This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RDP-GAN: A Rényi-Differential Privacy based Generative Adversarial Network

Chuan Ma, Member, IEEE, Jun Li, Senior Member, IEEE, Ming Ding, Senior Member, IEEE,
Bo Liu, Senior Member, IEEE, Kang Wei, Student Member, IEEE,
Jian Weng, Member, IEEE, and H. Vincent Poor, Fellow, IEEE
C. Ma, J. Li and K. Wei are with the School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing, China. (e-mail: {chuan.ma, jun.li, kang.wei}@njust.edu.cn).M. Ding is with Data61, CSIRO, Australia (e-mail: Ming.Ding@data61.csiro.au).B. Liu is with University of Technology Sydney, NSW 2007, Australia (email: bo.liu@uts.edu.au).J. Weng is with Jinan University, Guangzhou 510632, China (email: cryptjweng@gmail.com).H. V. Poor is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu).
Abstract

Generative adversarial network (GAN) has attracted increasing attention recently owing to its impressive ability to generate realistic samples with high privacy protection. Without directly interactive with training examples, the generative model can be fully used to estimate the underlying distribution of an original dataset while the discriminative model can examine the quality of the generated samples by comparing the label values with the training examples. However, when GANs are applied on sensitive or private training examples, such as medical or financial records, it is still probable to divulge individuals’ sensitive and private information. To mitigate this information leakage and construct a private GAN, in this work we propose a Rényi-differentially private-GAN (RDP-GAN), which achieves differential privacy (DP) in a GAN by carefully adding random noises on the value of the loss function during training. Moreover, we derive the analytical results of the total privacy loss under the subsampling method and cumulated iterations, which show its effectiveness on the privacy budget allocation. In addition, in order to mitigate the negative impact brought by the injecting noise, we enhance the proposed algorithm by adding an adaptive noise tuning step, which will change the volume of added noise according to the testing accuracy. Through extensive experimental results, we verify that the proposed algorithm can achieve a better privacy level while producing high-quality samples compared with a benchmark DP-GAN scheme based on noise perturbation on training gradients.

Index Terms:
Generative Adversarial Network, Rényi-Differential Privacy, Adaptive Noise Tuning Algorithm

I Introduction

Recent technological advancements are transforming the ways in which data are created and processed. With the advent of the Internet-of-things (IoT), the number of intelligent devices in the world is rapidly growing in the last couple of years. Many of these devices are equipped with various sensors and increasingly powerful hardware, which allow them to not just collect, but more importantly, process data at unprecedented scales. In the concurrent development, artificial intelligence (AI) has revolutionized the ways that information is utilized with ground breaking successes in areas such as computer vision, natural language processing, voice recognition, etc [1]. Therefore, there is a high demand for harnessing the rich data provided by distributed devices to improve machine learning models.

In the past few years, deep learning has demonstrated largely improved performance over traditional machine learning methods in various applications, e.g., image understanding [2], speech recognition [3], cancer analysis [4], and the game of GO [5]. The great success of deep learning is owing to the development of powerful computing processor and the availability of massive data for training the neural networks. However, there exists domains where the accessibility of this huge data is not fully granted. For example, the sensitive medical data are usually not open-access in most countries. Thus, building a high-quality analytical model remains to be challenging at present. At the same time, data privacy has become a growing concern for clients. In particular, the emergence of centralized searchable data repositories has made the leakage of private information an urgent social problem, e.g., health conditions, travel information, and financial data. Furthermore, the diverse set of open data applications, such as census data dissemination and social networks, place more emphasis on privacy concerns. In such practices, the access to real-life datasets may cause information leakage even in pure research activities. Consequently, privacy preservation has become a critical issue.

Fortunately, generative models [6] provide us with a promising solution to alleviated the data scarcity issue. By sketching the data distribution from a small set of training data, it is feasible to sample from the input distribution and further generate more synthetic samples. By combining the complexity of deep neural networks and game theory, the Generative Adversarial Networks (GANs) [7] and its variants have demonstrated impressive performance on modelling the underlying data distribution, generating high quality “fake” samples that are hard to be distinguished from real ones, i.e., synthetic samples. In this way, the availability of GANs can fully facilitate the generated data and well protect the privacy of individuals.

However, the GANs can still implicitly disclose private information on the training examples. The adversarial training procedure and the high model complexity of deep neural networks, jointly encourage a distribution that is concentrated around training samples. By repeated sampling from the distribution, there is a considerable chance of recovering the training examples. For example, the authors in [8] introduced an active inference attack model that can reconstruct training samples from sampling generated data. Therefore, it is highly demanded to have generative models that not only generates high quality samples but also protects the privacy of the training data. Indeed, a private GAN has its potential to address this challenging problem.

To preserve the data privacy, one way is to make the synthetic data differentially private with respect to the original data. To do this, the authors in [9] modified the training procedure of the discriminator to be differentially private by using the Private Aggregation of Teacher Ensembles (PATE) framework. The post-processing theorem then guarantees that the generator of GAN will also be differentially private. Unlike [9], the work in [10] introduced a subsampled randomized algorithm that first takes a subsample for the dataset generated through a subsampling procedure, and then applies a known randomized mechanism \mathcal{M} on the subsampled data points. It is important to exploit the randomness in subsampling because if the original mechanism is differentially private, then the subsampled mechanism also obeys differentially privacy, referred to the “privacy amplification lemma”. To further conduct the sampling rate, [11, 12] introduced a key idea that adding random noise to the updating gradients of the discriminator during training, and provided privacy guarantees. Moreover, norm clipping is used to bound the parameters while it will also control the maximum influence of added noises. However, bounding on the added noise will always violate the privacy level, as its maximum scale of noise is limited to one in their assumptions. To keep track of privacy loss, differential privacy (DP) that measures the difference in output between two input databases differing by at most one element, has been evolved in [13]. By ensuring each gradient descent step is differentially private, the final output model satisfies a certain level of DP given by the strong composition property. Moreover, to obtain a tighter estimation, Abadi et al. [14] proposed the moment accountant methods, which track the log moments of the privacy loss variable under random sampling.

Nevertheless, the training of GAN usually suffers from its instability, and directly adding random noises on the updating gradients will definitely harm the learning performance. Therefore, there needs a more subtle design on the privacy protection scheme. In addition, the architecture of GAN evolves multiple iterations. In more details, the fashion Wasserstein GAN [15] will train the generator multiple times and train the discriminator once to obtain better results. Thus, to conduct the privacy analysis, it is necessary to give an accurate estimation on each iteration. To sum up, there are three key tasks to solve in the design of a private GAN:

  • Design a differentially private algorithm for GAN while guaranteeing its performance.

  • Obtain a tight privacy loss estimation under subsampling.

  • Obtain an accurate privacy budget calculation for cumulated iterations.

Accordingly, in this work, we propose a differentially private GAN. To accurately estimate the privacy loss of the proposed algorithm, we elaborate on the definition of Rényi-DP which can achieve a tighter privacy analysis compared to the moment accountant method. Furthermore, we improve the proposed algorithm by designing an adaptive noise tuning algorithm, which can achieve better learning performance. Specifically, the contributions of this work are listed as follows:

  • To achieve a differentially private GAN, we propose a method, where random noises are added on the value of the loss function of a discriminator in each iteration. Different from previous works in [11, 12], the proposed algorithm does not need an additional norm clipping method as the chosen loss function can be bounded naturally.

  • To investigate the privacy loss of the proposed algorithm, we first elaborate on the effectiveness of Rényi-DP in the subsampling step, which is proved to achieve a tighter analysis compared with the moment accountant method. Then we obtain the expression of the total privacy cost for cumulated iterations.

  • To obtain better learning performances, we further improve the proposed algorithm by adding an adaptive noise tuning step, called the RDP-ANT algorithm. The improved algorithm can adaptively change the value of added noise according to the testing accuracy, and eventually consummate the algorithm.

  • We conduct sufficient numerical experiments to verify the effectiveness of the analytical results. Compared with other algorithms, the proposed algorithm will inject less noise and obtain better performance in turn when achieving the same privacy level. Moreover, we also show the superiority of the RDP-ANT algorithm compared with other noise decay methods.

The rest of the paper is structured as follows. We introduce the back ground in Section II, and propose the RDP-GAN algorithm with comprehensive privacy loss analysis in Section III. In Section IV, we show the details of the improved adaptive noise tuning algorithm. Then, we validate the privacy analysis through extensive simulations and discuss the network performance in Section V, and conclude the paper in Section VI.

II Preliminary

In this section, we present some background knowledge of DP and GAN.

II-A (ϵ,δ\epsilon,\delta)-Differential Privacy

We first recall the standard definition of (ϵ,δ\epsilon,\delta)-DP [16].

Definition 1.

(ϵ,δ\epsilon,\delta-DP). A randomized mechanism f:𝒳f:\mathcal{X}\mapsto\mathcal{R} offers (ϵ,δ\epsilon,\delta)-DP if for any adjacent X,X𝒳X,X^{\prime}\in\mathcal{X} and SS\subset\mathcal{R},

Pr[f(X)S]eϵPr[f(X)S]+δ,\Pr\left[{f\left(X\right)\in S}\right]\leq{e^{\epsilon}}\Pr\left[{f\left({X^{\prime}}\right)\in S}\right]+\delta, (1)

where f(X)f(X) denotes a random function of XX.

The above definition is contingent on the notion of adjacent inputs XX and XX^{\prime}, which is domain-specific and is typically chosen to capture the contribution to the mechanism’s input by a single individual. Moreover, to avoid the worst-case scenario of always violating privacy of a δ\delta fraction of the dataset, the standard recommendation is to choose δ1/N\delta\ll 1/N or even δ=negl(1/N)\delta=\textrm{negl}(1/N), where NN is the data size.

The definition of (ϵ,δ\epsilon,\delta)-DP is initially proposed to capture privacy guarantees of the Gaussian mechanism, defined as follows:

Gσf(X)=Δf(X)+𝒩(0,σ2),{\textbf{G}_{\sigma}}f(X)\buildrel\Delta\over{=}f(X)+\mathcal{N}(0,{\sigma^{2}}), (2)

where 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) represents a random noise, which follows the Gaussian normal distribution with mean 0 and standard derivation σ\sigma. Elementary analysis shows that the Gaussian mechanism satisfies a continuum of incomparable (ϵ,δ\epsilon,\delta)-differential privacy [16], for all combinations of ϵ<1\epsilon<1 and σ>Δ2f2ln1.25/δ/ϵ\sigma>{\Delta_{2}}f\sqrt{2\ln 1.25/\delta}/\epsilon, where the l2l_{2}-sensitivity of ff is defined as

Δ2f=ΔmaxX,X𝒳f(X)f(X)2,{\Delta_{2}}f\buildrel\Delta\over{=}\mathop{\max}\limits_{X,X^{\prime}\in\mathcal{X}}{\left\|{f(X){{-f(X^{\prime})}}}\right\|_{2}}, (3)

and taken over all adjacent inputs XX and XX^{\prime}.

We also list the proposition that will be used in the proof.

Proposition 1.

(Post-processing.) Let \mathcal{M} be an (ϵ,δ\epsilon,\delta)-differentially private algorithm and let :𝒪𝒪\mathcal{F}:\mathcal{O}\rightarrow\mathcal{O}^{\prime} where OO^{\prime} is any arbitrary space. Then \mathcal{F}\circ\mathcal{M} is (ϵ,δ)(\epsilon,\delta)-differentially private.

II-B ϵ\epsilon-Rényi Differential Privacy

The original DP has been investigated in several works to estimate the privacy level of a typical machine learning algorithm, such as Deep Learning [14], Federated Learning [17] and GAN [11, 12]. However, the original DP needs strong assumption and will damage its accuracy when using composition theorem, as pointed in [18]. Thus, we elaborate a generalization of the notion of DP based on the concept of the Rényi divergence, which can avert inaccuracy. The Rényi divergence is classically defined as follows:

Definition 2.

(Rényi divergence). For two probability distributions PP and QQ defined over \mathcal{R}, the Rényi divergence of order α>1\alpha>1 is

Dα(P||Q)=Δ1α1logExQ(P(x)Q(x))α,{D_{\alpha}}\left({P||Q}\right)\buildrel\Delta\over{=}\frac{1}{{\alpha-1}}\log{E_{x\sim Q}}{\left({\frac{{P\left(x\right)}}{{Q\left(x\right)}}}\right)^{\alpha}}, (4)

where P(x)P(x) denotes the density of PP at xx.

It motivates exploring a relaxation of DP based on the Rényi divergence.

Definition 3.

((α,ϵ\alpha,\epsilon)-RDP). A randomized mechanism f:𝒳f:\mathcal{X}\mapsto\mathcal{R} is said to have ϵ\epsilon-Rényi DP of order α\alpha, or (α,ϵ\alpha,\epsilon)-RDP for short, if for any adjacent X,X𝒳X,X^{\prime}\in\mathcal{X} it holds that

Dα(f(X)||f(X))ϵ.{D_{\alpha}}\left({f\left(X\right)||f\left({X}^{\prime}\right)}\right)\leq\epsilon. (5)
Remark 1.

Similar to the definition of DP, a finite value for ϵ\epsilon-RDP implies that feasible outcomes of f(X)f(X) for some X𝒳X\in\mathcal{X} are feasible, i.e., have a non-zero density, for all inputs from 𝒳\mathcal{X} except for a set of measure 0. Assuming that this is in cause, we let the event space be the support of the distribution.

We then list some propositions that will be used in deriving the analytical results in the following.

Proposition 2.

Let f:𝒟1f:\mathcal{D}\mapsto\mathcal{R}_{1} be (α,ϵ1)(\alpha,\epsilon_{1})-RDP and g:1×𝒟2g:\mathcal{R}_{1}\times\mathcal{D}\mapsto\mathcal{R}_{2} be (α,ϵ2)(\alpha,\epsilon_{2})-RDP, then the mechanism defined as (X,Y)(X,Y), where Xf(𝒟)X\sim f(\mathcal{D}) and Yg(X,𝒟)Y\sim g(X,\mathcal{D}), satisfies (α,ϵ1+ϵ2)(\alpha,\epsilon_{1}+\epsilon_{2})-RDP.

Proposition 3.

(Probability preservation). Let α>1\alpha>1, PP and QQ be two distributions defined over \mathcal{R} with identical support. AA\subset\mathcal{R} be an arbitrary event. Then

P(A){exp[𝒟α(P||Q)]Q(A)}(α1)/α.P\left(A\right)\leq{\left\{{\exp\left[{{\mathcal{D}_{\alpha}}(P||Q)}\right]\cdot Q\left(A\right)}\right\}^{\left({\alpha-1}\right)/\alpha}}. (6)

II-C Generative Adversarial Networks

GAN is a recently developed generative model to produce synthetic images or texts after being trained [7]. The learning process in the model is based on one generator (g) and one discriminator (d) neural networks playing in the following zero-sum minimax (i.e., adversarial) game:

minGmaxDV(g,d)=𝔼[log(d(x)]xpdata(x)+𝔼[log(1d(g(z)))]zp(z),\begin{split}&\mathop{\min}\limits_{G}\mathop{\max}\limits_{D}V(g,d)=\\ &\mathbb{E}{\left[{\log(d(x)}\right]_{{\rm{x}}\sim{{{p}}_{data}}(x)}}+\mathbb{E}{\left[{\log\left({1{\rm{-}}d\left({g\left({\rm{z}}\right)}\right)}\right)}\right]_{z\sim p(z)}},\end{split} (7)

where p(z)p(z) is a prior distribution of latent vector zz, g()g(\cdot) is a generator function, and d()d(\cdot) is a discriminator function whose output spans [0,1][0,1]. d(x)=0d(x)=0 (resp. d(x)=1d(x)=1) indicates that the discriminator dd classifies a sample xx as generated (resp. real).

In Fig. 1, we show a typical structure of GAN. gg and dd can be any form of neural networks. The discriminator attempts to maximize the objective, whereas the generator attempts to maximize the objective. In other words, the discriminator attempts to distinguish between real and generated samples, while the generator attempts to generate real-looking fake samples that the discriminator cannot distinguish from real samples. According to the feedback in the interaction process, the generator will updates its network to enhance the quality of the generated samples. Adversaries exist in this environment and may attack the interaction process, and obtain private information from clients. In order to avoid this privacy leakage, in the next section we propose a method to improve the privacy performance of GAN.

Refer to caption
Figure 1: The structure of GAN

III Design of RDP-GAN

In this section, we elaborate the design of RDP-GAN, a Rényi differentially private adversarial network to mitigate information leakage, and maintain desirable utility in the generated data.

III-A RDP-GAN Framework

In most real-world problems, the ture data distribution p(x)p(x) is unknown and needs to be estimated empirically. Since we are primarily interested in data synthesis, and we will use GANs as the mechanism to estimate p(x)p(x) and draw samples from it. If trained properly, a RDP-GAN will mitigate inference attack during the data analysis.

The framework of RDP-GAN is structured as follows. Sensitive data XX is first fed into a discriminator with a privacy-preserving layer. This discriminator is used to train a differentially private generator to produce a private artificial dataset X~\tilde{X}. Different from the work on differentially private deep learning (e.g., [14, 11, 12, 19]), the proposed algorithm achieves DP by injecting random noise on the value of the loss function in the interaction procedure. In more details, the rational behind our design is as follows:

  • Different from existing works, which add noise on the gradients, the proposed algorithm focuses on the design on the interaction between the generator and discriminator. This is because that from the aspect of an adversary, the interaction process is more vulnerable to attack than the network architecture if the generator and discriminator are not assembled. Moreover, it is not trivial to acknowledge information from the inside of the discriminator or generator.

  • Adding noise on the value of the loss function is a direct method compared to the method in [11, 12, 19], which will bring explicit privacy protection. This is because that the loss values, not the parameters, are exchanged between generator and discriminator, and the perturbation on the loss value will directly influence the updates of the generator. In addition, the privacy estimation is not that accuracy as the changes of the privacy level from parameters to loss values are ignored in the previous works.

  • Adding noise on the value of the loss function does not need an extra norm clipping function as we naturally use a bounded activation function as the last layer of neuron network. Manually bounding the loss function is an non-trivial task as its value is sensitive and needs adjustment based on empirical results.

Despite the fact that the generator does not have access to the real data XX in the training process, one cannot guarantee DP because of the information passed through with the loss value from the discriminator. A simple high level example will illustrate such breach of privacy. Let the dataset XX, XX^{\prime} contain some small real numbers. The only difference between these two datasets is the number xXx^{\prime}\in X^{\prime}, which happens to be extremely large. Since the impact of xx^{\prime} is large enough to influence the performance of the discriminator, specifically on its loss value, it will lead to a large difference whether xx is used to train or not. In this case, this difference on the loss value will be propagated to the generator that will break privacy.

III-B The Implementation of RDP-GAN

Our method focuses on preserving the privacy during the training procedure, and we add noise on the loss value of the discriminator as follows:

𝔼[log(d(x)]=Fd(W;X)i=1mFd(wi;xi)+𝒩(0,σ2),\mathbb{E}{\left[{\log(d(x)}\right]{{}{}}}=F_{d}(W;X)\leftarrow\sum^{m}_{i=1}F_{d}({w_{i}};x_{i})+\mathcal{N}(0,\sigma^{2}), (8)

where we use Fd(;)F_{d}(;) to represent the value of the loss function of all data samples, and mm denotes the sample size of the discriminator, respectively.

We first explain this perturbation on the loss value will not vanish during back propagation. Considering a LL layers network, the perturbed loss value of the last layer (the LL-th layer) can be represented by

FdL=ACA(zL)+𝒩(0,σ2),F_{d}^{L}=\nabla_{A}C\odot A^{\prime}(z^{L})+\mathcal{N}(0,\sigma^{2}), (9)

where AA denotes the activation function of the LL-th layer, CC denotes the correct label, \odot expresses the XNOR function that outputs a logical one or true only if two inputs are the same, and zLz^{L} is the input of the LL-th layer, respectively [20]. During back propagation, this perturbation is involved as follows:

Fdl=[(wl+1)TFdl+1]A(zl),F_{d}^{l}=[(w^{l+1})^{T}F_{d}^{l+1}]\odot A^{\prime}(z^{l}), (10)

where l=L1,L2,,2,1l=L-1,L-2,...,2,1. Thus, if adding noise on the value of the discriminator can achieve DP, this perturbation will be elaborated during the back propagation when updating the discriminator, and ensures that the discriminator is DP. Moreover, the loss value (parameters) of generator can also guarantee DP because of post-processing property in [16] which states that any mapping (operation) after a differentially private output will not invade the privacy. Here the mapping is in fact the computation of parameters of generator.

The RDP-GAN procedure is summarized in Algorithm 1.

Algorithm 1 Framework of the Proposed RDP-GAN.
1:real samples: {x1,x2,}p(x)\{x_{1},x_{2},\cdots\}\sim p(x); prior samples: {z1,z2,}p(z)\{z_{1},z_{2},\cdots\}\sim p(z); sampling rate: qq; number of discriminator iterations per generator iteration: ndn_{d}; number of generator iteration: ngn_{g}; noise scale: σ\sigma;
2:differential private generator.
3:Initialize discriminator parameters d0d_{0}, generator parameters g0g_{0}.
4:For t1=1,,ngt_{1}=1,...,n_{g} do
5:  For t2=1,,ndt_{2}=1,...,n_{d} do
6:    Sample {x1,x2,xm}p(x)\{x_{1},x_{2},\cdots x_{m}\}\sim p(x) a batch of mm real data points.
7:    Sample {z1,z2,zm}p(z)\{z_{1},z_{2},\cdots z_{m}\}\sim p(z) a batch of mm prior samples.
8:    Train the discriminator dd.
9:    For each data sample ii, add a random noise on the value of the loss function as Fd(W;X)i=1mFdt2(wi;xi)+𝒩(0,σ2)F_{d}(W;X)\leftarrow\sum^{m}_{i=1}F_{d}^{t_{2}}({w_{i}};x_{i})+\mathcal{N}(0,\sigma^{2}).
10:    Update the discriminator dd according to Eq. (9) and (10).
11:  End for
12:  Update the generator gg.
13:End for
14:Return differentially private generator gg.

III-C Privacy Analysis of RDP-GAN

Because the proposed algorithm consists of multiple iterations and in order to analyze the total privacy loss, we first prove that adding noise on the value of the loss function in the discriminator ensures the Rényi differential privacy, and then show that the generator can satisfy DP.

III-C1 Sensitivity on the loss function

According to the definition of DP, we consider two adjacent database XX and XX^{\prime} with same size, and only differ by one sample.

Consequently, for the discrimination in one iteration, the training process can be written as:

d(X)=minWFd(W;X),\mathcal{L}_{d}(X)=\mathop{min}\limits_{W}F_{d}(W;X), (11)

where d(X)\mathcal{L}_{d}(X) denotes the training process. Therefore, the sensitivity on the loss function can be expressed as

ΔS=maxX,X𝒳Fd(W;X)Fd(W;X).\Delta S=\mathop{\max}\limits_{X,X^{\prime}\in\mathcal{X}}||F_{d}(W;X)-F_{d}(W^{\prime};X^{\prime})||. (12)

To make the loss function bounded, we can choose a bounded activation as the last network layer, thus the value Wi(Xi){W_{i}\left({{X_{i}}}\right)} of the ii-th sample is bounded. So the value of the loss function F(;)F(;) has a bounded input with unchanged label, and its output is naturally bounded according to the forward processing property. In addition, we denote the bound of loss function by CC111Note here CC can be different for different datasets and loss functions. In our experimental results, we will conduct experiments to obtain CC empirically.. Thus, following the same logic of [17], the sensitivity of loss function can be expressed as

ΔS=C/|X|.\Delta S=C/|X|. (13)

From Eq. (13), we can observe that when the batch size increases, the sensitivity should decrease accordingly. The intuition is that the maximum difference on single data sample can make is set to CC and such a maximum difference is scaled down by a factor of |X||X| since every data sample contributed equally to the resulting model. For example, if a model is trained by patient records with a certain disease, acquiring that an individual’s record is among them directly affects the privacy. If we increase the size of the dataset, it will be more difficult to determine whether this record is part of the training data or not.

III-C2 Privacy analysis on the discriminator

With the sensitivity, we can design the Gaussian mechanism 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) with (α,ϵ)(\alpha,\epsilon)-RDP requirement in terms of the sampling rate qq under ndn_{d} iterations. In Theorem 1, we present the privacy level of the output of discriminator as follows.

Theorem 1.

Given the sampling rate qq and the sensitivity ΔS\Delta S, adding random noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) on the loss function, the output of discriminator for each iteration can satisfy (α,ϵ)(\alpha,\epsilon)-RDP, and ϵqα2ΔS22(α1)σ2\epsilon\leq\frac{q\alpha^{2}\Delta S^{2}}{2(\alpha-1)\sigma^{2}}.

Proof.

Please find the proof in the Appendix A. ∎

We then provide the following lemma, which will be used to prove the output of discriminator after ndn_{d} iterations can satisfy (ϵd,δ)(\epsilon_{d},\delta)-DP. For simplicity, we rewrite the loss function Fd(W;X)F_{d}(W;X) as Fd(X)F_{d}(X) in the following.

Lemma 1.

Let f:𝒟f:\mathcal{D}\mapsto\mathcal{R} be an adaptive composition of nn mechanisms all satisfying (α,ϵ)(\alpha,\epsilon)-RDP. Let XX and XX^{\prime} be two adjacent inputs. Then for any subset SS\subset\mathcal{R}:

Pr[Fd(X)S]exp{2ϵnlog/Pr[Fd(X)S]}Pr[Fd(X)S].\begin{split}&\Pr\left[{F_{d}(X)\in S}\right]\leq\\ &\exp\left\{{2\epsilon\sqrt{n\log/\Pr\left[{F_{d}(X^{\prime})\in S}\right]}}\right\}\cdot\Pr\left[{F_{d}\left({X^{\prime}}\right)\in S}\right].\end{split} (14)
Proof.

Please find the proof in the Appendix B. ∎

Remark 2.

Compared with the standard composition theorem, i.e., the Proposition 2, Lemma 1 can achieve a tighter privacy estimation on multiple mechanisms.

Theorem 2.

Given the number of discriminator iterations ndn_{d}, the sampling rate qq and the sensitivity ΔS\Delta S, adding random noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) on the value of the loss function of discriminator, the output guarantees (ϵd,δ)(\epsilon_{{\emph{d}}},\delta)-DP under a composition of ndn_{d} RDP mechanisms, when log(1/δ)ϵ2nd\log\left({1/\delta}\right)\geq{\epsilon^{2}}n_{d}. The expression of ϵd\epsilon_{\emph{d}} can be derived as

ϵd=Δ4ϵ2ndlog(1/δ),\epsilon_{\emph{d}}\buildrel\Delta\over{=}4\epsilon\sqrt{2n_{d}\log\left({1/\delta}\right)}, (15)

where ϵ=qα2ΔS22(α1)σ2\epsilon=\frac{q\alpha^{2}\Delta S^{2}}{2(\alpha-1)\sigma^{2}}.

Proof.

Please find the proof in the Appendix C. ∎

III-C3 Privacy analysis of the generator

We next provide the analytical privacy results on the generator according to the post processing property.

Theorem 3.

The output of generator in the proposed algorithm guarantees (ϵg,δ)(\epsilon_{{\emph{g}}},\delta)-DP, where

ϵg=4qα2ΔS22(α1)σ22ndlog(1/δ)\epsilon_{\emph{g}}=4\frac{q\alpha^{2}\Delta S^{2}}{2(\alpha-1)\sigma^{2}}\sqrt{2n_{d}\log\left({1/\delta}\right)} (16)
Proof.

The privacy mechanism guarantees a direct consequence from followed Proposition 1 by the post processing property of DP. ∎

III-C4 Privacy analysis of the total training process

Finally, we come to our final results that the proposed algorithm after ngn_{g} iterations satisfies the DP guarantee.

Theorem 4.

The proposed algorithm satisfies (ϵtotal,δ)(\epsilon_{total},\delta)-DP, where ϵtotal=ngϵg\epsilon_{total}=n_{g}\epsilon_{{\emph{g}}}, and ngn_{g} denotes the total iterations of generator in this algorithm.

Remark 3.

For the calculation of the total privacy budget, it cannot directly using the composition property (Lemma 1) as there is a strict condition in Theorem 2, i.e., log(1/δ)ϵ2nd\log\left({1/\delta}\right)\geq{\epsilon^{2}}n_{d}. For typical DP setting, we can choose δ=105\delta=10^{-5} and ϵ=1\epsilon=1, then we can obtain ng11.5n_{g}\leq 11.5. Thus this condition cannot be satisfied in GAN scenarios as we always set a large number of iterations, usually ng>1000n_{g}>1000. So we directly sum all the privacy budgets of every generator iteration in Theorem 4.

IV Adaptive Noise Tuning Algorithm

Although the loss value is bounded naturally, adding random noise may reduce the learning performance. In Algorithm 1, the privacy budget allocated to one iteration is evenly distributed, which implies a same noise scale of Gaussian mechanism in every generator iteration. In general, the final model performance, such as convergence, accuracy, etc., is largely dependent on the amount of noise added over the training process [19]. Thus, in order to obtain a better learning performance, we can optimize the allocation of privacy budget and design the amount of noise added in each iteration. The details are described as follows.

IV-A Adaptive Noise Tuning Algorithm

Our adaptive noise tuning algorithm follows the idea that, as the training continues, it is expected to have less noise on the loss function, which allows the model converge faster and obtain better results. Similar ideas have been applied by adjusting the learning rate in common practice [21]. Therefore, we propose an algorithm that can adaptively tune the noise scale over the training iterations, and effectively improve the model accuracy in turn.

Our approach for adaptive tuning the noise scale is to monitor the training accuracy and reduce the noise scale when the accuracy stops improving. To do this, we first add an extra testing process after each generator finishes training in each iteration. Every time when the improvement of the testing accuracy is less than a predefined threshold τ\tau, the noise scale is reduced by a factor of kk until the total privacy budget runs out. Although this will lead to recalculating privacy cost, we can show the improvement on the performance is convincing despite more energy cost in the experimental results.

In our approach, with a validation dataset, the testing accuracy is checked periodically during the training process in the generator to determine whether the noise scale needs to be reduced for subsequent iterations. Let σt\sigma_{t} be the noise scale in the ttth iteration obtained by Theorem 4 when the total privacy budget and total iterations are given, and StS_{t} be the corresponding testing accuracy. The noise scale for the subsequent iterations is adjusted based on the accuracy difference between the current and previous iteration.

σt+1={σt,if St+1Stτ;kσt,else,\sigma_{t+1}=\left\{\begin{array}[]{ll}\sigma_{t},&\hbox{if $S_{t+1}-S_{t}\geq\tau$;}\\ k\sigma_{t},&\hbox{else,}\end{array}\right. (17)

where τ\tau denotes the threshold for the testing accuracy.

Initially we set 0<k10<k\leq 1 and S0=0S_{0}=0. Then the updated noise scale σt+1\sigma_{t+1} will be applied to the next iteration.

In addition, we find that the testing accuracy may not increase monotonically as the training progresses, and its fluctuation may cause unnecessary reduction of noise scale and thus wasting the privacy budget. This motivates us to use the average of testing accuracy to improve the algorithm: at the current testing iteration tt, we define an average testing accuracy S¯t\overline{S}_{t} over the previous iterations as follows:

S¯t=1ti=1tSi.{\overline{S}_{t}}=\frac{1}{t}\sum\limits_{i=1}^{t}{{S_{i}}}. (18)

The average testing value will replace the previous one in Eq.(17) and determine whether the noise scale should be reduced or not.

To sum up, we formally present our adaptive noise tuning DP-GAN framework in the Algorithm 2.

Algorithm 2 Framework of the adaptive noise tuning RDP-GAN.
1:real samples: {x1,x2,}p(x)\{x_{1},x_{2},\cdots\}\sim p(x); prior samples: {z1,z2,}p(z)\{z_{1},z_{2},\cdots\}\sim p(z); sampling rate: qq; number of discriminator iterations per generator iteration: ndn_{d}; number of generator iteration: ngn_{g}; decay rate: kk; total privacy budget: ϵtotal\epsilon_{total};
2:differential private generator.
3:Initialize discriminator parameters d0d_{0}, generator parameters g0g_{0}.
4:According to the total privacy budget ϵtotal\epsilon_{total} and the total number of generator iterations ngn_{g}, estimate the noise scale for each generator iteration by Theorem 3 and Theorem 4. The algorithm ends when ϵtotal=0\epsilon_{total}=0.
5:According to the total number of discriminator iterations ndn_{d}, estimate the initial noise scale for each generator iteration by Theorem 1.
6:For t1=1,,ngt_{1}=1,...,n_{g} do
7:  For t2=1,,ndt_{2}=1,...,n_{d} do
8:    Sample {x1,x2,}p(x)\{x_{1},x_{2},\cdots\}\sim p(x) a batch of mm real data points.
9:    Sample {z1,z2,}p(z)\{z_{1},z_{2},\cdots\}\sim p(z) a batch of mm prior samples.
10:    Train the discriminator dd.
11: For each data sample ii, Fd(W;X)i=1mFdt2(wi,xi)+𝒩(0,σt2)F_{d}(W;X)\leftarrow\sum^{m}_{i=1}F_{d}^{t_{2}}(w_{i},x_{i})+\mathcal{N}(0,\sigma_{t}^{2}).
12:    Update the discriminator dd according to Eq. (9) and (10).
13:  End for
14:  Update the generator gg and generates fakes samples to test.
15:  According to the Eq. (17), update the noise scale σt1\sigma_{t_{1}} to σt1+1\sigma_{t_{1}+1} and calculate the rest ϵtotal\epsilon_{total}.
16:  Train the discriminator dd.
17:End for
18:Return differentially private generator gg.

IV-B Pre-defined Noise Decay Schedules

In this subsection, we will list several predefined noise decay schedules, and provide comprehensive comparison results with our proposed algorithm in Sec. 5.5.

IV-B1 Time-Based Decay

According to [22], it is defined with the mathematical form

σt=σ0/(1+kt),\sigma_{t}=\sigma_{0}/\left(1+kt\right), (19)

where σ0\sigma_{0} is the initial noise scale, kk is the decay rate and tt is the current iteration, respectively.

IV-B2 Exponential Decay

It has the mathematical form that

σt=σ0ekt.\sigma_{t}=\sigma_{0}*e^{-kt}. (20)

IV-B3 Step Decay

Step decay reduces the learning rate by some factor every few iterations. The mathematical form is expressed as

σt=σ0kt/t,\sigma_{t}=\sigma_{0}*k^{t/t^{*}}, (21)

where t<tt<t^{*} decides how often to reduce noise in terms of the number of iterations.

IV-C Privacy Preserving Parameter Selection

The proposed algorithms require a set of pre-defined hyperparameters, such as decay rate kk. This value will definitely the training time and affect the final model accuracy. Intuitively speaking, a smaller decay rate kk will lead to a better performance but longer training iterations. It is expected to find an appropriate decay rate for the schedules under an optimal tradeoff between the learning performance and training. However, this decay rate should be determined by multiple factors such as typical neuron network, training set and other factors. Thus in this subsection, we only propose a straightforward method to obtain an appropriate value. To list kk candidates by training kk neural networks respectively and directly choose the one that achieves the highest accuracy, accordingly we show the experimental results in Sec. V-E.

V Experimental Results

In this section, we first evaluate the performance of our analytical results for different privacy budgets. Then, we evaluate our proposed RDP-GAN method on the tabular and image datasets, respectively. In addition, we demonstrate the effectiveness of the improved algorithm with various noise scaling methods and parameter settings.

V-A Experimental Setting

V-A1 Dataset

In our experiments, we use two benchmark datasets for different tasks:

  • Adult, which consists of 30K30K individual information, has 1414 attributes (88 selected in the experiments) for each individual and split into 20K20K training and 10K10K test samples. We show an Adult dataset sample with some attributes in Table I.

    TABLE I: Some examples of the Adult dataset
    Age Occupation Education Gender Work Class Marital Status Work Hour Income
    39 Sales Bachelors Male State-gov Never-married 30 50K\leq 50K
    38 Tech-support HS-grad Male Private Divorced 35 50K\leq 50K
    44 Prof-specialty Masters Male Private Married-civ-spouse 42 >50K>50K
    43 Other-service Some-college Female State-gov Separated 26 50K\leq 50K
  • MNIST, which consists of 70K70K handwritten digit images of size 28×2828\times 28, split into 60K60K training and 10K10K test samples. We show the typical samples of the standard MNIST dataset in Fig. 2.

    Refer to caption
    Figure 2: Some samples of the MNIST dataset

V-A2 GAN

The neural network architecture is adapted to the deep convolutional GAN (DCGAN), where the discriminator is a convolutional neural network (CNN) that contains 33 layers. In each layer, a list of learnable filters are applied to the entire input matrix, where data samples are converted into square matrices in our method. To make the loss value bounded, we choose the Sigmoid activation function as the last layer that generates the probability of being real or fake. The loss function of the discriminator is set to use Cross Entropy function. The generator is also a neural network that consists of 33 de-convolutional layers.

In the experiment, we set the learning rate of discriminator ada_{d} and generator aga_{g} to be 5.0×1055.0\times 10^{-5}. The sampling rate qq is set according to different batch sizes, and the number of iterations on discriminator (ndn_{d}) and generator (ngn_{g}) are 55 and 10310^{3}, respectively.

V-A3 Privacy and Utility Level Estimation

In the experimental results, we set up two privacy levels: ϵtotal=0.5\epsilon_{total}=0.5 and ϵtotal=5\epsilon_{total}=5, respectively. To verify the generated results, we use an additional classifier to test the accuracy with right labels in the MNIST dataset. The classifier is trained by the 10K10K test samples in the MNIST dataset and is used to distinguish whether an input digit with a predefined label is belong to the right label. Moreover, the classifier uses a multilayer perceptron (MLP) neural network with 33 layers. Except for the input nodes, each node is a neuron that uses Sigmoid activation function. In the Adult dataset, we conducted the probability mass function (PMF) and the absolute average error with the true data for each attribute. For the relationship among different attributes, we use a trained classifier to test the accuracy. In addition, the absolute average error is calculated by the following equation:

Error=x=1|X||pg(x)pr(x)|×x;\textrm{Error}=\sum\limits_{x=1}^{\left|X\right|}{\left|{{p_{g}}(x)-{p_{r}}(x)}\right|\times x}; (22)

where |X||X| denotes the total number of labels, pg(x)p_{g}(x) and pr(x)p_{r}(x) denote the PMF of the generated and real data, respectively.

V-B Analytical Results on the Privacy Level

V-B1 Investigation on the Sensitivity

Before we show the analytical results of the proposed RDP-GAN, we need to investigate the sensitivity on the loss function. According to Eq. (13), we can first estimate the bound of loss function and calculate the sensitivity. We conduct experimental results on the Adult and MNIST dataset, respectively, and record the maximum loss value without injecting noises in In Table II.

TABLE II: Simulation results for the sensitivity value on the loss function.
Batch size =64=64 Batch size =128=128
Maximum value of loss function on Adult 17.852 21. 242
Maximum value of loss function on MNIST 23.534 26.823
CC on Adult 20 23
CC on MNIST 25 28
ΔS\Delta S on the Adult dataset 0.3125 0.1796
ΔS\Delta S on the MNIST dataset 0.1953 0.2187

From the results we can find a large batch size will lead to a large maximum loss function value on both datasets. Accordingly, we set CC for different situations in Table. II222Note here the value of CC is only used to estimate the privacy loss. Nevertheless, we will show the proposed privacy analysis can achieve a lower bound compared to other method under same noise.. The results also indicate that one individual will produce less influence in a larger dataset.

V-B2 Analytical Results on the Privacy Level of Generator

To show the effectiveness of the RDP-GAN, we compare it with the method using Moment Accountant (MA) algorithm [12, 14] as shown in the follow figures. In [12, 14], the privacy level of generator ϵg\epsilon_{g} in each iteration can be expressed as

ϵg=2qndlog(1δ)σ,{\epsilon_{g}}=\frac{{2q\sqrt{{n_{d}}\log\left({\frac{1}{\delta}}\right)}}}{\sigma}, (23)

where the sensitivity is omitted in the formula. In addition, the expression of ϵg\epsilon_{g} in MA has a order of O(1σ)O(\frac{1}{\sigma}), where our analysis method (RDP) will achieve a order of O(1σ2)O(\frac{1}{\sigma^{2}}) respect of the injected noise. This will lead to a higher privacy level in RDP when the noise scale is qualitative larger, e.g., σ>10\sigma>10.

Refer to caption
(a) Batch size =64=64
Refer to caption
(b) Batch size =128=128.
Figure 3: The analysis privacy level with different noise scales.

In Fig. 3 we investigate the privacy level in two batch sizes. As can be found in these figures, the proposed RDP method can achieve a better privacy level when larger noises are added as expected. Moreover, the privacy level in the Adult dataset is always higher (a smaller ϵ\epsilon) than the MNIST dataset. The intuition behind this is that the sensitivity in the Adult dataset is lower than it in the MNIST dataset as it has less attributes.

V-C Experimental Results on the MNIST Dataset

In this subsection, we excute the proposed algorithm on the MNIST dataset. In Fig. 4, we first show the real and generated data samples with different noise scale. We also show the results generated by the DP-GAN [12, 11] for comparison, which adds noise on the gradients. From the figure we can find the generated digits with low privacy level can well preserve the characters of the true ones, and with the increase of the privacy level, the quality begins to drop as there are unclear pixels exists. For example, we can find the digit 22 and digit 88 are hard to recognize in the generated samples when ϵtotal=5\epsilon_{total}=5. Moreover, the quality generated by DP-GAN is obviously worse than the proposed RDP in both privacy levels.

Refer to caption
Figure 4: Generated MNIST samples with different noise scale: (a) RDP (ϵtotal=0.5\epsilon_{total}=0.5); (b) DP-GAN (ϵtotal=0.5\epsilon_{total}=0.5); (c) RDP (ϵtotal=5\epsilon_{total}=5); (d) DP-GAN (ϵtotal=5\epsilon_{total}=5).

In order to further verify the qualities of the generated samples, we use an additional classifier to test the accuracy. We show the test accuracy along with the training process (i.e., 10001000 iterations) in the following figures.

Refer to caption
(a) No noise
Refer to caption
(b) σ=0.5\sigma=0.5
Refer to caption
(c) σ=5\sigma=5
Refer to caption
(d) σ=10\sigma=10
Figure 5: The classification accuracy with different noise sacles in MNIST dataset.

From Fig. 5 we can first find that the generated samples without privacy protection, i.e., σ=0\sigma=0, can achieve relative high accuracy in the classifying. Second, with the increase of the noise scale, the accuracy performance begins to fluctuate, which is brought from the injected noises. In addition, a few collapse cases happen when a large noise is added. For example, we can find in Fig. 5(d) the accuracy always suddenly drops to 0.30.3 or less. This is because the large scale of σ\sigma will probably brings about huge noises to the loss function that the system cannot stand. Therefore, it will lead to unacceptable results in the generated samples, i.e., the digit 88 in Fig. 4 when σ=20\sigma=20.

Refer to caption
(a) ϵtotal=0.5\epsilon_{total}=0.5
Refer to caption
(b) ϵtotal=5\epsilon_{total}=5
Figure 6: The classification accuracy comparison with DP-GAN.

In addition, we compare our algorithm (RDP) with the method that adds noise on the parameters, i.e., DP-GAN in two privacy levels, ϵtotal=0.5\epsilon_{total}=0.5 and ϵtotal=5\epsilon_{total}=5, respectively. As can be found in Fig. 6, the proposed algorithm has a better performance compared with DP-GAN in both privacy levels, in which the biggest performance gap is around 7%7\% when classifying digit 33 in the high privacy level. As explained in Sec. 3.1, the superior performance of the proposed RDP-GAN compared with the DP-GAN is because that adding noise on the loss function can bring in more explicit privacy protection.

V-D Experimental Results on the Adult Dataset

In this subsection, we show the performance of the proposed algorithm on the Adult dataset.

Refer to caption
(a) Age
Refer to caption
(b) Occupation
Refer to caption
(c) Education
Refer to caption
(d) Gender
Refer to caption
(e) Work Class
Refer to caption
(f) Marital Status
Refer to caption
(g) Work Hour
Refer to caption
(h) Income
Figure 7: The PMF of generated synthetic samples.

In Fig. 7 we show the probability mass function (PMF) of the generated attributes. Take the attribute of Age as an example, we use binning to divide the age range into 99 regions as shown in Fig. 7(a) to increase the training speed,. As can be found in Fig. 7(a), the generated samples can achieve similar distribution with the true ones, while the samples with larger noise make more differences. Moveover, the largest difference is focused on the edge domain. The main reason may because the generator always produces samples that are likely to appear in the true dataset, i.e., the one close the average value. Similar phenomenon can be found in other attributes as well. To further investigate this the quality of the generated samples, we show the absolute average error for these attributes. As can be found in Table. III, when more noises are injected, the absolute average error increases. Thus, the proposed algorithm may not have the capability to achieve high performance in maintaining the character of unique attributes, especially in the tabular dataset. To further enhance this performance, it may need more intelligent design on the generator, which is left as our future work.

TABLE III: Absolute average error for the generated samples with different privacy levels.
Age Occ. Edu. Gen. Work C Mar. Work H Inc.
ϵ=\epsilon=\infty 1.09 0.68 0.92 0.01 0.834 0.36 1.66 0.06
ϵ=5\epsilon=5 1.1 1.23 1.23 0.009 1.11 0.57 1.65 0.17
ϵ=0.5\epsilon=0.5 3. 1.75 1.87 0.048 2.21 0.69 2.55 0.44

To verify the correlation among different attributes, we use a trained classifier to determine whether the income of an individual is larger than 50K50K or not. Experimental results on the comparison of the proposed RDP-GAN with DP-GAN can be found in Fig. 8.

Refer to caption
(a) Classifier trained by real data
Refer to caption
(b) Classifier trained by synthetic data
Figure 8: Correlation comparison with DP-GAN.

From Fig. 8(a) we can find the classifier is able to obtain a fairly high accuracy trained by the true data samples, which has around 96.4%96.4\% accuracy. Moreover, compared with DP-GAN, the proposed RDP algorithm can better capture the relationship among attributes in both privacy level. For example, when ϵ=0.5\epsilon=0.5, the performance gap is around 4.1%4.1\% (82.582.5 v.s. 78.4%78.4\%). In addition, in Fig. 8(b) we trained the classifier using the synthetic data, and test its accuracy using the true data. The results also show that the proposed algorithm can achieve a better performance than DP-GAN at a same privacy level.

V-E Experimental Results on the Adaptive Noise Tuning Algorithm

In this subsection, we first conduct experiments to find the appropriate decay rate for different algorithms. Specifically, we choose ϵtotal=0.5\epsilon_{total}=0.5 and 10001000 iterations, which leads to adding σ=10\sigma=10 to the discriminator in each iteration, and the testing accuracy τ\tau is set to 0.50.5 to prevent model collapsing. In addition, we denote the Time-based decay, Exponential decay and Step decay as Time, Exp and Step, respectively, and the tt^{*} in Step is set to 100100. In the following, we show the trend of the testing accuracy with different value of decay rate kk in the Adult dataset.

Refer to caption
(a) Time and Exp
Refer to caption
(b) Step and ANT-RDP
Figure 9: The accuracy with different decay rate kk in the MNIST Dataset.
Refer to caption
(a) Time and Exp
Refer to caption
(b) Step and ANT-RDP
Figure 10: The accuracy with different decay rate kk in the MNIST Dataset.

From Fig. 9 and Fig. 10, we can find there exists an optimal decay rate in the current system setting. For example, in the ANT-RDP algorithm, the best accuracy occurs when k=0.8k=0.8 in the Adult dataset, and k=0.7k=0.7 in the Adult dataset, respectively. In addition, the performance of the Time, Step and ANT-RDP schedules seem to have a convex relationship with the value of kk. Intuitively speaking, it may because that a lower decay rate causes to a longer training time, and it will produce a worse accuracy because most iterations will suffer from noise-added parameters. On the other side, a higher decay rate leads to the noise scale decaying sharply in this case and therefore resulting in an insufficient training time, which may also degrades the accuracy. Moreover, we show a comparable results for these algorithms in Table. IV and Table. V.

TABLE IV: Average accuracy for dynamic schedules in the MNIST dataset: RDP-GAN (ϵ=0.5\epsilon=0.5); Time (k=0.05k=0.05); Step (k=0.6k=0.6); Exp (k=0.01k=0.01); ANT-RDP (k=0.8k=0.8)
No Privacy RDP-GAN Time Step Exp ANT-RDP
Iterations 1000 1000 657 564 902 868
Accuracy 0.995 0.863 0.874 0.876 0.898 0.908
TABLE V: Average accuracy for dynamic schedules in the Adult dataset: RDP-GAN (ϵ=0.5\epsilon=0.5); Time (k=0.04k=0.04); Step (k=0.5k=0.5); Exp (k=0.01k=0.01); ANT-RDP (k=0.7k=0.7)
No Privacy RDP-GAN Time Step Exp ANT-RDP
Iterations 1000 1000 632 532 894 845
Accuracy 0.964 0.825 0.836 0.828 0.839 0.843

From the tables we can find that all the dynamic schedules can achieve a better performance than RDP-GAN, with smaller cost on the training iterations. The reason behind this is that the dynamic algorithm will adjust the scale of the added noise according to the testing performance, and it will cost more privacy budget when noise scale becomes smaller in each iteration, thus leading to less iterations. Moreover, the proposed ANT-RDP can obtain the highest accuracy on both datasets while it does not need the largest iterations. For example, compared with the Exp algorithm, the ANT-RDP has a better accuracy, i.e., there is a 0.4%0.4\% performance gain in the Adult dataset, and 0.9%0.9\% in the MNIST dataset, respectively, while needs less iterations, i.e., they are 4949 and 3434 less in each dataset.

VI Related Work

In this section, we provide a brief literature review of relevant topics: differential privacy and differentially private deep learning.

VI-A Differential Privacy

Differential privacy (DP) [16] and related algorithms have been widely studied in the literatures. The work [23] laid the theoretical foundations of many DP studies by adding noise to make the maximum change of data related functions. Another related frameworks that adds noise on gradient are [24] and its followed work [25], which studied to obtain information from models with DP requirements using stochastic gradient. In [26], a comprehensive and structured overview of DP data publishing and analysis show several possible future directions and applications. In addition, to tightly analyze the composition, especially on the tails of the privacy loss, the work [18] gave a new privacy definition, named Rényi DP. The Rényi DP is not only a natural generalization of pure DP as it shares many properties with DP, but also has the capability to capture a more accurate aggregate privacy loss by the composition theorem.

VI-B Differentially Private Deep Learning

Abdadi et al. [14] proposed a differentially private SGD algorithm for deep learning to offer provable privacy guarantees on the output model, and the work [19] used the concentrated DP [13], which is in spirt of RDP, to provide cumulative privacy loss estimation over computations. Moreover, the applications of DP in GAN, such as [11, 12], have well addressed information leakage by adding noises to gradients during training. They also can produce data samples with good quality under reasonable privacy budgets.

Different from these solutions, our work can obtain an more accurate estimation of privacy loss. The properties of RDP is first used to apply in the GAN. Moreover, we provide a naturally succeed method without artificial clipping on the noise-added gradients, which may destroy the learning performance.

VII Conclusion

In this work, we proposed a Rényi-differentially private-GAN (RDP-GAN), which achieves differential privacy (DP) under the framework of GAN by carefully adding random Gaussian noises on the value of the loss function during training. The advantages of applying perturbation on the loss function are three-fold: i) loss values are exchanged by the generator and discriminator directly which need more protection than parameters; ii) adding noise on the loss value can provide explicit privacy protection; and iii) it does not need value clipping on the parameters, so that the stability can be enhanced. We theoretically obtained the analytical results of the total privacy loss considering the subsampling method and multiple iterations of training with equal privacy budget allocation to each iteration. In addition, in order to alleviate the negative impact brought by the noise injection, we improved the proposed algorithm by designing an adaptive noise tuning algorithm, which can adaptively change the value of added noise according to the testing accuracy. Through comprehensive experimental results, we verified that the proposed algorithm can achieve a higher data utility compared with the existing DP-GAN algorithm at the same DP protection level.

Appendix A Proof of Theorem 1

Let FdF_{d} denote the loss function of discriminator, XX and XX^{\prime} are adjacent datasets which only differ one element xx^{\prime}. Thus, the loss value of these two adjacent datasets after noise perturbation are expressed as:

Ld(W;X)=Fd(W;X)+𝒩(0,σ2);Ld(W;X)=Fd(W;X)+𝒩(0,σ2),\begin{split}&L_{d}\left({W;X}\right)=F_{d}\left({W;X}\right)+\mathcal{N}(0,{\sigma^{2}});\\ &L^{\prime}_{d}\left({W^{\prime};X^{\prime}}\right)=F_{d}^{\prime}\left({W^{\prime};X^{\prime}}\right)+\mathcal{N}(0,{\sigma^{2}}),\end{split} (24)

where WW denotes the parameter incurring in the neural network.

From Eq. (24), we know Ld(W;X)L_{d}\left({W;X}\right) is a summation of two independent random variables. Using the Gaussian approximation approach in [27], the summation of these two RVs can be treated as another RV which also follows a Gaussian distribution. Moreover, without loss of generality, we assume that i|X|\[x]Fd(xi)=0\sum_{i\in\left|X\right|\backslash[x^{\prime}]}{F}_{d}({x_{i}})=0. Thus Ld(W;X)L_{d}\left({W;X}\right) and Ld(W;X)L_{d}^{\prime}\left({W^{\prime};X^{\prime}}\right) are distributed identically except for the first coordinate and hence we have a one-dimensional problem [14]. Let u0u_{0} denote the pdf of 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) and let u1u_{1} denote the pdf of 𝒩(ΔS,σ2)\mathcal{N}(\Delta S,\sigma^{2}). Thus we have

L(W;X)v0u0;L(W;X)v1(1q)μ0+qu1,\begin{split}&L^{\prime}\left({W^{\prime};X^{\prime}}\right)\sim v_{0}\triangleq u_{0};\\ &L\left({W;X}\right)\sim v_{1}\triangleq(1-q)\mu_{0}+qu_{1},\end{split} (25)

where qq is the sampling rate and ΔS\Delta S denotes the sensitivity bound on the loss function. In addition, q=mnq=\frac{m}{n}, where mm, nn denotes the sampling and total size of dataset, respectively.

According to the definition of Rényi differential privacy in Eq. (5), ϵ\epsilon of one iteration can be expressed as
max(DαLd(X)Ld(X),DαLd(X)Ld(X))\max\left({{D_{\alpha}}\frac{{L_{d}\left(X\right)}}{{L_{d}\left({X^{\prime}}\right)}},{D_{\alpha}}\frac{{L_{d}\left({X^{\prime}}\right)}}{{L_{d}\left(X\right)}}}\right). For simplicity, we rewrite the
Ld(W;X)L_{d}\left({W;X}\right) as Ld(X)L_{d}\left(X\right).

To obtain ϵ\epsilon, we first calculate Dα(Ld(X)Ld(X)){D_{\alpha}}\left({\frac{{L_{d}(X)}}{{L_{d}(X^{\prime})}}}\right) as

Dα(Ld(X)Ld(X))=1α1logExLd(X)(Ld(X)Ld(X))α=1α1logu0[(1q)u0+qu1u0]α𝑑x=1α1logu0{1q+qexp[12σ2(2xΔSΔS2)]}α𝑑x=(a)1α1logu0k=0α(αk)(1q)kαqkexp()dx=1α1logk=0α(αk)(1q)kαqkexp[kΔS2(k1)2σ2](b)1α1logk=0α(αk)(1q)kα[qexp(ΔS2α2σ2)]=αα1log{1+q[exp(ΔS2α2σ2)1]}(c)αα1q[exp(ΔS2α2σ2)1](d)αα1q[ΔS2α2σ2+O(ΔS4α24σ4)]qα2ΔS22(α1)σ2,\begin{split}&{D_{\alpha}}\left({\frac{{L_{d}(X)}}{{L_{d}(X^{\prime})}}}\right)=\frac{1}{{\alpha-1}}\log{E_{x\sim L_{d}(X^{\prime})}}{\left({\frac{{L_{d}(X)}}{{L_{d}(X^{\prime})}}}\right)^{\alpha}}\\ &=\frac{1}{{\alpha-1}}\log{\int{{u_{0}}\left[{\frac{{\left({1-q}\right){u_{0}}+q{u_{1}}}}{{{u_{0}}}}}\right]}^{\alpha}}dx\\ &=\frac{1}{{\alpha-1}}\log{\int{{u_{0}}\left\{{1-q+q\exp\left[{\frac{1}{{2{\sigma^{2}}}}\left({2x\Delta S-\Delta{S^{2}}}\right)}\right]}\right\}}^{\alpha}}dx\\ &\overset{(a)}{=}\frac{1}{{\alpha-1}}\log\int{{u_{0}}\sum\limits_{k=0}^{\alpha}{\left({\begin{array}[]{*{20}{c}}\alpha\\ k\end{array}}\right)}{{\left({1-q}\right)}^{k-\alpha}}{q^{k}}\exp\left(\blacktriangle\right)}dx\\ &=\frac{1}{{\alpha-1}}\log\sum\limits_{k=0}^{\alpha}{\left({\begin{array}[]{*{20}{c}}\alpha\\ k\end{array}}\right)}{\left({1-q}\right)^{k-\alpha}}{q^{k}}\exp\left[{\frac{{k\Delta{S^{2}}(k-1)}}{{2{\sigma^{2}}}}}\right]\\ &\overset{(b)}{\leq}\frac{1}{{\alpha-1}}\log\sum\limits_{k=0}^{\alpha}{\left({\begin{array}[]{*{20}{c}}\alpha\\ k\end{array}}\right)}{\left({1-q}\right)^{k-\alpha}}\left[{q\exp\left({\frac{{\Delta{S^{2}}\alpha}}{{2{\sigma^{2}}}}}\right)}\right]\\ &=\frac{\alpha}{{\alpha-1}}\log{\left\{{1+q\left[{\exp\left({\frac{{\Delta{S^{2}}\alpha}}{{2{\sigma^{2}}}}}\right)-1}\right]}\right\}}\\ &\overset{(c)}{\leq}{\color[rgb]{0,0,0}\frac{\alpha}{{\alpha-1}}q\left[{\exp\left({\frac{{\Delta{S^{2}}\alpha}}{{2{\sigma^{2}}}}}\right)-1}\right]}\\ &\overset{(d)}{\leq}{\color[rgb]{0,0,0}\frac{\alpha}{{\alpha-1}}q\left[{\frac{{\Delta{S^{2}}\alpha}}{{2{\sigma^{2}}}}+O\left({\frac{{\Delta{S^{4}}{\alpha^{2}}}}{{4{\sigma^{4}}}}}\right)}\right]}\leq\frac{q\alpha^{2}\Delta S^{2}}{2(\alpha-1)\sigma^{2}},\end{split} (26)

where step (a) uses the binomial expansion, \blacktriangle is denoted by k2σ2(2xΔSΔS2){\frac{k}{{2{\sigma^{2}}}}\left({2x\Delta S-\Delta{S^{2}}}\right)}, and step (b) is obtained by k1αk-1\leq\alpha. Step (c) uses Taylor series expansion, and step (d) is obtained by Taylor series expansion.

We then calculate Dα(Ld(X)Ld(X)){D_{\alpha}}\left(\frac{{L_{d}\left({X^{\prime}}\right)}}{{L_{d}\left(X\right)}}\right) as

Dα(Ld(X)Ld(X))=1α1logExLd(X)[Ld(X)Ld(X)]=1α1logu0[u0(1q)u0+qu1]α1𝑑x=1α1logu0[(1q)u0+qu1u0]1αdx=1α1logu0{1q+qexp[12σ2(2xΔSΔS2)]}1α𝑑x=(a)1α1logu0{k=0(1)k(α+k2k)qk}𝑑x(b)1α1logk=0(1)k(α+k2k)qkexp()(1q)1kα=1α1log[1q+qexp(ΔS2α2σ2)]1α=log{1+q[exp(ΔS2α2σ2)1]}<Dα(Ld(X)Ld(X)),\begin{split}&{D_{\alpha}}\left(\frac{{L_{d}\left({X^{\prime}}\right)}}{{L_{d}\left(X\right)}}\right)=\frac{{\rm{1}}}{{\alpha{\rm{-1}}}}\log{E_{x\sim L_{d}\left(X\right)}}\left[{\frac{{L_{d}(X^{\prime})}}{{L_{d}(X)}}}\right]\\ &=\frac{1}{{\alpha-1}}\log{\int{{u_{0}}\left[{\frac{{{u_{0}}}}{{(1-q){u_{0}}+q{u_{1}}}}}\right]}^{\alpha-1}}dx\\ &=\frac{1}{{\alpha-1}}\log\int{{u_{0}}{{\left[{\frac{{(1-q){u_{0}}+q{u_{1}}}}{{{u_{\rm{0}}}}}}\right]}^{{\rm{1-}}\alpha}}{\rm{dx}}}\\ &=\frac{1}{{\alpha-1}}\log{\int{{u_{0}}\left\{{1-q+q\exp\left[{\frac{1}{{2{\sigma^{2}}}}\left({2x\Delta S-\Delta{S^{2}}}\right)}\right]}\right\}}^{1-\alpha}}dx\\ &\overset{(a)}{=}\frac{1}{\alpha-1}\log\int{u_{0}}\left\{\sum\limits_{k=0}^{\infty}(-1)^{k}\left({\begin{array}[]{*{20}{c}}{\alpha+k-2}\\ k\end{array}}\right){q^{k}}\blacktriangledown\right\}dx\\ &\overset{(b)}{\leq}\frac{1}{{\alpha-1}}\log\sum\limits_{k=0}^{\infty}{{{(-1)}^{k}}\left({\begin{array}[]{*{20}{c}}{\alpha+k-2}\\ k\end{array}}\right){q^{k}}\exp\left(\blacktriangle\right)}{\left({1-q}\right)^{1-k-\alpha}}\\ &=\frac{1}{{\alpha-1}}\log{\left[{1-q+q\exp\left({\frac{{\Delta{S^{2}}\alpha}}{{2{\sigma^{2}}}}}\right)}\right]^{1-\alpha}}\\ &=-\log\left\{{1+q\left[{\exp\left({\frac{{\Delta{S^{\rm{2}}}\alpha}}{{{\rm{2}}{\sigma^{\rm{2}}}}}}\right){\rm{-1}}}\right]}\right\}<{D_{\alpha}}\left({\frac{{L_{d}(X)}}{{L_{d}(X^{\prime})}}}\right),\end{split} (27)

where step (a) is obtained by (x+a)n=k=0(1)k(n+k1k)xkank{(x+a)^{-n}}=\\ \sum\limits_{k=0}^{\infty}{{{(-1)}^{k}}\left({\begin{array}[]{*{20}{c}}{n+k-1}\\ k\end{array}}\right)}{x^{k}}{a^{-n-k}}, and \blacktriangledown is denoted by
exp[k2σ2(2xΔSΔS2)](1q)1kα\exp\left[{\frac{k}{{2{\sigma^{2}}}}\left({2x\Delta S-\Delta{S^{2}}}\right)}\right]{{\left({1-q}\right)}^{1-k-\alpha}}. Step (b) is obtained by k1αk-1\leq\alpha, and \blacktriangle is denoted by [k2σ2(2xΔSΔS2)]\left[{\frac{k}{{2{\sigma^{2}}}}\left({2x\Delta S-\Delta{S^{2}}}\right)}\right].

As a result, ϵ=qα2ΔS22(α1)σ2\epsilon=\frac{q\alpha^{2}\Delta S^{2}}{2(\alpha-1)\sigma^{2}} which concludes the proof.

Appendix B Proof of Lemma 1

By applying Proposition 2 to their composition, we can obtain that for all α>1\alpha>1,

Dα[Fd(X)||Fd(X)]2αnϵ2.{D_{\alpha}}\left[{F_{d}\left(X\right)||F_{d}\left({X^{\prime}}\right)}\right]\leq 2\alpha n{\epsilon^{2}}. (28)

Denote Pr[Fd(X)S]\Pr\left[{F_{d}(X)\in S}\right] by PP and Pr[Fd(X)S]\Pr\left[{F_{d}(X^{\prime})\in S}\right] by QQ and consider two cases.

Case I: log1/Qϵ2n\log 1/Q\geq\epsilon^{2}n. According to the Proposition 3 and choosing α=log1/Q/(ϵn)1\alpha=\sqrt{\log 1/Q}/(\epsilon\sqrt{n})\geq 1, we have

P{exp{Dα[Fd(X)||Fd(X)]}Q}11/αexp[2(α1)nϵ2]Q11/α<(a)exp[2ϵnlog1/Q2nϵ2(logQ)/α]Q<exp(2ϵnlog1/Q)Q,\begin{split}P&\leq{\left\{{\exp\left\{{{D_{\alpha}}\left[{F_{d}\left(X\right)||F_{d}\left({X^{\prime}}\right)}\right]}\right\}\cdot Q}\right\}^{1-1/\alpha}}\\ &\leq\exp\left[{2\left({\alpha-1}\right)n{\epsilon^{2}}}\right]\cdot{Q^{1-1/\alpha}}\\ &\overset{(a)}{<}\exp\left[{2\epsilon\sqrt{n\log 1/Q}-2n{\epsilon^{2}}-\left({\log Q}\right)/\alpha}\right]\cdot Q\\ &<\exp\left({2\epsilon\sqrt{n\log 1/Q}}\right)\cdot Q,\end{split} (29)

where step (a) is obtained by substituting α=log1/Q/(ϵn)1\alpha=\sqrt{\log 1/Q}/(\epsilon\sqrt{n})\geq 1 into the equation.

Case II: log1/Q<ϵ2n\log 1/Q<\epsilon^{2}n. This case follows trivially, since the right hand of Eq. (14) is larger than 1 as:

exp(2ϵnlog1/Q)Qexp(2log1/Q)Q=1/Q>1.\exp\left({2\epsilon\sqrt{n\log 1/Q}}\right)\cdot Q\geq\exp\left({2\log 1/Q}\right)\cdot Q=1/Q>1. (30)

Appendix C Proof of Theorem 2

Let XX and XX^{\prime} be two adjacent inputs, and SS be some subset results of the loss function FdF_{d}. To argue (ϵd,δ)(\epsilon_{{\emph{d}}},\delta)-differential private of FdF_{d}, we need to verify that

Pr[Fd(X)S]exp(ϵd)Pr[Fd(X)S]+δ.\Pr\left[{F_{d}(X)\in S}\right]\leq\exp\left(\epsilon_{{\textrm{d}}}\right)\Pr\left[{F_{d}\left({X^{\prime}}\right)\in S}\right]+\delta. (31)

According to Lemma 1, we know that

Pr[Fd(X)S]exp{2ϵnlog/Pr[Fd(X)S]}Pr[Fd(X)S].\begin{split}&\Pr\left[{F_{d}(X)\in S}\right]\leq\\ &\exp\left\{{2\epsilon\sqrt{n\log/\Pr\left[{F_{d}(X^{\prime})\in S}\right]}}\right\}\cdot\Pr\left[{F_{d}\left({X^{\prime}}\right)\in S}\right].\end{split} (32)

Denote Pr[Fd(X)S]\Pr\left[{F_{d}(X)\in S}\right] by PP and Pr[Fd(X)S]\Pr\left[{F_{d}(X^{\prime})\in S}\right] by QQ and we can obtain that

Pexp(2ϵnlog1/Q)Q.P\leq\exp\left({2\epsilon\sqrt{n\log 1/Q}}\right)\cdot Q. (33)

Eq. (33) can be divided into two cases as follows.

Case I: 8log1/δ>log1/Q8\log 1/\delta>\log 1/Q. Then Eq. (33) can be derived by

Pexp(2ϵ8nlog1/δ)Q=exp(4ϵ2nlog1/δ)Qexp(ϵd)+δ.\begin{split}P&\leq\exp\left({2\epsilon\sqrt{8n\log 1/\delta}}\right)\cdot Q\\ &=\exp\left({4\epsilon\sqrt{2n\log 1/\delta}}\right)\cdot Q\leq\exp({\epsilon_{\textrm{d}}})+\delta.\end{split} (34)

Case II: 8log1/δlog1/Q8\log 1/\delta\leq\log 1/Q. Then Eq. (33) can be derived by

P(a)exp(2log1/δlog1/Q)Qexp(1/2log1/Q)Q=Q11/2Q1/8=δδ+exp(ϵd),\begin{split}P&\overset{(a)}{\leq}\exp\left({2\sqrt{\log 1/\delta\cdot\log 1/Q}}\right)\cdot Q\\ &\leq\exp\left({\sqrt{1/2}\log 1/Q}\right)\cdot Q\\ &=Q^{1-1/\sqrt{2}}\leq Q^{1/8}=\delta\leq\delta+\exp({\epsilon_{\textrm{d}}}),\end{split} (35)

where step (a) is obtained by substituting the condition 8log1/δlog1/Q8\log 1/\delta\leq\log 1/Q into the equation.

This concludes the proof.

References

  • [1] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-edge ai: Intelligentizing mobile edge computing, caching and communication by federated learning,” arXiv preprint arXiv:1809.07857, 2018.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [3] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 6645–6649.
  • [4] M. Liang, Z. Li, T. Chen, and J. Zeng, “Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 4, pp. 928–937, July 2015.
  • [5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
  • [6] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [8] B. Hitaj, G. Ateniese, and F. Perez-Cruz, “Deep models under the gan: information leakage from collaborative deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2017, pp. 603–618.
  • [9] J. Jordon, J. Yoon, and M. van der Schaar, “Pate-gan: generating synthetic data with differential privacy guarantees,” 2018.
  • [10] Y.-X. Wang, B. Balle, and S. Kasiviswanathan, “Subsampled rényi differential privacy and analytical moments accountant,” arXiv preprint arXiv:1808.00087, 2018.
  • [11] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou, “Differentially private generative adversarial network,” arXiv preprint arXiv:1802.06739, 2018.
  • [12] C. Xu, J. Ren, D. Zhang, Y. Zhang, Z. Qin, and K. Ren, “Ganobfuscator: Mitigating information leakage under gan via differential privacy,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 9, pp. 2358–2371, Sep. 2019.
  • [13] C. Dwork and G. N. Rothblum, “Concentrated differential privacy,” arXiv preprint arXiv:1603.01887, 2016.
  • [14] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2016, pp. 308–318.
  • [15] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
  • [16] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
  • [17] K. Wei, J. Li, M. Ding, C. Ma, H. Su, B. Zhang, and H. V. Poor, “Performance analysis and optimization in privacy-preserving federated learning,” arXiv preprint arXiv:2003.00229, 2020.
  • [18] I. Mironov, “Rényi differential privacy,” in 2017 IEEE 30th Computer Security Foundations Symposium (CSF).   IEEE, 2017, pp. 263–275.
  • [19] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex, “Differentially private model publishing for deep learning,” arXiv preprint arXiv:1904.02200, 2019.
  • [20] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.
  • [21] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
  • [22] C. Darken and J. E. Moody, “Note on learning rate schedules for stochastic optimization,” in Advances in neural information processing systems, 1991, pp. 832–838.
  • [23] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference.   Springer, 2006, pp. 265–284.
  • [24] S. Song, K. Chaudhuri, and A. D. Sarwate, “Stochastic gradient descent with differentially private updates,” in 2013 IEEE Global Conference on Signal and Information Processing.   IEEE, 2013, pp. 245–248.
  • [25] S. Song, K. Chaudhuri, and A. Sarwate, “Learning from data with heterogeneous noise using sgd,” in Artificial Intelligence and Statistics, 2015, pp. 894–902.
  • [26] T. Zhu, G. Li, W. Zhou, and S. Y. Philip, “Differentially private data publishing and analysis: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 8, pp. 1619–1638, 2017.
  • [27] M. Ding, D. López-Pérez, G. Mao, Z. Lin, and S. K. Das, “Dna-ga: A tractable approach for performance analysis of uplink cellular networks,” IEEE Transactions on Communications, vol. 66, no. 1, pp. 355–369, 2017.