This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Wasserstein Differential Privacy

Chengyi Yang1, Jiayin Qi2, Aimin Zhou1 Corresponding Author
Abstract

Differential privacy (DP) has achieved remarkable results in the field of privacy-preserving machine learning. However, existing DP frameworks do not satisfy all the conditions for becoming metrics, which prevents them from deriving better basic private properties and leads to exaggerated values on privacy budgets. We propose Wasserstein differential privacy (WDP), an alternative DP framework to measure the risk of privacy leakage, which satisfies the properties of symmetry and triangle inequality. We show and prove that WDP has 13 excellent properties, which can be theoretical supports for the better performance of WDP than other DP frameworks. In addition, we derive a general privacy accounting method called Wasserstein accountant, which enables WDP to be applied in stochastic gradient descent (SGD) scenarios containing subsampling. Experiments on basic mechanisms, compositions and deep learning show that the privacy budgets obtained by Wasserstein accountant are relatively stable and less influenced by order. Moreover, the overestimation on privacy budgets can be effectively alleviated. The code is available at https://github.com/Hifipsysta/WDP.

Introduction

Differential privacy (Dwork et al. 2006b) is a mathematically rigorous definition of privacy, providing quantifiable descriptions of the risk on leaking sensitive information. In the early stage, researches on differential privacy mainly focused on the issue of statistical queries (SQ) (McSherry 2009; Kasiviswanathan et al. 2011). With the risk of privacy leakage being warned in machine learning (Wang, Si, and Wu 2015; Shokri et al. 2017; Zhu, Liu, and Han 2019), differential privacy has been gradually applied for privacy protection in deep learning (Shokri and Shmatikov 2015; Abadi et al. 2016; Phan et al. 2019; Cheng et al. 2022).

However, these techniques are always constructed on the postulation of standard DP (Dwork et al. 2006b), which only provides the worst-case scenario, and tends to overestimate privacy budgets under the measure of maximum divergence (Triastcyn and Faltings 2020). Although the most commonly applied approximate differential privacy (ε,δ\varepsilon,\delta-DP) (Dwork et al. 2006a) ignores extreme situations with small probabilities by introducing a relaxation term δ\delta called failure probability, it is believed that (ε,δ)(\varepsilon,\delta)-DP cannot strictly handle composition problems (Mironov 2017; Dong, Roth, and Su 2022). To address the above issues, further researches have been considering the specific data distribution, which can be divided into two main directions: the distribution of privacy loss and the distribution of unique difference. For example, concentrated differential privacy (CDP) (Dwork and Rothblum 2016), zero-concentrated differential privacy (zCDP) (Bun and Steinke 2016), and truncated concentrated differential privacy (tCDP) (Bun et al. 2018) all assume that the mean of privacy loss follows subgaussian distribution. While Bayesian differential privacy (BDP) (Triastcyn and Faltings 2020) considers the distribution of the only different data entry xx^{\prime}. Nevertheless, they are all defined by the upper bound of divergence, which implies that their privacy budgets are overly pessimistic (Triastcyn and Faltings 2020).

In this paper, we introduce a variant of differential privacy from another perspective. We define the privacy budget through the upper bound of the Wasserstein distance between adjacent distributions, which is called Wasserstein differential privacy (WDP). From a semantic perspective, WDP also follows the concept of indistinguishability (Dwork et al. 2006b) in differential privacy. Specifically, for all possible adjacent databases DD and DD^{\prime}, WDP reflects the maximum variation of optimal transport (OT) cost between the distributions queried by an adversary before and after any data entry change in the database.

Intuitively speaking, the advantages of WDP can be divided into at least two aspects. (1) WDP focuses on individuals within the distribution, rather than focusing on the entire distribution like divergence, which is consistent with the original intention of differential privacy to protect individual private information from leakage. (2) More importantly, WDP satisfies all the conditions to become a metric, including non-negativity, symmetry and triangle inequality (see Proposition 1-3), which is not fully possessed by privacy loss under the definition of divergence, as divergence itself does not satisfy symmetry and triangle inequality (see Proposition 11 in the appendix of Mironov (2017)).

The combination of DP and OT has been taken into consideration in several existing works. Their contributions are essentially to provide privacy guarantees for computing Wasserstein distance between data domains (Tien, Habrard, and Sebban 2019), distributions (Rakotomamonjy and Ralaivola 2021) or graph embeddings (Jin and Chen 2022). However, our work is to compute privacy budgets through Wasserstein distance, and the contributions are summarized as follows:

Firstly, we propose an alternative DP framework called Wasserstein differential privacy (WDP), which satisfies three basic properties of a metric (non-negativity, symmetry and triangle inequality), and is easy to convert with other DP frameworks (see Proposition 9-11).

Secondly, we show that WDP has 13 excellent properties. More notably, basic sequential composition, group privacy among them and advanced composition are all derived from triangle inequality, which shows the advantages of WDP as a metric DP.

Thirdly, we derive advanced composition, privacy loss and absolute moment under WDP, and finally develop Wasserstein accountant to track and account privacy budgets in subsampling algorithms such as SGD in deep learning.

Fourthly, we conduct experiments to evaluate WDP on basic mechanisms, compositions and deep learning. Results show that applying WDP as privacy framework can effectively avoid overstating the privacy budgets.

Related Work

Pure differential privacy (ε\varepsilon-DP) (Dwork et al. 2006b) provides strict guarantees for all measured events through maximum divergence. To address the long tailed distribution generated by privacy mechanism, (ε,δ)(\varepsilon,\delta)-DP (Dwork et al. 2006a) ignores extremely low probability events through a relaxation term δ\delta. However, (ε,δ)(\varepsilon,\delta)-DP is considered to an overly relaxed definition (Bun et al. 2018) and cannot effectively handle composition problems, such as leading to parameter explosion (Mironov 2017) or failing to capture correct hypothesis testing (Dong, Roth, and Su 2022). In view of this, CDP (Dwork and Rothblum 2016) applies a subgaussian assumption to the mean of privacy loss. zCDP (Bun and Steinke 2016) capture privacy loss is a subgaussian random variable through Rényi divergence. Rényi differential privacy (RDP) (Mironov 2017) proposes a more general definition of DP based on Rényi divergence. tCDP (Bun et al. 2018) further relaxes zCDP. BDP (Triastcyn and Faltings 2020) considers the distribution of unique different entries. Subspace differential privacy (Gao, Gong, and Yu 2022) and integer subspace differential privacy (Dharangutte et al. 2023) consider privacy computing scenarios with external constraints. However, these concepts are all based on divergence, so that their privacy loss does not have the property of metrics. Although ff-DP and its special case Gaussian differential privacy (GDP) (Dong, Roth, and Su 2022) innovatively define privacy based on the trade-off function between two types of errors in hypothesis testing, they are difficult to associate with other DP frameworks.

Wasserstein Differential Privacy

In this section, we introduce the concept of Wasserstein distance and define our Wasserstein differential privacy.

Definition 1 (Wasserstein distance (Rüschendorf 2009)). For two probability distributions PP and QQ defined over \mathcal{R}, their μ\mu-Wasserstein distance is

Wμ(P,Q)=(infγΓ(P,Q)𝒳×𝒴ρ(x,y)μ𝑑γ(x,y))1μ.\begin{aligned} W_{\mu}\left(P,Q\right)=\left(\inf_{\gamma\in\Gamma\left(P,Q\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\rho\left(x,y\right)}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}}\end{aligned}. (1)

Where ρ(x,y)=xy\rho(x,y)=\|x-y\| is the norm defined in probability space Ω=𝒳×𝒴\Omega=\mathcal{X}\times\mathcal{Y}. Γ(P,Q)\Gamma\left(P,Q\right) is the set for all the possible joint distributions, and γ(x,y)>0\gamma(x,y)>0 satisfying γ(x,y)𝑑y=P(x)\int\gamma\left(x,y\right)dy=P(x) and γ(x,y)𝑑x=Q(y)\int\gamma\left(x,y\right)dx=Q(y).

In practical sense, ρ(x,y)\rho\left(x,y\right) can be regarded as the cost for one unit of mass transported from xx to yy. γ(x,y)\gamma\left(x,y\right) can be seen as a transport plan representing the share to be moved from PP to QQ, which measures how much mass must be transported in order to complete the transportation.

In particular, when μ\mu is equal to 1, we can obtain the 1-Wasserstein distance applied in Wasserstein generative adversarial network (WGAN) (Arjovsky, Chintala, and Bottou 2017; Gulrajani et al. 2017). The successful application of 1-Wasserstein distance in WGAN should be attributed to Kantorovich-Rubinstein duality, which effectively reduces the computational complexity of Wasserstein distance.

Definition 2 (Kantorovich-Rubinstein distance (Kantorovich and Rubinshten 1958)). According to the property of Kantorovich-Rubinstein duality, 1-Wasserstein distance can be equivalently expressed as Kantorovich-Rubinstein distance

K(P,Q)=supφL1𝔼xP[φ(x)]𝔼yQ[φ(y)].K\left(P,Q\right)=\sup_{\|\varphi\|_{L}\leq 1}\mathbb{E}_{x\sim P}[\varphi(x)]-\mathbb{E}_{y\sim Q}[\varphi(y)]. (2)

Where φ:𝒳\varphi:\mathcal{X}\rightarrow\mathcal{R} is the so-called Kantorovich potential, giving the optimal transport map by a close-form formula. Where φL\|\varphi\|_{L} is the Lipschitz bound of Kantorovich potential, φL1\|\varphi\|_{L}\leq 1 indicates that φ\varphi satisfies the 1-Lipschitz condition with

φL=supxyρ(φ(x),φ(y))ρ(x,y).\|\varphi\|_{L}=\sup_{x\not=y}\frac{\rho\left(\varphi(x),\varphi(y)\right)}{\rho\left(x,y\right)}. (3)

Definition 3 ((μ,ε)\left(\mu,\varepsilon\right)-WDP). A randomized algorithm \mathcal{M} is said to satisfy (μ,ε)\left(\mu,\varepsilon\right)-Wasserstein differential privacy if for any adjacent datasets D,D𝒟D,D^{\prime}\in\mathcal{D} and all measurable subsets SS\subseteq\mathcal{R} the following inequality holds

Wμ(Pr[(D)S],Pr[(D)S])=\displaystyle W_{\mu}\left(Pr[{\mathcal{M}\left({D}\right)\in S}],Pr[{\mathcal{M}\left({D}^{\prime}\right)}\in S]\right)= (4)
(infγΓ(Pr(D),Pr(D))𝒳×𝒴ρ(x,y)μ𝑑γ(x,y))1με.\displaystyle\left(\inf_{\gamma\in\Gamma\left({Pr_{\mathcal{M}}\left({D}\right)},{Pr_{\mathcal{M}}\left({D}^{\prime}\right)}\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\rho\left(x,y\right)}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}}\leq\varepsilon.

Where (D)\mathcal{M}(D) and (D)\mathcal{M}(D^{\prime}) represent two outputs when algorithm \mathcal{M} respectively performs on dataset DD and DD^{\prime}. Pr[(D)S]Pr[{\mathcal{M}\left({D}\right)\in S}] and Pr[(D)S]Pr[{\mathcal{M}\left({D^{\prime}}\right)\in S}] are the probability distributions, also denoted as Pr(D)Pr_{\mathcal{M}}(D) and Pr(D)Pr_{\mathcal{M}}(D^{\prime}) in this paper. The value of Wμ(Pr(D),Pr(D))W_{\mu}\left({Pr_{\mathcal{M}}\left({D}\right)},{Pr_{\mathcal{M}}\left({D}^{\prime}\right)}\right) is the privacy loss under (μ,ε)\left(\mu,\varepsilon\right)-WDP and its upper bound ε\varepsilon is called privacy budget.

Symbolic representations. WDP can also be represented as Wμ((D),(D))εW_{\mu}(\mathcal{M}(D),\mathcal{M}(D^{\prime}))\leq\varepsilon. To emphasize the inputs are two probability distributions, we denote WDP as Wμ(Pr(D),Pr(D))εW_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq\varepsilon. To avoid confusion, we also represent RDP as Dα(Pr(D)Pr(D))εD_{\alpha}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))\leq\varepsilon, although the representation Dα((D)(D))εD_{\alpha}(\mathcal{M}(D)\|\mathcal{M}(D^{\prime}))\leq\varepsilon implies that the results depend on the randomized algorithm and the queried data. They are both reasonable because (D)\mathcal{M}(D) can be seen as a random variable that satisfies (D)Pr(D)\mathcal{M}(D)\sim Pr_{\mathcal{M}}(D).

For the convenience on computation, we define Kantorovich Differential Privacy (KDP) as an alternative way to obtain privacy loss or privacy budget under (1,ε)(1,\varepsilon)-WDP.

Definition 4 (Kantorovich Differential Privacy). If a randomized algorithm \mathcal{M} satisfies(1,ε)\left(1,\varepsilon\right)-WDP, which can also be written as the form of Kantorovich-Rubinstein duality

K(Pr(D),Pr(D))=\displaystyle K\left({Pr_{\mathcal{M}}\left({D}\right)},{Pr_{\mathcal{M}}\left({D}^{\prime}\right)}\right)=
supφL1𝔼xPr(D)[φ(x)]𝔼xPr(D)[φ(x)]ε.\displaystyle\sup_{\|\varphi\|_{L}\leq 1}\mathbb{E}_{x\sim{Pr_{\mathcal{M}}\left({D}\right)}}[\varphi(x)]-\mathbb{E}_{x\sim{Pr_{\mathcal{M}}\left({D^{\prime}}\right)}}[\varphi(x)]\leq\varepsilon. (5)

ε\varepsilon-KDP is equivalent to (1,ε)\left(1,\varepsilon\right)-WDP, and can be computed more efficiently through duality formula based on Kantorovich-Rubinstein distance.

Properties of WDP

Proposition 1 (Symmetry). Let \mathcal{M} be a (μ,ε)\left(\mu,\varepsilon\right)-WDP algorithm, for any μ1\mu\geq 1 and ε0\varepsilon\geq 0 the following equation holds

Wμ(Pr(D),\displaystyle W_{\mu}(Pr_{\mathcal{M}}\left({D}\right), Pr(D))\displaystyle Pr_{\mathcal{M}}\left({D}^{\prime}\right)) (6)
=\displaystyle= Wμ(Pr(D),Pr(D))ε.\displaystyle W_{\mu}\left(Pr_{\mathcal{M}}\left(D^{\prime}\right),Pr_{\mathcal{M}}\left(D\right)\right)\leq\varepsilon.

The symmetric property of (μ,ε)\left(\mu,\varepsilon\right)-WDP is implied in its definition. Specifically, the joint distribution Γ()\Gamma(\cdot) satisfies Γ(Pr(D),Pr(D))=Γ(Pr(D),Pr(D))\Gamma(Pr_{\mathcal{M}}(D^{\prime}),Pr_{\mathcal{M}}(D))=\Gamma(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})). In addition, Kantorovich differential privacy also satisfies this property and the proof is available in the appendix.

Proposition 2 (Triangle Inequality) Let D1,D2,D3𝒟D_{1},D_{2},D_{3}\in\mathcal{D} be three arbitrary datasets. Suppose there are fewer different data entries between D1D_{1} and D2D_{2} compared with D1D_{1} and D3D_{3}, and the differences between D1D_{1} and D2D_{2} are included in the differences between D1D_{1} and D3D_{3}. For any randomized algorithm \mathcal{M} satisfies (μ,ε)(\mu,\varepsilon)-WDP with μ1\mu\geq 1, we have

Wμ(Pr(D1),\displaystyle W_{\mu}(Pr_{\mathcal{M}}(D_{1}), Pr(D3))\displaystyle Pr_{\mathcal{M}}(D_{3})) (7)
\displaystyle\leq Wμ(Pr(D1),Pr(D2))\displaystyle W_{\mu}(Pr_{\mathcal{M}}(D_{1}),Pr_{\mathcal{M}}(D_{2}))
+Wμ(Pr(D2),Pr(D3)).\displaystyle+W_{\mu}(Pr_{\mathcal{M}}(D_{2}),Pr_{\mathcal{M}}(D_{3})).

The proof is available in the appendix, and Minkowski’s inequality is applied in the deduction process. Proposition 2 can also be understood as the cost that converting from Pr(D1)Pr_{\mathcal{M}}(D_{1}) to Pr(D2)Pr_{\mathcal{M}}(D_{2}) and then to Pr(D3)Pr_{\mathcal{M}}(D_{3}) is not lower than the cost that converting from Pr(D1)Pr_{\mathcal{M}}(D_{1}) to Pr(D3)Pr_{\mathcal{M}}(D_{3}) directly. Triangle inequality is indispensable in proving several properties, such as basic sequential composition (see Proposition 6), group privacy (see Proposition 13) and advanced composition (see Theorem 1).

Proposition 3 (Non-Negativity). For μ1\mu\geq 1 and any randomized algorithm \mathcal{M}, we have Wμ(Pr(D),Pr(D))0W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\geq 0.

Proof. See proof of Proposition 3 in the appendix.

Proposition 4 (Monotonicity). For 1μ1μ21\leq\mu_{1}\leq\mu_{2}, we have Wμ1(Pr(D),Pr(D))Wμ2(Pr(D),Pr(D))W_{\mu_{1}}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq W_{\mu_{2}}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})), or we can equivalently described this proposition as (μ2,ε)(\mu_{2},\varepsilon)-WDP implies (μ1,ε)(\mu_{1},\varepsilon)-WDP.

The proof is available in the appendix, and the derivation is completed with the help of Lyapunov’s inequality.

Proposition 5 (Parallel Composition). Suppose a dataset DD is divided into nn parts disjointly which are denoted as Di,i=1,2,,nD_{i},i=1,2,\cdots,n. Each randomized algorithm i\mathcal{M}_{i} performed on different seperated datasets DiD_{i} respectively. If i:𝒟i\mathcal{M}_{i}:\mathcal{D}\rightarrow\mathcal{R}_{i} satisfies (μ,εi)\left(\mu,\varepsilon_{i}\right)-WDP for i=1,2,,ni=1,2,\cdots,n, then the set of randomized algorithms ={1,2,,n}\mathcal{M}=\{\mathcal{M}_{1},\mathcal{M}_{2},\cdots,\mathcal{M}_{n}\} satisfies (μ\mu, max{ε1,ε2,,εn}\max\{\varepsilon_{1},\varepsilon_{2},\cdots,\varepsilon_{n}\})-WDP.

Proof. See proof of Proposition 5 in the appendix.

Proposition 6 (Sequential Composition). Consider a series of randomized algorithms ={1,,i,,n}\mathcal{M}=\{\mathcal{M}_{1},\cdots,\mathcal{M}_{i},\cdots,\mathcal{M}_{n}\} performed on a dataset sequentially. If any i:𝒟i\mathcal{M}_{i}:\mathcal{D}\rightarrow\mathcal{R}_{i} satisfies (μ,εi)\left(\mu,\varepsilon_{i}\right)-WDP, then \mathcal{M} satisfies (μ,i=1nεi)(\mu,\sum_{i=1}^{n}\varepsilon_{i})-WDP.

Proof. See proof of Proposition 6 in the appendix.

Proposition 7 (Laplace Mechanism). If an algorithm f:𝒟f:\mathcal{D}\rightarrow\mathcal{R} has sensitivity Δpf\Delta_{p}f and the order μ1\mu\geq 1, then the Laplace mechanism L=f(x)+Lap(0,λ)\mathcal{M}_{L}=f\left(x\right)+Lap\left(0,\lambda\right) preserves (μ,12Δpf(2[1/λ+exp(1/λ)1])1μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left(\sqrt{2\left[1/\lambda+\exp(-1/\lambda)-1\right]}\right)^{\frac{1}{\mu}}\right)-WDP.

Proof. See proof of Proposition 7 in the appendix.

Proposition 8 (Gaussian Mechanism). If an algorithm f:𝒟f:\mathcal{D}\rightarrow\mathcal{R} has sensitivity Δpf\Delta_{p}f and the order μ1\mu\geq 1, then the Gaussian mechanism G=f(x)+𝒩(0,σ2)\mathcal{M}_{G}=f\left(x\right)+\mathcal{N}\left(0,\sigma^{2}\right) preserves (μ,12(Δpf/σ)1μ)\left(\mu,\frac{1}{2}\left({\Delta_{p}f}/{\sigma}\right)^{\frac{1}{\mu}}\right)-WDP.

The proof of Gaussian mechanism is available in the appendix. The relation between parameters and privacy budgets in Laplace mechanism and Gaussian mechanism are summarized in Table 1.

Differential Privacy Framework Laplace Mechanism Gaussian Mechanism
DP 1/λ\lambda \infty
RDP for order α\alpha α>1\alpha>1: 1α1log{α2α1exp(α1λ)+α12α1exp(αλ)}\frac{1}{\alpha-1}\log\left\{\frac{\alpha}{2\alpha-1}\exp\left(\frac{\alpha-1}{\lambda}\right)+\frac{\alpha-1}{2\alpha-1}\exp\left(-\frac{\alpha}{\lambda}\right)\right\} α/(2σ2)\alpha/(2\sigma^{2})
α=1\alpha=1: 1/λ+exp(1/λ)11/\lambda+\exp\left(-1/\lambda\right)-1
WDP for order μ\mu 12Δpf(2[1/λ+exp(1/λ)1])1μ\frac{1}{2}\Delta_{p}f\left(\sqrt{2\left[1/\lambda+\exp(-1/\lambda)-1\right]}\right)^{\frac{1}{\mu}} 12(Δpf/σ)1μ\frac{1}{2}\left({\Delta_{p}f}/{\sigma}\right)^{\frac{1}{\mu}}
Table 1: Privacy budgets of DP, RDP and WDP for Basic Mechanisms. The Laplace mechanism and Gaussian mechanism of DP and RDP with sensitivity 1 are obtained from Table 2 in Mironov (2017). When it comes to WDP, the sensitivity Δpf\Delta_{p}f can be an arbitrary positive constant.

Proposition 9 (From DP to WDP) If \mathcal{M} preserves ε\varepsilon-DP with sensitivity Δf\Delta f , it also satisfies (μ,12Δpf(2ε(eε1))12μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left({2\varepsilon\cdot(e^{\varepsilon}-1)}\right)^{\frac{1}{2\mu}}\right)-WDP.

Proof. See proof of Proposition 9 in the appendix.

Proposition 10 (From RDP to WDP) If \mathcal{M} preserves (α,ε)(\alpha,\varepsilon)-RDP with sensitivity Δpf\Delta_{p}f, it also satisfies (μ,12Δpf(2ε)12μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left(2\varepsilon\right)^{\frac{1}{2\mu}}\right)-WDP.

Proof. See proof of Proposition 10 in the appendix.

Proposition 11 (From WDP to RDP and DP) Suppose μ1\mu\geq 1 and log(p())\log(p_{\mathcal{M}}(\cdot)) is an LL-Lipschitz function. If \mathcal{M} preserves (μ,ε)(\mu,\varepsilon)-WDP with sensitivity Δpf\Delta_{p}f, it also satisfies (α,αα1Lεμ/(μ+1))\left(\alpha,\frac{\alpha}{\alpha-1}L\cdot\varepsilon^{\mu/(\mu+1)}\right)-RDP. Specifically, when α\alpha\rightarrow\infty, \mathcal{M} satisfies (Lεμ/(μ+1))\left(L\cdot\varepsilon^{\mu/(\mu+1)}\right)-DP.

The proof is available in the appendix. Where p()p_{\mathcal{M}}(\cdot) is the probability density function of distribution Pr()Pr_{\mathcal{M}}(\cdot).

Proposition 12 (Post-Processing). Let :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R} be a (μ,ε)(\mu,\varepsilon)-Wasserstein differentially private algorithm. Let 𝒢:\mathcal{G}:\mathcal{R}\rightarrow\mathcal{R}^{\prime} be an arbitrary randomized mapping. For any order μ[1,)\mu\in[1,\infty) and all measurable subsets SS\subseteq\mathcal{R}, 𝒢()()\mathcal{G}(\mathcal{M})(\cdot) is also (μ,ε)(\mu,\varepsilon)-Wasserstein differentially private, namely

Wμ(Pr[𝒢((D))S],Pr[𝒢((D))S])ε.\displaystyle W_{\mu}\left(Pr[\mathcal{G}(\mathcal{M}(D))\in S],Pr[\mathcal{G}(\mathcal{M}(D^{\prime}))\in S]\right)\leq\varepsilon. (8)

proof. See proof of Proposition 12 in the appendix.

Proposition 13 (Group Privacy). Let :𝒟\mathcal{M}:\mathcal{D}\mapsto\mathcal{R} be a (μ,ε)(\mu,\varepsilon)-Wasserstein differentially private algorithm. Then for any pairs of datasets D,D𝒟D,D^{\prime}\in\mathcal{D} differing in kk data entries x1,,xkx_{1},\cdots,x_{k} for any i=1,,k,i=1,\cdots,k,\mathcal{M} is (μ,kε)(\mu,k\varepsilon)-Wasserstein differentially private.

Proof. See proof of Proposition 13 in the appendix.

Implementation in Deep Learning

Advanced Composition

To derive advanced composition under WDP, we first define generalized (μ,ε)\left(\mu,\varepsilon\right)-WDP.

Definition 5 (Generalized (μ,ε)\left(\mu,\varepsilon\right)-WDP) A randomized mechanism \mathcal{M} is generalized (μ,ε)\left(\mu,\varepsilon\right)-Wasserstein differentially private if for any two adjacent datasets D,D𝒟D,D^{\prime}\in\mathcal{D} holds that

Pr[Wμ(Pr(D),Pr(D))ε]δ.Pr[W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\geq\varepsilon]\leq\delta. (9)

According to the above definition, we find that (μ,ε)(\mu,\varepsilon)-WDP can be regarded as a special case of generalized (μ,ε)(\mu,\varepsilon)-WDP when δ\delta tends to zero.

Definition 5 is helpful for designing Wasserstein accountant applied in private deep learning, and we will deduce several necessary theorems based on this notion in the following.

Theorem 1 (Advanced Composition) Suppose a randomized algorithm \mathcal{M} consists of a sequence of (μ,ε)(\mu,\varepsilon)-WDP algorithms 1,2,,T\mathcal{M}_{1},\mathcal{M}_{2},\cdots,\mathcal{M}_{T}, which perform on dataset DD adaptively and satisfy t:𝒟t\mathcal{M}_{t}:\mathcal{D}\rightarrow\mathcal{R}_{t}, t{1,2,,T}t\in\{1,2,\cdots,T\}. \mathcal{M} is generalized (μ,ε)(\mu,\varepsilon)-Wasserstein differentially private with ε>0\varepsilon>0 and μ1\mu\geq 1 if for any two adjacent datasets D,D𝒟D,D^{\prime}\in\mathcal{D} hold that

exp[βt=1T𝔼(Wμ(Prt(D),Prt(D)))βε]δ.{\exp\left[\beta\sum_{t=1}^{T}\mathbb{E}(W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})))-\beta\varepsilon\right]}\leq\delta. (10)

Where β\beta is a customization parameter that satisfies β>0\beta>0.

Proof. See proof of Theorem 1 in the appendix.

Privacy Loss and Absolute Moment

Theorem 2 Suppose an algorithm \mathcal{M} consists of a sequence of private algorithms 1,2,,T\mathcal{M}_{1},\mathcal{M}_{2},\cdots,\mathcal{M}_{T} protected by Gaussian mechanism and satisfying t:𝒟\mathcal{M}_{t}:\mathcal{D}\rightarrow\mathcal{R}, t={1,2,,T}t=\left\{1,2,\cdots,T\right\}. If the subsampling probability, scale parameter and l2l_{2}-sensitivity of algorithm t\mathcal{M}_{t} are represented by q[0,1]q\in[0,1], σ>0\sigma>0 and dt0d_{t}\geq 0, then the privacy loss under WDP at epoch tt is

Wμ(Prt(D),Prt(D))=infdt[i=1n𝔼(|Zti|μ)]1μ,\displaystyle W_{\mu}\left(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})\right)=\inf_{d_{t}}\left[\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}},
Zt𝒩(qdt,(22q+2q2)σ2).\displaystyle Z_{t}\sim\mathcal{N}\left(qd_{t},(2-2q+2q^{2})\sigma^{2}\right). (11)

Where Prt(D)Pr_{\mathcal{M}_{t}}(D) is the outcome distribution when performing t\mathcal{M}_{t} on DD at epoch tt. dt=gtgt2d_{t}=\|g_{t}-g_{t}^{\prime}\|_{2} represents the l2l_{2} norm between pairs of adjacent gradients gtg_{t} and gtg_{t}^{\prime}. In addition, ZtZ_{t} is a vector follows Gaussian distribution, and ZtiZ_{ti} represents the ii-th component of ZtZ_{t}.

Proof. See proof of Theorem 2 in the appendix.

Note that 𝔼(|Zti|μ)\mathbb{E}\left(|Z_{ti}|^{\mu}\right) is the μ\mu-order raw absolute moment of the Gaussian distribution 𝒩(qdt,(22q+2q2)σ2)\mathcal{N}\left(qd_{t},(2-2q+2q^{2})\sigma^{2}\right). We know that the raw moment of a Gaussian distribution can be obtained by taking the μ\mu-th order derivatives of the moment generating function with respect to zz. Nevertheless, we do not adopt such an indirect approach. We successfully derive a direct formula, as shown in Lemma 1.

Lemma 1 (Raw Absolute Moment) Assume that Zt𝒩(qdt,(22q+2q2)σ2)Z_{t}\sim\mathcal{N}(qd_{t},(2-2q+2q^{2})\sigma^{2}), we can obtain the raw absolute moment of ZZ as follow

𝔼(|Zt|μ)=(2Var)μ2GF(μ+12)π𝒦(μ2,12;q2dt22Var).\displaystyle\mathbb{E}\left(|Z_{t}|^{\mu}\right)=\left(2Var\right)^{\frac{\mu}{2}}\frac{GF\left({\frac{\mu+1}{2}}\right)}{\sqrt{\pi}}\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2Var}\right). (12)

Where VarVar represents the Variance of random variable ZZ, and can be expressed as Var=(22q+2q2)σ2Var=(2-2q+2q^{2})\sigma^{2}. GF(μ+12)GF\left({\frac{\mu+1}{2}}\right) represents Gamma function as follow

GF(μ+12)=0xμ+121ex𝑑x,\displaystyle GF\left({\frac{\mu+1}{2}}\right)=\int_{0}^{\infty}x^{{\frac{\mu+1}{2}}-1}e^{-x}dx, (13)

and 𝒦(μ2,12;q2dt22Var)\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2Var}\right) represents Kummer’s confluent hypergeometric function as

n=0q2ndt2nn!4n(1q+q2)nσ2ni=1nμ2i+21+2i2.\displaystyle\sum_{n=0}^{\infty}\frac{{q^{2n}d_{t}}^{2n}}{n!\cdot 4^{n}(1-q+q^{2})^{n}\sigma^{2n}}\prod_{i=1}^{n}\frac{\mu-2i+2}{1+2i-2}. (14)

proof. Our mathematical deduction is based on the work from Winkelbauer (2012), and the proof is available in the appendix.

Wasserstein Accountant in Deep Learning

Next, we will deduce Wasserstein accountant applied in private deep learning. We obtain Theorem 3 based on the above preparations including advanced composition, privacy loss and absolute moment under WDP.

Theorem 3 (Tail Bound) Under the conditions described in Theorem 2, \mathcal{M} satisfies (μ,ε)(\mu,\varepsilon)-WDP for

logδ=βt=1Tinfdt[𝔼i=1n(|Zti|μ)]1μβε.\displaystyle\log\delta=\beta\sum_{t=1}^{T}\inf_{d_{t}}\left[\mathbb{E}\sum_{i=1}^{n}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}-\beta\varepsilon. (15)

Where Z𝒩(qdt,(22q+2q2)σ2)Z\sim\mathcal{N}\left(qd_{t},(2-2q+2q^{2})\sigma^{2}\right) and dt=gtgt2d_{t}=\|g_{t}-g_{t}^{\prime}\|_{2}. The proof of Theorem 3 is available in the appendix. In another case, if we have determined δ\delta and want to know the privacy budget ε\varepsilon, then we can utilize the result in Corollary 1.

Corollary 1 Under the conditions described in Theorem 2, \mathcal{M} satisfies (μ,ε)(\mu,\varepsilon)-WDP for

ε=t=1Tinfdt[i=1n𝔼(|Zti|μ)]1μ1βlogδ.\displaystyle\varepsilon=\sum_{t=1}^{T}\inf_{d_{t}}\left[\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}-\frac{1}{\beta}log\delta. (16)

Corollary 1 is more commonly used than Theorem 3 since the total privacy budget generated by an algorithm plays a more important role in privacy computing.

Experiments

The experiments in this paper consist of four parts. Firstly, we test Laplace Mechanism and Gaussian Mechanism under RDP and WDP with ever-changing orders. Secondly, we carry out the experiments of composition and compare our Wasserstein accountant with Bayesian accountant and moments accountant. Thirdly, we consider the application scenario of deep learning, and train a convolutional neural network (CNN) optimized by differentially private stochastic gradient descent (DP-SGD) (Abadi et al. 2016) on the task of image classification. At last, we demonstrate the impact of hyperparameter variations on privacy budgets. All the experiments were performed on a single machine with Ubuntu 18.04, 40 Intel(R) Xeon(R) Silver 4210R CPUs @ 2.40GHz, and two NVIDIA Quadro RTX 8000 GPUs.

Basic Mechanisms

We conduct experiments to test Laplace Mechanism and Gaussian Mechnism under RDP and WDP. Our experiments are based on the results of Proposition 7, 8 and Table 1. We set the scale parameters of Laplace mechanism and Gaussian mechanism as 1, 2, 3 and 5 respectively. The order μ\mu of WDP is allowed to varies from 1 to 10, and so is the order α\alpha of RDP. We plot the values of privacy budgets ε\varepsilon with increasing orders, and the results are shown in Figure 1.

We can observe that the privacy budgets of WDP increase with μ\mu growing, which corresponds to our monotonicity property (see Proposition 4). More importantly, we find that the privacy budgets of WDP are not susceptible to the order μ\mu, because their curves all exhibit slow upward trends. However, the privacy budgets of RDP experience a steep increase under Gaussian mechanism when the noise scale equals 1, simply because its order α\alpha increases. In addition, the slopes of RDP curves with different noise scales are significantly different. These phenomena lead users to confusion about order selection and risk assessment through privacy budgets when utilizing RDP.

Refer to caption
(a) LM for RDP
Refer to caption
(b) LM for WDP
Refer to caption
(c) GM for RDP
Refer to caption
(d) GM for WDP
Figure 1: Privacy buget curves of (μ,ε)(\mu,\varepsilon)-WDP and (α,ε)(\alpha,\varepsilon)-RDP for Laplace mechanism (LM) and Gaussian mechanism (GM) with varying orders. Where λ\lambda and σ\sigma is the scale of LM and GM respectively. The sensitivities are set to 1 and remains unchanged.

Composition

For the convenience of comparison, we adopt the same settings as the composition experiment in Triastcyn and Faltings (2020). We imitate heavy-tailed gradient distributions by generating synthetic gradients from a Weibull distribution with 0.50.5 as its shape parameter and 50×100050\times 1000 as its size.

The hyper-parameter σ\sigma remains unchanged after being set as 0.20.2, and the threshold of gradient clipping CC is set to {0.05,0.50,0.75,0.99}\{0.05,0.50,0.75,0.99\}-quantiles of gradient norm in turns. To observe the original variations of their privacy budgets, we do not clip gradients. Thus, CC only affects Gaussian noise with variance C2σ2C^{2}\sigma^{2} in DP-SGD (Abadi et al. 2016) in this experiment. In addition, we also provide the composition results with gradient clipping in the appendix for comparison.

In Figure 2, we have the following key observations. (1) The curves obtained from Wasserstein accountant (WA) almost replicate the changes and trends depicted by the curves obtained from moments accountant (MA) and Bayesian accountant (BA). (2) The privacy budgets under WA are always the lowest, and this advantage becomes more significant with CC increasing.

The above results show that Wasserstein accountant can retain the privacy features expressed by MA and BA at a lower privacy budget.

Refer to caption
(a) 0.05-quantile of gt\|g_{t}\|
Refer to caption
(b) 0.50-quantile of gt\|g_{t}\|
Refer to caption
(c) 0.75-quantile of gt\|g_{t}\|
Refer to caption
(d) 0.99-quantile of gt\|g_{t}\|
Figure 2: Privacy budgets over synthetic gradients obtained by moments accountant under DP, Bayesian accountant under BDP and Wasserstein accountant under WDP without gradient clipping.

Deep Learning

We adopt DP-SGD (Abadi et al. 2016) as the private optimizer to obtain the privacy budgets under MA, BA and our WA when applying a CNN model designed by Triastcyn and Faltings (2020) to the task of image classification on four baseline datasets including MNIST (Lecun et al. 1998), CIFAR-10 (Krizhevsky and Hinton 2009), SVHN (Netzer et al. 2011) and Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017).

In the experiment of deep learning, we allow different DP frameworks to adjust the noise scale σ\sigma according to their own needs. The reasons are as follows: (1) MA supported by DP can easily lead to gradient explosion when the noise scale is small, thus σ\sigma can only take a relatively larger value to avoid this situation. However, an excessive noise limits the performance of BDP and WDP. (2) In addition, this setting enables our experimental results more convenient to compare with that in BDP (Triastcyn and Faltings 2020), because the deep learning experiment in BDP is also designed in this way.

Table 2 shows the results obtained under the above experimental settings. We can observe the following phenomenons: (1) WDP requires lower privacy budgets than DP and RDP to achieve the same level of test accuracy. (2) The convergence speed of the deep learning model under WA is faster than that of MA and BA. Taking the experiments on MNIST dataset as an example, DP and BDP need more than 100 epochs and 50 epochs of training respectively to achieve the accuracy of 96%. While our WDP can reach the same level after only 16 epochs of training.

BDP (Triastcyn and Faltings 2020) attributes its better performance than DP to considering the gradient distribution information. Similarly, we can also analyze the advantages of WDP from the following aspects. (1) From the perspective of definition, WDP also utilizes gradient distribution information through γ(Pr(D),Pr(D))\gamma\in\left(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})\right). From the perspective of Wasserstein accountant, the information of gradient distribution is included in dtd_{t} and ZtZ_{t}. (2) More importantly, privacy budgets under WDP will not explode even under low noise conditions. Because Wasserstein distance is more stable than Renyi divergence or maximum divergence, which is similar to the reason why WGAN (Arjovsky, Chintala, and Bottou 2017) succeed to alleviate the problem of mode collapse by applying Wasserstein distance.

Accuracy Privacy
Dataset Non Private Private DP (δ=105\delta=10^{-5}) BDP (δ=1010\delta=10^{-10}) WDP (δ=1010\delta=10^{-10})
MNIST 99% 96% 2.2 (0.898) 0.95 (0.721) 0.76 (0.681)
CIFAR-10 86% 73% 8.0 (0.999) 0.76 (0.681) 0.52 (0.627)
SVHN 93% 92% 5.0 (0.999) 0.87 (0.705) 0.40 (0.599)
F-MNIST 92% 90% 2.9 (0.623) 0.91 (0.713) 0.45 (0.611)
Table 2: Privacy budgets accounted by DP, BDP and WDP on MNIST, CIFAR-10, SVHN and Fashion-MNIST (F-MNIST). The values in parentheses are the probability of potential attack success computed by P(A)=1/(1+eε)P(A)={1}/{(1+e^{-\varepsilon})} (see Section 3 in Triastcyn and Faltings (2020)).

Effect of β\beta and δ\delta

We also conduct experiments to illustrate the relation between privacy budgets and related hyperparameters. Our experiments are based on the results from Theorem 3 and Corollary 1, which have been proved before. In Figure 3(a), the hyperparameter β\beta in WDP are allowed to varies from 1 to 50, and the failure probability δ\delta of WDP can only be {1010,108,105,103}\{10^{-10},10^{-8},10^{-5},10^{-3}\}. While in Figure 3(b), the failure probability δ\delta is allowed to varies from 101010^{-10} to 10510^{-5}, and the hyperparameter β\beta under WDP can only be {1,2,5,10}\{1,2,5,10\}. We observe that β\beta has a clear effect on the value of ε\varepsilon in Figure 3(a). ε\varepsilon decreases quickly when β\beta is less than 10, while very slowly when it is greater than 10. When it comes to 3(b), ε\varepsilon seems to be decreasing uniformly with the exponential growth of delta.

Refer to caption
(a) ε\varepsilon varies with β\beta
Refer to caption
(b) ε\varepsilon varies with δ\delta
Figure 3: The impact of β\beta and δ\delta. The coordinates of horizontal axis in 3(b) are on a logarithmic scale.

Discussion

Relations to Other DP Frameworks

We establish the bridges between WDP, DP and RDP through Proposition 9, 10 and 11. We know that ε\varepsilon-DP implies (μ,12Δpf(2ε(eε1))12μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left(2\varepsilon\cdot(e^{\varepsilon}-1)\right)^{\frac{1}{2\mu}}\right)-WDP and (α,ε)(\alpha,\varepsilon)-RDP implies (μ,12Δpf(2ε)12μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left(2\varepsilon\right)^{\frac{1}{2\mu}}\right)-WDP. In addition, (μ,ε)(\mu,\varepsilon)-WDP implies (α,αα1Lεμ/(μ+1))\left(\alpha,\frac{\alpha}{\alpha-1}L\cdot\varepsilon^{\mu/(\mu+1)}\right)-RDP or (Lεμ/(μ+1))\left(L\cdot\varepsilon^{\mu/(\mu+1)}\right)-DP.

With the above basic conclusions, we can obtain more derivative relationships through RDP or DP. For example, we can obtain that (μ,ε)(\mu,\varepsilon)-WDP implies 12(Lεμ/(μ+1))2\frac{1}{2}\left(L\cdot\varepsilon^{\mu/(\mu+1)}\right)^{2}-zCDP (zero-concentrated differentially private) according to Proposition 1.4 in Bun and Steinke (2016),

Advantages from Metric Property

The privacy losses of DP, RDP and BDP are all non-negative but asymmetric, and do not satisfy triangle inequality (Mironov 2017). Several obvious advantages of WDP as a metric DP have been mentioned in the introduction (see Section Introduction) and verified in the experiments (see Section Experiments), and here we provide more additional details.

Triangle inequality. (1) Several properties including basic sequential composition, group privacy and advanced composition are derived from triangle inequality. (2) Properties in WDP are more comprehensible and easier to utilize than those in RDP. For example, RDP have to introduce additional conditions of 2c2^{c}-stable and α2c+1\alpha\geq 2^{c+1} to derive group privacy (see Proposition 2 in Mironov (2017)), where cc is a constant. In contrast, our WDP utilizes its intrinsic triangle inequality to obtain group privacy without introducing any complex concepts or conditions.

Symmetry. We have considered that the asymmetry of privacy loss would not be transferred to the privacy budget. Specifically, even if Dα(Pr(D)Pr(D))Dα(Pr(D)Pr(D))D_{\alpha}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))\neq D_{\alpha}(Pr_{\mathcal{M}}(D^{\prime})\|Pr_{\mathcal{M}}(D)), Dα(Pr(D)Pr(D))εD_{\alpha}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))\leq\varepsilon still implies Dα(Pr(D)Pr(D))εD_{\alpha}(Pr_{\mathcal{M}}(D^{\prime})\|Pr_{\mathcal{M}}(D))\leq\varepsilon, because neighboring datasets DD and DD^{\prime} can be all possible pairs. Even so, symmetrical privacy loss still has at least two advantages: (1) When computing privacy budgets, it can reduce the amount of computation for traversing adjacent datasets by half. (2) When proving properties, it is not necessary to exchange datasets and deduce it again like non-metric DP (e.g. see Proof of Theorem 3 in Triastcyn and Faltings (2020)).

Limitations

WDP has excellent mathematical properties as a metric DP, and can effectively alleviate exploding privacy budgets as an alternative DP framework. However, when the volume of data in the queried database is extremely small, WDP may release a much smaller privacy budget than other DP frameworks. Fortunately, this situation only occurs when there is very little data available in the dataset. WDP has great potential in deep learning that requires a large amount of data to train neural network models.

Additional Specifications

Other possibility. Symmetry can be obtained by replacing Rényi divergence with Jensen-Shannon divergence (JSD) (Rao and Nayak 1985). While JSD does not satisfy the triangle inequality unless we take its square root instead (Osán, Bussandri, and Lamberti 2018). Nevertheless, it still tends to exaggerate privacy budgets excessively, as it is defined based on divergence.

Comparability. Another question worth explaining is why the privacy budgets obtained by DP, RDP, and WDP can be compared. (1) Their process of computing privacy budgets follows the same mapping, namely :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R}. (2) They are essentially measuring the differences in distributions between adjacent datasets, although their respective measurement methods are different. (3) Privacy budgets can be uniformly transformed into the probability of successful attacks (Triastcyn and Faltings 2020).

Computational problem. Although obtaining the Wasserstein distance requires relatively high computational costs (Dudley 1969; Fournier and Guillin 2015), we do not need to worry about this issue. Because WDP does not need to directly calculate the Wasserstein distance no matter in basic privacy mechanisms or Wasserstein accountant for deep learning (see Proposition 7-8 and Theorem 1-3).

Conclusion

In this paper, we propose an alternative DP framework called Wasserstein differential privacy (WDP) based on Wasserstein distance. WDP satisfies the properties of symmetry, triangle inequality and non-negativity that other DPs do not satisfy all, which enables the privacy losses under WDP to become real metrics. We prove that WDP has several excellent properties (see Proposition 1-13) through Lyapunov’s inequality, Minkowski’s inequality, Jensen’s inequality, Markov’s inequality, Pinsker’s inequality and triangle inequality. We also derive advanced composition theorem, privacy loss and absolute moment under the postulation of WDP and finally obtain Wasserstein accountant to compute cumulative privacy budgets in deep learning (see Theorem 1-3 and Lemma 1). Our evaluations on basic mechanisms, compositions and deep learning show that WDP enables privacy budgets to be more stable and can effectively avoid the overestimation or even explosion on privacy.

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 72293583, No. 72293580), Science and Technology Commission of Shanghai Municipality Grant (No. 22511105901), Defense Industrial Technology Development Program (JCKY2019204A007) and Sino-German Research Network (GZ570).

References

  • Abadi et al. (2016) Abadi, M.; Chu, A.; Goodfellow, I. J.; McMahan, H. B.; Mironov, I.; Talwar, K.; and Zhang, L. 2016. Deep Learning with Differential Privacy. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security (CCS), 308–318.
  • Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning (ICML), 214–223.
  • Bobkov and Ledoux (2019) Bobkov, S.; and Ledoux, M. 2019. One-Dimensional Empirical Measures, Order Statistics, and Kantorovich Transport Distances. Memoirs of the American Mathematical Society, 261(1259).
  • Bun et al. (2018) Bun, M.; Dwork, C.; Rothblum, G. N.; and Steinke, T. 2018. Composable and Versatile Privacy via Truncated CDP. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), 74–86. ACM.
  • Bun and Steinke (2016) Bun, M.; and Steinke, T. 2016. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. In Theory of Cryptography Conference (TCC), volume 9985, 635–658.
  • Cheng et al. (2022) Cheng, A.; Wang, J.; Zhang, X. S.; Chen, Q.; Wang, P.; and Cheng, J. 2022. DPNAS: Neural Architecture Search for Deep Learning with Differential Privacy. In Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 6358–6366.
  • Clement and Desch (2008) Clement, P.; and Desch, W. 2008. An Elementary Proof of the Triangle Inequality for the Wasserstein Metric. Proceedings of the American Mathematical Society, 136(1): 333–339.
  • Dharangutte et al. (2023) Dharangutte, P.; Gao, J.; Gong, R.; and Yu, F. 2023. Integer Subspace Differential Privacy. In Williams, B.; Chen, Y.; and Neville, J., eds., Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI), 7349–7357. AAAI Press.
  • Dong, Roth, and Su (2022) Dong, J.; Roth, A.; and Su, W. J. 2022. Gaussian Differential Privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1): 3–37.
  • Dudley (1969) Dudley, R. M. 1969. The Speed of Mean Glivenko-Cantelli Convergence. Annals of Mathematical Statistics, 40: 40–50.
  • Dwork et al. (2006a) Dwork, C.; Kenthapadi, K.; McSherry, F.; Mironov, I.; and Naor, M. 2006a. Our Data, Ourselves: Privacy via Distributed Noise Generation. In Vaudenay, S., ed., 25th Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), volume 4004, 486–503. Springer.
  • Dwork and Lei (2009) Dwork, C.; and Lei, J. 2009. Differential Privacy and Robust Statistics. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC), 371–380.
  • Dwork et al. (2006b) Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. D. 2006b. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, Third Theory of Cryptography Conference (TCC), volume 3876, 265–284. Springer.
  • Dwork and Roth (2014) Dwork, C.; and Roth, A. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theory Computer Science, 9(3-4): 211–407.
  • Dwork and Rothblum (2016) Dwork, C.; and Rothblum, G. N. 2016. Concentrated Differential Privacy. arXiv preprint arXiv:1603.01887.
  • Erven and Harremoës (2014) Erven, T. V.; and Harremoës, P. 2014. Rényi Divergence and Kullback-Leibler Divergence. IEEE Transactions Information Theory, 60(7): 3797–3820.
  • Fedotov, Harremoës, and Topsøe (2003) Fedotov, A. A.; Harremoës, P.; and Topsøe, F. 2003. Refinements of Pinsker’s inequality. IEEE Transactions on Information Theory, 49(6): 1491–1498.
  • Fournier and Guillin (2015) Fournier, N.; and Guillin, A. 2015. On the Rate of Convergence in Wasserstein Distance of the Empirical Measure. Probability Theory and Related Fields, 162: 707–738.
  • Gao, Gong, and Yu (2022) Gao, J.; Gong, R.; and Yu, F. 2022. Subspace Differential Privacy. In Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 3986–3995.
  • Gulrajani et al. (2017) Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems (NeurIPS), 5767–5777.
  • Jin and Chen (2022) Jin, H.; and Chen, X. 2022. Gromov-Wasserstein Discrepancy with Local Differential Privacy for Distributed Structural Graphs. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), 2115–2121.
  • Kantorovich and Rubinshten (1958) Kantorovich, L. V.; and Rubinshten, G. S. 1958. On a Space of Completely Additive Functions. Vestnik Leningrad Univ, 13(7): 52–59.
  • Kasiviswanathan et al. (2011) Kasiviswanathan, S. P.; Lee, H. K.; Nissim, K.; Raskhodnikova, S.; and Smith, A. D. 2011. What Can We Learn Privately? SIAM Journal on Computing, 40(3): 793–826.
  • Krizhevsky and Hinton (2009) Krizhevsky, A.; and Hinton, G. 2009. Learning Multiple Layers of Features from Tiny Images. Handbook of Systemic Autoimmune Diseases, 1(4).
  • Lecun et al. (1998) Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11): 2278–2324.
  • McSherry (2009) McSherry, F. 2009. Privacy Integrated Queries: An Extensible Platform for Privacy-Preserving Data Analysis. In Proceedings of ACM International Conference on Management of Data (SIGMOD), 19–30.
  • Mironov (2017) Mironov, I. 2017. Rényi Differential Privacy. In 30th IEEE Computer Security Foundations Symposium (CSF), 263–275.
  • Netzer et al. (2011) Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
  • Osán, Bussandri, and Lamberti (2018) Osán, T. M.; Bussandri, D. G.; and Lamberti, P. W. 2018. Monoparametric Family of Metrics Derived from Classical Jensen–Shannon Divergence. Physica A: Statistical Mechanics and its Applications, 495: 336–344.
  • Panaretos and Zemel (2019) Panaretos, V. M.; and Zemel, Y. 2019. Statistical Aspects of Wasserstein Distances. Annual Review of Statistics and Its Application, 6(1).
  • Phan et al. (2019) Phan, N.; Vu, M. N.; Liu, Y.; Jin, R.; Dou, D.; Wu, X.; and Thai, M. T. 2019. Heterogeneous Gaussian Mechanism: Preserving Differential Privacy in Deep Learning with Provable Robustness. In International Joint Conference on Artificial Intelligence (IJCAI), 4753–4759.
  • Rakotomamonjy and Ralaivola (2021) Rakotomamonjy, A.; and Ralaivola, L. 2021. Differentially Private Sliced Wasserstein Distance. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, 8810–8820.
  • Rao and Nayak (1985) Rao, C.; and Nayak, T. 1985. Cross entropy, Dissimilarity Measures, and Characterizations of Quadratic Entropy. IEEE Transactions on Information Theory, 31(5): 589–593.
  • Rüschendorf (2009) Rüschendorf, L. 2009. Optimal Transport. Old and New. Jahresbericht der Deutschen Mathematiker-Vereinigung, 111(2): 18–21.
  • Shokri and Shmatikov (2015) Shokri, R.; and Shmatikov, V. 2015. Privacy-Preserving Deep Learning. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security (CCS), 1310–1321.
  • Shokri et al. (2017) Shokri, R.; Stronati, M.; Song, C.; and Shmatikov, V. 2017. Membership Inference Attacks Against Machine Learning Models. In IEEE Symposium on Security and Privacy (SP), 3–18.
  • Tien, Habrard, and Sebban (2019) Tien, N. L.; Habrard, A.; and Sebban, M. 2019. Differentially Private Optimal Transport: Application to Domain Adaptation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), 2852–2858.
  • Triastcyn and Faltings (2020) Triastcyn, A.; and Faltings, B. 2020. Bayesian Differential Privacy for Machine Learning. In International Conference on Machine Learning (ICML), 9583–9592.
  • Wang, Si, and Wu (2015) Wang, Y.; Si, C.; and Wu, X. 2015. Regression Model Fitting under Differential Privacy and Model Inversion Attack. In International Joint Conference on Artificial Intelligence (IJCAI), 1003–1009.
  • Winkelbauer (2012) Winkelbauer, A. 2012. Moments and Absolute Moments of the Normal Distribution. arXiv preprint arXiv:1209.4340.
  • Xiao, Rasul, and Vollgraf (2017) Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747.
  • Zhu, Liu, and Han (2019) Zhu, L.; Liu, Z.; and Han, S. 2019. Deep Leakage from Gradients. In Advances in Neural Information Processing Systems (NeurIPS), 14747–14756.

Appendix A Proof of Propositions and Theorems

Proof of Proposition 1

Proposition 1 (Symmetry). Let \mathcal{M} be a (μ,ε)\left(\mu,\varepsilon\right)-WDP algorithm, for any μ1\mu\geq 1 and ε0\varepsilon\geq 0 the following equation holds

Wμ(Pr(D),Pr(D))=Wμ(Pr(D),Pr(D))ε.W_{\mu}\left(Pr_{\mathcal{M}}\left({D}\right),Pr_{\mathcal{M}}\left({D}^{\prime}\right)\right)=W_{\mu}\left(Pr_{\mathcal{M}}\left(D^{\prime}\right),Pr_{\mathcal{M}}\left(D\right)\right)\leq\varepsilon.

Proof. Considering the definition of (μ\mu,ε\varepsilon)-WDP, we have

Wμ(Pr(D),Pr(D))=(infγΓ(Pr(D),Pr(D))𝒳×𝒴ρ(x,y)μ𝑑γ(x,y))1με.\displaystyle W_{\mu}\left(Pr_{\mathcal{M}}\left({D}\right),Pr_{\mathcal{M}}\left({D}^{\prime}\right)\right)=\left(\inf_{\gamma\in\Gamma\left(Pr_{\mathcal{M}}\left({D}\right),Pr_{\mathcal{M}}\left({D}^{\prime}\right)\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\rho\left(x,y\right)}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}}\leq\varepsilon.

The symmetry of Wasserstein differential privacy is obvious for the reason that joint distribution has property Γ(Pr(D),Pr(D))=Γ(Pr(D),Pr(D))\Gamma(Pr_{\mathcal{M}}(D^{\prime}),Pr_{\mathcal{M}}(D))=\Gamma(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})).

Next, we want to proof that Kantorvich differential privacy also satisfies symmetry. Consider the definition of Kantorvich differential privacy, we have

K(Pr(D),Pr(D))=supφL1𝔼xPr(D)[φ(x)]𝔼xPr(D)[φ(x)]\displaystyle K\left(Pr_{\mathcal{M}}\left(D\right),Pr_{\mathcal{M}}\left(D^{\prime}\right)\right)=\sup_{\|\varphi\|_{L}\leq 1}\mathbb{E}_{x\sim Pr_{\mathcal{M}}(D)}[\varphi(x)]-\mathbb{E}_{x\sim Pr_{\mathcal{M}}(D^{\prime})}[\varphi(x)] (17)

and

K(Pr(D),Pr(D))=supφL1𝔼xPr(D)[φ(x)]𝔼xPr(D)[φ(x)].\displaystyle K\left(Pr_{\mathcal{M}}\left(D^{\prime}\right),Pr_{\mathcal{M}}\left(D\right)\right)=\sup_{\|\varphi\|_{L}\leq 1}\mathbb{E}_{x\sim Pr_{\mathcal{M}}(D^{\prime})}[\varphi(x)]-\mathbb{E}_{x\sim Pr_{\mathcal{M}}(D)}[\varphi(x)]. (18)

If we set ψ(x)=φ(x)\psi\left(x\right)=-\varphi\left(x\right), then the above formula can be written as

K(Pr(D),Pr(D))\displaystyle K\left(Pr_{\mathcal{M}}\left(D^{\prime}\right),Pr_{\mathcal{M}}\left(D\right)\right) =supψL1𝔼xPr(D)[ψ(x)]𝔼xPr(D)[ψ(x)]\displaystyle=\sup_{\|\psi\|_{L}\leq 1}\mathbb{E}_{x\sim Pr_{\mathcal{M}}(D^{\prime})}[-\psi(x)]-\mathbb{E}_{x\sim Pr_{\mathcal{M}}(D)}[-\psi(x)] (19)
=supψL1𝔼xPr(D)[ψ(x)]𝔼xPr(D)[ψ(x)]\displaystyle=\sup_{\|\psi\|_{L}\leq 1}\mathbb{E}_{x\sim Pr_{\mathcal{M}}(D)}[\psi(x)]-\mathbb{E}_{x\sim Pr_{\mathcal{M}}(D^{\prime})}[\psi(x)]
=K(Pr(D),Pr(D)).\displaystyle=K\left(Pr_{\mathcal{M}}\left(D\right),Pr_{\mathcal{M}}\left(D^{\prime}\right)\right).

Proof of Proposition 2

Proposition 2 (Triangle Inequality) Let D1,D2,D3𝒟D_{1},D_{2},D_{3}\in\mathcal{D} be three arbitrary datasets. Suppose there are fewer different data entries between D1D_{1} and D2D_{2} compared with D1D_{1} and D3D_{3}, and the differences between D1D_{1} and D2D_{2} are included in the differences between D1D_{1} and D3D_{3}. For any randomized algorithm \mathcal{M} satisfies (μ,ε)(\mu,\varepsilon)-WDP with μ1\mu\geq 1, we have

Wμ(Pr(D1),Pr(D3))Wμ(Pr(D1),Pr(D2))+Wμ(Pr(D2),Pr(D3)).\displaystyle W_{\mu}(Pr_{\mathcal{M}}(D_{1}),Pr_{\mathcal{M}}(D_{3}))\leq W_{\mu}(Pr_{\mathcal{M}}(D_{1}),Pr_{\mathcal{M}}(D_{2}))+W_{\mu}(Pr_{\mathcal{M}}(D_{2}),Pr_{\mathcal{M}}(D_{3})). (20)

Proof. Triangle inequality has been proved by Proposition 2.1 in Clement and Desch (2008). Here we provide a simpler proof method from another perspective.

Firstly, we introduce another mathematical form that defines the Wasserstein distance (see Definition 6.1 in Rüschendorf (2009) or Equation 1 in Panaretos and Zemel (2019))

Wμ(P,Q)=infXPYQ[𝔼ρ(X,Y)μ]1μ,μ1.W_{\mu}\left(P,Q\right)=\inf_{X\sim P\atop Y\sim Q}\left[\mathbb{E}\;\rho(X,Y)^{\mu}\right]^{\frac{1}{\mu}},\mu\geq 1. (21)

Where XX and YY are random vectors, and the infimum is taken over all possible pairs of XX and YY that are marginally distributed as PP and QQ.

Let X1,X2,X3X_{1},X_{2},X_{3} be three random variables follow distributions Pr(D1),Pr(D2),Pr(D3)Pr_{\mathcal{M}}(D_{1}),Pr_{\mathcal{M}}(D_{2}),Pr_{\mathcal{M}}(D_{3}) respectively.

Wμ(Pr(D1),Pr(D3))\displaystyle W_{\mu}(Pr_{\mathcal{M}}(D_{1}),Pr_{\mathcal{M}}(D_{3})) =infX1Pr(D1)X3Pr(D3)[𝔼ρ(X1,X3)μ]1μ\displaystyle=\inf_{X_{1}\sim Pr_{\mathcal{M}}(D_{1})\atop X_{3}\sim Pr_{\mathcal{M}}(D_{3})}\left[\mathbb{E}\;\rho\left(X_{1},X_{3}\right)^{\mu}\right]^{\frac{1}{\mu}} (22)
infX1Pr(D1)X2Pr(D2)[𝔼ρ(X1,X2)μ]1μ+infX2Pr(D2)X3Pr(D3)[𝔼ρ(X2,X3)μ]1μ\displaystyle\leq\inf_{X_{1}\sim Pr_{\mathcal{M}}(D_{1})\atop X_{2}\sim Pr_{\mathcal{M}}(D_{2})}\left[\mathbb{E}\;\rho\left(X_{1},X_{2}\right)^{\mu}\right]^{\frac{1}{\mu}}+\inf_{X_{2}\sim Pr_{\mathcal{M}}(D_{2})\atop X_{3}\sim Pr_{\mathcal{M}}(D_{3})}\left[\mathbb{E}\;\rho\left(X_{2},X_{3}\right)^{\mu}\right]^{\frac{1}{\mu}} (23)
=Wμ(Pr(D1),Pr(D2))+Wμ(Pr(D2),Pr(D3)).\displaystyle=W_{\mu}(Pr_{\mathcal{M}}(D_{1}),Pr_{\mathcal{M}}(D_{2}))+W_{\mu}(Pr_{\mathcal{M}}(D_{2}),Pr_{\mathcal{M}}(D_{3})). (24)

Here Equation 23 can be established by applying Minkowski’s inequality that X1+X2rX1r+X2r\|X_{1}+X_{2}\|_{r}\leq\|X_{1}\|_{r}+\|X_{2}\|_{r} with 1<r<1<r<\infty.

Proof of Proposition 3

Proposition 3 (Non-Negativity). For μ1\mu\geq 1 and any randomized algorithm \mathcal{M}, we have Wμ(Pr(D),Pr(D))0W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\geq 0.

Proof. We can be sure that the integrand function ρ(x,y)0\rho(x,y)\geq 0, for the reason that it’s a cost function in the sense of optimal transport (Rüschendorf 2009) and a norm in the statistical sense (Panaretos and Zemel 2019). γ(x,y)\gamma(x,y) is the probability measure, so that γ(x,y)>0\gamma(x,y)>0 holds. Then according to the definition of WDP, the integral function

(infγΓ(Pr(D),Pr(D))𝒳×𝒴ρ(x,y)μ𝑑γ(x,y))1μ0.\left(\inf_{\gamma\in\Gamma\left(Pr_{\mathcal{M}}\left({D}\right),Pr_{\mathcal{M}}\left({D}^{\prime}\right)\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\rho\left(x,y\right)}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}}\geq 0.

Proof of Proposition 4

Proposition 4 (Monotonicity). For 1μ1μ21\leq\mu_{1}\leq\mu_{2}, we have Wμ1(Pr(D),Pr(D))Wμ2(Pr(D),Pr(D))W_{\mu_{1}}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq W_{\mu_{2}}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})), or we can equivalently described this proposition as (μ2,ε)(\mu_{2},\varepsilon)-WDP implies (μ1,ε)(\mu_{1},\varepsilon)-WDP.

Proof. Consider the expectation form of Wasserstein differential privacy (see Equation 21), and apply Lyapunov’s inequality as follow

[𝔼||μ1]1μ1[𝔼||μ2]1μ2,1μ1μ2\left[\mathbb{E}|\cdot|^{\mu_{1}}\right]^{\frac{1}{\mu_{1}}}\leq[\mathbb{E}|\cdot|^{\mu_{2}}]^{\frac{1}{\mu_{2}}},1\leq\mu_{1}\leq\mu_{2} (25)

we obtain that

Wμ1(Pr(D),Pr(D))\displaystyle W_{\mu_{1}}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})) =infX(D)Y(D)[𝔼ρ(X,Y)μ1]1μ1\displaystyle=\inf_{X\sim\mathcal{M}(D)\atop Y\sim\mathcal{M}(D^{\prime})}\left[\mathbb{E}\;\rho\left(X,Y\right)^{\mu_{1}}\right]^{\frac{1}{\mu_{1}}} (26)
infX(D)Y(D)[𝔼ρ(X,Y)μ2]1μ2\displaystyle\leq\inf_{X\sim\mathcal{M}(D)\atop Y\sim\mathcal{M}(D^{\prime})}\left[\mathbb{E}\;\rho\left(X,Y\right)^{\mu_{2}}\right]^{\frac{1}{\mu_{2}}}
=Wμ2(Pr(D),Pr(D)).\displaystyle=W_{\mu_{2}}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})).

Proof of Proposition 5

Proposition 5 (Parallel Composition). Suppose a dataset DD is divided into nn parts disjointly which are denoted as Di,i=1,2,,nD_{i},i=1,2,\cdots,n. Each randomized algorithm i\mathcal{M}_{i} performed on different seperated dataset DiD_{i} respectively. If i:𝒟i\mathcal{M}_{i}:\mathcal{D}\rightarrow\mathcal{R}_{i} satisfies (μ,εi)\left(\mu,\varepsilon_{i}\right)-WDP for i=1,2,,ni=1,2,\cdots,n, then a set of randomized algorithms ={1,2,,n}\mathcal{M}=\{\mathcal{M}_{1},\mathcal{M}_{2},\cdots,\mathcal{M}_{n}\} satisfies (μ\mu, max{ε1,ε2,,εn}\max\{\varepsilon_{1},\varepsilon_{2},\cdots,\varepsilon_{n}\})-WDP.

Proof. From the definition of WDP, we obtain that

Wμ(Pr(D),Pr(D))\displaystyle W_{\mu}\left(Pr_{\mathcal{M}}\left(D^{\prime}\right),Pr_{\mathcal{M}}\left(D\right)\right) =(infγΓ(Pr(D),Pr(D))𝒳×𝒴ρ(x,y)μ𝑑γ(x,y))1μ\displaystyle=\left(\inf_{\gamma\in\Gamma\left(Pr_{\mathcal{M}}\left(D\right),Pr_{\mathcal{M}}\left(D^{\prime}\right)\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\rho\left(x,y\right)}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}} (27)
\displaystyle\leq max{(infγΓ(Pri(Di),Pri(Di))𝒳×𝒴ρ(x,y)μ𝑑γ(x,y))1μ,i,DiD}\displaystyle\max\left\{\left(\inf_{\gamma\in\Gamma\left(Pr_{\mathcal{M}_{i}}\left(D_{i}\right),Pr_{\mathcal{M}_{i}}\left(D_{i}^{\prime}\right)\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\rho\left(x,y\right)}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}},\forall\mathcal{M}_{i}\subseteq\mathcal{M},D_{i}\subseteq D\right\} (28)
\displaystyle\leq max{ε1,ε2,,εn}.\displaystyle\max\{\varepsilon_{1},\varepsilon_{2},\cdots,\varepsilon_{n}\}. (29)

Inequality 28 is tenable for the following reasons. (1) Privacy budget in WDP framework focuses on the upper bound of privacy loss or distance. (2) The randomized algorithm in \mathcal{M} that leads to the maximum differential privacy budget is a certain i\mathcal{M}_{i}, because only one differential privacy mechanism can be applied in both Wμ(Pr(D),Pr(D))W_{\mu}\left(Pr_{\mathcal{M}}\left(D^{\prime}\right),Pr_{\mathcal{M}}\left(D\right)\right) and Wμ(Pri(Di),Pri(Di))W_{\mu}(Pr_{\mathcal{M}_{i}}(D_{i}),Pr_{\mathcal{M}_{i}}(D_{i})). (3) There is only one element difference between both DD, DD^{\prime} and DiD_{i}, DiD^{\prime}_{i}, the difference is greater when the data volume is small from the perspective of entire distributions. The query algorithm in differential privacy requires hiding individual differences, and a larger amount of data helps to hide individual data differences.

Proof of Proposition 6

Proposition 6 (Sequential Composition). Consider a series of randomized algorithms ={1,,i,,n}\mathcal{M}=\{\mathcal{M}_{1},\cdots,\mathcal{M}_{i},\cdots,\mathcal{M}_{n}\} performed on a dataset sequentially. If any i:𝒟i\mathcal{M}_{i}:\mathcal{D}\rightarrow\mathcal{R}_{i} satisfies (μ,εi)\left(\mu,\varepsilon_{i}\right)-WDP, then \mathcal{M} satisfies (μ,i=1nεi)(\mu,\sum_{i=1}^{n}\varepsilon_{i})-WDP.

Proof. Consider the mathematical forms of (μ,εi)\left(\mu,\varepsilon_{i}\right)-WDP

{Wμ(Pr1(D),Pr1(D))ε1,Wμ(Pr2(D),Pr2(D))ε2,Wμ(Prn(D),Prn(D))εn\displaystyle\begin{cases}W_{\mu}(Pr_{\mathcal{M}_{1}}(D),Pr_{\mathcal{M}_{1}}(D^{\prime}))\leq\varepsilon_{1},\\ W_{\mu}(Pr_{\mathcal{M}_{2}}(D),Pr_{\mathcal{M}_{2}}(D^{\prime}))\leq\varepsilon_{2},\\ \qquad\cdots\\ W_{\mu}(Pr_{\mathcal{M}_{n}}(D),Pr_{\mathcal{M}_{n}}(D^{\prime}))\leq\varepsilon_{n}\end{cases} (30)

According to the basic properties of the inequality, we can obtain the upper bound of the sum of Wassestein distances

i=1nWμ(Pri(D),Pri(D))i=1nεi.\displaystyle\sum_{i=1}^{n}W_{\mu}(Pr_{\mathcal{M}_{i}}(D),Pr_{\mathcal{M}_{i}}(D^{\prime}))\leq\sum_{i=1}^{n}\varepsilon_{i}. (31)

According to the triangle inequality of Wasserstein distance (see Proposition 2), we have

i=1nWμ(Pri(D),Pri(D))Wμ(Pr(D),Pr(D)).\displaystyle\sum_{i=1}^{n}W_{\mu}(Pr_{\mathcal{M}_{i}}(D),Pr_{\mathcal{M}_{i}}(D^{\prime}))\geq W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})). (32)

Thus, we obtain that Wμ(Pr(D),Pr(D))i=1nεiW_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq\sum_{i=1}^{n}\varepsilon_{i}.

Proof of Proposition 7

Proposition 7 (Laplace Mechanism). If an algorithm f:𝒟f:\mathcal{D}\rightarrow\mathcal{R} has sensitivity Δpf\Delta_{p}f and the order μ1\mu\geq 1, then the Laplace mechanism L=f(x)+Lap(0,λ)\mathcal{M}_{L}=f\left(x\right)+Lap\left(0,\lambda\right) preserves (μ,12Δpf(2[1/λ+exp(1/λ)1])1μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left(\sqrt{2\left[1/\lambda+exp(-1/\lambda)-1\right]}\right)^{\frac{1}{\mu}}\right)-Wasserstein differential privacy.

Proof. Considering the Wasserstein distance between two Laplace distributions, we have

Wμ(Lap(0,λ),Lap(Δpf,λ))\displaystyle W_{\mu}\left(Lap\left(0,\lambda\right),Lap\left(\Delta_{p}f,\lambda\right)\right) =(infγΓ(Lap(0,λ),Lap(Δpf,λ))𝒳×𝒴ρ(x,y)μ𝑑γ(x,y))1μ\displaystyle=\left(\inf_{\gamma\in\Gamma\left(Lap\left(0,\lambda\right),Lap\left(\Delta_{p}f,\lambda\right)\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\rho\left(x,y\right)}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}} (33)
(infγΓ(Lap(0,λ),Lap(Δpf,λ))𝒳×𝒴Δpfμ𝑑γ(x,y))1μ\displaystyle\leq\left(\inf_{\gamma\in\Gamma\left(Lap\left(0,\lambda\right),Lap\left(\Delta_{p}f,\lambda\right)\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\Delta_{p}f}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}} (34)
=Δpf(infγΓ(Lap(0,λ),Lap(Δpf,λ))1𝑑γ(x,y))1μ\displaystyle=\Delta_{p}f\left(\inf_{\gamma\in\Gamma\left(Lap\left(0,\lambda\right),Lap\left(\Delta_{p}f,\lambda\right)\right)}\int 1d\gamma\left(x,y\right)\right)^{\frac{1}{\mu}} (35)
=ΔpfinfXLap(0,λ)YLap(Δpf,λ)[𝔼 1XY]1μ\displaystyle=\Delta_{p}f\inf_{X\sim{Lap}\left(0,\lambda\right)\atop Y\sim{Lap}\left(\Delta_{p}f,\lambda\right)}\left[\mathbb{E}\;1_{X\not=Y}\right]^{\frac{1}{\mu}} (36)
=12Δpf(Lap(0,λ)Lap(Δpf,λ)TV)1μ\displaystyle=\frac{1}{2}\Delta_{p}f\left(\|{Lap}\left(0,\lambda\right)-Lap(\Delta_{p}f,\lambda)\|_{TV}\right)^{\frac{1}{\mu}} (37)
12Δpf(2DKL(Lap(0,λ)Lap(Δpf,λ)))1μ.\displaystyle\leq\frac{1}{2}\Delta_{p}f\left(\sqrt{2D_{KL}({Lap}\left(0,\lambda\right)\|{Lap}\left(\Delta_{p}f,\lambda\right))}\right)^{\frac{1}{\mu}}. (38)

Where Δpf\Delta_{p}f is the lpl_{p}-sensitivity between two datasets (see Definition 8), and pp is its order which can be set to any positive integer as needed. XX and YY are random variables follows Laplace distribution (see Equation 36). In addition, TV\|\cdot\|_{TV} represents the total variation. DKL(PQ)D_{KL}(P\|Q) represents the Kullback–Leibler (KL) divergence between PP and QQ, which is also equal to one-order Rényi divergence D1(PQ)D_{1}(P\|Q) (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov (2017)).

We can obtain Equation 37 from Equation 36 because of the probabilistic interpretation of total variation when ρ(x,y)=1\rho(x,y)=1, which has been proposed at page 10 in Reference (Rüschendorf 2009). Equation 38 can be established because of Pinsker’s inequality (see Section I in Fedotov, Harremoës, and Topsøe (2003))

DKL(PQ)12PQTV2.D_{KL}(P\|Q)\geq\frac{1}{2}\|P-Q\|_{TV}^{2}. (39)

Pinsker’s inequality establishs a relation between KL divergence and total variation, and PP and QQ represent the distributions of two random variables respectively, and

To obtain the final result, we apply the outcome of Laplace Mechanism under Rényi DP of order one (see Table II in Mironov (2017)) as follow

D1(Lap(0,λ)Lap(1,λ))=1/λ+exp(1/λ)1.D_{1}(Lap(0,\lambda)\|Lap(1,\lambda))=1/\lambda+exp(-1/\lambda)-1. (40)

Then we will obtain the outcome of Laplace Mechnism under wasserstein DP as follow

Wμ(Lap(0,λ),Lap(1,λ))12Δpf(2[1/λ+exp(1/λ)1])1μ.\displaystyle W_{\mu}\left(Lap\left(0,\lambda\right),Lap\left(1,\lambda\right)\right)\leq\frac{1}{2}\Delta_{p}f\left(\sqrt{2\left[1/\lambda+exp(-1/\lambda)-1\right]}\right)^{\frac{1}{\mu}}. (41)

Proof of Proposition 8

Proposition 8 (Gaussian Mechanism). If an algorithm f:𝒟f:\mathcal{D}\rightarrow\mathcal{R} has sensitivity Δpf\Delta_{p}f and the order μ1\mu\geq 1, then Gaussian mechanism G=f(x)+𝒩(0,σ2)\mathcal{M}_{G}=f\left(x\right)+\mathcal{N}\left(0,\sigma^{2}\right) preserves (μ,12(Δpf/σ)1μ)\left(\mu,\frac{1}{2}\left({\Delta_{p}f}/{\sigma}\right)^{\frac{1}{\mu}}\right)-Wasserstein differential privacy.

Proof. By directly calculating the Wasserstein distance between Gaussian distributions, we have

Wμ(𝒩(0,σ2),𝒩(Δpf,σ2))\displaystyle W_{\mu}\left(\mathcal{N}\left(0,\sigma^{2}\right),\mathcal{N}\left(\Delta_{p}f,\sigma^{2}\right)\right) =(infγΓ(𝒩(0,σ2),𝒩(Δpf,σ2))𝒳×𝒴ρ(x,y)μ𝑑γ(x,y))1μ\displaystyle=\left(\inf_{\gamma\in\Gamma\left(\mathcal{N}\left(0,\sigma^{2}\right),\mathcal{N}\left(\Delta_{p}f,\sigma^{2}\right)\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\rho\left(x,y\right)}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}} (42)
(infγΓ(𝒩(0,σ2),𝒩(Δpf,σ2))𝒳×𝒴Δpfμ𝑑γ(x,y))1μ\displaystyle\leq\left(\inf_{\gamma\in\Gamma\left(\mathcal{N}\left(0,\sigma^{2}\right),\mathcal{N}\left(\Delta_{p}f,\sigma^{2}\right)\right)}\int_{\mathcal{X}\times\mathcal{Y}}{{\Delta_{p}f}^{\mu}d\gamma\left(x,y\right)}\right)^{\frac{1}{\mu}} (43)
=Δpf(infγΓ(𝒩(0,σ2),𝒩(Δpf,σ2))1𝑑γ(x,y))1μ\displaystyle=\Delta_{p}f\left(\inf_{\gamma\in\Gamma\left(\mathcal{N}\left(0,\sigma^{2}\right),\mathcal{N}\left(\Delta_{p}f,\sigma^{2}\right)\right)}\int 1d\gamma\left(x,y\right)\right)^{\frac{1}{\mu}} (44)
=ΔpfinfX𝒩(0,σ2)Y𝒩(Δpf,σ2)[𝔼 1XY]1μ\displaystyle=\Delta_{p}f\inf_{X\sim{\mathcal{N}}\left(0,\sigma^{2}\right)\atop Y\sim{\mathcal{N}}\left(\Delta_{p}f,\sigma^{2}\right)}\left[\mathbb{E}\;1_{X\not=Y}\right]^{\frac{1}{\mu}} (45)
=12Δpf(𝒩(0,σ2)𝒩(Δpf,σ2)TV)1μ\displaystyle=\frac{1}{2}\Delta_{p}f\left(\|\mathcal{N}(0,\sigma^{2})-\mathcal{N}(\Delta_{p}f,\sigma^{2})\|_{TV}\right)^{\frac{1}{\mu}} (46)
12Δpf(2DKL(𝒩(0,σ2)𝒩(Δpf,σ2)))1μ.\displaystyle\leq\frac{1}{2}\Delta_{p}f\left(\sqrt{2D_{KL}\left(\mathcal{N}(0,\sigma^{2})\|\mathcal{N}(\Delta_{p}f,\sigma^{2})\right)}\right)^{\frac{1}{\mu}}. (47)

Where Δpf\Delta_{p}f is the lpl_{p}-sensitivity between two datasets (see Definition 8). XX and YY are random variables follows Gaussian distribution. TV\|\cdot\|_{TV} represents the total variation. DKL(PQ)D_{KL}(P\|Q) represents the KL divergence between PP and QQ, which is also equal to one-order Rényi divergence D1(PQ)D_{1}(P\|Q) (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov (2017)).

We can obtain Equation 46 from Equation 45 because of the probabilistic interpretation of total variation when ρ(x,y)=1\rho(x,y)=1 (see page 10 in Rüschendorf (2009)). Equation 47 can be established because of Pinsker’s inequality (see Section I in Fedotov, Harremoës, and Topsøe (2003))

DKL(PQ)12PQTV2.D_{KL}(P\|Q)\geq\frac{1}{2}\|P-Q\|_{TV}^{2}. (48)

Pinsker’s inequality establishs a relation between KL divergence and total variation, and PP and QQ represent the distributions of two random variables.

To obtain the final result, we apply the property of Gaussian Mechanism under Rényi DP of order one (see Proposition 7 and Table II in Mironov (2017)) as follow

D1(𝒩(0,σ2)𝒩(1,σ2))=(Δpf)22σ2.D_{1}(\mathcal{N}(0,\sigma^{2})\|\mathcal{N}(1,\sigma^{2}))=\frac{(\Delta_{p}f)^{2}}{2\sigma^{2}}. (49)

Then we will obtain the outcome of Gaussian Mechnism under wasserstein DP as follow

Wμ(𝒩(0,σ2),𝒩(1,σ2))12(2(Δpf)22σ2)1μ=12(Δpfσ)1μ.\displaystyle W_{\mu}\left(\mathcal{N}\left(0,\sigma^{2}\right),\mathcal{N}\left(1,\sigma^{2}\right)\right)\leq\frac{1}{2}\left(\sqrt{2\frac{(\Delta_{p}f)^{2}}{2\sigma^{2}}}\right)^{\frac{1}{\mu}}=\frac{1}{2}\left(\frac{\Delta_{p}f}{\sigma}\right)^{\frac{1}{\mu}}. (50)

Thus we have proved that if algorithm ff has sensitivity 1, then the Gaussian mechanism G\mathcal{M}_{G} satisfies (μ,12(Δpf/σ)1μ)\left(\mu,\frac{1}{2}\left({\Delta_{p}f}/{\sigma}\right)^{\frac{1}{\mu}}\right)-WDP.

Proof of Proposition 9

Proposition 9 (From DP to WDP) If \mathcal{M} preserves ε\varepsilon-DP with sensitivity Δf\Delta f , it also satisfies (μ,12Δpf(2ε(eε1))12μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left({2\varepsilon\cdot(e^{\varepsilon}-1)}\right)^{\frac{1}{2\mu}}\right)-WDP.

Proof. Considering the definition of Wasserstein differential privacy and refering to Equation 33-38, we have

Wμ(Pr(D),Pr(D))12Δpf(2DKL(Pr(D)Pr(D)))1μ.\displaystyle W_{\mu}\left(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})\right)\leq\frac{1}{2}\Delta_{p}f\left(\sqrt{2D_{KL}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))}\right)^{\frac{1}{\mu}}. (51)

To deduce further, we apply Lemma 3.18 in Dwork and Roth (2014). It said that if two random variables XX, YY satisfy D(XY)εD_{\infty}(X\|Y)\leq\varepsilon and D(XY)εD_{\infty}(X\|Y)\leq\varepsilon, then we can obtain

D1(XY)ε(eε1).\displaystyle D_{1}(X\|Y)\leq\varepsilon\cdot(e^{\varepsilon}-1). (52)

It should be noted that the condition of ε\varepsilon-DP ensures that D(XY)εD_{\infty}(X\|Y)\leq\varepsilon and D(XY)εD_{\infty}(X\|Y)\leq\varepsilon can be established (see Remark 3.2 in (Dwork and Roth 2014)). Based on Equation 51 and Equation 52, we have

Wμ(Pr(D),Pr(D))12Δpf(2ε(eε1))1μ=12Δpf(2ε(eε1))12μ.\displaystyle W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq\frac{1}{2}\Delta_{p}f\left(\sqrt{2\varepsilon\cdot(e^{\varepsilon}-1)}\right)^{\frac{1}{\mu}}=\frac{1}{2}\Delta_{p}f\left({2\varepsilon\cdot(e^{\varepsilon}-1)}\right)^{\frac{1}{2\mu}}. (53)

Proof of Proposition 10

Proposition 10 (From RDP to WDP) If \mathcal{M} preserves (α,ε)(\alpha,\varepsilon)-RDP with sensitivity Δpf\Delta_{p}f, it also satisfies (μ,12Δpf(2ε)12μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left(2\varepsilon\right)^{\frac{1}{2\mu}}\right)-WDP.

Proof. Considering the definition of Wasserstein differential privacy and refering to Equation 33-38, we have

Wμ(Pr(D),Pr(D))12Δpf(2DKL(Pr(D)Pr(D)))1μ.\displaystyle W_{\mu}\left(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})\right)\leq\frac{1}{2}\Delta_{p}f\left(\sqrt{2D_{KL}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))}\right)^{\frac{1}{\mu}}. (54)

Where DKL(Pr(D)Pr(D))D_{KL}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime})) represents the KL divergence between Pr(D)Pr_{\mathcal{M}}(D) and Pr(D)Pr_{\mathcal{M}}(D^{\prime}), which can also written as 1-order Rényi divergence (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov (2017))

DKL(Pr(D)Pr(D))=D1(Pr(D),Pr(D)).\displaystyle D_{KL}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))=D_{1}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})). (55)

In addition, from the monotonicity property of RDP, we have

Dμ1(Pr(D),Pr(D))Dμ2(Pr(D),Pr(D))\displaystyle D_{\mu_{1}}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq D_{\mu_{2}}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})) (56)

for 1μ1<μ21\leq\mu_{1}<\mu_{2} and arbitrary Pr(D)Pr_{\mathcal{M}}(D) and Pr(D)Pr_{\mathcal{M}}(D^{\prime}).

From the condition that \mathcal{M} preserves (α,ε)(\alpha,\varepsilon)-RDP, we have

Dα(Pr(D),Pr(D))ε,α1\displaystyle D_{\alpha}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq\varepsilon,\;\alpha\geq 1 (57)

Combining Equation 55, 56 and 57, we have

DKL(Pr(D)Pr(D))=D1(Pr(D),Pr(D))Dα(Pr(D),Pr(D))ε.\displaystyle D_{KL}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))=D_{1}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq D_{\alpha}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq\varepsilon. (58)

Combining Equation 54 and 58, we have

Wμ(Pr(D),Pr(D))12Δpf(2ε)1μ=12Δpf(2ε)12μ.\displaystyle W_{\mu}\left(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})\right)\leq\frac{1}{2}\Delta_{p}f\left(\sqrt{2\varepsilon}\right)^{\frac{1}{\mu}}=\frac{1}{2}\Delta_{p}f\left(2\varepsilon\right)^{\frac{1}{2\mu}}. (59)

Therefore, (α,ε)(\alpha,\varepsilon)-RDP implies (μ,12Δpf(2ε)12μ)\left(\mu,\frac{1}{2}\Delta_{p}f\left(2\varepsilon\right)^{\frac{1}{2\mu}}\right)-WDP.

Proof of Proposition 11

Proposition 11 (From WDP to RDP) Suppose μ1\mu\geq 1 and log(p())\log(p_{\mathcal{M}}(\cdot)) is an LL-Lipschitz function. If \mathcal{M} preserves (μ,ε)(\mu,\varepsilon)-WDP with sensitivity Δpf\Delta_{p}f, it also satisfies (α,αα1Lεμ/(μ+1))\left(\alpha,\frac{\alpha}{\alpha-1}L\cdot\varepsilon^{\mu/(\mu+1)}\right)-RDP. Specifically, when α\alpha\rightarrow\infty, it satisfies (Lεμ/(μ+1))\left(L\cdot\varepsilon^{\mu/(\mu+1)}\right)-DP.

proof. Considering the definition of LL-Lipschitz function, we have

|logp(D)logp(D)|\displaystyle|\log p_{\mathcal{M}}(D)-\log p_{\mathcal{M}}(D^{\prime})| L|p(D)p(D)|\displaystyle\leq L|p_{\mathcal{M}}(D)-p_{\mathcal{M}}(D^{\prime})| (60)
|logp(D)p(D)|\displaystyle\Bigg{|}\log\frac{p_{\mathcal{M}}(D)}{p_{\mathcal{M}}(D^{\prime})}\Bigg{|} L|p(D)p(D)|\displaystyle\leq L|p_{\mathcal{M}}(D)-p_{\mathcal{M}}(D^{\prime})| (61)
L|p(D)p(D)|\displaystyle-L|p_{\mathcal{M}}(D)-p_{\mathcal{M}}(D^{\prime})| logp(D)p(D)L|p(D)p(D)|\displaystyle\leq\log\frac{p_{\mathcal{M}}(D)}{p_{\mathcal{M}}(D^{\prime})}\leq L|p_{\mathcal{M}}(D)-p_{\mathcal{M}}(D^{\prime})| (62)
eL|p(D)p(D)|\displaystyle e^{-L|p_{\mathcal{M}}(D)-p_{\mathcal{M}}(D^{\prime})|} p(D)p(D)eL|p(D)p(D)|.\displaystyle\leq\frac{p_{\mathcal{M}}(D)}{p_{\mathcal{M}}(D^{\prime})}\leq e^{L|p_{\mathcal{M}}(D)-p_{\mathcal{M}}(D^{\prime})|}. (63)

Considering the Rényi divergence with order α\alpha, we have

Dα(Pr(D)Pr(D))\displaystyle D_{\alpha}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime})) =1α1log𝔼Pr(D)[(p(D)p(D))α]\displaystyle=\frac{1}{\alpha-1}\log\mathbb{E}_{Pr_{\mathcal{M}}(D^{\prime})}\left[\left(\frac{p_{\mathcal{M}}(D)}{p_{\mathcal{M}}(D^{\prime})}\right)^{\alpha}\right] (64)
1α1log𝔼Pr(D)(eαL|p(D)p(D)|)\displaystyle\leq\frac{1}{\alpha-1}\log\mathbb{E}_{Pr_{\mathcal{M}}(D^{\prime})}\left(e^{\alpha L|p_{\mathcal{M}}(D)-p_{\mathcal{M}}(D^{\prime})|}\right) (65)
1α1log𝔼Pr(D)(eαLΔpf).\displaystyle\leq\frac{1}{\alpha-1}\log\mathbb{E}_{Pr_{\mathcal{M}}(D^{\prime})}\left(e^{\alpha L\Delta_{p}f}\right). (66)

According to the definition of sensitivity, we know that

{p(D)p(D)+Δpf,p(D)p(D)p(D)p(D)+Δpf,p(D)p(D).\displaystyle\begin{cases}p_{\mathcal{M}}(D)\leq p_{\mathcal{M}}(D^{\prime})+\Delta_{p}f,&p_{\mathcal{M}}(D)\geq p_{\mathcal{M}}(D^{\prime})\\ p_{\mathcal{M}}(D^{\prime})\leq p_{\mathcal{M}}(D)+\Delta_{p}f,&p_{\mathcal{M}}(D)\leq p_{\mathcal{M}}(D^{\prime})\end{cases}. (67)

From Theorem 2.7 in Bobkov and Ledoux (2019), we have

ΔpfWμ(Pr(D)Pr(D))μ/(μ+1).\displaystyle\Delta_{p}f\leq W_{\mu}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))^{\mu/(\mu+1)}. (68)

Combining Equation 66 and 68, we have

Dα(Pr(D)Pr(D))\displaystyle D_{\alpha}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime})) 1α1log𝔼Pr(D)(eαL[Wμ(Pr(D)Pr(D))]μ/(μ+1))\displaystyle\leq\frac{1}{\alpha-1}\log\mathbb{E}_{Pr_{\mathcal{M}}(D^{\prime})}\left(e^{\alpha L[W_{\mu}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))]^{\mu/(\mu+1)}}\right) (69)
=1α1log𝔼Pr(D)(eαLεμ/(μ+1))\displaystyle=\frac{1}{\alpha-1}\log\mathbb{E}_{Pr_{\mathcal{M}}(D^{\prime})}\left(e^{\alpha L\varepsilon^{\mu/(\mu+1)}}\right) (70)
=1α1log(eαLεμ/(μ+1))\displaystyle=\frac{1}{\alpha-1}\log\left(e^{\alpha L\varepsilon^{\mu/(\mu+1)}}\right) (71)
=αα1Lεμ/(μ+1).\displaystyle=\frac{\alpha}{\alpha-1}L\varepsilon^{\mu/(\mu+1)}. (72)

Through the same methods, we can also prove that

Dα(Pr(D)Pr(D))αα1Lεμ/(μ+1).\displaystyle D_{\alpha}(Pr_{\mathcal{M}}(D^{\prime})\|Pr_{\mathcal{M}}(D))\leq\frac{\alpha}{\alpha-1}L\varepsilon^{\mu/(\mu+1)}. (73)

Next, we consider the special case that α\alpha\rightarrow\infty. From the definition of max divergence, we have

D(Pr(D)Pr(D))=supPr(D)logp(D)p(D).\displaystyle D_{\infty}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))=\sup_{Pr_{\mathcal{M}}(D)}\log\frac{p_{\mathcal{M}}(D)}{p_{\mathcal{M}}(D^{\prime})}. (74)

Refering to Equation 63, we have

D(Pr(D)Pr(D))supPr(D)L|p(D)p(D)|=LΔpf.\displaystyle D_{\infty}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))\leq\sup_{Pr_{\mathcal{M}}(D)}L|p_{\mathcal{M}}(D)-p_{\mathcal{M}}(D^{\prime})|=L\Delta_{p}f. (75)

Refering to Equation 68 , we know that

D(Pr(D)Pr(D))Lεμ/(μ+1).\displaystyle D_{\infty}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))\leq L\varepsilon^{\mu/(\mu+1)}. (76)

Through the same methods, we can also prove that

D(Pr(D)Pr(D))Lεμ/(μ+1).\displaystyle D_{\infty}(Pr_{\mathcal{M}}(D^{\prime})\|Pr_{\mathcal{M}}(D))\leq L\varepsilon^{\mu/(\mu+1)}. (77)

Proof of Proposition 12

Proposition 12 (Post-Processing). Let :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R} be a (μ,ε)(\mu,\varepsilon)-Wasserstein differentially private algorithm, and 𝒢:\mathcal{G}:\mathcal{R}\rightarrow\mathcal{R}^{\prime} be an arbitrary randomized mapping. For any order μ[1,)\mu\in[1,\infty) and all measurable subsets SS\subseteq\mathcal{R}, 𝒢()()\mathcal{G}(\mathcal{M})(\cdot) is also (μ,ε)(\mu,\varepsilon)-Wasserstein differentially private, namely

Wμ(Pr[𝒢((D))S],Pr[𝒢((D))S])ε.\displaystyle W_{\mu}\left(Pr[\mathcal{G}(\mathcal{M}(D))\in S],Pr[\mathcal{G}(\mathcal{M}(D^{\prime}))\in S]\right)\leq\varepsilon. (78)

proof. Let T={x:𝒢(x)S}T=\{x\in\mathcal{R}:\mathcal{G}(x)\in S\}, then we have

Wμ(Pr[𝒢((D))S],Pr[𝒢((D))S]\displaystyle W_{\mu}(Pr[\mathcal{G}(\mathcal{M}(D))\in S],Pr[\mathcal{G}(\mathcal{M}(D^{\prime}))\in S] =Wμ(Pr[(D)T],Pr[(D)T])\displaystyle=W_{\mu}\left(Pr[\mathcal{M}(D)\in T],Pr[\mathcal{M}(D^{\prime})\in T]\right) (79)
=Wμ(Pr(D),Pr(D))ε.\displaystyle=W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq\varepsilon. (80)

Proof of Proposition 13

Proposition 13 (Group Privacy). Let :𝒟\mathcal{M}:\mathcal{D}\mapsto\mathcal{R} be a (μ,ε)(\mu,\varepsilon)-Wasserstein differentially private algorithm. Then for any pairs of datasets D,D𝒟D,D^{\prime}\in\mathcal{D} differing in kk data entries x1,,xkx_{1}^{\prime},\cdots,x_{k}^{\prime} for any i=1,,k,(D)i=1,\cdots,k,\mathcal{M}(D) is (μ,kε)(\mu,k\varepsilon)-Wasserstein differentially private.

Proof. We decompose the group privacy problem and denote D,D1D,D_{1}^{\prime} as a pair of adjacent datasets only differ in x1x_{1}^{\prime}. Similarly, we denote D1D_{1}^{\prime} and D2D_{2}^{\prime}, D2D_{2}^{\prime} and D3D_{3}^{\prime}, \cdots, Dk1D_{k-1}^{\prime} and DD^{\prime} as other k1k-1 pairs of adjacent datasets only differ in x2,x3,,xkx_{2}^{\prime},x_{3}^{\prime},\cdots,x_{k}^{\prime} respectively.

Recall that WDP satisfies triangle inequality in Proposition 2, then we have

Wμ(Pr(D),Pr(D))Wμ(Pr(D),Pr(D1))\displaystyle W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\leq W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D_{1}^{\prime})) +Wμ(Pr(D1),Pr(D2))+\displaystyle+W_{\mu}(Pr_{\mathcal{M}}(D_{1}^{\prime}),Pr_{\mathcal{M}}(D_{2}^{\prime}))+\cdots (81)
+Wμ(Pr(Dk2),Pr(Dk1))\displaystyle+W_{\mu}(Pr_{\mathcal{M}}(D_{k-2}^{\prime}),Pr_{\mathcal{M}}(D_{k-1}^{\prime}))
+Wμ(Pr(Dk1),Pr(D))=kε.\displaystyle+W_{\mu}(Pr_{\mathcal{M}}(D_{k-1}^{\prime}),Pr_{\mathcal{M}}(D^{\prime}))=k\varepsilon.

Proof of Theorem 1

Theorem 1 (Advanced Composition) Suppose a randomized algorithm \mathcal{M} consists of a sequence of (μ,ε)(\mu,\varepsilon)-WDP algorithms 1,2,T\mathcal{M}_{1},\mathcal{M}_{2}\cdots,\mathcal{M}_{T}, which perform on dataset DD adaptively and satisfy t:𝒟t\mathcal{M}_{t}:\mathcal{D}\rightarrow\mathcal{R}_{t}, t{1,2,,T}t\in\{1,2,\cdots,T\}. \mathcal{M} is generalized (μ,ε)(\mu,\varepsilon)-Wasserstein differentially private with ε>0\varepsilon>0 and μ1\mu\geq 1 if for any two adjacent datasets D,D𝒟D,D^{\prime}\in\mathcal{D} hold that

exp[βt=1T𝔼(Wμ(Prt(D),Prt(D)))βε]δ.{\exp\left[\beta\sum_{t=1}^{T}\mathbb{E}(W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})))-\beta\varepsilon\right]}\leq\delta. (82)

Where β\beta is a customization parameter that satisfies β>0\beta>0.

Proof. With definition of generalized (μ,ε)(\mu,\varepsilon)-WDP, we have

Pr[Wμ(Pr(D),Pr(D))ε]\displaystyle Pr\left[W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime}))\geq\varepsilon\right] Pr[βt=1TWμ(Prt(D),Prt(D))βε]\displaystyle\leq Pr\left[\beta\sum_{t=1}^{T}W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime}))\geq\beta\varepsilon\right] (83)
𝔼[exp(βt=1TWμ(Prt(D),Prt(D)))]exp(βε)\displaystyle\leq\frac{\mathbb{E}\left[\exp(\beta\sum_{t=1}^{T}W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})))\right]}{\exp\left(\beta\varepsilon\right)} (84)
exp[β𝔼t=1T(Wμ(Prt(D),Prt(D)))]exp(βε)\displaystyle\leq\frac{\exp\left[\beta\mathbb{E}\sum_{t=1}^{T}(W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})))\right]}{\exp\left(\beta\varepsilon\right)} (85)
=exp[βt=1T𝔼(Wμ(PrtD),Prt(D)))]exp(βε)\displaystyle=\frac{\exp\left[\beta\sum_{t=1}^{T}\mathbb{E}(W_{\mu}(Pr_{\mathcal{M}_{t}}D),Pr_{\mathcal{M}_{t}}(D^{\prime})))\right]}{\exp\left(\beta\varepsilon\right)} (86)
=exp[βt=1T𝔼(Wμ(Prt(D),Prt(D)))βε].\displaystyle={\exp\left[\beta\sum_{t=1}^{T}\mathbb{E}(W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})))-\beta\varepsilon\right]}. (87)

Where Equation 83 holds because triangle inequality (see Proposition 2) ensures that

t=1TWμ(Prt(D),Prt(D))Wμ(Pr(D),Pr(D)).\displaystyle\sum_{t=1}^{T}W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime}))\geq W_{\mu}(Pr_{\mathcal{M}}(D),Pr_{\mathcal{M}}(D^{\prime})).

Inequality 84 holds because of Markov’s inequality

Pr(||c)𝔼(φ(||))φ(c),c>0.Pr(|\cdot|\geq c)\leq\frac{\mathbb{E}(\varphi(|\cdot|))}{\varphi(c)},c>0. (88)

Here φ()\varphi(\cdot) can be any monotonically increasing function and satisfies the non-negative property. To simplify the computation of privacy budgets in WDP, we set φ()\varphi(\cdot) as exp()\exp({\cdot}). Inequality 85 holds because of Jensen’s inequality. Equation 86 is supported by the operational property of expectation. Thus, we find that Equation 87 δ\leq\delta implies Pr[Wμ((D),(D))ε]δPr\left[W_{\mu}(\mathcal{M}(D),\mathcal{M}(D^{\prime}))\geq\varepsilon\right]\leq\delta.

Proof of Theorem 2

Theorem 2 Suppose an algorithm \mathcal{M} consists of a sequence of private algorithms 1,2,T\mathcal{M}_{1},\mathcal{M}_{2}\cdots,\mathcal{M}_{T} protected by Gaussian mechanism and satisfying t:𝒟\mathcal{M}_{t}:\mathcal{D}\rightarrow\mathcal{R}, t={1,2,,T}t=\left\{1,2,\cdots,T\right\}. If the subsampling probability, scale parameter and l2l_{2}-sensitivity of algorithm t\mathcal{M}_{t} are represented by q[0,1]q\in[0,1], σ>0\sigma>0 and dt0d_{t}\geq 0, then the privacy loss under WDP at epoch tt is

Wμ(Prt(D),Prt(D))=infdt[i=1n𝔼(|Zti|μ)]1μ,\displaystyle W_{\mu}\left(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})\right)=\inf_{d_{t}}\left[\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}, (89)
Zt𝒩(qdt,(22q+2q2)σ2).\displaystyle Z_{t}\sim\mathcal{N}\left(qd_{t},(2-2q+2q^{2})\sigma^{2}\right).

Where Prt(D)Pr_{\mathcal{M}_{t}}(D) is the outcome distribution when performing \mathcal{M} on DD at epoch tt. dt=gtgt2d_{t}=\|g_{t}-g_{t}^{\prime}\|_{2} represents the l2l_{2} norm between pairs of adjacent gradients gtg_{t} and gtg_{t}^{\prime}. In addition, ZtZ_{t} is a vector follows Gaussian distribution, and ZtiZ_{ti} represents the ii-th component of ZtZ_{t}.

Proof. With Gaussian mechanism in a subsampling scenario, we have

Prt(D)=(1q)𝒩(0,σ2)+q𝒩(dt,σ2),\displaystyle Pr_{\mathcal{M}_{t}}(D)=(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(d_{t},\sigma^{2}),
Prt(D)=𝒩(0,σ2).\displaystyle Pr_{\mathcal{M}_{t}}(D^{\prime})=\mathcal{N}(0,\sigma^{2}).

To facilitate the later proof, we slightly simplify the expression of t(D)\mathcal{M}_{t}(D).

Prt(D)\displaystyle Pr_{\mathcal{M}_{t}}(D) =(1q)𝒩(0,σ2)+q𝒩(dt,σ2)\displaystyle=(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(d_{t},\sigma^{2}) (90)
=𝒩(0,(1q)2σ2)+𝒩(qdt,q2σ2)\displaystyle=\mathcal{N}\left(0,(1-q)^{2}\sigma^{2}\right)+\mathcal{N}\left(qd_{t},q^{2}\sigma^{2}\right) (91)
=𝒩(qdt,(12q+2q2)σ2).\displaystyle=\mathcal{N}\left(qd_{t},(1-2q+2q^{2})\sigma^{2}\right). (92)

Then we compute the privacy loss at epoch tt

Wμ(Prt(D),Prt(D))=infXPrt(D)YPrt(D)[𝔼XtYtμ]1μ.\displaystyle W_{\mu}\left(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})\right)=\inf_{X\sim Pr_{\mathcal{M}_{t}}(D)\atop Y\sim Pr_{\mathcal{M}_{t}}(D^{\prime})}\left[\mathbb{E}\;\|X_{t}-Y_{t}\|^{\mu}\right]^{\frac{1}{\mu}}. (93)

Let Zt=XtYtZ_{t}=X_{t}-Y_{t}, thus we have

Zt𝒩(qdt,22q+2q2).\displaystyle Z_{t}\sim\mathcal{N}\left(qd_{t},2-2q+2q^{2}\right). (94)

The privacy loss is

Wμ(Prt(D),Prt(D))=infdt[𝔼(Ztμ)]1μ.\displaystyle W_{\mu}\left(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})\right)=\inf_{d_{t}}\left[\mathbb{E}\left(\|Z_{t}\|^{\mu}\right)\right]^{\frac{1}{\mu}}. (95)

Refering to the definition of norm, we can obtain

Zt=(i=1n|Zti|μ)1μZtμ=i=1n|Zti|μ.\displaystyle\|Z_{t}\|=\left(\sum_{i=1}^{n}|Z_{ti}|^{\mu}\right)^{\frac{1}{\mu}}\Rightarrow\;\|Z_{t}\|^{\mu}=\sum_{i=1}^{n}|Z_{ti}|^{\mu}. (96)

According to the summation property of expectation, we have

𝔼[Ztμ]=𝔼[i=1n|Zti|μ]=i=1n𝔼(|Zti|μ).\displaystyle\mathbb{E}\left[\|Z_{t}\|^{\mu}\right]=\mathbb{E}\left[\sum_{i=1}^{n}|Z_{ti}|^{\mu}\right]=\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right). (97)

Finally, we have

Wμ(Prt(D),Prt(D))=infdt[i=1n𝔼(|Zti|μ)]1μ.\displaystyle W_{\mu}\left(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})\right)=\inf_{d_{t}}\left[\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}. (98)

Proof of Theorem 3

Theorem 3 (Tail bound) Under the conditions described in Theorem 2, \mathcal{M} satisfies (μ,δ)(\mu,\delta)-WDP for

logδ=βt=1Tinfdt[i=1n𝔼(|Zti|μ)]1μβε,\displaystyle\log\delta=\beta\sum_{t=1}^{T}\inf_{d_{t}}\left[\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}-\beta\varepsilon, (99)
Z𝒩(qdt,(22q+2q2)σ2).\displaystyle Z\sim\mathcal{N}\left(qd_{t},(2-2q+2q^{2})\sigma^{2}\right).

Proof. In Theorem 1, we have proved that

exp[βt=1T𝔼(Wμ(Prt(D),Prt(D)))βε]δ.{\exp\left[\beta\sum_{t=1}^{T}\mathbb{E}(W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})))-\beta\varepsilon\right]}\leq\delta. (100)

Taking logarithms on both sides of Equation 100, we can obtain

βt=1T𝔼(Wμ(Prt(D),Prt(D)))βεlogδ.\displaystyle\beta\sum_{t=1}^{T}\mathbb{E}(W_{\mu}(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})))-\beta\varepsilon\leq\log\delta. (101)

In Theorem 2, we have proved that

Wμ(Prt(D),Prt(D))=infdt[t=1T𝔼(|Zti|μ)]1μ,\displaystyle W_{\mu}\left(Pr_{\mathcal{M}_{t}}(D),Pr_{\mathcal{M}_{t}}(D^{\prime})\right)=\inf_{d_{t}}\left[\sum_{t=1}^{T}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}, (102)

where Z𝒩(qdt,(22q+2q2)σ2)Z\sim\mathcal{N}\left(qd_{t},(2-2q+2q^{2})\sigma^{2}\right) and dt=gtgt2d_{t}=\|g_{t}-g_{t}^{\prime}\|_{2}.

Plugging Equation 102 into Equation 101, we can obtain

βt=1T𝔼[infdt[i=1n𝔼(|Zti|μ)]1μ]βεlogδ.\displaystyle\beta\sum_{t=1}^{T}\mathbb{E}\left[\inf_{d_{t}}\left[\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}\right]-\beta\varepsilon\leq log\delta. (103)

Where 𝔼(|Z|μ)\mathbb{E}\left(|Z|^{\mu}\right) can be obtained with the help of Lemma 1, thus we regard it as a computable whole part.

Observing Equation 103, we find that the uncertainty comes from two parts: Gaussian random variable ZZ and the norm of pairwise gradients gtgt2\|g_{t}-g_{t}^{\prime}\|_{2}. However, these two uncertainties have been eliminated by the inner expectation and the operation of infimum. Thus, we no longer need outside 𝔼\mathbb{E} and the expression can be simplified as

βt=1Tinfdt[i=1n𝔼(|Zti|μ)]1μβεlogδ.\displaystyle\beta\sum_{t=1}^{T}\inf_{d_{t}}\left[\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}-\beta\varepsilon\leq log\delta. (104)

We always want the probability of failure to be as small as possible, thus we replace the unequal sign with the equal sign as follow

logδ=βt=1Tinfdt[i=1n𝔼(|Zti|μ)]1μβε.\displaystyle\log\delta=\beta\sum_{t=1}^{T}\inf_{d_{t}}\left[\sum_{i=1}^{n}\mathbb{E}\left(|Z_{ti}|^{\mu}\right)\right]^{\frac{1}{\mu}}-\beta\varepsilon. (105)

Proof of Lemma 1

Lemma 1 (Raw Absolute Moment) Assume that Z𝒩(qdt,(22q+2q2)σ2)Z\sim\mathcal{N}(qd_{t},(2-2q+2q^{2})\sigma^{2}), we can obtain the raw absolute moment of ZZ as follow

𝔼(|Z|μ)=(2Var)μ2GF(μ+12)π𝒦(μ2,12;q2dt22Var).\displaystyle\mathbb{E}\left(|Z|^{\mu}\right)=\left(2Var\right)^{\frac{\mu}{2}}\frac{GF\left({\frac{\mu+1}{2}}\right)}{\sqrt{\pi}}\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2Var}\right).

Where VarVar represents the Variance of Gaussian random variable ZZ, and can be expressed as Var=(22q+2q2)σ2Var=(2-2q+2q^{2})\sigma^{2}. GF(μ+12)GF\left({\frac{\mu+1}{2}}\right) represents Gamma function as

GF(μ+12)=0xμ+121ex𝑑x,\displaystyle GF\left({\frac{\mu+1}{2}}\right)=\int_{0}^{\infty}x^{{\frac{\mu+1}{2}}-1}e^{-x}dx, (106)

and 𝒦(μ2,12;q2dt22Var)\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2Var}\right) represents Kummer’s confluent hypergeometric function as

n=0q2ndt2nn!4n(1q+q2)nσ2ni=1nμ2i+21+2i2.\displaystyle\sum_{n=0}^{\infty}\frac{{q^{2n}d_{t}}^{2n}}{n!\cdot 4^{n}(1-q+q^{2})^{n}\sigma^{2n}}\prod_{i=1}^{n}\frac{\mu-2i+2}{1+2i-2}. (107)

Proof. From Equation 17 in Winkelbauer (2012), we can obtain the expression of 𝔼(|Z|μ)\mathbb{E}\left(|Z|^{\mu}\right) as follow

𝔼(|Z|μ)=(2Var)μ2GF(μ+12)π𝒦(μ2,12;q2dt22Var).\displaystyle\mathbb{E}\left(|Z|^{\mu}\right)=\left(2Var\right)^{\frac{\mu}{2}}\frac{GF\left({\frac{\mu+1}{2}}\right)}{\sqrt{\pi}}\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2Var}\right). (108)

Where 𝒦(μ2,12;q2dt22Var)\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2Var}\right) deduce further as follows

𝒦(μ2,12;q2dt22Var)\displaystyle\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2Var}\right) =𝒦(μ2,12;q2dt22(22q+2q2)σ2)\displaystyle=\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2(2-2q+2q^{2})\sigma^{2}}\right) (109)
=𝒦(μ2,12;q2dt24(1q+q2)σ2)\displaystyle=\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{4(1-q+q^{2})\sigma^{2}}\right) (110)
=n=0(μ2)n¯(12)n¯(q2dt24(1q+q2)σ2)nn!\displaystyle=\sum_{n=0}^{\infty}\frac{\left(-\frac{\mu}{2}\right)^{\overline{n}}}{\left(\frac{1}{2}\right)^{\overline{n}}}\frac{\left(-\frac{q^{2}d_{t}^{2}}{4(1-q+q^{2})\sigma^{2}}\right)^{n}}{n!} (111)
=n=0(μ2)n¯(12)n¯(1)n(q2dt24(1q+q2)σ2)nn!\displaystyle=\sum_{n=0}^{\infty}\frac{\left(-\frac{\mu}{2}\right)^{\overline{n}}}{\left(\frac{1}{2}\right)^{\overline{n}}}\left(-1\right)^{n}\frac{\left(\frac{q^{2}d_{t}^{2}}{4(1-q+q^{2})\sigma^{2}}\right)^{n}}{n!} (112)
=n=0(1)n(μ2)n¯(12)n¯(q2dt2)nn!(4(1q+q2)σ2)n\displaystyle=\sum_{n=0}^{\infty}\left(-1\right)^{n}\frac{\left(-\frac{\mu}{2}\right)^{\overline{n}}}{\left(\frac{1}{2}\right)^{\overline{n}}}\frac{\left(q^{2}d_{t}^{2}\right)^{n}}{n!\cdot\left({4(1-q+q^{2})\sigma^{2}}\right)^{n}} (113)
=n=0(1)n(μ2)n¯(12)n¯q2ndt2nn!4n(1q+q2)nσ2n.\displaystyle=\sum_{n=0}^{\infty}\left(-1\right)^{n}\frac{\left(-\frac{\mu}{2}\right)^{\overline{n}}}{\left(\frac{1}{2}\right)^{\overline{n}}}\frac{{q^{2n}d_{t}}^{2n}}{n!\cdot 4^{n}(1-q+q^{2})^{n}\sigma^{2n}}. (114)

Where (μ2)n¯(-\frac{\mu}{2})^{\overline{n}} is the rising factorial of μ2-\frac{\mu}{2} (see Winkelbauer (2012))

(μ2)n¯\displaystyle\left(-\frac{\mu}{2}\right)^{\overline{n}} =GF(μ2+n)GF(μ2)\displaystyle=\frac{GF(-\frac{\mu}{2}+n)}{GF(-\frac{\mu}{2})} (115)
=(μ2)(μ2+1)(μ2+n1)\displaystyle=\left(-\frac{\mu}{2}\right)\cdot\left(-\frac{\mu}{2}+1\right)\cdot...\cdot\left(-\frac{\mu}{2}+n-1\right) (116)
=(1)n(12)nμ(μ2)(μ2n+2).\displaystyle=\left(-1\right)^{n}\left(\frac{1}{2}\right)^{n}\mu\cdot\left(\mu-2\right)\cdot...\cdot\left(\mu-2n+2\right). (117)
(12)n¯\displaystyle\left(\frac{1}{2}\right)^{\overline{n}} =GF(12+n)GF(12)\displaystyle=\frac{GF(\frac{1}{2}+n)}{GF(\frac{1}{2})} (118)
=(12)(12+1)(12+n1)\displaystyle=\left(\frac{1}{2}\right)\cdot\left(\frac{1}{2}+1\right)\cdot...\cdot\left(\frac{1}{2}+n-1\right) (119)
=(12)n[13(1+2n2)].\displaystyle=\left(\frac{1}{2}\right)^{n}\left[1\cdot 3\cdot...\cdot(1+2n-2)\right]. (120)

From Equation 117 and 120, we have

(μ2)n¯(12)n¯\displaystyle\frac{\left(-\frac{\mu}{2}\right)^{\overline{n}}}{\left(\frac{1}{2}\right)^{\overline{n}}} =(1)nμ(μ2)(μ2n+2)[13(1+2n2)]\displaystyle=\frac{(-1)^{n}\cdot\mu\cdot\left(\mu-2\right)\cdot...\cdot\left(\mu-2n+2\right)}{\left[1\cdot 3\cdot...\cdot(1+2n-2)\right]} (121)
=(1)ni=1nμ2(i1)i=1n1+2i2\displaystyle=(-1)^{n}\frac{\prod_{i=1}^{n}\mu-2(i-1)}{\prod_{i=1}^{n}1+2i-2} (122)
=(1)ni=1nμ2i+21+2i2.\displaystyle=(-1)^{n}\prod_{i=1}^{n}\frac{\mu-2i+2}{1+2i-2}. (123)

Combing Equation 114 and 123, we can obtain

𝒦(μ2,12;q2dt22Var)\displaystyle\mathcal{K}\left(-\frac{\mu}{2},\frac{1}{2};-\frac{q^{2}d_{t}^{2}}{2Var}\right) =n=0q2ndt2nn!4n(1q+q2)nσ2ni=1nμ2i+21+2i2\displaystyle=\sum_{n=0}^{\infty}\frac{{q^{2n}d_{t}}^{2n}}{n!\cdot 4^{n}(1-q+q^{2})^{n}\sigma^{2n}}\prod_{i=1}^{n}\frac{\mu-2i+2}{1+2i-2} (124)
=n=0q2ndt2nn!4n(Var)ni=1nμ2i+21+2i2.\displaystyle=\sum_{n=0}^{\infty}\frac{{q^{2n}d_{t}}^{2n}}{n!\cdot 4^{n}(Var)^{n}}\prod_{i=1}^{n}\frac{\mu-2i+2}{1+2i-2}. (125)

Appendix B Experiments

Composition with Clipping

Figure 4 demonstrate the changing process of privacy budget as step increases. We find that the impact of CC on the privacy budget has decreased, because the gradient norm is limited by the clipping threshold, and the gap of privacy budgets between different DP frameworks has narrowed. However, this does not affect WDP to still get the lowest cumulative privacy budget, and this value grows the a bit slower than that of DP and BDP.

Refer to caption
(a) 0.05-quantile of gt\|g_{t}\|
Refer to caption
(b) 0.50-quantile of gt\|g_{t}\|
Refer to caption
(c) 0.75-quantile of gt\|g_{t}\|
Refer to caption
(d) 0.99-quantile of gt\|g_{t}\|
Figure 4: Privacy budgets over synthetic gradients obtained by moments accountant under DP, Bayesian accountant under BDP and Wasserstein accountant under WDP when applying gradient clipping.

Appendix C Several Basic Concepts in Differential Privacy

Definition 6 (Differential Privacy (Dwork et al. 2006b)). A randomized algorithm :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R} is (ε,δ)\left(\varepsilon,\delta\right)-differentially private if for any adjacent datasets D,D𝒟D,D\textquoteright\in\mathcal{D} and all measurable subsets SS\subseteq\mathcal{R} the following inequality holds:

Pr[(D)S]eεPr[(D)S]+δ.Pr\left[\mathcal{M}\left(D\right)\in S\right]\leq e^{\varepsilon}Pr\left[\mathcal{M}\left(D^{\prime}\right)\in S\right]+\delta. (126)

Where Pr[]Pr[\cdot] is the notation of probability, and ε\varepsilon is known as the privacy budget. In particular, if δ=0\delta=0, \mathcal{M} is said to preserve ε\varepsilon-DP or pure DP.

Definition 7 (Privacy Loss of DP). For a randomized algorithm :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R}, and oo is the outcome of algorithm \mathcal{M}, then the privacy loss of the \mathcal{M} can be defined as

Loss(o)=logPr[(D)=o]Pr[(D)=o].Loss\left(o\right)=\log\frac{Pr\left[\mathcal{M}\left(D\right)=o\right]}{Pr\left[\mathcal{M}\left(D^{\prime}\right)=o\right]}. (127)

Privacy budget is the strict upper bound of privacy loss in ε\varepsilon-differential privacy, and is a quasi upper bound of privacy loss with the confidence of 1δ1-\delta in (ε\varepsilon,δ\delta)-differential privacy.

Definition 8 (lpl_{p}-Sensitivity (Dwork and Lei 2009)). Sensitivity in DP theory can be defined by maximum pp-norm distance between the same query functions of two adjacent datasets D{D} and D{D}^{\prime}

Δpf=supρ(D,D)1f(D)f(D)p.{\Delta}_{p}f=\sup_{\rho({D},{D^{\prime}})\leq 1}\|f({D})-f({D^{\prime}})\|_{p}. (128)

Where f:𝒟df:\mathcal{D}\rightarrow\mathbb{R}^{d} is a dd-dimension query function, ρ(D,D)=DDp\rho({D},{D^{\prime}})=\|D-D^{\prime}\|_{p} is the norm function between D{D} and D{D^{\prime}}. lpl_{p}-sensitivity measures the largest difference between all possible adjacent datasets.

Definition 9 (Rényi Differential Privacy (Mironov 2017)). A randomized algorithm :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R} is said to preserve (α,ε)\left(\alpha,\varepsilon\right)-RDP if for any adjacent datasets D,D𝒟D,D^{\prime}\in\mathcal{D} the following holds

Dα(Pr(D)Pr(D))=1α1log𝔼o(D)[(p(D)(o)p(D)(o))α]ε.\displaystyle D_{\alpha}(Pr_{\mathcal{M}}(D)\|Pr_{\mathcal{M}}(D^{\prime}))=\frac{1}{\alpha-1}\log\mathbb{E}_{o\sim\mathcal{M}\left(D^{\prime}\right)}\left[\left(\frac{p_{\mathcal{M}\left({D}\right)}\left(o\right)}{p_{{\mathcal{M}}\left({D}^{\prime}\right)}\left(o\right)}\right)^{\alpha}\right]\leq\varepsilon. (129)

Where α(1,+)\alpha\in\left(1,+\infty\right) is the order of RDP, oo is the output of algorithm \mathcal{M}. Pr(D)Pr_{\mathcal{M}}(D) and Pr(D)Pr_{\mathcal{M}}(D^{\prime}) are probability distributions, while p(D)p_{\mathcal{M}}(D) and p(D)p_{\mathcal{M}}(D^{\prime}) are probability density functions.

Definition 10 (Strong Bayesian Differential Privacy (Triastcyn and Faltings 2020)). A randomized algorithm :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R} is said to satisfy (εb,δb)\left(\varepsilon_{b},\delta_{b}\right)-strong Bayesian differential privacy if for any adjacent datasets D,D𝒟D,D\textquoteright\in\mathcal{D} the following holds

Pr[logp(o|D)p(o|D)εb]δb.Pr\left[\log\frac{p(o|D)}{p(o|D^{\prime})}\geq\varepsilon_{b}\right]\leq\delta_{b}. (130)

Where εb\varepsilon_{b} and δb\delta_{b} are privacy budget and failure probability in BDP (Triastcyn and Faltings 2020). Where oo is the output satisfying o=()o=\mathcal{M}(\cdot). p(o|D)p(o|D) and p(o|D)p(o|D^{\prime}) are probability density functions of adjacent datasets.

Definition 11 (Bayesian Differential Privacy (Triastcyn and Faltings 2020)). Suppose the only different data entry xx^{\prime} follows a certain distribution b(x)b(x), namely xb(x)x^{\prime}\sim b(x). A randomized algorithm :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R} is said to satisfy (εb,δb)\left(\varepsilon_{b},\delta_{b}\right)-Bayesian differential privacy if for any neighboring datasets D,D𝒟D,D^{\prime}\in\mathcal{D} and any set of outcomes 𝒪\mathcal{O} the following holds

Pr[(D)𝒪]eεbPr[(D)𝒪]+δb.Pr\left[\mathcal{M}\left(D\right)\in\mathcal{O}\right]\leq e^{\varepsilon_{b}}Pr\left[\mathcal{M}\left(D\right)\in\mathcal{O}\right]+\delta_{b}. (131)

From the above definitions, we find that strong BDP is inspired by RDP, and the definition of BDP is similar to that of DP. Therefore, the weaknesses of DP, BDP and RDP are similar: (1) Their privacy losses do not satisfy symmetry and triangle inequality, which prevent them from becoming metrics. (2) Their privacy budgets tend to be overstated. To alleviate these problems, we propose Wasserstein differential privacy in this paper, expecting to achieve better properties in privacy computing, and thus obtain higher performances in private machine learning.