This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Randomized Approach to Tight Privacy Accounting

Jiachen T. Wang Princeton University Saeed Mahloujifar Princeton University Tong Wu Princeton University Ruoxi Jia Virginia Tech
{tianhaowang,sfar,tongwu,pmittal}@princeton.edu, ruoxijia@vt.edu
Prateek Mittal Princeton University
Abstract

Bounding privacy leakage over compositions, i.e., privacy accounting, is a key challenge in differential privacy (DP). The privacy parameter (ε\varepsilon or δ\delta) is often easy to estimate but hard to bound. In this paper, we propose a new differential privacy paradigm called estimate-verify-release (EVR), which tackles the challenges of providing a strict upper bound for the privacy parameter in DP compositions by converting an estimate of privacy parameter into a formal guarantee. The EVR paradigm first verifies whether the mechanism meets the estimated privacy guarantee, and then releases the query output based on the verification result. The core component of the EVR is privacy verification. We develop a randomized privacy verifier using Monte Carlo (MC) technique. Furthermore, we propose an MC-based DP accountant that outperforms existing DP accounting techniques in terms of accuracy and efficiency. MC-based DP verifier and accountant is applicable to an important and commonly used class of DP algorithms, including the famous DP-SGD. An empirical evaluation shows the proposed EVR paradigm improves the utility-privacy tradeoff for privacy-preserving machine learning.

1 Introduction

The concern of privacy is a major obstacle to deploying machine learning (ML) applications. In response, ML algorithms with differential privacy (DP) guarantees have been proposed and developed. For privacy-preserving ML algorithms, DP mechanisms are often repeatedly applied to private training data. For instance, when training deep learning models using DP-SGD [ACG+16], it is often necessary to execute sub-sampled Gaussian mechanisms on the private training data thousands of times.

A major challenge in machine learning with differential privacy is privacy accounting, i.e., measuring the privacy loss of the composition of DP mechanisms. A privacy accountant takes a list of mechanisms, and returns the privacy parameter (ε\varepsilon and δ\delta) for the composition of those mechanisms. Specifically, a privacy accountant is given a target ε\varepsilon and finds the smallest achievable δ\delta such that the composed mechanism \mathcal{M} is (ε,δ)(\varepsilon,\delta)-DP (we can also fix δ\delta and find ε\varepsilon). We use δ(ε)\delta_{\mathcal{M}}(\varepsilon) to denote the smallest achievable δ\delta given ε\varepsilon, which is often referred to as optimal privacy curve in the literature.

Training deep learning models with DP-SGD is essentially the adaptive composition for thousands of sub-sampled Gaussian Mechanisms. Moment Accountant (MA) is a pioneer solution for privacy loss calculation in differentially private deep learning [ACG+16]. However, MA does not provide the optimal δ(ε)\delta_{\mathcal{M}}(\varepsilon) in general [ZDW22]. This motivates the development of more advanced privacy accounting techniques that outperforms MA. Two major lines of such works are based on Fast Fourier Transform (FFT) (e.g., [GLW21] and Central Limit Theorem (CLT) [BDLS20, WGZ+22]. Both techniques can provide an estimate as well as an upper bound for δ(ε)\delta_{\mathcal{M}}(\varepsilon) though bounding the worst-case estimation error. In practice, only the upper bounds for δ(ε)\delta_{\mathcal{M}}(\varepsilon) can be used, as differential privacy is a strict guarantee.

Refer to caption
Figure 1: Results of estimating/bounding δ(ε)\delta_{\mathcal{M}}(\varepsilon) for the composition of 1200 Gaussian mechanisms with σ=70\sigma=70. ‘-upp’ means upper bound and ‘-est’ means estimate. Curves of ‘Exact’, ‘FFT-est’, and ‘CLT-est’ are overlapped. The groundtruth curve (‘Exact’) for pure Gaussian mechanism can be computed analytically [GLW21].

Motivation: estimates can be more accurate than upper bounds. The motivation for this paper stems from the limitations of current privacy accounting techniques in providing tight upper bounds for δ(ε)\delta_{\mathcal{M}}(\varepsilon). Despite outperforming MA, both FFT- and CLT-based methods can provide ineffective bounds in certain regimes [GLW21, WGZ+22]. We demonstrate such limitations in Figure 1 using the composition of Gaussian mechanisms. For FFT-based technique [GLW21], we can see that although it outperforms MA for most of the regimes, the upper bounds (blue dashed curve) are worse than that of MA when δ<1010\delta<10^{-10} due to computational limitations (as discussed in [GLW21]’s Appendix A; also see Remark 6 for a discussion of why the regime of δ<1010\delta<10^{-10} is important). The CLT-based techniques (e.g., [WGZ+22]) also produce sub-optimal upper bounds (red dashed curve) for the entire range of δ\delta. This is primarily due to the small number of mechanisms used (k=1200k=1200), which does not meet the requirements for CLT bounds to converge (similar phenomenon observed in [WGZ+22]). On the other hand, we can see that the estimates of δ(ε)\delta_{\mathcal{M}}(\varepsilon) from both FFT and CLT-based techniques, which estimate the parameters rather than providing an upper bound, are in fact very close to the ground truth (the three curves overlapped in Figure 1). However, as we mentioned earlier, these accurate estimations cannot be used in practice, as we cannot prove that they do not underestimate δ(ε)\delta_{\mathcal{M}}(\varepsilon). The dilemma raises an important question: can we develop new techniques that allow us to use privacy parameter estimates instead of strict upper bounds in privacy accounting?111We note that this not only brings benefits for the regime where δ<1010\delta<10^{-10}, but also for the more common regime where δ105\delta\approx 10^{-5}. See Figure 3 for an example.

Refer to caption
Figure 2: An overview of our EVR paradigm. EVR converts an estimated (ε,δ)(\varepsilon,\delta) provided by a privacy accountant into a formal guarantee. Compared with the original mechanism, the EVR has an extra failure mode that does not output anything when the estimated (ε,δ)(\varepsilon,\delta) is rejected. We show that the MC-based verifier we proposed can achieve negligible failure probability (O(δ))(O(\delta)) in Section 4.4.

This paper gives a positive answer to it. Our contributions are summarized as follows:

Estimate-Verify-Release (EVR): a DP paradigm that converts privacy parameter estimate into a formal guarantee. We develop a new DP paradigm called Estimate-Verify-Release, which augments a mechanism with a formal privacy guarantee based on its privacy parameter estimates. The basic idea of EVR is to first verify whether the mechanism satisfies the estimated DP guarantee, and release the mechanism’s output if the verification is passed. The core component of the EVR paradigm is privacy verification. A DP verifier can be randomized and imperfect, suffering from both false positives (accept an underestimation) and false negatives (reject an overestimation). We show that EVR’s privacy guarantee can be achieved when privacy verification has a low false negative rate.

A Monte Carlo-based DP Verifier. For an important and widely used class of DP algorithms including Subsampled Gaussian mechanism (the building block for DP-SGD), we develop a Monte Carlo (MC) based DP verifier for the EVR paradigm. We present various techniques that ensure the DP verifier has both a low false positive rate (for privacy guarantee) and a low false negative rate (for utility guarantee, i.e., making the EVR and the original mechanism as similar as possible).

A Monte Carlo-based DP Accountant. We further propose a new MC-based approach for DP accounting, which we call the MC accountant. It utilizes similar MC techniques as in privacy verification. We show that the MC accountant achieves several advantages over existing privacy accounting methods. In particular, we demonstrate that MC accountant is efficient for online privacy accounting, a realistic scenario for privacy practitioners where one wants to update the estimate on privacy guarantee whenever executing a new mechanism.

Figure 2 gives an overview of the proposed EVR paradigm as well as this paper’s contributions.

2 Privacy Accounting: a Mean Estimation/Bounding Problem

In this section, we review relevant concepts and introduce privacy accounting as a mean estimation/bounding problem.

Symbols and notations. We use D,D𝒳D,D^{\prime}\in\mathbb{N}^{\mathcal{X}} to denote two datasets with an unspecified size over space 𝒳\mathcal{X}. We call two datasets DD and DD^{\prime} adjacent (denoted as DDD\sim D^{\prime}) if we can construct one by adding/removing one data point from the other. We use P,QP,Q to denote random variables. We also overload the notation and denote P(),Q()P(\cdot),Q(\cdot) the density function of P,QP,Q.

Differential privacy and its equivalent characterizations. Having established the notations, we can now proceed to formally define differential privacy.

Definition 1 (Differential Privacy [DMNS06]).

For ε,δ0\varepsilon,\delta\geq 0, a randomized algorithm :𝒳𝒴\mathcal{M}:\mathbb{N}^{\mathcal{X}}\rightarrow\mathcal{Y} is (ε,δ)(\varepsilon,\delta)-differentially private if for every pair of adjacent datasets DDD\sim D^{\prime} and for every subset of possible outputs E𝒴E\subseteq\mathcal{Y}, we have Pr[(D)E]eεPr[(D)E]+δ\Pr_{\mathcal{M}}[\mathcal{M}(D)\in E]\leq e^{\varepsilon}\Pr_{\mathcal{M}}[\mathcal{M}(D^{\prime})\in E]+\delta.

One can alternatively define differential privacy in terms of the maximum possible divergence between the output distribution of any pair of (D)\mathcal{M}(D) and (D)\mathcal{M}(D^{\prime}).

Lemma 2 ([BO13]).

A mechanism \mathcal{M} is (ε,δ)(\varepsilon,\delta)-DP iff supDD𝖤eε((D)(D))δ\sup_{D\sim D^{\prime}}\mathsf{E}_{e^{\varepsilon}}(\mathcal{M}(D)\|\mathcal{M}(D^{\prime}))\leq\delta, where 𝖤γ(PQ):=𝔼oQ[(P(o)Q(o)γ)+]\mathsf{E}_{\gamma}(P\|Q):=\mathbb{E}_{o\sim Q}[(\frac{P(o)}{Q(o)}-\gamma)_{+}] & (a)+:=max(a,0)(a)_{+}:=\max(a,0).

𝖤γ\mathsf{E}_{\gamma} is usually referred as Hockey-Stick (HS) Divergence in the literature. For every mechanism \mathcal{M} and every ε0\varepsilon\geq 0, there exists a smallest δ\delta such that \mathcal{M} is (ε,δ)(\varepsilon,\delta)-DP. Following the literature [ZDW22, AGA+23], we formalize such a δ\delta as a function of ε\varepsilon.

Definition 3 (Optimal Privacy Curve).

The optimal privacy curve of a mechanism \mathcal{M} is the function δ:+[0,1]\delta_{\mathcal{M}}:\mathbb{R}^{+}\rightarrow[0,1] s.t. δ(ε):=supDD𝖤eε((D)(D))\delta_{\mathcal{M}}(\varepsilon):=\sup_{D\sim D^{\prime}}\mathsf{E}_{e^{\varepsilon}}(\mathcal{M}(D)\|\mathcal{M}(D^{\prime})).

Dominating Distribution Pair and Privacy Loss Random Variable (PRV). It is computationally infeasible to find δ(ε)\delta_{\mathcal{M}}(\varepsilon) by computing 𝖤eε((D)(D))\mathsf{E}_{e^{\varepsilon}}(\mathcal{M}(D)\|\mathcal{M}(D^{\prime})) for all pairs of adjacent dataset DD and DD^{\prime}. A mainstream strategy in the literature is to find a pair of distributions (P,Q)(P,Q) that dominates all ((D),(D))(\mathcal{M}(D),\mathcal{M}(D^{\prime})) in terms of the Hockey-Stick divergence. This results in the introduction of dominating distribution pair and privacy loss random variable (PRV).

Definition 4 ([ZDW22]).

A pair of distributions (P,Q)(P,Q) is a pair of dominating distributions for \mathcal{M} under adjacent relation \sim if for all γ0\gamma\geq 0, supDD𝖤γ((D)(D))𝖤γ(PQ)\sup_{D\sim D^{\prime}}\mathsf{E}_{\gamma}(\mathcal{M}(D)\|\mathcal{M}(D^{\prime}))\leq\mathsf{E}_{\gamma}(P\|Q). If equality is achieved for all γ0\gamma\geq 0, then we say (P,Q)(P,Q) is a pair of tightly dominating distributions for \mathcal{M}. Furthermore, we call Y:=log(P(o)Q(o)),oPY:=\log\left(\frac{P(o)}{Q(o)}\right),o\sim P the privacy loss random variable (PRV) of \mathcal{M} associated with dominating distribution pair (P,Q)(P,Q).

Zhu et al. [ZDW22] shows that all mechanisms have a pair of tightly dominating distributions. Hence, we can alternatively characterize the optimal privacy curve as δ(ε)=𝖤eε(PQ)\delta_{\mathcal{M}}(\varepsilon)=\mathsf{E}_{e^{\varepsilon}}(P\|Q) for the tightly dominating pair (P,Q)(P,Q), and we have δ(ε)𝖤eε(PQ)\delta_{\mathcal{M}}(\varepsilon)\leq\mathsf{E}_{e^{\varepsilon}}(P\|Q) if (P,Q)(P,Q) is a dominating pair that is not necessarily tight. The importance of the concept of PRV comes from the fact that we can write 𝖤eε(PQ)\mathsf{E}_{e^{\varepsilon}}(P\|Q) as an expectation over it: 𝖤eε(PQ)=𝔼Y[(1eεY)+]\mathsf{E}_{e^{\varepsilon}}(P\|Q)=\mathbb{E}_{Y}\left[\left(1-e^{\varepsilon-Y}\right)_{+}\right]. Thus, one can bound δ(ε)\delta_{\mathcal{M}}(\varepsilon) by first identifying \mathcal{M}’s dominating pair distributions as well as the associated PRV YY, and then computing this expectation. Such a formulation allows us to bound δ(ε)\delta_{\mathcal{M}}(\varepsilon) without enumerating over all adjacent DD and DD^{\prime}. For notation convenience, we denote δY(ε):=𝔼Y[(1eεY)+].\delta_{Y}(\varepsilon):=\mathbb{E}_{Y}\left[\left(1-e^{\varepsilon-Y}\right)_{+}\right]. Clearly, δδY\delta_{\mathcal{M}}\leq\delta_{Y}. If (P,Q)(P,Q) is a tightly dominating pair for \mathcal{M}, then δ=δY\delta_{\mathcal{M}}=\delta_{Y}.

Privacy Accounting as a Mean Estimation/Bounding Problem. Privacy accounting aims to estimate and bound the optimal privacy curve δ(ε)\delta_{\mathcal{M}}(\varepsilon) for adaptively composed mechanism =1k(D)\mathcal{M}=\mathcal{M}_{1}\circ\cdots\circ\mathcal{M}_{k}(D). The adaptive composition of two mechanisms 1\mathcal{M}_{1} and 2\mathcal{M}_{2} is defined as 12(D):=(1(D),2(D,1(D)))\mathcal{M}_{1}\circ\mathcal{M}_{2}(D):=(\mathcal{M}_{1}(D),\mathcal{M}_{2}(D,\mathcal{M}_{1}(D))), in which 2\mathcal{M}_{2} can access both the dataset and the output of 1\mathcal{M}_{1}. Most of the practical privacy accounting techniques are based on the concept of PRV, centered on the following result.

Lemma 5 ([ZDW22]).

Let (Pj,Qj)\left(P_{j},Q_{j}\right) be a pair of tightly dominating distributions for mechanism j\mathcal{M}_{j} for j{1,,k}j\in\{1,\ldots,k\}. Then (P1××Pk,Q1××Qk)\left(P_{1}\times\cdots\times P_{k},Q_{1}\times\cdots\times Q_{k}\right) is a pair of dominating distributions for =1k\mathcal{M}=\mathcal{M}_{1}\circ\cdots\circ\mathcal{M}_{k}, where ×\times denotes the product distribution. Furthermore, the associated privacy loss random variable is Y=i=1kYiY=\sum_{i=1}^{k}Y_{i} where YiY_{i} is the PRV associated with (Pi,Qi)(P_{i},Q_{i}).

Lemma 5 suggests that privacy accounting for DP composition can be cast into a mean estimation/bounding problem where one aims to approximate or bound the expectation in (2) when Y=i=1kYiY=\sum_{i=1}^{k}Y_{i}. Note that while Lemma 5 does not guarantee a pair of tightly dominating distributions for the adaptive composition, it cannot be improved in general, as noted in [DRS19]. Hence, all the current privacy accounting techniques work on δY\delta_{Y} instead of δ\delta_{\mathcal{M}}, as Lemma 5 is tight even for non-adaptive composition. Following the prior works, in this paper, we only consider the practical scenarios where Lemma 5 is tight for the simplicity of presentation. That is, we assume δY=δ\delta_{Y}=\delta_{\mathcal{M}} unless otherwise specified.

Most of the existing privacy accounting techniques can be described as different techniques for such a mean estimation problem. Example-1: FFT-based methods. This line of works (e.g., [GLW21]) discretizes the domain of each YiY_{i} and use Fast Fourier Transform (FFT) to speed up the approximation of δY(ε)\delta_{Y}(\varepsilon). The upper bound is derived through the worst-case error bound for the approximation. Example-2: CLT-based methods. [BDLS20, WGZ+22] use CLT to approximate the distribution of Y=i=1kYiY=\sum_{i=1}^{k}Y_{i} as Gaussian distribution. They then use CLT’s finite-sample approximation guarantee to derive the upper bound for δY(ε)\delta_{Y}(\varepsilon).

Remark 6 (The Importance of Privacy Accounting in Regime δ<1010\delta<10^{-10}).

The regime where δ<1010\delta<10^{-10} is of significant importance for two reasons. (1) δ\delta serves as an upper bound on the chance of severe privacy breaches, such as complete dataset exposure, necessitating a “cryptographically small” value, namely, δ<nω(1)\delta<n^{-\omega(1)} [DR+14, Vad17]. (2) Even with the oft-used yet questionable guideline of δn1\delta\approx n^{-1} or n1.1n^{-1.1}, datasets of modern scale, such as JFT-3B [ZKHB22] or LAION-5B [SBV+22], already comprise billions of records, thus rendering small δ\delta values crucial. While we acknowledge that it requires a lot of effort to achieve a good privacy-utility tradeoff even for the current choice of δn1\delta\approx n^{-1}, it is important to keep such a goal in mind.

3 Estimate-Verify-Release

As mentioned earlier, upper bounds for δY(ε)\delta_{Y}(\varepsilon) are the only valid options for privacy accounting techniques. However, as we have demonstrated in Figure 1, both FFT- and CLT-based methods can provide overly conservative upper bounds in certain regimes. On the other hand, their estimates for δY(ε)\delta_{Y}(\varepsilon) can be very close to the ground truth even though there is no provable guarantee. Therefore, it is highly desirable to develop new techniques that enable the use of privacy parameter estimates instead of overly conservative upper bounds in privacy accounting.

We tackle the problem by introducing a new paradigm for constructing DP mechanisms, which we call Estimate-Verify-Release (EVR). The key component of the EVR is an object called DP verifier (Section 3.1). The full EVR paradigm is then presented in Section 3.2, where the DP verifier is utilized as a building block to guarantee privacy.

3.1 Differential Privacy Verifier

We first formalize the concept of differential privacy verifier, the central element of the EVR paradigm. In informal terms, a DP verifier is an algorithm that attempts to verify whether a mechanism satisfies a specific level of differential privacy.

Definition 7 (Differential Privacy Verifier).

We say a differentially private verifier DPV()\texttt{DPV}(\cdot) is an algorithm that takes the description of a mechanism \mathcal{M} and proposed privacy parameter (ε,δest)(\varepsilon,\delta^{\mathrm{est}}) as input, and returns 𝚃𝚛𝚞𝚎DPV(,ε,δest)\mathtt{True}\leftarrow\texttt{DPV}(\mathcal{M},\varepsilon,\delta^{\mathrm{est}}) if the algorithm believes \mathcal{M} is (ε,δest)(\varepsilon,\delta^{\mathrm{est}})-DP (i.e., δestδY(ε)\delta^{\mathrm{est}}\geq\delta_{Y}(\varepsilon) where YY is the PRV of \mathcal{M}), and returns 𝙵𝚊𝚕𝚜𝚎\mathtt{False} otherwise.

A differential privacy verifier can be imperfect, suffering from both false positives (FP) and false negatives (FN). Typically, FP rate is the likelihood for DPV to accept (ε,δest)(\varepsilon,\delta^{\mathrm{est}}) when δest<δY(ε)\delta^{\mathrm{est}}<\delta_{Y}(\varepsilon). However, δest\delta^{\mathrm{est}} is still a good estimate for δY(ε)\delta_{Y}(\varepsilon) by being a small (e.g., <10%) underestimate. To account for this, we introduce a smoothing factor, τ(0,1]\tau\in(0,1], such that δest\delta^{\mathrm{est}} is deemed “should be rejected” only when δestτδY(ε)\delta^{\mathrm{est}}\leq\tau\delta_{Y}(\varepsilon). A similar argument can be put forth for FN cases where we also introduce a smoothing factor ρ(0,1]\rho\in(0,1]. This leads to relaxed notions for FP/FN rate:

Definition 8.

We say a DPV’s τ\tau-relaxed false positive rate at (ε,δest)(\varepsilon,\delta^{\mathrm{est}}) is

𝖥𝖯DPV(ε,δest;τ):=sup:δest<τδY(ε)PrDPV[DPV(,ε,δest)=𝚃𝚛𝚞𝚎]\displaystyle\mathsf{FP}_{\texttt{DPV}}(\varepsilon,\delta^{\mathrm{est}};\tau):=\sup_{\mathcal{M}:\delta^{\mathrm{est}}<\tau\delta_{Y}(\varepsilon)}~{}~{}\Pr_{\texttt{DPV}}\left[\texttt{DPV}(\mathcal{M},\varepsilon,\delta^{\mathrm{est}})=\mathtt{True}\right]

We say a DPV’s ρ\rho-relaxed false negative rate at (ε,δest)(\varepsilon,\delta^{\mathrm{est}}) is

𝖥𝖭DPV(ε,δest;ρ):=sup:δest>ρδY(ε)PrDPV[DPV(,ε,δest)=𝙵𝚊𝚕𝚜𝚎]\displaystyle\mathsf{FN}_{\texttt{DPV}}(\varepsilon,\delta^{\mathrm{est}};\rho):=\sup_{\mathcal{M}:\delta^{\mathrm{est}}>\rho\delta_{Y}(\varepsilon)}~{}~{}\Pr_{\texttt{DPV}}\left[\texttt{DPV}(\mathcal{M},\varepsilon,\delta^{\mathrm{est}})=\mathtt{False}\right]

Privacy Verification with DP Accountant. For a composed mechanism =1k\mathcal{M}=\mathcal{M}_{1}\circ\ldots\circ\mathcal{M}_{k}, a DP verifier can be easily implemented using any existing privacy accounting techniques. That is, one can execute DP accountant to obtain an estimate or upper bound (ε,δ^)(\varepsilon,\hat{\delta}) of the actual privacy parameter. If δest<δ^\delta^{\mathrm{est}}<\hat{\delta}, then the proposed privacy level is rejected as it is more private than what the DP accountant tells; otherwise, the test is passed. The input description of a mechanism \mathcal{M}, in this case, can differ depending on the DP accounting method. For Moment Accountant [ACG+16], the input description is the upper bound of the moment-generating function (MGF) of the privacy loss random variable for each individual mechanism. For FFT and CLT-based methods, the input description is the cumulative distribution functions (CDF) of the dominating distribution pair of each individual i\mathcal{M}_{i}.

Algorithm 1 Estimate-Verify-Release (EVR) Framework
1:  Input: \mathcal{M}: mechanism. DD: dataset. (ε,δest)(\varepsilon,\delta^{\mathrm{est}}): an estimated privacy parameter for \mathcal{M}.
2:  if DPV(,ε,δest)\texttt{DPV}(\mathcal{M},\varepsilon,\delta^{\mathrm{est}}) outputs 𝚃𝚛𝚞𝚎\mathtt{True} then Execute (D)\mathcal{M}(D).
3:  else Print \bot.

3.2 EVR: Ensuring Estimated Privacy with DP Verifier

We now present the full paradigm of EVR. As suggested by the name, it contains three steps: (1) Estimate: A privacy parameter (ε,δest)(\varepsilon,\delta^{\mathrm{est}}) for \mathcal{M} is estimated, e.g., based on a privacy auditing or accounting technique. (2) Verify: A DP verifier DPV is used for validating whether mechanism \mathcal{M} satisfies (ε,δest)(\varepsilon,\delta^{\mathrm{est}})-DP guarantee. (3) Release: If DP verification test is passed, we can execute \mathcal{M} as usual; otherwise, the program is terminated immediately. For practical utility, this rejection probability needs to be small when (ε,δest)(\varepsilon,\delta^{\mathrm{est}}) is an accurate estimation. The procedure is summarized in Algorithm 1.

Given estimated privacy parameter (ε,δest)(\varepsilon,\delta^{\mathrm{est}}), we have the privacy guarantee for the EVR paradigm:

{restatable}

theoremprivguarantee Algorithm 1 is (ε,δest/τ)(\varepsilon,\delta^{\mathrm{est}}/\tau)-DP for any τ>0\tau>0 if 𝖥𝖯DPV(ε,δest;τ)δest/τ\mathsf{FP}_{\texttt{DPV}}(\varepsilon,\delta^{\mathrm{est}};\tau)\leq\delta^{\mathrm{est}}/\tau.

We defer the proof to Appendix B. The implication of this result is that, for any estimate of the privacy parameter, one can safely use it as a DPV with a bounded false positive rate would enforce differential privacy. However, this is not enough: an overly conservative DPV that satisfies 0 FP rate but rejects everything would not be useful. When δest\delta^{\mathrm{est}} is accurate, we hope the DPV can also achieve a small false negative rate so that the output distributions of EVR and \mathcal{M} are indistinguishable. We discuss the instantiation of DPV in Section 4.

4 Monte Carlo Verifier of Differential Privacy

As we can see from Section 3.2, a DP verifier (DPV) that achieves a small FP rate is the central element for the EVR framework. In the meanwhile, it is also important that DPV has a low FN rate in order to maintain the good utility of the EVR when the privacy parameter estimate is accurate. In this section, we introduce an instantiation of DPV based on the Monte Carlo technique that achieves both a low FP and FN rate, assuming the PRV is known for each individual mechanism.

Remark 9 (Mechanisms where PRV can be derived).

PRV can be derived for many commonly used DP mechanisms such as the Laplace, Gaussian, and Subsampled Gaussian Mechanism [KJH20, GLW21]. In particular, our DP verifier applies for DP-SGD, one of the most important application scenarios of privacy accounting. Moreover, the availability of PRV is also the assumption for most of the recently developed privacy accounting techniques (including FFT- and CLT-based methods). The extension beyond these commonly used mechanisms is an important future work in the field.

Remark 10 (Previous studies on the hardness of privacy verification).

Several studies [GNP20, BGG22] have shown that DP verification is an NP-hard problem. However, these works consider the setting where the input description of the DP mechanism is its corresponding randomized Boolean circuits. Some other works [GM18] show that DP verification is impossible, but this assertion is proved for the black-box setting where the verifier can only query the mechanism. Our work gets around this barrier by providing the description of the PRV of the mechanism as input to the verifier.

4.1 DPV through an MC Estimator for δY(ε)\delta_{Y}(\varepsilon)

Recall that most of the recently proposed DP accountants are essentially different techniques for estimating the expectation

δY=i=1kYi(ε)=𝔼Y[(1eεY)+]\displaystyle\delta_{Y=\sum_{i=1}^{k}Y_{i}}(\varepsilon)=\mathbb{E}_{Y}\left[\left(1-e^{\varepsilon-Y}\right)_{+}\right]

where each YiY_{i} is the privacy loss random variable Yi=log(Pi(t)Qi(t))Y_{i}=\log\left(\frac{P_{i}(t)}{Q_{i}(t)}\right) for tPit\sim P_{i}, and (Pi,Qi)(P_{i},Q_{i}) is a pair of dominating distribution for individual mechanism i\mathcal{M}_{i}. In the following text, we denote the product distribution 𝑷:=P1××Pk\bm{P}:=P_{1}\times\ldots\times P_{k} and 𝑸:=Q1××Qk\bm{Q}:=Q_{1}\times\ldots\times Q_{k}. Recall from Lemma 5 that (𝑷,𝑸)(\bm{P},\bm{Q}) is a pair of dominating distributions for the composed mechanism \mathcal{M}. For notation simplicity, we denote a vector 𝒕:=(t(1),,t(k))\bm{t}:=(t^{(1)},\ldots,t^{(k)}).

Algorithm 2  DPV(,ε,δest)\texttt{DPV}(\mathcal{M},\varepsilon,\delta^{\mathrm{est}}) with Simple MC Estimator and Offset Parameter Δ\Delta.
1:  Obtain i.i.d. samples {𝒕i}i=1m\{\bm{t}_{i}\}_{i=1}^{m} from 𝑷\bm{P}.
2:  Compute δ^=1mi=1m(1eεyi)+\widehat{\delta}=\frac{1}{m}\sum_{i=1}^{m}\left(1-e^{\varepsilon-y_{i}}\right)_{+} with PRV samples yi=log(𝑷(𝒕i)𝑸(𝒕i)),i=1my_{i}=\log\left(\frac{\bm{P}(\bm{t}_{i})}{\bm{Q}(\bm{t}_{i})}\right),i=1\ldots m.
3:  if δ^<δestτΔ\widehat{\delta}<\frac{\delta^{\mathrm{est}}}{\tau}-\Delta then return 𝚃𝚛𝚞𝚎\mathtt{True}.
4:  else return 𝙵𝚊𝚕𝚜𝚎\mathtt{False}.

Monte Carlo (MC) technique is arguably one of the most natural and widely used techniques for approximating expectations. Since δY(ε)\delta_{Y}(\varepsilon) is an expectation in terms of the PRV YY, one can apply MC-based technique to estimate it. Given an MC estimator for δY(ε)\delta_{Y}(\varepsilon), we construct a DPV(,ε,δest)\texttt{DPV}(\mathcal{M},\varepsilon,\delta^{\mathrm{est}}) as shown in Algorithm 2 (instantiated by the Simple MC estimator introduced in Section 4.2). Specifically, we first obtain an estimate δ^\widehat{\delta} from an MC estimator for δY(ε)\delta_{Y}(\varepsilon). The estimate δest\delta^{\mathrm{est}} passes the test if δ^<δestτΔ\widehat{\delta}<\frac{\delta^{\mathrm{est}}}{\tau}-\Delta, and fails otherwise. The parameter Δ0\Delta\geq 0 here is an offset that allows us to conveniently controls the τ\tau-relaxed false positive rate. We will discuss how to set Δ\Delta in Section 4.4.

In the following contents, we first present two constructions of MC estimators for δY(ε)\delta_{Y}(\varepsilon) in Section 4.2. We then discuss the condition for which our MC-based DPV achieves a certain target FP rate in Section 4.3. Finally, we discuss the utility guarantee for the MC-based DPV in Section 4.4.

4.2 Constructing MC Estimator for δY(ε)\delta_{Y}(\varepsilon)

In this section, we first present a simple MC estimator that applies to any mechanisms where we can derive and sample from the dominating distribution pairs. Given the importance of Poisson Subsampled Gaussian mechanism for privacy-preserving machine learning, we further design a more advanced and specialized MC estimator for it based on the importance sampling technique.

Simple Monte Carlo Estimator. One can easily sample from YY by sampling 𝒕𝑷\bm{t}\sim\bm{P} and output log(𝑷(𝒕)𝑸(𝒕))\log\left(\frac{\bm{P}(\bm{t})}{\bm{Q}(\bm{t})}\right). Hence, a straightforward algorithm for estimating (2) is the Simple Monte Carlo (SMC) algorithm, which directly samples from the privacy random variable YY. We formally define it here.

Definition 11 (Simple Monte Carlo (SMC) Estimator).

We denote 𝛅^MCm(ε)\widehat{\bm{\delta}}_{\texttt{MC}}^{m}(\varepsilon) as the random variable of SMC estimator for δY(ε)\delta_{Y}(\varepsilon) with mm samples, i.e., 𝛅^MCm(ε):=1mi=1m(1eεyi)+\widehat{\bm{\delta}}_{\texttt{MC}}^{m}(\varepsilon):=\frac{1}{m}\sum_{i=1}^{m}\left(1-e^{\varepsilon-y_{i}}\right)_{+} for y1,,ymy_{1},\ldots,y_{m} i.i.d. sampled from YY.

Importance Sampling Estimator for Poisson Subsampled Gaussian (Overview). As δY(ε)\delta_{Y}(\varepsilon) is usually a tiny value (10510^{-5} or even cryptographically small), it is likely that by naive sampling from YY, almost all of the samples in {(1eεyi)+}i=1m\{(1-e^{\varepsilon-y_{i}})_{+}\}_{i=1}^{m} are just 0s! That is, the i.i.d. samples {yi}i=1m\{y_{i}\}_{i=1}^{m} from YY can rarely exceed ε\varepsilon. To further improve the sample efficiency, one can potentially use more advanced MC techniques such as Importance Sampling or MCMC. However, these advanced tools usually require additional distributional information about YY and thus need to be developed case-by-case.

Poisson Subsampled Gaussian mechanism is the main workhorse behind the DP-SGD algorithm [ACG+16]. Given its important role in privacy-preserving ML, we derive an advanced MC estimator for it based on the Importance Sampling technique. Importance Sampling (IS) is a classic method for rare event simulation [TK10]. It samples from an alternative distribution instead of the distribution of the quantity of interest, and a weighting factor is then used for correcting the difference between the two distributions. The specific design of alternative distribution is complicated and notation-heavy, and we defer the technical details to Appendix C. At a high level, we construct the alternative sampling distribution based on the exponential tilting technique and derive the optimal tilting parameter such that the corresponding IS estimator approximately achieves the smallest variance. Similar to Definition 11, we use 𝜹^ISm\widehat{\bm{\delta}}_{\texttt{IS}}^{m} to denote the random variable of importance sampling estimator with mm samples.

4.3 Bounding FP Rate

We now discuss the FP guarantee for the DPV instantiated by 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m} and 𝜹^ISm\widehat{\bm{\delta}}_{\texttt{IS}}^{m} we developed in the last section. Since both estimators are unbiased, by Law of Large Number, both 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m} and 𝜹^ISm\widehat{\bm{\delta}}_{\texttt{IS}}^{m} converge to δY(ε)\delta_{Y}(\varepsilon) almost surely as mm\rightarrow\infty, which leads a DPV with perfect accuracy. Of course, mm cannot go to \infty in practice. In the following, we derive the required amount of samples mm for ensuring that τ\tau-relaxed false positive rate is smaller than δest/τ\delta^{\mathrm{est}}/\tau for 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m} and 𝜹^ISm\widehat{\bm{\delta}}_{\texttt{IS}}^{m}. We use 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} (or 𝜹^IS\widehat{\bm{\delta}}_{\texttt{IS}}) as an abbreviation for 𝜹^MC1\widehat{\bm{\delta}}_{\texttt{MC}}^{1} (or 𝜹^IS1\widehat{\bm{\delta}}_{\texttt{IS}}^{1}), the random variable for a single draw of sampling. We state the theorem for 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m}, and the same result for 𝜹^ISm\widehat{\bm{\delta}}_{\texttt{IS}}^{m} can be obtained by simply replacing 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} with 𝜹^IS\widehat{\bm{\delta}}_{\texttt{IS}}. We use 𝖥𝖯MC\mathsf{FP}_{\texttt{MC}} to denote the FP rate for DPV implemented by SMC estimator.

{restatable}

theoremsmcsamplecomp Suppose 𝔼[(𝜹^MC)2]ν\mathbb{E}\left[\left(\widehat{\bm{\delta}}_{\texttt{MC}}\right)^{2}\right]\leq\nu. DPV instantiated by 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m} has bounded τ\tau-relaxed false positive rate 𝖥𝖯MC(ε,δest;τ)δest/τ\mathsf{FP}_{\texttt{MC}}(\varepsilon,\delta^{\mathrm{est}};\tau)\leq\delta^{\mathrm{est}}/\tau with m2νΔ2log(τ/δest)m\geq\frac{2\nu}{\Delta^{2}}\log(\tau/\delta^{\mathrm{est}}).

The proof is based on Bennett’s inequality and is deferred to Appendix D. This result suggests that, to improve the computational efficiency of MC-based DPV (i.e., tighten the number of required samples), it is important to tightly bound 𝔼[(𝜹^MC)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}] (or 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]), the second moment of 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} (or 𝜹^IS\widehat{\bm{\delta}}_{\texttt{IS}}).

Bounding the Second-Moment of MC Estimators (Overview). For clarity, we defer the notation-heavy results and derivation of the upper bounds for 𝔼[(𝜹^MC)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}] and 𝔼[(𝜹^IS)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS}})^{2}] to Appendix E. Our high-level idea for bounding 𝔼[(𝜹^MC)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}] is through the RDP guarantee for the composed mechanism \mathcal{M}. This is a natural idea since converting RDP to upper bounds for δY(ε)\delta_{Y}(\varepsilon) – the first moment of 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} – is a well-studied problem [Mir17, CKS20, ALC+21]. Bounding 𝔼[(𝜹^IS)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS}})^{2}] is highly technically involved.

4.4 Guaranteeing Utility

Overall picture so far. Given the proposed privacy parameter (ε,δest)(\varepsilon,\delta^{\mathrm{est}}), a tolerable degree of underestimation τ\tau, and an offset parameter Δ\Delta, one can now compute the number of samples mm required for the MC-based DPV such that τ\tau-relaxed FP rate to be δest/τ\leq\delta^{\mathrm{est}}/\tau based on the results from Section 4.3 and Appendix E. We have not yet discussed the selection of the hyperparameter Δ\Delta. An appropriate Δ\Delta is important for the utility of MC-based DPV. That is, when δest\delta^{\mathrm{est}} is not too smaller than δY(ε)\delta_{Y}(\varepsilon), the probability of being rejected by DPV should stay negligible. If we set Δ\Delta\rightarrow\infty, the DPV simply rejects everything, which achieves 0 FP rate (and with m=0m=0) but is not useful at all!

Formally, the utility of a DPV is quantified by the ρ\rho-relaxed false negative (FN) rate (Definition 8). While one may be able to bound the FN rate through concentration inequalities, a more convenient way is to pick an appropriate Δ\Delta such that 𝖥𝖭DPV\mathsf{FN}_{\texttt{DPV}} is approximately smaller than 𝖥𝖯DPV\mathsf{FP}_{\texttt{DPV}}. After all, 𝖥𝖯DPV\mathsf{FP}_{\texttt{DPV}} already has to be a small value δest/τ\leq\delta^{\mathrm{est}}/\tau for privacy guarantee. The result is stated informally in the following (holds for both 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} and 𝜹^IS\widehat{\bm{\delta}}_{\texttt{IS}}), and the involved derivation is deferred to Appendix F.

Theorem 12 (Informal).

When Δ=0.4(1/τ1/ρ)δest\Delta=0.4\left(1/\tau-1/\rho\right)\delta^{\mathrm{est}}, then 𝖥𝖭MC(ε,δest;ρ)𝖥𝖯MC(ε,δest;τ)\mathsf{FN}_{\texttt{MC}}(\varepsilon,\delta^{\mathrm{est}};\rho)\lessapprox\mathsf{FP}_{\texttt{MC}}(\varepsilon,\delta^{\mathrm{est}};\tau).

Therefore, by setting Δ=0.4(1/τ1/ρ)δest\Delta=0.4\left(1/\tau-1/\rho\right)\delta^{\mathrm{est}}, one can ensure that 𝖥𝖭MC(ε,δest;ρ)\mathsf{FN}_{\texttt{MC}}(\varepsilon,\delta^{\mathrm{est}};\rho) is also (approximately) upper bounded by Θ(δest/τ)\Theta(\delta^{\mathrm{est}}/\tau). Moreover, in Appendix, we empirically show that the FP rate is actually a very conservative bound for the FN rate. Both τ\tau and ρ\rho are selected based on the tradeoff between privacy, utility, and efficiency.

The pseudocode of privacy verification for DP-SGD is summarized in Appendix G.

5 Monte Carlo Accountant of Differential Privacy

The Monte Carlo estimators 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} and 𝜹^IS\widehat{\bm{\delta}}_{\texttt{IS}} described in Section 4.2 are used for implementing DP verifiers. One may already realize that the same estimators can also be utilized to directly implement a DP accountant which estimates δY(ε)\delta_{Y}(\varepsilon). It is important to note that with the EVR paradigm, DP accountants are no longer required to derive a strict upper bound for δY(ε)\delta_{Y}(\varepsilon). We refer to the technique of estimating δY(ε)\delta_{Y}(\varepsilon) using the MC estimators as Monte Carlo accountant.

Algorithm 3 MC Accountant for εY(δ)\varepsilon_{Y}(\delta).
1:  Obtain PRV samples {yi}i=1m\{y_{i}\}_{i=1}^{m} with either Simple MC or Importance Sampling.
2:  Binary search ε\varepsilon such that 1mi=1m(1eεyi)+=δ\frac{1}{m}\sum_{i=1}^{m}\left(1-e^{\varepsilon-y_{i}}\right)_{+}=\delta.
3:  Return ε\varepsilon.

Finding ε\varepsilon for a given δ\delta. It is straightforward to implement MC accountant when we fix ε\varepsilon and compute for δY(ε)\delta_{Y}(\varepsilon). In practice, privacy practitioners often want to do the inverse: finding ε\varepsilon for a given δ\delta, which we denote as εY(δ)\varepsilon_{Y}(\delta). Similar to the existing privacy accounting methods, we use binary search to find εY(δ)\varepsilon_{Y}(\delta) (see Algorithm 3). Specifically, after generating PRV samples {yi}i=1m\{y_{i}\}_{i=1}^{m}, we simply need to find the ε\varepsilon such that 1mi=1m(1eεyi)+=δ\frac{1}{m}\sum_{i=1}^{m}\left(1-e^{\varepsilon-y_{i}}\right)_{+}=\delta. We do not need to generate new PRV samples for different ε\varepsilon we evaluate during the binary search; hence the additional binary search is computationally efficient.

Number of Samples for MC Accountant. Compared with the number of samples required for achieving the FP guarantee in Section 4.3, one may be able to use much fewer samples to obtain a decent estimate for δY(ε)\delta_{Y}(\varepsilon), as the sample complexity bound derived based on concentration inequality may be conservative. Many heuristics for guiding the number of samples in MC simulation have been developed (e.g., Wald confidence interval) and can be applied to the setting of MC accountants.

Compared with FFT-based and CLT-based methods, MC accountant exhibits the following strength:

(1) Accurate δY(ε)\delta_{Y}(\varepsilon) estimation in all regimes. As we mentioned earlier, the state-of-the-art FFT-based method [GLW21] fails to provide meaningful bounds due to computational limitations when the true value of δY(ε)\delta_{Y}(\varepsilon) is small. In contrast, the simplicity of the MC accountant allows us to accurately estimate δY(ε)\delta_{Y}(\varepsilon) in all regimes.

(2) Short clock runtime & Easy GPU acceleration. MC-based techniques are well-suited for parallel computing and GPU acceleration due to their nature of repeated sampling. One can easily utilize PyTorch’s CUDA functionality (e.g., torch.randn(size=(k,m)).cuda()*sigma+mu) to significantly boost the computational efficiency for sampling from common distributions such as Gaussian. In Appendix H, we show that when using one NVIDIA A100 GPU, the runtime time of sampling Gaussian mixture (1q)𝒩(0,σ2)+q𝒩(1,σ2)(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2}) can be improved by 10310^{3} times compared with CPU-only scenario.

(3) Efficient online privacy accounting. When training ML models with DP-SGD or its variants, a privacy practitioner usually wants to compute a running privacy leakage for every training iteration, and pick the checkpoint with the best utility-privacy tradeoff. This involves estimating δY(i)(ε)\delta_{Y^{(i)}}(\varepsilon) for every i=1,,ki=1,\ldots,k, where Y(i):=j=1iYjY^{(i)}:=\sum_{j=1}^{i}Y_{j}. We refer to such a scenario as online privacy accounting222Note that this is different from the scenario of privacy odometer [RRUV16], as here the privacy parameter of the next individual mechanism is not adaptively chosen.. MC accountant is especially efficient for online privacy accounting. When estimating δY(i)(ε)\delta_{Y^{(i)}}(\varepsilon), one can re-use the samples previously drawn from Y1,,Yi1Y_{1},\ldots,Y_{i-1} that were used for estimating privacy loss at earlier iterations.

These advantages are justified empirically in Section 6 and Appendix H.

6 Numerical Experiments

In this section, we conduct numerical experiments to illustrate (1) EVR paradigm with MC verifiers enables a tighter privacy analysis, and (2) MC accountant achieves state-of-the-art performance in privacy parameter estimation.

6.1 EVR vs Upper Bound

To illustrate the advantage of the EVR paradigm compared with directly using a strict upper bound for privacy parameters, we take the current state-of-the-art DP accountant, the FFT-based method from [GLW21] as the example.

Refer to caption
Figure 3: Privacy analysis and runtime of the EVR paradigm. The settings are the same as Figure 1. For (a), when τ>0.9\tau>0.9, the curves are indistinguishable from ‘Exact’. For fair comparison, we set ρ=(1+τ)/2\rho=(1+\tau)/2 and set Δ\Delta according to Theorem 12, which ensures EVR’s failure probability of the order of δ\delta. For (b), the runtime is estimated on an NVIDIA A100-SXM4-80GB GPU.

EVR provides a tighter privacy guarantee. Recall that in Figure 1, FFT-based method provides vacuous bound when the ground-truth δY(ε)<1010\delta_{Y}(\varepsilon)<10^{-10}. Under the same hyperparameter setting, Figure 3 (a) shows the privacy bound of the EVR paradigm where the δest\delta^{\mathrm{est}} are FFT’s estimates. We use the Importance Sampling estimator 𝜹^IS\widehat{\bm{\delta}}_{\texttt{IS}} for DP verification. We experiment with different values of τ\tau. A higher value of τ\tau leads to tighter privacy guarantee but longer runtime. For fair comparison, the EVR’s output distribution needs to be almost indistinguishable from the original mechanism. We set ρ=(1+τ)/2\rho=(1+\tau)/2 and set Δ\Delta according to the heuristic from Theorem 12. This guarantees that, as long as the estimate of δest\delta^{\mathrm{est}} from FFT is not a big underestimation (i.e., as long as δestρδY(ε)\delta^{\mathrm{est}}\geq\rho\delta_{Y}(\varepsilon)), the failure probability of the EVR paradigm is negligible (O(δY(ε))O(\delta_{Y}(\varepsilon))). The ‘FFT-EVR’ curve in Figure 3 (a) is essentially the ‘FFT-est’ curve in Figure 1 scaled up by 1/τ1/\tau. As we can see, EVR provides a significantly better privacy analysis in the regime where the ‘FFT-upp’ is unmeaningful (δ<1010\delta<10^{-10}).

EVR incurs little extra runtime. In Figure 3 (b), we plot the runtime of the Importance Sampling verifier in Figure 3 (b) for different τ0.9\tau\geq 0.9. Note that for τ>0.9\tau>0.9, the privacy curves are indistinguishable from ‘Exact’ in Figure 3 (a). The runtime of EVR is determined by the number of samples required to achieve the target τ\tau-relaxed FP rate from Theorem 4.3. Smaller τ\tau leads to faster DP verification. As we can see, even when τ=0.99\tau=0.99, the runtime of DP verification in the EVR is <2<2 minutes. This is attributable to the sample-efficient IS estimator and GPU acceleration.

Refer to caption
Figure 4: Utility-privacy tradeoff curve for fine-tuning ImageNet-pretrained BEiT [BDPW21] on CIFAR100 when δ=105\delta=10^{-5}. We follow the training procedure from [PTS+22].

EVR provides better privacy-utility tradeoff for Privacy-preserving ML with minimal time consumption. To further underscore the superiority of the EVR paradigm in practical applications, we illustrate the privacy-utility tradeoff curve when finetuning on CIFAR100 dataset with DP-SGD. As shown in Figure 4, the EVR paradigm provides a lower test error across all privacy budget ε\varepsilon compared with the traditional upper bound method. For instance, it achieves around 7% (relative) error reduction when ε=0.6\varepsilon=0.6. The runtime time required for privacy verification is less than <1010<10^{-10} seconds for all ε\varepsilon, which is negligible compared to the training time. We provide additional experimental results in Appendix H.

6.2 MC Accountant

We evaluate the MC Accountant proposed in Section 5. We focus on privacy accounting for the composition of Poisson Subsampled Gaussian mechanisms, the algorithm behind the famous DP-SGD algorithm [ACG+16]. The mechanism is specified by the noise magnitude σ\sigma and subsampling rate qq.

Settings. We consider two practical scenarios of privacy accounting: (1) Offline accounting which aims at estimating δY(k)(ε)\delta_{Y^{(k)}}(\varepsilon), and (2) Online accounting which aims at estimating δY(i)(ε)\delta_{Y^{(i)}}(\varepsilon) for all i=1,,ki=1,\ldots,k. For space constraint, we only show the results of online accounting here, and defer the results for offline accounting to Appendix H. Metric: Relative Error. To easily and fairly evaluate the performance of privacy parameter estimation, we compute the almost exact (yet computationally expensive) privacy parameters as the ground-truth value. The ground-truth value allows us to compute the relative error of an estimate of privacy leakage. That is, if the corresponding ground-truth of an estimate δ^\widehat{\delta} is δ\delta, then the relative error rerr=|δ^δ|/δr_{\mathrm{err}}=|\widehat{\delta}-\delta|/\delta. Implementation. For MC accountant, we use the IS estimator described in Section 4.2. For baselines, in addition to the FFT-based and CLT-based method we mentioned earlier, we also examine AFA [ZDW22] and GDP accountant [BDLS20]. For a fair comparison, we adjust the number of samples for MC accountant so that the runtime of MC accountant and FFT is comparable. Note that we compared with the privacy parameter estimates instead of upper bounds from the baselines. Detailed settings for both MC accountant and the baselines are provided in Appendix H.

Refer to caption
Figure 5: Experiment for Composing Subsampled Gaussian Mechanisms in the Online Setting. (a) Compares the relative error in approximating kεY(δ)k\mapsto\varepsilon_{Y}(\delta). The error bar for MC accountant is the variance taken over 5 independent runs. Note that the y-axis is in the log scale. (b) Compares the cumulative runtime for online privacy accounting. We did not show AFA [ZDW22] as it does not terminate in 24 hours.

Results for Online Accounting: MC accountant is both more accurate and efficient. Figure 5 (a) shows the online accounting results for (σ,δ,q)=(1.0,109,103)(\sigma,\delta,q)=(1.0,10^{-9},10^{-3}). As we can see, MC accountant outperforms all of the baselines in estimating εY(δ)\varepsilon_{Y}(\delta). The sharp decrease in FFT at approximately 250 steps is due to the transition of FFT’s estimates from underestimating before this point to overestimating after. Figure 5 (b) shows that MC accountant is around 5 times faster than FFT, the baseline with the best performance in (a). This showcases the MC accountant’s efficiency and accuracy in online setting.

7 Conclusion & Limitations

This paper tackles the challenge of deriving provable privacy leakage upper bounds in privacy accounting. We present the estimate-verify-release (EVR) paradigm which enables the safe use of privacy parameter estimate. Limitations. Currently, our MC-based DP verifier and accountant require known and efficiently samplable dominating pairs and PRV for the individual mechanism. Fortunately, this applies to commonly used mechanisms such as Gaussian mechanism and DP-SGD. Generalizing MC-based DP verifier and accountant to other mechanisms is an interesting future work.

Acknowledgments

This work was supported in part by the National Science Foundation under grants CNS-2131938, CNS-1553437, CNS-1704105, the ARL’s Army Artificial Intelligence Innovation Institute (A2I2), the Office of Naval Research Young Investigator Award, the Army Research Office Young Investigator Prize, Schmidt DataX award, and Princeton E-ffiliates Award, Amazon-Virginia Tech Initiative in Efficient and Robust Machine Learning, and Princeton’s Gordon Y. S. Wu Fellowship. We are grateful to anonymous reviewers at NeurIPS for their valuable feedback.

References

  • [ACG+16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • [AGA+23] Wael Alghamdi, Juan Felipe Gomez, Shahab Asoodeh, Flavio Calmon, Oliver Kosut, and Lalitha Sankar. The saddle-point method in differential privacy. In Proceedings of the 40th International Conference on Machine Learning, pages 508–528, 2023.
  • [ALC+21] Shahab Asoodeh, Jiachun Liao, Flavio P Calmon, Oliver Kosut, and Lalitha Sankar. Three variants of differential privacy: Lossless conversion and applications. IEEE Journal on Selected Areas in Information Theory, 2(1):208–222, 2021.
  • [BDLS20] Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J Su. Deep learning with gaussian differential privacy. Harvard data science review, 2020(23), 2020.
  • [BDPW21] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  • [BGG22] Mark Bun, Marco Gaboardi, and Ludmila Glinskih. The complexity of verifying boolean programs as differentially private. In 2022 IEEE 35th Computer Security Foundations Symposium (CSF), pages 396–411. IEEE, 2022.
  • [BO13] Gilles Barthe and Federico Olmedo. Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs. In International Colloquium on Automata, Languages, and Programming, pages 49–60. Springer, 2013.
  • [BSBV21] Benjamin Bichsel, Samuel Steffen, Ilija Bogunovic, and Martin Vechev. Dp-sniper: Black-box discovery of differential privacy violations using classifiers. In 2021 IEEE Symposium on Security and Privacy (SP), pages 391–409. IEEE, 2021.
  • [CKS20] Clément L Canonne, Gautam Kamath, and Thomas Steinke. The discrete gaussian for differential privacy. Advances in Neural Information Processing Systems, 33:15676–15688, 2020.
  • [DGK+22] Vadym Doroshenko, Badih Ghazi, Pritish Kamath, Ravi Kumar, and Pasin Manurangsi. Connect the dots: Tighter discrete approximations of privacy loss distributions. Proceedings on Privacy Enhancing Technologies, 4:552–570, 2022.
  • [DL09] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 371–380, 2009.
  • [DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • [DR+14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [DRS19] Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2019.
  • [DRV10] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51–60. IEEE, 2010.
  • [GJG+22] Daniele Gorla, Louis Jalouzot, Federica Granese, Catuscia Palamidessi, and Pablo Piantanida. On the (im) possibility of estimating various notions of differential privacy. arXiv preprint arXiv:2208.14414, 2022.
  • [GKKM22] Badih Ghazi, Pritish Kamath, Ravi Kumar, and Pasin Manurangsi. Faster privacy accounting via evolving discretization. In International Conference on Machine Learning, pages 7470–7483. PMLR, 2022.
  • [GLW21] Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. Numerical composition of differential privacy. Advances in Neural Information Processing Systems, 34:11631–11642, 2021.
  • [GM18] Anna C Gilbert and Audra McMillan. Property testing for differential privacy. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 249–258. IEEE, 2018.
  • [GNP20] Marco Gaboardi, Kobbi Nissim, and David Purser. The complexity of verifying loop-free programs as differentially private. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
  • [JUO20] Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33:22205–22216, 2020.
  • [KH21] Antti Koskela and Antti Honkela. Computing differential privacy guarantees for heterogeneous compositions using fft. CoRR, abs/2102.12412, 2021.
  • [KJH20] Antti Koskela, Joonas Jälkö, and Antti Honkela. Computing tight differential privacy guarantees using fft. In International Conference on Artificial Intelligence and Statistics, pages 2560–2569. PMLR, 2020.
  • [KJPH21] Antti Koskela, Joonas Jälkö, Lukas Prediger, and Antti Honkela. Tight differential privacy for discrete-valued mechanisms and for the subsampled gaussian mechanism using fft. In International Conference on Artificial Intelligence and Statistics, pages 3358–3366. PMLR, 2021.
  • [KOV15] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376–1385. PMLR, 2015.
  • [LMF+22] Fred Lu, Joseph Munoz, Maya Fuchs, Tyler LeBlond, Elliott V Zaresky-Williams, Edward Raff, Francis Ferraro, and Brian Testa. A general framework for auditing differentially private machine learning. In Advances in Neural Information Processing Systems, 2022.
  • [LO19] Xiyang Liu and Sewoong Oh. Minimax optimal estimation of approximate differential privacy on neighboring databases. Advances in neural information processing systems, 32, 2019.
  • [LWMIZ22] Yun Lu, Yu Wei, Malik Magdon-Ismail, and Vassilis Zikas. Eureka: A general framework for black-box differential privacy estimators. Cryptology ePrint Archive, 2022.
  • [Mir17] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.
  • [MSCJ22] Saeed Mahloujifar, Alexandre Sablayrolles, Graham Cormode, and Somesh Jha. Optimal membership inference bounds for adaptive composition of sampled gaussian mechanisms. arXiv preprint arXiv:2204.06106, 2022.
  • [MV16] Jack Murtagh and Salil Vadhan. The complexity of computing the optimal composition of differential privacy. In Theory of Cryptography Conference, pages 157–175. Springer, 2016.
  • [NHS+23] Milad Nasr, Jamie Hayes, Thomas Steinke, Borja Balle, Florian Tramèr, Matthew Jagielski, Nicholas Carlini, and Andreas Terzis. Tight auditing of differentially private machine learning. arXiv preprint arXiv:2302.07956, 2023.
  • [NST+21] Milad Nasr, Shuang Songi, Abhradeep Thakurta, Nicolas Papernot, and Nicholas Carlin. Adversary instantiation: Lower bounds for differentially private machine learning. In 2021 IEEE Symposium on security and privacy (SP), pages 866–882. IEEE, 2021.
  • [PTS+22] Ashwinee Panda, Xinyu Tang, Vikash Sehwag, Saeed Mahloujifar, and Prateek Mittal. Dp-raft: A differentially private recipe for accelerated fine-tuning. arXiv preprint arXiv:2212.04486, 2022.
  • [RRUV16] Ryan M Rogers, Aaron Roth, Jonathan Ullman, and Salil Vadhan. Privacy odometers and filters: Pay-as-you-go composition. Advances in Neural Information Processing Systems, 29, 2016.
  • [RZW23] Rachel Redberg, Yuqing Zhu, and Yu-Xiang Wang. Generalized ptr: User-friendly recipes for data-adaptive algorithms with differential privacy. In International Conference on Artificial Intelligence and Statistics, pages 3977–4005. PMLR, 2023.
  • [SBV+22] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • [SNJ23] Thomas Steinke, Milad Nasr, and Matthew Jagielski. Privacy auditing with one (1) training run. arXiv preprint arXiv:2305.08846, 2023.
  • [TK10] Surya T Tokdar and Robert E Kass. Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):54–60, 2010.
  • [Vad17] Salil Vadhan. The complexity of differential privacy. Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pages 347–450, 2017.
  • [WGZ+22] Hua Wang, Sheng Gao, Huanyu Zhang, Milan Shen, and Weijie J Su. Analytical composition of differential privacy via the edgeworth accountant. arXiv preprint arXiv:2206.04236, 2022.
  • [WMW+22] Jiachen T Wang, Saeed Mahloujifar, Shouda Wang, Ruoxi Jia, and Prateek Mittal. Renyi differential privacy of propose-test-release and applications to private and robust machine learning. Advances in Neural Information Processing Systems, 35:38719–38732, 2022.
  • [ZDW22] Yuqing Zhu, Jinshuo Dong, and Yu-Xiang Wang. Optimal accounting of differential privacy via characteristic function. In International Conference on Artificial Intelligence and Statistics, pages 4782–4817. PMLR, 2022.
  • [ZKHB22] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.

Appendix A Extended Related Work

In this section, we review the related works in privacy accounting, privacy verification, privacy auditing, and we also discuss the connection between our EVR paradigm and the famous Propose-Test-Release (PTR) paradigm.

Privacy Accounting.

Early privacy accounting techniques such as Advanced Composition Theorem [DRV10] only make use of the privacy parameters of the individual mechanisms, which bounds δ(ε)\delta_{\mathcal{M}}(\varepsilon) in terms of the privacy parameter (εi,δi)(\varepsilon_{i},\delta_{i}) for each i,i=1,,k\mathcal{M}_{i},i=1,\ldots,k. The optimal bound for δ(ε)\delta_{\mathcal{M}}(\varepsilon) under this condition has been derived [KOV15, MV16]. However, the computation of the optimal bound is #P-hard in general. Bounding δ(ε)\delta_{\mathcal{M}}(\varepsilon) only in terms of (εi,δi)(\varepsilon_{i},\delta_{i}) is often sub-optimal for many commonly used mechanisms [MV16]. This disadvantage has spurred many recent advances in privacy accounting by making use of more statistical information from the specific mechanisms to be composed [ACG+16, Mir17, KJH20, BDLS20, KH21, KJPH21, GLW21, ZDW22, GKKM22, DGK+22, WGZ+22, AGA+23]. All of these works can be described as approximating the expectation in (2) when Y=i=1kYiY=\sum_{i=1}^{k}Y_{i}. For instance, the line of [KJH20, BDLS20, KH21, KJPH21, GLW21, GKKM22, DGK+22] discretize the domain of each YiY_{i} and use Fast Fourier Transform (FFT) in order to speed up the approximation of δY(ε)\delta_{Y}(\varepsilon). [ZDW22] tracks the characteristic function of the privacy loss random variables for the composed mechanism and still requires discretization when the mechanisms do not have closed-form characteristic functions. The line of [BDLS20, WGZ+22] uses Central Limit Theorem (CLT) to approximate the distribution of Y=i=1kYiY=\sum_{i=1}^{k}Y_{i} as Gaussian distribution and uses the finite-sample bound to derive the strict upper bound for δY(ε)\delta_{Y}(\varepsilon). We also note that [MSCJ22] also uses Monte Carlo approaches to calculate optimal membership inference bounds. They use a similar Simple MC estimator as the one in Section 4.2. Although their Monte Carlo approach is similar, their error analysis only works for large values of δ\delta (δ0.5)\delta\approx 0.5) as they use sub-optimal concentration bounds.

Privacy Verification.

As we mentioned in Remark 10, some previous works have also studied the problem of privacy verification. Most of the works consider either “white-box setting” where the input description of the DP mechanism is its corresponding randomized Boolean circuits [GNP20, BGG22]. Some other works consider an even more stringent “black-box setting” where the verifier can only query the mechanism [GM18, LO19, BSBV21, GJG+22, LWMIZ22]. In contrast, our MC verifier is designed specifically for those mechanisms where the PRV can be derived, which includes many commonly used mechanisms such as the Subsampled Gaussian mechanism.

Privacy verification via auditing.

Several heuristics have tried to perform DP verification, forming a line of work called auditing differential privacy [JUO20, NST+21, LMF+22, NHS+23, SNJ23]. Specifically, these techniques can verify a claimed privacy parameter by computing a lower bound for the actual privacy parameter, and comparing that with the claimed privacy parameter. The input description of mechanism \mathcal{M} for DPV, in this case, is a black-box oracle ()\mathcal{M}(\cdot), where the DPV makes multiple queries to ()\mathcal{M}(\cdot) and estimates the actual privacy leakage. Privacy auditing techniques can achieve 100% accuracy when δest>δY(ε)\delta^{\mathrm{est}}>\delta_{Y}(\varepsilon) (or 0 ρ\rho-FN rate for any ρ1\rho\leq 1), as the computed lower bound is guaranteed to be smaller than δest\delta^{\mathrm{est}}. However, when δest\delta^{\mathrm{est}} lies between δY(ε)\delta_{Y}(\varepsilon) and the computed lower bound, the DP verification will be wrong. Moreover, such techniques do not have a guarantee for the lower bound’s tightness.

Remark 13 (Connection between our EVR paradigm and the Propose-Test-Release (PTR) paradigm [DL09]).

PTR is a classic differential privacy paradigm introduced over a decade ago by [DL09], and is being generalized in [RZW23, WMW+22]. At a high level, PTR checks if releasing the query answer is safe with a certain amount of randomness (in a private way). If the test is passed, the query answer is released; otherwise, the program is terminated. PTR shares a similar underlying philosophy with our EVR paradigm. However, they are fundamentally different in terms of implementation. The verification step in EVR is completely independent of the dataset. In contrast, the test step in PTR measures the privacy risks for the mechanism \mathcal{M} on a specific dataset DD, which means that the test itself may cause additional privacy leakage. One way to think about the difference is that EVR asks “whether \mathcal{M} is private”, while PTR asks “whether (D)\mathcal{M}(D) is private”.

Appendix B Proofs for Privacy

\privguarantee

*

Proof.

For any mechanism \mathcal{M}, we denote AA as the event that δestτδY(ε)\delta^{\mathrm{est}}\geq\tau\delta_{Y}(\varepsilon), and indicator variable B=𝟙[DPV(,ε,δest;τ)=𝚃𝚛𝚞𝚎]B=\mathbbm{1}[\texttt{DPV}(\mathcal{M},\varepsilon,\delta^{\mathrm{est}};\tau)=\mathtt{True}]. Note that event AA implies \mathcal{M} is (δest/τ)(\delta^{\mathrm{est}}/\tau)-DP.

Thus, we know that

Pr[B=1|A¯]𝖥𝖯DPV(ε,δest;τ)\displaystyle\Pr[B=1|\bar{A}]\leq\mathsf{FP}_{\texttt{DPV}}(\varepsilon,\delta^{\mathrm{est}};\tau) (1)

For notation simplicity, we also denote p𝖥𝖯:=Pr[B=1|A¯]p_{\mathsf{FP}}:=\Pr[B=1|\bar{A}], and p𝖳𝖯:=Pr[B=1|A]p_{\mathsf{TP}}:=\Pr[B=1|A].

For any possible event SS,

Praug[aug(D)S]\displaystyle\Pr_{\mathcal{M}^{\mathrm{aug}}}\left[\mathcal{M}^{\mathrm{aug}}(D)\in S\right]
=Pr[(D)S|B=1]Pr[B=1]+I[S]Pr[B=0]\displaystyle=\Pr_{\mathcal{M}}[\mathcal{M}(D)\in S|B=1]\Pr[B=1]+I[\bot\in S]\Pr[B=0]
=Pr[(D)S|B=1,A]Pr[B=1|A]I[A]+Pr[(D)S|B=1,A¯]Pr[B=1|A¯]I[A¯]\displaystyle=\Pr_{\mathcal{M}}[\mathcal{M}(D)\in S|B=1,A]\Pr[B=1|A]I[A]+\Pr_{\mathcal{M}}[\mathcal{M}(D)\in S|B=1,\bar{A}]\Pr[B=1|\bar{A}]I[\bar{A}]
+I[S]Pr[B=0]\displaystyle~{}~{}~{}~{}+I[\bot\in S]\Pr[B=0]
(eεPr[(D)S|B=1,A]+δestτ)Pr[B=1|A]I[A]\displaystyle\leq\left(e^{\varepsilon}\Pr_{\mathcal{M}}[\mathcal{M}(D^{\prime})\in S|B=1,A]+\frac{\delta^{\mathrm{est}}}{\tau}\right)\Pr[B=1|A]I[A]
+Pr[(D)S|B=1,A¯]Pr[B=1|A¯]I[A¯]\displaystyle~{}~{}~{}~{}+\Pr_{\mathcal{M}}[\mathcal{M}(D)\in S|B=1,\bar{A}]\Pr[B=1|\bar{A}]I[\bar{A}]
+I[S]Pr[B=0]\displaystyle~{}~{}~{}~{}+I[\bot\in S]\Pr[B=0]
(eεPr[(D)S|B=1,A]+δestτ)p𝖳𝖯I[A]+p𝖥𝖯I[A¯]+I[S]Pr[B=0]\displaystyle\leq\left(e^{\varepsilon}\Pr_{\mathcal{M}}[\mathcal{M}(D^{\prime})\in S|B=1,A]+\frac{\delta^{\mathrm{est}}}{\tau}\right)p_{\mathsf{TP}}I[A]+p_{\mathsf{FP}}I[\bar{A}]+I[\bot\in S]\Pr[B=0]
eε(Pr[(D)S|B=1,A]p𝖳𝖯I[A]+Pr[(D)S|B=1,A¯]p𝖥𝖯I[A¯]+I[S]Pr[B=0])\displaystyle\leq e^{\varepsilon}\left(\Pr_{\mathcal{M}}[\mathcal{M}(D^{\prime})\in S|B=1,A]p_{\mathsf{TP}}I[A]+\Pr_{\mathcal{M}}[\mathcal{M}(D\textquoteright)\in S|B=1,\bar{A}]p_{\mathsf{FP}}I[\bar{A}]+I[\bot\in S]\Pr[B=0]\right)
+δestτp𝖳𝖯I[A]+p𝖥𝖯I[A¯]\displaystyle~{}~{}~{}~{}~{}~{}+\frac{\delta^{\mathrm{est}}}{\tau}p_{\mathsf{TP}}I[A]+p_{\mathsf{FP}}I[\bar{A}]
eεPraug[aug(D)S]+max(δestp𝖳𝖯τ,p𝖥𝖯)\displaystyle\leq e^{\varepsilon}\Pr_{\mathcal{M}^{\mathrm{aug}}}\left[\mathcal{M}^{\mathrm{aug}}(D^{\prime})\in S\right]+\max\left(\frac{\delta^{\mathrm{est}}p_{\mathsf{TP}}}{\tau},p_{\mathsf{FP}}\right)

where in the first inequality, we use the definition of differential privacy. Therefore, aug\mathcal{M}^{\mathrm{aug}} is (ε,max(δestp𝖳𝖯τ,p𝖥𝖯))\left(\varepsilon,\max\left(\frac{\delta^{\mathrm{est}}p_{\mathsf{TP}}}{\tau},p_{\mathsf{FP}}\right)\right)-DP. By assumption of p𝖥𝖯𝖥𝖯DPV(ε,δest;τ)δest/τp_{\mathsf{FP}}\leq\mathsf{FP}_{\texttt{DPV}}(\varepsilon,\delta^{\mathrm{est}};\tau)\leq\delta^{\mathrm{est}}/\tau, we reach the conclusion. ∎

Appendix C Importance Sampling via Exponential Tilting

Notation Review.

Recall that most of the recently proposed DP accountants are essentially different techniques for estimating the expectation

δY=i=1kYi(ε)=𝔼Y[(1eεY)+]\displaystyle\delta_{Y=\sum_{i=1}^{k}Y_{i}}(\varepsilon)=\mathbb{E}_{Y}\left[\left(1-e^{\varepsilon-Y}\right)_{+}\right]

where each YiY_{i} is the privacy loss random variable Yi=log(Pi(t)Qi(t))Y_{i}=\log\left(\frac{P_{i}(t)}{Q_{i}(t)}\right) for tPit\sim P_{i}, and (Pi,Qi)(P_{i},Q_{i}) is a pair of dominating distribution for individual mechanism i\mathcal{M}_{i}. In the following text, we denote the product distribution 𝑷:=P1××Pk\bm{P}:=P_{1}\times\ldots\times P_{k} and 𝑸:=Q1××Qk\bm{Q}:=Q_{1}\times\ldots\times Q_{k}. Recall from Lemma 5 that (𝑷,𝑸)(\bm{P},\bm{Q}) is a pair of dominating distributions for the composed mechanism \mathcal{M}. For notation simplicity, we denote a vector 𝒕:=(t(1),,t(k))\bm{t}:=(t^{(1)},\ldots,t^{(k)}). We slightly abuse the notation and write y(t;P,Q):=log(P(t)Q(t))y(t;P,Q):=\log\left(\frac{P(t)}{Q(t)}\right). Note that y(𝒕;𝑷,𝑸)=i=1ky(t(i);Pi,Qi)y(\bm{t};\bm{P},\bm{Q})=\sum_{i=1}^{k}y(t^{(i)};P_{i},Q_{i}). When the context is clear, we omit the dominating pairs and simply write y(t)y(t).

Dominating Distribution Pairs for Poisson Subsampled Gaussian Mechanisms.

The dominating distribution pair for Poisson Subsampled Gaussian Mechanisms is a well-known result.

Lemma 14.

For Poisson Subsampled Gaussian mechanism with sensitivity CC, noise variance C2σ2C^{2}\sigma^{2}, and subsampling rate qq, one dominating pair (P,Q)(P,Q) is Q:=𝒩(0,σ2)Q:=\mathcal{N}(0,\sigma^{2}) and P:=(1q)𝒩(0,σ2)+q𝒩(1,σ2)P:=(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2}).

Proof.

See Appendix B of [GLW21]. ∎

That is, QQ is just a 1-dimensional standard Gaussian distribution, and PP is a convex combination between standard Gaussian and a Gaussian centered at 11.

Remark 15 (Dominating pair supported on higher dimensional space).

The cost of our approach would not increase (in terms of the number of samples) even if the dominating pair is supported in a high dimensional space. For Monte Carlo estimate, we can see from Hoeffding’s inequality that the expected error rate of estimation is independent of the dimension of the support set of dominating distribution pairs. This means the number of samples we need to ensure a certain confidence interval is independent of the dimension. However, we should also note that although the number of samples does not change, the sampling process itself might be more costly for higher dimensional spaces.

C.1 Importance Sampling for the Composition of Poisson Subsampled Gaussian Mechanisms

Importance Sampling (IS) is a classic method for rare event simulation. It samples from an alternative distribution instead of the distribution of the quantity of interest, and a weighting factor is then used for correcting the difference between the two distributions. Specifically, we can re-write the expression for δY(ε)\delta_{Y}(\varepsilon) as follows:

δY(ε)\displaystyle\delta_{Y}(\varepsilon) =𝔼Y[(1eεY)+]\displaystyle=\mathbb{E}_{Y}\left[(1-e^{\varepsilon-Y})_{+}\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+𝑷(𝒕)𝑷(𝒕)]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}^{\prime}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}\frac{\bm{P}(\bm{t})}{\bm{P}^{\prime}(\bm{t})}\right] (2)

where 𝑷\bm{P}^{\prime} is the alternative distribution up to the user’s choice. From Equation (2), one can construct an unbiased importance sampling estimator for δY(ε)\delta_{Y}(\varepsilon) by sampling from 𝑷\bm{P}^{\prime}. In this section, we develop a 𝑷\bm{P}^{\prime} for estimating δY(ε)\delta_{Y}(\varepsilon) when composing identically distributed Poisson subsampled Gaussian mechanisms, which is arguably the most important DP mechanism nowadays due to its application in differentially private stochastic gradient descent.

Exponential tilting is a common way to construct alternative sampling distribution for IS. The exponential tilting of a distribution PP is defined as

Pθ(t):=eθtMP(θ)P(t)\displaystyle P_{\theta}(t):=\frac{e^{\theta t}}{M_{P}(\theta)}P(t)

where MP(θ):=𝔼tP[eθt]M_{P}(\theta):=\mathbb{E}_{t\sim P}[e^{\theta t}] is the moment generating function for PP. Such a transformation is especially convenient for distributions from the exponential family. For example, for normal distribution 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}), the tilted distribution is 𝒩(μ+θσ2,σ2)\mathcal{N}(\mu+\theta\sigma^{2},\sigma^{2}), which is easy to sample from.

Without the loss of generality, we consider Poisson Subsampled Gaussian mechanism with sensitivity 1, noise variance σ2\sigma^{2}, and subsampling rate qq. Recall from Lemma 14 that the dominating pair in this case is Q:=𝒩(0,σ2)Q:=\mathcal{N}(0,\sigma^{2}) and P:=(1q)𝒩(0,σ2)+q𝒩(1,σ2)P:=(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2}). For notation simplicity, we denote P0:=𝒩(1,σ2)P_{0}:=\mathcal{N}(1,\sigma^{2}), and thus P=(1q)Q+qP0P=(1-q)Q+qP_{0}. Since each individual mechanism is the same, 𝑷=P××P\bm{P}=P\times\ldots\times P and 𝑸=Q××Q\bm{Q}=Q\times\ldots\times Q. The exponential tilting of PP with parameter θ\theta is Pθ:=(1q)𝒩(θσ2,σ2)+q𝒩(1+θσ2,σ2)P_{\theta}:=(1-q)\mathcal{N}(\theta\sigma^{2},\sigma^{2})+q\mathcal{N}(1+\theta\sigma^{2},\sigma^{2}). We propose the following importance sampling estimator for δY(ε)\delta_{Y}(\varepsilon) based on exponential tilting.

Definition 16 (Importance Sampling Estimator for Subsampled Gaussian Composition).

Let the alternative distribution

𝑷:=𝑷θ=(P,,Pθithdim,,P),iUnif([k])\bm{P}^{\prime}:=\bm{P}_{\theta}=(P,\ldots,\underbrace{P_{\theta}}_{i\mathrm{th~{}dim}},\ldots,P),~{}~{}i\sim\mathrm{Unif}([k])

with θ=1/2+σ2log(exp(ε)(1q)q)\theta=1/2+\sigma^{2}\log\left(\frac{\exp(\varepsilon)-(1-q)}{q}\right). Given a random draw 𝐭𝐏θ\bm{t}\sim\bm{P}_{\theta}, an unbiased sample for δY(ε)\delta_{Y}(\varepsilon) is (1eεy(𝐭;𝐏,𝐐))+(1ki=1kPθ(ti)P(ti))1\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}\left(\frac{1}{k}\sum_{i=1}^{k}\frac{P_{\theta}(t_{i})}{P(t_{i})}\right)^{-1}. We denote 𝛅^IS,θm(ε)\widehat{\bm{\delta}}_{\texttt{IS},\theta}^{m}(\varepsilon) as the random variable of the corresponding importance sampling estimator with mm samples.

We defer the formal justification of the choice of θ\theta to Appendix C.2. We first give the intuition for why we choose such an alternative distribution 𝑷θ\bm{P}_{\theta}.

Intuition for the alternative distribution 𝑷θ\bm{P}_{\theta}.

It is well-known that the variance of the importance sampling estimator is minimized when the alternative distribution

𝑷(𝒕)(1eεy(𝒕))+𝑷(𝒕)\displaystyle\bm{P}^{\prime}(\bm{t})\propto\left(1-e^{\varepsilon-y(\bm{t})}\right)_{+}\bm{P}(\bm{t})

The distribution of each privacy loss random variable y(t;P,Q),tPy(t;P,Q),t\sim P is light-tailed, which means that for the rare event where y(𝒕)=i=1ky(t(i))>εy(\bm{t})=\sum_{i=1}^{k}y(t^{(i)})>\varepsilon, it is most likely that there is only one outlier tt^{*} among all {t(i)}i=1k\{t^{(i)}\}_{i=1}^{k} such that y(t)y(t^{*}) is large (which means that y(𝒕)y(\bm{t}) is also large), and all the rest of y(t(i))y(t^{(i)})s are small. Hence, a reasonable alternative distribution can just tilt the distribution of a randomly picked t(i)t^{(i)}, and leave the rest of k1k-1 distributions to stay the same. Moreover, θ\theta is selected to approximately minimize the variance of 𝜹^IS,θ\widehat{\bm{\delta}}_{\texttt{IS},\theta} (detailed in Appendix C.2). An intuitive way to see it is that y(θ)=εy(\theta)=\varepsilon, which significantly improves the probability where y(𝒕)εy(\bm{t})\geq\varepsilon while also accounting for the fact that P(t)P(t) decays exponentially fast as tt increases.

We also empirically verify the advantage of the IS estimator over the SMC estimator. The orange curve in Figure 6 (a) shows the empirical estimate of 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}] which quantifies the variance of 𝜹^IS,θ\widehat{\bm{\delta}}_{\texttt{IS},\theta}. Note that θ=0\theta=0 corresponds to the case of 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}}. As we can see, 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}] drops quickly as θ\theta increases, and eventually converges. We can also see that the θ\theta selected by our heuristic in Definition 16 (marked as red ‘*’) approximately corresponds to the lowest point of 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]. This validates our theoretical justification for the selection of θ\theta.

Refer to caption
Figure 6: We examine the properties of MC-based DP verifiers for Poisson Subampled mechanism. We set q=103,σ=0.6,ε=1.5,k=100q=10^{-3},\sigma=0.6,\varepsilon=1.5,k=100. δY(ε)7.7×106\delta_{Y}(\varepsilon)\approx 7.7\times 10^{-6} in this case. (a) Plot for the upper bound and empirical estimate of 𝔼[𝜹^MC2]\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}^{2}] and 𝔼[𝜹^IS,θ2]\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{IS},\theta}^{2}]. The upper bounds are computed by Corollary 19 (for 𝔼[𝜹^MC2]\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}^{2}]) and Theorem 19 (for 𝔼[𝜹^IS,θ2]\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{IS},\theta}^{2}]). Note that θ=0\theta=0 corresponds to 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}}. The red star indicates the second moment for the value of θ\theta selected by our heuristic in Definition 16. The blue star indicates the θ\theta that minimizes the analytical bound. (b) Empirical estimate of the rejection probability Pr[𝜹^IS,θm>δest/τΔ]\Pr[\widehat{\bm{\delta}}_{\texttt{IS},\theta}^{m}>\delta^{\mathrm{est}}/\tau-\Delta] scaled with the number of samples mm. We set δest=0.8δY(ε),τ=105/δest\delta^{\mathrm{est}}=0.8\delta_{Y}(\varepsilon),\tau=10^{-5}/\delta^{\mathrm{est}}, and we set Δ\Delta following the heuristic proposed in Section 4.4.

C.2 Intuition of the Heuristic of Choosing Exponential Tilting Parameter θ\theta

The variance of the IS estimator proposed in Definition 16 is given by

𝔼𝒕𝑷θ[(1eεy(𝒕;𝑷,𝑸))+2(𝑷(𝒕)𝑷θ(𝒕))2]\displaystyle\mathbb{E}_{\bm{t}\sim\bm{P}_{\theta}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{\bm{P}(\bm{t})}{\bm{P}_{\theta}(\bm{t})}\right)^{2}\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(𝑷(𝒕)𝑷θ(𝒕))]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{\bm{P}(\bm{t})}{\bm{P}_{\theta}(\bm{t})}\right)\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1ki=1kPθ(ti)P(ti))1]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{k}\sum_{i=1}^{k}\frac{P_{\theta}(t_{i})}{P(t_{i})}\right)^{-1}\right]
=kMP(θ)𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1i=1keθti)]\displaystyle=kM_{P}(\theta)\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)\right]

Let S(θ):=kMP(θ)𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1i=1keθti)]S(\theta):=kM_{P}(\theta)\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)\right]. We aim to find θ\theta that minimizes S(θ)S(\theta).

Note that

MP(θ)=(1q)e12σ2θ2+qeθ+12σ2θ2\displaystyle M_{P}(\theta)=(1-q)e^{\frac{1}{2}\sigma^{2}\theta^{2}}+qe^{\theta+\frac{1}{2}\sigma^{2}\theta^{2}} (3)

To simplify the notation, let b(𝒕):=(1eεy(𝒕))+2i=1kP0(ti)b(\bm{t}):=\left(1-e^{\varepsilon-y(\bm{t})}\right)_{+}^{2}\prod_{i=1}^{k}P_{0}(t_{i}).

θS(θ)\displaystyle\frac{\partial}{\partial\theta}S(\theta) =[(1q)e12σ2θ2(σ2θ)+qeθ+12σ2θ2(1+σ2θ)]b(𝒕)(i=1keθti)1𝑑𝒕\displaystyle=\left[(1-q)e^{\frac{1}{2}\sigma^{2}\theta^{2}}(\sigma^{2}\theta)+qe^{\theta+\frac{1}{2}\sigma^{2}\theta^{2}}(1+\sigma^{2}\theta)\right]\int\ldots\int b(\bm{t})\left(\sum_{i=1}^{k}e^{\theta t_{i}}\right)^{-1}d\bm{t} (4)
[(1q)e12σ2θ2+qeθ+12σ2θ2]b(𝒕)i=1keθtiti(i=1keθti)2𝑑𝒕\displaystyle~{}~{}~{}-\left[(1-q)e^{\frac{1}{2}\sigma^{2}\theta^{2}}+qe^{\theta+\frac{1}{2}\sigma^{2}\theta^{2}}\right]\int\ldots\int b(\bm{t})\frac{\sum_{i=1}^{k}e^{\theta t_{i}}t_{i}}{\left(\sum_{i=1}^{k}e^{\theta t_{i}}\right)^{2}}d\bm{t} (5)

By setting θS(θ)=0\frac{\partial}{\partial\theta}S(\theta)=0 and simplify the expression, we have

(1q+qeθ)(σ2θ)+qeθ1q+qeθ=b(𝒕)i=1keθtiti(i=1keθti)2𝑑𝒕b(𝒕)(i=1keθti)1𝑑𝒕\displaystyle\frac{(1-q+qe^{\theta})(\sigma^{2}\theta)+qe^{\theta}}{1-q+qe^{\theta}}=\frac{\int\ldots\int b(\bm{t})\frac{\sum_{i=1}^{k}e^{\theta t_{i}}t_{i}}{\left(\sum_{i=1}^{k}e^{\theta t_{i}}\right)^{2}}d\bm{t}}{\int\ldots\int b(\bm{t})\left(\sum_{i=1}^{k}e^{\theta t_{i}}\right)^{-1}d\bm{t}} (6)

As we mentioned earlier, b(𝒕)>0b(\bm{t})>0 only when y(𝒕)=i=1ky(t(i))>εy(\bm{t})=\sum_{i=1}^{k}y(t^{(i)})>\varepsilon, and for such an event it is most likely that there is only one outlier tt^{*} among all {t(i)}i=1k\{t^{(i)}\}_{i=1}^{k} such that y(t)εy(t^{*})\approx\varepsilon, and all the rest of y(t(i))0y(t^{(i)})\approx 0. Therefore, a very simple and rough, yet surprisingly effective approximation for the RHS of (6) is

b(𝒕)i=1keθtiti(i=1keθti)2𝑑𝒕b(𝒕)(i=1keθti)1𝑑𝒕b(t)eθttb(t)eθt=t\displaystyle\frac{\int\ldots\int b(\bm{t})\frac{\sum_{i=1}^{k}e^{\theta t_{i}}t_{i}}{\left(\sum_{i=1}^{k}e^{\theta t_{i}}\right)^{2}}d\bm{t}}{\int\ldots\int b(\bm{t})\left(\sum_{i=1}^{k}e^{\theta t_{i}}\right)^{-1}d\bm{t}}\approx\frac{b(t)e^{-\theta t}t}{b(t)e^{-\theta t}}=t (7)

for tt s.t. y(t)=εy(t)=\varepsilon. This leads to an approximate solution

θ=12σ2+log((exp(ε)(1q))/q)\displaystyle\theta^{*}=\frac{1}{2\sigma^{2}}+\log\left((\exp(\varepsilon)-(1-q))/q\right) (8)

Appendix D Sample Complexity for Achieving Target False Positive Rate

To derive the sample complexity for achieving a DP verifier with a target false positive rate, we use Bennett’s inequality.

Lemma 17 (Bennett’s inequality).

Let X1,,XnX_{1},\ldots,X_{n} be independent real-valued random variables with finite variance such that XibX_{i}\leq b for some b>0b>0 almost surely for all 1in1\leq i\leq n. Let νi=1n𝔼[Xi2]\nu\geq\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}]. For any t>0t>0, we have

Pr[i=1nXi𝔼[Xi]t]exp(νb2h(btν))\displaystyle\Pr\left[\sum_{i=1}^{n}X_{i}-\mathbb{E}[X_{i}]\geq t\right]\leq\exp\left(-\frac{\nu}{b^{2}}h\left(\frac{bt}{\nu}\right)\right) (9)

where h(x)=(1+x)log(1+x)xh(x)=(1+x)\log(1+x)-x for x>0x>0.

\smcsamplecomp

*

Proof.

For any \mathcal{M} s.t. δest<τδY(ε)\delta^{\mathrm{est}}<\tau\delta_{Y}(\varepsilon), we have

Pr[𝜹^MCm(ε;Y)<δest/τΔ]\displaystyle\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}(\varepsilon;Y)<\delta^{\mathrm{est}}/\tau-\Delta\right] Pr[𝜹^MCm(ε;Y)<δY(ε)Δ]\displaystyle\leq\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}(\varepsilon;Y)<\delta_{Y}(\varepsilon)-\Delta\right]
=Pr[1mi=1m(𝜹^MC(i)δY(ε))<Δ]\displaystyle=\Pr\left[\frac{1}{m}\sum_{i=1}^{m}(\widehat{\bm{\delta}}_{\texttt{MC}}^{(i)}-\delta_{Y}(\varepsilon))<-\Delta\right]
=Pr[i=1m(δY(ε)𝜹^MC(i))>mΔ]\displaystyle=\Pr\left[\sum_{i=1}^{m}(\delta_{Y}(\varepsilon)-\widehat{\bm{\delta}}_{\texttt{MC}}^{(i)})>m\Delta\right] (10)

Since 𝜹^MC[1,0]\widehat{\bm{\delta}}_{\texttt{MC}}\in[-1,0], the condition in Bennett’s inequality is satisfied with b0+b\rightarrow 0^{+}. Hence, (10) can be upper bounded by

(10)\displaystyle(\ref{eq:to-use-bennett}) limb0+exp(mνb2h(bΔν))\displaystyle\leq\lim_{b\rightarrow 0^{+}}\exp\left(-\frac{m\nu}{b^{2}}h\left(\frac{b\Delta}{\nu}\right)\right)
=exp(mΔ22ν)\displaystyle=\exp\left(-\frac{m\Delta^{2}}{2\nu}\right)

By setting exp(mΔ22ν)δest/τ\exp\left(-\frac{m\Delta^{2}}{2\nu}\right)\leq\delta^{\mathrm{est}}/\tau, we have

m2νΔ2log(τ/δest)\displaystyle m\geq\frac{2\nu}{\Delta^{2}}\log(\tau/\delta^{\mathrm{est}})

Appendix E Proofs for Moment Bound

E.1 Overview

As suggested by Theorem 4.3, a good upper bound for 𝔼[(𝜹^MC)2]\mathbb{E}\left[\left(\widehat{\bm{\delta}}_{\texttt{MC}}\right)^{2}\right] (or 𝔼[(𝜹^IS)2]\mathbb{E}\left[\left(\widehat{\bm{\delta}}_{\texttt{IS}}\right)^{2}\right]) is important for the computational efficiency of MC-based DPV.

We upper bound the higher moment of 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} through the RDP guarantee for the composed mechanism \mathcal{M}. This is a natural idea since converting RDP to upper bounds for δY(ε)\delta_{Y}(\varepsilon) – the first moment of 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} – is a well-studied problem [Mir17, CKS20, ALC+21]. Recall that the RDP guarantee for \mathcal{M} is equivalent to a bound for MY(λ):=𝔼[eλY]M_{Y}(\lambda):=\mathbb{E}[e^{\lambda Y}] for \mathcal{M}’s privacy loss random variable YY for any λ0\lambda\geq 0.

Lemma 18 (RDP-MGF bound conversion [Mir17]).

If a mechanism \mathcal{M} is (α,εR(α))(\alpha,\varepsilon_{\mathrm{R}}(\alpha))-RDP, then MY(λ)exp(λεR(λ+1))M_{Y}(\lambda)\leq\exp(\lambda\varepsilon_{\mathrm{R}}(\lambda+1)).

We convert an upper bound for MY()M_{Y}(\cdot) into the following guarantee for the higher moment of 𝜹^MC=(1eεY)+\widehat{\bm{\delta}}_{\texttt{MC}}=(1-e^{\varepsilon-Y})_{+}.

{restatable}

theoremvariancebound For any u1u\geq 1, we have

𝔼[(𝜹^MC)u]=𝔼[(1eεY)+u]minλ0MY(λ)eελuuλλ(u+λ)u+λ\displaystyle\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{u}]=\mathbb{E}\left[(1-e^{\varepsilon-Y})_{+}^{u}\right]\leq\min_{\lambda\geq 0}M_{Y}(\lambda)e^{-\varepsilon\lambda}\frac{u^{u}\lambda^{\lambda}}{(u+\lambda)^{u+\lambda}}

The proof is deferred to Appendix E.2. The basic idea is to find the smallest constant cc such that 𝔼[(𝜹^MC)u]cMY(λ)\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{u}]\leq cM_{Y}(\lambda). By setting u=1u=1, our result recovers the RDP-DP conversion from [CKS20]. By setting u=2u=2, we obtain the desired bound for 𝔼[(𝜹^MC)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}].

Corollary 19.

𝔼[(𝜹^MC)2]minλ0MY(λ)eελ4λλ(λ+2)λ+2\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}]\leq\min_{\lambda\geq 0}M_{Y}(\lambda)e^{-\varepsilon\lambda}\frac{4\lambda^{\lambda}}{(\lambda+2)^{\lambda+2}}.

Corollary 19 applies to any mechanisms where the RDP guarantee is available, which covers a wide range of commonly used mechanisms such as (Subsampled) Gaussian or Laplace mechanism. We also note that one may be able to further tighten the above bound similar to the optimal RDP-DP conversion in [ALC+21]. We leave this as an interesting future work.

Next, we derive the upper bound for 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}] for Poisson Subsampled Gaussian mechanism.

{restatable}

theoremvarboundisholder For any positive integer λ\lambda, and for any a,b1a,b\geq 1 s.t. 1/a+1/b=11/a+1/b=1, we have 𝔼[(𝜹^IS,θ)2]kMP(θ)(𝔼[𝜹^MC2a])1/a(bθeλε[r(λ,x)]kebθx𝑑x)1/b\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]\leq kM_{P}(\theta)\left(\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}^{2a}]\right)^{1/a}\cdot\left(b\theta e^{-\lambda\varepsilon}\int\left[r(\lambda,x)\right]^{k}e^{-b\theta x}dx\right)^{1/b} where r(λ,x)r(\lambda,x) is an upper bound for Pr𝒕𝑷[maxitix,y(𝒕)ε]\Pr_{\bm{t}\sim\bm{P}}[\max_{i}t_{i}\leq x,y(\bm{t})\geq\varepsilon] detailed in Appendix E.3.

The proof is based on applying Hölder’s inequality to the expression of 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}], and then bound the part where θ\theta is involved: 𝔼𝒕𝑷[(1ki=1kPθ(ti)P(ti))1]\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(\frac{1}{k}\sum_{i=1}^{k}\frac{P_{\theta}(t_{i})}{P(t_{i})}\right)^{-1}\right]. We can bound 𝔼[𝜹^MC2a]\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}^{2a}] through Theorem 18.

Figure 6 (a) shows the analytical bound from Corollary 19 and Theorem 19 compared with empirically estimated 𝔼[(𝜹^MC)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}] and 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]. As we can see, the analytical bound for 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}] for relatively large θ\theta is much smaller than the bound for 𝔼[(𝜹^MC)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}] (i.e., θ=0\theta=0 in the plot). Moreover, we find that the θ\theta which minimizes the analytical bound (the blue ‘*’) is close to the θ\theta selected by our heuristic (the red ‘*’). For computational efficiency, one may prefer to use θ\theta that minimizes the analytical bound. However, the heuristically selected θ\theta is still useful when one simply wants to estimate δY(ε)\delta_{Y}(\varepsilon) and does not require the formal, analytical guarantee for the false positive rate. We see such a scenario when we introduce the MC accountant in Section 5. We also note that such a discrepancy (and the gap between the analytical bound and empirical estimate) is due to the use of Hölder’s inequality in bounding 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]. Further tightening the bound for 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}] is important for future work.

E.2 Moment Bound for Simple Monte Carlo Estimator

\variancebound

*

Proof.
𝔼[(1eεY)+u]\displaystyle\mathbb{E}[(1-e^{\varepsilon-Y})_{+}^{u}] =(1eεx)+uP(x)dx\displaystyle=\int(1-e^{\varepsilon-x})_{+}^{u}P(x)\mathrm{d}x
=MY(λ)(1eεx)+ueλxP(x)eλx𝔼[eλY]dx\displaystyle=M_{Y}(\lambda)\int(1-e^{\varepsilon-x})_{+}^{u}e^{-\lambda x}\frac{P(x)e^{\lambda x}}{\mathbb{E}[e^{\lambda Y}]}\mathrm{d}x
=MY(λ)𝔼xPθ[(1eεx)+ueλx]\displaystyle=M_{Y}(\lambda)\mathbb{E}_{x\sim P_{\theta}}[(1-e^{\varepsilon-x})_{+}^{u}e^{-\lambda x}]
=MY(λ)eλε𝔼xPθ[(1eεx)+ue(εx)λ]\displaystyle=M_{Y}(\lambda)e^{-\lambda\varepsilon}\mathbb{E}_{x\sim P_{\theta}}[(1-e^{\varepsilon-x})_{+}^{u}e^{(\varepsilon-x)\lambda}]

where Pλ(x):=P(x)eλx𝔼[eλY]P_{\lambda}(x):=\frac{P(x)e^{\lambda x}}{\mathbb{E}[e^{\lambda Y}]} is the exponential tilting of PP. Define f(x,λ):=(1ex)+uexλf(x,\lambda):=(1-e^{-x})_{+}^{u}e^{-x\lambda}. When x0,f(x,λ)=0x\leq 0,f(x,\lambda)=0. When x>0x>0, the derivative of ff with respect to xx is

f(x,λ)x=exλ(1ex)u1[ex(u+λ)λ]\displaystyle\frac{\partial f(x,\lambda)}{\partial x}=e^{-x\lambda}(1-e^{-x})^{u-1}[e^{-x}(u+\lambda)-\lambda]

It is easy to see that the maximum of f(x,λ)f(x,\lambda) is achieved at x=log(u+λλ)x^{*}=\log\left(\frac{u+\lambda}{\lambda}\right), and we have

maxxf(x,λ)\displaystyle\max_{x}f(x,\lambda) =(uu+λ)u(λu+λ)λ\displaystyle=\left(\frac{u}{u+\lambda}\right)^{u}\left(\frac{\lambda}{u+\lambda}\right)^{\lambda}
=uuλλ(u+λ)u+λ\displaystyle=\frac{u^{u}\lambda^{\lambda}}{(u+\lambda)^{u+\lambda}}

Overall, we have

𝔼[(1eεY)+u]\displaystyle\mathbb{E}[(1-e^{\varepsilon-Y})_{+}^{u}] MY(λ)eελuuλλ(u+λ)u+λ\displaystyle\leq M_{Y}(\lambda)e^{-\varepsilon\lambda}\frac{u^{u}\lambda^{\lambda}}{(u+\lambda)^{u+\lambda}}

E.3 Moment Bound for Importance Sampling Estimator

In this section, we first prove two possible upper bounds for 𝔼[(𝜹^IS,θ)2]\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}] in Theorem E.3 and Theorem E.3. We then combine these two bounds in Theorem 19 via Holder’s inequality.

{restatable}

theoremvarboundisjs For any θ1/σ2\theta\geq 1/\sigma^{2}, we have

𝔼[(𝜹^IS,θ)2]MP(θ)[1k(εq+k)]θσ2eθ/2𝔼[(𝜹^MC)2]\displaystyle\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]\leq M_{P}(\theta)\left[\frac{1}{k}\left(\frac{\varepsilon}{q}+k\right)\right]^{-\theta\sigma^{2}}e^{-\theta/2}\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}]
Proof.
𝔼𝒕𝑷θ[(1eεy(𝒕;𝑷,𝑸))+2(𝑷(𝒕)𝑷θ(𝒕))2]\displaystyle\mathbb{E}_{\bm{t}\sim\bm{P}_{\theta}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{\bm{P}(\bm{t})}{\bm{P}_{\theta}(\bm{t})}\right)^{2}\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(𝑷(𝒕)𝑷θ(𝒕))]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{\bm{P}(\bm{t})}{\bm{P}_{\theta}(\bm{t})}\right)\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1ki=1kPθ(ti)P(ti))1]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{k}\sum_{i=1}^{k}\frac{P_{\theta}(t_{i})}{P(t_{i})}\right)^{-1}\right]
=MP(θ)𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(ki=1keθti)]\displaystyle=M_{P}(\theta)\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{k}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)\right] (11)

Note that

𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(ki=1keθti)]\displaystyle\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{k}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(ki=1keθti)I[y(𝒕;𝑷,𝑸)ε]]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{k}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)I[y(\bm{t};\bm{P},\bm{Q})\geq\varepsilon]\right]
Lemma 20.

When y(𝐭;𝐏,𝐐)=i=1klog(1q+qe(2ti1)/(2σ2))εy(\bm{t};\bm{P},\bm{Q})=\sum_{i=1}^{k}\log\left(1-q+qe^{(2t_{i}-1)/(2\sigma^{2})}\right)\geq\varepsilon and θσ21\theta\sigma^{2}\geq 1, we have

i=1keθtik[1k(εk+k)e1/(2σ2)]θσ2\displaystyle\sum_{i=1}^{k}e^{\theta t_{i}}\geq k\left[\frac{1}{k}\left(\frac{\varepsilon}{k}+k\right)e^{1/(2\sigma^{2})}\right]^{\theta\sigma^{2}}
Proof.

Since log(1+x)x\log(1+x)\leq x, we have

ε\displaystyle\varepsilon i=1klog(1q+qe(2ti1)/(2σ2))\displaystyle\leq\sum_{i=1}^{k}\log\left(1-q+qe^{(2t_{i}-1)/(2\sigma^{2})}\right)
i=1kq(e(2ti1)/(2σ2)1)\displaystyle\leq\sum_{i=1}^{k}q\left(e^{(2t_{i}-1)/(2\sigma^{2})}-1\right)

Hence,

i=1keti/σ2(εq+k)e1/(2σ2)\displaystyle\sum_{i=1}^{k}e^{t_{i}/\sigma^{2}}\geq\left(\frac{\varepsilon}{q}+k\right)e^{1/(2\sigma^{2})}

Hence,

i=1ketiθ=i=1k(eti/σ2)θσ2\displaystyle\sum_{i=1}^{k}e^{t_{i}\theta}=\sum_{i=1}^{k}\left(e^{t_{i}/\sigma^{2}}\right)^{\theta\sigma^{2}} =k[1ki=1k(eti/σ2)θσ2]\displaystyle=k\left[\frac{1}{k}\sum_{i=1}^{k}\left(e^{t_{i}/\sigma^{2}}\right)^{\theta\sigma^{2}}\right]
k[1ki=1k(eti/σ2)]θσ2\displaystyle\geq k\left[\frac{1}{k}\sum_{i=1}^{k}\left(e^{t_{i}/\sigma^{2}}\right)\right]^{\theta\sigma^{2}}
k[1k(εq+k)e1/(2σ2)]θσ2\displaystyle\geq k\left[\frac{1}{k}\left(\frac{\varepsilon}{q}+k\right)e^{1/(2\sigma^{2})}\right]^{\theta\sigma^{2}}
=k[1k(εq+k)]θσ2eθ/2\displaystyle=k\left[\frac{1}{k}\left(\frac{\varepsilon}{q}+k\right)\right]^{\theta\sigma^{2}}e^{\theta/2}

where the first inequality is due to Jensen’s inequality. ∎

By Lemma 20, we have

(11)\displaystyle(\ref{eq:cont}) MP(θ)1[1k(εq+k)]θσ2eθ/2𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2]\displaystyle\leq M_{P}(\theta)\frac{1}{\left[\frac{1}{k}\left(\frac{\varepsilon}{q}+k\right)\right]^{\theta\sigma^{2}}e^{\theta/2}}\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\right]
=MP(θ)[1k(εq+k)]θσ2eθ/2𝔼[(𝜹^MC)2]\displaystyle=M_{P}(\theta)\left[\frac{1}{k}\left(\frac{\varepsilon}{q}+k\right)\right]^{-\theta\sigma^{2}}e^{-\theta/2}\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{MC}})^{2}]

which concludes the proof. ∎

{restatable}

theoremvarboundismax For any positive integer λ\lambda, we have

𝔼[(𝜹^IS,θ)2]kMP(θ)θeλε[r(λ,x)]keθx𝑑x\displaystyle\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]\leq kM_{P}(\theta)\theta e^{-\lambda\varepsilon}\int\left[r(\lambda,x)\right]^{k}e^{-\theta x}dx (12)

where r(λ,x)r(\lambda,x) is an upper bound for the MGF of privacy loss random variable of truncated Gaussian mixture P|xP|_{\leq x}.

Proof.

Similar to Theorem E.3, the goal is to bound

kMP(θ)𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1i=1keθti)]\displaystyle kM_{P}(\theta)\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)\right]

Note that

𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1i=1keθti)]\displaystyle\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1i=1keθti)I[y(𝒕;𝑷,𝑸)ε]]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)I[y(\bm{t};\bm{P},\bm{Q})\geq\varepsilon]\right]
𝔼𝒕𝑷[1i=1keθtiI[y(𝒕;𝑷,𝑸)ε]]\displaystyle\leq\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}I[y(\bm{t};\bm{P},\bm{Q})\geq\varepsilon]\right]
𝔼𝒕𝑷[1eθtmaxI[y(𝒕;𝑷,𝑸)ε]]\displaystyle\leq\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\frac{1}{e^{\theta t_{\max}}}I[y(\bm{t};\bm{P},\bm{Q})\geq\varepsilon]\right] (13)

where tmax:=maxitit_{\max}:=\max_{i}t_{i}.

Further note that

(13)\displaystyle(\ref{eq:conttwo}) =𝔼𝒕𝑷[eθtmaxI[y(𝒕)ε]]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[e^{-\theta t_{\max}}I[y(\bm{t})\geq\varepsilon]\right]
=𝔼𝒕𝑷[eθtmax|y(𝒕)ε]Pr𝒕𝑷[y(𝒕)ε]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[e^{-\theta t_{\max}}|y(\bm{t})\geq\varepsilon\right]\Pr_{\bm{t}\sim\bm{P}}[y(\bm{t})\geq\varepsilon]
=(eθxdPr𝒕𝑷[tmaxx])Pr𝒕𝑷[y(𝒕)ε]\displaystyle=\left(\int e^{-\theta x}\mathrm{d}\Pr_{\bm{t}\sim\bm{P}}[t_{\max}\leq x]\right)\Pr_{\bm{t}\sim\bm{P}}[y(\bm{t})\geq\varepsilon]
=(Pr𝒕𝑷[tmaxx|y(𝒕)ε]deθx)Pr𝒕𝑷[y(𝒕)ε]\displaystyle=-\left(\int\Pr_{\bm{t}\sim\bm{P}}\left[t_{\max}\leq x|y(\bm{t})\geq\varepsilon\right]\mathrm{d}e^{-\theta x}\right)\Pr_{\bm{t}\sim\bm{P}}[y(\bm{t})\geq\varepsilon] (14)
=θPr𝒕𝑷[tmaxx,y(𝒕)ε]eθxdx\displaystyle=\theta\int\Pr_{\bm{t}\sim\bm{P}}\left[t_{\max}\leq x,y(\bm{t})\geq\varepsilon\right]e^{-\theta x}\mathrm{d}x (15)

where (14) is obtained through integration by parts.

Now, as we can see from (15), the question reduces to bound Pr𝒕𝑷[tmaxx,y(𝒕)ε]\Pr_{\bm{t}\sim\bm{P}}\left[t_{\max}\leq x,y(\bm{t})\geq\varepsilon\right] for any xx\in\mathbb{R}. It might be easier to write

Pr𝒕𝑷[tmaxx,y(𝒕)ε]=Pr𝒕𝑷[y(𝒕)ε|tmaxx]Pr𝒕𝑷[tmaxx]\displaystyle\Pr_{\bm{t}\sim\bm{P}}\left[t_{\max}\leq x,y(\bm{t})\geq\varepsilon\right]=\Pr_{\bm{t}\sim\bm{P}}\left[y(\bm{t})\geq\varepsilon|t_{\max}\leq x\right]\Pr_{\bm{t}\sim\bm{P}}\left[t_{\max}\leq x\right]

and we know that

Pr𝒕𝑷[tmaxx]\displaystyle\Pr_{\bm{t}\sim\bm{P}}\left[t_{\max}\leq x\right] =(PrtP[tx])k\displaystyle=\left(\Pr_{t\sim P}\left[t\leq x\right]\right)^{k} (16)
=((1q)Φ(x;0,σ2)+qΦ(x;1,σ2))k\displaystyle=\left((1-q)\Phi(x;0,\sigma^{2})+q\Phi(x;1,\sigma^{2})\right)^{k} (17)

as all tit_{i}s are i.i.d. random samples from PP, where Φ(;μ,σ2)\Phi(\cdot;\mu,\sigma^{2}) is the CDF of Gaussian distribution with mean μ\mu and variance σ2\sigma^{2}.

It remains to bound the conditional probability Pr𝒕𝑷[y(𝒕)ε|tmaxx]\Pr_{\bm{t}\sim\bm{P}}\left[y(\bm{t})\geq\varepsilon|t_{\max}\leq x\right], it may be easier to see it in this way:

Pr𝒕𝑷[y(𝒕)ε|tmaxx]\displaystyle\Pr_{\bm{t}\sim\bm{P}}\left[y(\bm{t})\geq\varepsilon|t_{\max}\leq x\right]
=Pr𝒕𝑷[y(𝒕)ε|t1x,,tkx]\displaystyle=\Pr_{\bm{t}\sim\bm{P}}\left[y(\bm{t})\geq\varepsilon|t_{1}\leq x,\ldots,t_{k}\leq x\right]
=Pr𝒕𝑷[i=1ky(ti)ε|y(t1)y(x),,y(tk)y(x)]\displaystyle=\Pr_{\bm{t}\sim\bm{P}}\left[\sum_{i=1}^{k}y(t_{i})\geq\varepsilon|y(t_{1})\leq y(x),\ldots,y(t_{k})\leq y(x)\right]
𝔼𝒕𝑷[eλi=1ky(ti)eλε|y(t1)y(x),,y(tk)y(x)]eλε\displaystyle\leq\frac{\mathbb{E}_{\bm{t}\sim\bm{P}}\left[e^{\lambda\sum_{i=1}^{k}y(t_{i})}\geq e^{\lambda\varepsilon}|y(t_{1})\leq y(x),\ldots,y(t_{k})\leq y(x)\right]}{e^{\lambda\varepsilon}}

where the last step is due to Chernoff bound which holds for any λ>0\lambda>0. Now we only need to bound the moment generating function for y(t)=log(1q+qe2t12σ2),tP|xy(t)=\log(1-q+qe^{\frac{2t-1}{2\sigma^{2}}}),t\sim P|_{\leq x}, where P|xP|_{\leq x} is the truncated distribution of PP. We note that this is equivalent to bounding the Rényi divergence for truncated Gaussian mixture distribution.

Recall that P=(1q)𝒩(0,σ2)+q𝒩(1,σ2)P=(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2}). For any λ\lambda that is a positive integer, we have

𝔼tP[eλy(t)I[tx]]\displaystyle\mathbb{E}_{t\sim P}\left[e^{\lambda y(t)}I[t\leq x]\right]
=12πσ[(1q)x(1q+qe2t12σ2)λet22σ2𝑑t+qx(1q+qe2t12σ2)λe(t1)22σ2𝑑t]\displaystyle=\frac{1}{\sqrt{2\pi}\sigma}\left[(1-q)\int_{-\infty}^{x}\left(1-q+qe^{\frac{2t-1}{2\sigma^{2}}}\right)^{\lambda}e^{-\frac{t^{2}}{2\sigma^{2}}}dt+q\int_{-\infty}^{x}\left(1-q+qe^{\frac{2t-1}{2\sigma^{2}}}\right)^{\lambda}e^{-\frac{(t-1)^{2}}{2\sigma^{2}}}dt\right]
=12(qe12σ2)λi=0λ(λi)q~i[(1q)e(λi)i2σ2(erf(x(λi)2σ)+1)+qe(λi+1)i12σ2(erf(x(λi+1)2σ)+1)]\displaystyle=\frac{1}{2}\left(qe^{-\frac{1}{2\sigma^{2}}}\right)^{\lambda}\sum_{i=0}^{\lambda}{\lambda\choose i}\tilde{q}^{i}\left[(1-q)e^{\frac{(\lambda-i)^{i}}{2\sigma^{2}}}\left(\mathrm{erf}\left(\frac{x-(\lambda-i)}{\sqrt{2}\sigma}\right)+1\right)+qe^{\frac{(\lambda-i+1)^{i}-1}{2\sigma^{2}}}\left(\mathrm{erf}\left(\frac{x-(\lambda-i+1)}{\sqrt{2}\sigma}\right)+1\right)\right] (18)

where q~:=(1q)exp(1/(2σ2))q\tilde{q}:=\frac{(1-q)\exp(1/(2\sigma^{2}))}{q}. Note that the above expression can be efficiently computed. Denote the above results as r(λ,x):=r(\lambda,x):= (18). Hence

𝔼tP[eλy(t)|tx]=r(λ,x)PrtP[tx]\displaystyle\mathbb{E}_{t\sim P}\left[e^{\lambda y(t)}|t\leq x\right]=\frac{r(\lambda,x)}{\Pr_{t\sim P}[t\leq x]} (19)

Now we have

Pr𝒕𝑷[y(𝒕)ε|tmaxx]\displaystyle\Pr_{\bm{t}\sim\bm{P}}\left[y(\bm{t})\geq\varepsilon|t_{\max}\leq x\right] 𝔼𝒕𝑷[eλi=1ky(ti)eλε|tmaxx]eλε\displaystyle\leq\frac{\mathbb{E}_{\bm{t}\sim\bm{P}}\left[e^{\lambda\sum_{i=1}^{k}y(t_{i})}\geq e^{\lambda\varepsilon}|t_{\max}\leq x\right]}{e^{\lambda\varepsilon}}
=[r(λ,x)]keλε(PrtP[tx])k\displaystyle=\frac{\left[r(\lambda,x)\right]^{k}}{e^{\lambda\varepsilon}\left(\Pr_{t\sim P}[t\leq x]\right)^{k}}

Plugging this bound into (15), we have

(13)\displaystyle(\ref{eq:conttwo}) =θPr𝒕𝑷[tmaxx,y(𝒕)ε]eθxdx\displaystyle=\theta\int\Pr_{\bm{t}\sim\bm{P}}\left[t_{\max}\leq x,y(\bm{t})\geq\varepsilon\right]e^{-\theta x}\mathrm{d}x
θeλε[r(λ,x)]keθxdx\displaystyle\leq\theta e^{-\lambda\varepsilon}\int\left[r(\lambda,x)\right]^{k}e^{-\theta x}\mathrm{d}x

which leads to the final conclusion. ∎

Remark 21.

In practice, we can further improve the bound by moving the minimum operation inside the integral:

𝔼[(𝜹^IS,θ)2]kMP(θ)θeλε[minλr(λ,x)]keθx𝑑x\displaystyle\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]\leq kM_{P}(\theta)\theta e^{-\lambda\varepsilon}\int\left[\min_{\lambda}r(\lambda,x)\right]^{k}e^{-\theta x}dx (20)

Of course, this bound will be less efficient to compute.

Theorem 19 (Generalizing Theorem E.3 and Theorem E.3 via Holder’s inequality).

For any positive integer λ\lambda, and for any a,b1a,b\geq 1 s.t. 1/a+1/b=11/a+1/b=1, we have 𝔼[(𝛅^IS,θ)2]kMP(θ)(𝔼[𝛅^MC2a])1/a(bθeλε[r(λ,x)]kebθx𝑑x)1/b\mathbb{E}[(\widehat{\bm{\delta}}_{\texttt{IS},\theta})^{2}]\leq kM_{P}(\theta)\left(\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}^{2a}]\right)^{1/a}\cdot\left(b\theta e^{-\lambda\varepsilon}\int\left[r(\lambda,x)\right]^{k}e^{-b\theta x}dx\right)^{1/b} where r(λ,x)r(\lambda,x) is an upper bound for the MGF of privacy loss random variable of truncated Gaussian mixture P|xP|_{\leq x} defined in (18).

Proof.

Note that Theorem E.3 and Theorem E.3 can both be viewed as two special cases of Hölder’s inequality: for any a,b1a,b\geq 1 s.t. 1a+1b=1\frac{1}{a}+\frac{1}{b}=1, we have

𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1i=1keθti)]\displaystyle\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)\right]
=𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2(1i=1keθti)I[y(𝒕;𝑷,𝑸)ε]]\displaystyle=\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2}\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)I[y(\bm{t};\bm{P},\bm{Q})\geq\varepsilon]\right]
𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2a]1/a𝔼𝒕𝑷[(1i=1keθti)bI[y(𝒕;𝑷,𝑸)ε]]1/b\displaystyle\leq\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2a}\right]^{1/a}\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)^{b}I[y(\bm{t};\bm{P},\bm{Q})\geq\varepsilon]\right]^{1/b}

Theorem E.3 corresponds to the case where a=1a=1, and Theorem E.3 corresponds to the case where b=1b=1. We can actually tune the parameters aa and bb to see if we can obtain any better bounds, as we have

𝔼𝒕𝑷[(1eεy(𝒕;𝑷,𝑸))+2a]𝔼[𝜹^MC2a]\displaystyle\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(1-e^{\varepsilon-y(\bm{t};\bm{P},\bm{Q})}\right)_{+}^{2a}\right]\leq\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}^{2a}] (21)

and

𝔼𝒕𝑷[(1i=1keθti)bI[y(𝒕;𝑷,𝑸)ε]]bθeλε[r(λ,x)]kebθx𝑑x\displaystyle\mathbb{E}_{\bm{t}\sim\bm{P}}\left[\left(\frac{1}{\sum_{i=1}^{k}e^{\theta t_{i}}}\right)^{b}I[y(\bm{t};\bm{P},\bm{Q})\geq\varepsilon]\right]\leq b\theta e^{-\lambda\varepsilon}\int\left[r(\lambda,x)\right]^{k}e^{-b\theta x}dx (22)

Simply combining the above inequalities leads to the final conclusion. ∎

Appendix F Proofs for Utility

F.1 Overview

While one may be able to bound the false negative rate through similar techniques that we bound the false positive rate, i.e., applying the concentration inequalities, the guarantee may be loose. As a formal, strict guarantee for 𝖥𝖭DPV\mathsf{FN}_{\texttt{DPV}} is not required, we provide a convenient heuristic of picking an appropriate Δ\Delta such that 𝖥𝖭DPV\mathsf{FN}_{\texttt{DPV}} is approximately smaller than 𝖥𝖯DPV\mathsf{FP}_{\texttt{DPV}}.

For any mechanism \mathcal{M} such that δest>ρδY(ε)\delta^{\mathrm{est}}>\rho\delta_{Y}(\varepsilon), we have

PrDPV[DPV(,ε,δest)=𝙵𝚊𝚕𝚜𝚎]\displaystyle\Pr_{\texttt{DPV}}\left[\texttt{DPV}(\mathcal{M},\varepsilon,\delta^{\mathrm{est}})=\mathtt{False}\right]
=Pr[𝜹^MCm>δest/τΔ]\displaystyle=\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}>\delta^{\mathrm{est}}/\tau-\Delta\right]
=Pr[𝜹^MCmδY(ε)>δest/τδY(ε)Δ]\displaystyle=\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)>\delta^{\mathrm{est}}/\tau-\delta_{Y}(\varepsilon)-\Delta\right]
Pr[𝜹^MCmδY(ε)>((1/τ1/ρ)δestΔ)]\displaystyle\leq\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)>\left(\left(1/\tau-1/\rho\right)\delta^{\mathrm{est}}-\Delta\right)\right]

and in the meantime, if δest<τδY(ε)\delta^{\mathrm{est}}<\tau\delta_{Y}(\varepsilon), we have

Pr[𝜹^MCm<δest/τΔ]Pr[𝜹^MCmδY(ε)<Δ]\displaystyle\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}<\delta^{\mathrm{est}}/\tau-\Delta\right]\leq\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)<-\Delta\right]

Our main idea is to find Δ\Delta such that

Pr[𝜹^MCmδY(ε)>((1/τ1/ρ)δestΔ)]\displaystyle\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)>\left(\left(1/\tau-1/\rho\right)\delta^{\mathrm{est}}-\Delta\right)\right]
Pr[𝜹^MCmδY(ε)<Δ]\displaystyle\lessapprox\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)<-\Delta\right]

In this way, we know that 𝖥𝖭MC(ε,δest;ρ)\mathsf{FN}_{\texttt{MC}}(\varepsilon,\delta^{\mathrm{est}};\rho) is upper bounded by Θ(δest/τ)\Theta(\delta^{\mathrm{est}}/\tau) for the same amount of samples we discussed in Section 4.3.

Observe that 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} (or 𝜹^IS,θ\widehat{\bm{\delta}}_{\texttt{IS},\theta} with a not too large θ\theta) is typically a highly asymmetric distribution with a significant probability of being zero, and the probability density decreases monotonically for higher values. Under such conditions, we prove the following results:

Theorem 22 (Informal).

When m2νΔ2log(τ/δest)m\geq\frac{2\nu}{\Delta^{2}}\log(\tau/\delta^{\mathrm{est}}), we have

Pr[𝜹^MCmδY(ε)<Δ]Pr[𝜹^MCmδY(ε)>32Δ]\displaystyle\Pr[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)<-\Delta]\gtrapprox\Pr[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)>\frac{3}{2}\Delta]

The proof is deferred to Appendix F.2. Therefore, by setting Δ=0.4(1/τ1/ρ)δest\Delta=0.4\left(1/\tau-1/\rho\right)\delta^{\mathrm{est}}, one can ensure that 𝖥𝖭MC(ε,δest;ρ)\mathsf{FN}_{\texttt{MC}}(\varepsilon,\delta^{\mathrm{est}};\rho) is also (approximately) upper bounded by Θ(δest/τ)\Theta(\delta^{\mathrm{est}}/\tau). We empirically verify the effectiveness of such a heuristic by estimating the actual false negative rate. As we can see from Figure 6 (b), the dashed curve is much higher than the two solid curves, which means that the false negative rate is a very conservative bound.

Remark 23 (For the halting case).

In this section, we develop techniques for ensuring the false negative rate (i.e., the rejection probability) is around O(δ)O(\delta) when the proposed privacy parameter δest\delta^{\mathrm{est}} is close to the true δ\delta. In the experiment, we use the state-of-the-art FFT accountant to produce δest\delta^{\mathrm{est}}, which is very accurate as we can see from Figure 1. Hence, the rejection probability in the experiment is around O(δ)O(\delta), which means the probability of rejection is close to the probability of catastrophic failure for privacy.

If one is still concerned that the rejection probability is too large, we can further reduce the probability as follows: we run two instances of EVR paradigm simultaneously; if both of the instances are passed, we randomly pick one and release the output. If either one of them is passed, we release the passed instance. It only fails when both of the instances fail. By running two instances of the EVR paradigm in parallel, the false positive rate (i.e., the final δ\delta) will be only doubled, but the probability of rejection will be squared.

We can also introduce a variant of our EVR paradigm that better deals with the failure case: whenever we face the “rejection”, we run a different mechanism \mathcal{M}^{\prime} that is guaranteed to be (ε,δest)(\varepsilon,\delta^{\mathrm{est}})-DP (e.g., by adjusting the subsampling rate and/or noise multiplier in DP-SGD). Moreover, we use FFT accountant to obtain a strict privacy guarantee upper bound (ε,δupp)(\varepsilon,\delta^{\mathrm{upp}}) for the original mechanism \mathcal{M}, where δest<δupp\delta^{\mathrm{est}}<\delta^{\mathrm{upp}}. We use p𝖥𝖭p_{\mathsf{FN}} and p𝖥𝖯p_{\mathsf{FP}} to denote the false negative and false positive rate of the privacy verifier used in EVR paradigm. If the original mechanism \mathcal{M} is indeed (ε,δest)(\varepsilon,\delta^{\mathrm{est}})-DP, then for any subset SS, for this augmented EVR paradigm aug\mathcal{M}^{\mathrm{aug}} we have

Pr[aug(D)S]\displaystyle\Pr[\mathcal{M}^{\mathrm{aug}}(D)\in S] =p𝖥𝖭Pr[(D)S]+(1p𝖥𝖭)Pr[(D)S]\displaystyle=p_{\mathsf{FN}}\Pr[\mathcal{M}(D)\in S]+(1-p_{\mathsf{FN}})\Pr[\mathcal{M}^{\prime}(D)\in S]
p𝖥𝖭(eεPr[(D)S]+δest)+(1p𝖥𝖭)(eεPr[(D)S]+δest)\displaystyle\leq p_{\mathsf{FN}}(e^{\varepsilon}\Pr[\mathcal{M}(D^{\prime})\in S]+\delta^{\mathrm{est}})+(1-p_{\mathsf{FN}})(e^{\varepsilon}\Pr[\mathcal{M}^{\prime}(D^{\prime})\in S]+\delta^{\mathrm{est}})
eεPr[aug(D)S]+δest\displaystyle\leq e^{\varepsilon}\Pr[\mathcal{M}^{\mathrm{aug}}(D^{\prime})\in S]+\delta^{\mathrm{est}}

If the original mechanism \mathcal{M} is not (ε,δest)(\varepsilon,\delta^{\mathrm{est}})-DP, then we have

Pr[aug(D)S]\displaystyle\Pr[\mathcal{M}^{\mathrm{aug}}(D)\in S] =p𝖥𝖯Pr[(D)S]+(1p𝖥𝖯)Pr[(D)S]\displaystyle=p_{\mathsf{FP}}\Pr[\mathcal{M}(D)\in S]+(1-p_{\mathsf{FP}})\Pr[\mathcal{M}^{\prime}(D)\in S]
p𝖥𝖯(eεPr[(D)S]+δupp)+(1p𝖥𝖯)(eεPr[(D)S]+δest)\displaystyle\leq p_{\mathsf{FP}}(e^{\varepsilon}\Pr[\mathcal{M}(D^{\prime})\in S]+\delta^{\mathrm{upp}})+(1-p_{\mathsf{FP}})(e^{\varepsilon}\Pr[\mathcal{M}^{\prime}(D^{\prime})\in S]+\delta^{\mathrm{est}})
eεPr[aug(D)S]+δest+p𝖥𝖯(δuppδest)\displaystyle\leq e^{\varepsilon}\Pr[\mathcal{M}^{\mathrm{aug}}(D^{\prime})\in S]+\delta^{\mathrm{est}}+p_{\mathsf{FP}}(\delta^{\mathrm{upp}}-\delta^{\mathrm{est}})

Hence, this augmented EVR algorithm will be (ε,δest+p𝖥𝖯(δuppδest))(\varepsilon,\delta^{\mathrm{est}}+p_{\mathsf{FP}}(\delta^{\mathrm{upp}}-\delta^{\mathrm{est}})), and if p𝖥𝖯p_{\mathsf{FP}} is around δest\delta^{\mathrm{est}}, then this extra factor p𝖥𝖯(δuppδest)p_{\mathsf{FP}}(\delta^{\mathrm{upp}}-\delta^{\mathrm{est}}) will be very small. We can also adjust the privacy guarantee for \mathcal{M}^{\prime} such that the privacy guarantees for the two cases are the same, which can further optimize the final privacy cost.

F.2 Technical Details

In this section, we provide theoretical justification for the heuristic of setting Δ=0.4(1/τ1/ρ)δest\Delta=0.4\left(1/\tau-1/\rho\right)\delta^{\mathrm{est}}.

For notation convenience, throughout this section, we talk about 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}}, but the same argument also applies to 𝜹^IS,θ\widehat{\bm{\delta}}_{\texttt{IS},\theta} with a not too large θ\theta, unless otherwise specified. We use 𝜹^MC(x)\widehat{\bm{\delta}}_{\texttt{MC}}(x) to denote the density of 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} at xx. Note that 𝜹^MC(0)=\widehat{\bm{\delta}}_{\texttt{MC}}(0)=\infty.

We make the following assumption about the distribution of 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}}.

Assumption 24.

Pr[𝜹^MC=0]1/2\Pr[\widehat{\bm{\delta}}_{\texttt{MC}}=0]\geq 1/2.

While intuitive, this assumption is hard to analyze for the case of Subsampled Gaussian mechanism. Therefore, we instead provide a condition for which the assumption holds for Pure Gaussian mechanism.

Lemma 25.

Fix ε,σ\varepsilon,\sigma. When k/(2σ2)εk/(2\sigma^{2})\leq\varepsilon, Assumption 24 holds.

Proof.

The PRV for the composition of kk Gaussian mechanism is 𝒩(k2σ2,kσ2)\mathcal{N}\left(\frac{k}{2\sigma^{2}},\frac{k}{\sigma^{2}}\right).

Pr[(1eεY)+=0]=Pr[Yε]\displaystyle\Pr[(1-e^{\varepsilon-Y})_{+}=0]=\Pr[Y\leq\varepsilon]

which is clearly 1/2\geq 1/2 when k/(2σ2)εk/(2\sigma^{2})\leq\varepsilon. ∎

We also empirically verify this assumption for Subsampled Gaussian mechanism as in Figure 7. As we can see, the θ\theta selected by our heuristic (the red star) has Pr[𝜹^IS,θ=0]1/2\Pr[\widehat{\bm{\delta}}_{\texttt{IS},\theta}=0]\approx 1/2 which matches our principle of selecting θ\theta. The θ\theta minimizes the analytical bound (which we are going to use in practice) achieves Pr[𝜹^IS,θ=0]0.880.5\Pr[\widehat{\bm{\delta}}_{\texttt{IS},\theta}=0]\approx 0.88\gg 0.5.

Refer to caption
Figure 7: We empirically estimate Pr[𝜹^IS,θ=0]\Pr[\widehat{\bm{\delta}}_{\texttt{IS},\theta}=0] for the case of Poisson Subampled Gaussian mechanism. We set q=103,σ=0.6,ε=1.5,k=100q=10^{-3},\sigma=0.6,\varepsilon=1.5,k=100. δY(ε)7.7×106\delta_{Y}(\varepsilon)\approx 7.7\times 10^{-6} in this case. The red star indicates the second moment for the value of θ\theta selected by our heuristic in Definition 16. The blue star indicates the θ\theta that minimizes the analytical bound.

Our goal is to show that Pr[𝜹^MCmδY(ε)<Δ]Pr[𝜹^MCmδY(ε)>Δ]\Pr[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)<-\Delta^{*}]\gtrapprox\Pr[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta_{Y}(\varepsilon)>\Delta^{*}] for large mm. For notation simplicity, we denote δ:=𝔼[𝜹^MC]=δY(ε)\delta:=\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}]=\delta_{Y}(\varepsilon) in the remaining of the section. We also denote

p0\displaystyle p_{0} :=Pr[𝜹^MC=0]\displaystyle:=\Pr[\widehat{\bm{\delta}}_{\texttt{MC}}=0]
p(0,1)\displaystyle p_{(0,1)} :=Pr[0<𝜹^MC<δ]\displaystyle:=\Pr[0<\widehat{\bm{\delta}}_{\texttt{MC}}<\delta]
p(1,2)\displaystyle p_{(1,2)} :=Pr[δ𝜹^MC2δ]\displaystyle:=\Pr[\delta\leq\widehat{\bm{\delta}}_{\texttt{MC}}\leq 2\delta]
p(2,)\displaystyle p_{(2,\infty)} :=Pr[𝜹^MC2δ]\displaystyle:=\Pr[\widehat{\bm{\delta}}_{\texttt{MC}}\geq 2\delta]

Note that p0+p(0,1)+p(1,2)+p(2,)=1p_{0}+p_{(0,1)}+p_{(1,2)}+p_{(2,\infty)}=1. We also write 𝜹^MC|(a,b)\widehat{\bm{\delta}}_{\texttt{MC}}|_{(a,b)} and 𝜹^MC|[a,b)\widehat{\bm{\delta}}_{\texttt{MC}}|_{[a,b)} to indicate the truncated distribution of 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}} on (a,b)(a,b) and [a,b)[a,b), respectively.

We first construct an alternative random variable 𝜹~MC\widetilde{\bm{\delta}}_{\texttt{MC}} with the following distribution:

𝜹~MC={2δxforx𝜹^MC|2δw.p.p(2,)xforx𝜹^MC|(0,δ)w.p.p(0,1)2δxforx𝜹^MC|(0,δ)w.p.p(0,1)xforx𝜹^MC|2δw.p.p(2,)δw.p.12(p(0,1)+p(2,))\widetilde{\bm{\delta}}_{\texttt{MC}}=\left\{\begin{array}[]{lr}2\delta-x~{}~{}\text{for}~{}~{}x\sim\widehat{\bm{\delta}}_{\texttt{MC}}|_{\geq 2\delta}&\text{w.p.}~{}~{}p_{(2,\infty)}\\ x~{}~{}\text{for}~{}~{}x\sim\widehat{\bm{\delta}}_{\texttt{MC}}|_{(0,\delta)}&\text{w.p.}~{}~{}p_{(0,1)}\\ 2\delta-x~{}~{}\text{for}~{}~{}x\sim\widehat{\bm{\delta}}_{\texttt{MC}}|_{(0,\delta)}&\text{w.p.}~{}~{}p_{(0,1)}\\ x~{}~{}\text{for}~{}~{}x\sim\widehat{\bm{\delta}}_{\texttt{MC}}|_{\geq 2\delta}&\text{w.p.}~{}~{}p_{(2,\infty)}\\ \delta&\text{w.p.}~{}~{}1-2(p_{(0,1)}+p_{(2,\infty)})\end{array}\right. (23)

This is a valid probability due to Assumption 24, as p(0,1)+p(2,)1p01/2p_{(0,1)}+p_{(2,\infty)}\leq 1-p_{0}\leq 1/2. The distribution of 𝜹~MC\widetilde{\bm{\delta}}_{\texttt{MC}} is illustrated in Figure 8. Note that 𝜹~MC\widetilde{\bm{\delta}}_{\texttt{MC}} is a symmetric distribution with 𝔼[𝜹~MC]=δ\mathbb{E}[\widetilde{\bm{\delta}}_{\texttt{MC}}]=\delta.

Refer to caption
Figure 8: Illustration for the density function of 𝜹~MC\widetilde{\bm{\delta}}_{\texttt{MC}} (black curve indicates the density curve for 𝜹^MC\widehat{\bm{\delta}}_{\texttt{MC}}, and red curve indicates the density curve for 𝜹^IS,θ\widehat{\bm{\delta}}_{\texttt{IS},\theta}).

Similar to the notation of 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m}, we write 𝜹~MCm:=1mi=1m𝜹~MC(i)\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}:=\frac{1}{m}\sum_{i=1}^{m}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}. We show an asymmetry result for 𝜹~MCm\widetilde{\bm{\delta}}_{\texttt{MC}}^{m} in terms of δ\delta.

Lemma 26.

For any Δ\Delta^{*}\in\mathbb{R}, we have

Pr[𝜹~MCmδ<Δ]Pr[𝜹~MCmδ>Δ]\displaystyle\Pr[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta<-\Delta^{*}]\geq\Pr[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta>\Delta^{*}] (24)
Proof.

Note that the above argument holds trivially when m=0,1m=0,1. For m2m\geq 2, we use the induction argument. Suppose we have

Pr[𝜹~MCm1δ<Δ]Pr[𝜹~MCm1δ>Δ]\displaystyle\Pr[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m-1}-\delta<-\Delta^{*}]\geq\Pr[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m-1}-\delta>\Delta^{*}] (25)

for any Δ\Delta^{*}\in\mathbb{R}. That is,

Pr[i=1m1𝜹~MC(i)(m1)δ<(m1)Δ]Pr[i=1m1𝜹~MC(i)(m1)δ>(m1)Δ]\displaystyle\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta<-(m-1)\Delta^{*}\right]\geq\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta>(m-1)\Delta^{*}\right] (26)

for any Δ\Delta^{*}\in\mathbb{R}.

Pr[𝜹~MCmδ<Δ]\displaystyle\Pr[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta<-\Delta^{*}] =Pr[i=1m𝜹~MC(i)mδ<mΔ]\displaystyle=\Pr\left[\sum_{i=1}^{m}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-m\delta<-m\Delta^{*}\right]
=Pr[i=1m1𝜹~MC(i)(m1)δ+𝜹~MC(m)δ<mΔ]\displaystyle=\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta+\widetilde{\bm{\delta}}_{\texttt{MC}}^{(m)}-\delta<-m\Delta^{*}\right]
=Pr[i=1m1𝜹~MC(i)(m1)δ<mΔx]Pr[𝜹~MC(m)δ=x]dx\displaystyle=\int\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta<-m\Delta^{*}-x\right]\Pr[\widetilde{\bm{\delta}}_{\texttt{MC}}^{(m)}-\delta=x]\mathrm{d}x
Pr[i=1m1𝜹~MC(i)(m1)δ>mΔ+x]Pr[𝜹~MC(m)δ=x]dx\displaystyle\geq\int\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta>m\Delta^{*}+x\right]\Pr[\widetilde{\bm{\delta}}_{\texttt{MC}}^{(m)}-\delta=x]\mathrm{d}x
=Pr[i=1m1𝜹~MC(i)(m1)δ(𝜹~MC(m)δ)>mΔ]\displaystyle=\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta-(\widetilde{\bm{\delta}}_{\texttt{MC}}^{(m)}-\delta)>m\Delta^{*}\right]
=Pr[x(𝜹~MC(m)δ)>mΔ]Pr[i=1m1𝜹~MC(i)(m1)δ=x]dx\displaystyle=\int\Pr\left[x-(\widetilde{\bm{\delta}}_{\texttt{MC}}^{(m)}-\delta)>m\Delta^{*}\right]\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta=x\right]\mathrm{d}x
=Pr[𝜹~MC(m)δ<mΔ+x]Pr[i=1m1𝜹~MC(i)(m1)δ=x]dx\displaystyle=\int\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{(m)}-\delta<-m\Delta^{*}+x\right]\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta=x\right]\mathrm{d}x
Pr[𝜹~MC(m)δ>mΔx]Pr[i=1m1𝜹~MC(i)(m1)δ=x]dx\displaystyle\geq\int\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{(m)}-\delta>m\Delta^{*}-x\right]\Pr\left[\sum_{i=1}^{m-1}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-(m-1)\delta=x\right]\mathrm{d}x
=[i=1m𝜹~MC(i)mδ>mΔ]\displaystyle=\left[\sum_{i=1}^{m}\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}-m\delta>m\Delta^{*}\right]
=[𝜹~MCmδ>Δ]\displaystyle=\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta>\Delta^{*}\right]

where the two inequalities are due to the induction assumption. ∎

Now we come back and analyze 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m} by using Lemma 26.

Pr[𝜹^MCmδΔ]\displaystyle\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\leq-\Delta^{*}\right]
=Pr[𝜹^MCm𝜹~MCm+𝜹~MCmδΔ]\displaystyle=\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}+\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\leq-\Delta^{*}\right]
Pr[𝜹^MCm𝜹~MCmc]Pr[𝜹^MCm𝜹~MCm+𝜹~MCmδΔ|𝜹^MCm𝜹~MCmc]\displaystyle\geq\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}\leq c\right]\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}+\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\leq-\Delta^{*}|\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}\leq c\right]
Pr[𝜹^MCm𝜹~MCmc]Pr[𝜹~MCmδΔc]\displaystyle\geq\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}\leq c\right]\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\leq-\Delta^{*}-c\right]
Pr[𝜹^MCm𝜹~MCmc]Pr[𝜹~MCmδΔ+c]\displaystyle\geq\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}\leq c\right]\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\geq\Delta^{*}+c\right]

Similarly,

Pr[𝜹~MCmδΔ+c]\displaystyle\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\geq\Delta^{*}+c\right]
=Pr[𝜹~MCm𝜹^MCm+𝜹^MCmδΔ+c]\displaystyle=\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\widehat{\bm{\delta}}_{\texttt{MC}}^{m}+\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\geq\Delta^{*}+c\right]
Pr[𝜹~MCm𝜹^MCmc]Pr[𝜹~MCm𝜹^MCm+𝜹^MCmδΔ+c|𝜹~MCm𝜹^MCmc]\displaystyle\geq\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\widehat{\bm{\delta}}_{\texttt{MC}}^{m}\geq-c\right]\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\widehat{\bm{\delta}}_{\texttt{MC}}^{m}+\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\geq\Delta^{*}+c|\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\widehat{\bm{\delta}}_{\texttt{MC}}^{m}\geq-c\right]
Pr[𝜹~MCm𝜹^MCmc]Pr[𝜹^MCmδΔ+2c]\displaystyle\geq\Pr\left[\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}-\widehat{\bm{\delta}}_{\texttt{MC}}^{m}\geq-c\right]\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\geq\Delta^{*}+2c\right]

Overall, we have

Pr[𝜹^MCmδΔ](Pr[𝜹^MCm𝜹~MCmc])2Pr[𝜹^MCmδΔ+2c]\displaystyle\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\leq-\Delta^{*}\right]\geq\left(\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}\leq c\right]\right)^{2}\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\geq\Delta^{*}+2c\right] (27)

for any Δ\Delta^{*}\in\mathbb{R}. Note that the above argument does not require 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m} and 𝜹~MCm\widetilde{\bm{\delta}}_{\texttt{MC}}^{m} to be correlated. To maximize Pr[𝜹^MCm𝜹~MCmc]\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}\leq c\right], we can sample 𝜹~MCm\widetilde{\bm{\delta}}_{\texttt{MC}}^{m} for a given 𝜹^MCm\widehat{\bm{\delta}}_{\texttt{MC}}^{m} as follows: for each 𝜹^MC(i)\widehat{\bm{\delta}}_{\texttt{MC}}^{(i)},

  1. 1.

    If 𝜹^MC(i)>0\widehat{\bm{\delta}}_{\texttt{MC}}^{(i)}>0, then let 𝜹~MC(i)=𝜹^MC(i)\widetilde{\bm{\delta}}_{\texttt{MC}}^{(i)}=\widehat{\bm{\delta}}_{\texttt{MC}}^{(i)}.

  2. 2.

    If 𝜹^MC(i)=0\widehat{\bm{\delta}}_{\texttt{MC}}^{(i)}=0, then with probability p(2,)/p0p_{(2,\infty)}/p_{0}, output 2δx2\delta-x for x𝜹^MC|(2,)x\sim\widehat{\bm{\delta}}_{\texttt{MC}}|_{(2,\infty)}; with probability p(0,1)/p0p_{(0,1)}/p_{0} output 2δx2\delta-x for x𝜹^MC|(0,1)x\sim\widehat{\bm{\delta}}_{\texttt{MC}}|_{(0,1)}; with probability 1(p(0,1)+p(2,))/p01-(p_{(0,1)}+p_{(2,\infty)})/p_{0} output δ\delta.

Denote the random variable 𝜹diff:=𝜹^MC𝜹~MC\bm{\delta}_{\mathrm{diff}}:=\widehat{\bm{\delta}}_{\texttt{MC}}-\widetilde{\bm{\delta}}_{\texttt{MC}}. It is not hard to see that 𝔼[𝜹diff2]𝔼[𝜹^MC2]+δ22𝔼[𝜹^MC2]\mathbb{E}[\bm{\delta}_{\mathrm{diff}}^{2}]\leq\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}^{2}]+\delta^{2}\leq 2\mathbb{E}[\widehat{\bm{\delta}}_{\texttt{MC}}^{2}]. By Bennett’s inequality, with m2νΔ2log(τ/δest)m\geq\frac{2\nu}{\Delta^{2}}\log(\tau/\delta^{\mathrm{est}}), we have

Pr[𝜹^MCm𝜹~MCmc]\displaystyle\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\widetilde{\bm{\delta}}_{\texttt{MC}}^{m}\leq c\right] 1τδestexp(c2Δ2)\displaystyle\gtrapprox 1-\frac{\tau}{\delta^{\mathrm{est}}}\exp\left(-\frac{c^{2}}{\Delta^{2}}\right)
=1O(τδest)\displaystyle=1-O\left(\frac{\tau}{\delta^{\mathrm{est}}}\right)

if we set c=O(Δ)c=O(\Delta). Let c=14Δc=\frac{1}{4}\Delta, then by setting Δ=Δ-\Delta^{*}=-\Delta and Δ+2c=(1/τ1/ρ)δestΔ\Delta^{*}+2c=\left(1/\tau-1/\rho\right)\delta^{\mathrm{est}}-\Delta, we have Δ=0.4(1/τ1/ρ)δest\Delta=0.4\left(1/\tau-1/\rho\right)\delta^{\mathrm{est}}.

To summarize, when we set Δ\Delta with the heuristic, we have

Pr[𝜹^MCmδΔ+2c]\displaystyle\Pr\left[\widehat{\bm{\delta}}_{\texttt{MC}}^{m}-\delta\geq\Delta^{*}+2c\right] τ/δest(1O(τδest))2\displaystyle\lessapprox\frac{\tau/\delta^{\mathrm{est}}}{\left(1-O\left(\frac{\tau}{\delta^{\mathrm{est}}}\right)\right)^{2}}
τ/δest\displaystyle\approx\tau/\delta^{\mathrm{est}}

Appendix G Pseudocode for Privacy Accounting for DP-SGD with EVR Paradigm

In this section, we outline the steps of privacy accounting for DP-SGD with our EVR paradigm. Recall from Lemma 14 that for subsampled Gaussian mechanism with sensitivity CC, noise variance C2σ2C^{2}\sigma^{2}, and subsampling rate qq, one dominating pair (P,Q)(P,Q) is Q:=𝒩(0,σ2)Q:=\mathcal{N}(0,\sigma^{2}) and P:=(1q)𝒩(0,σ2)+q𝒩(1,σ2)P:=(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2}). Hence, for DP-SGD with kk iterations, the dominating pair is the product distribution 𝑷:=P1××Pk\bm{P}:=P_{1}\times\ldots\times P_{k} and 𝑸:=Q1××Qk\bm{Q}:=Q_{1}\times\ldots\times Q_{k} where each PiP_{i} and QiQ_{i} follow the same distribution as PP and QQ.333It can also be extended to the heterogeneous case easily.

Algorithm 4 Privacy Accounting for DP-SGD with EVR Paradigm
1:  Privacy Parameters for DP-SGD: kk – number of mechanisms to be composed, σ\sigma – noise multiplier, qq – subsampling rate, CC – clipping ratio.
2:  
3:  // Step 1: Estimate Privacy Parameter.
4:  Obtain a privacy parameter estimate (ε,δest)(\varepsilon,\delta^{\mathrm{est}}) from a DP accountant (e.g., FFT/GDP/MC accountant). Set the smoothing factor τ=0.99\tau=0.99 (one can also adjust the value of τ\tau according to the privacy guarantee and runtime requirements).
5:  
6:  // Step 2: Verify Privacy Parameter.
7:  
8:  // Step 2.1: Derive the required amount of samples for privacy verification.
9:  Compute the number of required samples mm according to Theorem 4.3, where the second moment upper bound ν\nu is computed according to Corollary 19 for Simple MC estimator, or Theorem 19 for Importance Sampling estimator. Set ρ=(1+τ)/2\rho=(1+\tau)/2 and set Δ\Delta according to Theorem 12 for utility guarantee.
10:  
11:  // Step 2.2: Estimate privacy parameter with MC estimator.
12:  If use Simple MC estimator: Obtain i.i.d. samples {𝒕i}i=1m\{\bm{t}_{i}\}_{i=1}^{m} from 𝑷\bm{P}. Compute simple MC estimate δ^=1mi=1m(1eεyi)+\widehat{\delta}=\frac{1}{m}\sum_{i=1}^{m}\left(1-e^{\varepsilon-y_{i}}\right)_{+} with PRV samples yi=log(𝑷(𝒕i)𝑸(𝒕i)),i=1my_{i}=\log\left(\frac{\bm{P}(\bm{t}_{i})}{\bm{Q}(\bm{t}_{i})}\right),i=1\ldots m.
13:  If use Importance Sampling estimator: compute MC estimate according to Definition 16.
14:  
15:  // Step 3: Release according to verification result.
16:  if δ^<δestτΔ\widehat{\delta}<\frac{\delta^{\mathrm{est}}}{\tau}-\Delta then release the result of DP-SGD.
17:  else terminate program.

Appendix H Experiment Settings & Additional Results

H.1 GPU Acceleration for MC Sampling

MC-based techniques are well-suited for parallel computing and GPU acceleration due to their nature of repeated sampling. One can easily utilize PyTorch’s CUDA functionality, e.g.,

torch.randn(size=(k,m)).cuda()*sigma+mu

, to significantly boost the computational efficiency. Figure 9 (a) shows that when using a NVIDIA A100-SXM4-80GB GPU, the execution time of sampling Gaussian mixture ((1q)𝒩(0,σ2)+q𝒩(1,σ2)(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2})) can be improved by 10310^{3} times compared with CPU-only scenario. Figure 9 (b) shows the predicted runtime for different target false positive rates for k=1000k=1000. We vary σ\sigma and set the target false positive rate as the smallest s×10rs\times 10^{-r} that is greater than δY(ε)\delta_{Y}(\varepsilon), where s{1,5}s\in\{1,5\} and rr is positive integer. We set δest=0.8δY(ε)\delta^{\mathrm{est}}=0.8\delta_{Y}(\varepsilon), and τ,ρ,Δ\tau,\rho,\Delta are set as the heuristics introduced in the previous sections. The runtime is predicted by the number of required samples for the given false positive rate. As we can see, even when we target at 101010^{-10} false positive rate (which means that Δ1010\Delta\approx 10^{-10}), the clock time is still acceptable (around 3 hours).

Refer to caption
Figure 9: The execution time of sampling Gaussian mixture ((1q)𝒩(0,σ2)+q𝒩(1,σ2)(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2})) when only using CPU and when using a NVIDIA A100-SXM4-80GB GPU.

H.2 Experiment for Evaluating EVR Paradigm

H.2.1 Settings

For Figure 1 & Figure 3, the FFT-based method has hyperparameter being set as εerror=103,δerror=1010\varepsilon_{\mathrm{error}}=10^{-3},\delta_{\mathrm{error}}=10^{-10}. For the GDP-Edgeworth accountant, we use the second-order expansion and uniform bound, following [WGZ+22].

For Figure 4 (as well as Figure 11), the BEiT [BDPW21] is first self-supervised pretrained on ImageNet-1k and then trained finetuned on ImageNet-21k, following the state-of-the-art approach in [PTS+22]. For DP-GD training, we set σ\sigma as 28.914, clipping norm as 1, learning rate as 2, and we train for at most 60 iterations, and we only finetune the last layer on CIFAR-100.

H.2.2 Additional Results

kεY(k)(δ)k\rightarrow\varepsilon_{Y^{(k)}}(\delta) curve. We show additional results for a more common setting in privacy-preserving machine learning where one set a target δ\delta and try to find εY(k)(δ)\varepsilon_{Y^{(k)}}(\delta) for different kk, the number of individual mechanisms in the composition. We use Y(k)Y^{(k)} to stress that the PRV YY is for the composition of kk mechanisms. Such a scenario can happen when one wants to find the optimal stopping iteration for training a differentially private neural network.

Figure 10 shows such a result for Poisson Subsampled Gaussian where we set subsampling rate 0.01, σ=2\sigma=2, and δ=105\delta=10^{-5}. We set εerror=101,δerror=1010\varepsilon_{\mathrm{error}}=10^{-1},\delta_{\mathrm{error}}=10^{-10} for the FFT method. The estimate in this case is obtained by fixing δest=τδ\delta^{\mathrm{est}}=\tau\delta and find the corresponding estimate for ε\varepsilon through FFT-based method [GLW21]. As we can see, the EVR paradigm achieves a much tighter privacy analysis compared with the upper bound derived by FFT-based method. The runtime of privacy verification in this case is <15<15 minutes for all kks.

Refer to caption
Figure 10: kεY(k)(δ)k\rightarrow\varepsilon_{Y^{(k)}}(\delta) curve for Poisson Subsampled Gaussian mechanism for subsampling rate 0.01, σ=2\sigma=2, and δ=105\delta=10^{-5}. When running on an NVIDIA A100-SXM4-80GB GPU, the runtime of privacy verification is <15<15 minutes.

Privacy-Utility Tradeoff. We show additional results for the privacy-utility tradeoff curve when finetuning ImageNet-pretrained BEiT on CIFAR100 dataset with DP stochastic gradient descent (DP-SGD). For DP-SGD training, we set σ\sigma as 5.971, clipping norm as 1, learning rate as 0.2, momentum as 0.90.9, batch size as 4096, and we train for at most 360 iterations (30 epochs). We only finetune the last layer on CIFAR-100.

As shown in Figure 11 (a), the EVR paradigm provides a better utility-privacy tradeoff compare with the traditional upper bound method. In Figure 11 (b), we show the runtime of DP verification when ρ=(1+τ)/2\rho=(1+\tau)/2 and we set Δ\Delta according to Theorem 12 (which ensures EVR’s failure probability is negligible). The runtime is estimated on an NVIDIA A100-SXM4-80GB GPU. As we can see, it only takes a few minutes for privacy verification, which is short compared with hours of model training.

Refer to caption
Figure 11: (a) Utility-privacy tradeoff curve for fine-tuning ImageNet-pretrained BEiT [BDPW21] on CIFAR100 when δ=1012\delta=10^{-12} with DP-SGD. We follow the training hyperparameters from [PTS+22]. (b) Runtime of privacy verification in the EVR paradigm. For fair comparison, we set ρ=(1+τ)/2\rho=(1+\tau)/2 and set Δ\Delta according to Theorem 12, which ensures EVR’s failure probability is around O(δ)O(\delta). For (b), the runtime is estimated on an NVIDIA A100-SXM4-80GB GPU.

H.3 Experiment for Evaluating MC Accountant

H.3.1 Settings

Evaluation Protocol. We use Y(k)Y^{(k)} to stress that the PRV YY is for the composition of kk Poisson Subsampled Gaussian mechanisms. For the offline setting, we make the following two kinds of plots: (1) the relative error in approximating εδY(k)(ε)\varepsilon\mapsto\delta_{Y^{(k)}}(\varepsilon) (for fixed kk), and (2) the relative error in kεY(k)(δ)k\mapsto\varepsilon_{Y^{(k)}}(\delta) (for fixed δ\delta), where εY(k)(δ)\varepsilon_{Y^{(k)}}(\delta) is the inverse of δY(k)(ε)\delta_{Y^{(k)}}(\varepsilon) from (2). For the online setting, we make the following two kinds of plots: (1) the relative error in approximating kεY(k)(δ)k\mapsto\varepsilon_{Y^{(k)}}(\delta) (for fixed δ\delta), and (2) kk\mapsto cumulative time for privacy accounting until kkth iteration.

MC Accountant. We use the importance sampling technique with the tilting parameter being set according to the heuristic described in Definition 16.

Baselines. We compare MC accountant against the following state-of-the-art DP accountants with the following settings:

  • \bullet

    The state-of-the-art FFT-based approach [GLW21]. The setting of εerror\varepsilon_{\mathrm{error}} and δerror\delta_{\mathrm{error}} is specified in the next section.

  • \bullet

    CLT-based GDP accountant [BDLS20].

  • \bullet

    GDP-Edgeworth accountant with second-order expansion and uniform bound.

  • \bullet

    The Analytical Fourier Accountant based on characteristic function [ZDW22], with double quadrature approximation as this is the practical method recommended in the original paper.

H.3.2 Additional Results for Online Accounting

Figure 12 and 13 show the online accounting results for (σ,δ,q)=(0.5,105,103)(\sigma,\delta,q)=(0.5,10^{-5},10^{-3}) and (σ,δ,q)=(0.5,1013,103)(\sigma,\delta,q)=(0.5,10^{-13},10^{-3}), respectively. For the setting of (σ,δ,q)=(0.5,105,103)(\sigma,\delta,q)=(0.5,10^{-5},10^{-3}), we can see that the MC accountant achieves a comparable performance with a shorter runtime. For the setting of (σ,δ,q)=(0.5,1013,103)(\sigma,\delta,q)=(0.5,10^{-13},10^{-3}), we can see that the MC accountant achieves significantly better performance compared to the state-of-the-art FFT accountant (and again, with a shorter runtime). This showcases the MC accountant’s efficiency and accuracy in online setting.

Refer to caption
Figure 12: Experiment for Composing Subsampled Gaussian Mechanisms in the Online Setting with hyperparameter (σ,δ,q)=(0.5,105,103)(\sigma,\delta,q)=(0.5,10^{-5},10^{-3}). (a) Compares the relative error in approximating kεY(δ)k\mapsto\varepsilon_{Y}(\delta). The error bar for MC accountant is the variance taken over 5 independent runs. Note that the y-axis is in the log scale. (b) Compares the cumulative runtime for online privacy accounting. We did not show AFA [ZDW22] as it does not terminate in 24 hours.
Refer to caption
Figure 13: Experiment for Composing Subsampled Gaussian Mechanisms in the Online Setting with hyperparameter (σ,δ,q)=(0.5,1013,103)(\sigma,\delta,q)=(0.5,10^{-13},10^{-3}). (a) Compares the relative error in approximating kεY(δ)k\mapsto\varepsilon_{Y}(\delta). The error bar for MC accountant is the variance taken over 5 independent runs. Note that the y-axis is in the log scale. (b) Compares the cumulative runtime for online privacy accounting. We did not show AFA [ZDW22] as it does not terminate in 24 hours.

H.3.3 Additional Results for Offline Accounting

In this experiment, we set the number of samples for MC accountant as 10710^{7}, and the parameter for FFT-based method as εerror=103,δerror=1010\varepsilon_{\mathrm{error}}=10^{-3},\delta_{\mathrm{error}}=10^{-10}. The parameters are controlled so that the MC accountant is faster than FFT-based method, as shown in Table 1. Figure 14 (a) shows the offline accounting results for εδY(k)(ε)\varepsilon\mapsto\delta_{Y^{(k)}}(\varepsilon) when we set (σ,q,k)=(0.5,103,1000)(\sigma,q,k)=(0.5,10^{-3},1000). As we can see, the performance of MC accountant is comparable with the state-of-the-art FFT method. In Figure 15 (a), we decreases qq to 10510^{-5}. Compared against baselines, MC approximations are significantly more accurate for larger ε\varepsilon, compared with the FFT accountant. Figure 14 (b) shows the offline accounting results for kεY(δ)k\mapsto\varepsilon_{Y}(\delta) when we set (σ,q,δ)=(0.5,103,105)(\sigma,q,\delta)=(0.5,10^{-3},10^{-5}). Similarly, MC accountant performs comparably as FFT accountant. However, when we decrease qq to 10510^{-5} and δ\delta to 101410^{-14} (Figure 15 (b)), MC accountant significantly outperforms FFT accountant. This illustrates that MC accountant performs well in all regimes, and is especially more favorable when the true value of δY(ε)\delta_{Y}(\varepsilon) is tiny.

Refer to caption
Figure 14: Experiment for Composing Subsampled Gaussian Mechanisms: (a) Compares the relative error in approximating εδY(k)(ε)\varepsilon\mapsto\delta_{Y^{(k)}}(\varepsilon) where we set σ=0.5,k=1000,q=103\sigma=0.5,k=1000,q=10^{-3}. (b) Compares the relative error in kεY(δ)k\mapsto\varepsilon_{Y}(\delta) where we set σ=0.5,δ=105,q=103\sigma=0.5,\delta=10^{-5},q=10^{-3}. The error bar for MC accountant is the variance taken over 5 independent runs. Note that the y-axis is in the log scale.
Table 1: Runtime for k=1000k=1000.
AFA GDP GDP-E FFT MC-IS
18.63 4.1×1044.1\times 10^{-4} 1.50 3.01 2.31
Refer to caption
Figure 15: Experiment for Composing Subsampled Gaussian Mechanisms: (a) Compares the relative error in approximating εδY(k)(ε)\varepsilon\mapsto\delta_{Y^{(k)}}(\varepsilon) where we set σ=0.5,k=100,q=105\sigma=0.5,k=100,q=10^{-5}. (b) Compares the relative error in kεY(δ)k\mapsto\varepsilon_{Y}(\delta) where we set σ=0.5,δ=1014,q=105\sigma=0.5,\delta=10^{-14},q=10^{-5}. The error bar for MC accountant is the variance taken over 5 independent runs. Note that the y-axis is in the log scale. The curves for AFA, GDP and GDP-Edgeworth are overlapped with each other.