This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1]Damien Desfontaines

Differentially private partition selection

James Voss Google, jrvoss@google.com Bryant Gipson Google, bryantgipson@google.com Chinmoy Mandayam Google, cvm@google.com
Abstract

Many data analysis operations can be expressed as a GROUP BY query on an unbounded set of partitions, followed by a per-partition aggregation. To make such a query differentially private, adding noise to each aggregation is not enough: we also need to make sure that the set of partitions released is also differentially private.
This problem is not new, and it was recently formally introduced as differentially private set union [13]. In this work, we continue this area of study, and focus on the common setting where each user is associated with a single partition. In this setting, we propose a simple, optimal differentially private mechanism that maximizes the number of released partitions. We discuss implementation considerations, as well as the possible extension of this approach to the setting where each user contributes to a fixed, small number of partitions.

1 Introduction

Suppose that a centralized service collects information on its users, and that an engineer wants to understand the prevalence of different device models among the users. They could run a SQL query similar to the following.

SELECT
device_model,
COUNT(UNIQUE user_id)
FROM database
GROUP BY device_model

Many common data analysis tasks follow a simple structure, similar to this example query: a GROUP BY operation that defines a set of partitions (here, device models), followed by one or several aggregations. To make such a query differentially private, it is not enough to add noise to each count. Indeed, in the example above, suppose that a device model is particularly rare, and that a single user is associated to this device model. The presence or absence of this user determines whether this partition appears in the output: even if the count is noisy, the differential privacy property is not satisfied. Thus, in addition to the counts, the set of partitions present in the output must also be differentially private. There are two main ways of ensuring this property.

A first option is to determine the set of output partitions in advance, without looking at the private data. In this case, even if some of the partitions do not appear in the private data, they must still be returned, with noise added to the zero value. Conversely, if the private data has partitions that do not appear in the predefined list, they must be dropped from the output. This option is feasible when grouping users by some fixed categories, or if partitions can only take a small number of predefined values.

However, this is not always the case. Text-based partitions like search queries or user agents might take arbitrary values, and often cannot be known without access to the private dataset. Furthermore, when building a generic DP engine, usability is paramount, and requiring users to annotate their dataset with all possible values that can be taken by a given field is a significant burden. This makes a second option attractive: generating the list of partitions from the private data itself, in a differentially private way. This problem was formally introduced in [13] as differentially private set union. Each user is associated with one or several partitions, and the goal is to release as many partitions as possible while making sure that the output is differentially private.

In [13], the main motivation to study this set union primitive is natural language processing: the discovery of words and nn-grams is essential to these tasks, and can be modeled as a set union problem. In this context, each user can contribute to many different partitions. In the context of data analysis queries, however, it is common that each contributes only to a small number of partitions, often just one. This happens when the partition is a characteristic of each user, for example demographic attributes or the answer to a survey. In the above SQL query example, if the user ID is a device identifier, each contributes to at most one device model.

In this work, we focus on this particular single-contribution case, and provide an optimal partition selection strategy for this special case. More specifically, we show that there is a fundamental upper bound on the probability of returning a partition associated with kk users, and present an algorithm that achieves this bound.

This paper is structured as follows. After discussing prior work in more detail and introducing definitions, we present a partition selection mechanism for the case where each user contributes to one partition, and prove its optimality. We then discuss possible extensions to cases where each user contributes to multiple partitions as well as implementation considerations.

1.1 Prior work

In this section, we review existing literature on the problem of releasing a set of partitions from an unbounded set while satisfying differential privacy. This did not get specific attention until [13], but the first algorithm that solves it was introduced in [20], for the specific context of privately releasing search queries. This algorithm works as follows: build a histogram of all partitions, count unique users associated with each partition, add Laplace noise to each count, and keep only the partitions whose counts are above a fixed threshold. The scale of the noise and the value of the threshold determine ε\varepsilon and δ\delta. This method is simple and natural; it was adapted to work in more general-purpose query engines in [27].

In [13], the authors focus on the more general problem of differentially private set union. The main use case for this work is word and n-gram discovery in Natural Language Processing: data used in training models must not leak private information about individuals. In this context, each user potentially contributes to many elements; the sensitivity of the mechanism can be high. The authors propose two strategies applicable in this context. First, they use a weighted histogram so that if a user contributes to fewer elements than the maximum sensitivity, these elements can add more weight to the histogram count. Second, they introduce policies that determine which elements to add to the histogram depending on which histogram counts are already above the threshold. These strategies obtain significant utility improvements over the simple Laplace-based strategy.

In this work, in contrast to [13], we focus on the low-sensitivity use case: each user contributes to exactly one element. This different setting is common in data analysis: when the GROUP BY key partitions the set of users in distinct partitions, each user can only contribute to one element to the set union. Choosing the contributions of each user is therefore not relevant; the only question is to optimize the probability of releasing each element in the final result. For this specific problem, we introduce an optimal approach, which maximizes this probability.

Public partitions

When the domain of possible partitions is known in advance and considered public data, no partition selection is necessary. This assumption is typically made implicitly in existing work on histogram publication, either by assuming that the domain is known exactly and not too large [17, 10, 28, 29, 1, 30, 31], or that the attributes are numeric and bounded [26]. In the former case, no partition selection is necessary; the strategy usually revolves around grouping known partitions together to limit the impact of the noise. In the latter case, the possible partitions are also indirectly known in advance (all possible intervals in a fixed numerical range), and the problem is to find which intervals to use to slice the data. With such pre-existing knowledge about the partitions, our approach does not provide any benefit.

Domain of fixed size

When the domain does not conform to one of the assumptions described above, the data domain might still be a subset of some large domain. For example, integer attributes are typically stored using 64 bits. Similarly, it is reasonable to assume that search queries or URLs are strings whose size is bounded by some large number.

We can use this fact to perform partition selection by adding noise to all possible partitions, including the ones that do not contain any data, and only return the ones that are above a given threshold. This process can be simulated in an efficient way, without actually enumerating all partitions [5]. Other methods might be possible; for example, one could imagine simulating the sparse vector technique [8] or one of its multiple-queries variants [9, 24, 25, 22] to ask the number of users in all possible partitions, while ignoring the privacy cost of answers below a threshold.

We are not aware of any work using these techniques for the specific problem of partition selection. We also postulate that they are likely to fail for extremely large domain sizes (like long strings); the technique in [5] outputs a number of false positive partitions linear in the domain size.

Differences with our approach

In this work, we focus on cases where all assumptions above fail because the domain of the data is unbounded or too large. As such, the only way to learn this domain is by looking at the private data, which must be done in a differentially private way. This assumption is particularly suited to building generic tooling, like general-purpose differentially private query engines [2, 19, 27, 23]. Indeed, to use such an engine, either all the domain of the input data must be enumerated in advance, or partition selection is necessary. But this requires the analyst or data owner to document the data domain for all input databases. This is a significant usability burden, which makes it difficult to scale the use of the query engine. This problem is the main motivator for our work.

1.2 Definitions

Differential privacy (DP) is a standard notion to quantify the privacy guarantees of statistical data. For the problem of differentially private set union, we use (ε,δ)\left(\varepsilon,\delta\right)-DP.

Definition 1 (Differential privacy [7]).

A randomized mechanism \mathcal{M} is (ε,δ)\left(\varepsilon,\delta\right)-differentially private if for any two databases DD and DD^{\prime}, where DD^{\prime} can be obtained from DD by either adding or removing one user, and for all sets SS of possible outputs:

[(D)S]eε[(D)S]+δ.\displaystyle\mathbb{P}\left[\mathcal{M}\left(D\right)\in S\right]\leq e^{\varepsilon}\mathbb{P}\left[\mathcal{M}\left(D^{\prime}\right)\in S\right]+\delta.

Let us formalize the problem addressed in this work.

Definition 2 (Differentially private partition selection).

Let UU be a universe of partitions, possibly infinite. A partition selection mechanism is a mechanism \mathcal{M} that takes a database DD in which each user ii contributes a subset WiUW_{i}\subset U of partitions, and outputs a subset (D)iWi\mathcal{M}\left(D\right)\subseteq\cup_{i}W_{i}.

The problem of differentially private partition selection111Also called differentially private set union [13]. consists in finding a mechanism \mathcal{M} that outputs as many partitions as possible while satisfying (ε,δ)\left(\varepsilon,\delta\right)-differential privacy.

In the main section of this paper, we assume that each user contributes to only one partition (|Wi|=1|W_{i}|=1 for all ii). We first study the simplified problem of considering each partition independently. The only question then is: with which probability do we release this partition? And the strategy can simply be reduced to a function associating the number of users in a partition with the probability of keeping the partition. After finding an optimal primitive for this simpler problem, we show that it is actually optimal in a stronger sense, even among mechanisms that consider all partitions simultaneously.

Definition 3 (Partition selection primitive).

A partition selection primitive is a function π:[0,1]\pi:\mathbb{N}\rightarrow\left[0,1\right] such that π(0)=0\pi(0)=0. The corresponding partition selection strategy consists in counting the number nn of users in each partition, and releasing this partition with probability π(n)\pi(n).

We say that a partition selection primitive is (ε,δ)\left(\varepsilon,\delta\right)-differentially private if the corresponding partition selection strategy ρπ:{drop,keep}\rho_{\pi}:\mathbb{N}\rightarrow{\left\{\textnormal{drop},\textnormal{keep}\right\}}, defined by:

ρπ(n)={dropwith probability 1π(n)keepwith probability π(n)\rho_{\pi}(n)=\begin{cases}\textnormal{drop}&\textnormal{with probability }1-\pi(n)\\ \textnormal{keep}&\textnormal{with probability }\pi(n)\end{cases}

is (ε,δ)\left(\varepsilon,\delta\right)-differentially private.

Note that partitions associated with no users are not present in the input data, so the probability of releasing them has to be 0: the definition requires π(0)=0\pi(0)=0.

2 Main result

In this section, we define an (ε,δ)\left(\varepsilon,\delta\right)-DP partition selection primitive πopt\pi_{\textnormal{opt}} and prove that the corresponding partition selection strategy is optimal. In this context, optimal means that it maximizes the probability of releasing a partition with nn users, for all nn.

Definition 4 (Optimal partition selection primitive).

A partition selection primitive πopt\pi_{\textnormal{opt}} is optimal for (ε,δ)\left(\varepsilon,\delta\right)-DP if it is (ε,δ)\left(\varepsilon,\delta\right)-DP, and if for all (ε,δ)\left(\varepsilon,\delta\right)-DP partition selection primitives π\pi and all nn\in\mathbb{N}:

π(n)πopt(n).\pi(n)\leq\pi_{\textnormal{opt}}\left(n\right).

We introduce our main result, then we prove it in two steps: we first prove that the optimal partition selection primitive can be obtained recursively, then derive the closed-form formula of our main result from the recurrence relation.

Theorem 1 (General solution for πopt\pi_{\textnormal{opt}}).

Let ε>0\varepsilon>0 and δ(0,1)\delta\in(0,1). Defining:

n1\displaystyle n_{1} =1+1εln(eε+2δ1(eε+1)δ),\displaystyle=1+\left\lfloor\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}\right)\right\rfloor,
n2\displaystyle n_{2} =n1+1εln(1+eε1δ(1πopt(n1))),\displaystyle=n_{1}+\left\lfloor\frac{1}{\varepsilon}\ln\left(1+\frac{e^{\varepsilon}-1}{\delta}\left(1-\pi_{\textnormal{opt}}\left(n_{1}\right)\right)\right)\right\rfloor,

and m=nn1m=n-n_{1}, the partition selection primitive πopt\pi_{\textnormal{opt}} defined by:

  • πopt(n)=enε1eε1δ\pi_{\textnormal{opt}}\left(n\right)=\frac{e^{n\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta if nn1n\leq n_{1},

  • πopt(n)=(1emε)(1+δeε1)+emεπopt(n1)\pi_{\textnormal{opt}}\left(n\right)=\left(1-e^{-m\varepsilon}\right)\left(1+\frac{\delta}{e^{\varepsilon}-1}\right)+e^{-m\varepsilon}\pi_{\textnormal{opt}}\left(n_{1}\right) if n>n1n>n_{1} and nn2n\leq n_{2},

  • 11 otherwise

is optimal for (ε,δ)\left(\varepsilon,\delta\right)-DP.

These formulas assume ε>0\varepsilon>0 and δ>0\delta>0. We also cover the special cases where ε=0\varepsilon=0 or δ=0\delta=0.

Theorem 2 (Special cases for πopt\pi_{\textnormal{opt}}).

  1. 1.

    If δ=0\delta=0, partition selection is impossible: the optimal partition selection primitive πopt\pi_{\textnormal{opt}} for (ε,0)\left(\varepsilon,0\right)-DP is defined by πopt(n)=0\pi_{\textnormal{opt}}\left(n\right)=0 for all nn.

  2. 2.

    If ε=0\varepsilon=0, the optimal partition selection primitive πopt\pi_{\textnormal{opt}} for (0,δ)\left(0,\delta\right)-DP is defined by πopt(n)=min(1,nδ)\pi_{\textnormal{opt}}\left(n\right)=\min\left(1,n\delta\right) for all nn.

2.1 Recursive construction

How do we construct a partition selection primitive π\pi so that the partition is output with the highest possible probability under the constraint that π\pi is (ε,δ)\left(\varepsilon,\delta\right)-DP? Using the definition of differential privacy, the following inequalities must hold for all nn\in\mathbb{N}.

π(n+1)\displaystyle\pi(n+1) eεπ(n)+δ\displaystyle\leq e^{\varepsilon}\pi(n)+\delta (1)
π(n)\displaystyle\pi(n) eεπ(n+1)+δ\displaystyle\leq e^{\varepsilon}\pi(n+1)+\delta (2)
(1π(n+1))\displaystyle(1-\pi(n+1)) eε(1π(n))+δ\displaystyle\leq e^{\varepsilon}(1-\pi(n))+\delta (3)
(1π(n))\displaystyle(1-\pi(n)) eε(1π(n+1))+δ.\displaystyle\leq e^{\varepsilon}(1-\pi(n+1))+\delta. (4)

These inequalities are not only necessary, but also sufficient for π\pi to be DP. Thus, the optimal partition selection primitive can be constructed by recurrence, maximizing each value while still satisfying the inequalities above. As we will show, only inequalities (1) and (4) above need be included in the recurrence relationship. The latter can be rearranged as:

πopt(n+1)1eε(1πopt(n)δ)\pi_{\textnormal{opt}}\left(n+1\right)\leq 1-e^{-\varepsilon}(1-\pi_{\textnormal{opt}}\left(n\right)-\delta)

which leads to the following recursive formulation for πopt\pi_{\textnormal{opt}}.

Lemma 1 (Recursive solution for πopt\pi_{\textnormal{opt}}).

Given δ[0,1]\delta\in\left[0,1\right] and ε0\varepsilon\geq 0, πopt\pi_{\textnormal{opt}} satisfies the following recurrence relationship: πopt(0)=0\pi_{\textnormal{opt}}\left(0\right)=0, and for all n0n\geq 0:

πopt(n+1)=\displaystyle\pi_{\textnormal{opt}}\left(n+1\right)= min(\displaystyle\min( (5)
eεπopt(n)+δ,\displaystyle\quad e^{\varepsilon}\pi_{\textnormal{opt}}\left(n\right)+\delta,
1eε(1πopt(n)δ),\displaystyle\quad 1-e^{-\varepsilon}(1-\pi_{\textnormal{opt}}\left(n\right)-\delta),
1)\displaystyle\quad 1)
  • Proof.

    Let π0\pi_{0} be defined by recurrence as above; we will prove that π0=πopt\pi_{0}=\pi_{\textnormal{opt}}.

    First, let us show that π0\pi_{0} is monotonic. Fix nn\in\mathbb{N}. It suffices to show for each argument of the min function in (5) is larger than π0(n)\pi_{0}\left(n\right).

    First argument.

    Since ε0\varepsilon\geq 0 implies eε1e^{\varepsilon}\geq 1 and δ0\delta\geq 0, we trivially have eεπ0(n)+δπ0(n)e^{\varepsilon}\pi_{0}\left(n\right)+\delta\geq\pi_{0}\left(n\right).

    Second argument.

    We have:

    1eε(1π0(n)δ)\displaystyle 1-e^{-\varepsilon}(1-\pi_{0}\left(n\right)-\delta) =1eε(1π0(n))+eεδ\displaystyle=1-e^{-\varepsilon}(1-\pi_{0}\left(n\right))+e^{-\varepsilon}\delta
    1(1π0(n))\displaystyle\geq 1-(1-\pi_{0}\left(n\right))
    =π0(n)\displaystyle=\pi_{0}\left(n\right)

    using that 1π0(n)01-\pi_{0}\left(n\right)\geq 0 since π0(n)1\pi_{0}\left(n\right)\leq 1 by (5).

    Third argument.

    This is immediate given (5) and the fact that π0(0)=0\pi_{0}\left(0\right)=0.

    It follows that π0(n+1)π0(n)\pi_{0}\left(n+1\right)\geq\pi_{0}\left(n\right).

    Because π0\pi_{0} is monotonic, it immediately satisfies inequalities (2) and (3), and inequalities (1) and (4) are satisfied by definition.

    Since π0\pi_{0} satisfies all four inequalities above, it is (ε,δ)\left(\varepsilon,\delta\right)-DP. Its optimality follows immediately by recurrence: for each n+1n+1, if π(n+1)>πopt(n+1)\pi(n+1)>\pi_{\textnormal{opt}}\left(n+1\right), it cannot be (ε,δ)\left(\varepsilon,\delta\right)-DP, as one of the inequalities above is not satisfied: π0\pi_{0} is the fastest-growing DP partition selection strategy, and therefore equal to πopt\pi_{\textnormal{opt}}. ∎

Note that the special cases for πopt\pi_{\textnormal{opt}} in Theorem 2 can be immediately derived from Lemma 1.

2.2 Derivation of the closed-form solution

Let us now show that the closed-form solution of Theorem 1 can be derived from the recursive solution in 1. First, we show that there is a crossover point n1n_{1}, below which only the first term of the recurrence relation matters, and after which only the second term matters (until πopt(n)\pi_{\textnormal{opt}}\left(n\right) reaches 11).

Lemma 2.

Assume ε>0\varepsilon>0 and δ>0\delta>0. There are crossover points n1,n2n_{1},n_{2}\in\mathbb{N} such that 0<n1n20<n_{1}\leq n_{2} and

  • πopt(n)=0\pi_{\textnormal{opt}}\left(n\right)=0 if n=0n=0,

  • πopt(n)=πopt(n1)eε+δ\pi_{\textnormal{opt}}\left(n\right)=\pi_{\textnormal{opt}}\left(n-1\right)e^{\varepsilon}+\delta if n>0n>0 and nn1n\leq n_{1},

  • πopt(n)=1eε(1πopt(n1)δ)\pi_{\textnormal{opt}}\left(n\right)=1-e^{-\varepsilon}\left(1-\pi_{\textnormal{opt}}\left(n-1\right)-\delta\right) if n>n1n>n_{1} and nn2n\leq n_{2},

  • πopt(n)=1\pi_{\textnormal{opt}}\left(n\right)=1 otherwise.

  • Proof.

    We consider the arguments in the min statement in (5), substituting xx for πopt(n)\pi_{\textnormal{opt}}\left(n\right):

    α1(x)\displaystyle\alpha_{1}(x) =eεx+δ\displaystyle=e^{\varepsilon}x+\delta
    α2(x)\displaystyle\alpha_{2}(x) =1eε(1xδ)\displaystyle=1-e^{-\varepsilon}(1-x-\delta)
    α3(x)\displaystyle\alpha_{3}(x) =1\displaystyle=1

    This substitution allows us to work directly in the space of probabilities instead of restricting ourselves to the sequence (πopt(n))n=0{\left(\pi_{\textnormal{opt}}\left(n\right)\right)}_{n=0}^{\infty}. Taking the first derivative of these functions yields:

    α1(x)\displaystyle\alpha_{1}^{\prime}(x) =eε\displaystyle=e^{\varepsilon}
    α2(x)\displaystyle\alpha_{2}^{\prime}(x) =eε\displaystyle=e^{-\varepsilon}
    α3(x)\displaystyle\alpha_{3}^{\prime}(x) =0\displaystyle=0

    Since the derivative of α1(x)α2(x)\alpha_{1}(x)-\alpha_{2}(x) is eεeε>0e^{\varepsilon}-e^{-\varepsilon}>0, there exists at most one crossover point x1x_{1} such that α1(x)<α2(x)\alpha_{1}(x)<\alpha_{2}(x) for all x<x1x<x_{1}, α2(x1)=α1(x1)\alpha_{2}(x_{1})=\alpha_{1}(x_{1}), and α1(x)>α2(x)\alpha_{1}(x)>\alpha_{2}(x) for all x>x1x>x_{1}. Setting α1(x)=α2(x)\alpha_{1}(x)=\alpha_{2}(x) and solving for xx yields:

    eεx+δ=1eε(1xδ)e^{\varepsilon}x+\delta=1-e^{-\varepsilon}(1-x-\delta)

    which leads to:

    eεxeεx=1δeε(1xδ)e^{\varepsilon}x-e^{-\varepsilon}x=1-\delta-e^{-\varepsilon}(1-x-\delta)

    and finally:

    x1=(1δ)1eεeεeε.x_{1}=(1-\delta)\cdot\frac{1-e^{-\varepsilon}}{e^{\varepsilon}-e^{-\varepsilon}}.

    Since the derivative of α2(x)α3(x)\alpha_{2}(x)-\alpha_{3}(x) is eε>0e^{-\varepsilon}>0, there exists at most one crossover point x2x_{2} such that α2(x)<α3(x)\alpha_{2}(x)<\alpha_{3}(x) for all x<x2x<x_{2}, α2(x2)=α3(x2)\alpha_{2}(x_{2})=\alpha_{3}(x_{2}), and α2(x)>α3(x)\alpha_{2}(x)>\alpha_{3}(x) for all x>x2x>x_{2}. Setting α2(x)=α3(x)\alpha_{2}(x)=\alpha_{3}(x) and solving for xx yields:

    x2=1δ.x_{2}=1-\delta.

    From the formulas for x1x_{1} and x2x_{2}, it is immediate that 0<x1<x2<10<x_{1}<x_{2}<1. As such, the interval [0,1]\left[0,1\right] can be divided into three non-empty intervals:

    1. 1.

      On [0,x1]\left[0,x_{1}\right], α1(x)\alpha_{1}(x) is the active argument of min(α1(x),α2(x),α3(x))\min(\alpha_{1}(x),\alpha_{2}(x),\alpha_{3}(x)).

    2. 2.

      On [x1,x2]\left[x_{1},x_{2}\right], α2(x)\alpha_{2}(x) is the active argument of min(α1(x),α2(x),α3(x))\min(\alpha_{1}(x),\alpha_{2}(x),\alpha_{3}(x)).

    3. 3.

      On [x2,1]\left[x_{2},1\right], α3(x)\alpha_{3}(x) is the active argument of min(α1(x),α2(x),α3(x))\min(\alpha_{1}(x),\alpha_{2}(x),\alpha_{3}(x)).

    The existence of the crossover points is not enough to prove the lemma: we must also show that these points are reached in a finite number of steps. For all n1n\geq 1 such that πopt(n)1\pi_{\textnormal{opt}}\left(n\right)\neq 1, we have:

    πopt(n)πopt(n1)\displaystyle\pi_{\textnormal{opt}}\left(n\right)-\pi_{\textnormal{opt}}\left(n-1\right)
    =min(\displaystyle\qquad=\min(
    eεπopt(n1)+δ,\displaystyle\qquad\qquad e^{\varepsilon}\pi_{\textnormal{opt}}\left(n-1\right)+\delta,
    1eε(1πopt(n1)δ)\displaystyle\qquad\qquad 1-e^{-\varepsilon}\left(1-\pi_{\textnormal{opt}}\left(n-1\right)-\delta\right)
    )πopt(n1)\displaystyle\qquad\vphantom{=})-\pi_{\textnormal{opt}}\left(n-1\right)
    min(δ,(1eε)(1πopt(n1))+eεδ)\displaystyle\qquad\geq\min\left(\delta,\left(1-e^{-\varepsilon}\right)\left(1-\pi_{\textnormal{opt}}\left(n-1\right)\right)+e^{-\varepsilon}\delta\right)
    eεδ.\displaystyle\qquad\geq e^{-\varepsilon}\delta.

    Since πopt(n)πopt(n1)\pi_{\textnormal{opt}}\left(n\right)-\pi_{\textnormal{opt}}\left(n-1\right) is bounded from below by a strictly positive constant eεδe^{-\varepsilon}\delta, the sequence achieves the maximal probability 1 for finite nn. ∎

This allows us to derive the closed-form solution for n<n1n<n_{1} and for n1n<n2n_{1}\leq n<n_{2} stated in Theorem 1.

Lemma 3.

Assume ε>0\varepsilon>0 and δ0\delta\leq 0. If nn1n\leq n_{1}, then πopt(n)=enε1eε1δ\pi_{\textnormal{opt}}\left(n\right)=\frac{e^{n\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta. If n1n<n2n_{1}\leq n<n_{2}, then denoting m=nn1m=n-n_{1}:

πopt(n)=(1emε)(1+δeε1)+emεπopt(n1).\pi_{\textnormal{opt}}\left(n\right)=\left(1-e^{-m\varepsilon}\right)\left(1+\frac{\delta}{e^{\varepsilon}-1}\right)+e^{-m\varepsilon}\pi_{\textnormal{opt}}\left(n_{1}\right).
  • Proof.

    For n<n1n<n_{1}, expanding the recurrence relation yields:

    πopt(n)\displaystyle\pi_{\textnormal{opt}}\left(n\right) =πopt(n1)eε+δ\displaystyle=\pi_{\textnormal{opt}}\left(n-1\right)e^{\varepsilon}+\delta
    =δk=0n1ekε\displaystyle=\delta\sum_{k=0}^{n-1}e^{k\varepsilon}
    =enε1eε1δ.\displaystyle=\frac{e^{n\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta.

    For n1n<n2n_{1}\leq n<n_{2}, denoting m=nn1m=n-n_{1}, expanding the recurrence relation yields:

    πopt(n)\displaystyle\pi_{\textnormal{opt}}\left(n\right) =1eε(1πopt(n1)δ)\displaystyle=1-e^{-\varepsilon}\left(1-\pi_{\textnormal{opt}}\left(n-1\right)-\delta\right)
    =(1eε+δeε)k=0m1ekε+emεπopt(n1)\displaystyle=\left(1-e^{-\varepsilon}+\delta e^{-\varepsilon}\right)\sum_{k=0}^{m-1}e^{-k\varepsilon}+e^{-m\varepsilon}\pi_{\textnormal{opt}}\left(n_{1}\right)
    =(1eε+δeε)1emε1eε+emεπopt(n1)\displaystyle=\left(1-e^{-\varepsilon}+\delta e^{-\varepsilon}\right)\frac{1-e^{-m\varepsilon}}{1-e^{-\varepsilon}}+e^{-m\varepsilon}\pi_{\textnormal{opt}}\left(n_{1}\right)
    =(1emε)(1+δeε1)+emεπopt(n1).\displaystyle=\left(1-e^{-m\varepsilon}\right)\left(1+\frac{\delta}{e^{\varepsilon}-1}\right)+e^{-m\varepsilon}\pi_{\textnormal{opt}}\left(n_{1}\right).

We can now find a closed-form solution for n1n_{1} and for n2n_{2}.

Lemma 4.

The first crossover point n1n_{1} is:

n1=1+1εln(eε+2δ1δ(eε+1))n_{1}=1+\left\lfloor\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{\delta(e^{\varepsilon}+1)}\right)\right\rfloor (6)
  • Proof.

    Using the formula for x1x_{1} in the proof of Lemma 2, we see that πopt(n1)x1\pi_{\textnormal{opt}}\left(n-1\right)\leq x_{1} whenever:

    e(n1)ε1eε1δ1δeε+1.\frac{e^{(n-1)\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta\leq\frac{1-\delta}{e^{\varepsilon}+1}.

    Rearranging terms, we can rewrite this inequality as:

    n\displaystyle n 1+1εln[(1δ)(eε1)δ(eε+1)+1]\displaystyle\leq 1+\frac{1}{\varepsilon}\ln\left[\frac{(1-\delta)(e^{\varepsilon}-1)}{\delta(e^{\varepsilon}+1)}+1\right]
    =1+1εln[(1δ)(eε1)+δ(eε+1)δ(eε+1)]\displaystyle=1+\frac{1}{\varepsilon}\ln\left[\frac{(1-\delta)(e^{\varepsilon}-1)+\delta(e^{\varepsilon}+1)}{\delta(e^{\varepsilon}+1)}\right]
    =1+1εln[eε+2δ1δ(eε+1)].\displaystyle=1+\frac{1}{\varepsilon}\ln\left[\frac{e^{\varepsilon}+2\delta-1}{\delta(e^{\varepsilon}+1)}\right].

    Since nn is an integer, the supremum value defining n1n_{1} is achieved by taking the floor of the right-hand side of this inequality, which concludes the proof. ∎

Lemma 5.

The second crossover point n2n_{2} is:

n2=n1+1εln(1+eε1δ(1πopt(n1)))n_{2}=n_{1}+\left\lfloor\frac{1}{\varepsilon}\ln\left(1+\frac{e^{\varepsilon}-1}{\delta}\left(1-\pi_{\textnormal{opt}}\left(n_{1}\right)\right)\right)\right\rfloor
  • Proof.

    We want to find the maximal mm such that:

    (1emε)(1+δeε1)+emεπopt(n1)1.\left(1-e^{-m\varepsilon}\right)\left(1+\frac{\delta}{e^{\varepsilon}-1}\right)+e^{-m\varepsilon}\pi_{\textnormal{opt}}\left(n_{1}\right)\leq 1.

    We can rewrite this condition into:

    emε(1+δeε1πopt(n1))δeε1-e^{-m\varepsilon}\left(1+\frac{\delta}{e^{\varepsilon}-1}-\pi_{\textnormal{opt}}\left(n_{1}\right)\right)\leq\frac{-\delta}{e^{\varepsilon}-1}

    which leads to:

    emε\displaystyle e^{m\varepsilon} eε1δ(1+δeε1πopt(n1))\displaystyle\leq\frac{e^{\varepsilon}-1}{\delta}\left(1+\frac{\delta}{e^{\varepsilon}-1}-\pi_{\textnormal{opt}}\left(n_{1}\right)\right)
    1+eε1δ(1πopt(n1))\displaystyle\leq 1+\frac{e^{\varepsilon}-1}{\delta}\left(1-\pi_{\textnormal{opt}}\left(n_{1}\right)\right)

    and finally:

    m1εln(1+eε1δ(1πopt(n1)))m\leq\frac{1}{\varepsilon}\ln\left(1+\frac{e^{\varepsilon}-1}{\delta}\left(1-\pi_{\textnormal{opt}}\left(n_{1}\right)\right)\right)

    since mm must be an integer, we take the floor of the right-hand side of this inequality to obtain the result. ∎

2.3 More generic optimality result

Theorem 1 provides an optimal partition selection primitive in the sense of Definition 4: a mechanism using this primitive on each partition separately is optimal among the class of mechanisms that consider every partition separately. The mechanism cannot use auxiliary knowledge about relationships within partitions, and the decision for a given partition cannot depend on the data in other partitions. Can we extend the optimality result to a larger class of algorithms, that take the full list of partitions as input?

We can answer that question in the affirmative, in the particular case where each user contributions a single partition. First, we need to define what optimality means in a more general context. Recall that a partition selection mechanism takes a database DD in which each user contributes a subset WiUW_{i}\subset U of partitions, and outputs a subset (D)iWi\mathcal{M}\left(D\right)\subseteq\cup_{i}W_{i}. The goal is to output as many partitions as possible, which we capture by maximizing the expected value of the number of output partitions.

Definition 5 (Optimal partition selection mechanism).

A partition selection mechanism \mathcal{M} is optimal for (ε,δ)\left(\varepsilon,\delta\right)-DP and sensitivity κ\kappa if it is (ε,δ)\left(\varepsilon,\delta\right)-DP, and if for all (ε,δ)\left(\varepsilon,\delta\right)-DP partition selection mechanisms \mathcal{M}^{\prime}, and all databases DD in which each user contributes at most κ\kappa partitions:

𝔼[|(D)|]𝔼[|(D)|].\mathbb{E}\left[\left\lvert\mathcal{M}^{\prime}\left(D\right)\right\rvert\right]\leq\mathbb{E}\left[\left\lvert\mathcal{M}\left(D\right)\right\rvert\right].

We can now prove our more generic optimality result.

Theorem 3.

Let opt\mathcal{M}_{\textnormal{opt}} be the partition selection mechanism that, on input DD, returns each partition kk with probability πopt(|{i|Wi={k}}|)\pi_{\textnormal{opt}}\left(\left\lvert\left\{i\;\middle|\;W_{i}=\left\{k\right\}\right\}\right\rvert\right). Then opt\mathcal{M}_{\textnormal{opt}} is optimal for (ε,δ)\left(\varepsilon,\delta\right)-DP and sensitivity 11.

  • Proof.

    Let \mathcal{M} be a partition selection mechanism. Since we assume that every user contributes to at most one partition (κ=1\kappa=1), it is equivalent to consider the input of \mathcal{M} to be the histogram (ni)iU\left(n_{i}\right)_{i\in U}, where nin_{i} is the number of users associated to partition ii. Of course, if nk=0n_{k}=0 for some kk, then kk must not be in the output set.

    Now, for a given partition kk, fix all values of the histogram except nk=nn_{k}=n, and denote f(n)=[k((ni)iU)]f(n)=\mathbb{P}\left[k\in\mathcal{M}\left(\left(n_{i}\right)_{i\in U}\right)\right]. Then f(n)f(n) must satisfy inequalities 1 to 4 from Section 2.1 in order for \mathcal{M} to be (ε,δ)\left(\varepsilon,\delta\right)-DP. Then, by Theorem 1, f(n)πopt(n)f(n)\leq\pi_{\textnormal{opt}}(n). Now, for a given input (ni)iU\left(n_{i}\right)_{i\in U}, the expected size of the output set is given by:

    kU[k((ni)iU)]\sum_{k\in U}\mathbb{P}\left[k\in\mathcal{M}\left(\left(n_{i}\right)_{i\in U}\right)\right]

    which is bounded by kUπopt(nk)\sum_{k\in U}\pi_{\textnormal{opt}}\left(n_{k}\right). This is exactly the expected output size obtained with opt\mathcal{M}_{\textnormal{opt}}, which concludes the proof. ∎

3 Thresholding interpretation

In this section, we show that modulo a minor change in ε\varepsilon or δ\delta, the optimal partition selection primitive πopt\pi_{\textnormal{opt}} can be interpreted as a noisy thresholding operation, similar to the Laplace-based strategy, but using a truncated version of the geometric distribution. We first define this distribution, then we use it to prove this second characterization of πopt\pi_{\textnormal{opt}}.

Definition 6 (kk-TSGD).

Given p(0,1)p\in(0,1) and kk\in\mathbb{N} such that k1k\geq 1, the kk-truncated symmetric geometric distribution (kk-TSGD) of parameter pp is the distribution defined on \mathbb{Z} such that:

[X=x]={c(1p)|x|if x[k,k]0otherwise\mathbb{P}\left[X=x\right]=\begin{cases}c\cdot{(1-p)}^{|x|}&\textnormal{if }x\in\left[-k,k\right]\cap\mathbb{Z}\\ 0&\textnormal{otherwise}\end{cases} (7)

where c=p1+(1p)2(1p)k+1c=\frac{p}{1+(1-p)-2{(1-p)}^{k+1}} is a normalization constant ensuring that the total probability is 11.

This distribution can also be obtained by taking a symmetric two-sided geometric distribution222Also called the discrete Laplace distribution [18]. [14], with success probability pp and conditioning on the event that the result is in [k,k]\left[-k,k\right]. As such, the kk-truncated symmetric geometric distribution is the discrete analogue of the truncated Laplace distribution [12]. A similar construction was also defined in [15] to prove a lower bound on loss with (ε,δ)\left(\varepsilon,\delta\right)-differential privacy, but is not a proper probability distribution, since its total mass does not sum up to one333In the proof of Theorem 8, the sum for non-negative ii is assumed to sum up to 1/21/2, but 0 is counted twice when summing non-negative and non-positive ii..

Given privacy parameters ε\varepsilon and δ\delta, we can set the values of pp and kk such that adding noise drawn from the truncated geometric distribution achieves (ε,δ)\left(\varepsilon,\delta\right)-differential privacy for counting queries.

Definition 7 (Truncated geometric mechanism).

Given privacy parameters ε>0\varepsilon>0 and δ>0\delta>0, let p=1eεp=1-e^{-\varepsilon} and k=1εln(eε+2δ1(eε+1)δ)k=\left\lceil\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}\right)\right\rceil. Let the true result of an integer-valued query with sensitivity 1 be μ\mu\in\mathbb{Z}. Then the truncated geometric mechanism returns μ+X\mu+X, where XX is drawn from the kk-TSGD with success probability pp. The result has the distribution:

[Y=y]={ce|yμ|εif y[μk,μ+k]0otherwise\mathbb{P}\left[Y=y\right]=\begin{cases}c\cdot e^{-|y-\mu|\varepsilon}&\textnormal{if }y\in\left[\mu-k,\mu+k\right]\cap\mathbb{Z}\\ 0&\textnormal{otherwise}\end{cases}

where c=1eε1+eε2e(k+1)εc=\frac{1-e^{-\varepsilon}}{1+e^{-\varepsilon}-2e^{-(k+1)\varepsilon}} is a normalization constant ensuring that the total probability is 11.

The value of kk is the smallest value such that

[X=k]=ekε(1eε)1+eε2e(k+1)εδ\mathbb{P}\left[X=k\right]=\frac{e^{-k\varepsilon}(1-e^{-\varepsilon})}{1+e^{-\varepsilon}-2e^{-(k+1)\varepsilon}}\leq\delta

for the kk-TSGD.

Theorem 4.

The truncated geometric mechanism satisfies (ε,δ)\left(\varepsilon,\delta\right)-differential privacy.

  • Proof.

    This follows the same line of reasoning as the proof of Theorem 1 in [12]. The only difference is the change from a continuous distribution to a discrete distribution, since all the values are integers. If the result of the query before adding noise is μ\mu, then for an adjacent database, the corresponding value μ\mu^{\prime} must be in {μ1,μ,μ+1}\left\{\mu-1,\mu,\mu+1\right\}. If μ=μ\mu^{\prime}=\mu, the distribution of the output after adding noise is unchanged, trivially satisfying the (ε,δ)\left(\varepsilon,\delta\right)-differential privacy property. By symmetry, it is sufficient to analyze the case when μ=μ+1\mu^{\prime}=\mu+1. Here, the new distribution of the output of the mechanism is

    [Y=y]=ce|yμ1|ε\mathbb{P}\left[Y^{\prime}=y\right]=c\cdot e^{-|y-\mu-1|\varepsilon}

    if y[μk+1,μ+k+1]y\in\left[\mu-k+1,\mu+k+1\right]\cap\mathbb{Z}, and [Y=y]\mathbb{P}\left[Y^{\prime}=y\right] otherwise.

    By symmetry, to show that (ε,δ)\left(\varepsilon,\delta\right)-differential privacy is satisfied, we only need to show that [YS]eε[YS]+δ\mathbb{P}\left[Y^{\prime}\in S\right]\leq e^{\varepsilon}\mathbb{P}\left[Y\in S\right]+\delta for all SS\subset\mathbb{Z}. For all values yy\in\mathbb{Z} except μ+k+1\mu+k+1, [Y=y]eε[Y=y]\mathbb{P}\left[Y^{\prime}=y\right]\leq e^{\varepsilon}\mathbb{P}\left[Y=y\right]. Also, [Y=μ+k+1]=0\mathbb{P}\left[Y=\mu+k+1\right]=0 and [Y=μ+k+1]=[X=k]>0\mathbb{P}\left[Y^{\prime}=\mu+k+1\right]=\mathbb{P}\left[X=k\right]>0. Therefore, [YS]eε[YS]\mathbb{P}\left[Y^{\prime}\in S\right]-e^{\varepsilon}\mathbb{P}\left[Y\in S\right] is maximized when S={μ+k+1}S=\left\{\mu+k+1\right\}. This means that the condition is satisfied if [X=k]δ\mathbb{P}\left[X=k\right]\leq\delta. From the definition of kk in the truncated geometric mechanism:

    k=1εln(eε+2δ1(eε+1)δ)k=\left\lceil\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}\right)\right\rceil

    which leads to:

    ekεeε+2δ1(eε+1)δe^{k\varepsilon}\geq\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}

    thus:

    (ekε)(eε+1)δ2δeε1(e^{k\varepsilon})(e^{\varepsilon}+1)\delta-2\delta\geq e^{\varepsilon}-1

    and:

    (1+eε2e(k+1)ε)δekε(1eε)(1+e^{-\varepsilon}-2e^{-(k+1)\varepsilon})\delta\geq e^{-k\varepsilon}(1-e^{-\varepsilon})

    and finally:

    δ\displaystyle\delta ekε(1eε)1+eε2e(k+1)ε\displaystyle\geq\frac{e^{-k\varepsilon}(1-e^{-\varepsilon})}{1+e^{-\varepsilon}-2e^{-(k+1)\varepsilon}}
    [X=k].\displaystyle\geq\mathbb{P}\left[X=k\right].

Let us get some intuition why thresholding the truncated geometric mechanism leads to an optimal partition selection primitive. First, we compute the tail cumulative distribution function for the output of the truncated geometric mechanism. Summing the probability masses gives a geometric series:

[Yy]={1if yμk1e(k+yμ)ϵ1eϵ1cekϵif μkyμ1e(μ+k+1y)ϵ1eϵ1cekϵif μyμ+k0if y>μ+k.\mathbb{P}\left[Y\geq y\right]=\begin{cases}1&\textnormal{if }y\leq\mu-k\\ 1-\frac{e^{(k+y-\mu)\epsilon}-1}{e^{\epsilon}-1}ce^{-k\epsilon}&\textnormal{if }\mu-k\leq y\leq\mu-1\\ \frac{e^{(\mu+k+1-y)\epsilon}-1}{e^{\epsilon}-1}ce^{-k\epsilon}&\textnormal{if }\mu\leq y\leq\mu+k\\ 0&\textnormal{if }y>\mu+k.\end{cases}

If we define δ=cekϵ\delta=ce^{-k\epsilon} and rearrange the the cases as functions of μ\mu, we get:

[Yy]={0if μ<yke(μ+k+1y)ϵ1eϵ1δif ykμy1e(k+yμ)ϵ1eϵ1δif y+1μy+k1if μy+k.\mathbb{P}\left[Y\geq y\right]=\begin{cases}0&\textnormal{if }\mu<y-k\\ \frac{e^{(\mu+k+1-y)\epsilon}-1}{e^{\epsilon}-1}\delta&\textnormal{if }y-k\leq\mu\leq y\\ 1-\frac{e^{(k+y-\mu)\epsilon}-1}{e^{\epsilon}-1}\delta&\textnormal{if }y+1\leq\mu\leq y+k\\ 1&\textnormal{if }\mu\geq y+k.\end{cases} (8)

The μy\mu\leq y cases of the formula are the same as the closed-form formula for πopt\pi_{\textnormal{opt}} in Theorem 1 for values less than n1n_{1}. The μ>y\mu>y cases are simply the symmetric reflection of the former. We formalize this intuition and show that whenever 1εln(eε+2δ1(eε+1)δ)\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}\right) is an integer, the two approaches are exactly the same.

Theorem 5 (Noisy thresholding is optimal).

If δ(0,1)\delta\in(0,1) and ε>0\varepsilon>0 are such that k=1εln(eε+2δ1(eε+1)δ)k=\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}\right) is an integer, then for all nn:

πopt(n)=[n+Xk+1]\pi_{\textnormal{opt}}\left(n\right)=\mathbb{P}\left[n+X\geq k+1\right]

where XX is a random variable sampled from a kk-truncated symmetric geometric distribution of success probability (1eε)(1-e^{-\varepsilon}).

  • Proof.

    When k=1εln(eε+2δ1(eε+1)δ)k=\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}\right) is an integer, we have n1=k+1n_{1}=k+1, and

    ekε=eε+2δ1(eε+1)δe^{k\varepsilon}=\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}

    which leads to

    (ekε)(eε+1)δ=eε1+2δ(e^{k\varepsilon})(e^{\varepsilon}+1)\delta=e^{\varepsilon}-1+2\delta

    and

    (e(k+1)ε1)δ+(ekε1)δ=eε1.(e^{(k+1)\varepsilon}-1)\delta+(e^{k\varepsilon}-1)\delta=e^{\varepsilon}-1.

    On further rearranging, we get

    e(k+1)ε1eε1δ+ekε1eε1δ=1,\frac{e^{(k+1)\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta+\frac{e^{k\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta=1,

    and thus:

    1πopt(n1)=πopt(n11).1-\pi_{\textnormal{opt}}\left(n_{1}\right)=\pi_{\textnormal{opt}}\left(n_{1}-1\right).

    From Lemma 2, we also get

    1πopt(n)=eε((1πopt(n1))δ)1-\pi_{\textnormal{opt}}\left(n\right)=e^{-\varepsilon}((1-\pi_{\textnormal{opt}}\left(n-1\right))-\delta)

    if n1<nn2n_{1}<n\leq n_{2}. Since we also have

    πopt(n)=eε(πopt(n1)+δ)\pi_{\textnormal{opt}}\left(n\right)=e^{\varepsilon}(\pi_{\textnormal{opt}}\left(n-1\right)+\delta)

    if 0<nn10<n\leq n_{1}, we find that for n1<nn2n_{1}<n\leq n_{2},

    πopt(n)\displaystyle\pi_{\textnormal{opt}}\left(n\right) =1πopt(2n11n)\displaystyle=1-\pi_{\textnormal{opt}}\left(2n_{1}-1-n\right)
    =1e(2k+1n)ε1eε1δ.\displaystyle=1-\frac{e^{(2k+1-n)\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta.

    Consequently, for such special combinations of ε\varepsilon and δ\delta

    n2=2n11=2k+1.n_{2}=2n_{1}-1=2k+1.

    Now, rewriting the formula for πopt\pi_{\textnormal{opt}} in Theorem 1 using μ=n\mu=n and k=n11k=n_{1}-1 gives us that πopt(μ)\pi_{\textnormal{opt}}\left(\mu\right) is:

    • 0 if μ0\mu\leq 0,

    • e(μ+k+1(k+1))ε1eε1δ\frac{e^{(\mu+k+1-(k+1))\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta if μk+1\mu\leq k+1,

    • (1e(μ(k+1))ε)(1+δeε1)+e(μ(k+1))εe(k+1)ε1eε1δ\left(1-e^{(\mu-(k+1))\varepsilon}\right)\left(1+\frac{\delta}{e^{\varepsilon}-1}\right)+e^{(\mu-(k+1))\varepsilon}\frac{e^{(k+1)\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta if k+1<μ2k+1k+1<\mu\leq 2k+1,

    • 11 otherwise.

    Comparing this with (8) shows that for this combination of ε\varepsilon and δ\delta, and for the corresponding derived values of pp and kk,

    πopt(μ)=[Yk+1].\pi_{\textnormal{opt}}\left(\mu\right)=\mathbb{P}\left[Y\geq k+1\right].

This characterization suggests a simple implementation of the optimal partition selection primitive, at a minor cost in ε\varepsilon or δ\delta. Given arbitrary ε\varepsilon and δ\delta, one can replace ε\varepsilon by ε^ε\hat{\varepsilon}\leq\varepsilon, or δ\delta by δ^δ\hat{\delta}\leq\delta to ensure that kk from Theorem 5 is an integer. In our definition of the truncated geometric mechanism, we choose the latter strategy, requiring a slightly lower δ\delta by using an integer upper bound on kk, and using p=1eεp=1-e^{-\varepsilon} to fully utilize the ε\varepsilon budget. We then apply the truncated geometric mechanism to the number of unique users in each partition, and return this partition if the noisy count is larger than kk. Further, this noisy count may also be published for such a partition, while still satisfying (ε,δ)\left(\varepsilon,\delta\right)-differential privacy.

To see this, consider an arbitrarily large finite family of partitions444For example, all possible partitions representable by bytestrings that fit within available data storage QQ such that each user in a database DD is associated with at most one partition qQq\in Q. Consider the following mechanism.

Definition 8 (kk-TSGD thresholded release).

For a database DD, let cq(D)c_{q}(D) be the number of users associated with partition qq. Let QDQQ_{D}\subset Q be the finite subset {q|qQ and cq(D)>0}\left\{q\;\middle|\;q\in Q\text{ and }c_{q}(D)>0\right\} of partitions present in the dataset DD. Let the noise XqX_{q} for qQq\in Q be i.i.d. random variables drawn from the kk-TSGD of parameters p=1eεp=1-e^{-\varepsilon} and k=1εln(eε+2δ1(eε+1)δ)k=\left\lceil\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}\right)\right\rceil. Let c^q(D)=cq(D)+Xq\hat{c}_{q}(D)=c_{q}(D)+X_{q}. Then, the kk-TSGD thresholded release mechanism produces the set

{(q,c^q(D))|qQD and c^q(D)>k}.\left\{(q,\hat{c}_{q}(D))\;\middle|\;q\in Q_{D}\text{ and }\hat{c}_{q}(D)>k\right\}.
Theorem 6.

The kk-TSGD thresholded release mechanism satisfies (ε,δ)\left(\varepsilon,\delta\right)-differential privacy.

  • Proof.

    Consider the mechanism that adds a kk-TSGD of parameter p=1eεp=1-e^{-\varepsilon} and k=1εln(eε+2δ1(eε+1)δ)k=\left\lceil\frac{1}{\varepsilon}\ln\left(\frac{e^{\varepsilon}+2\delta-1}{(e^{\varepsilon}+1)\delta}\right)\right\rceil to every possible partition count, including those not present in the dataset. That is, we apply the truncated geometric mechanism to the unique-user counts for all possible partitions (even partitions not contained in the database), which produces the set

    {(q,c^q(D))|qQ}.\left\{(q,\hat{c}_{q}(D))\;\middle|\;q\in Q\right\}.

    This mechanism is (ε,δ)\left(\varepsilon,\delta\right)-differentially private: a single user’s addition or removal changes only one partition, and on this partition, Theorem 4 shows that the output satisfies (ε,δ)\left(\varepsilon,\delta\right)-differential privacy. Combined with the condition that the noise values are independent, this means that the entire mechanism is also (ε,δ)\left(\varepsilon,\delta\right)-differentially private.

    Adding a thresholding step to release the noised values only when they are greater than kk is only post-processing. Therefore, the entire mechanism that releases

    {(q,c^q(D))|qQ and c^q(D)>k}\left\{(q,\hat{c}_{q}(D))\;\middle|\;q\in Q\text{ and }\hat{c}_{q}(D)>k\right\}

    is also (ε,δ)\left(\varepsilon,\delta\right)-differentially private.

    Now, notice that this mechanism is exactly the same as if we had only added noise to the partitions in QDQ_{D}: the noise added to zero in empty partitions will be at most kk, so these partitions will be removed from the output in the thresholding step. Since these two mechanisms are identical and one is (ε,δ)\left(\varepsilon,\delta\right)-differentially private, both are (ε,δ)\left(\varepsilon,\delta\right)-differentially private. ∎

We note that this can be extended to the case where the set of allowed partitions QQ is countably infinite, using standard techniques from measure theory [16]. Thus, this mechanism is the (ε,δ)\left(\varepsilon,\delta\right)-differential privacy equivalent of Algorithm Filter in [6], which achieves (ε,0)(\varepsilon,0)-differential privacy when the set of possible partitions is known beforehand and not very large.

To demonstrate the utility of such an operation, consider a slight variation of the example query presented in the introduction.

SELECT
device_model,
COUNT(user_id),
AVG(latency)
FROM database
GROUP BY device_model

For simplicity, let us assume that each user contributes only one row with a single value for the latency. Then, this may be implemented in the following way.

SELECT
device_model,
COUNT(user_id),
SUM(latency) / COUNT(user_id)
FROM database
GROUP BY device_model

The available (ε,δ)\left(\varepsilon,\delta\right) budget must be split up for the partition selection, sum and count. Instead of instantiating separate noise values for partition selection and the count and having to split up the (ε,δ)\left(\varepsilon,\delta\right), we can use noisy thresholding on the count. This may be used to obtain a more accurate count or to leave more of the (ε,δ)\left(\varepsilon,\delta\right) budget for the sum estimation.

4 Numerical evaluation

Theorem 1 shows that the optimal partition selection primitive πopt\pi_{\textnormal{opt}} outperforms all other options. How does it compare with the naive strategy of adding Laplace noise and thresholding the result?

Definition 9 (Laplace partition selection [20]).

We denote by Lap(b)\textnormal{Lap}(b) a random variable sampled from a Laplace distribution of mean 0 and of scale bb. The following partition selection strategy ρLap\rho_{\textnormal{Lap}}, called Laplace-based partition selection, is (ε,δ)\left(\varepsilon,\delta\right)-differentially private:

ρLap(n)={dropif n+Lap(1ε)<1ln(2δ)εkeepotherwise.\rho_{\textnormal{Lap}}(n)=\begin{cases}\textnormal{drop}&\textnormal{if }n+\textnormal{Lap}\left(\frac{1}{\varepsilon}\right)<1-\frac{\ln(2\delta)}{\varepsilon}\\ \textnormal{keep}&\textnormal{otherwise.}\end{cases}

We denote by πLap\pi_{\textnormal{Lap}} the corresponding partition selection primitive: πLap(n)=[ρLap(n)=keep]\pi_{\textnormal{Lap}}(n)=\mathbb{P}\left[\rho_{\textnormal{Lap}}(n)=\textnormal{keep}\right].

As expected, using the optimal partition selection primitive translates to a larger probability of releasing a partition with the same user. As exemplified in Figure 1, the difference is especially large in the high-privacy regime.

668810101212141416161818202000.20.20.40.40.60.60.80.811nn[ρ(n)=keep]\mathbb{P}\left[\rho(n)=\textnormal{keep}\right]πopt\pi_{\textnormal{opt}}πLap\pi_{\textnormal{Lap}}
16016018018020020022022024024026026028028030030000.20.20.40.40.60.60.80.811nnπopt\pi_{\textnormal{opt}}πLap\pi_{\textnormal{Lap}}
Fig. 1: Probability of releasing a partition depending on the number of unique users, comparing Laplace-based partition selection with πopt\pi_{\textnormal{opt}}. On the left, ε=1\varepsilon=1 and δ=105\delta=10^{-5}, on the right, ε=0.1\varepsilon=0.1 and δ=1010\delta=10^{-10}.

To better understand the dependency on ε\varepsilon and δ\delta, we also compare the midpoint obtained for both partition selection strategies ρ\rho: the number nn for which the probability of releasing a partition with nn users is 0.50.5. For Laplace-based partition selection, this nn is simply the threshold. As Figure 2 shows, the gains are especially substantial when ε\varepsilon is small, and not significant for ε>1\varepsilon>1. Figure 3 shows the dependency on δ\delta: for a fixed ε\varepsilon, there is a constant interval between the midpoints of both strategies. Thus, the relative gains are larger for a larger δ\delta, since the midpoint is also smaller.

0.010.010.10.102002004004006006008008001,0001{,}000ε\varepsilonnn s.t. [ρ(n)=keep]=0.05,0.50,0.95\mathbb{P}\left[\rho(n)=\textnormal{keep}\right]=0.05,0.50,0.95πopt\pi_{\textnormal{opt}}πLap\pi_{\textnormal{Lap}}
0.10.11102020404060608080100100ε\varepsilonπopt\pi_{\textnormal{opt}}πLap\pi_{\textnormal{Lap}}
Fig. 2: Comparison of the 5th, 50th and 95th percentile of the partition selection strategy ρ\rho as a function of ε\varepsilon, for δ=105\delta=10^{-5}. The mid-point is plotted as a solid line, while the 5th and 95th percentiles are dashed.
101210^{-12}101010^{-10}10810^{-8}10610^{-6}10410^{-4}0100100200200300300δ\deltann s.t. [ρ(n)=keep]=0.05,0.50,0.95\mathbb{P}\left[\rho(n)=\textnormal{keep}\right]=0.05,0.50,0.95πopt\pi_{\textnormal{opt}}πLap\pi_{\textnormal{Lap}}
101210^{-12}101010^{-10}10810^{-8}10610^{-6}10410^{-4}0101020203030δ\deltaπopt\pi_{\textnormal{opt}}πLap\pi_{\textnormal{Lap}}
Fig. 3: Comparison of the 5th, 50th and 95th percentile of the partition strategy ρ\rho as a function of δ\delta, for ε=0.1\varepsilon=0.1 (left) and ε=1\varepsilon=1 (right). The mid-point is plotted as a solid line, while the 5th and 95th percentiles are dashed.

We also verified experimentally that on each partition, the selection mechanism runs in constant, very short time, on the order of 100 nanoseconds on a standard machine. This is not surprising: Theorem 1 provides a simple, closed-form formula for computing πopt(n)\pi_{\textnormal{opt}}(n), and generating the random decision based on this probability is computationally trivial. The performance impact of Laplace-based thresholding is similarly negligible: the only real cost of such simple partition selection strategies is to count the number of unique users nn in each partition, which is orders of magnitude more computationally intensive than computing π(n)\pi(n).

5 Discussion

The approach presented in this work is both easy to implement and efficient. Counting the number of unique users per partition can be done in one pass over the data and is massively parallelizable. Furthermore, since there is a relatively small value kk such that the probability of keeping a partition with nkn\geq k users is 1, the counting process can be interrupted as soon as a partition reaches kk users. This keeps memory usage low (in O(k)O(k)) without requiring approximate count-distinct algorithms like HyperLogLog [11] for which a more complex sensitivity analysis would be needed.

Extension to multiple partitions per user

Our approach could, in principle, be extended to cases where each user can contribute to κ>1\kappa>1 partitions. Following the intuition of Lemma 1, we could list a set of recursive equations defining πopt(n)\pi_{\textnormal{opt}}\left(n\right) as a function of πopt(i)\pi_{\textnormal{opt}}(i) for i<ni<n. Sadly, the system of equations quickly gets too large to solve naively. Consider, for example, the case where κ=2\kappa=2. The differential privacy constraint requires, for all ni0n\geq i\geq 0 and all S{drop,keep}2S\subseteq{\left\{\textnormal{drop},\textnormal{keep}\right\}}^{2}:

[(ρπ(n+1),ρπ(i+1))S]\displaystyle\mathbb{P}\left[\left(\rho_{\pi}(n+1),\rho_{\pi}(i+1)\right)\in S\right]
eε[(ρπ(n),ρπ(i))S]+δ\displaystyle\qquad\leq e^{\varepsilon}\cdot\mathbb{P}\left[\left(\rho_{\pi}(n),\rho_{\pi}(i)\right)\in S\right]+\delta
[(ρπ(n),ρπ(i))S]\displaystyle\mathbb{P}\left[\left(\rho_{\pi}(n),\rho_{\pi}(i)\right)\in S\right]
eε[(ρπ(n+1),ρπ(i+1))S]+δ\displaystyle\qquad\leq e^{\varepsilon}\cdot\mathbb{P}\left[\left(\rho_{\pi}(n+1),\rho_{\pi}(i+1)\right)\in S\right]+\delta

Thus, to maximize π(n)\pi(n), we have to consider 32n32n inequalities: nn possible values of ii, 2(22)=162^{\left(2^{2}\right)}=16 possible values of SS, and two inequalities. When κ\kappa increases, the total number of inequalities to compute all elements up to nn is O(nκ2κ2)O\left(n^{\kappa}2^{\kappa^{2}}\right). Some of these inequalities are trivial (e.g., when S=S=\emptyset or S={drop,keep}κS={\left\{\textnormal{drop},\textnormal{keep}\right\}}^{\kappa}), but most are not. We do not know whether it is possible to only consider a small number of these inequalities, and obtain the others “for free”.

Furthermore, the recurrence-based proof of optimality of πopt\pi_{\textnormal{opt}} only holds when we assume that each user contributes to exactly κ\kappa partitions in the original dataset. As discussed previously, this is relatively frequent when κ=1\kappa=1, but it rarely happens for larger values of κ\kappa: in typical datasets, some users contribute to more partitions than others. In that case, weighing the contributions of each user differently can bring additional benefits, as can changing each user’s strategy based on those of previous users [13]. For this generalized problem, it seems difficult to even define what optimality means.

The simplest option to use our approach for κ>1\kappa>1 is to divide the total privacy budget in κ\kappa. For generic tooling with strict scalability requirements where the analyst manually specifies κ\kappa, we recommend using our method (splitting the privacy budget) for κ3\kappa\leq 3, and weighted Gaussian thresholding (described in [13]) for κ4\kappa\geq 4. Figure 4 compares the mid-point of the partition selection strategy between πopt\pi_{\textnormal{opt}}, Laplace-based thresholding, and (non-weighted) Gaussian-based thresholding. It shows that the crossing point happens for κ=3\kappa=3, this stays true for varying values ε\varepsilon and δ\delta.

Comparison with weighted Gaussian thresholding is less trivial, since its benefits depend on the data distribution. However, weighted Gaussian thresholding is always better than Gaussian-based thresholding, and is straightforward to implement in a massively parallelizable fashion. We have also observed that its utility benefits are only significant for large κ\kappa, so our recommendation to use πopt\pi_{\textnormal{opt}} for κ3\kappa\leq 3 is likely robust.

1122334455667702020404060608080κ\kappann s.t. [ρ(n)=keep]=1/2\mathbb{P}\left[\rho(n)=\textnormal{keep}\right]=1/2πopt\pi_{\textnormal{opt}}πLap\pi_{\textnormal{Lap}}πGauss\pi_{\textnormal{Gauss}}
Fig. 4: Comparison of the mid-point of the partition selection strategy ρ\rho as a function of κ\kappa, for ε=1\varepsilon=1 and δ=105\delta=10^{-5}. For κ>1\kappa>1, the privacy budget is divided by κ\kappa for πopt\pi_{\textnormal{opt}} and πLap\pi_{\textnormal{Lap}}; for πGauss\pi_{\textnormal{Gauss}}, we use the formula in [3] to derive the standard deviation of Gaussian noise, and we split the δ\delta between noise and thresholding to minimize the threshold.

Policy-based approaches like those described in [13] also provide more utility, but they are not as scalable: since the strategy for each user depends on the choices made by all previous users, the computation cannot be parallelized. It also requires to keep an in-memory histogram of all partitions seen so far, which also does not scale to extremely large datasets. Improving the scalability of such policy-based approaches is an interesting open problem, on which further research would be valuable.

Extension to bounded differential privacy

In the definition of differential privacy we use in this work, neighboring databases differ in a single user being added or removed. This notion is called unbounded differential privacy in [21], by contrast to bounded differential privacy, in which neighboring datasets differ in a single user changing their data. (ε,δ)\left(\varepsilon,\delta\right)-unbounded DP implies (2ε,2δ)(2\varepsilon,2\delta)-DP, which provides a trivial way to extend our method to the bounded version of the definition: simply divide the privacy budget by two. This method outperforms Laplace-based thresholding, since Laplace noise of scale 2/ε2/\varepsilon must be added in the bounded setting (since L1L_{1}-sensitivity is 22 and no longer 11). Further, when kk from Theorem 5 is an integer, this noise distribution exactly achieves the lower bound on the loss from [15], and is therefore optimal for arbitrary symmetric loss functions.

Extension to weighted sampling

Another extension to this work was developed independently in [4]: the authors consider the problem of differentially private weighted sampling, where the partitions to be selected are also sampled to generate a compact summary of the data, rather than the maximal dataset. In the extreme case when the sampling probability is one, this is equivalent to partition selection.

Other possible extensions

The truncated geometric mechanism can be used as a building block to replace the Laplace or geometric mechanism in situations where (ε,δ)\left(\varepsilon,\delta\right)-DP with δ>0\delta>0 is acceptable. Similarly to the truncated Laplace mechanism [12], this building block is optimal for integer-valued functions.

To see how such a building block could be used in practice, consider the problem of releasing a histogram where some partitions are known in advance (call them public partitions), and some are not and must be discovered using the private data (private partitions). Note that some public partitions might be absent from the private data. In that case, one could add truncated geometric noise to all partitions (public and private), and use two distinct thresholds: one given by the formula for kk in Definition 7, and an arbitrary one tt.

  • kk is used to threshold the partitions present in the private data but not in the list of public partitions;

  • tt is used to threshold the public partitions (whether or not they are also in the private data).

The second threshold tt can be arbitrary, and allows an analyst to control the trade-off between false positives and false negatives. For example, setting t=0t=0 guarantees that all public partitions that appear in the private data are present in the output (no false negatives), at the cost of having a potentially large number of public partitions appearing in the output even though they were not present (many false positives). Conversely, setting t=kt=k guarantees that only the partitions present in the private data can be present in the output (no false positives), at the cost of dropping potentially many of them (many false negatives). Intermediate values of tt can allow an analyst to more finely tune this trade-off depending on the application.

We postulate that this building block could be used in a variety of different settings, and combined with existing techniques. For example, one could build a variant of the standard vector technique [8] that uses the truncated geometric mechanism instead of the Laplace mechanism to add noise to the output of the queries and to the threshold. This could be used to efficiently simulate the standard vector technique on a very large number of queries, most of which are deterministically below the threshold and can be skipped during computation. Formalizing this intuition and using it for partition selection with κ>1\kappa>1 is left to future work.

6 Conclusion

We introduced an optimal primitive for differentially private partition selection, a special case of differentially private set union where the sensitivity is 11. This optimal approach is simple to implement and efficient. It outperforms Laplace-based thresholding; the utility gain is especially significant in the high-privacy (small ε\varepsilon) regime. Besides the possible research directions outlined previously, this work leaves two open questions. Is it possible to extend this optimal approach to larger sensitivities in a simple and efficient manner? Furthermore, is it possible to combine this primitive with existing approaches to differentially private set union [13], like weighted histograms or policy-based strategies?

7 Acknowledgments

The authors gratefully acknowledge Alex Kulesza, Chao Li, Michael Daub, Kareem Amin, Peter Dickman, Peter Kairouz, and the PETS reviewers for their helpful feedback on this work.

D.D. was employed by Google and ETH Zurich at the time of this work. This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sector.

References

  • ACC [12] Gergely Acs, Claude Castelluccia, and Rui Chen. Differentially private histogram publishing through lossy compression. In 2012 IEEE 12th International Conference on Data Mining, pages 1–10. IEEE, 2012.
  • BHE+ [18] Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. Shrinkwrap: Differentially-private query processing in private data federations. arXiv preprint arXiv:1810.01816, 2018.
  • BW [18] Borja Balle and Yu-Xiang Wang. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pages 394–403. PMLR, 2018.
  • CGSS [21] Edith Cohen, Ofir Geri, Tamas Sarlos, and Uri Stemmer. Differentially private weighted sampling. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2404–2412. PMLR, 13–15 Apr 2021.
  • CPST [11] Graham Cormode, Magda Procopiuc, Divesh Srivastava, and Thanh TL Tran. Differentially private publication of sparse data. arXiv preprint arXiv:1103.0825, 2011.
  • CPST [12] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, and Thanh TL Tran. Differentially private summaries for sparse data. In Proceedings of the 15th International Conference on Database Theory, pages 299–311, 2012.
  • DMNS [06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer, 2006.
  • DNR+ [09] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N Rothblum, and Salil Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 381–390, 2009.
  • DR [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • DWHL [11] Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. Differentially private data cubes: optimizing noise sources and consistency. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 217–228, 2011.
  • FFGM [07] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pages 137–156. Discrete Mathematics and Theoretical Computer Science, 2007.
  • GDGK [20] Quan Geng, Wei Ding, Ruiqi Guo, and Sanjiv Kumar. Tight analysis of privacy and utility tradeoff in approximate differential privacy. In International Conference on Artificial Intelligence and Statistics, pages 89–99, 2020.
  • GGK+ [20] Sivakanth Gopi, Pankaj Gulhane, Janardhan Kulkarni, Judy Hanwen Shen, Milad Shokouhi, and Sergey Yekhanin. Differentially private set union. arXiv preprint arXiv:2002.09745, 2020.
  • GRS [12] Arpita Ghosh, Tim Roughgarden, and Mukund Sundararajan. Universally utility-maximizing privacy mechanisms. SIAM Journal on Computing, 41(6):1673–1693, 2012.
  • GV [16] Quan Geng and Pramod Viswanath. Optimal noise adding mechanisms for approximate differential privacy. IEEE Transactions on Information Theory, 62(2):952–969, Feb 2016.
  • HLM [15] Naoise Holohan, Douglas J Leith, and Oliver Mason. Differential privacy in metric spaces: Numerical, categorical and functional data under the one roof. Information Sciences, 305:256–268, 2015.
  • HRMS [09] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentially-private histograms through consistency. arXiv preprint arXiv:0904.0942, 2009.
  • IK [06] Seidu Inusah and Tomasz J Kozubowski. A discrete analogue of the laplace distribution. Journal of statistical planning and inference, 136(3):1090–1102, 2006.
  • JNS [18] Noah Johnson, Joseph P Near, and Dawn Song. Towards practical differential privacy for sql queries. Proceedings of the VLDB Endowment, 11(5):526–539, 2018.
  • KKMN [09] Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas. Releasing search queries and clicks privately. In Proceedings of the 18th international conference on World wide web, pages 171–180, 2009.
  • KM [11] Daniel Kifer and Ashwin Machanavajjhala. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 193–204, 2011.
  • KMS [20] Haim Kaplan, Yishay Mansour, and Uri Stemmer. The sparse vector technique, revisited. arXiv preprint arXiv:2010.00917, 2020.
  • KTH+ [19] Ios Kotsogiannis, Yuchao Tao, Xi He, Maryam Fanaeepour, Ashwin Machanavajjhala, Michael Hay, and Gerome Miklau. Privatesql: a differentially private sql query engine. Proceedings of the VLDB Endowment, 12(11):1371–1384, 2019.
  • LC [14] Jaewoo Lee and Christopher W Clifton. Top-k frequent itemsets via differentially private fp-trees. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 931–940, 2014.
  • LSL [16] Min Lyu, Dong Su, and Ninghui Li. Understanding the sparse vector technique for differential privacy. arXiv preprint arXiv:1603.01699, 2016.
  • LXJ [14] Haoran Li, Li Xiong, and Xiaoqian Jiang. Differentially private synthesization of multi-dimensional data using copula functions. In Advances in database technology: proceedings. International conference on extending database technology, volume 2014, page 475. NIH Public Access, 2014.
  • WZL+ [19] Royce J Wilson, Celia Yuxin Zhang, William Lam, Damien Desfontaines, Daniel Simmons-Marengo, and Bryant Gipson. Differentially private SQL with bounded user contribution. arXiv preprint arXiv:1909.01917, 2019.
  • XWG [10] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. Differential privacy via wavelet transforms. IEEE Transactions on knowledge and data engineering, 23(8):1200–1214, 2010.
  • XXFG [12] Yonghui Xiao, Li Xiong, Liyue Fan, and Slawomir Goryczka. Dpcube: differentially private histogram release through multidimensional partitioning. arXiv preprint arXiv:1202.5358, 2012.
  • XZX+ [13] Jia Xu, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Ge Yu, and Marianne Winslett. Differentially private histogram publication. The VLDB Journal, 22(6):797–822, 2013.
  • ZCP+ [17] Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):1–41, 2017.