This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Private Isotonic Regression

Badih Ghazi           Pritish Kamath           Ravi Kumar           Pasin Manurangsi
Google Research
Mountain View, CA, US
badihghazi@gmail.com, pritish@alum.mit.edu,
ravi.k53@gmail.com, pasin@google.com
Abstract

In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) 𝒳\mathcal{X} and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given nn input points, has an expected excess empirical risk of roughly width(𝒳)log|𝒳|/n\mathrm{width}(\mathcal{X})\cdot\log|\mathcal{X}|/n, where width(𝒳)\mathrm{width}(\mathcal{X}) is the width of the poset. In contrast, we also obtain a near-matching lower bound of roughly (width(𝒳)+log|𝒳|)/n(\mathrm{width}(\mathcal{X})+\log|\mathcal{X}|)/n, that holds even for approximate-DP algorithms. Moreover, we show that the above bounds are essentially the best that can be obtained without utilizing any further structure of the poset. In the special case of a totally ordered set and for 1\ell_{1} and 22\ell_{2}^{2} losses, our algorithm can be implemented in near-linear running time; we also provide extensions of this algorithm to the problem of private isotonic regression with additional structural constraints on the output function.

1 Introduction

Isotonic regression is a basic primitive in statistics and machine learning, which has been studied at least since the 19501950s [4, 9]; see also the textbooks on the topic [5, 38]. It has seen applications in numerous fields including medicine [31, 39] where the expression of an antigen is modeled as a monotone function of the DNA index and WBC count, and education [19], where isotonic regression was used to predict college GPA using high school GPA and standardized test scores. Isotonic regression is also arguably the most common non-parametric method for calibrating machine learning models [51], including modern neural networks [23].

In this paper, we study isotonic regression with a differential privacy (DP) constraint on the output model. DP [17, 16] is a highly popular notion of privacy for algorithms and machine learning primitives, and has seen increased adoption due to its powerful guarantees and properties [37, 43]. Despite the plethora of work on DP statistics and machine learning (see Section 5 for related work), ours is, to the best of our knowledge, the first to study DP isotonic regression.

In fact, we consider the most general version of the isotonic regression problem. We first set up some notation to describe our results. Let (𝒳,)(\mathcal{X},\leq) be any partially ordered set (poset). A function f:𝒳[0,1]f:\mathcal{X}\to[0,1] is monotone if and only if f(x)f(x)f(x)\leq f(x^{\prime}) for all xxx\leq x^{\prime}. For brevity, we use (𝒳,𝒴)\mathcal{F}(\mathcal{X},\mathcal{Y}) to denote the set of all monotone functions from 𝒳\mathcal{X} to 𝒴\mathcal{Y}; throughout, we consider 𝒴[0,1]\mathcal{Y}\subseteq[0,1].

Let [n][n] denote {1,,n}\{1,\ldots,n\}. Given an input dataset D={(x1,y1),,(xn,yn)}(𝒳×[0,1])nD=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}\in(\mathcal{X}\times[0,1])^{n}, let the empirical risk of a function f:𝒳[0,1]f:\mathcal{X}\to[0,1] be (f;D):=1ni[n](f(xi),yi)\mathcal{L}(f;D):=\frac{1}{n}\sum_{i\in[n]}\ell(f(x_{i}),y_{i}), where :[0,1]×[0,1]\ell:[0,1]\times[0,1]\to\mathbb{R} is a loss function.

We study private isotonic regression in the basic machine learning framework of empirical risk minimization. Specifically, the goal of the isotonic regression problem, given dataset D={(x1,y1),,(xn,yn)}(𝒳×[0,1])nD=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}\in(\mathcal{X}\times[0,1])^{n}, is to find a monotone function f:𝒳[0,1]f:\mathcal{X}\to[0,1] that minimizes (f;D)\mathcal{L}(f;D). The excess empirical risk of a function ff is defined as (f;D)(f;D)\mathcal{L}(f;D)-\mathcal{L}(f^{*};D) where f:=argming(𝒳,𝒴)(g;D)f^{*}:=\operatorname{argmin}_{g\in\mathcal{F}(\mathcal{X},\mathcal{Y})}\mathcal{L}(g;D).

1.1 Our Results

General Posets.

Our first contribution is to give nearly tight upper and lower bounds for any poset, based on its width, as stated below (see Section 4 for a formal definition.)

Theorem 1 (Upper Bound for General Poset).

Let 𝒳\mathcal{X} be any finite poset and let \ell be an LL-Lipschitz loss function. For any ε(0,1]\varepsilon\in(0,1], there is an ε\varepsilon-DP algorithm for isotonic regression for \ell with expected excess empirical risk at most O(Lwidth(𝒳)log|𝒳|(1+log2(εn))εn)O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon n}\right).

Theorem 2 (Lower Bound for General Poset; Informal).

For any ε(0,1]\varepsilon\in(0,1] and any δ<0.01ε/|𝒳|\delta<0.01\cdot\varepsilon/|\mathcal{X}|, any (ε,δ)(\varepsilon,\delta)-DP algorithm for isotonic regression for a “nice” loss function \ell must have expected excess empirical risk Ω(width(𝒳)+log|𝒳|εn)\Omega\left(\frac{\operatorname{width}(\mathcal{X})+\log|\mathcal{X}|}{\varepsilon n}\right).

While our upper and lower bounds do not exactly match because of the multiplication-vs-addition of log|𝒳|\log|\mathcal{X}|, we show in Section 4.3 that there are posets for which each bound in tight. In other words, this gap cannot be closed for generic posets.

Totally Ordered Sets.

The above upper and lower bounds immediately translate to the case of totally ordered sets, by plugging in width(𝒳)=1\operatorname{width}(\mathcal{X})=1. More importantly, we give efficient algorithms in this case, which runs in time O~(n2+nlog|𝒳|)\tilde{O}(n^{2}+n\log|\mathcal{X}|) for general loss function \ell, and in nearly linear O~(nlog|𝒳|)\tilde{O}(n\cdot\log|\mathcal{X}|) time for the widely-studied 22\ell_{2}^{2}- and 1\ell_{1}-losses111Recall that the 22\ell_{2}^{2}-loss is 22(y,y)=(yy)2\ell_{2}^{2}(y,y^{\prime})=(y-y^{\prime})^{2} and the 1\ell_{1}-loss is 1(y,y)=|yy|\ell_{1}(y,y^{\prime})=|y-y^{\prime}|..

Theorem 3.

For all finite totally ordered sets 𝒳\mathcal{X}, LL-Lipschitz loss functions \ell, and ε(0,1]\varepsilon\in(0,1], there is an ε\varepsilon-DP algorithm for isotonic regression for \ell with expected excess empirical risk O(L(log|𝒳|)(1+log2(εn))εn)O\left(\frac{L\cdot(\log|\mathcal{X}|)\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon n}\right). The running time of this algorithm is O~(n2+nlog|𝒳|)\tilde{O}(n^{2}+n\log|\mathcal{X}|) in general and can be improved to O~(nlog|𝒳|)\tilde{O}(n\log|\mathcal{X}|) for 1\ell_{1} and 22\ell_{2}^{2} losses.

We are not aware of any prior work on private isotonic regression. A simple baseline algorithm for this problem would be to use the exponential mechanism over the set of all monotone functions taking values in a discretized set, to choose one with small loss. We show in Appendix A that this achieves an excess empirical risk of O(Lwidth(𝒳)log|𝒳|/εn)O(L\cdot\sqrt{\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|/\varepsilon n}), which is quadratically worse than the bound in Theorem 1. Moreover, even in the case of a totally ordered set, it is unclear how to implement such a mechanism efficiently.

We demonstrate the flexibility of our techniques by showing that it can be extended to variants of isotonic regression where, in addition to monotonicity, we also require ff to satisfy additional properties. For example, we may want ff to be LfL_{f}-Lipchitz for some specified Lf>0L_{f}>0. Other constraints we can handle include kk-piecewise constant, kk-piecewise linear, convexity, and concavity. For each of these constraints, we devise an algorithm that yields essentially the same error compared to the unconstrained case and still runs in polynomial time.

Theorem 4.

For all finite totally ordered sets 𝒳\mathcal{X}, LL-Lipschitz loss functions \ell, and ε(0,1]\varepsilon\in(0,1], there is an ε\varepsilon-DP algorithm for kk-piecewise constant, kk-piecewise linear, Lipchitz, convex, or concave isotonic regression for \ell with expected excess empirical risk O~(L(log|𝒳|)εn)\tilde{O}\left(\frac{L\cdot(\log|\mathcal{X}|)}{\varepsilon n}\right). The running time of this algorithm is (n|𝒳|)O(1)(n|\mathcal{X}|)^{O(1)}.

Organization.

We next provide necessary background on DP. In Section 3, we prove our results for totally ordered sets (including Theorem 3). We then move on to discuss general posets in Section 4. Section 5 contains additional related work. Finally, we conclude with some discussion in Section 6. Due to space constraints, we omit some proofs from the main body; these can be found in the Appendix.

2 Background on Differential Privacy

Two datasets D={((x1,y1),,(xn,yn)}D=\{((x_{1},y_{1}),\ldots,(x_{n},y_{n})\} and D={(x1,y1),,(xn,yn)}D^{\prime}=\{(x^{\prime}_{1},y^{\prime}_{1}),\ldots,(x^{\prime}_{n},y^{\prime}_{n})\} are said to be neighboring, denoted DDD\sim D^{\prime}, if there is an index i[n]i\in[n] such that (xj,yj)=(xj,yj)(x_{j},y_{j})=(x^{\prime}_{j},y^{\prime}_{j}) for all j[n]{i}j\in[n]\setminus\{i\}. We recall the formal definition of differential privacy [18, 16]:

Definition 5 (Differential Privacy (DP) [18, 16]).

Let ε>0\varepsilon>0 and δ[0,1]\delta\in[0,1]. A randomized algorithm :𝒳n𝒴\mathcal{M}:\mathcal{X}^{n}\to\mathcal{Y} is (ε,δ)(\varepsilon,\delta)-differentially private ((ε,δ)(\varepsilon,\delta)-DP) if, for all DDD\sim D^{\prime} and all (measurable) outcomes S𝒴S\subseteq\mathcal{Y}, we have that Pr[(D)S]eεPr[(D)S]+δ\Pr[\mathcal{M}(D)\in S]\leq e^{\varepsilon}\cdot\Pr[\mathcal{M}(D^{\prime})\in S]+\delta.

We denote (ε,0)(\varepsilon,0)-DP as ε\varepsilon-DP (aka pure-DP). The case when δ>0\delta>0 is referred to as approximate-DP.

We will use the following composition theorems throughout our proofs.

Lemma 6.

(ε,δ)(\varepsilon,\delta)-DP satisfies the following:

  • Basic Composition: If mechanisms 1,,t\mathcal{M}_{1},\ldots,\mathcal{M}_{t} are such that i\mathcal{M}_{i} satisfies (εi,δi)(\varepsilon_{i},\delta_{i})-DP, then the composed mechanism (1(D),,t(D))(\mathcal{M}_{1}(D),\ldots,\mathcal{M}_{t}(D)) satisfies (iεi,iδi)(\sum_{i}\varepsilon_{i},\sum_{i}\delta_{i})-DP. This holds even under adaptive composition, where each i\mathcal{M}_{i} can depend on the outputs of 1,,i1\mathcal{M}_{1},\ldots,\mathcal{M}_{i-1}.

  • Parallel Composition [33]: If a mechanism \mathcal{M} satisfies (ε,δ)(\varepsilon,\delta)-DP, then for any partition of D=D1DtD=D_{1}\sqcup\cdots\sqcup D_{t}, the composed mechanism given as ((D1),,(Dt))(\mathcal{M}(D_{1}),\ldots,\mathcal{M}(D_{t})) satisfies (ε,δ)(\varepsilon,\delta)-DP.

Exponential Mechanism.

The exponential mechanism solves the basic task of selection in data analysis: given a dataset D𝒵nD\in\mathcal{Z}^{n} and a set 𝒜\mathcal{A} of options, it outputs the (approximately) best option, where “best” is defined by a scoring function 𝔰:𝒜×𝒵n\mathfrak{s}:\mathcal{A}\times\mathcal{Z}^{n}\to\mathbb{R}. The ε\varepsilon-DP exponential mechanism [34] is the randomized mechanism :𝒵n𝒜\mathcal{M}:\mathcal{Z}^{n}\to\mathcal{A} given by

D𝒵n,a𝒜:Pr[(D)=a]exp(ε2Δ𝔰𝔰(a,D)),\textstyle\forall~{}D\in\mathcal{Z}^{n},\ a\in\mathcal{A}\ :\ \Pr[\mathcal{M}(D)=a]~{}\propto~{}\exp\left(-\frac{\varepsilon}{2\Delta_{\mathfrak{s}}}\cdot\mathfrak{s}(a,D)\right),

where Δ𝔰:=supDDmaxa𝒜|𝔰(a,D)𝔰(a,D)|\Delta_{\mathfrak{s}}:=\sup_{D\sim D^{\prime}}\max_{a\in\mathcal{A}}|\mathfrak{s}(a,D)-\mathfrak{s}(a,D^{\prime})| is the sensitivity of the scoring function.

Lemma 7 ([34]).

For \mathcal{M} being the ε\varepsilon-DP exponential mechanism, it holds for all D𝒵nD\in\mathcal{Z}^{n} that

𝔼[𝔰((D),D)]mina𝒜𝔰(a,D)+2Δ𝔰εlog|𝒜|.\textstyle\mathbb{E}[\mathfrak{s}(\mathcal{M}(D),D)]~{}\leq~{}\min_{a\in\mathcal{A}}\,\mathfrak{s}(a,D)~{}+~{}\frac{2\Delta_{\mathfrak{s}}}{\varepsilon}\log|\mathcal{A}|.

Lower Bound for Privatizing Vectors.

Lower bounds for DP algorithms that can output a binary vector that is close (say, in the Hamming distance) to the input are well-known.

Lemma 8 (e.g., [32]).

Let ε,δ>0,m\varepsilon,\delta>0,m\in\mathbb{N}, let the input domain be {0,1}m\{0,1\}^{m} and let two vectors 𝐳,𝐳{0,1}m\mathbf{z},\mathbf{z}^{\prime}\in\{0,1\}^{m} be neighbors if and only if 𝐳𝐳01\|\mathbf{z}-\mathbf{z}^{\prime}\|_{0}\leq 1. Then, for any (ε,δ)(\varepsilon,\delta)-DP algorithm :{0,1}m{0,1}m\mathcal{M}:\{0,1\}^{m}\to\{0,1\}^{m}, we have 𝔼𝐳{0,1}m[(𝐳)𝐳0]eεm0.5(1δ)\mathbb{E}_{\mathbf{z}\sim\{0,1\}^{m}}[\|\mathcal{M}(\mathbf{z})-\mathbf{z}\|_{0}]\geq e^{-\varepsilon}\cdot m\cdot 0.5\cdot(1-\delta).

It is also simple to extend the lower bound for the case where the vector is not binary, as stated below. We defer the full proof to Appendix B.

Lemma 9.

Let D,mD,m be any positive integer such that D2D\geq 2, let the input domain be [D]m[D]^{m} and let two vectors 𝐳,𝐳[D]m\mathbf{z},\mathbf{z}^{\prime}\in[D]^{m} be neighbors if and only if 𝐳𝐳01\|\mathbf{z}-\mathbf{z}^{\prime}\|_{0}\leq 1. Then, for any (ln(D/2),0.25)(\ln(D/2),0.25)-DP algorithm :[D]m[D]m\mathcal{M}:[D]^{m}\to[D]^{m}, we have that 𝔼𝐳[D]m[(𝐳)𝐳0]Ω(m).\mathbb{E}_{\mathbf{z}\sim[D]^{m}}[\|\mathcal{M}(\mathbf{z})-\mathbf{z}\|_{0}]\geq\Omega(m).

Group Differential Privacy.

For any neighboring relation \sim, we write k\sim_{k} as a neighboring relation where DkDD\sim_{k}D^{\prime} iff there is a sequence D=D0,,Dk=DD=D_{0},\dots,D_{k^{\prime}}=D^{\prime} for some kkk^{\prime}\leq k such that Di1DiD_{i-1}\sim D_{i} for all i[k]i\in[k^{\prime}].

Fact 10 (e.g., [41]).

Let ε>0,δ0\varepsilon>0,\delta\geq 0 and kk\in\mathbb{N}. Suppose that \mathcal{M} is an (ε,δ)(\varepsilon,\delta)-DP algorithm for the neighboring relation \sim. Then \mathcal{M} is (kε,ekε1eε1δ)\left(k\varepsilon,\frac{e^{k\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta\right)-DP for the neighboring relation k\sim_{k}.

3 DP Isotonic Regression over Total Orders

We first focus on the “one-dimensional” case where 𝒳\mathcal{X} is totally ordered; for convenience, we assume that 𝒳=[m]\mathcal{X}=[m] where the order is the natural order on integers. We first present an efficient algorithm for the this case and then a matching lower bound.

3.1 An Efficient Algorithm

To describe our algorithm, it will be more convenient to use the unnormalized version of the empirical risk, which we define as abs(f;D):=(x,y)D(f(x),y)\mathcal{L}^{\mathrm{abs}}(f;D):=\sum_{(x,y)\in D}\ell(f(x),y).

We now provide a high-level overview of our algorithm. Any monotone function f:[m][0,1]f:[m]\to[0,1] contains a (not necessarily unique) threshold α{0}[m]\alpha\in\{0\}\cup[m] such that f(a)1/2f(a)\geq 1/2 for all a>αa>\alpha and f(a)1/2f(a)\leq 1/2 for all aαa\leq\alpha. Our algorithm works by first choosing this threshold α\alpha in a private manner using the exponential mechanism. The choice of α\alpha partitions [m][m] into [m]>α:={a[m]a>α}[m]^{>\alpha}:=\left\{a\in[m]\mid a>\alpha\right\} and [m]α:={a[m]aα}[m]^{\leq\alpha}:=\left\{a\in[m]\mid a\leq\alpha\right\}. The algorithm recurses on these two parts to find functions f>:[m]>α[1/2,1]f_{>}:[m]^{>\alpha}\to[1/2,1] and f:[m]α[0,1/2]f_{\leq}:[m]^{\leq\alpha}\to[0,1/2], which are then glued to obtain the final function.

In particular, the algorithm proceeds in TT stages, where in stage tt, the algorithm starts with a partition of [m][m] into 2t2^{t} intervals {Pi,ti{0,,2t1}}\left\{P_{i,t}\mid i\in\{0,\dots,2^{t}-1\}\right\}, and the algorithm eventually outputs a monotone function ff such that f(x)[i/2t,(i+1)/2t]f(x)\in[i/2^{t},(i+1)/2^{t}] for all xPi,tx\in P_{i,t}. This partition is further refined for stage t+1t+1 by choosing a threshold αi,t\alpha_{i,t} in Pi,tP_{i,t} and partitioning Pi,tP_{i,t} into Pi,t>αi,tP_{i,t}^{>\alpha_{i,t}} and Pi,tαi,tP_{i,t}^{\leq\alpha_{i,t}}. In the final stage, the function ff is chosen to be the constant i/2T1i/2^{T-1} over Pi,T1P_{i,T-1}. Note that the algorithm may stop at T=Θε(logn)T=\Theta_{\varepsilon}(\log n) because the Lipschitzness of \ell ensures that assigning each partition to the constant i/2T1i/2^{T-1} cannot increase the error by more than L/2TOε(L/n)L/2^{T}\leq O_{\varepsilon}(L/n).

We already have mentioned above that each αi,t\alpha_{i,t} has to be chosen in a private manner. However, if we let the scoring function directly be the unnormalized empirical risk, then its sensitivity remains as large as LL even at a large stage tt. This is undesirable because the error from each run of the exponential mechanism can be as large as O(Llogm)O(L\cdot\log m) but there are as many as 2t2^{t} runs in stage tt. Adding these error terms up would result in a far larger total error than desired.

To circumvent this, we observe that while the sensitivity can still be large, they are mostly “ineffective” because the function range is now restricted to only an interval of length 1/2t1/2^{t}. Indeed, we may use the following “clipped” version of the loss function which has low sensitivity of L/2tL/2^{t} instead.

Definition 11 (Clipped Loss Function).

For a range [τ,θ][0,1][\tau,\theta]\subseteq[0,1], let [τ,θ]:[τ,θ]×[0,1]\ell_{[\tau,\theta]}:[\tau,\theta]\times[0,1]\to\mathbb{R} be given as [τ,θ](y^,y):=(y^,y)miny[τ,θ](y,y)\ell_{[\tau,\theta]}(\hat{y},y):=\ell(\hat{y},y)-\min_{y^{\prime}\in[\tau,\theta]}\ell(y^{\prime},y). Similar to above, we also define [τ,θ]abs(f;D):=(x,y)D[τ,θ](f(xi),yi)\mathcal{L}^{\mathrm{abs}}_{[\tau,\theta]}(f;D):=\sum_{(x,y)\in D}\ell_{[\tau,\theta]}(f(x_{i}),y_{i}).

Observe that range([τ,θ])[0,L(θτ)]\operatorname{range}(\ell_{[\tau,\theta]})\subseteq[0,L\cdot(\theta-\tau)], since \ell is LL-Lipschitz. In other words, the sensitivity of [τ,θ]abs(f;D)\mathcal{L}^{\mathrm{abs}}_{[\tau,\theta]}(f;D) is only L(θτ)L\cdot(\theta-\tau). Algorithm 1 contains a full description.

Algorithm 1 DP Isotonic Regression for Totally Ordered Sets.
  Input: 𝒳=[m]\mathcal{X}=[m], dataset D={(x1,y1),,(xn,yn)}D=\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}, DP parameter ε\varepsilon.
  Output: Monotone function f:[m][0,1]f:[m]\to[0,1].
  
  Tlog(εn)T\leftarrow\left\lceil\log(\varepsilon n)\right\rceil
  εε/T\varepsilon^{\prime}\leftarrow\varepsilon/T
  P0,0[m]P_{0,0}\leftarrow[m]
  for t=0,,T1t=0,\ldots,T-1 do
     for i=0,,2t1i=0,\ldots,2^{t}-1 do
        
  • \triangleright

    Di,t{(xj,yj)j[n],xjPi,t}D_{i,t}\leftarrow\left\{(x_{j},y_{j})\mid j\in[n],x_{j}\in P_{i,t}\right\} {Set of all input points whose xx belongs to Pi,tP_{i,t}}

    {Notation: Define Di,tα:={(x,y)Di,txα}D_{i,t}^{\leq\alpha}:=\left\{(x,y)\in D_{i,t}\mid x\leq\alpha\right\} and Di,t>αD_{i,t}^{>\alpha} similarly }

    {Notation: Define Pi,tα:={xPi,txα}P_{i,t}^{\leq\alpha}:=\left\{x\in P_{i,t}\mid x\leq\alpha\right\} and Pi,t>αP_{i,t}^{>\alpha} similarly }

  • \triangleright

    Choose threshold αi,t{0}Pi,t\alpha_{i,t}\in\{0\}\cup P_{i,t}, using ε\varepsilon^{\prime}-DP exponential mechanism with scoring function

    scorei,t(α)\displaystyle\operatorname{score}_{i,t}(\alpha) :=minf1(Pi,tα,[i2t,i+0.52t])[i2t,i+12t]abs(f1;Di,tα)\displaystyle~{}:=~{}\min_{f_{1}\in\mathcal{F}(P_{i,t}^{\leq\alpha},[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{1};D_{i,t}^{\leq\alpha})
    +minf2(Pi,t>α,[(i+0.5)2t,(i+1)2t])[i2t,i+12t]abs(f2;Di,t>α)\displaystyle\qquad+~{}\min_{f_{2}\in\mathcal{F}(P_{i,t}^{>\alpha},[\frac{(i+0.5)}{2^{t}},\frac{(i+1)}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{2};D_{i,t}^{>\alpha})

    {Note: scorei,t(α)\operatorname{score}_{i,t}(\alpha) has sensitivity at most L/2tL/2^{t}. }

  • \triangleright

    P2i,t+1Pi,tαi,tP_{2i,t+1}\leftarrow P_{i,t}^{\leq\alpha_{i,t}} and P2i+1,t+1Pi,t>αi,tP_{2i+1,t+1}\leftarrow P_{i,t}^{>\alpha_{i,t}}.

  Let f:[m][0,1]f:[m]\to[0,1] be given as f(x)=i/2T1f(x)=i/2^{T-1} for all xPi,T1x\in P_{i,T-1} and all i[2T]i\in[2^{T}].
  return  ff
Proof of Theorem 3.

Before proceeding to prove the algorithm’s privacy and utility guarantees, we note that the output ff is indeed monotone since for every x<xx^{\prime}<x that gets separated when we partition Pi,tP_{i,t} into P2i,t+1,P2i+1,t+1P_{2i,t+1},P_{2i+1,t+1}, we must have xP2i,t+1x^{\prime}\in P_{2i,t+1} and xP2i+1,t+1x\in P_{2i+1,t+1}.

Privacy Analysis.

Since the exponential mechanism is ε\varepsilon^{\prime}-DP and the dataset is partitioned with the exponential mechanism being applied only to each partition once, the parallel composition property (Lemma 6) implies that the entire subroutine for each tt is ε\varepsilon^{\prime}-DP. Thus, by basic composition (Lemma 6), it follows that Algorithm 1 is ε\varepsilon-DP (since ε=εT\varepsilon=\varepsilon^{\prime}T).

Utility Analysis.

Since the sensitivity of scorei,t()\operatorname{score}_{i,t}(\cdot) is at most L/2tL/2^{t}, we have from Lemma 7, that for all t{0,,T1}t\in\{0,\dots,T-1\} and i{0,1,,2t}i\in\left\{0,1,\ldots,2^{t}\right\},

𝔼[scorei,t(αi,t)minαPi,tscorei,t(α)]O(Llog|Pi,t|ε2t)O(Llogmε2t).\mathbb{E}\left[\operatorname{score}_{i,t}(\alpha_{i,t})-\min_{\alpha\in P_{i,t}}\operatorname{score}_{i,t}(\alpha)\right]~{}\leq~{}O\left(\frac{L\cdot\log|P_{i,t}|}{\varepsilon^{\prime}\cdot 2^{t}}\right)~{}\leq~{}O\left(\frac{L\cdot\log m}{\varepsilon^{\prime}\cdot 2^{t}}\right). (1)

Let hi,th_{i,t} denote argminh(Pi,t,[i/2t,(i+1)/2t])abs(h;Di,t)\operatorname{argmin}_{h\in\mathcal{F}(P_{i,t},[i/2^{t},(i+1)/2^{t}])}\mathcal{L}^{\mathrm{abs}}(h;D_{i,t}) (with ties broken arbitrarily). Then, let α~i,t\tilde{\alpha}_{i,t} denote the largest element in Pi,tP_{i,t} such that hi,t(α~i,t)(i+1/2)/2th_{i,t}(\tilde{\alpha}_{i,t})\leq(i+1/2)/2^{t}; namely, α~i,t\tilde{\alpha}_{i,t} is the optimal threshold when restricted to Di,tD_{i,t}. Under this notation, we have that

scorei,t(αi,t)minαPi,tscorei,t(α)\displaystyle\operatorname{score}_{i,t}(\alpha_{i,t})-\min_{\alpha\in P_{i,t}}\operatorname{score}_{i,t}(\alpha)
scorei,t(αi,t)scorei,t(α~i,t)\displaystyle\geq~{}\operatorname{score}_{i,t}(\alpha_{i,t})-\operatorname{score}_{i,t}(\tilde{\alpha}_{i,t})
=([i/2t,(i+1)/2t]abs(h2i,t+1;D2i,t+1)+[i/2t,(i+1)/2t]abs(h2i+1,t+1;D2i+1,t+1))\displaystyle=~{}\left(\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{2i,t+1};D_{2i,t+1})+\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{2i+1,t+1};D_{2i+1,t+1})\right)
[i/2t,(i+1)/2t]abs(hi,t;Di,t)\displaystyle\qquad-~{}\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{i,t};D_{i,t})
=abs(h2i,t+1;D2i,t+1)+abs(h2i+1,t+1;D2i+1,t+1)abs(hi,t;Di,t).\displaystyle=~{}\mathcal{L}^{\mathrm{abs}}(h_{2i,t+1};D_{2i,t+1})+\mathcal{L}^{\mathrm{abs}}(h_{2i+1,t+1};D_{2i+1,t+1})-\mathcal{L}^{\mathrm{abs}}(h_{i,t};D_{i,t}). (2)

Finally, notice that

abs(f;Di,T1)abs(hi,T1;Di,T1)L2T1|Di,T1|=O(L|Di,T1|εn).\displaystyle\textstyle\mathcal{L}^{\mathrm{abs}}(f;D_{i,T-1})-\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})\leq\frac{L}{2^{T-1}}\cdot|D_{i,T-1}|=O\left(\frac{L\cdot|D_{i,T-1}|}{\varepsilon n}\right). (3)

With all the ingredients ready, we may now bound the expected (unnormalized) excess risk:

abs(f;D)\displaystyle\mathcal{L}^{\mathrm{abs}}(f;D) =0i<2T1abs(f;Di,T1)\displaystyle\textstyle~{}=~{}\sum_{0\leq i<2^{T-1}}\mathcal{L}^{\mathrm{abs}}(f;D_{i,T-1})
(3)0i<2T1(O(L|Di,T1|εn)+abs(hi,T1;Di,T1))\displaystyle\textstyle~{}\overset{\eqref{eq:1d-rounding-err-util-total-order}}{\leq}~{}\sum_{0\leq i<2^{T-1}}\left(O\left(\frac{L\cdot|D_{i,T-1}|}{\varepsilon n}\right)+\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})\right)
=O(L/ε)+0i<2T1abs(hi,T1;Di,T1)\displaystyle\textstyle~{}=~{}O(L/\varepsilon)~{}+~{}\sum_{0\leq i<2^{T-1}}\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})
=O(L/ε)+abs(h0,0;D0,0)\displaystyle\textstyle~{}=~{}O(L/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})
+t[T1]0i<2t1(abs(h2i,t;D2i,t)+abs(h2i+1,t;D2i+1,t)abs(hi,t1;Di,t1)).\displaystyle\textstyle\quad+\sum_{t\in[T-1]\atop 0\leq i<2^{t-1}}\left(\mathcal{L}^{\mathrm{abs}}(h_{2i,t};D_{2i,t})+\mathcal{L}^{\mathrm{abs}}(h_{2i+1,t};D_{2i+1,t})-\mathcal{L}^{\mathrm{abs}}(h_{i,t-1};D_{i,t-1})\right).

Taking the expectation on both sides and using (1) and (2) yields

𝔼[abs(f;D)]\displaystyle\mathbb{E}[\mathcal{L}^{\mathrm{abs}}(f;D)] O(L/ε)+abs(h0,0;D0,0)+t[T1]0i<2t1O(Llogmε2t)\displaystyle\textstyle~{}\leq~{}O(L/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})~{}+~{}\sum_{t\in[T-1]\atop 0\leq i<2^{t-1}}O\left(\frac{L\cdot\log m}{\varepsilon^{\prime}\cdot 2^{t}}\right)
=O(L/ε)+abs(f;D)+O(T2Llogmε)\displaystyle\textstyle~{}=~{}O(L/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(T^{2}\cdot\frac{L\cdot\log m}{\varepsilon}\right)
=abs(f;D)+O(Llogm(1+log2(εn))ε).\displaystyle\textstyle~{}=~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(\frac{L\cdot\log m\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon}\right).

Dividing both sides by nn yields the desired claim.

Running Time.

To obtain a bound on the running time for general loss functions, we need to make a slight modification to the algorithm: we will additionally only restrict the range of f1,f2f_{1},f_{2} to multiples of 1/2T11/2^{T-1}. We remark that this does not affect the utility since anyway we always take the final output whose values are multiples of 1/2T11/2^{T-1}.

Given any dataset D={(x1,y1),,(xn,yn)}D=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\} where x1<<xnx_{1}<\cdots<x_{n}, the prefix isotonic regression algorithm is to compute, for each i[n]i\in[n], the optimal loss in isotonic regression on (x1,y1),,(xi,yi)(x_{1},y_{1}),\dots,(x_{i},y_{i}). Straightforward dynamic programming solves this in O(nv)O(n\cdot v) time, where vv denote the number of possible values allowed in the function.

Now, for each i,ti,t, we may run the above algorithm with D=Di,tD=D_{i,t} and the allowed values are all multiples of 1/2T11/2^{T-1} in [i2t,i+0.52t][\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}]; this gives us minf1(Pi,tα,[i2t,i+0.52t])[i2t,i+12t]abs(f1;Di,tα)\min_{f_{1}\in\mathcal{F}(P_{i,t}^{\leq\alpha},[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{1};D_{i,t}^{\leq\alpha}) for all αPi,t\alpha\in P_{i,t} in time O(|Di,t|2Tt+|Pi,t|)O(|D_{i,t}|\cdot 2^{T-t}+|P_{i,t}|). Analogously, we can also compute minf2(Pi,t>α,[(i+0.5)2t,(i+1)2t])[i2t,i+12t]abs(f2;Di,t>α)\min_{f_{2}\in\mathcal{F}(P_{i,t}^{>\alpha},[\frac{(i+0.5)}{2^{t}},\frac{(i+1)}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{2};D_{i,t}^{>\alpha}) for all αPi,t\alpha\in P_{i,t} in a similar time. Thus, we can compute (scorei,t(α))αPi,t(\operatorname{score}_{i,t}(\alpha))_{\alpha\in P_{i,t}} in time O(|Di,t|2Tt+|Pi,t|)O(|D_{i,t}|\cdot 2^{T-t}+|P_{i,t}|), and then sample accordingly.

We can further speed up the algorithm by observing that the score remains constant for all α[xi,xi+1)\alpha\in[x_{i},x_{i+1}). Hence, we may first sample an interval among [0,x1),[x1,x2),,[xn1,xn),[xn,m)[0,x_{1}),[x_{1},x_{2}),\ldots,[x_{n-1},x_{n}),[x_{n},m) and then sample αi,t\alpha_{i,t} uniformly from that interval. This entire process can be done in O(|Di,t|2Tt+logm)O(|D_{i,t}|\cdot 2^{T-t}+\log m) time. In total, the running time of the algorithm is thus

t=0T1i=02t1O(|Di,t|2Tt+logm)t=0T1O(n2T+2tlogm)O(n2logn+nlogm).\displaystyle\sum_{t=0}^{T-1}\sum_{i=0}^{2^{t}-1}O(|D_{i,t}|\cdot 2^{T-t}+\log m)\leq\sum_{t=0}^{T-1}O(n2^{T}+2^{t}\cdot\log m)\leq O(n^{2}\log n+n\log m).

Near-Linear Time Algorithms for 𝟏\ell_{1}-, 𝟐𝟐\ell^{2}_{2}-Losses.

We now describe faster algorithms for the 1\ell_{1}- and 22\ell^{2}_{2}-loss functions, thereby proving the last part of Theorem 3. The key observation is that for convex loss functions, the restricted optimal is simple: we just have to “clip” the optimal function to be in the range [τ,θ][\tau,\theta]. Below clip[τ,θ]\operatorname{clip}_{[\tau,\theta]} denotes the function ymin{θ,max{τ,y}}y\mapsto\min\{\theta,\max\{\tau,y\}\}.

Observation 12.

Let \ell be any convex loss function, DD any dataset, fargminf(𝒳,𝒴)(f;D)f^{*}\in\operatorname{argmin}_{f\in\mathcal{F}(\mathcal{X},\mathcal{Y})}\mathcal{L}(f;D) and τθ\tau\leq\theta any real numbers such that τ,θ𝒴\tau,\theta\in\mathcal{Y}. Define fclipped(x):=clip[τ,θ](f(x))f^{*}_{\mathrm{clipped}}(x):=\operatorname{clip}_{[\tau,\theta]}(f^{*}(x)). Then, we must have fclipped(x)argminf(𝒳,𝒴[τ,θ])(f;D)f^{*}_{\mathrm{clipped}}(x)\in\operatorname{argmin}_{f\in\mathcal{F}(\mathcal{X},\mathcal{Y}\cap[\tau,\theta])}\mathcal{L}(f;D).

Proof.

Consider any f(𝒳,𝒴[τ,θ])f\in\mathcal{F}(\mathcal{X},\mathcal{Y}\cap[\tau,\theta]). Let 𝒳>\mathcal{X}^{>} (resp. 𝒳<\mathcal{X}^{<}) denote the set of all x𝒳x\in\mathcal{X} such that f(x)>θf^{*}(x)>\theta (resp. f(x)<τf^{*}(x)<\tau). Consider the following operations:

  • For each x𝒳>x\in\mathcal{X}^{>}, change f(x)f(x) to θ\theta.

  • For each x𝒳<x\in\mathcal{X}^{<}, change f(x)f(x) to τ\tau.

  • Let f(x)=f(x)f(x)=f^{*}(x) for all x𝒳(𝒳>𝒳<)x\in\mathcal{X}\setminus(\mathcal{X}^{>}\cup\mathcal{X}^{<}).

At the end, we end up with f(x)=fclipped(x)f(x)=f^{*}_{\mathrm{clipped}}(x). Each of the first two changes does not increase the loss (f;D)\mathcal{L}(f;D); otherwise, due to convexity, changing f(x)f^{*}(x) to f(x)f(x) would have decrease the objective function. Finally, the last operation does not decrease the loss; otherwise, we may replace this section of ff^{*} with the values in ff instead. Thus, we can conclude that fclipped(x)argming(𝒳,𝒴[τ,δ])(f;D)f^{*}_{\mathrm{clipped}}(x)\in\operatorname{argmin}_{g\in\mathcal{F}(\mathcal{X},\mathcal{Y}\cap[\tau,\delta])}\mathcal{L}(f;D). ∎

We will now show how to compute the scores in Algorithm 1 simultaneously for all α\alpha (for fixed i,ti,t) in nearly linear time. To do this, recall the prefix isotonic regression problem from earlier. For this problem, Stout [42] gave an O(n)O(n)-time algorithm for 2\ell_{2}-loss and an O(nlogn)O(n\log n)-time algorithm for 1\ell_{1}-loss (both the unrestricted value case). Furthermore, after the iith iteration, the algorithm also keeps a succinct representation SioptS^{\operatorname{opt}}_{i} of the optimal solution in the form of an array (i1,v1,1),,(ik,vk,k)(i_{1},v_{1},\ell_{1}),\dots,(i_{k},v_{k},\ell_{k}), which denotes f(x)=vjf(x)=v_{j} for all x[xij,xij+1)x\in[x_{i_{j}},x_{i_{j+1}}), and j\ell_{j} indicates the loss abs\mathcal{L}^{\mathrm{abs}} up until xij+1x_{i_{j+1}}, not including.

We can extend the above algorithm to prefix clipped isotonic regression problem, which we define in the same manner as above except that we restrict the function range to be [τ,θ][\tau,\theta] for some given τ<θ\tau<\theta. Using 12, it is not hard to extend the above algorithm to work in this case.

Lemma 13.

There is an O(nlogn)O(n\log n)-time algorithm for 22\ell_{2}^{2}- and 1\ell_{1}-prefix clipped isotonic regression.

Proof.

We first precompute cτ(i)=ji(τ,xj)c_{\tau}(i)=\sum_{j\leq i}\ell(\tau,x_{j}) and cθ(i)=ji(θ,xj)c_{\theta}(i)=\sum_{j\leq i}\ell(\theta,x_{j}) for all i[n]i\in[n]. We then run the aforementioned algorithm from [42]. At each iteration ii, use binary search to find the largest index jτj_{\tau} such that vjτ<τv_{j_{\tau}}<\tau and the largest index jθj_{\theta} such that vjθ<θv_{j_{\theta}}<\theta. 12 implies that the optimal solution of the clipped version is simply the same as the unrestricted version except that we need to change the function values before xjτx_{j_{\tau}} to τ\tau and after xjθx_{j_{\theta}} to θ\theta. The loss of this clipped optimal can be written as jθjτ+cτ(jτ)+(cθ(i)cθ(jθ))\ell_{j_{\theta}}-\ell_{j_{\tau}}+c_{\tau}(j_{\tau})+(c_{\theta}(i)-c_{\theta}(j_{\theta})), which can be computed in O(1)O(1) time given that we have already precomputed cτ,cθc_{\tau},c_{\theta}. The running time of the entire algorithm is thus the same as that of [42] together with the binary search time; the latter totals to O(nlogn)O(n\log n). ∎

Our fast algorithm for computing (scorei,t(α))αPi,t(\operatorname{score}_{i,t}(\alpha))_{\alpha\in P_{i,t}} first runs the above algorithm with τ=i2t,θ=i+0.52t\tau=\frac{i}{2^{t}},\theta=\frac{i+0.5}{2^{t}} and D=Di,tD=D_{i,t}; this gives us minf1(Pi,tα,[i2t,i+0.52t])[i2t,i+12t]abs(f1;Di,tα)\min_{f_{1}\in\mathcal{F}(P_{i,t}^{\leq\alpha},[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{1};D_{i,t}^{\leq\alpha}) for all αPi,t\alpha\in P_{i,t} in time O(|Di,t|log|Di,t|+|Pi,t|)O(|D_{i,t}|\log|D_{i,t}|+|P_{i,t}|). Analogously, we can also compute minf2(Pi,t>α,[(i+0.5)2t,(i+1)2t])[i2t,i+12t]abs(f2;Di,t>α)\min_{f_{2}\in\mathcal{F}(P_{i,t}^{>\alpha},[\frac{(i+0.5)}{2^{t}},\frac{(i+1)}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{2};D_{i,t}^{>\alpha}) for all αPi,t\alpha\in P_{i,t} in a similar time. Thus, we can compute (scorei,t(α))αPi,t(\operatorname{score}_{i,t}(\alpha))_{\alpha\in P_{i,t}} in time O(|Di,t|log|Di,t|+|Pi,t|)O(|D_{i,t}|\log|D_{i,t}|+|P_{i,t}|), and sample accordingly. Using the same observation as the general loss function case, this can be sped up further to O(|Di,t|log|Di,t|+logm)O(|D_{i,t}|\log|D_{i,t}|+\log m) time. In total, the running time of the algorithm is thus

t=0T1i=02t1O(|Di,t|log|Di,t|+logm)t=0T1O(nlogn+2tlogm)O(n(log2n+logm)).\displaystyle\sum_{t=0}^{T-1}\sum_{i=0}^{2^{t}-1}O(|D_{i,t}|\log|D_{i,t}|+\log m)\leq\sum_{t=0}^{T-1}O(n\log n+2^{t}\log m)\leq O(n(\log^{2}n+\log m)).\qquad\qed

3.2 A Nearly Matching Lower Bound

We show that the excess empirical risk guarantee in Theorem 3 is tight, even for approximate-DP algorithms with a sufficiently small δ\delta, under a mild assumption about the loss function stated below.

Definition 14 (Distance-Based Loss Function).

For R0R\geq 0, a loss function \ell is said to be RR-distance-based if there exist g:[0,1]+g:[0,1]\to\mathbb{R}_{+} such that (y,y)=g(|yy|)\ell(y,y^{\prime})=g(|y-y^{\prime}|) where gg is a non-decreasing function with g(0)=0g(0)=0 and g(1/2)Rg(1/2)\geq R.

We remark that standard loss functions, including 1\ell_{1}- or 22\ell_{2}^{2}-loss, are all Ω(1)\Omega(1)-distance-based.

Our lower bound is stated below. It is proved via a packing argument [25] in a similar manner as a lower bound for properly PAC learning threshold functions [10]. This is not a coincidence: indeed, when we restrict the range of our function to {0,1}\{0,1\}, the problem becomes exactly (the empirical version of) properly learning threshold functions. As a result, the same technique can be used to prove a lower bound in our setting as well.

Theorem 15.

For all 0δ<0.1(eε1)/m0\leq\delta<0.1\cdot(e^{\varepsilon}-1)/m, any (ε,δ)(\varepsilon,\delta)-DP algorithm for isotonic regression over [m][m] for any RR-distance-based loss function \ell must have expected excess empirical risk Ω(Rmin{1,logmεn})\Omega\left(R\cdot\min\left\{1,\frac{\log m}{\varepsilon n}\right\}\right).

Proof.

Suppose for the sake of contradiction that there exists an (ε,δ)(\varepsilon,\delta)-DP algorithm \mathcal{M} for isotonic regression with an RR-distance-based loss function \ell with expected excess empirical risk 0.01(Rmin{1,log(0.1m)εn})0.01\cdot\left(R\cdot\min\left\{1,\frac{\log(0.1m)}{\varepsilon n}\right\}\right). Let k:=0.1log(0.1m)/εk:=\left\lfloor 0.1\log(0.1m)/\varepsilon\right\rfloor.

We may assume that n2kn\geq 2k, as the Ω(R)\Omega(R) lower bound for the case n=2kn=2k can easily be adapted for an Ω(R)\Omega(R) lower bound for the case n<2kn<2k as well.

We will use the standard packing argument [25]. For each j[m1]j\in[m-1], we create a dataset DjD_{j} that contains kk copies of (j,0)(j,0), kk copies of (j+1,1)(j+1,1) and n2kn-2k copies of (1,0)(1,0). Finally, let DmD_{m} denote the dataset that contains kk copies of (m,0)(m,0) and nkn-k copies of (1,0)(1,0). Let VjV_{j} denote the set of all f([m],[0,1])f\in\mathcal{F}([m],[0,1]) such that (f;𝒟)<Rk/n\mathcal{L}(f;\mathcal{D})<Rk/n. The utility guarantee of \mathcal{M} implies that

Pr[(Dj)Vj]0.5.\displaystyle\Pr[\mathcal{M}(D_{j})\in V_{j}]\geq 0.5.

Furthermore, it is not hard to see that V1,,VmV_{1},\dots,V_{m} are disjoint. In particular, for any function f([m],[0,1])f\in\mathcal{F}([m],[0,1]), let xfx_{f} be the largest element x[m]x\in[m] for which f(x)1/2f(x)\leq 1/2; if no such xx exists (i.e., f(0)>1/2f(0)>1/2), let xf=0x_{f}=0. For any j<xfj<x_{f}, we have (f;Dj)kn(f(j+1),1)kng(1/2)Rk/n\mathcal{L}(f;D_{j})\geq\frac{k}{n}\ell(f(j+1),1)\geq\frac{k}{n}\cdot g(1/2)\geq Rk/n. Similarly, for any j>xfj>x_{f}, we have (f;Dj)kn(f(j),0)kng(1/2)Rk/n\mathcal{L}(f;D_{j})\geq\frac{k}{n}\ell(f(j),0)\frac{k}{n}\cdot g(1/2)\geq Rk/n This implies that ff can only belong to VjV_{j}, as claimed.

Therefore, we have that

1\displaystyle 1 j[m]Pr[(Dm)Vj]\displaystyle\textstyle~{}\geq~{}\sum_{j\in[m]}\Pr[\mathcal{M}(D_{m})\in V_{j}]
j[m]1e2kε(Pr[(Dj)Vj]δ(e2kε1)eε1)(10)\displaystyle\textstyle~{}\geq~{}\sum_{j\in[m]}\frac{1}{e^{2k\varepsilon}}\left(\Pr[\mathcal{M}(D_{j})\in V_{j}]-\delta\frac{(e^{2k\varepsilon}-1)}{e^{\varepsilon}-1}\right)\qquad\text{(\lx@cref{creftypecap~refnum}{fact:group-dp})}
j[m]10m(0.50.1)\displaystyle\textstyle~{}\geq~{}\sum_{j\in[m]}\frac{10}{m}(0.5-0.1)
>1,\displaystyle\textstyle~{}>~{}1,

a contradiction. ∎

3.3 Extensions

We now discuss several variants of the isotonic regression problem that places certain additional constraints on the function ff that we seek, as listed below.

  • kk-Piecewise Constant: ff must be a step function that consists of at most kk pieces.

  • kk-Piecewise Linear: ff must be a piecewise linear function with at most kk pieces.

  • Lipschitz Regression: ff must be LfL_{f}-Lipschitz for some specified Lf>0L_{f}>0.

  • Convex/Concave: ff must be convex/concave.

We devise a general meta algorithm that, with a small tweak in each case, works for all of these constraints to yield Theorem 4. At a high-level, our algorithm is similar to Algorithm 1, except that, in addition to using exponential mechanism to pick the threshold αi,t\alpha_{i,t}, we also pick certain auxiliary information that is then passed onto the next stage. For example, in the kk-piecewise constant setting, the algorithm in fact picks also the number of pieces to the left of αi,t\alpha_{i,t} and that to the right of it. These are then passed on to the next stage. The algorithm stops when the number of pieces become one, and then simply use the exponential mechanism to find the constant value on this subdomain.

The full description of the algorithm and the corresponding proof are deferred to Appendix C.

4 DP Isotonic Regression over General Posets

We now provide an algorithm and lower bounds for the case of general discrete posets. We first recall basic quantities about posets. An anti-chain of a poset (𝒳,)(\mathcal{X},\leq) is a set of elements such that no two distinct elements are comparable, whereas a chain is a set of elements such that every pair of elements is comparable. The width of a poset (𝒳,)(\mathcal{X},\leq), denoted by width(𝒳)\operatorname{width}(\mathcal{X}), is defined as the maximum size among all anti-chains in the poset. The height of (𝒳,)(\mathcal{X},\leq), denoted by height(𝒳)\operatorname{height}(\mathcal{X}), is defined as the maximum size among all chains in the poset. Dilworth’s theorem and Mirsky’s theorem give the following relation between chains an anti-chains:

Lemma 16 (Dilworth’s and Mirsky’s theorems [12, 36]).

A poset with width ww can be partitioned into ww chains. A poset with height hh can be partitioned into hh anti-chains.

4.1 An Algorithm

Our algorithm for general posets is similar to that of totally ordered set presented in the previous section. The only difference is that, instead of attempting to pick a single maximal point α\alpha such that f(α)τf(\alpha)\leq\tau as in the previous case, there could now be many such maximal α\alpha’s. Indeed, we need to use the exponential mechanism to pick all such α\alpha’s. Since these are all maximal, they must be incomparable; therefore, they form an anti-chain. Since there can be as many as |𝒳|width(𝒳)|\mathcal{X}|^{\operatorname{width}(\mathcal{X})} anti-chains in total, this means that the error from the exponential mechanism is O(log|𝒳|width(𝒳)/ε)=O(width(𝒳)log|𝒳|/ε)O\left(\log|\mathcal{X}|^{\operatorname{width}(\mathcal{X})}/\varepsilon^{\prime}\right)=O(\operatorname{width}(\mathcal{X})\log|\mathcal{X}|/\varepsilon^{\prime}), leading to the multiplicative increase of width(𝒳)\operatorname{width}(\mathcal{X}) in the total error. This completes our proof sketch for Theorem 1.

4.2 Lower Bounds

To prove a lower bound of Ω(width(𝒳)/εn)\Omega(\operatorname{width}(\mathcal{X})/\varepsilon n), we observe that the values of the function in any anti-chain can be arbitrary. Therefore, we may use each element in a maximum anti-chain to encode 𝒳\mathcal{X} as a binary vector. The lower bound from Lemma 8 then gives us an Ω(width(𝒳)/n)\Omega(\operatorname{width}(\mathcal{X})/n) lower bound for ε=1\varepsilon=1, as formalized below.

Lemma 17.

For any δ>0\delta>0, any (1,δ)(1,\delta)-DP algorithm for isotonic regression for any RR-distance-based loss function \ell must have expected excess empirical risk Ω(R(1δ)min{1,width(𝒳)n})\Omega\left(R(1-\delta)\cdot\min\left\{1,\frac{\operatorname{width}(\mathcal{X})}{n}\right\}\right).

Proof.

Consider any (1,δ)(1,\delta)-DP isotonic regression algorithm \mathcal{M}^{\prime} for loss \ell. Let AA be any maximum anti-chain (of size width(A)\operatorname{width}(A)) in 𝒳\mathcal{X}. We use this algorithm to build a (1,δ)(1,\delta)-DP algorithm \mathcal{M} for privatizing a binary vector of m=min{n,|A|1}m=\min\{n,|A|-1\} dimensions as follows:

  • Let x0,x1,,xmx_{0},x_{1},\dots,x_{m} be distinct elements of AA.

  • On input 𝐳{0,1}m\mathbf{z}\in\{0,1\}^{m}, create a dataset D={(x1,z1),,(xm,zm),(x0,0),,(x0,0)}D=\{(x_{1},z_{1}),\dots,(x_{m},z_{m}),(x_{0},0),\dots,(x_{0},0)\} where (x0,0)(x_{0},0) is repeated nmn-m times.

  • Run \mathcal{M}^{\prime} on the instance DD to get ff, and output a vector 𝐳\mathbf{z}^{\prime} where zi=𝟏[f(xi)1/2]z^{\prime}_{i}=\mathbf{1}[f(x_{i})\geq 1/2].

It is obvious that this algorithm is (1,δ)(1,\delta)-DP. Observe also that (f;D)=0\mathcal{L}(f^{*};D)=0 and thus \mathcal{M}^{\prime}’s expected excess empirical risk is 𝔼f(D)[(f;D)]R𝔼𝐳(𝐳)[𝐳𝐳0]/n\mathbb{E}_{f\sim\mathcal{M}^{\prime}(D)}[\mathcal{L}(f;D)]\geq R\cdot\mathbb{E}_{\mathbf{z}^{\prime}\sim\mathcal{M}(\mathbf{z})}[\|\mathbf{z}-\mathbf{z}^{\prime}\|_{0}]/n, which, from Lemma 8, must be at least Ω(Rm(1δ)/n)=Ω(R(1δ)min{1,width(𝒳)n})\Omega(Rm(1-\delta)/n)=\Omega\left(R(1-\delta)\cdot\min\left\{1,\frac{\operatorname{width}(\mathcal{X})}{n}\right\}\right). ∎

By using group privacy (10) and repeating each element Θ(1/ε)\Theta(1/\varepsilon) times, we arrive at a lower bound of Ω(Rmin{1,width(𝒳)εn})\Omega\left(R\cdot\min\left\{1,\frac{\operatorname{width}(\mathcal{X})}{\varepsilon n}\right\}\right). Furthermore, since 𝒳\mathcal{X} contains height(𝒳)\operatorname{height}(\mathcal{X}) elements that form a totally ordered set, Theorem 15 gives a lower bound of Ω(Rlog(height(𝒳))/εn)\Omega(R\cdot\log(\operatorname{height}(\mathcal{X}))/\varepsilon n) as long as δ<0.01ε/height(𝒳)\delta<0.01\cdot\varepsilon/\operatorname{height}(\mathcal{X}). Finally, due to Lemma 16, we have height(𝒳)|𝒳|/width(𝒳)\operatorname{height}(\mathcal{X})\geq|\mathcal{X}|/\operatorname{width}(\mathcal{X}), which means that max{width(𝒳),log(height(𝒳))}Ω(log|𝒳|)\max\{\operatorname{width}(\mathcal{X}),\log(\operatorname{height}(\mathcal{X}))\}\geq\Omega(\log|\mathcal{X}|). Thus, we arrive at:

Theorem 18.

For any ε(0,1]\varepsilon\in(0,1] and any δ<0.01ε/|𝒳|\delta<0.01\cdot\varepsilon/|\mathcal{X}|, any (ε,δ)(\varepsilon,\delta)-DP algorithm for isotonic regression for RR-distance-based loss function \ell must have expected excess empirical risk Ω(Rmin{1,width(𝒳)+log|𝒳|εn})\Omega\left(R\cdot\min\left\{1,\frac{\operatorname{width}(\mathcal{X})+\log|\mathcal{X}|}{\varepsilon n}\right\}\right).

4.3 Tight Examples for Upper and Lower Bounds

Recall that our upper bound is O~(width(𝒳)log|𝒳|εn)\tilde{O}\left(\frac{\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|}{\varepsilon n}\right) while our lower bound is Ω(width(𝒳)+log|𝒳|εn)\Omega\left(\frac{\operatorname{width}(\mathcal{X})+\log|\mathcal{X}|}{\varepsilon n}\right). One might wonder whether this gap can be closed. Below we show that, unfortunately, this is impossible in general: there are posets for which each bound is tight.

Tight Lower Bound Example. Let 𝒳disj(w,h)\mathcal{X}_{\mathrm{disj}(w,h)} denote the poset that consists of ww disjoint chains, C1,,CwC_{1},\dots,C_{w} where |C1|=h|C_{1}|=h and |C2|==|Cw|=1|C_{2}|=\cdots=|C_{w}|=1. (Every pair of elements on different chains are incomparable.) In this case, we can solve the isotonic regression problem directly on each chain and piece the solutions together into the final output ff. Note that |𝒳disj(w,h)|=w+h1|\mathcal{X}_{\mathrm{disj}(w,h)}|=w+h-1 and width(𝒳)=w,height(𝒳)=h\operatorname{width}(\mathcal{X})=w,\operatorname{height}(\mathcal{X})=h. According to Theorem 1, the unnormalized excess empirical risk in CiC_{i} is O~(log(|Ci|)/ε)\tilde{O}\left(\log(|C_{i}|)/\varepsilon\right). Therefore, the total (normalized) empirical risk for the entire domain 𝒳\mathcal{X} is O~(logh+(w1)εn)\tilde{O}\left(\frac{\log h+(w-1)}{\varepsilon n}\right). This is at most O~(wεn)\tilde{O}\left(\frac{w}{\varepsilon n}\right) as long as hexp(O(w))h\leq\exp(O(w)); this matches the lower bound.

Tight Upper Bound Example. Consider the grid poset 𝒳grid(w,h):=[w]×[h]\mathcal{X}_{\mathrm{grid}(w,h)}:=[w]\times[h] where (x,y)(x,y)(x,y)\leq(x^{\prime},y^{\prime}) if and only if xxx\leq x^{\prime} and yyy\leq y^{\prime}. We assume throughout that whw\leq h. Observe that width(𝒳grid(w,h))=w\operatorname{width}(\mathcal{X}_{\mathrm{grid}(w,h)})=w and height(𝒳grid(w,h))=w+h\operatorname{height}(\mathcal{X}_{\mathrm{grid}(w,h)})=w+h.

We will show the following lower bound, which matches the O~(width(𝒳)log|𝒳|εn)\tilde{O}\left(\frac{\operatorname{width}(\mathcal{X})\log|\mathcal{X}|}{\varepsilon n}\right) upper bound in the case where hw1+Ω(1)h\geq w^{1+\Omega(1)}, up to O(log2(εn))O(\log^{2}(\varepsilon n)) factor. We prove it by a reduction from Lemma 9. Note that this reduction is in some sense a “combination” of the proofs of Theorem 15 and Lemma 17, as the coordinate-wise encoding aspect of Lemma 17 is still present (across the rows) whereas the packing-style lower bound is present in how we embed elements of [D][D] (in blocks of columns).

Lemma 19.

For any ε(0,1]\varepsilon\in(0,1] and δ<Oε(1/h)\delta<O_{\varepsilon}(1/h), any (ε,δ)(\varepsilon,\delta)-DP algorithm for isotonic regression for any RR-distance-based loss function \ell must have expected excess empirical risk Ω(Rmin{1,wlog(h/w)εn})\Omega\left(R\cdot\min\left\{1,\frac{w\cdot\log(h/w)}{\varepsilon n}\right\}\right).

Proof.

Let D:=h/w1,m=wD:=\lfloor h/w-1\rfloor,m=w and r:=min{0.5n/m,0.5ln(D/2)/ε}r:=\min\{\lfloor 0.5n/m\rfloor,\lfloor 0.5\ln(D/2)/\varepsilon\rfloor\}. Consider any (ε,δ)(\varepsilon,\delta)-DP algorithm \mathcal{M}^{\prime} for isotonic regression for \ell on 𝒳grid(w,h)\mathcal{X}_{\mathrm{grid}(w,h)} where δ0.01ε/D\delta\leq 0.01\varepsilon/D. We use this algorithm to build a (ln(D/2),0.25)(\ln(D/2),0.25)-DP algorithm \mathcal{M} for privatizing a vector 𝐳[D]m\mathbf{z}\in[D]^{m} as follows:

  • Create a dataset DD that contains:

    • For all i[m]i\in[m], rr copies of ((i,(wi)(D+1)+zi),0)((i,(w-i)(D+1)+z_{i}),0) and rr copies of ((i,(wi)(D+1)+zi+1),1)((i,(w-i)(D+1)+z_{i}+1),1).

    • n2rmn-2rm copies of ((1,1),0)((1,1),0).

  • Run \mathcal{M}^{\prime} on instance DD to get ff.

  • Output a vector 𝐳\mathbf{z}^{\prime} where zi=max{j[D]f((i,(wi)(D+1)+j))1/2}z^{\prime}_{i}=\max\left\{{j\in[D]}\mid f((i,(w-i)(D+1)+j))\leq 1/2\right\}. (For simplicity, when such jj does not exist let zi=0z^{\prime}_{i}=0.)

By group privacy, \mathcal{M} is (ln(D/2),0.25)(\ln(D/2),0.25)-DP. Furthermore, (f;D)=0\mathcal{L}(f^{*};D)=0 and the expected empirical excess risk of \mathcal{M}^{\prime} is

𝔼f(D)[(f;D)]\displaystyle\mathbb{E}_{f\sim\mathcal{M}^{\prime}(D)}[\mathcal{L}(f;D)]
rni[m]((f(i,(wi)(D+1)+zi),0)+(f(i,(wi)(D+1)+zi+1),1))\displaystyle\textstyle~{}\geq~{}\frac{r}{n}\sum_{i\in[m]}\left(\ell(f(i,(w-i)(D+1)+z_{i}),0)+\ell(f(i,(w-i)(D+1)+z_{i}+1),1)\right)
rni[m]g(1/2)𝟏[zizi]=Rrn𝐳𝐳0,\displaystyle\textstyle~{}\geq~{}\frac{r}{n}\sum_{i\in[m]}g(1/2)\cdot\mathbf{1}[z^{\prime}_{i}\neq z_{i}]=\frac{Rr}{n}\cdot\|\mathbf{z}-\mathbf{z}^{\prime}\|_{0},

which must be at least Ω(Rrm/n)=Ω(Rmin{1,wlog(h/w)εn})\Omega(Rrm/n)=\Omega\left(R\cdot\min\left\{1,\frac{w\cdot\log(h/w)}{\varepsilon n}\right\}\right) by Lemma 9. ∎

5 Additional Related Work

(Non-private) isotonic regression is well-studied in statistics and machine learning. The one-dimensional (aka univariate) case has long history [9, 46, 5, 44, 45, 13, 8, 35, 14, 15, 49]; for a general introduction, see [22]. Moreover, isotonic regression has been studied in higher dimensions [24, 27, 26], including the sparse setting [21], as well as in online learning [29]. A related line of work studies learning neural networks under (partial) monotonicity constraints [3, 50, 30, 40].

There has been a rich body of work on DP machine learning, including DP empirical risk minimization (ERM), e.g., [11, 6, 48, 47], and DP linear regression, e.g., [1]; however, to the best of our knowledge none of these can be applied to isotonic regression to obtain non-trivial guarantees.

Another line of work related to our setting is around privately learning threshold functions [7, 20, 10, 2, 28]. We leveraged this relation to prove our lower bound for totally ordered case (Section 3.2).

6 Conclusions

In this paper we obtained new private algorithms for isotonic regression on posets and proved nearly matching lower bounds in terms of the expected empirical excess risk. Although our algorithms for totally ordered sets are efficient, our algorithm for general posets is not. Specifically, a trivial implementation of the algorithm would run in time exp(O~(width(𝒳)))\exp(\tilde{O}(\operatorname{width}(\mathcal{X}))). It remains an interesting open question whether this can be sped up. To the best of our knowledge, this question does not seem to be well understood even for the non-private setting, as previous algorithmic works have focused primarily on the totally ordered case. Similarly, while our algorithm is efficient for the totally ordered sets, it remains interesting to understand whether nearly linear time algorithms for 1\ell_{1}- and 22\ell_{2}^{2}-losses can be extended to a larger class of loss functions.

References

  • [1] D. Alabi, A. McMillan, J. Sarathy, A. D. Smith, and S. P. Vadhan. Differentially private simple linear regression. Proc. Priv. Enhancing Technol., 2022(2):184–204, 2022.
  • [2] N. Alon, R. Livni, M. Malliaris, and S. Moran. Private PAC learning implies finite littlestone dimension. In STOC, pages 852–860, 2019.
  • [3] N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with monotonicity constraints for two-group classification problems. Dec. Sci., 24(1):60–75, 1993.
  • [4] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. Ann. Math. Stat., pages 641–647, 1955.
  • [5] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical Inference Under Order Restrictions. John Wiley & Sons, 1973.
  • [6] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.
  • [7] A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory Comput., 12(1):1–61, 2016.
  • [8] L. Birgé and P. Massart. Rates of convergence for minimum contrast estimators. Prob. Theory Rel. Fields, 97(1):113–150, 1993.
  • [9] H. D. Brunk. Maximum likelihood estimates of monotone parameters. Ann. Math. Stat., pages 607–616, 1955.
  • [10] M. Bun, K. Nissim, U. Stemmer, and S. P. Vadhan. Differentially private release and learning of threshold functions. In FOCS, pages 634–649, 2015.
  • [11] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. JMLR, 12(3), 2011.
  • [12] R. P. Dilworth. A decomposition theorem for partially ordered sets. Ann. Math., 51(1):161–166, 1950.
  • [13] D. L. Donoho. Gelfand nn-widths and the method of least squares. Technical Report, University of California, 1991.
  • [14] C. Durot. On the lpl_{p}-error of monotonicity constrained estimators. Ann. Stat., 35(3):1080–1104, 2007.
  • [15] C. Durot. Monotone nonparametric regression with random design. Math. Methods Stat., 17(4):327–341, 2008.
  • [16] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486–503, 2006.
  • [17] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
  • [18] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
  • [19] R. L. Dykstra and T. Robertson. An algorithm for isotonic regression for two or more independent variables. Ann. Stat., 10(3):708–716, 1982.
  • [20] V. Feldman and D. Xiao. Sample complexity bounds on differentially private learning via communication complexity. In COLT, volume 35, pages 1000–1019, 2014.
  • [21] D. Gamarnik and J. Gaudio. Sparse high-dimensional isotonic regression. NeurIPS, 2019.
  • [22] P. Groeneboom and G. Jongbloed. Nonparametric Estimation under Shape Constraints. Cambridge University Press, 2014.
  • [23] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.
  • [24] Q. Han, T. Wang, S. Chatterjee, and R. J. Samworth. Isotonic regression in general dimensions. Ann. Stat., 47(5):2440–2471, 2019.
  • [25] M. Hardt and K. Talwar. On the geometry of differential privacy. In STOC, pages 705–714, 2010.
  • [26] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NeurIPS, 2011.
  • [27] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.
  • [28] H. Kaplan, K. Ligett, Y. Mansour, M. Naor, and U. Stemmer. Privately learning thresholds: Closing the exponential gap. In COLT, pages 2263–2285, 2020.
  • [29] W. Kotłowski, W. M. Koolen, and A. Malek. Online isotonic regression. In COLT, pages 1165–1189, 2016.
  • [30] X. Liu, X. Han, N. Zhang, and Q. Liu. Certified monotonic neural networks. NeurIPS, pages 15427–15438, 2020.
  • [31] R. Luss, S. Rosset, and M. Shahar. Efficient regularized isotonic regression with application to gene–gene interaction search. Ann. Appl. Stat., 6(1):253–283, 2012.
  • [32] P. Manurangsi. Tight bounds for differentially private anonymized histograms. In SOSA, pages 203–213, 2022.
  • [33] F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9):89–97, 2010.
  • [34] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103, 2007.
  • [35] M. Meyer and M. Woodroofe. On the degrees of freedom in shape-restricted regression. Ann. Stat., 28(4):1083–1104, 2000.
  • [36] L. Mirsky. A dual of Dilworth’s decomposition theorem. AMS, 78(8):876–877, 1971.
  • [37] C. Radebaugh and U. Erlingsson. Introducing TensorFlow Privacy: Learning with Differential Privacy for Training Data, March 2019. blog.tensorflow.org.
  • [38] T. Robertson, F. T. Wright, and R. L. Dykstra. Order restricted statistical inference. John Wiley & Sons, 1988.
  • [39] M. J. Schell and B. Singh. The reduced monotonic regression method. JASA, 92(437):128–135, 1997.
  • [40] A. Sivaraman, G. Farnadi, T. Millstein, and G. Van den Broeck. Counterexample-guided learning of monotonic neural networks. In NeurIPS, pages 11936–11948, 2020.
  • [41] T. Steinke and J. R. Ullman. Between pure and approximate differential privacy. J. Priv. Confidentiality, 7(2), 2016.
  • [42] Q. F. Stout. Unimodal regression via prefix isotonic regression. Comput. Stat. Data Anal., 53(2):289–297, 2008.
  • [43] D. Testuggine and I. Mironov. PyTorch Differential Privacy Series Part 1: DP-SGD Algorithm Explained, August 2020. medium.com.
  • [44] S. Van de Geer. Estimating a regression function. Ann. Stat., pages 907–924, 1990.
  • [45] S. Van de Geer. Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Stat., 21(1):14–44, 1993.
  • [46] C. van Eeden. Testing and Estimating Ordered Parameters of Probability Distribution. CWI, Amsterdam, 1958.
  • [47] D. Wang, C. Chen, and J. Xu. Differentially private empirical risk minimization with non-convex loss functions. In ICML, pages 6526–6535, 2019.
  • [48] D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster and more general. In NIPS, 2017.
  • [49] F. Yang and R. F. Barber. Contraction and uniform convergence of isotonic regression. Elec. J. Stat., 13(1):646–677, 2019.
  • [50] S. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks and partial monotonic functions. NIPS, 2017.
  • [51] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages 694–699, 2002.

Appendix A Baseline Algorithm for Private Isotonic Regression

We provide a baseline algorithm for private isotonic regression by a direct application of the exponential mechanism. For simplicity, we start with the case of totally ordered sets and then extend the algorithm to general posets.

Totally ordered sets.

Consider a discretized range of 𝒯:={0,1T,2T,,1}\mathcal{T}:=\left\{0,\frac{1}{T},\frac{2}{T},\ldots,1\right\}. We have that for f~:=argminf([m],𝒯)(f;D)\tilde{f}:=\operatorname{argmin}_{f\in\mathcal{F}([m],\mathcal{T})}\mathcal{L}(f;D) and f:=argminf([m],[0,1])(f;D)f^{*}:=\operatorname{argmin}_{f\in\mathcal{F}([m],[0,1])}\mathcal{L}(f;D), it holds that (f~;D)(f;D)+1T\mathcal{L}(\tilde{f};D)\leq\mathcal{L}(f^{*};D)+\frac{1}{T}. Also, it is a simple combinatorial fact that |([m],𝒯)|=(m+TT)(m+T)T|\mathcal{F}([m],\mathcal{T})|=\binom{m+T}{T}\leq(m+T)^{T}, which bounds the number of monotone functions with this discretization. Thus, the ε\varepsilon-DP exponential mechanism over the set of all monotone functions in ([m],𝒯)\mathcal{F}([m],\mathcal{T}), with the score function (f;D)\mathcal{L}(f;D) of sensitivity at most L/nL/n, returns f:[m]𝒯f:[m]\to\mathcal{T} such that

(f;D)(f~;D)+O(LTlog(m+T)εn)(f;D)+O(LTlog(m+T)εn+LT).\textstyle\mathcal{L}(f;D)\leq\mathcal{L}(\tilde{f};D)+O\left(\frac{LT\log(m+T)}{\varepsilon n}\right)~{}\leq~{}\mathcal{L}(f^{*};D)+O\left(\frac{LT\log(m+T)}{\varepsilon n}+\frac{L}{T}\right).

Setting T=εnlogmT=\sqrt{\frac{\varepsilon n}{\log m}}, gives an excess empirical error of O(Llog(m)εn)O\left(L\sqrt{\frac{\log(m)}{\varepsilon n}}\right) (when mnm\geq n).

General posets.

By Lemma 16, we have that 𝒳\mathcal{X} can be partitioned into w:=width(𝒳)w:=\operatorname{width}(\mathcal{X}) many chains H1,,HwH_{1},\ldots,H_{w}. Let hi:=|Hi|h_{i}:=|H_{i}|. Since any monotone function over 𝒳\mathcal{X} has to be monotone over each of the chains, we have that

|(𝒳,𝒯)|i=1w|(Hi,𝒯)|(|𝒳|w+T)wT(|𝒳|+T)wT.\textstyle|\mathcal{F}(\mathcal{X},\mathcal{T})|\leq\prod\limits_{i=1}^{w}|\mathcal{F}(H_{i},\mathcal{T})|~{}\leq~{}\left(\frac{|\mathcal{X}|}{w}+T\right)^{wT}~{}\leq~{}(|\mathcal{X}|+T)^{wT}.

Thus, by a similar argument as above, the ε\varepsilon-DP exponential mechanism over the set of all monotone functions in (𝒳,𝒯)\mathcal{F}(\mathcal{X},\mathcal{T}), with score function (f;D)\mathcal{L}(f;D) returns f:𝒳𝒯f:\mathcal{X}\to\mathcal{T} such that

(f;D)(f;D)+O(LwTlog(|𝒳|+T)nε+LT).\textstyle\mathcal{L}(f;D)~{}\leq~{}\mathcal{L}(f^{*};D)+O\left(L\cdot\frac{wT\log(|\mathcal{X}|+T)}{n\varepsilon}+\frac{L}{T}\right).

Choosing T=εnwlog|𝒳|T=\sqrt{\frac{\varepsilon n}{w\log|\mathcal{X}|}}, gives an excess empirical error of O(Lwlog|𝒳|εn)O\left(L\sqrt{\frac{w\log|\mathcal{X}|}{\varepsilon n}}\right) (when |𝒳|n|\mathcal{X}|\geq n).

Appendix B Lower Bound on Privatizing Vectors with Large Alphabet: Proof of Lemma 9

Below we prove Lemma 9. The proof below is a slight extension of that of Lemma 8 in [32].

Proof of Lemma 9.

For every i[m],σ[D]i\in[m],\sigma\in[D], let 𝐳(i,σ)\mathbf{z}_{(i,\sigma)} denote (z1,,zi1,σ,zi+1,,zm)(z_{1},\dots,z_{i-1},\sigma,z_{i+1},\dots,z_{m}). Let ε=ln(D/2),δ=0.25\varepsilon^{\prime}=\ln(D/2),\delta^{\prime}=0.25. We have

𝔼𝐳[D]m[(𝐳)𝐳0]\displaystyle\mathbb{E}_{\mathbf{z}\sim[D]^{m}}[\|\mathcal{M}(\mathbf{z})-\mathbf{z}\|_{0}]
=i[m]Pr𝐳[D]m[(𝐳)izi]\displaystyle=\sum_{i\in[m]}\Pr_{\mathbf{z}\sim[D]^{m}}[\mathcal{M}(\mathbf{z})_{i}\neq z_{i}]
=mi[m]Pr𝐳[D]m[(𝐳)i=zi]\displaystyle=m-\sum_{i\in[m]}\Pr_{\mathbf{z}\sim[D]^{m}}[\mathcal{M}(\mathbf{z})_{i}=z_{i}]
=mi[m]1Dm+1𝐳[D]m(σ[D]Pr[(𝐳(i,σ))=σ])\displaystyle=m-\sum_{i\in[m]}\frac{1}{D^{m+1}}\sum_{\mathbf{z}\in[D]^{m}}\left(\sum_{\sigma\in[D]}\Pr[\mathcal{M}(\mathbf{z}_{(i,\sigma)})=\sigma]\right)
(From (ε,δ)-DP of )\displaystyle(\text{From }(\varepsilon^{\prime},\delta^{\prime})\text{-DP of }\mathcal{M}) mi[m]1Dm+1𝐳[D]m(σ[D](eεPr[(𝐳(i,1))=σ]+δ))\displaystyle\geq m-\sum_{i\in[m]}\frac{1}{D^{m+1}}\sum_{\mathbf{z}\in[D]^{m}}\left(\sum_{\sigma\in[D]}\left(e^{\varepsilon^{\prime}}\cdot\Pr[\mathcal{M}(\mathbf{z}_{(i,1)})=\sigma]+\delta^{\prime}\right)\right)
mi[m]1Dm+1𝐳[D]m(eε+Dδ)\displaystyle\geq m-\sum_{i\in[m]}\frac{1}{D^{m+1}}\sum_{\mathbf{z}\in[D]^{m}}\left(e^{\varepsilon^{\prime}}+D\delta^{\prime}\right)
=(1eε/Dδ)m\displaystyle=\left(1-e^{\varepsilon^{\prime}}/D-\delta^{\prime}\right)m
=0.25m.\displaystyle=0.25m.\qed

Appendix C Algorithms for Isotonic Regression with Additional Constraints

In this section, we elaborate on the constrained variants of the isotonic regression problem over totally ordered sets, by designing a meta-algorithm that can be instantiated to get algorithms for each of the cases discussed in Section 3.3.

Recall that Algorithm 1 proceeded in TT rounds where in round tt the algorithm starts with a partition of [m][m] into 2t2^{t} intervals, and then partitions each interval into two using the exponential mechanism. At a high-level, our meta-algorithm is similar, except that, it maintains a set of pairwise disjoint structured intervals of [m][m], that is, each interval has an additional structure which imposes constraints on the function that can be returned on the said interval; moreover, the function is fixed outside the union of the said intervals. This idea is described in Algorithm 2, stated using the following abstractions, which will be instantiated to derive algorithms for each constrained variant.

  • A set of all structured intervals of [m][m] denoted as 𝒮\mathcal{S}, and an initial structured interval S0,0𝒮S_{0,0}\in\mathcal{S}. A structured interval SS will consist of an interval domain denoted PS[m]P_{S}\subseteq[m], an interval range denoted RS[0,1]R_{S}\subseteq[0,1], and potentially additional other constraints that the function should satisfy. We use |RS||R_{S}| to denote the length of RSR_{S}. In order to make the number of structured intervals bounded, we will consider a discretized range where the endpoints of interval RSR_{S} lie in :={0,1/H,2/H,,1}\mathcal{H}:=\{0,1/H,2/H,\ldots,1\} for some discretization parameter HH.

  • A partition method Φ:S{(Sleft,Sright,g)}\Phi:S\mapsto\{(S^{\mathrm{left}},S^{\mathrm{right}},g)\} that defines a set of all “valid partitions” of a structured interval SS into two structured intervals SleftS^{\mathrm{left}} and SrightS^{\mathrm{right}} and a function g:PS(PSleftPSright)RSg:P_{S}\smallsetminus(P_{S^{\mathrm{left}}}\cup P_{S^{\mathrm{right}}})\to R_{S}. It is required that PS(PSleftPSright)P_{S}\smallsetminus(P_{S^{\mathrm{left}}}\cup P_{S^{\mathrm{right}}}) be an interval. If the algorithm makes a choice of (Sleft,Sright,g)(S^{\mathrm{left}},S^{\mathrm{right}},g), then the final function returned by the algorithm is required to be equal to gg on PS(PSleftPSright)P_{S}\smallsetminus(P_{S^{\mathrm{left}}}\cup P_{S^{\mathrm{right}}}).

  • For all S𝒮S\in\mathcal{S}, we abuse notation to let (S)\mathcal{F}(S) denote the set of all monotone functions mapping PSP_{S} to RSR_{S}, while respecting the additional conditions enforced by the structure in SS.

Algorithm 2 Meta algorithm for variants of DP Isotonic Regression for Totally Ordered Sets.
  Input: 𝒳=[m]\mathcal{X}=[m], dataset D={(x1,y1),,(xn,yn)}D=\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}, DP parameter ε\varepsilon.
  Output: Monotone function f:[m][0,1]f:[m]\to[0,1] satisfying additional desired condition.
  
  Tlog(εn)T\leftarrow\left\lceil\log(\varepsilon n)\right\rceil
  εε/T\varepsilon^{\prime}\leftarrow\varepsilon/T
  S0,0S_{0,0} : initial structured interval                          {Any structured interval SS consists of an interval domain PSP_{S} and an interval range RSR_{S}, and potentially other conditions on the function.}
  for t=0,,T1t=0,\ldots,T-1 do
     for i=0,,2t1i=0,\ldots,2^{t}-1 do
        
  • \triangleright

    Di,t{(xj,yj)j[n],xjPSi,t}D_{i,t}\leftarrow\left\{(x_{j},y_{j})\mid j\in[n],x_{j}\in P_{S_{i,t}}\right\}

  • \triangleright

    Choose (Si,tleft,Si,tright,gi,t)Φ(Si,t)(S^{\mathrm{left}}_{i,t},S^{\mathrm{right}}_{i,t},g_{i,t})\in\Phi(S_{i,t}), using ε\varepsilon^{\prime}-DP exponential mechanism with scoring function

    scorei,t(Sleft,Sright,g)\displaystyle\operatorname{score}_{i,t}(S^{\mathrm{left}},S^{\mathrm{right}},g) :=minf1(Sleft)RSabs(f1;Di,tleft)\displaystyle~{}:=~{}\min_{f_{1}\in\mathcal{F}(S^{\mathrm{left}})}\mathcal{L}^{\mathrm{abs}}_{R_{S}}(f_{1};D^{\mathrm{left}}_{i,t})
    +minf2(Sright)RSabs(f2;Di,tright)\displaystyle\qquad+~{}\min_{f_{2}\in\mathcal{F}(S^{\mathrm{right}})}\mathcal{L}^{\mathrm{abs}}_{R_{S}}(f_{2};D^{\mathrm{right}}_{i,t})
    +RSabs(g;Di,tmid)\displaystyle\qquad+~{}\mathcal{L}^{\mathrm{abs}}_{R_{S}}(g;D^{\mathrm{mid}}_{i,t})

    {Notation: Di,tleft:={(x,y)Di,txPSleft}D_{i,t}^{\mathrm{left}}:=\left\{(x,y)\in D_{i,t}\mid x\in P_{S^{\mathrm{left}}}\right\}, Di,trightD_{i,t}^{\mathrm{right}} is defined similarly and

    Notation: Di,tmid:={(x,y)Di,txPSi,t(PSleftPSright)}D_{i,t}^{\mathrm{mid}}:=\left\{(x,y)\in D_{i,t}\mid x\in P_{S_{i,t}}\smallsetminus(P_{S^{\mathrm{left}}}\cup P_{S^{\mathrm{right}}})\right\}.}

    {Note: scorei,t(Sleft,Sright,g)\operatorname{score}_{i,t}(S^{\mathrm{left}},S^{\mathrm{right}},g) has sensitivity at most L|RS|L\cdot|R_{S}|.}

  • \triangleright

    S2i,t+1Si,tleftS_{2i,t+1}\leftarrow S^{\mathrm{left}}_{i,t} and S2i+1,t+1Si,trightS_{2i+1,t+1}\leftarrow S^{\mathrm{right}}_{i,t}.

  Let f:[m][0,1]f:[m]\to[0,1] be choosing f|PSi,T1(Si,T1)f|_{P_{S_{i,T-1}}}\in\mathcal{F}(S_{i,T-1}) arbitrarily for all i[2T]i\in[2^{T}], and f(x)=gi,t(x)f(x)=g_{i,t}(x) for all xPSi,t(PS2i,tPS2i+1,t)x\in P_{S_{i,t}}\smallsetminus(P_{S_{2i,t}}\cup P_{S_{2i+1,t}}) for all ii, tt.
  return  ff

We instantiate this notion of structured intervals in the following ways to derive algorithms for the constrained variants of isotonic regression mentioned earlier:

  • (Vanilla) Isotonic Regression (recovers Algorithm 1): 𝒮\mathcal{S} is simply the set of all interval domains, and all (discretized) interval ranges and the partition method simply partitions into two sub-intervals, with the range divided into two equal parts.222We ignore a slight detail that τ+θ2\frac{\tau+\theta}{2} need not be in \mathcal{H}; this can be fixed e.g., by letting it be Hτ+θ2/H\left\lfloor H\cdot\frac{\tau+\theta}{2}\right\rfloor/H, but we skip this complicated expression for simplicity. Note that, if we let H=2TH=2^{T}, this distinction does not make a difference in the algorithm for vanilla isotonic regression. Namely,

    𝒮\displaystyle\mathcal{S} :={([i,j],[τ,θ]):i,j[m],τ,θ s.t. ij,τθ},\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta]):i,j\in[m]\,,\tau,\theta\in\mathcal{H}\text{ s.t. }i\leq j\,,\tau\leq\theta\right\}\,,
    S0,0\displaystyle S_{0,0} :=([1,m],[0,1]),\displaystyle~{}:=~{}([1,m],[0,1])\,,
    Φ(([i,j],[τ,θ]))\displaystyle\Phi(([i,j],[\tau,\theta])) :={(([i,],[τ,τ+θ2]),([+1,j],[τ+θ2,θ])):i1j},\displaystyle\textstyle~{}:=~{}\left\{(([i,\ell],[\tau,\frac{\tau+\theta}{2}]),([\ell+1,j],[\frac{\tau+\theta}{2},\theta])):i-1\leq\ell\leq j\right\}\,,
    (([i,j],[τ,θ]))\displaystyle\mathcal{F}(([i,j],[\tau,\theta])) :=set of monotone functions mapping [i,j] to [τ,θ].\displaystyle~{}:=~{}\text{set of monotone functions mapping }[i,j]\text{ to }[\tau,\theta]\,.

    We skip the description of the function gg in the partition method Φ\Phi, since the middle sub-interval is empty. For all the other variants, we skip having to explicitly write the conditions of i,j[m]i,j\in[m], τ,θ\tau,\theta\in\mathcal{H}, iji\leq j, and τθ\tau\leq\theta in definition of 𝒮\mathcal{S}, and similarly that (S)\mathcal{F}(S) consist of monotone functions mapping [i,j][i,j] to [τ,θ][\tau,\theta]; we only focus on the main new conditions.

  • kk-Piecewise Constant: 𝒮\mathcal{S} is the set of all interval domains, all discretized ranges, along with a parameter (encoding an upper bound on the number of pieces in the final piecewise constant function). The partition method partitions into two sub-intervals respecting that the number of pieces and the range divided into two equal parts, namely,

    𝒮\displaystyle\mathcal{S} :={([i,j],[τ,θ],r):1rk if ij,r=0 if i>j},\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta],r):\begin{matrix}1\leq r\leq k&\text{ if }i\leq j,\\ r=0&\text{ if }i>j\end{matrix}\right\}\,,
    S0,0\displaystyle S_{0,0} :=([1,m],[0,1],k),\displaystyle~{}:=~{}([1,m],[0,1],k)\,,
    Φ(([i,j],[τ,θ],r))\displaystyle\Phi(([i,j],[\tau,\theta],r)) :={(([i,],[τ,τ+θ2],r1),([+1,j],[τ+θ2,θ],r2))s.t. i1j and r1+r2=r},\displaystyle\textstyle~{}:=~{}\left\{\begin{matrix}(([i,\ell],[\tau,\frac{\tau+\theta}{2}],r_{1}),([\ell+1,j],[\frac{\tau+\theta}{2},\theta],r_{2}))\\[5.69054pt] \text{s.t. }i-1\leq\ell\leq j\ \text{ and }r_{1}+r_{2}=r\end{matrix}\right\}\,,
    (([i,j],[τ,θ],r))\displaystyle\mathcal{F}(([i,j],[\tau,\theta],r)) :=set of r-piecewise constant functions\displaystyle~{}:=~{}\text{set of $r$-piecewise constant functions}
  • kk-Piecewise Linear: 𝒮\mathcal{S} is the set of all interval domains, all discretized ranges, along with a parameter (encoding an upper bound on the number of pieces in the final piecewise linear function), and two Boolean values (\top/\bot), one encoding whether the function must achieve the minimum possible value at the start of the interval, and other encoding whether it must achieve the maximum possible value at the end of the interval. The partition method partitions into two sub-intervals respecting that the number of pieces, by choosing a middle sub-interval that ensures that each range is at most half as large as the earlier one, namely,

    𝒮\displaystyle\mathcal{S} :={([i,j],[τ,θ],r,b1,b2):1rk if ij,r=0 if i>j,b1,b2{,}},\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta],r,b_{1},b_{2}):\begin{matrix}\begin{matrix}1\leq r\leq k&\text{ if }i\leq j,\\ r=0&\text{ if }i>j,\end{matrix}\\ b_{1},b_{2}\in\left\{\top,\bot\right\}\end{matrix}\right\}\,,
    S0,0\displaystyle S_{0,0} :=([1,m],[0,1],k,,),\displaystyle~{}:=~{}([1,m],[0,1],k,\bot,\bot)\,,
    Φ(([i,j],[τ,θ],r,b1,b2))\displaystyle\Phi(([i,j],[\tau,\theta],r,b_{1},b_{2})) :={(Sleft=([i,1],[τ,ω1],r1,b1,),Sright=([2,j],[ω2,θ],r2,,b2)g(x)=ω1+(x1)(ω2ω1)/(21))s.t. i11<2j+1,ω1τ+θ2ω2, and r1+r2=r1},\displaystyle\textstyle~{}:=~{}\left\{\begin{matrix}\left(\begin{matrix}S^{\mathrm{left}}=([i,\ell_{1}],[\tau,\omega_{1}],r_{1},b_{1},\top),\\ S^{\mathrm{right}}=([\ell_{2},j],[\omega_{2},\theta],r_{2},\top,b_{2})\\ g(x)=\omega_{1}+(x-\ell_{1})\cdot(\omega_{2}-\omega_{1})/(\ell_{2}-\ell_{1})\end{matrix}\right)\\[5.69054pt] \text{s.t. }i-1\leq\ell_{1}<\ell_{2}\leq j+1\ ,\omega_{1}\leq\frac{\tau+\theta}{2}\leq\omega_{2}\,,\\ \phantom{\text{s.t. }}\text{ and }r_{1}+r_{2}=r-1\end{matrix}\right\}\,,
    (([i,j],[τ,θ],r,b1,b2))\displaystyle\mathcal{F}(([i,j],[\tau,\theta],r,b_{1},b_{2})) :=set of r-piecewise linear functions f\displaystyle~{}:=~{}\text{set of $r$-piecewise linear functions $f$}
    s.t. f(i)=τf(i)=\tau if b1=b_{1}=\top and f(j)=θf(j)=\theta if b2=b_{2}=\top.

    In other words, Φ(([i,j],[τ,θ],r,b1,b2))\Phi(([i,j],[\tau,\theta],r,b_{1},b_{2})) considers the three sub-intervals [i,1][i,\ell_{1}], [1,2][\ell_{1},\ell_{2}] and [2,j][\ell_{2},j], and fits an affine function gg in the middle sub-interval [1,2][\ell_{1},\ell_{2}] such that g(1)=ω1g(\ell_{1})=\omega_{1} and g(2)=ω2g(\ell_{2})=\omega_{2} and ensures that the function ff returned on sub-intervals [i,1][i,\ell_{1}] and [2,j][\ell_{2},j] satisfies f(1)=ω1f(\ell_{1})=\omega_{1} and f(2)=ω2f(\ell_{2})=\omega_{2}.

  • Lipschitz Regression: Given any Lipschitz constant LfL_{f}, 𝒮\mathcal{S} is the set of all interval domains, all discretized ranges, along with two Boolean values (\top/\bot), one encoding whether the function must achieve the minimum possible value at the start of the interval, and other encoding whether it must achieve the maximum possible value at the end of the interval. The partition method chooses sub-intervals by choosing \ell and function values f()f(\ell) and f(+1)f(\ell+1) such that f(+1)f()Lff(\ell+1)-f(\ell)\leq L_{f} (thereby respecting the Lipschitz condition), and moreover f()τ+θ2f(\ell)\leq\frac{\tau+\theta}{2} and f(+1)τ+θ2f(\ell+1)\geq\frac{\tau+\theta}{2}.

    𝒮\displaystyle\mathcal{S} :={([i,j],[τ,θ],b1,b2):b1,b2{,}},\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta],b_{1},b_{2}):b_{1},b_{2}\in\left\{\top,\bot\right\}\right\}\,,
    S0,0\displaystyle S_{0,0} :=([1,m],[0,1],,),\displaystyle~{}:=~{}([1,m],[0,1],\bot,\bot)\,,
    Φ(([i,j],[τ,θ],b1,b2))\displaystyle\Phi(([i,j],[\tau,\theta],b_{1},b_{2})) :={(Sleft=([i,],[τ,ω1],b1,),Sright=([+1,j],[ω2,θ],,b2))s.t. i1j,ω1τ+θ2ω2,ω2ω1Lf},\displaystyle\textstyle~{}:=~{}\left\{\begin{matrix}\left(\begin{matrix}S^{\mathrm{left}}=([i,\ell],[\tau,\omega_{1}],b_{1},\top),\\ S^{\mathrm{right}}=([\ell+1,j],[\omega_{2},\theta],\top,b_{2})\end{matrix}\right)\\[5.69054pt] \text{s.t. }i-1\leq\ell\leq j\ ,\omega_{1}\leq\frac{\tau+\theta}{2}\leq\omega_{2}\,,\\ \phantom{\text{s.t. }}\omega_{2}-\omega_{1}\leq L_{f}\end{matrix}\right\}\,,
    (([i,j],[τ,θ],b1,b2))\displaystyle\mathcal{F}(([i,j],[\tau,\theta],b_{1},b_{2})) :=set of Lf-Lipschitz linear functions f\displaystyle~{}:=~{}\text{set of $L_{f}$-Lipschitz linear functions $f$}
    s.t. f(i)=τf(i)=\tau if b1=b_{1}=\top and f(j)=θf(j)=\theta if b2=b_{2}=\top.
  • Convex/Concave: We only describe the convex case; the concave case follows similarly. Note that a function ff is convex over the discrete domain [m][m] if and only if f(x+1)+f(x1)>2f(x)f(x+1)+f(x-1)>2\cdot f(x) holds for all xx. Let 𝒮\mathcal{S} be the set of all interval domains, all discretized ranges, along with the following additional parameters

    • a lower bound LL on the (discrete) derivative of ff,

    • an upper bound UU on the (discrete) derivative of ff,

    • a Boolean value encoding whether the function must achieve the minimum possible value at the start of the interval,

    • another Boolean value encoding whether the function must achieve the maximum possible value at the end of the interval.

    The partition method chooses sub-intervals by choosing \ell and function values f()f(\ell) and f(+1)f(\ell+1) such that Lf(+1)f()UL\leq f(\ell+1)-f(\ell)\leq U, f()τ+θ2f(\ell)\leq\frac{\tau+\theta}{2} and f(+1)τ+θ2f(\ell+1)\geq\frac{\tau+\theta}{2} and enforcing that the left sub-interval has derivatives at most f(+1)f()f(\ell+1)-f(\ell) and the right sub-interval has derivatives at least f(+1)f()f(\ell+1)-f(\ell).

    𝒮\displaystyle\mathcal{S} :={([i,j],[τ,θ],L,U,b1,b2):LU,b1,b2{,}},\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta],L,U,b_{1},b_{2}):L\leq U,b_{1},b_{2}\in\left\{\top,\bot\right\}\right\}\,,
    S0,0\displaystyle S_{0,0} :=([1,m],[0,1],,+,,),\displaystyle~{}:=~{}([1,m],[0,1],-\infty,+\infty,\bot,\bot)\,,
    Φ(([i,j],[τ,θ],L,U,b1,b2))\displaystyle\Phi(([i,j],[\tau,\theta],L,U,b_{1},b_{2})) :={(Sleft=([i,],[τ,ω1],L,ω2ω1,b1,),Sright=([+1,j],[ω2,θ],ω2ω1,U,,b2))s.t. i1j,ω1τ+θ2ω2,Lω2ω1U},\displaystyle\textstyle~{}:=~{}\left\{\begin{matrix}\left(\begin{matrix}S^{\mathrm{left}}=([i,\ell],[\tau,\omega_{1}],L,\omega_{2}-\omega_{1},b_{1},\top),\\ S^{\mathrm{right}}=([\ell+1,j],[\omega_{2},\theta],\omega_{2}-\omega_{1},U,\top,b_{2})\end{matrix}\right)\\[5.69054pt] \text{s.t. }i-1\leq\ell\leq j\ ,\omega_{1}\leq\frac{\tau+\theta}{2}\leq\omega_{2}\,,\\ \phantom{\text{s.t. }}L\leq\omega_{2}-\omega_{1}\leq U\end{matrix}\right\}\,,
    (([i,j],[τ,θ],L,U))\displaystyle\mathcal{F}(([i,j],[\tau,\theta],L,U)) :=set of convex functions f\displaystyle~{}:=~{}\text{set of convex functions $f$}
    s.t. for all [i,j)\ell\in[i,j) it holds that Lf(+1)f()UL\leq f(\ell+1)-f(\ell)\leq U,
    and f(i)=τf(i)=\tau if b1=b_{1}=\top and f(j)=θf(j)=\theta if b2=b_{2}=\top.

Privacy Analysis.

Follows similarly as done for Algorithm 1.

Utility Analysis.

Since |RSi,t|2t|R_{S_{i,t}}|\leq 2^{-t} in each of the cases, it follows that the sensitivity of the scoring function is at most L/2tL/2^{t}. The rest of the proof follows similarly, with the only change being that the number of candidates in the exponential mechanism is given as |Φ(Si,t)||\Phi(S_{i,t})|, which in the case of vanilla isotonic regression was simply |Pi,t||P_{i,t}|. We now bound this for each of the cases, which shows that log|Φ(Si,t)|\log|\Phi(S_{i,t})| is at most O(log(mn))O(\log(mn)). In particular,

  • kk-Piecewise Constant: |Φ(S)|O(mk)|\Phi(S)|\leq O(mk).

  • kk-Piecewise Linear: |Φ(S)|O(m2H2k)|\Phi(S)|\leq O(m^{2}H^{2}k).

  • LfL_{f}-Lipschitz: |Φ(S)|O(mH2)|\Phi(S)|\leq O(mH^{2}).

  • Convex/Concave: |Φ(S)|O(mH)|\Phi(S)|\leq O(mH)

Finally, there is an additional error due to discretization. To account for the discretization error, we argue below for appropriately selected values of HH that, for any optimal function ff^{*}, there exists f(S0,0)f\in\mathcal{F}(S_{0,0}) such that |f(x)f(x)|1/n|f^{*}(x)-f(x)|\leq 1/n. This indeed immediately implies that the discretization error is at most O(1)O(1).

  • kk-Piecewise Linear: We may select H=nH=n. In this case, for every endpoint \ell, we let f()=Hf()/Hf(\ell)=H\cdot\lceil f^{*}(\ell)/H\rceil and interpolate the intermediate points accordingly. It is simple to see that f(x)f(x)1/nf^{*}(x)-f(x)\leq 1/n as desired.

  • LfL_{f}-Lipschitz and Convex/Concave: Let H=mnH=mn. Here we discretize the (discrete) derivative of ff. Specifically, let f(1)=Hf(1)/Hf(1)=\lfloor H\cdot f^{*}(1)\lfloor/H and let f(+1)f()=H(f(+1)f())/Hf(\ell+1)-f(\ell)=\lfloor H\cdot(f^{*}(\ell+1)-f^{*}(\ell))\rfloor/H for all =2,,m\ell=2,\dots,m. Once again, it is simple to see that f,ff,f^{*} differ by at most 1/n1/n at each point.

In summary, in all cases, we have |Φ(S)|(nm)O(1)|\Phi(S)|\leq(nm)^{O(1)} resulting in the same asymptotic error as in the unconstrained case.

Runtime Analysis.

It is easy to see that each score value can be computed (via dynamic programming) in time poly(n)poly(H)\mathrm{poly}(n)\cdot\mathrm{poly}(H). Thus, the entire algorithm can be implemented in time that poly(n)poly(H)logm(nm)O(1)\mathrm{poly}(n)\cdot\mathrm{poly}(H)\cdot\log m\leq(nm)^{O(1)} as claimed.333In the main body, we erroneously claimed that the running time was (nlogm)O(1)(n\log m)^{O(1)}, instead of (nm)O(1)(nm)^{O(1)}.

Appendix D Missing Proofs from Section 4

For a set S𝒳S\subseteq\mathcal{X}, its lower closure and upper closure are defined as S:={x𝒳sS,xs}S^{\leq}:=\{x\in\mathcal{X}\mid\exists s\in S,x\leq s\} and S:={x𝒳sS,xs}S^{\geq}:=\{x\in\mathcal{X}\mid\exists s\in S,x\geq s\}, respectively. Similarly, the strict lower closure and strict upper closure are defined as S<:={x𝒳sS,x<s}S^{<}:=\{x\in\mathcal{X}\mid\exists s\in S,x<s\} and S>:={x𝒳sS,x>s}S^{>}:=\{x\in\mathcal{X}\mid\exists s\in S,x>s\}. When S=S=\emptyset, we use the convention that S=S<=S^{\leq}=S^{<}=\emptyset and S=S>=𝒳S^{\geq}=S^{>}=\mathcal{X}.

D.1 Proof of Theorem 1

We note that, in the proof below, we also consider the empty set to be an anti-chain.

Proof of Theorem 1.

We use the notations of [τ,θ]\ell_{[\tau,\theta]} and [τ,θ]abs\mathcal{L}^{\mathrm{abs}}_{[\tau,\theta]} as defined in the proof of Theorem 3.

Any monotone function f:𝒳[0,1]f:\mathcal{X}\to[0,1] corresponds to an antichain AA in 𝒳\mathcal{X} such that f(a)1/2f(a)\geq 1/2 for all aA>a\in A^{>} and f(a)1/2f(a)\leq 1/2 for all aAa\in A^{\leq}. Our algorithm works by first choosing this antichain AA in a DP manner using the exponential mechanism. The choice of AA partitions the poset into two parts A>A^{>} and AA^{\leq} and the algorithm recurses on these two parts to find functions f>:A>[1/2,1]f_{>}:A^{>}\to[1/2,1] and f:A[0,1/2]f_{\leq}:A^{\leq}\to[0,1/2], which are put together to obtain the final function.

In particular, the algorithm proceeds in TT stages, where in stage tt, the algorithm starts with a partition of 𝒳\mathcal{X} into 2t2^{t} parts {Pi,ti[2t]}\left\{P_{i,t}\mid i\in[2^{t}]\right\}, and the algorithm eventually outputs a monotone function ff such that f(x)[i/2t,(i+1)/2t]f(x)\in[i/2^{t},(i+1)/2^{t}] for all xPi,tx\in P_{i,t}. This partition is further refined for stage t+1t+1 by choosing an antichain Ai,tA_{i,t} in Pi,tP_{i,t} and partitioning Pi,tP_{i,t} into Pi,tAi,t>P_{i,t}\cap A_{i,t}^{>} and Pi,tAi,tP_{i,t}\cap A_{i,t}^{\leq}. In the final stage, the function ff is chosen to be the constant i/2T1i/2^{T-1} over Pi,T1P_{i,T-1}. A complete description is presented in Algorithm 3.

Algorithm 3 DP Isotonic Regression for General Posets
  Input: Poset 𝒳\mathcal{X}, dataset D={(x1,y1),,(xn,yn)}D=\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}, DP parameter ε\varepsilon.
  Output: Monotone function f:𝒳[0,1]f:\mathcal{X}\to[0,1].
  
  Tlog(εn)T\leftarrow\left\lceil\log(\varepsilon n)\right\rceil and εε/T\varepsilon^{\prime}\leftarrow\varepsilon/T.
  P0,0𝒳P_{0,0}\leftarrow\mathcal{X}
  for t=0,,T1t=0,\ldots,T-1 do
     for i=0,,2t1i=0,\ldots,2^{t}-1 do
        
  • \triangleright

    Di,t{(xj,yj)j[n],xjPi,t}D_{i,t}\leftarrow\left\{(x_{j},y_{j})\mid j\in[n],x_{j}\in P_{i,t}\right\} (set of all input points whose xx belongs to Pi,tP_{i,t})

  • \triangleright

    𝒜i,t\mathcal{A}_{i,t}\leftarrow set of all antichains in Pi,tP_{i,t}.

    For each antichain A𝒜i,tA\in\mathcal{A}_{i,t}, we abuse notation to use

    • \bullet

      Di,tAD_{i,t}\cap A^{\leq} to denote{(x,y)Di,txA}\{(x,y)\in D_{i,t}\mid x\in A^{\leq}\}, and

    • \bullet

      Di,tA>D_{i,t}\cap A^{>} to denote {(x,y)Di,txA>}\{(x,y)\in D_{i,t}\mid x\in A^{>}\}.

  • \triangleright

    Choose antichain Ai,t𝒜i,tA_{i,t}\in\mathcal{A}_{i,t} using the exponential mechanism with the scoring function

    scorei,t(A)=\displaystyle\operatorname{score}_{i,t}(A)= minf1(Pi,tA,[i2t,i+0.52t])[i2t,i+12t]abs(f1;Di,tA)\displaystyle\min_{f_{1}\in\mathcal{F}(P_{i,t}\cap A^{\leq},[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{1};D_{i,t}\cap A^{\leq})
    +minf2(Pi,tA>,[i+0.52t,i+12t])[i2t,i+12t]abs(f2;Di,tA>),\displaystyle\qquad+~{}\min_{f_{2}\in\mathcal{F}(P_{i,t}\cap A^{>},[\frac{i+0.5}{2^{t}},\frac{i+1}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{2};D_{i,t}\cap A^{>}),

    {scorei,t(A)\operatorname{score}_{i,t}(A) has sensitivity at most L/2tL/2^{t}.}

  • \triangleright

    P2i,t+1Pi,tAi,tP_{2i,t+1}\leftarrow P_{i,t}\cap A_{i,t}^{\leq} and P2i+1,t+1Pi,tAi,t>P_{2i+1,t+1}\leftarrow P_{i,t}\cap A_{i,t}^{>}.

  Let f:𝒳[0,1]f:\mathcal{X}\to[0,1] be given by f(x)=i/2T1f(x)=i/2^{T-1} for all xPi,T1x\in P_{i,T-1} and all i[2t]i\in[2^{t}].
  return  ff

Before proceeding to prove the algorithm’s privacy and utility guarantees, we note that the output ff is indeed monotone because for every x<xx^{\prime}<x that gets separated when we partition Pi,tP_{i,t} to P2i,t+1,P2i+1,t+1P_{2i,t+1},P_{2i+1,t+1}, we must have xP2i,t+1x^{\prime}\in P_{2i,t+1} and xP2i+1,t+1x\in P_{2i+1,t+1}.

Privacy Analysis.

Similar to the proof of Theorem 3, it follows that each inner subroutine for each tt is ε\varepsilon^{\prime}-DP, and thus the entire mechanism is ε\varepsilon-DP by basic composition of DP (Lemma 6).

Utility Analysis.

Since the sensitivity of scorei,t()\operatorname{score}_{i,t}(\cdot) is at most L/2tL/2^{t}, we have from Lemma 7, that for all t{0,,T1}t\in\{0,\dots,T-1\} and i[2t]i\in[2^{t}],

𝔼[scorei,t(Ai,t)minA𝒜i,tscorei,t(A)]O(Llog|𝒜i,t|ε2t)\displaystyle\mathbb{E}\left[\operatorname{score}_{i,t}(A_{i,t})-\min_{A\in\mathcal{A}_{i,t}}\operatorname{score}_{i,t}(A)\right]\leq O\left(\frac{L\cdot\log|\mathcal{A}_{i,t}|}{\varepsilon^{\prime}\cdot 2^{t}}\right) O(Lwidth(𝒳)log|𝒳|ε2t).\displaystyle\leq O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|}{\varepsilon^{\prime}\cdot 2^{t}}\right). (4)

To facilitate the subsequent steps of the proof, let us introduce additional notation. Let hi,th_{i,t} denote argminh(Pi,t,[i/2t,(i+1)/2t])abs(h;Di,t)\operatorname{argmin}_{h\in\mathcal{F}(P_{i,t},[i/2^{t},(i+1)/2^{t}])}\mathcal{L}^{\mathrm{abs}}(h;D_{i,t}) (with ties broken arbitrarily). Then, let A~i,t\tilde{A}_{i,t} denote the set of all maximal elements of hi,t1([i/2t,(i+1/2)/2t])h_{i,t}^{-1}([i/2^{t},(i+1/2)/2^{t}]). Under this notation, we have that

scorei,t(Ai,t)minA𝒜i,tscorei,t(A)\displaystyle\operatorname{score}_{i,t}(A_{i,t})-\min_{A\in\mathcal{A}_{i,t}}\operatorname{score}_{i,t}(A)
scorei,t(Ai,t)scorei,t(A~i,t)\displaystyle~{}\geq~{}\operatorname{score}_{i,t}(A_{i,t})-\operatorname{score}_{i,t}(\tilde{A}_{i,t}) (5)
=([i/2t,(i+1)/2t]abs(h2i,t+1;D2i,t+1)+[i/2t,(i+1)/2t]abs(h2i+1,t+1;D2i+1,t+1))\displaystyle~{}=~{}\left(\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{2i,t+1};D_{2i,t+1})+\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{2i+1,t+1};D_{2i+1,t+1})\right)
[i/2t,(i+1)/2t]abs(hi,t;Di,t)\displaystyle\qquad-\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{i,t};D_{i,t})
=abs(h2i,t+1;D2i,t+1)+abs(h2i+1,t+1;D2i+1,t+1)abs(hi,t;Di,t).\displaystyle~{}=~{}\mathcal{L}^{\mathrm{abs}}(h_{2i,t+1};D_{2i,t+1})+\mathcal{L}^{\mathrm{abs}}(h_{2i+1,t+1};D_{2i+1,t+1})-\mathcal{L}^{\mathrm{abs}}(h_{i,t};D_{i,t}). (6)

Finally, notice that

abs(f;Di,T1)abs(hi,T1;Di,T1)L2T1|Di,T1|=O(|Di,T1|εn).\displaystyle\mathcal{L}^{\mathrm{abs}}(f;D_{i,T-1})-\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})\leq\frac{L}{2^{T-1}}\cdot|D_{i,T-1}|=O\left(\frac{|D_{i,T-1}|}{\varepsilon n}\right). (7)

With all the ingredients ready, we may now bound the expected (unnormalized) excess risk. We have that

abs(f;D)\displaystyle\mathcal{L}^{\mathrm{abs}}(f;D) =i[2T1]abs(f;Di,T1)\displaystyle=\sum_{i\in[2^{T-1}]}\mathcal{L}^{\mathrm{abs}}(f;D_{i,T-1})
(7)i[2T1](O(|Di,T1|εn)+abs(hi,T1;Di,T1))\displaystyle\overset{\eqref{eq:rounding-err-util}}{\leq}\sum_{i\in[2^{T-1}]}\left(O\left(\frac{|D_{i,T-1}|}{\varepsilon n}\right)+\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})\right)
=O(1/ε)+i[2T1]abs(hi,T1;Di,T1)\displaystyle=O(1/\varepsilon)~{}+~{}\sum_{i\in[2^{T-1}]}\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})
=O(1/ε)+abs(h0,0;D0,0)\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})
+t[T1]i[2t1](abs(h2i,t;D2i,t)+abs(h2i+1,t;D2i+1,t)abs(hi,t1;Di,t1)).\displaystyle\qquad+\sum_{t\in[T-1]}\sum_{i\in[2^{t-1}]}\left(\mathcal{L}^{\mathrm{abs}}(h_{2i,t};D_{2i,t})+\mathcal{L}^{\mathrm{abs}}(h_{2i+1,t};D_{2i+1,t})-\mathcal{L}^{\mathrm{abs}}(h_{i,t-1};D_{i,t-1})\right).

Taking the expectation on both sides and using (4) and (6) yields

𝔼[abs(f;D)]\displaystyle\mathbb{E}[\mathcal{L}^{\mathrm{abs}}(f;D)] O(1/ε)+abs(h0,0;D0,0)+t[T1]i[2t1]O(Lwidth(𝒳)log|𝒳|ε2t)\displaystyle\leq O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})~{}+~{}\sum_{t\in[T-1]}\sum_{i\in[2^{t-1}]}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|}{\varepsilon^{\prime}\cdot 2^{t}}\right)
=O(1/ε)+abs(f;D)+t[T1]O(Lwidth(𝒳)log|𝒳|ε)\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}\sum_{t\in[T-1]}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|}{\varepsilon^{\prime}}\right)
=O(1/ε)+abs(f;D)+O(TLwidth(𝒳)log|𝒳|ε)\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(T\cdot\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|}{\varepsilon^{\prime}}\right)
=O(1/ε)+abs(f;D)+O(T2Lwidth(𝒳)log|𝒳|ε)\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(T^{2}\cdot\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|}{\varepsilon}\right)
=abs(f;D)+O(Lwidth(𝒳)log|𝒳|(1+log2(εn))ε).\displaystyle=\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon}\right).

Dividing both sides by nn yields the desired claim. ∎