Private Isotonic Regression

Badih Ghazi Pritish Kamath Ravi Kumar Pasin Manurangsi
Google Research
Mountain View, CA, US
badihghazi@gmail.com, pritish@alum.mit.edu,
ravi.k53@gmail.com, pasin@google.com

Abstract

In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) $\mathcal{X}$ and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given $n$ input points, has an expected excess empirical risk of roughly $\mathrm{width}(\mathcal{X})\cdot\log|\mathcal{X}|/n$ , where $\mathrm{width}(\mathcal{X})$ is the width of the poset. In contrast, we also obtain a near-matching lower bound of roughly $(\mathrm{width}(\mathcal{X})+\log|\mathcal{X}|)/n$ , that holds even for approximate-DP algorithms. Moreover, we show that the above bounds are essentially the best that can be obtained without utilizing any further structure of the poset. In the special case of a totally ordered set and for $\ell_{1}$ and $\ell_{2}^{2}$ losses, our algorithm can be implemented in near-linear running time; we also provide extensions of this algorithm to the problem of private isotonic regression with additional structural constraints on the output function.

1 Introduction

Isotonic regression is a basic primitive in statistics and machine learning, which has been studied at least since the $1950$ s [4, 9]; see also the textbooks on the topic [5, 38]. It has seen applications in numerous fields including medicine [31, 39] where the expression of an antigen is modeled as a monotone function of the DNA index and WBC count, and education [19], where isotonic regression was used to predict college GPA using high school GPA and standardized test scores. Isotonic regression is also arguably the most common non-parametric method for calibrating machine learning models [51], including modern neural networks [23].

In this paper, we study isotonic regression with a differential privacy (DP) constraint on the output model. DP [17, 16] is a highly popular notion of privacy for algorithms and machine learning primitives, and has seen increased adoption due to its powerful guarantees and properties [37, 43]. Despite the plethora of work on DP statistics and machine learning (see Section 5 for related work), ours is, to the best of our knowledge, the first to study DP isotonic regression.

In fact, we consider the most general version of the isotonic regression problem. We first set up some notation to describe our results. Let $(\mathcal{X},\leq)$ be any partially ordered set (poset). A function $f:\mathcal{X}\to[0,1]$ is monotone if and only if $f(x)\leq f(x^{\prime})$ for all $x\leq x^{\prime}$ . For brevity, we use $\mathcal{F}(\mathcal{X},\mathcal{Y})$ to denote the set of all monotone functions from $\mathcal{X}$ to $\mathcal{Y}$ ; throughout, we consider $\mathcal{Y}\subseteq[0,1]$ .

Let $[n]$ denote $\{1,\ldots,n\}$ . Given an input dataset $D=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}\in(\mathcal{X}\times[0,1])^{n}$ , let the empirical risk of a function $f:\mathcal{X}\to[0,1]$ be $\mathcal{L}(f;D):=\frac{1}{n}\sum_{i\in[n]}\ell(f(x_{i}),y_{i})$ , where $\ell:[0,1]\times[0,1]\to\mathbb{R}$ is a loss function.

We study private isotonic regression in the basic machine learning framework of empirical risk minimization. Specifically, the goal of the isotonic regression problem, given dataset $D=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}\in(\mathcal{X}\times[0,1])^{n}$ , is to find a monotone function $f:\mathcal{X}\to[0,1]$ that minimizes $\mathcal{L}(f;D)$ . The excess empirical risk of a function $f$ is defined as $\mathcal{L}(f;D)-\mathcal{L}(f^{*};D)$ where $f^{*}:=\operatorname{argmin}_{g\in\mathcal{F}(\mathcal{X},\mathcal{Y})}\mathcal{L}(g;D)$ .

1.1 Our Results

General Posets.

Our first contribution is to give nearly tight upper and lower bounds for any poset, based on its width, as stated below (see Section 4 for a formal definition.)

Theorem 1 (Upper Bound for General Poset).

Let $\mathcal{X}$ be any finite poset and let $\ell$ be an $L$ -Lipschitz loss function. For any $\varepsilon\in(0,1]$ , there is an $\varepsilon$ -DP algorithm for isotonic regression for $\ell$ with expected excess empirical risk at most $O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon n}\right)$ .

Theorem 2 (Lower Bound for General Poset; Informal).

For any $\varepsilon\in(0,1]$ and any $\delta<0.01\cdot\varepsilon/|\mathcal{X}|$ , any $(\varepsilon,\delta)$ -DP algorithm for isotonic regression for a “nice” loss function $\ell$ must have expected excess empirical risk $\Omega\left(\frac{\operatorname{width}(\mathcal{X})+\log|\mathcal{X}|}{\varepsilon n}\right)$ .

While our upper and lower bounds do not exactly match because of the multiplication-vs-addition of $\log|\mathcal{X}|$ , we show in Section 4.3 that there are posets for which each bound in tight. In other words, this gap cannot be closed for generic posets.

Totally Ordered Sets.

The above upper and lower bounds immediately translate to the case of totally ordered sets, by plugging in $\operatorname{width}(\mathcal{X})=1$ . More importantly, we give efficient algorithms in this case, which runs in time $\tilde{O}(n^{2}+n\log|\mathcal{X}|)$ for general loss function $\ell$ , and in nearly linear $\tilde{O}(n\cdot\log|\mathcal{X}|)$ time for the widely-studied $\ell_{2}^{2}$ - and $\ell_{1}$ -losses¹¹1Recall that the $\ell_{2}^{2}$ -loss is $\ell_{2}^{2}(y,y^{\prime})=(y-y^{\prime})^{2}$ and the $\ell_{1}$ -loss is $\ell_{1}(y,y^{\prime})=|y-y^{\prime}|$ ..

Theorem 3.

For all finite totally ordered sets $\mathcal{X}$ , $L$ -Lipschitz loss functions $\ell$ , and $\varepsilon\in(0,1]$ , there is an $\varepsilon$ -DP algorithm for isotonic regression for $\ell$ with expected excess empirical risk $O\left(\frac{L\cdot(\log|\mathcal{X}|)\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon n}\right)$ . The running time of this algorithm is $\tilde{O}(n^{2}+n\log|\mathcal{X}|)$ in general and can be improved to $\tilde{O}(n\log|\mathcal{X}|)$ for $\ell_{1}$ and $\ell_{2}^{2}$ losses.

We are not aware of any prior work on private isotonic regression. A simple baseline algorithm for this problem would be to use the exponential mechanism over the set of all monotone functions taking values in a discretized set, to choose one with small loss. We show in Appendix A that this achieves an excess empirical risk of $O(L\cdot\sqrt{\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|/\varepsilon n})$ , which is quadratically worse than the bound in Theorem 1. Moreover, even in the case of a totally ordered set, it is unclear how to implement such a mechanism efficiently.

We demonstrate the flexibility of our techniques by showing that it can be extended to variants of isotonic regression where, in addition to monotonicity, we also require $f$ to satisfy additional properties. For example, we may want $f$ to be $L_{f}$ -Lipchitz for some specified $L_{f}>0$ . Other constraints we can handle include $k$ -piecewise constant, $k$ -piecewise linear, convexity, and concavity. For each of these constraints, we devise an algorithm that yields essentially the same error compared to the unconstrained case and still runs in polynomial time.

Theorem 4.

For all finite totally ordered sets $\mathcal{X}$ , $L$ -Lipschitz loss functions $\ell$ , and $\varepsilon\in(0,1]$ , there is an $\varepsilon$ -DP algorithm for $k$ -piecewise constant, $k$ -piecewise linear, Lipchitz, convex, or concave isotonic regression for $\ell$ with expected excess empirical risk $\tilde{O}\left(\frac{L\cdot(\log|\mathcal{X}|)}{\varepsilon n}\right)$ . The running time of this algorithm is $(n|\mathcal{X}|)^{O(1)}$ .

Organization.

We next provide necessary background on DP. In Section 3, we prove our results for totally ordered sets (including Theorem 3). We then move on to discuss general posets in Section 4. Section 5 contains additional related work. Finally, we conclude with some discussion in Section 6. Due to space constraints, we omit some proofs from the main body; these can be found in the Appendix.

2 Background on Differential Privacy

Two datasets $D=\{((x_{1},y_{1}),\ldots,(x_{n},y_{n})\}$ and $D^{\prime}=\{(x^{\prime}_{1},y^{\prime}_{1}),\ldots,(x^{\prime}_{n},y^{\prime}_{n})\}$ are said to be neighboring, denoted $D\sim D^{\prime}$ , if there is an index $i\in[n]$ such that $(x_{j},y_{j})=(x^{\prime}_{j},y^{\prime}_{j})$ for all $j\in[n]\setminus\{i\}$ . We recall the formal definition of differential privacy [18, 16]:

Definition 5 (Differential Privacy (DP) [18, 16]).

Let $\varepsilon>0$ and $\delta\in[0,1]$ . A randomized algorithm $\mathcal{M}:\mathcal{X}^{n}\to\mathcal{Y}$ is $(\varepsilon,\delta)$ -differentially private ( $(\varepsilon,\delta)$ -DP) if, for all $D\sim D^{\prime}$ and all (measurable) outcomes $S\subseteq\mathcal{Y}$ , we have that $\Pr[\mathcal{M}(D)\in S]\leq e^{\varepsilon}\cdot\Pr[\mathcal{M}(D^{\prime})\in S]+\delta$ .

We denote $(\varepsilon,0)$ -DP as $\varepsilon$ -DP (aka pure-DP). The case when $\delta>0$ is referred to as approximate-DP.

We will use the following composition theorems throughout our proofs.

Lemma 6.

$(\varepsilon,\delta)$ -DP satisfies the following:

•

Basic Composition: If mechanisms $\mathcal{M}_{1},\ldots,\mathcal{M}_{t}$ are such that $\mathcal{M}_{i}$ satisfies $(\varepsilon_{i},\delta_{i})$ -DP, then the composed mechanism $(\mathcal{M}_{1}(D),\ldots,\mathcal{M}_{t}(D))$ satisfies $(\sum_{i}\varepsilon_{i},\sum_{i}\delta_{i})$ -DP. This holds even under adaptive composition, where each $\mathcal{M}_{i}$ can depend on the outputs of $\mathcal{M}_{1},\ldots,\mathcal{M}_{i-1}$ .
•

Parallel Composition [33]: If a mechanism $\mathcal{M}$ satisfies $(\varepsilon,\delta)$ -DP, then for any partition of $D=D_{1}\sqcup\cdots\sqcup D_{t}$ , the composed mechanism given as $(\mathcal{M}(D_{1}),\ldots,\mathcal{M}(D_{t}))$ satisfies $(\varepsilon,\delta)$ -DP.

Exponential Mechanism.

The exponential mechanism solves the basic task of selection in data analysis: given a dataset $D\in\mathcal{Z}^{n}$ and a set $\mathcal{A}$ of options, it outputs the (approximately) best option, where “best” is defined by a scoring function $\mathfrak{s}:\mathcal{A}\times\mathcal{Z}^{n}\to\mathbb{R}$ . The $\varepsilon$ -DP exponential mechanism [34] is the randomized mechanism $\mathcal{M}:\mathcal{Z}^{n}\to\mathcal{A}$ given by

\textstyle\forall~{}D\in\mathcal{Z}^{n},\ a\in\mathcal{A}\ :\ \Pr[\mathcal{M}(D)=a]~{}\propto~{}\exp\left(-\frac{\varepsilon}{2\Delta_{\mathfrak{s}}}\cdot\mathfrak{s}(a,D)\right),

where $\Delta_{\mathfrak{s}}:=\sup_{D\sim D^{\prime}}\max_{a\in\mathcal{A}}|\mathfrak{s}(a,D)-\mathfrak{s}(a,D^{\prime})|$ is the sensitivity of the scoring function.

Lemma 7 ([34]).

For $\mathcal{M}$ being the $\varepsilon$ -DP exponential mechanism, it holds for all $D\in\mathcal{Z}^{n}$ that

\textstyle\mathbb{E}[\mathfrak{s}(\mathcal{M}(D),D)]~{}\leq~{}\min_{a\in\mathcal{A}}\,\mathfrak{s}(a,D)~{}+~{}\frac{2\Delta_{\mathfrak{s}}}{\varepsilon}\log|\mathcal{A}|.

Lower Bound for Privatizing Vectors.

Lower bounds for DP algorithms that can output a binary vector that is close (say, in the Hamming distance) to the input are well-known.

Lemma 8 (e.g., [32]).

Let $\varepsilon,\delta>0,m\in\mathbb{N}$ , let the input domain be $\{0,1\}^{m}$ and let two vectors $\mathbf{z},\mathbf{z}^{\prime}\in\{0,1\}^{m}$ be neighbors if and only if $\|\mathbf{z}-\mathbf{z}^{\prime}\|_{0}\leq 1$ . Then, for any $(\varepsilon,\delta)$ -DP algorithm $\mathcal{M}:\{0,1\}^{m}\to\{0,1\}^{m}$ , we have $\mathbb{E}_{\mathbf{z}\sim\{0,1\}^{m}}[\|\mathcal{M}(\mathbf{z})-\mathbf{z}\|_{0}]\geq e^{-\varepsilon}\cdot m\cdot 0.5\cdot(1-\delta)$ .

It is also simple to extend the lower bound for the case where the vector is not binary, as stated below. We defer the full proof to Appendix B.

Lemma 9.

Let $D,m$ be any positive integer such that $D\geq 2$ , let the input domain be $[D]^{m}$ and let two vectors $\mathbf{z},\mathbf{z}^{\prime}\in[D]^{m}$ be neighbors if and only if $\|\mathbf{z}-\mathbf{z}^{\prime}\|_{0}\leq 1$ . Then, for any $(\ln(D/2),0.25)$ -DP algorithm $\mathcal{M}:[D]^{m}\to[D]^{m}$ , we have that $\mathbb{E}_{\mathbf{z}\sim[D]^{m}}[\|\mathcal{M}(\mathbf{z})-\mathbf{z}\|_{0}]\geq\Omega(m).$

Group Differential Privacy.

For any neighboring relation $\sim$ , we write $\sim_{k}$ as a neighboring relation where $D\sim_{k}D^{\prime}$ iff there is a sequence $D=D_{0},\dots,D_{k^{\prime}}=D^{\prime}$ for some $k^{\prime}\leq k$ such that $D_{i-1}\sim D_{i}$ for all $i\in[k^{\prime}]$ .

Fact 10 (e.g., [41]).

Let $\varepsilon>0,\delta\geq 0$ and $k\in\mathbb{N}$ . Suppose that $\mathcal{M}$ is an $(\varepsilon,\delta)$ -DP algorithm for the neighboring relation $\sim$ . Then $\mathcal{M}$ is $\left(k\varepsilon,\frac{e^{k\varepsilon}-1}{e^{\varepsilon}-1}\cdot\delta\right)$ -DP for the neighboring relation $\sim_{k}$ .

3 DP Isotonic Regression over Total Orders

We first focus on the “one-dimensional” case where $\mathcal{X}$ is totally ordered; for convenience, we assume that $\mathcal{X}=[m]$ where the order is the natural order on integers. We first present an efficient algorithm for the this case and then a matching lower bound.

3.1 An Efficient Algorithm

To describe our algorithm, it will be more convenient to use the unnormalized version of the empirical risk, which we define as $\mathcal{L}^{\mathrm{abs}}(f;D):=\sum_{(x,y)\in D}\ell(f(x),y)$ .

We now provide a high-level overview of our algorithm. Any monotone function $f:[m]\to[0,1]$ contains a (not necessarily unique) threshold $\alpha\in\{0\}\cup[m]$ such that $f(a)\geq 1/2$ for all $a>\alpha$ and $f(a)\leq 1/2$ for all $a\leq\alpha$ . Our algorithm works by first choosing this threshold $\alpha$ in a private manner using the exponential mechanism. The choice of $\alpha$ partitions $[m]$ into $[m]^{>\alpha}:=\left\{a\in[m]\mid a>\alpha\right\}$ and $[m]^{\leq\alpha}:=\left\{a\in[m]\mid a\leq\alpha\right\}$ . The algorithm recurses on these two parts to find functions $f_{>}:[m]^{>\alpha}\to[1/2,1]$ and $f_{\leq}:[m]^{\leq\alpha}\to[0,1/2]$ , which are then glued to obtain the final function.

In particular, the algorithm proceeds in $T$ stages, where in stage $t$ , the algorithm starts with a partition of $[m]$ into $2^{t}$ intervals $\left\{P_{i,t}\mid i\in\{0,\dots,2^{t}-1\}\right\}$ , and the algorithm eventually outputs a monotone function $f$ such that $f(x)\in[i/2^{t},(i+1)/2^{t}]$ for all $x\in P_{i,t}$ . This partition is further refined for stage $t+1$ by choosing a threshold $\alpha_{i,t}$ in $P_{i,t}$ and partitioning $P_{i,t}$ into $P_{i,t}^{>\alpha_{i,t}}$ and $P_{i,t}^{\leq\alpha_{i,t}}$ . In the final stage, the function $f$ is chosen to be the constant $i/2^{T-1}$ over $P_{i,T-1}$ . Note that the algorithm may stop at $T=\Theta_{\varepsilon}(\log n)$ because the Lipschitzness of $\ell$ ensures that assigning each partition to the constant $i/2^{T-1}$ cannot increase the error by more than $L/2^{T}\leq O_{\varepsilon}(L/n)$ .

We already have mentioned above that each $\alpha_{i,t}$ has to be chosen in a private manner. However, if we let the scoring function directly be the unnormalized empirical risk, then its sensitivity remains as large as $L$ even at a large stage $t$ . This is undesirable because the error from each run of the exponential mechanism can be as large as $O(L\cdot\log m)$ but there are as many as $2^{t}$ runs in stage $t$ . Adding these error terms up would result in a far larger total error than desired.

To circumvent this, we observe that while the sensitivity can still be large, they are mostly “ineffective” because the function range is now restricted to only an interval of length $1/2^{t}$ . Indeed, we may use the following “clipped” version of the loss function which has low sensitivity of $L/2^{t}$ instead.

Definition 11 (Clipped Loss Function).

For a range $[\tau,\theta]\subseteq[0,1]$ , let $\ell_{[\tau,\theta]}:[\tau,\theta]\times[0,1]\to\mathbb{R}$ be given as $\ell_{[\tau,\theta]}(\hat{y},y):=\ell(\hat{y},y)-\min_{y^{\prime}\in[\tau,\theta]}\ell(y^{\prime},y)$ . Similar to above, we also define $\mathcal{L}^{\mathrm{abs}}_{[\tau,\theta]}(f;D):=\sum_{(x,y)\in D}\ell_{[\tau,\theta]}(f(x_{i}),y_{i})$ .

Observe that $\operatorname{range}(\ell_{[\tau,\theta]})\subseteq[0,L\cdot(\theta-\tau)]$ , since $\ell$ is $L$ -Lipschitz. In other words, the sensitivity of $\mathcal{L}^{\mathrm{abs}}_{[\tau,\theta]}(f;D)$ is only $L\cdot(\theta-\tau)$ . Algorithm 1 contains a full description.

Algorithm 1 DP Isotonic Regression for Totally Ordered Sets.

Input:

\mathcal{X}=[m]

, dataset

D=\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}

, DP parameter

\varepsilon

Output: Monotone function

f:[m]\to[0,1]

T\leftarrow\left\lceil\log(\varepsilon n)\right\rceil

\varepsilon^{\prime}\leftarrow\varepsilon/T

P_{0,0}\leftarrow[m]

for

t=0,\ldots,T-1

for

i=0,\ldots,2^{t}-1

$\triangleright$

$D_{i,t}\leftarrow\left\{(x_{j},y_{j})\mid j\in[n],x_{j}\in P_{i,t}\right\}$ {Set of all input points whose $x$ belongs to $P_{i,t}$ }

{Notation: Define $D_{i,t}^{\leq\alpha}:=\left\{(x,y)\in D_{i,t}\mid x\leq\alpha\right\}$ and $D_{i,t}^{>\alpha}$ similarly }

{Notation: Define $P_{i,t}^{\leq\alpha}:=\left\{x\in P_{i,t}\mid x\leq\alpha\right\}$ and $P_{i,t}^{>\alpha}$ similarly }

\triangleright

Choose threshold $\alpha_{i,t}\in\{0\}\cup P_{i,t}$ , using $\varepsilon^{\prime}$ -DP exponential mechanism with scoring function

	$\displaystyle\operatorname{score}_{i,t}(\alpha)$	$\displaystyle~{}:=~{}\min_{f_{1}\in\mathcal{F}(P_{i,t}^{\leq\alpha},[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{1};D_{i,t}^{\leq\alpha})$
		$\displaystyle\qquad+~{}\min_{f_{2}\in\mathcal{F}(P_{i,t}^{>\alpha},[\frac{(i+0.5)}{2^{t}},\frac{(i+1)}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{2};D_{i,t}^{>\alpha})$

{Note: $\operatorname{score}_{i,t}(\alpha)$ has sensitivity at most $L/2^{t}$ . }

$\triangleright$

$P_{2i,t+1}\leftarrow P_{i,t}^{\leq\alpha_{i,t}}$ and $P_{2i+1,t+1}\leftarrow P_{i,t}^{>\alpha_{i,t}}$ .

Let

f:[m]\to[0,1]

be given as

f(x)=i/2^{T-1}

for all

x\in P_{i,T-1}

and all

i\in[2^{T}]

return

f

Proof of Theorem 3.

Before proceeding to prove the algorithm’s privacy and utility guarantees, we note that the output $f$ is indeed monotone since for every $x^{\prime}<x$ that gets separated when we partition $P_{i,t}$ into $P_{2i,t+1},P_{2i+1,t+1}$ , we must have $x^{\prime}\in P_{2i,t+1}$ and $x\in P_{2i+1,t+1}$ .

Privacy Analysis.

Since the exponential mechanism is $\varepsilon^{\prime}$ -DP and the dataset is partitioned with the exponential mechanism being applied only to each partition once, the parallel composition property (Lemma 6) implies that the entire subroutine for each $t$ is $\varepsilon^{\prime}$ -DP. Thus, by basic composition (Lemma 6), it follows that Algorithm 1 is $\varepsilon$ -DP (since $\varepsilon=\varepsilon^{\prime}T$ ).

Utility Analysis.

Since the sensitivity of $\operatorname{score}_{i,t}(\cdot)$ is at most $L/2^{t}$ , we have from Lemma 7, that for all $t\in\{0,\dots,T-1\}$ and $i\in\left\{0,1,\ldots,2^{t}\right\}$ ,

\mathbb{E}\left[\operatorname{score}_{i,t}(\alpha_{i,t})-\min_{\alpha\in P_{i,t}}\operatorname{score}_{i,t}(\alpha)\right]~{}\leq~{}O\left(\frac{L\cdot\log|P_{i,t}|}{\varepsilon^{\prime}\cdot 2^{t}}\right)~{}\leq~{}O\left(\frac{L\cdot\log m}{\varepsilon^{\prime}\cdot 2^{t}}\right).

(1)

Let $h_{i,t}$ denote $\operatorname{argmin}_{h\in\mathcal{F}(P_{i,t},[i/2^{t},(i+1)/2^{t}])}\mathcal{L}^{\mathrm{abs}}(h;D_{i,t})$ (with ties broken arbitrarily). Then, let $\tilde{\alpha}_{i,t}$ denote the largest element in $P_{i,t}$ such that $h_{i,t}(\tilde{\alpha}_{i,t})\leq(i+1/2)/2^{t}$ ; namely, $\tilde{\alpha}_{i,t}$ is the optimal threshold when restricted to $D_{i,t}$ . Under this notation, we have that

	$\displaystyle\operatorname{score}_{i,t}(\alpha_{i,t})-\min_{\alpha\in P_{i,t}}\operatorname{score}_{i,t}(\alpha)$
	$\displaystyle\geq~{}\operatorname{score}_{i,t}(\alpha_{i,t})-\operatorname{score}_{i,t}(\tilde{\alpha}_{i,t})$
	$\displaystyle=~{}\left(\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{2i,t+1};D_{2i,t+1})+\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{2i+1,t+1};D_{2i+1,t+1})\right)$
	$\displaystyle\qquad-~{}\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{i,t};D_{i,t})$
	$\displaystyle=~{}\mathcal{L}^{\mathrm{abs}}(h_{2i,t+1};D_{2i,t+1})+\mathcal{L}^{\mathrm{abs}}(h_{2i+1,t+1};D_{2i+1,t+1})-\mathcal{L}^{\mathrm{abs}}(h_{i,t};D_{i,t}).$		(2)

Finally, notice that

\displaystyle\textstyle\mathcal{L}^{\mathrm{abs}}(f;D_{i,T-1})-\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})\leq\frac{L}{2^{T-1}}\cdot|D_{i,T-1}|=O\left(\frac{L\cdot|D_{i,T-1}|}{\varepsilon n}\right).

(3)

With all the ingredients ready, we may now bound the expected (unnormalized) excess risk:

	$\displaystyle\mathcal{L}^{\mathrm{abs}}(f;D)$	$\displaystyle\textstyle~{}=~{}\sum_{0\leq i<2^{T-1}}\mathcal{L}^{\mathrm{abs}}(f;D_{i,T-1})$
		$\displaystyle\textstyle~{}\overset{\eqref{eq:1d-rounding-err-util-total-order}}{\leq}~{}\sum_{0\leq i<2^{T-1}}\left(O\left(\frac{L\cdot\|D_{i,T-1}\|}{\varepsilon n}\right)+\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})\right)$
		$\displaystyle\textstyle~{}=~{}O(L/\varepsilon)~{}+~{}\sum_{0\leq i<2^{T-1}}\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})$
		$\displaystyle\textstyle~{}=~{}O(L/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})$
		$\displaystyle\textstyle\quad+\sum_{t\in[T-1]\atop 0\leq i<2^{t-1}}\left(\mathcal{L}^{\mathrm{abs}}(h_{2i,t};D_{2i,t})+\mathcal{L}^{\mathrm{abs}}(h_{2i+1,t};D_{2i+1,t})-\mathcal{L}^{\mathrm{abs}}(h_{i,t-1};D_{i,t-1})\right).$

Taking the expectation on both sides and using (1) and (2) yields

	$\displaystyle\mathbb{E}[\mathcal{L}^{\mathrm{abs}}(f;D)]$	$\displaystyle\textstyle~{}\leq~{}O(L/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})~{}+~{}\sum_{t\in[T-1]\atop 0\leq i<2^{t-1}}O\left(\frac{L\cdot\log m}{\varepsilon^{\prime}\cdot 2^{t}}\right)$
		$\displaystyle\textstyle~{}=~{}O(L/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(T^{2}\cdot\frac{L\cdot\log m}{\varepsilon}\right)$
		$\displaystyle\textstyle~{}=~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(\frac{L\cdot\log m\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon}\right).$

Dividing both sides by $n$ yields the desired claim.

Running Time.

To obtain a bound on the running time for general loss functions, we need to make a slight modification to the algorithm: we will additionally only restrict the range of $f_{1},f_{2}$ to multiples of $1/2^{T-1}$ . We remark that this does not affect the utility since anyway we always take the final output whose values are multiples of $1/2^{T-1}$ .

Given any dataset $D=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ where $x_{1}<\cdots<x_{n}$ , the prefix isotonic regression algorithm is to compute, for each $i\in[n]$ , the optimal loss in isotonic regression on $(x_{1},y_{1}),\dots,(x_{i},y_{i})$ . Straightforward dynamic programming solves this in $O(n\cdot v)$ time, where $v$ denote the number of possible values allowed in the function.

Now, for each $i,t$ , we may run the above algorithm with $D=D_{i,t}$ and the allowed values are all multiples of $1/2^{T-1}$ in $[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}]$ ; this gives us $\min_{f_{1}\in\mathcal{F}(P_{i,t}^{\leq\alpha},[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{1};D_{i,t}^{\leq\alpha})$ for all $\alpha\in P_{i,t}$ in time $O(|D_{i,t}|\cdot 2^{T-t}+|P_{i,t}|)$ . Analogously, we can also compute $\min_{f_{2}\in\mathcal{F}(P_{i,t}^{>\alpha},[\frac{(i+0.5)}{2^{t}},\frac{(i+1)}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{2};D_{i,t}^{>\alpha})$ for all $\alpha\in P_{i,t}$ in a similar time. Thus, we can compute $(\operatorname{score}_{i,t}(\alpha))_{\alpha\in P_{i,t}}$ in time $O(|D_{i,t}|\cdot 2^{T-t}+|P_{i,t}|)$ , and then sample accordingly.

We can further speed up the algorithm by observing that the score remains constant for all $\alpha\in[x_{i},x_{i+1})$ . Hence, we may first sample an interval among $[0,x_{1}),[x_{1},x_{2}),\ldots,[x_{n-1},x_{n}),[x_{n},m)$ and then sample $\alpha_{i,t}$ uniformly from that interval. This entire process can be done in $O(|D_{i,t}|\cdot 2^{T-t}+\log m)$ time. In total, the running time of the algorithm is thus

\displaystyle\sum_{t=0}^{T-1}\sum_{i=0}^{2^{t}-1}O(|D_{i,t}|\cdot 2^{T-t}+\log m)\leq\sum_{t=0}^{T-1}O(n2^{T}+2^{t}\cdot\log m)\leq O(n^{2}\log n+n\log m).

Near-Linear Time Algorithms for $\ell_{1}$ -, $\ell^{2}_{2}$ -Losses.

We now describe faster algorithms for the $\ell_{1}$ - and $\ell^{2}_{2}$ -loss functions, thereby proving the last part of Theorem 3. The key observation is that for convex loss functions, the restricted optimal is simple: we just have to “clip” the optimal function to be in the range $[\tau,\theta]$ . Below $\operatorname{clip}_{[\tau,\theta]}$ denotes the function $y\mapsto\min\{\theta,\max\{\tau,y\}\}$ .

Observation 12.

Let $\ell$ be any convex loss function, $D$ any dataset, $f^{*}\in\operatorname{argmin}_{f\in\mathcal{F}(\mathcal{X},\mathcal{Y})}\mathcal{L}(f;D)$ and $\tau\leq\theta$ any real numbers such that $\tau,\theta\in\mathcal{Y}$ . Define $f^{*}_{\mathrm{clipped}}(x):=\operatorname{clip}_{[\tau,\theta]}(f^{*}(x))$ . Then, we must have $f^{*}_{\mathrm{clipped}}(x)\in\operatorname{argmin}_{f\in\mathcal{F}(\mathcal{X},\mathcal{Y}\cap[\tau,\theta])}\mathcal{L}(f;D)$ .

Proof.

Consider any $f\in\mathcal{F}(\mathcal{X},\mathcal{Y}\cap[\tau,\theta])$ . Let $\mathcal{X}^{>}$ (resp. $\mathcal{X}^{<}$ ) denote the set of all $x\in\mathcal{X}$ such that $f^{*}(x)>\theta$ (resp. $f^{*}(x)<\tau$ ). Consider the following operations:

•

For each $x\in\mathcal{X}^{>}$ , change $f(x)$ to $\theta$ .
•

For each $x\in\mathcal{X}^{<}$ , change $f(x)$ to $\tau$ .
•

Let $f(x)=f^{*}(x)$ for all $x\in\mathcal{X}\setminus(\mathcal{X}^{>}\cup\mathcal{X}^{<})$ .

At the end, we end up with $f(x)=f^{*}_{\mathrm{clipped}}(x)$ . Each of the first two changes does not increase the loss $\mathcal{L}(f;D)$ ; otherwise, due to convexity, changing $f^{*}(x)$ to $f(x)$ would have decrease the objective function. Finally, the last operation does not decrease the loss; otherwise, we may replace this section of $f^{*}$ with the values in $f$ instead. Thus, we can conclude that $f^{*}_{\mathrm{clipped}}(x)\in\operatorname{argmin}_{g\in\mathcal{F}(\mathcal{X},\mathcal{Y}\cap[\tau,\delta])}\mathcal{L}(f;D)$ . ∎

We will now show how to compute the scores in Algorithm 1 simultaneously for all $\alpha$ (for fixed $i,t$ ) in nearly linear time. To do this, recall the prefix isotonic regression problem from earlier. For this problem, Stout [42] gave an $O(n)$ -time algorithm for $\ell_{2}$ -loss and an $O(n\log n)$ -time algorithm for $\ell_{1}$ -loss (both the unrestricted value case). Furthermore, after the $i$ th iteration, the algorithm also keeps a succinct representation $S^{\operatorname{opt}}_{i}$ of the optimal solution in the form of an array $(i_{1},v_{1},\ell_{1}),\dots,(i_{k},v_{k},\ell_{k})$ , which denotes $f(x)=v_{j}$ for all $x\in[x_{i_{j}},x_{i_{j+1}})$ , and $\ell_{j}$ indicates the loss $\mathcal{L}^{\mathrm{abs}}$ up until $x_{i_{j+1}}$ , not including.

We can extend the above algorithm to prefix clipped isotonic regression problem, which we define in the same manner as above except that we restrict the function range to be $[\tau,\theta]$ for some given $\tau<\theta$ . Using 12, it is not hard to extend the above algorithm to work in this case.

Lemma 13.

There is an $O(n\log n)$ -time algorithm for $\ell_{2}^{2}$ - and $\ell_{1}$ -prefix clipped isotonic regression.

Proof.

We first precompute $c_{\tau}(i)=\sum_{j\leq i}\ell(\tau,x_{j})$ and $c_{\theta}(i)=\sum_{j\leq i}\ell(\theta,x_{j})$ for all $i\in[n]$ . We then run the aforementioned algorithm from [42]. At each iteration $i$ , use binary search to find the largest index $j_{\tau}$ such that $v_{j_{\tau}}<\tau$ and the largest index $j_{\theta}$ such that $v_{j_{\theta}}<\theta$ . 12 implies that the optimal solution of the clipped version is simply the same as the unrestricted version except that we need to change the function values before $x_{j_{\tau}}$ to $\tau$ and after $x_{j_{\theta}}$ to $\theta$ . The loss of this clipped optimal can be written as $\ell_{j_{\theta}}-\ell_{j_{\tau}}+c_{\tau}(j_{\tau})+(c_{\theta}(i)-c_{\theta}(j_{\theta}))$ , which can be computed in $O(1)$ time given that we have already precomputed $c_{\tau},c_{\theta}$ . The running time of the entire algorithm is thus the same as that of [42] together with the binary search time; the latter totals to $O(n\log n)$ . ∎

Our fast algorithm for computing $(\operatorname{score}_{i,t}(\alpha))_{\alpha\in P_{i,t}}$ first runs the above algorithm with $\tau=\frac{i}{2^{t}},\theta=\frac{i+0.5}{2^{t}}$ and $D=D_{i,t}$ ; this gives us $\min_{f_{1}\in\mathcal{F}(P_{i,t}^{\leq\alpha},[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{1};D_{i,t}^{\leq\alpha})$ for all $\alpha\in P_{i,t}$ in time $O(|D_{i,t}|\log|D_{i,t}|+|P_{i,t}|)$ . Analogously, we can also compute $\min_{f_{2}\in\mathcal{F}(P_{i,t}^{>\alpha},[\frac{(i+0.5)}{2^{t}},\frac{(i+1)}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{2};D_{i,t}^{>\alpha})$ for all $\alpha\in P_{i,t}$ in a similar time. Thus, we can compute $(\operatorname{score}_{i,t}(\alpha))_{\alpha\in P_{i,t}}$ in time $O(|D_{i,t}|\log|D_{i,t}|+|P_{i,t}|)$ , and sample accordingly. Using the same observation as the general loss function case, this can be sped up further to $O(|D_{i,t}|\log|D_{i,t}|+\log m)$ time. In total, the running time of the algorithm is thus

\displaystyle\sum_{t=0}^{T-1}\sum_{i=0}^{2^{t}-1}O(|D_{i,t}|\log|D_{i,t}|+\log m)\leq\sum_{t=0}^{T-1}O(n\log n+2^{t}\log m)\leq O(n(\log^{2}n+\log m)).\qquad\qed

3.2 A Nearly Matching Lower Bound

We show that the excess empirical risk guarantee in Theorem 3 is tight, even for approximate-DP algorithms with a sufficiently small $\delta$ , under a mild assumption about the loss function stated below.

Definition 14 (Distance-Based Loss Function).

For $R\geq 0$ , a loss function $\ell$ is said to be $R$ -distance-based if there exist $g:[0,1]\to\mathbb{R}_{+}$ such that $\ell(y,y^{\prime})=g(|y-y^{\prime}|)$ where $g$ is a non-decreasing function with $g(0)=0$ and $g(1/2)\geq R$ .

We remark that standard loss functions, including $\ell_{1}$ - or $\ell_{2}^{2}$ -loss, are all $\Omega(1)$ -distance-based.

Our lower bound is stated below. It is proved via a packing argument [25] in a similar manner as a lower bound for properly PAC learning threshold functions [10]. This is not a coincidence: indeed, when we restrict the range of our function to $\{0,1\}$ , the problem becomes exactly (the empirical version of) properly learning threshold functions. As a result, the same technique can be used to prove a lower bound in our setting as well.

Theorem 15.

For all $0\leq\delta<0.1\cdot(e^{\varepsilon}-1)/m$ , any $(\varepsilon,\delta)$ -DP algorithm for isotonic regression over $[m]$ for any $R$ -distance-based loss function $\ell$ must have expected excess empirical risk $\Omega\left(R\cdot\min\left\{1,\frac{\log m}{\varepsilon n}\right\}\right)$ .

Proof.

Suppose for the sake of contradiction that there exists an $(\varepsilon,\delta)$ -DP algorithm $\mathcal{M}$ for isotonic regression with an $R$ -distance-based loss function $\ell$ with expected excess empirical risk $0.01\cdot\left(R\cdot\min\left\{1,\frac{\log(0.1m)}{\varepsilon n}\right\}\right)$ . Let $k:=\left\lfloor 0.1\log(0.1m)/\varepsilon\right\rfloor$ .

We may assume that $n\geq 2k$ , as the $\Omega(R)$ lower bound for the case $n=2k$ can easily be adapted for an $\Omega(R)$ lower bound for the case $n<2k$ as well.

We will use the standard packing argument [25]. For each $j\in[m-1]$ , we create a dataset $D_{j}$ that contains $k$ copies of $(j,0)$ , $k$ copies of $(j+1,1)$ and $n-2k$ copies of $(1,0)$ . Finally, let $D_{m}$ denote the dataset that contains $k$ copies of $(m,0)$ and $n-k$ copies of $(1,0)$ . Let $V_{j}$ denote the set of all $f\in\mathcal{F}([m],[0,1])$ such that $\mathcal{L}(f;\mathcal{D})<Rk/n$ . The utility guarantee of $\mathcal{M}$ implies that

\displaystyle\Pr[\mathcal{M}(D_{j})\in V_{j}]\geq 0.5.

Furthermore, it is not hard to see that $V_{1},\dots,V_{m}$ are disjoint. In particular, for any function $f\in\mathcal{F}([m],[0,1])$ , let $x_{f}$ be the largest element $x\in[m]$ for which $f(x)\leq 1/2$ ; if no such $x$ exists (i.e., $f(0)>1/2$ ), let $x_{f}=0$ . For any $j<x_{f}$ , we have $\mathcal{L}(f;D_{j})\geq\frac{k}{n}\ell(f(j+1),1)\geq\frac{k}{n}\cdot g(1/2)\geq Rk/n$ . Similarly, for any $j>x_{f}$ , we have $\mathcal{L}(f;D_{j})\geq\frac{k}{n}\ell(f(j),0)\frac{k}{n}\cdot g(1/2)\geq Rk/n$ This implies that $f$ can only belong to $V_{j}$ , as claimed.

Therefore, we have that

	$\displaystyle 1$	$\displaystyle\textstyle~{}\geq~{}\sum_{j\in[m]}\Pr[\mathcal{M}(D_{m})\in V_{j}]$
		$\displaystyle\textstyle~{}\geq~{}\sum_{j\in[m]}\frac{1}{e^{2k\varepsilon}}\left(\Pr[\mathcal{M}(D_{j})\in V_{j}]-\delta\frac{(e^{2k\varepsilon}-1)}{e^{\varepsilon}-1}\right)\qquad\text{(\lx@cref{creftypecap~refnum}{fact:group-dp})}$
		$\displaystyle\textstyle~{}\geq~{}\sum_{j\in[m]}\frac{10}{m}(0.5-0.1)$
		$\displaystyle\textstyle~{}>~{}1,$

a contradiction. ∎

3.3 Extensions

We now discuss several variants of the isotonic regression problem that places certain additional constraints on the function $f$ that we seek, as listed below.

•

$k$ -Piecewise Constant: $f$ must be a step function that consists of at most $k$ pieces.
•

$k$ -Piecewise Linear: $f$ must be a piecewise linear function with at most $k$ pieces.
•

Lipschitz Regression: $f$ must be $L_{f}$ -Lipschitz for some specified $L_{f}>0$ .
•

Convex/Concave: $f$ must be convex/concave.

We devise a general meta algorithm that, with a small tweak in each case, works for all of these constraints to yield Theorem 4. At a high-level, our algorithm is similar to Algorithm 1, except that, in addition to using exponential mechanism to pick the threshold $\alpha_{i,t}$ , we also pick certain auxiliary information that is then passed onto the next stage. For example, in the $k$ -piecewise constant setting, the algorithm in fact picks also the number of pieces to the left of $\alpha_{i,t}$ and that to the right of it. These are then passed on to the next stage. The algorithm stops when the number of pieces become one, and then simply use the exponential mechanism to find the constant value on this subdomain.

The full description of the algorithm and the corresponding proof are deferred to Appendix C.

4 DP Isotonic Regression over General Posets

We now provide an algorithm and lower bounds for the case of general discrete posets. We first recall basic quantities about posets. An anti-chain of a poset $(\mathcal{X},\leq)$ is a set of elements such that no two distinct elements are comparable, whereas a chain is a set of elements such that every pair of elements is comparable. The width of a poset $(\mathcal{X},\leq)$ , denoted by $\operatorname{width}(\mathcal{X})$ , is defined as the maximum size among all anti-chains in the poset. The height of $(\mathcal{X},\leq)$ , denoted by $\operatorname{height}(\mathcal{X})$ , is defined as the maximum size among all chains in the poset. Dilworth’s theorem and Mirsky’s theorem give the following relation between chains an anti-chains:

Lemma 16 (Dilworth’s and Mirsky’s theorems [12, 36]).

A poset with width $w$ can be partitioned into $w$ chains. A poset with height $h$ can be partitioned into $h$ anti-chains.

4.1 An Algorithm

Our algorithm for general posets is similar to that of totally ordered set presented in the previous section. The only difference is that, instead of attempting to pick a single maximal point $\alpha$ such that $f(\alpha)\leq\tau$ as in the previous case, there could now be many such maximal $\alpha$ ’s. Indeed, we need to use the exponential mechanism to pick all such $\alpha$ ’s. Since these are all maximal, they must be incomparable; therefore, they form an anti-chain. Since there can be as many as $|\mathcal{X}|^{\operatorname{width}(\mathcal{X})}$ anti-chains in total, this means that the error from the exponential mechanism is $O\left(\log|\mathcal{X}|^{\operatorname{width}(\mathcal{X})}/\varepsilon^{\prime}\right)=O(\operatorname{width}(\mathcal{X})\log|\mathcal{X}|/\varepsilon^{\prime})$ , leading to the multiplicative increase of $\operatorname{width}(\mathcal{X})$ in the total error. This completes our proof sketch for Theorem 1.

4.2 Lower Bounds

To prove a lower bound of $\Omega(\operatorname{width}(\mathcal{X})/\varepsilon n)$ , we observe that the values of the function in any anti-chain can be arbitrary. Therefore, we may use each element in a maximum anti-chain to encode $\mathcal{X}$ as a binary vector. The lower bound from Lemma 8 then gives us an $\Omega(\operatorname{width}(\mathcal{X})/n)$ lower bound for $\varepsilon=1$ , as formalized below.

Lemma 17.

For any $\delta>0$ , any $(1,\delta)$ -DP algorithm for isotonic regression for any $R$ -distance-based loss function $\ell$ must have expected excess empirical risk $\Omega\left(R(1-\delta)\cdot\min\left\{1,\frac{\operatorname{width}(\mathcal{X})}{n}\right\}\right)$ .

Proof.

Consider any $(1,\delta)$ -DP isotonic regression algorithm $\mathcal{M}^{\prime}$ for loss $\ell$ . Let $A$ be any maximum anti-chain (of size $\operatorname{width}(A)$ ) in $\mathcal{X}$ . We use this algorithm to build a $(1,\delta)$ -DP algorithm $\mathcal{M}$ for privatizing a binary vector of $m=\min\{n,|A|-1\}$ dimensions as follows:

•

Let $x_{0},x_{1},\dots,x_{m}$ be distinct elements of $A$ .
•

On input $\mathbf{z}\in\{0,1\}^{m}$ , create a dataset $D=\{(x_{1},z_{1}),\dots,(x_{m},z_{m}),(x_{0},0),\dots,(x_{0},0)\}$ where $(x_{0},0)$ is repeated $n-m$ times.
•

Run $\mathcal{M}^{\prime}$ on the instance $D$ to get $f$ , and output a vector $\mathbf{z}^{\prime}$ where $z^{\prime}_{i}=\mathbf{1}[f(x_{i})\geq 1/2]$ .

It is obvious that this algorithm is $(1,\delta)$ -DP. Observe also that $\mathcal{L}(f^{*};D)=0$ and thus $\mathcal{M}^{\prime}$ ’s expected excess empirical risk is $\mathbb{E}_{f\sim\mathcal{M}^{\prime}(D)}[\mathcal{L}(f;D)]\geq R\cdot\mathbb{E}_{\mathbf{z}^{\prime}\sim\mathcal{M}(\mathbf{z})}[\|\mathbf{z}-\mathbf{z}^{\prime}\|_{0}]/n$ , which, from Lemma 8, must be at least $\Omega(Rm(1-\delta)/n)=\Omega\left(R(1-\delta)\cdot\min\left\{1,\frac{\operatorname{width}(\mathcal{X})}{n}\right\}\right)$ . ∎

By using group privacy (10) and repeating each element $\Theta(1/\varepsilon)$ times, we arrive at a lower bound of $\Omega\left(R\cdot\min\left\{1,\frac{\operatorname{width}(\mathcal{X})}{\varepsilon n}\right\}\right)$ . Furthermore, since $\mathcal{X}$ contains $\operatorname{height}(\mathcal{X})$ elements that form a totally ordered set, Theorem 15 gives a lower bound of $\Omega(R\cdot\log(\operatorname{height}(\mathcal{X}))/\varepsilon n)$ as long as $\delta<0.01\cdot\varepsilon/\operatorname{height}(\mathcal{X})$ . Finally, due to Lemma 16, we have $\operatorname{height}(\mathcal{X})\geq|\mathcal{X}|/\operatorname{width}(\mathcal{X})$ , which means that $\max\{\operatorname{width}(\mathcal{X}),\log(\operatorname{height}(\mathcal{X}))\}\geq\Omega(\log|\mathcal{X}|)$ . Thus, we arrive at:

Theorem 18.

For any $\varepsilon\in(0,1]$ and any $\delta<0.01\cdot\varepsilon/|\mathcal{X}|$ , any $(\varepsilon,\delta)$ -DP algorithm for isotonic regression for $R$ -distance-based loss function $\ell$ must have expected excess empirical risk $\Omega\left(R\cdot\min\left\{1,\frac{\operatorname{width}(\mathcal{X})+\log|\mathcal{X}|}{\varepsilon n}\right\}\right)$ .

4.3 Tight Examples for Upper and Lower Bounds

Recall that our upper bound is $\tilde{O}\left(\frac{\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|}{\varepsilon n}\right)$ while our lower bound is $\Omega\left(\frac{\operatorname{width}(\mathcal{X})+\log|\mathcal{X}|}{\varepsilon n}\right)$ . One might wonder whether this gap can be closed. Below we show that, unfortunately, this is impossible in general: there are posets for which each bound is tight.

Tight Lower Bound Example. Let $\mathcal{X}_{\mathrm{disj}(w,h)}$ denote the poset that consists of $w$ disjoint chains, $C_{1},\dots,C_{w}$ where $|C_{1}|=h$ and $|C_{2}|=\cdots=|C_{w}|=1$ . (Every pair of elements on different chains are incomparable.) In this case, we can solve the isotonic regression problem directly on each chain and piece the solutions together into the final output $f$ . Note that $|\mathcal{X}_{\mathrm{disj}(w,h)}|=w+h-1$ and $\operatorname{width}(\mathcal{X})=w,\operatorname{height}(\mathcal{X})=h$ . According to Theorem 1, the unnormalized excess empirical risk in $C_{i}$ is $\tilde{O}\left(\log(|C_{i}|)/\varepsilon\right)$ . Therefore, the total (normalized) empirical risk for the entire domain $\mathcal{X}$ is $\tilde{O}\left(\frac{\log h+(w-1)}{\varepsilon n}\right)$ . This is at most $\tilde{O}\left(\frac{w}{\varepsilon n}\right)$ as long as $h\leq\exp(O(w))$ ; this matches the lower bound.

Tight Upper Bound Example. Consider the grid poset $\mathcal{X}_{\mathrm{grid}(w,h)}:=[w]\times[h]$ where $(x,y)\leq(x^{\prime},y^{\prime})$ if and only if $x\leq x^{\prime}$ and $y\leq y^{\prime}$ . We assume throughout that $w\leq h$ . Observe that $\operatorname{width}(\mathcal{X}_{\mathrm{grid}(w,h)})=w$ and $\operatorname{height}(\mathcal{X}_{\mathrm{grid}(w,h)})=w+h$ .

We will show the following lower bound, which matches the $\tilde{O}\left(\frac{\operatorname{width}(\mathcal{X})\log|\mathcal{X}|}{\varepsilon n}\right)$ upper bound in the case where $h\geq w^{1+\Omega(1)}$ , up to $O(\log^{2}(\varepsilon n))$ factor. We prove it by a reduction from Lemma 9. Note that this reduction is in some sense a “combination” of the proofs of Theorem 15 and Lemma 17, as the coordinate-wise encoding aspect of Lemma 17 is still present (across the rows) whereas the packing-style lower bound is present in how we embed elements of $[D]$ (in blocks of columns).

Lemma 19.

For any $\varepsilon\in(0,1]$ and $\delta<O_{\varepsilon}(1/h)$ , any $(\varepsilon,\delta)$ -DP algorithm for isotonic regression for any $R$ -distance-based loss function $\ell$ must have expected excess empirical risk $\Omega\left(R\cdot\min\left\{1,\frac{w\cdot\log(h/w)}{\varepsilon n}\right\}\right)$ .

Proof.

Let $D:=\lfloor h/w-1\rfloor,m=w$ and $r:=\min\{\lfloor 0.5n/m\rfloor,\lfloor 0.5\ln(D/2)/\varepsilon\rfloor\}$ . Consider any $(\varepsilon,\delta)$ -DP algorithm $\mathcal{M}^{\prime}$ for isotonic regression for $\ell$ on $\mathcal{X}_{\mathrm{grid}(w,h)}$ where $\delta\leq 0.01\varepsilon/D$ . We use this algorithm to build a $(\ln(D/2),0.25)$ -DP algorithm $\mathcal{M}$ for privatizing a vector $\mathbf{z}\in[D]^{m}$ as follows:

•
Create a dataset $D$ that contains:
- –
  
  For all $i\in[m]$ , $r$ copies of $((i,(w-i)(D+1)+z_{i}),0)$ and $r$ copies of $((i,(w-i)(D+1)+z_{i}+1),1)$ .
- –
  
  $n-2rm$ copies of $((1,1),0)$ .
•

Run $\mathcal{M}^{\prime}$ on instance $D$ to get $f$ .
•

Output a vector $\mathbf{z}^{\prime}$ where $z^{\prime}_{i}=\max\left\{{j\in[D]}\mid f((i,(w-i)(D+1)+j))\leq 1/2\right\}$ . (For simplicity, when such $j$ does not exist let $z^{\prime}_{i}=0$ .)

By group privacy, $\mathcal{M}$ is $(\ln(D/2),0.25)$ -DP. Furthermore, $\mathcal{L}(f^{*};D)=0$ and the expected empirical excess risk of $\mathcal{M}^{\prime}$ is

	$\displaystyle\mathbb{E}_{f\sim\mathcal{M}^{\prime}(D)}[\mathcal{L}(f;D)]$
	$\displaystyle\textstyle~{}\geq~{}\frac{r}{n}\sum_{i\in[m]}\left(\ell(f(i,(w-i)(D+1)+z_{i}),0)+\ell(f(i,(w-i)(D+1)+z_{i}+1),1)\right)$
	$\displaystyle\textstyle~{}\geq~{}\frac{r}{n}\sum_{i\in[m]}g(1/2)\cdot\mathbf{1}[z^{\prime}_{i}\neq z_{i}]=\frac{Rr}{n}\cdot\\|\mathbf{z}-\mathbf{z}^{\prime}\\|_{0},$

which must be at least $\Omega(Rrm/n)=\Omega\left(R\cdot\min\left\{1,\frac{w\cdot\log(h/w)}{\varepsilon n}\right\}\right)$ by Lemma 9. ∎

5 Additional Related Work

(Non-private) isotonic regression is well-studied in statistics and machine learning. The one-dimensional (aka univariate) case has long history [9, 46, 5, 44, 45, 13, 8, 35, 14, 15, 49]; for a general introduction, see [22]. Moreover, isotonic regression has been studied in higher dimensions [24, 27, 26], including the sparse setting [21], as well as in online learning [29]. A related line of work studies learning neural networks under (partial) monotonicity constraints [3, 50, 30, 40].

There has been a rich body of work on DP machine learning, including DP empirical risk minimization (ERM), e.g., [11, 6, 48, 47], and DP linear regression, e.g., [1]; however, to the best of our knowledge none of these can be applied to isotonic regression to obtain non-trivial guarantees.

Another line of work related to our setting is around privately learning threshold functions [7, 20, 10, 2, 28]. We leveraged this relation to prove our lower bound for totally ordered case (Section 3.2).

6 Conclusions

In this paper we obtained new private algorithms for isotonic regression on posets and proved nearly matching lower bounds in terms of the expected empirical excess risk. Although our algorithms for totally ordered sets are efficient, our algorithm for general posets is not. Specifically, a trivial implementation of the algorithm would run in time $\exp(\tilde{O}(\operatorname{width}(\mathcal{X})))$ . It remains an interesting open question whether this can be sped up. To the best of our knowledge, this question does not seem to be well understood even for the non-private setting, as previous algorithmic works have focused primarily on the totally ordered case. Similarly, while our algorithm is efficient for the totally ordered sets, it remains interesting to understand whether nearly linear time algorithms for $\ell_{1}$ - and $\ell_{2}^{2}$ -losses can be extended to a larger class of loss functions.

References

[1] D. Alabi, A. McMillan, J. Sarathy, A. D. Smith, and S. P. Vadhan. Differentially private simple linear regression. Proc. Priv. Enhancing Technol., 2022(2):184–204, 2022.
[2] N. Alon, R. Livni, M. Malliaris, and S. Moran. Private PAC learning implies finite littlestone dimension. In STOC, pages 852–860, 2019.
[3] N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with monotonicity constraints for two-group classification problems. Dec. Sci., 24(1):60–75, 1993.
[4] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. Ann. Math. Stat., pages 641–647, 1955.
[5] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical Inference Under Order Restrictions. John Wiley & Sons, 1973.
[6] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.
[7] A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory Comput., 12(1):1–61, 2016.
[8] L. Birgé and P. Massart. Rates of convergence for minimum contrast estimators. Prob. Theory Rel. Fields, 97(1):113–150, 1993.
[9] H. D. Brunk. Maximum likelihood estimates of monotone parameters. Ann. Math. Stat., pages 607–616, 1955.
[10] M. Bun, K. Nissim, U. Stemmer, and S. P. Vadhan. Differentially private release and learning of threshold functions. In FOCS, pages 634–649, 2015.
[11] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. JMLR, 12(3), 2011.
[12] R. P. Dilworth. A decomposition theorem for partially ordered sets. Ann. Math., 51(1):161–166, 1950.
[13] D. L. Donoho. Gelfand $n$ -widths and the method of least squares. Technical Report, University of California, 1991.
[14] C. Durot. On the $l_{p}$ -error of monotonicity constrained estimators. Ann. Stat., 35(3):1080–1104, 2007.
[15] C. Durot. Monotone nonparametric regression with random design. Math. Methods Stat., 17(4):327–341, 2008.
[16] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486–503, 2006.
[17] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
[18] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
[19] R. L. Dykstra and T. Robertson. An algorithm for isotonic regression for two or more independent variables. Ann. Stat., 10(3):708–716, 1982.
[20] V. Feldman and D. Xiao. Sample complexity bounds on differentially private learning via communication complexity. In COLT, volume 35, pages 1000–1019, 2014.
[21] D. Gamarnik and J. Gaudio. Sparse high-dimensional isotonic regression. NeurIPS, 2019.
[22] P. Groeneboom and G. Jongbloed. Nonparametric Estimation under Shape Constraints. Cambridge University Press, 2014.
[23] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.
[24] Q. Han, T. Wang, S. Chatterjee, and R. J. Samworth. Isotonic regression in general dimensions. Ann. Stat., 47(5):2440–2471, 2019.
[25] M. Hardt and K. Talwar. On the geometry of differential privacy. In STOC, pages 705–714, 2010.
[26] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NeurIPS, 2011.
[27] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.
[28] H. Kaplan, K. Ligett, Y. Mansour, M. Naor, and U. Stemmer. Privately learning thresholds: Closing the exponential gap. In COLT, pages 2263–2285, 2020.
[29] W. Kotłowski, W. M. Koolen, and A. Malek. Online isotonic regression. In COLT, pages 1165–1189, 2016.
[30] X. Liu, X. Han, N. Zhang, and Q. Liu. Certified monotonic neural networks. NeurIPS, pages 15427–15438, 2020.
[31] R. Luss, S. Rosset, and M. Shahar. Efficient regularized isotonic regression with application to gene–gene interaction search. Ann. Appl. Stat., 6(1):253–283, 2012.
[32] P. Manurangsi. Tight bounds for differentially private anonymized histograms. In SOSA, pages 203–213, 2022.
[33] F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9):89–97, 2010.
[34] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103, 2007.
[35] M. Meyer and M. Woodroofe. On the degrees of freedom in shape-restricted regression. Ann. Stat., 28(4):1083–1104, 2000.
[36] L. Mirsky. A dual of Dilworth’s decomposition theorem. AMS, 78(8):876–877, 1971.
[37] C. Radebaugh and U. Erlingsson. Introducing TensorFlow Privacy: Learning with Differential Privacy for Training Data, March 2019. blog.tensorflow.org.
[38] T. Robertson, F. T. Wright, and R. L. Dykstra. Order restricted statistical inference. John Wiley & Sons, 1988.
[39] M. J. Schell and B. Singh. The reduced monotonic regression method. JASA, 92(437):128–135, 1997.
[40] A. Sivaraman, G. Farnadi, T. Millstein, and G. Van den Broeck. Counterexample-guided learning of monotonic neural networks. In NeurIPS, pages 11936–11948, 2020.
[41] T. Steinke and J. R. Ullman. Between pure and approximate differential privacy. J. Priv. Confidentiality, 7(2), 2016.
[42] Q. F. Stout. Unimodal regression via prefix isotonic regression. Comput. Stat. Data Anal., 53(2):289–297, 2008.
[43] D. Testuggine and I. Mironov. PyTorch Differential Privacy Series Part 1: DP-SGD Algorithm Explained, August 2020. medium.com.
[44] S. Van de Geer. Estimating a regression function. Ann. Stat., pages 907–924, 1990.
[45] S. Van de Geer. Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Stat., 21(1):14–44, 1993.
[46] C. van Eeden. Testing and Estimating Ordered Parameters of Probability Distribution. CWI, Amsterdam, 1958.
[47] D. Wang, C. Chen, and J. Xu. Differentially private empirical risk minimization with non-convex loss functions. In ICML, pages 6526–6535, 2019.
[48] D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster and more general. In NIPS, 2017.
[49] F. Yang and R. F. Barber. Contraction and uniform convergence of isotonic regression. Elec. J. Stat., 13(1):646–677, 2019.
[50] S. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks and partial monotonic functions. NIPS, 2017.
[51] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages 694–699, 2002.

Appendix A Baseline Algorithm for Private Isotonic Regression

We provide a baseline algorithm for private isotonic regression by a direct application of the exponential mechanism. For simplicity, we start with the case of totally ordered sets and then extend the algorithm to general posets.

Totally ordered sets.

Consider a discretized range of $\mathcal{T}:=\left\{0,\frac{1}{T},\frac{2}{T},\ldots,1\right\}$ . We have that for $\tilde{f}:=\operatorname{argmin}_{f\in\mathcal{F}([m],\mathcal{T})}\mathcal{L}(f;D)$ and $f^{*}:=\operatorname{argmin}_{f\in\mathcal{F}([m],[0,1])}\mathcal{L}(f;D)$ , it holds that $\mathcal{L}(\tilde{f};D)\leq\mathcal{L}(f^{*};D)+\frac{1}{T}$ . Also, it is a simple combinatorial fact that $|\mathcal{F}([m],\mathcal{T})|=\binom{m+T}{T}\leq(m+T)^{T}$ , which bounds the number of monotone functions with this discretization. Thus, the $\varepsilon$ -DP exponential mechanism over the set of all monotone functions in $\mathcal{F}([m],\mathcal{T})$ , with the score function $\mathcal{L}(f;D)$ of sensitivity at most $L/n$ , returns $f:[m]\to\mathcal{T}$ such that

\textstyle\mathcal{L}(f;D)\leq\mathcal{L}(\tilde{f};D)+O\left(\frac{LT\log(m+T)}{\varepsilon n}\right)~{}\leq~{}\mathcal{L}(f^{*};D)+O\left(\frac{LT\log(m+T)}{\varepsilon n}+\frac{L}{T}\right).

Setting $T=\sqrt{\frac{\varepsilon n}{\log m}}$ , gives an excess empirical error of $O\left(L\sqrt{\frac{\log(m)}{\varepsilon n}}\right)$ (when $m\geq n$ ).

General posets.

By Lemma 16, we have that $\mathcal{X}$ can be partitioned into $w:=\operatorname{width}(\mathcal{X})$ many chains $H_{1},\ldots,H_{w}$ . Let $h_{i}:=|H_{i}|$ . Since any monotone function over $\mathcal{X}$ has to be monotone over each of the chains, we have that

\textstyle|\mathcal{F}(\mathcal{X},\mathcal{T})|\leq\prod\limits_{i=1}^{w}|\mathcal{F}(H_{i},\mathcal{T})|~{}\leq~{}\left(\frac{|\mathcal{X}|}{w}+T\right)^{wT}~{}\leq~{}(|\mathcal{X}|+T)^{wT}.

Thus, by a similar argument as above, the $\varepsilon$ -DP exponential mechanism over the set of all monotone functions in $\mathcal{F}(\mathcal{X},\mathcal{T})$ , with score function $\mathcal{L}(f;D)$ returns $f:\mathcal{X}\to\mathcal{T}$ such that

\textstyle\mathcal{L}(f;D)~{}\leq~{}\mathcal{L}(f^{*};D)+O\left(L\cdot\frac{wT\log(|\mathcal{X}|+T)}{n\varepsilon}+\frac{L}{T}\right).

Choosing $T=\sqrt{\frac{\varepsilon n}{w\log|\mathcal{X}|}}$ , gives an excess empirical error of $O\left(L\sqrt{\frac{w\log|\mathcal{X}|}{\varepsilon n}}\right)$ (when $|\mathcal{X}|\geq n$ ).

Appendix B Lower Bound on Privatizing Vectors with Large Alphabet: Proof of Lemma 9

Below we prove Lemma 9. The proof below is a slight extension of that of Lemma 8 in [32].

Proof of Lemma 9.

For every $i\in[m],\sigma\in[D]$ , let $\mathbf{z}_{(i,\sigma)}$ denote $(z_{1},\dots,z_{i-1},\sigma,z_{i+1},\dots,z_{m})$ . Let $\varepsilon^{\prime}=\ln(D/2),\delta^{\prime}=0.25$ . We have

		$\displaystyle\mathbb{E}_{\mathbf{z}\sim[D]^{m}}[\\|\mathcal{M}(\mathbf{z})-\mathbf{z}\\|_{0}]$
		$\displaystyle=\sum_{i\in[m]}\Pr_{\mathbf{z}\sim[D]^{m}}[\mathcal{M}(\mathbf{z})_{i}\neq z_{i}]$
		$\displaystyle=m-\sum_{i\in[m]}\Pr_{\mathbf{z}\sim[D]^{m}}[\mathcal{M}(\mathbf{z})_{i}=z_{i}]$
		$\displaystyle=m-\sum_{i\in[m]}\frac{1}{D^{m+1}}\sum_{\mathbf{z}\in[D]^{m}}\left(\sum_{\sigma\in[D]}\Pr[\mathcal{M}(\mathbf{z}_{(i,\sigma)})=\sigma]\right)$
	$\displaystyle(\text{From }(\varepsilon^{\prime},\delta^{\prime})\text{-DP of }\mathcal{M})$	$\displaystyle\geq m-\sum_{i\in[m]}\frac{1}{D^{m+1}}\sum_{\mathbf{z}\in[D]^{m}}\left(\sum_{\sigma\in[D]}\left(e^{\varepsilon^{\prime}}\cdot\Pr[\mathcal{M}(\mathbf{z}_{(i,1)})=\sigma]+\delta^{\prime}\right)\right)$
		$\displaystyle\geq m-\sum_{i\in[m]}\frac{1}{D^{m+1}}\sum_{\mathbf{z}\in[D]^{m}}\left(e^{\varepsilon^{\prime}}+D\delta^{\prime}\right)$
		$\displaystyle=\left(1-e^{\varepsilon^{\prime}}/D-\delta^{\prime}\right)m$
		$\displaystyle=0.25m.\qed$

Appendix C Algorithms for Isotonic Regression with Additional Constraints

In this section, we elaborate on the constrained variants of the isotonic regression problem over totally ordered sets, by designing a meta-algorithm that can be instantiated to get algorithms for each of the cases discussed in Section 3.3.

Recall that Algorithm 1 proceeded in $T$ rounds where in round $t$ the algorithm starts with a partition of $[m]$ into $2^{t}$ intervals, and then partitions each interval into two using the exponential mechanism. At a high-level, our meta-algorithm is similar, except that, it maintains a set of pairwise disjoint structured intervals of $[m]$ , that is, each interval has an additional structure which imposes constraints on the function that can be returned on the said interval; moreover, the function is fixed outside the union of the said intervals. This idea is described in Algorithm 2, stated using the following abstractions, which will be instantiated to derive algorithms for each constrained variant.

•

A set of all structured intervals of $[m]$ denoted as $\mathcal{S}$ , and an initial structured interval $S_{0,0}\in\mathcal{S}$ . A structured interval $S$ will consist of an interval domain denoted $P_{S}\subseteq[m]$ , an interval range denoted $R_{S}\subseteq[0,1]$ , and potentially additional other constraints that the function should satisfy. We use $|R_{S}|$ to denote the length of $R_{S}$ . In order to make the number of structured intervals bounded, we will consider a discretized range where the endpoints of interval $R_{S}$ lie in $\mathcal{H}:=\{0,1/H,2/H,\ldots,1\}$ for some discretization parameter $H$ .
•

A partition method $\Phi:S\mapsto\{(S^{\mathrm{left}},S^{\mathrm{right}},g)\}$ that defines a set of all “valid partitions” of a structured interval $S$ into two structured intervals $S^{\mathrm{left}}$ and $S^{\mathrm{right}}$ and a function $g:P_{S}\smallsetminus(P_{S^{\mathrm{left}}}\cup P_{S^{\mathrm{right}}})\to R_{S}$ . It is required that $P_{S}\smallsetminus(P_{S^{\mathrm{left}}}\cup P_{S^{\mathrm{right}}})$ be an interval. If the algorithm makes a choice of $(S^{\mathrm{left}},S^{\mathrm{right}},g)$ , then the final function returned by the algorithm is required to be equal to $g$ on $P_{S}\smallsetminus(P_{S^{\mathrm{left}}}\cup P_{S^{\mathrm{right}}})$ .
•

For all $S\in\mathcal{S}$ , we abuse notation to let $\mathcal{F}(S)$ denote the set of all monotone functions mapping $P_{S}$ to $R_{S}$ , while respecting the additional conditions enforced by the structure in $S$ .

Algorithm 2 Meta algorithm for variants of DP Isotonic Regression for Totally Ordered Sets.

Input:

\mathcal{X}=[m]

, dataset

D=\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}

, DP parameter

\varepsilon

Output: Monotone function

f:[m]\to[0,1]

satisfying additional desired condition.

T\leftarrow\left\lceil\log(\varepsilon n)\right\rceil

\varepsilon^{\prime}\leftarrow\varepsilon/T

S_{0,0}

: initial structured interval {Any structured interval

S

consists of an interval domain

P_{S}

and an interval range

R_{S}

, and potentially other conditions on the function.}

for

t=0,\ldots,T-1

for

i=0,\ldots,2^{t}-1

$\triangleright$

$D_{i,t}\leftarrow\left\{(x_{j},y_{j})\mid j\in[n],x_{j}\in P_{S_{i,t}}\right\}$

\triangleright

Choose $(S^{\mathrm{left}}_{i,t},S^{\mathrm{right}}_{i,t},g_{i,t})\in\Phi(S_{i,t})$ , using $\varepsilon^{\prime}$ -DP exponential mechanism with scoring function

	$\displaystyle\operatorname{score}_{i,t}(S^{\mathrm{left}},S^{\mathrm{right}},g)$	$\displaystyle~{}:=~{}\min_{f_{1}\in\mathcal{F}(S^{\mathrm{left}})}\mathcal{L}^{\mathrm{abs}}_{R_{S}}(f_{1};D^{\mathrm{left}}_{i,t})$
		$\displaystyle\qquad+~{}\min_{f_{2}\in\mathcal{F}(S^{\mathrm{right}})}\mathcal{L}^{\mathrm{abs}}_{R_{S}}(f_{2};D^{\mathrm{right}}_{i,t})$
		$\displaystyle\qquad+~{}\mathcal{L}^{\mathrm{abs}}_{R_{S}}(g;D^{\mathrm{mid}}_{i,t})$

{Notation: $D_{i,t}^{\mathrm{left}}:=\left\{(x,y)\in D_{i,t}\mid x\in P_{S^{\mathrm{left}}}\right\}$ , $D_{i,t}^{\mathrm{right}}$ is defined similarly and

Notation: $D_{i,t}^{\mathrm{mid}}:=\left\{(x,y)\in D_{i,t}\mid x\in P_{S_{i,t}}\smallsetminus(P_{S^{\mathrm{left}}}\cup P_{S^{\mathrm{right}}})\right\}$ .}

{Note: $\operatorname{score}_{i,t}(S^{\mathrm{left}},S^{\mathrm{right}},g)$ has sensitivity at most $L\cdot|R_{S}|$ .}

$\triangleright$

$S_{2i,t+1}\leftarrow S^{\mathrm{left}}_{i,t}$ and $S_{2i+1,t+1}\leftarrow S^{\mathrm{right}}_{i,t}$ .

Let

f:[m]\to[0,1]

be choosing

f|_{P_{S_{i,T-1}}}\in\mathcal{F}(S_{i,T-1})

arbitrarily for all

i\in[2^{T}]

, and

f(x)=g_{i,t}(x)

for all

x\in P_{S_{i,t}}\smallsetminus(P_{S_{2i,t}}\cup P_{S_{2i+1,t}})

for all

i

t

return

f

We instantiate this notion of structured intervals in the following ways to derive algorithms for the constrained variants of isotonic regression mentioned earlier:

•

(Vanilla) Isotonic Regression (recovers Algorithm 1): $\mathcal{S}$ is simply the set of all interval domains, and all (discretized) interval ranges and the partition method simply partitions into two sub-intervals, with the range divided into two equal parts.²²2We ignore a slight detail that $\frac{\tau+\theta}{2}$ need not be in $\mathcal{H}$ ; this can be fixed e.g., by letting it be $\left\lfloor H\cdot\frac{\tau+\theta}{2}\right\rfloor/H$ , but we skip this complicated expression for simplicity. Note that, if we let $H=2^{T}$ , this distinction does not make a difference in the algorithm for vanilla isotonic regression. Namely,

	$\displaystyle\mathcal{S}$	$\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta]):i,j\in[m]\,,\tau,\theta\in\mathcal{H}\text{ s.t. }i\leq j\,,\tau\leq\theta\right\}\,,$
	$\displaystyle S_{0,0}$	$\displaystyle~{}:=~{}([1,m],[0,1])\,,$
	$\displaystyle\Phi(([i,j],[\tau,\theta]))$	$\displaystyle\textstyle~{}:=~{}\left\{(([i,\ell],[\tau,\frac{\tau+\theta}{2}]),([\ell+1,j],[\frac{\tau+\theta}{2},\theta])):i-1\leq\ell\leq j\right\}\,,$
	$\displaystyle\mathcal{F}(([i,j],[\tau,\theta]))$	$\displaystyle~{}:=~{}\text{set of monotone functions mapping }[i,j]\text{ to }[\tau,\theta]\,.$

We skip the description of the function $g$ in the partition method $\Phi$ , since the middle sub-interval is empty. For all the other variants, we skip having to explicitly write the conditions of $i,j\in[m]$ , $\tau,\theta\in\mathcal{H}$ , $i\leq j$ , and $\tau\leq\theta$ in definition of $\mathcal{S}$ , and similarly that $\mathcal{F}(S)$ consist of monotone functions mapping $[i,j]$ to $[\tau,\theta]$ ; we only focus on the main new conditions.

•

$k$ -Piecewise Constant: $\mathcal{S}$ is the set of all interval domains, all discretized ranges, along with a parameter (encoding an upper bound on the number of pieces in the final piecewise constant function). The partition method partitions into two sub-intervals respecting that the number of pieces and the range divided into two equal parts, namely,

	$\displaystyle\mathcal{S}$	$\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta],r):\begin{matrix}1\leq r\leq k&\text{ if }i\leq j,\\ r=0&\text{ if }i>j\end{matrix}\right\}\,,$
	$\displaystyle S_{0,0}$	$\displaystyle~{}:=~{}([1,m],[0,1],k)\,,$
	$\displaystyle\Phi(([i,j],[\tau,\theta],r))$	$\displaystyle\textstyle~{}:=~{}\left\{\begin{matrix}(([i,\ell],[\tau,\frac{\tau+\theta}{2}],r_{1}),([\ell+1,j],[\frac{\tau+\theta}{2},\theta],r_{2}))\\[5.69054pt] \text{s.t. }i-1\leq\ell\leq j\ \text{ and }r_{1}+r_{2}=r\end{matrix}\right\}\,,$
	$\displaystyle\mathcal{F}(([i,j],[\tau,\theta],r))$	$\displaystyle~{}:=~{}\text{set of $r$-piecewise constant functions}$

•

$k$ -Piecewise Linear: $\mathcal{S}$ is the set of all interval domains, all discretized ranges, along with a parameter (encoding an upper bound on the number of pieces in the final piecewise linear function), and two Boolean values ( $\top$ / $\bot$ ), one encoding whether the function must achieve the minimum possible value at the start of the interval, and other encoding whether it must achieve the maximum possible value at the end of the interval. The partition method partitions into two sub-intervals respecting that the number of pieces, by choosing a middle sub-interval that ensures that each range is at most half as large as the earlier one, namely,

	$\displaystyle\mathcal{S}$	$\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta],r,b_{1},b_{2}):\begin{matrix}\begin{matrix}1\leq r\leq k&\text{ if }i\leq j,\\ r=0&\text{ if }i>j,\end{matrix}\\ b_{1},b_{2}\in\left\{\top,\bot\right\}\end{matrix}\right\}\,,$
	$\displaystyle S_{0,0}$	$\displaystyle~{}:=~{}([1,m],[0,1],k,\bot,\bot)\,,$
	$\displaystyle\Phi(([i,j],[\tau,\theta],r,b_{1},b_{2}))$	$\displaystyle\textstyle~{}:=~{}\left\{\begin{matrix}\left(\begin{matrix}S^{\mathrm{left}}=([i,\ell_{1}],[\tau,\omega_{1}],r_{1},b_{1},\top),\\ S^{\mathrm{right}}=([\ell_{2},j],[\omega_{2},\theta],r_{2},\top,b_{2})\\ g(x)=\omega_{1}+(x-\ell_{1})\cdot(\omega_{2}-\omega_{1})/(\ell_{2}-\ell_{1})\end{matrix}\right)\\[5.69054pt] \text{s.t. }i-1\leq\ell_{1}<\ell_{2}\leq j+1\ ,\omega_{1}\leq\frac{\tau+\theta}{2}\leq\omega_{2}\,,\\ \phantom{\text{s.t. }}\text{ and }r_{1}+r_{2}=r-1\end{matrix}\right\}\,,$
	$\displaystyle\mathcal{F}(([i,j],[\tau,\theta],r,b_{1},b_{2}))$	$\displaystyle~{}:=~{}\text{set of $r$-piecewise linear functions $f$}$
		s.t. $f(i)=\tau$ if $b_{1}=\top$ and $f(j)=\theta$ if $b_{2}=\top$ .

In other words, $\Phi(([i,j],[\tau,\theta],r,b_{1},b_{2}))$ considers the three sub-intervals $[i,\ell_{1}]$ , $[\ell_{1},\ell_{2}]$ and $[\ell_{2},j]$ , and fits an affine function $g$ in the middle sub-interval $[\ell_{1},\ell_{2}]$ such that $g(\ell_{1})=\omega_{1}$ and $g(\ell_{2})=\omega_{2}$ and ensures that the function $f$ returned on sub-intervals $[i,\ell_{1}]$ and $[\ell_{2},j]$ satisfies $f(\ell_{1})=\omega_{1}$ and $f(\ell_{2})=\omega_{2}$ .

•

Lipschitz Regression: Given any Lipschitz constant $L_{f}$ , $\mathcal{S}$ is the set of all interval domains, all discretized ranges, along with two Boolean values ( $\top$ / $\bot$ ), one encoding whether the function must achieve the minimum possible value at the start of the interval, and other encoding whether it must achieve the maximum possible value at the end of the interval. The partition method chooses sub-intervals by choosing $\ell$ and function values $f(\ell)$ and $f(\ell+1)$ such that $f(\ell+1)-f(\ell)\leq L_{f}$ (thereby respecting the Lipschitz condition), and moreover $f(\ell)\leq\frac{\tau+\theta}{2}$ and $f(\ell+1)\geq\frac{\tau+\theta}{2}$ .

	$\displaystyle\mathcal{S}$	$\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta],b_{1},b_{2}):b_{1},b_{2}\in\left\{\top,\bot\right\}\right\}\,,$
	$\displaystyle S_{0,0}$	$\displaystyle~{}:=~{}([1,m],[0,1],\bot,\bot)\,,$
	$\displaystyle\Phi(([i,j],[\tau,\theta],b_{1},b_{2}))$	$\displaystyle\textstyle~{}:=~{}\left\{\begin{matrix}\left(\begin{matrix}S^{\mathrm{left}}=([i,\ell],[\tau,\omega_{1}],b_{1},\top),\\ S^{\mathrm{right}}=([\ell+1,j],[\omega_{2},\theta],\top,b_{2})\end{matrix}\right)\\[5.69054pt] \text{s.t. }i-1\leq\ell\leq j\ ,\omega_{1}\leq\frac{\tau+\theta}{2}\leq\omega_{2}\,,\\ \phantom{\text{s.t. }}\omega_{2}-\omega_{1}\leq L_{f}\end{matrix}\right\}\,,$
	$\displaystyle\mathcal{F}(([i,j],[\tau,\theta],b_{1},b_{2}))$	$\displaystyle~{}:=~{}\text{set of $L_{f}$-Lipschitz linear functions $f$}$
		s.t. $f(i)=\tau$ if $b_{1}=\top$ and $f(j)=\theta$ if $b_{2}=\top$ .

•

Convex/Concave: We only describe the convex case; the concave case follows similarly. Note that a function $f$ is convex over the discrete domain $[m]$ if and only if $f(x+1)+f(x-1)>2\cdot f(x)$ holds for all $x$ . Let $\mathcal{S}$ be the set of all interval domains, all discretized ranges, along with the following additional parameters

–

a lower bound $L$ on the (discrete) derivative of $f$ ,
–

an upper bound $U$ on the (discrete) derivative of $f$ ,
–

a Boolean value encoding whether the function must achieve the minimum possible value at the start of the interval,
–

another Boolean value encoding whether the function must achieve the maximum possible value at the end of the interval.

The partition method chooses sub-intervals by choosing $\ell$ and function values $f(\ell)$ and $f(\ell+1)$ such that $L\leq f(\ell+1)-f(\ell)\leq U$ , $f(\ell)\leq\frac{\tau+\theta}{2}$ and $f(\ell+1)\geq\frac{\tau+\theta}{2}$ and enforcing that the left sub-interval has derivatives at most $f(\ell+1)-f(\ell)$ and the right sub-interval has derivatives at least $f(\ell+1)-f(\ell)$ .

	$\displaystyle\mathcal{S}$	$\displaystyle~{}:=~{}\left\{([i,j],[\tau,\theta],L,U,b_{1},b_{2}):L\leq U,b_{1},b_{2}\in\left\{\top,\bot\right\}\right\}\,,$
	$\displaystyle S_{0,0}$	$\displaystyle~{}:=~{}([1,m],[0,1],-\infty,+\infty,\bot,\bot)\,,$
	$\displaystyle\Phi(([i,j],[\tau,\theta],L,U,b_{1},b_{2}))$	$\displaystyle\textstyle~{}:=~{}\left\{\begin{matrix}\left(\begin{matrix}S^{\mathrm{left}}=([i,\ell],[\tau,\omega_{1}],L,\omega_{2}-\omega_{1},b_{1},\top),\\ S^{\mathrm{right}}=([\ell+1,j],[\omega_{2},\theta],\omega_{2}-\omega_{1},U,\top,b_{2})\end{matrix}\right)\\[5.69054pt] \text{s.t. }i-1\leq\ell\leq j\ ,\omega_{1}\leq\frac{\tau+\theta}{2}\leq\omega_{2}\,,\\ \phantom{\text{s.t. }}L\leq\omega_{2}-\omega_{1}\leq U\end{matrix}\right\}\,,$
	$\displaystyle\mathcal{F}(([i,j],[\tau,\theta],L,U))$	$\displaystyle~{}:=~{}\text{set of convex functions $f$}$
		s.t. for all $\ell\in[i,j)$ it holds that $L\leq f(\ell+1)-f(\ell)\leq U$ ,
		and $f(i)=\tau$ if $b_{1}=\top$ and $f(j)=\theta$ if $b_{2}=\top$ .

Privacy Analysis.

Follows similarly as done for Algorithm 1.

Utility Analysis.

Since $|R_{S_{i,t}}|\leq 2^{-t}$ in each of the cases, it follows that the sensitivity of the scoring function is at most $L/2^{t}$ . The rest of the proof follows similarly, with the only change being that the number of candidates in the exponential mechanism is given as $|\Phi(S_{i,t})|$ , which in the case of vanilla isotonic regression was simply $|P_{i,t}|$ . We now bound this for each of the cases, which shows that $\log|\Phi(S_{i,t})|$ is at most $O(\log(mn))$ . In particular,

•

$k$ -Piecewise Constant: $|\Phi(S)|\leq O(mk)$ .
•

$k$ -Piecewise Linear: $|\Phi(S)|\leq O(m^{2}H^{2}k)$ .
•

$L_{f}$ -Lipschitz: $|\Phi(S)|\leq O(mH^{2})$ .
•

Convex/Concave: $|\Phi(S)|\leq O(mH)$

Finally, there is an additional error due to discretization. To account for the discretization error, we argue below for appropriately selected values of $H$ that, for any optimal function $f^{*}$ , there exists $f\in\mathcal{F}(S_{0,0})$ such that $|f^{*}(x)-f(x)|\leq 1/n$ . This indeed immediately implies that the discretization error is at most $O(1)$ .

•

$k$ -Piecewise Linear: We may select $H=n$ . In this case, for every endpoint $\ell$ , we let $f(\ell)=H\cdot\lceil f^{*}(\ell)/H\rceil$ and interpolate the intermediate points accordingly. It is simple to see that $f^{*}(x)-f(x)\leq 1/n$ as desired.
•

$L_{f}$ -Lipschitz and Convex/Concave: Let $H=mn$ . Here we discretize the (discrete) derivative of $f$ . Specifically, let $f(1)=\lfloor H\cdot f^{*}(1)\lfloor/H$ and let $f(\ell+1)-f(\ell)=\lfloor H\cdot(f^{*}(\ell+1)-f^{*}(\ell))\rfloor/H$ for all $\ell=2,\dots,m$ . Once again, it is simple to see that $f,f^{*}$ differ by at most $1/n$ at each point.

In summary, in all cases, we have $|\Phi(S)|\leq(nm)^{O(1)}$ resulting in the same asymptotic error as in the unconstrained case.

Runtime Analysis.

It is easy to see that each score value can be computed (via dynamic programming) in time $\mathrm{poly}(n)\cdot\mathrm{poly}(H)$ . Thus, the entire algorithm can be implemented in time that $\mathrm{poly}(n)\cdot\mathrm{poly}(H)\cdot\log m\leq(nm)^{O(1)}$ as claimed.³³3In the main body, we erroneously claimed that the running time was $(n\log m)^{O(1)}$ , instead of $(nm)^{O(1)}$ .

Appendix D Missing Proofs from Section 4

For a set $S\subseteq\mathcal{X}$ , its lower closure and upper closure are defined as $S^{\leq}:=\{x\in\mathcal{X}\mid\exists s\in S,x\leq s\}$ and $S^{\geq}:=\{x\in\mathcal{X}\mid\exists s\in S,x\geq s\}$ , respectively. Similarly, the strict lower closure and strict upper closure are defined as $S^{<}:=\{x\in\mathcal{X}\mid\exists s\in S,x<s\}$ and $S^{>}:=\{x\in\mathcal{X}\mid\exists s\in S,x>s\}$ . When $S=\emptyset$ , we use the convention that $S^{\leq}=S^{<}=\emptyset$ and $S^{\geq}=S^{>}=\mathcal{X}$ .

D.1 Proof of Theorem 1

We note that, in the proof below, we also consider the empty set to be an anti-chain.

Proof of Theorem 1.

We use the notations of $\ell_{[\tau,\theta]}$ and $\mathcal{L}^{\mathrm{abs}}_{[\tau,\theta]}$ as defined in the proof of Theorem 3.

Any monotone function $f:\mathcal{X}\to[0,1]$ corresponds to an antichain $A$ in $\mathcal{X}$ such that $f(a)\geq 1/2$ for all $a\in A^{>}$ and $f(a)\leq 1/2$ for all $a\in A^{\leq}$ . Our algorithm works by first choosing this antichain $A$ in a DP manner using the exponential mechanism. The choice of $A$ partitions the poset into two parts $A^{>}$ and $A^{\leq}$ and the algorithm recurses on these two parts to find functions $f_{>}:A^{>}\to[1/2,1]$ and $f_{\leq}:A^{\leq}\to[0,1/2]$ , which are put together to obtain the final function.

In particular, the algorithm proceeds in $T$ stages, where in stage $t$ , the algorithm starts with a partition of $\mathcal{X}$ into $2^{t}$ parts $\left\{P_{i,t}\mid i\in[2^{t}]\right\}$ , and the algorithm eventually outputs a monotone function $f$ such that $f(x)\in[i/2^{t},(i+1)/2^{t}]$ for all $x\in P_{i,t}$ . This partition is further refined for stage $t+1$ by choosing an antichain $A_{i,t}$ in $P_{i,t}$ and partitioning $P_{i,t}$ into $P_{i,t}\cap A_{i,t}^{>}$ and $P_{i,t}\cap A_{i,t}^{\leq}$ . In the final stage, the function $f$ is chosen to be the constant $i/2^{T-1}$ over $P_{i,T-1}$ . A complete description is presented in Algorithm 3.

Algorithm 3 DP Isotonic Regression for General Posets

Input: Poset

\mathcal{X}

, dataset

D=\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}

, DP parameter

\varepsilon

Output: Monotone function

f:\mathcal{X}\to[0,1]

T\leftarrow\left\lceil\log(\varepsilon n)\right\rceil

and

\varepsilon^{\prime}\leftarrow\varepsilon/T

P_{0,0}\leftarrow\mathcal{X}

for

t=0,\ldots,T-1

for

i=0,\ldots,2^{t}-1

$\triangleright$

$D_{i,t}\leftarrow\left\{(x_{j},y_{j})\mid j\in[n],x_{j}\in P_{i,t}\right\}$ (set of all input points whose $x$ belongs to $P_{i,t}$ )
$\triangleright$

$\mathcal{A}_{i,t}\leftarrow$ set of all antichains in $P_{i,t}$ .
For each antichain $A\in\mathcal{A}_{i,t}$ , we abuse notation to use
- $\bullet$
  
  $D_{i,t}\cap A^{\leq}$ to denote $\{(x,y)\in D_{i,t}\mid x\in A^{\leq}\}$ , and
- $\bullet$
  
  $D_{i,t}\cap A^{>}$ to denote $\{(x,y)\in D_{i,t}\mid x\in A^{>}\}$ .

\triangleright

Choose antichain $A_{i,t}\in\mathcal{A}_{i,t}$ using the exponential mechanism with the scoring function

	$\displaystyle\operatorname{score}_{i,t}(A)=$	$\displaystyle\min_{f_{1}\in\mathcal{F}(P_{i,t}\cap A^{\leq},[\frac{i}{2^{t}},\frac{i+0.5}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{1};D_{i,t}\cap A^{\leq})$
		$\displaystyle\qquad+~{}\min_{f_{2}\in\mathcal{F}(P_{i,t}\cap A^{>},[\frac{i+0.5}{2^{t}},\frac{i+1}{2^{t}}])}\mathcal{L}^{\mathrm{abs}}_{[\frac{i}{2^{t}},\frac{i+1}{2^{t}}]}(f_{2};D_{i,t}\cap A^{>}),$

{ $\operatorname{score}_{i,t}(A)$ has sensitivity at most $L/2^{t}$ .}

$\triangleright$

$P_{2i,t+1}\leftarrow P_{i,t}\cap A_{i,t}^{\leq}$ and $P_{2i+1,t+1}\leftarrow P_{i,t}\cap A_{i,t}^{>}$ .

Let

f:\mathcal{X}\to[0,1]

be given by

f(x)=i/2^{T-1}

for all

x\in P_{i,T-1}

and all

i\in[2^{t}]

return

f

Before proceeding to prove the algorithm’s privacy and utility guarantees, we note that the output $f$ is indeed monotone because for every $x^{\prime}<x$ that gets separated when we partition $P_{i,t}$ to $P_{2i,t+1},P_{2i+1,t+1}$ , we must have $x^{\prime}\in P_{2i,t+1}$ and $x\in P_{2i+1,t+1}$ .

Privacy Analysis.

Similar to the proof of Theorem 3, it follows that each inner subroutine for each $t$ is $\varepsilon^{\prime}$ -DP, and thus the entire mechanism is $\varepsilon$ -DP by basic composition of DP (Lemma 6).

Utility Analysis.

Since the sensitivity of $\operatorname{score}_{i,t}(\cdot)$ is at most $L/2^{t}$ , we have from Lemma 7, that for all $t\in\{0,\dots,T-1\}$ and $i\in[2^{t}]$ ,

\displaystyle\mathbb{E}\left[\operatorname{score}_{i,t}(A_{i,t})-\min_{A\in\mathcal{A}_{i,t}}\operatorname{score}_{i,t}(A)\right]\leq O\left(\frac{L\cdot\log|\mathcal{A}_{i,t}|}{\varepsilon^{\prime}\cdot 2^{t}}\right)

\displaystyle\leq O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log|\mathcal{X}|}{\varepsilon^{\prime}\cdot 2^{t}}\right).

(4)

To facilitate the subsequent steps of the proof, let us introduce additional notation. Let $h_{i,t}$ denote $\operatorname{argmin}_{h\in\mathcal{F}(P_{i,t},[i/2^{t},(i+1)/2^{t}])}\mathcal{L}^{\mathrm{abs}}(h;D_{i,t})$ (with ties broken arbitrarily). Then, let $\tilde{A}_{i,t}$ denote the set of all maximal elements of $h_{i,t}^{-1}([i/2^{t},(i+1/2)/2^{t}])$ . Under this notation, we have that

	$\displaystyle\operatorname{score}_{i,t}(A_{i,t})-\min_{A\in\mathcal{A}_{i,t}}\operatorname{score}_{i,t}(A)$
	$\displaystyle~{}\geq~{}\operatorname{score}_{i,t}(A_{i,t})-\operatorname{score}_{i,t}(\tilde{A}_{i,t})$		(5)
	$\displaystyle~{}=~{}\left(\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{2i,t+1};D_{2i,t+1})+\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{2i+1,t+1};D_{2i+1,t+1})\right)$
	$\displaystyle\qquad-\mathcal{L}^{\mathrm{abs}}_{[i/2^{t},(i+1)/2^{t}]}(h_{i,t};D_{i,t})$
	$\displaystyle~{}=~{}\mathcal{L}^{\mathrm{abs}}(h_{2i,t+1};D_{2i,t+1})+\mathcal{L}^{\mathrm{abs}}(h_{2i+1,t+1};D_{2i+1,t+1})-\mathcal{L}^{\mathrm{abs}}(h_{i,t};D_{i,t}).$		(6)

Finally, notice that

\displaystyle\mathcal{L}^{\mathrm{abs}}(f;D_{i,T-1})-\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})\leq\frac{L}{2^{T-1}}\cdot|D_{i,T-1}|=O\left(\frac{|D_{i,T-1}|}{\varepsilon n}\right).

(7)

With all the ingredients ready, we may now bound the expected (unnormalized) excess risk. We have that

	$\displaystyle\mathcal{L}^{\mathrm{abs}}(f;D)$	$\displaystyle=\sum_{i\in[2^{T-1}]}\mathcal{L}^{\mathrm{abs}}(f;D_{i,T-1})$
		$\displaystyle\overset{\eqref{eq:rounding-err-util}}{\leq}\sum_{i\in[2^{T-1}]}\left(O\left(\frac{\|D_{i,T-1}\|}{\varepsilon n}\right)+\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})\right)$
		$\displaystyle=O(1/\varepsilon)~{}+~{}\sum_{i\in[2^{T-1}]}\mathcal{L}^{\mathrm{abs}}(h_{i,T-1};D_{i,T-1})$
		$\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})$
		$\displaystyle\qquad+\sum_{t\in[T-1]}\sum_{i\in[2^{t-1}]}\left(\mathcal{L}^{\mathrm{abs}}(h_{2i,t};D_{2i,t})+\mathcal{L}^{\mathrm{abs}}(h_{2i+1,t};D_{2i+1,t})-\mathcal{L}^{\mathrm{abs}}(h_{i,t-1};D_{i,t-1})\right).$

Taking the expectation on both sides and using (4) and (6) yields

	$\displaystyle\mathbb{E}[\mathcal{L}^{\mathrm{abs}}(f;D)]$	$\displaystyle\leq O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})~{}+~{}\sum_{t\in[T-1]}\sum_{i\in[2^{t-1}]}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|}{\varepsilon^{\prime}\cdot 2^{t}}\right)$
		$\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}\sum_{t\in[T-1]}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|}{\varepsilon^{\prime}}\right)$
		$\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(T\cdot\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|}{\varepsilon^{\prime}}\right)$
		$\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(T^{2}\cdot\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|}{\varepsilon}\right)$
		$\displaystyle=\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon}\right).$

Dividing both sides by $n$ yields the desired claim. ∎

	$\displaystyle\mathbb{E}[\mathcal{L}^{\mathrm{abs}}(f;D)]$	$\displaystyle\leq O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(h_{0,0};D_{0,0})~{}+~{}\sum_{t\in[T-1]}\sum_{i\in[2^{t-1}]}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|}{\varepsilon^{\prime}\cdot 2^{t}}\right)$
		$\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}\sum_{t\in[T-1]}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|}{\varepsilon^{\prime}}\right)$
		$\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(T\cdot\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|}{\varepsilon^{\prime}}\right)$
		$\displaystyle=O(1/\varepsilon)~{}+~{}\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(T^{2}\cdot\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|}{\varepsilon}\right)$
		$\displaystyle=\mathcal{L}^{\mathrm{abs}}(f^{*};D)~{}+~{}O\left(\frac{L\cdot\operatorname{width}(\mathcal{X})\cdot\log\|\mathcal{X}\|\cdot(1+\log^{2}(\varepsilon n))}{\varepsilon}\right).$

Private Isotonic Regression

Abstract

1 Introduction

1.1 Our Results

General Posets.

Theorem 1 (Upper Bound for General Poset).

Theorem 2 (Lower Bound for General Poset; Informal).

Totally Ordered Sets.

Theorem 3.

Theorem 4.

Organization.

2 Background on Differential Privacy

Definition 5 (Differential Privacy (DP) [18, 16]).

Lemma 6.

Exponential Mechanism.

Lemma 7 ([34]).

Lower Bound for Privatizing Vectors.

Lemma 8 (e.g., [32]).

Lemma 9.

Group Differential Privacy.

Fact 10 (e.g., [41]).

3 DP Isotonic Regression over Total Orders

3.1 An Efficient Algorithm

Definition 11 (Clipped Loss Function).

Proof of Theorem 3.

Privacy Analysis.

Utility Analysis.

Running Time.

Near-Linear Time Algorithms for ℓ𝟏\ell_{1}-, ℓ𝟐𝟐\ell^{2}_{2}-Losses.

Observation 12.

Proof.

Lemma 13.

Proof.

3.2 A Nearly Matching Lower Bound

Definition 14 (Distance-Based Loss Function).

Theorem 15.

Proof.

3.3 Extensions

4 DP Isotonic Regression over General Posets

Lemma 16 (Dilworth’s and Mirsky’s theorems [12, 36]).

4.1 An Algorithm

4.2 Lower Bounds

Lemma 17.

Proof.

Theorem 18.

4.3 Tight Examples for Upper and Lower Bounds

Lemma 19.

Proof.

5 Additional Related Work

6 Conclusions

References

Appendix A Baseline Algorithm for Private Isotonic Regression

Totally ordered sets.

General posets.

Appendix B Lower Bound on Privatizing Vectors with Large Alphabet: Proof of Lemma 9

Proof of Lemma 9.

Appendix C Algorithms for Isotonic Regression with Additional Constraints

Privacy Analysis.

Utility Analysis.

Runtime Analysis.

Appendix D Missing Proofs from Section 4

D.1 Proof of Theorem 1

Proof of Theorem 1.

Privacy Analysis.

Utility Analysis.

Near-Linear Time Algorithms for $\ell_{1}$ -, $\ell^{2}_{2}$ -Losses.