Private Isotonic Regression
Abstract
In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given input points, has an expected excess empirical risk of roughly , where is the width of the poset. In contrast, we also obtain a near-matching lower bound of roughly , that holds even for approximate-DP algorithms. Moreover, we show that the above bounds are essentially the best that can be obtained without utilizing any further structure of the poset. In the special case of a totally ordered set and for and losses, our algorithm can be implemented in near-linear running time; we also provide extensions of this algorithm to the problem of private isotonic regression with additional structural constraints on the output function.
1 Introduction
Isotonic regression is a basic primitive in statistics and machine learning, which has been studied at least since the s [4, 9]; see also the textbooks on the topic [5, 38]. It has seen applications in numerous fields including medicine [31, 39] where the expression of an antigen is modeled as a monotone function of the DNA index and WBC count, and education [19], where isotonic regression was used to predict college GPA using high school GPA and standardized test scores. Isotonic regression is also arguably the most common non-parametric method for calibrating machine learning models [51], including modern neural networks [23].
In this paper, we study isotonic regression with a differential privacy (DP) constraint on the output model. DP [17, 16] is a highly popular notion of privacy for algorithms and machine learning primitives, and has seen increased adoption due to its powerful guarantees and properties [37, 43]. Despite the plethora of work on DP statistics and machine learning (see Section 5 for related work), ours is, to the best of our knowledge, the first to study DP isotonic regression.
In fact, we consider the most general version of the isotonic regression problem. We first set up some notation to describe our results. Let be any partially ordered set (poset). A function is monotone if and only if for all . For brevity, we use to denote the set of all monotone functions from to ; throughout, we consider .
Let denote . Given an input dataset , let the empirical risk of a function be , where is a loss function.
We study private isotonic regression in the basic machine learning framework of empirical risk minimization. Specifically, the goal of the isotonic regression problem, given dataset , is to find a monotone function that minimizes . The excess empirical risk of a function is defined as where .
1.1 Our Results
General Posets.
Our first contribution is to give nearly tight upper and lower bounds for any poset, based on its width, as stated below (see Section 4 for a formal definition.)
Theorem 1 (Upper Bound for General Poset).
Let be any finite poset and let be an -Lipschitz loss function. For any , there is an -DP algorithm for isotonic regression for with expected excess empirical risk at most .
Theorem 2 (Lower Bound for General Poset; Informal).
For any and any , any -DP algorithm for isotonic regression for a “nice” loss function must have expected excess empirical risk .
While our upper and lower bounds do not exactly match because of the multiplication-vs-addition of , we show in Section 4.3 that there are posets for which each bound in tight. In other words, this gap cannot be closed for generic posets.
Totally Ordered Sets.
The above upper and lower bounds immediately translate to the case of totally ordered sets, by plugging in . More importantly, we give efficient algorithms in this case, which runs in time for general loss function , and in nearly linear time for the widely-studied - and -losses111Recall that the -loss is and the -loss is ..
Theorem 3.
For all finite totally ordered sets , -Lipschitz loss functions , and , there is an -DP algorithm for isotonic regression for with expected excess empirical risk . The running time of this algorithm is in general and can be improved to for and losses.
We are not aware of any prior work on private isotonic regression. A simple baseline algorithm for this problem would be to use the exponential mechanism over the set of all monotone functions taking values in a discretized set, to choose one with small loss. We show in Appendix A that this achieves an excess empirical risk of , which is quadratically worse than the bound in Theorem 1. Moreover, even in the case of a totally ordered set, it is unclear how to implement such a mechanism efficiently.
We demonstrate the flexibility of our techniques by showing that it can be extended to variants of isotonic regression where, in addition to monotonicity, we also require to satisfy additional properties. For example, we may want to be -Lipchitz for some specified . Other constraints we can handle include -piecewise constant, -piecewise linear, convexity, and concavity. For each of these constraints, we devise an algorithm that yields essentially the same error compared to the unconstrained case and still runs in polynomial time.
Theorem 4.
For all finite totally ordered sets , -Lipschitz loss functions , and , there is an -DP algorithm for -piecewise constant, -piecewise linear, Lipchitz, convex, or concave isotonic regression for with expected excess empirical risk . The running time of this algorithm is .
Organization.
We next provide necessary background on DP. In Section 3, we prove our results for totally ordered sets (including Theorem 3). We then move on to discuss general posets in Section 4. Section 5 contains additional related work. Finally, we conclude with some discussion in Section 6. Due to space constraints, we omit some proofs from the main body; these can be found in the Appendix.
2 Background on Differential Privacy
Two datasets and are said to be neighboring, denoted , if there is an index such that for all . We recall the formal definition of differential privacy [18, 16]:
Definition 5 (Differential Privacy (DP) [18, 16]).
Let and . A randomized algorithm is -differentially private (-DP) if, for all and all (measurable) outcomes , we have that .
We denote -DP as -DP (aka pure-DP). The case when is referred to as approximate-DP.
We will use the following composition theorems throughout our proofs.
Lemma 6.
-DP satisfies the following:
-
•
Basic Composition: If mechanisms are such that satisfies -DP, then the composed mechanism satisfies -DP. This holds even under adaptive composition, where each can depend on the outputs of .
-
•
Parallel Composition [33]: If a mechanism satisfies -DP, then for any partition of , the composed mechanism given as satisfies -DP.
Exponential Mechanism.
The exponential mechanism solves the basic task of selection in data analysis: given a dataset and a set of options, it outputs the (approximately) best option, where “best” is defined by a scoring function . The -DP exponential mechanism [34] is the randomized mechanism given by
where is the sensitivity of the scoring function.
Lemma 7 ([34]).
For being the -DP exponential mechanism, it holds for all that
Lower Bound for Privatizing Vectors.
Lower bounds for DP algorithms that can output a binary vector that is close (say, in the Hamming distance) to the input are well-known.
Lemma 8 (e.g., [32]).
Let , let the input domain be and let two vectors be neighbors if and only if . Then, for any -DP algorithm , we have .
It is also simple to extend the lower bound for the case where the vector is not binary, as stated below. We defer the full proof to Appendix B.
Lemma 9.
Let be any positive integer such that , let the input domain be and let two vectors be neighbors if and only if . Then, for any -DP algorithm , we have that
Group Differential Privacy.
For any neighboring relation , we write as a neighboring relation where iff there is a sequence for some such that for all .
Fact 10 (e.g., [41]).
Let and . Suppose that is an -DP algorithm for the neighboring relation . Then is -DP for the neighboring relation .
3 DP Isotonic Regression over Total Orders
We first focus on the “one-dimensional” case where is totally ordered; for convenience, we assume that where the order is the natural order on integers. We first present an efficient algorithm for the this case and then a matching lower bound.
3.1 An Efficient Algorithm
To describe our algorithm, it will be more convenient to use the unnormalized version of the empirical risk, which we define as .
We now provide a high-level overview of our algorithm. Any monotone function contains a (not necessarily unique) threshold such that for all and for all . Our algorithm works by first choosing this threshold in a private manner using the exponential mechanism. The choice of partitions into and . The algorithm recurses on these two parts to find functions and , which are then glued to obtain the final function.
In particular, the algorithm proceeds in stages, where in stage , the algorithm starts with a partition of into intervals , and the algorithm eventually outputs a monotone function such that for all . This partition is further refined for stage by choosing a threshold in and partitioning into and . In the final stage, the function is chosen to be the constant over . Note that the algorithm may stop at because the Lipschitzness of ensures that assigning each partition to the constant cannot increase the error by more than .
We already have mentioned above that each has to be chosen in a private manner. However, if we let the scoring function directly be the unnormalized empirical risk, then its sensitivity remains as large as even at a large stage . This is undesirable because the error from each run of the exponential mechanism can be as large as but there are as many as runs in stage . Adding these error terms up would result in a far larger total error than desired.
To circumvent this, we observe that while the sensitivity can still be large, they are mostly “ineffective” because the function range is now restricted to only an interval of length . Indeed, we may use the following “clipped” version of the loss function which has low sensitivity of instead.
Definition 11 (Clipped Loss Function).
For a range , let be given as . Similar to above, we also define .
Observe that , since is -Lipschitz. In other words, the sensitivity of is only . Algorithm 1 contains a full description.
-
{Set of all input points whose belongs to }
{Notation: Define and similarly }
{Notation: Define and similarly }
-
Choose threshold , using -DP exponential mechanism with scoring function
{Note: has sensitivity at most . }
-
and .
Proof of Theorem 3.
Before proceeding to prove the algorithm’s privacy and utility guarantees, we note that the output is indeed monotone since for every that gets separated when we partition into , we must have and .
Privacy Analysis.
Since the exponential mechanism is -DP and the dataset is partitioned with the exponential mechanism being applied only to each partition once, the parallel composition property (Lemma 6) implies that the entire subroutine for each is -DP. Thus, by basic composition (Lemma 6), it follows that Algorithm 1 is -DP (since ).
Utility Analysis.
Since the sensitivity of is at most , we have from Lemma 7, that for all and ,
(1) |
Let denote (with ties broken arbitrarily). Then, let denote the largest element in such that ; namely, is the optimal threshold when restricted to . Under this notation, we have that
(2) |
Finally, notice that
(3) |
Running Time.
To obtain a bound on the running time for general loss functions, we need to make a slight modification to the algorithm: we will additionally only restrict the range of to multiples of . We remark that this does not affect the utility since anyway we always take the final output whose values are multiples of .
Given any dataset where , the prefix isotonic regression algorithm is to compute, for each , the optimal loss in isotonic regression on . Straightforward dynamic programming solves this in time, where denote the number of possible values allowed in the function.
Now, for each , we may run the above algorithm with and the allowed values are all multiples of in ; this gives us for all in time . Analogously, we can also compute for all in a similar time. Thus, we can compute in time , and then sample accordingly.
We can further speed up the algorithm by observing that the score remains constant for all . Hence, we may first sample an interval among and then sample uniformly from that interval. This entire process can be done in time. In total, the running time of the algorithm is thus
Near-Linear Time Algorithms for -, -Losses.
We now describe faster algorithms for the - and -loss functions, thereby proving the last part of Theorem 3. The key observation is that for convex loss functions, the restricted optimal is simple: we just have to “clip” the optimal function to be in the range . Below denotes the function .
Observation 12.
Let be any convex loss function, any dataset, and any real numbers such that . Define . Then, we must have .
Proof.
Consider any . Let (resp. ) denote the set of all such that (resp. ). Consider the following operations:
-
•
For each , change to .
-
•
For each , change to .
-
•
Let for all .
At the end, we end up with . Each of the first two changes does not increase the loss ; otherwise, due to convexity, changing to would have decrease the objective function. Finally, the last operation does not decrease the loss; otherwise, we may replace this section of with the values in instead. Thus, we can conclude that . ∎
We will now show how to compute the scores in Algorithm 1 simultaneously for all (for fixed ) in nearly linear time. To do this, recall the prefix isotonic regression problem from earlier. For this problem, Stout [42] gave an -time algorithm for -loss and an -time algorithm for -loss (both the unrestricted value case). Furthermore, after the th iteration, the algorithm also keeps a succinct representation of the optimal solution in the form of an array , which denotes for all , and indicates the loss up until , not including.
We can extend the above algorithm to prefix clipped isotonic regression problem, which we define in the same manner as above except that we restrict the function range to be for some given . Using 12, it is not hard to extend the above algorithm to work in this case.
Lemma 13.
There is an -time algorithm for - and -prefix clipped isotonic regression.
Proof.
We first precompute and for all . We then run the aforementioned algorithm from [42]. At each iteration , use binary search to find the largest index such that and the largest index such that . 12 implies that the optimal solution of the clipped version is simply the same as the unrestricted version except that we need to change the function values before to and after to . The loss of this clipped optimal can be written as , which can be computed in time given that we have already precomputed . The running time of the entire algorithm is thus the same as that of [42] together with the binary search time; the latter totals to . ∎
Our fast algorithm for computing first runs the above algorithm with and ; this gives us for all in time . Analogously, we can also compute for all in a similar time. Thus, we can compute in time , and sample accordingly. Using the same observation as the general loss function case, this can be sped up further to time. In total, the running time of the algorithm is thus
3.2 A Nearly Matching Lower Bound
We show that the excess empirical risk guarantee in Theorem 3 is tight, even for approximate-DP algorithms with a sufficiently small , under a mild assumption about the loss function stated below.
Definition 14 (Distance-Based Loss Function).
For , a loss function is said to be -distance-based if there exist such that where is a non-decreasing function with and .
We remark that standard loss functions, including - or -loss, are all -distance-based.
Our lower bound is stated below. It is proved via a packing argument [25] in a similar manner as a lower bound for properly PAC learning threshold functions [10]. This is not a coincidence: indeed, when we restrict the range of our function to , the problem becomes exactly (the empirical version of) properly learning threshold functions. As a result, the same technique can be used to prove a lower bound in our setting as well.
Theorem 15.
For all , any -DP algorithm for isotonic regression over for any -distance-based loss function must have expected excess empirical risk .
Proof.
Suppose for the sake of contradiction that there exists an -DP algorithm for isotonic regression with an -distance-based loss function with expected excess empirical risk . Let .
We may assume that , as the lower bound for the case can easily be adapted for an lower bound for the case as well.
We will use the standard packing argument [25]. For each , we create a dataset that contains copies of , copies of and copies of . Finally, let denote the dataset that contains copies of and copies of . Let denote the set of all such that . The utility guarantee of implies that
Furthermore, it is not hard to see that are disjoint. In particular, for any function , let be the largest element for which ; if no such exists (i.e., ), let . For any , we have . Similarly, for any , we have This implies that can only belong to , as claimed.
Therefore, we have that
a contradiction. ∎
3.3 Extensions
We now discuss several variants of the isotonic regression problem that places certain additional constraints on the function that we seek, as listed below.
-
•
-Piecewise Constant: must be a step function that consists of at most pieces.
-
•
-Piecewise Linear: must be a piecewise linear function with at most pieces.
-
•
Lipschitz Regression: must be -Lipschitz for some specified .
-
•
Convex/Concave: must be convex/concave.
We devise a general meta algorithm that, with a small tweak in each case, works for all of these constraints to yield Theorem 4. At a high-level, our algorithm is similar to Algorithm 1, except that, in addition to using exponential mechanism to pick the threshold , we also pick certain auxiliary information that is then passed onto the next stage. For example, in the -piecewise constant setting, the algorithm in fact picks also the number of pieces to the left of and that to the right of it. These are then passed on to the next stage. The algorithm stops when the number of pieces become one, and then simply use the exponential mechanism to find the constant value on this subdomain.
The full description of the algorithm and the corresponding proof are deferred to Appendix C.
4 DP Isotonic Regression over General Posets
We now provide an algorithm and lower bounds for the case of general discrete posets. We first recall basic quantities about posets. An anti-chain of a poset is a set of elements such that no two distinct elements are comparable, whereas a chain is a set of elements such that every pair of elements is comparable. The width of a poset , denoted by , is defined as the maximum size among all anti-chains in the poset. The height of , denoted by , is defined as the maximum size among all chains in the poset. Dilworth’s theorem and Mirsky’s theorem give the following relation between chains an anti-chains:
Lemma 16 (Dilworth’s and Mirsky’s theorems [12, 36]).
A poset with width can be partitioned into chains. A poset with height can be partitioned into anti-chains.
4.1 An Algorithm
Our algorithm for general posets is similar to that of totally ordered set presented in the previous section. The only difference is that, instead of attempting to pick a single maximal point such that as in the previous case, there could now be many such maximal ’s. Indeed, we need to use the exponential mechanism to pick all such ’s. Since these are all maximal, they must be incomparable; therefore, they form an anti-chain. Since there can be as many as anti-chains in total, this means that the error from the exponential mechanism is , leading to the multiplicative increase of in the total error. This completes our proof sketch for Theorem 1.
4.2 Lower Bounds
To prove a lower bound of , we observe that the values of the function in any anti-chain can be arbitrary. Therefore, we may use each element in a maximum anti-chain to encode as a binary vector. The lower bound from Lemma 8 then gives us an lower bound for , as formalized below.
Lemma 17.
For any , any -DP algorithm for isotonic regression for any -distance-based loss function must have expected excess empirical risk .
Proof.
Consider any -DP isotonic regression algorithm for loss . Let be any maximum anti-chain (of size ) in . We use this algorithm to build a -DP algorithm for privatizing a binary vector of dimensions as follows:
-
•
Let be distinct elements of .
-
•
On input , create a dataset where is repeated times.
-
•
Run on the instance to get , and output a vector where .
It is obvious that this algorithm is -DP. Observe also that and thus ’s expected excess empirical risk is , which, from Lemma 8, must be at least . ∎
By using group privacy (10) and repeating each element times, we arrive at a lower bound of . Furthermore, since contains elements that form a totally ordered set, Theorem 15 gives a lower bound of as long as . Finally, due to Lemma 16, we have , which means that . Thus, we arrive at:
Theorem 18.
For any and any , any -DP algorithm for isotonic regression for -distance-based loss function must have expected excess empirical risk .
4.3 Tight Examples for Upper and Lower Bounds
Recall that our upper bound is while our lower bound is . One might wonder whether this gap can be closed. Below we show that, unfortunately, this is impossible in general: there are posets for which each bound is tight.
Tight Lower Bound Example. Let denote the poset that consists of disjoint chains, where and . (Every pair of elements on different chains are incomparable.) In this case, we can solve the isotonic regression problem directly on each chain and piece the solutions together into the final output . Note that and . According to Theorem 1, the unnormalized excess empirical risk in is . Therefore, the total (normalized) empirical risk for the entire domain is . This is at most as long as ; this matches the lower bound.
Tight Upper Bound Example. Consider the grid poset where if and only if and . We assume throughout that . Observe that and .
We will show the following lower bound, which matches the upper bound in the case where , up to factor. We prove it by a reduction from Lemma 9. Note that this reduction is in some sense a “combination” of the proofs of Theorem 15 and Lemma 17, as the coordinate-wise encoding aspect of Lemma 17 is still present (across the rows) whereas the packing-style lower bound is present in how we embed elements of (in blocks of columns).
Lemma 19.
For any and , any -DP algorithm for isotonic regression for any -distance-based loss function must have expected excess empirical risk .
Proof.
Let and . Consider any -DP algorithm for isotonic regression for on where . We use this algorithm to build a -DP algorithm for privatizing a vector as follows:
-
•
Create a dataset that contains:
-
–
For all , copies of and copies of .
-
–
copies of .
-
–
-
•
Run on instance to get .
-
•
Output a vector where . (For simplicity, when such does not exist let .)
By group privacy, is -DP. Furthermore, and the expected empirical excess risk of is
which must be at least by Lemma 9. ∎
5 Additional Related Work
(Non-private) isotonic regression is well-studied in statistics and machine learning. The one-dimensional (aka univariate) case has long history [9, 46, 5, 44, 45, 13, 8, 35, 14, 15, 49]; for a general introduction, see [22]. Moreover, isotonic regression has been studied in higher dimensions [24, 27, 26], including the sparse setting [21], as well as in online learning [29]. A related line of work studies learning neural networks under (partial) monotonicity constraints [3, 50, 30, 40].
6 Conclusions
In this paper we obtained new private algorithms for isotonic regression on posets and proved nearly matching lower bounds in terms of the expected empirical excess risk. Although our algorithms for totally ordered sets are efficient, our algorithm for general posets is not. Specifically, a trivial implementation of the algorithm would run in time . It remains an interesting open question whether this can be sped up. To the best of our knowledge, this question does not seem to be well understood even for the non-private setting, as previous algorithmic works have focused primarily on the totally ordered case. Similarly, while our algorithm is efficient for the totally ordered sets, it remains interesting to understand whether nearly linear time algorithms for - and -losses can be extended to a larger class of loss functions.
References
- [1] D. Alabi, A. McMillan, J. Sarathy, A. D. Smith, and S. P. Vadhan. Differentially private simple linear regression. Proc. Priv. Enhancing Technol., 2022(2):184–204, 2022.
- [2] N. Alon, R. Livni, M. Malliaris, and S. Moran. Private PAC learning implies finite littlestone dimension. In STOC, pages 852–860, 2019.
- [3] N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with monotonicity constraints for two-group classification problems. Dec. Sci., 24(1):60–75, 1993.
- [4] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. Ann. Math. Stat., pages 641–647, 1955.
- [5] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical Inference Under Order Restrictions. John Wiley & Sons, 1973.
- [6] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.
- [7] A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory Comput., 12(1):1–61, 2016.
- [8] L. Birgé and P. Massart. Rates of convergence for minimum contrast estimators. Prob. Theory Rel. Fields, 97(1):113–150, 1993.
- [9] H. D. Brunk. Maximum likelihood estimates of monotone parameters. Ann. Math. Stat., pages 607–616, 1955.
- [10] M. Bun, K. Nissim, U. Stemmer, and S. P. Vadhan. Differentially private release and learning of threshold functions. In FOCS, pages 634–649, 2015.
- [11] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. JMLR, 12(3), 2011.
- [12] R. P. Dilworth. A decomposition theorem for partially ordered sets. Ann. Math., 51(1):161–166, 1950.
- [13] D. L. Donoho. Gelfand -widths and the method of least squares. Technical Report, University of California, 1991.
- [14] C. Durot. On the -error of monotonicity constrained estimators. Ann. Stat., 35(3):1080–1104, 2007.
- [15] C. Durot. Monotone nonparametric regression with random design. Math. Methods Stat., 17(4):327–341, 2008.
- [16] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486–503, 2006.
- [17] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
- [18] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
- [19] R. L. Dykstra and T. Robertson. An algorithm for isotonic regression for two or more independent variables. Ann. Stat., 10(3):708–716, 1982.
- [20] V. Feldman and D. Xiao. Sample complexity bounds on differentially private learning via communication complexity. In COLT, volume 35, pages 1000–1019, 2014.
- [21] D. Gamarnik and J. Gaudio. Sparse high-dimensional isotonic regression. NeurIPS, 2019.
- [22] P. Groeneboom and G. Jongbloed. Nonparametric Estimation under Shape Constraints. Cambridge University Press, 2014.
- [23] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.
- [24] Q. Han, T. Wang, S. Chatterjee, and R. J. Samworth. Isotonic regression in general dimensions. Ann. Stat., 47(5):2440–2471, 2019.
- [25] M. Hardt and K. Talwar. On the geometry of differential privacy. In STOC, pages 705–714, 2010.
- [26] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NeurIPS, 2011.
- [27] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.
- [28] H. Kaplan, K. Ligett, Y. Mansour, M. Naor, and U. Stemmer. Privately learning thresholds: Closing the exponential gap. In COLT, pages 2263–2285, 2020.
- [29] W. Kotłowski, W. M. Koolen, and A. Malek. Online isotonic regression. In COLT, pages 1165–1189, 2016.
- [30] X. Liu, X. Han, N. Zhang, and Q. Liu. Certified monotonic neural networks. NeurIPS, pages 15427–15438, 2020.
- [31] R. Luss, S. Rosset, and M. Shahar. Efficient regularized isotonic regression with application to gene–gene interaction search. Ann. Appl. Stat., 6(1):253–283, 2012.
- [32] P. Manurangsi. Tight bounds for differentially private anonymized histograms. In SOSA, pages 203–213, 2022.
- [33] F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9):89–97, 2010.
- [34] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103, 2007.
- [35] M. Meyer and M. Woodroofe. On the degrees of freedom in shape-restricted regression. Ann. Stat., 28(4):1083–1104, 2000.
- [36] L. Mirsky. A dual of Dilworth’s decomposition theorem. AMS, 78(8):876–877, 1971.
- [37] C. Radebaugh and U. Erlingsson. Introducing TensorFlow Privacy: Learning with Differential Privacy for Training Data, March 2019. blog.tensorflow.org.
- [38] T. Robertson, F. T. Wright, and R. L. Dykstra. Order restricted statistical inference. John Wiley & Sons, 1988.
- [39] M. J. Schell and B. Singh. The reduced monotonic regression method. JASA, 92(437):128–135, 1997.
- [40] A. Sivaraman, G. Farnadi, T. Millstein, and G. Van den Broeck. Counterexample-guided learning of monotonic neural networks. In NeurIPS, pages 11936–11948, 2020.
- [41] T. Steinke and J. R. Ullman. Between pure and approximate differential privacy. J. Priv. Confidentiality, 7(2), 2016.
- [42] Q. F. Stout. Unimodal regression via prefix isotonic regression. Comput. Stat. Data Anal., 53(2):289–297, 2008.
- [43] D. Testuggine and I. Mironov. PyTorch Differential Privacy Series Part 1: DP-SGD Algorithm Explained, August 2020. medium.com.
- [44] S. Van de Geer. Estimating a regression function. Ann. Stat., pages 907–924, 1990.
- [45] S. Van de Geer. Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Stat., 21(1):14–44, 1993.
- [46] C. van Eeden. Testing and Estimating Ordered Parameters of Probability Distribution. CWI, Amsterdam, 1958.
- [47] D. Wang, C. Chen, and J. Xu. Differentially private empirical risk minimization with non-convex loss functions. In ICML, pages 6526–6535, 2019.
- [48] D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster and more general. In NIPS, 2017.
- [49] F. Yang and R. F. Barber. Contraction and uniform convergence of isotonic regression. Elec. J. Stat., 13(1):646–677, 2019.
- [50] S. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks and partial monotonic functions. NIPS, 2017.
- [51] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages 694–699, 2002.
Appendix A Baseline Algorithm for Private Isotonic Regression
We provide a baseline algorithm for private isotonic regression by a direct application of the exponential mechanism. For simplicity, we start with the case of totally ordered sets and then extend the algorithm to general posets.
Totally ordered sets.
Consider a discretized range of . We have that for and , it holds that . Also, it is a simple combinatorial fact that , which bounds the number of monotone functions with this discretization. Thus, the -DP exponential mechanism over the set of all monotone functions in , with the score function of sensitivity at most , returns such that
Setting , gives an excess empirical error of (when ).
General posets.
By Lemma 16, we have that can be partitioned into many chains . Let . Since any monotone function over has to be monotone over each of the chains, we have that
Thus, by a similar argument as above, the -DP exponential mechanism over the set of all monotone functions in , with score function returns such that
Choosing , gives an excess empirical error of (when ).
Appendix B Lower Bound on Privatizing Vectors with Large Alphabet: Proof of Lemma 9
Proof of Lemma 9.
For every , let denote . Let . We have
Appendix C Algorithms for Isotonic Regression with Additional Constraints
In this section, we elaborate on the constrained variants of the isotonic regression problem over totally ordered sets, by designing a meta-algorithm that can be instantiated to get algorithms for each of the cases discussed in Section 3.3.
Recall that Algorithm 1 proceeded in rounds where in round the algorithm starts with a partition of into intervals, and then partitions each interval into two using the exponential mechanism. At a high-level, our meta-algorithm is similar, except that, it maintains a set of pairwise disjoint structured intervals of , that is, each interval has an additional structure which imposes constraints on the function that can be returned on the said interval; moreover, the function is fixed outside the union of the said intervals. This idea is described in Algorithm 2, stated using the following abstractions, which will be instantiated to derive algorithms for each constrained variant.
-
•
A set of all structured intervals of denoted as , and an initial structured interval . A structured interval will consist of an interval domain denoted , an interval range denoted , and potentially additional other constraints that the function should satisfy. We use to denote the length of . In order to make the number of structured intervals bounded, we will consider a discretized range where the endpoints of interval lie in for some discretization parameter .
-
•
A partition method that defines a set of all “valid partitions” of a structured interval into two structured intervals and and a function . It is required that be an interval. If the algorithm makes a choice of , then the final function returned by the algorithm is required to be equal to on .
-
•
For all , we abuse notation to let denote the set of all monotone functions mapping to , while respecting the additional conditions enforced by the structure in .
-
-
Choose , using -DP exponential mechanism with scoring function
{Notation: , is defined similarly and
.}
{Note: has sensitivity at most .}
-
and .
We instantiate this notion of structured intervals in the following ways to derive algorithms for the constrained variants of isotonic regression mentioned earlier:
-
•
(Vanilla) Isotonic Regression (recovers Algorithm 1): is simply the set of all interval domains, and all (discretized) interval ranges and the partition method simply partitions into two sub-intervals, with the range divided into two equal parts.222We ignore a slight detail that need not be in ; this can be fixed e.g., by letting it be , but we skip this complicated expression for simplicity. Note that, if we let , this distinction does not make a difference in the algorithm for vanilla isotonic regression. Namely,
We skip the description of the function in the partition method , since the middle sub-interval is empty. For all the other variants, we skip having to explicitly write the conditions of , , , and in definition of , and similarly that consist of monotone functions mapping to ; we only focus on the main new conditions.
-
•
-Piecewise Constant: is the set of all interval domains, all discretized ranges, along with a parameter (encoding an upper bound on the number of pieces in the final piecewise constant function). The partition method partitions into two sub-intervals respecting that the number of pieces and the range divided into two equal parts, namely,
-
•
-Piecewise Linear: is the set of all interval domains, all discretized ranges, along with a parameter (encoding an upper bound on the number of pieces in the final piecewise linear function), and two Boolean values (/), one encoding whether the function must achieve the minimum possible value at the start of the interval, and other encoding whether it must achieve the maximum possible value at the end of the interval. The partition method partitions into two sub-intervals respecting that the number of pieces, by choosing a middle sub-interval that ensures that each range is at most half as large as the earlier one, namely,
s.t. if and if . In other words, considers the three sub-intervals , and , and fits an affine function in the middle sub-interval such that and and ensures that the function returned on sub-intervals and satisfies and .
-
•
Lipschitz Regression: Given any Lipschitz constant , is the set of all interval domains, all discretized ranges, along with two Boolean values (/), one encoding whether the function must achieve the minimum possible value at the start of the interval, and other encoding whether it must achieve the maximum possible value at the end of the interval. The partition method chooses sub-intervals by choosing and function values and such that (thereby respecting the Lipschitz condition), and moreover and .
s.t. if and if . -
•
Convex/Concave: We only describe the convex case; the concave case follows similarly. Note that a function is convex over the discrete domain if and only if holds for all . Let be the set of all interval domains, all discretized ranges, along with the following additional parameters
-
–
a lower bound on the (discrete) derivative of ,
-
–
an upper bound on the (discrete) derivative of ,
-
–
a Boolean value encoding whether the function must achieve the minimum possible value at the start of the interval,
-
–
another Boolean value encoding whether the function must achieve the maximum possible value at the end of the interval.
The partition method chooses sub-intervals by choosing and function values and such that , and and enforcing that the left sub-interval has derivatives at most and the right sub-interval has derivatives at least .
s.t. for all it holds that , and if and if . -
–
Privacy Analysis.
Follows similarly as done for Algorithm 1.
Utility Analysis.
Since in each of the cases, it follows that the sensitivity of the scoring function is at most . The rest of the proof follows similarly, with the only change being that the number of candidates in the exponential mechanism is given as , which in the case of vanilla isotonic regression was simply . We now bound this for each of the cases, which shows that is at most . In particular,
-
•
-Piecewise Constant: .
-
•
-Piecewise Linear: .
-
•
-Lipschitz: .
-
•
Convex/Concave:
Finally, there is an additional error due to discretization. To account for the discretization error, we argue below for appropriately selected values of that, for any optimal function , there exists such that . This indeed immediately implies that the discretization error is at most .
-
•
-Piecewise Linear: We may select . In this case, for every endpoint , we let and interpolate the intermediate points accordingly. It is simple to see that as desired.
-
•
-Lipschitz and Convex/Concave: Let . Here we discretize the (discrete) derivative of . Specifically, let and let for all . Once again, it is simple to see that differ by at most at each point.
In summary, in all cases, we have resulting in the same asymptotic error as in the unconstrained case.
Runtime Analysis.
It is easy to see that each score value can be computed (via dynamic programming) in time . Thus, the entire algorithm can be implemented in time that as claimed.333In the main body, we erroneously claimed that the running time was , instead of .
Appendix D Missing Proofs from Section 4
For a set , its lower closure and upper closure are defined as and , respectively. Similarly, the strict lower closure and strict upper closure are defined as and . When , we use the convention that and .
D.1 Proof of Theorem 1
We note that, in the proof below, we also consider the empty set to be an anti-chain.
Proof of Theorem 1.
We use the notations of and as defined in the proof of Theorem 3.
Any monotone function corresponds to an antichain in such that for all and for all . Our algorithm works by first choosing this antichain in a DP manner using the exponential mechanism. The choice of partitions the poset into two parts and and the algorithm recurses on these two parts to find functions and , which are put together to obtain the final function.
In particular, the algorithm proceeds in stages, where in stage , the algorithm starts with a partition of into parts , and the algorithm eventually outputs a monotone function such that for all . This partition is further refined for stage by choosing an antichain in and partitioning into and . In the final stage, the function is chosen to be the constant over . A complete description is presented in Algorithm 3.
-
(set of all input points whose belongs to )
-
set of all antichains in .
For each antichain , we abuse notation to use
-
to denote, and
-
to denote .
-
-
Choose antichain using the exponential mechanism with the scoring function
{ has sensitivity at most .}
-
and .
Before proceeding to prove the algorithm’s privacy and utility guarantees, we note that the output is indeed monotone because for every that gets separated when we partition to , we must have and .
Privacy Analysis.
Utility Analysis.
Since the sensitivity of is at most , we have from Lemma 7, that for all and ,
(4) |
To facilitate the subsequent steps of the proof, let us introduce additional notation. Let denote (with ties broken arbitrarily). Then, let denote the set of all maximal elements of . Under this notation, we have that
(5) | |||
(6) |
Finally, notice that
(7) |