Private optimization in the interpolation regime:
faster rates and hardness results
Abstract
In non-private stochastic convex optimization, stochastic gradient methods converge much faster on interpolation problems—problems where there exists a solution that simultaneously minimizes all of the sample losses—than on non-interpolating ones; we show that generally similar improvements are impossible in the private setting. However, when the functions exhibit quadratic growth around the optimum, we show (near) exponential improvements in the private sample complexity. In particular, we propose an adaptive algorithm that improves the sample complexity to achieve expected error from to for any fixed , while retaining the standard minimax-optimal sample complexity for non-interpolation problems. We prove a lower bound that shows the dimension-dependent term is tight. Furthermore, we provide a superefficiency result which demonstrates the necessity of the polynomial term for adaptive algorithms: any algorithm that has a polylogarithmic sample complexity for interpolation problems cannot achieve the minimax-optimal rates for the family of non-interpolation problems.
1 Introduction
We study differentially private stochastic convex optimization (DP-SCO), where given a dataset we wish to solve
(1) |
while guaranteeing differential privacy. In problem (1), is the parameter space, is a sample space, and is a collection of convex losses. We study the interpolation setting, where there exists a solution that simultaneously minimizes all of the sample losses.
Interpolation problems are ubiquitous in machine learning applications: for example, least squares problems with consistent solutions [28, 24], and problems with over-parametrized models where a perfect predictor exists [23, 11, 12]. This has led to a great deal of work on the advantages and implications of interpolation [26, 16, 11, 12].
For non-private SCO, interpolation problems allow significant improvements in convergence rates over generic problems [26, 16, 23, 29, 31]. For general convex functions, SST [26] develop algorithms that obtain sub-optimality, improving over the minimax-optimal rate for non-interpolation problems. Even more dramatic improvements are possible when the functions exhibit growth around the minimizer, as VBS [29] show that SGD achieves exponential rates in this setting compared to polynomial rates without interpolation. [3, 1, 14] extend these fast convergence results to model-based optimization methods.
Despite the recent progress and increased interest in interpolation problems, in the private setting they remain poorly understood. In spite of the substantial progress in characterizing the tight convergence guarantees for a variety of settings in DP optimization [13, 9, 20, 6, 7], we have little understanding of private optimization in the growing class of interpolation problems.
Given (i) the importance of differential privacy and interpolation problems in modern machine learning, (ii) the (often) paralyzingly slow rates of private optimization algorithms, and (iii) the faster rates possible for non-private interpolation problems, the interpolation setting provides a reasonable opportunity for significant speedups in the private setting. This motivates the following two questions: first, is it possible to improve the rates for DP-SCO in the interpolation regime? And, what are the optimal rates?
1.1 Our contributions
We answer both questions. In particular, we show that
-
1.
No improvements in general (Section 3): our first result is a hardness result demonstrating that the rates cannot be improved for DP-SCO in the interpolation regime with general convex functions. More precisely, we prove a lower bound of on the excess loss for pure differentially private algorithms. This shows that existing algorithms achieve optimal private rates for this setting.
-
2.
Faster rates with growth (Section 4): when the functions exhibit quadratic growth around the minimizer, that is, for some , we propose an algorithm that achieves near-exponentially small excess loss, improving over the polynomial rates in the non-interpolation setting. Specifically, we show that the sample complexity to achieve expected excess loss is for pure DP and for -DP, for any fixed . This improves over the sample complexity for non-interpolation problems with growth which is . We also present new algorithms that improve the rates for interpolation problems with the weaker -growth assumption [7] for where we achieve excess loss , compared to the previous bound without interpolation.
-
3.
Adaptivity to interpolation (Section 4.3): While these improvements for the interpolation regime are important, practitioners using these methods in practice cannot identify whether the dataset they are working with is an interpolating one or not. Thus, it is crucial that these algorithms do not fail when given a non-interpolating dataset. We show that our algorithms are adaptive to interpolation, obtaining these better rates for interpolation while simultaneously retaining the standard minimax optimal rates for non-interpolation problems.
-
4.
Tightness (Section 5): finally, we provide a lower bound and a super-efficiency result that demonstrate the (near) tightness of our upper bounds showing sample complexity is necessary for interpolation problems with pure DP. Moreover, our super-efficiency result shows that the polynomial dependence on in the sample complexity is necessary for adaptive algorithms: any algorithm that has a polylogarithmic sample complexity for interpolation problems cannot achieve minimax-optimal rates for non-interpolation problems.
1.2 Related work
Over the past decade, a lot of works [15, 17, 27, 13, 2, 9, 20, 6, 5, 8] have studied the problem of private convex optimization. CMS [15] and [13] study the closely related problem of differentially private empirical risk minimization (DP-ERM) where the goal is to minimize the empirical loss, and obtain (minimax) optimal rates of for pure DP and for -DP. Recently, more papers have moved beyond DP-ERM to privately minimizing the population loss (DP-SCO) [9, 20, 6, 5, 10, 7]. BFTT [9] was the first paper to obtain the optimal rate for -DP, and subsequent papers develop more efficient algorithms that achieve the same rates [20, 8]. Moreover, other papers study DP-SCO under different settings including non-Euclidean geometry [6, 5], heavy-tailed data [32], and functions with growth [7]. However, to the best of our knowledge, there has not been any work in private optimization that studies the problem in the interpolation regime.
On the other hand, the optimization literature has witnessed numerous papers on the interpolation regime [26, 16, 23, 29, 22, 31]. SST [26] propose algorithms that roughly achieve the rate for smooth and convex functions where . In the interpolation regime with , this result obtains loss improving over the standard rate for non-interpolation problems. Moreover, VBS [29] studied the interpolation regime for functions with growth and show that SGD enjoys linear convergence (exponential rates). More recently, several papers investigated and developed acceleration-based algorithms in the interpolation regime [22, 31].
2 Preliminaries
We begin with notation that will be used throughout the paper and provide some standard definitions from convex analysis and differential privacy.
Notation
We let denote the sample size and the dimension. We let denote the optimization variable and the constraint set. are samples from , and is an -valued random variable. For each sample , is a closed convex function. Let denote the subdifferential of at . We let denote the collection of datasets with data points from . We let denote the empirical loss and denote the population loss. The distance of a point to a set is . We use to denote the diameter of parameter space and use as a bound on the diameter of our parameter space.
We recall the definition of -differential privacy.
Definition 2.1.
A randomized mechanism is -differentially private (-DP) if for all datasets that differ in a single data point and for all events in the output space of , we have
We define -differential privacy (-DP) to be -differential privacy.
We now recall a couple of standard convex analysis definitions.
Definition 2.2.
-
1.
A function is -Lipschitz if for all
Equivalently, a function is -Lipschitz if for all .
-
2.
A function is -smooth if it has -Lipschitz gradient: for all
-
3.
A function is -strongly convex if for all
We formally define interpolation problems:
Definition 2.3 (Interpolation Problem).
Let . Then problem (1) is an interpolation problem if there exists such that for -almost all , we have .
Interpolation problems are common in modern machine learning, where models are overparameterized. One simple example is overparameterized linear regression: there exists a solution that minimizes each individual sample function. Classification problems with margin are another example.
Crucial to our results is the following quadratic growth assumption:
Definition 2.4.
We say that a function satisfies the quadratic growth condition if for all
This assumption is natural with interpolation and holds for many important applications including noiseless linear regression [28, 24]. Past work ([29, 31]) uses this assumption with interpolation to get faster rates of convergence for non-private optimization.
Finally, the adaptivity of our algorithms will crucially depend on an innovation leveraging Lipchitizian extensions, defined as follows.
Definition 2.5 (Lipschitzian extension [21]).
The Lipschitzian extension with Lipschitz constant L of a function is defined as the infimal convolution
(2) |
The Lipschitzian extension (2) essentially transforms a general convex function into an -Lipschitz convex function. We now present a few properties of the Lipschitzian extension that are relevant to our development.
Lemma 2.1.
Let be convex. Then its Lipschitzian extension satisfies the following:
-
1.
is -Lipschitz.
-
2.
is convex.
-
3.
If is -Lipschitz, then , for all .
-
4.
Let . If is at a finite distance from , we have
We use the Lipschitzian extension as a substitute for gradient clipping to ensure differential privacy. Unlike gradient clipping, which may alter the geometry of a convex problem to a non-convex one, the Lipschitzian extension of a function remains convex and thus retains other nice properties that we leverage in our algorithms in Section 4.
3 Hardness of private interpolation
In non-private stochastic convex optimization, for smooth functions it is well known that interpolation problems enjoy the fast rate [26] compared to the minimax-optimal without interpolation [19]. In this section, we show that such an improvement is not generally possible with privacy. The same lower bound of private non-interpolation problems, , holds for interpolation problems.
To state our lower bounds, we present some notation that we will use throughout of the paper. We let denote the family of function and dataset pairs such that is convex and -smooth in its first argument, , and is an interpolation problem (Definition 2.3). We define the constrained minimax risk to be
where be the collection of -differentially private mechanisms from to . We use to denote the collection of -DP mechanisms from to . Here, the expectation is taken over the randomness of the mechanism, while the dataset is fixed.
We have the following lower bound for private interpolation problems; the proof is deferred to Section B.1.
Theorem 1.
Suppose contains a -dimensional ball of diameter . Then the following lower bound holds for
Moreover, if and , the following lower bound holds
Recall the optimal rate for pure DP optimization problems without interpolation is . The first term is the non-private rate, as this is the rate one would get if . The second term is the private rate, as this is the price algorithms have to pay for privacy. In modern machine learning, problems are often high dimensional, so we often think of the dimension scaling with some function of the number of samples . Thus, the private rate is often thought to dominate the non-private rate. For this reason, in this section, we focus on the private rate. The lower bounds of Theorem 1 show that it is not possible to improve the private rate for interpolation problems in general. Similarly, for approximate -DP, the lower bound shows that improvements are not possible for . For completeness, as we alluded to earlier, we note that our results do not preclude the possibility of improving the non-private rate from to . We leave this as an open problem of independent interest for future work.
Despite this pessimistic result, in the next section we show that substantial improvements are possible for private interpolation problems with additional growth conditions.
4 Faster rates for interpolation with growth
Having established our hardness result for general interpolation problems, in this section we show that when the functions satisfy additional growth conditions, we get (nearly) exponential improvements in the rates of convergence for private interpolation.
Our algorithms use recent localization techniques that yield optimal algorithms for DP-SCO [20, 7] where the algorithm iteratively shrinks the diameter of the domain. However, to obtain faster rates for interpolation, we crucially build on the observation that the norm of the gradients is decreasing as we approach the optimal solution, since . Hence, by carefully localizing the domain and shrinking the Lipschitz constant accordingly, our algorithms improve the rates for interpolating datasets.
However, this technique alone yields an algorithm that may not be private for non-interpolation problems, violating that privacy must hold for all inputs: the reduction in the Lipschitz constant may not hold for non-interpolation problems, and thus, the amount of noise added may not be enough to ensure privacy. To solve this issue, we use the Lipschitzian extension (Definition 2.5) to transform our potentially non-Lipschitz sample functions into Lipschitz ones and guarantee privacy even for non-interpolation problems.
We begin in Section 4.1 by presenting our Lipschitzian extension based algorithm, which recovers the standard optimal rates for (non-interpolation) -Lipschitz functions while still guaranteeing privacy when the function is not Lipschitz. Then in Section 4.2 we build on this algorithm to develop a localization-based algorithm that obtains faster rates for interpolation-with-growth problems. Finally, in Section 4.3 we present our final adaptive algorithm, which obtains fast rates for interpolation-with-growth problems while achieving optimal rates for non-interpolation growth problems.
4.1 Lipschitzian-extension based algorithms
Existing algorithms for DP-SCO with -Lipschitz functions may not be private if the input function is not -Lipschitz [8, 20, 7]. Given any DP-SCO algorithm , which is private for -Lipschitz functions, we present a framework that transforms to an algorithm which is (i) private for all functions, even ones which are not -Lipschitz functions and (ii) has the same utility guarantees as for -Lipschitz functions. In simpler terms, our algorithm essentially feeds the Lipschitzian-extension of the sample functions as inputs. Algorithm 1 describes our Lipschitzian-extension based framework.
For this paper we consider to be Algorithm 2 of [7] (reproduced in Section A.2 as Algorithm 5). The following proposition summarizes our guarantees for Algorithm 1.
Proposition 1.
Let denote the set of sample function-dataset pair such that is -Lipschitz and let denote the set of sample function-dataset pair such that is -DP for any . Then
-
1.
For any , Algorithm 1 is -DP.
-
2.
For any , Algorithm 1 achieves the same optimality guarantees as .
Proof For the first item, note that Lemma 2.1 implies that is -Lipschitz, i.e. . Since is -DP when applied over Lipschitz functions in , we have that Algorithm 1 is -DP.
For the second item, Lemma 2.1 implies that when is -Lipschitz. Thus, in Algorithm 1, we apply over itself.
∎
While clipped DP-SGD does ensure privacy for input functions which are not -Lipschitz, our algorithm has some advantages over clipped DP-SGD: first, clipping does not result in optimal rates for pure DP, and second, clipped DP-SGD results in time complexity . In contrast, our Lipschitzian extension approach is amenable to existing linear time algorithms [20] allowing for almost linear time complexity algorithms for interpolation problems. Finally, while clipping the gradients and using the Lipschitzian extension both alter the effective function being optimized, only the Lipschitzian extension is able to preserve the convexity of said effective function (see item 2 in Lemma 2.1). We make a note about the computational efficiency of Algorithm 1. Recall that when the objective is in fact -Lipschitz, computing gradients for the Lipschitzian extension (say in the context of a first-order method) is only as expensive as computing the gradients for the original function. In particular, one can first compute the gradient of the original function and use item 4 of Lemma 2.1; when the problem is -Lipschitz, is always less than or equal to and thus the gradient of the Lipschitzian extension is just the gradient of the original function.
4.2 Faster non-adaptive algorithm
Building on the Lipschitzian-extension framework of the previous section, in this section, we present our epoch based algorithm, which obtains faster rates in the interpolation-with-growth regime. It uses Algorithm 1 with as Algorithm 5 (reproduced in Section A.2) as a subroutine in each epoch, to localize and shrink the domain as the iterates get closer to the true minimizer. Simultaneously, the algorithm also reduces the Lipschitz constant, as the interpolation assumption implies that the norm of the gradient decreases for iterates near the minimizer. The detailed algorithm is given in Algorithm 2 where denotes the effective diameter and denotes the effective Lipschitz constant in epoch .
The following theorem provides our upper bounds for Algorithm 2, demonstrating near-exponential rates for interpolation problems; we present the proof in Appendix C.
Theorem 2.
Assume each sample function is -Lipschitz and -smooth, and let the population function satisfy quadratic growth (Definition 2.4). Let Problem (1) be an interpolation problem. Then Algorithm 2 is -DP. For , , , and any , Algorithm 2 returns such that
(3) |
For , , , and any , Algorithm 2 returns such that
(4) |
The exponential rates in Theorem 2 show a significant improvement in the interpolation regime over the minimax-optimal without interpolation [20, 7]. To get the linear convergence rates, we run roughly epochs with samples each. Thus, each call of the subroutine runs the algorithm on only logarithmic number of samples compared to the number of epochs. Intuitively, growth conditions improves the performance of the sub-algorithm, while growth and interpolation conditions reduce the search space. This in tandem leads to faster rates.
To better illustrate the improvement in rates compared to the non-private setting, the next corollary states the private sample complexity required to achieve error in the interpolation regime.
Corollary 4.1.
Let the conditions of Theorem 2 hold. For , Algorithm 2 is -DP and requires
samples to ensure
for any fixed , where ignores only polyloglog factors in .
Moreover, for , Algorithm 2 is -DP and requires
samples to ensure , for any fixed , where ignores polyloglo factors in .
As the sample complexity of DP-SCO to achieve expected error on general quadratic growth problems is [7]
Corollary 4.1 shows that we are able to improve the polynomial dependence on in the sample complexity to (nearly) logarithmic for interpolation problems.
Remark 1.
In contrast to Corollary 4.1, we can tune the failure probability parameter to get the sample complexity . Even though this sample complexity does not have the polynomial factor, it may be worse than , because generally the dimension term is the dominant one.
We end this section by considering growth conditions that are weaker than quadratic growth.
Remark 2.
(interpolation with -growth) We can extend our algorithms to work for the weaker -growth condition [7], i.e., . We present the full details of these algorithms in Section C.1 (see Algorithm 6). In this setting, we obtain excess loss
for interpolation problems, improving over the minimax-optimal loss for non-interpolation problems which is
As an example, when , this corresponds to an improvement from roughly to . Like our previous results, we are again able to show similar improvements for -DP with better dependence on the dimension. Finally, we note that we have not provided lower bounds for the interpolation-with--growth setting for . We leave this question as a direction for future research.
4.3 Adaptive algorithm
Though Algorithm 2 is private and enjoys faster rates of convergence in the interpolation regime, it is not necessarily adaptive to interpolation, i.e. it may perform poorly given a non-interpolation problem. In fact, since the shrinkage of the diameter and Lipschitz constants at each iteration hinges squarely on the interpolation assumption, the new domain may not include the optimizing set in the non-interpolation setting, so our algorithm may not even converge. Since in general we do not know a priori whether a dataset is interpolating, it is important to have an algorithm which adapts to interpolation.
To that end, we present an adaptive algorithm that achieves faster rates for interpolation-with-growth problems while simultaneously obtaining the standard optimal rates for general growth problems. The algorithm consists of two steps. In the first step, our algorithm privately minimizes the objective without assuming it is an interpolation problem. Next, we run our non-adaptive interpolation algorithm of Section 4.2 over the localized domain returned by the first step. If our problem was an interpolating one, the second step recovers the faster rates in Section 4.2. If our problem was not an interpolating one, the first localization step ensures that we at least recover the non-interpolating convergence rate. We stress that the privacy of Algorithm 3 requires that the call to Algorithm 2 remains private even if the problem is non-interpolating. This is ensured by using our Lipschitzian extension based algorithm with as Algorithm 5. The Lipschitzian extension allows us to continue preserving privacy. We present the full details of this algorithm in Algorithm 3.
The following theorem (Theorem 3) states the convergence guarantees of our adaptive algorithm (Algorithm 3) in both the interpolation and non-interpolation regimes for the pure DP setting. The results for approximate DP are similar and can be obtained by replacing with ; we give the full details in Appendix C.
Theorem 3.
Let each sample function be -Lipschitz and -smooth, and let the population function satisfy quadratic growth (Definition 2.4) with coefficient . Let be the output of Algorithm 3. Then
-
1.
Algorithm 3 is -DP.
-
2.
Without any additional interpolation assumption, satisfies
-
3.
Let problem (1) be an interpolation problem. Then satisfies
Proof The privacy of Algorithm 3 follows from the privacy of Algorithms 1 and 2 and post-processing.
To prove the convergence guarantees, we first need to show that the optimal set is in the shrinked domain . Using the high probability guarantees of Algorithm 1, we know that with probability , we have
Using the quadratic growth condition, we immediately have and hence .
Using smoothness, we have that for any ,
Since Algorithm 2 always outputs a point in its input domain (in this case ), even in the non-interpolation setting that
In the interpolation setting, the guarantees of Algorithm 2 hold and result is immediate.
∎
5 Optimality and Superefficiency
We conclude this paper by providing a lower bound and a super-efficiency result that demonstrate the tightness of our upper bounds. Recall that our upper bound from Section 4 is roughly (up to constants)
(5) |
for any arbitrarily large . We begin with an exponential lower bound showing that the second term in (5) is tight. We then prove a superefficiency result that demonstrates that any private algorithm which avoids the first term in (5) cannot be adaptive to interpolation, that is, it can not achieve the minimax optimal rate for the family of non-interpolation problems.
Theorem 4 below presents our exponential lower bounds for private interpolation problems with growth. We use the notation and proof structure of Theorem 1. We let be the subcollection of function, data set pairs which also have functions that have -quadratic growth (Definition 2.4). The proof of Theorem 4 is found in Section D.1.
Theorem 4.
Let contain a -dimensional -ball of diameter . Then
This lower bound addresses the second term of (5); we now turn to our superefficiency results to lower bound the first term of (5). We start with defining some notation and making some simplifying assumptions. For a fixed function which is convex, -smooth with respect to the first argument, let be the set of datasets of data points sampled from such that is -Lipschitz and have -strongly convex objectives. For simplicity, we will assume that 1. for all , 2. , and 3. the codomain of is . With this setup, we present the formal statement of our result; the proof of Theorem 5 is found in Section D.2.
Theorem 5.
Suppose we have some with such that satisfy Definition 2.3. Suppose there is an -DP estimator such that
for some and absolute constant . Then, for sufficiently large , there exists another dataset , where may not satisfy Definition 2.3, such that
To better contextualize this result, suppose there exists an algorithm which atttains a convergence rate on interpolation problems; i.e., the algorithm is able to avoid the term in (5). Then Theorem 5 states that there exists some strongly convex, non-interpolation problem on which the aforementioned algorithm will optimize very poorly; in particular, the algorithm will only be able to return a solution that attains, on average, constant error on this “hard” problem. More generally, recall that in the non-interpolation quadratic growth setting, the optimal error rate is on the order of [7]. Theorem 5 shows that attaining better-than-polynomial error complexity on quadratic growth interpolation problems implies that the algorithm cannot be minimax optimal in the non-interpolation quadratic growth setting. Thus, the rates our adaptive algorithms attain are the best we can hope for if we want an algorithm to perform well on both interpolation and non-interpolation quadratic growth problems.
References
- ACCD [20] Hilal Asi, Karan Chadha, Gary Cheng, and John C. Duchi. Minibatch stochastic approximate proximal point methods. In Advances in Neural Information Processing Systems 33, 2020.
- ACG+ [16] Martin Abadi, Andy Chu, Ian Goodfellow, Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In 23rd ACM Conference on Computer and Communications Security (ACM CCS), pages 308–318, 2016.
- AD [19] Hilal Asi and John C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019.
- AD [20] Hilal Asi and John Duchi. Near instance-optimality in differential privacy. arXiv:2005.10630 [cs.CR], 2020.
- ADF+ [21] Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, and Kunal Talwar. Private adaptive gradient methods for convex optimization. In Proceedings of the 38th International Conference on Machine Learning, pages 383–392, 2021.
- AFKT [21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in geometry. In Proceedings of the 38th International Conference on Machine Learning, 2021.
- ALD [21] Hilal Asi, Daniel Levy, and John C. Duchi. Adapting to function difficulty and growth conditions in private optimization. In Advances in Neural Information Processing Systems 34, 2021.
- BFGT [20] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems 33, 2020.
- BFTT [19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems 32, 2019.
- BGN [21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Proceedings of the Thirty Fourth Annual Conference on Computational Learning Theory, pages 474–499, 2021.
- BHM [18] Mikhail Belkin, Daniel Hsu, and Partha Mitra. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems 31, pages 2300–2311. Curran Associates, Inc., 2018.
- BRT [19] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. Does data interpolation contradict statistical optimality? In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 1611–1619, 2019.
- BST [14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 55th Annual Symposium on Foundations of Computer Science, pages 464–473, 2014.
- CCD [22] Karan Chadha, Gary Cheng, and John Duchi. Accelerated, optimal and parallel: Some results on model-based stochastic optimization. In International Conference on Machine Learning, pages 2811–2827. PMLR, 2022.
- CMS [11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069–1109, 2011.
- CSSS [11] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems 24, 2011.
- DJW [13] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical minimax rates. In 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.
- DR [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 & 4):211–407, 2014.
- Duc [18] John C. Duchi. Introductory lectures on stochastic convex optimization. In The Mathematics of Data, IAS/Park City Mathematics Series. American Mathematical Society, 2018.
- FKT [20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in linear time. In Proceedings of the Fifty-Second Annual ACM Symposium on the Theory of Computing, 2020.
- HUL [93] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II. Springer, New York, 1993.
- LB [20] Chaoyue Liu and Mikhail Belkin. Accelerating sgd with momentum for over-parameterized learning. In International Conference on Learning Representations, 2020.
- MBB [18] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- NWS [14] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In Advances in Neural Information Processing Systems 27, pages 1017–1025, 2014.
- SSSSS [09] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In Proceedings of the Twenty Second Annual Conference on Computational Learning Theory, 2009.
- SST [10] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In nips2010, pages 2199–2207, 2010.
- ST [13] Adam Smith and Abhradeep Thakurta. Differentially private feature selection via stability arguments, and the robustness of the Lasso. In Proceedings of the Twenty Sixth Annual Conference on Computational Learning Theory, pages 819–850, 2013.
- SV [09] T. Strohmer and Roman Vershynin. A randomized Kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262–278, 2009.
- VBS [19] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
- Wai [19] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
- WS [21] Blake E. Woodworth and Nathan Srebro. An even more optimal stochastic optimization algorithm: Minibatching and interpolation learning. In Advances in Neural Information Processing Systems 34, 2021.
- WXDX [20] Di Wang, Hanshen Xiao, Srini Devadas, and Jinhui Xu. Private stochastion differentially private stochastic convex optimization with heavy-tailed data. In Proceedings of the 37th International Conference on Machine Learning, 2020.
Appendix A Results from previous work
A.1 Proof of Lemma 2.1
A.2 Algorithms from [7]
A.3 Theoretical results from [7]
We first reproduce the high probability guarantees of Algorithm 4 as proved in [7].
Proposition 2.
Let , and be convex, -Lipschitz for all . Setting
then for , Algorithm 4 is -DP and has with probability
Proposition 3.
Let , and be convex, -Lipschitz for all . Setting
then for , Algorithm 4 is -DP and has with probability
Now, we reproduce the high probability convergence guarantees of Algorithm 5.
Theorem 6.
Let , and be convex, -Lipschitz for all . Assume that has -growth (Assumption 2.4) with . Setting , Algorithm 5 is -DP and has with probability
Theorem 7.
Let , and be convex, -Lipschitz for all . Assume that has -growth (Assumption 2.4) with . Setting and , Algorithm 5 is -DP and has with probability
Appendix B Proofs from Section 3
B.1 Proof of Theorem 1
Consider the sample risk function
We define the datasets . We define the corresponding population risk to be . We select to be a -packing (with respect to the norm) of diameter ball contained in . Define the separation between with respect to the loss and by
For the sake of contradiction, suppose that for for all . Then by Markov’s inequality, and for all , and so
where inequality follows from the definition of the separation, and follows from privacy and the disjoint nature of the events in the union. Rearranging, we get that
which is a contradiction. By standard packing inequalities [30], we know that . Setting and and using the fact that is decreasing in gives
We now prove the -DP lower bound. Consider the following sample risk function
We define the datasets inducing the corresponding population risk . We select two points contained within the diameter ball contained in such that . Define the separation between with respect to the loss and as
For the sake of contradiction, suppose that for for all . Then by Markov’s inequality, and for all , and so
where inequality follows from the definition of the separation, and follows from group privacy of -privacy [18]. Rearranging, we get that
which is a contradiction. Setting and using the fact gives the first result.
Appendix C Proofs from Section 4
We first prove a lemma that each time we shrink the domain size, the set of interpolating solutions still lies in the new domain with high probability, and the new Lipschitz constant we define is a valid Lipschitz constant for the loss defined on the new domain. We prove it in generality for -growth.
Lemma C.1.
Let denote the set of interpolating solutions of problem (1). Then for all with probability , and for all .
Proof We prove this lemma for the case when , the case when follows similarly. For epoch , using Theorem 2 of [7], we have with probability ,
Using the growth condition on , we have
Using , we get with probability . Thus, for each epoch , with probability , each point in the set of optimizers lies in the domain . Using a union bound on all epochs, we have for all with probability .
We now prove the second part of the lemma. Using the smoothness of and that for all , we have
as desired.
∎
We now restate and prove the convergence rate of Algorithm 2
See 2 Proof First we prove the privacy guarantee of the algorithm. Each sample impacts only one of the iterates , thus Algorithm 2 satisfies the same privacy guarantee as Algorithm 5 by postprocessing.
We divide the utility proof into 2 main parts; first is to check the validity of the assumptions while applying Algorithm 5 and second is using its high probability convergence guarantees to get the final rates. To check this, we ensure that the optimum set lies in the new domain at step and that the Lipschitz constant defined with respect to the domain is a valid lipschitz constant. This follows from Lemma C.1.
Next, we use the high probability convergence guarantees of the subalgorithm Algorithm 5 to get convergence rates for Algorithm 2. We prove it for the case when , the case when is similar. We know that
Thus we have
Using Theorem 2 of [7] on the last epoch, we have with probability that
Let and for some such that
This holds for example for
for sufficiently large . Using these values of and , we have
(6) |
To get the convergence results in expectation, let denote the “bad” event with tail probability , where . Now,
Substituting and using Equation 6, we get the result.
∎
C.1 Algorithm for general
Remark is an absolute constant dependent on the high probability performance guarantees of Algorithm 5. We can calculate that is at most and hence .
Theorem 8.
Assume each sample function be -Lipschitz and -smooth, and let the population function satisfy quadratic growth (Definition 2.4). Let Problem (1) be an interpolation problem. Then, Algorithm 6 is -DP. For , Algorithm 6 with and , we have
with probability . For , Algorithm 6 when run using and achieves error
with probability .
Proof The privacy guarantee follows from the proof of Theorem 2. We divide the utility proof into 2 main parts; first is to check the validity of the assumptions while applying Algorithm 5 and second is using its high probability convergence guarantees to get the final rates. To check this, we ensure that the optimum set lies in the new domain defined at every step and that the lipschitz constant defined with respect to the domain is a valid lipschitz constant. This follows from Lemma C.1.
Next, we use the high probability convergence guarantees of the subalgorithm Algorithm 5 to get convergence rates for Algorithm 2.
We prove it for the case when , the case when is similar. We know that
Thus, we have
We note that for , and thus for large , we ignore the terms of the form since they are . Ignoring these terms by including an additional constant we can write
Now we write results in terms of sample complexity required to achieve a particular error. The sufficient number of samples. To ensure , it is sufficient to ensure
Choosing ensures error .
∎
Corollary C.1.
Under the conditions of Theorem 8, for , the expected error of the output of algorithm is upper bounded by
for arbitrarily large . For , the expected error of the output of algorithm is upper bounded by
for arbitrarily large .
C.2 version of Theorem 3
Theorem 9.
Assume each sample function be -Lipschitz and -smooth, and let the population function satisfy quadratic growth (Definition 2.4) with coefficient . Let be the output of Algorithm 3. Then,
-
1.
Algorithm 3 is -DP.
-
2.
Without any additional interpolation assumption, we have that the expected error of the is upper bounded by
-
3.
Let problem (1) be an interpolation problem. Then, the expected error of the is upper bounded by
Proof First, we note that the privacy of Algorithm 3 follows from the privacy of Algorithm 2 and Algorithm 1 and post-processing.
To prove the convergence guarantees, we first need to show that the optimal set is included in the shrinked domain . Using the high probability guarantees of Algorithm 1, we know that with probability , we have
Using the quadratic growth condition, we immediately have and hence .
Using smoothness, we have that for any ,
Since Algorithm 2 always outputs a point in its input domain (in this case ), even in the non-interpolation setting we have that
In the interpolation setting, the guarantees of Algorithm 2 hold and the result is immediate.
∎
Appendix D Proofs from Section 5
D.1 Proof of Theorem 4
The proof is exactly the same as Theorem 1, except we set to ensure that for any has -quadratic growth. Finally we set and use the fact that and the fact that is decreasing in to give the desired lower bound.
D.2 Proof of Theorem 5
The proof of this result hinges on the two following supporting propositions. We first copy Proposition 2.2 from [4] (listed as Proposition 4 below) in our notation for convenience. We then state Proposition 5 which gives upper and lower bounds on the modulus of continuity (defined in Proposition 4). We note that the lower bound presented in Proposition 5 is one of the novel contributions of this paper; the proof of Proposition 5 can be found in Section D.2.1. We will first assume this to be true and prove Theorem 5 before returning prove its correctness.
Proposition 4.
For some fixed which is convex and -smooth with respect to its first argument, let for . Let . Define the corresponding modulus of continuity
Assume the mechanism is -DP and for some achieves
Then there exists a sample where such that
Proposition 5.
Let be convex and -smooth in its first argument and satisfying for all . Suppose we have some with which also induces an interpolation problem (a problem which satisfies Definition 2.3). With respect to the dataset , the modulus of continuity satsifies
With these two results, we can now prove Theorem 5. Restating the conditions of the theorem formally, suppose for some constants and there is an -DP estimator such that
If , set , then the bound certainly still holds for large enough . If we let , using the definition of strong convexity, we have that there exists some and such that
To satisfy the expression from Proposition 4, we select such that
Using Proposition 5 we must have . Using Proposition 4, we have that
Before performing a further lower bound on this quantity, we first verify that does not exceed the total size of the dataset, . Using our bounds on , we see that
For any , for sufficiently large , this quantity is less than . We now lower bound the modulus of continuity by using the fact that it is a non-decreasing function in its second argument:
This is the desired result; the last inequality comes from another application of Proposition 5 but with in place of .
D.2.1 Proof of Proposition 5
Proof Outline
At a high level, starting with a function , we first remove an arbitrary fraction to create a function . We then replace the sample functions we removed with samples of and argue how far the minimizer of is away from the minimizer of . We will need many supporting lemmas to complete this proof; we quickly outline how we use these lemmas.
1.
We use Lemma D.1 to argue that the minimizers of are no different that .
2.
We use Lemma D.2 to argue about the growth of .
3.
We use Lemma D.3, Lemma D.4, Lemma D.5 to lower bound how far the minimizer of has moved from the minimizer of .
4.
We use Lemma D.6 to upper bound how far the minimizer of has moved from the minimizer of .
We now formally introduce the several supporting lemmas which will aid our proof of Proposition 5. The first ensures that the minimizing set does not change upon the removal of a constant number of samples.
Lemma D.1.
Assume that for all . Suppose satisfies Definition 2.3 and has -quadratic growth. Let . Let consist of any (constant not scaling with ) data points. Then, for we have that .
Proof Suppose for the sake of contradiction that Since is an interpolation problem, the removal of samples can only increase the size of . Suppose that . There exists at most points in that have non-zero error on . However, by smoothness of each sample function (and the fact that and by construction), we have that for
Since , this contradicts -quadratic growth.
∎
This second lemma ensures that deleting a constant number of samples does not affect the growth or strong convexity of the population function by too much.
Lemma D.2.
Assume that for all . Suppose satisfies Definition 2.3 and has -quadratic growth (respectively -strong convexity). Let be defined as in Lemma D.1. Then has -quadratic growth (respectively -strong convexity) for any .
Proof By Lemma D.1, that the minimizing set of is the same as . Suppose for the sake of contradiction that does not have -quadratic growth. Then there must exist such that
By smoothness and growth we have
This implies that , a contradiction.
Suppose for the sake of contradiction that does not have -strong convexity. Then there must exist and such that
By smoothness and strong convexity we have
However, this implies that which is a contradiction.
∎
The next lemma is a standard result on the closure under addition of strongly convex functions.
Lemma D.3.
Let functions and be and strongly convex respectively, then is strongly convex.
This lemma provides some growth conditions on the gradient under smoothness, strong convexity and quadratic growth.
Lemma D.4.
Let be a convex function with such that for , . Suppose has -quadratic growth, then
If instead has -strong convexity, then
Alternatively, suppose has -smoothness, then
Proof We note that by first order optimality conditions, for all , . To prove the first inequality, we have that for any , the following is true:
In particular, minimizing over on the right hand side and rearranging gives the desired result. To prove the second result, we know that by strong convexity for any
To prove the last result, we know that by smoothness for any
Minimizing over on the right hand side gives the desired result.
∎
This lemma controls how much the minimizers of a function can change if another function is added. This will directly be useful in lower bounding the modulus of continuity.
Lemma D.5.
Suppose and . Let be the largest minimizer of and be the smallest minimizer of , and assume that . Let be any minimizer of . Assume that and . If has -quadratic growth and is -smooth, then
If is -smooth and has -quadratic growth, then
The same relation holds with and replaced with and respectively if the above statement is modified such that and are and strongly convex instead.
Proof If , then the first order condition for optimality implies
Thus, . By the monotonicty of the first derivative of convex functions that for , and . Combining this with Lemma D.4, we get
Combining these facts gives
Rearranging these two inequalities gives the desired result. We note that the lower bound only requires that is -smooth and has -quadratic growth, and the upper bound only requires has -quadratic growth and is -smooth. The last statement about strong convexity follows from the same reasoning, except using the strong convexity inequality in Lemma D.4 instead of the quadratic growth inequality.
∎
The following lemma is a slight modification of Claim 6.1 from [25] and will be helpful for us to upper bound the modulus of continuity.
Lemma D.6.
Let consist of data points where . Suppose that is -strongly convex and satisfies Definition 2.3. Assume the sample function is -Lipschitz in its first argument and that for all . For and , we have that
Proof By strong convexity, we have that
since by first order optimality conditions, we know that as a consequence of Definition 2.3. We also have
where the last inequality comes from the Lipschitzness of and that .
∎
Armed with these supporting lemmas, we can now bound the modulus of continuity. Let be the largest minimizer of following the steps of the proof outline. Without loss of generality, we assume that . If , by symmetry, it suffices to consider the problem replacing with in the following proof. By Lemma D.1, has the same minimizing set as . By Lemma D.2, has -strong convexity. Replace the datapoints removed with samples that have the loss function ; we note that it is clear that satisifies the desired Lipschitz condition. Our constructed non-interpolation population function is
which is -strongly convex by Lemma D.2 and Lemma D.3 and is -Lipschitz. This means that the this function corresponds to belongs to . Let be the minimizer of .
By the triangle inequality, we have is -smooth. is - strongly convex. Thus, by Lemma D.5 setting and , we have that
Here, implicitly, we are using the fact that is also a minimizer of by Lemma D.2. This completes the proof of the lower bound.
The upper bound follows from Lemma D.6 with and .