This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Private optimization in the interpolation regime:
faster rates and hardness results

Hilal Asi111Equal contribution, author order alphabetical 222Electrical Engineering Department, Stanford University
asi@stanford.edu
   Karan Chadhafootnotemark: footnotemark:
knchadha@stanford.edu
   Gary Chengfootnotemark: footnotemark:
chenggar@stanford.edu
   John Duchifootnotemark: 333Statistics Department, Stanford University
jduchi@stanford.edu
Abstract

In non-private stochastic convex optimization, stochastic gradient methods converge much faster on interpolation problems—problems where there exists a solution that simultaneously minimizes all of the sample losses—than on non-interpolating ones; we show that generally similar improvements are impossible in the private setting. However, when the functions exhibit quadratic growth around the optimum, we show (near) exponential improvements in the private sample complexity. In particular, we propose an adaptive algorithm that improves the sample complexity to achieve expected error α\alpha from dεα\frac{d}{\varepsilon\sqrt{\alpha}} to 1αρ+dεlog(1α)\frac{1}{\alpha^{\rho}}+\frac{d}{\varepsilon}\log\left({\frac{1}{\alpha}}\right) for any fixed ρ>0\rho>0, while retaining the standard minimax-optimal sample complexity for non-interpolation problems. We prove a lower bound that shows the dimension-dependent term is tight. Furthermore, we provide a superefficiency result which demonstrates the necessity of the polynomial term for adaptive algorithms: any algorithm that has a polylogarithmic sample complexity for interpolation problems cannot achieve the minimax-optimal rates for the family of non-interpolation problems.

1 Introduction

We study differentially private stochastic convex optimization (DP-SCO), where given a dataset 𝒮=S1niidP\mathcal{S}=S_{1}^{n}\stackrel{{\scriptstyle\rm iid}}{{\sim}}P we wish to solve

minimizef(x)=𝔼P[F(x;S)]=ΩF(x;s)𝑑P(s)subjecttox𝒳,\begin{split}\mathop{\rm minimize}~{}&f(x)=\mathbb{E}_{P}[F(x;S)]=\int_{\Omega}F(x;s)dP(s)\\ \mathop{\rm subject\;to}~{}&x\in\mathcal{X},\end{split} (1)

while guaranteeing differential privacy. In problem (1), 𝒳d\mathcal{X}\subset\mathbb{R}^{d} is the parameter space, Ω\Omega is a sample space, and {F(;s):s𝕊}\{F(\cdot;s):s\in\mathbb{S}\} is a collection of convex losses. We study the interpolation setting, where there exists a solution that simultaneously minimizes all of the sample losses.

Interpolation problems are ubiquitous in machine learning applications: for example, least squares problems with consistent solutions [28, 24], and problems with over-parametrized models where a perfect predictor exists [23, 11, 12]. This has led to a great deal of work on the advantages and implications of interpolation [26, 16, 11, 12].

For non-private SCO, interpolation problems allow significant improvements in convergence rates over generic problems [26, 16, 23, 29, 31]. For general convex functions, SST [26] develop algorithms that obtain O(1n)O(\frac{1}{n}) sub-optimality, improving over the minimax-optimal rate O(1n)O(\frac{1}{\sqrt{n}}) for non-interpolation problems. Even more dramatic improvements are possible when the functions exhibit growth around the minimizer, as VBS [29] show that SGD achieves exponential rates in this setting compared to polynomial rates without interpolation. [3, 1, 14] extend these fast convergence results to model-based optimization methods.

Despite the recent progress and increased interest in interpolation problems, in the private setting they remain poorly understood. In spite of the substantial progress in characterizing the tight convergence guarantees for a variety of settings in DP optimization [13, 9, 20, 6, 7], we have little understanding of private optimization in the growing class of interpolation problems.

Given (i) the importance of differential privacy and interpolation problems in modern machine learning, (ii) the (often) paralyzingly slow rates of private optimization algorithms, and (iii) the faster rates possible for non-private interpolation problems, the interpolation setting provides a reasonable opportunity for significant speedups in the private setting. This motivates the following two questions: first, is it possible to improve the rates for DP-SCO in the interpolation regime? And, what are the optimal rates?

1.1 Our contributions

We answer both questions. In particular, we show that

  1. 1.

    No improvements in general (Section 3): our first result is a hardness result demonstrating that the rates cannot be improved for DP-SCO in the interpolation regime with general convex functions. More precisely, we prove a lower bound of Ω(dnε)\Omega(\frac{d}{n\varepsilon}) on the excess loss for pure differentially private algorithms. This shows that existing algorithms achieve optimal private rates for this setting.

  2. 2.

    Faster rates with growth (Section 4): when the functions exhibit quadratic growth around the minimizer, that is, f(x)f(x)λxx22f(x)-f(x^{\star})\geq\lambda\|{x-x^{\star}}\|_{2}^{2} for some λ>0\lambda>0, we propose an algorithm that achieves near-exponentially small excess loss, improving over the polynomial rates in the non-interpolation setting. Specifically, we show that the sample complexity to achieve expected excess loss α>0\alpha>0 is O(1αρ+dεlog(1α))O(\frac{1}{\alpha^{\rho}}+\frac{d}{\varepsilon}\log\left({\frac{1}{\alpha}}\right)) for pure DP and O(1αρ+dlog(1/δ)εlog(1α))O(\frac{1}{\alpha^{\rho}}+\frac{\sqrt{d\log(1/\delta)}}{\varepsilon}\log\left({\frac{1}{\alpha}}\right)) for (ε,δ)(\varepsilon,\delta)-DP, for any fixed ρ>0\rho>0. This improves over the sample complexity for non-interpolation problems with growth which is O(1α+dεα)O(\frac{1}{\alpha}+\frac{d}{\varepsilon\sqrt{\alpha}}). We also present new algorithms that improve the rates for interpolation problems with the weaker κ\kappa-growth assumption [7] for κ>2\kappa>2 where we achieve excess loss O((1n+dnε)κκ2)O((\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon})^{\frac{\kappa}{\kappa-2}}), compared to the previous bound O((1n+dnε)κκ1)O((\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon})^{\frac{\kappa}{\kappa-1}}) without interpolation.

  3. 3.

    Adaptivity to interpolation (Section 4.3): While these improvements for the interpolation regime are important, practitioners using these methods in practice cannot identify whether the dataset they are working with is an interpolating one or not. Thus, it is crucial that these algorithms do not fail when given a non-interpolating dataset. We show that our algorithms are adaptive to interpolation, obtaining these better rates for interpolation while simultaneously retaining the standard minimax optimal rates for non-interpolation problems.

  4. 4.

    Tightness (Section 5): finally, we provide a lower bound and a super-efficiency result that demonstrate the (near) tightness of our upper bounds showing sample complexity Ω(dεlog(1α))\Omega(\frac{d}{\varepsilon}\log\left({\frac{1}{\alpha}}\right)) is necessary for interpolation problems with pure DP. Moreover, our super-efficiency result shows that the polynomial dependence on 1/α1/\alpha in the sample complexity is necessary for adaptive algorithms: any algorithm that has a polylogarithmic sample complexity for interpolation problems cannot achieve minimax-optimal rates for non-interpolation problems.

1.2 Related work

Over the past decade, a lot of works  [15, 17, 27, 13, 2, 9, 20, 6, 5, 8] have studied the problem of private convex optimization. CMS [15] and [13] study the closely related problem of differentially private empirical risk minimization (DP-ERM) where the goal is to minimize the empirical loss, and obtain (minimax) optimal rates of d/nεd/n\varepsilon for pure DP and dlog(1/δ)/nε{\sqrt{d\log(1/\delta)}}/{n\varepsilon} for (ε,δ)(\varepsilon,\delta)-DP. Recently, more papers have moved beyond DP-ERM to privately minimizing the population loss (DP-SCO) [9, 20, 6, 5, 10, 7].  BFTT [9] was the first paper to obtain the optimal rate 1/n+dlog(1/δ)/nε1/\sqrt{n}+{\sqrt{d\log(1/\delta)}}/{n\varepsilon} for (ε,δ)(\varepsilon,\delta)-DP, and subsequent papers develop more efficient algorithms that achieve the same rates [20, 8]. Moreover, other papers study DP-SCO under different settings including non-Euclidean geometry [6, 5], heavy-tailed data [32], and functions with growth [7]. However, to the best of our knowledge, there has not been any work in private optimization that studies the problem in the interpolation regime.

On the other hand, the optimization literature has witnessed numerous papers on the interpolation regime [26, 16, 23, 29, 22, 31]. SST [26] propose algorithms that roughly achieve the rate 1/n+f/n1/n+\sqrt{f^{\star}/n} for smooth and convex functions where f=minx𝒳f(x)f^{\star}=\min_{x\in\mathcal{X}}f(x). In the interpolation regime with f=0f^{\star}=0, this result obtains loss 1/n1/n improving over the standard 1/n1/\sqrt{n} rate for non-interpolation problems. Moreover, VBS [29] studied the interpolation regime for functions with growth and show that SGD enjoys linear convergence (exponential rates). More recently, several papers investigated and developed acceleration-based algorithms in the interpolation regime [22, 31].

2 Preliminaries

We begin with notation that will be used throughout the paper and provide some standard definitions from convex analysis and differential privacy.

Notation

We let nn denote the sample size and dd the dimension. We let xx denote the optimization variable and 𝒳d\mathcal{X}\subset\mathbb{R}^{d} the constraint set. ss are samples from Ω\Omega, and SS is an Ω\Omega-valued random variable. For each sample sΩs\in\Omega, F(;s):d{+}F(\cdot;s):\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{+\infty\} is a closed convex function. Let F(x;s)\partial F(x;s) denote the subdifferential of F(;s)F(\cdot;s) at xx. We let Ωn\Omega^{n} denote the collection of datasets 𝒮=(s1,,sn)\mathcal{S}=(s_{1},\ldots,s_{n}) with nn data points from Ω\Omega. We let f𝒮(x)1ns𝒮F(x,s)f_{\mathcal{S}}(x)\coloneqq\frac{1}{n}\sum_{s\in\mathcal{S}}F(x,s) denote the empirical loss and f(x)𝔼[F(x;S)]f(x)\coloneqq\mathbb{E}[F(x;S)] denote the population loss. The distance of a point to a set is dist(x,Y)=minyYxy2\mathop{\rm dist}\left({x,Y}\right)=\min_{y\in Y}\left\|{x-y}\right\|_{2}. We use Diam(𝒳)=supx,y𝒳xy2{\rm Diam}(\mathcal{X})=\sup_{x,y\in\mathcal{X}}\left\|{x-y}\right\|_{2} to denote the diameter of parameter space 𝒳\mathcal{X} and use DD as a bound on the diameter of our parameter space.

We recall the definition of (ε,δ)(\varepsilon,\delta)-differential privacy.

Definition 2.1.

A randomized mechanism MM is (ε,δ)(\varepsilon,\delta)-differentially private ((ε,δ)(\varepsilon,\delta)-DP) if for all datasets 𝒮,𝒮Ωn\mathcal{S},\mathcal{S}^{\prime}\in\Omega^{n} that differ in a single data point and for all events 𝒪\mathcal{O} in the output space of MM, we have

P(M(𝒮)𝒪)eεP(M(𝒮)𝒪)+δ.\displaystyle P(M(\mathcal{S})\in\mathcal{O})\leq e^{\varepsilon}P(M(\mathcal{S}^{\prime})\in\mathcal{O})+\delta.

We define ε\varepsilon-differential privacy (ε\varepsilon-DP) to be (ε,0)(\varepsilon,0)-differential privacy.

We now recall a couple of standard convex analysis definitions.

Definition 2.2.
  1. 1.

    A function h:𝒳h:\mathcal{X}\to\mathbb{R} is LL-Lipschitz if for all x,y𝒳x,y\in\mathcal{X}

    |h(x)h(y)|Lxy2.\displaystyle|{h(x)-h(y)}|\leq L\left\|{x-y}\right\|_{2}.

    Equivalently, a function is LL-Lipschitz if f(x)2L\left\|{\nabla f(x)}\right\|_{2}\leq L for all x𝒳x\in\mathcal{X}.

  2. 2.

    A function hh is HH-smooth if it has HH-Lipschitz gradient: for all x,y𝒳x,y\in\mathcal{X}

    h(x)h(y)2Hxy2.\displaystyle\left\|{\nabla h(x)-\nabla h(y)}\right\|_{2}\leq H\left\|{x-y}\right\|_{2}.
  3. 3.

    A function hh is λ\lambda-strongly convex if for all x,y𝒳x,y\in\mathcal{X}

    h(y)h(x)+h(x)T(yx)+λ2yx22.\displaystyle h(y)\geq h(x)+\nabla h(x)^{T}(y-x)+\frac{\lambda}{2}\left\|{y-x}\right\|_{2}^{2}.

We formally define interpolation problems:

Definition 2.3 (Interpolation Problem).

Let 𝒳argminx𝒳f(x)\mathcal{X}^{\star}\coloneqq\mathop{\rm argmin}_{x\in\mathcal{X}}f(x). Then problem (1) is an interpolation problem if there exists x𝒳x^{\star}\in\mathcal{X}^{\star} such that for PP-almost all sΩs\in\Omega, we have 0F(x;s)0\in\partial F(x^{\star};s).

Interpolation problems are common in modern machine learning, where models are overparameterized. One simple example is overparameterized linear regression: there exists a solution that minimizes each individual sample function. Classification problems with margin are another example.

Crucial to our results is the following quadratic growth assumption:

Definition 2.4.

We say that a function ff satisfies the quadratic growth condition if for all x𝒳x\in\mathcal{X}

f(x)infx𝒳f(x)λ2dist(x,𝒳)2.\displaystyle f(x)-\inf_{x^{\prime}\in\mathcal{X}^{\star}}f(x^{\prime})\geq\frac{\lambda}{2}\mathop{\rm dist}\left({x,\mathcal{X}^{\star}}\right)^{2}.

This assumption is natural with interpolation and holds for many important applications including noiseless linear regression [28, 24]. Past work ([29, 31]) uses this assumption with interpolation to get faster rates of convergence for non-private optimization.

Finally, the adaptivity of our algorithms will crucially depend on an innovation leveraging Lipchitizian extensions, defined as follows.

Definition 2.5 (Lipschitzian extension [21]).

The Lipschitzian extension with Lipschitz constant L of a function ff is defined as the infimal convolution

fL(x)infyd{f(y)+Lxy2}.f_{L}(x)\coloneqq\inf_{y\in\mathbb{R}^{d}}\{f(y)+L\left\|{x-y}\right\|_{2}\}. (2)

The Lipschitzian extension (2) essentially transforms a general convex function into an LL-Lipschitz convex function. We now present a few properties of the Lipschitzian extension that are relevant to our development.

Lemma 2.1.

Let f:𝒳f:\mathcal{X}\to\mathbb{R} be convex. Then its Lipschitzian extension satisfies the following:

  1. 1.

    fLf_{L} is LL-Lipschitz.

  2. 2.

    fLf_{L} is convex.

  3. 3.

    If ff is LL-Lipschitz, then fL(x)=f(x)f_{L}(x)=f(x), for all xx.

  4. 4.

    Let y(x)=argminyd{f(y)+Lxy2}y(x)=\mathop{\rm argmin}_{y\in\mathbb{R}^{d}}\{f(y)+L\left\|{x-y}\right\|_{2}\}. If y(x)y(x) is at a finite distance from xx, we have

    fL(x)={f(x),if f(x)2LLxy(x)xy(x)2,otherwise.\nabla f_{L}(x)=\begin{cases}\nabla f(x),&\text{if $\left\|{\nabla f(x)}\right\|_{2}\leq L$}\\ L\frac{x-y(x)}{\left\|{x-y(x)}\right\|_{2}},&\text{otherwise}.\end{cases}

We use the Lipschitzian extension as a substitute for gradient clipping to ensure differential privacy. Unlike gradient clipping, which may alter the geometry of a convex problem to a non-convex one, the Lipschitzian extension of a function remains convex and thus retains other nice properties that we leverage in our algorithms in Section 4.

3 Hardness of private interpolation

In non-private stochastic convex optimization, for smooth functions it is well known that interpolation problems enjoy the fast rate O(1/n)O(1/n) [26] compared to the minimax-optimal O(1/n)O(1/\sqrt{n}) without interpolation [19]. In this section, we show that such an improvement is not generally possible with privacy. The same lower bound of private non-interpolation problems, d/nεd/n\varepsilon, holds for interpolation problems.

To state our lower bounds, we present some notation that we will use throughout of the paper. We let 𝔖\mathfrak{S} denote the family of function FF and dataset 𝒮\mathcal{S} pairs such that F:𝒳×ΩF:\mathcal{X}\times\Omega\rightarrow\mathbb{R} is convex and HH-smooth in its first argument, |𝒮|=n|\mathcal{S}|=n, and f𝒮(y)=1ns𝒮F(y,s)f_{\mathcal{S}}(y)=\frac{1}{n}\sum_{s\in\mathcal{S}}F(y,s) is an interpolation problem (Definition 2.3). We define the constrained minimax risk to be

𝔐(𝒳,𝔖,ε,δ)\displaystyle\mathfrak{M}(\mathcal{X},\mathfrak{S},\varepsilon,\delta)\coloneqq infM(ε,δ)sup(F,𝒮n)𝔖𝔼[f𝒮n(M(𝒮n))]infx𝒳f𝒮n(x).\displaystyle\inf_{M\in\mathcal{M}^{(\varepsilon,\delta)}}\sup_{(F,\mathcal{S}^{n})\in\mathfrak{S}}\mathbb{E}[f_{\mathcal{S}^{n}}(M(\mathcal{S}^{n}))]-\inf_{x^{\prime}\in\mathcal{X}}f_{\mathcal{S}^{n}}(x^{\prime}).

where (ε,δ)\mathcal{M}^{(\varepsilon,\delta)} be the collection of (ε,δ)(\varepsilon,\delta)-differentially private mechanisms from Ωn\Omega^{n} to 𝒳\mathcal{X}. We use (ε,0)\mathcal{M}^{(\varepsilon,0)} to denote the collection of ε\varepsilon-DP mechanisms from Ωn\Omega^{n} to 𝒳\mathcal{X}. Here, the expectation is taken over the randomness of the mechanism, while the dataset 𝒮n\mathcal{S}^{n} is fixed.

We have the following lower bound for private interpolation problems; the proof is deferred to Section B.1.

Theorem 1.

Suppose 𝒳d\mathcal{X}\subset\mathbb{R}^{d} contains a dd-dimensional 2\ell_{2} ball of diameter DD. Then the following lower bound holds for δ=0\delta=0

𝔐(𝒳,𝔖,ε,0)HD2d96e2nε.\displaystyle\mathfrak{M}(\mathcal{X},\mathfrak{S},\varepsilon,0)\geq\frac{HD^{2}d}{96e^{2}n\varepsilon}.

Moreover, if 0<δ<ε/60<\delta<\varepsilon/6 and d=1d=1, the following lower bound holds

𝔐(𝒳,𝔖,ε,δ)HD216(e+1)nε.\displaystyle\mathfrak{M}(\mathcal{X},\mathfrak{S},\varepsilon,\delta)\geq\frac{HD^{2}}{16(e+1)n\varepsilon}.

Recall the optimal rate for pure DP optimization problems without interpolation is O(1n+dnε)O(\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon}). The first term is the non-private rate, as this is the rate one would get if ε=0\varepsilon=0. The second term is the private rate, as this is the price algorithms have to pay for privacy. In modern machine learning, problems are often high dimensional, so we often think of the dimension dd scaling with some function of the number of samples nn. Thus, the private rate is often thought to dominate the non-private rate. For this reason, in this section, we focus on the private rate. The lower bounds of Theorem 1 show that it is not possible to improve the private rate for interpolation problems in general. Similarly, for approximate (ε,δ)(\varepsilon,\delta)-DP, the lower bound shows that improvements are not possible for d=1d=1. For completeness, as we alluded to earlier, we note that our results do not preclude the possibility of improving the non-private rate from O(1/n)O(1/\sqrt{n}) to O(1/n)O(1/n). We leave this as an open problem of independent interest for future work.

Despite this pessimistic result, in the next section we show that substantial improvements are possible for private interpolation problems with additional growth conditions.

4 Faster rates for interpolation with growth

Having established our hardness result for general interpolation problems, in this section we show that when the functions satisfy additional growth conditions, we get (nearly) exponential improvements in the rates of convergence for private interpolation.

Our algorithms use recent localization techniques that yield optimal algorithms for DP-SCO [20, 7] where the algorithm iteratively shrinks the diameter of the domain. However, to obtain faster rates for interpolation, we crucially build on the observation that the norm of the gradients is decreasing as we approach the optimal solution, since F(x;s)2Hxx2\left\|{\nabla F(x;s)}\right\|_{2}\leq H\left\|{x-x^{\star}}\right\|_{2}. Hence, by carefully localizing the domain and shrinking the Lipschitz constant accordingly, our algorithms improve the rates for interpolating datasets.

However, this technique alone yields an algorithm that may not be private for non-interpolation problems, violating that privacy must hold for all inputs: the reduction in the Lipschitz constant may not hold for non-interpolation problems, and thus, the amount of noise added may not be enough to ensure privacy. To solve this issue, we use the Lipschitzian extension (Definition 2.5) to transform our potentially non-Lipschitz sample functions into Lipschitz ones and guarantee privacy even for non-interpolation problems.

We begin in Section 4.1 by presenting our Lipschitzian extension based algorithm, which recovers the standard optimal rates for (non-interpolation) LL-Lipschitz functions while still guaranteeing privacy when the function is not Lipschitz. Then in Section 4.2 we build on this algorithm to develop a localization-based algorithm that obtains faster rates for interpolation-with-growth problems. Finally, in Section 4.3 we present our final adaptive algorithm, which obtains fast rates for interpolation-with-growth problems while achieving optimal rates for non-interpolation growth problems.

4.1 Lipschitzian-extension based algorithms

Existing algorithms for DP-SCO with LL-Lipschitz functions may not be private if the input function is not LL-Lipschitz [8, 20, 7]. Given any DP-SCO algorithm 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)}, which is private for LL-Lipschitz functions, we present a framework that transforms 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} to an algorithm which is (i) private for all functions, even ones which are not LL-Lipschitz functions and (ii) has the same utility guarantees as 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} for LL-Lipschitz functions. In simpler terms, our algorithm essentially feeds 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} the Lipschitzian-extension of the sample functions as inputs. Algorithm 1 describes our Lipschitzian-extension based framework.

Algorithm 1 Lipschitzian-Extension Algorithm
0:  Dataset 𝒮=(s1,,sn)𝕊n\mathcal{S}=(s_{1},\ldots,s_{n})\in\mathbb{S}^{n};
1:  Let FL(x;si)F_{L}(x;s_{i}) be the Lipschitzian extension of F(x;si)F(x;s_{i}) for all ii.
FL(x;si)=infy{F(y;si)+Lxy2}F_{L}(x;s_{i})=\inf_{y}\{F(y;s_{i})+L\left\|{x-y}\right\|_{2}\}
2:  Run 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} over the functions FL(;si)F_{L}(\cdot;s_{i}).
3:  Let xprivx_{\rm priv} denote the output of 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)}.
4:  return  xprivx_{\rm priv}

For this paper we consider 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} to be Algorithm 2 of [7] (reproduced in Section A.2 as Algorithm 5). The following proposition summarizes our guarantees for Algorithm 1.

Proposition 1.

Let L\mathcal{L}_{L} denote the set of sample function-dataset pair (F,S)(F,S) such that FF is LL-Lipschitz and let \mathcal{F} denote the set of sample function-dataset pair (F,𝒮)(F,\mathcal{S}) such that 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} is (ε,δ)(\varepsilon,\delta)-DP for any (F,𝒮)L(F,\mathcal{S})\in\mathcal{L}_{L}\cap\mathcal{F}. Then

  1. 1.

    For any (F,𝒮)(F,\mathcal{S})\in\mathcal{F}, Algorithm 1 is (ε,δ)(\varepsilon,\delta)-DP.

  2. 2.

    For any (F,𝒮)L(F,\mathcal{S})\in\mathcal{L}_{L}\cap\mathcal{F}, Algorithm 1 achieves the same optimality guarantees as 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)}.

Proof    For the first item, note that Lemma 2.1 implies that FLF_{L} is LL-Lipschitz, i.e. (FL,𝒮)L(F_{L},\mathcal{S})\in\mathcal{L}_{L}\cap\mathcal{F}. Since 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} is (ε,δ)(\varepsilon,\delta)-DP when applied over Lipschitz functions in \mathcal{F}, we have that Algorithm 1 is (ε,δ)(\varepsilon,\delta)-DP.

For the second item, Lemma 2.1 implies that FL=FF_{L}=F when FF is LL-Lipschitz. Thus, in Algorithm 1, we apply 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} over FF itself. ∎

While clipped DP-SGD does ensure privacy for input functions which are not LL-Lipschitz, our algorithm has some advantages over clipped DP-SGD: first, clipping does not result in optimal rates for pure DP, and second, clipped DP-SGD results in time complexity O(n3/2)O(n^{3/2}). In contrast, our Lipschitzian extension approach is amenable to existing linear time algorithms [20] allowing for almost linear time complexity algorithms for interpolation problems. Finally, while clipping the gradients and using the Lipschitzian extension both alter the effective function being optimized, only the Lipschitzian extension is able to preserve the convexity of said effective function (see item 2 in Lemma 2.1). We make a note about the computational efficiency of Algorithm 1. Recall that when the objective is in fact LL-Lipschitz, computing gradients for the Lipschitzian extension (say in the context of a first-order method) is only as expensive as computing the gradients for the original function. In particular, one can first compute the gradient of the original function and use item 4 of Lemma 2.1; when the problem is LL-Lipschitz, f(x)2\|\nabla f(x)\|_{2} is always less than or equal to LL and thus the gradient of the Lipschitzian extension is just the gradient of the original function.

4.2 Faster non-adaptive algorithm

Building on the Lipschitzian-extension framework of the previous section, in this section, we present our epoch based algorithm, which obtains faster rates in the interpolation-with-growth regime. It uses Algorithm 1 with 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} as Algorithm 5 (reproduced in Section A.2) as a subroutine in each epoch, to localize and shrink the domain as the iterates get closer to the true minimizer. Simultaneously, the algorithm also reduces the Lipschitz constant, as the interpolation assumption implies that the norm of the gradient decreases for iterates near the minimizer. The detailed algorithm is given in Algorithm 2 where DiD_{i} denotes the effective diameter and LiL_{i} denotes the effective Lipschitz constant in epoch ii.

Algorithm 2 Domain and Lipschitz Localization algorithm
0:  Dataset 𝒮=(s1,,sn)𝕊n\mathcal{S}=(s_{1},\ldots,s_{n})\in\mathbb{S}^{n}, Lipschitz constant LL, domain 𝒳\mathcal{X}, probability parameter β\beta, initial point x0x_{0}
1:  Set L1=LL_{1}=L, D1=Diam(𝒳)D_{1}={\rm Diam}(\mathcal{X}) and 𝒳1=𝒳\mathcal{X}_{1}=\mathcal{X}
2:  Partition the dataset into T partitions (denoted by {𝒮k}k=1T\{\mathcal{S}_{k}\}_{k=1}^{T}) of size mm each; 𝒮k=(s(k1)m+1,,skm)\mathcal{S}_{k}=(s_{(k-1)m+1},\dots,s_{km})
3:  for i=1i=1 to TT  do
4:     xix_{i}\leftarrow Run Algorithm 1 with dataset 𝒮i\mathcal{S}_{i}, constraint set 𝒳i\mathcal{X}_{i}, Lipschitz constant LiL_{i}, probability parameter β/T\beta/T, privacy parameters (ε,δ)(\varepsilon,\delta), initial point xi1x_{i-1},
5:     Shrink the diameter
Di+1=256(Liλmax{log(T/β)log3/2mm,min(d,dlog(1/δ))log(T/β)logmmε})\displaystyle D_{i+1}=256\left(\frac{L_{i}}{\lambda}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}}\right.\right.,\left.\left.\frac{\min(d,\sqrt{d\log(1/\delta)})\log(T/\beta)\log m}{m\varepsilon}\right\}\right)
6:     Set 𝒳i+1={x:xxi2Di+1/2}\mathcal{X}_{i+1}=\{x:\left\|{x-x_{i}}\right\|_{2}\leq D_{i+1}/2\}
7:     Set Li+1=HDi+1L_{i+1}=HD_{i+1}
8:  end for
9:  return  the final iterate xTx_{T}

The following theorem provides our upper bounds for Algorithm 2, demonstrating near-exponential rates for interpolation problems; we present the proof in Appendix C.

Theorem 2.

Assume each sample function FF is LL-Lipschitz and HH-smooth, and let the population function ff satisfy quadratic growth (Definition 2.4). Let Problem (1) be an interpolation problem. Then Algorithm 2 is (ε,δ)(\varepsilon,\delta)-DP. For δ=0\delta=0, β=1nμ\beta=\frac{1}{n^{\mu}}, m=256log2nHlog(1/β)λmax{256Hλ,dεlogn}m=256\log^{2}n\frac{H\log(1/\beta)}{\lambda}\max\left\{\frac{256H}{\lambda},\frac{d}{\varepsilon\sqrt{\log n}}\right\}, T=n/mT=n/m and any μ>0\mu>0, Algorithm 2 returns xTx_{T} such that

𝔼[f(xT)f(x)]LD\displaystyle\mathbb{E}[f(x_{T})-f(x^{\star})]\leq LD (1nμ+exp(Θ~(nλ2H2))+exp(Θ~(λnεHd))).\displaystyle\left(\frac{1}{n^{\mu}}+\exp\left(-\widetilde{\Theta}\left({\frac{n\lambda^{2}}{H^{2}}}\right)\right)+\exp\left(-\widetilde{\Theta}\left({\frac{\lambda n\varepsilon}{Hd}}\right)\right)\right). (3)

For δ>0\delta>0, β=1nμ\beta=\frac{1}{n^{\mu}}, m=256log2nHlog(1/β)λmax{256Hλ,dlog(1/δ)εlogn}m=256\log^{2}n\frac{H\log(1/\beta)}{\lambda}\max\left\{\frac{256H}{\lambda},\frac{\sqrt{d}\log(1/\delta)}{\varepsilon\sqrt{\log n}}\right\}, T=n/mT=n/m and any μ>0\mu>0, Algorithm 2 returns xTx_{T} such that

𝔼[f(xT)f(x)]LD(1nμ+exp(Θ~(nλ2H2))+exp(Θ~(λnεHdlog(1/δ)))).\displaystyle\mathbb{E}[f(x_{T})-f(x^{\star})]\leq LD\left(\frac{1}{n^{\mu}}+\exp\left(\widetilde{\Theta}\left({\frac{n\lambda^{2}}{H^{2}}}\right)\right)+\exp\left(-\widetilde{\Theta}\left({\frac{\lambda n\varepsilon}{H\sqrt{d\log(1/\delta)}}}\right)\right)\right). (4)

The exponential rates in Theorem 2 show a significant improvement in the interpolation regime over the minimax-optimal O((1n+dnε)2)O((\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon})^{2}) without interpolation [20, 7]. To get the linear convergence rates, we run roughly n/lognn/\log n epochs with logn\log n samples each. Thus, each call of the subroutine runs the algorithm on only logarithmic number of samples compared to the number of epochs. Intuitively, growth conditions improves the performance of the sub-algorithm, while growth and interpolation conditions reduce the search space. This in tandem leads to faster rates.

To better illustrate the improvement in rates compared to the non-private setting, the next corollary states the private sample complexity required to achieve error α\alpha in the interpolation regime.

Corollary 4.1.

Let the conditions of Theorem 2 hold. For δ=0\delta=0 , Algorithm 2 is ε\varepsilon-DP and requires

n=O~(1αρ+dρεlog(1α))\displaystyle n=\widetilde{O}\left({\frac{1}{\alpha^{\rho}}+\frac{d}{\rho\varepsilon}\log\left({\frac{1}{\alpha}}\right)}\right)

samples to ensure 𝔼[f(xT)f(x)]α\mathbb{E}[f(x_{T})-f(x^{\star})]\leq\alpha for any fixed ρ>0\rho>0, where O~\widetilde{O} ignores only polyloglog factors in 1/α1/\alpha.
Moreover, for δ>0\delta>0, Algorithm 2 is (ε,δ)(\varepsilon,\delta)-DP and requires

n=O~(1αρ+dlog(1/δ)ρεlog(1α))\displaystyle n=\widetilde{O}\left({\frac{1}{\alpha^{\rho}}+\frac{\sqrt{d\log(1/\delta)}}{\rho\varepsilon}\log\left({\frac{1}{\alpha}}\right)}\right)

samples to ensure 𝔼[f(xT)f(x)]α\mathbb{E}[f(x_{T})-f(x^{\star})]\leq\alpha, for any fixed ρ>0\rho>0, where O~\widetilde{O} ignores polyloglo factors in 1/α1/\alpha.

As the sample complexity of DP-SCO to achieve expected error α\alpha on general quadratic growth problems is [7]

Θ(1α+dεα),\Theta\left(\frac{1}{\alpha}+\frac{d}{\varepsilon\sqrt{\alpha}}\right),

Corollary 4.1 shows that we are able to improve the polynomial dependence on 1/α1/\alpha in the sample complexity to (nearly) logarithmic for interpolation problems.

Remark 1.

In contrast to Corollary 4.1, we can tune the failure probability parameter β\beta to get the sample complexity dεlog2(1α)\frac{d}{\varepsilon}\log^{2}\left({\frac{1}{\alpha}}\right). Even though this sample complexity does not have the polynomial factor, it may be worse than 1αρ+dεlog(1α)\frac{1}{\alpha^{\rho}}+\frac{d}{\varepsilon}\log\left({\frac{1}{\alpha}}\right), because generally the dimension term is the dominant one.

We end this section by considering growth conditions that are weaker than quadratic growth.

Remark 2.

(interpolation with κ\kappa-growth) We can extend our algorithms to work for the weaker κ\kappa-growth condition [7], i.e., f(x)f(x)λκxx2κf(x)-f(x^{\star})\geq\frac{\lambda}{\kappa}\|{x-x^{\star}}\|_{2}^{\kappa}. We present the full details of these algorithms in Section C.1 (see Algorithm 6). In this setting, we obtain excess loss

O((1n+dnε)κκ2),O\left(\left(\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon}\right)^{\frac{\kappa}{\kappa-2}}\right),

for interpolation problems, improving over the minimax-optimal loss for non-interpolation problems which is

O((1n+dnε)κκ1).O\left(\left(\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon}\right)^{\frac{\kappa}{\kappa-1}}\right).

As an example, when κ=3\kappa=3, this corresponds to an improvement from roughly (d/nε)3/2(d/n\varepsilon)^{3/2} to (d/nε)3(d/n\varepsilon)^{3}. Like our previous results, we are again able to show similar improvements for (ε,δ)(\varepsilon,\delta)-DP with better dependence on the dimension. Finally, we note that we have not provided lower bounds for the interpolation-with-κ\kappa-growth setting for κ>2\kappa>2. We leave this question as a direction for future research.

4.3 Adaptive algorithm

Though Algorithm 2 is private and enjoys faster rates of convergence in the interpolation regime, it is not necessarily adaptive to interpolation, i.e. it may perform poorly given a non-interpolation problem. In fact, since the shrinkage of the diameter and Lipschitz constants at each iteration hinges squarely on the interpolation assumption, the new domain may not include the optimizing set 𝒳\mathcal{X}^{\star} in the non-interpolation setting, so our algorithm may not even converge. Since in general we do not know a priori whether a dataset is interpolating, it is important to have an algorithm which adapts to interpolation.

To that end, we present an adaptive algorithm that achieves faster rates for interpolation-with-growth problems while simultaneously obtaining the standard optimal rates for general growth problems. The algorithm consists of two steps. In the first step, our algorithm privately minimizes the objective without assuming it is an interpolation problem. Next, we run our non-adaptive interpolation algorithm of Section 4.2 over the localized domain returned by the first step. If our problem was an interpolating one, the second step recovers the faster rates in Section 4.2. If our problem was not an interpolating one, the first localization step ensures that we at least recover the non-interpolating convergence rate. We stress that the privacy of Algorithm 3 requires that the call to Algorithm 2 remains private even if the problem is non-interpolating. This is ensured by using our Lipschitzian extension based algorithm with 𝐌(ε,δ)L\mathbf{M}^{L}_{(\varepsilon,\delta)} as Algorithm 5. The Lipschitzian extension allows us to continue preserving privacy. We present the full details of this algorithm in Algorithm 3.

Algorithm 3 Algorithm that adapts to interpolation
0:  Dataset 𝒮=(s1,,sn)𝕊n\mathcal{S}=(s_{1},\ldots,s_{n})\in\mathbb{S}^{n}, Lipschitz constant LL, domain 𝒳\mathcal{X}, probability parameter β\beta, initial point x0x_{0}
1:  Partition the dataset into 2 partitions S1=(s1,,sn/2)S_{1}=(s_{1},\dots,s_{n/2}) and S2=(s(n/2)+1,,sn)S_{2}=(s_{(n/2)+1},\dots,s_{n})
2:  x1x_{1}\leftarrow Run Algorithm 1 with dataset S1S_{1}, constraint set 𝒳i\mathcal{X}_{i}, Lipschitz constant LiL_{i}, probability parameter β/2\beta/2, privacy parameters (ε,δ)(\varepsilon,\delta), initial point xi1x_{i-1},
3:  Shrink the diameter
Dint=128Lλ\displaystyle D_{\rm int}=\frac{128L}{\lambda}\cdot (log(2/β)log3/2nn+min{d,dlog(1/δ)}log(2/β)lognnε)\displaystyle\left(\frac{\sqrt{\log(2/\beta)}\log^{3/2}n}{\sqrt{n}}+\frac{\min\{d,\sqrt{d\log(1/\delta)}\}\log(2/\beta)\log n}{n\varepsilon}\right)
4:  𝒳int={x:xx12Dint/2}\mathcal{X}_{\rm int}=\{x:\left\|{x-x_{1}}\right\|_{2}\leq D_{\rm int}/2\}
5:  xadaptx_{\rm adapt}\leftarrow Run Algorithm 2 with dataset S2S_{2}, diameter DintD_{\rm int}, Lipschitz constant LL, domain 𝒳int\mathcal{X}_{\rm int}, smoothness parameter HH, tail probability parameter β/2\beta/2, growth parameter λ\lambda, initial point x1x_{1}
6:  return  the final iterate xadaptx_{\rm adapt}.

The following theorem (Theorem 3) states the convergence guarantees of our adaptive algorithm (Algorithm 3) in both the interpolation and non-interpolation regimes for the pure DP setting. The results for approximate DP are similar and can be obtained by replacing dd with dlog(1/δ)\sqrt{d\log(1/\delta)}; we give the full details in Appendix C.

Theorem 3.

Let each sample function FF be LL-Lipschitz and HH-smooth, and let the population function ff satisfy quadratic growth (Definition 2.4) with coefficient λ\lambda. Let xadaptx_{\rm adapt} be the output of Algorithm 3. Then

  1. 1.

    Algorithm 3 is ε\varepsilon-DP.

  2. 2.

    Without any additional interpolation assumption, xadaptx_{\rm adapt} satisfies

    𝔼[f(xT)f(x)]LDO~(1n+dnε)2.\displaystyle\mathbb{E}[f(x_{T})-f(x^{\star})]\leq LD\cdot\widetilde{O}\left(\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon}\right)^{2}.
  3. 3.

    Let problem (1) be an interpolation problem. Thenxadaptx_{\rm adapt} satisfies

    𝔼[f(xT)f(x)]LD\displaystyle\mathbb{E}[f(x_{T})-f(x^{\star})]\leq LD (1nμ+exp(Θ~(nλ2H2))+exp(Θ~(λnεHd))).\displaystyle\left(\frac{1}{n^{\mu}}+\exp\left(-\widetilde{\Theta}\left({\frac{n\lambda^{2}}{H^{2}}}\right)\right)+\exp\left(-\widetilde{\Theta}\left({\frac{\lambda n\varepsilon}{Hd}}\right)\right)\right).

Proof    The privacy of Algorithm 3 follows from the privacy of Algorithms 1 and 2 and post-processing.

To prove the convergence guarantees, we first need to show that the optimal set 𝒳\mathcal{X}^{\star} is in the shrinked domain 𝒳int\mathcal{X}_{\rm int}. Using the high probability guarantees of Algorithm 1, we know that with probability 1β/21-\beta/2, we have

f(x1)f(x)212Lλ\displaystyle f(x_{1})-f(x^{\star})\leq\frac{2^{12}L}{\lambda}\cdot (log(2/β)log3/2nn+dlog(1/δ)log(2/β)lognnε).\displaystyle\left(\frac{\sqrt{\log(2/\beta)}\log^{3/2}n}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}\log(2/\beta)\log n}{n\varepsilon}\right).

Using the quadratic growth condition, we immediately have xx12Dint/2\left\|{x^{\star}-x_{1}}\right\|_{2}\leq D_{\rm int}/2 and hence 𝒳𝒳int\mathcal{X}^{\star}\subset\mathcal{X}_{\rm int}.

Using smoothness, we have that for any x𝒳intx\in\mathcal{X}_{\rm int},

f(x)f(x)HDint22.\displaystyle f(x)-f(x^{\star})\leq\frac{HD_{\rm int}^{2}}{2}.

Since Algorithm 2 always outputs a point in its input domain (in this case 𝒳int\mathcal{X}_{\rm int}), even in the non-interpolation setting that

𝔼[f(xT)f(x)]LDO~(1n+dnε)2.\displaystyle\mathbb{E}[f(x_{T})-f(x^{\star})]\leq LD\cdot\widetilde{O}\left(\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon}\right)^{2}.

In the interpolation setting, the guarantees of Algorithm 2 hold and result is immediate. ∎

5 Optimality and Superefficiency

We conclude this paper by providing a lower bound and a super-efficiency result that demonstrate the tightness of our upper bounds. Recall that our upper bound from Section 4 is roughly (up to constants)

1nc+exp(Θ~(nεd)),\frac{1}{n^{c}}+\exp\left(-\widetilde{\Theta}\left(\frac{n\varepsilon}{d}\right)\right), (5)

for any arbitrarily large cc. We begin with an exponential lower bound showing that the second term in (5) is tight. We then prove a superefficiency result that demonstrates that any private algorithm which avoids the first term in (5) cannot be adaptive to interpolation, that is, it can not achieve the minimax optimal rate for the family of non-interpolation problems.

Theorem 4 below presents our exponential lower bounds for private interpolation problems with growth. We use the notation and proof structure of Theorem 1. We let 𝔖λ𝔖\mathfrak{S}^{\lambda}\subset\mathfrak{S} be the subcollection of function, data set pairs which also have functions f𝒮nf_{\mathcal{S}^{n}} that have λ\lambda-quadratic growth (Definition 2.4). The proof of Theorem 4 is found in Section D.1.

Theorem 4.

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} contain a dd-dimensional 2\ell_{2}-ball of diameter DD. Then

𝔐(𝒳,𝔖λ,ε,0)λD296exp(2λnεHd).\displaystyle\mathfrak{M}(\mathcal{X},\mathfrak{S}^{\lambda},\varepsilon,0)\geq\frac{\lambda D^{2}}{96}\exp\left(-\frac{2\lambda n\varepsilon}{Hd}\right).

This lower bound addresses the second term of (5); we now turn to our superefficiency results to lower bound the first term of (5). We start with defining some notation and making some simplifying assumptions. For a fixed function F:𝒳,ΩF:\mathcal{X},\Omega\rightarrow\mathbb{R} which is convex, HH-smooth with respect to the first argument, let 𝔖λL(F)\mathfrak{S}_{\lambda}^{L}(F) be the set of datasets 𝒮\mathcal{S} of nn data points sampled from Ω\Omega such that f𝒮(x)1ns𝒮nF(x,s)f_{\mathcal{S}}(x)\coloneqq\frac{1}{n}\sum_{s\in\mathcal{S}^{n}}F(x,s) is LL-Lipschitz and have λ\lambda-strongly convex objectives. For simplicity, we will assume that 1. infx𝒳F(x;s)=0\inf_{x\in\mathcal{X}}F(x;s)=0 for all sΩs\in\Omega, 2. 𝒳=[D,D]\mathcal{X}=[-D,D]\subset\mathbb{R}, and 3. the codomain of FF is +\mathbb{R}_{+}. With this setup, we present the formal statement of our result; the proof of Theorem 5 is found in Section D.2.

Theorem 5.

Suppose we have some 𝒮𝔖λL(F)\mathcal{S}\in\mathfrak{S}_{\lambda}^{L}(F) with L=2HDL=2HD such that (F,𝒮)(F,\mathcal{S}) satisfy Definition 2.3. Suppose there is an ε\varepsilon-DP estimator MM such that

𝔼[f𝒮(M(𝒮))]infx𝒳f𝒮(x)cD2eΘ((nε)t)\displaystyle\mathbb{E}[f_{\mathcal{S}}(M(\mathcal{S}))]-\inf_{x\in\mathcal{X}}f_{\mathcal{S}}(x)\leq cD^{2}e^{-\Theta((n\varepsilon)^{t})}

for some t>0t>0 and absolute constant cc. Then, for sufficiently large nn, there exists another dataset 𝒮𝔖λL(F)\mathcal{S}^{\prime}\in\mathfrak{S}_{\lambda}^{L}(F), where (F,𝒮)(F,\mathcal{S}^{\prime}) may not satisfy Definition 2.3, such that

𝔼[f𝒮(M(𝒮))]infx𝒳f𝒮(x)=Ω(D2(nε)2(1t))\displaystyle\mathbb{E}[f_{\mathcal{S}^{\prime}}(M(\mathcal{S}^{\prime}))]-\inf_{x\in\mathcal{X}}f_{\mathcal{S}^{\prime}}(x)=\Omega\left(\frac{D^{2}}{(n\varepsilon)^{2(1-t)}}\right)

To better contextualize this result, suppose there exists an algorithm which atttains a exp(Θ~(nε/d))\exp(-\widetilde{\Theta}\left(n\varepsilon/d\right)) convergence rate on interpolation problems; i.e., the algorithm is able to avoid the 1/nc1/n^{c} term in (5). Then Theorem 5 states that there exists some strongly convex, non-interpolation problem on which the aforementioned algorithm will optimize very poorly; in particular, the algorithm will only be able to return a solution that attains, on average, constant error on this “hard” problem. More generally, recall that in the non-interpolation quadratic growth setting, the optimal error rate is on the order of 1/(nε)21/(n\varepsilon)^{2} [7]. Theorem 5 shows that attaining better-than-polynomial error complexity on quadratic growth interpolation problems implies that the algorithm cannot be minimax optimal in the non-interpolation quadratic growth setting. Thus, the rates our adaptive algorithms attain are the best we can hope for if we want an algorithm to perform well on both interpolation and non-interpolation quadratic growth problems.

References

  • ACCD [20] Hilal Asi, Karan Chadha, Gary Cheng, and John C. Duchi. Minibatch stochastic approximate proximal point methods. In Advances in Neural Information Processing Systems 33, 2020.
  • ACG+ [16] Martin Abadi, Andy Chu, Ian Goodfellow, Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In 23rd ACM Conference on Computer and Communications Security (ACM CCS), pages 308–318, 2016.
  • AD [19] Hilal Asi and John C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019.
  • AD [20] Hilal Asi and John Duchi. Near instance-optimality in differential privacy. arXiv:2005.10630 [cs.CR], 2020.
  • ADF+ [21] Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, and Kunal Talwar. Private adaptive gradient methods for convex optimization. In Proceedings of the 38th International Conference on Machine Learning, pages 383–392, 2021.
  • AFKT [21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in 1\ell_{1} geometry. In Proceedings of the 38th International Conference on Machine Learning, 2021.
  • ALD [21] Hilal Asi, Daniel Levy, and John C. Duchi. Adapting to function difficulty and growth conditions in private optimization. In Advances in Neural Information Processing Systems 34, 2021.
  • BFGT [20] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems 33, 2020.
  • BFTT [19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems 32, 2019.
  • BGN [21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Proceedings of the Thirty Fourth Annual Conference on Computational Learning Theory, pages 474–499, 2021.
  • BHM [18] Mikhail Belkin, Daniel Hsu, and Partha Mitra. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems 31, pages 2300–2311. Curran Associates, Inc., 2018.
  • BRT [19] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. Does data interpolation contradict statistical optimality? In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 1611–1619, 2019.
  • BST [14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 55th Annual Symposium on Foundations of Computer Science, pages 464–473, 2014.
  • CCD [22] Karan Chadha, Gary Cheng, and John Duchi. Accelerated, optimal and parallel: Some results on model-based stochastic optimization. In International Conference on Machine Learning, pages 2811–2827. PMLR, 2022.
  • CMS [11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069–1109, 2011.
  • CSSS [11] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems 24, 2011.
  • DJW [13] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical minimax rates. In 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.
  • DR [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 & 4):211–407, 2014.
  • Duc [18] John C. Duchi. Introductory lectures on stochastic convex optimization. In The Mathematics of Data, IAS/Park City Mathematics Series. American Mathematical Society, 2018.
  • FKT [20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in linear time. In Proceedings of the Fifty-Second Annual ACM Symposium on the Theory of Computing, 2020.
  • HUL [93] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II. Springer, New York, 1993.
  • LB [20] Chaoyue Liu and Mikhail Belkin. Accelerating sgd with momentum for over-parameterized learning. In International Conference on Learning Representations, 2020.
  • MBB [18] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • NWS [14] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In Advances in Neural Information Processing Systems 27, pages 1017–1025, 2014.
  • SSSSS [09] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In Proceedings of the Twenty Second Annual Conference on Computational Learning Theory, 2009.
  • SST [10] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In nips2010, pages 2199–2207, 2010.
  • ST [13] Adam Smith and Abhradeep Thakurta. Differentially private feature selection via stability arguments, and the robustness of the Lasso. In Proceedings of the Twenty Sixth Annual Conference on Computational Learning Theory, pages 819–850, 2013.
  • SV [09] T. Strohmer and Roman Vershynin. A randomized Kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262–278, 2009.
  • VBS [19] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
  • Wai [19] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
  • WS [21] Blake E. Woodworth and Nathan Srebro. An even more optimal stochastic optimization algorithm: Minibatching and interpolation learning. In Advances in Neural Information Processing Systems 34, 2021.
  • WXDX [20] Di Wang, Hanshen Xiao, Srini Devadas, and Jinhui Xu. Private stochastion differentially private stochastic convex optimization with heavy-tailed data. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Appendix A Results from previous work

A.1 Proof of Lemma 2.1

  1. 1.

    Follows from Proposition IV.3.1.4 of [21].

  2. 2.

    Follows from Proposition IV.3.1.4 of [21].

  3. 3.

    Follows since for LL-lipschitz functions 0f(x)+L𝔹20\in\nabla f(x)+L\mathbb{B}_{2}.

  4. 4.

    Follows from Section VI.4.5 of [21].

A.2 Algorithms from [7]

Algorithm 4 Localization based Algorithm
0:  Dataset D=(s1,,sn)𝕊nD=(s_{1},\ldots,s_{n})\in\mathbb{S}^{n}, constraint set 𝒳\mathcal{X}, step size η\eta, initial point x0x_{0}, Lipschitz (clipping) constant LL, privacy parameters (ε,δ)(\varepsilon,\delta);
1:  Set k=lognk=\left\lceil{\log n}\right\rceil and n0=n/kn_{0}=n/k
2:  for i=1i=1 to kk  do
3:     Set ηi=24iη\eta_{i}=2^{-4i}\eta
4:     Solve the following ERM over 𝒳i={x𝒳:xxi122Lηin0}\mathcal{X}_{i}=\{x\in\mathcal{X}:\left\|{x-x_{i-1}}\right\|_{2}\leq{2L\eta_{i}n_{0}}\}:
Fi(x)=1n0j=1+(i1)n0in0F(x;sj)+1ηin0xxi122F_{i}(x)=\frac{1}{n_{0}}\sum_{j=1+(i-1)n_{0}}^{in_{0}}F(x;s_{j})+\frac{1}{\eta_{i}n_{0}}\left\|{x-x_{i-1}}\right\|_{2}^{2}
5:     Let x^i\hat{x}_{i} be the output of the optimization algorithm.
6:     if δ=0\delta=0 then
7:        Set ζi𝖫𝖺𝗉d(σi)\zeta_{i}\sim\mathsf{Lap}_{d}(\sigma_{i}) where σi=4Lηid/εi\sigma_{i}=4L\eta_{i}\sqrt{d}/\varepsilon_{i}
8:     else if δ>0\delta>0 then
9:        Set ζi𝖭(0,σi2)\zeta_{i}\sim\mathsf{N}(0,\sigma_{i}^{2}) where σi=4Lηilog(1/δ)/ε\sigma_{i}=4L\eta_{i}\sqrt{\log(1/\delta)}/\varepsilon
10:     end if
11:     Set xi=x^i+ζix_{i}=\hat{x}_{i}+\zeta_{i}
12:  end for
13:  return  the final iterate xkx_{k}
Algorithm 5 Epoch-based algorithms for κ\kappa-growth
0:  Dataset 𝒮=(s1,,sn)𝕊n\mathcal{S}=(s_{1},\ldots,s_{n})\in\mathbb{S}^{n}, constraint set 𝒳\mathcal{X}, Lipschitz (clipping) constant LL, initial point x0x_{0}, number of iterations TT, probability parameter β\beta, privacy parameters (ε,δ)(\varepsilon,\delta);
1:  Set n0=n/Tn_{0}=n/T and D0=diam(𝒳)D_{0}={\rm diam}(\mathcal{X})
2:  if δ=0\delta=0 then
3:     Set η0=D02Lmin(1n0log(n0)log(1/β),εdlog(1/β))\eta_{0}=\frac{D_{0}}{2L}\min\left(\frac{1}{\sqrt{n_{0}\log(n_{0})\log(1/\beta)}},\frac{\varepsilon}{d\log(1/\beta)}\right)
4:  else if δ>0\delta>0 then
5:     Set
η0=D02Lmin{1n0log(n0)log(1/β),εdlog(1/δ)log(1/β))\displaystyle\eta_{0}=\frac{D_{0}}{2L}\min\left\{\frac{1}{\sqrt{n_{0}\log(n_{0})\log(1/\beta)}},\frac{\varepsilon}{\sqrt{d\log(1/\delta)}\log(1/\beta)}\right)
6:  end if
7:  for i=0i=0 to T1T-1  do
8:     Let 𝒮i=(s1+(i1)n0,,sin0)\mathcal{S}_{i}=(s_{1+(i-1)n_{0}},\dots,s_{in_{0}})
9:     Set Di=2iD0D_{i}=2^{-i}D_{0} and ηi=2iη0\eta_{i}=2^{-i}\eta_{0}
10:     Set 𝒳i={x𝒳:xxi2Di}\mathcal{X}_{i}=\{x\in\mathcal{X}:\left\|{x-x_{i}}\right\|_{2}\leq D_{i}\}
11:     Run Algorithm 4 on dataset 𝒮i\mathcal{S}_{i} with starting point xix_{i}, Lipschitz (clipping) constant LL, privacy parameter (ε,δ)(\varepsilon,\delta), domain 𝒳i\mathcal{X}_{i} (with diameter DiD_{i}), step size ηi\eta_{i}
12:     Let xi+1x_{i+1} be the output of the private procedure
13:  end for
14:  return  xTx_{T}

A.3 Theoretical results from [7]

We first reproduce the high probability guarantees of Algorithm 4 as proved in [7].

Proposition 2.

Let β1/(n+d)\beta\leq 1/(n+d), D2(𝒳)DD_{2}(\mathcal{X})\leq D and F(x;s)F(x;s) be convex, LL-Lipschitz for all s𝕊s\in\mathbb{S}. Setting

η=DLmin(1nlog(1/β),εdlog(1/β))\eta=\frac{D}{L}\min\left(\frac{1}{\sqrt{n\log(1/\beta)}},\frac{\varepsilon}{d\log(1/\beta)}\right)

then for δ=0\delta=0, Algorithm 4 is ε\varepsilon-DP and has with probability 1β1-\beta

f(x)f(x)128LD(log(1/β)log3/2nn+dlog(1/β)lognnε).f(x)-f(x^{\star})\leq 128LD\cdot\left(\frac{\sqrt{\log(1/\beta)}\log^{3/2}n}{\sqrt{n}}+\frac{d\log(1/\beta)\log n}{n\varepsilon}\right).
Proposition 3.

Let β1/(n+d)\beta\leq 1/(n+d), D2(𝒳)DD_{2}(\mathcal{X})\leq D and F(x;s)F(x;s) be convex, LL-Lipschitz for all s𝕊s\in\mathbb{S}. Setting

η=DLmin(1nlog(1/β),εdlog(1/δ)log(1/β)),\eta=\frac{D}{L}\min\left(\frac{1}{\sqrt{n\log(1/\beta)}},\frac{\varepsilon}{\sqrt{d\log(1/\delta)}\log(1/\beta)}\right),

then for δ>0\delta>0, Algorithm 4 is (ε,δ)(\varepsilon,\delta)-DP and has with probability 1β1-\beta

f(x)f(x)128LD(log(1/β)log3/2nn+dlog(1/δ)log(1/β)lognnε).f(x)-f(x^{\star})\leq 128LD\cdot\left(\frac{\sqrt{\log(1/\beta)}\log^{3/2}n}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}\log(1/\beta)\log n}{n\varepsilon}\right).

Now, we reproduce the high probability convergence guarantees of Algorithm 5.

Theorem 6.

Let β1/(n+d)\beta\leq 1/(n+d), D2(𝒳)DD_{2}(\mathcal{X})\leq D and F(x;s)F(x;s) be convex, LL-Lipschitz for all sΩs\in\Omega. Assume that ff has κ\kappa-growth (Assumption 2.4) with κκ¯>1\kappa\geq\underline{\kappa}>1. Setting T=2lognκ¯1T=\left\lceil\frac{2\log n}{\underline{\kappa}-1}\right\rceil, Algorithm 5 is ε\varepsilon-DP and has with probability 1β1-\beta

f(xT)minx𝒳f(x)4032λ1κ1(Llog(1/β)log3/2nn+Ldlog(1/β)lognnε(κ¯1))κκ1.f(x_{T})-\min_{x\in\mathcal{X}}f(x)\leq\frac{4032}{\lambda^{\frac{1}{\kappa-1}}}\cdot\left(\frac{L\sqrt{\log(1/\beta)}\log^{3/2}n}{\sqrt{n}}+\frac{Ld\log(1/\beta)\log n}{n\varepsilon(\underline{\kappa}-1)}\right)^{\frac{\kappa}{\kappa-1}}.
Theorem 7.

Let β1/(n+d)\beta\leq 1/(n+d), D2(𝒳)DD_{2}(\mathcal{X})\leq D and F(x;s)F(x;s) be convex, LL-Lipschitz for all sΩs\in\Omega. Assume that ff has κ\kappa-growth (Assumption 2.4) with κκ¯>1\kappa\geq\underline{\kappa}>1. Setting T=2lognκ¯1T=\left\lceil\frac{2\log n}{\underline{\kappa}-1}\right\rceil and δ>0\delta>0, Algorithm 5 is (ε,δ)(\varepsilon,\delta)-DP and has with probability 1β1-\beta

f(xT)minx𝒳f(x)4032λ1κ1(Llog(1/β)log3/2nn+Ldlog(1/δ)log(1/β)lognnε(κ¯1))κκ1.f(x_{T})-\min_{x\in\mathcal{X}}f(x)\leq\frac{4032}{\lambda^{\frac{1}{\kappa-1}}}\cdot\left(\frac{L\sqrt{\log(1/\beta)}\log^{3/2}n}{\sqrt{n}}+\frac{L\sqrt{d\log(1/\delta)}\log(1/\beta)\log n}{n\varepsilon(\underline{\kappa}-1)}\right)^{\frac{\kappa}{\kappa-1}}.

Appendix B Proofs from Section 3

B.1 Proof of Theorem 1

Consider the sample risk function

F(x;s)H2xs22𝟏{s0}.\displaystyle F(x;s)\coloneqq\frac{H}{2}\left\|{x-s}\right\|_{2}^{2}\cdot\mathbf{1}\{s\neq 0\}.

We define the datasets 𝒮vn{0}nk{v}k\mathcal{S}_{v}^{n}\coloneqq\{0\}^{n-k}\cup\{v\}^{k}. We define the corresponding population risk to be fv(x)1ns𝒮vF(x;s)=kH2nxv22f_{v}(x)\coloneqq\frac{1}{n}\sum_{s\in\mathcal{S}_{v}}F(x;s)=\frac{kH}{2n}\left\|{x-v}\right\|_{2}^{2}. We select 𝒱\mathcal{V} to be a γ\gamma-packing (with respect to the 2\ell_{2} norm) of diameter DD ball contained in 𝒳\mathcal{X}. Define the separation between v,v𝒱v,v^{\prime}\in\mathcal{V} with respect to the loss fvf_{v} and fvf_{v^{\prime}} by

dopt(fv,fv)infx𝒳fv(x)2+fv(x)2ckH8nγ2.\displaystyle d_{\rm opt}(f_{v},f_{v^{\prime}})\coloneqq\inf_{x\in\mathcal{X}}\frac{f_{v}(x)}{2}+\frac{f_{v^{\prime}}(x)}{2}\geq c\coloneqq\frac{kH}{8n}\gamma^{2}.

For the sake of contradiction, suppose that 𝔼[fv(M(𝒮v))]τ\mathbb{E}[f_{v}(M(\mathcal{S}_{v}))]\leq\tau for τ<kHγ28n(1+ekε2dγd/Dd)\tau<\frac{kH\gamma^{2}}{8n(1+e^{k\varepsilon}2^{d}\gamma^{d}/D^{d})} for all v𝒳v\in\mathcal{X}. Then by Markov’s inequality, (fv(M(𝒮v))>c)τc\mathbb{P}(f_{v}(M(\mathcal{S}_{v}))>c)\leq\frac{\tau}{c} and (fv(M(𝒮v))c)1τc\mathbb{P}(f_{v}(M(\mathcal{S}_{v}))\leq c)\geq 1-\frac{\tau}{c} for all vv, and so

τc\displaystyle\frac{\tau}{c} (i)(fv(M(𝒮v))>c)\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}\mathbb{P}(f_{v}(M(\mathcal{S}_{v}))>c)
(ii)(v𝒱{v}fv(M(𝒮v))c)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}\mathbb{P}(\cup_{v^{\prime}\in\mathcal{V}\setminus\{v\}}f_{v^{\prime}}(M(\mathcal{S}_{v}))\leq c)
(iii)ekεv𝒱{v}(fv(M(𝒮v))c)\displaystyle\stackrel{{\scriptstyle(iii)}}{{\geq}}e^{-k\varepsilon}\sum_{v^{\prime}\in\mathcal{V}\setminus\{v\}}\mathbb{P}(f_{v^{\prime}}(M(\mathcal{S}_{v^{\prime}}))\leq c)
(i)ekε(|𝒱|1)(1τc),\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}e^{-k\varepsilon}(|\mathcal{V}|-1)\left(1-\frac{\tau}{c}\right),

where inequality (ii)(ii) follows from the definition of the separation, and (iii)(iii) follows from privacy and the disjoint nature of the events in the union. Rearranging, we get that

τkHγ28n(1+ekε(|𝒱|1)1),\displaystyle\tau\geq\frac{kH\gamma^{2}}{8n(1+e^{k\varepsilon}(|\mathcal{V}|-1)^{-1})},

which is a contradiction. By standard packing inequalities [30], we know that |𝒱|(D/2γ)d|\mathcal{V}|\geq(D/2\gamma)^{d}. Setting k=d/εk=d/\varepsilon and γ=D/2e\gamma=D/2e and using the fact that x/(x1)x/(x-1) is decreasing in xx gives

τdHD232nεe2(1+ed(ed1)1)HD2d96e2nε.\displaystyle\tau\geq\frac{dHD^{2}}{32n\varepsilon e^{2}(1+e^{d}(e^{d}-1)^{-1})}\geq\frac{HD^{2}d}{96e^{2}n\varepsilon}.

We now prove the (ε,δ)(\varepsilon,\delta)-DP lower bound. Consider the following sample risk function

F(x;s)H2(xs)2𝟏{s0}.\displaystyle F(x;s)\coloneqq\frac{H}{2}(x-s)^{2}\mathbf{1}\{s\neq 0\}.

We define the datasets 𝒮vn{0}nk{v}k\mathcal{S}_{v}^{n}\coloneqq\{0\}^{n-k}\cup\{v\}^{k} inducing the corresponding population risk fv(x)1ns𝒮vF(x;s)=kH2n(xv)2f_{v}(x)\coloneqq\frac{1}{n}\sum_{s\in\mathcal{S}_{v}}F(x;s)=\frac{kH}{2n}(x-v)^{2}. We select two points v,vv,v^{\prime} contained within the diameter DD ball contained in 𝒳\mathcal{X} such that |vv|=D|v-v^{\prime}|=D. Define the separation between v,v𝒱v,v^{\prime}\in\mathcal{V} with respect to the loss fvf_{v} and fvf_{v^{\prime}} as

dopt(fv,fv)infx𝒳fv(x)2+fv(x)2ckH8nD2\displaystyle d_{\rm opt}(f_{v},f_{v^{\prime}})\coloneqq\inf_{x\in\mathcal{X}}\frac{f_{v}(x)}{2}+\frac{f_{v^{\prime}}(x)}{2}\geq c\coloneqq\frac{kH}{8n}D^{2}

For the sake of contradiction, suppose that 𝔼[fv(M(𝒮v))]τ\mathbb{E}[f_{v}(M(\mathcal{S}_{v}))]\leq\tau for τ<kHD28n(ekεkeεδ1+ekε)\tau<\frac{kHD^{2}}{8n}\left\lparen\frac{e^{-k\varepsilon}-ke^{-\varepsilon}\delta}{1+e^{-k\varepsilon}}\right\rparen for all v𝒳v\in\mathcal{X}. Then by Markov’s inequality, (fv(M(𝒮v))>c)τc\mathbb{P}(f_{v}(M(\mathcal{S}_{v}))>c)\leq\frac{\tau}{c} and (fv(M(𝒮v))c)1τc\mathbb{P}(f_{v}(M(\mathcal{S}_{v}))\leq c)\geq 1-\frac{\tau}{c} for all vv, and so

τc\displaystyle\frac{\tau}{c} (i)(fv(M(𝒮v))>c)\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}\mathbb{P}(f_{v}(M(\mathcal{S}_{v}))>c)
(ii)(fv(M(𝒮v))c)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}\mathbb{P}(f_{v^{\prime}}(M(\mathcal{S}_{v}))\leq c)
(iii)ekε(fv(M(𝒮v))c)keεδ\displaystyle\stackrel{{\scriptstyle(iii)}}{{\geq}}e^{-k\varepsilon}\mathbb{P}(f_{v^{\prime}}(M(\mathcal{S}_{v^{\prime}}))\leq c)-ke^{-\varepsilon}\delta
(i)ekε(1τc)keεδ,\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}e^{-k\varepsilon}\left(1-\frac{\tau}{c}\right)-ke^{-\varepsilon}\delta,

where inequality (ii)(ii) follows from the definition of the separation, and (iii)(iii) follows from group privacy of (ε,δ)(\varepsilon,\delta)-privacy [18]. Rearranging, we get that

τkHD28n(ekεkeεδ1+ekε),\displaystyle\tau\geq\frac{kHD^{2}}{8n}\left\lparen\frac{e^{-k\varepsilon}-ke^{-\varepsilon}\delta}{1+e^{-k\varepsilon}}\right\rparen,

which is a contradiction. Setting k=1/εk=1/\varepsilon and using the fact δεeε1/2\delta\leq\varepsilon e^{\varepsilon-1}/2 gives the first result.

Appendix C Proofs from Section 4

We first prove a lemma that each time we shrink the domain size, the set of interpolating solutions still lies in the new domain with high probability, and the new Lipschitz constant we define is a valid Lipschitz constant for the loss defined on the new domain. We prove it in generality for κ\kappa-growth.

Lemma C.1.

Let 𝒳\mathcal{X}^{\star} denote the set of interpolating solutions of problem (1). Then 𝒳𝒳i\mathcal{X}^{\star}\subset\mathcal{X}_{i} for all i[T]i\in[T] with probability 1β1-\beta, and F(y;s)2Li\left\|{\nabla F(y;s)}\right\|_{2}\leq L_{i} for all y𝒳iy\in\mathcal{X}_{i}.

Proof    We prove this lemma for the case when δ=0\delta=0, the case when δ>0\delta>0 follows similarly. For epoch ii, using Theorem 2 of [7], we have with probability 1β/T1-\beta/T,

f(x^i)f(x)Cκλ1κ1max{Lilog(T/β)log3/2mm,Lidlog(T/β)logmmε}κκ1\displaystyle f(\hat{x}_{i})-f(x^{\star})\leq\frac{C_{\kappa}}{\lambda^{\frac{1}{\kappa-1}}}\max\left\{\frac{L_{i}\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{L_{i}{d}\log(T/\beta)\log m}{m\varepsilon}\right\}^{\frac{\kappa}{\kappa-1}}

Using the growth condition on f()f(\cdot), we have

x^ix2κ(f(x^i)f(x))λκ(Cκκ)1/κmax{Lilog(T/β)log3/2mλm,Lidlog(T/β)logmλmε}1κ1,\displaystyle\left\|{\hat{x}_{i}-x^{\star}}\right\|_{2}\leq\sqrt[\kappa]{\frac{\kappa(f(\hat{x}_{i})-f(x^{\star}))}{\lambda}}\leq(C_{\kappa}\kappa)^{1/\kappa}\max\left\{\frac{L_{i}\sqrt{\log(T/\beta)}\log^{3/2}m}{\lambda\sqrt{m}},\frac{L_{i}{d}\log(T/\beta)\log m}{\lambda m\varepsilon}\right\}^{\frac{1}{\kappa-1}},

Using cκ=2(Cκκ)1/κc_{\kappa}=2(C_{\kappa}\kappa)^{1/\kappa}, we get x^ix2Di+1/2\left\|{\hat{x}_{i}-x^{\star}}\right\|_{2}\leq D_{i+1}/2 with probability 1β/T1-\beta/T. Thus, for each epoch ii, with probability 1β/T1-\beta/T, each point in the set 𝒳\mathcal{X}^{\star} of optimizers lies in the domain 𝒳i\mathcal{X}_{i}. Using a union bound on all epochs, we have 𝒳𝒳i\mathcal{X}^{\star}\subset\mathcal{X}_{i} for all i[T]i\in[T] with probability 1β1-\beta.

We now prove the second part of the lemma. Using the smoothness of F(;s)F(\cdot;s) and that F(x;s)=0\nabla F(x^{\star};s)=0 for all x𝒳x^{\star}\in\mathcal{X}^{\star}, we have

F(y;s)2=F(y;s)F(x;s)2Hyx2H(yx^i2+x^x^i2)HDi=Li\displaystyle\left\|{\nabla F(y;s)}\right\|_{2}=\left\|{\nabla F(y;s)-\nabla F(x^{\star};s)}\right\|_{2}\leq H\left\|{y-x^{\star}}\right\|_{2}\leq H\left(\left\|{y-\hat{x}_{i}}\right\|_{2}+\left\|{\hat{x}^{\star}-\hat{x}_{i}}\right\|_{2}\right)\leq HD_{i}=L_{i}

as desired. ∎

We now restate and prove the convergence rate of Algorithm 2

See 2 Proof    First we prove the privacy guarantee of the algorithm. Each sample impacts only one of the iterates x^i\hat{x}_{i}, thus Algorithm 2 satisfies the same privacy guarantee as Algorithm 5 by postprocessing.

We divide the utility proof into 2 main parts; first is to check the validity of the assumptions while applying Algorithm 5 and second is using its high probability convergence guarantees to get the final rates. To check this, we ensure that the optimum set lies in the new domain 𝒳i\mathcal{X}_{i} at step ii and that the Lipschitz constant LiL_{i} defined with respect to the domain is a valid lipschitz constant. This follows from Lemma C.1.

Next, we use the high probability convergence guarantees of the subalgorithm Algorithm 5 to get convergence rates for Algorithm 2. We prove it for the case when δ=0\delta=0, the case when δ>0\delta>0 is similar. We know that

Li\displaystyle L_{i} =HDi\displaystyle=HD_{i}
=c2HLi1λmax{log(T/β)log3/2mm,dlog(T/β)logmmε}.\displaystyle=c_{2}\frac{HL_{i-1}}{\lambda}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{d\log(T/\beta)\log m}{m\varepsilon}\right\}.

Thus we have

LT=(c2Hλmax{log(T/β)log3/2mm,dlog(T/β)logmmε})T1L1.\displaystyle L_{T}=\left(c_{2}\frac{H}{\lambda}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{d\log(T/\beta)\log m}{m\varepsilon}\right\}\right)^{T-1}L_{1}.

Using Theorem 2 of [7] on the last epoch, we have with probability 1β1-\beta that

f(x^T)f(x)\displaystyle f(\hat{x}_{T})-f(x^{\star}) C2LT2λmax{log(T/β)log3/2mm,d2log2(T/β)logmm2ε2}\displaystyle\leq C_{2}\frac{L^{2}_{T}}{\lambda}\max\left\{\frac{\log(T/\beta)\log^{3/2}m}{m},\frac{{d^{2}}\log^{2}(T/\beta)\log m}{m^{2}\varepsilon^{2}}\right\}
=(c22H2λ2max{log(T/β)log3/2mm,d2log2(T/β)logmm2ε2})TC2L12λH2c22\displaystyle=\left(c^{2}_{2}\frac{H^{2}}{\lambda^{2}}\max\left\{\frac{\log(T/\beta)\log^{3/2}m}{m},\frac{{d^{2}}\log^{2}(T/\beta)\log m}{m^{2}\varepsilon^{2}}\right\}\right)^{T}\frac{C_{2}L_{1}^{2}\lambda}{H^{2}c_{2}^{2}}
=(c22H2λ2max{log(T/β)log3/2mm,d2log2(T/β)logmm2ε2})TL12λ8H2.\displaystyle=\left(c^{2}_{2}\frac{H^{2}}{\lambda^{2}}\max\left\{\frac{\log(T/\beta)\log^{3/2}m}{m},\frac{{d^{2}}\log^{2}(T/\beta)\log m}{m^{2}\varepsilon^{2}}\right\}\right)^{T}\frac{L_{1}^{2}\lambda}{8H^{2}}.

Let m=klog2nm=k\log^{2}n and T=n/mT=n/m for some kk such that

(c22H2λ2max{log(n/(βklog2n))log3/2(klog2n)klog2n,d2log2(n/(βklog2n))log(klog2n)(klog2n)2ε2})1e.\left(c^{2}_{2}\frac{H^{2}}{\lambda^{2}}\max\left\{\frac{\log(n/(\beta k\log^{2}n))\log^{3/2}(k\log^{2}n)}{k\log^{2}n},\frac{{d^{2}}\log^{2}(n/(\beta k\log^{2}n))\log(k\log^{2}n)}{(k\log^{2}n)^{2}\varepsilon^{2}}\right\}\right)\leq\frac{1}{e}.

This holds for example for

k=256Hlog(1/β)λmax{256Hλ,dεlogn},\displaystyle k=256\frac{H\log(1/\beta)}{\lambda}\max\left\{\frac{256H}{\lambda},\frac{d}{\varepsilon\sqrt{\log n}}\right\},

for sufficiently large nn. Using these values of mm and TT, we have

f(x^T)f(x)C2L2λH2c22exp(nklog2n)=L12λ8H2exp(nklog2n).\displaystyle f(\hat{x}_{T})-f(x^{\star})\leq\frac{C_{2}L^{2}\lambda}{H^{2}c_{2}^{2}}\exp\left(-\frac{n}{k\log^{2}n}\right)=\frac{L_{1}^{2}\lambda}{8H^{2}}\exp\left(-\frac{n}{k\log^{2}n}\right). (6)

To get the convergence results in expectation, let AA denote the “bad” event with tail probability β\beta, where f(x^T)f(x)>L12λ8H2exp(nklog2n)f(\hat{x}_{T})-f(x^{\star})>\frac{L_{1}^{2}\lambda}{8H^{2}}\exp\left(-\frac{n}{k\log^{2}n}\right). Now,

𝔼[f(x^T)f(x)]\displaystyle\mathbb{E}[f(\hat{x}_{T})-f(x^{\star})] βHD22+(1β)𝔼[f(x^T)f(x)Ac]\displaystyle\leq\beta\frac{HD^{2}}{2}+(1-\beta)\mathbb{E}[f(\hat{x}_{T})-f(x^{\star})\mid A^{c}]
βHD22+𝔼[f(x^T)f(x)Ac]\displaystyle\leq\beta\frac{HD^{2}}{2}+\mathbb{E}[f(\hat{x}_{T})-f(x^{\star})\mid A^{c}]

Substituting β=1nμ\beta=\frac{1}{n^{\mu}} and using Equation 6, we get the result. ∎

C.1 Algorithm for general κ\kappa

Algorithm 6 Epoch based epoch based epoch based clipped-GD
0:  number of epochs: TT, samples in each round: m=n/Tm=n/T, Diameter at the start: D1D_{1}, lipschitz constant at the start L1L_{1}, domain 𝒳1\mathcal{X}_{1}, initial point x^0\hat{x}_{0}
1:  for i=1i=1 to TT  do
2:     x^i\hat{x}_{i}\leftarrow Output of Algorithm 5 when run on domain 𝒳i\mathcal{X}_{i} (diameter DiD_{i}), with lipschitz constant LiL_{i} using mm samples.
3:     if δ=0\delta=0 then
4:        
SetDi+1=cκ(Liλmax{log(T/β)log3/2mm,dlog(T/β)logmmε})1κ1\displaystyle\text{Set}\ D_{i+1}=c_{\kappa}\left(\frac{L_{i}}{\lambda}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{d\log(T/\beta)\log m}{m\varepsilon}\right\}\right)^{\frac{1}{\kappa-1}}
5:     else if δ>0\delta>0 then
6:        
SetDi+1=cκ(Liλmax{log(T/β)log3/2mm,dlog(1/δ)log(T/β)logmmε})1κ1\displaystyle\text{Set}\ D_{i+1}=c_{\kappa}\left(\frac{L_{i}}{\lambda}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{\sqrt{d\log(1/\delta)}\log(T/\beta)\log m}{m\varepsilon}\right\}\right)^{\frac{1}{\kappa-1}}
7:     end if
8:     Set 𝒳i+1={x^:x^x^i2Di+1/2}\mathcal{X}_{i+1}=\{\hat{x}:\left\|{\hat{x}-\hat{x}_{i}}\right\|_{2}\leq D_{i+1}/2\}
9:     Set Li+1=HDi+1L_{i+1}=HD_{i+1}
10:  end for
11:  return  the final iterate xTx_{T}

Remark     cκc_{\kappa} is an absolute constant dependent on the high probability performance guarantees of Algorithm 5. We can calculate that CκC_{\kappa} is at most 212(4000)2^{12}(\sim 4000) and hence cκ2(212κ)1/κ4212/κc_{\kappa}\leq 2(2^{12}\kappa)^{1/\kappa}\leq 4\cdot 2^{12/\kappa}.

Theorem 8.

Assume each sample function FF be LL-Lipschitz and HH-smooth, and let the population function ff satisfy quadratic growth (Definition 2.4). Let Problem (1) be an interpolation problem. Then, Algorithm 6 is (ε,δ)(\varepsilon,\delta)-DP. For δ=0\delta=0, Algorithm 6 with T=lognT=\log n and m=nlognm=\frac{n}{\log n}, we have

f(x^T)f(x)O~(1n+dnε)κκ2,\displaystyle f(\hat{x}_{T})-f(x^{\star})\leq\widetilde{O}\left({\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon}}\right)^{\frac{\kappa}{\kappa-2}},

with probability 1β1-\beta. For δ>0\delta>0, Algorithm 6 when run using T=lognT=\log n and m=n/lognm=n/\log n achieves error

f(x^T)f(x)O~(1n+dlog(1/δ)nε)κκ2,\displaystyle f(\hat{x}_{T})-f(x^{\star})\leq\widetilde{O}\left({\frac{1}{\sqrt{n}}+\frac{\sqrt{d}\log(1/\delta)}{n\varepsilon}}\right)^{\frac{\kappa}{\kappa-2}},

with probability 1β1-\beta.

Proof    The privacy guarantee follows from the proof of Theorem 2. We divide the utility proof into 2 main parts; first is to check the validity of the assumptions while applying Algorithm 5 and second is using its high probability convergence guarantees to get the final rates. To check this, we ensure that the optimum set lies in the new domain defined at every step and that the lipschitz constant defined with respect to the domain is a valid lipschitz constant. This follows from Lemma C.1.

Next, we use the high probability convergence guarantees of the subalgorithm Algorithm 5 to get convergence rates for Algorithm 2.

We prove it for the case when δ=0\delta=0, the case when δ>0\delta>0 is similar. We know that

Li\displaystyle L_{i} =HDi\displaystyle=HD_{i}
=cκH(Li1λmax{log(T/β)log3/2mm,dlog(T/β)logmmε})1κ1.\displaystyle=c_{\kappa}H\left(\frac{L_{i-1}}{\lambda}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{d\log(T/\beta)\log m}{m\varepsilon}\right\}\right)^{\frac{1}{\kappa-1}}.

Thus, we have

LT\displaystyle L_{T} =(cκH)κ1κ2(11(κ1)T1)(1λmax{log(T/β)log3/2mm,dlog(T/β)logmmε})1κ2(11(κ1)T1)L11(κ1)T1.\displaystyle=(c_{\kappa}H)^{\frac{\kappa-1}{\kappa-2}\left(1-\frac{1}{(\kappa-1)^{T-1}}\right)}\left(\frac{1}{\lambda}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{d\log(T/\beta)\log m}{m\varepsilon}\right\}\right)^{\frac{1}{\kappa-2}\left(1-\frac{1}{(\kappa-1)^{T-1}}\right)}L_{1}^{\frac{1}{(\kappa-1)^{T-1}}}.

We note that for TlognT\sim\log n, 1(κ1)T11nlog(κ1)\frac{1}{(\kappa-1)^{T-1}}\approx\frac{1}{n^{\log(\kappa-1)}} and thus for large nn, we ignore the terms of the form a1nlog(κ1)a^{-\frac{1}{n^{\log(\kappa-1)}}} since they are 1\approx 1. Ignoring these terms by including an additional constant CC^{\prime} we can write

LT\displaystyle L_{T} =C(cκH)κ1κ2(1λmax{log(T/β)log3/2mm,dlog(T/β)logmmε})1κ2L11(κ1)T1.\displaystyle=C^{\prime}(c_{\kappa}H)^{\frac{\kappa-1}{\kappa-2}}\left(\frac{1}{\lambda}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{d\log(T/\beta)\log m}{m\varepsilon}\right\}\right)^{\frac{1}{\kappa-2}}L_{1}^{\frac{1}{(\kappa-1)^{T-1}}}.

Using Theorem 2 of [7] on the last epoch, we have with probability 1β1-\beta that

f(x^T)f(x)\displaystyle f(\hat{x}_{T})-f(x^{\star}) Cκλ1κ1max{LTlog(T/β)log3/2mm,LTdlog(T/β)logmmε}κκ1\displaystyle\leq\frac{C_{\kappa}}{\lambda^{\frac{1}{\kappa-1}}}\max\left\{\frac{L_{T}\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{L_{T}{d}\log(T/\beta)\log m}{m\varepsilon}\right\}^{\frac{\kappa}{\kappa-1}}
=(C)κκ1Cκ(cκH)κκ2λ2κ2max{log(T/β)log3/2mm,dlog(T/β)logmmε}κκ2L1κ(κ1)T.\displaystyle=\frac{(C^{\prime})^{\frac{\kappa}{\kappa-1}}C_{\kappa}(c_{\kappa}H)^{\frac{\kappa}{\kappa-2}}}{\lambda^{\frac{2}{\kappa-2}}}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{{d}\log(T/\beta)\log m}{m\varepsilon}\right\}^{\frac{\kappa}{\kappa-2}}L_{1}^{{\frac{\kappa}{(\kappa-1)^{T}}}}.

Choosing T=lognT=\log n and m=n/lognm=n/\log n, we have

f(x^T)f(x)\displaystyle f(\hat{x}_{T})-f(x^{\star}) (C)κκ1Cκ(cκH)κκ2λ2κ2max{log(logn/β)log3/2(n/logn)n/logn,dlog(logn/β)log(n/logn)εn/logn}κκ2L1κn.\displaystyle\leq\frac{(C^{\prime})^{\frac{\kappa}{\kappa-1}}C_{\kappa}(c_{\kappa}H)^{\frac{\kappa}{\kappa-2}}}{\lambda^{\frac{2}{\kappa-2}}}\max\left\{\frac{\sqrt{\log(\log n/\beta)}\log^{3/2}(n/\log n)}{\sqrt{n/\log n}},\frac{{d}\log(\log n/\beta)\log(n/\log n)}{\varepsilon n/\log n}\right\}^{\frac{\kappa}{\kappa-2}}L_{1}^{{\frac{\kappa}{n}}}.

Now we write results in terms of sample complexity required to achieve a particular error. The sufficient number of samples. To ensure f(x^T)f(x)<αf(\hat{x}_{T})-f(x^{\star})<\alpha, it is sufficient to ensure

(C)κκ1Cκ(cκH)κκ2λ2κ2max{log(T/β)log3/2mm,dlog(T/β)logmmε}κκ2L1κ(κ1)T<α.\frac{(C^{\prime})^{\frac{\kappa}{\kappa-1}}C_{\kappa}(c_{\kappa}H)^{\frac{\kappa}{\kappa-2}}}{\lambda^{\frac{2}{\kappa-2}}}\max\left\{\frac{\sqrt{\log(T/\beta)}\log^{3/2}m}{\sqrt{m}},\frac{{d}\log(T/\beta)\log m}{m\varepsilon}\right\}^{\frac{\kappa}{\kappa-2}}L_{1}^{{\frac{\kappa}{(\kappa-1)^{T}}}}<\alpha.

Choosing n=O~(max{(1α2)κ2κ,(dεα)κ2κ})n=\tilde{O}\left(\max\{(\frac{1}{\alpha^{2}})^{\frac{\kappa-2}{\kappa}},(\frac{d}{\varepsilon\alpha})^{\frac{\kappa-2}{\kappa}}\}\right) ensures error α\leq\alpha. ∎

Corollary C.1.

Under the conditions of Theorem 8, for δ=0\delta=0, the expected error of the output of algorithm is upper bounded by

𝔼[f(x^T)f(x)]O~(1n+dnε)κκ2,\displaystyle\mathbb{E}[f(\hat{x}_{T})-f(x^{\star})]\leq\widetilde{O}\left({\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon}}\right)^{\frac{\kappa}{\kappa-2}},

for arbitrarily large μ\mu. For δ>0\delta>0, the expected error of the output of algorithm is upper bounded by

𝔼[f(x^T)f(x)]O~(1n+dnε)κκ2,\displaystyle\mathbb{E}[f(\hat{x}_{T})-f(x^{\star})]\leq\widetilde{O}\left({\frac{1}{\sqrt{n}}+\frac{d}{n\varepsilon}}\right)^{\frac{\kappa}{\kappa-2}},

for arbitrarily large μ\mu.

C.2 (ε,δ)(\varepsilon,\delta) version of Theorem 3

Theorem 9.

Assume each sample function FF be LL-Lipschitz and HH-smooth, and let the population function ff satisfy quadratic growth (Definition 2.4) with coefficient λ\lambda. Let xadaptx_{\rm adapt} be the output of Algorithm 3. Then,

  1. 1.

    Algorithm 3 is ε\varepsilon-DP.

  2. 2.

    Without any additional interpolation assumption, we have that the expected error of the xadaptx_{\rm adapt} is upper bounded by

    𝔼[f(xT)f(x)]LDO~(1n+dlog(1/δ)nε)2.\displaystyle\mathbb{E}[f(x_{T})-f(x^{\star})]\leq LD\cdot\widetilde{O}\left(\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}}{n\varepsilon}\right)^{2}.
  3. 3.

    Let problem (1) be an interpolation problem. Then, the expected error of the xadaptx_{\rm adapt} is upper bounded by

    𝔼[f(xT)f(x)]LD\displaystyle\mathbb{E}[f(x_{T})-f(x^{\star})]\leq LD (1nμ+exp(Θ~(nλ2H2))\displaystyle\left(\frac{1}{n^{\mu}}+\exp\left(-\widetilde{\Theta}\left({\frac{n\lambda^{2}}{H^{2}}}\right)\right)\right.
    +exp(Θ~(λnεHdlog(1/δ)))).\displaystyle+\left.\exp\left(-\widetilde{\Theta}\left({\frac{\lambda n\varepsilon}{H\sqrt{d\log(1/\delta)}}}\right)\right)\right).

Proof    First, we note that the privacy of Algorithm 3 follows from the privacy of Algorithm 2 and Algorithm 1 and post-processing.

To prove the convergence guarantees, we first need to show that the optimal set 𝒳\mathcal{X}^{\star} is included in the shrinked domain 𝒳int\mathcal{X}_{\rm int}. Using the high probability guarantees of Algorithm 1, we know that with probability 1β/21-\beta/2, we have

f(x1)f(x)212Lλlog(κ1)(log(2/β)log3/2nn+dlog(1/δ)log(2/β)lognnε).\displaystyle f(x_{1})-f(x^{\star})\leq\frac{2^{12}L}{\lambda}\cdot^{\log(\kappa-1)}\left(\frac{\sqrt{\log(2/\beta)}\log^{3/2}n}{\sqrt{n}}\right.+\left.\frac{\sqrt{d\log(1/\delta)}\log(2/\beta)\log n}{n\varepsilon}\right).

Using the quadratic growth condition, we immediately have xx12Dint/2\left\|{x^{\star}-x_{1}}\right\|_{2}\leq D_{\rm int}/2 and hence 𝒳𝒳int\mathcal{X}^{\star}\subset\mathcal{X}_{\rm int}.

Using smoothness, we have that for any x𝒳intx\in\mathcal{X}_{\rm int},

f(x)f(x)HDint22.\displaystyle f(x)-f(x^{\star})\leq\frac{HD_{\rm int}^{2}}{2}.

Since Algorithm 2 always outputs a point in its input domain (in this case 𝒳int\mathcal{X}_{\rm int}), even in the non-interpolation setting we have that

𝔼[f(xT)f(x)]LDO~(1n+dlog(1/δ)nε)2.\displaystyle\mathbb{E}[f(x_{T})-f(x^{\star})]\leq LD\cdot\widetilde{O}\left(\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}}{n\varepsilon}\right)^{2}.

In the interpolation setting, the guarantees of Algorithm 2 hold and the result is immediate. ∎

Appendix D Proofs from Section 5

D.1 Proof of Theorem 4

The proof is exactly the same as Theorem 1, except we set k=λnHk=\frac{\lambda n}{H} to ensure that fv(x)f_{v}(x) for any v𝒳v\in\mathcal{X} has λ\lambda-quadratic growth. Finally we set γ=D2exp(λnεHd)\gamma=\frac{D}{2}\exp(\frac{-\lambda n\varepsilon}{Hd}) and use the fact that eλnεH2e^{\frac{\lambda n\varepsilon}{H}}\geq 2 and the fact that x/(x1)x/(x-1) is decreasing in xx to give the desired lower bound.

D.2 Proof of Theorem 5

The proof of this result hinges on the two following supporting propositions. We first copy Proposition 2.2 from [4] (listed as Proposition 4 below) in our notation for convenience. We then state Proposition 5 which gives upper and lower bounds on the modulus of continuity (defined in Proposition 4). We note that the lower bound presented in Proposition 5 is one of the novel contributions of this paper; the proof of Proposition 5 can be found in Section D.2.1. We will first assume this to be true and prove Theorem 5 before returning prove its correctness.

Proposition 4.

For some fixed F:𝒳,ΩF:\mathcal{X},\Omega\rightarrow\mathbb{R} which is convex and HH-smooth with respect to its first argument, let 𝒮𝔖λL(F)\mathcal{S}\in\mathfrak{S}_{\lambda}^{L}(F) for L=2HDL=2HD. Let x𝒮=argminx𝒳f𝒮(x)x_{\mathcal{S}}^{\star}=\mathop{\rm argmin}_{x^{\prime}\in\mathcal{X}}f_{\mathcal{S}}(x^{\prime}). Define the corresponding modulus of continuity

ω(𝒮,1/ε)sup𝒮𝔖λL(F){|x𝒮x𝒮|:dham(𝒮,𝒮)1/ε}.\displaystyle\omega(\mathcal{S},1/\varepsilon)\coloneqq\sup_{\mathcal{S}^{\prime}\in\mathfrak{S}_{\lambda}^{L}(F)}\{|x_{\mathcal{S}}^{\star}-x_{\mathcal{S}^{\prime}}^{\star}|:d_{ham}(\mathcal{S},\mathcal{S}^{\prime})\leq 1/\varepsilon\}.

Assume the mechanism MM is ε\varepsilon-DP and for some γ12e\gamma\leq\frac{1}{2e} achieves

𝔼[|M(𝒮)x𝒮|]γ(ω(𝒮;1/ε)2).\mathbb{E}[|M(\mathcal{S})-x_{\mathcal{S}}^{\star}|]\leq\gamma\left(\frac{\omega(\mathcal{S};1/\varepsilon)}{2}\right).

Then there exists a sample 𝒮𝔖λL(F)\mathcal{S}^{\prime}\in\mathfrak{S}_{\lambda}^{L}(F) where dham(𝒮,𝒮)log(1/2γ)2εd_{ham}(\mathcal{S},\mathcal{S}^{\prime})\leq\frac{\log(1/2\gamma)}{2\varepsilon} such that

𝔼[|M(𝒮)x𝒮|]14(14ω(𝒮;log(1/2γ)2ε)).\displaystyle\mathbb{E}[|M(\mathcal{S}^{\prime})-x_{\mathcal{S}^{\prime}}^{\star}|]\geq\frac{1}{4}\ell\left(\frac{1}{4}\omega\left(\mathcal{S}^{\prime};\frac{\log(1/2\gamma)}{2\varepsilon}\right)\right).
Proposition 5.

Let F:𝒳,ΩF:\mathcal{X},\Omega\rightarrow\mathbb{R} be convex and HH-smooth in its first argument and satisfying infx𝒳F(x;s)=0\inf_{x\in\mathcal{X}}F(x;s)=0 for all sΩs\in\Omega. Suppose we have some 𝒮𝔖λL(F)\mathcal{S}\in\mathfrak{S}_{\lambda}^{L}(F) with L=2HDL=2HD which also induces an interpolation problem (a problem which satisfies Definition 2.3). With respect to the dataset 𝒮\mathcal{S}, the modulus of continuity ω(𝒮,1/ε)\omega(\mathcal{S},1/\varepsilon) satsifies

Dnεω(𝒮,1/ε)8HDλnε\displaystyle\frac{D}{n\varepsilon}\leq\omega(\mathcal{S},1/\varepsilon)\leq\frac{8HD}{\lambda n\varepsilon}

With these two results, we can now prove Theorem 5. Restating the conditions of the theorem formally, suppose for some constants c0c_{0} and c1c_{1} there is an ε\varepsilon-DP estimator MM such that

𝔼[f𝒮(M(𝒮))]infx𝒳f𝒮(x)c0D2ec1(nε)t.\displaystyle\mathbb{E}[f_{\mathcal{S}}(M(\mathcal{S}))]-\inf_{x\in\mathcal{X}}f_{\mathcal{S}}(x)\leq c_{0}D^{2}e^{-c_{1}(n\varepsilon)^{t}}.

If t>1t>1, set t=min(1,t)t=\min(1,t), then the bound certainly still holds for large enough nn. If we let x𝒮=argminx𝒳f𝒮(x)x_{\mathcal{S}}^{\star}=\mathop{\rm argmin}_{x\in\mathcal{X}}f_{\mathcal{S}}(x), using the definition of strong convexity, we have that there exists some c2c_{2} and c3c_{3} such that

𝔼[|M(𝒮)x𝒮|]c2Dec3(nε)t\displaystyle\mathbb{E}[|M(\mathcal{S})-x_{\mathcal{S}}^{\star}|]\leq c_{2}De^{-c_{3}(n\varepsilon)^{t}}

To satisfy the expression from Proposition 4, we select γ\gamma such that

γω(𝒮;1/ε)2=c2Dec3(nε)t.\displaystyle\frac{\gamma\omega(\mathcal{S};1/\varepsilon)}{2}=c_{2}De^{-c_{3}(n\varepsilon)^{t}}.

Using Proposition 5 we must have λnε4Hc2exp(c3(nε)t)γ2nεc2exp(c3(nε)t)\frac{\lambda n\varepsilon}{4H}c_{2}\exp(-c_{3}(n\varepsilon)^{t})\leq\gamma\leq 2n\varepsilon c_{2}\exp(-c_{3}(n\varepsilon)^{t}). Using Proposition 4, we have that

𝔼[|M(𝒮)x𝒮|]ω(𝒮;log(1/2γ)2ε)\displaystyle\mathbb{E}[|M(\mathcal{S}^{\prime})-x_{\mathcal{S}^{\prime}}^{\star}|]\geq\omega\left(\mathcal{S}^{\prime};\frac{\log(1/2\gamma)}{2\varepsilon}\right)

Before performing a further lower bound on this quantity, we first verify that log(1/2γ)2ε\frac{\log(1/2\gamma)}{2\varepsilon} does not exceed the total size of the dataset, nn. Using our bounds on γ\gamma, we see that

log(1/2γ)2ε12ε(c3(nε)tlogc2log(λnε2H))\displaystyle\frac{\log(1/2\gamma)}{2\varepsilon}\leq\frac{1}{2\varepsilon}\left(c_{3}(n\varepsilon)^{t}-\log c_{2}-\log\left(\frac{\lambda n\varepsilon}{2H}\right)\right)

For any t(0,1]t\in(0,1], for sufficiently large nn, this quantity is less than nn. We now lower bound the modulus of continuity by using the fact that it is a non-decreasing function in its second argument:

𝔼[|M(𝒮)x𝒮|]\displaystyle\mathbb{E}[|M(\mathcal{S}^{\prime})-x^{\star}_{\mathcal{S}^{\prime}}|] ω(𝒮;log(1/2γ)2ε)ω(𝒮;c3(nε)tlogc2log(4nε)2ε)\displaystyle\geq\omega\left(\mathcal{S}^{\prime};\frac{\log(1/2\gamma)}{2\varepsilon}\right)\geq\omega\left(\mathcal{S}^{\prime};\frac{c_{3}(n\varepsilon)^{t}-\log c_{2}-\log(4n\varepsilon)}{2\varepsilon}\right)
D2nε[c3(nε)tlogc2log(4nε)].\displaystyle\geq\frac{D}{2n\varepsilon}\left[c_{3}(n\varepsilon)^{t}-\log c_{2}-\log(4n\varepsilon)\right].

This is the desired result; the last inequality comes from another application of Proposition 5 but with c3(nε)tlogc2log(4nε)2ε\frac{c_{3}(n\varepsilon)^{t}-\log c_{2}-\log(4n\varepsilon)}{2\varepsilon} in place of 1/ε1/\varepsilon.

D.2.1 Proof of Proposition 5

Proof Outline
At a high level, starting with a function f𝒮f_{\mathcal{S}}, we first remove an arbitrary 1/ε1/\varepsilon fraction to create a function f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon}. We then replace the sample functions we removed with 1/ε1/\varepsilon samples of H2(xD)2\frac{H}{2}(x-D)^{2} and argue how far the minimizer of f𝒮ε+H2nε(xD)2f_{\mathcal{S}}^{\setminus\varepsilon}+\frac{H}{2n\varepsilon}(x-D)^{2} is away from the minimizer of f𝒮f_{\mathcal{S}}. We will need many supporting lemmas to complete this proof; we quickly outline how we use these lemmas. 1. We use Lemma D.1 to argue that the minimizers of f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} are no different that f𝒮f_{\mathcal{S}}. 2. We use Lemma D.2 to argue about the growth of f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon}. 3. We use Lemma D.3, Lemma D.4, Lemma D.5 to lower bound how far the minimizer of f𝒮ε+H2nε(xD)2f_{\mathcal{S}}^{\setminus\varepsilon}+\frac{H}{2n\varepsilon}(x-D)^{2} has moved from the minimizer of f𝒮f_{\mathcal{S}}. 4. We use Lemma D.6 to upper bound how far the minimizer of f𝒮ε+H2nε(xD)2f_{\mathcal{S}}^{\setminus\varepsilon}+\frac{H}{2n\varepsilon}(x-D)^{2} has moved from the minimizer of f𝒮f_{\mathcal{S}}.

 
We now formally introduce the several supporting lemmas which will aid our proof of Proposition 5. The first ensures that the minimizing set does not change upon the removal of a constant number of samples.

Lemma D.1.

Assume that infx𝒳F(x;s)=0\inf_{x\in\mathcal{X}}F(x;s)=0 for all sΩs\in\Omega. Suppose f𝒮f_{\mathcal{S}} satisfies Definition 2.3 and has λ\lambda-quadratic growth. Let 𝒳argminx𝒳f𝒮(x)\mathcal{X}^{\star}\coloneqq\mathop{\rm argmin}_{x\in\mathcal{X}}f_{\mathcal{S}}(x). Let 𝒮ε𝒮\mathcal{S}_{\varepsilon}\subset\mathcal{S} consist of any (constant not scaling with nn) 1/ε>01/\varepsilon>0 data points. Then, for f𝒮ε1ns𝒮𝒮εF(x;s)f_{\mathcal{S}}^{\setminus\varepsilon}\coloneqq\frac{1}{n}\sum_{s\in\mathcal{S}\setminus\mathcal{S}_{\varepsilon}}F(x;s) we have that 𝒳εargminx𝒳f𝒮ε(x)=𝒳\mathcal{X}_{\setminus\varepsilon}^{\star}\coloneqq\mathop{\rm argmin}_{x\in\mathcal{X}}f_{\mathcal{S}}^{\setminus\varepsilon}(x)=\mathcal{X}^{\star}.

Proof    Suppose for the sake of contradiction that 𝒳𝒳ε\mathcal{X}^{\star}\neq\mathcal{X}_{\setminus\varepsilon}^{\star} Since f𝒮f_{\mathcal{S}} is an interpolation problem, the removal of samples can only increase the size of 𝒳ε\mathcal{X}_{\setminus\varepsilon}^{\star}. Suppose that 𝒳ε𝒳\mathcal{X}_{\setminus\varepsilon}^{\star}\setminus\mathcal{X}^{\star}\neq\emptyset. There exists at most 1/ε1/\varepsilon points in 𝒮\mathcal{S} that have non-zero error on 𝒳ε𝒳\mathcal{X}_{\setminus\varepsilon}^{\star}\setminus\mathcal{X}^{\star}. However, by smoothness of each sample function (and the fact that f(x)=0f(x^{\star})=0 and f(x)=0f^{\prime}(x^{\star})=0 by construction), we have that for x[a,b]x\in[a,b]

f𝒮(x)Hnεdist(x,𝒳)2.\displaystyle f_{\mathcal{S}}(x)\leq\frac{H}{n\varepsilon}\mathop{\rm dist}(x,\mathcal{X}^{\star})^{2}.

Since limnHnε=0\lim_{n\to\infty}\frac{H}{n\varepsilon}=0, this contradicts λ\lambda-quadratic growth. ∎

This second lemma ensures that deleting a constant number of samples does not affect the growth or strong convexity of the population function by too much.

Lemma D.2.

Assume that infx𝒳F(x;s)=0\inf_{x\in\mathcal{X}}F(x;s)=0 for all sΩs\in\Omega. Suppose f𝒮f_{\mathcal{S}} satisfies Definition 2.3 and has λ\lambda-quadratic growth (respectively λ\lambda-strong convexity). Let f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} be defined as in Lemma D.1. Then f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} has γ\gamma-quadratic growth (respectively γ\gamma-strong convexity) for any γλHnε\gamma\leq\lambda-\frac{H}{n\varepsilon}.

Proof    By Lemma D.1, that the minimizing set of f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} is the same as f𝒮f_{\mathcal{S}}. Suppose for the sake of contradiction that f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} does not have γ\gamma-quadratic growth. Then there must exist x1x_{1} such that

f𝒮ε(x1)f𝒮ε(x)<γ2x1x22.\displaystyle f_{\mathcal{S}}^{\setminus\varepsilon}(x_{1})-f_{\mathcal{S}}^{\setminus\varepsilon}(x^{\star})<\frac{\gamma}{2}\left\|{x_{1}-x^{\star}}\right\|_{2}^{2}.

By smoothness and growth we have

H2nεx1x22+γ2x1x22>f𝒮(x1)f𝒮(x)λ2x1x22.\displaystyle\frac{H}{2n\varepsilon}\left\|{x_{1}-x^{\star}}\right\|_{2}^{2}+\frac{\gamma}{2}\left\|{x_{1}-x^{\star}}\right\|_{2}^{2}>f_{\mathcal{S}}(x_{1})-f_{\mathcal{S}}(x^{\star})\geq\frac{\lambda}{2}\left\|{x_{1}-x^{\star}}\right\|_{2}^{2}.

This implies that γ>λHnε\gamma>\lambda-\frac{H}{n\varepsilon}, a contradiction.

Suppose for the sake of contradiction that f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} does not have γ\gamma-strong convexity. Then there must exist x1x_{1} and x2x_{2} such that

f𝒮ε(x1)f𝒮ε(x2)<γ2x1x22+f𝒮ε(x2),x1x2.\displaystyle f_{\mathcal{S}}^{\setminus\varepsilon}(x_{1})-f_{\mathcal{S}}^{\setminus\varepsilon}(x_{2})<\frac{\gamma}{2}\left\|{x_{1}-x^{\star}}\right\|_{2}^{2}+\langle\nabla f_{\mathcal{S}}^{\setminus\varepsilon}(x_{2}),x_{1}-x_{2}\rangle.

By smoothness and strong convexity we have

H2nεx1x222+γ2x1x222+f𝒮(x2),x1x2>f𝒮(x1)f𝒮(x2)λ2x1x222+f𝒮(x2),x1x2.\displaystyle\frac{H}{2n\varepsilon}\left\|{x_{1}-x_{2}}\right\|_{2}^{2}+\frac{\gamma}{2}\left\|{x_{1}-x_{2}}\right\|_{2}^{2}+\langle\nabla f_{\mathcal{S}}(x_{2}),x_{1}-x_{2}\rangle>f_{\mathcal{S}}(x_{1})-f_{\mathcal{S}}(x_{2})\geq\frac{\lambda}{2}\left\|{x_{1}-x_{2}}\right\|_{2}^{2}+\langle\nabla f_{\mathcal{S}}(x_{2}),x_{1}-x_{2}\rangle.

However, this implies that γ>λHnε\gamma>\lambda-\frac{H}{n\varepsilon} which is a contradiction. ∎

The next lemma is a standard result on the closure under addition of strongly convex functions.

Lemma D.3.

Let functions h1h_{1} and h2h_{2} be λ\lambda and γ\gamma strongly convex respectively, then h1+h2h_{1}+h_{2} is λ+γ\lambda+\gamma strongly convex.

This lemma provides some growth conditions on the gradient under smoothness, strong convexity and quadratic growth.

Lemma D.4.

Let g:𝒳+g:\mathcal{X}\rightarrow\mathbb{R}_{+} be a convex function with 𝒳=argminx𝒳g(x)\mathcal{X}^{\star}=\mathop{\rm argmin}_{x\in\mathcal{X}}g(x) such that for x𝒳x^{\star}\in\mathcal{X}^{\star}, g(x)=0g(x^{\star})=0. Suppose gg has λ\lambda-quadratic growth, then

|g(x)|λ2dist(x,𝒳).\displaystyle|g^{\prime}(x)|\geq\frac{\lambda}{2}\mathop{\rm dist}(x,\mathcal{X}^{\star}).

If instead gg has λ\lambda-strong convexity, then

|g(x)|λdist(x,𝒳).\displaystyle|g^{\prime}(x)|\geq\lambda\mathop{\rm dist}(x,\mathcal{X}^{\star}).

Alternatively, suppose gg has HH-smoothness, then

|g(x)|Hdist(x,𝒳).\displaystyle|g^{\prime}(x)|\leq H\mathop{\rm dist}(x,\mathcal{X}^{\star}).

Proof    We note that by first order optimality conditions, for all x𝒳x^{\star}\in\mathcal{X}^{\star}, g(x)=0\nabla g(x^{\star})=0. To prove the first inequality, we have that for any x𝒳x^{\star}\in\mathcal{X}^{\star}, the following is true:

λ2dist(x,𝒳)2g(x)g|g(x)||xx|.\displaystyle\frac{\lambda}{2}\mathop{\rm dist}(x,\mathcal{X}^{\star})^{2}\leq g(x)-g^{\star}\leq|g^{\prime}(x)||x-x^{\star}|.

In particular, minimizing over xx^{\star} on the right hand side and rearranging gives the desired result. To prove the second result, we know that by strong convexity for any x𝒳x^{\star}\in\mathcal{X}^{\star}

|g(x)|=|g(x)g(x)|λ|xx|.\displaystyle|g^{\prime}(x)|=|g^{\prime}(x)-g^{\prime}(x^{\star})|\geq\lambda|x-x^{\star}|.

To prove the last result, we know that by smoothness for any x𝒳x^{\star}\in\mathcal{X}^{\star}

|g(x)|=|g(x)g(x)|H|xx|.\displaystyle|g^{\prime}(x)|=|g^{\prime}(x)-g^{\prime}(x^{\star})|\leq H|x-x^{\star}|.

Minimizing over xx^{\star} on the right hand side gives the desired result. ∎

This lemma controls how much the minimizers of a function can change if another function is added. This will directly be useful in lower bounding the modulus of continuity.

Lemma D.5.

Suppose h:[D,D]+h:[-D,D]\to\mathbb{R}_{+} and g:[D,D]+g:[-D,D]\to\mathbb{R}_{+}. Let xhx_{h}^{\star} be the largest minimizer of hh and xgx_{g}^{\star} be the smallest minimizer of gg, and assume that xhxgx_{h}^{\star}\leq x_{g}^{\star}. Let xx^{\star} be any minimizer of h+gh+g. Assume that h(xh)=0h(x_{h}^{\star})=0 and g(xg)=0g(x_{g}^{\star})=0. If hh has λh\lambda_{h}-quadratic growth and gg is HgH_{g}-smooth, then

xxhHg(xgxh)λh2+Hg.\displaystyle x^{\star}-x_{h}^{\star}\leq\frac{H_{g}(x_{g}^{\star}-x_{h}^{\star})}{\frac{\lambda_{h}}{2}+H_{g}}.

If hh is HhH_{h}-smooth and gg has λg\lambda_{g}-quadratic growth, then

λg2(xgxh)λg2+Hhxxh.\displaystyle\frac{\frac{\lambda_{g}}{2}(x_{g}^{\star}-x_{h}^{\star})}{\frac{\lambda_{g}}{2}+H_{h}}\leq x^{\star}-x_{h}^{\star}.

The same relation holds with λg/2\lambda_{g}/2 and λh/2\lambda_{h}/2 replaced with λg\lambda_{g} and λh\lambda_{h} respectively if the above statement is modified such that gg and hh are λg\lambda_{g} and λh\lambda_{h} strongly convex instead.

Proof    If xhDx_{h}^{\star}\neq D, then the first order condition for optimality implies

h(xh)+g(xh)=g(xh)<0andh(xg)+g(xg)=h(xg)>0.\displaystyle h^{\prime}(x_{h}^{\star})+g^{\prime}(x_{h}^{\star})=g^{\prime}(x_{h}^{\star})<0\quad\text{and}\quad h^{\prime}(x_{g}^{\star})+g^{\prime}(x_{g}^{\star})=h^{\prime}(x_{g}^{\star})>0.

Thus, x(xh,xg)x^{\star}\in(x_{h}^{\star},x_{g}^{\star}). By the monotonicty of the first derivative of convex functions that for x(xh,xg)x^{\star}\in(x_{h}^{\star},x_{g}^{\star}), g(x)<0g^{\prime}(x^{\star})<0 and h(x)>0h^{\prime}(x^{\star})>0. Combining this with Lemma D.4, we get

λh2(xxh)h(x)Hh(xxh)\displaystyle\frac{\lambda_{h}}{2}(x^{\star}-x_{h}^{\star})\leq h^{\prime}(x^{\star})\leq H_{h}(x^{\star}-x_{h}^{\star})
Hg(xxg)g(x)λg2(xxg).\displaystyle H_{g}(x^{\star}-x_{g}^{\star})\leq g^{\prime}(x^{\star})\leq\frac{\lambda_{g}}{2}(x^{\star}-x_{g}^{\star}).

Combining these facts gives

λh2(xxh)+Hg(xxg)h(x)+g(x)=0Hh(xxh)+λg2(xxg)\displaystyle\frac{\lambda_{h}}{2}(x^{\star}-x_{h}^{\star})+H_{g}(x^{\star}-x_{g}^{\star})\leq h^{\prime}(x^{\star})+g(^{\prime}x^{\star})=0\leq H_{h}(x^{\star}-x_{h}^{\star})+\frac{\lambda_{g}}{2}(x^{\star}-x_{g}^{\star})

Rearranging these two inequalities gives the desired result. We note that the lower bound only requires that hh is HhH_{h}-smooth and gg has λg\lambda_{g}-quadratic growth, and the upper bound only requires hh has λh\lambda_{h}-quadratic growth and gg is HgH_{g}-smooth. The last statement about strong convexity follows from the same reasoning, except using the strong convexity inequality in Lemma D.4 instead of the quadratic growth inequality. ∎

The following lemma is a slight modification of Claim 6.1 from [25] and will be helpful for us to upper bound the modulus of continuity.

Lemma D.6.

Let 𝒮\mathcal{S}^{\prime} consist of nn data points where |𝒮𝒮|=k|\mathcal{S}\triangle\mathcal{S}^{\prime}|=k. Suppose that f𝒮f_{\mathcal{S}} is λ\lambda-strongly convex and satisfies Definition 2.3. Assume the sample function F:𝒳×Ω+F:\mathcal{X}\times\Omega\to\mathbb{R}_{+} is LL-Lipschitz in its first argument and that infx𝒳F(x;s)=0\inf_{x\in\mathcal{X}}F(x;s)=0 for all sΩs\in\Omega. For x𝒮argminx𝒳f𝒮(x)x_{\mathcal{S}}\in\mathop{\rm argmin}_{x\in\mathcal{X}}f_{\mathcal{S}}(x) and x𝒮argminx𝒳f𝒮(x)x_{\mathcal{S}^{\prime}}\in\mathop{\rm argmin}_{x\in\mathcal{X}}f_{\mathcal{S}^{\prime}}(x), we have that

x𝒮x𝒮24kLλn.\displaystyle\left\|{x_{\mathcal{S}}-x_{\mathcal{S}^{\prime}}}\right\|_{2}\leq\frac{4kL}{\lambda n}.

Proof    By strong convexity, we have that

f𝒮(x𝒮)f𝒮(x𝒮)λ2x𝒮x𝒮22,\displaystyle f_{\mathcal{S}}(x_{\mathcal{S}^{\prime}})-f_{\mathcal{S}}(x_{\mathcal{S}})\geq\frac{\lambda}{2}\left\|{x_{\mathcal{S}^{\prime}}-x_{\mathcal{S}}}\right\|_{2}^{2},

since by first order optimality conditions, we know that f𝒮(x𝒮)=0\nabla f_{\mathcal{S}}(x_{\mathcal{S}})=0 as a consequence of Definition 2.3. We also have

f𝒮(x𝒮)f𝒮(x𝒮)\displaystyle f_{\mathcal{S}}(x_{\mathcal{S}^{\prime}})-f_{\mathcal{S}}(x_{\mathcal{S}}) =1ns𝒮𝒮[F(x𝒮;s)F(x𝒮;s)]+1ns𝒮𝒮[F(x𝒮;s)F(x𝒮;s)]\displaystyle=\frac{1}{n}\sum_{s\in\mathcal{S}\setminus\mathcal{S}^{\prime}}\left[F(x_{\mathcal{S}^{\prime}};s)-F(x_{\mathcal{S}};s)\right]+\frac{1}{n}\sum_{s\in\mathcal{S}\cap\mathcal{S}^{\prime}}\left[F(x_{\mathcal{S}^{\prime}};s)-F(x_{\mathcal{S}};s)\right]
=1ns𝒮𝒮[F(x𝒮;s)F(x𝒮;s)]1ns𝒮𝒮[F(x𝒮;s)F(x𝒮;s)]+f𝒮(x𝒮)f𝒮(x𝒮)\displaystyle=\frac{1}{n}\sum_{s\in\mathcal{S}\setminus\mathcal{S}^{\prime}}\left[F(x_{\mathcal{S}^{\prime}};s)-F(x_{\mathcal{S}};s)\right]-\frac{1}{n}\sum_{s\in\mathcal{S}^{\prime}\setminus\mathcal{S}}\left[F(x_{\mathcal{S}^{\prime}};s)-F(x_{\mathcal{S}};s)\right]+f_{\mathcal{S}^{\prime}}(x_{\mathcal{S}^{\prime}})-f_{\mathcal{S}^{\prime}}(x_{\mathcal{S}})
2kLnx𝒮x𝒮2,\displaystyle\leq\frac{2kL}{n}\left\|{x_{\mathcal{S}^{\prime}}-x_{\mathcal{S}}}\right\|_{2},

where the last inequality comes from the Lipschitzness of FF and that x𝒮argminx𝒳f𝒮(x)x_{\mathcal{S}^{\prime}}\in\mathop{\rm argmin}_{x\in\mathcal{X}}f_{\mathcal{S}^{\prime}}(x). ∎

Armed with these supporting lemmas, we can now bound the modulus of continuity. Let x0x_{0}^{\star} be the largest minimizer of f𝒮f_{\mathcal{S}} following the steps of the proof outline. Without loss of generality, we assume that x00x_{0}^{\star}\leq 0. If x0>0x_{0}^{\star}>0, by symmetry, it suffices to consider the problem replacing H2(xD)2\frac{H}{2}(x-D)^{2} with H2(x+D)2\frac{H}{2}(x+D)^{2} in the following proof. By Lemma D.1, f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} has the same minimizing set as f𝒮f_{\mathcal{S}}. By Lemma D.2, f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} has λHnε\lambda-\frac{H}{n\varepsilon}-strong convexity. Replace the 1/ε1/\varepsilon datapoints removed with samples that have the loss function H2(xD)2\frac{H}{2}(x-D)^{2}; we note that it is clear that H2(xD)2\frac{H}{2}(x-D)^{2} satisifies the desired Lipschitz condition. Our constructed non-interpolation population function is

f𝒮ε(x)+H2nε(xD)2,\displaystyle f_{\mathcal{S}}^{\setminus\varepsilon}(x)+\frac{H}{2n\varepsilon}(x-D)^{2},

which is λ\lambda-strongly convex by Lemma D.2 and Lemma D.3 and is 2HD2HD-Lipschitz. This means that the 𝒮\mathcal{S}^{\prime} this function corresponds to belongs to 𝔖λL(F)\mathfrak{S}_{\lambda}^{L}(F). Let xx^{\star} be the minimizer of f𝒮ε(x)+H2nε(xD)2f_{\mathcal{S}}^{\setminus\varepsilon}(x)+\frac{H}{2n\varepsilon}(x-D)^{2}.

By the triangle inequality, we have f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} is (n1/εn)H\left(\frac{n-1/\varepsilon}{n}\right)H-smooth. H2nε(xD)2\frac{H}{2n\varepsilon}(x-D)^{2} is Hnε\frac{H}{n\varepsilon}- strongly convex. Thus, by Lemma D.5 setting h(x)f𝒮ε(x)h(x)\coloneqq f_{\mathcal{S}}^{\setminus\varepsilon}(x) and g(x)H2nε(xD)2g(x)\coloneqq\frac{H}{2n\varepsilon}(x-D)^{2}, we have that

|xx0|=xx0Hnε(Dx0)(n1/εn)H+Hnε=Dx0nε=D+|x0|nεDnε.\displaystyle|x^{\star}-x_{0}^{\star}|=x^{\star}-x_{0}^{\star}\geq\frac{\frac{H}{n\varepsilon}(D-x_{0}^{\star})}{\left(\frac{n-1/\varepsilon}{n}\right)H+\frac{H}{n\varepsilon}}=\frac{D-x_{0}^{\star}}{n\varepsilon}=\frac{D+|x_{0}^{\star}|}{n\varepsilon}\geq\frac{D}{n\varepsilon}.

Here, implicitly, we are using the fact that x0x_{0}^{\star} is also a minimizer of f𝒮εf_{\mathcal{S}}^{\setminus\varepsilon} by Lemma D.2. This completes the proof of the lower bound.

The upper bound follows from Lemma D.6 with k=1/εk=1/\varepsilon and L=2HDL=2HD.