This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adapting to Function Difficulty and
Growth Conditions in Private Optimization

Hilal Asi         Daniel Levy         John C. Duchi
{asi,danilevy,jduchi}@stanford.edu
Equal contribution. Author order determined by coin toss.
Abstract

We develop algorithms for private stochastic convex optimization that adapt to the hardness of the specific function we wish to optimize. While previous work provide worst-case bounds for arbitrary convex functions, it is often the case that the function at hand belongs to a smaller class that enjoys faster rates. Concretely, we show that for functions exhibiting κ\kappa-growth around the optimum, i.e., f(x)f(x)+λκ1xx2κf(x)\geq f(x^{\star})+\lambda\kappa^{-1}\|{x-x^{\star}}\|_{2}^{\kappa} for κ>1\kappa>1, our algorithms improve upon the standard d/nε{\sqrt{d}}/{n\varepsilon} privacy rate to the faster (d/nε)κκ1({\sqrt{d}}/{n\varepsilon})^{\tfrac{\kappa}{\kappa-1}}. Crucially, they achieve these rates without knowledge of the growth constant κ\kappa of the function. Our algorithms build upon the inverse sensitivity mechanism, which adapts to instance difficulty [2], and recent localization techniques in private optimization [25]. We complement our algorithms with matching lower bounds for these function classes and demonstrate that our adaptive algorithm is simultaneously (minimax) optimal over all κ1+c\kappa\geq 1+c whenever c=Θ(1)c=\Theta(1).

1 Introduction

Stochastic convex optimization (SCO) is a central problem in machine learning and statistics, where for a sample space 𝕊\mathbb{S}, parameter space 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, and a collection of convex losses {F(;s):s𝕊}\{F(\cdot;s):s\in\mathbb{S}\}, one wishes to solve

minimizex𝒳f(x)𝔼SP[F(x;S)]=𝕊F(x;s)dP(s)\mathop{\rm minimize}_{x\in\mathcal{X}}f(x)\coloneqq\mathbb{E}_{S\sim P}\left[F(x;S)\right]=\int_{\mathbb{S}}F(x;s)\mathrm{d}P(s) (1)

using an observed dataset 𝒮=S1niidP\mathcal{S}=S_{1}^{n}\stackrel{{\scriptstyle\rm iid}}{{\sim}}P. While as formulated, the problem is by now fairly well-understood [12, 38, 29, 10, 37], it is becoming clear that, because of considerations beyond pure statistical accuracy—memory or communication costs [45, 26, 13], fairness [23, 28], personalization or distributed learning [35]—problem (1) is simply insufficient to address modern learning problems. To that end, researchers have revisited SCO under the additional constraint that the solution preserves the privacy of the provided sample [22, 21, 1, 16, 19]. A waypoint is Bassily et al. [7], who provide a private method with optimal convergence rates for the related empirical risk minimization problem, with recent papers focus on SCO providing (worst-case) optimal rates in various settings: smooth convex functions [8, 25], non-smooth functions [9], non-Euclidean geometry [5, 4] and under more stringent privacy constraints [34].

Yet these works ground their analyses in worst-case scenarios and provide guarantees for the hardest instance of the class of problems they consider. Conversely, they argue that their algorithms are optimal in a minimax sense: for any algorithm, there exists a hard instance on which the error achieved by the algorithm is equal to the upper bound. While valuable, these results are pessimistic—the exhibited hard instances are typically pathological—and fail to reflect achievable performance.

In this work, we consider the problem of adaptivity when solving (1) under privacy constraints. Importantly, we wish to provide private algorithms that adapt to the hardness of the objective ff. A loss function ff may belong to multiple problem classes, each exhibiting different achievable rates, so a natural desideratum is to attain the error rate of the easiest sub-class. As a simple vignette, if one gets an arbitrary 11-Lipschitz convex loss function ff, the worst-case guarantee of any ε\varepsilon-DP algorithm is Θ(1/n+d/(nε))\Theta(1/\sqrt{n}+d/(n\varepsilon)). However, if one learns that ff exhibits some growth property—say ff is 11-strongly convex—the regret guarantee improves to the faster Θ(1/n+(d/(nε))2)\Theta(1/n+(d/(n\varepsilon))^{2}) rate with the appropriate algorithm. It is thus important to provide algorithms that achieves the rates of the “easiest” class to which the function belongs [32, 46, 18].

To that end, consider the nested classes of functions κ\mathcal{F}^{\kappa} for κ[1,]\kappa\in[1,\infty] such that, if fκf\in\mathcal{F}^{\kappa} then there exists λ>0\lambda>0 such that for all x𝒳x\in\mathcal{X},

f(x)infx𝒳f(x)λκxx2κ.f(x)-\inf_{x^{\prime}\in\mathcal{X}}f(x^{\prime})\geq\frac{\lambda}{\kappa}\|x-x^{\star}\|_{2}^{\kappa}.

For example, strong convexity implies growth with parameter κ=2\kappa=2. This growth assumption closely relates to uniform convexity [32] and the Polyak-Kurdyka-Łojasiewicz inequality [11], and we make these connections precise in Section 2. Intuitively, smaller κ\kappa makes the function much easier to optimize: the error around the optimal point grows quickly. Objectives with growth are widespread in machine learning applications: among others, the 1\ell_{1}-regularized hinge loss exhibits sharp growth (i.e. κ=1\kappa=1) while 1\ell_{1}- or \ell_{\infty}-constrained κ\kappa-norm regression —i.e. s=(a,b)d×s=(a,b)\in\mathbb{R}^{d}\times\mathbb{R} and F(x;s)=|ba,x|κF(x;s)=\lvert b-\langle a,x\rangle\rvert^{\kappa}—has κ\kappa-growth for any κ\kappa integer greater than 22 [43]. In this work, we provide private adaptive algorithms that adapt to the actual growth of the function at hand.

We begin our analysis by examining Asi and Duchi’s inverse sensitivity mechanism [2] on ERM as a motivation. While not a practical algorithm, it achieves instance-optimal rates for any one-dimensional function under mild assumptions, quantifying the best bound one could hope to achieve with an adaptive algorithm, and showing (in principle) that adaptive private algorithms can exist. We first show that for any function with κ\kappa-growth, the inverse sensitivity mechanism achieves privacy cost (d/(nε))κ/(κ1)(d/(n\varepsilon))^{\kappa/(\kappa-1)}; importantly, without knowledge of the function class κ\mathcal{F}^{\kappa}, that ff belongs to. This constitutes grounding and motivation for our work in three ways: (i) it validates our choice of sub-classes κ\mathcal{F}^{\kappa} as the privacy rate is effectively controlled by the value of κ\kappa, (ii) it exhibits the rate we wish to achieve with efficient algorithms on κ\mathcal{F}^{\kappa} and (iii) it showcases that for easier functions, privacy costs shrink significantly—to illustrate, for κ=5/4\kappa=5/4 the privacy rate becomes (d/(nε))5(d/(n\varepsilon))^{5}.

We continue our treatment of problem (1) under growth in Section 4 and develop practical algorithms that achieve the rates of the inverse sensitivity mechanism. Moreover, for approximate (ε,δ)(\varepsilon,\delta)-differential privacy, our algorithms improve the rates, achieving roughly (d/(nε))κ/(κ1)(\sqrt{d}/(n\varepsilon))^{\kappa/(\kappa-1)}. Our algorithms hinge on a reduction to SCO: we show that by solving a sequence of increasingly constrained SCO problems, one achieves the right rate whenever the function exhibits growth at the optimum. Importantly, our algorithm only requires a lower bound κ¯κ\underline{\kappa}\leq\kappa (where κ\kappa is the actual growth of ff).

We provide optimality guarantees for our algorithms in Section 5 and show that both the inverse sensitivity and the efficient algorithms of Section 4 are simultaneously minimax optimal over all classes κ\mathcal{F}^{\kappa} whenever κ=1+Θ(1)\kappa=1+\Theta(1) and d=1d=1 for ε\varepsilon-DP algorithms. Finally, we prove that in arbitrary dimension, for both pure- and approximate-DP constraints, our algorithms are also simultaneously optimal for all classes κ\mathcal{F}^{\kappa} with κ2\kappa\geq 2.

On the way, we provide results that may be of independent interest to the community. First, we develop optimal algorithms for SCO under pure differential privacy constraints, which, to the best of our knowledge, do not exist in the literature. Secondly, our algorithms and analysis provide high-probability bounds on the loss, whereas existing results only provide (weaker) bounds on the expected loss. Finally, we complete the results of Ramdas and Singh [40] on (non-private) optimization lower bounds for functions with κ\kappa-growth by providing information-theoretic lower bounds (in contrast to oracle-based lower bounds that rely on observing only gradient information) and capturing the optimal dependence on all problem parameters (namely d,Ld,L and λ\lambda).

1.1 Related work

Convex optimization is one of the best studied problems in private data analysis [16, 19, 41, 7]. The first papers in this line of work mainly study minimizing the empirical loss, and readily establish that the (minimax) optimal privacy rates are d/nε{d}/{n\varepsilon} for pure ε\varepsilon-DP and dlog(1/δ)/nε{\sqrt{d\log(1/\delta)}}/{n\varepsilon} for (ε,δ)(\varepsilon,\delta)-DP [16, 7]. More recently, several works instead consider the harder problem of privately minimizing the population loss [8, 25]. These papers introduce new algorithmic techniques to obtain the worst-case optimal rates of 1/n+dlog(1/δ)/nε1/\sqrt{n}+{\sqrt{d\log(1/\delta)}}/{n\varepsilon} for (ε,δ)(\varepsilon,\delta)-DP. They also show how to improve this rate to the faster 1/n+dlog(1/δ)/(nε)21/n+{{d\log(1/\delta)}}/{(n\varepsilon)^{2}} in the case of 11-strongly convex functions. Our work subsumes both of these results as they correspond to κ=\kappa=\infty and κ=2\kappa=2 respectively. To the best of our knowledge, there has been no work in private optimization that investigates the rates under general κ\kappa-growth assumptions or adaptivity to such conditions.

In contrast, the optimization community has extensively studied growth assumptions  [40, 32, 15] and show that on these problems, carefully crafted algorithms improves upon the standard 1/n1/\sqrt{n} for convex functions to the faster (1/n)κ/(κ1)(1/\sqrt{n})^{\kappa/(\kappa-1)}. [32] derives worst-case optimal (in the first-order oracle model) gradient algorithms in the uniformly convex case (i.e. κ2\kappa\geq 2) and provides technique to adapt to the growth κ\kappa, while [40], drawing connections between growth conditions and active learning, provides upper and lower bounds in the first-order stochastic oracle model. We complete the results of the latter and provide information-theoretic lower bounds that have optimal dependence on d,λd,\lambda and nn—their lower bound only holding for λ\lambda inversely proportional to d1/21/κd^{1/2-1/\kappa}, when κ2\kappa\geq 2. Closest to our work is [15] who studies instance-optimality via local minimax complexity [14]. For one-dimensional functions, they develop a bisection-based instance-optimal algorithm and show that on individual functions of the form tκ1|t|κt\mapsto\kappa^{-1}\lvert t\rvert^{\kappa}, the local minimax rate is (1/n)κ/(κ1)(1/\sqrt{n})^{\kappa/(\kappa-1)}.

2 Preliminaries

We first provide notation that we use throughout this paper, define useful assumptions and present key definitions in convex analysis and differential privacy.

Notation.

nn typically denotes the sample size and dd the dimension. Throughout this work, xx refers to the optimization variable, 𝒳d\mathcal{X}\subset\mathbb{R}^{d} to the constraint set and ss to elements (SS when random) of the sample space 𝕊\mathbb{S}. We usually denote by F:𝒳×𝕊F:\mathcal{X}\times\mathbb{S}\to\mathbb{R} the (convex) loss function and for a dataset 𝒮=(s1,,sn)𝕊\mathcal{S}=(s_{1},\ldots,s_{n})\subset\mathbb{S}, we define the empirical and population losses

f𝒮(x)1ninF(x;si) and f(x)𝔼SP[F(x;S)].f_{\mathcal{S}}(x)\coloneqq\frac{1}{n}\sum_{i\leq n}F(x;s_{i})\mbox{~{}~{}and~{}~{}}f(x)\coloneqq\mathbb{E}_{S\sim P}\left[F(x;S)\right].

We omit the dependence on PP as it is often clear from context. We reserve ε,δ0\varepsilon,\delta\geq 0 for the privacy parameters of Definition 2.1. We always take gradients with respect to the optimization variable xx. In the case that F(;s)F(\cdot;s) is not differentiable at xx, we override notation and define F(x;s)=argmingF(x;s)g2\nabla F(x;s)=\mathop{\rm argmin}_{g\in\partial F(x;s)}\|g\|_{2}, where F(x;s)\partial F(x;s) is the subdifferential of F(;s)F(\cdot;s) at xx. We use 𝖠\mathsf{A} for (potentially random) mechanism and S1nS_{1}^{n} as a shorthand for (S1,,Sn)\left(S_{1},\ldots,S_{n}\right). For p1p\geq 1, p\|\cdot\|_{p} is the standard p\ell_{p}-norm, 𝔹pd(R)\mathbb{B}_{p}^{d}(R) is the corresponding dd-dimensional pp-ball of radius RR and pp^{\star} is the dual of pp, i.e. such that 1/p+1/p=11/p^{\star}+1/p=1. Finally, we define the Hamming distance between datasets dHam(𝒮,𝒮)infσ𝔖n𝟏{sisσ(i)}d_{\rm Ham}(\mathcal{S},\mathcal{S}^{\prime})\coloneqq\inf_{\sigma\in\mathfrak{S}_{n}}\mathbf{1}\{s_{i}\neq s^{\prime}_{\sigma(i)}\}, where 𝔖n\mathfrak{S}_{n} is the set of permutations over sets of size nn.

Assumptions.

We first state standard assumptions for solving (1). We assume that 𝒳\mathcal{X} is a closed, convex domain such that 𝖽𝗂𝖺𝗆2(𝒳)=supx,y𝒳xy2D<\mathsf{diam}_{2}(\mathcal{X})=\sup_{x,y\in\mathcal{X}}\|x-y\|_{2}\leq D<\infty. Furthermore, we assume that for any s𝕊s\in\mathbb{S}, F(;s)F(\cdot;s) is convex and LL-Lipschitz with respect to 2\|\cdot\|_{2}. Central to our work, we define the following κ\kappa-growth assumption.

Assumption 1 (κ\kappa-growth).

Let x=argminx𝒳f(x)x^{\star}=\mathop{\rm argmin}_{x\in\mathcal{X}}f(x). For a loss FF and distribution PP, we say that (F,P)(F,P) has (λ,κ)(\lambda,\kappa) growth for κ[1,]\kappa\in[1,\infty] and λ>0\lambda>0, if the population function satisfies

for all x𝒳,f(x)f(x)λκxx2κ.\mbox{for all~{}}x\in\mathcal{X},\quad f(x)-f(x^{\star})\geq\frac{\lambda}{\kappa}\|x-x^{\star}\|_{2}^{\kappa}.

In the case where P^\widehat{P} is the empirical distribution on a finite dataset 𝒮\mathcal{S}, we refer to (λ,κ)(\lambda,\kappa)-growth of (F,P^)(F,\widehat{P}) as κ\kappa-growth of the empirical function f𝒮f_{\mathcal{S}}.

Uniform convexity and Kurdyka-Łojasiewicz inequality.

Assumption 1 is closely related to two fundamental notions in convex analysis: uniform convexity and the Kurdyka-Łojasiewicz inequality. Following [39], we say that h:𝒵dh:\mathcal{Z}\subset\mathbb{R}^{d}\to\mathbb{R} is (σ,κ)(\sigma,\kappa)-uniformly convex with σ>0\sigma>0 and κ2\kappa\geq 2 if

for all x,y𝒵,h(y)h(x)+h(x),yx+σκxy2κ.\mbox{for all~{}~{}}x,y\in\mathcal{Z},\quad h(y)\geq h(x)+\langle\nabla h(x),y-x\rangle+\frac{\sigma}{\kappa}\|x-y\|_{2}^{\kappa}.

This immediately implies that (i) sums (and expectations) preserve uniform convexity (ii) if ff is uniformly convex with λ\lambda and κ\kappa, then it has (λ,κ)(\lambda,\kappa)-growth. This will be useful when constructing hard instances as it will suffice to consider (λ,κ)(\lambda,\kappa)-uniformly convex functions which are generally more convenient to manipulate. Finally, we point out that, in the general case that κ1\kappa\geq 1, the literature refers to Assumption 1 as the Kurdyka-Łojasiewicz inequality [11] with, in their notation, φ(s)=(κ/λ)1/κs1/κ\varphi(s)=(\kappa/\lambda)^{1/\kappa}s^{1/\kappa}. Theorem 5-(ii) in [11] says that, under mild conditions, Assumption 1 implies the following inequality between the error and the gradient norm for all x𝒳x\in\mathcal{X}

f(x)infx𝒳f(x)eλ1κ1f(x)2κκ1,f(x)-\inf_{x^{\prime}\in\mathcal{X}}f(x^{\prime})\leq\frac{e}{\lambda^{\tfrac{1}{\kappa-1}}}\left\|\nabla f(x)\right\|_{2}^{\tfrac{\kappa}{\kappa-1}}, (2)

This is a key result in our analysis of the inverse sensitivity mechanism of Section 3.

Differential privacy.

We begin by recalling the definition of (ε,δ)(\varepsilon,\delta)-differential privacy.

Definition 2.1 ([22, 21]).

A randomized algorithm 𝖠\mathsf{A} is (ε,δ)(\varepsilon,\delta)-differentially private ((ε,δ)(\varepsilon,\delta)-DP) if, for all datasets 𝒮,𝒮𝕊n\mathcal{S},\mathcal{S}^{\prime}\in\mathbb{S}^{n} that differ in a single data element and for all events 𝒪\mathcal{O} in the output space of 𝖠\mathsf{A}, we have

Pr(𝖠(𝒮)𝒪)eϵPr(𝖠(𝒮)𝒪)+δ.\Pr\left(\mathsf{A}(\mathcal{S})\in\mathcal{O}\right)\leq e^{\epsilon}\Pr\left(\mathsf{A}(\mathcal{S}^{\prime})\in\mathcal{O}\right)+\delta.

We use the following standard results in differential privacy.

Lemma 2.1 (Composition [20, Thm. 3.16]).

If 𝖠1,,𝖠k\mathsf{A}_{1},\dots,\mathsf{A}_{k} are randomized algorithms that each is (ε,δ)(\varepsilon,\delta)-DP, then their composition (𝖠1(𝒮),,𝖠k(𝒮))(\mathsf{A}_{1}(\mathcal{S}),\dots,\mathsf{A}_{k}(\mathcal{S})) is (kε,kδ)(k\varepsilon,k\delta)-DP.

Next, we consider the Laplace mechanism. We will let Z𝖫𝖺𝗉d(σ)Z\sim\mathsf{Lap}_{d}(\sigma) denote a dd-dimensional vector ZdZ\in\mathbb{R}^{d} such that Ziiid𝖫𝖺𝗉(σ)Z_{i}\stackrel{{\scriptstyle\rm iid}}{{\sim}}\mathsf{Lap}(\sigma) for 1id1\leq i\leq d.

Lemma 2.2 (Laplace mechanism [20, Thm. 3.6]).

Let h:𝕊ndh:\mathbb{S}^{n}\to\mathbb{R}^{d} have 1\ell_{1}-sensitivity Δ\Delta, that is, sup𝒮,𝒮𝕊n:dHam(𝒮,𝒮)1h(𝒮)h(𝒮)1Δ\sup_{\mathcal{S},\mathcal{S}^{\prime}\in\mathbb{S}^{n}:d_{\rm Ham}(\mathcal{S},\mathcal{S}^{\prime})\leq 1}\|h(\mathcal{S})-h(\mathcal{S}^{\prime})\|_{1}\leq\Delta. Then the Laplace mechanism 𝖠(𝒮)=h(𝒮)+𝖫𝖺𝗉d(σ)\mathsf{A}(\mathcal{S})=h(\mathcal{S})+\mathsf{Lap}_{d}(\sigma) with σ=Δ/ε\sigma=\Delta/\varepsilon is ε\varepsilon-DP.

Finally, we need the Gaussian mechanism for (ε,δ)(\varepsilon,\delta)-DP.

Lemma 2.3 (Gaussian mechanism [20, Thm. A.1]).

Let h:𝕊ndh:\mathbb{S}^{n}\to\mathbb{R}^{d} have 2\ell_{2}-sensitivity Δ\Delta, that is, sup𝒮,𝒮𝕊n:dHam(𝒮,𝒮)1h(𝒮)h(𝒮)2Δ\sup_{\mathcal{S},\mathcal{S}^{\prime}\in\mathbb{S}^{n}:d_{\rm Ham}(\mathcal{S},\mathcal{S}^{\prime})\leq 1}\|h(\mathcal{S})-h(\mathcal{S}^{\prime})\|_{2}\leq\Delta. Then the Gaussian mechanism 𝖠(𝒮)=h(𝒮)+𝖭(0,σ2Id)\mathsf{A}(\mathcal{S})=h(\mathcal{S})+\mathsf{N}(0,\sigma^{2}I_{d}) with σ=2Δlog(2/δ)/ε\sigma=2\Delta\log(2/\delta)/\varepsilon is (ε,δ)(\varepsilon,\delta)-DP.

Inverse sensitivity mechanism.

Our goal is to design private optimization algorithms that adapt to the difficulty of the underlying function. As a reference point, we turn to the inverse sensitivity mechanism of [2] as it enjoys general instance-optimality guarantees. For a given function h:𝕊n𝒯dh:\mathbb{S}^{n}\to\mathcal{T}\subset\mathbb{R}^{d} that we wish to estimate privately, define the inverse sensitivity at x𝒯x\in\mathcal{T}

𝗅𝖾𝗇h(𝒮;x)=inf𝒮{dHam(𝒮,𝒮):h(𝒮)=x},\mathsf{len}_{h}(\mathcal{S};x)=\inf_{\mathcal{S}^{\prime}}\{d_{\rm Ham}(\mathcal{S}^{\prime},\mathcal{S}):h(\mathcal{S}^{\prime})=x\}, (3)

that is, the inverse sensitivity of a target parameter y𝒯y\in\mathcal{T} at instance 𝒮\mathcal{S} is the minimal number of samples one needs to change to reach a new instance 𝒮\mathcal{S}^{\prime} such that h(𝒮)=yh(\mathcal{S}^{\prime})=y. Having this quantity, the inverse sensitivity mechanism samples an output from the following probability density

π𝖠inv(𝒮)(x)eε𝗅𝖾𝗇h(𝒮;x).\pi_{\mathsf{A}_{\mathrm{inv}}(\mathcal{S})}(x)\propto e^{-\varepsilon\mathsf{len}_{h}(\mathcal{S};x)}. (4)

The inverse sensitivity mechanism preserves ε\varepsilon-DP and enjoys instance-optimality guarantees in general settings [2]. In contrast to (worst-case) minimax optimality guarantees which measure the performance of the algorithm on the hardest instance, these notions of instance-optimality provide stronger per-instance optimality guarantees.

3 Adaptive rates through inverse sensitivity for ε\varepsilon-DP

To understand the achievable rates when privately optimizing functions with growth, we begin our theoretical investigation by examining the inverse sensitivity mechanism in our setting. We show that, for instances that exhibit κ\kappa-growth of the empirical function, the inverse sensitivity mechanism privately solves ERM with excess loss roughly (d/nε)κκ1({d}/{n\varepsilon})^{\frac{\kappa}{\kappa-1}}.

In our setting, we use a gradient-based approximation of the inverse sensitivity mechanism to simplify the analysis, while attaining similar rates. Following [3] with our function of interest h(𝒮)argminx𝒳f𝒮(x)h(\mathcal{S})\coloneqq\mathop{\rm argmin}_{x\in\mathcal{X}}f_{\mathcal{S}}(x), we can lower bound the inverse sensitivity 𝗅𝖾𝗇h(𝒮;x)nf𝒮(x)2/2L\mathsf{len}_{h}(\mathcal{S};x)\geq n\|\nabla f_{\mathcal{S}}(x)\|_{2}/2L under natural assumptions. We define a ρ\rho-smoothed version of this quantity which is more suitable to continuous domains

G𝒮ρ(x)=infy𝒳:yx2ρf𝒮(y)2,G^{\rho}_{\mathcal{S}}(x)=\inf_{y\in\mathcal{X}:\|y-x\|_{2}\leq\rho}\|\nabla f_{\mathcal{S}}(y)\|_{2},

and define the ρ\rho-smooth gradient-based inverse sensitivity mechanism

π𝖠grinv(𝒮)(x)eεnG𝒮ρ(x)/2L.\pi_{\mathsf{A}_{\mathrm{gr{-}inv}}(\mathcal{S})}(x)\propto e^{-\varepsilon nG^{\rho}_{\mathcal{S}}(x)/2L}. (5)

Note that while exactly sampling from the un-normalized density π𝖠grinv(𝒮)\pi_{\mathsf{A}_{\mathrm{gr{-}inv}}(\mathcal{S})} is computationally intractable, analyzing its performance is an important step towards understanding the optimal rates for the family of functions with growth that we study in this work. The following theorem demonstrates the adaptivity of the inverse sensitivity mechanism to the growth of the underlying instance. We defer the proof to Appendix A.

Theorem 1.

Let 𝒮=(s1,,sn)𝕊n\mathcal{S}=(s_{1},\ldots,s_{n})\in\mathbb{S}^{n}, F(x;s)F(x;s) be convex, LL-Lipschitz for all s𝕊s\in\mathbb{S}. Let x=argminx𝒳f𝒮(x)x^{\star}=\mathop{\rm argmin}_{x\in\mathcal{X}}f_{\mathcal{S}}(x) and assume xx^{\star} is in the interior of 𝒳\mathcal{X}. Assume that fS(x)f_{S}(x) has κ\kappa-growth (Assumption 1) with κκ¯>1\kappa\geq\underline{\kappa}>1. For ρ>0\rho>0, the ρ\rho-smooth inverse sensitivity mechanism 𝖠grinv\mathsf{A}_{\mathrm{gr{-}inv}} (5) is ε\varepsilon-DP, and with probability at least 1β1-\beta the output x^=𝖠grinv(𝒮)\hat{x}=\mathsf{A}_{\mathrm{gr{-}inv}}(\mathcal{S}) has

f𝒮(x^)minx𝒳f𝒮(x)1λ1κ1(2L(log(1/β)+dlog(D/ρ))nε)κκ1+Lρ.f_{\mathcal{S}}(\hat{x})-\min_{x\in\mathcal{X}}f_{\mathcal{S}}(x)\leq\frac{1}{\lambda^{\frac{1}{\kappa-1}}}\left(\frac{2L(\log(1/\beta)+d\log(D/\rho))}{n\varepsilon}\right)^{\frac{\kappa}{\kappa-1}}+L\rho.

Moreover, setting ρ=(L/λ)1κ¯1(d/nε)κ¯κ¯1\rho=({L}/{\lambda})^{\frac{1}{\underline{\kappa}-1}}({d}/{n\varepsilon})^{\frac{\underline{\kappa}}{\underline{\kappa}-1}}, we have

f𝒮(x^)minx𝒳f𝒮(x)\displaystyle f_{\mathcal{S}}(\hat{x})-\min_{x\in\mathcal{X}}f_{\mathcal{S}}(x) 1λ1κ1O~(Ldnε)κκ1.\displaystyle\leq\frac{1}{\lambda^{\frac{1}{\kappa-1}}}\widetilde{O}\left(\frac{Ld}{n\varepsilon}\right)^{\frac{\kappa}{\kappa-1}}.

The rates of the inverse sensitivity in Theorem 1 provide two main insights regarding the landscape of the problem with growth conditions. First, these conditions allow to improve the worst-case rate d/nεd/n\varepsilon to (d/nε)κκ1({d}/{n\varepsilon})^{\frac{\kappa}{\kappa-1}} for pure ε\varepsilon-DP and therefore suggest a better rate (dlog(1/δ)/nε)κκ1({\sqrt{d\log(1/\delta)}}/{n\varepsilon})^{\frac{\kappa}{\kappa-1}} is possible for approximate (ε,δ)(\varepsilon,\delta)-DP. Moreover, the general instance-optimality guarantees of this mechanism [2] hint that these are the optimal rates for our class of functions. In the sections to come, we validate the correctness of these predictions by developing efficient algorithms that achieve these rates (for pure and approximate privacy), and prove matching lower bounds which demonstrate the optimality of these algorithms.

4 Efficient algorithms with optimal rates

While the previous section demonstrates that there exists algorithms that improve the rates for functions with growth, we pointed out that 𝖠grinv\mathsf{A}_{\mathrm{gr{-}inv}} was computationally intractable in the general case. In this section, we develop efficient algorithms—e.g. that are implementable with gradient-based methods—that achieve the same convergence rates. Our algorithms build on the recent localization techniques that Feldman et al. [25] used to obtain optimal rates for DP-SCO with general convex functions. In Section 4.1, we use these techniques to develop private algorithms that achieve the optimal rates for (pure) DP-SCO with high probability, in contrast to existing results which bound the expected excess loss. These results are of independent interest.

In Section 4.2, we translate these results into convergence guarantees on privately optimizing convex functions with growth by solving a sequence of increasingly constrained SCO problems—the high-probability guarantees of Section 4.1 being crucial to our convergence analysis of these algorithms.

4.1 High-probability guarantees for convex DP-SCO

We first describe our algorithm (Algorithm 1) then analyze its performance under pure-DP (Proposition 1) and approximate-DP constraints (Proposition 2). Our analysis builds on novel tight generalization bounds for uniformly-stable algorithms with high probability [24]. We defer the proofs to Appendix B.

Algorithm 1 Localization-based Algorithm
0:  Dataset 𝒮=(s1,,sn)𝕊n\mathcal{S}=(s_{1},\ldots,s_{n})\in\mathbb{S}^{n}, constraint set 𝒳\mathcal{X}, step size η\eta, initial point x0x_{0}, privacy parameters (ε,δ)(\varepsilon,\delta);
1:  Set k=lognk=\lceil\log n\rceil and n0=n/kn_{0}=n/k
2:  for i=1i=1 to kk  do
3:     Set ηi=24iη\eta_{i}=2^{-4i}\eta
4:     Solve the following ERM over 𝒳i={x𝒳:xxi122Lηin0}\mathcal{X}_{i}=\{x\in\mathcal{X}:\|x-x_{i-1}\|_{2}\leq{2L\eta_{i}n_{0}}\}:Fi(x)=1n0j=1+(i1)n0in0F(x;sj)+1ηin0xxi122\displaystyle\quad\quad\quad\quad F_{i}(x)=\frac{1}{n_{0}}\sum_{j=1+(i-1)n_{0}}^{in_{0}}F(x;s_{j})+\frac{1}{\eta_{i}n_{0}}\|x-x_{i-1}\|_{2}^{2}
5:     Let x^i\hat{x}_{i} be the output of the optimization algorithm
6:     if δ=0\delta=0 then
7:        Set ζi𝖫𝖺𝗉d(σi)\zeta_{i}\sim\mathsf{Lap}_{d}(\sigma_{i}) where σi=4Lηid/εi\sigma_{i}=4L\eta_{i}\sqrt{d}/\varepsilon_{i}
8:     else if δ>0\delta>0 then
9:        Set ζi𝖭(0,σi2)\zeta_{i}\sim\mathsf{N}(0,\sigma_{i}^{2}) where σi=4Lηilog(1/δ)/ε\sigma_{i}=4L\eta_{i}\sqrt{\log(1/\delta)}/\varepsilon
10:     Set xi=x^i+ζix_{i}=\hat{x}_{i}+\zeta_{i}
11:  return  the final iterate xkx_{k}
Proposition 1.

Let β1/(n+d)\beta\leq 1/(n+d), 𝖽𝗂𝖺𝗆2(𝒳)D\mathsf{diam}_{2}(\mathcal{X})\leq D and F(x;s)F(x;s) be convex, LL-Lipschitz for all s𝕊s\in\mathbb{S}. Setting

η=DLmin(1nlog(1/β),εdlog(1/β))\eta=\frac{D}{L}\min\left(\frac{1}{\sqrt{n\log(1/\beta)}},\frac{\varepsilon}{d\log(1/\beta)}\right)

then for δ=0\delta=0, Algorithm 1 is ε\varepsilon-DP and has with probability 1β1-\beta

f(x)f(x)LDO(log(1/β)log3/2nn+dlog(1/β)lognnε).f(x)-f(x^{\star})\leq LD\cdot O\left(\frac{\sqrt{\log(1/\beta)}\log^{3/2}n}{\sqrt{n}}+\frac{d\log(1/\beta)\log n}{n\varepsilon}\right).

Similarly, by using a different choice for the parameters and noise distribution, we have the following guarantees for approximate (ε,δ)(\varepsilon,\delta)-DP.

Proposition 2.

Let β1/(n+d)\beta\leq 1/(n+d), 𝖽𝗂𝖺𝗆2(𝒳)D\mathsf{diam}_{2}(\mathcal{X})\leq D and F(x;s)F(x;s) be convex, LL-Lipschitz for all s𝕊s\in\mathbb{S}. Setting

η=DLmin(1nlog(1/β),εdlog(1/δ)log(1/β)),\eta=\frac{D}{L}\min\left(\frac{1}{\sqrt{n\log(1/\beta)}},\frac{\varepsilon}{\sqrt{d\log(1/\delta)}\log(1/\beta)}\right),

then for δ>0\delta>0, Algorithm 1 is (ε,δ)(\varepsilon,\delta)-DP and has with probability 1β1-\beta

f(x)f(x)LDO(log(1/β)log3/2nn+dlog(1/δ)log(1/β)lognnε).f(x)-f(x^{\star})\leq LD\cdot O\left(\frac{\sqrt{\log(1/\beta)}\log^{3/2}n}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}\log(1/\beta)\log n}{n\varepsilon}\right).

4.2 Algorithms for DP-SCO with growth

Building on the algorithms of the previous section, we design algorithms that recover the rates of the inverse sensitivity mechanism for functions with growth, importantly without knowledge of the value of κ\kappa. Inspired by epoch-based algorithms from the optimization literature [31, 29], our algorithm iteratively applies the private procedures from the previous section. Crucially, the growth assumption allows to reduce the diameter of the domain after each run, hence improving the overall excess loss by carefully choosing the hyper-parameters. We provide full details in Algorithm 2.

Algorithm 2 Epoch-based algorithms for κ\kappa-growth
0:  Dataset 𝒮=(s1,,sn)𝕊n\mathcal{S}=(s_{1},\ldots,s_{n})\in\mathbb{S}^{n}, convex set 𝒳\mathcal{X}, initial point x0x_{0}, number of iterations TT, privacy parameters (ε,δ)(\varepsilon,\delta);
1:  Set n0=n/Tn_{0}=n/T and D0=𝖽𝗂𝖺𝗆2(𝒳)D_{0}=\mathsf{diam}_{2}(\mathcal{X})
2:  if δ=0\delta=0 then
3:     Set η0=D02Lmin(1n0log(n0)log(1/β),εdlog(1/β))\eta_{0}=\frac{D_{0}}{2L}\min\left(\frac{1}{\sqrt{n_{0}\log(n_{0})\log(1/\beta)}},\frac{\varepsilon}{d\log(1/\beta)}\right)
4:  else if δ>0\delta>0 then
5:     η0=D02Lmin(1n0log(n0)log(1/β),εdlog(1/δ)log(1/β))\eta_{0}=\frac{D_{0}}{2L}\min\left(\frac{1}{\sqrt{n_{0}\log(n_{0})\log(1/\beta)}},\frac{\varepsilon}{\sqrt{d\log(1/\delta)}\log(1/\beta)}\right)
6:  for i=0i=0 to T1T-1  do
7:     Let 𝒮i=(s1+(i1)n0,,sin0)\mathcal{S}_{i}=(s_{1+(i-1)n_{0}},\dots,s_{in_{0}})
8:     Set Di=2iD0D_{i}=2^{-i}D_{0} and ηi=2iη0\eta_{i}=2^{-i}\eta_{0}
9:     Set 𝒳i={x𝒳:xxi2Di}\mathcal{X}_{i}=\{x\in\mathcal{X}:\|x-x_{i}\|_{2}\leq D_{i}\}
10:     Run Algorithm 1 on dataset 𝒮i\mathcal{S}_{i} with starting point xix_{i}, privacy parameter (ε,δ)(\varepsilon,\delta), domain 𝒳i\mathcal{X}_{i} (with diameter DiD_{i}), step size ηi\eta_{i}
11:     Let xi+1x_{i+1} be the output of the private procedure
12:  return  xTx_{T}

The following theorem summarizes our main upper bound for DP-SCO with growth in the pure privacy model, recovering the rates of the inverse sensitivity mechanism in Section 3. We defer the proof to Section B.3.

Theorem 2.

Let β1/(n+d)\beta\leq 1/(n+d), 𝖽𝗂𝖺𝗆2(𝒳)D\mathsf{diam}_{2}(\mathcal{X})\leq D and F(x;s)F(x;s) be convex, LL-Lipschitz for all s𝕊s\in\mathbb{S}. Assume that ff has κ\kappa-growth (Assumption 1) with κκ¯>1\kappa\geq\underline{\kappa}>1. Setting T=2lognκ¯1T=\left\lceil\frac{2\log n}{\underline{\kappa}-1}\right\rceil, Algorithm 2 is ε\varepsilon-DP and has with probability 1β1-\beta

f(xT)minx𝒳f(x)1λ1κ1O~(Llog(1/β)n+Ldlog(1/β)nε(κ¯1))κκ1,f(x_{T})-\min_{x\in\mathcal{X}}f(x)\leq\frac{1}{\lambda^{\frac{1}{\kappa-1}}}\cdot\widetilde{O}\left(\frac{L\sqrt{\log(1/\beta)}}{\sqrt{n}}+\frac{Ld\log(1/\beta)}{n\varepsilon(\underline{\kappa}-1)}\right)^{\frac{\kappa}{\kappa-1}},

where O~\widetilde{O} hides logarithmic factors depending on nn and dd.

Sketch of the proof.

The main challenge of the proof is showing that the iterate achieves good risk without knowledge of κ\kappa. Let us denote by DρD\cdot\rho the error guarantee of Proposition 1 (or Proposition 2 for approximate-DP). At each stage ii, as long as x=argminx𝒳f(x)x^{\star}=\mathop{\rm argmin}_{x\in\mathcal{X}}f(x) belongs to 𝒳i\mathcal{X}_{i}, the excess loss is of order DiρD_{i}\cdot\rho and thus decreases exponentially fast with ii. The challenge is that, without knowledge of κ\kappa, we do not know the index i0i_{0} (roughly log2nκ1\tfrac{\log_{2}n}{\kappa-1}) after which xDjx^{\star}\notin D_{j} for ji0j\geq i_{0} and the regret guarantees become meaningless with respect to the original problem. However, in the stages after i0i_{0}, as the constraint set becomes very small, we upper bound the variations in function values f(xj+1)f(xj)f(x_{j+1})-f(x_{j}) and show that the sub-optimality cannot increase (overall) by more than O(Di0ρ)O(D_{i_{0}}\cdot\rho), thus achieving the optimal rate of stage i0i_{0}.

Moreover, we can improve the dependence on the dimension for approximate (ε,δ)(\varepsilon,\delta)-DP, resulting in the following bounds.

Theorem 3.

Let β1/(n+d)\beta\leq 1/(n+d), 𝖽𝗂𝖺𝗆2(𝒳)D\mathsf{diam}_{2}(\mathcal{X})\leq D and F(x;s)F(x;s) be convex, LL-Lipschitz for all s𝕊s\in\mathbb{S}. Assume that ff has κ\kappa-growth (Assumption 1) with κκ¯>1\kappa\geq\underline{\kappa}>1. Setting T=2lognκ¯1T=\left\lceil\frac{2\log n}{\underline{\kappa}-1}\right\rceil and δ>0\delta>0, Algorithm 2 is (ε,δ)(\varepsilon,\delta)-DP and has with probability 1β1-\beta

f(xT)minx𝒳f(x)1λ1κ1O~(Llog(1/β)n+Ldlog(1/δ)log(1/β)nε(κ¯1))κκ1,f(x_{T})-\min_{x\in\mathcal{X}}f(x)\leq\frac{1}{\lambda^{\frac{1}{\kappa-1}}}\cdot\widetilde{O}\left(\frac{L\sqrt{\log(1/\beta)}}{\sqrt{n}}+\frac{L\sqrt{d\log(1/\delta)}\log(1/\beta)}{n\varepsilon(\underline{\kappa}-1)}\right)^{\frac{\kappa}{\kappa-1}},

where O~\widetilde{O} hides logarithmic factors depending on nn and dd.

5 Lower bounds

In this section, we develop (minimax) lower bounds for the problem of SCO with κ\kappa-growth under privacy constraints. Note that taking ε\varepsilon\to\infty provides lower bound for the unconstrained minimax risk. For a sample space 𝕊\mathbb{S} and collection of distributions 𝒫\mathcal{P} over 𝕊\mathbb{S}, we define the function class κ(𝒫)\mathcal{F}^{\kappa}(\mathcal{P}) as the set of convex functions from d\mathbb{R}^{d}\to\mathbb{R} that are LL-Lipschitz and has κ\kappa-growth (Assumption 1). We define the constrained minimax risk [6]

𝔐n(𝒳,𝒫,κ,ε,δ)infx^n𝒜ε,δsup(F,P)κ×𝒫𝔼[f(x^n(S1n))infx𝒳f(x)],\mathfrak{M}_{n}(\mathcal{X},\mathcal{P},\mathcal{F}^{\kappa},\varepsilon,\delta)\coloneqq\inf_{\widehat{x}_{n}\in\mathcal{A}^{\varepsilon,\delta}}\sup_{(F,P)\in\mathcal{F}^{\kappa}\times\mathcal{P}}\mathbb{E}\left[f(\widehat{x}_{n}(S_{1}^{n}))-\inf_{x^{\prime}\in\mathcal{X}}f(x^{\prime})\right], (6)

where 𝒜ϵ,δ\mathcal{A}^{\epsilon,\delta} is the collection of (ε,δ)(\varepsilon,\delta)-DP mechanisms from 𝕊n\mathbb{S}^{n} to 𝒳\mathcal{X}. When clear from context, we omit the dependency on 𝒫\mathcal{P} of the function class and simply write κ\mathcal{F}^{\kappa}. We also forego the dependence on δ\delta when referring to pure-DP constraints, i.e. 𝔐n(𝒳,𝒫,κ,ε,δ=0)𝔐n(𝒳,𝒫,κ,ε)\mathfrak{M}_{n}(\mathcal{X},\mathcal{P},\mathcal{F}^{\kappa},\varepsilon,\delta=0)\eqqcolon\mathfrak{M}_{n}(\mathcal{X},\mathcal{P},\mathcal{F}^{\kappa},\varepsilon). We now proceed to prove tight lower bounds for ε\varepsilon-DP in Section 5.1 and (ε,δ)(\varepsilon,\delta)-DP in Section 5.2.

5.1 Lower bounds for pure ε\varepsilon-DP

Although in Section 4 we show that the same algorithm achieves the optimal upper bounds for all values of κ>1\kappa>1, the landscape of the problem is more subtle for the lower bounds and we need to delineate two different cases to obtain tight lower bounds. We begin with κ2\kappa\geq 2, which corresponds to uniform convexity and enjoys properties that make the problem easier (e.g., closure under summation or addition of linear terms). The second case, 1<κ<21<\kappa<2, corresponds to sharper growth and requires a different hard instance to satisfy the growth condition.

κ\kappa-growth with κ2\kappa\geq 2.

We begin by developing lower bounds under pure DP for κ2\kappa\geq 2

Theorem 4 (Lower bound for ε\varepsilon-DP, κ2\kappa\geq 2).

Let d1d\geq 1, 𝒳=𝔹2d(R)\mathcal{X}=\mathbb{B}_{2}^{d}(R), 𝕊={±ej}jd\mathbb{S}=\{\pm e_{j}\}_{j\leq d}, κ2\kappa\geq 2 and nn\in\mathbb{N}. Let 𝒫\mathcal{P} be the set of distributions on 𝕊\mathbb{S}. Assume that

2κ1Lλ1Rκ12κ196n and nε132^{\kappa-1}\leq\frac{L}{\lambda}\frac{1}{R^{\kappa-1}}\leq 2^{\kappa-1}\sqrt{96n}\mbox{~{}~{}and~{}~{}}n\varepsilon\geq\frac{1}{\sqrt{3}}

The following lower bound holds

𝔐n(𝒳,𝒫,κ,ϵ)1λ1κ1Ω~((Ln)κ(κ1)+(Ldnε)κκ1).\mathfrak{M}_{n}(\mathcal{X},\mathcal{P},\mathcal{F}^{\kappa},\epsilon)\geq\frac{1}{\lambda^{\tfrac{1}{\kappa-1}}}\tilde{\Omega}\left(\left(\frac{L}{\sqrt{n}}\right)^{\tfrac{\kappa}{(\kappa-1)}}+\left(\frac{Ld}{n\varepsilon}\right)^{\tfrac{\kappa}{\kappa-1}}\right). (7)

First of all, note that Lλ2κRκ1L\geq\lambda 2^{\kappa}R^{\kappa-1} is not an overly-restrictive assumption. Indeed, for an arbitrary (λ,κ)(\lambda,\kappa)-uniformly convex and LL-Lipschitz function, it always holds that Lλ2Rκ1L\geq\tfrac{\lambda}{2}R^{\kappa-1}. This is thus equivalent to assuming κ=Θ(1)\kappa=\Theta(1). Note that when κ1\kappa\gg 1, the standard n1/2+d/(nε)n^{-1/2}+d/(n\varepsilon) lower bound holds. We present the proof in Section C.1.1 and preview the main ideas here.

Sketch of the proof.

Our lower bounds hinges on the collections of functions F(x;s)aκ1x2κ+bx,sF(x;s)\coloneqq a\kappa^{-1}\|x\|_{2}^{\kappa}+b\langle x,s\rangle for a,b0a,b\geq 0 to be chosen later. These functions are [39, Lemma 4] κ\kappa-uniformly convex for any s𝕊s\in\mathbb{S} and in turn, so is the population function ff. We proceed as follows, we first prove an information-theoretic (non-private) lower bound (Theorem 8 in Appendix C.1.1) which provides the statistical term in (7). With the same family of functions, we exhibit a collection of datasets and prove by contradiction that if an estimator were to optimize below a certain error it would have violated ε\varepsilon-DP—this yields a lower bound on ERM for our function class (Theorem 9 in Appendix C.1.1). We conclude by proving a reduction from SCO to ERM in Proposition 4. ∎

κ\kappa-growth with κ(1,2]\kappa\in(1,2].

As the construction of the hard instance is more intricate for κ<2\kappa<2, we provide a one-dimensional lower bound and leave the high-dimensional case for future work. In this case we directly obtain the result with a private version of Le Cam’s method [44, 42, 6], however with a different family of functions.

The issue with the construction of the previous section is that the function does not exhibit sharp growth for κ<2\kappa<2. Indeed, the added linear function shifts the minimum away from 0 where the function is differentiable and as a result it locally behaves as a quadratic and only achieves growth κ=2\kappa=2. To establish the lower bound, we consider a different sample function FF that has growth exactly 11 on one side and κ\kappa on the other side. This yields the following

Theorem 5 (Lower bound for ε\varepsilon-DP, κ(1,2]\kappa\in{(1,2]} ).

Let d=1d=1, 𝕊={1,+1}\mathbb{S}=\{-1,+1\}, κ(1,2]\kappa\in(1,2], λ=1\lambda=1, L=2L=2, and nn\in\mathbb{N}. There exists a collection of distributions 𝒫\mathcal{P} such that, whenever nε1/3n\varepsilon\geq 1/\sqrt{3}, it holds that

𝔐n([1,1],𝒫,d=1κ,ϵ)=Ω{(1n)κκ1+(1nε)κκ1}.\mathfrak{M}_{n}([-1,1],\mathcal{P},\mathcal{F}^{\kappa}_{d=1},\epsilon)=\Omega\left\{\left(\frac{1}{\sqrt{n}}\right)^{\tfrac{\kappa}{\kappa-1}}+\left(\frac{1}{n\varepsilon}\right)^{\tfrac{\kappa}{\kappa-1}}\right\}. (8)

5.2 Lower bounds under approximate privacy constraints

We conclude our treatment by providing lower bounds but now under approximate privacy constraints, demonstrating the optimality of the risk bound of Theorem 3. We prove the result via a reduction: we show that if one solves ERM with κ\kappa-growth with error Δ\Delta, this implies that one solves arbitrary convex ERM with error ϕ(Δ)\phi(\Delta). Given that a lower bound of Ω(d/(nε))\Omega(\sqrt{d}/(n\varepsilon)) holds for ERM, a lower bound of ϕ1(d/(nε))\phi^{-1}(\sqrt{d}/(n\varepsilon)) holds for ERM with κ\kappa-growth. However, for this reduction to hold, we require that κ2\kappa\geq 2. Furthermore, we consider κ\kappa to be roughly a constant—in the case that κ\kappa is too large, standard lower bounds on general convex functions apply.

Theorem 6 (Private lower bound for (ε,δ)(\varepsilon,\delta)-DP).

Let κ2\kappa\geq 2 such that κ=Θ(1)\kappa=\Theta(1), 𝒳=𝔹2d(D)\mathcal{X}=\mathbb{B}_{2}^{d}(D). Let d1d\geq 1 and 𝕊={±1/d}d\mathbb{S}=\{\pm 1/\sqrt{d}\}^{d}. Assume that nε=Ω(d)n\varepsilon=\Omega(\sqrt{d}), then for any (ε,δ)(\varepsilon,\delta) mechanism 𝖠\mathsf{A}, there exists λ>0,F\lambda>0,F and 𝒮𝕊\mathcal{S}\subset\mathbb{S} such that

𝔼[f𝒮(𝖠(𝒮))]infx𝒳f𝒮(x)Ω~[1λ1κ1(Ldnε)κκ1].\mathbb{E}[f_{\mathcal{S}}(\mathsf{A}(\mathcal{S}))]-\inf_{x^{\prime}\in\mathcal{X}}f_{\mathcal{S}}(x^{\prime})\geq\tilde{\Omega}\left[\frac{1}{\lambda^{\tfrac{1}{\kappa-1}}}\left(\frac{L\sqrt{d}}{n\varepsilon}\right)^{\tfrac{\kappa}{\kappa-1}}\right].

Theorem 6 implies that the same lower bound (up to logarithmic factors) applies to SCO via the reduction of [8, Appendix C]. Before proving the theorem, let us state (and prove in Section C.2) the following reduction: if an (ε,δ)(\varepsilon,\delta)-DP algorithm achieves excess error (roughly) Δ\Delta on ERM for any function with κ\kappa-growth, there exists an (ε,δ)(\varepsilon,\delta)-DP algorithm that achieves error Δ(κ1)/κ\Delta^{(\kappa-1)/\kappa} for any convex function. We construct the latter by iteratively solving ERM problems with geometrically increasing 2κ\|\cdot\|_{2}^{\kappa}-regularization towards the previous iterate to ensure the objective has κ\kappa-growth.

Proposition 3 (Solving ERM with κ\kappa-growth implies solving any convex ERM).

Let κ2\kappa\geq 2. Assume there exists an (ϵ,δ)(\epsilon,\delta) mechanism 𝖠\mathsf{A} such that for any LL-Lipschitz loss GG on 𝒴\mathcal{Y} and dataset 𝒮\mathcal{S} such that g𝒮(x)1ns𝒮G(x;s)g_{\mathcal{S}}(x)\coloneqq\frac{1}{n}\sum_{s\in\mathcal{S}}G(x;s) exhibits (λ,κ)(\lambda,\kappa)-growth, the mechanism achieves excess loss

𝔼[g𝒮(𝖠(𝒮,G,𝒴))]infy𝒴g𝒮(y)1λ1κ1Δ(n,L,ϵ,δ).\mathbb{E}[g_{\mathcal{S}}(\mathsf{A}(\mathcal{S},G,\mathcal{Y}))]-\inf_{y^{\prime}\in\mathcal{Y}}g_{\mathcal{S}}(y^{\prime})\leq\frac{1}{\lambda^{\tfrac{1}{\kappa-1}}}\Delta(n,L,\epsilon,\delta).

Then, we can construct an (ε,δ)(\varepsilon,\delta)-DP mechanism 𝖠\mathsf{A^{\prime}} such that for any LL-Lipschitz loss ff, the mechanism achieves excess loss

𝔼[f𝒮(𝖠(𝒮))]infx𝒳f𝒮(x)O(D[Δ(n,L,ϵ/k,δ/k)]κ1κ),\mathbb{E}[f_{\mathcal{S}}(\mathsf{A}^{\prime}(\mathcal{S}))]-\inf_{x^{\prime}\in\mathcal{X}}f_{\mathcal{S}}(x^{\prime})\leq O\left(D\left[\Delta(n,L,\epsilon/k,\delta/k)\right]^{\tfrac{\kappa-1}{\kappa}}\right),

where kk is the smallest integer such that klog[κ1κ1Lκκ122κ3Δ(n,L,ε/k,δ/k)]k\geq\log\left[\frac{\kappa^{\tfrac{1}{\kappa-1}}L^{\tfrac{\kappa}{\kappa-1}}}{2^{2\kappa-3}\Delta(n,L,\varepsilon/k,\delta/k)}\right].

With this proposition, the proof of the theorem directly follows as Bassily et al. [7] prove a lower bound Ω(d/(nε))\Omega(\sqrt{d}/(n\varepsilon)) for ERM with (ε,δ)(\varepsilon,\delta)-DP.

Discussion

In this work, we develop private algorithms that adapt to the growth of the function at hand, achieving the convergence rate corresponding to the “easiest” sub-class the function belongs to. However, the picture is not yet complete. First of, there are still gaps in our theoretical understanding, the most interesting one being κ=1\kappa=1. On these functions, appropriate optimization algorithms achieve linear convergence [43] and raise the question, can we achieve exponentially small privacy cost in our setting? Finally, while our optimality guarantees are more fine-grained than the usual minimax results over convex functions, they are still contigent on some predetermined choice of sub-classes. Studying more general notions of adaptivity is an important future direction in private optimization.

Acknowledgments

The authors would like to thank Karan Chadha and Gary Cheng for comments on an early version of the draft.

References

  • Abadi et al. [2016] M. Abadi, A. Chu, I. Goodfellow, B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In 23rd ACM Conference on Computer and Communications Security (ACM CCS), pages 308–318, 2016.
  • Asi and Duchi [2020a] H. Asi and J. Duchi. Near instance-optimality in differential privacy. arXiv:2005.10630 [cs.CR], 2020a.
  • Asi and Duchi [2020b] H. Asi and J. C. Duchi. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. In Advances in Neural Information Processing Systems 33, 2020b.
  • Asi et al. [2021a] H. Asi, J. Duchi, A. Fallah, O. Javidbakht, and K. Talwar. Private adaptive gradient methods for convex optimization. arXiv:2106.13756 [cs.LG], 2021a.
  • Asi et al. [2021b] H. Asi, V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: Optimal rates in 1\ell_{1} geometry. arXiv:2103.01516 [cs.LG], 2021b.
  • Barber and Duchi [2014] R. F. Barber and J. C. Duchi. Privacy and statistical risk: Formalisms and minimax bounds. arXiv:1412.4451 [math.ST], 2014.
  • Bassily et al. [2014] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 55th Annual Symposium on Foundations of Computer Science, pages 464–473, 2014.
  • Bassily et al. [2019] R. Bassily, V. Feldman, K. Talwar, and A. Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems 32, 2019.
  • Bassily et al. [2020] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems 33, 2020.
  • Beck and Teboulle [2003] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167–175, 2003.
  • Bolte et al. [2017] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. Suter. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming, 165:471–507, 2017.
  • Bottou et al. [2018] L. Bottou, F. Curtis, and J. Nocedal. Optimization methods for large-scale learning. SIAM Review, 60(2):223–311, 2018.
  • Braverman et al. [2016] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the Forty-Eighth Annual ACM Symposium on the Theory of Computing, 2016. URL https://arxiv.org/abs/1506.07216.
  • Cai and Low [2015] T. Cai and M. Low. A framework for estimating convex functions. Statistica Sinica, 25:423–456, 2015.
  • Chatterjee et al. [2016] S. Chatterjee, J. Duchi, J. Lafferty, and Y. Zhu. Local minimax complexity of stochastic convex optimization. In Advances in Neural Information Processing Systems 29, 2016.
  • Chaudhuri et al. [2011] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069–1109, 2011.
  • Duchi [2019] J. C. Duchi. Information theory and statistics. Lecture Notes for Statistics 311/EE 377, Stanford University, 2019. URL http://web.stanford.edu/class/stats311/lecture-notes.pdf. Accessed May 2019.
  • Duchi and Ruan [2021] J. C. Duchi and F. Ruan. Asymptotic optimality in stochastic optimization. Annals of Statistics, 49(1):21–48, 2021.
  • Duchi et al. [2013] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.
  • Dwork and Roth [2014] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 & 4):211–407, 2014.
  • Dwork et al. [2006a] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology (EUROCRYPT 2006), 2006a.
  • Dwork et al. [2006b] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Theory of Cryptography Conference, pages 265–284, 2006b.
  • Dwork et al. [2012] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pages 214–226, 2012.
  • Feldman and Vondrak [2019] V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Proceedings of the Thirty Second Annual Conference on Computational Learning Theory, pages 1270–1279, 2019.
  • Feldman et al. [2020] V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: Optimal rates in linear time. In Proceedings of the Fifty-Second Annual ACM Symposium on the Theory of Computing, 2020.
  • Garg et al. [2014] A. Garg, T. Ma, and H. L. Nguyen. On communication cost of distributed statistical estimation and dimensionality. In Advances in Neural Information Processing Systems 27, 2014.
  • Hardt and Talwar [2010] M. Hardt and K. Talwar. On the geometry of differential privacy. In Proceedings of the Forty-Second Annual ACM Symposium on the Theory of Computing, pages 705–714, 2010. URL http://arxiv.org/abs/0907.3754.
  • Hashimoto et al. [2018] T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • Hazan and Kale [2011] E. Hazan and S. Kale. An optimal algorithm for stochastic strongly convex optimization. In Proceedings of the Twenty Fourth Annual Conference on Computational Learning Theory, 2011. URL http://arxiv.org/abs/1006.2425.
  • Jin et al. [2019] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv:1902.03736 [math.PR], 2019.
  • Juditsky and Nesterov [2010] A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. URL http://hal.archives-ouvertes.fr/docs/00/50/89/33/PDF/Strong-hal.pdf, 2010.
  • Juditsky and Nesterov [2014] A. Juditsky and Y. Nesterov. Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization. Stochastic Systems, 4(1):44––80, 2014.
  • Levy and Duchi [2019] D. Levy and J. C. Duchi. Necessary and sufficient geometries for gradient methods. In Advances in Neural Information Processing Systems 32, 2019.
  • Levy et al. [2021] D. Levy, Z. Sun, K. Amin, S. Kale, A. Kulesza, M. Mohri, and A. T. Suresh. Learning with user-level privacy. arXiv:2102.11845 [cs.LG], 2021. URL https://arxiv.org/abs/2102.11845.
  • McMahan et al. [2017] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
  • Mitzenmacher and Upfal [2005] M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.
  • Nemirovski and Yudin [1983] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
  • Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • Nesterov [2008] Y. Nesterov. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008.
  • Ramdas and Singh [2013] A. Ramdas and A. Singh. Optimal rates for stochastic convex optimization under tsybakov noise condition. In Proceedings of the 30th International Conference on Machine Learning, pages 365–373, 2013.
  • Smith and Thakurta [2013] A. Smith and A. Thakurta. Differentially private feature selection via stability arguments, and the robustness of the Lasso. In Proceedings of the Twenty Sixth Annual Conference on Computational Learning Theory, pages 819–850, 2013. URL http://proceedings.mlr.press/v30/Guha13.html.
  • Wainwright [2019] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
  • Xu et al. [2017] Y. Xu, Q. Lin, and T. Yang. Stochastic convex optimization: Faster local growth implies faster global convergence. In Proceedings of the 34th International Conference on Machine Learning, pages 3821–3830, 2017.
  • Yu [1997] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435. Springer-Verlag, 1997.
  • Zhang et al. [2013] Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed estimation with communication constraints. In Advances in Neural Information Processing Systems 26, 2013.
  • Zhu et al. [2016] Y. Zhu, S. Chatterjee, J. Duchi, and J. Lafferty. Local minimax complexity of stochastic convex optimization. In Advances in Neural Information Processing Systems 29, 2016.

Appendix A Proofs for Section 3

A.1 Proof of Theorem 1

See 1

Let us first prove privacy. The sensitivity of f𝒮(x)2\|\nabla f_{\mathcal{S}}(x)\|_{2} is 2L/n2L/n as FF is LL-Lipschitz, therefore following the privacy proof of the smooth inverse sensitivity mechanism [2, Prop. 3.2] we get that 𝖠grinv\mathsf{A}_{\mathrm{gr{-}inv}} (5) is ε\varepsilon-DP.

Let us now prove the claim about utility. Denote x^=𝖠grinv(𝒮)\hat{x}=\mathsf{A}_{\mathrm{gr{-}inv}}(\mathcal{S}) and E=2LKnεE=\frac{2LK}{n\varepsilon} with KK to be chosen presently. We argue that it is enough to show that Pr(Gρ(x^)E)β\Pr(G_{\rho}(\hat{x})\geq E)\leq\beta. Indeed then with probability at least 1β1-\beta we have Gρ(x^)EG_{\rho}(\hat{x})\leq E, which implies there is yy such that x^y2ρ\|\hat{x}-y\|_{2}\leq\rho and f𝒮(y)2E\|\nabla f_{\mathcal{S}}(y)\|_{2}\leq E, hence using the Kurdyka-Łojasiewicz inequality (2)

f𝒮(x^)f𝒮(x)\displaystyle f_{\mathcal{S}}(\hat{x})-f_{\mathcal{S}}(x^{\star}) =f𝒮(x^)f𝒮(y)+f𝒮(y)f𝒮(x)\displaystyle=f_{\mathcal{S}}(\hat{x})-f_{\mathcal{S}}(y)+f_{\mathcal{S}}(y)-f_{\mathcal{S}}(x^{\star})
Lρ+eλ1κ1f𝒮(y)2κκ1\displaystyle\leq L\rho+\frac{e}{\lambda^{\frac{1}{\kappa-1}}}\|\nabla f_{\mathcal{S}}(y)\|_{2}^{\frac{\kappa}{\kappa-1}}
Lρ+eλ1κ1Eκκ1.\displaystyle\leq L\rho+\frac{e}{\lambda^{\frac{1}{\kappa-1}}}E^{\frac{\kappa}{\kappa-1}}.

It remains to prove that Pr(Gρ(x^)E)β\Pr(G_{\rho}(\hat{x})\geq E)\leq\beta. Let S0={xd:xx2ρ}S_{0}=\{x\in\mathbb{R}^{d}:\|x-x^{\star}\|_{2}\leq\rho\} and S1={xd:Gρ(x)E}S_{1}=\{x\in\mathbb{R}^{d}:G_{\rho}(x)\geq E\}. Note that Gρ(x)=0G_{\rho}(x)=0 for any xS0x\in S_{0} as xx^{\star} is in the interior of 𝒳\mathcal{X} which implies f𝒮(x)=0\nabla f_{\mathcal{S}}(x^{\star})=0. Hence the definition of the smooth inverse sensitivity mechanism (5) implies

Pr(𝖠grinv(𝒮)S1)\displaystyle\Pr(\mathsf{A}_{\mathrm{gr{-}inv}}(\mathcal{S})\in S_{1}) 𝖵𝗈𝗅({xd:xx2D+ρ})enε2LE𝖵𝗈𝗅({xd:xx2ρ})\displaystyle\leq\frac{\mathsf{Vol}(\{x\in\mathbb{R}^{d}:\|x-x^{\star}\|_{2}\leq D+\rho\})e^{-\frac{n\varepsilon}{2L}E}}{\mathsf{Vol}(\{x\in\mathbb{R}^{d}:\|x-x^{\star}\|_{2}\leq\rho\})}
eK(1+Dρ)dβ,\displaystyle\leq e^{-K}\left(1+\frac{D}{\rho}\right)^{d}\leq\beta,

where the last inequality follows by choosing K=log(1/β)+dlog(1+D/ρ)K=\log(1/\beta)+d\log(1+D/\rho).

Appendix B Proofs for Section 4

We need to the following result on the generalization properties of uniformly stable algorithms [24].

Theorem 7.

[24, Cor. 4.2] Assume 𝖽𝗂𝖺𝗆2(𝒳)D\mathsf{diam}_{2}(\mathcal{X})\leq D. Let 𝒮=(S1,,Sn)\mathcal{S}=(S_{1},\dots,S_{n}) where S1niidPS_{1}^{n}\stackrel{{\scriptstyle\rm iid}}{{\sim}}P and F(x;s)F(x;s) is LL-Lipschitz and λ\lambda-strongly convex for all s𝕊s\in\mathbb{S}. Let x^=argminx𝒳f𝒮(x)\hat{x}=\mathop{\rm argmin}_{x\in\mathcal{X}}{f_{\mathcal{S}}(x)} be the empirical minimizer. For 0<β1/n0<\beta\leq 1/n, with probability at least 1β1-\beta

f(x^)f(x)cL2log(n)log(1/β)λn+cLDlog(1/β)n.f(\hat{x})-f(x^{\star})\leq\frac{cL^{2}\log(n)\log(1/\beta)}{\lambda n}+\frac{cLD\sqrt{\log(1/\beta)}}{\sqrt{n}}.

B.1 Proof of Proposition 1

See 1

We begin by proving the privacy claim. We show that each iterate is ε\varepsilon-DP and therefore post-processing implies the claim as each sample is used in exactly one iterate. To this end, let λi=1/ηin0\lambda_{i}=1/\eta_{i}n_{0} and note that the minimizer x^i\hat{x}_{i} has 2\ell_{2} sensitivity 2L/λin04Lηi2L/\lambda_{i}n_{0}\leq 4L\eta_{i} [25], hence the 1\ell_{1}-sensitivity is at most 4Lηid4L\eta_{i}\sqrt{d}. Standard properties of the Laplace mechanism [20] now imply that xix_{i} is ε\varepsilon-DP which give the claim about privacy.

Now we proceed to prove utility which follows similar arguments to the localization-based proof in [25]. Letting x^0=x\hat{x}_{0}=x^{\star}, we have:

f(xk)f(x)\displaystyle f(x_{k})-f(x^{\star}) =i=1kf(x^i)f(x^i1)+f(xk)f(x^k).\displaystyle=\sum_{i=1}^{k}f(\hat{x}_{i})-f(\hat{x}_{i-1})+f(x_{k})-f(\hat{x}_{k}).

First, by using standard properties of Laplace distributions [17], we know that for ζi𝖫𝖺𝗉(σi)\zeta_{i}\sim\mathsf{Lap}(\sigma_{i}),

Pr(ζi2t)Pr(ζit/d)det/dσi,\Pr(\|\zeta_{i}\|_{2}\geq t)\leq\Pr(\|\zeta_{i}\|_{\infty}\geq t/\sqrt{d})\leq de^{-t/\sqrt{d}\sigma_{i}},

which implies (as β1/(n+d)\beta\leq 1/(n+d)) that with probability 1β/21-\beta/2 we have ζi210dσilog(1/β)\|\zeta_{i}\|_{2}\leq 10\sqrt{d}\sigma_{i}\log(1/\beta) for all 1ik1\leq i\leq k. Hence

f(xk)f(x^k)\displaystyle f(x_{k})-f(\hat{x}_{k}) Lxkx^k2\displaystyle\leq L\|x_{k}-\hat{x}_{k}\|_{2}
Lσkdlog(1/β)\displaystyle\leq L\sigma_{k}\sqrt{d}\log(1/\beta)
4L2dηiε\displaystyle\leq 4L^{2}d\frac{\eta_{i}}{\varepsilon}
4L2dηε24i4LDn2,\displaystyle\leq 4L^{2}d\frac{\eta}{\varepsilon 2^{4i}}\leq\frac{4LD}{n^{2}},

where the last inequality follows since η=DεLdlog(k/β)\eta=\frac{D\varepsilon}{Ld\log(k/\beta)}. Now we use high-probability generalization guarantees of uniformly-stable algorithms. We use Theorem 7 with F(x;sj)+xxi122ηin0F(x;s_{j})+\frac{\|x-x_{i-1}\|_{2}^{2}}{\eta_{i}n_{0}} to get that with probability 1β/21-\beta/2 for each ii

f(x^i)f(x^i1)x^i1xi122ηin0+cL2log(n)log(1/β)ηi+cLDlog(1/β)n0.\displaystyle f(\hat{x}_{i})-f(\hat{x}_{i-1})\leq\frac{\|\hat{x}_{i-1}-x_{i-1}\|_{2}^{2}}{\eta_{i}n_{0}}+{cL^{2}\log(n)\log(1/\beta)\eta_{i}}+\frac{cLD\sqrt{\log(1/\beta)}}{\sqrt{n_{0}}}.

Thus,

i=1kf(x^i)f(x^i1)\displaystyle\sum_{i=1}^{k}f(\hat{x}_{i})-f(\hat{x}_{i-1}) i=1k{x^i1xi122ηin0+cL2log(n)log(1/β)ηi+cLDlog(1/β)n0}\displaystyle\leq\sum_{i=1}^{k}\left\{\frac{\|\hat{x}_{i-1}-x_{i-1}\|_{2}^{2}}{\eta_{i}n_{0}}+{cL^{2}\log(n)\log(1/\beta)\eta_{i}}+\frac{cLD\sqrt{\log(1/\beta)}}{\sqrt{n_{0}}}\right\}
D2ηn0+[i=2kσi12dlog2(1/β)ηin0]+2cL2log(n)log(1/β)η+cLDlog(1/β)kn0\displaystyle\leq\frac{D^{2}}{\eta n_{0}}+\left[\sum_{i=2}^{k}\frac{\sigma_{i-1}^{2}d\log^{2}(1/\beta)}{\eta_{i}n_{0}}\right]+{2cL^{2}\log(n)\log(1/\beta)\eta}+\frac{cLD\sqrt{\log(1/\beta)}k}{\sqrt{n_{0}}}
=D2ηn0+[i=2kCL2ηi1d2log2(1/β)n0ε2]+2cL2log(n)log(1/β)η+cLDlog(1/β)kn0\displaystyle=\frac{D^{2}}{\eta n_{0}}+\left[\sum_{i=2}^{k}\frac{CL^{2}\eta_{i-1}d^{2}\log^{2}(1/\beta)}{n_{0}\varepsilon^{2}}\right]+{2cL^{2}\log(n)\log(1/\beta)\eta}+\frac{cLD\sqrt{\log(1/\beta)}k}{\sqrt{n_{0}}}
=D2ηn0+CL2ηd2log2(1/β)n0ε2[i=2k2i]+2cL2log(n)log(1/β)η+cLDlog(1/β)kn0\displaystyle=\frac{D^{2}}{\eta n_{0}}+\frac{CL^{2}\eta d^{2}\log^{2}(1/\beta)}{n_{0}\varepsilon^{2}}\left[\sum_{i=2}^{k}2^{-i}\right]+{2cL^{2}\log(n)\log(1/\beta)\eta}+\frac{cLD\sqrt{\log(1/\beta)}k}{\sqrt{n_{0}}}
LDO(log(1/β)log(n)+log(1/β)log3/2(n)n+dlog(1/β)log(n)nε),\displaystyle\leq LD\cdot O\left(\frac{\sqrt{\log(1/\beta)\log(n)}+\sqrt{\log(1/\beta)}\log^{3/2}(n)}{\sqrt{n}}+\frac{d\log(1/\beta)\log(n)}{n\varepsilon}\right),

where the last inequality follows by choosing η=DLmin(1nlog(1/β),εdlog(1/β))\eta=\frac{D}{L}\min\left(\frac{1}{\sqrt{n\log(1/\beta)}},\frac{\varepsilon}{d\log(1/\beta)}\right)

B.2 Proof of Proposition 2

See 2

The proof is similar to the proof of Proposition 1. For privacy, we show in the proof of Proposition 1 that the 2\ell_{2}-sensitivity of x^i\hat{x}_{i} is upper bounded by 2L/λin04Lηi2L/\lambda_{i}n_{0}\leq 4L\eta_{i} hence standard properties of the Gaussian mechanism [20] now imply that xix_{i} is (ε,δ)(\varepsilon,\delta)-DP which implies the final algorithm is (ε,δ)(\varepsilon,\delta)-DP using post-processing.

The utility proof follows the same arguments as in the proof of Proposition 1, except that for ζi𝖭(0,σi2)\zeta_{i}\sim\mathsf{N}(0,\sigma_{i}^{2}) we have [30] (since ζi\zeta_{i} is 22σid2\sqrt{2}\sigma_{i}\sqrt{d}-norm-sub-Gaussian)

Pr(ζi2td)2et216σi2,\Pr(\|\zeta_{i}\|_{2}\geq t\sqrt{d})\leq 2e^{-\tfrac{t^{2}}{16\sigma_{i}^{2}}},

implying that ζi24dσilog(4/β)\|\zeta_{i}\|_{2}\leq 4\sqrt{d}\sigma_{i}\log(4/\beta) for all 1ik1\leq i\leq k with probability 1β/21-\beta/2.

B.3 Proofs of Theorems 2 and 3

We first restate Theorems 2 and 3.

See 2

See 3

We start by proving privacy. Since each sample sis_{i} is used in exactly one iterate, we only need to show that each iterate is (ε,δ)(\varepsilon,\delta)-DP, which will imply the main claim using post-processing. The privacy of each iterate follows directly from the privacy guarantees of Algorithm 1. We proceed to prove utility.

We will prove the utility claim assuming the subroutine used in Algorithm 2 satisfies the following: the output xk+1x_{k+1} has error

f(xk+1)minx𝒳f(x)Dkρ,f(x_{k+1})-\min_{x\in\mathcal{X}}f(x)\leq D_{k}\cdot\rho,

for some ρ>0\rho>0. Note that in our setting, Proposition 1 implies that ρLO(log(1/β)logn0n0+dlog(1/β)n0ε)\rho\leq L\cdot O(\frac{\sqrt{\log(1/\beta)}\log n_{0}}{\sqrt{n_{0}}}+\frac{d\log(1/\beta)}{n_{0}\varepsilon}) for pure-DP and similarly Proposition 2 gives the corresponding ρ\rho for (ε,δ)(\varepsilon,\delta)-DP.

The proof has two stages. In the first stage (Lemma B.1), we prove that as long as ii0i\leq i_{0} for some i0>0i_{0}>0, then x𝒳ix^{\star}\in\mathcal{X}_{i} and the performance of the algorithm keeps improving. We show that at the end of this stage, the points xi0+1x_{i_{0}+1} has optimal excess loss. Then, in the second stage (Lemma B.2), we show that the iterates would not move much as the radius DiD_{i} of the domain is sufficiently small, hence the final accumulated error along these iterations is small.

Let us begin with the first stage. Let i0i_{0} be the largest ii such that Di(κ2κρλ)1κ1D_{i}\geq(\frac{\kappa 2^{\kappa}\rho}{\lambda})^{\frac{1}{\kappa-1}}. We prove that x𝒳ix^{\star}\in\mathcal{X}_{i} for all 0ii00\leq i\leq i_{0} where we recall that 𝒳i={x𝒳:xxi2Di}\mathcal{X}_{i}=\{x\in\mathcal{X}:\|x-x_{i}\|_{2}\leq D_{i}\} and Di=2iD0D_{i}=2^{-i}D_{0}.

Lemma B.1.

For all 0ii00\leq i\leq i_{0} we have

x𝒳i and f(xi0+1)minx𝒳f(x)4(2κ)1κ11λ1κ1ρκκ1.x^{\star}\in\mathcal{X}_{i}\quad\mbox{~{}~{}and~{}~{}}\quad f(x_{i_{0}+1})-\min_{x\in\mathcal{X}}f(x)\leq 4({2^{\kappa}})^{\frac{1}{\kappa-1}}\frac{1}{\lambda^{\frac{1}{\kappa-1}}}\rho^{\frac{\kappa}{\kappa-1}}.
Proof.

To prove the first part, we need to show that xix2Di\|x_{i}-x^{\star}\|_{2}\leq D_{i}. Let D¯i=xix2\bar{D}_{i}=\|x_{i}-x^{\star}\|_{2}. First, note that the claim is true for i=0i=0. Now we assume it is correct for 0ii010\leq i\leq i_{0}-1 and prove correctness for i+1i+1. Note that the growth condition implies

D¯i+1(κΔi/λ)1/κ,\bar{D}_{i+1}\leq(\kappa\Delta_{i}/\lambda)^{1/\kappa},

where Δi=f(xi+1)minx𝒳f(x)Diρ\Delta_{i}=f(x_{i+1})-\min_{x\in\mathcal{X}}f(x)\leq D_{i}\cdot\rho. Thus we have

D¯i+1(κDiρ/λ)1/κDi/2=Di+1,\bar{D}_{i+1}\leq(\kappa D_{i}\rho/\lambda)^{1/\kappa}\leq D_{i}/2=D_{i+1},

where the second inequality holds for ii that satisfies Di(κ2κρλ)1κ1D_{i}\geq(\frac{\kappa 2^{\kappa}\rho}{\lambda})^{\frac{1}{\kappa-1}}. This proves the first part of the claim. For the second part, note that the definition of i0i_{0} implies that Di02(κ2κρλ)1κ1D_{i_{0}}\leq 2(\frac{\kappa 2^{\kappa}\rho}{\lambda})^{\frac{1}{\kappa-1}}. Therefore, as x𝒳i0x^{\star}\in\mathcal{X}_{i_{0}} and the algorithm has error DiρD_{i}\cdot\rho, we have

f(xi0+1)minx𝒳f(x)\displaystyle f(x_{i_{0}+1})-\min_{x\in\mathcal{X}}f(x) Di0ρ\displaystyle\leq D_{i_{0}}\cdot\rho
2(κ2κ/λ)1κ1ρκκ1.\displaystyle\leq 2({\kappa 2^{\kappa}}/{\lambda})^{\frac{1}{\kappa-1}}\rho^{\frac{\kappa}{\kappa-1}}.

The claim now follows as κ1κ12\kappa^{\frac{1}{\kappa-1}}\leq 2. ∎

We now proceed to the second stage. The following lemma shows that the accumulated error along the iterates i>i0i>i_{0} is small and therefore xTx_{T} obtains the same error as xi0+1x_{i_{0}+1} (up to constant factors).

Lemma B.2.

Assume the algorithm has error DiρD_{i}\cdot\rho. Let i0i_{0} be the largest ii such that Di(κ2κρλ)1κ1D_{i}\geq(\frac{\kappa 2^{\kappa}\rho}{\lambda})^{\frac{1}{\kappa-1}}. For all ii0+1i\geq i_{0}+1 we have

f(xi+1)f(xi)2(ii0)Di0ρ.f(x_{i+1})-f(x_{i})\leq 2^{-(i-i_{0})}D_{i_{0}}\rho.

In particular, for Ti0+1T\geq i_{0}+1 we have

f(xT)minx𝒳f(x)8(2κ/λ)1κ1ρκκ1.f(x_{T})-\min_{x\in\mathcal{X}}f(x)\leq 8({2^{\kappa}}/{\lambda})^{\frac{1}{\kappa-1}}\rho^{\frac{\kappa}{\kappa-1}}.
Proof.

Note that as xi𝒳ix_{i}\in\mathcal{X}_{i}, the guarantees of the algorithm give

f(xi+1)f(xi)Diρ=2(ii0)Di0ρ.f(x_{i+1})-f(x_{i})\leq D_{i}\rho=2^{-(i-i_{0})}D_{i_{0}}\rho.

For the second part of the claim, we have

f(xT)minx𝒳f(x)\displaystyle f(x_{T})-\min_{x\in\mathcal{X}}f(x) =f(xi0+1)minx𝒳f(x)+i=i0+1Tf(xi+1)f(xi)\displaystyle=f(x_{i_{0}+1})-\min_{x\in\mathcal{X}}f(x)+\sum_{i=i_{0}+1}^{T}f(x_{i+1})-f(x_{i})
Di0ρ+i=i0+1T2(ii0)Di0ρ2Di0ρ.\displaystyle\leq D_{i_{0}}\rho+\sum_{i=i_{0}+1}^{T}2^{-(i-i_{0})}D_{i_{0}}\rho\leq 2D_{i_{0}}\rho.

The claim now follows as Di02(κ2κρλ)1κ1D_{i_{0}}\leq 2(\frac{\kappa 2^{\kappa}\rho}{\lambda})^{\frac{1}{\kappa-1}} and κ1κ12\kappa^{\frac{1}{\kappa-1}}\leq 2. ∎

Assuming Ti0+1T\geq i_{0}+1, Theorem 2 and Theorem 3 now follow immediately from Lemma B.2. Indeed, for the case of pure-DP (δ=0\delta=0), the choice of hyper-parameters in Algorithm 2 and the guarantees of Algorithm 1 (Proposition 1) imply that ρLO(log(1/β)logn0n0+Tdlog(1/β)n0ε)\rho\leq L\cdot O(\frac{\sqrt{\log(1/\beta)}\log n_{0}}{\sqrt{n_{0}}}+\frac{Td\log(1/\beta)}{n_{0}\varepsilon}), which proves Theorem 2. Similarly, Theorem 3 follows by using the guarantees of of Algorithm 1 for approximate (ε,δ)(\varepsilon,\delta)-DP, that ism Proposition 2, which gives ρLO(log(1/β)logn0n0+Tdlog(1/δ)log(1/β)n0ε)\rho\leq L\cdot O\left(\frac{\sqrt{\log(1/\beta)}\log n_{0}}{\sqrt{n_{0}}}+\frac{T\sqrt{d\log(1/\delta)}\log(1/\beta)}{n_{0}\varepsilon}\right). Note that our choice of stepsize at each iterate implies that Theorem 2 guarantees the desired utility with probability at least 1β21-\beta^{2}, hence the final utility guarantee holds with probability at least 1Tβ21β1-T\beta^{2}\geq 1-\beta.

It remains to verify Ti0+1T\geq i_{0}+1. Note that by choosing T2log(D0κ1λ/ρ)κ¯1T\geq\frac{2\log(D_{0}^{\kappa-1}\lambda/\rho)}{\underline{\kappa}-1}, we get that DT(κ2κρλ)1κ1D_{T}\leq(\frac{\kappa 2^{\kappa}\rho}{\lambda})^{\frac{1}{\kappa-1}}, hence Ti0+1T\geq i_{0}+1. As we have ρL/n0\rho\geq L/\sqrt{n_{0}} (non-private error) and D0κ1L/λD_{0}^{\kappa-1}\leq L/\lambda in our setting, we get that choosing T=2lognκ¯1T=\frac{2\log n}{\underline{\kappa}-1} gives the claim.

Appendix C Proofs of Section 5

In this section, we provide the proofs for our lower bound under privacy constraints for functions with growth. This section is organized as follows: we prove in Section C.1, the lower bounds under pure-DP and in Section C.2, the lower bounds under approximate-DP. Within Section C.1, we distinguish between κ2\kappa\geq 2 (Section C.1.1) and κ(1,2)\kappa\in(1,2) (Section C.1.2).

C.1 Proofs of Section 5.1

C.1.1 Proof of Theorem 4

As we preview in the main text, the proof combines the (non-private) information-theoretic lower bounds of Theorem 8 with the (private) lower bound on ERM of Theorem 9. Finally, we show in Proposition 4 that privately solving SCO is harder than privately solving ERM, concluding the proof of the theorem. We restate the theorem and prove these results in sequence.

See 4

Non-private lower bound

We begin the proof of Theorem 4 by proving a (non-private) information-theoretic lower bound for minimizing functions with κ2\kappa\geq 2-growth. We use the standard reduction from estimation to testing [see 33, Appendix A.1] in conjunction with Fano’s method [42, 44].

Theorem 8 (Non-private lower bound).

Let d1d\geq 1, 𝒳=𝔹2d(R)\mathcal{X}=\mathbb{B}_{2}^{d}(R), 𝕊={±ej}jd\mathbb{S}=\{\pm e_{j}\}_{j\leq d}, κ2\kappa\geq 2 and nn\in\mathbb{N}. Let 𝒫\mathcal{P} be the set of distributions on 𝕊\mathbb{S}. Assume that

2κ1Lλ1Rκ12κ196n.2^{\kappa-1}\leq\frac{L}{\lambda}\frac{1}{R^{\kappa-1}}\leq 2^{\kappa-1}\sqrt{96n}.

The following lower bound holds

𝔐n(𝒳,𝒫,κ)1λ1κ1(Ln)κ(κ1).\mathfrak{M}_{n}(\mathcal{X},\mathcal{P},\mathcal{F}^{\kappa})\gtrsim\frac{1}{\lambda^{\tfrac{1}{\kappa-1}}}\left(\frac{L}{\sqrt{n}}\right)^{\tfrac{\kappa}{(\kappa-1)}}.
Proof.

For 𝒱{±1}d\mathcal{V}\subset\{\pm 1\}^{d} let us consider the following function and distribution

F(x;s)λ2κ2κx2κ+L2x,s and XPv implies Xj={vjej w.p. 1+δ2vjej w.p. 1δ2.F(x;s)\coloneqq\frac{\lambda 2^{\kappa-2}}{\kappa}\|x\|_{2}^{\kappa}+\frac{L}{2}\langle x,s\rangle\mbox{~{}~{}and~{}~{}}X\sim P_{v}\mbox{~{}~{}implies~{}~{}}X_{j}=\begin{cases}v_{j}e_{j}\mbox{~{}~{}w.p.~{}~{}}\frac{1+\delta}{2}\\ -v_{j}e_{j}\mbox{~{}~{}w.p.~{}~{}}\frac{1-\delta}{2}.\end{cases}

Since the linear term does not affect uniform convexity, Lemma 4 in [39] guarantees that fvf_{v} is (λ,κ)(\lambda,\kappa)-uniformly convex. Furthermore, for s𝕊s\in\mathbb{S}

F(x;s)2λ2κ2Rκ1+L2L,\|\nabla F(x;s)\|_{2}\leq\lambda 2^{\kappa-2}R^{\kappa-1}+\frac{L}{2}\leq L,

by assumption, so the functions are LL-Lipschitz and satisfy Assumption 1.

Computing the separation. As 𝔼PvS=δdv\mathbb{E}_{P_{v}}S=\tfrac{\delta}{d}v, we have

fv(x)=λ2κ2κx2κ+Lδ2dx,v.f_{v}(x)=\frac{\lambda 2^{\kappa-2}}{\kappa}\|x\|_{2}^{\kappa}+\frac{L\delta}{2d}\langle x,v\rangle.

Note that for ud,σ>0u\in\mathbb{R}^{d},\sigma>0, it holds that

infxdσx2κκ+x,u=1κ(1σ)1κ1uκκ1 at xu=(1σ)1κ1(1u2)κ2κ1u.\inf_{x\in\mathbb{R}^{d}}\sigma\frac{\|x\|_{2}^{\kappa}}{\kappa}+\langle x,u\rangle=-\frac{1}{\kappa^{\star}}\left(\frac{1}{\sigma}\right)^{\tfrac{1}{\kappa-1}}\|u\|^{\tfrac{\kappa}{\kappa-1}}\mbox{~{}~{}at~{}~{}}x_{u}^{\star}=-\left(\frac{1}{\sigma}\right)^{\tfrac{1}{\kappa-1}}\left(\frac{1}{\|u\|_{2}}\right)^{\tfrac{\kappa-2}{\kappa-1}}u.

To make sure that xu𝔹2d(R)x_{u}^{\star}\in\mathbb{B}_{2}^{d}(R), we require u2σRκ1\|{u}\|_{2}\leq\sigma R^{\kappa-1}. After choosing δ\delta, we will see that this holds under the assumptions of the theorem. Let us consider the Gilbert-Varshimov packing of the hypercube: there exists 𝒱{±1}d\mathcal{V}\subset\left\{\pm 1\right\}^{d} such that |𝒱|=exp(d/8)\lvert\mathcal{V}\rvert=\exp(d/8) and dHam(v,v)d/4d_{\rm Ham}(v,v^{\prime})\geq d/4 for all vv𝒱v\neq v^{\prime}\in\mathcal{V}. Let us compute the separation

infx𝔹2d(R)fv(x)+fv(x)2=14κλ1κ1(Lδd)κκ1v+v22κκ1\inf_{x\in\mathbb{B}_{2}^{d}(R)}\frac{f_{v}(x)+f_{v^{\prime}}(x)}{2}=-\frac{1}{4\kappa^{\star}\lambda^{\tfrac{1}{\kappa-1}}}\left(\frac{L\delta}{d}\right)^{\tfrac{\kappa}{\kappa-1}}\left\|\frac{v+v^{\prime}}{2}\right\|_{2}^{\tfrac{\kappa}{\kappa-1}}

Note that (v+v)/22=ddHam(v,v)3d/4\|(v+v^{\prime})/2\|_{2}=\sqrt{d-d_{\rm Ham}(v,v^{\prime})}\leq\sqrt{3d/4}. This yields a separation

d𝗈𝗉𝗍(v,v,𝒳)1(3/4)κ/(2κ2)2κλ1κ1(Lδd)κκ1.d_{\mathsf{opt}}(v,v^{\prime},\mathcal{X})\geq\frac{1-(3/4)^{\kappa/(2\kappa-2)}}{2\kappa^{\star}\lambda^{\tfrac{1}{\kappa-1}}}\left(\frac{L\delta}{\sqrt{d}}\right)^{\tfrac{\kappa}{\kappa-1}}.

Lower bounding the testing error. In the case of a multiple hypothesis test, we use Fano’s method and for V𝖴𝗇𝗂{𝒱}V\sim\mathsf{Uni}\{\mathcal{V}\} and S1n|V=viidPvS_{1}^{n}|V=v\stackrel{{\scriptstyle\rm iid}}{{\sim}}P_{v}, Fano’s inequality guarantees

infψ:𝕊n𝒱Pr(ψ(S1n)V)1𝖨(S1n;V)+log2log|𝒱|,\inf_{\psi:\mathbb{S}^{n}\to\mathcal{V}}\Pr(\psi(S_{1}^{n})\neq V)\geq 1-\frac{\mathsf{I}(S_{1}^{n};V)+\log 2}{\log\lvert\mathcal{V}\rvert},

where 𝖨(X;Y)\mathsf{I}(X;Y) is the Shannon mutual information between XX and YY. In our case, we have log|𝒱|d/8\log\lvert\mathcal{V}\rvert\geq d/8 and 𝖨(S1n;V)nmaxvvDkl(Pv||Pv)3nδ2\mathsf{I}(S_{1}^{n};V)\leq n\max_{v\neq v^{\prime}}D_{\rm kl}({P_{v}}|\!|{P_{v^{\prime}}})\leq 3n\delta^{2}. In the case d48log2d\geq 48\log 2, we choose δ=d/(24n)\delta=\sqrt{d/(24n)}. We handle the one-dimensional case thereafter. For this δ\delta, we have

𝔐n(𝒳,𝒫,κ)1(34)κ2κ24κ(24)κ2κ21λ1κ1(L2n)κ2κ2.\mathfrak{M}_{n}(\mathcal{X},\mathcal{P},\mathcal{F}^{\kappa})\geq\frac{1-\left(\tfrac{3}{4}\right)^{\tfrac{\kappa}{2\kappa-2}}}{4\kappa^{\star}(24)^{\tfrac{\kappa}{2\kappa-2}}}\frac{1}{\lambda^{\tfrac{1}{\kappa-1}}}\left(\frac{L^{2}}{n}\right)^{\tfrac{\kappa}{2\kappa-2}}.

For this choice of δ\delta, the assumption on nn ensures that the minimum remains in 𝔹2d(R)\mathbb{B}_{2}^{d}(R).

One-dimensional lower bound with Le Cam’s method. Since Fano’s method requires d48log2d\geq 48\log 2, we finish the proof by providing a lower bound for d=1d=1 using Le Cam’s method. We use the same family of functions in one dimension, i.e. 𝕊={±1}\mathbb{S}=\{\pm 1\}, v{±1}v\in\{\pm 1\} and for δ[0,1]\delta\in[0,1] define

F(x;s)=λ2κ2κ|x|κ+L2sx and XPv implies X={v w.p. 1+δ2v w.p. 1δ2.F(x;s)=\frac{\lambda 2^{\kappa-2}}{\kappa}\lvert x\rvert^{\kappa}+\frac{L}{2}s\cdot x\mbox{~{}~{}and~{}~{}}X\sim P_{v}\mbox{~{}~{}implies~{}~{}}X=\begin{cases}v&\mbox{~{}~{}w.p.~{}~{}}\frac{1+\delta}{2}\\ -v&\mbox{~{}~{}w.p.~{}~{}}\frac{1-\delta}{2}.\end{cases}

As this is the one-dimensional analog of the previous construction, FF remains LL-lipschitz and ff has (λ,κ)(\lambda,\kappa)-growth. A calculation yields that the separation is

d𝗈𝗉𝗍(1,1,𝒳)12λ1κ1(Lδ)κκ1,d_{\mathsf{opt}}(1,-1,\mathcal{X})\geq\frac{1}{2\lambda^{\tfrac{1}{\kappa-1}}}\left(L\delta\right)^{\tfrac{\kappa}{\kappa-1}},

where we used that κ[1,2]\kappa^{\star}\in[1,2]. For V𝖴𝗇𝗂{1,1}V\sim\mathsf{Uni}\{-1,1\} and S1n|V=viidPvS_{1}^{n}|V=v\stackrel{{\scriptstyle\rm iid}}{{\sim}}P_{v}. Le Cam’s lemma in conjunction with Pinsker’s inequality yields that

infψ:𝕊n{1,1}Pr(ψ(S1n)V)=12(1P1nP1nTV)12(1n2Dkl(P1||P1)).\inf_{\psi:\mathbb{S}^{n}\to\{-1,1\}}\Pr(\psi(S_{1}^{n})\neq V)=\frac{1}{2}(1-\|P_{1}^{n}-P_{-1}^{n}\|_{\rm TV})\geq\frac{1}{2}(1-\sqrt{\tfrac{n}{2}D_{\rm kl}({P_{1}}|\!|{P_{-1}})}).

In our case, we have Dkl(P1||P1)=δlog1+δ1δ3δ2D_{\rm kl}({P_{1}}|\!|{P_{-1}})=\delta\log\frac{1+\delta}{1-\delta}\leq 3\delta^{2} for δ[0,1/2]\delta\in[0,1/2]. We set δ=1/6n\delta=1/\sqrt{6n}, which yields the final result in one dimension

𝔐n([1,1],𝒫,d=1κ)=Ω(1λ1κ1(Ln)κκ1.)\mathfrak{M}_{n}([-1,1],\mathcal{P},\mathcal{F}^{\kappa}_{d=1})=\Omega\left(\frac{1}{\lambda^{\tfrac{1}{\kappa-1}}}\left(\frac{L}{\sqrt{n}}\right)^{\tfrac{\kappa}{\kappa-1}}.\right)

Privatizing the lower bound via a packing argument

We now show how this construction yields a private lower bound via a packing argument. For d1d\geq 1, considering the ERM problem, the following private lower bound holds.

Theorem 9 (Private lower bound for ERM).

Let d1,𝒳=𝔹2d(R)d\geq 1,\mathcal{X}=\mathbb{B}_{2}^{d}(R), 𝕊={±ej}jd\mathbb{S}=\{\pm e_{j}\}_{j\leq d}, κ2\kappa\geq 2 and nn\in\mathbb{N}. Let 𝒫\mathcal{P} be the set of distributions on 𝕊\mathbb{S}. Assume that

2κ1Lλ1Rκ12κ196n.2^{\kappa-1}\leq\frac{L}{\lambda}\frac{1}{R^{\kappa-1}}\leq 2^{\kappa-1}\sqrt{96n}.

Then any ε\varepsilon-DP algorithm 𝖠\mathsf{A} has

sup𝒮𝕊n𝔼[f𝒮(𝖠(𝒮))minx𝒳f𝒮(x)]1λ1κ1(Ldnε)κκ1.\sup_{\mathcal{S}\in\mathbb{S}^{n}}\mathbb{E}\left[f_{\mathcal{S}}(\mathsf{A}(\mathcal{S}))-\min_{x\in\mathcal{X}}f_{\mathcal{S}}(x)\right]\gtrsim\frac{1}{\lambda^{\frac{1}{\kappa-1}}}\left(\frac{Ld}{n\varepsilon}\right)^{\frac{\kappa}{\kappa-1}}.
Proof.

First, note that it is enough to prove the following lower bound

sup𝒮𝕊n𝔼[𝖠(𝒮)x2]1λ1κ1(Ldnε)1κ1.\sup_{\mathcal{S}\in\mathbb{S}^{n}}\mathbb{E}\left[\|\mathsf{A}(\mathcal{S})-x^{\star}\|_{2}\right]\gtrsim\frac{1}{\lambda^{\frac{1}{\kappa-1}}}\left(\frac{Ld}{n\varepsilon}\right)^{\frac{1}{\kappa-1}}. (9)

Indeed, this implies that

sup𝒮𝕊n𝔼[f(𝖠(𝒮))minx𝒳f(x)]\displaystyle\sup_{\mathcal{S}\in\mathbb{S}^{n}}\mathbb{E}\left[f(\mathsf{A}(\mathcal{S}))-\min_{x\in\mathcal{X}}f(x)\right] λκsup𝒮𝕊n𝔼[𝖠(𝒮)x2κ]\displaystyle\geq\frac{\lambda}{\kappa}\sup_{\mathcal{S}\in\mathbb{S}^{n}}\mathbb{E}\left[\|\mathsf{A}(\mathcal{S})-x^{\star}\|_{2}^{\kappa}\right]
1κλ1κ1(Ldnε)κκ1.\displaystyle\gtrsim\frac{1}{\kappa\lambda^{\frac{1}{\kappa-1}}}\left(\frac{Ld}{n\varepsilon}\right)^{\frac{\kappa}{\kappa-1}}.

Let us now prove the lower bound (9). To this end, we consider the function F(x;s)λ2κ2κx2κ+L2x,sF(x;s)\coloneqq\frac{\lambda 2^{\kappa-2}}{\kappa}\|x\|_{2}^{\kappa}+\frac{L}{2}\langle x,s\rangle where s21\|s\|_{2}\leq 1. We now construct MM datasets 𝒮1,,𝒮M\mathcal{S}_{1},\dots,\mathcal{S}_{M} as follows. Let v1,,vM{±1d}dv_{1},\dots,v_{M}\in\left\{\pm\frac{1}{\sqrt{d}}\right\}^{d} be the Gilbert-Varshimov packing of the hypercube: that is, Mexp(d/8)M\geq\exp(d/8) and dHam(vi,vj)d/4d_{\rm Ham}(v_{i},v_{j})\geq d/4 for all iji\neq j. We define 𝒮i=(vi,,vid/20ε,0,,0)\mathcal{S}_{i}=(\underbrace{v_{i},\dots,v_{i}}_{d/20\varepsilon},0,\dots,0). Note that dHam(Si,Sj)d/20εd_{\rm Ham}(S_{i},S_{j})\leq d/20\varepsilon and that f(x;𝒮i)=λ2κ2κx2κ+L2d20nεx,vif(x;\mathcal{S}_{i})=\frac{\lambda 2^{\kappa-2}}{\kappa}\|x\|_{2}^{\kappa}+\frac{L}{2}\frac{d}{20n\varepsilon}\langle x,v_{i}\rangle, hence

xi=(1λ2κ2)1κ1(40nεLd)κ2κ1Ld40nεvi.x_{i}^{\star}=-\left(\frac{1}{\lambda 2^{\kappa-2}}\right)^{\tfrac{1}{\kappa-1}}\left(\frac{40n\varepsilon}{Ld}\right)^{\tfrac{\kappa-2}{\kappa-1}}\frac{Ld}{40n\varepsilon}v_{i}.

Therefore we have

xixj22\displaystyle\|x_{i}^{\star}-x_{j}^{\star}\|_{2}^{2} (1λ2κ2)2κ1(40nεLd)2(κ2)κ1L2d21600n2ε2\displaystyle\geq\left(\frac{1}{\lambda 2^{\kappa-2}}\right)^{\tfrac{2}{\kappa-1}}\left(\frac{40n\varepsilon}{Ld}\right)^{\tfrac{2(\kappa-2)}{\kappa-1}}\frac{L^{2}d^{2}}{1600n^{2}\varepsilon^{2}}
(1λ2κ2)2κ1(Ldnε)2κ1ρ2.\displaystyle\gtrsim\left(\frac{1}{\lambda 2^{\kappa-2}}\right)^{\tfrac{2}{\kappa-1}}\left(\frac{Ld}{n\varepsilon}\right)^{\tfrac{2}{\kappa-1}}\coloneqq\rho^{2}.

We are now ready to finish the proof using packing-based arguments [27]. Assume by contradiction there is an ε\varepsilon-DP algorithm 𝖠\mathsf{A} such that

sup1iM𝔼[𝖠(𝒮i)xi2]ρ/20.\sup_{1\leq i\leq M}\mathbb{E}\left[\|\mathsf{A}(\mathcal{S}_{i})-x_{i}^{\star}\|_{2}\right]\leq\rho/20.

Let Bi={x𝒳:xxi2ρ/2}B_{i}=\{x\in\mathcal{X}:\|x-x_{i}^{\star}\|_{2}\leq\rho/2\}. Note that the sets BiB_{i} are disjoint and that Markov inequality implies

Pr(𝖠(𝒮i)Bi)=Pr(𝖠(𝒮i)xi2ρ/2)9/10.\Pr(\mathsf{A}(\mathcal{S}_{i})\in B_{i})=\Pr(\|\mathsf{A}(\mathcal{S}_{i})-x_{i}^{\star}\|_{2}\leq\rho/2)\geq 9/10.

Thus, the privacy constraint now gives

1\displaystyle 1 i=1MPr(𝖠(x1)Bi)\displaystyle\geq\sum_{i=1}^{M}\Pr(\mathsf{A}(x_{1})\in B_{i})
Pr(𝖠(x1)B1)+ed/20i=2MPr(𝖠(xi)Bi)\displaystyle\geq\Pr(\mathsf{A}(x_{1})\in B_{1})+e^{-d/20}\sum_{i=2}^{M}\Pr(\mathsf{A}(x_{i})\in B_{i})
910(1+ed/20(M1)),\displaystyle\geq\frac{9}{10}(1+e^{-d/20}(M-1)),

where the second inequality follows since dHam(𝒮i,𝒮j)d/20εd_{\rm Ham}(\mathcal{S}_{i},\mathcal{S}_{j})\leq d/20\varepsilon. This gives a contradiction for d20d\geq 20 as Mexp(d/8)M\geq\exp(d/8). For d=1d=1, we can repeat the same arguments with M=2M=2 to get the desired lower bound. ∎

Reduction from ε\varepsilon-DP ERM to ε\varepsilon-DP SCO

We conclude the proof of the theorem by proving that SCO under privacy constraints is strictly harder than ERM. This is similar to Appendix C in [8] but we require it for pure-DP constraints. We make this formal in here.

We have the following lemma.

Proposition 4.

Let 0<β1/n0<\beta\leq 1/n. Assume 𝖠\mathsf{A} is an ε2log(2/β)\frac{\varepsilon}{2\log(2/\beta)}-DP algorithm that for a sample 𝒮=(S1,,Sn)\mathcal{S}=(S_{1},\ldots,S_{n}) with S1niidPS_{1}^{n}\stackrel{{\scriptstyle\rm iid}}{{\sim}}P achieves with probability 1β/21-\beta/2 error

f(𝖠(𝒮))minx𝒳f(x)γ.f(\mathsf{A}(\mathcal{S}))-\min_{x\in\mathcal{X}}f(x)\leq\gamma.

Then there is an ε\varepsilon-DP algorithm 𝖠\mathsf{A}^{\prime} such that for any dataset 𝒮𝕊n\mathcal{S}\in\mathbb{S}^{n} has with probability 1β1-\beta,

f𝒮(𝖠(𝒮))minx𝒳f𝒮(x)γ.f_{\mathcal{S}}(\mathsf{A}^{\prime}(\mathcal{S}))-\min_{x\in\mathcal{X}}f_{\mathcal{S}}(x)\leq\gamma.
Proof.

Given the algorithm 𝖠\mathsf{A}, we define 𝖠\mathsf{A}^{\prime} as follows. For an input 𝒮𝕊n\mathcal{S}\in\mathbb{S}^{n}, let P𝒮P_{\mathcal{S}} be the empirical distribution of 𝒮\mathcal{S}. Then, 𝖠\mathsf{A}^{\prime} proceeds as follows:

  1. 1.

    Sample a new dataset 𝒮1=(S1,,Sn)\mathcal{S}_{1}=(S_{1}^{\prime},\dots,S^{\prime}_{n}) where SiP𝒮S^{\prime}_{i}\sim P_{\mathcal{S}}

  2. 2.

    If there is a sample SiS_{i} that was sampled more than k=2log(2/β)k=2\log(2/\beta) times, return 0

  3. 3.

    Else, return 𝖠(𝒮1)\mathsf{A}(\mathcal{S}_{1})

We need to prove that 𝖠\mathsf{A}^{\prime} is ε\varepsilon-DP and that it has the desired utility. For utility, note that 𝖠\mathsf{A}^{\prime} returns 0 at step 22 with probability at most β/2\beta/2, since we have for every 1in1\leq i\leq n

Pr(si used more than k times)\displaystyle\Pr\left(s_{i}\mbox{~{}used more than $k$ times}\right) =Pr(j=1nZik)\displaystyle=\Pr\left(\sum_{j=1}^{n}Z_{i}\geq k\right)
2kβ2/2,\displaystyle\leq 2^{-k}\leq\beta^{2}/2,

where Zj𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p)Z_{j}\sim\mathsf{Bernoulli}(p) with p=1/np=1/n, and the second inequality follows from Chernoff [36, Thm. 4.4] and β1/10\beta\leq 1/10. Applying a union bound over all samples, we get that step 22 returns 0 with probability at most β/2\beta/2 as β1/n\beta\leq 1/n. Moreover, Algorithm 𝖠\mathsf{A} fails with probability at most β/2\beta/2. Therefore, as f𝒮(x)=𝔼SP𝒮[F(x;S)]f_{\mathcal{S}}(x)=\mathbb{E}_{S\sim P_{\mathcal{S}}}[F(x;S)], we have with probability at least 1β1-\beta,

f𝒮(𝖠(𝒮))minx𝒳f𝒮(x)γ.f_{\mathcal{S}}(\mathsf{A}^{\prime}(\mathcal{S}))-\min_{x\in\mathcal{X}}f_{\mathcal{S}}(x)\leq\gamma.

Let us now prove privacy. Assume we run algorithm 𝖠\mathsf{A}^{\prime} on two neighboring datasets 𝒮,𝒮\mathcal{S},\mathcal{S}^{\prime}, and let 𝒮1,𝒮1\mathcal{S}_{1},\mathcal{S}^{\prime}_{1} be the datasets produced at step 11. Let BB denote the event that there was a sample sis_{i} that was used more than kk times (note that this does not depend on the input). Then for any measurable 𝒪\mathcal{O},

Pr(𝖠(𝒮)𝒪)\displaystyle\Pr(\mathsf{A}^{\prime}(\mathcal{S})\in\mathcal{O}) =Pr(𝖠(𝒮)𝒪B)Pr(B)+Pr(𝖠(𝒮)𝒪Bc)Pr(Bc)\displaystyle=\Pr(\mathsf{A}^{\prime}(\mathcal{S})\in\mathcal{O}\mid B)\Pr(B)+\Pr(\mathsf{A}^{\prime}(\mathcal{S})\in\mathcal{O}\mid B^{c})\Pr(B^{c})
eεPr(𝖠(𝒮)𝒪B)Pr(B)+Pr(𝖠(𝒮)𝒪Bc)Pr(Bc)\displaystyle\leq e^{\varepsilon}\Pr(\mathsf{A}^{\prime}(\mathcal{S}^{\prime})\in\mathcal{O}\mid B)\Pr(B)+\Pr(\mathsf{A}^{\prime}(\mathcal{S}^{\prime})\in\mathcal{O}\mid B^{c})\Pr(B^{c})
eεPr(𝖠(𝒮)𝒪),\displaystyle\leq e^{\varepsilon}\Pr(\mathsf{A}^{\prime}(\mathcal{S}^{\prime})\in\mathcal{O}),

where the first inequality follows from group privacy since dHam(𝒮1,𝒮1)kd_{\rm Ham}(\mathcal{S}_{1},\mathcal{S}^{\prime}_{1})\leq k and 𝖠\mathsf{A} is ε/k\varepsilon/k-DP. This completes the proof.

C.1.2 Proof of Theorem 5

See 5

Proof.

We follow the same reduction that we used in the proof of Theorem 8. For δ[0,1/2]\delta\in[0,1/2], we again consider Pv=1P_{v}=1 with probability 1+δv2\tfrac{1+\delta v}{2} and 1-1 otherwise. For a[0,1]a\in[0,1] to be defined later, we construct the following function

F(x;+1)={|xa| if xa|xa|κ if xa and F(x;1)={|x+a|κ if xa|x+a| if xaF(x;+1)=\begin{cases}\lvert x-a\rvert&\mbox{~{}~{}if~{}~{}}x\leq a\\ \lvert x-a\rvert^{\kappa}&\mbox{~{}~{}if~{}~{}}x\geq a\end{cases}\mbox{~{}~{}and~{}~{}}F(x;-1)=\begin{cases}\lvert x+a\rvert^{\kappa}&\mbox{~{}~{}if~{}~{}}x\leq-a\\ \lvert x+a\rvert&\mbox{~{}~{}if~{}~{}}x\geq-a\end{cases}

Computing the separation. First, let us compute the separation d𝗈𝗉𝗍(v,v,𝒳)d_{\mathsf{opt}}(v,v^{\prime},\mathcal{X}). We will then choose aa to ensure fvf_{v} has κ\kappa-growth. By symmetry, assume v=1v=1. fvf_{v} is increasing on [a,1][a,1] and decreasing on [1,a][-1,-a], thus the minimum belongs to [a,a][-a,a] and by inspection, is attained at x=ax=a with value a(1δ)a(1-\delta). Similarly, the minimum of f+1(x)+f1(x)f_{+1}(x)+f_{-1}(x) is attained on [a,a][-a,a] with value 2a2a. This yields

d𝗈𝗉𝗍(v,v,𝒳)=2a2a(1δ)=2aδ.d_{\mathsf{opt}}(v,v^{\prime},\mathcal{X})=2a-2a(1-\delta)=2a\delta.

Let us now pick aa such that fvf_{v} has κ\kappa-growth. Again, by symmetry we only treat the v=1v=1 case. We have

for xa,fv(x)fv=1+δ2(xa)κ+1δ2(x+a)a(1δ)=1+δ2(xa)κ+1δ2(xa)|xa|κ,\mbox{for~{}}x\geq a,f_{v}(x)-f_{v}^{\star}=\frac{1+\delta}{2}(x-a)^{\kappa}+\frac{1-\delta}{2}(x+a)-a(1-\delta)=\frac{1+\delta}{2}(x-a)^{\kappa}+\frac{1-\delta}{2}(x-a)\geq\lvert x-a\rvert^{\kappa},

where the last inequality is because (xa)1(x-a)\leq 1 and so (xa)(xa)κ(x-a)\geq(x-a)^{\kappa} for κ>1\kappa>1. In the second case, we have

for x[a,a],fv(x)fv=δ(ax).\mbox{for~{}}x\in[-a,a],f_{v}(x)-f_{v}^{\star}=\delta(a-x).

It holds that δ(ax)(ax)κ\delta(a-x)\geq(a-x)^{\kappa} for all x[a,a]x\in[-a,a] iff a12δ1κ1a\leq\tfrac{1}{2}\delta^{\tfrac{1}{\kappa-1}}. As a result, we set a=12δ1κ1a=\tfrac{1}{2}\delta^{\tfrac{1}{\kappa-1}}. Finally, for x[1,a]x\in[-1,-a], we define

h(x)1+δ2|xa|+1δ2|x+a|κa(1δ)1κ|xa|κ for x[1,a].h(x)\coloneqq\frac{1+\delta}{2}\lvert x-a\rvert+\frac{1-\delta}{2}\lvert x+a\rvert^{\kappa}-a(1-\delta)-\frac{1}{\kappa}\lvert x-a\rvert^{\kappa}\mbox{~{}~{}for~{}~{}}x\in[-1,-a].

We wish to prove that h(x)0h(x)\geq 0. First of, note that h(a)=δκκ1(12+121κ)>0h(-a)=\delta^{\tfrac{\kappa}{\kappa-1}}(\tfrac{1}{2}+\tfrac{1}{2}-\tfrac{1}{\kappa})>0, whenever κ>1\kappa>1. Let us show that h(x)h(x) is decreasing on [1,a][-1,-a] which suffices to conclude the proof. We have

h(x)=1+δ2κ(1δ)2|x+a|κ1+|xa|κ1.h^{\prime}(x)=-\frac{1+\delta}{2}-\frac{\kappa(1-\delta)}{2}\lvert x+a\rvert^{\kappa-1}+\lvert x-a\rvert^{\kappa-1}.

First of, note that h(a)=1+δ2+δ0h^{\prime}(-a)=-\tfrac{1+\delta}{2}+\delta\leq 0 and h(1)<0h^{\prime}(-1)<0, thus it suffices to show that if hh^{\prime} has an extremum then is it negative. An extremum of this function is a point xx^{\star} such that

|ax|=(κ(1δ)2)1κ2|a+x|,\lvert a-x^{\star}\rvert=\left(\frac{\kappa(1-\delta)}{2}\right)^{\tfrac{1}{\kappa-2}}\lvert a+x^{\star}\rvert,

which yields that

h(x)=|a+x|κ1(κ(1δ)2)[(κ(1δ)2)1κ21]1+δ20,h^{\prime}(x^{\star})=\lvert a+x^{\star}\rvert^{\kappa-1}\left(\frac{\kappa(1-\delta)}{2}\right)\left[\left(\frac{\kappa(1-\delta)}{2}\right)^{\tfrac{1}{\kappa-2}}-1\right]-\frac{1+\delta}{2}\leq 0,

as κ2\kappa\leq 2. This calculation shows that fvf_{v} has (1,κ)(1,\kappa)-growth. Finally note that the function is κ2\kappa\leq 2-Lipschitz as desired.

Lower bounding the testing error. It remains to choose the value of δ\delta. Since we require a lower bound under privacy constraints, in contrast to the one-dimensional section of the proof of Theorem 8, we require the following privatized version of Le Cam’s lemma from [6]

Proposition 5.

[6, Thm. 2] Let 𝖠𝒜ε\mathsf{A}\in\mathcal{A}^{\varepsilon} be an ϵ\epsilon-DP mechanism from 𝕊n𝒳\mathbb{S}^{n}\to\mathcal{X}. It holds that

infψ:𝒳{1,1}inf𝖠𝒜ϵPr(ψ(𝖠(S1n))V)12(1min{2nεP1P1TV,P1nP1nTV}).\inf_{\psi:\mathcal{X}\to\{-1,1\}}\inf_{\mathsf{A}\in\mathcal{A}^{\epsilon}}\Pr(\psi(\mathsf{A}(S_{1}^{n}))\neq V)\geq\frac{1}{2}\left(1-\min\left\{2n\varepsilon\|P_{1}-P_{-1}\|_{\rm TV},\|P_{-1}^{n}-P_{1}^{n}\|_{\rm TV}\right\}\right).

With this result, we set δ=max{1/6n,1/(23nε)}\delta=\max\{1/\sqrt{6n},1/(2\sqrt{3}n\varepsilon)\} and lower bound max{a,b}\max\{a,b\} by a+ba+b for readability, which concludes the proof of the theorem. ∎

C.2 Proof for Section 5.2

See 3

Proof of Proposition 3.

Let us first show how to construct the mechanism 𝖠\mathsf{A}^{\prime}. Let kk\in\mathbb{N} be such that klog2[κ1κ1Lκκ122κ3Δ(n,L,ε/k,δ)]k\geq\log_{2}\left[\frac{\kappa^{\tfrac{1}{\kappa-1}}L^{\tfrac{\kappa}{\kappa-1}}}{2^{2\kappa-3}\Delta(n,L,\varepsilon/k,\delta)}\right] and let {λi}i[k]\{\lambda_{i}\}_{i\in[k]} be a collection of positive scalars. Set x0𝒳x_{0}\in\mathcal{X}, for i{1,,k}i\in\{1,\ldots,k\}

define Gi(x;s)=F(x;s)+λi2κ2κxxi12κ,𝒴i{x𝒳:xxi12(Lκλi2κ2)1κ1}\displaystyle\mbox{define~{}~{}}G_{i}(x;s)=F(x;s)+\frac{\lambda_{i}\cdot 2^{\kappa-2}}{\kappa}\|x-x_{i-1}\|_{2}^{\kappa},\mathcal{Y}_{i}\coloneqq\left\{x\in\mathcal{X}:\|x-x_{i-1}\|_{2}\leq\left(\frac{L\kappa}{\lambda_{i}2^{\kappa-2}}\right)^{\tfrac{1}{\kappa-1}}\right\}
 and set xi=𝖠(𝒮,Gi,𝒴i), with privacy (ε/k,δ/k).\displaystyle\mbox{~{}~{}and set~{}~{}}x_{i}=\mathsf{A}(\mathcal{S},G_{i},\mathcal{Y}_{i}),\mbox{~{}with privacy~{}~{}}(\varepsilon/k,\delta/k).

Finally, define 𝖠(𝒮)=xk\mathsf{A}^{\prime}(\mathcal{S})=x_{k}. Standard composition theorems [20] guarantee that 𝖠\mathsf{A}^{\prime} is (ε,δ)(\varepsilon,\delta)-DP. Let us analyze its utility; we drop the dependence of Δ\Delta on other variables when clear from context. First of, since κ\kappa is a constant, note that GiG_{i} is c0Lc_{0}L-Lipschitz with c0<c_{0}<\infty a numerical constant. For simplicity, we define gi(x)1ns𝒮Gi(x;s)g_{i}(x)\coloneqq\frac{1}{n}\sum_{s\in\mathcal{S}}G_{i}(x;s) and xi=argminx𝒴igi(x)x_{i}^{\star}=\mathop{\rm argmin}_{x\in\mathcal{Y}_{i}}g_{i}(x). It holds that gig_{i} is (λi2κ2,κ)(\lambda_{i}2^{\kappa-2},\kappa)-uniformly-convex and thus the following growth condition holds

λiκ𝔼xixi2κ𝔼[gi(xi)]gi(xi)1λi1κ1Δ.\frac{\lambda_{i}}{\kappa}\mathbb{E}\|x_{i}-x_{i}^{\star}\|_{2}^{\kappa}\leq\mathbb{E}[g_{i}(x_{i})]-g_{i}(x_{i}^{\star})\leq\frac{1}{\lambda_{i}^{\tfrac{1}{\kappa-1}}}\Delta.

Also note that for any point y𝒴iy\in\mathcal{Y}_{i}, it holds that

f𝒮(xi)f(y)λi2κ2κxi1y2κ.f_{\mathcal{S}}(x_{i}^{\star})-f(y)\leq\frac{\lambda_{i}2^{\kappa-2}}{\kappa}\|x_{i-1}-y\|_{2}^{\kappa}.

Finally, let us bound the distance to the optimum of f𝒮f_{\mathcal{S}} at the final iterate. We have

λkκxkxk2κgk(xk)gk(xk)c0Lxkxk2 which yields xkxk2(c0Lκλk)1κ1.\frac{\lambda_{k}}{\kappa}\|x_{k}-x_{k}^{\star}\|_{2}^{\kappa}\leq g_{k}(x_{k})-g_{k}(x_{k}^{\star})\leq c_{0}L\|x_{k}-x_{k}^{\star}\|_{2}\mbox{~{}~{}which yields~{}~{}}\|x_{k}-x_{k}^{\star}\|_{2}\leq\left(\frac{c_{0}L\kappa}{\lambda_{k}}\right)^{\tfrac{1}{\kappa-1}}.

Let us put the pieces together: for λ>0\lambda>0 to be determined later and ν=κ1\nu=\kappa-1, set λi=2νiλ\lambda_{i}=2^{-\nu i}\lambda. After kk rounds and denoting x0=infx𝒳f𝒮(x)x_{0}^{\star}=\inf_{x\in\mathcal{X}}f_{\mathcal{S}}(x), we have

𝔼[f𝒮(xk)]f𝒮(x)\displaystyle\mathbb{E}[f_{\mathcal{S}}(x_{k})]-f_{\mathcal{S}}(x^{\star}) =i=1k𝔼[f𝒮(xi)f𝒮(xi1)]+𝔼[f𝒮(xk)f𝒮(xk)]\displaystyle=\sum_{i=1}^{k}\mathbb{E}\left[f_{\mathcal{S}}(x_{i}^{\star})-f_{\mathcal{S}}(x_{i-1}^{\star})\right]+\mathbb{E}\left[f_{\mathcal{S}}(x_{k})-f_{\mathcal{S}}(x_{k}^{\star})\right]
i=1kλi2κ2κ𝔼xi1xi12κ+L(c0Lκλk)1κ1\displaystyle\leq\sum_{i=1}^{k}\frac{\lambda_{i}2^{\kappa-2}}{\kappa}\mathbb{E}\|x_{i-1}-x_{i-1}^{\star}\|_{2}^{\kappa}+L\left(\frac{c_{0}L\kappa}{\lambda_{k}}\right)^{\tfrac{1}{\kappa-1}}
λDκκ+i=2kλi2κ2λi1κκ1Δ+L(c0Lκλk)1κ1\displaystyle\leq\frac{\lambda D^{\kappa}}{\kappa}+\sum_{i=2}^{k}\frac{\lambda_{i}2^{\kappa-2}}{\lambda_{i-1}^{\tfrac{\kappa}{\kappa-1}}}\Delta+L\left(\frac{c_{0}L\kappa}{\lambda_{k}}\right)^{\tfrac{1}{\kappa-1}}
=λDκκ+Δ2κ2λ1κ1i=2k2νκ1(iκ)+L(c0Lκλ)1κ12νκ1k\displaystyle=\frac{\lambda D^{\kappa}}{\kappa}+\frac{\Delta 2^{\kappa-2}}{\lambda^{\tfrac{1}{\kappa-1}}}\sum_{i=2}^{k}2^{-\tfrac{\nu}{\kappa-1}(i-\kappa)}+L\left(\frac{c_{0}L\kappa}{\lambda}\right)^{\tfrac{1}{\kappa-1}}2^{-\tfrac{\nu}{\kappa-1}k}
λDκκ+22κ3Δλ1κ1+κ1κ1(c0L)κκ12kλ1κ1.\displaystyle\leq\frac{\lambda D^{\kappa}}{\kappa}+2^{2\kappa-3}\frac{\Delta}{\lambda^{\tfrac{1}{\kappa-1}}}+\frac{\kappa^{\tfrac{1}{\kappa-1}}(c_{0}L)^{\tfrac{\kappa}{\kappa-1}}2^{-k}}{\lambda^{\tfrac{1}{\kappa-1}}}.

Finally, note that

klog2[κ1κ1(c0L)κκ122κ3Δ] so that κ1κ1(c0L)κκ12kλ1κ122κ3Δλ1κ1.k\geq\left\lceil\log_{2}\left[\frac{\kappa^{\tfrac{1}{\kappa-1}}(c_{0}L)^{\tfrac{\kappa}{\kappa-1}}}{2^{2\kappa-3}\Delta}\right]\right\rceil\mbox{~{}~{}so that~{}~{}}\frac{\kappa^{\tfrac{1}{\kappa-1}}(c_{0}L)^{\tfrac{\kappa}{\kappa-1}}2^{-k}}{\lambda^{\tfrac{1}{\kappa-1}}}\leq 2^{2\kappa-3}\frac{\Delta}{\lambda^{\tfrac{1}{\kappa-1}}}.

It then holds that

𝔼[f𝒮(xk)]f𝒮(x)λDκκ+4κ1Δ1λ1κ1.\mathbb{E}[f_{\mathcal{S}}(x_{k})]-f_{\mathcal{S}}(x^{\star})\leq\lambda\frac{D^{\kappa}}{\kappa}+4^{\kappa-1}\Delta\frac{1}{\lambda^{\tfrac{1}{\kappa-1}}}.

It remains to pick λ\lambda to minimize the upper bound above. A calculation yields that for a,b0a,b\geq 0

infν0aν+bν1κ1=(κ1)1/κa1/κb(κ1)/κ[κ1+1κ1] at ν=(ba(κ1))κ1κ.\inf_{\nu\geq 0}a\nu+\frac{b}{\nu^{\tfrac{1}{\kappa-1}}}=(\kappa-1)^{1/\kappa}a^{1/\kappa}b^{(\kappa-1)/\kappa}\left[\kappa-1+\frac{1}{\kappa-1}\right]\mbox{~{}~{}at~{}~{}}\nu^{\star}=\left(\frac{b}{a(\kappa-1)}\right)^{\tfrac{\kappa-1}{\kappa}}.

Setting λ=4(κ1)2κ(ΔκDκ(κ1))(κ1)/κ\lambda=4^{\tfrac{(\kappa-1)^{2}}{\kappa}}(\tfrac{\Delta\kappa}{D^{\kappa}(\kappa-1)})^{(\kappa-1)/\kappa} yields the regret bound

𝔼[f𝒮(xk)]f𝒮(x)O(1)DΔκ1κ.\mathbb{E}[f_{\mathcal{S}}(x_{k})]-f_{\mathcal{S}}(x^{\star})\leq O(1)D\Delta^{\tfrac{\kappa-1}{\kappa}}.

Proof.

Consider the reduction of Proposition 3. For c1<c_{1}<\infty to be determined later, assume by contradiction that there exists an (ε,δ)(\varepsilon,\delta) mechanism such that

Δ(n,L,ε,δ)c1(Ldnε)κκ1.\Delta(n,L,\varepsilon,\delta)\leq c_{1}\left(\frac{L\sqrt{d}}{n\varepsilon}\right)^{\tfrac{\kappa}{\kappa-1}}.

Setting k=4log2(nε/d)loglog2((nε/d)κ/(κ1))k=\lceil 4\log_{2}(n\varepsilon/\sqrt{d})\log\log_{2}((n\varepsilon/\sqrt{d})^{\kappa/(\kappa-1)})\rceil, the condition holds and the result of Proposition 3 guarantees that there exists a numerical constant c2<c_{2}<\infty and a mechanism 𝖠\mathsf{A}^{\prime} such that

𝔼[f𝒮(𝖠(𝒮))]infx𝒳f𝒮(x)c2c1κ1κkDLdnε.\mathbb{E}[f_{\mathcal{S}}(\mathsf{A}^{\prime}(\mathcal{S}))]-\inf_{x^{\prime}\in\mathcal{X}}f_{\mathcal{S}}(x^{\prime})\leq c_{2}c_{1}^{\tfrac{\kappa-1}{\kappa}}kD\frac{L\sqrt{d}}{n\varepsilon}.

However, Theorem 5.3 in [7] guarantees that there exists c3>0c_{3}>0 such that for any (ε,δ)(\varepsilon,\delta)-DP mechanism 𝖠′′\mathsf{A}^{\prime\prime}, it must hold

c3LDdnε𝔼[f𝒮(𝖠′′(𝒮))]f𝒮(x).c_{3}LD\frac{\sqrt{d}}{n\varepsilon}\leq\mathbb{E}[f_{\mathcal{S}}(\mathsf{A}^{\prime\prime}(\mathcal{S}))]-f_{\mathcal{S}}(x^{\star}).

Setting c1=12(c3kc2)κκ1c_{1}=\tfrac{1}{2}\left(\tfrac{c_{3}}{kc_{2}}\right)^{\tfrac{\kappa}{\kappa-1}} yields a contradiction and the desired lower bound by noting that kk consists only of log factors. ∎