This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Universal Transfer Theorem for Convex Optimization Algorithms
Using Inexact First-order Oracles

Phillip Kerger    Marco Molinaro    Hongyi Jiang    Amitabh Basu
Abstract

Given any algorithm for convex optimization that uses exact first-order information (i.e., function values and subgradients), we show how to use such an algorithm to solve the problem with access to inexact first-order information. This is done in a “black-box” manner without knowledge of the internal workings of the algorithm. This complements previous work that considers the performance of specific algorithms like (accelerated) gradient descent with inexact information. In particular, our results apply to a wider range of algorithms beyond variants of gradient descent, e.g., projection-free methods, cutting-plane methods, or any other first-order methods formulated in the future. Further, they also apply to algorithms that handle structured nonconvexities like mixed-integer decision variables.

Convex Optimization, First-order methods, inexact information, approximate information, Smooth optimization

1 Introduction

Optimization is a core tool for almost any learning or estimation problem. Such problems are very often approached by setting up an optimization problem whose decision variables model the entity to be estimated, and whose objective and constraints are defined by the observed data combined with structural insights into the inference problem. Algorithms for any sufficiently general class of relevant optimization problems in such settings need to collect information about the particular instance by making (adaptive) queries about the objective before they can report a good solution. In this paper, we focus on the following important class of optimization problems over a fixed ground set XdX\subseteq\mathbb{R}^{d}

min{f(x):xX},\min\{f(x):x\in X\}, (1.1)

where f:df:\mathbb{R}^{d}\to\mathbb{R} is a (possibly nonsmooth) convex function. When the underlying ground set XX is all of d\mathbb{R}^{d} or some fixed convex subset, (1.1) is the classical convex optimization problem. In this paper, we allow XX to be more general and to be used to model some known nonconvexity, e.g. integrality constraints by setting X=C(d1×d2)X=C\cap({\mathbb{Z}}^{d_{1}}\times\mathbb{R}^{d_{2}}) with d1+d2=dd_{1}+d_{2}=d, where CdC\subseteq\mathbb{R}^{d} is a fixed convex set. From an algorithmic perspective, the setup is that the algorithm has complete knowledge of what XX is, but does not a priori know ff and must collect information via queries. A standard model for accessing the function ff is through so-called first-order oracles. At any point during its execution, the algorithm can request the function value and the (sub)gradient of ff at any point x¯d\bar{x}\in\mathbb{R}^{d}.

Given access to such oracles, a first-order algorithm makes adaptive queries to this oracle and, after it judges that it has collected enough information about ff, it reports a solution with certain guarantees. A long line of research has gone into understanding exactly how many queries are needed to solve different classes of problems (with different sets of assumptions on ff and XX), with tight upper and lower bounds on the query complexity (a.k.a. oracle or information complexity) known in the literature; see (Nesterov, 2004; Bubeck, 2015; Nemirovski, 1994; Basu, 2023; Basu et al., 2023) for expositions of these results.

A natural question that arises in this context is what happens if the response of the oracle is not exact, but approximate (with possibly desired accuracy). For example, the response of the oracle might be itself a solution to another computational problem which is solved only approximately, which happens when using function smoothing (Nesterov, 2005), and in minimax problems (Wang & Abernethy, 2018). Stochastic first-order oracles, modeling applications where only some estimate of the gradient is used, may also be viewed as inexact oracles whose accuracy is a random variable at each iteration. Thus, researchers have also investigated what one can say about algorithms that have access to inexact oracle responses (with possibly known guarantees on the inexactness). Early work on this topic appears in Shor (Shor, 1985) and Polyak (Polyak, 1987), and more recent progress can be found in (Devolder et al., 2014; Schmidt et al., 2011; Lan, 2009; Hintermüller, 2001; Kiwiel, 2006; d’Aspremont, 2008) and references therein. To the best of our knowledge, all previous work on inexact first-order oracles has focused either on how specific algorithms like (accelerated) gradient methods perform with inexact (sub)gradients with no essential change to the algorithm, or on how to adapt a particular class of algorithms to perform well with inexact information.

In this paper, we provide a different approach to the problem of inexact information. We provide a way to take any first-order algorithm that solves (1.1) with exact first-order information and, with absolutely no knowledge of the its inner workings, show how to make the same algorithm work with inexact oracle information. Thus, in contrast to earlier work, our result is not the analysis of specific algorithms under inexact information or the adaptation of specific algorithms to use inexact information. It is in this sense that we believe our results to be universal because they apply to a much wider class of algorithms than previous work, including gradient descent, cutting plane methods, bundle methods, projection-free methods etc., and also to any first-order method that is invented in the future for optimization problems of the form (1.1).

2 Formal statement of results and discussion

We begin with definitions of standard concepts that we need to state our results formally. We use \|\cdot\| to denote the standard Euclidean norm and B(c,r)B(c,r) to denote the Euclidean ball of radius rr centered at cdc\in\mathbb{R}^{d}. When the center is the origin, we denote the ball by B(r)B(r). A function h:dh:\mathbb{R}^{d}\rightarrow\mathbb{R} is said to be MM-Lipschitz if |h(x)h(y)|Mxy|h(x)-h(y)|\leq M\|x-y\| for all x,yx,y. Let 0(M,R)\mathcal{F}^{0}(M,R) denote the standard family of instances of the optimization problem (1.1) consisting of all MM-Lipschitz (possibly non-differentiable) convex functions ff such that the minimizer xXx^{\star}\in X is contained in the ball B(R)B(R).111This is a standard assumption in the analysis of optimization algorithm – if no such bound is assumed, then it can be shown that no algorithm can report a good solution within a guaranteed number of steps for every instance (Nesterov, 2004). Alternatively, one may give the convergence rates in terms of the distance of the initial iterate of the algorithm and the optimal solution (one can think of RR as an upper bound on this distance). Our results can also be formulated in this language with no conceptual or technical changes. We now formalize the inexact first-order oracles that we will work with.

Definition 2.1.

An η\eta-approximate first-order oracle for a convex function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} takes as input a query point x¯d\bar{x}\in\mathbb{R}^{d} and returns a first-order pair (f~,g~)(\tilde{f},\tilde{g}) satisfying |f~f(x¯)|η|\tilde{f}-f(\bar{x})|\leq\eta and g~gη2R\norm{\tilde{g}-g}\leq\frac{\eta}{2R} for some subgradient gf(x¯)g\in\partial f(\bar{x}).

We now state our main results. We remind the reader that in (1.1) the underlying set XX need not be d\mathbb{R}^{d} and may be nonconvex; below, when we talk about a first order algorithm for (1.1) we mean an algorithm that can solve (1.1) with access to first-order oracles for ff. We use OPT(f)\textup{OPT}(f) to denote the optimal value of the instance ff.

Theorem 2.2.

Consider an algorithm for (1.1) such that for any instance f0(M,R)f\in\mathcal{F}^{0}(M,R), with access to function values and subgradients of ff, after TT iterations the algorithm reports a feasible solution xXx\in X with error at most err(T,M,R)err(T,M,R), i.e., f(x)OPT(f)+err(T,M,R)f(x)\leq\textup{OPT}(f)+err(T,M,R).

Then there is an algorithm that, with access to an η\eta-approximate first-order oracle for ff for any η0\eta\geq 0, after TT iterations the algorithm returns a feasible solution x¯X\bar{x}\in X with value

f(x¯)OPT(f)+err(T,M,R)+4ηT,f(\bar{x})\,\leq\,\textup{OPT}(f)+err(T,M^{\prime},R)+4\eta T,

where M=M+η2RM^{\prime}=M+\frac{\eta}{2R}.

Although we state this theorem as an existence result, our proof is constructive and exactly formulates the desired algorithm via Procedures 1 and 2. Let us illustrate what this theorem says when applied to two classical algorithms for convex optimization (i.e., X=dX=\mathbb{R}^{d}): subgradient methods and cutting-plane methods (Nesterov, 2004). When using exact first-order information, the subgradient method produces after TT iterations a solution with error at most O(MRT)O\big{(}\frac{MR}{\sqrt{T}}\big{)}. Applying the procedures mentioned from Theorem 2.2 to this algorithm, one obtains an algorithm that uses only η\eta-approximate first-order information and after TT iterations produces a solution whose error is at most O(MR+ηT)+2TηO\Big{(}\frac{MR+\eta}{\sqrt{T}}\Big{)}+2T\eta. If one can choose the accuracy of the inexact oracle, setting η=ε3M2R2\eta=\frac{\varepsilon^{3}}{M^{2}R^{2}} and T=M2R2ε2T=\lceil{\frac{M^{2}R^{2}}{\varepsilon^{2}}}\rceil gives a solution with error at most O(ε)O(\varepsilon). Note that this does not involve knowing anything about the original algorithm; it simply illustrates the tradeoff between the oracle accuracy and final solution accuracy.

Similarly, for classical cutting-plane methods (e.g., center-of-gravity, ellipsoid, Vaidya) the error after TT iterations is at most O(MRexp(Tpoly(d)))O\left(MR\exp\left(\frac{-T}{\operatorname*{poly}(d)}\right)\right). Thus, with access to η\eta-approximate first-order oracles, we can use our result to produce a solution with error at most O((MR+η2)exp(Tpoly(d)))+4ηTO\left((MR+\frac{\eta}{2})\exp\left(\frac{-T}{\operatorname*{poly}(d)}\right)\right)+4\eta T. With the desired accuracy of η=O(εpoly(d)log(MRε))\eta=O\left(\frac{\varepsilon}{\operatorname*{poly}(d)\log\left(\frac{MR}{\varepsilon}\right)}\right), and T=O(poly(d)log(MRεmissing))T=O\left(\operatorname*{poly}(d)\log\big(\frac{MR}{\varepsilon}\big{missing})\right), it gives a solution with error at most O(ε)O(\varepsilon).

We next consider the family of α\alpha-smooth functions, i.e., the family 1(M,α,R)\mathcal{F}^{1}(M,\alpha,R) of MM-Lipschitz convex functions that are differentiable with α\alpha-Lipschitz gradient maps, whose minimizers are contained in B(R)B(R). This is a classical family of objective functions in convex optimization that admits the celebrated accelerated method of Nesterov (1983) (see (d’Aspremont et al., 2021) for a survey). We give the following universal transfer theorem for algorithms for smooth objective functions.

Theorem 2.3.

Consider an algorithm for (1.1) such that for any instance in f1(M,α,R)f\in\mathcal{F}^{1}(M,\alpha,R), with access to function values and subgradients of ff, after TT iterations the algorithm reports a feasible solution xXx\in X with error at most err(T,M,α,R)err(T,M,\alpha,R), i.e., f(x)OPT(f)+err(T,M,α,R)f(x)\leq\textup{OPT}(f)+err(T,M,\alpha,R).

Then for any ηαR25T\eta\leq\frac{\alpha R^{2}}{5T}, there is an algorithm that, with access to an η\eta-approximate first-order oracle for ff, after TT iterations the algorithm returns a feasible solution x¯X\bar{x}\in X with value

f(x¯)OPT(f)+err(T,M,α,R)+5η(T+2),f(\bar{x})\,\leq\,\textup{OPT}(f)+err(T,M^{\prime},\alpha^{\prime},R)+5\eta\cdot(T+2),

where M=M+η2RM^{\prime}=M+\frac{\eta}{2R}, α=αd(45(T+1)+3)\alpha^{\prime}=\alpha\cdot\sqrt{d}\cdot\Big{(}4\sqrt{5(T+1)}+3\Big{)}.

As an illustration, we apply this transfer theorem to the accelerated algorithm of Nesterov (1983) for continuous optimization (X=dX=\mathbb{R}^{d}): Under perfect first-order information, it obtains error O(αR2T2)O(\frac{\alpha R^{2}}{T^{2}}) after TT iterations. Using our transfer theorem as a wrapper gives an algorithm that, using only η\eta-approximate first-order information, obtains error O(αR2dT1.5+ηT)O(\frac{\alpha R^{2}\sqrt{d}}{T^{1.5}}+\eta T); if the accuracy of the oracle is set to η=O(αR2dT2.5)\eta=O(\frac{\alpha R^{2}\sqrt{d}}{T^{2.5}}), this gives an algorithm with error O(αR2dT1.5)O(\frac{\alpha R^{2}\sqrt{d}}{T^{1.5}}). While this does not recover in full the acceleration of Nesterov’s method, the key take away is that a significant amount of acceleration (i.e., error rates better than those possible for non-smooth functions) can be preserved under inexact oracles in a universal way, for any accelerated algorithm requiring exact information.

Remark 2.4.

For the sake of exposition, we have assumed that the accuracy η\eta of the oracle is fixed and the additional error is O(ηT)O(\eta T). However, one can allow different oracle accuracies ηt\eta_{t} at each query point xtx_{t} and the additional error is O(tηt)O(\sum_{t}\eta_{t}) (and the parameter M=M+12RmaxtηtM^{\prime}=M+\frac{1}{2R}\max_{t}{\eta_{t}}).

2.1 Allowing inexactness in the constraint set

So far we have assumed that the algorithm has complete knowledge of the constraints XX. Now, we extend our results to include algorithms that can work with larger classes of constraints that are not fully known up front. In other words, just like the algorithm needs to collect information about ff, it also needs to collect information about XX, via another oracle, to be able to solve the problem. To capture the most general algorithms of this type, we formalize this setting by assuming XX is of the form CZC\cap Z, where CC belongs to a class of closed, convex sets and ZZ is possibly nonconvex but completely known (e.g., Z=d1×d2Z={\mathbb{Z}}^{d_{1}}\times\mathbb{R}^{d_{2}} with d1+d2=dd_{1}+d_{2}=d).

min{f(x):xCZ}.\min\{f(x):x\in C\cap Z\}. (2.1)

The algorithm then must collect information about CC, for which we use the common model of allowing the algorithm access to a separation oracle. Upon receiving a query point xx, a separation oracle either reports correctly that xx is inside CC or otherwise returns a separating hyperplane that separates xx from CC. We note that a separation oracle for CC is in some sense comparable to a first-order oracle for a convex function ff; since the pair (f(x),f(x))(f(x),\nabla f(x)) can be viewed as providing a supporting hyperplane for the epigraph of ff at xx, using an oracle that returns separating hyperplanes for CC provides a comparable way of collecting information about the constraints.

Let us first precisely define the inexact version of a separation oracle.

Definition 2.5.

For a closed, convex set CB(R)C\subseteq B(R) and a query point x¯B(R)\bar{x}\in B(R), an η\eta-approximate separation oracle reports a separation response (flag,g~){Feasible,Infeasible}×d({flag},\tilde{g})\in\{\textsc{Feasible},\textsc{Infeasible}\}\times\mathbb{R}^{d} such that if x¯C\bar{x}\in C then flag=Feasible{flag}=\textsc{Feasible} (with no requirement on g~\tilde{g}), and otherwise flag=Infeasible{flag}=\textsc{Infeasible} and g~\tilde{g} is a unit vector such that there exists some unit vector gg satisfying g,xg,x¯\langle g,x\rangle\leq\langle g,\bar{x}\rangle for all xCx\in C and g~g2η4R\norm{\tilde{g}-g}_{2}\leq\frac{\eta}{4R}. Given such a g~\tilde{g} (for x¯C\bar{x}\notin C), we call the hyperplane through x¯\bar{x} induced by this normal vector an η\eta-approximate separating hyperplane for x¯\bar{x}.

We now state our results for algorithms that work with separation oracles. Note that for this, instances of (1.1) have to specify both ff and CC, as opposed to just ff, since only ZZ is known but not CC. We use (M,R,ρ)\mathcal{I}(M,R,\rho) to denote the set of all instances (f,C)(f,C) where f:df:\mathbb{R}^{d}\to\mathbb{R} is an MM-Lipschitz convex function and CC is a compact, convex set that contains a ball of radius ρ\rho and is contained in B(R)B(R). We use OPT(f,C)OPT(f,C) to denote the minimum value of (2.1). The “strict feasibility" assumption of CC containing a ρ\rho-ball is standard in convex optimization with constraints given via separation oracles. Otherwise, it can be shown that no algorithm will be able to find even an approximately feasible point in a finite number of steps (Nesterov, 2004). The first result we state is for pure convex problems, i.e., Z=dZ=\mathbb{R}^{d}.

Theorem 2.6.

Consider an algorithm for (2.1) with Z=dZ=\mathbb{R}^{d}, such that for any instance in (f,C)(M,R,ρ)(f,C)\in\mathcal{I}(M,R,\rho), with access to function values and subgradients of ff and separating hyperplanes for CC, after TT iterations the algorithm reports a feasible solution xCx\in C with error at most err(T,M,R,ρ)err(T,M,R,\rho), i.e., f(x)OPT(f,C)+err(T,M,R,ρ)f(x)\leq\textup{OPT}(f,C)+err(T,M,R,\rho).

Then there is an algorithm that, with access to an ηf\eta_{f}-approximate first-order oracle for ff and an ηC\eta_{C}-approximate separation oracle for CC for any ηf0\eta_{f}\geq 0 and 0ηCρ0\leq\eta_{C}\leq\rho, after TT iterations the algorithm returns a feasible solution x¯C\bar{x}\in C with value

f(x¯)OPT(f,C)+err(T,M,R,ρ)+4ηfT+2ηCMRρ,f(\bar{x})\,\leq\,\textup{OPT}(f,C)+err(T,M^{\prime},R,\rho^{\prime})+4\eta_{f}T+\frac{2\eta_{C}MR}{\rho},

where M=M+ηf2RM^{\prime}=M+\frac{\eta_{f}}{2R} and ρ=ρηC\rho^{\prime}=\rho-\eta_{C}.

We can handle more general, nonconvex ZZ with separation oracles under a slightly stronger “strict feasibility" assumption on CC: let (M,R,ρ)\mathcal{I}^{\star}(M,R,\rho) denote the subclass of instances from (M,R,ρ)\mathcal{I}(M,R,\rho) where the minimizer xx^{\star} of (2.1) is ρ\rho-deep inside CC, i.e., B(x,ρ)CB(x^{\star},\rho)\subseteq C.

Theorem 2.7.

Consider an algorithm for (2.1), such that for any instance in (f,C)(M,R,ρ)(f,C)\in\mathcal{I}^{\star}(M,R,\rho), with access to function values and subgradients of ff and separating hyperplanes for CC, after TT iterations the algorithm reports a feasible solution xCZx\in C\cap Z with error at most err(T,M,R,ρ)err(T,M,R,\rho), i.e., f(x)OPT(f,C)+err(T,M,R,ρ)f(x)\leq\textup{OPT}(f,C)+err(T,M,R,\rho).

Then there is an algorithm that, with access to an ηf\eta_{f}-approximate first-order oracle for ff and an ηC\eta_{C}-approximate separation oracle for CC for any ηf0\eta_{f}\geq 0 and 0ηCρ0\leq\eta_{C}\leq\rho, after TT iterations the algorithm returns a feasible solution x¯CZ\bar{x}\in C\cap Z with value

f(x¯)OPT(f,C)+err(T,M,R,ρ)+4ηfT,f(\bar{x})\,\leq\,\textup{OPT}(f,C)+err(T,M^{\prime},R,\rho^{\prime})+4\eta_{f}T,

where M=M+ηf2RM^{\prime}=M+\frac{\eta_{f}}{2R} and ρ=ρηC\rho^{\prime}=\rho-\eta_{C}.

Remark 2.8.

The objective functions in the above results were allowed to be any MM-Lipschitz, possibly nondifferentiable, convex function. One can state versions of these results for algorithms that work for the smaller class of α\alpha-smooth functions (e.g., accelerated projected gradient methods), just as Theorem 2.3 is a version of Theorem 2.2 for α\alpha-smooth objectives. The reason is that the analysis for handling constraints is independent of the arguments needed to handle the objective using inexact oracles; however, for space constraints, we leave the details out of this manuscript. Additionally, one can prove versions of all our theorems for strongly convex objective functions, but we leave these out of the manuscript as well to convey the main message of the paper more crisply.

2.2 Relation to existing work

Previous work on inexact first-order information focused on how certain known algorithms perform or can be made to perform under inexact information, most recently on (accelerated) proximal-gradient methods. For instance, (Devolder et al., 2014) analyze the performance of (accelerated) gradient descent in the presence of inexact oracles, with no change to algorithm. They show that simple gradient descent (for unconstrained problems) will return a solution with additional error O(η)O(\eta) and accelerated gradient descent incurs an additional error of O(ηT)O(\eta T) (similar to our guarantees). We provide a more thorough comparison of our setting and results with those of (Devolder et al., 2014) in Appendix C.

Similarly, (Schmidt et al., 2011) does an analysis for (accelerated) proximal gradient methods, with more complicated forms of the additional error, depending on how well the proximal problems are solved. (Gasnikov & Tyurin, 2019) and (Cohen et al., 2018) also study gradient methods in inexact settings, with their analyses being specific to particular algorithms.

In contrast, our result does not assume any knowledge of the internal logic of the algorithm. We must, therefore, use the algorithm in a “black-box” manner. We are able to do this by using the inexact oracles to construct a modified instance whose optimal solution is similar in quality to that of the true instance, and where this inexact information from the true instance can be interpreted as exact information for the modified instance. Thus, we can effectively run the algorithm as a black-box on this modified instance and leverage its error guarantee. Constructing this modified instance in an online fashion requires technical ideas that are new, to the best of our knowledge, in this literature. For instance, it is not even true that given approximate function values and subgradients of a convex function, we can find another convex function that has these as exact function values and subgradients; see Figure 1. Thus, one cannot directly use the inexact information as is (contrary to what is done in many of the papers dealing with inexact information for specific algorithms), in the general case we consider. The key is to modify the inexact information so that the information the algorithm receives admits an extension into a convex function/set that is still close to the original instance. When dealing with α\alpha-smooth objectives, the arguments are especially technically challenging since we have to report approximate function and gradient values that allow for a smooth extension that also approximates the unknown objective well. This involves careful use of new, localized smoothing techniques and maximal couplings of probability distributions. Such smoothing guarantees based on the proximity to the class of smooth functions may be of independent interest (see Theorem 4.1).

New applications: Since our results apply to algorithms for any ground set XX, we are able to handle mixed-integer convex optimization, i.e., X=C(d1×d2)X=C\cap({\mathbb{Z}}^{d_{1}}\times\mathbb{R}^{d_{2}}), with inexact oracles. Recently, there have been several applications of such optimization problems in machine learning and statistics (Bertsimas et al., 2016; Mazumder & Radchenko, 2017; Bandi et al., 2019; Dedieu et al., 2021; Dey et al., 2022; Hazimeh et al., 2022, 2023). General algorithms for mixed-integer convex optimization, as well as specialized ones designed for specific applications in the above papers, all involve a sophisticated combination of techniques like branch-and-bound, cutting planes and other heuristics. To the best of our knowledge, the performance of these algorithms has never been analyzed under the presence of inexact oracles which can cause issues for all of these components of the algorithm. Our results apply immediately to all these algorithms, precisely because the internal workings of the algorithm are abstracted away in our analysis. This yields the first ever versions of these methods that can work with inexact oracles. Moreover, XX can be used to model other types of structured nonconvexities (e.g., complementarity constraints (Cottle et al., 2009)) and our results show how to adapt algorithms in those settings to work with inexact oracles. Note that this holds for the cases where the convex set CC is explicitly known a priori (Theorems 2.2 and 2.3), or must be accessed via separation oracles (Theorems 2.6 and 2.7).

The remainder of this paper is dedicated to the proof sketches of Theorems 2.2 and 2.3. The missing details and proofs of Theorems 2.6 and 2.7 can be found in the Appendix.

3 Universal transfer for Lipschitz functions

In this section we prove our transfer result stated in Theorem 2.2. The proof relies on the following key concept: Given a set of points x1,,xTB(R)x_{1},\ldots,x_{T}\in B(R) (e.g., queries made by an optimization algorithm), we say that the sequence of first-order pairs222We use first-order pair as just a more “visual” name for a pair in ×d\mathbb{R}\times\mathbb{R}^{d}. (f1,g1),,(fT,gT)×d(f_{1},g_{1}),\ldots,(f_{T},g_{T})\in\mathbb{R}\times\mathbb{R}^{d} has an MM-Lipschitz convex extension, or simply MM-extension, if there is a function FF that is convex, MM-Lipschitz, and such that ft=F(xt)f_{t}=F(x_{t}) and gtF(xt)g_{t}\in\partial F(x_{t}) for all tt, i.e., the first-order information of FF at the queried points is exactly {(ft,gt)}t\{(f_{t},g_{t})\}_{t}.

As mentioned in the introduction, the main idea is to feed to the convex optimization algorithm 𝒜\mathcal{A} a sequence of pairs (ft,gt)(f_{t},g_{t})’s that have an MM-Lipschitz extension FF that is close to the original function ff. Since the information is consistent with what the algorithm expects when interacting exactly with the function FF, it will approximately optimize the latter which will then give an approximately optimal solution to the neighboring function ff.

Unfortunately, it is easy to see approximate first-order information from ff for the queried points xtx_{t}’s does not necessarily have a Lipschitz convex extension (see Figure 1). Thus, the main subroutine of our algorithm Approximate-to-Exact given below is that given an approximate first-order oracle for ff, it constructs first-order pairs (f^t,g^t)(\hat{f}_{t},\hat{g}_{t})’s in an online fashion (i.e. (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) only depends on x1,,xtx_{1},\ldots,x_{t}) with the desired extension properties. For a function gg, let g:=supx|g(x)|\|g\|_{\infty}:=\sup_{x}|g(x)| denote its sup-norm.

Refer to caption
Figure 1: An example where two approximate function values and subgradients do not have a convex extension. The true function ff is constant. The function values are reported with no error. The reported slopes are shown in red. However, these slopes decrease going from x1x_{1} to x2x_{2} thus eliminating the possibility of any convex function having these values and slopes at x1x_{1} and x2x_{2}.
Theorem 3.1 (Online first-order Lipschitz-extensibility).

Consider an MM-Lipschitz convex function f:B(R)f:B(R)\rightarrow\mathbb{R}, and a sequence of points x1,,xTB(R)x_{1},\ldots,x_{T}\in B(R). There is an online procedure that, given η\eta-approximate first-order oracle access to ff, produces first-order pairs (f^1,g^1),,(f^T,g^T)(\hat{f}_{1},\hat{g}_{1}),\ldots,(\hat{f}_{T},\hat{g}_{T}) that have a (M+η2R)(M+\frac{\eta}{2R})-extension F:B(R)F:B(R)\rightarrow\mathbb{R} satisfying Ff2ηT\|F-f\|_{\infty}\leq 2\eta T. (Moreover, the procedure only probes the approximate oracle at the given points x1,,xTx_{1},\ldots,x_{T}.)

With this at hand, given any first-order algorithm 𝒜\mathcal{A} we can run it using only approximate first-order information in the following natural way:

Procedure 1.
Approximate-to-Exact(𝒜\mathcal{A}, TT) For each timestep t=1,,Tt=1,\ldots,T: 1. Receive query point xtB(R)x_{t}\in B(R) from 𝒜\mathcal{A}. 2. Send point xtx_{t} to the η\eta-approximate oracle for ff and receive the information (f~t,g~t)(\tilde{f}_{t},\tilde{g}_{t}). 3. Use the online procedure from Theorem 3.1 to construct the first-order pair (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}). 4. Send (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) to the algorithm 𝒜\mathcal{A}. Return the point in XX returned by 𝒜\mathcal{A}.
Proof of Theorem 2.2.

Consider a first-order algorithm 𝒜\mathcal{A} that, for any MM-Lipschitz convex function, after TT iterations returns a point x¯X\bar{x}\in X such that f(x¯)OPT(f)+err(T,M,R)f(\bar{x})\leq\textup{OPT}(f)+err(T,M,R), We show that running Procedure 1 with 𝒜\mathcal{A} as input, which only uses an η\eta-approximate oracle for ff, returns a point x¯X\bar{x}\in X such that f(x¯)OPT(f)+err(T,M,R)+4ηTf(\bar{x})\leq\textup{OPT}(f)+err(T,M^{\prime},R)+4\eta T with M=M+η2RM^{\prime}=M+\frac{\eta}{2R}.

To see this, let FF be an MM^{\prime}-extension for the first-order pairs (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) sent to the algorithm 𝒜\mathcal{A} in Procedure 1 with Ff2ηT\|F-f\|_{\infty}\leq 2\eta T, guaranteed by Theorem 3.1. This means that the execution of the first-order algorithm 𝒜\mathcal{A} during our procedure is exactly the same as executing 𝒜\mathcal{A} directly on the convex function FF. Thus, by the error guarantee of 𝒜\mathcal{A}, the point x¯X\bar{x}\in X returned by 𝒜\mathcal{A} after TT iterations (which is the same point returned by our procedure) is almost optimal for FF, i.e., F(x¯)OPT(F)+err(T,M,R)F(\bar{x})\leq\textup{OPT}(F)+err(T,M^{\prime},R). Since FF and ff are pointwise within ±2ηT\pm 2\eta T of each other, the value of the solution x¯\bar{x} with respect to the original function ff satisfies

f(x¯)F(x¯)+2ηT\displaystyle f(\bar{x})\leq F(\bar{x})+2\eta T OPT(F)+err(T,M,R)+2ηT\displaystyle\leq\textup{OPT}(F)+err(T,M^{\prime},R)+2\eta T
OPT(f)+err(T,M,R)+4ηT,\displaystyle\leq\textup{OPT}(f)+err(T,M^{\prime},R)+4\eta T,

which proves the desired result. ∎

3.1 Computing Lipschitz-extensible first-order pairs

In this section we describe the procedure that constructs the first-order pairs with a Lipschitz convex extension FF that satisfies Ff2ηT\|F-f\|_{\infty}\leq 2\eta T, proving Theorem 3.1. Before getting into the heart of the matter, we show that the latter property can be significantly weakened: instead of requiring both f(x)F(x)2ηTf(x)\geq F(x)-2\eta T and f(x)F(x)+2ηTf(x)\leq F(x)+2\eta T for all xB(R)x\in B(R), we can relax the latter to only hold for the queried points x1,,xTx_{1},\ldots,x_{T}.

Lemma 3.2.

Consider a sequence of points x1,,xTB(R)x_{1},\ldots,x_{T}\in B(R), and a sequence of first-order pairs (f^1,g^1),,(f^T,g^T)(\hat{f}_{1},\hat{g}_{1}),\ldots,(\hat{f}_{T},\hat{g}_{T}). Consider δ>0\delta>0 and MMM^{\prime}\geq M, and suppose that there is an MM^{\prime}-extension FF of these first-order pairs that satisfies:

f(x)F(x)δapprox. under-approximation,xB(R)\displaystyle\underbrace{f(x)\geq F(x)-\delta}_{\textrm{approx. under-approximation}},~~~~~~~\forall x\in B(R) (3.1)
f(xt)F(xt)+δapprox. queried values,t{1,,T}.\displaystyle\underbrace{f(x_{t})\leq F(x_{t})+\delta}_{\textrm{approx. queried values}},~~~~~\forall t\in\{1,\ldots,T\}. (3.2)

Then the first-order pairs have an MM^{\prime}-extension FF^{\prime} such that Ffδ\|F^{\prime}-f\|_{\infty}\leq\delta. In particular, setting

F(x)=max{F(x),f(x)δ}F^{\prime}(x)=\max\{F(x),f(x)-\delta\}

provides such an extension.

Proof.

Define the function FF^{\prime} as F(x):=max{f(x)δ,F(x)}F^{\prime}(x):=\max\{f(x)-\delta,F(x)\} by taking the maximum between FF and a downward-shifted ff. We will show that this function is the desired convex extension of the first-order pairs (f^1,g^1),,(f^T,g^T)(\hat{f}_{1},\hat{g}_{1}),\ldots,(\hat{f}_{T},\hat{g}_{T}).

First, to show Ffδ\norm{F^{\prime}-f}_{\infty}\leq\delta, by the definition of FF^{\prime} one has F(x)f(x)δF^{\prime}(x)\geq f(x)-\delta for all xx. Furthermore, because of the guarantee that F(x)f(x)+δF(x)\leq f(x)+\delta, we also have F(x)f(x)+δF^{\prime}(x)\leq f(x)+\delta for all xx; together these imply that Ffδ\norm{F^{\prime}-f}_{\infty}\leq\delta. Since ff and FF are MM^{\prime}-Lipschitz convex functions, so is FF^{\prime}.

It remains to be shown that FF^{\prime} is an extension of the first-order pairs, that is, to show f^t=F(xt)\hat{f}_{t}=F^{\prime}(x_{t}) and g^tF(xt)\hat{g}_{t}\in\partial F^{\prime}(x_{t}) for all t=1,,Tt=1,...,T. Given property (3.2), we have F(xt)f(xt)δF(x_{t})\geq f(x_{t})-\delta, and so F(xt)=F(xt)=f^tF^{\prime}(x_{t})=F(x_{t})=\hat{f}_{t}. The fact that F(xt)=F(xt)F^{\prime}(x_{t})=F(x_{t}) also implies that every vector in F(xt)\partial F(x_{t}) is a subgradient of FF^{\prime} at xtx_{t}, namely F(xt)F(xt)g^t\partial F^{\prime}(x_{t})\supseteq\partial F(x_{t})\ni\hat{g}_{t}. To see this, recall that since FF is convex, for g^tF(xt)\hat{g}_{t}\in\partial F(x_{t}) we have F(xt)+g^t,xtxF(x)F(x_{t})+\langle\hat{g}_{t},x_{t}-x\rangle\leq F(x). Using the fact that F(xt)=F(xt)F^{\prime}(x_{t})=F(x_{t}), we thus have F(xt)+g^t,xtxF(x)F(x)F^{\prime}(x_{t})+\langle\hat{g}_{t},x_{t}-x\rangle\leq F(x)\leq F^{\prime}(x) for all xx, and so any g^tF(xt)\hat{g}_{t}\in\partial F(x_{t}) is also a subgradient for FF^{\prime} at xtx_{t}, as desired to conclude the proof. ∎

Given Lemma 3.2, to prove Theorem 3.1 it suffices to do the following. Consider a sequence of points x1,,xTB(R)x_{1},\ldots,x_{T}\in B(R). Using an η\eta-approximate first-order oracle to access the function ff (at the points x1,,xTx_{1},\ldots,x_{T}), we need to produce a sequence of first-order pairs (f^1,g^1),,(f^T,g^T)(\hat{f}_{1},\hat{g}_{1}),\ldots,(\hat{f}_{T},\hat{g}_{T}) in an online fashion that have an MM-extension FF achieving the approximations (3.1) and (3.2). We do this as follows.

At iteration tt we maintain the function Ft(x):=max{f^τ+g^τ,xxτ:τt}F_{t}(x):=\max\{\hat{f}_{\tau}+\langle\hat{g}_{\tau},x-x_{\tau}\rangle\,:\,\tau\leq t\}, that is, the maximum of the linear functions induced by the first-order pairs (f^τ,g^τ)(\hat{f}_{\tau},\hat{g}_{\tau}) constructed up to this point. We would like to define the pairs (f^τ,g^τ)(\hat{f}_{\tau},\hat{g}_{\tau}) to guarantee that for all tt, FtF_{t} is an MM^{\prime}-extension for these pairs, and satisfies (3.1) and (3.2) for x1,,xtx_{1},\ldots,x_{t}. In this case, F=FTF=F_{T} gives the desired function.

For that, suppose the above holds for t1t-1; we will show how to define (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) to maintain this invariant for tt. We should think of constructing FtF_{t} by taking the maximum of Ft1F_{t-1} and a new linear function f^t+g^t,xxt\hat{f}_{t}+\langle\hat{g}_{t},x-x_{t}\rangle. To ensure that FtF_{t} is an extension of the first-order pairs thus far, we need to make sure that:

  1. 1.

    f^tFt1(xt).\hat{f}_{t}\geq F_{t-1}(x_{t}). This is necessary to ensure that Ft(xt)=f^tF_{t}(x_{t})=\hat{f}_{t}, and also guarantees g^tFt(xt)\hat{g}_{t}\in\partial F_{t}(x_{t}).

  2. 2.

    f^t+g^t,xτxtFt1(xτ),τt1.\hat{f}_{t}+\langle\hat{g}_{t},x_{\tau}-x_{t}\rangle\leq F_{t-1}(x_{\tau}),~~~~~\forall\tau\leq t-1. This is necessary to ensure that Ft(xτ)=Ft1(xτ)=f^τF_{t}(x_{\tau})=F_{t-1}(x_{\tau})=\hat{f}_{\tau}, and also guarantees Ft(xτ)Ft1(xτ)g^τ\partial F_{t}(x_{\tau})\supseteq\partial F_{t-1}(x_{\tau})\ni\hat{g}_{\tau}.

To construct (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) with these properties, we probe the approximate first-order oracle for ff at xtx_{t}, and receive an answer (f~t,g~t)(\tilde{f}_{t},\tilde{g}_{t}). If setting (f^t,g^t)=(f~t,g~t)(\hat{f}_{t},\hat{g}_{t})=(\tilde{f}_{t},\tilde{g}_{t}) violates the first item above, we simply use the first-order information of Ft1F_{t-1} at xtx_{t}, i.e., we set f^t=Ft1(xt)\hat{f}_{t}=F_{t-1}(x_{t}) and g^tFt1(xt)\hat{g}_{t}\in\partial F_{t-1}(x_{t}).

If the second item above is violated instead, we shift the value f~t\tilde{f}_{t} down as little as possible to ensure the desired property, i.e., we set (f^t,g^t)=(f~ts,g~t)(\hat{f}_{t},\hat{g}_{t})=(\tilde{f}_{t}-s^{*},\tilde{g}_{t}) for appropriate s>0s^{*}>0. With this shifted value, the first item may now be violated, in which case we again just use the current first-order information of Ft1F_{t-1}.

These steps are formalized in the following procedure.

Procedure 2.
Set F0(x)=F_{0}(x)=-\infty. For each t=1,,Tt=1,\ldots,T: 1. Query the η\eta-approximate oracle for ff at xtx_{t}, receiving the first-order pair (f~t,g~t)(\tilde{f}_{t},\tilde{g}_{t}). 2. Let s:=min{s0:f~ts+g~t,xτxtFt1(xτ),τt1}.s^{*}:=\min\{s\geq 0:\tilde{f}_{t}-s+\langle\tilde{g}_{t},x_{\tau}-x_{t}\rangle\leq F_{t-1}(x_{\tau}),\forall\tau\leq t-1\}. 3. Let Ft(x)=max{Ft1(x),f~t+g~t,xxts}F_{t}(x)=\max\{F_{t-1}(x),\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-s^{*}\}, and then set f^t=Ft(xt)\hat{f}_{t}=F_{t}(x_{t}), g^tFt(xt)\hat{g}_{t}\in\partial F_{t}(x_{t}).

We remark that this requires storing historical values of f~t\tilde{f}_{t} and g~t\tilde{g}_{t} (this seems unavoidable to ensure convexity of FtF_{t}). In terms of computational complexity, we remark that the procedure takes a total of O(T2d)O(T^{2}d) operations. We now prove that the functions FtF_{t}’s have the desired properties.

Lemma 3.3.

For every t=1,,Tt=1,\ldots,T, the function FtF_{t} is an (M+η2R)(M+\frac{\eta}{2R})-extension of the first-order information pairs (f^1,g^1),,(f^t,g^t)(\hat{f}_{1},\hat{g}_{1}),\ldots,(\hat{f}_{t},\hat{g}_{t}).

Proof.

Since FtF_{t} is the maximum over affine functions, it is convex. Moreover, all of its subgradients come from the set {g~τ}τ\{\tilde{g}_{\tau}\}_{\tau}, and by the approximation guarantee of the oracle we have that for some subgradient gf(xτ)g\in\partial f(x_{\tau}), g~2g~g2+g2η2R+M\|\tilde{g}\|_{2}\leq\|\tilde{g}-g\|_{2}+\|g\|_{2}\leq\frac{\eta}{2R}+M, where we used that fact that ff is MM-Lipschitz; thus, FtF_{t} is (M+η2R)(M+\frac{\eta}{2R})-Lipschitz.

We prove by induction on tt that FtF_{t} is an extension of the desired pairs (the base case t=1t=1 can be readily verified). Recall Ft(xt)=max{Ft1(x),H(x)}F_{t}(x_{t})=\max\{F_{t-1}(x),H(x)\}, where H(x):=f^t+g^,xxtH(x):=\hat{f}_{t}+\langle\hat{g},x-x_{t}\rangle. By the definition of ss^{*}, for all x=x1,,xt1x=x_{1},\ldots,x_{t-1}, this maximum is achieved by the function Ft1F_{t-1}, giving, by induction, that for all τt1\tau\leq t-1, Ft(xτ)=Ft1(xτ)=f^τF_{t}(x_{\tau})=F_{t-1}(x_{\tau})=\hat{f}_{\tau}; this also implies that for such τ\tau’s, Ft(xτ)Ft1(xτ)g^τ\partial F_{t}(x_{\tau})\supseteq\partial F_{t-1}(x_{\tau})\ni\hat{g}_{\tau}, the last inclusion again following by induction. These give the extension property for the pairs (f^τ,g^τ)(\hat{f}_{\tau},\hat{g}_{\tau}) with τt1\tau\leq t-1.

It remains to verify that this also holds for τ=t\tau=t. Now the maximum in the definition of Ft(xt)F_{t}(x_{t}) is achieved by the function HH: if f~tsFt1(xt)\tilde{f}_{t}-s^{*}\geq F_{t-1}(x_{t}), the procedure sets f^t=f~ts\hat{f}_{t}=\tilde{f}_{t}-s^{*} and we have H(xt)=f^tFt1(xt)H(x_{t})=\hat{f}_{t}\geq F_{t-1}(x_{t}); otherwise the procedure sets f^t=Ft1(xt)\hat{f}_{t}=F_{t-1}(x_{t}) and so H(xt)=Ft1(xt)H(x_{t})=F_{t-1}(x_{t}). Again this implies that Ft(xt)H(xt)={g^t}\partial F_{t}(x_{t})\supseteq\partial H(x_{t})=\{\hat{g}_{t}\}. This concludes the proof of the lemma. ∎

Finally, we show that the functions FtF_{t} approximate ff according to (3.1) and (3.2).

Lemma 3.4.

For every t=1,,Tt=1,\ldots,T, the FtF_{t} satisfies inequalities (3.1) and (3.2) with δ=2ηt\delta=2\eta t.

Proof.

Again we prove this by induction on tt. Fix tt. Let Δ:=f~tf(xt)\Delta:=\tilde{f}_{t}-f(x_{t}) be the error the inexact oracle makes on the function value. We claim that the shift ss^{*} used in iteration tt of Procedure 2 satisfies smax{0,Δ+2ηt}s^{*}\leq\max\{0,\Delta+2\eta t\}. To see this, the η\eta-approximation of the oracle guarantees that there is a subgradient gf(xt)g\in\partial f(x_{t}) such that g~tg2η2R\|\tilde{g}_{t}-g\|_{2}\leq\frac{\eta}{2R}, and so for every τt1\tau\leq t-1

f~t+g~t,xτxt\displaystyle\tilde{f}_{t}+\langle\tilde{g}_{t},x_{\tau}-x_{t}\rangle
=\displaystyle= Δ+f(xt)+g,xτxtf(xτ)+g~tg,xτxtg~tg2xτxt2η\displaystyle\Delta+\underbrace{f(x_{t})+\langle g,x_{\tau}-x_{t}\rangle}_{\leq f(x_{\tau})}+\underbrace{\langle\tilde{g}_{t}-g,x_{\tau}-x_{t}\rangle}_{\leq\|\tilde{g}_{t}-g\|_{2}\|x_{\tau}-x_{t}\|_{2}\leq\eta}
\displaystyle\leq Ft1(xτ)+Δ+2tη,\displaystyle F_{t-1}(x_{\tau})+\Delta+2t\eta, (3.3)

the first underbrace following since gg is a subgradient of ff, and the last inequality following from the induction hypothesis Ft1(xτ)f(xτ)2(t1)ηF_{t-1}(x_{\tau})\geq f(x_{\tau})-2(t-1)\eta (inequality (3.2)); the optimality of ss^{*} then guarantees that it is at most max{0,Δ+2ηt}\max\{0,\Delta+2\eta t\}, proving the claim.

Now we show that FtF_{t} satisfies the desired bounds, namely Ft(xτ)f(xτ)2ηtF_{t}(x_{\tau})\geq f(x_{\tau})-2\eta t for all τt\tau\leq t, and Ft(x)f(x)+2ηtF_{t}(x)\leq f(x)+2\eta t for all xB(R)x\in B(R). From the inductive hypothesis, for τt1\tau\leq t-1 we have Ft(xτ)Ft1(xτ)f(xτ)2η(t1)F_{t}(x_{\tau})\geq F_{t-1}(x_{\tau})\geq f(x_{\tau})-2\eta(t-1), giving the first bound for these xτx_{\tau}. For xtx_{t}, notice that Ft(xt)f~tsF_{t}(x_{t})\geq\tilde{f}_{t}-s^{*}. Therefore,

Ft(xt)\displaystyle F_{t}(x_{t}) f~tsf~tmax{0,Δ2ηt}\displaystyle\geq\tilde{f}_{t}-s^{*}\geq\tilde{f}_{t}-\max\{0,\Delta-2\eta t\}
max{f(xt)η,f(xt)2ηt}=f(xt)2ηt,\displaystyle\geq\max\{f(x_{t})-\eta\,,\,f(x_{t})-2\eta t\}=f(x_{t})-2\eta t,

where in the second inequality we used the upper bound on the shift smax{0,Δ+2ηt}s^{*}\leq\max\{0,\Delta+2\eta t\}, and in the next inequality we used the guarantee |f~tf(xt)|η|\tilde{f}_{t}-f(x_{t})|\leq\eta from the approximate oracle.

For the upper bound Ft(x)f(x)+2ηtF_{t}(x)\leq f(x)+2\eta t, by the inductive hypothesis Ft1(x)f(x)+2η(t1)F_{t-1}(x)\leq f(x)+2\eta(t-1). Moreover, the same development as in (3.3) reveals that

f~t+g~t,xxts\displaystyle\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-s^{*} f~t+g~t,xxt\displaystyle\leq\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle
f(x)+Δ+ηf(x)+2η,\displaystyle\leq f(x)+\Delta+\eta\leq f(x)+2\eta,

where the last inequality again uses that Δ=f~tf(xt)η\Delta=\tilde{f}_{t}-f(x_{t})\leq\eta due to the guarantee of the approximate oracle. Thus, Ft(x)max{f(x)+2η(t1),f(x)+2η}f(x)+2ηtF_{t}(x)\leq\max\{f(x)+2\eta(t-1),f(x)+2\eta\}\leq f(x)+2\eta t, giving the desired bound. This concludes the proof of the lemma.

Combining Lemmas 3.2, 3.3, and 3.4 shows that the first-order pairs produced by Procedure 2 satisfies the properties stated in Theorem 3.1, finally concluding its proof.

4 Universal transfer for smooth functions

In this section we prove our transfer theorem for smooth functions stated in Theorem 2.3. Recall that a function ff is α\alpha-smooth if it has α\alpha-Lipschitz gradients:

x,yd,f(x)f(y)αxy.\displaystyle\forall x,y\in\mathbb{R}^{d},\quad\|\nabla f(x)-\nabla f(y)\|\leq\alpha\|x-y\|.

As in the proof of the previous transfer theorem, the core element is the following: Given the sequence of iterates x1,,xtx_{1},\ldots,x_{t} of a black-box optimization algorithm and access to an approximate first-order oracle to the smooth objective function ff, construct in an online fashion first-order pairs (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) and, implicitly, a smooth function SS close to the original ff such that (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) provide exactly the value and gradient of SS at xtx_{t}.

Theorem 4.1 (Online first-order smooth-extensibility).

Consider an α\alpha-smooth, MM-Lipschitz convex function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}, and a sequence of points x1,,xTB(R)x_{1},\ldots,x_{T}\in B(R). Then, for ηαR25T\eta\leq\frac{\alpha R^{2}}{5T}, there is an online procedure that given η\eta-approximate first-order oracle access to ff, produces first-order pairs (f^1,g^1),,(f^T,g^T)(\hat{f}_{1},\hat{g}_{1}),\ldots,(\hat{f}_{T},\hat{g}_{T}) that have an α\alpha^{\prime}-smooth (M+η2R)(M+\frac{\eta}{2R})-extension S:B(R)S:B(R)\rightarrow\mathbb{R} satisfying Sf5η(T+2)\|S-f\|_{\infty}\leq 5\eta\,(T+2), where α=αd(45(T+1)+3)\alpha^{\prime}=\alpha\,\sqrt{d}\,\Big{(}4\sqrt{5\cdot(T+1)}+3\Big{)}. Moreover, the procedure only probes the approximate oracle at the given points x1,,xTx_{1},\ldots,x_{T}.

In the previous section, the extension was created by adding a new linear function at every iteration; this produced the piecewise linear (non-smooth) functions FtF_{t} in the previous section. Having to construct a smooth extension creates a challenge. Our approach is to apply a smoothing procedure to these piecewise linear functions, in an online manner. One issue is that most standard smoothing procedures (e.g., via inf-convolution (Beck & Teboulle, 2012) or Gaussian smoothing (Nesterov, 2005)) may use the values of the non-smooth base function over the whole domain; in our online construction, at a given point in time we have determined the value of the function only in a neighborhood of the previous iterates, and the updated functions can change at points outside these small neighborhoods. Thus, we employ a localized smoothing procedure. Moreover, we need the procedure to leverage the fact that the non-smooth base function is close to a smooth one, and produce stronger smoothing guarantees by making use thereof. We start by describing this smoothing technique and its properties, and then describe the full procedure that gives Theorem 4.1.

Randomized smoothing of almost smooth functions.

Given a function h:dh:\mathbb{R}^{d}\rightarrow\mathbb{R} and a radius r>0r>0, we define the smoothed function hrh_{r} by hr(x):=𝔼h(x+rU),h_{r}(x):=\mathbb{E}h(x+rU), where UU is uniformly distributed on the unit ball B(1)B(1). It is well-known that when hh is convex and MM-Lipschitz, then hrh_{r} is differentiable, also MM-Lipschitz, and, most importantly, is Mdr\frac{M\sqrt{d}}{r}-smooth (Yousefian et al., 2012). However, we show that the smoothing parameter can be significantly improved when the function hh is already close to a smooth function. The proof is deferred to Appendix A.1.

Lemma 4.2.

Let h:B(4R)h:B(4R)\rightarrow\mathbb{R} be a convex function such that there exists an α\alpha-smooth convex function f:B(4R)f:B(4R)\rightarrow\mathbb{R} with hfε\|h-f\|_{\infty}\leq\varepsilon, for εαR2\varepsilon\leq\alpha R^{2}. Then, for rRr\leq R the smoothed function hr:B(R)h_{r}:B(R)\rightarrow\mathbb{R} (so restricted to the ball B(R)B(R)) satisfies:

  1. 1.

    hrh_{r} is (4αεdr+3αd)\Big{(}\frac{4\sqrt{\alpha\varepsilon d}}{r}+3\alpha\sqrt{d}\Big{)}-smooth

  2. 2.

    |hr(x)f(x)|ε+αr22|h_{r}(x)-f(x)|\leq\varepsilon+\frac{\alpha r^{2}}{2} for all xB(R)x\in B(R).

Construction of the smooth-extension.

As mentioned, in each iteration tt we will maintain a piecewise linear function FtF_{t} constructed very similarly to the proof of Theorem 3.1. Now we will also maintain the smoothened version (Ft)r(F_{t})_{r} of this function that uses the randomized smoothing discussed above (for a particular value of rr). Our transfer procedure then returns the first-order information f^t:=(Ft)r(xt)\hat{f}_{t}:=(F_{t})_{r}(x_{t}) and g^t:=(Ft)r(xt)\hat{g}_{t}:=\nabla(F_{t})_{r}(x_{t}) of the latter. The final smooth function S:B(R)S:B(R)\rightarrow\mathbb{R} compatible with the first-order information returned by the procedure will be given, as in Lemma 3.2, by taking the maximum between the final FTF_{T} and a shifted version of the original function ff.

The main difference in how the functions FtF_{t}’s are constructed, compared to the proof of Theorem 3.1, is the following. Previously, in order to ensure that FTF_{T} (and so the final extension) was compatible with the first-order pairs output in earlier iterations, we needed to “protect” the points xtx_{t} and ensure that the function values and gradients at these points did not change over time, e.g., we needed FT(xt)=Ft(xt)F_{T}(x_{t})=F_{t}(x_{t}). But now the first-order pair output for the query point xtx_{t} depends not only on the value of FtF_{t} at xtx_{t}, but also on the values on the whole ball B(xt,r)B(x_{t},r) that are used to determine the smoothed function (Ft)r(F_{t})_{r} at xtx_{t}. Thus, we will now need to “protect” these balls and ensure that the function values over them do not change in later iterations.

We now formalize the construction of the functions FtF_{t}, the first-order information returned, and the final extension SS in Procedure 3.

Procedure 3.
Set r=η/αr=\sqrt{\eta/\alpha} and F0(x)=F_{0}(x)=-\infty. For each t=1,,Tt=1,\ldots,T: 1. Query the η\eta-approximate oracle for ff at xtx_{t}, receiving the first-order pair (f~t,g~t)(\tilde{f}_{t},\tilde{g}_{t}). 2. Define the function FtF_{t} by setting Ft(x)=max{Ft1(x),f~t+g~t,xxt(4ηt+αr2t+2η)}F_{t}(x)=\max\{F_{t-1}(x)\,,\,\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-(4\eta t+\alpha r^{2}t+2\eta)\} for all xx 3. Output the first-order information of the randomly smoothed function (Ft)r(F_{t})_{r}: f^t:=(Ft)r(xt)\hat{f}_{t}:=(F_{t})_{r}(x_{t}) and g^t:=(Ft)r(xt)\hat{g}_{t}:=\nabla(F_{t})_{r}(x_{t}) Define the function S:B(R)S:B(R)\rightarrow\mathbb{R} by S=(max{FT,f4η(T+1)+αr2(T+1)})rS=(\max\{F_{T},f-4\eta(T+1)+\alpha r^{2}(T+1)\})_{r}, where max\max denotes pointwise maximum.

The proof that this procedure indeed yields Theorem 4.1 is presented in Appendix A.2.

Acknowledgments

We would like to thank the reviewers for their detailed and insightful feedback that has corrected inaccuracies in the previous proofs and have improved the presentation of the paper.

The first and fourth authors would like to acknowledge support from Air Force Office of Scientific Research (AFOSR) grant FA95502010341 and National Science Foundation (NSF) grant CCF2006587. The second author was supported in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES, Brasil) - Finance Code 001, and by Bolsa de Produtividade em Pesquisa #3\#312751/2021-4 from CNPq.

Impact Statement

This paper presents work whose goal is to advance the fields of Machine Learning and Optimization. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Bandi et al. (2019) Bandi, H., Bertsimas, D., and Mazumder, R. Learning a mixture of gaussians via mixed-integer optimization. INFORMS Journal on Optimization, 1(3):221–240, 2019.
  • Basu (2023) Basu, A. Complexity of optimizing over the integers. Mathematical Programming, Series B, 200:739–780, 2023.
  • Basu et al. (2023) Basu, A., Jiang, H., Kerger, P., and Molinaro, M. Information complexity of mixed-integer convex optimization. arXiv preprint arXiv:2308.11153, 2023.
  • Beck & Teboulle (2012) Beck, A. and Teboulle, M. Smoothing and first order methods: A unified framework. SIAM Journal on Optimization, 22(2):557–580, 2012. doi: 10.1137/100818327. URL https://doi.org/10.1137/100818327.
  • Bertsekas (1973) Bertsekas, D. P. Stochastic optimization problems with nondifferentiable cost functionals. Journal of Optimization Theory and Applications, 12:218–231, 1973.
  • Bertsimas et al. (2016) Bertsimas, D., King, A., and Mazumder, R. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016.
  • Bubeck (2015) Bubeck, S. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3–4):231–357, nov 2015. ISSN 1935-8237. doi: 10.1561/2200000050. URL https://doi.org/10.1561/2200000050.
  • Chen & Qi (2005) Chen, C.-P. and Qi, F. The best bounds in Wallis’ inequality. Proceedings of the American Mathematical Society, (133):397–401, 2005.
  • Cohen et al. (2018) Cohen, M., Diakonikolas, J., and Orecchia, L. On acceleration with noise-corrupted gradients. In International Conference on Machine Learning, pp.  1019–1028. PMLR, 2018.
  • Cottle et al. (2009) Cottle, R. W., Pang, J.-S., and Stone, R. E. The linear complementarity problem, volume 60. Siam, 2009.
  • d’Aspremont (2008) d’Aspremont, A. Smooth optimization with approximate gradient. SIAM Journal on Optimization, 19(3):1171–1183, 2008.
  • Dedieu et al. (2021) Dedieu, A., Hazimeh, H., and Mazumder, R. Learning sparse classifiers: Continuous and mixed integer optimization perspectives. The Journal of Machine Learning Research, 22(1):6008–6054, 2021.
  • Devolder et al. (2014) Devolder, O., Glineur, F., and Nesterov, Y. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146:37–75, 2014.
  • Dey et al. (2022) Dey, S. S., Mazumder, R., and Wang, G. Using 1\ell_{1}-relaxation and integer programming to obtain dual bounds for sparse pca. Operations Research, 70(3):1914–1932, 2022.
  • d’Aspremont et al. (2021) d’Aspremont, A., Scieur, D., and Taylor, A. Acceleration methods. Foundations and Trends® in Optimization, 5(1-2):1–245, 2021. ISSN 2167-3888. doi: 10.1561/2400000036. URL http://dx.doi.org/10.1561/2400000036.
  • Gasnikov & Tyurin (2019) Gasnikov, A. and Tyurin, A. Fast gradient descent for convex minimization problems with an oracle producing a (δ\delta, l)-model of function at the requested point. Computational Mathematics and Mathematical Physics, 59:1085 – 1097, 2019. URL https://api.semanticscholar.org/CorpusID:202124988.
  • Hazimeh et al. (2022) Hazimeh, H., Mazumder, R., and Saab, A. Sparse regression at scale: Branch-and-bound rooted in first-order optimization. Mathematical Programming, 196(1-2):347–388, 2022.
  • Hazimeh et al. (2023) Hazimeh, H., Mazumder, R., and Radchenko, P. Grouped variable selection with discrete optimization: Computational and statistical perspectives. The Annals of Statistics, 51(1):1–32, 2023.
  • Hintermüller (2001) Hintermüller, M. A proximal bundle method based on approximate subgradients. Computational Optimization and Applications, 20:245–266, 2001.
  • Kiwiel (2006) Kiwiel, K. C. A proximal bundle method with approximate subgradient linearizations. SIAM Journal on optimization, 16(4):1007–1023, 2006.
  • Lan (2009) Lan, G. Convex optimization under inexact first-order information. Georgia Institute of Technology, 2009.
  • Lindvall (2002) Lindvall, T. Lectures on the Coupling Method. Dover Books on Mathematics Series. Dover Publications, Incorporated, 2002. ISBN 9780486421452. URL https://books.google.com.br/books?id=GB290HEW724C.
  • Mazumder & Radchenko (2017) Mazumder, R. and Radchenko, P. The discrete dantzig selector: Estimating sparse linear models via mixed integer linear optimization. IEEE Transactions on Information Theory, 63(5):3053–3075, 2017.
  • Nemirovski (1994) Nemirovski, A. Efficient methods in convex programming. Lecture notes, 1994.
  • Nesterov (1983) Nesterov, Y. A method of solving a convex programming problem with convergence rate o(1k2)o(\frac{1}{k^{2}}). Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983.
  • Nesterov (2005) Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical Programming, 103:127–152, 2005.
  • Nesterov (2004) Nesterov, Y. E. Introductory Lectures on Convex Optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, 2004. ISBN 1-4020-7553-7.
  • Nesterov (2018) Nesterov, Y. E. Lectures on convex optimization, volume 137. Springer, 2018.
  • Polyak (1987) Polyak, B. Introduction to optimization. Translations Series in Mathematics and Engineering. New York: Optimization Software Inc. Publications Division, 1987.
  • Schmidt et al. (2011) Schmidt, M., Roux, N., and Bach, F. Convergence rates of inexact proximal-gradient methods for convex optimization. Advances in neural information processing systems, 24, 2011.
  • Shor (1985) Shor, N. Z. Minimization methods for non-differentiable functions. Springer Series in Computational Mathematics, 1985.
  • Wang & Abernethy (2018) Wang, J.-K. and Abernethy, J. D. Acceleration through optimistic no-regret dynamics. Advances in Neural Information Processing Systems, 31, 2018.
  • Yousefian et al. (2011) Yousefian, F., Nedić, A., and Shanbhag, U. V. On stochastic gradient and subgradient methods with adaptive steplength sequences. arXiv preprint arxiv:1105.4549, 2011.
  • Yousefian et al. (2012) Yousefian, F., Nedić, A., and Shanbhag, U. V. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56–67, 2012. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2011.09.043. URL https://www.sciencedirect.com/science/article/pii/S0005109811004833.

Appendix

Appendix A Universal transfer for smooth functions

In this section we present the missing proofs for our transfer theorem for smooth functions from Section 4. We start by recalling the definition of a smooth function.

Definition A.1.

A function f:df:\mathbb{R}^{d}\to\mathbb{R} is said to be α\alpha-smooth if it is differentiable and its gradient is Lipschitz continuous with a Lipschitz constant α\alpha, namely

x,yd,f(x)f(y)αxy.\displaystyle\forall x,y\in\mathbb{R}^{d},\quad\|\nabla f(x)-\nabla f(y)\|\leq\alpha\|x-y\|.

An α\alpha-smooth function possesses the following useful upper bounding property: for x,ynx,y\in\mathbb{R}^{n}:

f(y)f(x)+f(x),yx+α2yx2.f(y)\leq f(x)+\langle\nabla f(x),y-x\rangle+\frac{\alpha}{2}\|y-x\|^{2}. (A.1)

A.1 Proof of Lemma 4.2

Let hh and ff be the convex functions over the ball B(4R)B(4R) satisfying the statement of the lemma, i.e., hfε\|h-f\|_{\infty}\leq\varepsilon and ff is α\alpha-smooth. Recall that the smoothed function hrh_{r} is defined as hr(x)=𝔼h(x+rU)h_{r}(x)=\mathbb{E}h(x+rU) for xB(R)x\in B(R), where UU is a random vector uniformly distributed on the unit ball B(1)B(1) and rRr\leq R.

The first observation is that since hh is close to ff and the latter is smooth, their (sub)gradients are close to each other; the same also holds between hrh_{r} and ff.

Lemma A.2.

We have the following:

  1. 1.

    h(x)f(x)2αε\|\partial h(x)-\nabla f(x)\|\leq 2\sqrt{\alpha\varepsilon} for every xB(2R)x\in B(2R) and every subgradient h(x)\partial h(x).

  2. 2.

    hr(x)f(x)2αε+αr\|\nabla h_{r}(x)-\nabla f(x)\|\leq 2\sqrt{\alpha\varepsilon}+\alpha r for every xB(R)x\in B(R).

Proof.

To prove the first item, fix xB(2R)x\in B(2R) and let yy be such that

xy=2εαh(x)f(x)h(x)f(x).x-y=2\sqrt{\frac{\varepsilon}{\alpha}}\cdot\frac{\partial h(x)-\nabla f(x)}{\|\partial h(x)-\nabla f(x)\|}.

Since xB(2R)x\in B(2R), notice that yy has norm at most 2R+2ε/α2R+2\sqrt{\varepsilon/\alpha}, which by assumption of ε\varepsilon is at most 4R4R; thus, yy is in the domain of hh and ff.

Then using α\alpha-smoothness of ff, hfε\|h-f\|_{\infty}\leq\varepsilon, and convexity of hh, we have

f(x)+f(x),yx+α2xy2f(y)h(y)ε\displaystyle f(x)+\langle\nabla f(x),y-x\rangle+\frac{\alpha}{2}\|x-y\|^{2}\,\geq\,f(y)\,\geq\,h(y)-\varepsilon\,
h(x)+h(x),yxε,\displaystyle\qquad\qquad\qquad\qquad\qquad~~\geq\,h(x)+\langle\partial h(x),y-x\rangle-\varepsilon,

and so

h(x)f(x),yx\displaystyle\langle\partial h(x)-\nabla f(x),y-x\rangle f(x)h(x)+ε+α2xy2\displaystyle\leq f(x)-h(x)+\varepsilon+\frac{\alpha}{2}\|x-y\|^{2}
2ε+α2xy2.\displaystyle\leq 2\varepsilon+\frac{\alpha}{2}\|x-y\|^{2}\,.

Plugging the definition of yy on this expression gives

2εαh(x)f(x)4ε,\displaystyle 2\sqrt{\frac{\varepsilon}{\alpha}}\cdot\|\partial h(x)-\nabla f(x)\|\leq 4\varepsilon,

and so h(x)f(x)2αε\|\partial h(x)-\nabla f(x)\|\leq 2\sqrt{\alpha\varepsilon}, which gives the first item of the lemma.

For the second item, again let UU be uniformly distributed in B(1)B(1). This random variable is sufficiently regular that gradients and expectations commute, namely hr(x)=(𝔼h(x+rU))=𝔼h(x+rU)\nabla h_{r}(x)=\nabla(\mathbb{E}\,h(x+rU))=\mathbb{E}\,\partial h(x+rU), were h\partial h denotes any subgradient of hh (Bertsekas, 1973).Then applying Jensen’s inequality, for any xB(R)x\in B(R) we get

hr(x)f(x)\displaystyle\|\nabla h_{r}(x)-\nabla f(x)\| =𝔼h(x+rU)f(x)\displaystyle=\|\mathbb{E}\,\partial h(x+rU)-\nabla f(x)\|
𝔼h(x+rU)f(x).\displaystyle\leq\mathbb{E}\,\|\partial h(x+rU)-\nabla f(x)\|.

Also, for any unit-norm vector uu we have

h(x+ru)f(x)\displaystyle\|\partial h(x+ru)-\nabla f(x)\| h(x+ru)f(x+ru)\displaystyle\leq\|\partial h(x+ru)-\nabla f(x+ru)\|
+f(x+ru)f(x)\displaystyle~~~~~~+\|\nabla f(x+ru)-\nabla f(x)\|
2αε+αr,\displaystyle\leq 2\sqrt{\alpha\varepsilon}+\alpha r,

where the last inequality follows from Item 1 of the lemma (since rRr\leq R, x+rux+ru has norm at most R+r2RR+r\leq 2R and so the item can indeed be applied) and α\alpha-smoothness of ff (which is equivalent to f(z)f(z)αzz\|\nabla f(z)-\nabla f(z^{\prime})\|\leq\alpha\|z-z^{\prime}\| (Nesterov, 2018)). This concludes the proof. ∎

The second element that we will need is a bound on the total variation between the the uniform distributions on the two same-radius balls with different centers.

Lemma A.3.

Let XdX\in\mathbb{R}^{d} be the uniformly distributed on B(x,r)B(x,r) and YdY\in\mathbb{R}^{d} be uniformly distributed on B(y,r)B(y,r). Then there is a random variable (X,Y)2d(X^{\prime},Y^{\prime})\in\mathbb{R}^{2d} where XX^{\prime} has the same distribution as XX and YY^{\prime} the same distribution as YY, and where Pr(XY)xydr\Pr(X^{\prime}\neq Y^{\prime})\leq\frac{\|x-y\|\sqrt{d}}{r}.

Proof sketch.

This folklore result can be obtained as follows. Let μz\mu_{z} be the uniform distribution over B(z,r)B(z,r). Since μx\mu_{x} and μy\mu_{y} are the distribution of XX and YY, by the Maximal Coupling Lemma (Theorem 5.2 of (Lindvall, 2002)) there is a random variable (X,Y)2d(X^{\prime},Y^{\prime})\in\mathbb{R}^{2d} where XμxX^{\prime}\sim\mu_{x} and YμyY^{\prime}\sim\mu_{y} and Pr(XY)=12|dμx(z)dμy(z)|dz\Pr(X^{\prime}\neq Y^{\prime})=\frac{1}{2}\int|\textrm{d}\mu_{x}(z)-\textrm{d}\mu_{y}(z)|\textrm{d}z. Moreover, it is known that the right hand side is at most xydr\frac{\|x-y\|\sqrt{d}}{r}, see for example inequality (39) of (Yousefian et al., 2011) (plus the estimate from (Chen & Qi, 2005)). ∎

We are now ready to prove Lemma 4.2.

Proof of Lemma 4.2.

Item 1: We prove that hr(x)hr(y)(4αεdr+3αd)xy\|\nabla h_{r}(x)-\nabla h_{r}(y)\|\leq(\frac{4\sqrt{\alpha\varepsilon d}}{r}+3\alpha\sqrt{d})\|x-y\| for all x,yx,y. In fact, it suffices to prove this for x,yx,y where xyr\|x-y\|\leq r, since the inequality can then be chained to obtain the result for any pair of points.

Then fix x,yx,y with xyr\|x-y\|\leq r. Using the notation from Lemma A.3, hr(x)=𝔼h(X)\nabla h_{r}(x)=\mathbb{E}\,\partial h(X^{\prime}) and hr(y)=𝔼h(Y)\nabla h_{r}(y)=\mathbb{E}\,\partial h(Y^{\prime}) and Pr(XY)xydr\Pr(X^{\prime}\neq Y^{\prime})\leq\frac{\|x-y\|\sqrt{d}}{r}. Applying Jensen’s inequality,

hr(x)hr(y)𝔼X,Yh(X)h(Y)\displaystyle\|\nabla h_{r}(x)-\nabla h_{r}(y)\|\leq\mathbb{E}_{X^{\prime},Y^{\prime}}\|\partial h(X^{\prime})-\partial h(Y^{\prime})\|
xydrmaxx,yB(x,r)B(y,r)h(x)h(y).\displaystyle\leq\frac{\|x-y\|\sqrt{d}}{r}\cdot\max_{x^{\prime},y^{\prime}\in B(x,r)\cup B(y,r)}\|\partial h(x^{\prime})-\partial h(y^{\prime})\|.

We upper bound the last term by applying triangle inequality and then Lemma A.2:

h(x)h(y)\displaystyle\|\partial h(x^{\prime})-\nabla h(y^{\prime})\| h(x)f(x)\displaystyle\leq\|\partial h(x^{\prime})-\nabla f(x^{\prime})\|
+f(x)f(y)\displaystyle~~~~~+\|\nabla f(x^{\prime})-\nabla f(y^{\prime})\|
+f(y)h(y)\displaystyle~~~~~+\|\nabla f(y^{\prime})-\partial h(y^{\prime})\|
4αε+αxy\displaystyle\leq 4\sqrt{\alpha\varepsilon}+\alpha\,\|x^{\prime}-y^{\prime}\|
4αε+3αr,\displaystyle\leq 4\sqrt{\alpha\varepsilon}+3\alpha r,

where the second inequality uses that ff is α\alpha-smooth, and the last inequality uses the assumption xyr\|x-y\|\leq r. Plugging this into the previous inequality gives

hr(x)hr(y)(4αεdr+3αd)xy,\|\nabla h_{r}(x)-\nabla h_{r}(y)\|\leq\bigg{(}\frac{4\sqrt{\alpha\varepsilon d}}{r}+3\alpha\sqrt{d}\bigg{)}\cdot\|x-y\|,

as desired.

Second item: We now show that hrfε+αr22\|h_{r}-f\|_{\infty}\leq\varepsilon+\frac{\alpha r^{2}}{2}. Fix xdx\in\mathbb{R}^{d}, and again let UU be uniformly distributed in the unit ball. Using the assumption hfε\|h-f\|_{\infty}\leq\varepsilon and convexity of ff, we have

h(x+rU)f(x+rU)εf(x)+f(x),rUε.\displaystyle h(x+rU)\geq f(x+rU)-\varepsilon\geq f(x)+\langle\nabla f(x),rU\rangle-\varepsilon.

Since UU has mean zero, taking expectations gives hr(x)f(x)εh_{r}(x)\geq f(x)-\varepsilon. Similarly, since ff is α\alpha-smooth

h(x+rU)\displaystyle h(x+rU) f(x+rU)+ε\displaystyle\leq f(x+rU)+\varepsilon
f(x)+f(x),rU+α2rU2+ε,\displaystyle\leq f(x)+\langle\nabla f(x),rU\rangle+\frac{\alpha}{2}\|rU\|^{2}+\varepsilon,

and taking expectations gives hr(x)f(x)+αr22+εh_{r}(x)\leq f(x)+\frac{\alpha r^{2}}{2}+\varepsilon. Together, these yield |hr(x)f(x)|ε+αr22|h_{r}(x)-f(x)|\leq\varepsilon+\frac{\alpha r^{2}}{2}, thus proving the result. This concludes the proof of the theorem. ∎

A.2 Proof of Theorem 4.1

Throughout this section, fix an α\alpha-smooth MM-Lipschitz function f:B(R)f:B(R)\rightarrow\mathbb{R}. Recall that we have a sequence of queried points x1,,xTB(R)x_{1},\ldots,x_{T}\in B(R) and access to an η\eta-approximate first-order oracle for ff. Our goal is to produce, in an online fashion, a sequence of first-order pairs (f^1,g^1),,(f^T,g^T)(\hat{f}_{1},\hat{g}_{1}),\ldots,(\hat{f}_{T},\hat{g}_{T}) for the queried points and a function SS that is smooth, Lipschitz, and compatible with these first-order pairs (i.e., S(xt)=f^tS(x_{t})=\hat{f}_{t} and S(xt)=g^t\nabla S(x_{t})=\hat{g}_{t}).

As mentioned, in each iteration tt we will keep a piecewise linear function FtF_{t} and their smoothened version (Ft)r(F_{t})_{r} (by using the randomized smoothing from the previous section for a specific value of rr). Our transfer procedure then returns the first-order information f^t:=(Ft)r(xt)\hat{f}_{t}:=(F_{t})_{r}(x_{t}) and g^t:=(Ft)r(xt)\hat{g}_{t}:=\nabla(F_{t})_{r}(x_{t}) of the latter. The final smooth function S:B(R)S:B(R)\rightarrow\mathbb{R} compatible with the first-order information output by the procedure will be given, as in Lemma 3.2, by using the maximum between the final FTF_{T} and a shifted version of the original function ff. Also recall that in order to ensure the compatibility of SS with the first-order information (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) output throughout the process, we need to “protect” the points xtx_{t} and ensure that the function values and gradients at these points did not change across iterations, i.e. (FT)r(xt)=(Ft)r(xt)(F_{T})_{r}(x_{t})=(F_{t})_{r}(x_{t}) and (FT)r(xt)=(Ft)r(xt)\nabla(F_{T})_{r}(x_{t})=\nabla(F_{t})_{r}(x_{t}). Since (Ft)r(xt)(F_{t^{\prime}})_{r}(x_{t}) depends on the values of FtF_{t^{\prime}} at the ball B(xt,r)B(x_{t},r) around xtx_{t}, we need to “protect” FtF_{t^{\prime}} on these balls, namely to have FT(x)=Ft(x)F_{T}(x)=F_{t}(x) for all xB(xt,r)x\in B(x_{t},r).

For convenience, we recall the exact construction of the functions FtF_{t}, the first-order information returned, and the final extension SS. In hindsight, set r:=η/αr:=\sqrt{\eta/\alpha}, and for every tt define the shift st:=4ηt+αr2ts_{t}:=4\eta t+\alpha r^{2}t.

Procedure 3. Set F0(x)=F_{0}(x)=-\infty. For each t=1,,Tt=1,\ldots,T: 1. Query the η\eta-approximate oracle for ff at xtx_{t}, receiving the first-order pair (f~t,g~t)(\tilde{f}_{t},\tilde{g}_{t}). 2. Define the function FtF_{t} by setting Ft(x)=max{Ft1(x),f~t+g~t,xxt(st+2η)}F_{t}(x)=\max\{F_{t-1}(x)\,,\,\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-(s_{t}+2\eta)\} for all xx. 3. Output the first-order information of the randomly smoothed function (Ft)r(F_{t})_{r}: f^t:=(Ft)r(xt)\hat{f}_{t}:=(F_{t})_{r}(x_{t}) and g^t:=(Ft)r(xt)\hat{g}_{t}:=\nabla(F_{t})_{r}(x_{t}). Define the function S:B(R)S:B(R)\rightarrow\mathbb{R} by S=(max{FT,fsT+1})rS=(\max\{F_{T},f-s_{T+1}\})_{r}, where max\max denotes pointwise maximum.

We now prove the main properties of the functions FtF_{t}, formulated in the following lemma. The first two are similar to (3.1) and (3.2) used in our non-smooth transfer result and guarantee, loosely speaking, that FtF_{t} is close to the original function ff. The third property is precisely the “ball protection” idea discussed above.

Lemma A.4.

For all tt, the function FtF_{t} satisfies the following:

  1. 1.

    Ft(x)f(x)F_{t}(x)\leq f(x) for every xB(4R)x\in B(4R)

  2. 2.

    For every ttt^{\prime}\leq t, we have Ft(x)f(x)st+1F_{t}(x)\geq f(x)-s_{t+1} for all xB(xt,2r)x\in B(x_{t^{\prime}},\sqrt{2}\,r)

  3. 3.

    For every ttt^{\prime}\leq t, we have Ft(x)=Ft(x)F_{t}(x)=F_{t^{\prime}}(x) for every xB(xt,2r)x\in B(x_{t^{\prime}},\sqrt{2}\,r). In particular (Ft)r(xt)=(Ft)r(xt)(F_{t})_{r}(x_{t^{\prime}})=(F_{t^{\prime}})_{r}(x_{t^{\prime}}) and (Ft)r(xt)=(Ft)r(xt)\nabla(F_{t})_{r}(x_{t^{\prime}})=\nabla(F_{t^{\prime}})_{r}(x_{t^{\prime}}).

Proof.

We prove these properties by induction on tt.

First item: Since the property holds by induction for Ft1F_{t-1} and Ft(x)=max{Ft1(x),f~t+g~t,xxt(st+2η)}F_{t}(x)=\max\{F_{t-1}(x)\,,\,\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-(s_{t}+2\eta)\}, it suffices to show that

f~t+g~t,xxt(st+2η)f(x)\displaystyle\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-(s_{t}+2\eta)\leq f(x) (A.2)

for all xB(4R)x\in B(4R). For that, since (f~t,g~t)(\tilde{f}_{t},\tilde{g}_{t}) comes from an η\eta-approximate first-order oracle, by definition |f~tf(xt)|η|\tilde{f}_{t}-f(x_{t})|\leq\eta and g~tf(xt)η2R\|\tilde{g}_{t}-\nabla f(x_{t})\|\leq\frac{\eta}{2R}; in particular, |g~tf(xt),xxt|g~tf(xt)xxt5η2|\langle\tilde{g}_{t}-\nabla f(x_{t}),x-x_{t}\rangle|\leq\|\tilde{g}_{t}-\nabla f(x_{t})\|\|x-x_{t}\|\leq\frac{5\eta}{2} for every xB(4R)x\in B(4R) (since also xtB(R)x_{t}\in B(R), by assumption). Then using convexity of ff we get

f(x)\displaystyle f(x) f(xt)+f(xt),xxt\displaystyle\geq f(x_{t})+\langle\nabla f(x_{t}),x-x_{t}\rangle
f~t+g~t,xxtη5η2,\displaystyle\geq\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-\eta-\frac{5\eta}{2}, (A.3)

which implies (A.2) as desired, since st+2ηη+5η2s_{t}+2\eta\geq\eta+\frac{5\eta}{2}.

Second item: Again since this property holds by induction for Ft1F_{t-1}, it suffices to show

f~t+g~t,xxt(st+2η)f(x)st+1\displaystyle\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-(s_{t}+2\eta)\geq f(x)-s_{t+1} (A.4)

for all xB(xt,2r)x\in B(x_{t},\sqrt{2}\,r). Since ff is α\alpha-smooth, for every such xx we have

f(x)\displaystyle f(x) f(xt)+f(xt),xxt+α2xxt2\displaystyle\leq f(x_{t})+\langle\nabla f(x_{t}),x-x_{t}\rangle+\frac{\alpha}{2}\|x-x_{t}\|^{2}
f~t+g~t,xxt+2η+αr2.\displaystyle\leq\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle+2\eta+\alpha r^{2}. (A.5)

Since st+1=st+4η+αr2s_{t+1}=s_{t}+4\eta+\alpha r^{2}, reorganizing the terms gives (A.4) as desired.

Third item: To show that for every ttt^{\prime}\leq t, we have Ft(x)=Ft(x)F_{t}(x)=F_{t^{\prime}}(x) for every xB(xt,2r)x\in B(x_{t^{\prime}},\sqrt{2}\,r), it suffices to show that for every t<tt^{\prime}<t

f~t+g~t,xxt(st+2η)Ft1(x)\displaystyle\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-(s_{t}+2\eta)\leq F_{t-1}(x) (A.6)

for all xB(xt,2r)x\in B(x_{t^{\prime}},\sqrt{2}\,r). For that, first notice that for all t<tt^{\prime}<t we have Ft1FtF_{t-1}\geq F_{t^{\prime}}, and the latter can be lower bounded by the affine term added during iteration tt^{\prime}. Combining this with (A.5), applied to iteration tt^{\prime}, we get for all xB(xt,2r)x\in B(x_{t^{\prime}},\sqrt{2}\,r)

Ft1(x)\displaystyle F_{t-1}(x) f~t+g~t,xxt(st+2η)\displaystyle\geq\tilde{f}_{t^{\prime}}+\langle\tilde{g}_{t^{\prime}},x-x_{t^{\prime}}\rangle-(s_{t^{\prime}}+2\eta)
f(x)(st+4η+αr2)\displaystyle\geq f(x)-(s_{t^{\prime}}+4\eta+\alpha r^{2})
f~t+g~t,xxt(st+6η+αr2),\displaystyle\geq\tilde{f}_{t}+\langle\tilde{g}_{t},x-x_{t}\rangle-(s_{t^{\prime}}+6\eta+\alpha r^{2}),

where the last inequality uses (A.3). Since stst+4η+αr2s_{t}\geq s_{t^{\prime}}+4\eta+\alpha r^{2}, this implies (A.6) as desired.

To conclude the proof of this item, notice that (Ft)r(xt)(F_{t})_{r}(x_{t^{\prime}}) (respectively (Ft)r(xt)(F_{t^{\prime}})_{r}(x_{t^{\prime}})) only depends on the values of FtF_{t} (resp. FtF_{t^{\prime}}) on the ball B(xt,r)B(x_{t^{\prime}},r). Since we just showed the value of FtF_{t} and FtF_{t^{\prime}} agree on this ball, we get (Ft)r(xt)=(Ft)r(xt)(F_{t})_{r}(x_{t^{\prime}})=(F_{t^{\prime}})_{r}(x_{t^{\prime}}). Similarly, the gradient (Ft)r(xt)\nabla(F_{t})_{r}(x_{t^{\prime}}) only depends on the values of FtF_{t} on an arbitrarily small open neighborhood of the ball B(xt,r)B(x_{t^{\prime}},r), and the same holds for (Ft)r(xt)\nabla(F_{t^{\prime}})_{r}(x_{t^{\prime}}). Since the bigger ball B(xt,2r)B(x_{t^{\prime}},\sqrt{2}\,r) contains such a neighborhood, we again obtain (Ft)r(xt)=(Ft)r(xt)\nabla(F_{t})_{r}(x_{t^{\prime}})=\nabla(F_{t^{\prime}})_{r}(x_{t^{\prime}}). This concludes the proof of the lemma. ∎

We are now ready to prove Theorem 4.1.

Proof of Theorem 4.1.

We need to prove that the function S:B(R)S:B(R)\rightarrow\mathbb{R} defined in Procedure 3 satisfies:

  1. 1.

    SfsT+1+αr22\|S-f\|_{\infty}\leq s_{T+1}+\frac{\alpha r^{2}}{2}.

  2. 2.

    SS is (4αdsT+1r+3αd)\Big{(}\frac{4\sqrt{\alpha\cdot d\cdot s_{T+1}}}{r}+3\alpha\sqrt{d}\Big{)}-smooth

  3. 3.

    SS is (M+η2R)(M+\frac{\eta}{2R})-Lipschitz

  4. 4.

    SS is an extension for the first-order pairs (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) output by the procedure

First item: Define the function S¯:=max{Ft(x),f(x)sT+1}\bar{S}:=\max\{F_{t}(x),f(x)-s_{T+1}\}, so S=S¯rS=\bar{S}_{r}. Using Item 1 of Lemma A.4, we see that S¯(x)f(x)\bar{S}(x)\leq f(x) for all xB(4R)x\in B(4R), and by definition we have S¯(x)f(x)sT+1\bar{S}(x)\geq f(x)-s_{T+1}, thus |S¯(x)f(x)|sT+1|\bar{S}(x)-f(x)|\leq s_{T+1} for all xB(4R)x\in B(4R). Then using Item 2 of Lemma 4.2 we get |S(x)f(x)|sT+1+αr22|S(x)-f(x)|\leq s_{T+1}+\frac{\alpha r^{2}}{2} for all xB(R)x\in B(R) (we can indeed use this lemma since the definition r=η/αr=\sqrt{\eta/\alpha} and the assumption ηαR25(T+1)\eta\leq\frac{\alpha R^{2}}{5(T+1)} imply that sT+1αR2s_{T+1}\leq\alpha R^{2} and rRr\leq R).

Second item: This follows Item 1 of Lemma 4.2 instead.

Third item: The subgradients of FTF_{T} are (a convex combination of a subset of the) vectors g~t\tilde{g}_{t}, and so FTF_{T} is (maxtg~t)(\max_{t}\|\tilde{g}_{t}\|)-Lipschitz. Since the vectors came from an η\eta-approximate oracle for ff, we have g~tf(xt)η2R\|\tilde{g}_{t}-\nabla f(x_{t})\|\leq\frac{\eta}{2R}, and since ff is MM-Lipschitz we get g~tM+η2R\|\tilde{g}_{t}\|\leq M+\frac{\eta}{2R}; it follows that FTF_{T} is (M+η2R)(M+\frac{\eta}{2R})-Lipschitz. Next, the subgradients of S¯\bar{S} come either from subgradients of FTF_{T} or gradients of ff (or a convex combination thereof), and so S¯\bar{S} is max{M+η2R,M}=M+η2R\max\{M+\frac{\eta}{2R},M\}=M+\frac{\eta}{2R} Lipschitz. Finally, for every xB(R)x\in B(R) we have (UU being uniformly distributed in the unit ball again)

S(x)=𝔼S¯(x+rU)𝔼S¯(x+rU)M+η2R,\|\nabla S(x)\|=\|\mathbb{E}\partial\bar{S}(x+rU)\|\leq\mathbb{E}\|\partial\bar{S}(x+rU)\|\leq M+\frac{\eta}{2R},

where S¯(x+rU)\partial\bar{S}(x+rU) denotes any subgradient at x+rUx+rU and the first inequality follows from Jensen’s inequality. This proves that SS is (M+η2R)(M+\frac{\eta}{2R})-Lipschitz.

Fourth item: We need to show that for all tt, f^t=S(xt)\hat{f}_{t}=S(x_{t}) and g^t=S(xt)\hat{g}_{t}=\nabla S(x_{t}). By definition, f^t=(Ft)r(xt)\hat{f}_{t}=(F_{t})_{r}(x_{t}) and g^t=(Ft)r(xt)\hat{g}_{t}=\nabla(F_{t})_{r}(x_{t}). Moreover, by Item 3 of Lemma A.4, using FTF_{T} instead of FtF_{t} gives the same quantities, namely f^t=(FT)r(xt)\hat{f}_{t}=(F_{T})_{r}(x_{t}) and g^t=(FT)r(xt)\hat{g}_{t}=\nabla(F_{T})_{r}(x_{t}). We claim that for every tt, FTF_{T} and S¯\bar{S} are equal inside the ball B(xt,2r)B(x_{t},\sqrt{2}\,r), which then implies that f^t=(FT)r(xt)=S¯r(xt)=S(xt)\hat{f}_{t}=(F_{T})_{r}(x_{t})=\bar{S}_{r}(x_{t})=S(x_{t}) and g^t=(FT)r(xt)=S¯r(xt)=S(xt)\hat{g}_{t}=\nabla(F_{T})_{r}(x_{t})=\nabla\bar{S}_{r}(x_{t})=\nabla S(x_{t}), as desired. To show the equality in the ball B(xt,2r)B(x_{t},\sqrt{2}\,r), it suffices that the other term in the max defining S¯\bar{S} does not “cut off” FTF_{T}, namely that f(x)sT+1FT(x)f(x)-s_{T+1}\leq F_{T}(x) for every xB(xt,2r)x\in B(x_{t},\sqrt{2}\,r). But this follows from Item 2 of Lemma A.4.

Substituting the value r=η/αr=\sqrt{\eta/\alpha} and st=4ηt+αr2ts_{t}=4\eta t+\alpha r^{2}t in the items above concludes the proof of Theorem 4.1. ∎

Appendix B Separation oracles: proofs of Theorems 2.6 and 2.7

We now consider the original constrained problem min{f(x):xCX}\min\{f(x):x\in C\cap X\}, and show how to run any first-order algorithm 𝒜\mathcal{A} using only approximate first-order information about ff and approximate separation information from CC, proving Theorems 2.6 and 2.7. The main additional element is to convert the approximate separation information for CC into an exact information for a related set KCK\approx C so it can be used in a black-box fashion by 𝒜\mathcal{A}, as the previous section did for the first-order information of ff. For simplicity, we assume throughout that the algorithm 𝒜\mathcal{A} only queries points in B(R)B(R) (the ball containing the feasible set CC), since points outside it can be separated exactly.

Given a set of points x1,,xTB(R)x_{1},...,x_{T}\in B(R), we say a sequence of separation responses r1,,rT{Feasible,Infeasible}×dr_{1},...,r_{T}\in\{\textsc{Feasible},\textsc{Infeasible}\}\times\mathbb{R}^{d} has a convex extension if there is a convex set KK\not=\emptyset such that there exists an exact (i.e., 0-approximate) separation oracle for KK giving responses r1,,rTr_{1},...,r_{T} for the query points x1,,xTx_{1},...,x_{T}. We will also refer to such responses as consistent with KK. As in the previous section, responses from an η\eta-approximate separation oracle may not by themselves admit a convex extension, and need to be modified in order to allow a consistent, convex extension; for example, approximate separating hyperplanes may not be consistent with a convex set, or may "cut off" points previously reported as feasible. When we say a point yy is cut off by a separating hyperplane through xx with normal vector gg, we mean that g,y>g,x\langle g,y\rangle>\langle g,x\rangle, i.e. that yy is not in the induced halfspace. Note that when given an exact separating hyperplane for some xCx\notin C, no point in CC is cut off by it, whereas approximate separating hyperplanes have no such guarantee. With this in mind, we now give a theorem serving as a feasibility analogue to Theorem 3.1.

Definition B.1.

For any convex set CdC\subseteq\mathbb{R}^{d} and any δ>0\delta>0, we define Cδ:={xC:B(x,δ)C}C_{-\delta}:=\{x\in C:B(x,\delta)\subseteq C\}, which will be called δ\delta-deep points of CC.

Theorem B.2 (Online Convex Extensibility).

Consider a convex set CB(R)C\subseteq B(R) and a sequence of points x1,,xTB(R)x_{1},\ldots,x_{T}\in B(R). There is an online procedure that, given access to an η\eta-approximate separation oracle for CC, produces separation responses r^1,,r^T\hat{r}_{1},...,\hat{r}_{T} that have a convex extension KK satisfying CηKCC_{-\eta}\subseteq K\subseteq C. Moreover, the procedure only probes the approximate oracle at the points x1,,xTx_{1},\ldots,x_{T}.

Note that the guarantee CηKCC_{-\eta}\subseteq K\subseteq C means that for any point xtx_{t} that is ηC\eta_{C}-deep in CC, i.e., in CηC_{-\eta}, the response r^t\hat{r}_{t} produced says Feasible, whereas for any xtCx_{t}\notin C it says Infeasible and gives a hyperplane separating xtx_{t} from KK (which cannot cut too deep into CC, i.e., it contains CηC_{-\eta}). At a high-level, such responses allow one to cut off infeasible solutions, but guarantee that there are still (ηC\eta_{C}-deep) solutions with small ff-value available.

With this additional procedure at hand, we extend Procedure 1 from the main text in the following way to solve constrained optimization: in each step of the procedure, we also send the point xtx_{t} queried by the algorithm 𝒜\mathcal{A} to the ηC\eta_{C}-approximate separation oracle for CC, receive the response r~t\tilde{r}_{t}, pass it through Theorem B.2 to obtain the new response r^t\hat{r}_{t}, and send the latter back to 𝒜\mathcal{A}. We call this procedure Approximate-to-Exact-Constr, and formally state it as follows:

Procedure 4.
Approximate-to-Exact-Constr(𝒜,T)(\mathcal{A},T) For each timestep t=1,,Tt=1,\ldots,T: 1. Receive query point xtB(R)x_{t}\in B(R) from 𝒜\mathcal{A} 2. Send point xtx_{t} to the ηf\eta_{f}-approximate first-order oracle and to the ηC\eta_{C}-approximate separation oracle, and receive the approximate first-order information (f~t,g~t)×d(\tilde{f}_{t},\tilde{g}_{t})\in\mathbb{R}\times\mathbb{R}^{d}, and separation response r~t{Feasible}{Infeasible}×d\tilde{r}_{t}\in\{\textsc{Feasible}\}\cup\{\textsc{Infeasible}\}\times\mathbb{R}^{d}. 3. Use the online procedures from Theorems 3.1 and B.2 to construct the first-order pair (f^t,g^t)(\hat{f}_{t},\hat{g}_{t}) and separation response r^t\hat{r}_{t}. 4. Send (f^t,g^t),r^t(\hat{f}_{t},\hat{g}_{t}),\hat{r}_{t} to the algorithm 𝒜\mathcal{A}. Return the point returned by 𝒜\mathcal{A}.

The proof that this procedure yields Theorem 2.6 is analogous to the one for the unconstrained case of Theorem 2.2, so we only sketch it to avoid repetition.

Proof sketch of Theorem 2.6.

Let FF and KK be the Lipschitz and convex extensions to the answers sent to 𝒜\mathcal{A} that are guaranteed by Theorems 3.1 and B.2, respectively. Approximate-to-Exact-Constr has the same effect as 𝒜\mathcal{A} running on the instance (F,K)(F,K). One can show that this instance belongs to (M+ηf2R,R,ρηC)\mathcal{I}(M+\frac{\eta_{f}}{2R},R,\rho-\eta_{C}). Then if err()err(\cdot) is the error guarantee of 𝒜\mathcal{A} as in the statement of the theorem, this ensures that we return a solution x¯KX\bar{x}\in K\cap X satisfying

F(x¯)OPT(F,K)+err(T,M+ηf2R,R,ρηC),F(\bar{x})\leq\textup{OPT}(F,K)+err(T,M+\frac{\eta_{f}}{2R},R,\rho-\eta_{C}),

where OPT(F,K):=argmin{F(x):xKX}\textup{OPT}(F,K):=\operatorname*{argmin}\{F(x):x\in K\cap X\}. Since CηCC_{-\eta_{C}} contains a solution xx with value f(x)OPT(f,C)+2MRηCρf(x)\leq\textup{OPT}(f,C)+\frac{2MR\eta_{C}}{\rho} (e.g., Lemma 4.7 of (Basu, 2023)), we have OPT(f,K)OPT(f,C)+2MRηCρ\textup{OPT}(f,K)\leq\textup{OPT}(f,C)+\frac{2MR\eta_{C}}{\rho}. Finally, using the guarantee Ff2ηfT\|F-f\|_{\infty}\leq 2\eta_{f}T, we obtain that f(x¯)OPT(f,C)+err(T,M+ηf2R,R,ρηC)+4ηfT+2MRηCρf(\bar{x})\leq\textup{OPT}(f,C)+err(T,M+\frac{\eta_{f}}{2R},R,\rho-\eta_{C})+4\eta_{f}T+\frac{2MR\eta_{C}}{\rho}, concluding the proof of the theorem. ∎

The proof of Theorem 2.7 follows effectively the same reasoning as for Theorem 2.6; we also sketch it here. The main difference is that one needs to ensure the optimal solution xx^{*} of 2.1 is contained in contained in the auxiliary feasible region KK the algorithm 𝒜\mathcal{A} uses; otherwise the additional restrictions imposed by ZZ may lead to arbitrarily bad solutions, or even KZK\cap Z being empty (consider for example the case of ZZ being a singleton on the boundary of CC that is then cut off by an approximate separation response). However, since KK is guaranteed to contain CηCC_{-\eta_{C}} and we assume that ηCρ\eta_{C}\leq\rho, the fact that xx^{*} is ρ\rho-deep in CC implies that it is also in KK.

Proof sketch of Theorem 2.7.

Again, let FF and KK be the Lipschitz and convex extensions as in the previous proof, so that the instance (F,K)(F,K) belongs to (M+ηf2R,R,ρηC)\mathcal{I}(M+\frac{\eta_{f}}{2R},R,\rho-\eta_{C}) and 𝒜\mathcal{A} returns a solution x¯KZ\bar{x}\in K\cap Z satisfying F(x¯)OPT(F,K)+err(T,M+ηf2R,R,ρηC),F(\bar{x})\leq\textup{OPT}(F,K)+err(T,M+\frac{\eta_{f}}{2R},R,\rho-\eta_{C}), where OPT(F,K):=argmin{F(x):xKZ}\textup{OPT}(F,K):=\operatorname*{argmin}\{F(x):x\in K\cap Z\}. Recall that ZZ is assumed to be given and known by the algorithm. Since the optimal solution for (f,C)(f,C) is assumed to be in CρC_{-\rho}, and KK contains CηCCρC_{-\eta_{C}}\supseteq C_{-\rho}, KK contains the optimal solution to to the true instance, x:f(x)=OPT(f,C)x^{*}:f(x*)=\textup{OPT}(f,C). Finally, using the guarantee Ff2ηfT\|F-f\|_{\infty}\leq 2\eta_{f}T, we obtain that f(x¯)OPT(f,C)+err(T,M+ηf2R,R,ρηC)+4ηfTf(\bar{x^{*}})\leq\textup{OPT}(f,C)+err(T,M+\frac{\eta_{f}}{2R},R,\rho-\eta_{C})+4\eta_{f}T, concluding the proof of the theorem. ∎

B.1 Computing convex-extensible separation responses

We now prove Theorem B.2. The result requires the existence of a convex extension KK for the responses r^t\hat{r}_{t} that we construct, and we need CηKCC_{-\eta}\subseteq K\subseteq C. We provide a procedure that produces the responses r^1,,r^T\hat{r}_{1},\ldots,\hat{r}_{T} together with sets K1,,KTK_{1},\ldots,K_{T} so that KtCK_{t}\cap C is consistent with the responses up to this round, i.e., r^1,,r^t\hat{r}_{1},\ldots,\hat{r}_{t}, and is sandwiched between CηC_{-\eta} and CC. The set KtK_{t} will consist of all the points that were not excluded by the separating hyperplanes of the responses up to this round. Thus, our main task is to ensure that as KtK_{t} evolves, it does not exclude the points xτx_{\tau} that the responses up to now have reported as Feasible (ensuring consistency with previous responses). We also want to ensure that none of the deep points CηC_{-\eta} is excluded.

Before stating the formal procedure, we give some intuition on how this is accomplished. Suppose one has Kt1K_{t-1} satisfying the desired properties. One receives a new point xtx_{t} and separation response r~t\tilde{r}_{t} from the approximate oracle, and we need to construct a response r^t\hat{r}_{t} and an updated set KtK_{t} to maintain the desired properties.

Suppose r~t\tilde{r}_{t} reports that xtx_{t} is Feasible. Our procedure ignores this information, keeps Kt=Kt1K_{t}=K_{t-1} and creates a response r^t\hat{r}_{t} that is Feasible if and only if xtKt=Kt1x_{t}\in K_{t}=K_{t-1} (also sending a hyperplane separating xtx_{t} from KtK_{t} if xtKtx_{t}\notin K_{t}; notice that since this hyperplane does not cut into Kt1K_{t-1}, we do not need to update this set). Notice that KtCK_{t}\cap C is indeed consistent with the response r^t\hat{r}_{t}.

The interesting case is when r~t\tilde{r}_{t} reports that xtx_{t} is Infeasible (so xtCx_{t}\notin C) but xtKt1x_{t}\in K_{t-1}. Thus, xtx_{t} cannot belong to KtCK_{t}\cap C (recall we will construct KtKt1K_{t}\subseteq K_{t-1}), and so to ensure consistency our response r^t\hat{r}_{t} needs to report Infeasible and a separating hyperplane that excludes xtx_{t}. The first idea is to simply use separating hyperplane reported by the approximate oracle. But this can exclude points that were deemed Feasible by our previous responses r^τ\hat{r}_{\tau} (we call these points Feas¯t1\overline{\textsc{Feas}}_{t-1}), which would violate consistency. Thus, we first rotate this hyperplane as little as possible such that it contains all points in Feas¯t1\overline{\textsc{Feas}}_{t-1}, and report this rotated hyperplane HH (adding it to Kt1K_{t-1} to obtain KtK_{t}). While this rotation protects the points Feas¯t1\overline{\textsc{Feas}}_{t-1}, we also need to argue that it does not stray too much away from the original approximate separating hyperplane so as to not cut into CηC_{-\eta}.

We now describe the formal procedure in detail. We use H(g,x¯):={x:g,xg,x¯}H(g,\bar{x}):=\{x:\langle g,x\rangle\leq\langle g,\bar{x}\rangle\} to denote the halfspace with normal gg passing through the point x¯\bar{x}.

Procedure 5.
Initialize K0=dK_{0}=\mathbb{R}^{d} and Feas¯0=\overline{\textsc{Feas}}_{0}=\emptyset. For each t=1,,Tt=1,\ldots,T: 1. Query the η\eta-approximate feasibility oracle for CC at xtx_{t}, receiving the response (flagt,g~t)({flag}_{t},\tilde{g}_{t}) 2. If flagt=Feasible{flag}_{t}=\textsc{Feasible} and xtKt1x_{t}\in K_{t-1}. Define the response r^t=(Feasible,)\hat{r}_{t}=(\textsc{Feasible},\star), and set Kt=Kt1K_{t}=K_{t-1} and Feas¯t=Feas¯t1xt\overline{\textsc{Feas}}_{t}=\overline{\textsc{Feas}}_{t-1}\cup{x_{t}}. 3. ElseIf flagt=Feasible{flag}_{t}=\textsc{Feasible} but xtKt1x_{t}\notin K_{t-1}. Set g^t\hat{g}_{t} be any unit vector such that the halfspace H(g^t,xt)H(\hat{g}_{t},x_{t}) contains Kt1K_{t-1}. Define the response r^t=(Infeasible,g^t)\hat{r}_{t}=(\textsc{Infeasible},\hat{g}_{t}). Set Kt=Kt1K_{t}=K_{t-1} and Feas¯t=Feas¯t1{\overline{\textsc{Feas}}_{t}=\overline{\textsc{Feas}}_{t-1}}. 4. Else (so flagt=Infeasible{flag}_{t}=\textsc{Infeasible}). Let g^t\hat{g}_{t} be a unit vector such that the induced halfspace H(g^t,xt)H(\hat{g}_{t},x_{t}) contains Feas¯t1\overline{\textsc{Feas}}_{t-1} that is the closest to g~t\tilde{g}_{t} with this property, i.e. g^tg~t2\displaystyle\|\hat{g}_{t}-\tilde{g}_{t}\|_{2} =\displaystyle= mingB(1){gg~t2:H(g,xt)Feas¯t1}.\displaystyle\min_{g\in B(1)}\bigg{\{}\|g-\tilde{g}_{t}\|_{2}:H(g,x_{t})\supseteq\overline{\textsc{Feas}}_{t-1}\bigg{\}}. Define the response r^t=(Infeasible,g^t)\hat{r}_{t}=(\textsc{Infeasible},\hat{g}_{t}). Set Kt=Kt1H(g^t,xt)K_{t}=K_{t-1}\cap H(\hat{g}_{t},x_{t}) and Feas¯t=Feas¯t1\overline{\textsc{Feas}}_{t}=\overline{\textsc{Feas}}_{t-1}.

We remark that in Line 4, there indeed exists a halfspace supported at xtx_{t} that contains all points in Feas¯t1\overline{\textsc{Feas}}_{t-1}: in this case xtCx_{t}\notin C (since flagt=Infeasible{flag}_{t}=\textsc{Infeasible}) and by definition Feas¯t1C\overline{\textsc{Feas}}_{t-1}\subseteq C, so any halfspace separating xtx_{t} from CC will do.

Lemma B.3.

For every t=1,,Tt=1,\ldots,T, the set KtCK_{t}\cap C, with KtK_{t} computed by Procedure 5, is a convex extension to the responses r^1,,r^t\hat{r}_{1},\ldots,\hat{r}_{t}. Moreover, all these sets satisfy KtCηK_{t}\supseteq C_{-\eta}.

Proof.

It suffices to prove this for each iteration, so suppose Kt1K_{t-1} with responses r^1,,r^t1\hat{r}_{1},...,\hat{r}_{t-1} satisfy the lemma. If Lines 2 or 3 of the procedure were executed, it is straightforward to see KtK_{t} satisfies the lemma. If Line 4 executes, xtCx_{t}\notin C, and so the response r^t\hat{r}_{t} is consistent with KtCK_{t}\cap C since Kt=Kt1H(g^t,xt)K_{t}=K_{t-1}\cap H(\hat{g}_{t},x_{t}). As Feas¯t1Kt\overline{\textsc{Feas}}_{t-1}\subseteq K_{t} by construction and KtKt1K_{t}\subseteq K_{t-1}, KtCK_{t}\cap C is consistent with all responses r^1,,r^t\hat{r}_{1},...,\hat{r}_{t} made. It remains to show that KtK_{t} contains CηC_{-\eta}, for which showing H(g^t,xt)H(\hat{g}_{t},x_{t}) contains it suffices. Notice that since there exists an exact halfspace H(g,xt)H(g,x_{t}) separating xtCx_{t}\notin C that is η4R\frac{\eta}{4R}-close to g~t\tilde{g}_{t} (due to the η\eta-approximate oracle), and H(g,xt)H(g,x_{t}) contains CC and thus Feas¯t1\overline{\textsc{Feas}}_{t-1}, we have g^tg~t2η4R\|\hat{g}_{t}-\tilde{g}_{t}\|_{2}\leq\frac{\eta}{4R}. The triangle inequality then reveals g^tg2g^tg~t2+g~tg2η2R\|\hat{g}_{t}-g\|_{2}\leq\|\hat{g}_{t}-\tilde{g}_{t}\|_{2}+\|\tilde{g}_{t}-g\|_{2}\leq\frac{\eta}{2R}, and then it is easy to see that H(g^t,xt)H(\hat{g}_{t},x_{t}) contains CηC_{-\eta}, concluding the proof. ∎

Appendix C A closer comparison with related work

In this section, we give a more detailed comparison between our work and the results and settings in closely related work of (Devolder et al., 2014). Therein, the authors define a similar setting using their own notion of inexact first-order oracles and analyze the behavior of a few primal, dual and accelerated gradient methods for convex optimization problems under these oracles. In particular, they show that accelerated gradient methods must accumulate errors in the inexact information setting. As noted, their analysis gives guarantees for specific algorithms, as opposed to our “universal” algorithm-independent guarantees. In summary, for the algorithms they provide results for, implementing our method mostly requires less noise to achieve equivalent convergence rates. We begin by comparing the oracles used.

Comparing the inexact oracles:

The notion of inexact first-order oracle in (Devolder et al., 2014) uses two parameters, as opposed to the single η\eta parameter in our Definition 2.1. Nevertheless, an oracle from the Devolder et al. setting corresponds to an oracle in our setting for any family of convex functions on B(R)B(R), with appropriate settings of the noise parameters, and vice versa. We provide the definition for an inexact oracle used by Devolder et al. here:

Definition C.1.

Let ff be a convex function on B(R)B(R). A first-order (δ,L)(\delta,L)-oracle queried at some point yB(R)y\in B(R) returns a pair (f~y,g~y)×d\left(\tilde{f}_{y},\tilde{g}_{y}\right)\in\mathbb{R}\times\mathbb{R}^{d} such that for all xB(R)x\in B(R) we have

f~y+g~y,xyf(x)\displaystyle\tilde{f}_{y}+\left\langle\tilde{g}_{y},x-y\right\rangle\leq f(x)
f~y+g~y,xy+L2xy2+δ.\displaystyle\leq\tilde{f}_{y}+\left\langle\tilde{g}_{y},x-y\right\rangle+\frac{L}{2}\|x-y\|^{2}+\delta.

Note that while this definition is valid for any convex function, not just LL-smooth functions, the motivation for the definition comes from the LL-smooth inequalities. The parameter δ\delta can be viewed as the “noise" while LL is the smoothness constant of the family of functions considered, which is taken for granted throughout (Devolder et al., 2014). We will refer to the oracle of Definition 2.1 as the η\eta-approximate oracle and that of Definition C.1 as the (δ,L)(\delta,L)-oracle in this section.

Here is what one can say when comparing the two oracles. For the family of LL-smooth functions, an η\eta-approximate oracle corresponds to a (δ,L)(\delta,L)-oracle with δ=4η\delta=4\eta (see Example b.b. from Section 2.3 in (Devolder et al., 2014)). For MM-Lipschitz functions, an η\eta-approximate oracle is equivalent to a (δ\delta, LL)-oracle obtained by setting δ=3η+M2L\delta=3\eta+\frac{M^{2}}{L} and arbitrary L>0L>0 (if the response of the η\eta-approximate oracle is f^,g^\hat{f},\hat{g}, the response of the (δ,L)(\delta,L)-oracle is f~=f^32η\tilde{f}=\hat{f}-\frac{3}{2}\eta and g~=g^\tilde{g}=\hat{g}). This follows from the Lipschitz bound f(y)f(x)+f(x),yx+2Myxf(y)\leq f(x)+\langle\nabla f(x),y-x\rangle+2M\|y-x\| combined with the fact that MrL2r2+M22LMr\leq\frac{L}{2}r^{2}+\frac{M^{2}}{2L} for any r,L>0r,L>0.

In the other direction, if one is given a (δ,L)(\delta,L)-oracle from the Devolder et al. setting for any family of convex functions on B(R)B(R), this provides an η\eta-approximate oracle in our sense for η=max(δ,2R2δL)\eta=\max(\delta,2R\sqrt{2\delta L}) (see eqn. (8) in (Devolder et al., 2014)).

Comparison of the noise levels needed for convergence rates:

Next, we compare the noise levels needed to achieve comparable convergence rates across our approach and the results of Devolder et al. We look at three different function classes for this comparison.

  • a)

    Nonsmooth, MM-Lipschitz functions: As mentioned above, Devolder et al. do not explicitly do an analysis for this family; they focus on LL-smooth functions. However, one can carry out an analysis of the subgradient method given an η\eta-approximate oracle similar to the Devolder et al. analysis in the smooth case. The additional error incurred by the inexactness of the oracle is indeed still O(η)O(\eta). Therefore, if one uses subgradient descent without any modifications, one needs to set η=O(1T0.5)\eta=O(\frac{1}{T^{0.5}}) to get overall error O(1T0.5)O(\frac{1}{T^{0.5}}) after TT iterations. Using our black-box reduction however, one needs to set η=O(1T1.5)\eta=O(\frac{1}{T^{1.5}}) for the same convergence rate.

  • b)

    Comparison for LL-smooth functions with gradient descent: Devolder et al.’s analysis of gradient descent with their notion of (δ,L)(\delta,L) oracle gives O(δ)O(\delta) additional error. Thus, they need to set δ=O(1T)\delta=O(\frac{1}{T}) to get O(1T)O(\frac{1}{T}) convergence. From the discussion above, an η\eta-approximate oracle corresponds to a (δ,L)(\delta,L)-oracle with δ=4η\delta=4\eta. In other words, Devolder et al.’s result says that η=O(1T)\eta=O(\frac{1}{T}) suffices to get O(1T)O(\frac{1}{T}) convergence, if gradient descent is run without modification with an η\eta-approximate oracle. From our black-box analysis for the LL-smooth case, we cannot guarantee a convergence rate better than O(1T)O(\frac{1}{\sqrt{T}}), no matter what choice of η\eta, since we lose the T\sqrt{T} factor due to our smoothing technique. Thus, our technique does not achieve the O(1T)O(\frac{1}{T}) rate for gradient descent for LL-smooth functions.

  • c)

    Comparison for LL-smooth functions with acceleration: Devolder et al.’s analysis of Nesterov’s acceleration with the (δ,L)(\delta,L)-oracle gives O(δT)O(\delta T) additional error. Thus, they need to set δ=O(1T3)\delta=O(\frac{1}{T^{3}}) to get the standard O(1T2)O(\frac{1}{T^{2}}) convergence. From the discussion above, an η\eta-approximate oracle in our setting corresponds to a (δ,L)(\delta,L)-oracle with δ=4η\delta=4\eta. In other words, Devolder et al.’s result says that η=O(1T3)\eta=O(\frac{1}{T^{3}}) suffices to get O(1T2)O(\frac{1}{T^{2}}) convergence. To get just O(1T1.5)O(\frac{1}{T^{1.5}}) rate, one can set η=O(1T2.5)\eta=O(\frac{1}{T^{2.5}}) in the analysis of Devolder et al. (the additional error term from the oracle noise dominates in this case). From our black-box analysis for LL-smooth functions, we need to set η=O(1T2.5)\eta=O(\frac{1}{T^{2.5}}) as well, but note that we lose a factor of d\sqrt{d} (dd being the dimension) additionally. So our final rate is O(dT1.5)O(\frac{\sqrt{d}}{T^{1.5}}), whereas the Devolder et al. analysis does not accrue any dimension-dependent factors.

Overall, the algorithm-specific analyses in (Devolder et al., 2014) give better convergence rates for equivalent noise levels, or equivalently have less stringent noise requirements to achieve a target convergence rate. This is not too surprising, since our black-box approach is much more general to work with any first-order algorithm, and can be viewed as a kind of trade-off to the generality of our results.