This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization

Idan Amir Department of Electrical Engineering, Tel Aviv University; idanamir@mail.tau.ac.il.    Roi Livni Department of Electrical Engineering, Tel Aviv University; rlivni@tauex.tau.ac.il.    Nathan Srebro Toyota Technological Institute at Chicago; nati@ttic.edu.
Abstract

We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e. where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most ε\varepsilon (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of O~(1/ε2)\tilde{O}(1/\varepsilon^{2}) and only O~(1/ε2)\tilde{O}(1/\varepsilon^{2}) iterations. This contrasts with general stochastic convex optimization, where Ω(1/ε4)\Omega(1/\varepsilon^{4}) iterations are needed Amir et al. (2021b). The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using Θ(1/ε4)\Theta(1/\varepsilon^{4}) samples, we rely on uniform convergence in a distribution-dependent ball.

1 Introduction

The recent success of learning using deep networks, with many more parameters than training points, and even without any explicit regularization, has brought back interest in how biases and “implicit regularization” due to the optimization method used, can ensure good generalization, even in situations where minimizing the optimization objective itself cannot Neyshabur et al. (2015). Alongside interest in understanding the algorithmic biases of optimization in non-convex, deep models, and how they can yield good generalization (e.g. Gunasekar et al., 2018c, b; Li et al., 2018; Nacson et al., 2019; Arora et al., 2019; Lyu and Li, 2019; Woodworth et al., 2020; Moroshko et al., 2020; Chizat and Bach, 2020; Blanc et al., 2020; HaoChen et al., 2021; Razin et al., 2021; Pesme et al., 2021), there has also been renewed interest in understanding the fundamentals of algorithmic regularization in convex models (Shalev-Shwartz et al., 2009; Feldman, 2016; Amir et al., 2021a, b; Sekhari et al., 2021; Dauber et al., 2020), both as an interesting and important setting in its own right, and even more so as a basis for understanding the situation in more complex models (how can we hope to understand phenomena in deep, non-convex models, if we can’t even understand them in convex models?). In particular, these fundamental questions include the relationship between algorithmic regularization, explicit regularization, uniform convergence and stability; and the importance of stochasticity and early stopping.

In this paper we focus on the algorithmic bias of deterministic (full batch) optimization methods, and in particular of full-batch gradient descent (i.e. gradient descent on the empirical risk, or “training error”). Even when gradient descent (GD) is run to convergence, it affords some bias that can ensure generalization. E.g. in linear regression (and even slightly more general settings) where interpolation is possible, it can be shown to converge to the minimum norm interpolating solution. This can be sufficient for generalization even in underdetermined situations where other interpolators (i.e. other minimizers of the optimization objective) would not generalize well. Indeed, the generalization ability of the minimum norm interpolating solution, and hence of GD, even in noisy situations, is the subject of much study. And in parallel, there has also been work going beyond GD on linear regression, characterizing the limit point of GD, or of other optimization methods such as Mirror Descent, steepest descent w.r.t. a norm, coordinate descent and AdaBoost, for different types of loss functions (Telgarsky, 2013; Soudry et al., 2018; Gunasekar et al., 2018a; Ji and Telgarsky, 2019; Li et al., 2019; Ji et al., 2020a; Shamir, 2020; Ji et al., 2020b; Vaskevicius et al., 2020).

But what about situations where interpolation, or completely minimizing the empirical risk, is not desirable, and generalization requires compromising on the empirical risk? In this cases one can consider early stopping, i.e. running GD only for some specific number of iterations. Indeed, early stopped GD is a common regularization approach in practice, and other learning approaches, most prominently Boosting, can also be viewed as early stopping of an optimization procedure (coordinate descent in the case of Boosting). When and how can such early stopped GD allow for generalization? How does this compare to Stochastic Gradient Descent (SGD), or to using explicit regularization, both in terms of generalization ability, and the number of required optimization iterations? And what tools, such as uniform convergence, distribution-dependent uniform convergence, and stability, are appropriate for studying the generalization ability of GD?

Our main result

We will show that when training a linear predictor with a convex Lipschitz loss (or more generally, for stochastic convex optimization with a generalized linear instantaneous objective), GD with early stopping can generalize optimally, up to logarithmic factors, with optimal sample complexity O~(1/ε2)\tilde{O}(1/\varepsilon^{2}), and with an optimal number O~(1/ε2)\tilde{O}(1/\varepsilon^{2}) of iterations (the same as stochastic gradient descent), even without any explicit regularization, and in particular without projections into a norm ball—just unconstrained GD on the training error. This contrasts to previous results regarding early stopped GD for arbitrary stochastic convex optimization (not necessarily generalized linear), for which Amir et al. (2021b) showed Θ(1/ε4)\Theta(1/\varepsilon^{4}) GD iterations are needed (even if projections or explicit regularization is used).

Stability vs Uniform Convergence for SGD and GD

The important difference here, and the only property of generalized linear models (GLMs) that we rely on, is that GLMs satisfy uniform convergence Bartlett and Mendelson (2002): empirical losses converge to their expectations uniformly over all predictors in Euclidean balls. This is in contrast to general Stochastic Convex Optimization (SCO), for which no such uniform convergence is possible (Shalev-Shwartz et al., 2009). Consequently, rather then uniform convergence, the analysis for SCO is based on stability arguments Bassily et al. (2020). For SGD, stability at each step can be used in an online analysis, combined with an online-to-batch conversion, which is sufficient for ensuring optimal generalization with an optimal O(1/ε2)O(1/\varepsilon^{2}) number of iterations. But for GD, we must consider the stability of the method as a whole, which leads to optimal rates in the case of smooth losses Hardt et al. (2016) but otherwise is much worse, necessitating using a smaller stepsize, and hence quadratically more iterations—Amir et al. (2021b) showed that this is not just an analysis artifact, but a real limitation of GD for general SCO. In this paper, we show that once generalization can be ensured via uniform convergence, e.g. for GLMs, then GD does not have to worry about stability, can take much larger step sizes, and generalize optimally after only O~(1/ε2)\tilde{O}(1/\varepsilon^{2}) iterations.

But we must be careful with how we ensure uniform convergence! To rely on uniform convergence, we need to ensure the output of GD lies in some ball, and the generalization error would then scale with the radius of the ball. A naïve approach would be to ensure that output of GD lies in a norm ball around the origin, and rely on uniform convergence in this ball. In Section 4 we show that this approach can ensure generalization, but only with suboptimal sample complexity of O(1/ε4)O(1/\varepsilon^{4})—that is, worse than with the stability-based approach. Instead, we show that, with high probability, the output of GD lies in a small (constant radius) distribution dependent ball, centered not at the origin, but around the (distribution dependent) output of GD on the population objective. Even though this ball is unknown to the algorithm, this is still sufficient for generalization. The situation here is similar to the notion of algorithm dependent uniform convergence introduced by Nagarajan and Kolter (2019), though we should emphasize that here we show that algorithm-dependent uniform convergence is sufficient for optimal generalization.

Context and Insights

Our results complement recent results exposing gaps between generalization using stochastic optimization versus explicit regularization or deterministic optimization in stochastic convex optimization. Sekhari et al. (2021) showed that for a setting that is slightly weaker (where only the population loss is convex, but instantaneous losses can be non-convex), there can be large gaps between SGD versus GD or explicit regularization: even though SGD can learn with the optimal sample complexity of O(1/ε2)O(1/\varepsilon^{2}), explicit regularization, in the form of regularized empirical risk minimization cannot learn at all, even with arbitrarily many samples, and GD requires at least a suboptimal Ω(1/ε2.4)\Omega(1/\varepsilon^{2.4}) number of samples (it is not clear if this sample complexity is sufficient for GD, or even if it can at all ensure learning). This highlights that the generalization ability of SGD cannot be understood in terms of mimicking some regularizer, as well as gaps between stochastic and deterministic optimization.

Returning to the strict SCO setting, where the instantaneous losses are convex, still, uniform convergence does not hold, and generalization can only be ensured via algorithmic-dependent bounds. Nevertheless, optimal generalization can be ensured either thorough explicit regularization, SGD or GD—all three approaches can ensure learning with Θ(1/ε2)\Theta(1/\varepsilon^{2}) samples (Shalev-Shwartz et al., 2009; Nemirovski and Yudin, 1983; Bassily et al., 2020). Even so, differences between deterministic and stochastic methods still exist. Dauber et al. (2020) shows that in SCO, the output of SGD cannot even be guaranteed to lie in some “small” distribution-dependent set of (approximate) empirical risk minimizers111“smallness” here, refers to a size notion that measures how well an empirical risk minimizer over the set generalizes, see Dauber et al. (2020) for the definition of a statistically complex set, and so the generalization ability of SGD in this settings cannot be attributed to any “regularizer” being small. Dauber et al. also shows that GD does not follow some distribution-independent regularization path. In contrast, here we show that if we do take the distribution into account, then GD is constrained to follow (at least approximately) a predetermined path (i.e. deterministic path, that is independent of the sample, but does depend only on the distribution). Finally, there is also a gap between SGD and GD, in this case, in the required number of iterations Amir et al. (2021b). Surprisingly, this gap cannot be fixed by adding regularization to the objective Amir et al. (2021a).

Importantly, for either weak or strict SCO, regularization in the form of constraining the norm of the predictor (i.e. empirical risk minimization inside the hypothesis class we are competing with), is not sufficient for learning (Shalev-Shwartz et al., 2009; Feldman, 2016). The failure of constrained ERM is critical here for the constructions of Amir et al.; Dauber et al. and Sekhari et al. establishing the gaps above: since projected gradient descent would quickly converge to the constrained ERM, it would also generalize just as well, and the gaps and constructions are valid also for projected gradient descent.

In this paper we turn to the GLM setting, which is perhaps one of the most well studied frameworks in the learning theory literature, as it captures fundamental problems such as logistic regression, SVM and many more. Uniform convergence for GLMs (Bartlett and Mendelson, 2002; Shalev-Shwartz and Ben-David, 2014; Kakade et al., 2008) ensures that constrained ERM learns with optimal sample complexity, and hence so would projected GD with O(1/ε2)O(1/\varepsilon^{2}) iterations. Interestingly, GD on the Tikhonov-type regularized objective similarly yields optimal learning with only O(1/ε2)O(1/\varepsilon^{2}) iterations Sridharan et al. (2008), indicating the lower bounds of Amir et al. on the iteration complexity do not hold in this setting. But both projected GD and GD on the Tikhonov regularized objective are forms of explicit regularization, and so we study unregularized, unprojected, GD. Our results show that (a) when uniform convergence holds, both sample complexity and iteration complexity gaps between stochastic and deterministic optimization disappear (though stochastic optimization is of course still much more computationally efficient), and (b) perhaps surprisingly, distribution-dependent uniform convergence is not only able to explain generalization of the output of GD, but it can do so better than stability; (c) though it appears to be suboptimal, distribution independent uniform convergence can explain generalization even though the norm could become very large (by making explanations based on uniform convergence inside a fixed distribution independent class).

2 Problem Setup and Background

We study the problem of stochastic optimization from i.i.d. data samples. For that purpose, we consider the standard setting of stochastic convex optimization. A learning problem consists of a family of loss functions f:𝒲×𝒵f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R} defined over a fixed domain 𝒲d\mathcal{W}\subseteq\mathbb{R}^{d}, parameterized by the parameter space 𝒵\mathcal{Z}. For each BB we denote the domain

𝒲B=\brk[c]wd:\normwB.\mathcal{W}^{B}=\brk[c]{w\in\mathbb{R}^{d}:\norm{w}\leq B}.

Our underlying assumption is that for each z𝒵z\in\mathcal{Z} the function f(w;z)f(w;z) is convex and LL-Lipschitz with respect to its first argument ww. In this setting, a learner is provided with a sample S=\brk[c]z1,,znS=\brk[c]{z_{1},\ldots,z_{n}} of i.i.d. examples drawn from an unknown distribution DD. The goal of the learner is to optimize the population risk defined as follows:

FD(w)𝔼zD\brk[s]f(w;z).\displaystyle F_{D}(w)\coloneqq\mathop{\mathbb{E}}_{z\sim D}\brk[s]{f(w;z)}.

More formally, an algorithm 𝒜\mathcal{A} is said to learn the class 𝒲\mathcal{W} (in expectation), with sample complexity m(ε)m(\varepsilon) if given i.i.d. sample S={z1,,zn}S=\{z_{1},\ldots,z_{n}\} such that nm(ε)n\geq m(\varepsilon), the learner returns a solution 𝒜(S)\mathcal{A}(S) that holds

𝔼SDn[FD(𝒜(S))]minw𝒲FD(w)+ε.\mathop{\mathbb{E}}_{S\sim D^{n}}\left[F_{D}(\mathcal{A}(S))\right]\leq\min_{w\in\mathcal{W}}F_{D}(w)+\varepsilon.

A prevalent approach in stochastic optimization, is to consider, and optimize, the empirical risk over a sample SS:

FS(w)=1ni=1nf(w;zi).\displaystyle F_{S}(w)=\frac{1}{n}\sum_{i=1}^{n}f(w;z_{i}). (1)

2.1 Gradient Descent (without projections).

A concrete and prominent way in minimizing the empirical risk in Eq. 1 is with Gradient Descent (GD). GD is an iterative algorithm that runs for TT iterations, and has the following update rule that depends on a learning-rate η\eta:

wt+1S=wtSηFS(wtS),\displaystyle w^{S}_{t+1}=w^{S}_{t}-\eta\nabla F_{S}(w^{S}_{t}), (2)

where w0S=0w^{S}_{0}=0, and FS(wtS)\nabla F_{S}(w^{S}_{t}) is the (sub)gradient of FSF_{S} at point wtSw^{S}_{t}. The output of GD is normally taken to be the average iterate,

w¯S=1Tt=1TwtS.\displaystyle\bar{w}^{S}=\frac{1}{T}\sum_{t=1}^{T}w^{S}_{t}. (3)

Throughout the paper we consider the aforementioned output in Eq. 3, though we remark that our results also extend to any reasonable averaging scheme (including prevalent schemes such as tail-averaging or random choice of wtw_{t}).

It is well known (e.g. Shalev-Shwartz and Ben-David (2014)) that the output of GD with standard initialization at the origin, enjoys the following guarantee over the empirical risk, for every BB:

FS(w¯S)minw𝒲BFS(w)+ηL2+B2ηT.\displaystyle F_{S}(\bar{w}^{S})\leq\min_{w\in\mathcal{W}^{B}}F_{S}(w)+\eta L^{2}+\frac{B^{2}}{\eta T}. (4)

In contrast, as far as the population risk, it was recently shown in Amir et al. (2021b) that, for sufficiently large dd, GD suffers from the following sub optimal rate222the result in Amir et al. (2021b) is formulated for projected-GD, however as the authors note the proof holds verbatim to the unprojected version.

𝔼SDn[FD(w¯S)]minw𝒲1FD(w)+Ω\brk2min\brk[c]1ηT+1ηT,1.\mathop{\mathbb{E}}_{S\sim D^{n}}[F_{D}(\bar{w}^{S})]\geq\min_{w\in\mathcal{W}^{1}}F_{D}(w)+\Omega\brk 2{\min\brk[c]1{\eta\sqrt{T}+\frac{1}{\eta T},1}}. (5)

In particular, setting for concreteness B=1B=1, a choice of η=1/T\eta=1/\sqrt{T} and T=O(1/ε2)T=O(1/\varepsilon^{2}) may lead to ε\varepsilon-error over the empirical risk, but is susceptible to overfitting. In fact, it turns out that T=O(1/ε4)T=O(1/\varepsilon^{4}) iterations are necessary and sufficient (Bassily et al., 2020) for GD (with or without projections) to obtain O(ε)O(\varepsilon) population error.

2.2 Generalized Linear Models.

One, important, class of convex learning problems is the class of generalized linear models (GLMs). In a GLM problem, f(w;z)f(w;z) takes the following generalized linear form:

f(w;z)=(wϕ\brkx,y),\displaystyle f(w;z)=\ell(w\boldsymbol{\cdot}\phi\brk{x},y), (6)

where 𝒵=𝒳×𝒴\mathcal{Z}=\mathcal{X}\times\mathcal{Y}, (a,y)\ell(a,y) is convex and LL-Lipschitz w.r.t. aa, and, ϕ:𝒳H\phi:\mathcal{X}\to H is an embedding of the domain 𝒳\mathcal{X} into some norm bounded set in a linear space (potentially infinite).

As we mostly care about norm bounded solutions, we will assume for concretness that \normϕ(x)1\norm{\phi(x)}\leq 1. We also treat the value of the function at zero as constant, namely |(0,y)|O(1)|\ell(0,y)|\leq O(1). In turn, the function ff is LL-Lipschitz as well, and we obtain by Lipschitness that:

|f(w,z)|=|(wϕ(x),y)|O(wL).|f(w,z)|=|\ell(w\boldsymbol{\cdot}\phi(x),y)|\leq O(\|w\|L). (7)

Uniform Convergence of GLMs.

One desirable property of GLMs is that, in contrast with general convex functions, they enjoy dimension-independent uniform convergence bounds. This, potentially, allow us to circumvent the bound in Eq. 5. In more detail, a seminal result due to Bartlett and Mendelson (2002) shows that, under our restrictions on \ell and ϕ\phi, the empirical risk FS(w)F_{S}(w) converges uniformly to the population loss FD(w)F_{D}(w) as follows for any BB:

𝔼SDn\brk[s]1supw𝒲B\brk[c]1FD(w)FS(w)2LBn.\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{B}}\brk[c]1{F_{D}(w)-F_{S}(w)}}\leq\frac{2LB}{\sqrt{n}}. (8)

By a similar reasoning, one can show a stronger property for GLMs, which we will need: uniform convergence on any ball, not necessarily centered around zero. More formally, for every uu and KK let us notate

𝒲uK:=\brk[c]w:\normwuK.\mathcal{W}^{K}_{u}:=\brk[c]{w:\norm{w-u}\leq K}.

We can then bound the expected generalization error of w𝒲uKw\in\mathcal{W}^{K}_{u} as follows:

Lemma 2.1.

Suppose (a,y)\ell(a,y) is LL-Lipschitz w.r.t. aa and ϕ:𝒳H\phi:\mathcal{X}\to H is an embedding in a linear space HH such that ϕ(x)1\|\phi(x)\|\leq 1. Then, for ff as in Eq. 6. We have for any uu and KK

𝔼SDn\brk[s]1supw𝒲uK\brk[c]1FD(w)FS(w)2LKn\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}_{u}^{K}}\brk[c]1{F_{D}(w)-F_{S}(w)}}\leq\frac{2LK}{\sqrt{n}}

The proof is essentially the same as the proof of Eq. 8, and exploit the Rademacher complexity of the class. For the sake of completeness we repeat it in Appendix B.

Eq. 8, and more generally Lemma 2.1, imply that any algorithm, 𝒜\mathcal{A} whose range is restricted to a BB-bounded ball, that has an optimization guarantee of,

FS(𝒜(S))minw𝒲BFS(w)O\brk2LBn,\displaystyle F_{S}(\mathcal{A}(S))-\min_{w\in\mathcal{W}^{B}}F_{S}(w)\leq O\brk 2{\frac{LB}{\sqrt{n}}},

will also obtain an O\brkLB/nO\brk{LB/\sqrt{n}} convergence rate of the excess population risk. For example, projected-GD is an algorithm that enjoys the aforementioned generalization bound.

3 Main Results

Theorem 3.1.

Suppose DD is a distribution over 𝒵=𝒳×𝒴\mathcal{Z}=\mathcal{X}\times\mathcal{Y}. Let f:d×𝒵f:\mathbb{R}^{d}\times\mathcal{Z}\rightarrow\mathbb{R} be a loss function of the form given in Eq. 6 such that for all y𝒴y\in\mathcal{Y}, (,y)\ell(\cdot,y) is convex and LL-Lipschitz with respect to its first argument and such that for all x𝒳:\normϕ(x)1x\in\mathcal{X}:\norm{\phi(x)}\leq 1. Then, for the gradient descent solution in Eq. 3 and for any BB\in\mathbb{R} we have:

𝔼SDn\brk[s]1FD(w¯S)minw𝒲BFD(w)O~\brk3ηL2Tn+ηL2Tn+ηL2+B2ηT.\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}-\min_{w\in\mathcal{W}^{B}}F_{D}(w)\leq\tilde{O}\brk 3{\frac{\eta L^{2}T}{n}+\frac{\eta L^{2}\sqrt{T}}{\sqrt{n}}+\eta L^{2}+\frac{B^{2}}{\eta T}}.

In particular, setting η=O(1/(LT))\eta=O(1/(L\sqrt{T})) and T=nT=n we get,

𝔼SDn\brk[s]1FD(w¯S)infwd\brk[c]2FD(w)+O~\brk2L(w2+1)n.\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]2{F_{D}(w)+\tilde{O}\brk 2{\frac{L(\|w\|^{2}+1)}{\sqrt{n}}}}.

Thus, Theorem 3.1 shows that using GD, in the case of GLMs, we can learn the class 𝒲B\mathcal{W}^{B}, with sample complexity m(ε)=O~(1/ε2)m(\varepsilon)=\tilde{O}(1/\varepsilon^{2}) and number of iterations T=O~(1/ε2)T=\tilde{O}(1/\varepsilon^{2}).

Note that, an improved rate of O~\brkLw/n\tilde{O}\brk{L\|w\|/\sqrt{n}} can be obtained, if we choose η=O(w/(LT))\eta=O(\|w\|/(L\sqrt{T})), but this requires prior knowledge of the bounded domain radius. The bound we formulate above is for a learning rate that is oblivious to the norm of the optimal choice ww.

The key technical tool in proving Theorem 3.1 is the following structural result that characterizes the output of GD over the empirical risk, and may be of independent interest:

Theorem 3.2.

Let S={z1,,zn}S=\{z_{1},\ldots,z_{n}\} be an i.i.d sequence drawn from some unknown distribution DD. Assume that f:d×𝒵f:\mathbb{R}^{d}\times\mathcal{Z}\rightarrow\mathbb{R} is a convex and LL-Lipschitz function, for all z𝒵z\in\mathcal{Z}, with respect to its first argument. Then, given the distribution DD there exists a sequence u1,,uTu_{1},\ldots,u_{T} that depends on DD (but independent of the sample SS), such that, for any δ\brk0,1\delta\in\brk{0,1} with probability of at least 1δ1-\delta over SS, the iterates of gradient descent, as depicted in Eq. 2, satisfy:

t[T]:\norm1wtSutO\brk3ηLtnlog\brkT/δ+ηLt\displaystyle\forall t\in[T]:\quad\norm 1{w^{S}_{t}-u_{t}}\leq O\brk 3{\frac{\eta Lt}{\sqrt{n}}\sqrt{\log\brk{T/\delta}}+\eta L\sqrt{t}}

For example, we can set η=O~(1/T)\eta=\tilde{O}(1/\sqrt{T}) and T=O(n)T=O(n), then with probability O(1/n)O(1/n):

\norm1wtSut=O(1).\norm 1{w^{S}_{t}-u_{t}}=O(1).

As such, for the natural choice of learning rate and iteration complexity, we obtain that the iterates of GD remain at the vicinity of a deterministic trajectory, predetermined by the distribution to be learned. Our following result shows that this bound is tight up to some logarithmic factor. We refer the reader to Appendix C for the full proof.

Theorem 3.3.

Fix η,L,T\eta,L,T and nn. For any sequence u1,,uTu_{1},\ldots,u_{T}, independent of the sample SS, there exists a convex and LL-Lipschitz function f:𝒲×𝒵f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R} and a distribution DD over 𝒵\mathcal{Z}, such that, with probability at least 1/201/20 over SS, the iterates of gradient descent, as depicted in Eq. 2, satisfy:

t[T]:\normwtSut\displaystyle\forall t\in[T]:\quad\norm{w^{S}_{t}-u_{t}} Ω\brk3ηLtn+ηLt.\displaystyle\geq\Omega\brk 3{\frac{\eta Lt}{\sqrt{n}}+\eta L\sqrt{t}}.

3.1 High Probability Rates

We next move to discuss results for learnability with high probability. We first remark that using Markov’s inequality and, standard techniques to boost the confidence, One can achieve high proability rates at a logarithmic computational and sample cost in the confidence.

If we want, though, to achieve high probability rates for the algorithm without alterations, the standard approach requires concentration bounds which normally rely on boundness of the predictor. This, unfortunately is not true when we run GD without projection, as the prediction can be potentially unbounded.

However, under natural structural assumptions that are often met for the types of losses we are usually interested in, we can achieve such boundness by clipping procedure which we next describe.

For this section we consider a loss function (a,y)\ell(a,y), such that y[b,b]y\in[-b,b] and, we assume that:

(a,y){(|y|,y)a|y|,(|y|,y)a|y|.anda[b,b]:\abs(a,y)c\displaystyle\ell(a,y)\geq\begin{cases}\ell(|y|,y)&a\geq|y|,\\ \ell(-|y|,y)&a\leq-|y|.\end{cases}\qquad\textrm{and}\qquad\forall a\in[-b,b]:\;\abs{\ell(a,y)}\leq c

Note that this is a very natural assumption to have in the case of convex surrogates for prediction tasks. For example, this holds for the widely used hinge loss (a,y)=max\brk[c]0,1ya\ell(a,y)=\max\brk[c]{0,1-ya} in binary classification. Observe that under this assumption, if we consider wϕ(x)w\cdot\phi(x) as a predictor of yy, the learner has no incentive to return a prediction that is outside of the interval [b,b][-b,b].

Thus, we define the following mapping g(a)g(a)

g(a)={cab,aa[b,b],cab,\displaystyle g(a)=\begin{cases}c&a\geq b,\\ a&a\in[-b,b],\\ -c&a\leq-b,\end{cases} (9)

and consider the clipped solution g(w¯Sϕ(x))g(\bar{w}^{S}\cdot\phi(x)) where w¯S\bar{w}^{S} is the original output of GD defined in Eq. 3. We can now present our high probability result.

Theorem 3.4.

Suppose DD is a distribution over 𝒵=𝒳×𝒴\mathcal{Z}=\mathcal{X}\times\mathcal{Y} and let 𝒲B=\brk[c]1wd:\normwB\mathcal{W}^{B}=\brk[c]1{w\in\mathbb{R}^{d}:\norm{w}\leq B}. Let (a,y)\ell(a,y) be a generalized linear loss function of the form given in Eqs. 6 and 3.1 such that for all y𝒴y\in\mathcal{Y}, (,y)\ell(\cdot,y) is convex and LL-Lipschitz with respect to its first argument. Then, for the gradient descent solution in Eq. 3 and the function g(a)g(a) in Eq. 9 we have with probability at least 1δ1-\delta

𝔼(x,y)D\brk[s]1\brk1g\brk1w¯Sϕ(x),y\displaystyle\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x)},y}} infwd\brk[c]3𝔼(x,y)D\brk[s]1\brk1wϕ(x),y+w2ηT+wLlog(2/δ)n\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}+\frac{\|w\|^{2}}{\eta T}+\|w\|L\sqrt{\frac{\log(2/\delta)}{n}}}
+O\brk2ηL2Tnlog\brk4T/δ+ηL2Tn+ηL2+clog\brk8/δn.\displaystyle\quad+O\brk 2{\frac{\eta L^{2}T}{n}\sqrt{\log\brk{4T/\delta}}+\frac{\eta L^{2}\sqrt{T}}{\sqrt{n}}+\eta L^{2}+c\sqrt{\frac{\log\brk{8/\delta}}{n}}}.

In particular, setting η=1/(LT)\eta=1/(L\sqrt{T}) and T=nT=n we obtain that with probability 1δ1-\delta:

𝔼(x,y)D\brk[s]1\brk1g\brk1w¯Sϕ(x),yinfw𝒲1\brk[c]3𝔼(x,y)D\brk[s]1\brk1wϕ(x),y+O~(L+cnlog(1/δ))\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x)},y}}\leq\inf_{w\in\mathcal{W}^{1}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}}+\tilde{O}\left(\frac{L+c}{\sqrt{n}}\log(1/\delta)\right)

4 Comparison with non-distribution-dependent uniform convergence

In this section, we contrast our approach with what might be possible with a more traditional, non-distribution-dependent, uniform convergence. In particular, we consider an argument based on ensuring that the output of GD, with some stepsize η\eta and number of iterations TT, is always (or with high probability) in a ball of radius B(η,T)B(\eta,T) around the origin, namely, \normw¯SB(η,T)\norm{\bar{w}^{S}}\leq B(\eta,T). Hence, using Rademacher complexity bounds for GLMs, we can say that its population risk is within O(BL/n)O(BL/\sqrt{n}) of its empirical risk. By selecting η\eta and TT to be very small, one can always ensure that the output of GD is in a small ball, but we also need to balance that with the empirical suboptimality of the output. That is, traditional bounds require us to find η\eta and TT that balance between the guarantee on the empirical suboptimality and the guarantee on the norm of the output. Such an approach would then result in population suboptimality that is governed by these two terms:

G(n)infη,TsupD,f\brk[c]3max{𝔼SDn\brk[s]1FS(w¯S)minw𝒲1FS(w),𝔼SDn[\normw¯SLn]},\displaystyle G(n)\coloneqq\inf_{\eta,T}\sup_{D,f}\brk[c]3{\max\left\{\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{S}(\bar{w}^{S})-\min_{w\in\mathcal{W}^{1}}F_{S}(w)},\mathop{\mathbb{E}}_{S\sim D^{n}}\left[\frac{\norm{\bar{w}^{S}}L}{\sqrt{n}}\right]\right\}}, (10)

where DD and ff are taken over all valid distributions such that ff is convex and Lipschitz. In particular, we would get,

𝔼SDn\brk[s]1FD(w¯S)minw𝒲1FD(w)+O(G(n)).\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}\leq\min_{w\in\mathcal{W}^{1}}F_{D}(w)+O(G(n)).
Claim.

For GLMs it holds that G(n)=Θ(L/n1/4)G(n)=\Theta(L/n^{1/4}).

Proof.

The upper bound follows by taking η=1/(LT)\eta=1/(L\sqrt{T}) and T=nT=\sqrt{n}, bounding the first term using the standard GD guarantee in Eq. 4, and the second term by noting that each step increases the gradient by at most ηL\eta L. Therefore, \normw¯SηLT\norm{\bar{w}^{S}}\leq\eta LT and we obtain,

G(n)ηL2+1ηT+ηL2Tn=O\brk3Ln1/4.\displaystyle G(n)\leq\eta L^{2}+\frac{1}{\eta T}+\frac{\eta L^{2}T}{\sqrt{n}}=O\brk 3{\frac{L}{n^{1/4}}}.

For the lower bound we consider two deterministic functions in one dimension. When ηTn1/4/L\eta T\geq n^{1/4}/L we consider the deterministic objective f(w;z)=Lwf(w;z)=Lw. Clearly, the norm of the solution is \normw¯S=1Tt=1TηLtηLT/2\norm{\bar{w}^{S}}=\frac{1}{T}\sum_{t=1}^{T}\eta Lt\geq\eta LT/2, thus:

\normw¯SLnΩ\brk3Ln1/4.\displaystyle\frac{\norm{\bar{w}^{S}}L}{\sqrt{n}}\geq\Omega\brk 3{\frac{L}{n^{1/4}}}.

When ηT<n1/4/L\eta T<n^{1/4}/L we consider the objective f(w;z)=Lw/n1/4f(w;z)=Lw/n^{1/4}. Similarly, we have that w¯S=ηL(T+1)/(2n1/4)\bar{w}^{S}=-\eta L(T+1)/(2n^{1/4}). Thus, the empirical risk is:

FS(w¯S)minw𝒲1FS(w)=ηL2(T+1)2n1/8+Ln1/4L2n1/4L2Tn1/4Ω\brk3Ln1/4.\displaystyle F_{S}(\bar{w}^{S})-\min_{w\in\mathcal{W}^{1}}F_{S}(w)=-\frac{\eta L^{2}(T+1)}{2n^{1/8}}+\frac{L}{n^{1/4}}\geq\frac{L}{2n^{1/4}}-\frac{L}{2Tn^{1/4}}\geq\Omega\brk 3{\frac{L}{n^{1/4}}}.

The claim shows that uniform convergence to a fixed ball around the origin can ensure learning, but with a suboptimal sample complexity of O(1/ε4)O(1/\varepsilon^{4}). To obtain better bounds using this approach, additional structural assumption on the loss or the distribution are required. For example, the work of Shamir (2020) assumes max-margin and smoothness(specifically the logistic loss) to obtain bounds on the norm of the solution, while our result applies to general Lipschitz GLMs without any distribution assumptions. It is insightful to directly contrast Eq. 10 to the approach of Section 3, which essentially entails looking at:

G~(n)infη,TsupD,f\brk[c]3max{𝔼SDn\brk[s]1FS(w¯S)minw𝒲1FS(w),𝔼SDn[\normw¯SuLn]},\displaystyle\tilde{G}(n)\coloneqq\inf_{\eta,T}\sup_{D,f}\brk[c]3{\max\left\{\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{S}(\bar{w}^{S})-\min_{w\in\mathcal{W}^{1}}F_{S}(w)},\mathop{\mathbb{E}}_{S\sim D^{n}}\left[\frac{\norm{\bar{w}^{S}-u}L}{\sqrt{n}}\right]\right\}},

for some deterministic uu. We can then apply a uniform concentration guarantee for a ball around it, and so G~\tilde{G} is also sufficient to ensure:

𝔼SDn\brk[s]1FD(w¯S)minw𝒲1FD(w)+O(G~(n)).\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}\leq\min_{w\in\mathcal{W}^{1}}F_{D}(w)+O(\tilde{G}(n)).

In Section 3 we show that G~(n)=O~(L/n)\tilde{G}(n)=\tilde{O}(L/\sqrt{n}), improving on G(n)G(n) and yielding optimal learning, up to logarithmic factors.

5 Technical Overview

Our main result, Theorem 3.1, establishes a generalization bound for GD without projection, and it builds upon a structural result presented in Theorem 3.2. We next, then, outline the derivation of both these results. Because the most interesting implication of our results are derived when we choose η=O~(1/T)\eta=\tilde{O}(1/\sqrt{T}) and T=O(n)T=O(n), we will focus here in the exposition on this regime. This is mainly to avoid clutter notation. Hence, unless stated otherwise we assume that TT and η\eta are fixed appropriately.

Our proof of Theorem 3.1 relies on two steps. First, through Theorem 3.2 we argue that that output of GD will be (w.h.p) in a fixed ball that depends solely on the distribution (but not on the sample). Then, as a second step, we can apply standard uniform convergence for a bounded-norm balls (see Lemma 2.1) to reason about generalization.

The second step, builds on a standard generalization bound that is derived through Rademacher complexity. Also notice, that the first step follows immediately from Theorem 3.2. Indeed, Theorem 3.2 argues that there exist a sequence u1,,uTu_{1},\ldots,u_{T} such that if wtSw_{t}^{S} is the trajectory of GD over the empirical risk, we will have (w.h.p):

wtSut=O(1).\|w_{t}^{S}-u_{t}\|=O(1).

The output of GD is the averaged iterate, hence we obtain that indeed w¯S\bar{w}^{S} is restricted to a ball around u=1Tt=1Tutu=\frac{1}{T}\sum_{t=1}^{T}u_{t}, as required. Therefore, we are left with proving our main structural result: That the iterates of GD are in a proximity of a trajectory u1,,uTu_{1},\ldots,u_{T} that depends solely on the distribution (i.e. Theorem 3.2).

For the end of proving Theorem 3.2, we introduce the following GD trajectory:

Gradient Descent on the population loss.

We consider an alternative sequence of gradient descent that operates on the population loss FD(w)F_{D}(w) rather than the empirical risk FS(w)F_{S}(w). The update rule is then,

wt+1D=wtDηFD(wtD).\displaystyle w^{D}_{t+1}=w^{D}_{t}-\eta\nabla F_{D}(w^{D}_{t}). (11)

The sequence w1D,,wtDw_{1}^{D},\ldots,w_{t}^{D} will serve as the sequence u1,,uTu_{1},\ldots,u_{T}.

In the proof of Theorem 3.2 we require high probability rates. However, for the simplicity of the exposition, we will prove here a slightly weaker, in expectation, result: (the proof is defered to Section A.1).

Lemma 5.1.

Let S={z1,,zn}S=\{z_{1},\ldots,z_{n}\} be an i.i.d sequence drawn from some unknown distribution DD. Assume that f:d×𝒵f:\mathbb{R}^{d}\times\mathcal{Z}\rightarrow\mathbb{R} is a convex and LL-Lipschitz function, for all z𝒵z\in\mathcal{Z}, with respect to its first argument. Then, the iterates of gradient descent, as depicted in Eqs. 2 and 11, satisfy:

t[T]:𝔼SDn\brk[s]2\normwtSwtDO\brk3ηLtn+ηLt\displaystyle\forall t\in[T]:\quad\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{w^{S}_{t}-w^{D}_{t}}}\leq O\brk 3{\frac{\eta Lt}{\sqrt{n}}+\eta L\sqrt{t}}

This lemma bounds the distance, in expectation, between the GD trajectory over the empirical risk and the GD trajectory over the population risk. Again, we remark that to obtain the final result in Theorem 3.2, a stronger high probability version of Lemma 5.1 is required. The proof of Theorem 3.2 follows similar lines to that of Lemma 5.1 with modifications concerning specific concentration inequalities.

One crucial challenge in proving Lemma 5.1 stems from the adaptivity of the gradient sequence. In particular, notice that the sequences wtS,wtDw_{t}^{S},w_{t}^{D} are governed, respectively, by the dynamics

wtS=wt1SηFS(wtS),wtD=wt1DηFD(wtD).w_{t}^{S}=w_{t-1}^{S}-\eta\nabla F_{S}(w^{S}_{t}),\quad w_{t}^{D}=w_{t-1}^{D}-\eta\nabla F_{D}(w^{D}_{t}).

At a first glance, since FS(w)\nabla F_{S}(w) is an estimate of FD(w)\nabla F_{D}(w), it might seem that the result can be obtained by standard concentration bounds and application of a union bound along the trajectory. Unfortunately, such a naive argument cannot work. Indeed, since wtw_{t}, for t2t\geq 2, depends on the sequence SS, FS(wt)\nabla F_{S}(w_{t}) is not necessarily an unbiased estimate of FD(wt)\nabla F_{D}(w_{t}). This is not just seemingly, and a construction in Amir et al. (2021b) demonstrates how, even after only two iterates the gradient FS(wt)\nabla F_{S}(w_{t}) can diverge, significantly, from FD(wt)\nabla F_{D}(w_{t}). While these constructions are outside of the scope of GLMs, we remark that Theorem 3.2 holds for any convex and Lipschitz function. Also, it was in fact shown that even for GLMs (Foster et al., 2018), the estimate of the gradients don’t have any dimension-independent uniform convergence bound, making it a challenge to pursue such a proof direction. To summarize, we need to prove that the two sequences remain in the vicinity, even though, apriori, the update steps of the two functions may be different at each iteration.

Towards this goal, we follow an analysis, reminiscent of the uniform argument stability analysis of Bassily et al. (2020) for non-smooth convex losses. In a nutshell, both arguments compare the trajectory to a reference trajectory, and bound the incremental difference between the two trajectories while exploiting monotonicity of the convex function. In the case of stability, the reference trajectory is the trajectory wtSw^{S^{\prime}}_{t} induced by a sample S=\brkz1,,zi,,znS^{\prime}=\brk{z_{1},\ldots,z^{\prime}_{i},\ldots,z_{n}} that differ on a single example from the sample S=\brkz1,,zi,,znS=\brk{z_{1},\ldots,z_{i},\ldots,z_{n}}.

Bassily et al. showed that if SS and SS^{\prime} are two sequences that differ on a single example, then:

wtSwtS=O(ηTn+ηT).\|w_{t}^{S}-w_{t}^{S^{\prime}}\|=O\left(\frac{\eta T}{n}+\eta\sqrt{T}\right).

In comparison, we show:

𝔼SDn[wtSwtD]=O(ηTn+ηT).\mathop{\mathbb{E}}_{S\sim D^{n}}\left[\|w_{t}^{S}-w_{t}^{D}\|\right]=O\left(\frac{\eta T}{\sqrt{n}}+\eta\sqrt{T}\right).

We note, that both results are optimal. Namely, the stability bound of Bassily et al. (2020) is the optimal stability rate, and the proximity bound we provide is the best possible bound against a fixed point, independent of the sample.

Note that both bounds yield O(1)O(1) difference for the natural choice of η\eta and TT. However, O(1)O(1)-stability guarantees are vacuous in the sense that they do not provide any interesting implication over the generalization of the algorithm. Whereas, we provide O(1)O(1)-proximity guarantee to some fixed point, completely independent of the sample SS. This does imply generalization in our setup.

One might also suggest that our result of O(1)O(1)-proximity can be derived from O(1)O(1)-stability. However, it turns out this is not the case. For the same instance we use to lower bound the proximity by Ω\brkηT/n\Omega\brk{\eta T/\sqrt{n}} in Theorem 3.3, it can be easily shown that GD will be O(ηT/n)O(\eta T/n)-stable. Setting η=O(1)\eta=O(1) and T=O(n)T=O(n) will then imply O(1)O(1)-stability but with Ω(n)\Omega(\sqrt{n})-proximity. This asserts that our result is not a direct consequence of standard stability arguments.

References

  • Amir et al. [2021a] I. Amir, Y. Carmon, T. Koren, and R. Livni. Never go full batch (in stochastic convex optimization). In Advances in Neural Information Processing Systems, 2021a.
  • Amir et al. [2021b] I. Amir, T. Koren, and R. Livni. SGD generalizes better than GD (and regularization doesn’t help). In Conference on Learning Theory, 2021b.
  • Arora et al. [2019] S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.
  • Bartlett and Mendelson [2002] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, 2002.
  • Bassily et al. [2020] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems, 2020.
  • Blanc et al. [2020] G. Blanc, N. Gupta, G. Valiant, and P. Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pages 483–513. PMLR, 2020.
  • Boucheron et al. [2013] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-5. doi: 10.1093/acprof:oso/9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.
  • Chizat and Bach [2020] L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
  • Dauber et al. [2020] A. Dauber, M. Feder, T. Koren, and R. Livni. Can implicit bias explain generalization? stochastic convex optimization as a case study. In Advances in Neural Information Processing Systems, 2020.
  • Feldman [2016] V. Feldman. Generalization of ERM in stochastic convex optimization: The dimension strikes back. In Advances in Neural Information Processing Systems, 2016.
  • Foster et al. [2018] D. J. Foster, A. Sekhari, and K. Sridharan. Uniform convergence of gradients for non-convex learning and optimization. Advances in Neural Information Processing Systems, 31, 2018.
  • Gunasekar et al. [2018a] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In J. G. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1827–1836. PMLR, 2018a. URL http://proceedings.mlr.press/v80/gunasekar18a.html.
  • Gunasekar et al. [2018b] S. Gunasekar, J. D. Lee, N. Srebro, and D. Soudry. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 2018b.
  • Gunasekar et al. [2018c] S. Gunasekar, B. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In 2018 Information Theory and Applications Workshop, 2018c.
  • HaoChen et al. [2021] J. Z. HaoChen, C. Wei, J. Lee, and T. Ma. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315–2357. PMLR, 2021.
  • Hardt et al. [2016] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, 2016.
  • Ji and Telgarsky [2019] Z. Ji and M. Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, 2019.
  • Ji et al. [2020a] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, 2020a.
  • Ji et al. [2020b] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky. Gradient descent follows the regularization path for general losses. In J. D. Abernethy and S. Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 2109–2136. PMLR, 2020b. URL http://proceedings.mlr.press/v125/ji20a.html.
  • Kakade et al. [2008] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 793–800. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper/2008/hash/5b69b9cb83065d403869739ae7f0995e-Abstract.html.
  • Li et al. [2018] Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47. PMLR, 2018.
  • Li et al. [2019] Y. Li, E. X. Fang, H. Xu, and T. Zhao. Implicit bias of gradient descent based adversarial training on separable data. In International Conference on Learning Representations, 2019.
  • Lyu and Li [2019] K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.
  • Moroshko et al. [2020] E. Moroshko, B. E. Woodworth, S. Gunasekar, J. D. Lee, N. Srebro, and D. Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in neural information processing systems, 33:22182–22193, 2020.
  • Nacson et al. [2019] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019.
  • Nagarajan and Kolter [2019] V. Nagarajan and J. Z. Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Nemirovski and Yudin [1983] A. S. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization. 1983.
  • Neyshabur et al. [2015] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations (workshop track), 2015.
  • Pesme et al. [2021] S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34, 2021.
  • Razin et al. [2021] N. Razin, A. Maman, and N. Cohen. Implicit regularization in tensor factorization. In International Conference on Machine Learning, pages 8913–8924. PMLR, 2021.
  • Sekhari et al. [2021] A. Sekhari, K. Sridharan, and S. Kale. Sgd: The role of implicit regularization, batch-size and multiple-epochs. Advances in Neural Information Processing Systems, 34, 2021.
  • Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  • Shalev-Shwartz et al. [2009] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Conference on Learning Theory, 2009.
  • Shamir [2020] O. Shamir. Gradient methods never overfit on separable data. CoRR, abs/2007.00028, 2020. URL https://arxiv.org/abs/2007.00028.
  • Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  • Sridharan et al. [2008] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. Advances in Neural Information Processing Systems, 2008.
  • Telgarsky [2013] M. Telgarsky. Margins, shrinkage, and boosting. In International Conference on Machine Learning, pages 307–315. PMLR, 2013.
  • Vaskevicius et al. [2020] T. Vaskevicius, V. Kanade, and P. Rebeschini. The statistical complexity of early-stopped mirror descent. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/024d2d699e6c1a82c9ba986386f4d824-Abstract.html.
  • Woodworth et al. [2020] B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.

Appendix A Main Proofs

Recall that we define wtDw_{t}^{D} to be the tt-th iterate when applying GD over the population risk as depicted in Eq. 11.

A.1 Proof of Lemma 5.1

Observe that,

\normwt+1Swt+1D2\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2} =\normwtSwtDη\brkFS(wtS)FD(wtD)2\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}-\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}}^{2}
=\normwtSwtD22η\brkFS(wtS)FD(wtD)\brkwtSwtD+η2\normFS(wtS)FD(ut)2\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+\eta^{2}\norm{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(u_{t})}^{2}
\normwtSwtD22η\brkFS(wtS)FD(wtD)\brkwtSwtD+4η2L2\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2} (LL-Lipschitz)
=\normwtSwtD22η\brkFS(wtS)FS(wtD)\brkwtSwtD\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}
+2η\brkFD(wtD)FS(wtD)\brkwtSwtD+4η2L2\displaystyle\phantom{=\norm{w^{S}_{t}-w^{D}_{t}}^{2}}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}
\normwtSwtD2+2η\brkFD(wtD)FS(wtD)\brkwtSwtD+4η2L2,\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2},

where in the last inequality we use monotinicity of convex functions: \brkFS(w)FS(u)\brkwu0\brk{\nabla F_{S}(w)-\nabla F_{S}(u)}\brk{w-u}\geq 0 for any w,uw,u. Next, applying Cauchy-Schwarz inequality we get,

\normwt+1Swt+1D2\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2} \normwtSwtD2+2η\brkFD(wtD)FS(wtD)\brkwtSwtD+4η2L2\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}
\normwtSwtD2+2η\normFS(wtD)FD(wtD)\normwtSwtD+4η2L2\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}\norm{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2} (C.S.)
\normwtSwtD2+η2t\normFS(wtD)FD(wtD)2+1t\normwtSwtD2+4η2L2\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\eta^{2}t\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}+\frac{1}{t}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+4\eta^{2}L^{2}
=\brk21+1c\normwtSwtD2+η2c\normFS(wtD)FD(wtD)2+4η2L2.\displaystyle=\brk 2{1+\frac{1}{c}}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\eta^{2}c\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}+4\eta^{2}L^{2}.

The last inequality follows from the observation that 2abca2+b2/c2ab\leq ca^{2}+b^{2}/c for any c>0c>0 and a,b0a,b\geq 0, specifically for a=ηFS(wtD)FD(wtD)a=\eta\|\nabla F_{S}(w_{t}^{D})-\nabla F_{D}(w_{t}^{D})\|, and b=wtSwtDb=\|w_{t}^{S}-w_{t}^{D}\|.

Applying the formula recursively, and noting that w0=u0w_{0}=u_{0}:

\normwt+1Swt+1D2\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2} t=0t\brk21+1ctt\brk3η2c\normFS(wtD)FD(wtD)2+4η2L2\displaystyle\leq\sum_{t^{\prime}=0}^{t}\brk 2{1+\frac{1}{c}}^{t-t^{\prime}}\brk 3{\eta^{2}c\norm{\nabla F_{S}(w^{D}_{t^{\prime}})-\nabla F_{D}(w^{D}_{t^{\prime}})}^{2}+4\eta^{2}L^{2}}
t=0t\brk3eη2t\normFS(wtD)FD(wtD)2+4eη2L2,\displaystyle\leq\sum_{t^{\prime}=0}^{t}\brk 3{e\eta^{2}t\norm{\nabla F_{S}(w^{D}_{t^{\prime}})-\nabla F_{D}(w^{D}_{t^{\prime}})}^{2}+4e\eta^{2}L^{2}},

where in the last inequality we chose c=t+1c=t+1 and used the known bound of (1+1/t)te(1+1/t)^{t}\leq e. Taking the square root and using the inequality of a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} we conclude

\normwt+1Swt+1D\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}} eη2(t+1)t=0t\normFS(wtD)FD(wtD)2+2eηLt+1.\displaystyle\leq\sqrt{e\eta^{2}(t+1)\sum_{t^{\prime}=0}^{t}\norm{\nabla F_{S}(w^{D}_{t^{\prime}})-\nabla F_{D}(w^{D}_{t^{\prime}})}^{2}}+2\sqrt{e}\eta L\sqrt{t+1}.

We are interested in bounding 𝔼SDn\brk[s]2\normFS(wtD)FD(wtD)2\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}}. By the definition of the empirical risk

𝔼SDn\brk[s]2\normFS(wtD)FD(wtD)2\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}} =1n2𝔼SDn\brk[s]3\norm2i=1n\brk[s]2f(wtD;zi)𝔼zD\brk[s]f(wtD;z)2\displaystyle=\frac{1}{n^{2}}\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]3{\norm 2{\sum_{i=1}^{n}\brk[s]2{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}_{z\sim D}\brk[s]{\nabla f(w^{D}_{t};z)}}}^{2}}

Note that by Lipschitzness \normf(wtD;zi)𝔼\brk[s]f(wtD;z)2L\norm{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}\leq 2L, and that \brk[c]1f(wtD;zi)𝔼\brk[s]f(wtD;z)i[n]\brk[c]1{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}_{i\in[n]} are independent zero-mean random vectors (as wtDw^{D}_{t} is independent of ziz_{i}). Thus, we get

𝔼SDn\brk[s]2\normFS(wtD)FD(wtD)2=1n2i=1n\brk[s]3𝔼SDn\brk[s]2\norm1f(wtD;zi)𝔼zD\brk[s]f(wtD;z)24L2n.\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}}=\frac{1}{n^{2}}\sum_{i=1}^{n}\brk[s]3{\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm 1{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}_{z\sim D}\brk[s]{\nabla f(w^{D}_{t};z)}}^{2}}}\leq\frac{4L^{2}}{n}.

Consequently, taking the expectation over the sample SS and using Jensen’s inequality

𝔼SDn\brk[s]2\normwt+1Swt+1D4ηL(t+1)n+4ηLt+1.\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{w^{S}_{t+1}-w^{D}_{t+1}}}\leq\frac{4\eta L(t+1)}{\sqrt{n}}+4\eta L\sqrt{t+1}.

A.2 Proof of Theorem 3.1

Starting with Lemma 2.1 we obtain for any domain 𝒲uK\mathcal{W}^{K}_{u},

𝔼SDn\brk[s]1supw𝒲uK\brk[c]FD(w)FS(w)\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}} 2LKn.\displaystyle\leq\frac{2LK}{\sqrt{n}}. (12)

From Theorem 3.2, there exists a sequence u1,,uTu_{1},\ldots,u_{T} such that with probability at least 1δ1-\delta,

\norm2w¯S1Tt=1Tut7ηLTnlog\brkT/δ+4ηLT.\displaystyle\norm 2{\bar{w}^{S}-\frac{1}{T}\sum_{t=1}^{T}u_{t}}\leq\frac{7\eta LT}{\sqrt{n}}\sqrt{\log\brk{T/\delta}}+4\eta L\sqrt{T}. (13)

Setting u=1Tt=1Tutu=\frac{1}{T}\sum_{t=1}^{T}u_{t} and KK to be the RHS in Eq. 13 we obtain:

𝔼SDn\brk[s]1FD(w¯S)FS(w¯S)\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})} 𝔼SDn\brk[s]1supw𝒲uK\brk[c]FD(w)FS(w)|w¯S𝒲uKP(w¯S𝒲uK)\displaystyle\leq\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}\big{|}\bar{w}^{S}\in\mathcal{W}^{K}_{u}}P(\bar{w}^{S}\in\mathcal{W}_{u}^{K})
+P(w¯S𝒲uK)supSDn\absFD(w¯S)FS(w¯S)\displaystyle\quad+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\abs{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}
=𝔼SDn\brk[s]1supw𝒲uK\brk[c]FD(w)FS(w)𝔼SDn\brk[s]1supw𝒲uK\brk[c]FD(w)FS(w)|w¯S𝒲uKP(w¯S𝒲uK)\displaystyle=\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}}-\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}\big{|}\bar{w}^{S}\notin\mathcal{W}^{K}_{u}}P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})
+P(w¯S𝒲uK)supSDn|FD(w¯S)FS(w¯S)|\displaystyle\quad+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}|F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})|
𝔼SDn\brk[s]1supw𝒲uK\brk[c]FD(w)FS(w)+P(w¯S𝒲uK)supSDnsupw𝒲uK\absFD(w)FS(w)\displaystyle\leq\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}}+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{D}(w)-F_{S}(w)}
+P(w¯S𝒲uK)supSDn\absFD(w¯S)FS(w¯S)\displaystyle\quad+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\abs{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}
14ηL2Tnlog\brkT/δ+8ηL2Tn+P(w¯S𝒲uK)supSDnsupw𝒲uK\absFD(w)FS(w)\displaystyle\leq\frac{14\eta L^{2}T}{n}\sqrt{\log\brk{T/\delta}}+\frac{8\eta L^{2}\sqrt{T}}{\sqrt{n}}+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{D}(w)-F_{S}(w)}
+P(w¯S𝒲uK)supSDn\absFD(w¯S)FS(w¯S)\displaystyle\quad+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\abs{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})} (Eqs. 13 and 12)
14ηL2Tnlog\brkT/δ+8ηL2Tn+O(δηL2Tlog(T/δ)),\displaystyle\leq\frac{14\eta L^{2}T}{n}\sqrt{\log\brk{T/\delta}}+\frac{8\eta L^{2}\sqrt{T}}{\sqrt{n}}+O\left(\delta\eta L^{2}T\sqrt{\log(T/\delta)}\right),

where we used Eq. 7 and the fact that w¯SηLT\|\bar{w}^{S}\|\leq\eta LT and \normu+KO\brk2ηLT+ηLTlog(T/δ)/nO\brk1ηLTlog(T/δ)\norm{u}+K\leq O\brk 2{\eta LT+\eta LT\sqrt{\log(T/\delta)}/\sqrt{n}}\leq O\brk 1{\eta LT\sqrt{\log(T/\delta)}} to bound the second and third terms. Hence:

|FD(w¯S)FS(w¯S)||FD(w¯S)|+|FS(w¯S)|O(ηL2T),|F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})|\leq|F_{D}(\bar{w}^{S})|+|F_{S}(\bar{w}^{S})|\leq O\left(\eta L^{2}T\right),

and

supw𝒲uK\absFD(w)FS(w)supw𝒲uK\absFD(w)+supw𝒲uK\absFS(w)O\brk2ηL2Tlog(T/δ).\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{D}(w)-F_{S}(w)}\leq\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{D}(w)}+\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{S}(w)}\leq O\brk 2{\eta L^{2}T\sqrt{\log(T/\delta)}}.

Next, setting δ=O(1/nT)\delta=O(1/\sqrt{nT}) we get that:

𝔼SDn\brk[s]1FD(w¯S)FS(w¯S)\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})} O(ηL2Tnlog\brknT+ηL2Tnlog\brknT).\displaystyle\leq O\left(\frac{\eta L^{2}T}{n}\sqrt{\log\brk{nT}}+\frac{\eta L^{2}\sqrt{T}}{\sqrt{n}}\sqrt{\log\brk{nT}}\right). (14)

Finally, combining Eqs. 14 and 4 we obtain that for every w𝒲Bw^{\star}\in\mathcal{W}^{B}:

𝔼SDn\brk[s]1FD(w¯S)FD(w)\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}-F_{D}(w^{\star}) =𝔼SDn\brk[s]1FD(w¯S)FS(w¯S)+𝔼SDn\brk[s]1FS(w¯S)FD(w)\displaystyle=\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}+\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{S}(\bar{w}^{S})}-F_{D}(w^{\star})
=𝔼SDn\brk[s]1FD(w¯S)FS(w¯S)+𝔼SDn\brk[s]1FS(w¯S)FS(w)\displaystyle=\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}+\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{S}(\bar{w}^{S})-F_{S}(w^{\star})}
O(ηL2Tnlog(nT)+ηL2Tnlog\brknT+ηL2+B2ηT).\displaystyle\leq O\left(\frac{\eta L^{2}T}{n}\sqrt{\log(nT)}+\frac{\eta L^{2}\sqrt{T}}{\sqrt{n}}\sqrt{\log\brk{nT}}+\eta L^{2}+\frac{B^{2}}{\eta T}\right).

A.3 Proof of Theorem 3.2

The proof is similar to that of Lemma 5.1 with the exception that here we employ specific concentration inequalities of random variables with bounded difference. The reference sequence we consider is the GD iterates over the population risk, namely, wtDw^{D}_{t} as described in Eq. 11. Observe that

\normwt+1Swt+1D2\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2} =\normwtSwtDη\brkFS(wtS)FD(wtD)2\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}-\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}}^{2}
=\normwtSwtD22η\brkFS(wtS)FD(wtD)\brkwtSwtD+η2\normFS(wtS)FD(wtD)2\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+\eta^{2}\norm{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}
\normwtSwtD22η\brkFS(wtS)FD(wtD)\brkwtSwtD+4η2L2\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2} (LL-Lipschitz)
=\normwtSwtD22η\brkFS(wtS)FS(wtD)\brkwtSwtD\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}
+2η\brkFD(wtD)FS(wtD)\brkwtSwtD+4η2L2.\displaystyle\phantom{=\norm{w^{S}_{t}-w^{D}_{t}}^{2}}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}.

From convexity of FSF_{S} we know that \brkFS(w)FS(u)\brkwu0\brk{\nabla F_{S}(w)-\nabla F_{S}(u)}\brk{w-u}\geq 0 for any w,uw,u. Therefore, applying Cauchy-Schwarz inequality we get,

\normwt+1Swt+1D2\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2} \normwtSwtD2+2η\brkFD(wtD)FS(wtD)\brkwtSwtD+4η2L2\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2} (convexity)
\normwtSwtD2+2η\normFS(wtD)FD(wtD)\normwtSwtD+4η2L2\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}\norm{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2} (C.S.)
\normwtSwtD2+η2t\normFS(wtD)FD(wtD)2+1t\normwtSwtD2+4η2L2\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\eta^{2}t\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}+\frac{1}{t}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+4\eta^{2}L^{2}
=\brk21+1c\normwtSwtD2+η2c\normFS(wtD)FD(wtD)2+4η2L2.\displaystyle=\brk 2{1+\frac{1}{c}}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\eta^{2}c\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}+4\eta^{2}L^{2}. (15)

The last inequality follows from the observation that 2abca2+b2/c2ab\leq ca^{2}+b^{2}/c for any c>0c>0 and a,b0a,b\geq 0, specifically for a=ηFS(wtD)FD(wtD)a=\eta\|\nabla F_{S}(w_{t}^{D})-\nabla F_{D}(w_{t}^{D})\|, and b=wtSwtDb=\|w_{t}^{S}-w_{t}^{D}\|.

We are interested in bounding \normFS(wtD)FD(wtD)\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}. For that matter we consider the following concentration inequality which is a direct result of the bounded difference inequality by McDiarmid.

Theorem (Boucheron, Lugosi, and Massart [2013, Example 6.3]).

Let X1,,XnX_{1},\ldots,X_{n} be independent zero-mean R.V’s such that \normXici/2\norm{X_{i}}\leq c_{i}/2 and denote v=14i=1nci2v=\frac{1}{4}\sum_{i=1}^{n}c_{i}^{2}. Then, for all tvt\geq\sqrt{v},

\brk[c]3\norm3i=1nXi>te(tv)2/(2v).\displaystyle\mathop{\mathbb{P}}\brk[c]3{\norm 3{\sum_{i=1}^{n}X_{i}}>t}\leq e^{-(t-\sqrt{v})^{2}/(2v)}.

Note that by Lipschitzness \normf(wtD;zi)𝔼\brk[s]f(wtD;z)2L\norm{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}\leq 2L, and that \brk[c]1f(wtD;zi)𝔼\brk[s]f(wtD;z)i[n]\brk[c]1{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}_{i\in[n]} are independent zero-mean random variables (as wtDw^{D}_{t} is independent of ziz_{i}). Thus, for Δ2Ln\Delta\geq 2\frac{L}{\sqrt{n}}:

\brk[c]3\norm31ni=1nf(wtD;zi)𝔼\brk[s]f(wtD;z)>Δe(Δn2L)2/(4L2),\displaystyle\mathop{\mathbb{P}}\brk[c]3{\norm 3{\frac{1}{n}\sum_{i=1}^{n}\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}>\Delta}\leq e^{-(\Delta\sqrt{n}-2L)^{2}/(4L^{2})},

This implies that with probability 1δ1-\delta we get,

\normFS(wtD)FD(wtD)\displaystyle\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})} 4Lnlog(1/δ).\displaystyle\leq\frac{4L}{\sqrt{n}}\sqrt{\log(1/\delta)}.

Plugging it back to Eq. 15 we obtain w.p. 1δ1-\delta

\normwt+1Swt+1D2\brk21+1c\normwtSwtD2+16η2L2clog\brk1/δn+4η2L2\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2}\leq\brk 2{1+\frac{1}{c}}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\frac{16\eta^{2}L^{2}c\log\brk{1/\delta}}{n}+4\eta^{2}L^{2}

Applying the formula recursively, and noting that w0S=w0Dw^{S}_{0}=w^{D}_{0}:

\normwt+1Swt+1D2\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2} t=0t\brk21+1ct\brk316η2L2clog\brk1/δn+4η2L2\displaystyle\leq\sum_{t^{\prime}=0}^{t}\brk 2{1+\frac{1}{c}}^{t^{\prime}}\brk 3{\frac{16\eta^{2}L^{2}c\log\brk{1/\delta}}{n}+4\eta^{2}L^{2}}
16eη2L2(t+1)2log\brk1/δn+4eη2L2(t+1),\displaystyle\leq\frac{16e\eta^{2}L^{2}(t+1)^{2}\log\brk{1/\delta}}{n}+4e\eta^{2}L^{2}(t+1),

where in the last inequality we chose c=t+1c=t+1 and used the known bound of (1+1/t)te(1+1/t)^{t}\leq e. Taking the square root and using the inequality of a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} we have

\normwt+1Swt+1D\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}} 7η2L2(t+1)2log\brk1/δn+4ηLt+1.\displaystyle\leq 7\sqrt{\frac{\eta^{2}L^{2}(t+1)^{2}\log\brk{1/\delta}}{n}}+4\eta L\sqrt{t+1}.

By taking the union bound over all t[T]t\in[T] we conclude the proof.

A.4 Proof of Theorem 3.4

Similarly to the proof in Section A.2, let us consider the domain 𝒲uK=\brk[c]w:\normwuK\mathcal{W}^{K}_{u}=\brk[c]{w:\norm{w-u}\leq K}, where we set u=1Tt=1Tutu=\frac{1}{T}\sum_{t=1}^{T}u_{t}, the average of the deterministic sequence in Theorem 3.2. From the assumption in Section 3.1 it follows that for any ww we have that \abs\brk1g\brk1wϕ(x),yc\abs{\ell\brk 1{g\brk 1{w\boldsymbol{\cdot}\phi(x)},y}}\leq c. We also, can use Lemma 2.1 (applying it to g\ell\circ g\to\ell) to obtain that

𝔼SDn\brk[s]1supw𝒲uK\brk[c]1FD(gw)FS(gw)2LKn,\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]1{F_{D}(g\circ w)-F_{S}(g\circ w)}}\leq\frac{2LK}{\sqrt{n}}, (16)

where we denote

FS(gw)=1ni=1n(g(wϕ(xi),yi)),FD(gw)=𝔼(x,y)D[(g(wϕ(x),y))].F_{S}(g\circ w)=\frac{1}{n}\sum_{i=1}^{n}\ell(g(w\boldsymbol{\cdot}\phi(x_{i}),y_{i})),\quad F_{D}(g\circ w)=\mathop{\mathbb{E}}_{(x,y)\sim D}\left[\ell(g(w\boldsymbol{\cdot}\phi(x),y))\right].

Next, we define

G(S)=supw𝒲uk\brk[c]1FD(gw)FS(gw),G(S)=\sup_{w\in\mathcal{W}^{k}_{u}}\brk[c]1{F_{D}(g\circ w)-F_{S}(g\circ w)},

and note that for two samples, S,SS,S^{\prime} that differ on a single example we have that

|G(S)G(S)|2cn.|G(S)-G(S^{\prime})|\leq\frac{2c}{n}.

Using then the bounded difference inequality by McDiarmid [see Shalev-Shwartz and Ben-David, 2014, Lemma 26.4]. We have that with probability at least 1δ1-\delta,

G(S)\displaystyle G(S) =supw𝒲uK\brk[c]3𝔼(x,y)D\brk[s]\brk1g\brk1wϕ(x),y1ni=1n\brk1g\brk1wϕ(xi),yi\displaystyle=\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]{\ell\brk 1{g\brk 1{w\boldsymbol{\cdot}\phi(x)},y}}-\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{w\boldsymbol{\cdot}\phi(x_{i})},y_{i}}}
𝔼SDn[G(S)]+c2log(2/δ)n\displaystyle\leq\mathop{\mathbb{E}}_{S\sim D^{n}}[G(S)]+c\sqrt{\frac{2\log(2/\delta)}{n}} (McDiarmid)
2LKn+c2log(2/δ)n.\displaystyle\leq\frac{2LK}{\sqrt{n}}+c\sqrt{\frac{2\log(2/\delta)}{n}}.

From Theorem 3.2 we have that with probability at least 1δ1-\delta,

\normw¯Su6ηLTnlog\brkT/δ+4ηLT\displaystyle\norm{\bar{w}^{S}-u}\leq\frac{6\eta LT}{\sqrt{n}}\sqrt{\log\brk{T/\delta}}+4\eta L\sqrt{T} (17)

Taken together, and applying union bound, we have that with probability at least 1δ1-\delta:

𝔼(x,y)D\brk[s]1\brk1g\brk1w¯Sϕ(x),y1ni=1n\brk1g\brk1w¯Sϕ(xi),yi+12ηL2Tnlog\brk2T/δ+8ηL2Tn+c2log\brk4/δn.\displaystyle\begin{split}\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x)},y}}&\leq\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i})},y_{i}}\\ &\quad+\frac{12\eta L^{2}T}{n}\sqrt{\log\brk{2T/\delta}}+\frac{8\eta L^{2}\sqrt{T}}{\sqrt{n}}+c\sqrt{\frac{2\log\brk{4/\delta}}{n}}.\end{split} (18)

Next, using Eqs. 4 and 3.1 and the fact that the optimization bound Eq. 4 holds for any B>0B>0:

1ni=1n\brk1g\brk1w¯Sϕ(xi),yi\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i})},y_{i}} 1ni=1n(w¯Sϕ(xi),yi)\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\ell(\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i}),y_{i}) (Section 3.1)
infB+\brk[c]2minw𝒲B\brk[c]21ni=1n\brk1wϕ(xi),yi+ηL2+B2ηT\displaystyle\leq\inf_{B\in\mathbb{R}^{+}}\brk[c]2{\min_{w\in\mathcal{W}^{B}}\brk[c]2{\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w\boldsymbol{\cdot}\phi(x_{i}),y_{i}}}+\eta L^{2}+\frac{B^{2}}{\eta T}} (Eq. 4)
infwd\brk[c]31ni=1n\brk1wϕ(xi),yi+ηL2+w2ηT,\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w\boldsymbol{\cdot}\phi(x_{i}),y_{i}}+\eta L^{2}+\frac{\|w\|^{2}}{\eta T}}, (19)

Now, set ww^{\star} such that

𝔼(x,y)D\brk[s]1\brk1wϕ(x),y+w2ηT+wL2log(1/δ)ninfwd\brk[c]3𝔼(x,y)D\brk[s]1\brk1wϕ(x),y+w2ηT+wL2log(1/δ)n+ηL2.\displaystyle\begin{split}\mathop{\mathbb{E}}_{(x,y)\sim D}&\brk[s]1{\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x),y}}+\frac{\|w^{\star}\|^{2}}{\eta T}+\|w^{\star}\|L\sqrt{\frac{2\log(1/\delta)}{n}}\\ &\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}+\frac{\|w\|^{2}}{\eta T}+\|w\|L\sqrt{\frac{2\log(1/\delta)}{n}}}+\eta L^{2}.\end{split} (20)

By independence of \brk[c](xi,yi)i=1n\brk[c]{(x_{i},y_{i})}_{i=1}^{n} and the bound on |(0,y)|c|\ell(0,y)|\leq c we obtain by Lipschitzness \abs(wϕ(x),y)wL+c\abs{\ell(w^{\star}\boldsymbol{\cdot}\phi(x),y)}\leq\|w^{\star}\|L+c. It follows from the Hoeffding’s inequality that with probability at least 1δ1-\delta

1ni=1n\brk1wϕ(xi),yi𝔼(x,y)D\brk[s]1\brk1wϕ(x),y\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x_{i}),y_{i}}-\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x),y}} (wL+c)2log(1/δ)n.\displaystyle\leq(\|w^{\star}\|L+c)\sqrt{\frac{2\log(1/\delta)}{n}}. (21)

Thus, we have that w.p. 1δ1-\delta:

1ni=1n\brk1g\brk1w¯Sϕ(xi),yi\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i})},y_{i}} infwd\brk[c]31ni=1n\brk1wϕ(xi),yi+ηL2+w2ηT\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w\boldsymbol{\cdot}\phi(x_{i}),y_{i}}+\eta L^{2}+\frac{\|w\|^{2}}{\eta T}}
1ni=1n\brk1wϕ(xi),yi+ηL2+w2ηT\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x_{i}),y_{i}}+\eta L^{2}+\frac{\|w^{\star}\|^{2}}{\eta T}
𝔼(x,y)D\brk[s]1\brk1wϕ(x),y+w2ηT+(wL+c)2log(1/δ)n+ηL2\displaystyle\leq\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x),y}}+\frac{\|w^{\star}\|^{2}}{\eta T}+(\|w^{\star}\|L+c)\sqrt{\frac{2\log(1/\delta)}{n}}+\eta L^{2} (Eq. 21)
infwd\brk[c]3𝔼(x,y)D\brk[s]1\brk1wϕ(x),y+w2ηT+(wL+c)2log(1/δ)n+2ηL2,\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}+\frac{\|w\|^{2}}{\eta T}+(\|w\|L+c)\sqrt{\frac{2\log(1/\delta)}{n}}}+2\eta L^{2}, (22)

where the last inequality follows from Eq. 20. Combining Eqs. 18 and 22 and applying union bound we obtain the result.

Appendix B Proof of Lemma 2.1

Using the standard bound of the generalization error, via the Rademacher complexity of the class (see e.g. Shalev-Shwartz and Ben-David [2014]), we have that:

𝔼SDn\brk[s]2supw𝒲uK\brk[c]1FD(w)FS(w)\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]1{F_{D}(w)-F_{S}(w)}} 2𝔼SDn\brk[s]S(f𝒲uK),\displaystyle\leq 2\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]{\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u})},

Where we notate the function class:

f𝒲uK={z(wx,y):w𝒲uK}.f\circ\mathcal{W}^{K}_{u}=\{z\to\ell(w\cdot x,y):w\in\mathcal{W}^{K}_{u}\}.

and S(f𝒲uK)\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u}) is the Rademacher complexity of the class f𝒲uKf\circ\mathcal{W}^{K}_{u}. Namely:

S(f𝒲uK):=𝔼σ[suphfWuK1nziSσih(zi)],\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u}):=\mathop{\mathbb{E}}_{\sigma}\left[\sup_{h\in f\circ W^{K}_{u}}\frac{1}{n}\sum_{z_{i}\in S}\sigma_{i}h(z_{i})\right], (23)

and σ1,,σn\sigma_{1},\ldots,\sigma_{n} are i.i.d. Rademacher random variables.

We next show that:

S(f𝒲uK)LKn.\displaystyle\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u})\leq\frac{LK}{\sqrt{n}}. (24)

To show Eq. 24, we use the following well known property of the Rademacher complexity of a class:

Lemma B.1 (contraction lemma, see Shalev-Shwartz and Ben-David [2014]).

For each i[n]i\in[n], let ρi:\rho_{i}:\mathbb{R}\rightarrow\mathbb{R} be convex LL-lipschitz function in their first argument. Let AnA\subseteq\mathbb{R}^{n} and denote a=\brka1,,anAa=\brk{a_{1},\ldots,a_{n}}\in A. Then, if σ=σ1,,σn\sigma=\sigma_{1},\ldots,\sigma_{n}, are i.i.d. Rademacher random variables

𝔼σ[supaAi=1nσiρi(ai)]L𝔼σ[supaAi=1nσiai].\mathop{\mathbb{E}}_{\sigma}\left[\sup_{a\in A}\sum_{i=1}^{n}\sigma_{i}\rho_{i}(a_{i})\right]\leq L\cdot\mathop{\mathbb{E}}_{\sigma}\left[\sup_{a\in A}\sum_{i=1}^{n}\sigma_{i}a_{i}\right].

As well as the Rademacher complexity of the class of linear predictors against a sample S={ϕ(x1),,ϕ(xn)}S=\{\phi(x_{1}),\ldots,\phi(x_{n})\} of 2\ell_{2} 1-bounded vectors:

S(f𝒲0K)=K/n.\displaystyle\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{0})=K/\sqrt{n}. (25)

Next, given a sample S={z1,,zn}S=\{z_{1},\ldots,z_{n}\} we define ρi(α):=(α+uϕ(xi),yi)\rho_{i}(\alpha):=\ell(\alpha+u\boldsymbol{\cdot}\phi(x_{i}),y_{i}) and we set

A:={(vϕ(x1),,vϕ(xn)):v𝒲0k}.A:=\{(v\boldsymbol{\cdot}\phi(x_{1}),\ldots,v\boldsymbol{\cdot}\phi(x_{n})):v\in\mathcal{W}_{0}^{k}\}.

Then:

nS(f𝒲uK)\displaystyle n\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u}) =𝔼σ[supw𝒲uki=1nσi(wϕ(xi),yi)]\displaystyle=\mathop{\mathbb{E}}_{\sigma}\left[\sup_{w\in\mathcal{W}^{k}_{u}}\sum_{i=1}^{n}\sigma_{i}\ell(w\boldsymbol{\cdot}\phi(x_{i}),y_{i})\right]
=𝔼σ[supw𝒲uki=1nσi((wu)ϕ(xi)+uϕ(xi),yi)]\displaystyle=\mathop{\mathbb{E}}_{\sigma}\left[\sup_{w\in\mathcal{W}^{k}_{u}}\sum_{i=1}^{n}\sigma_{i}\ell((w-u)\boldsymbol{\cdot}\phi(x_{i})+u\boldsymbol{\cdot}\phi(x_{i}),y_{i})\right]
=𝔼σ[supv𝒲0ki=1nσi(vϕ(xi)+uϕ(xi),yi)]\displaystyle=\mathop{\mathbb{E}}_{\sigma}\left[\sup_{v\in\mathcal{W}^{k}_{0}}\sum_{i=1}^{n}\sigma_{i}\ell(v\boldsymbol{\cdot}\phi(x_{i})+u\boldsymbol{\cdot}\phi(x_{i}),y_{i})\right]
L𝔼σ[supv𝒲0ki=1nσivϕ(xi)]\displaystyle\leq L\cdot\mathop{\mathbb{E}}_{\sigma}\left[\sup_{v\in\mathcal{W}^{k}_{0}}\sum_{i=1}^{n}\sigma_{i}v\boldsymbol{\cdot}\phi(x_{i})\right] (Lemma B.1)
LKn\displaystyle\leq LK\sqrt{n} (Eq. 25)

Dividing by nn yields the proof.

Appendix C Proof of Theorem 3.3

Our construction is comprised of two separate instances. We first provide lower bounds, Lemmas C.1 and C.2, for the distance between the GD iterates wtS,wtSw^{S}_{t},w^{S^{\prime}}_{t} defined over two separate i.i.d. samples S=\brkz1,,znS=\brk{z_{1},\ldots,z_{n}} and S=\brkz1,,znS^{\prime}=\brk{z^{\prime}_{1},\ldots,z^{\prime}_{n}}, respectively.

Lemma C.1.

Fix η,L,T\eta,L,T and nn. Suppose SS and SS^{\prime} are i.i.d. samples drawn from DnD^{n}. There exists a convex and LL-Lipschitz function f:𝒲×𝒵f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R} and a distribution DD over 𝒵\mathcal{Z}, such that, if wtSw^{S}_{t} and wtSw^{S^{\prime}}_{t} are defined as in Eq. 2, then with probability at least 1/101/10:

t[T]:\normwtSwtS\displaystyle\forall t\in[T]:\quad\norm{w^{S}_{t}-w^{S^{\prime}}_{t}} Ω\brk3ηLtn.\displaystyle\geq\Omega\brk 3{\frac{\eta Lt}{\sqrt{n}}}.
Lemma C.2.

Fix η,L,T\eta,L,T and nn. Suppose SS and SS^{\prime} are i.i.d. samples drawn from DnD^{n}. There exists a convex and LL-Lipschitz function f:𝒲×𝒵f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R} and a distribution DD over 𝒵\mathcal{Z}, such that, if wtSw^{S}_{t} and wtSw^{S^{\prime}}_{t} are defined as in Eq. 2, then with probability at least 1/101/10:

t[T]:\normwtSwtS\displaystyle\forall t\in[T]:\quad\norm{w^{S}_{t}-w^{S^{\prime}}_{t}} Ω\brk3ηLt.\displaystyle\geq\Omega\brk 3{\eta L\sqrt{t}}.

One can then pick the dominant term between the bounds, and obtain that with probability at least 1/101/10:

\normwtSwtS\displaystyle\norm{w^{S}_{t}-w^{S^{\prime}}_{t}} Ω\brk3ηLtn+ηLt.\displaystyle\geq\Omega\brk 3{\frac{\eta Lt}{\sqrt{n}}+\eta L\sqrt{t}}. (26)

Suppose some utu_{t}, independent on the samples SS and SS^{\prime}. Then by the triangle inequality we have that,

\brk1\normwtSwtSa\displaystyle\mathop{\mathbb{P}}\brk 1{\norm{w^{S}_{t}-w^{S^{\prime}}_{t}}\geq a} \brk1\normwtSut+\normwtSuta\displaystyle\leq\mathop{\mathbb{P}}\brk 1{\norm{w^{S}_{t}-u_{t}}+\norm{w^{S^{\prime}}_{t}-u_{t}}\geq a}
\brk1\normwtSuta/2+\brk1\normwtSuta/2\displaystyle\leq\mathop{\mathbb{P}}\brk 1{\norm{w^{S}_{t}-u_{t}}\geq a/2}+\mathop{\mathbb{P}}\brk 1{\norm{w^{S^{\prime}}_{t}-u_{t}}\geq a/2}
=2\brk1\normwtSuta/2.\displaystyle=2\mathop{\mathbb{P}}\brk 1{\norm{w^{S}_{t}-u_{t}}\geq a/2}. (SS and SS^{\prime} are i.i.d.)

Dividing by 22 and using Eq. 26 we conclude the proof.

C.1 Proof of Lemma C.1

Suppose f:𝒲×𝒵f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R} takes following form:

f(w;z)=Lwz,\displaystyle f(w;z)=Lw\cdot z,

where 𝒲\mathcal{W}\subseteq\mathbb{R} and z=±1z=\pm 1 with probability 1/21/2. Define a sample S=\brk[c]z1,,znS=\brk[c]{z_{1},\ldots,z_{n}} and S=\brk[c]z1,,znS^{\prime}=\brk[c]{z^{\prime}_{1},\ldots,z^{\prime}_{n}}, then by the update rule in Eq. 2 we obtain,

wt+1S=ηLt1ni=1nzi,wt+1S=ηLt1ni=1nzi.\displaystyle w^{S}_{t+1}=-\eta Lt\cdot\frac{1}{n}\sum_{i=1}^{n}z_{i},\quad w^{S^{\prime}}_{t+1}=-\eta Lt\cdot\frac{1}{n}\sum_{i=1}^{n}z^{\prime}_{i}.

This implies that \abswt+1Swt+1S=ηLt\abs1ni=1n\brkzizi\abs{w^{S}_{t+1}-w^{S^{\prime}}_{t+1}}=\eta Lt\cdot\abs{\frac{1}{n}\sum_{i=1}^{n}\brk{z_{i}-z^{\prime}_{i}}}. Note that,

zizi={2w.p.1/4,0w.p.1/2,2w.p.1/4.\displaystyle z_{i}-z^{\prime}_{i}=\begin{cases}2&w.p.~{}1/4,\\ 0&w.p.~{}1/2,\\ -2&w.p.~{}1/4.\end{cases}

Using Berry-Esseen inequality one can show that with probability at least 1/101/10:

\abs21ni=1n\brkzizi1n.\displaystyle\abs 2{\frac{1}{n}\sum_{i=1}^{n}\brk{z_{i}-z^{\prime}_{i}}}\geq\frac{1}{\sqrt{n}}.

In turn we conclude that w.p. at least 1/101/10,

\abswt+1Swt+1SηLtn.\displaystyle\abs{w^{S}_{t+1}-w^{S^{\prime}}_{t+1}}\geq\frac{\eta Lt}{\sqrt{n}}.

We remark that f(w;z)f(w;z) can be embedded to any large dimension, thus implying our lower bound holds for any arbitrary dimension.

C.2 Proof of Lemma C.2

This proof relies on the same construction of Bassily et al. [2020]. The difference is that we show a lower bound between iterates over two i.i.d. samples while their result holds for two samples that differ only on a single example. The main observation here is that with some constant probability, the problem is reduced to that of Bassily et al. [2020]. Consider the following f:𝒲×𝒵f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}:

f(w;z)=γL2zw+L2maxi[d]\brk[c]1wiεi,0,\displaystyle f(w;z)=-\frac{\gamma L}{2}zw+\frac{L}{2}\max_{i\in[d]}\brk[c]1{w_{i}-\varepsilon_{i},0},

where 𝒲d\mathcal{W}\subseteq\mathbb{R}^{d} and

z={1w.p.1/(n+1),0w.p.11/(n+1).\displaystyle z=\begin{cases}1&w.p.~{}1/(n+1),\\ 0&w.p.~{}1-1/(n+1).\end{cases}

We also choose εi\varepsilon_{i} such that 0<ε1<<εd<γηL/(2n)0<\varepsilon_{1}<\ldots<\varepsilon_{d}<\gamma\eta L/(2n) and a sufficiently small γ=1/(4dT)\gamma=1/(4\sqrt{dT}), and d>Td>T. Observe that for a given sample S=\brk[c]z1,,znS=\brk[c]{z_{1},\ldots,z_{n}} the empirical risk is then,

FS(w)=γL2ni=1nziw+L2maxi[d]\brk[c]1wiεi,0.\displaystyle F_{S}(w)=-\frac{\gamma L}{2n}\sum_{i=1}^{n}z_{i}w+\frac{L}{2}\max_{i\in[d]}\brk[c]1{w_{i}-\varepsilon_{i},0}.

We now claim that with probability (11n+1)n(1-\frac{1}{n+1})^{n} over SS^{\prime}, the empirical risk is given by,

FS(w)=L2maxi[d]\brk[c]1wiεi,0.\displaystyle F_{S^{\prime}}(w)=\frac{L}{2}\max_{i\in[d]}\brk[c]1{w_{i}-\varepsilon_{i},0}.

Conditioned on this event, we get that FS(0)=0\nabla F_{S^{\prime}}(0)=0 and therefore wtS=0w^{S^{\prime}}_{t}=0 for any t[T]t\in[T]. In addition, we know that the complementary event, namely, zi=1z_{i}=1 for at least a single i[n]i\in[n], is given with probability 1(11n)n1-(1-\frac{1}{n})^{n}. Since,

\brk111n+1n\brk112nne1/2,and\brk111n+1n1/2,\displaystyle\brk 1{1-\frac{1}{n+1}}^{n}\leq\brk{1-\frac{1}{2n}}^{n}\leq e^{-1/2},\quad\textrm{and}\quad\brk 1{1-\frac{1}{n+1}}^{n}\geq 1/2,

we have that with probability at least 0.5\brk1e1/20.190.5\cdot\brk{1-e^{-1/2}}\geq 0.19 both events occur. Note that FS(w)=γL2ni=1nzi𝟙+L2maxi[d]\brk[c]1wiεi,0\nabla F_{S}(w)=-\frac{\gamma L}{2n}\sum_{i=1}^{n}z_{i}\mathbbm{1}+\frac{L}{2}\nabla\max_{i\in[d]}\brk[c]1{w_{i}-\varepsilon_{i},0} where 𝟙\mathbbm{1} is the one vector. Then applying the update rule in Eq. 2 and the fact that w0S=0w^{S}_{0}=0 we get,

w1S=γηL2ni=1nzi𝟙.\displaystyle w^{S}_{1}=\frac{\gamma\eta L}{2n}\sum_{i=1}^{n}z_{i}\mathbbm{1}.

Recall, that under the aforementioned event we have that 1ni=1nzi1n\frac{1}{n}\sum_{i=1}^{n}z_{i}\geq\frac{1}{n}. This implies that w1S(i)γηL2n>εiw^{S}_{1}(i)\geq\frac{\gamma\eta L}{2n}>\varepsilon_{i} for any i[d]i\in[d]. Therefore,

w2S=w1SηFS(w1S)=2γηL2ni=1nzi𝟙ηL2e1,\displaystyle w^{S}_{2}=w^{S}_{1}-\eta\nabla F_{S}(w^{S}_{1})=\frac{2\gamma\eta L}{2n}\sum_{i=1}^{n}z_{i}\mathbbm{1}-\frac{\eta L}{2}e_{1},

where eie_{i} is the standard basis vector of index ii. Since γ1/(4T)\gamma\leq 1/(4T) we have that w2S(1)ηL4TηL2<0w^{S}_{2}(1)\leq\frac{\eta L}{4T}-\frac{\eta L}{2}<0. Developing this dynamic recursively we obtain,

wt+1S=γηLt2ni=1nzi𝟙ηL2s[t]es.\displaystyle w^{S}_{t+1}=\frac{\gamma\eta Lt}{2n}\sum_{i=1}^{n}z_{i}\mathbbm{1}-\frac{\eta L}{2}\sum_{s\in[t]}e_{s}.

Using the reverse triangle inequality we have,

\norm1wtSwtS\displaystyle\norm 1{w^{S}_{t}-w^{S^{\prime}}_{t}} =\norm1wtS\displaystyle=\norm 1{w^{S}_{t}} (wtS=0w^{S^{\prime}}_{t}=0)
ηL2\norm2s[t]esγηLt2n\abs2i=1nzi\norm𝟙\displaystyle\geq\frac{\eta L}{2}\norm 2{\sum_{s\in[t]}e_{s}}-\frac{\gamma\eta Lt}{2n}\abs 2{\sum_{i=1}^{n}z_{i}}\norm{\mathbbm{1}} (reverse triangle inequality)
ηLt2γηLt2d\displaystyle\geq\frac{\eta L\sqrt{t}}{2}-\frac{\gamma\eta Lt}{2}\sqrt{d} (\norm𝟙=d\norm{\mathbbm{1}}=\sqrt{d} and \abs2i=1nzin\abs 2{\sum_{i=1}^{n}z_{i}}\leq n)
38ηLt.\displaystyle\geq\frac{3}{8}\eta L\sqrt{t}. (γ1/(4dt\gamma\leq 1/(4\sqrt{dt})