Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization

Idan Amir Department of Electrical Engineering, Tel Aviv University; idanamir@mail.tau.ac.il. Roi Livni Department of Electrical Engineering, Tel Aviv University; rlivni@tauex.tau.ac.il. Nathan Srebro Toyota Technological Institute at Chicago; nati@ttic.edu.

Abstract

We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e. where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $\varepsilon$ (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of $\tilde{O}(1/\varepsilon^{2})$ and only $\tilde{O}(1/\varepsilon^{2})$ iterations. This contrasts with general stochastic convex optimization, where $\Omega(1/\varepsilon^{4})$ iterations are needed Amir et al. (2021b). The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using $\Theta(1/\varepsilon^{4})$ samples, we rely on uniform convergence in a distribution-dependent ball.

1 Introduction

The recent success of learning using deep networks, with many more parameters than training points, and even without any explicit regularization, has brought back interest in how biases and “implicit regularization” due to the optimization method used, can ensure good generalization, even in situations where minimizing the optimization objective itself cannot Neyshabur et al. (2015). Alongside interest in understanding the algorithmic biases of optimization in non-convex, deep models, and how they can yield good generalization (e.g. Gunasekar et al., 2018c, b; Li et al., 2018; Nacson et al., 2019; Arora et al., 2019; Lyu and Li, 2019; Woodworth et al., 2020; Moroshko et al., 2020; Chizat and Bach, 2020; Blanc et al., 2020; HaoChen et al., 2021; Razin et al., 2021; Pesme et al., 2021), there has also been renewed interest in understanding the fundamentals of algorithmic regularization in convex models (Shalev-Shwartz et al., 2009; Feldman, 2016; Amir et al., 2021a, b; Sekhari et al., 2021; Dauber et al., 2020), both as an interesting and important setting in its own right, and even more so as a basis for understanding the situation in more complex models (how can we hope to understand phenomena in deep, non-convex models, if we can’t even understand them in convex models?). In particular, these fundamental questions include the relationship between algorithmic regularization, explicit regularization, uniform convergence and stability; and the importance of stochasticity and early stopping.

In this paper we focus on the algorithmic bias of deterministic (full batch) optimization methods, and in particular of full-batch gradient descent (i.e. gradient descent on the empirical risk, or “training error”). Even when gradient descent (GD) is run to convergence, it affords some bias that can ensure generalization. E.g. in linear regression (and even slightly more general settings) where interpolation is possible, it can be shown to converge to the minimum norm interpolating solution. This can be sufficient for generalization even in underdetermined situations where other interpolators (i.e. other minimizers of the optimization objective) would not generalize well. Indeed, the generalization ability of the minimum norm interpolating solution, and hence of GD, even in noisy situations, is the subject of much study. And in parallel, there has also been work going beyond GD on linear regression, characterizing the limit point of GD, or of other optimization methods such as Mirror Descent, steepest descent w.r.t. a norm, coordinate descent and AdaBoost, for different types of loss functions (Telgarsky, 2013; Soudry et al., 2018; Gunasekar et al., 2018a; Ji and Telgarsky, 2019; Li et al., 2019; Ji et al., 2020a; Shamir, 2020; Ji et al., 2020b; Vaskevicius et al., 2020).

But what about situations where interpolation, or completely minimizing the empirical risk, is not desirable, and generalization requires compromising on the empirical risk? In this cases one can consider early stopping, i.e. running GD only for some specific number of iterations. Indeed, early stopped GD is a common regularization approach in practice, and other learning approaches, most prominently Boosting, can also be viewed as early stopping of an optimization procedure (coordinate descent in the case of Boosting). When and how can such early stopped GD allow for generalization? How does this compare to Stochastic Gradient Descent (SGD), or to using explicit regularization, both in terms of generalization ability, and the number of required optimization iterations? And what tools, such as uniform convergence, distribution-dependent uniform convergence, and stability, are appropriate for studying the generalization ability of GD?

Our main result

We will show that when training a linear predictor with a convex Lipschitz loss (or more generally, for stochastic convex optimization with a generalized linear instantaneous objective), GD with early stopping can generalize optimally, up to logarithmic factors, with optimal sample complexity $\tilde{O}(1/\varepsilon^{2})$ , and with an optimal number $\tilde{O}(1/\varepsilon^{2})$ of iterations (the same as stochastic gradient descent), even without any explicit regularization, and in particular without projections into a norm ball—just unconstrained GD on the training error. This contrasts to previous results regarding early stopped GD for arbitrary stochastic convex optimization (not necessarily generalized linear), for which Amir et al. (2021b) showed $\Theta(1/\varepsilon^{4})$ GD iterations are needed (even if projections or explicit regularization is used).

Stability vs Uniform Convergence for SGD and GD

The important difference here, and the only property of generalized linear models (GLMs) that we rely on, is that GLMs satisfy uniform convergence Bartlett and Mendelson (2002): empirical losses converge to their expectations uniformly over all predictors in Euclidean balls. This is in contrast to general Stochastic Convex Optimization (SCO), for which no such uniform convergence is possible (Shalev-Shwartz et al., 2009). Consequently, rather then uniform convergence, the analysis for SCO is based on stability arguments Bassily et al. (2020). For SGD, stability at each step can be used in an online analysis, combined with an online-to-batch conversion, which is sufficient for ensuring optimal generalization with an optimal $O(1/\varepsilon^{2})$ number of iterations. But for GD, we must consider the stability of the method as a whole, which leads to optimal rates in the case of smooth losses Hardt et al. (2016) but otherwise is much worse, necessitating using a smaller stepsize, and hence quadratically more iterations—Amir et al. (2021b) showed that this is not just an analysis artifact, but a real limitation of GD for general SCO. In this paper, we show that once generalization can be ensured via uniform convergence, e.g. for GLMs, then GD does not have to worry about stability, can take much larger step sizes, and generalize optimally after only $\tilde{O}(1/\varepsilon^{2})$ iterations.

But we must be careful with how we ensure uniform convergence! To rely on uniform convergence, we need to ensure the output of GD lies in some ball, and the generalization error would then scale with the radius of the ball. A naïve approach would be to ensure that output of GD lies in a norm ball around the origin, and rely on uniform convergence in this ball. In Section 4 we show that this approach can ensure generalization, but only with suboptimal sample complexity of $O(1/\varepsilon^{4})$ —that is, worse than with the stability-based approach. Instead, we show that, with high probability, the output of GD lies in a small (constant radius) distribution dependent ball, centered not at the origin, but around the (distribution dependent) output of GD on the population objective. Even though this ball is unknown to the algorithm, this is still sufficient for generalization. The situation here is similar to the notion of algorithm dependent uniform convergence introduced by Nagarajan and Kolter (2019), though we should emphasize that here we show that algorithm-dependent uniform convergence is sufficient for optimal generalization.

Context and Insights

Our results complement recent results exposing gaps between generalization using stochastic optimization versus explicit regularization or deterministic optimization in stochastic convex optimization. Sekhari et al. (2021) showed that for a setting that is slightly weaker (where only the population loss is convex, but instantaneous losses can be non-convex), there can be large gaps between SGD versus GD or explicit regularization: even though SGD can learn with the optimal sample complexity of $O(1/\varepsilon^{2})$ , explicit regularization, in the form of regularized empirical risk minimization cannot learn at all, even with arbitrarily many samples, and GD requires at least a suboptimal $\Omega(1/\varepsilon^{2.4})$ number of samples (it is not clear if this sample complexity is sufficient for GD, or even if it can at all ensure learning). This highlights that the generalization ability of SGD cannot be understood in terms of mimicking some regularizer, as well as gaps between stochastic and deterministic optimization.

Returning to the strict SCO setting, where the instantaneous losses are convex, still, uniform convergence does not hold, and generalization can only be ensured via algorithmic-dependent bounds. Nevertheless, optimal generalization can be ensured either thorough explicit regularization, SGD or GD—all three approaches can ensure learning with $\Theta(1/\varepsilon^{2})$ samples (Shalev-Shwartz et al., 2009; Nemirovski and Yudin, 1983; Bassily et al., 2020). Even so, differences between deterministic and stochastic methods still exist. Dauber et al. (2020) shows that in SCO, the output of SGD cannot even be guaranteed to lie in some “small” distribution-dependent set of (approximate) empirical risk minimizers¹¹1“smallness” here, refers to a size notion that measures how well an empirical risk minimizer over the set generalizes, see Dauber et al. (2020) for the definition of a statistically complex set, and so the generalization ability of SGD in this settings cannot be attributed to any “regularizer” being small. Dauber et al. also shows that GD does not follow some distribution-independent regularization path. In contrast, here we show that if we do take the distribution into account, then GD is constrained to follow (at least approximately) a predetermined path (i.e. deterministic path, that is independent of the sample, but does depend only on the distribution). Finally, there is also a gap between SGD and GD, in this case, in the required number of iterations Amir et al. (2021b). Surprisingly, this gap cannot be fixed by adding regularization to the objective Amir et al. (2021a).

Importantly, for either weak or strict SCO, regularization in the form of constraining the norm of the predictor (i.e. empirical risk minimization inside the hypothesis class we are competing with), is not sufficient for learning (Shalev-Shwartz et al., 2009; Feldman, 2016). The failure of constrained ERM is critical here for the constructions of Amir et al.; Dauber et al. and Sekhari et al. establishing the gaps above: since projected gradient descent would quickly converge to the constrained ERM, it would also generalize just as well, and the gaps and constructions are valid also for projected gradient descent.

In this paper we turn to the GLM setting, which is perhaps one of the most well studied frameworks in the learning theory literature, as it captures fundamental problems such as logistic regression, SVM and many more. Uniform convergence for GLMs (Bartlett and Mendelson, 2002; Shalev-Shwartz and Ben-David, 2014; Kakade et al., 2008) ensures that constrained ERM learns with optimal sample complexity, and hence so would projected GD with $O(1/\varepsilon^{2})$ iterations. Interestingly, GD on the Tikhonov-type regularized objective similarly yields optimal learning with only $O(1/\varepsilon^{2})$ iterations Sridharan et al. (2008), indicating the lower bounds of Amir et al. on the iteration complexity do not hold in this setting. But both projected GD and GD on the Tikhonov regularized objective are forms of explicit regularization, and so we study unregularized, unprojected, GD. Our results show that (a) when uniform convergence holds, both sample complexity and iteration complexity gaps between stochastic and deterministic optimization disappear (though stochastic optimization is of course still much more computationally efficient), and (b) perhaps surprisingly, distribution-dependent uniform convergence is not only able to explain generalization of the output of GD, but it can do so better than stability; (c) though it appears to be suboptimal, distribution independent uniform convergence can explain generalization even though the norm could become very large (by making explanations based on uniform convergence inside a fixed distribution independent class).

2 Problem Setup and Background

We study the problem of stochastic optimization from i.i.d. data samples. For that purpose, we consider the standard setting of stochastic convex optimization. A learning problem consists of a family of loss functions $f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}$ defined over a fixed domain $\mathcal{W}\subseteq\mathbb{R}^{d}$ , parameterized by the parameter space $\mathcal{Z}$ . For each $B$ we denote the domain

\mathcal{W}^{B}=\brk[c]{w\in\mathbb{R}^{d}:\norm{w}\leq B}.

Our underlying assumption is that for each $z\in\mathcal{Z}$ the function $f(w;z)$ is convex and $L$ -Lipschitz with respect to its first argument $w$ . In this setting, a learner is provided with a sample $S=\brk[c]{z_{1},\ldots,z_{n}}$ of i.i.d. examples drawn from an unknown distribution $D$ . The goal of the learner is to optimize the population risk defined as follows:

\displaystyle F_{D}(w)\coloneqq\mathop{\mathbb{E}}_{z\sim D}\brk[s]{f(w;z)}.

More formally, an algorithm $\mathcal{A}$ is said to learn the class $\mathcal{W}$ (in expectation), with sample complexity $m(\varepsilon)$ if given i.i.d. sample $S=\{z_{1},\ldots,z_{n}\}$ such that $n\geq m(\varepsilon)$ , the learner returns a solution $\mathcal{A}(S)$ that holds

\mathop{\mathbb{E}}_{S\sim D^{n}}\left[F_{D}(\mathcal{A}(S))\right]\leq\min_{w\in\mathcal{W}}F_{D}(w)+\varepsilon.

A prevalent approach in stochastic optimization, is to consider, and optimize, the empirical risk over a sample $S$ :

\displaystyle F_{S}(w)=\frac{1}{n}\sum_{i=1}^{n}f(w;z_{i}).

(1)

2.1 Gradient Descent (without projections).

A concrete and prominent way in minimizing the empirical risk in Eq. 1 is with Gradient Descent (GD). GD is an iterative algorithm that runs for $T$ iterations, and has the following update rule that depends on a learning-rate $\eta$ :

\displaystyle w^{S}_{t+1}=w^{S}_{t}-\eta\nabla F_{S}(w^{S}_{t}),

(2)

where $w^{S}_{0}=0$ , and $\nabla F_{S}(w^{S}_{t})$ is the (sub)gradient of $F_{S}$ at point $w^{S}_{t}$ . The output of GD is normally taken to be the average iterate,

\displaystyle\bar{w}^{S}=\frac{1}{T}\sum_{t=1}^{T}w^{S}_{t}.

(3)

Throughout the paper we consider the aforementioned output in Eq. 3, though we remark that our results also extend to any reasonable averaging scheme (including prevalent schemes such as tail-averaging or random choice of $w_{t}$ ).

It is well known (e.g. Shalev-Shwartz and Ben-David (2014)) that the output of GD with standard initialization at the origin, enjoys the following guarantee over the empirical risk, for every $B$ :

\displaystyle F_{S}(\bar{w}^{S})\leq\min_{w\in\mathcal{W}^{B}}F_{S}(w)+\eta L^{2}+\frac{B^{2}}{\eta T}.

(4)

In contrast, as far as the population risk, it was recently shown in Amir et al. (2021b) that, for sufficiently large $d$ , GD suffers from the following sub optimal rate²²2the result in Amir et al. (2021b) is formulated for projected-GD, however as the authors note the proof holds verbatim to the unprojected version.

\mathop{\mathbb{E}}_{S\sim D^{n}}[F_{D}(\bar{w}^{S})]\geq\min_{w\in\mathcal{W}^{1}}F_{D}(w)+\Omega\brk 2{\min\brk[c]1{\eta\sqrt{T}+\frac{1}{\eta T},1}}.

(5)

In particular, setting for concreteness $B=1$ , a choice of $\eta=1/\sqrt{T}$ and $T=O(1/\varepsilon^{2})$ may lead to $\varepsilon$ -error over the empirical risk, but is susceptible to overfitting. In fact, it turns out that $T=O(1/\varepsilon^{4})$ iterations are necessary and sufficient (Bassily et al., 2020) for GD (with or without projections) to obtain $O(\varepsilon)$ population error.

2.2 Generalized Linear Models.

One, important, class of convex learning problems is the class of generalized linear models (GLMs). In a GLM problem, $f(w;z)$ takes the following generalized linear form:

\displaystyle f(w;z)=\ell(w\boldsymbol{\cdot}\phi\brk{x},y),

(6)

where $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ , $\ell(a,y)$ is convex and $L$ -Lipschitz w.r.t. $a$ , and, $\phi:\mathcal{X}\to H$ is an embedding of the domain $\mathcal{X}$ into some norm bounded set in a linear space (potentially infinite).

As we mostly care about norm bounded solutions, we will assume for concretness that $\norm{\phi(x)}\leq 1$ . We also treat the value of the function at zero as constant, namely $|\ell(0,y)|\leq O(1)$ . In turn, the function $f$ is $L$ -Lipschitz as well, and we obtain by Lipschitness that:

|f(w,z)|=|\ell(w\boldsymbol{\cdot}\phi(x),y)|\leq O(\|w\|L).

(7)

Uniform Convergence of GLMs.

One desirable property of GLMs is that, in contrast with general convex functions, they enjoy dimension-independent uniform convergence bounds. This, potentially, allow us to circumvent the bound in Eq. 5. In more detail, a seminal result due to Bartlett and Mendelson (2002) shows that, under our restrictions on $\ell$ and $\phi$ , the empirical risk $F_{S}(w)$ converges uniformly to the population loss $F_{D}(w)$ as follows for any $B$ :

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{B}}\brk[c]1{F_{D}(w)-F_{S}(w)}}\leq\frac{2LB}{\sqrt{n}}.

(8)

By a similar reasoning, one can show a stronger property for GLMs, which we will need: uniform convergence on any ball, not necessarily centered around zero. More formally, for every $u$ and $K$ let us notate

\mathcal{W}^{K}_{u}:=\brk[c]{w:\norm{w-u}\leq K}.

We can then bound the expected generalization error of $w\in\mathcal{W}^{K}_{u}$ as follows:

Lemma 2.1.

Suppose $\ell(a,y)$ is $L$ -Lipschitz w.r.t. $a$ and $\phi:\mathcal{X}\to H$ is an embedding in a linear space $H$ such that $\|\phi(x)\|\leq 1$ . Then, for $f$ as in Eq. 6. We have for any $u$ and $K$

\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}_{u}^{K}}\brk[c]1{F_{D}(w)-F_{S}(w)}}\leq\frac{2LK}{\sqrt{n}}

The proof is essentially the same as the proof of Eq. 8, and exploit the Rademacher complexity of the class. For the sake of completeness we repeat it in Appendix B.

Eq. 8, and more generally Lemma 2.1, imply that any algorithm, $\mathcal{A}$ whose range is restricted to a $B$ -bounded ball, that has an optimization guarantee of,

\displaystyle F_{S}(\mathcal{A}(S))-\min_{w\in\mathcal{W}^{B}}F_{S}(w)\leq O\brk 2{\frac{LB}{\sqrt{n}}},

will also obtain an $O\brk{LB/\sqrt{n}}$ convergence rate of the excess population risk. For example, projected-GD is an algorithm that enjoys the aforementioned generalization bound.

3 Main Results

Theorem 3.1.

Suppose $D$ is a distribution over $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ . Let $f:\mathbb{R}^{d}\times\mathcal{Z}\rightarrow\mathbb{R}$ be a loss function of the form given in Eq. 6 such that for all $y\in\mathcal{Y}$ , $\ell(\cdot,y)$ is convex and $L$ -Lipschitz with respect to its first argument and such that for all $x\in\mathcal{X}:\norm{\phi(x)}\leq 1$ . Then, for the gradient descent solution in Eq. 3 and for any $B\in\mathbb{R}$ we have:

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}-\min_{w\in\mathcal{W}^{B}}F_{D}(w)\leq\tilde{O}\brk 3{\frac{\eta L^{2}T}{n}+\frac{\eta L^{2}\sqrt{T}}{\sqrt{n}}+\eta L^{2}+\frac{B^{2}}{\eta T}}.

In particular, setting $\eta=O(1/(L\sqrt{T}))$ and $T=n$ we get,

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]2{F_{D}(w)+\tilde{O}\brk 2{\frac{L(\|w\|^{2}+1)}{\sqrt{n}}}}.

Thus, Theorem 3.1 shows that using GD, in the case of GLMs, we can learn the class $\mathcal{W}^{B}$ , with sample complexity $m(\varepsilon)=\tilde{O}(1/\varepsilon^{2})$ and number of iterations $T=\tilde{O}(1/\varepsilon^{2})$ .

Note that, an improved rate of $\tilde{O}\brk{L\|w\|/\sqrt{n}}$ can be obtained, if we choose $\eta=O(\|w\|/(L\sqrt{T}))$ , but this requires prior knowledge of the bounded domain radius. The bound we formulate above is for a learning rate that is oblivious to the norm of the optimal choice $w$ .

The key technical tool in proving Theorem 3.1 is the following structural result that characterizes the output of GD over the empirical risk, and may be of independent interest:

Theorem 3.2.

Let $S=\{z_{1},\ldots,z_{n}\}$ be an i.i.d sequence drawn from some unknown distribution $D$ . Assume that $f:\mathbb{R}^{d}\times\mathcal{Z}\rightarrow\mathbb{R}$ is a convex and $L$ -Lipschitz function, for all $z\in\mathcal{Z}$ , with respect to its first argument. Then, given the distribution $D$ there exists a sequence $u_{1},\ldots,u_{T}$ that depends on $D$ (but independent of the sample $S$ ), such that, for any $\delta\in\brk{0,1}$ with probability of at least $1-\delta$ over $S$ , the iterates of gradient descent, as depicted in Eq. 2, satisfy:

\displaystyle\forall t\in[T]:\quad\norm 1{w^{S}_{t}-u_{t}}\leq O\brk 3{\frac{\eta Lt}{\sqrt{n}}\sqrt{\log\brk{T/\delta}}+\eta L\sqrt{t}}

For example, we can set $\eta=\tilde{O}(1/\sqrt{T})$ and $T=O(n)$ , then with probability $O(1/n)$ :

\norm 1{w^{S}_{t}-u_{t}}=O(1).

As such, for the natural choice of learning rate and iteration complexity, we obtain that the iterates of GD remain at the vicinity of a deterministic trajectory, predetermined by the distribution to be learned. Our following result shows that this bound is tight up to some logarithmic factor. We refer the reader to Appendix C for the full proof.

Theorem 3.3.

Fix $\eta,L,T$ and $n$ . For any sequence $u_{1},\ldots,u_{T}$ , independent of the sample $S$ , there exists a convex and $L$ -Lipschitz function $f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}$ and a distribution $D$ over $\mathcal{Z}$ , such that, with probability at least $1/20$ over $S$ , the iterates of gradient descent, as depicted in Eq. 2, satisfy:

\displaystyle\forall t\in[T]:\quad\norm{w^{S}_{t}-u_{t}}

\displaystyle\geq\Omega\brk 3{\frac{\eta Lt}{\sqrt{n}}+\eta L\sqrt{t}}.

3.1 High Probability Rates

We next move to discuss results for learnability with high probability. We first remark that using Markov’s inequality and, standard techniques to boost the confidence, One can achieve high proability rates at a logarithmic computational and sample cost in the confidence.

If we want, though, to achieve high probability rates for the algorithm without alterations, the standard approach requires concentration bounds which normally rely on boundness of the predictor. This, unfortunately is not true when we run GD without projection, as the prediction can be potentially unbounded.

However, under natural structural assumptions that are often met for the types of losses we are usually interested in, we can achieve such boundness by clipping procedure which we next describe.

For this section we consider a loss function $\ell(a,y)$ , such that $y\in[-b,b]$ and, we assume that:

\displaystyle\ell(a,y)\geq\begin{cases}\ell(|y|,y)&a\geq|y|,\\ \ell(-|y|,y)&a\leq-|y|.\end{cases}\qquad\textrm{and}\qquad\forall a\in[-b,b]:\;\abs{\ell(a,y)}\leq c

Note that this is a very natural assumption to have in the case of convex surrogates for prediction tasks. For example, this holds for the widely used hinge loss $\ell(a,y)=\max\brk[c]{0,1-ya}$ in binary classification. Observe that under this assumption, if we consider $w\cdot\phi(x)$ as a predictor of $y$ , the learner has no incentive to return a prediction that is outside of the interval $[-b,b]$ .

Thus, we define the following mapping $g(a)$

\displaystyle g(a)=\begin{cases}c&a\geq b,\\ a&a\in[-b,b],\\ -c&a\leq-b,\end{cases}

(9)

and consider the clipped solution $g(\bar{w}^{S}\cdot\phi(x))$ where $\bar{w}^{S}$ is the original output of GD defined in Eq. 3. We can now present our high probability result.

Theorem 3.4.

Suppose $D$ is a distribution over $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ and let $\mathcal{W}^{B}=\brk[c]1{w\in\mathbb{R}^{d}:\norm{w}\leq B}$ . Let $\ell(a,y)$ be a generalized linear loss function of the form given in Eqs. 6 and 3.1 such that for all $y\in\mathcal{Y}$ , $\ell(\cdot,y)$ is convex and $L$ -Lipschitz with respect to its first argument. Then, for the gradient descent solution in Eq. 3 and the function $g(a)$ in Eq. 9 we have with probability at least $1-\delta$

	$\displaystyle\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x)},y}}$	$\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}+\frac{\\|w\\|^{2}}{\eta T}+\\|w\\|L\sqrt{\frac{\log(2/\delta)}{n}}}$
		$\displaystyle\quad+O\brk 2{\frac{\eta L^{2}T}{n}\sqrt{\log\brk{4T/\delta}}+\frac{\eta L^{2}\sqrt{T}}{\sqrt{n}}+\eta L^{2}+c\sqrt{\frac{\log\brk{8/\delta}}{n}}}.$

In particular, setting $\eta=1/(L\sqrt{T})$ and $T=n$ we obtain that with probability $1-\delta$ :

\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x)},y}}\leq\inf_{w\in\mathcal{W}^{1}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}}+\tilde{O}\left(\frac{L+c}{\sqrt{n}}\log(1/\delta)\right)

4 Comparison with non-distribution-dependent uniform convergence

In this section, we contrast our approach with what might be possible with a more traditional, non-distribution-dependent, uniform convergence. In particular, we consider an argument based on ensuring that the output of GD, with some stepsize $\eta$ and number of iterations $T$ , is always (or with high probability) in a ball of radius $B(\eta,T)$ around the origin, namely, $\norm{\bar{w}^{S}}\leq B(\eta,T)$ . Hence, using Rademacher complexity bounds for GLMs, we can say that its population risk is within $O(BL/\sqrt{n})$ of its empirical risk. By selecting $\eta$ and $T$ to be very small, one can always ensure that the output of GD is in a small ball, but we also need to balance that with the empirical suboptimality of the output. That is, traditional bounds require us to find $\eta$ and $T$ that balance between the guarantee on the empirical suboptimality and the guarantee on the norm of the output. Such an approach would then result in population suboptimality that is governed by these two terms:

\displaystyle G(n)\coloneqq\inf_{\eta,T}\sup_{D,f}\brk[c]3{\max\left\{\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{S}(\bar{w}^{S})-\min_{w\in\mathcal{W}^{1}}F_{S}(w)},\mathop{\mathbb{E}}_{S\sim D^{n}}\left[\frac{\norm{\bar{w}^{S}}L}{\sqrt{n}}\right]\right\}},

(10)

where $D$ and $f$ are taken over all valid distributions such that $f$ is convex and Lipschitz. In particular, we would get,

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}\leq\min_{w\in\mathcal{W}^{1}}F_{D}(w)+O(G(n)).

Claim.

For GLMs it holds that $G(n)=\Theta(L/n^{1/4})$ .

Proof.

The upper bound follows by taking $\eta=1/(L\sqrt{T})$ and $T=\sqrt{n}$ , bounding the first term using the standard GD guarantee in Eq. 4, and the second term by noting that each step increases the gradient by at most $\eta L$ . Therefore, $\norm{\bar{w}^{S}}\leq\eta LT$ and we obtain,

\displaystyle G(n)\leq\eta L^{2}+\frac{1}{\eta T}+\frac{\eta L^{2}T}{\sqrt{n}}=O\brk 3{\frac{L}{n^{1/4}}}.

For the lower bound we consider two deterministic functions in one dimension. When $\eta T\geq n^{1/4}/L$ we consider the deterministic objective $f(w;z)=Lw$ . Clearly, the norm of the solution is $\norm{\bar{w}^{S}}=\frac{1}{T}\sum_{t=1}^{T}\eta Lt\geq\eta LT/2$ , thus:

\displaystyle\frac{\norm{\bar{w}^{S}}L}{\sqrt{n}}\geq\Omega\brk 3{\frac{L}{n^{1/4}}}.

When $\eta T<n^{1/4}/L$ we consider the objective $f(w;z)=Lw/n^{1/4}$ . Similarly, we have that $\bar{w}^{S}=-\eta L(T+1)/(2n^{1/4})$ . Thus, the empirical risk is:

\displaystyle F_{S}(\bar{w}^{S})-\min_{w\in\mathcal{W}^{1}}F_{S}(w)=-\frac{\eta L^{2}(T+1)}{2n^{1/8}}+\frac{L}{n^{1/4}}\geq\frac{L}{2n^{1/4}}-\frac{L}{2Tn^{1/4}}\geq\Omega\brk 3{\frac{L}{n^{1/4}}}.

The claim shows that uniform convergence to a fixed ball around the origin can ensure learning, but with a suboptimal sample complexity of $O(1/\varepsilon^{4})$ . To obtain better bounds using this approach, additional structural assumption on the loss or the distribution are required. For example, the work of Shamir (2020) assumes max-margin and smoothness(specifically the logistic loss) to obtain bounds on the norm of the solution, while our result applies to general Lipschitz GLMs without any distribution assumptions. It is insightful to directly contrast Eq. 10 to the approach of Section 3, which essentially entails looking at:

\displaystyle\tilde{G}(n)\coloneqq\inf_{\eta,T}\sup_{D,f}\brk[c]3{\max\left\{\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{S}(\bar{w}^{S})-\min_{w\in\mathcal{W}^{1}}F_{S}(w)},\mathop{\mathbb{E}}_{S\sim D^{n}}\left[\frac{\norm{\bar{w}^{S}-u}L}{\sqrt{n}}\right]\right\}},

for some deterministic $u$ . We can then apply a uniform concentration guarantee for a ball around it, and so $\tilde{G}$ is also sufficient to ensure:

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}\leq\min_{w\in\mathcal{W}^{1}}F_{D}(w)+O(\tilde{G}(n)).

In Section 3 we show that $\tilde{G}(n)=\tilde{O}(L/\sqrt{n})$ , improving on $G(n)$ and yielding optimal learning, up to logarithmic factors.

5 Technical Overview

Our main result, Theorem 3.1, establishes a generalization bound for GD without projection, and it builds upon a structural result presented in Theorem 3.2. We next, then, outline the derivation of both these results. Because the most interesting implication of our results are derived when we choose $\eta=\tilde{O}(1/\sqrt{T})$ and $T=O(n)$ , we will focus here in the exposition on this regime. This is mainly to avoid clutter notation. Hence, unless stated otherwise we assume that $T$ and $\eta$ are fixed appropriately.

Our proof of Theorem 3.1 relies on two steps. First, through Theorem 3.2 we argue that that output of GD will be (w.h.p) in a fixed ball that depends solely on the distribution (but not on the sample). Then, as a second step, we can apply standard uniform convergence for a bounded-norm balls (see Lemma 2.1) to reason about generalization.

The second step, builds on a standard generalization bound that is derived through Rademacher complexity. Also notice, that the first step follows immediately from Theorem 3.2. Indeed, Theorem 3.2 argues that there exist a sequence $u_{1},\ldots,u_{T}$ such that if $w_{t}^{S}$ is the trajectory of GD over the empirical risk, we will have (w.h.p):

\|w_{t}^{S}-u_{t}\|=O(1).

The output of GD is the averaged iterate, hence we obtain that indeed $\bar{w}^{S}$ is restricted to a ball around $u=\frac{1}{T}\sum_{t=1}^{T}u_{t}$ , as required. Therefore, we are left with proving our main structural result: That the iterates of GD are in a proximity of a trajectory $u_{1},\ldots,u_{T}$ that depends solely on the distribution (i.e. Theorem 3.2).

For the end of proving Theorem 3.2, we introduce the following GD trajectory:

Gradient Descent on the population loss.

We consider an alternative sequence of gradient descent that operates on the population loss $F_{D}(w)$ rather than the empirical risk $F_{S}(w)$ . The update rule is then,

\displaystyle w^{D}_{t+1}=w^{D}_{t}-\eta\nabla F_{D}(w^{D}_{t}).

(11)

The sequence $w_{1}^{D},\ldots,w_{t}^{D}$ will serve as the sequence $u_{1},\ldots,u_{T}$ .

In the proof of Theorem 3.2 we require high probability rates. However, for the simplicity of the exposition, we will prove here a slightly weaker, in expectation, result: (the proof is defered to Section A.1).

Lemma 5.1.

\displaystyle\forall t\in[T]:\quad\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{w^{S}_{t}-w^{D}_{t}}}\leq O\brk 3{\frac{\eta Lt}{\sqrt{n}}+\eta L\sqrt{t}}

This lemma bounds the distance, in expectation, between the GD trajectory over the empirical risk and the GD trajectory over the population risk. Again, we remark that to obtain the final result in Theorem 3.2, a stronger high probability version of Lemma 5.1 is required. The proof of Theorem 3.2 follows similar lines to that of Lemma 5.1 with modifications concerning specific concentration inequalities.

One crucial challenge in proving Lemma 5.1 stems from the adaptivity of the gradient sequence. In particular, notice that the sequences $w_{t}^{S},w_{t}^{D}$ are governed, respectively, by the dynamics

w_{t}^{S}=w_{t-1}^{S}-\eta\nabla F_{S}(w^{S}_{t}),\quad w_{t}^{D}=w_{t-1}^{D}-\eta\nabla F_{D}(w^{D}_{t}).

At a first glance, since $\nabla F_{S}(w)$ is an estimate of $\nabla F_{D}(w)$ , it might seem that the result can be obtained by standard concentration bounds and application of a union bound along the trajectory. Unfortunately, such a naive argument cannot work. Indeed, since $w_{t}$ , for $t\geq 2$ , depends on the sequence $S$ , $\nabla F_{S}(w_{t})$ is not necessarily an unbiased estimate of $\nabla F_{D}(w_{t})$ . This is not just seemingly, and a construction in Amir et al. (2021b) demonstrates how, even after only two iterates the gradient $\nabla F_{S}(w_{t})$ can diverge, significantly, from $\nabla F_{D}(w_{t})$ . While these constructions are outside of the scope of GLMs, we remark that Theorem 3.2 holds for any convex and Lipschitz function. Also, it was in fact shown that even for GLMs (Foster et al., 2018), the estimate of the gradients don’t have any dimension-independent uniform convergence bound, making it a challenge to pursue such a proof direction. To summarize, we need to prove that the two sequences remain in the vicinity, even though, apriori, the update steps of the two functions may be different at each iteration.

Towards this goal, we follow an analysis, reminiscent of the uniform argument stability analysis of Bassily et al. (2020) for non-smooth convex losses. In a nutshell, both arguments compare the trajectory to a reference trajectory, and bound the incremental difference between the two trajectories while exploiting monotonicity of the convex function. In the case of stability, the reference trajectory is the trajectory $w^{S^{\prime}}_{t}$ induced by a sample $S^{\prime}=\brk{z_{1},\ldots,z^{\prime}_{i},\ldots,z_{n}}$ that differ on a single example from the sample $S=\brk{z_{1},\ldots,z_{i},\ldots,z_{n}}$ .

Bassily et al. showed that if $S$ and $S^{\prime}$ are two sequences that differ on a single example, then:

\|w_{t}^{S}-w_{t}^{S^{\prime}}\|=O\left(\frac{\eta T}{n}+\eta\sqrt{T}\right).

In comparison, we show:

\mathop{\mathbb{E}}_{S\sim D^{n}}\left[\|w_{t}^{S}-w_{t}^{D}\|\right]=O\left(\frac{\eta T}{\sqrt{n}}+\eta\sqrt{T}\right).

We note, that both results are optimal. Namely, the stability bound of Bassily et al. (2020) is the optimal stability rate, and the proximity bound we provide is the best possible bound against a fixed point, independent of the sample.

Note that both bounds yield $O(1)$ difference for the natural choice of $\eta$ and $T$ . However, $O(1)$ -stability guarantees are vacuous in the sense that they do not provide any interesting implication over the generalization of the algorithm. Whereas, we provide $O(1)$ -proximity guarantee to some fixed point, completely independent of the sample $S$ . This does imply generalization in our setup.

One might also suggest that our result of $O(1)$ -proximity can be derived from $O(1)$ -stability. However, it turns out this is not the case. For the same instance we use to lower bound the proximity by $\Omega\brk{\eta T/\sqrt{n}}$ in Theorem 3.3, it can be easily shown that GD will be $O(\eta T/n)$ -stable. Setting $\eta=O(1)$ and $T=O(n)$ will then imply $O(1)$ -stability but with $\Omega(\sqrt{n})$ -proximity. This asserts that our result is not a direct consequence of standard stability arguments.

References

Amir et al. [2021a] I. Amir, Y. Carmon, T. Koren, and R. Livni. Never go full batch (in stochastic convex optimization). In Advances in Neural Information Processing Systems, 2021a.
Amir et al. [2021b] I. Amir, T. Koren, and R. Livni. SGD generalizes better than GD (and regularization doesn’t help). In Conference on Learning Theory, 2021b.
Arora et al. [2019] S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.
Bartlett and Mendelson [2002] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, 2002.
Bassily et al. [2020] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems, 2020.
Blanc et al. [2020] G. Blanc, N. Gupta, G. Valiant, and P. Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pages 483–513. PMLR, 2020.
Boucheron et al. [2013] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-5. doi: 10.1093/acprof:oso/9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.
Chizat and Bach [2020] L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
Dauber et al. [2020] A. Dauber, M. Feder, T. Koren, and R. Livni. Can implicit bias explain generalization? stochastic convex optimization as a case study. In Advances in Neural Information Processing Systems, 2020.
Feldman [2016] V. Feldman. Generalization of ERM in stochastic convex optimization: The dimension strikes back. In Advances in Neural Information Processing Systems, 2016.
Foster et al. [2018] D. J. Foster, A. Sekhari, and K. Sridharan. Uniform convergence of gradients for non-convex learning and optimization. Advances in Neural Information Processing Systems, 31, 2018.
Gunasekar et al. [2018a] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In J. G. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1827–1836. PMLR, 2018a. URL http://proceedings.mlr.press/v80/gunasekar18a.html.
Gunasekar et al. [2018b] S. Gunasekar, J. D. Lee, N. Srebro, and D. Soudry. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 2018b.
Gunasekar et al. [2018c] S. Gunasekar, B. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In 2018 Information Theory and Applications Workshop, 2018c.
HaoChen et al. [2021] J. Z. HaoChen, C. Wei, J. Lee, and T. Ma. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315–2357. PMLR, 2021.
Hardt et al. [2016] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, 2016.
Ji and Telgarsky [2019] Z. Ji and M. Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, 2019.
Ji et al. [2020a] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, 2020a.
Ji et al. [2020b] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky. Gradient descent follows the regularization path for general losses. In J. D. Abernethy and S. Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 2109–2136. PMLR, 2020b. URL http://proceedings.mlr.press/v125/ji20a.html.
Kakade et al. [2008] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 793–800. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper/2008/hash/5b69b9cb83065d403869739ae7f0995e-Abstract.html.
Li et al. [2018] Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47. PMLR, 2018.
Li et al. [2019] Y. Li, E. X. Fang, H. Xu, and T. Zhao. Implicit bias of gradient descent based adversarial training on separable data. In International Conference on Learning Representations, 2019.
Lyu and Li [2019] K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.
Moroshko et al. [2020] E. Moroshko, B. E. Woodworth, S. Gunasekar, J. D. Lee, N. Srebro, and D. Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in neural information processing systems, 33:22182–22193, 2020.
Nacson et al. [2019] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019.
Nagarajan and Kolter [2019] V. Nagarajan and J. Z. Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
Nemirovski and Yudin [1983] A. S. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization. 1983.
Neyshabur et al. [2015] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations (workshop track), 2015.
Pesme et al. [2021] S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34, 2021.
Razin et al. [2021] N. Razin, A. Maman, and N. Cohen. Implicit regularization in tensor factorization. In International Conference on Machine Learning, pages 8913–8924. PMLR, 2021.
Sekhari et al. [2021] A. Sekhari, K. Sridharan, and S. Kale. Sgd: The role of implicit regularization, batch-size and multiple-epochs. Advances in Neural Information Processing Systems, 34, 2021.
Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
Shalev-Shwartz et al. [2009] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Conference on Learning Theory, 2009.
Shamir [2020] O. Shamir. Gradient methods never overfit on separable data. CoRR, abs/2007.00028, 2020. URL https://arxiv.org/abs/2007.00028.
Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
Sridharan et al. [2008] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. Advances in Neural Information Processing Systems, 2008.
Telgarsky [2013] M. Telgarsky. Margins, shrinkage, and boosting. In International Conference on Machine Learning, pages 307–315. PMLR, 2013.
Vaskevicius et al. [2020] T. Vaskevicius, V. Kanade, and P. Rebeschini. The statistical complexity of early-stopped mirror descent. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/024d2d699e6c1a82c9ba986386f4d824-Abstract.html.
Woodworth et al. [2020] B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.

Appendix A Main Proofs

Recall that we define $w_{t}^{D}$ to be the $t$ -th iterate when applying GD over the population risk as depicted in Eq. 11.

A.1 Proof of Lemma 5.1

Observe that,

$\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2}$	$\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}-\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}}^{2}$
	$\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+\eta^{2}\norm{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(u_{t})}^{2}$
	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}$	( $L$ -Lipschitz)
	$\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}$
	$\displaystyle\phantom{=\norm{w^{S}_{t}-w^{D}_{t}}^{2}}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}$
	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2},$

where in the last inequality we use monotinicity of convex functions: $\brk{\nabla F_{S}(w)-\nabla F_{S}(u)}\brk{w-u}\geq 0$ for any $w,u$ . Next, applying Cauchy-Schwarz inequality we get,

$\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2}$	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}$
	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}\norm{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}$	(C.S.)
	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\eta^{2}t\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}+\frac{1}{t}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+4\eta^{2}L^{2}$
	$\displaystyle=\brk 2{1+\frac{1}{c}}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\eta^{2}c\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}+4\eta^{2}L^{2}.$

The last inequality follows from the observation that $2ab\leq ca^{2}+b^{2}/c$ for any $c>0$ and $a,b\geq 0$ , specifically for $a=\eta\|\nabla F_{S}(w_{t}^{D})-\nabla F_{D}(w_{t}^{D})\|$ , and $b=\|w_{t}^{S}-w_{t}^{D}\|$ .

Applying the formula recursively, and noting that $w_{0}=u_{0}$ :

	$\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2}$	$\displaystyle\leq\sum_{t^{\prime}=0}^{t}\brk 2{1+\frac{1}{c}}^{t-t^{\prime}}\brk 3{\eta^{2}c\norm{\nabla F_{S}(w^{D}_{t^{\prime}})-\nabla F_{D}(w^{D}_{t^{\prime}})}^{2}+4\eta^{2}L^{2}}$
		$\displaystyle\leq\sum_{t^{\prime}=0}^{t}\brk 3{e\eta^{2}t\norm{\nabla F_{S}(w^{D}_{t^{\prime}})-\nabla F_{D}(w^{D}_{t^{\prime}})}^{2}+4e\eta^{2}L^{2}},$

where in the last inequality we chose $c=t+1$ and used the known bound of $(1+1/t)^{t}\leq e$ . Taking the square root and using the inequality of $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ we conclude

\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}

\displaystyle\leq\sqrt{e\eta^{2}(t+1)\sum_{t^{\prime}=0}^{t}\norm{\nabla F_{S}(w^{D}_{t^{\prime}})-\nabla F_{D}(w^{D}_{t^{\prime}})}^{2}}+2\sqrt{e}\eta L\sqrt{t+1}.

We are interested in bounding $\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}}$ . By the definition of the empirical risk

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}}

\displaystyle=\frac{1}{n^{2}}\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]3{\norm 2{\sum_{i=1}^{n}\brk[s]2{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}_{z\sim D}\brk[s]{\nabla f(w^{D}_{t};z)}}}^{2}}

Note that by Lipschitzness $\norm{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}\leq 2L$ , and that $\brk[c]1{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}_{i\in[n]}$ are independent zero-mean random vectors (as $w^{D}_{t}$ is independent of $z_{i}$ ). Thus, we get

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}}=\frac{1}{n^{2}}\sum_{i=1}^{n}\brk[s]3{\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm 1{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}_{z\sim D}\brk[s]{\nabla f(w^{D}_{t};z)}}^{2}}}\leq\frac{4L^{2}}{n}.

Consequently, taking the expectation over the sample $S$ and using Jensen’s inequality

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\norm{w^{S}_{t+1}-w^{D}_{t+1}}}\leq\frac{4\eta L(t+1)}{\sqrt{n}}+4\eta L\sqrt{t+1}.

A.2 Proof of Theorem 3.1

Starting with Lemma 2.1 we obtain for any domain $\mathcal{W}^{K}_{u}$ ,

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}}

\displaystyle\leq\frac{2LK}{\sqrt{n}}.

(12)

From Theorem 3.2, there exists a sequence $u_{1},\ldots,u_{T}$ such that with probability at least $1-\delta$ ,

\displaystyle\norm 2{\bar{w}^{S}-\frac{1}{T}\sum_{t=1}^{T}u_{t}}\leq\frac{7\eta LT}{\sqrt{n}}\sqrt{\log\brk{T/\delta}}+4\eta L\sqrt{T}.

(13)

Setting $u=\frac{1}{T}\sum_{t=1}^{T}u_{t}$ and $K$ to be the RHS in Eq. 13 we obtain:

$\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}$	$\displaystyle\leq\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}\big{\|}\bar{w}^{S}\in\mathcal{W}^{K}_{u}}P(\bar{w}^{S}\in\mathcal{W}_{u}^{K})$
	$\displaystyle\quad+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\abs{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}$
	$\displaystyle=\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}}-\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}\big{\|}\bar{w}^{S}\notin\mathcal{W}^{K}_{u}}P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})$
	$\displaystyle\quad+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\|F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})\|$
	$\displaystyle\leq\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]{F_{D}(w)-F_{S}(w)}}+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{D}(w)-F_{S}(w)}$
	$\displaystyle\quad+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\abs{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}$
	$\displaystyle\leq\frac{14\eta L^{2}T}{n}\sqrt{\log\brk{T/\delta}}+\frac{8\eta L^{2}\sqrt{T}}{\sqrt{n}}+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{D}(w)-F_{S}(w)}$
	$\displaystyle\quad+P(\bar{w}^{S}\notin\mathcal{W}_{u}^{K})\sup_{S\sim D^{n}}\abs{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}$	(Eqs. 13 and 12)
	$\displaystyle\leq\frac{14\eta L^{2}T}{n}\sqrt{\log\brk{T/\delta}}+\frac{8\eta L^{2}\sqrt{T}}{\sqrt{n}}+O\left(\delta\eta L^{2}T\sqrt{\log(T/\delta)}\right),$

where we used Eq. 7 and the fact that $\|\bar{w}^{S}\|\leq\eta LT$ and $\norm{u}+K\leq O\brk 2{\eta LT+\eta LT\sqrt{\log(T/\delta)}/\sqrt{n}}\leq O\brk 1{\eta LT\sqrt{\log(T/\delta)}}$ to bound the second and third terms. Hence:

|F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})|\leq|F_{D}(\bar{w}^{S})|+|F_{S}(\bar{w}^{S})|\leq O\left(\eta L^{2}T\right),

and

\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{D}(w)-F_{S}(w)}\leq\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{D}(w)}+\sup_{w\in\mathcal{W}^{K}_{u}}\abs{F_{S}(w)}\leq O\brk 2{\eta L^{2}T\sqrt{\log(T/\delta)}}.

Next, setting $\delta=O(1/\sqrt{nT})$ we get that:

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}

\displaystyle\leq O\left(\frac{\eta L^{2}T}{n}\sqrt{\log\brk{nT}}+\frac{\eta L^{2}\sqrt{T}}{\sqrt{n}}\sqrt{\log\brk{nT}}\right).

(14)

Finally, combining Eqs. 14 and 4 we obtain that for every $w^{\star}\in\mathcal{W}^{B}$ :

	$\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})}-F_{D}(w^{\star})$	$\displaystyle=\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}+\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{S}(\bar{w}^{S})}-F_{D}(w^{\star})$
		$\displaystyle=\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{D}(\bar{w}^{S})-F_{S}(\bar{w}^{S})}+\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{F_{S}(\bar{w}^{S})-F_{S}(w^{\star})}$
		$\displaystyle\leq O\left(\frac{\eta L^{2}T}{n}\sqrt{\log(nT)}+\frac{\eta L^{2}\sqrt{T}}{\sqrt{n}}\sqrt{\log\brk{nT}}+\eta L^{2}+\frac{B^{2}}{\eta T}\right).$

A.3 Proof of Theorem 3.2

The proof is similar to that of Lemma 5.1 with the exception that here we employ specific concentration inequalities of random variables with bounded difference. The reference sequence we consider is the GD iterates over the population risk, namely, $w^{D}_{t}$ as described in Eq. 11. Observe that

$\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2}$	$\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}-\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}}^{2}$
	$\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+\eta^{2}\norm{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}$
	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{D}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}$	( $L$ -Lipschitz)
	$\displaystyle=\norm{w^{S}_{t}-w^{D}_{t}}^{2}-2\eta\brk{\nabla F_{S}(w^{S}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}$
	$\displaystyle\phantom{=\norm{w^{S}_{t}-w^{D}_{t}}^{2}}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}.$

From convexity of $F_{S}$ we know that $\brk{\nabla F_{S}(w)-\nabla F_{S}(u)}\brk{w-u}\geq 0$ for any $w,u$ . Therefore, applying Cauchy-Schwarz inequality we get,

$\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2}$	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\brk{\nabla F_{D}(w^{D}_{t})-\nabla F_{S}(w^{D}_{t})}\brk{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}$	(convexity)
	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+2\eta\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}\norm{w^{S}_{t}-w^{D}_{t}}+4\eta^{2}L^{2}$	(C.S.)
	$\displaystyle\leq\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\eta^{2}t\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}+\frac{1}{t}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+4\eta^{2}L^{2}$
	$\displaystyle=\brk 2{1+\frac{1}{c}}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\eta^{2}c\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}^{2}+4\eta^{2}L^{2}.$	(15)

We are interested in bounding $\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}$ . For that matter we consider the following concentration inequality which is a direct result of the bounded difference inequality by McDiarmid.

Theorem (Boucheron, Lugosi, and Massart [2013, Example 6.3]).

Let $X_{1},\ldots,X_{n}$ be independent zero-mean R.V’s such that $\norm{X_{i}}\leq c_{i}/2$ and denote $v=\frac{1}{4}\sum_{i=1}^{n}c_{i}^{2}$ . Then, for all $t\geq\sqrt{v}$ ,

\displaystyle\mathop{\mathbb{P}}\brk[c]3{\norm 3{\sum_{i=1}^{n}X_{i}}>t}\leq e^{-(t-\sqrt{v})^{2}/(2v)}.

Note that by Lipschitzness $\norm{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}\leq 2L$ , and that $\brk[c]1{\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}_{i\in[n]}$ are independent zero-mean random variables (as $w^{D}_{t}$ is independent of $z_{i}$ ). Thus, for $\Delta\geq 2\frac{L}{\sqrt{n}}$ :

\displaystyle\mathop{\mathbb{P}}\brk[c]3{\norm 3{\frac{1}{n}\sum_{i=1}^{n}\nabla f(w^{D}_{t};z_{i})-\mathop{\mathbb{E}}\brk[s]{\nabla f(w^{D}_{t};z)}}>\Delta}\leq e^{-(\Delta\sqrt{n}-2L)^{2}/(4L^{2})},

This implies that with probability $1-\delta$ we get,

\displaystyle\norm{\nabla F_{S}(w^{D}_{t})-\nabla F_{D}(w^{D}_{t})}

\displaystyle\leq\frac{4L}{\sqrt{n}}\sqrt{\log(1/\delta)}.

Plugging it back to Eq. 15 we obtain w.p. $1-\delta$

\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2}\leq\brk 2{1+\frac{1}{c}}\norm{w^{S}_{t}-w^{D}_{t}}^{2}+\frac{16\eta^{2}L^{2}c\log\brk{1/\delta}}{n}+4\eta^{2}L^{2}

Applying the formula recursively, and noting that $w^{S}_{0}=w^{D}_{0}$ :

	$\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}^{2}$	$\displaystyle\leq\sum_{t^{\prime}=0}^{t}\brk 2{1+\frac{1}{c}}^{t^{\prime}}\brk 3{\frac{16\eta^{2}L^{2}c\log\brk{1/\delta}}{n}+4\eta^{2}L^{2}}$
		$\displaystyle\leq\frac{16e\eta^{2}L^{2}(t+1)^{2}\log\brk{1/\delta}}{n}+4e\eta^{2}L^{2}(t+1),$

where in the last inequality we chose $c=t+1$ and used the known bound of $(1+1/t)^{t}\leq e$ . Taking the square root and using the inequality of $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ we have

\displaystyle\norm{w^{S}_{t+1}-w^{D}_{t+1}}

\displaystyle\leq 7\sqrt{\frac{\eta^{2}L^{2}(t+1)^{2}\log\brk{1/\delta}}{n}}+4\eta L\sqrt{t+1}.

By taking the union bound over all $t\in[T]$ we conclude the proof.

A.4 Proof of Theorem 3.4

Similarly to the proof in Section A.2, let us consider the domain $\mathcal{W}^{K}_{u}=\brk[c]{w:\norm{w-u}\leq K}$ , where we set $u=\frac{1}{T}\sum_{t=1}^{T}u_{t}$ , the average of the deterministic sequence in Theorem 3.2. From the assumption in Section 3.1 it follows that for any $w$ we have that $\abs{\ell\brk 1{g\brk 1{w\boldsymbol{\cdot}\phi(x)},y}}\leq c$ . We also, can use Lemma 2.1 (applying it to $\ell\circ g\to\ell$ ) to obtain that

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]1{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]1{F_{D}(g\circ w)-F_{S}(g\circ w)}}\leq\frac{2LK}{\sqrt{n}},

(16)

where we denote

F_{S}(g\circ w)=\frac{1}{n}\sum_{i=1}^{n}\ell(g(w\boldsymbol{\cdot}\phi(x_{i}),y_{i})),\quad F_{D}(g\circ w)=\mathop{\mathbb{E}}_{(x,y)\sim D}\left[\ell(g(w\boldsymbol{\cdot}\phi(x),y))\right].

Next, we define

G(S)=\sup_{w\in\mathcal{W}^{k}_{u}}\brk[c]1{F_{D}(g\circ w)-F_{S}(g\circ w)},

and note that for two samples, $S,S^{\prime}$ that differ on a single example we have that

|G(S)-G(S^{\prime})|\leq\frac{2c}{n}.

Using then the bounded difference inequality by McDiarmid [see Shalev-Shwartz and Ben-David, 2014, Lemma 26.4]. We have that with probability at least $1-\delta$ ,

$\displaystyle G(S)$	$\displaystyle=\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]{\ell\brk 1{g\brk 1{w\boldsymbol{\cdot}\phi(x)},y}}-\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{w\boldsymbol{\cdot}\phi(x_{i})},y_{i}}}$
	$\displaystyle\leq\mathop{\mathbb{E}}_{S\sim D^{n}}[G(S)]+c\sqrt{\frac{2\log(2/\delta)}{n}}$	(McDiarmid)
	$\displaystyle\leq\frac{2LK}{\sqrt{n}}+c\sqrt{\frac{2\log(2/\delta)}{n}}.$

From Theorem 3.2 we have that with probability at least $1-\delta$ ,

\displaystyle\norm{\bar{w}^{S}-u}\leq\frac{6\eta LT}{\sqrt{n}}\sqrt{\log\brk{T/\delta}}+4\eta L\sqrt{T}

(17)

Taken together, and applying union bound, we have that with probability at least $1-\delta$ :

\displaystyle\begin{split}\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x)},y}}&\leq\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i})},y_{i}}\\ &\quad+\frac{12\eta L^{2}T}{n}\sqrt{\log\brk{2T/\delta}}+\frac{8\eta L^{2}\sqrt{T}}{\sqrt{n}}+c\sqrt{\frac{2\log\brk{4/\delta}}{n}}.\end{split}

(18)

Next, using Eqs. 4 and 3.1 and the fact that the optimization bound Eq. 4 holds for any $B>0$ :

$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i})},y_{i}}$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\ell(\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i}),y_{i})$	(Section 3.1)
	$\displaystyle\leq\inf_{B\in\mathbb{R}^{+}}\brk[c]2{\min_{w\in\mathcal{W}^{B}}\brk[c]2{\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w\boldsymbol{\cdot}\phi(x_{i}),y_{i}}}+\eta L^{2}+\frac{B^{2}}{\eta T}}$	(Eq. 4)
	$\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w\boldsymbol{\cdot}\phi(x_{i}),y_{i}}+\eta L^{2}+\frac{\\|w\\|^{2}}{\eta T}},$	(19)

Now, set $w^{\star}$ such that

\displaystyle\begin{split}\mathop{\mathbb{E}}_{(x,y)\sim D}&\brk[s]1{\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x),y}}+\frac{\|w^{\star}\|^{2}}{\eta T}+\|w^{\star}\|L\sqrt{\frac{2\log(1/\delta)}{n}}\\ &\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}+\frac{\|w\|^{2}}{\eta T}+\|w\|L\sqrt{\frac{2\log(1/\delta)}{n}}}+\eta L^{2}.\end{split}

(20)

By independence of $\brk[c]{(x_{i},y_{i})}_{i=1}^{n}$ and the bound on $|\ell(0,y)|\leq c$ we obtain by Lipschitzness $\abs{\ell(w^{\star}\boldsymbol{\cdot}\phi(x),y)}\leq\|w^{\star}\|L+c$ . It follows from the Hoeffding’s inequality that with probability at least $1-\delta$

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x_{i}),y_{i}}-\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x),y}}

\displaystyle\leq(\|w^{\star}\|L+c)\sqrt{\frac{2\log(1/\delta)}{n}}.

(21)

Thus, we have that w.p. $1-\delta$ :

$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i})},y_{i}}$	$\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w\boldsymbol{\cdot}\phi(x_{i}),y_{i}}+\eta L^{2}+\frac{\\|w\\|^{2}}{\eta T}}$
	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x_{i}),y_{i}}+\eta L^{2}+\frac{\\|w^{\star}\\|^{2}}{\eta T}$
	$\displaystyle\leq\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x),y}}+\frac{\\|w^{\star}\\|^{2}}{\eta T}+(\\|w^{\star}\\|L+c)\sqrt{\frac{2\log(1/\delta)}{n}}+\eta L^{2}$	(Eq. 21)
	$\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}+\frac{\\|w\\|^{2}}{\eta T}+(\\|w\\|L+c)\sqrt{\frac{2\log(1/\delta)}{n}}}+2\eta L^{2},$	(22)

where the last inequality follows from Eq. 20. Combining Eqs. 18 and 22 and applying union bound we obtain the result.

Appendix B Proof of Lemma 2.1

Using the standard bound of the generalization error, via the Rademacher complexity of the class (see e.g. Shalev-Shwartz and Ben-David [2014]), we have that:

\displaystyle\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]2{\sup_{w\in\mathcal{W}^{K}_{u}}\brk[c]1{F_{D}(w)-F_{S}(w)}}

\displaystyle\leq 2\mathop{\mathbb{E}}_{S\sim D^{n}}\brk[s]{\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u})},

Where we notate the function class:

f\circ\mathcal{W}^{K}_{u}=\{z\to\ell(w\cdot x,y):w\in\mathcal{W}^{K}_{u}\}.

and $\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u})$ is the Rademacher complexity of the class $f\circ\mathcal{W}^{K}_{u}$ . Namely:

\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u}):=\mathop{\mathbb{E}}_{\sigma}\left[\sup_{h\in f\circ W^{K}_{u}}\frac{1}{n}\sum_{z_{i}\in S}\sigma_{i}h(z_{i})\right],

(23)

and $\sigma_{1},\ldots,\sigma_{n}$ are i.i.d. Rademacher random variables.

We next show that:

\displaystyle\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u})\leq\frac{LK}{\sqrt{n}}.

(24)

To show Eq. 24, we use the following well known property of the Rademacher complexity of a class:

Lemma B.1 (contraction lemma, see Shalev-Shwartz and Ben-David [2014]).

For each $i\in[n]$ , let $\rho_{i}:\mathbb{R}\rightarrow\mathbb{R}$ be convex $L$ -lipschitz function in their first argument. Let $A\subseteq\mathbb{R}^{n}$ and denote $a=\brk{a_{1},\ldots,a_{n}}\in A$ . Then, if $\sigma=\sigma_{1},\ldots,\sigma_{n}$ , are i.i.d. Rademacher random variables

\mathop{\mathbb{E}}_{\sigma}\left[\sup_{a\in A}\sum_{i=1}^{n}\sigma_{i}\rho_{i}(a_{i})\right]\leq L\cdot\mathop{\mathbb{E}}_{\sigma}\left[\sup_{a\in A}\sum_{i=1}^{n}\sigma_{i}a_{i}\right].

As well as the Rademacher complexity of the class of linear predictors against a sample $S=\{\phi(x_{1}),\ldots,\phi(x_{n})\}$ of $\ell_{2}$ 1-bounded vectors:

\displaystyle\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{0})=K/\sqrt{n}.

(25)

Next, given a sample $S=\{z_{1},\ldots,z_{n}\}$ we define $\rho_{i}(\alpha):=\ell(\alpha+u\boldsymbol{\cdot}\phi(x_{i}),y_{i})$ and we set

A:=\{(v\boldsymbol{\cdot}\phi(x_{1}),\ldots,v\boldsymbol{\cdot}\phi(x_{n})):v\in\mathcal{W}_{0}^{k}\}.

Then:

$\displaystyle n\mathcal{R}_{S}(f\circ\mathcal{W}^{K}_{u})$	$\displaystyle=\mathop{\mathbb{E}}_{\sigma}\left[\sup_{w\in\mathcal{W}^{k}_{u}}\sum_{i=1}^{n}\sigma_{i}\ell(w\boldsymbol{\cdot}\phi(x_{i}),y_{i})\right]$
	$\displaystyle=\mathop{\mathbb{E}}_{\sigma}\left[\sup_{w\in\mathcal{W}^{k}_{u}}\sum_{i=1}^{n}\sigma_{i}\ell((w-u)\boldsymbol{\cdot}\phi(x_{i})+u\boldsymbol{\cdot}\phi(x_{i}),y_{i})\right]$
	$\displaystyle=\mathop{\mathbb{E}}_{\sigma}\left[\sup_{v\in\mathcal{W}^{k}_{0}}\sum_{i=1}^{n}\sigma_{i}\ell(v\boldsymbol{\cdot}\phi(x_{i})+u\boldsymbol{\cdot}\phi(x_{i}),y_{i})\right]$
	$\displaystyle\leq L\cdot\mathop{\mathbb{E}}_{\sigma}\left[\sup_{v\in\mathcal{W}^{k}_{0}}\sum_{i=1}^{n}\sigma_{i}v\boldsymbol{\cdot}\phi(x_{i})\right]$	(Lemma B.1)
	$\displaystyle\leq LK\sqrt{n}$	(Eq. 25)

Dividing by $n$ yields the proof.

Appendix C Proof of Theorem 3.3

Our construction is comprised of two separate instances. We first provide lower bounds, Lemmas C.1 and C.2, for the distance between the GD iterates $w^{S}_{t},w^{S^{\prime}}_{t}$ defined over two separate i.i.d. samples $S=\brk{z_{1},\ldots,z_{n}}$ and $S^{\prime}=\brk{z^{\prime}_{1},\ldots,z^{\prime}_{n}}$ , respectively.

Lemma C.1.

Fix $\eta,L,T$ and $n$ . Suppose $S$ and $S^{\prime}$ are i.i.d. samples drawn from $D^{n}$ . There exists a convex and $L$ -Lipschitz function $f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}$ and a distribution $D$ over $\mathcal{Z}$ , such that, if $w^{S}_{t}$ and $w^{S^{\prime}}_{t}$ are defined as in Eq. 2, then with probability at least $1/10$ :

\displaystyle\forall t\in[T]:\quad\norm{w^{S}_{t}-w^{S^{\prime}}_{t}}

\displaystyle\geq\Omega\brk 3{\frac{\eta Lt}{\sqrt{n}}}.

Lemma C.2.

\displaystyle\forall t\in[T]:\quad\norm{w^{S}_{t}-w^{S^{\prime}}_{t}}

\displaystyle\geq\Omega\brk 3{\eta L\sqrt{t}}.

One can then pick the dominant term between the bounds, and obtain that with probability at least $1/10$ :

\displaystyle\norm{w^{S}_{t}-w^{S^{\prime}}_{t}}

\displaystyle\geq\Omega\brk 3{\frac{\eta Lt}{\sqrt{n}}+\eta L\sqrt{t}}.

(26)

Suppose some $u_{t}$ , independent on the samples $S$ and $S^{\prime}$ . Then by the triangle inequality we have that,

$\displaystyle\mathop{\mathbb{P}}\brk 1{\norm{w^{S}_{t}-w^{S^{\prime}}_{t}}\geq a}$	$\displaystyle\leq\mathop{\mathbb{P}}\brk 1{\norm{w^{S}_{t}-u_{t}}+\norm{w^{S^{\prime}}_{t}-u_{t}}\geq a}$
	$\displaystyle\leq\mathop{\mathbb{P}}\brk 1{\norm{w^{S}_{t}-u_{t}}\geq a/2}+\mathop{\mathbb{P}}\brk 1{\norm{w^{S^{\prime}}_{t}-u_{t}}\geq a/2}$
	$\displaystyle=2\mathop{\mathbb{P}}\brk 1{\norm{w^{S}_{t}-u_{t}}\geq a/2}.$	( $S$ and $S^{\prime}$ are i.i.d.)

Dividing by $2$ and using Eq. 26 we conclude the proof.

C.1 Proof of Lemma C.1

Suppose $f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}$ takes following form:

\displaystyle f(w;z)=Lw\cdot z,

where $\mathcal{W}\subseteq\mathbb{R}$ and $z=\pm 1$ with probability $1/2$ . Define a sample $S=\brk[c]{z_{1},\ldots,z_{n}}$ and $S^{\prime}=\brk[c]{z^{\prime}_{1},\ldots,z^{\prime}_{n}}$ , then by the update rule in Eq. 2 we obtain,

\displaystyle w^{S}_{t+1}=-\eta Lt\cdot\frac{1}{n}\sum_{i=1}^{n}z_{i},\quad w^{S^{\prime}}_{t+1}=-\eta Lt\cdot\frac{1}{n}\sum_{i=1}^{n}z^{\prime}_{i}.

This implies that $\abs{w^{S}_{t+1}-w^{S^{\prime}}_{t+1}}=\eta Lt\cdot\abs{\frac{1}{n}\sum_{i=1}^{n}\brk{z_{i}-z^{\prime}_{i}}}$ . Note that,

\displaystyle z_{i}-z^{\prime}_{i}=\begin{cases}2&w.p.~{}1/4,\\ 0&w.p.~{}1/2,\\ -2&w.p.~{}1/4.\end{cases}

Using Berry-Esseen inequality one can show that with probability at least $1/10$ :

\displaystyle\abs 2{\frac{1}{n}\sum_{i=1}^{n}\brk{z_{i}-z^{\prime}_{i}}}\geq\frac{1}{\sqrt{n}}.

In turn we conclude that w.p. at least $1/10$ ,

\displaystyle\abs{w^{S}_{t+1}-w^{S^{\prime}}_{t+1}}\geq\frac{\eta Lt}{\sqrt{n}}.

We remark that $f(w;z)$ can be embedded to any large dimension, thus implying our lower bound holds for any arbitrary dimension.

C.2 Proof of Lemma C.2

This proof relies on the same construction of Bassily et al. [2020]. The difference is that we show a lower bound between iterates over two i.i.d. samples while their result holds for two samples that differ only on a single example. The main observation here is that with some constant probability, the problem is reduced to that of Bassily et al. [2020]. Consider the following $f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}$ :

\displaystyle f(w;z)=-\frac{\gamma L}{2}zw+\frac{L}{2}\max_{i\in[d]}\brk[c]1{w_{i}-\varepsilon_{i},0},

where $\mathcal{W}\subseteq\mathbb{R}^{d}$ and

\displaystyle z=\begin{cases}1&w.p.~{}1/(n+1),\\ 0&w.p.~{}1-1/(n+1).\end{cases}

We also choose $\varepsilon_{i}$ such that $0<\varepsilon_{1}<\ldots<\varepsilon_{d}<\gamma\eta L/(2n)$ and a sufficiently small $\gamma=1/(4\sqrt{dT})$ , and $d>T$ . Observe that for a given sample $S=\brk[c]{z_{1},\ldots,z_{n}}$ the empirical risk is then,

\displaystyle F_{S}(w)=-\frac{\gamma L}{2n}\sum_{i=1}^{n}z_{i}w+\frac{L}{2}\max_{i\in[d]}\brk[c]1{w_{i}-\varepsilon_{i},0}.

We now claim that with probability $(1-\frac{1}{n+1})^{n}$ over $S^{\prime}$ , the empirical risk is given by,

\displaystyle F_{S^{\prime}}(w)=\frac{L}{2}\max_{i\in[d]}\brk[c]1{w_{i}-\varepsilon_{i},0}.

Conditioned on this event, we get that $\nabla F_{S^{\prime}}(0)=0$ and therefore $w^{S^{\prime}}_{t}=0$ for any $t\in[T]$ . In addition, we know that the complementary event, namely, $z_{i}=1$ for at least a single $i\in[n]$ , is given with probability $1-(1-\frac{1}{n})^{n}$ . Since,

\displaystyle\brk 1{1-\frac{1}{n+1}}^{n}\leq\brk{1-\frac{1}{2n}}^{n}\leq e^{-1/2},\quad\textrm{and}\quad\brk 1{1-\frac{1}{n+1}}^{n}\geq 1/2,

we have that with probability at least $0.5\cdot\brk{1-e^{-1/2}}\geq 0.19$ both events occur. Note that $\nabla F_{S}(w)=-\frac{\gamma L}{2n}\sum_{i=1}^{n}z_{i}\mathbbm{1}+\frac{L}{2}\nabla\max_{i\in[d]}\brk[c]1{w_{i}-\varepsilon_{i},0}$ where $\mathbbm{1}$ is the one vector. Then applying the update rule in Eq. 2 and the fact that $w^{S}_{0}=0$ we get,

\displaystyle w^{S}_{1}=\frac{\gamma\eta L}{2n}\sum_{i=1}^{n}z_{i}\mathbbm{1}.

Recall, that under the aforementioned event we have that $\frac{1}{n}\sum_{i=1}^{n}z_{i}\geq\frac{1}{n}$ . This implies that $w^{S}_{1}(i)\geq\frac{\gamma\eta L}{2n}>\varepsilon_{i}$ for any $i\in[d]$ . Therefore,

\displaystyle w^{S}_{2}=w^{S}_{1}-\eta\nabla F_{S}(w^{S}_{1})=\frac{2\gamma\eta L}{2n}\sum_{i=1}^{n}z_{i}\mathbbm{1}-\frac{\eta L}{2}e_{1},

where $e_{i}$ is the standard basis vector of index $i$ . Since $\gamma\leq 1/(4T)$ we have that $w^{S}_{2}(1)\leq\frac{\eta L}{4T}-\frac{\eta L}{2}<0$ . Developing this dynamic recursively we obtain,

\displaystyle w^{S}_{t+1}=\frac{\gamma\eta Lt}{2n}\sum_{i=1}^{n}z_{i}\mathbbm{1}-\frac{\eta L}{2}\sum_{s\in[t]}e_{s}.

Using the reverse triangle inequality we have,

$\displaystyle\norm 1{w^{S}_{t}-w^{S^{\prime}}_{t}}$	$\displaystyle=\norm 1{w^{S}_{t}}$	( $w^{S^{\prime}}_{t}=0$ )
	$\displaystyle\geq\frac{\eta L}{2}\norm 2{\sum_{s\in[t]}e_{s}}-\frac{\gamma\eta Lt}{2n}\abs 2{\sum_{i=1}^{n}z_{i}}\norm{\mathbbm{1}}$	(reverse triangle inequality)
	$\displaystyle\geq\frac{\eta L\sqrt{t}}{2}-\frac{\gamma\eta Lt}{2}\sqrt{d}$	( $\norm{\mathbbm{1}}=\sqrt{d}$ and $\abs 2{\sum_{i=1}^{n}z_{i}}\leq n$ )
	$\displaystyle\geq\frac{3}{8}\eta L\sqrt{t}.$	( $\gamma\leq 1/(4\sqrt{dt}$ )

$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{g\brk 1{\bar{w}^{S}\boldsymbol{\cdot}\phi(x_{i})},y_{i}}$	$\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w\boldsymbol{\cdot}\phi(x_{i}),y_{i}}+\eta L^{2}+\frac{\\|w\\|^{2}}{\eta T}}$
	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x_{i}),y_{i}}+\eta L^{2}+\frac{\\|w^{\star}\\|^{2}}{\eta T}$
	$\displaystyle\leq\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w^{\star}\boldsymbol{\cdot}\phi(x),y}}+\frac{\\|w^{\star}\\|^{2}}{\eta T}+(\\|w^{\star}\\|L+c)\sqrt{\frac{2\log(1/\delta)}{n}}+\eta L^{2}$	(Eq. 21)
	$\displaystyle\leq\inf_{w\in\mathbb{R}^{d}}\brk[c]3{\mathop{\mathbb{E}}_{(x,y)\sim D}\brk[s]1{\ell\brk 1{w\boldsymbol{\cdot}\phi(x),y}}+\frac{\\|w\\|^{2}}{\eta T}+(\\|w\\|L+c)\sqrt{\frac{2\log(1/\delta)}{n}}}+2\eta L^{2},$	(22)