Learning Trajectories are Generalization Indicators

Jingwen Fu¹ , Zhizheng Zhang² , Dacheng Yin³¹¹footnotemark: 1 , Yan Lu² , Nanning Zheng¹²²footnotemark: 2
fu1371252069@stu.xjtu.edu.cn
{zhizzhang,yanlu}@microsoft.com
ydc@mail.ustc.edu.cn
nnzheng@mail.xjtu.edu.cn
¹National Key Laboratory of Human-Machine Hybrid Augmented Intelligence,
National Engineering Research Center for Visual Information and Applications,
and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University,
²Microsoft Research Asia, ³University of Science and Technology of China
Work done during internships at Microsoft Research Asia.Corresponding Authors

Abstract

This paper explores the connection between learning trajectories of Deep Neural Networks (DNNs) and their generalization capabilities when optimized using (stochastic) gradient descent algorithms. Instead of concentrating solely on the generalization error of the DNN post-training, we present a novel perspective for analyzing generalization error by investigating the contribution of each update step to the change in generalization error. This perspective enable a more direct comprehension of how the learning trajectory influences generalization error. Building upon this analysis, we propose a new generalization bound that incorporates more extensive trajectory information. Our proposed generalization bound depends on the complexity of learning trajectory and the ratio between the bias and diversity of training set. Experimental observations reveal that our method effectively captures the generalization error throughout the training process. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and label noise levels. These results demonstrate that learning trajectory information is a valuable indicator of a model’s generalization capabilities.

1 Introduction

The generalizability of a Deep Neural Network (DNN) is a crucial research topic in the field of machine learning. Deep neural networks are commonly trained with a limited number of training samples while being tested on unseen samples. Depite the commonly used independent and identically distributed (i.i.d.) assumption between the training and testing sets, there often exists a varying degree of discrepancy between them in real-world applications. Generalization theories study the generalization of DNNs by modeling the gap between the empirical risk [36] and the popular risk [36]. Classical uniform convergence based methods [20] adopt the complexity of the function space to analyze this generalization error. These theories discover that more complex function space results in a larger generalization error [37]. However, they are not well applicable for DNNs [32, 22]. In deep learning, the double descent phenomenon [6] exists, which tells that larger complexity of function space may lead to smaller generalization error. This violates the aforementioned property in uniform convergence methods and imposes demands in studying the generalization of DNNs.

Although the function space of DNNs is vast, not all functions within that space can be discovered by learning algorithms. Therefore, some representative works bound the generalization of DNNs based on the properties of the learning algorithm, e.g. , stability of algorithm [11], information-theoretic analysis [39]. These works rely on the relation between the input (i.e. , training data) and output (weights of the model after training) of the learning algorithm to infer the generalization ability of the learned model. Here, the relation refers to how the change of one sample in the training data impacts the final weights of model in the stability of algorithms while referring to the mutual information between the weights and the training data in the information-theoretic analysis. Although some works [24, 11] leverage some information from training process to understand the properties of learning algorithm, there is limited trajectory information conveyed.

The purpose of this article is to enhance our theoretical comprehension of the relation between learning trajectory and generalization. While some recent experiments [9, 13] have shown a strong correlation between the information contained in learning trajectory and generalization, the theoretical understanding behind this is still underexplored. By investigating the contribution of each update step to the change in generalization error, we give a new generalization bound with rich trajectory related information. Our work can serve as a starting point to understand those experimental discoveries.

1.1 Our Contribution

Our contributions can be summarized below:

•

We demonstrate that learning trajectory information serves as a valuable indicator of generalization abilities. With this motivation, we present a novel perspective for analyzing generalization error by investigating the contribution of each update step to the change in generalization error.
•

Utilizing the aforementioned modeling technique, we introduce a novel generalization bound for deep neural networks (DNNs). Our proposed bound provides a greater depth of trajectory-related insights than existing methods.
•

Our method effectively captures the generalization error throughout the training process. And the assumption corresponding to this method is also confirmed by experiments. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and label noise levels.

2 Related Work

Generalization Theories

Existing works on studying the generalization of DNNs can be divided into three categories: the methods based on the complexity of function space, the methods based on learning algorithms, and the methods based on PAC Bayes. The first category considers the generalization of DNNs from the perspective of the complexity of the function space. Many methods for measuring the complexity of the function space have been proposed, e.g. , VC dimension [38], Rademacher Complexity [4] and covering number [32]. These works fail in being applied to DNN models since the complexity of the function space of a DNN model is too large to deliver a trivial result [40]. This thus motivates recent works to rethink the generalization of DNNs based on the accessible information in different learning algorithms such as stability of algorithm [11], information-theoretic analysis [39]. Among them, the stability of algorithm [7] measures how one sample change of training data impacts the model weights finally learned, and the information theory [29, 30, 39] based generalization bounds rely on the mutual information of the input (training data) and output (weights after training) of the learning algorithm. Another line is PAC Bayes [19] based method, which bounds the expectation of the error rates of a classifier chosen from a posterior distribution in terms of the KL divergence from a given prior distribution. Our research modifies the conventional Rademacher Complexity to calculate the complexity of the space explored by a learning algorithm, which in turn helps derive the generalization bound. Our approach resembles the first category, as we also rely on the complexity of the function space. However, our method differs as we focus on the function space explored by the learning trajectory, rather than the entire function space. The novelty of our technique lies in addressing the issue of dependence on training data within the function space explored by the learning trajectory, a dependency that is not permitted by the original Rademacher Complexity Theory.

Generalization Analysis for SGD

The optimization plays an nonnegligible role in the success of DNN. Therefore, there are many prior works studying the generalization of DNNs by exploring property of SGD, which could be summarized into two categories: stability of SGD and information-theoretic analysis. The most popular way of the former category is to analyze the stability of the weights updating. Hardt et al. [11] is the first work to analyze the stability of SGD with the requirements of smooth and Lipschitz assumptions. Its follow-up works try to discard the smooth [5], or Lipschitz [25] assumptions towards getting a more general bound. Information-theoretic methods leverage the chain rule of KL-divergence to calculate the mutual information between the learned model weights and the data. This kind of works is mainly applied for Stochastic Gradient Langevin Dynamics(SGLD), i.e. , SGD with noise injected in each step of parameters updating [28]. Negrea et al. [23], Haghifam et al. [10] improve the results using data-dependent priors. Neu et al. [24] construct an auxiliary iterative noisy process to adapt this method to the SGD scenario. In contrast to these studies, our approach utilizes more information related to learning trajectories. A more detailed comparison can be found in Table 2 and Appendix B.

3 Generalization Bound

Let us consider a supervised learning problem with a instance space $\mathcal{Z}$ and a parameter space $\mathcal{W}$ . The loss function can be defined as $f:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}_{+}$ . We denote the distribution of the instance space $\mathcal{Z}$ as $\mu$ . The $n$ i.i.d samples draw from $\mu$ are denoted as $S=\{z_{1},...,z_{n}\}\sim\mu^{n}$ . Given parameters $\mathbf{w}$ , the empirical risk and popular risk are denoted as $F_{S}(\mathbf{w})\triangleq\frac{1}{n}\sum_{i}^{n}f(\mathbf{w},z_{i})$ , and $F_{\mu}(\mathbf{w})\triangleq\mathbb{E}_{z\sim\mu}[f(\mathbf{w},z)]$ respectively. Our work studies the generalization error of the learned model, i.e. $F_{\mu}(\mathbf{w})-F_{S}(\mathbf{w})$ . For an optimizaiton process, the learning trajectory is represented as a function $\mathbf{J}:\mathbb{N}\to\mathcal{W}$ . We use $\mathbf{J_{t}}$ to denote the weights of model after $t$ times updating, where $\mathbf{J_{t}}=\mathbf{J}(t)$ . The learning algorithm is defined as $\mathcal{A}:\mu^{n}\times\mathbb{R}\to\mathbf{J}$ , where the second input $\mathbb{R}$ denotes all randomness in the algorithm $\mathcal{A}$ , including the randomness in initialization, batch sampling et al. . We simply use $\mathcal{A}(S)$ to represent a random choice for the second input term. Given two functions $U,V$ , $\int_{t}U(t)\mathrm{d}V(t)\triangleq\sum_{t}U(t)(V(t+1)-V(t))$ and we use $\|\cdot\|$ to denote $L$ 2 norm. If $S$ is a set, then $|S|$ denotes the number of elements in $S$ . $\mathbb{E}_{t}$ denotes taking the expectiation conditioned on $\{\mathbf{J_{i}}|i\leq t\}$ .

Let mini-batch $B$ be a random subset sampled from dataset $S$ , and we have $|B|=b$ . The averaged function value of mini-batch $B$ is denoted as $F_{B}(\mathbf{w})\triangleq\frac{1}{b}\sum_{z\in B}f(\mathbf{w},z)$ . The parameters updated with gradient descent can be formulated as:

\mathbf{J_{t+1}}=\mathbf{J_{t}}-\eta_{t}\nabla F_{S}(\mathbf{J_{t}}).

(1)

where $\eta_{t}$ is the learning rate for thr $t$ -th update. The parameter updating with stochastic gradient descent is:

\mathbf{J_{t+1}}=\mathbf{J_{t}}-\eta_{t}\nabla F_{B}(\mathbf{J_{t}}).

(2)

Let $\epsilon(\mathbf{w})\triangleq\nabla F_{S}(\mathbf{w})-\nabla F_{B}(\mathbf{w})$ be the gradient noise in mini-batch updating, where $\mathbf{w}$ is the weights of a DNN. Then we can transform Equation (2) into:

\mathbf{J_{t+1}}=\mathbf{J_{t}}-\eta_{t}\nabla F_{S}(\mathbf{J_{t}})+\eta_{t}\epsilon(\mathbf{J_{t}}).

(3)

The covariance of the gradients over the entire dataset $S$ can be calculated as:

\Sigma(\mathbf{w})\triangleq\frac{1}{n}\sum_{i=1}^{n}\nabla f(\mathbf{w},z_{i})\nabla f(\mathbf{w},z_{i})^{\mathrm{T}}-\nabla F_{S}(\mathbf{w})\nabla F_{S}(\mathbf{w})^{\mathrm{T}}.

(4)

Therefore, the covariance of the gradient noise $\epsilon(\mathbf{w})$ is:

C(\mathbf{w})\triangleq\frac{n-b}{b(n-1)}\Sigma(\mathbf{w}).

(5)

Since for any $w$ we have $\mathbb{E}(\epsilon(\mathbf{w}))\!=\!0$ , we can represent $\epsilon(\mathbf{w})$ as $C(\mathbf{w})^{\frac{1}{2}}\epsilon^{\prime}$ , where $\epsilon^{\prime}$ is a random distribution whose mean is zero and covariance matrix is an identity matrix. Here, $\epsilon^{\prime}$ can be any distributions, including Guassian distribution [12] and $\mathcal{S}\alpha\mathcal{S}$ distribution [34].

The primary objective of our work is to suggest a new generalization bound that incorporates more comprehensive trajectory-related information. The key aspects of this information are: 1) It should be adaptive and change according to different learning trajectories. 2) It should not rely on the extra information from data distribution $\mu$ except from the training data $S$ .

3.1 Investigating generalization alone learning trajectory

As annotated before, the learning trajectory is represented by a function $\mathbf{J}:\mathbb{N}\to\mathcal{W}$ , which defines the relationship between the model weights and the training timesteps $t$ . $\mathbf{J}_{t}$ denotes the model weights after $t$ times updating. Note that $\mathbf{J}$ depends on $S$ , because it comes from the equation $\mathbf{J}=\mathcal{A}(S)$ . We simply use $f(\mathbf{J_{t}}):\mathcal{Z}\rightarrow\mathbb{R}_{+}$ to represent the function after $t$ -times update. Our goal is to analyze the generalization error, i.e., $F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})$ , where $T$ represents the total training steps.

We reformulate the function corresponding to the finally obtained model as:

f(\mathbf{J_{T}})=f(\mathbf{J_{0}})+\sum_{t=1}^{T}(f(\mathbf{J_{t}})-f(\mathbf{J_{t-1}})).

(6)

Therefore, the generalization error can be rewritten as:

F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})=\underbrace{F_{\mu}(\mathbf{J_{0}})-F_{S}(\mathbf{J_{0}})}_{(i)}+\sum_{t=1}^{T}\underbrace{[(F_{\mu}(\mathbf{J_{t}})-F_{\mu}(\mathbf{J_{t-1}}))-(F_{S}(\mathbf{J_{t}})-F_{S}(\mathbf{J_{t-1}}))]}_{(ii)_{t}}.

(7)

In this form, we divide the generalization error into two parts. $(i)$ is the generalization error before the training. $(ii)_{t}$ is the generalization error caused by $t$ -step update.

Typically, there is independence between $\mathbf{J_{0}}$ and the data $S$ . Therefore, we have $\mathbb{E}(i)=0$ . Combining with this, we have:

\mathbb{E}[F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})]=\mathbb{E}\sum_{t=1}^{T}\mathbb{(}ii)_{t}.

(8)

Analyzing the generalization error after training can be transformed into analyzing the increase of generalization error for each update. This is a straighforward and quite different way to extract the information from learning trajectory compared with previous work. Here, we list two techniques that most used by previous works to extract the information from learning trajectory.

•

(T1). This method leverages the chaining rule of mutual informaton to calculate a upper bound of the mutual information between $\mathbf{J_{T}}$ and the training data $S$ , i.e. $I(S;\mathbf{J_{T}})\leq I(S;\mathbf{J_{t\leq T}})\leq\sum_{t=0}^{T}I(S;\mathbf{J_{t}}|\mathbf{J_{i<t}})$ . $I(S;\mathbf{J_{T}})$ is the value of concerning for their theory.
•

(T2). This method assumes we have another data $S^{\prime}$ , which is obtained by replacing one sample in data $S$ with another sample drawing from distribution $\mu$ . $\mathbf{J^{\prime}}$ is the learning trajectory trained from data $S^{\prime}$ with same randomness value as $\mathbf{J}$ . Denote $\Delta_{k}\triangleq\|\mathbf{J_{k}}-\mathbf{J^{\prime}_{k}}\|$ and assume $\Delta_{0}=0$ . Then, the value of concerning is $\Delta_{T}$ . The upper bound of $\Delta_{T}$ is calculate by iterately apply the formular $\Delta_{k}\leq c_{k-1}\Delta_{k-1}+e_{k-1}$ .

(T1) is commonly utilized in analyzing Stochastic Gradient Langevin Dynamics(SGLD)[18, 2, 28], while (T2) is frequently employed in stability-based works for analyzing SGD[11, 15, 5]. Our method offers several benefits, including: 1) We directly focus on the change in generalization error, rather than intermediate values such as $\Delta_{k}$ and $I(S;\mathbf{J_{t}}|\mathbf{J_{i<t}})$ , 2) The generalization error is equivalent to the sum of $(ii)_{t}$ , while (T1) and (T2) takes the upper bound value of $I(S;\mathbf{J_{T}})$ and $\Delta_{T}$ , and 3) From this perspective, We can extract more in-depth trajectory-related information. For (T1), the computation of $I(S;\mathbf{J_{t}}|\mathbf{J_{i<t}})$ primarily involves the information of $\nabla F_{\mu}(\mathbf{J_{t}})$ , which is inaccessible to us (Detail in Appendix D and Neu et al. [24]). (T2) faces the challenge that only the upper bounds of $c_{k}$ and $e_{k}$ can be calculated. The upper bounds remain unchanged across various learning trajectories. Consequently, both (T1) and (T2) have difficulty conveying meaningful trajectory information.

Table 1: Comparison of the generalization bounds with stability based method for SGD learning algorithms. T.R.T is an abbreviation for the term related to trajectory. T.R.T is defined as the term that 1) varies based on different learning trajectories, and 2) don’t rely on the extra information of data distribution

\mu

except from training data

S

. We can infer that the proposed bound incorporates a greater amount of information pertaining to the trajectory. Other related works are discussed in Appendix D.

Method	$\beta$ -Smooth	$L$ -Lipschitz	Convex	Small LR	Other Conditions	Generalization Bound	T.R.T
Hardt et al. [11]	✓	✓	✓	✓		$\frac{2L^{2}}{n}\sum_{t=1}^{T}\eta_{t}$	$\sum_{t=1}^{T}\eta_{t}$
Hardt et al. [11]	✓	✓		✓	$f\in[0,1],\eta_{t}<\frac{c}{t}$	$\mathcal{O}(\frac{1}{n}L^{\frac{2}{\beta c+1}}T^{\frac{\beta c}{\beta c+1}})$	$T^{\frac{\beta c}{\beta c+1}}$
Zhang et al. [42]	✓	✓		✓	$T>n,\eta_{t}=\frac{c}{\beta t}$	$\frac{16L^{2}T^{c}}{n^{1+c}}$	$T^{c}$
Zhou et al. [43]	✓	✓		✓	$\mathbb{E}_{z\in S}\\|\nabla f(\mathbf{w},z)-\nabla F_{S}(\mathbf{w})\\|^{2}\leq B^{2}$	$\mathcal{O}(\sqrt{\frac{1}{n}L\sqrt{2\beta F_{\mu}(\mathbf{J_{0}})+\frac{1}{2}\mathbb{E}B^{2}}\log T})$	$\sqrt{\log T}$
Bassily et al. [5]		✓	✓		Projected SGD	$2L^{2}\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{4L^{2}}{n}\sum_{t=1}^{T-1}\eta_{t}$	$\sum_{t=1}^{T-1}\eta_{t}$
Lei and Ying [16]		✓	✓		Projected SGD	$\mathcal{O}((1+\frac{T}{n^{2}})\sum_{t=1}^{T}\eta_{t}^{2})$	$T\sum_{t=1}^{T}\eta_{t}^{2}$
Ours (Theorem 3.6)				✓	$\\|\nabla F_{\mu}(\mathbf{w})\\|\leq\gamma\\|\nabla F_{S}(\mathbf{w})\\|$	Theorem 3.6	$\int_{t}dF_{S}(\mathbf{J_{t}})\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|^{2}}}$

3.2 A New Generalization Bound

In this section, we introduce the generalization bound based on our aforementioned modeling. Let us start with the definition of commonly used assumptions.

Definition 3.1.

The function $f$ is $L$ -Lipschitz, if for all $\mathbf{w_{1}},\mathbf{w_{2}}\in\mathcal{W}$ and for all $z\in\mathcal{Z}$ , wherein we have $\|f(\mathbf{w_{1}},z)-f(\mathbf{w_{2}},z)\|\leq L\|\mathbf{w_{1}}-\mathbf{w_{2}}\|$ .

Definition 3.2.

The function $f$ is $\beta$ -smooth, if for all $\mathbf{w_{1}},\mathbf{w_{2}}\in\mathcal{W}$ and for all $z\in\mathcal{Z}$ , wherein we have $\|\nabla f(\mathbf{w_{1}},z)-\nabla f(\mathbf{w_{2}},z)\|\leq\beta\|\mathbf{w_{1}}-\mathbf{w_{2}}\|$ .

Definition 3.3.

The function $f$ is convex, if for all $\mathbf{w_{1}},\mathbf{w_{2}}\in\mathcal{W}$ and for all $z\in\mathcal{Z}$ , wherein we have $f(\mathbf{w_{1}},z)\geq f(\mathbf{w_{2}},z)+(\mathbf{w_{1}}-\mathbf{w_{2}})^{\mathrm{T}}\nabla f(\mathbf{w_{2}},z)$ .

Here, $L$ -lipschitz assumption implies that the $\|\nabla f(\mathbf{w},z)\|\leq L$ holds. $\beta$ -smooth assumption indicates the largest eignvalue of $\nabla^{2}f(\mathbf{w},z)$ is smaller than $\beta$ . The convexity indicates the smallest eigenvalue of $\nabla^{2}f(\mathbf{w},z)$ are positive. These assumptions tell us the constraints of gradients and Hessian matrices of the training data and the unseen samples in the test set. Since the values of gradients and Hessian matrices in the training set are accessible, the key role of these assumptions is to deliver knowledge about the unseen samples in the test set.

In the following, we introduce a new generalization bound. We give the assumption required by our new generalization bound in the following.

Assumption 3.4.

There is a value $\gamma$ , so that for all $\mathbf{w}\in\{\mathbf{J_{t}}|t\in\mathbb{N}\}$ , we have $\|\nabla F_{\mu}(\mathbf{w})\|\leq\gamma\|\nabla F_{S}(\mathbf{w})\|$ .

Remark 3.5.

Assumption 3.4 gives a restriction with the norm of popular gradient $\nabla F_{\mu}(\mathbf{w})$ . This assumption is easily satisfied when $n$ is a large number, because we have $\lim\limits_{n\rightarrow\infty}\|\nabla F_{S}(\mathbf{w})\|=\|\nabla F_{\mu}(\mathbf{w})\|$ . When the $n$ is not large enough, the assumption will hold before SGD enter the neighbourhood of convergent point. Under the case that SGD enters the neighbourhood of convergent point, we give a relaxed assumption and its corresponding generalization bound in Appendix B. According to paper [41], this case will ununsually happen in real situation. Section 4 gives experiments to explore the assumption.

Theorem 3.6.

Under Assumption 3.4, given $S\sim\mu^{n}$ , let $\mathbf{J}=\mathcal{A}(S)$ , where $\mathcal{A}$ denoted the SGD or GD algorithm training with $T$ steps, we have:

\mathbb{E}[F_{\mu}(\mathbf{J_{T}})-F_{S}(\mathbf{J_{T}})]\leq-2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\|\nabla F_{S}(\mathbf{J_{t}})\|_{2}^{2}}}+\mathcal{O}(\eta_{m})

(9)

where $\mathbb{V}(\mathbf{w})=\frac{\|\nabla F_{S}(\mathbf{w})\|}{\mathbb{E}_{U\subset S}\|\frac{|U|}{n}\nabla F_{U}(\mathbf{w})-\frac{n-|U|}{n}\nabla F_{S/U}(\mathbf{w})\|}$ , $\mathbb{V}_{m}\!=\!\max\limits_{t}\mathbb{V}(\mathbf{J_{t}})$ , $\gamma^{\prime}\!=\!\max\{1,\max\limits_{U\subset S;t}\frac{|U|\|\nabla F_{U}(\mathbf{J_{t}})\|}{n\|\nabla F_{S}(\mathbf{J_{t}})\|}\}\gamma$ and $\eta_{m}\triangleq\max\limits_{t}\eta_{t}$ .

Remark 3.7.

Our generalization bound mainly relies on the information from gradients. $\mathbb{V}(\mathbf{w})$ is related to the variance of the gradient. When the variance of the gradients across different samples in the training set $S$ is large, then the value of $\mathbb{V}(\mathbf{w})$ is small, and vice versa. Note that we have $|U|<n$ due to $U\subset S$ . Our bound will became trival if $\mathbb{E}_{U\subset S}\|\frac{|U|}{n}\nabla F_{U}(\mathbf{w})-\frac{n-|U|}{n}\nabla F_{S/U}(\mathbf{w})\|=0$ . This rarely happens in real case, because it requires that for all $U\subset S$ , we have $|U|\nabla F_{U}(\mathbf{w})=(n-|U|)\nabla F_{S/U}(\mathbf{w})$ . A example of linear regression case is given in Appendix LABEL:subsec:example. We also give a relaxed assumption version of this theorem in Appendix B. The generalization bound provides a clear insight into how the reduction of training loss leads to a increase in generalization error.

Proof Sketch

The proof of this theorem is placed in Appendix A. Here, we give the sketch for this proof.

Step 1

Beginning with Equation (8), we decomposite the $F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})$ into a linear part ( $\operatorname{gen}^{lin}(\mathbf{J_{T}})$ ) and nonlinear part( $\operatorname{gen}^{nl}(\mathbf{J_{T}})$ ). We have $\operatorname{gen}^{lin}(\mathbf{J_{T}})=\sum_{t=1}^{T}(ii)^{lin}_{t}$ , where $(ii)^{lin}_{t}\triangleq(\mathbf{J_{t}}-\mathbf{J_{t-1}})^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t-1}})-\nabla F_{S}(\mathbf{J_{t-1}}))$ . The nonlinear part is $\operatorname{gen}^{nl}(\mathbf{J_{T}})=F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})-\operatorname{gen}^{lin}(\mathbf{J_{T}})$ . We takle these two parts differently. Here, we focus on analyzing $\operatorname{gen}^{lin}(\mathbf{J_{T}})$ because it dominates under small learning rate. Detail discussion of $\operatorname{gen}^{nl}(\mathbf{J_{T}})$ is given in Appendix (Propositon A.1 and Subsection C.3)

Step 2

We construct the addictive linear space $\mathcal{L}_{\mathbf{J}|S}\triangleq\{\sum_{t=0}^{T-1}\mathbf{w_{t}}^{\mathrm{T}}\nabla f(\mathbf{J_{t}})\ |\ \|\mathbf{w_{t}}\|\leq\Delta_{t}\}$ , where $\Delta_{t}\triangleq\|\eta_{t}\nabla F_{S}(\mathbf{J_{t}})\|$ . Then $\mathbb{E}[\operatorname{gen}^{lin}(\mathbf{J_{T}})]\leq 2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}R_{S}(\mathcal{L}_{\mathbf{J}|S})$ , where $R_{S}(\mathcal{L}_{\mathbf{J}|S})\triangleq\mathbb{E}_{\sigma}\sup\limits_{h\in\mathcal{L}_{\mathbf{J}|S}}(\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}h(z_{i}))$ .

Step 3

Finally, we compute the upper bound of $R_{S}(\mathcal{L}_{\mathbf{J}|S})$ , which follows same techniques used in Radermacher Complexity theory. By combining this with Proposition A.1, we establish the theorem.

Technical Novety

Directly applying the Rademacher complexity to calculate the generalization error bound fails because the large complexity of neural network’s function space leads to trival bound[40]. In this work, we want to calculate the complexity of the function space that can be explored during the training process. However, there are two challenges here. First, the trajectory of neural network is a "line", instead of a function space that can be calculated the complexity. To solve this problem, we indroduce the addictive linear space $\mathcal{L}_{\mathbf{J}|S}$ . This space contains the local information of learning trajectory, and can serve as the pseudo function space. Second, the function space $\mathcal{L}_{\mathbf{J}|S}$ has a dependent on the sample set $S$ , while the theory of Rademacher complexity requires that the function space is independent with training samples. To decouple this dependence, we adapt the Rademacher complexity and we obtain that $\mathbb{E}[\operatorname{gen}^{lin}(\mathbf{J_{T}})]\leq 2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}R_{S}(\mathcal{L}_{\mathbf{J}|S})$ . Here, $\gamma^{\prime}$ is indroduced to decouple the dependent fact mentioned above.

Compared with Previous Works

In Table 1, we present a summary of stability-based methods, while other methods are outlined in Appendix D. We focus on generalization bounds from previous works that eliminate terms dependent on extra information about data distribution $\mu$ , apart from the training data $S$ , using assumptions such as smoothness or Lipschitz continuity. Analyzing Table 1 reveals that most prior works primarily depend on the learning rate $\eta$ and the total number of training steps $T$ . This suggests that we can achieve the same bound by using an identical learning rate schedule and total training steps, which does not align with our practical experience. Our proposed generalization bound considers the evolution of function values, gradient covariance, and gradient norms throughout the training process. As a result, our bounds encompass more comprehensive information about the learning trajectory.

Asymptotic Analysis

We will first analyze the dependent of $n$ for $\mathbb{V}$ . The $\mathbb{V}$ is calculated as $\mathbb{V}(\mathbf{w})=\frac{\|\nabla F_{S}(\mathbf{w})\|}{\mathbb{E}_{U\subset S}\|\frac{|U|}{n}\nabla F_{U}(\mathbf{w})-\frac{n-|U|}{n}\nabla F_{S/U}(\mathbf{w})\|}$ . Obviously, the gradient of individual sample is unrelated to the sample size $n$ . And $|U|\sim n$ . Therefore, $\mathbb{V}=\mathcal{O}(1)$ . Similarly, we have $\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\|\nabla F_{S}(\mathbf{J_{t}})\|_{2}^{2}}}=\mathcal{O}(\frac{1}{\sqrt{n}})$ . As for the $\mathcal{O}(\eta_{m})$ term in Theorem 3.6, we have $\lim\limits_{n\to\infty}\mathcal{O}(\eta_{m})=0$ according to Proposition A.1. We simply assume that $\mathcal{O}(\eta_{m})=\mathcal{O}(\frac{1}{n^{c}})$ . Therefore, our bound has $\mathcal{O}(\frac{1}{n^{\text{min}\{0.5,c\}}})$

Next, in order to draw a clearer comparison with the stability-based method, we present the following corollary. This corollary employs the $\beta$ -smooth assumption to bound $\operatorname{gen}^{nl}(\mathbf{J_{T}})$ and leverages a similar learning rate setting to that found in stability based works.

Corollary 3.8.

If function $f(\cdot)$ is $\beta$ -smooth, under Assumption 3.4 given $S\sim\mu^{n}$ , let $\mathbf{J}=\mathcal{A}(S)$ , $\eta_{t}=\frac{c}{\beta(t+1)}$ , $M^{2}_{2}=\max\limits_{t}\mathbb{E}_{t-1}(\|\nabla F_{S}(\mathbf{J_{t}})+\epsilon(\mathbf{J_{t}})\|^{2})$ and $M^{4}_{4}=\max\limits_{t}\mathbb{E}_{t-1}(\|\nabla F_{S}(\mathbf{J_{t}})+\epsilon(\mathbf{J_{t}})\|^{4})$ , where $\mathcal{A}$ denoted the SGD or GD algorithm training with $T$ steps, we have:

$\displaystyle\mathbb{E}[F_{\mu}(\mathbf{J_{T}})-F_{S}(\mathbf{J_{T}})]\leq$	$\displaystyle-2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$	(10)
	$\displaystyle+2c^{2}\gamma^{\prime}\mathbb{V}_{m}M_{4}^{2}\sqrt{\mathbb{E}\int_{t}\frac{dt}{n\beta^{2}(t+1)^{4}}\left(1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}\right)}$
	$\displaystyle+2c^{2}\frac{M_{2}^{2}}{\beta}.$

where $\mathbb{V}(\mathbf{w})=\frac{\|\nabla F_{S}(\mathbf{w})\|}{\mathbb{E}_{U\subset S}\|\frac{|U|}{n}\nabla F_{U}(\mathbf{w})-\frac{n-|U|}{n}\nabla F_{S/U}(\mathbf{w})\|}$ , $\mathbb{V}_{m}\!=\!\max\limits_{t}\mathbb{V}(\mathbf{J_{t}})$ and $\gamma^{\prime}\!=\!\max\{1,\max\limits_{U\subset S;t}\frac{|U|\|\nabla F_{U}(\mathbf{J_{t}})\|}{n\|\nabla F_{S}(\mathbf{J_{t}})\|}\}\gamma$ .

3.3 Analysis

3.3.1 Generalization Bounds

Our obtained generalization bound is:

\mathbb{E}[F_{\mu}(\mathbf{J_{T}})-F_{S}(\mathbf{J_{T}})]\leq\!\underbrace{\gamma^{\prime}}_{\text{\tiny Bias of Training Set}}\!\overbrace{\mathbb{V}_{m}}^{\frac{1}{\text{Diversity of Training Set}}}\!\underbrace{\!\left(\!-2\mathbb{E}\!\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\|\nabla F_{S}(\mathbf{J_{t}})\|_{2}^{2}}}\!\right)\!}_{\text{Complexity of Learning Trajectory}}+\mathcal{O}(\!\eta_{m}\!)

(11)

The "Bias of Training Set" refers to the disparity between the characteristics of the training set and those of the broader population. To measure this difference, we use the distance between the norm of the popular gradient and that of the training set gradient, as specified in Assumption 3.4. The "Diversity of Training Set" can be understood as the variation among the samples in the training set, which in turn affects the quality of the training data. The ratio $\frac{\text{Bias of Training Set}}{\text{Diversity of Training Set}}$ gives us the property of information conveyed by the training set. It is important to consider the properties of the training set, as the data may not contribute equally to the generalization[35]. The detail version of the equation can be found in Theorem 3.6.

3.3.2 Comparison with Uniform Stability Results

Here, we compare our modelling with uniform stability [11] from several perspectives in Table 2.

Table 2: Comparison with Uniform Stability Methods. The

\beta

refers to the

\beta

-smooth assumption (see in Definition 3.2).

S

denotes the training set.

	Uniform Stability[11]	Ours
Assumption	$\forall\mathbf{w}\in\mathcal{W}\quad\forall z^{\prime}\in\mathcal{Z}\quad\\|\nabla f(\mathbf{w},z^{\prime})\\|\leq L$	$\forall\mathbf{w}\in\{\mathbf{J_{t}}\|t\in\mathbb{N}\}\quad\\|\mathbb{E}_{z^{\prime}\sim\mu}\nabla f(\mathbf{w},z^{\prime})\\|\leq\gamma\\|\mathbb{E}_{z\in S}\nabla f(\mathbf{w},z^{\prime})\\|$
Modelling Method of SGD	Epoch Structure	Full Batch Gradient + Stochastic Noise
Batch Size	1	$\leq n$
Trajectory Information in Bound	Learning rate and number of training step	Values in Trajectory (gradient norm and covariance)
Perspective	Stability of Algorithm	Complexity of Learning Trajectory

The concept of uniform stability is commonly used to evaluate the ability of SGD in generalizaton, by assessing its stability when a single training sample is altered. Our primary point of comparison is with Hardt et al. [11], as their work is considered the most representative in terms of analyzing the stability of SGD.

First, the assumption of Uniform Stability requires the gradient norm of all input samples for all weights being bounded by $L$ , whereas our assumption only limits the expectation of the gradients for the weights during the learning trajectory. Secondly, Uniform Stability uses an epoch structure to model the stochastic gradient descent, whereas our approach regards each stochastic gradient descent as full batch gradient descent with added stochastic noise. The epoch structure complicates the modelling process because it requires a consideration of sampling. As a result, in Hardt et al. [11], the author only considers the setting with batch size 1. Thirdly, the bound of Uniform Stability only uses hyperparameters setting such as learning rate and number of training step. In contrast, our bound contains more trajectory-related information, such as the gradient norm and covariance. Finally, the Uniform Stability provides the generalization bound based on the stability of the algorithm, while our approach leverages the complexity of the learning trajectory. In summary, there are some notable differences between our approach and Uniform Stability, such as the assumptions made, the modelling process, the type of information used in the bound, and the perspectives.

4 Experiments

4.1 Tightness of Our Bounds

Table 3: Numeric comparison with stability-based work on toy examples. The reason for the value of Zhang et al. [42] is large is because that our and Hardt et al. [11] has dependent on

\frac{L^{2}}{\beta}

, while Hardt et al. [11] depends on

L^{2}

L

and

\beta

are usually large numbers.

Gen Error	Ours	Hardt et al. [11]	Zhang et al. [42]
1.49	3.62	4.04	4417

In a toy dataset setting, we compare our generalization bound with stability-based methods.

Reasons for toy examples

1) Some values in the bounds are hard to be calculated. Calculating $\beta$ (under the $\beta$ -smooth assumption) and $L$ (under the $L$ -Lipschitz assumption) in stability-based work, as well as the values of $\mathbb{V}$ and $\gamma$ in our proposed bound, are challenging. 2) Stability-based methods require a batch size of 1. The training is hard for batch size of 1 with learning rate setting $\eta_{t}=\frac{1}{t}$ in complex datasets.

Constuction of the toy examples

In the following, we discuss the construction of the toy dataset used to compare the tightness of the generalization bounds. The training data is $X_{tr}=\{x_{i}\}_{i=1}^{n}$ . All the data $x_{i}$ is sampled from Guassian distribution $\mathcal{N}(0,\mathbf{I}_{d})$ . Sampling $\tilde{\mathbf{w}}\sim\mathcal{N}(0,\mathbf{I}_{d})$ ,the ground truth is generated by $y_{i}=1\ \ \text{if}\ \ \tilde{\mathbf{w}}^{\mathrm{T}}x_{i}>0\ \ \text{else}\ \ 0$ . The weights for learning is denoted as $\mathbf{w}$ . The predict $\tilde{y}$ is calculated as $\tilde{y}_{i}=\mathbf{w}^{\mathrm{T}}x_{i}$ . The loss for a simple data point is $l_{i}=\left\|y_{i}-\mathbf{w}^{\mathrm{T}}x_{i}\right\|_{2}$ . The training loss is $\mathcal{L}=\sum_{i=1}^{n}l_{i}$ . The test data is $X_{te}=\{x^{\prime}_{i}\}$ , where $x^{\prime}_{i}=\tilde{x}^{\prime}_{i}$ and $\tilde{x}^{\prime}_{i}\sim\mathcal{N}(0,\mathbf{I}_{d})$ . We use 100 samples for training and 1,000 samples for evaluation. The model is trained using SGD for 200 epochs.

We evaluate the tightness of our bound by comparing our results with those in references Hardt et al. [11] and Zhang et al. [42] from the original paper. We set the learning rate as $\eta_{t}=\frac{1}{\beta t}$ . Our reasons for comparing with these two papers are: 1) Hardt et al. [11] is a representative study, 2) Both papers have theorems using a learning rate setting of $\eta_{t}=\mathcal{O}(\frac{1}{t})$ , which aligns with Corollary 3.8 in our paper, and 3) They do not assume convexity. The generalization bounds we compare include Corollary 3.8 from our paper, Theorem 3.12 from Hardt et al. [11], and Theorem 5 from Zhang et al. [42].

Our results are given in Table 3. Our bound is tighter under this setting.

4.2 Capturing the trend of generalization error

In this section, 1) we conduct the deep learning experiment to verify Assumption 3.4 and 2) Verify whether our proposed generalization bound can capture the changes of generalization error. In this experiment, we mainly consider the term $\mathcal{C}(\mathbf{J_{t}})\triangleq-2\int_{i=0}^{t}\frac{dF_{S}(\mathbf{J_{i}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{i}}))}{\|\nabla F_{S}(\mathbf{J_{i}})\|_{2}^{2}}}$ . We omit the term $\gamma^{\prime}$ and $\mathbb{V}_{m}$ , because all the trajectory related information that we want to explore is stored in $\mathcal{C}(\mathbf{J_{t}})$ . Capturing the trend of generalization error is regarded as an important problem in Nagarajan [21]. Unless further specified, we use the default setting of the experiments on CIFAR-10 dataset [14] with the VGG13 [33] network. The experimental details for each figure can be found in Appendix C.2.

Our observations are:

•

Assumption 3.4 is valid when SGD is not exhibiting extreme overfitting.
•

The term of $\mathcal{C}(\mathbf{J_{t}})$ can depict how the generalization error varies along the training process. And it can also track the changes in generalization error when adjustments are made to learnling rates and label noise levels

Refer to caption — Figure 1: Exploration of Assumption 3.4 for different dataset. The $\widetilde{\gamma}_{t}$ is stable before training loss reaches a relative small value. Assumption holds if the training is stop before extremely overfitting. A relaxed assumption and its corresponding generalization bound are given in Appendix B for extremely overfitting situation

Exploring the assumption 3.4 for different dataset during the training process

To explore the Assumption 3.4, we define $\gamma_{t}\triangleq\frac{\|\nabla F_{\mu}(\mathbf{J_{t}})\|}{\|\nabla F_{S}(\mathbf{J_{t}})\|}$ and $\widetilde{\gamma}_{t}\triangleq\frac{\|\nabla F_{S^{\prime}}(\mathbf{J_{t}})\|}{\|\nabla F_{S}(\mathbf{J_{t}})\|}$ , where $S^{\prime}$ is another data set i.i.d sampled from distribution $\mu$ . Because $S^{\prime}$ is independent with $S$ , we have $\widetilde{\gamma}_{t}\approx\gamma_{t}$ . We found that $\widetilde{\gamma}_{t}$ is stable around 1 during the early stage of training(Figure 1). When the training loss is reaching a relative small value, $\widetilde{\gamma}_{t}$ increases as we continue training. This phenomenon remain consistant aross the Cifar10, Cifar100 and SVHN datasets. The $\gamma$ in Assumption 3.4 can be assigned as $\gamma=\max_{t}\widetilde{\gamma}_{t}$ . We can always find such $\gamma$ if the optimizer is not extreme overfitting. Under the extremely overfitting case, we can use the relaxed theorem in Appendix B to bound the generalization error.

The bound capturing the trend of generalization error during training process

The generalization error and the $\mathcal{C}(\mathbf{J_{t}})$ both changes as the training continues. Therefore, we want to verify whether they correlate with each other during the training process. Here, we use the term $\nabla F_{S^{\prime}}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}})$ to approximate the generalization error. We find that $\nabla F_{S^{\prime}}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}})$ has similar trend with $\mathcal{C}(\mathbf{J_{t}})$ (Figure 2 Center). What’s more, we also find that the curve of $\nabla F_{S}(\mathbf{J_{t}})+\mathcal{C}(\mathbf{J_{t}})$ exhibits a comparable pattern with the curve $F_{S^{\prime}}(\mathbf{J_{t}})$ (Figure 2 Left). To explore whether $\mathcal{C}(\mathbf{J_{t}})$ reveals influence of the change of $F_{S}(\mathbf{J_{t}})$ to the generalization error, we plot $\frac{d\mathcal{C}(\mathbf{J_{t}})}{dF_{S}(\mathbf{J_{t}})}$ (Figure 2 Right) during the training process. $\frac{d\mathcal{C}(\mathbf{J_{t}})}{dF_{S}(\mathbf{J_{t}})}$ increases slowly during the early stage of training, but surge rapidly afterward. This discovery is aligned with our intuition about the overfitting.

The complexity of learning trajectory correlates with the generalization error

In Figure 3, we carry out experiments under various settings. Each data point in the figure represents the average of three repeated experiments. The results demonstrate that both the generalization error and $\mathcal{C}(\mathbf{J_{t}})$ increase as the level of label noise is raised (Figure 3 Left). The another experiments measure $\mathcal{C}(\mathbf{J_{t}})$ and generalization error for different learning rate and discover that $\mathcal{C}(\mathbf{J_{t}})$ can capture the trend generalization error. The reasons behind a larger learning rate resulting in a smaller generalization error have been explored in Li et al. [17], Barrett and Dherin [3]. Additionally, Appendix E discusses why a larger learning rate can lead to a smaller $\mathcal{C}(\mathbf{J_{t}})$ .

5 Limitation

The assumption of small learning rate is required by our method. But this assumption is also common use in previous works. For example, Hardt et al. [11], Zhang et al. [42], Zhou et al. [43] explicitly requires that the learning rate should be small and is decayed with a rate of $\mathcal{O}(\frac{1}{t})$ . Some methods have no explict requirements about this but show that large learning rate pushes the generalization bounds to a trivial point. For example, the generalization bounds in works [5, 16] have a term $\sum_{t=1}^{T}\eta_{t}^{2}$ that is not decayed as the data size $n$ increases. The value of this term is unignorable when the learning rate is large. The small learning assumption widens the gap between theory and practice. Eliminating this assumption is crucial for future work.

6 Conclusion

In this study, we investigate the relation between learning trajectories and generalization capabilities of Deep Neural Networks (DNNs) from a unique standpoint. We show that learning trajectories can serve as reliable predictors for DNNs’ generalization performance. To understand the relation between learning trajectory and generalization error, we analyze how each update step impacts the generalization error. Based on this, we propose a novel generalization bound that encompasses extensive information related to the learning trajectory. The conducted experiments validate our newly proposed assumption. Experimental findings reveal that our method effectively captures the generalization error throughout the training process. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and the level of label noises.

7 Acknowledgement

We thank all the anonymous reviewers for their valuable comments. The work was supported in part with the National Natural Science Foundation of China (Grant No. 62088102).

References

Ahn et al. [2022] K. Ahn, J. Zhang, and S. Sra. Understanding the unstable convergence of gradient descent. In International Conference on Machine Learning, pages 247–257. PMLR, 2022.
Banerjee et al. [2022] A. Banerjee, T. Chen, X. Li, and Y. Zhou. Stability based generalization bounds for exponential family langevin dynamics. arXiv preprint arXiv:2201.03064, 2022.
Barrett and Dherin [2020] D. G. Barrett and B. Dherin. Implicit gradient regularization. arXiv preprint arXiv:2009.11162, 2020.
Bartlett and Mendelson [2002] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
Bassily et al. [2020] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33:4381–4391, 2020.
Belkin et al. [2019] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
Bousquet and Elisseeff [2002] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
Chandramoorthy et al. [2022] N. Chandramoorthy, A. Loukas, K. Gatmiry, and S. Jegelka. On the generalization of learning algorithms that do not converge. arXiv preprint arXiv:2208.07951, 2022.
Cohen et al. [2021] J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
Haghifam et al. [2020] M. Haghifam, J. Negrea, A. Khisti, D. M. Roy, and G. K. Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 33:9925–9935, 2020.
Hardt et al. [2016] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225–1234. PMLR, 2016.
Jastrzebski et al. [2017] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
Jastrzebski et al. [2021] S. Jastrzebski, D. Arpit, O. Astrand, G. B. Kerg, H. Wang, C. Xiong, R. Socher, K. Cho, and K. J. Geras. Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. In International Conference on Machine Learning, pages 4772–4784. PMLR, 2021.
Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Lei [2022] Y. Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. arXiv preprint arXiv:2206.07082, 2022.
Lei and Ying [2020] Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pages 5809–5819. PMLR, 2020.
Li et al. [2019] Y. Li, C. Wei, and T. Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
Luo et al. [2022] X. Luo, B. Luo, and J. Li. Generalization bounds for gradient methods via discrete and continuous prior. Advances in Neural Information Processing Systems, 35:10600–10614, 2022.
McAllester [1999] D. A. McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999.
Mohri et al. [2018] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
Nagarajan [2021] V. Nagarajan. Explaining generalization in deep learning: progress and fundamental limits. arXiv preprint arXiv:2110.08922, 2021.
Nagarajan and Kolter [2019] V. Nagarajan and J. Z. Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
Negrea et al. [2019] J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy. Information-theoretic generalization bounds for sgld via data-dependent estimates. Advances in Neural Information Processing Systems, 32, 2019.
Neu et al. [2021] G. Neu, G. K. Dziugaite, M. Haghifam, and D. M. Roy. Information-theoretic generalization bounds for stochastic gradient descent. In Conference on Learning Theory, pages 3526–3545. PMLR, 2021.
Nikolakakis et al. [2022] K. E. Nikolakakis, F. Haddadpour, A. Karbasi, and D. S. Kalogerias. Beyond lipschitz: Sharp generalization and excess risk bounds for full-batch gd. arXiv preprint arXiv:2204.12446, 2022.
Oksendal [2013] B. Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
Park et al. [2022] S. Park, U. Simsekli, and M. A. Erdogdu. Generalization bounds for stochastic gradient descent via localized $\epsilon$ -covers. Advances in Neural Information Processing Systems, 35:2790–2802, 2022.
Pensia et al. [2018] A. Pensia, V. Jog, and P.-L. Loh. Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018.
Russo and Zou [2016] D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. In Artificial Intelligence and Statistics, pages 1232–1240. PMLR, 2016.
Russo and Zou [2019] D. Russo and J. Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
Sagun et al. [2017] L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Simsekli et al. [2019] U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827–5837. PMLR, 2019.
Sorscher et al. [2022] B. Sorscher, R. Geirhos, S. Shekhar, S. Ganguli, and A. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
Vapnik [1991] V. Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991.
Vapnik [1999] V. Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.
Vapnik and Chervonenkis [2015] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11–30. Springer, 2015.
Xu and Raginsky [2017] A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30, 2017.
Zhang et al. [2021] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Zhang et al. [2022a] J. Zhang, H. Li, S. Sra, and A. Jadbabaie. Neural network weights do not converge to stationary points: An invariant measure perspective. In International Conference on Machine Learning, pages 26330–26346. PMLR, 2022a.
Zhang et al. [2022b] Y. Zhang, W. Zhang, S. Bald, V. Pingali, C. Chen, and M. Goswami. Stability of sgd: Tightness analysis and improved bounds. In Uncertainty in Artificial Intelligence, pages 2364–2373. PMLR, 2022b.
Zhou et al. [2022] Y. Zhou, Y. Liang, and H. Zhang. Understanding generalization error of sgd in nonconvex optimization. Machine Learning, 111(1):345–375, 2022.
Zhu et al. [2018] Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. arXiv preprint arXiv:1803.00195, 2018.

Appendix A Proof of Theorem 3.6

We rewrite the Equation (7) and Equation (8):

F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})=\underbrace{F_{\mu}(\mathbf{J_{0}})-F_{S}(\mathbf{J_{0}})}_{(i)}+\sum_{t=1}^{T}\underbrace{[(F_{\mu}(\mathbf{J_{t}})-F_{\mu}(\mathbf{J_{t-1}}))-(F_{S}(\mathbf{J_{t}})-F_{S}(\mathbf{J_{t-1}}))]}_{(ii)_{t}},

(12)

and

\mathbb{E}[F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})]=\mathbb{E}\sum_{t=1}^{T}(ii)_{t}.

(13)

Using Taylor expansion for the function $f(\cdot)$ , we have:

f(\mathbf{J_{t}})-f(\mathbf{J_{t-1}})=(\mathbf{J_{t}}-\mathbf{J_{t-1}})^{\mathrm{T}}\nabla f(\mathbf{J_{t-1}})+\mathcal{O}(\|\mathbf{J_{t+1}}-\mathbf{J_{t}}\|^{2}).

(14)

Therefore, we can define $(ii)^{lin}_{t}$ as:

(ii)^{lin}_{t}\triangleq(\mathbf{J_{t}}-\mathbf{J_{t-1}})^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t-1}})-\nabla F_{S}(\mathbf{J_{t-1}})).

(15)

The $(ii)_{t}$ can be decomposed as $(ii)_{t}=(ii)_{t}^{lin}+(ii)_{t}^{nl}$ , where $(ii)_{t}^{nl}\triangleq(ii)_{t}-(ii)_{t}^{lin}$ .

Then Equation 13 can be decomposited as:

\mathbb{E}[F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})]=\mathbb{E}\underbrace{\sum_{t=1}^{T}(ii)^{lin}_{t}}_{\operatorname{gen}^{lin}(\mathbf{J_{T}})}+\mathbb{E}\underbrace{\sum_{t=1}^{T}(ii)^{nl}_{t}}_{\operatorname{gen}^{nl}(\mathbf{J_{T}})}.

(16)

Proposition A.1.

For the gradient descent or the stochastic gradient descent algorithm, we have:

\mathbb{E}[\operatorname{gen}^{lin}(\mathbf{J_{T}})]=\mathbb{E}[\sum_{t=0}^{T-1}\eta_{t}\nabla F_{S}(\mathbf{J_{t}})^{\mathrm{T}}(\nabla F_{S}(\mathbf{J_{t}})-\nabla F_{\mu}(\mathbf{J_{t}}))].

(17)

If $T=\mathcal{O}(\frac{1}{\eta_{m}})$ , we have:

|\operatorname{gen}^{nl}(\mathbf{J_{T}})|=\mathcal{O}(\eta_{m}),

(18)

where $\eta_{m}\triangleq\max\limits_{t}\eta_{t}$ , and we have:

\lim\limits_{n\rightarrow\infty}|\operatorname{gen}^{nl}(\mathbf{J_{T}})|=0.

(19)

Remark A.2.

Furthermore, we give a experimental exploration of the $\operatorname{gen}^{nl}(\mathbf{J_{T}})$ in Appendix C.3. We discover that if the optimizer doesn’t enter the EoS (Edge of Stability) regime[9], we have $\operatorname{gen}^{nl}(\mathbf{J_{T}})\approx 0$ . One of common used assumption in stability based generalization theories is $\eta_{m}\leq\frac{2}{\beta}$ . For gradient descent, we have that maximum eigenvalue of Hessian hovers above $\frac{2}{\eta}$ when the optimizers enter EoS. This indicates that the assumption $\eta_{m}\leq\frac{2}{\beta}$ is valid only when the optimizer doesn’t enter EoS. In addition, we observe that the proposed bound can effectively represent the generalization error trend in Section 4 under common used experiment settings.

Proof.

Analyzing of $\operatorname{gen}^{lin}(\mathbf{J_{T}})$

Because of $\epsilon(\mathbf{w})=C(\mathbf{w})^{\frac{1}{2}}\epsilon^{\prime}$ and $\mathbb{E}[\epsilon^{\prime}]=0$ (detail in Equation (3) and Equation (5)d), we can get:

		$\displaystyle\quad\ \mathbb{E}_{t-1}[\epsilon_{t}^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}}))]$		(20)
		$\displaystyle=\mathbb{E}_{t-1}[(\epsilon^{\prime})^{\mathrm{T}}(C(\mathbf{J_{t}})^{\frac{1}{2}})^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}}))]$
		$\displaystyle=\mathbb{E}_{t-1}[\epsilon^{\prime}]^{\mathrm{T}}\mathbb{E}_{t-1}[(C(\mathbf{J_{t}})^{\frac{1}{2}})^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}}))]$
		$\displaystyle=0.$

Combining with Formula (3), we have

\mathbb{E}[\operatorname{gen}^{lin}(\mathbf{J_{T}})]=\mathbb{E}[\sum_{t=0}^{T-1}\eta_{t}\nabla F_{S}(\mathbf{J_{t}})^{\mathrm{T}}(\nabla F_{S}(\mathbf{J_{t}})-\nabla F_{\mu}(\mathbf{J_{t}}))].

(21)

Analyzing of $\operatorname{gen}^{nl}(\mathbf{J_{T}})$

Here, we denote $M\triangleq\max\limits_{t}\|\nabla F_{S}(\mathbf{J_{t}})\|$ . According to the definition of $\operatorname{gen}^{nl}(\mathbf{J_{T}})$ .

$\displaystyle\|\operatorname{gen}^{nl}(\mathbf{J_{T}})\|$	$\displaystyle\leq\|F_{\mu}(\mathbf{J_{T}})-F_{\mu}^{lin}(\mathbf{J_{T}})\|+\|F_{S}(\mathbf{J_{T}})-F_{S}^{lin}(\mathbf{J_{T}})\|$	(22)
	$\displaystyle=\|\sum_{t=1}^{T}\mathcal{O}(\\|\mathbf{J_{t+1}}-\mathbf{J_{t}}\\|^{2})\|+\|\sum_{t=1}^{T}\mathcal{O}(\\|\mathbf{J_{t+1}}-\mathbf{J_{t}}\\|^{2})\|$
	$\displaystyle=\sum_{t=1}^{T}\mathcal{O}(\eta_{t}^{2}\\|\nabla F_{S}(\mathbf{J_{t}})\\|^{2})$
	$\displaystyle=\mathcal{O}(T\eta_{m}^{2}M^{2})$
	$\displaystyle=\mathcal{O}(\frac{1}{\eta_{m}}\eta_{m}^{2}M^{2})$
	$\displaystyle=\mathcal{O}(\eta_{m})$

Because all the element of training set $S$ is sampled from distribution $\mu$ , we have $\lim\limits_{n\rightarrow\infty}\nabla F_{S}(\mathbf{w})=F_{\mu}(\mathbf{w}),$ Therefore:

\lim\limits_{n\rightarrow\infty}(ii)^{l}_{t}=\lim\limits_{n\rightarrow\infty}(\mathbf{J_{t}}-\mathbf{J_{t-1}})^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t-1}})-\nabla F_{S}(\mathbf{J_{t-1}}))=0.

(23)

What’s more, we also have:

\lim\limits_{n\rightarrow\infty}(ii)^{l}_{t}=F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})=0.

(24)

Because $F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})=\sum_{t=1}^{T}(ii)^{lin}_{t}+\sum_{t=1}^{T}(ii)^{nl}_{t}$ , we have:

\lim\limits_{n\rightarrow\infty}|\operatorname{gen}^{nl}(\mathbf{J_{T}})|=\lim\limits_{n\rightarrow\infty}\left|\sum_{t=1}^{T}(ii)^{nl}_{t}\right|=\lim\limits_{n\rightarrow\infty}\left|F_{\mu}(\mathbf{\mathbf{J_{T}}})-F_{S}(\mathbf{J_{T}})-\sum_{t=1}^{T}(ii)^{lin}_{t}\right|=|0-0|=0

(25)

∎

According to the Equation (17), we analyze the generalization error of $\mathcal{F}_{\mathbf{J}|S}\triangleq\{\sum_{t=0}^{T-1}\mathbf{w_{t}}^{\mathrm{T}}\nabla f(\mathbf{J_{t}})\ |\ \mathbf{w_{t}}=\delta_{t}\frac{\nabla f(\mathbf{J_{t}})}{\|\nabla f(\mathbf{J_{t}})\|}\}$ as a proxy for analyzing generalization error of the function trained using SGD or GD algorithm. The value of $\operatorname{gen}^{lin}(\mathbf{J_{T}})$ is equal to the generalization error of $\mathcal{F}_{\mathbf{J}|S}$ . To analyze $\mathcal{F}_{\mathbf{J}|S}$ , we construct an addictive linear space as $\mathcal{L}_{\mathbf{J}|S}\triangleq\{\sum_{t=0}^{T-1}\mathbf{w_{t}}^{\mathrm{T}}\nabla f(\mathbf{J_{t}})\ |\ \|\mathbf{w_{t}}\|\leq\delta_{t}\}$ , where $\delta_{t}\triangleq\|\eta_{t}\nabla F_{S}(\mathbf{J_{t}})\|$ . Here, we use $\mathbf{J}|S$ to emphasize that $\mathbf{J}$ depends on $S$ .

Under Assumption 3.4 (that is introduced in the main paper), we can have the following lemma.

Lemma A.3.

Under Assumption 3.4, given $S\sim\mu^{n}$ , let $\mathbf{J}=\mathcal{A}(S)$ , we have:

\mathbb{E}[\operatorname{gen}^{lin}(\mathbf{J_{T}})]\leq 2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}R_{S}(\mathcal{L}_{\mathbf{J}|S}),

(26)

where $\mathbb{V}(\mathbf{w})=\frac{\|\nabla F_{S}(\mathbf{w})\|}{\mathbb{E}_{U\subset S}\|\frac{|U|}{n}\nabla F_{U}(\mathbf{w})-\frac{|S|-|U|}{n}\nabla F_{S/U}(\mathbf{w})\|}$ , $\mathbb{V}_{m}=\max\limits_{t}\mathbb{V}(\mathbf{J_{t}})$ and $\gamma^{\prime}=\max\{1,\max\limits_{U\subset S;t}\frac{|U|\|\nabla F_{U}(\mathbf{J_{t}})\|}{n\|\nabla F_{S}(\mathbf{J_{t}})\|}\}\gamma$ .

Proof.

For a function $h$ , we define that $h_{\mu}\triangleq\mathbb{E}_{z\sim\mu}[g(z)]$ and $h_{S}=\frac{1}{n}\sum_{z_{i}\in S}h(z_{i})$ . Given a function space, the maximum generalization error of the space can be defined as: $\Phi(S,H)\triangleq\sup\limits_{h\in H}(h_{\mu}-h_{S})$

$\displaystyle\Phi(S,H\|S)$	$\displaystyle=\sup\limits_{h\in H\|S}(h_{\mu}-h_{S})$	(27)
	$\displaystyle=\sup\limits_{h\in H\|S}(\mathbb{E}_{S^{\prime}}h_{S^{\prime}}-h_{S})$
	$\displaystyle\leq\mathbb{E}_{S^{\prime}}\sup\limits_{h\in H\|S}(h_{S^{\prime}}-h_{S})$
	$\displaystyle=\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in H\|S}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}})-h(z_{i}^{-\sigma_{i}})))$
	$\displaystyle\leq\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in H\|S}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}})))+\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in H\|S}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}})))$
	$\displaystyle=2\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in H\|S}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}}))),$

where $S^{\prime}$ is another i.i.d sample set drawn from $\mu^{n}$ and $\sigma$ denotes the Rademacher variable. The $\sigma_{i}$ in $z_{i}^{\sigma_{i}}$ denotes $z_{i}^{\sigma_{i}}$ that belongs to $S$ or $S^{\prime}$ . if $\sigma_{i}=-1$ $z_{i}^{\sigma_{i}}\in S$ , otherwise, $z_{i}^{\sigma_{i}}\in S^{\prime}$ .

$\displaystyle R_{S}(\mathcal{L}_{\mathbf{J}\|S})$	$\displaystyle\triangleq\mathbb{E}_{\sigma}\sup\limits_{h\in\mathcal{L}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}h(z_{i}))$	(28)
	$\displaystyle=\mathbb{E}_{\sigma}\sup\limits_{h\in\mathcal{L}_{\mathbf{J}\|S}}(\frac{1}{n}(\sum_{z\in S_{+}}h(z)-\sum_{z\in S_{-}}h(z)))$
	$\displaystyle=\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|),$

where $S_{+}\triangleq\{z_{i}\ |\ \sigma_{i}=+1\}$ and $S_{-}\triangleq\{z_{i}\ |\ \sigma_{i}=-1\}$ , and $g_{S}(\mathbf{w})\triangleq|S|\nabla F_{S}(\mathbf{w})$ .

$\displaystyle\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in\mathcal{F}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}})))$	$\displaystyle=\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in\mathcal{F}_{\mathbf{J}\|S}}(\frac{1}{n}(\sum_{z\in S^{\prime}_{+}}h(z)-\sum_{z\in S_{-}}h(z)))$	(29)
	$\displaystyle=\mathbb{E}_{S^{\prime},\sigma}(\frac{1}{n}\sum_{t=0}^{T-1}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(g_{S^{\prime}_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$
	$\displaystyle=\mathbb{E}_{S^{\prime},\sigma}(\frac{1}{n}\sum_{t=0}^{T-1}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(\|S_{+}\|\nabla F_{\mu}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$

where $S^{\prime}_{+}$ is a subset of $S^{\prime}$ with $|S^{\prime}_{+}|=|S_{+}|$ . Defining $k\triangleq\gamma^{\prime}\mathbb{V}_{m}$ , we have:

		$\displaystyle k\mathbb{E}_{\sigma}\sup\limits_{h\in\mathcal{L}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}h(z_{i}))-\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in\mathcal{F}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i})))$		(30)
		$\displaystyle=k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)-\mathbb{E}_{S^{\prime},\sigma}(\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(g_{S^{\prime}_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$
		$\displaystyle=k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|-\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(\|S_{+}\|\nabla F_{\mu}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$
		$\displaystyle\geq k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|-\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\\|\|S_{+}\|\nabla F_{\mu}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)$
		$\displaystyle\geq k\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\mathbb{E}_{\sigma}(\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)-\sum_{t=0}^{T}\delta_{t}\gamma^{\prime}\\|\nabla F_{S}(\mathbf{J_{t}})\\|$
		$\displaystyle\geq 0$

Therefore, combining Equation (27) and (30), we have $\mathbb{E}[\operatorname{gen}^{lin}(\mathbf{J_{T}})]\leq 2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}R_{S}(\mathcal{L}_{\mathbf{J}|S})$ .

∎

Lemma A.4.

Given $\mathbf{J}=\mathcal{A}(S)$ , the formula $R_{S}(\mathcal{L}_{\mathbf{J}|S})$ can be upper bounded with:

R_{S}(\mathcal{L}_{\mathbf{J}|S})\leq-\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{w}))}{\|\nabla F_{S}(\mathbf{w})\|_{2}^{2}}}.

(31)

Proof.

Let us start with the calculation of $R_{S}(\mathbf{w}^{\mathrm{T}}\nabla f(\mathbf{J_{t}}))$ :

$\displaystyle R_{S}(\{\mathbf{w}^{\mathrm{T}}\nabla f(\mathbf{J_{t}})\|\\|\mathbf{w}\\|\leq\delta\})$	$\displaystyle=\frac{1}{n}\mathbb{E}_{\sigma}\left(\sup_{\left\\|\mathbf{w}\right\\|\leq\delta}\mathbf{w}^{\mathrm{T}}\sum_{i=1}^{n}\sigma_{i}\nabla f(\mathbf{J_{t}},z_{i})\right)$	(32)
	$\displaystyle=\frac{\delta}{n}\mathbb{E}_{\sigma}\left(\sqrt{\left\\|\sum_{i=1}^{n}\sigma_{i}\nabla f(\mathbf{J_{t}},z_{i})\right\\|^{2}}\right)$
	$\displaystyle\leq\frac{\delta}{n}\left(\sqrt{\mathbb{E}_{\sigma}\left\\|\sum_{i=1}^{n}\sigma_{i}\nabla f(\mathbf{J_{t}},z_{i})\right\\|^{2}}\right)$
	$\displaystyle\overset{\blacktriangle}{\leq}\frac{\delta}{n}\left(\sqrt{\mathbb{E}_{\sigma}\sum_{i=1}^{n}\left\\|\sigma_{i}\nabla f(\mathbf{J_{t}},z_{i})\right\\|^{2}}\right)$
	$\displaystyle=\frac{\delta}{n}\sqrt{\sum_{i=1}^{n}\left\\|\nabla f(\mathbf{J_{t}},z_{i})\right\\|^{2}},$

where $\blacktriangle$ represents using the relation that for $i,j$ satisfying $i\neq j$ , we have $\mathbb{E}\sigma_{i}\sigma_{j}=0$ .

Because $w_{i}$ is independent of $w_{j}$ if $i\neq j$ , we have:

$\displaystyle R_{S}(\mathcal{L}_{\mathbf{J}\|S})$	$\displaystyle=R_{S}(\{f(\mathbf{J_{0}})+\sum_{t=0}^{T-1}\mathbf{w_{t}}^{\mathrm{T}}\nabla f(\mathbf{J_{t}})\ \|\ \\|\mathbf{w_{t}}\\|\leq\delta_{t}\})$	(33)
	$\displaystyle=\sum_{t=0}^{T-1}R_{S}(\{\mathbf{w_{t}}^{\mathrm{T}}\nabla f(\mathbf{J_{t}})\|\\|\mathbf{w_{t}}\\|\leq\delta_{t}\})$
	$\displaystyle\leq\sum_{t=0}^{T-1}\frac{\delta_{t}}{n}\sqrt{\sum_{i=1}^{n}\\|\nabla f(\mathbf{J_{t}},z_{i})\\|^{2}}.$

The covariance of gradient noise can be calculated as:

$\displaystyle\operatorname{Tr}[\Sigma(\mathbf{w})]$	$\displaystyle=\operatorname{Tr}[\frac{1}{n}\sum_{i=1}^{n}\nabla f(\mathbf{w},z_{i})\nabla f(\mathbf{w},z_{i})^{\mathrm{T}}-\nabla F_{S}(\mathbf{w})\nabla F_{S}(\mathbf{w})^{\mathrm{T}}]$	(34)
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\operatorname{Tr}[\nabla f(\mathbf{w},z_{i})\nabla f(\mathbf{w},z_{i})^{\mathrm{T}}]-\operatorname{Tr}[\nabla F_{S}(\mathbf{w})\nabla F_{S}(\mathbf{w})^{\mathrm{T}}]$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\\|\nabla f(\mathbf{w},z_{i})^{2}\\|-\\|\nabla F_{S}(\mathbf{w})\\|^{2}$

Taking Equation (33) and $\delta_{t}\triangleq\|\eta_{t}\nabla F_{S}(\mathbf{J_{t}})\|$ into Equation (34), we have :

$\displaystyle R_{S}(\mathcal{L}_{\mathbf{J}\|S})$	$\displaystyle\leq\sum_{t=0}^{T-1}\frac{\delta_{t}}{n}\sqrt{\sum_{i=1}^{n}\\|\nabla f(\mathbf{J_{t}},z_{i})\\|^{2}}$	(35)
	$\displaystyle=\sum_{t=0}^{T-1}\frac{\delta_{t}}{\sqrt{n}}\sqrt{\operatorname{Tr}[\Sigma(\mathbf{J_{t}})]+\\|\nabla F_{S}(\mathbf{J_{t}})\\|^{2}}$
	$\displaystyle=\sum_{t=0}^{T-1}\frac{\eta_{t}\\|\nabla F_{S}(\mathbf{J_{t}})\\|}{\sqrt{n}}\sqrt{\operatorname{Tr}[\Sigma(\mathbf{J_{t}})]+\\|\nabla F_{S}(\mathbf{J_{t}})\\|^{2}}$

When $\eta_{t}$ is small, $\delta_{t}\approx-\mathbb{E}_{\epsilon}\frac{(\mathbf{J_{t+1}}-\mathbf{J_{t}})^{\mathrm{T}}\nabla F_{S}(\mathbf{J_{t}})}{\|\nabla F_{S}(\mathbf{J_{t}})\|}\approx-\mathbb{E}_{\epsilon}\frac{F_{S}(\mathbf{J_{t+1}})-F_{S}(\mathbf{J_{t}})}{\|\nabla F_{S}(\mathbf{J_{t}})\|}$ holds, therefore we have:

$\displaystyle\mathbb{E}R_{S}(\mathcal{L}_{\mathbf{J}\|S})$	$\displaystyle\leq\mathbb{E}\sum_{t=0}^{T-1}\frac{\delta_{t}}{n}\sqrt{\sum_{i=1}^{n}\left\\|\nabla f(\mathbf{J_{t}},z_{i})\right\\|_{2}^{2}}$	(36)
	$\displaystyle\approx-\mathbb{E}\sum_{t=0}^{T-1}\frac{F_{S}(\mathbf{J_{t+1}})-F_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{w}))}{\\|\nabla F_{S}(\mathbf{w})\\|_{2}^{2}}}$
	$\displaystyle\approx-\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{w}))}{\\|\nabla F_{S}(\mathbf{w})\\|_{2}^{2}}}$

∎

Theorem A.5.

Under Assumption 3.4, given $S\sim\mu^{n}$ , let $\mathbf{J}=\mathcal{A}(S)$ , where $\mathcal{A}$ denotes the SGD or GD algorithm training with $T$ steps, we have:

\mathbb{E}[F_{\mu}(\mathbf{J_{T}})-F_{S}(\mathbf{J_{T}})]\leq-2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\|\nabla F_{S}(\mathbf{J_{t}})\|_{2}^{2}}}+\mathcal{O}(\eta_{m}),

(37)

where $\mathbb{V}(\mathbf{w})=\frac{\|\nabla F_{S}(\mathbf{w})\|}{\mathbb{E}_{U\subset S}\|\frac{|U|}{n}\nabla F_{U}(\mathbf{w})-\frac{|S|-|U|}{n}\nabla F_{S/U}(\mathbf{w})\|}$ , $\mathbb{V}_{m}=\max_{t}\mathbb{V}(\mathbf{J_{t}})$ and $\gamma^{\prime}=\max\{1,\max\limits_{U\subset S;t}\frac{|U|\|\nabla F_{U}(\mathbf{J_{t}})\|}{n\|\nabla F_{S}(\mathbf{J_{t}})\|}\}\gamma$ .

Proof.

We rewrite Equation 2 of the update of SGD with batchsize $b$ here:

\mathbf{J_{t}}=\mathbf{J_{t-1}}-\eta_{t}\nabla F_{S}(\mathbf{J_{t-1}})+\eta_{t}\epsilon_{t}

(38)

where we simplify the $\epsilon(\mathbf{w_{t}})$ as $\epsilon_{t}$ , then we can expand the function at $f(\mathbf{J_{T}})$ as:

	$\displaystyle f^{lin}(\mathbf{J_{T}})$	$\displaystyle\triangleq f(\mathbf{J_{0}})+\sum_{t=0}^{T-1}(\eta_{t}\nabla F_{S}(\mathbf{J_{t}})+\epsilon)^{\mathrm{T}}\nabla f(\mathbf{J_{t}})$		(39)
		$\displaystyle=f(\mathbf{J_{0}})+\sum_{t=0}^{K-1}\eta_{t}\nabla F_{S}(\mathbf{J_{t}})^{\mathrm{T}}\nabla f(\mathbf{J_{t}})+\sum_{t=0}^{T-1}\epsilon_{t}^{\mathrm{T}}\nabla f(\mathbf{J_{t}})$		(39)

Note that when the learning rate is small, we have $f(\mathbf{J_{T}})\approx f^{lin}(\mathbf{J_{T}})$ .

The difference between the distributional value and the empirical value of the linear function can be calculated as:

		$\displaystyle\mathbb{E}[F_{\mu}(\mathbf{J_{0}})+\sum_{t=0}^{T-1}(\eta_{t}\nabla F_{S}(\mathbf{J_{t}})+\epsilon)^{\mathrm{T}}\nabla F_{\mu}(\mathbf{J_{t}})]-\mathbb{E}[F_{S}(\mathbf{J_{0}})+\sum_{t=0}^{T-1}(\eta_{t}\nabla F_{S}(\mathbf{J_{t}})+\epsilon)^{\mathrm{T}}\nabla F_{S}(\mathbf{J_{t}})]$		(40)
		$\displaystyle=\mathbb{E}[F_{\mu}(\mathbf{J_{0}})-F_{S}(\mathbf{J_{0}})+\sum_{t=0}^{T-1}(\eta_{t}\nabla F_{S}(\mathbf{J_{t}})+\epsilon)^{\mathrm{T}}\nabla F_{\mu}(\mathbf{J_{t}})-\sum_{t=0}^{T-1}(\eta_{t}\nabla F_{S}(\mathbf{J_{t}})+\epsilon)^{\mathrm{T}}\nabla F_{S}(\mathbf{J_{t}})]$
		$\displaystyle=\mathbb{E}[\sum_{t=0}^{T-1}\eta_{t}\nabla F_{S}(\mathbf{J_{t}})^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}}))+\sum_{t=0}^{T-1}\epsilon_{t}^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}}))]]$
		$\displaystyle\overset{\blacktriangle}{=}\mathbb{E}[\sum_{t=0}^{T-1}\eta_{t}\nabla F_{S}(\mathbf{J_{t}})^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}}))]$
		$\displaystyle\leq\Phi(S,\mathcal{F}_{\mathbf{J}\|S}),$

where $\blacktriangle$ using the equation that $\mathbb{E}[\epsilon_{t}^{\mathrm{T}}(\nabla F_{\mu}(\mathbf{J_{t}})-\nabla F_{S}(\mathbf{J_{t}}))]=0$ , according to Equation 20.

Because of $\mathbb{E}[F_{\mu}(\mathbf{J_{T}})-F_{S}(\mathbf{J_{T}})]=\mathbb{E}[F^{lin}_{\mu}(\mathbf{J_{T}})+\mathcal{O}(\eta_{m})-F^{lin}_{S}(\mathbf{J_{T}})-\mathcal{O}(\eta_{m})]=\mathbb{E}[F^{lin}_{\mu}(\mathbf{J_{T}})-F^{lin}_{S}(\mathbf{J_{T}})]+\mathcal{O}(\eta_{m})$ (from Proposition A.1), by applying Lemma A.3 and Lemma A.4, the theorm is proved.

∎

Corollary A.6.

$\displaystyle\mathbb{E}[F_{\mu}(\mathbf{J_{T}})-F_{S}(\mathbf{J_{T}})]\leq$	$\displaystyle-2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$	(41)
	$\displaystyle+2c^{2}\gamma^{\prime}\mathbb{V}_{m}M_{4}^{2}\sqrt{\mathbb{E}\int_{t}\frac{dt}{n\beta^{2}(t+1)^{4}}\left(1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}\right)}$
	$\displaystyle+2c^{2}\frac{M_{2}^{2}}{\beta}.$

where $\mathbb{V}(\mathbf{w})=\frac{\|\nabla F_{S}(\mathbf{w})\|}{\mathbb{E}_{U\subset S}\|\frac{|U|}{n}\nabla F_{U}(\mathbf{w})-\frac{|S|-|U|}{n}\nabla F_{S/U}(\mathbf{w})\|}$ , $\mathbb{V}_{m}\!=\!\max\limits_{t}\mathbb{V}(\mathbf{J_{t}})$ and $\gamma^{\prime}\!=\!\max\{1,\max\limits_{U\subset S;t}\frac{|U|\|\nabla F_{U}(\mathbf{J_{t}})\|}{n\|\nabla F_{S}(\mathbf{J_{t}})\|}\}\gamma$ .

Proof.

If $f(\cdot)$ is $\beta$ -smooth, we have:

	$\displaystyle f(\mathbf{J_{t+1}})-f(\mathbf{J_{t}})\leq(\mathbf{J_{t+1}}-\mathbf{J_{t}})^{\mathrm{T}}\nabla f(\mathbf{J_{t}})+\frac{1}{2}\beta\\|\mathbf{J_{t+1}-\mathbf{J_{t}}}\\|^{2}$		(42)
	$\displaystyle f(\mathbf{J_{t+1}})-f(\mathbf{J_{t}})\geq(\mathbf{J_{t+1}}-\mathbf{J_{t}})^{\mathrm{T}}\nabla f(\mathbf{J_{t}})-\frac{1}{2}\beta\\|\mathbf{J_{t+1}-\mathbf{J_{t}}}\\|^{2}.$		(43)

Combining the two equations, we obtain:

$\displaystyle\|R_{\mu}(\mathbf{J_{T}})-R_{S}(\mathbf{J_{T}})\|$	$\displaystyle\leq\|F_{\mu}(\mathbf{J_{T}})-F_{\mu}^{lin}(\mathbf{J_{T}})\|+\|F_{S}(\mathbf{J_{T}})-F_{S}^{lin}(\mathbf{J_{T}})\|$	(44)
	$\displaystyle\leq\frac{\beta}{2}\sum_{t=0}^{T-1}\\|\mathbf{J_{t+1}-\mathbf{J_{t}}}\\|^{2}+\frac{\beta}{2}\sum_{t=0}^{T-1}\\|\mathbf{J_{t+1}-\mathbf{J_{t}}}\\|^{2}$
	$\displaystyle=\beta\sum_{t=0}^{T-1}\\|\mathbf{J_{t+1}-\mathbf{J_{t}}}\\|^{2}.$

The generalization error can be divided into three parts:

$\displaystyle\mathbb{E}[F_{\mu}(\mathbf{J_{T}})-F_{S}(\mathbf{J_{T}})]\leq$	$\displaystyle-2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$	(45)
	$\displaystyle\underbrace{-2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\int_{t}\frac{dF^{lin}_{S}(\mathbf{J_{t}})-dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}}_{(A)}$
	$\displaystyle+\underbrace{\beta\mathbb{E}\sum_{t=0}^{T-1}\\|\mathbf{J_{t+1}-\mathbf{J_{t}}}\\|^{2}}_{(B)}.$

The term“ $(A)$ ” is caused by using $F_{S}(\mathbf{J_{t+1}})-F_{S}(\mathbf{J_{t}})$ to replace $F^{lin}_{S}(\mathbf{J_{t+1}})-F^{lin}_{S}(\mathbf{J_{t}})$ . The term " $(B)$ " is induced by $\operatorname{gen}^{nl}(\mathbf{J_{T}})$ . Then, we want to give a upper bound of $(A)$ using $M_{4}^{4}$ :

$\displaystyle(A)$	$\displaystyle\overset{(\star)}{\leq}2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\sum_{t=0}^{T-1}\frac{\beta\\|\mathbf{J_{t+1}}-\mathbf{J_{t}}\\|^{2}}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$	(46)
	$\displaystyle\overset{(\star\star)}{\leq}2c^{2}\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\sum_{t=0}^{T-1}\frac{\\|\nabla F_{S}(\mathbf{J_{t}})+\epsilon(\mathbf{J_{t}})\\|^{2}}{\beta\sqrt{n}(t+1)^{2}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$
	$\displaystyle=2c^{2}\gamma^{\prime}\mathbb{V}_{m}\sum_{t=0}^{T-1}\mathbb{E}_{t-1}\frac{\\|\nabla F_{S}(\mathbf{J_{t}})+\epsilon(\mathbf{J_{t}})\\|^{2}}{\beta\sqrt{n}(t+1)^{2}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$
	$\displaystyle\overset{(\star\star\star)}{\leq}2c^{2}\gamma^{\prime}\mathbb{V}_{m}\sum_{t=0}^{T-1}\sqrt{\frac{\mathbb{E}_{t-1}\\|\nabla F_{S}(\mathbf{J_{t}})+\epsilon(\mathbf{J_{t}})\\|^{4}}{\beta^{2}n(t+1)^{4}}\mathbb{E}_{t-1}\left(1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}\right)}$
	$\displaystyle\leq 2c^{2}\gamma^{\prime}\mathbb{V}_{m}\sum_{t=0}^{T-1}\sqrt{\frac{M_{4}^{4}}{\beta^{2}n(t+1)^{4}}\mathbb{E}_{t-1}\left(1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}\right)}$
	$\displaystyle\leq 2c^{2}\gamma^{\prime}\mathbb{V}_{m}M_{4}^{2}\sqrt{\mathbb{E}\sum_{t=0}^{T-1}\frac{1}{\beta^{2}n(t+1)^{4}}\left(1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}\right)}.$

where $(\star)$ is due to the Equation 44, $(\star\star)$ is due to the update rule of $\mathbf{J_{t}}$ and $(\star\star\star)$ is que to Hölder’s inequality. In the following, we use $M_{2}^{2}$ to give a upper bound for $(B)$ :

$\displaystyle(B)$	$\displaystyle\leq\frac{c^{2}}{\beta}\sum_{t=0}^{T-1}\frac{1}{(t+1)^{2}}\mathbb{E}\\|\nabla F(\mathbf{J_{t}})+\epsilon(\mathbf{J_{t}})\\|^{2}$	(47)
	$\displaystyle\leq\frac{c^{2}}{\beta}\sum_{t=0}^{T-1}\frac{1}{(t+1)^{2}}M_{2}^{2}$
	$\displaystyle\leq\frac{c^{2}}{\beta}\left(M_{2}^{2}+\sum_{t=1}^{T-1}\frac{1}{(t+1)^{2}}M_{2}^{2}\right)$
	$\displaystyle\leq\frac{c^{2}}{\beta}\left(M_{2}^{2}+\sum_{t=1}^{T-1}\left(\frac{1}{t}-\frac{1}{t+1}\right)M_{2}^{2}\right)$
	$\displaystyle\leq\frac{c^{2}}{\beta}\left(2M_{2}^{2}-\frac{1}{T}M_{2}^{2}\right)$
	$\displaystyle\leq 2c^{2}\frac{M_{2}^{2}}{\beta}$

Taking the upper bound value of "(A)" and "(B)" into Equation 45, we obtain the result.

∎

Appendix B Relaxed Assumption and Corresponding Bound

Assumption B.1.

There is a value $\gamma$ , $T_{0}$ and $\zeta$ , so that for all $\mathbf{w}\in\{\mathbf{J_{t}}|t\in\mathbb{N}\ \wedge\ t<T_{0}\}$ , we have $\|\nabla F_{\mu}(\mathbf{w})\|\leq\gamma\|\nabla F_{S}(\mathbf{w})\|$ and for all $\mathbf{w}\in\{\mathbf{J_{t}}|t\in\mathbb{N}\ \wedge\ t\geq T_{0}\}$ , we have $\|\nabla F_{\mu}(\mathbf{w})\|\leq\gamma\|\nabla F_{S}(\mathbf{w})\|+\zeta$ .

Theorem B.2.

Under Assumption B.1, given $S\sim\mu^{n}$ , let $\mathbf{J}=\mathcal{A}(S)$ , where $\mathcal{A}$ denotes the SGD or GD algorithm training with $T$ steps, we have:

\mathbb{E}[F_{\mu}(\mathbf{J_{T}})-F_{S}(\mathbf{J_{T}})]\leq-2\gamma^{\prime}\mathbb{V}_{m}\mathbb{E}\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\|\nabla F_{S}(\mathbf{J_{t}})\|_{2}^{2}}}+\frac{1}{2}\sum_{t=T_{0}}^{T}\eta_{t}\|\nabla F_{S}(\mathbf{J_{t}})\|\zeta+\mathcal{O}(\eta_{m}),

(48)

where $\mathbb{V}(\mathbf{w})=\frac{\|\nabla F_{S}(\mathbf{w})\|}{\mathbb{E}_{U\subset S}\|\frac{|U|}{n}\nabla F_{U}(\mathbf{w})-\frac{n-|U|}{n}\nabla F_{S/U}(\mathbf{w})\|}$ , $\mathbb{V}_{m}=\max_{t}\mathbb{V}(\mathbf{J_{t}})$ and $\gamma^{\prime}=\max\{1,\max\limits_{U\subset S;t}\frac{|U|\|\nabla F_{U}(\mathbf{J_{t}})\|}{n\|\nabla F_{S}(\mathbf{J_{t}})\|}\}\gamma$ .

Proof.

Most of the proofs in this part are the same as those in Appendix A, except for Equation 30. The Equation 30 is replaced by:

		$\displaystyle k\mathbb{E}_{\sigma}\sup\limits_{h\in\mathcal{L}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}h(z_{i}))-\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in\mathcal{F}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i})))$		(49)
		$\displaystyle=k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)-\mathbb{E}_{S^{\prime},\sigma}(\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(g_{S^{\prime}_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$
		$\displaystyle=k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|-\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(\|S_{+}\|\nabla F_{\mu}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$
		$\displaystyle\geq k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|-\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\\|\|S_{+}\|\nabla F_{\mu}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)$
		$\displaystyle\geq k\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\mathbb{E}_{\sigma}(\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)-\sum_{t=0}^{T}\delta_{t}\gamma^{\prime}\\|\nabla F_{S}(\mathbf{J_{t}})\\|-\frac{1}{2}\sum_{t=T_{0}}^{T}\delta_{t}\zeta$
		$\displaystyle\geq-\frac{1}{2}\sum_{t=T_{0}}^{T}\eta_{t}\\|\nabla F_{S}(\mathbf{J_{t}})\\|\zeta.$

∎

Remark B.3.

Compared of Theorem 3.6, we have a extra term $\sum_{t=T_{0}}^{T}\eta_{t}\|\nabla F_{S}(\mathbf{J_{t}})\|\zeta$ here. Since the unrelaxed assumption $\|\nabla F_{\mu}(\mathbf{w})\|\leq\gamma\|\nabla F_{S}(\mathbf{w})\|$ is not satisfied only when $\|\nabla F_{S}(\mathbf{w})\|$ is relative small, the term $\sum_{t=T_{0}}^{T}\eta_{t}\|\nabla F_{S}(\mathbf{J_{t}})\|\zeta$ is small value.

Appendix C Experiments

C.1 Calculation of $\mathcal{C}(\mathbf{J})$

To reduce the calculation, we construct a randomly sampled subset $S_{\operatorname{sp}}=\{z^{\operatorname{sp}}_{1},...,z^{\operatorname{sp}}_{n}\}\subset S$ .

	$\displaystyle\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$	$\displaystyle=\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{\frac{\sum_{i=1}^{n}\left\\|\nabla f(\mathbf{J_{t}},z_{i})\right\\|_{2}^{2}}{n}\frac{1}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$
		$\displaystyle\approx\int_{t}\frac{dF_{S}(\mathbf{J_{t}})}{\sqrt{n}}\sqrt{\frac{\sum_{i=1}^{n_{\operatorname{sp}}}\left\\|\nabla f(\mathbf{J_{t}},z^{\operatorname{sp}}_{i})\right\\|_{2}^{2}}{n_{\operatorname{sp}}}\frac{1}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|_{2}^{2}}}$

Denote the weights after $t$ -epoch training as $\mathbf{X_{t}}$ . We can roughly calculated $\mathcal{C}(\mathbf{J_{t}})$

\sum_{t=1}^{T}\frac{F_{S}(\mathbf{X_{t}})-F_{S}(\mathbf{\mathbf{X_{t-1}}})}{\sqrt{n}}\sqrt{\frac{\sum_{i=1}^{n_{\operatorname{sp}}}\left\|\nabla f(\mathbf{\mathbf{X_{t}}},z^{\operatorname{sp}}_{i})\right\|_{2}^{2}}{n_{\operatorname{sp}}}\frac{1}{\|\nabla F_{S}(\mathbf{\mathbf{X_{t}}})\|_{2}^{2}}}

C.2 Experimental Details

Here, we give a detail setting of the experiment for each figure.

Figure 1 The learning rate is fixed to 0.05 during all the training process. The batch size is 256. All experiments is trained with 100 epoch. The test accuracy for CIFAR-10, CIFAR-100, and SVHN are 87.64%, 55.08%, and 92.80%, respectively.

Figure 2 The initial learning rate is set to 0.05 with the batch size of 1024. We use the Cosine Annealing LR Schedule to adjust the learning rate during training.

Figure 3 Each point is an average of three repeated experiments. We stop training when the training loss is small than 0.2.

C.3 Experimental exploration of $\operatorname{gen}^{nl}(\mathbf{J_{T}})$

In this section, our aim is to investigate the conditions under which $\operatorname{gen}^{nl}(\mathbf{J_{T}})\approx 0$ . Since directly calculating the difference $|R_{\mu}(\mathbf{J_{T}})-R_{S}(\mathbf{J_{T}})|$ is challenging, we concentrate on the upper bound value $|R_{\mu}(\mathbf{J_{T}})|+|R_{S}(\mathbf{J_{T}})|$ .

We conduct the experiment using cifar10-5k dataset and fc-tanh network, following the setting of paper [9]. Cifar10-5k[9] is a subset of cifar10 dataset. Building upon the work of [1], we compute the Relative Progress Ratio (RP) and Test Relative Progress Ratio (TRP) throughout the training process. We initially consider the case of gradient descent. The definitions of RP and TRP for gradient descent are as follows:

	$\displaystyle\operatorname{RP}(\mathbf{J_{t}})\triangleq\frac{F_{S}(\mathbf{J_{t+1}})-F_{S}(\mathbf{J_{t}})}{\eta\\|\nabla F_{S}(\mathbf{J_{t}})\\|^{2}}$		(50)
	$\displaystyle\operatorname{TRP}(\mathbf{J_{t}})\triangleq\frac{F_{S^{\prime}}(\mathbf{J_{t+1}})-F_{S^{\prime}}(\mathbf{J_{t}})}{\eta\nabla F_{S}(\mathbf{J_{t}})^{\mathrm{T}}\nabla F_{S^{\prime}}(\mathbf{J_{t}})}.$		(51)

Therefore, we have:

		$\displaystyle F_{S}(\mathbf{J_{0}})+\sum_{t=1}^{T}(F_{S}(\mathbf{J_{t}})-F_{S}(\mathbf{J_{t-1}}))-F_{S}(\mathbf{J_{0}})-\sum_{t=0}^{T-1}(\mathbf{J_{t}}-\mathbf{J_{t-1}})^{\mathrm{T}}\nabla F_{S}(\mathbf{J_{t-1}})$		(52)
		$\displaystyle=\sum_{t=1}^{T}\left[(F_{S}(\mathbf{J_{t}})-F_{S}(\mathbf{J_{t-1}}))-(\mathbf{J_{t}}-\mathbf{J_{t-1}})^{\mathrm{T}}\nabla F_{S}(\mathbf{J_{t-1}})\right]$
		$\displaystyle=\sum_{t=1}^{T}\left[(F_{S}(\mathbf{J_{t}})-F_{S}(\mathbf{J_{t-1}}))+\eta\\|\nabla F_{S}(\mathbf{J_{t-1}})\\|^{2}\right]$
		$\displaystyle=\sum_{t=1}^{T}\left[\eta_{t}(1+\operatorname{RP}(\mathbf{J_{t-1}}))\\|\nabla F_{S}(\mathbf{J_{t-1}})\\|^{2}\right]$

Following the same way, we have:

		$\displaystyle F_{S^{\prime}}(\mathbf{J_{0}})+\sum_{t=1}^{T}(F_{S^{\prime}}(\mathbf{J_{t}})-F_{S^{\prime}}(\mathbf{J_{t-1}}))-F_{S^{\prime}}(\mathbf{J_{0}})-\sum_{t=0}^{T-1}(\mathbf{J_{t}}-\mathbf{J_{t-1}})^{\mathrm{T}}\nabla F_{S^{\prime}}(\mathbf{J_{t-1}})$		(53)
		$\displaystyle=\sum_{t=1}^{T}\left[\eta_{t}(1+\operatorname{TRP}(\mathbf{J_{t-1}}))\nabla F_{S}(\mathbf{J_{t-1}})^{\mathrm{T}}\nabla F_{S^{\prime}}(\mathbf{J_{t-1}})\right]$		(53)

Combining Equation (52) and Equation (53), we have:

		$\displaystyle\|\operatorname{gen}^{nl}(\mathbf{J_{T}})\|\leq$		(54)
		$\displaystyle\sum_{t=1}^{T}\eta_{t}\left[(1+\operatorname{TRP}(\mathbf{J_{t-1}}))\|\nabla F_{S}(\mathbf{J_{t-1}})^{\mathrm{T}}\nabla F_{S^{\prime}}(\mathbf{J_{t-1}})\|+(1+\operatorname{RP}(\mathbf{J_{t-1}}))\\|\nabla F_{S}(\mathbf{J_{t-1}})\\|^{2}\right]$		(54)

Therefore, if we have for all $t$ , $\operatorname{RP}(\mathbf{J_{t}})\approx-1$ and $\operatorname{TRP}(\mathbf{J_{t}})\approx-1$ , then $|\operatorname{gen}^{nl}(\mathbf{J_{T}})|\approx 0$ .

From Figure 4 we find that in stable regime, where the sharpness is below the $\frac{2}{\eta}$ , we have $\operatorname{TRP}\approx\operatorname{RP}\approx-1$ . Under small learning rate, the gradient descent doesn’t enter the regime of edge of stability and we have $\operatorname{TRP}\approx\operatorname{RP}\approx-1$ during whole training process and $\operatorname{gen}^{nl}(\mathbf{J_{T}})\approx 0$ .

Next, we consider the case of Stochastic Gradient Descent (SGD). Due to the stochastic estimation of the gradient, we need to rely on some approximations. Let $\mathbf{X_{t}^{i}}$ represent the weights after the $t$ -epoch and $i$ -th iteration of training. We assume a constant learning rate $\eta$ for SGD. The gradient is approximated as follows:

\eta\nabla F_{S}(\mathbf{X_{t}^{i}})\approx\frac{B}{n}(\mathbf{X_{t}}-\mathbf{X_{t+1}})=\frac{B}{n}\sum_{i=1}^{\frac{n}{B}}\nabla F_{S}(\mathbf{X_{t}^{i}}),

(55)

and we appximate $\nabla F_{S^{\prime}}(\mathbf{X_{t}^{i}})$ as:

\eta\nabla F_{S}(\mathbf{X_{t}^{i}})\approx\eta\nabla F_{S}(\mathbf{X_{t}}).

(56)

Therefore, we have:

	$\displaystyle\operatorname{RP}(\mathbf{X_{t}})\approx\frac{\eta(F_{S}(\mathbf{X_{t+1}})-F_{S}(\mathbf{X_{t}}))}{\\|\mathbf{X_{t+1}}-\mathbf{X_{t}}\\|}$		(57)
	$\displaystyle\operatorname{TRP}(\mathbf{X_{t}})\approx\frac{F_{S^{\prime}}(\mathbf{X_{t+1}})-F_{S^{\prime}}(\mathbf{X_{t}})}{(\mathbf{X_{t}}-\mathbf{X_{t+1}})^{\mathrm{T}}\nabla F_{S^{\prime}}(\mathbf{X_{t}})}.$		(58)

We calculated the effect learning rate for SGD as $\eta_{ef}\triangleq\frac{n}{B}\eta$ . Figure 5 shows that the conclusions of SGD are similar as GD, except that the conditions of entering EoS are different.

Appendix D Other Related Work

Table 4: Comparison of trajectory based generalization bounds. Only our proposed method can apply to the SGD with rich trajectory related information.

Method	Conditions	T.R.T
Nikolakakis et al. [25]	Gradient Descent, $\eta_{t}\leq\frac{c}{t}\leq\frac{1}{\beta}$ , $\beta$ -smooth	$\sum_{t=1}^{T}\eta_{t}\frac{1}{n}\sum_{i=1}^{n}\\|\nabla f(\mathbf{J_{t}},z_{i})\\|^{2}$
Neu et al. [24]	$\beta$ -smooth, $\mathbb{E}\left[\\|\nabla f(\mathbf{w},z)-\nabla F_{\mu}(\mathbf{w})\\|\right]\leq v$ , $f(\cdot)$ is subguassian distribution	$\sqrt{T\eta^{2}}$
Park et al. [27]	Weak Lipschitz continuity, Piecewise $\beta^{\prime}$ -smooth, $f(\cdot)$ is bounded, $\eta<\frac{2}{\beta}$	$T$
Ours	Small Learning Rate, $\\|\nabla F_{\mu}(\mathbf{w})\\|\leq\gamma\\|\nabla F_{S}(\mathbf{w})\\|$	$\int_{t}dF_{S}(\mathbf{J_{t}})\sqrt{1+\frac{\operatorname{Tr}(\Sigma(\mathbf{J_{t}}))}{\\|\nabla F_{S}(\mathbf{J_{t}})\\|^{2}}}$

This part compares the works that is not listed in Table 2. Table 4 gives other trajectory based generalization bounds. [25] is a stability based work designed mainly for generalization of gradient descent. It removes the Lipschitz assumption, and replaced by the term $\sum_{t=1}^{T}\eta_{t}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(\mathbf{J_{t}},z_{i})\|^{2}$ in the generalization bounds. This helps enrich the trajectory information in the bounds. The limitation of this work is that it can only apply to the gradient descent and it is hard to extend to the stochastic gradient descent. Neu et al. [24] adapt the information-theretical generalization bound to the stochastic gradient descent. The Theorem 1 in Neu et al. [24] contains rich information about the learning trajectory, but most is about $\nabla F_{\mu}(\mathbf{w})$ , which is unavailable for us. Therefore, we mainly consider the result of Corollary 2 in Neu et al. [24], which removes the term $\nabla F_{\mu}(\mathbf{w})$ by the assumption listed in Table 4. For this Collorary, the remained information within trajectory is merely the $\sqrt{T\eta^{2}}$ . Althouth Neu et al. [24] dosen’t require the assumption of small learning rate, the bound contains the dimension of model, which is large for deep neural network. Compared with these work, our proposed method has advantage in that it can both reveal rich information about learning trajectory and applied to stochastic gradient descent.

Chandramoorthy et al. [8] analyzes the generalization behavior based on statistical algorithmic stability. The proposed generalization bound can be applied into algorithms that don’t converge. Let $S^{(i)}$ be the dataset obtained by replace $z_{i}$ in $S$ with another sample $z_{i}^{\prime}$ draw from distribution $\mu$ .The generalization bound relies on the stability measure $m\triangleq\sup\{\frac{1}{T}\sum_{t=0}^{T-1}f(\mathbf{J_{t}}|S,z)-\frac{1}{T}\sum_{t=0}^{T-1}f(\mathbf{J_{t}}|S^{(i)},z)|z\in\mathcal{Z},i\in[n]\}$ . We don’t directly compare with this method because the calculation of $m$ relies on $S^{(i)}$ which contains sample outside of $S$ . Therefore, we treat this result as intermediate results. More assumption is needed to remove this dependence of the information about the unseen samples, i.e., the samples outside set $S$ .

Appendix E Effect of Learning Rate and Stochastic Noise

In this part, we want to analyze how learning rate and the stochastic noise jointly affect our proposed generalization bound. Specifically, we denote $p_{t}(\mathbf{w})$ as the distribution of the $\mathbf{J_{t}}$ during the training with multiple training steps. Following the work [12], we consider the SDE function as an approximation, which is shown as below:

\mathrm{d}\mathbf{w}=-\nabla F_{S}(\mathbf{w})\mathrm{d}t+\sqrt{\eta}C^{\frac{1}{2}}\mathrm{d}\mathbf{W}(t).

(59)

The SDE can be regarded as the continuous counterpart of Equation(3) when sets the distribution of noise term $\epsilon^{\prime}$ in Equation(3) as Gaussian distribution. The influence of the noise $\epsilon$ on $p_{t}(\mathbf{w})$ is shown in the following theorem.

Theorem E.1.

When the updating of the weight $w$ follows Equation (59), the covariance matrix $C$ is a hessian matrix of a function with a scalar output, then we have:

\frac{\partial p_{t}(\mathbf{w})}{\partial t}=-\sum_{i=1}^{d}\frac{\partial}{\partial\mathbf{w_{i}}}[\nabla F_{S}(\mathbf{w})p_{t}(\mathbf{w})-\frac{\eta}{2}[\nabla\operatorname{Tr}(C(\mathbf{w}))+\underbrace{C(\mathbf{w})\nabla_{w}\log(p_{t}(\mathbf{w}))}_{\text{dampling factor}}]p_{t}(\mathbf{w})].

(60)

Remark E.2.

Previous studies ([44, 31, 12]) tell that the covariance matrix $C$ is proximately equal to the hessian matrix of the loss function with respect to the parameters of DNN. Thus, the above condition that the covariance matrix $C$ is a hessian matrix of a function with scalar output is easy to be satisfied. Formula (60) contains three parts. The item $F_{S}(\mathbf{w})p_{t}(\mathbf{w})$ enlarge the probability of parameters being located in the parameter space with low $F_{S}(\mathbf{w})$ . $\nabla\text{Tr}(C(\mathbf{w}))$ and $C(\mathbf{w})\nabla_{w}\log(p_{t}(\mathbf{w}))$ ususally contradict with each other. $\nabla\text{Tr}(C(\mathbf{w}))$ enlarge the probability of parameters being located in the parameter space with low $\text{Tr}(C(\mathbf{w}))$ value, while $C(\mathbf{w})\nabla_{w}\log(p_{t}(\mathbf{w}))$ serves as a damping factor to prevent the probability from concentrating on a small space. Therefore, setting larger learning rate gives stronger force for the weight to the area with lower $\text{Tr}(C(\mathbf{w}))$ values. According to Equation 5, we also have a lower $\Sigma(\mathbf{w})$ . As a result, large learning rate causes a small lower bound in Theorem 3.6

Proof.

Based on the condition described above, we can infer that $C(\mathbf{w})=\nabla\nabla G(\mathbf{w})$ , where G is a function with a scalar output.

We first prove that $\nabla\cdot C(\mathbf{w})=\nabla\operatorname{Tr}(C(\mathbf{w}))$ as below:

_j	$\displaystyle=[\nabla\cdot\nabla\nabla G(\mathbf{w})]_{j}$	(61)
	$\displaystyle=\sum_{i}\frac{\partial}{\partial w_{i}}\frac{\partial}{\partial w_{i}}\frac{\partial}{\partial w_{j}}G(\mathbf{w})$
	$\displaystyle=\frac{\partial}{\partial w_{j}}\sum_{i}\frac{\partial}{\partial w_{i}}\frac{\partial}{\partial w_{i}}G(\mathbf{w})$
	$\displaystyle=\frac{\partial}{\partial w_{j}}\operatorname{Tr}(C(\mathbf{w})).$

So far, we can infer that $\nabla\cdot C=\nabla\operatorname{Tr}(C)$ . According to Fokker-Planck equation( [26]), we have:

$\displaystyle\frac{\partial p_{t}(\mathbf{w})}{\partial t}$	$\displaystyle=-\sum_{i=1}^{d}\frac{\partial}{\partial\mathbf{w_{i}}}\left[\nabla F_{D}(\mathbf{w})p_{t}(\mathbf{w})\right]+\frac{1}{2}\eta\sum_{i=1}^{d}\frac{\partial}{\partial\mathbf{w_{i}}}\left[\sum_{\mathrm{j}}^{\mathrm{d}}\frac{\partial}{\partial\mathbf{w_{j}}}\left[C(\mathbf{w})p_{t}(\mathbf{w})\right]\right]$	(62)
	$\displaystyle=-\sum_{i=1}^{d}\frac{\partial}{\partial\mathbf{w_{i}}}\left[\nabla F_{D}(\mathbf{w})p_{t}(\mathbf{w})\right]+\frac{1}{2}\eta\sum_{i=1}^{d}\frac{\partial}{\partial\mathbf{w_{i}}}\left[p_{t}(\mathbf{w})\nabla\cdot C+p_{t}(\mathbf{w})C\nabla_{w}\log p_{t}(\mathbf{w})\right]$
	$\displaystyle=-\sum_{i=1}^{d}\frac{\partial}{\partial\mathbf{w_{i}}}\left[\nabla F_{D}(\mathbf{w})p_{t}(\mathbf{w})-\frac{1}{2}\eta\left[\nabla\cdot C(\mathbf{w})+C(\mathbf{w})\nabla_{w}\log p_{t}(\mathbf{w})\right]p_{t}(\mathbf{w})\right]$
	$\displaystyle=-\sum_{i=1}^{d}\frac{\partial}{\partial\mathbf{w_{i}}}\left[\nabla F_{D}(\mathbf{w})p_{t}(\mathbf{w})-\frac{1}{2}\eta\left[\nabla\operatorname{Tr}(C(\mathbf{w}))+C(\mathbf{w})\nabla_{w}\log p_{t}(\mathbf{w})\right]p_{t}(\mathbf{w})\right].$

Therefore, the theorem is proven. ∎

$\displaystyle\Phi(S,H\|S)$	$\displaystyle=\sup\limits_{h\in H\|S}(h_{\mu}-h_{S})$	(27)
	$\displaystyle=\sup\limits_{h\in H\|S}(\mathbb{E}_{S^{\prime}}h_{S^{\prime}}-h_{S})$
	$\displaystyle\leq\mathbb{E}_{S^{\prime}}\sup\limits_{h\in H\|S}(h_{S^{\prime}}-h_{S})$
	$\displaystyle=\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in H\|S}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}})-h(z_{i}^{-\sigma_{i}})))$
	$\displaystyle\leq\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in H\|S}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}})))+\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in H\|S}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}})))$
	$\displaystyle=2\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in H\|S}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}}))),$

$\displaystyle R_{S}(\mathcal{L}_{\mathbf{J}\|S})$	$\displaystyle\triangleq\mathbb{E}_{\sigma}\sup\limits_{h\in\mathcal{L}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}h(z_{i}))$	(28)
	$\displaystyle=\mathbb{E}_{\sigma}\sup\limits_{h\in\mathcal{L}_{\mathbf{J}\|S}}(\frac{1}{n}(\sum_{z\in S_{+}}h(z)-\sum_{z\in S_{-}}h(z)))$
	$\displaystyle=\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|),$

$\displaystyle\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in\mathcal{F}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i}^{\sigma_{i}})))$	$\displaystyle=\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in\mathcal{F}_{\mathbf{J}\|S}}(\frac{1}{n}(\sum_{z\in S^{\prime}_{+}}h(z)-\sum_{z\in S_{-}}h(z)))$	(29)
	$\displaystyle=\mathbb{E}_{S^{\prime},\sigma}(\frac{1}{n}\sum_{t=0}^{T-1}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(g_{S^{\prime}_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$
	$\displaystyle=\mathbb{E}_{S^{\prime},\sigma}(\frac{1}{n}\sum_{t=0}^{T-1}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(\|S_{+}\|\nabla F_{\mu}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$

		$\displaystyle k\mathbb{E}_{\sigma}\sup\limits_{h\in\mathcal{L}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}h(z_{i}))-\mathbb{E}_{S^{\prime},\sigma}\sup\limits_{h\in\mathcal{F}_{\mathbf{J}\|S}}(\frac{1}{n}\sum_{i}^{n}\sigma_{i}(h(z_{i})))$		(30)
		$\displaystyle=k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)-\mathbb{E}_{S^{\prime},\sigma}(\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(g_{S^{\prime}_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$
		$\displaystyle=k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|-\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\frac{g_{S}(\mathbf{J_{t}})}{\\|g_{S}(\mathbf{J_{t}})\\|}(\|S_{+}\|\nabla F_{\mu}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})))$
		$\displaystyle\geq k\mathbb{E}_{\sigma}(\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|-\frac{1}{n}\sum_{t=0}^{T}\delta_{t}\\|\|S_{+}\|\nabla F_{\mu}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)$
		$\displaystyle\geq k\frac{1}{n}\sum_{t=0}^{\mathrm{T}}\delta_{t}\mathbb{E}_{\sigma}(\\|g_{S_{+}}(\mathbf{J_{t}})-g_{S_{-}}(\mathbf{J_{t}})\\|)-\sum_{t=0}^{T}\delta_{t}\gamma^{\prime}\\|\nabla F_{S}(\mathbf{J_{t}})\\|$
		$\displaystyle\geq 0$

$\displaystyle R_{S}(\{\mathbf{w}^{\mathrm{T}}\nabla f(\mathbf{J_{t}})\|\\|\mathbf{w}\\|\leq\delta\})$	$\displaystyle=\frac{1}{n}\mathbb{E}_{\sigma}\left(\sup_{\left\\|\mathbf{w}\right\\|\leq\delta}\mathbf{w}^{\mathrm{T}}\sum_{i=1}^{n}\sigma_{i}\nabla f(\mathbf{J_{t}},z_{i})\right)$	(32)
	$\displaystyle=\frac{\delta}{n}\mathbb{E}_{\sigma}\left(\sqrt{\left\\|\sum_{i=1}^{n}\sigma_{i}\nabla f(\mathbf{J_{t}},z_{i})\right\\|^{2}}\right)$
	$\displaystyle\leq\frac{\delta}{n}\left(\sqrt{\mathbb{E}_{\sigma}\left\\|\sum_{i=1}^{n}\sigma_{i}\nabla f(\mathbf{J_{t}},z_{i})\right\\|^{2}}\right)$
	$\displaystyle\overset{\blacktriangle}{\leq}\frac{\delta}{n}\left(\sqrt{\mathbb{E}_{\sigma}\sum_{i=1}^{n}\left\\|\sigma_{i}\nabla f(\mathbf{J_{t}},z_{i})\right\\|^{2}}\right)$
	$\displaystyle=\frac{\delta}{n}\sqrt{\sum_{i=1}^{n}\left\\|\nabla f(\mathbf{J_{t}},z_{i})\right\\|^{2}},$

Learning Trajectories are Generalization Indicators

Abstract

1 Introduction

1.1 Our Contribution

2 Related Work

Generalization Theories

Generalization Analysis for SGD

3 Generalization Bound

3.1 Investigating generalization alone learning trajectory

3.2 A New Generalization Bound

Definition 3.1.

Definition 3.2.

Definition 3.3.

Assumption 3.4.

Remark 3.5.

Theorem 3.6.

Remark 3.7.

Proof Sketch

Step 1

Step 2

Step 3

Technical Novety

Compared with Previous Works

Asymptotic Analysis

Corollary 3.8.

3.3 Analysis

3.3.1 Generalization Bounds

3.3.2 Comparison with Uniform Stability Results

4 Experiments

4.1 Tightness of Our Bounds

Reasons for toy examples

Constuction of the toy examples

4.2 Capturing the trend of generalization error

Exploring the assumption 3.4 for different dataset during the training process

The bound capturing the trend of generalization error during training process

The complexity of learning trajectory correlates with the generalization error

5 Limitation

6 Conclusion

7 Acknowledgement

References

Appendix A Proof of Theorem 3.6

Proposition A.1.

Remark A.2.

Proof.

Lemma A.3.

Proof.

Lemma A.4.

Proof.

Theorem A.5.

Proof.

Corollary A.6.

Proof.

Appendix B Relaxed Assumption and Corresponding Bound

Assumption B.1.

Theorem B.2.

Proof.

Remark B.3.

Appendix C Experiments

C.1 Calculation of 𝒞​(𝐉)\mathcal{C}(\mathbf{J})

C.2 Experimental Details

C.3 Experimental exploration of genn​l⁡(𝐉𝐓)\operatorname{gen}^{nl}(\mathbf{J_{T}})

Appendix D Other Related Work

Appendix E Effect of Learning Rate and Stochastic Noise

Theorem E.1.

Remark E.2.

Proof.

C.1 Calculation of $\mathcal{C}(\mathbf{J})$

C.3 Experimental exploration of $\operatorname{gen}^{nl}(\mathbf{J_{T}})$