This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Method for Enhancing Generalization of Adam by Multiple Integrations

Long Jin1, Han Nong1, Liangming Chen1, Zhenming Su1
Abstract

The insufficient generalization of adaptive moment estimation (Adam) has hindered its broader application. Recent studies have shown that flat minima in loss landscapes are highly associated with improved generalization. Inspired by the filtering effect of integration operations on high-frequency signals, we propose multiple integral Adam (MIAdam), a novel optimizer that integrates a multiple integral term into Adam. This multiple integral term effectively filters out sharp minima encountered during optimization, guiding the optimizer towards flatter regions and thereby enhancing generalization capability. We provide a theoretical explanation for the improvement in generalization through the diffusion theory framework and analyze the impact of the multiple integral term on the optimizer’s convergence. Experimental results demonstrate that MIAdam not only enhances generalization and robustness against label noise but also maintains the rapid convergence characteristic of Adam, outperforming Adam and its variants in state-of-the-art benchmarks.

Codehttps://github.com/LongJin-lab/MIAdam

Introduction

An appropriate optimizer is essential to train a deep neural network (DNN), as it directly affects the training convergence and performance of a model (Yao et al. 2021). The goal of optimizers is usually to minimize (or maximize) a certain objective function, typically a loss function, which measures the gap between the predictions and ground-truth values. As a traditional optimizer, stochastic gradient descent (SGD) is a commonly used optimizer for training DNNs (Deng et al. 2023). However, SGD suffers from certain limitations, such as the need to precisely tune the learning rate, the uniform scaling of gradients in all directions, and the risk of being trapped in saddle points (Johnson et al. 2020; Liu et al. 2021). In order to address these challenges, adaptive learning rate optimizers are developed, offering more nuanced control over learning rates and improved convergence in diverse training scenarios. Among them, adaptive moment estimation (Adam) (Kingma and Ba 2015) is currently one of the most popular adaptive learning optimizers for its rapid convergence and efficient handling of sparse gradients. The combination of first-order and second-order moments in Adam enables the effective incorporation of momentum-based optimization and adaptive learning rate methods, thereby enhancing its overall efficiency and applicability in various neural network training contexts.

Refer to caption
Figure 1: The idea of this work and the filtering effect of integrations on optimizer trajectories. The blue integrated trajectory represents an equivalent path that does not actually exist on the original loss landscape.

Despite being widely used, Adam also exhibits certain limitations, such as the inferior generalization capabilities compared to SGD in some scenarios (Wilson et al. 2017; Luo et al. 2019; Zou et al. 2021). Therefore, several enhanced variants of Adam are developed to alleviate the issue of poor generalization. Switching from Adam to SGD (SWATS) (Keskar and Socher 2017) is designed to start training with the Adam optimizer, then automatically switch to SGD, aiming to improve the model’s generalization performance. However, this method can not maintain the original convergence rate of Adam to some extent. ND-Adam (normalized direction-preserving Adam) (Zhang 2018) meticulously preserves the direction of gradients for each parameter and produces the regularization effect akin to L2 weight decay. Despite this, its ability to enhance generalization is quite limited. AdaBound (Luo et al. 2019) employs dynamic constraints on the learning rate to achieve a smooth and gradual transition from adaptive methods to SGD, which enhances the generalization of a model and reduces the dependence on detailed learning rate adjustments. Nonetheless, a significant drawback of AdaBound is its potential for slow convergence in certain scenarios (Savarese 2019). Overall, these improved optimizers of Adam attempting to enhance the generalization of Adam are unable to simultaneously retain the rapid convergence characteristic of Adam and enhance generalization effectively.

Many studies show theoretically and empirically that the generalization performance of a model is highly correlated with its loss landscape in the parameter space (Hochreiter and Schmidhuber 1997; Chaudhari et al. 2019; Jiang et al. 2020; Petzka et al. 2021; Du et al. 2022). A crucial observation is that flat regions in the loss landscape tend to be associated with good generalization performance, while sharp or narrow regions may lead to overfitting (Mulayoff and Michaeli 2020; Sun et al. 2023). This means that optimizers can effectively improve the generalization of a model by converging to flat minima during the training process, which provides a new perspective to alleviate the issue of the poor generalization of Adam. In order to improve the generalization of a model by finding flat minima in the loss landscape, we propose multiple integral Adam (MIAdam), which is inspired by the effect that the multiple integral term can often serve as filters and noise suppressions in the field of signal processing and control systems (Roberts and Mullis 1987; Jin, Zhang, and Li 2015). MIAdam introduces a multiple integral term to the parameter update formula of Adam, utilizing the filtering effect of multiple integrations to smooth the optimizer’s trajectory. As shown in Fig. 1, if we consider the trajectory as a time-varying input signal, integrating the signal is equivalent to filtering out the sharp minima encountered by the optimizer on the loss landscape, thereby enabling the optimizer to converge to flat minima. More details about the design of the MIAdam optimizer are discussed in Section MIAdam. The main contributions of this paper are summarized as follows.

  • \bullet

    To the best of our knowledge, the method of introducing multiple integral term in the optimizer to find flat minima in the loss landscape is proposed for the first time. Furthermore, we propose a new optimizer based on Adam, which is called MIAdam.

  • \bullet

    We provide theoretical analyses on MIAdam. Specifically, utilizing the diffusion theory framework in (Xie, Sato, and Sugiyama 2020), we prove that the multiple integral term enables MIAdam to generalize better than Adam under some assumptions. In addition, we also analyze the effect of multiple integral terms on convergence.

  • \bullet

    The effectiveness of the proposed method is validated through image classification experiments, text classification experiments, and experiments that inject label noises into datasets. Experimental results demonstrate that MIAdam outperforms the Adam and its state-of-the-art (SOTA) variants on both generalization and robustness against label noises.

Preliminaries

In this section, core concepts about Adam and the related theoretical analyses on the relationship between flat minima and generalization are briefly given to set the stage for the detailed exposition of MIAdam that follows.

Overview of Adam

The training procedure for a DNN can be primarily characterized as an optimization problem, which is defined as follows:

min𝜽1|S|k=1|S|(𝒙k,𝒚k;𝜽),\min_{\bm{\theta}}\frac{1}{\left|S\right|}\sum_{k=1}^{\left|S\right|}\mathscr{L}(\bm{x}_{k},\bm{y}_{k};\bm{\theta}), (1)

where (𝒙k,𝒚k;𝜽)\mathscr{L}(\bm{x}_{k},\bm{y}_{k};\bm{\theta}) represents the loss function; 𝜽\bm{\theta} denotes the parameter of the model; 𝒙k\bm{x}_{k} and 𝒚k\bm{y}_{k} are the input and its corresponding ground-truth label, respectively; SS is a subset of DD and DD is the training dataset. In the early stages of deep learning development, SGD emerges as a prevailing optimizer, with its parameter update formula expressed as follows:

θt+1,i=θt,iαgt,i,\theta_{t+1,i}=\theta_{t,i}-\alpha g_{t,i}, (2)

where α\alpha represents the learning rate; θt,i\theta_{t,i} represents the ii-th dimension of the parameter at discrete time tt; gt,ig_{t,i} is the gradient with respect to the parameter θt,i\theta_{t,i}. The gradient is formally defined as

gt,i=L(θt,i)θt,i,g_{t,i}=\frac{\partial L(\theta_{t,i})}{\partial\theta_{t,i}}, (3)

where L(𝜽)=(1/|S|)k=1|S|(𝒙k,𝒚k;𝜽)L(\bm{\theta})=(1/\left|S\right|)\sum\nolimits_{k=1}^{\left|S\right|}\mathscr{L}(\bm{x}_{k},\bm{y}_{k};\bm{\theta}). Momentum is a strategy used to expedite convergence of SGD towards minima and escape saddle points on the loss landscape (Qian 1999). It is computed by accumulating previous gradients into the current gradient. The parameter update formula of SGD with momentum (SGDM) is shown as

mt,i=βmt1,i+gt,i,\displaystyle m_{t,i}=\beta m_{t-1,i}+g_{t,i}, (4a)
θt+1,i=θt,iαmt,i,\displaystyle\theta_{t+1,i}=\theta_{t,i}-\alpha m_{t,i}, (4b)

where mt,im_{t,i} denotes the momentum at tt and β\beta is the hyperparameter is used to trade off between the current gradient and the accumulation of historical gradients. Adam refines the momentum formulation in Eq. (4a) and introduces an adaptive learning rate achieved through the computation of the first-order and the second-order moments concerning current gradients. The first-order and second-order moments are calculated by the following expressions:

mt,i=β1mt1,i+(1β1)gt,i,\displaystyle m_{t,i}=\beta_{1}m_{t-1,i}+(1-\beta_{1})g_{t,i}, (5a)
vt,i=β2vt1,i+(1β2)gt,i2,\displaystyle v_{t,i}=\beta_{2}v_{t-1,i}+(1-\beta_{2})g_{t,i}^{2}, (5b)

where vt,iv_{t,i} denotes the second-order moment at tt; the β1\beta_{1} and β2\beta_{2} are the exponential decay rates used to adjust the first-order and second-order moments, respectively. Furthermore, the parameter update formula of Adam is expressed as

θt+1,i=θt,iαm^t,iv^t,i+ϵ,\displaystyle\vspace{-16pt}\theta_{t+1,i}=\theta_{t,i}-\frac{\alpha\hat{m}_{t,i}}{\sqrt{\hat{v}_{t,i}+\epsilon}},\vspace{-16pt} (6)

where m^t,i=mt,i/(1β1t)\hat{m}_{t,i}=m_{t,i}/(1-\beta_{1}^{t}) and v^t,i=vt,i/(1β2t)\hat{v}_{t,i}=v_{t,i}/(1-\beta_{2}^{t}). The hyperparameter ϵ\epsilon is assumed as a small value to prevent division by zero in the denominator.

The Relationship between Flat Minima and Generalization

The generalization of DNNs has been extensively explored in recent years. In order to understand the phenomenon of generalization of DNNs, some of the existing research delves into understanding the relationship between loss landscapes and the generalization. A correlation between the flatness of the loss landscape and model generalization is revealed in (Hochreiter and Schmidhuber 1995). Subsequent investigations in (Hochreiter and Schmidhuber 1997) expand on this correlation and provide a method for identifying flat minima. In (Keskar et al. 2017), a definition of the sharpness of a specified point on a loss landscape is given. The study in (Dinh et al. 2017) introduces a reparameterization method and argues that previous sharpness measurements are inadequate for predicting generalization capabilities. Furthermore, it is demonstrated that the generalization capability is influenced by factors such as the batch size, higher-order “smoothness” terms characterized by the Lipschitz constant of the Hessian matrix, the loss function, and the number of parameters (Wang et al. 2018). Based on the above theoretical studies, empirical experiments extensively explore the intrinsic link between the generalization performance of a model and loss landscapes. A consensus emerging from these studies, including (Chaudhari et al. 2019; Jiang et al. 2020; Du et al. 2022; Petzka et al. 2021), is that flat minima usually yield better generalization compared to sharp minima.

MIAdam

In the parameter update formula of Adam, the first-order moment, as defined in Eq. (5a), is reformulated as follows (Kingma and Ba 2015):

mt,i=\displaystyle m_{t,i}= (1β1)j=0tβ1tjgj,i.\displaystyle(1-\beta_{1})\sum^{t}_{j=0}\beta_{1}^{t-j}g_{j,i}. (7)

Consequently, the parameter update formula of Adam is rewritten as

θt+1,i=θt,iα(1β1)j=0tβ1tjgj,i(1β1t)v^t,i+ϵ.\theta_{t+1,i}=\theta_{t,i}-\frac{\alpha(1-\beta_{1})\sum^{t}_{j=0}\beta_{1}^{t-j}g_{j,i}}{(1-\beta_{1}^{t})\sqrt{\hat{v}_{t,i}+\epsilon}}. (8)

When the learning rate α\alpha is sufficiently small, Eq. (8) is approximated in a continuous form as follows:

dθt~,i\displaystyle\mathrm{d}\theta_{\tilde{t},i} =μt~,i0t~β1t~τgi(τ)dτ,\displaystyle=\mu_{{\tilde{t}},i}\int_{0}^{\tilde{t}}\beta_{1}^{\tilde{t}-\tau}g_{i}(\tau)\mathrm{d}\tau, (9)

where t~{\tilde{t}} is the continuous time, dτ\mathrm{d}\tau is equivalent to α\alpha and μt~,i=(1β1)/((1β1t~)v^t~,i+ϵ)\mu_{\tilde{t},i}=-(1-\beta_{1})/((1-\beta_{1}^{\tilde{t}})\sqrt{\hat{v}_{{\tilde{t}},i}+\epsilon}). It is noteworthy that the integral term appears in Eq. (9). In signal processing, a continuous input signal x(t~)x({\tilde{t}}) that undergoes an integral operation is written as

y(t~)=t~0t~x(τ)dτ,\displaystyle y(\tilde{t})=\int_{{\tilde{t}}_{0}}^{{\tilde{t}}}x(\tau)\mathrm{d}\tau, (10)

where y(t~)y({\tilde{t}}) is the integrated signal and the integration range is from t~0{\tilde{t}}_{0} to t~{\tilde{t}}. Ultimately, the resulting integral signal y(t~)y({\tilde{t}}) contains the cumulative information of the original signal x(t~)x({\tilde{t}}) at different points in time. After the integral operation, the high-frequency components of the signal are filtered out. Inspired by this, the trajectory of the optimizer on the loss landscape can be viewed as the input signal when training a DNN, and the sharp minima are equivalent to the high-frequency components in the signal. Integrating this signal is equivalent to filtering out the sharp minima encountered by an optimizer in the loss landscape, thereby guiding the optimizer toward convergence in flat regions. Therefore, to further achieve the effect of filtering out the sharp minima encountered by the optimizer, multiple integrations, an enhanced version of the integral operation, are introduced into the parameter update formula of Adam. Based on the process involving the integration as depicted in Eq. (10), we obtain the following equation:

dθt~,𝒾=\displaystyle\mathrm{d}\theta_{{\tilde{t}},\mathcal{i}}= (11)
μt~,i0t~κt~t~10t~10t~n2κt~n2t~n10t~n1n-th-order multiple integrationβ1t~n1τ~𝒾(τ~)dτ~dt~n1dt~1,\displaystyle\mu_{{\tilde{t}},i}\!\!\overbrace{\int_{0}^{{\tilde{t}}}\!\!\!\kappa^{{\tilde{t}}\!-{\tilde{t}}_{1}}\!\!\!\int_{0}^{{\tilde{t}}_{1}}\!\!\!\!\!\!\cdots\!\!\int_{0}^{{\tilde{t}}_{n\!-\!2}}\!\!\!\kappa^{{\tilde{t}}_{n\!-\!2}\!-{\tilde{t}}_{n\!-\!1}}\!\!\!\int_{0}^{{\tilde{t}}_{n\!-\!1}}}^{\text{$n{\text{-th}}$-order multiple integration}}\!\!\!\beta_{1}^{{\tilde{t}}_{n\!-\!1}\!-\!\tilde{\tau}}\!\!\!\mathcal{g}_{\mathcal{i}}(\tilde{\tau})\mathrm{d}\tilde{\tau}\mathrm{d}{\tilde{t}}_{n\!-\!1}\!\cdots\!\mathrm{d}{\tilde{t}}_{1},

where κ\kappa is multiple integration rate which adjusts the multiple integral term. Then, we perform cumulative operations on the first-order moments in the parameter update formula of Adam to transform the multiple integral term from a continuous form to its corresponding discrete form. Thus, the corresponding parameter update formula of Eq. (LABEL:eq12) is derived as follows:

{mt,i=β1mt1,i+(1β1)gt,i,m¯t,i(n)=(1β1)t1=0tκtt1t2=0t1tn1=0tn2κtn2tn1tn=0tn1β1tn1tnn-th-order multiple summationgtn,i=t1=0tκtt1t2=0t1tn1=0tn2κtn2tn1mtn1,i,θt+1,i=θt,iαnm¯t,i(n)(1β1t)v^t+ϵ,\left\{\begin{aligned} &m_{t,i}=\beta_{1}m_{t-1,i}+(1-\beta_{1})g_{t,i},\\ &\overline{m}_{t,i}^{(n)}\!\!=\!(1\!-\!\beta_{1})\!\!\!\overbrace{\sum^{t}_{t_{1}=0}\!\!\kappa^{t-t_{1}}\!\!\!\!\sum^{t_{1}}_{t_{2}=0}\!\!\cdots\!\!\!\!\!\!\sum^{t_{n\!-\!2}}_{t_{n\!-\!1}=0}\!\!\!\!\kappa^{t_{n\!-\!2}-t_{n\!-\!1}}\!\!\!\!\sum^{t_{n\!-\!1}}_{t_{n}=0}\!\!\beta_{1}^{{t_{n\!-\!1}}\!-\!t_{n}}}^{\text{$n{\text{-th}}$-order multiple summation}}\!\!\!g_{t_{n},i}\\ &\quad\ \!\ \ =\!\!\sum^{t}_{t_{1}=0}\!\kappa^{t-t_{1}}\!\!\sum^{t_{1}}_{t_{2}=0}\!\!\cdots\!\!\!\!\!\!\sum^{t_{n\!-\!2}}_{t_{n\!-\!1}=0}\!\!\!\!\kappa^{t_{n\!-\!2}-t_{n\!-\!1}}m_{t_{{n\!-\!1},i}},\\ &\theta_{t+1,i}=\theta_{t,i}-\frac{\alpha^{n}\overline{m}_{t,i}^{(n)}}{(1-\beta_{1}^{t})\sqrt{\hat{v}_{t}+\epsilon}},\end{aligned}\right. (12)

where the superscript (n) means the nn-th-order multiple summation.

Algorithm 1 MIAdam

Given: Learning rate: α\alpha;

exponential decay rates: β1\beta_{1}, β2\beta_{2};

multiple integration rate: κ\kappa;

infinitesimal term: ϵ\epsilon;

the order of the multiple integral item: nn;

switching moment: ζ\zeta.

Initialize: Step time t0t\leftarrow 0;

first moment vector mt=0,i0m_{t=0,i}\leftarrow 0;

second moment vector vt,i0v_{t,i}\leftarrow 0, m¯t=0,i(0)mt=0,i\overline{m}_{t=0,i}^{(0)}\leftarrow m_{t=0,i}.

1:  while stopping criterion is not met do
2:     tt+1t\leftarrow t+1
3:     using Eq. (3) to get the gradient gt,ig_{t,i}
4:     mt,iβ1mt1,i+(1β1)gt,im_{t,i}\leftarrow\beta_{1}m_{t-1,i}+(1-\beta_{1})g_{t,i}
5:     vt,iβ2vt1,i+(1β2)gt,i2v_{t,i}\leftarrow\beta_{2}v_{t-1,i}+(1-\beta_{2})g_{t,i}^{2}
6:     if t<ζt<\zeta then
7:        mt,i(0)mt,im^{(0)}_{t,i}\leftarrow m_{t,i}
8:        for j=1j=1 to nn do
9:           m¯t,i(j)κm¯t1,i(j)+m¯t,i(j1)\overline{m}_{t,i}^{(j)}\leftarrow\kappa\overline{m}_{t-1,i}^{(j)}+\overline{m}_{t,i}^{(j-1)}
10:        end for
11:        αtαn\alpha_{t}\leftarrow\alpha^{n}
12:        m^t,im¯t,i(n)/(1β1t)\hat{m}_{t,i}\leftarrow\overline{m}_{t,i}^{(n)}/(1-\beta_{1}^{t})
13:     else
14:        αtα\alpha_{t}\leftarrow\alpha
15:        m^t,imt,i/(1β1t)\hat{m}_{t,i}\leftarrow m_{t,i}/(1-\beta_{1}^{t})
16:     end if
17:     v^t,ivt,i/(1β2t)\hat{v}_{t,i}\leftarrow v_{t,i}/(1-\beta_{2}^{t})
18:     θt,iθt1,iαtm^t,i/(v^t,i+ϵ)\theta_{t,i}\leftarrow\theta_{t-1,i}-\alpha_{t}\hat{m}_{t,i}/(\sqrt{\hat{v}_{t,i}}+\epsilon)
19:  end while

According to the theoretical analyses in Section Generalization and Convergence Analyses and the simulations in Fig. 2, although the multiple integral term helps an optimizer to find flat minima, the optimizer hovers around flat minima and does not converge. Thus, we only use the multiple integral term in the early stages of training, and after that, the optimizer switches to Adam to ensure that the training is convergent eventually. At this point, the multiple integral term is introduced into Adam, and this new optimizer is named MIAdam. The pseudo code for MIAdam is shown in Algorithm 1.

Note that the multiple integration is approximated by the multiple summation in Algorithm 1, which adds only nn additional summation operations at each iteration for each dimension of the parameter. Therefore, MIAdam adds very little additional computational overhead compared to Adam. In the following text, we refer to Adam with an additional first-order integration as MIAdam1, and the one with an additional second-order integration as MIAdam2, and so on.

Generalization and Convergence Analyses

In this section, we present the theoretical analyses of the generalization and convergence associated with the addition of the multiple integral term to Adam, which does not involve the switching of optimizers. These analyses provide a theoretical foundation for our proposed optimizer.

Generalization Analyses

In this subsection, the diffusion theory framework is utilized to rigorously demonstrate that the incorporation of the multiple integral term enhances the generalization capabilities of the model. Specifically, generalization is quantitatively assessed by comparing the mean escape time, represented as ϕ\phi, which indicates an optimizer’s ability to escape from sharp minima. In the following analyses, we begin by delineating three fundamental assumptions that are crucial for the application of the diffusion theory framework (Xie, Sato, and Sugiyama 2020).

Assumption 1.

The loss function around the critical point 𝒑\bm{p} is approximately written as

L(𝜽)=L(𝒑)+12(𝜽𝒑)H(𝒑)(𝜽𝒑),\displaystyle L(\bm{\theta})=L(\bm{p})+\frac{1}{2}(\bm{\theta}-\bm{p})^{\top}H(\bm{p})(\bm{\theta}-\bm{p}), (13)

where the superscript means the transpose of a vector.

Assumption 2.

(Quasi-equilibrium approximation). The system is in quasi-equilibrium near minima.

Assumption 3.

(Low-temperature approximation). The system is under low temperature (small gradient noise).

Consequently, following the theoretical analyses in (Xie, Sato, and Sugiyama 2020; Xie et al. 2022), we can further deduce Theorem 1. The detailed proof is given in the Appendix.

Theorem 1.

Suppose that Assumption 1, Assumption 2, and Assumption 3 hold while saddle point 𝐮\bm{u} is the exit from sharp minimum 𝐚\bm{a}. Then the mean escape time of MIAdam1 from sharp minimum 𝐚\bm{a} to flat minimum 𝐛\bm{b} through saddle point 𝐮\bm{u} before the switch is

ϕMIAdam1=\displaystyle\phi_{\mathrm{MIAdam1}}= (14)
π[1+4α𝒷|H𝒖𝒆|t~(1β1)+1]|det(H𝒂1H𝒖)|14|H𝒖𝒆|\displaystyle\pi\left[\sqrt{1+\frac{4\alpha\sqrt{\mathcal{b}\left|H_{{\bm{u}}{\bm{e}}}\right|}}{\tilde{t}(1-\beta_{1})}}+1\right]\frac{\left|\operatorname{det}\left(H_{\bm{a}}^{-1}H_{\bm{u}}\right)\right|^{\frac{1}{4}}}{\left|H_{{\bm{u}}{\bm{e}}}\right|}
exp[2𝒷ΔLt~α(ϱH𝒂𝒆+(1ϱ)|H𝒖𝒆|)],\displaystyle\exp\left[\frac{2\sqrt{\mathcal{b}}\Delta L}{\tilde{t}\alpha}\left(\frac{\varrho}{\sqrt{H_{{\bm{a}}{\bm{e}}}}}+\frac{(1-\varrho)}{\sqrt{\left|H_{{\bm{u}}{\bm{e}}}\right|}}\right)\right],

where subscript e denotes the escape direction; ϱ\varrho is the path-dependent parameter; 𝒷=|S|\mathcal{b}=|S| indicates the batch size; ΔL=L(𝐮)L(𝐚)\Delta L=L(\bm{u})-L(\bm{a}); HH represents the Hessian matrix.

Refer to caption
(a) First Loss Landscape
Refer to caption
(b) α\alpha=0.05
Refer to caption
(c) α\alpha=0.1
Refer to caption
(d) α\alpha=0.15
Refer to caption
(e) Second Loss Landscape
Refer to caption
(f) MIAdam1 and Adam
Refer to caption
(g) MIAdam2 and Adam
Refer to caption
(h) MIAdam3 and Adam
Figure 2: Simulations of trajectory of Adam and MIAdam on 2-parameter loss landscapes.

Comparing the mean escape time ϕMIAdam1\phi_{\mathrm{MIAdam1}} obtained from Theorem 1 with that of Adam’s in  (Xie et al. 2022),

ϕAdam=\displaystyle\phi_{\mathrm{Adam}}= π[1+4α𝒷|H𝒖𝒆|(1β1)+1]|det(H𝒂1H𝒖)|14|H𝒖𝒆|\displaystyle\pi\left[\sqrt{1+\frac{4\alpha\sqrt{\mathcal{b}\left|H_{\bm{u}\bm{e}}\right|}}{(1-\beta_{1})}}+1\right]\frac{\left|\operatorname{det}\left(H_{\bm{a}}^{-1}H_{\bm{u}}\right)\right|^{\frac{1}{4}}}{\left|H_{{\bm{u}}{\bm{e}}}\right|} (15)
exp[2𝒷ΔLα(ϱH𝒂𝒆+(1ϱ)|H𝒖𝒆|)],\displaystyle\exp\left[\frac{2\sqrt{\mathcal{b}}\Delta L}{\alpha}\left(\frac{\varrho}{\sqrt{H_{{\bm{a}}{\bm{e}}}}}+\frac{(1-\varrho)}{\sqrt{\left|H_{{\bm{u}}{\bm{e}}}\right|}}\right)\right],

when t~>1\tilde{t}>1, it is found that ϕMIAdam1\phi_{\mathrm{MIAdam1}} is smaller than ϕAdam\phi_{\mathrm{Adam}}, indicating that Adam introduces an additional first-order integration which is more likely to escape from sharp minima and consequently converge to flat minima, thereby improving the generalization.

Convergence Analyses

In order to verify the effect of the multiple integral term on the convergence of the optimizer, we follow the analytical framework of Adam which is also used in this subsection. Concretely, the regret bound R(t^)R({\hat{t}}) is utilized to evaluate the convergence of the algorithm and is defined as follows:

R(t^)=t=1t^ft(𝜽t)min𝜽t=1t^ft(𝜽),\displaystyle R({\hat{t}})=\sum_{t=1}^{\hat{t}}f_{t}(\bm{\theta}_{t})-\min_{\bm{\theta}}\sum_{t=1}^{\hat{t}}f_{t}(\bm{\theta}), (16)

where ft()f_{t}(\cdot) is a convex loss function.

Theorem 2.

Assume that the convex function ftf_{t} has bounded gradients, ft(𝛉)2𝗀,ft(𝛉)\left\|\nabla f_{t}(\bm{\theta})\right\|_{2}\leq\mathsf{g},\left\|\nabla f_{t}(\bm{\theta})\right\|_{\infty}\leq 𝗀\mathsf{g}_{\infty} for all 𝛉d\bm{\theta}\in\mathbb{R}^{d} and distance between any 𝛉t\bm{\theta}_{t} is guaranteed to be bounded, 𝛉n𝛉m2𝖽\left\|\bm{\theta}_{n}-\bm{\theta}_{m}\right\|_{2}\leq\mathsf{d}, 𝛉m𝛉n𝖽\left\|\bm{\theta}_{m}-\bm{\theta}_{n}\right\|_{\infty}\leq\mathsf{d}_{\infty} for any m,n{1,,t^}m,n\in\{1,\ldots,{\hat{t}}\}, and β1,β2[0,1)\beta_{1},\beta_{2}\in[0,1) satisfy β12/β2<1\beta_{1}^{2}/\sqrt{\beta_{2}}<1. Let αt=α/th\alpha_{t}=\alpha/t^{h} and β1,t=β1λt1,κ(0,1],λ(0,1)\beta_{1,t}=\beta_{1}\lambda^{t-1},\kappa\in(0,1],\lambda\in(0,1). For the convex problem, the R(t^)R({\hat{t}}) of MIAdam1 before the switch satisfies

limt^R(t^)t^0.\displaystyle\lim_{{\hat{t}}\rightarrow\infty}\frac{R({\hat{t}})}{{\hat{t}}}\neq 0. (17)

The detailed proof of Theorem 2 is thoroughly presented in the Appendix. From Theorem 2, it is evident that merely adding an extra first-order integration to the parameter updating formula of Adam leads to the non-convergence of the optimizer. Although it is non-convergent, it effectively escapes sharp minima and hovers around flat minima in the loss landscape. This observation is corroborated by the simulation results shown in Figs. 2(f)-(h). As a result, the MIAdam’s algorithm is structured to switch to Adam after a certain number of epochs to guarantee convergence.

Simulations and Experiments

In this section, we conduct the simulations on 2-parameter loss landscapes to illustrate the efficiency of MIAdam to escape from sharp minima. Furthermore, extensive empirical experiments are conducted to demonstrate that MIAdam outperforms Adam in terms of generalization and robustness against label noises.

Refer to caption
(a) Adam
Refer to caption
(b) MIAdam1
Refer to caption
(c) MIAdam2
Refer to caption
(d) MIAdam3
Figure 3: Comparisons of top Hessian eigenvalues λtop\lambda_{\text{top}}, Hessian traces λtrace\lambda_{\text{trace}}, and full Hessian eigenvalue densities for loss landscapes on the CIFAR-10 dataset using ResNet18.
Optimizer ResNet18 ResNet50
CIFAR-10(%) Time CIFAR-100(%) Time CIFAR-10(%) Time CIFAR-100(%) Time
Adam 93.89±0.1893.89_{\pm 0.18} 47m 73.17±0.2673.17_{\pm 0.26} 47m 93.69±0.2893.69_{\pm 0.28} 2h 45m 75.24±0.6275.24_{\pm 0.62} 2h 47m
NAdam 93.92±0.1193.92_{\pm 0.11} 46m 73.08±0.3173.08_{\pm 0.31} 46m 94.03±0.0794.03_{\pm 0.07} 2h 57m 74.95±0.0874.95_{\pm 0.08} 2h 58m
AdamW 93.74±0.0793.74_{\pm 0.07} 48m 72.11±0.2372.11_{\pm 0.23} 47m 93.80±0.1193.80_{\pm 0.11} 2h 45m 74.40±0.6174.40_{\pm 0.61} 2h 48m
ND-Adam 93.69±0.0893.69_{\pm 0.08} 44m 72.26±0.3472.26_{\pm 0.34} 48m 93.81±0.1893.81_{\pm 0.18} 2h 39m 74.18±0.3574.18_{\pm 0.35} 2h 57m
Adamax 93.68±0.2593.68_{\pm 0.25} 1h 28m 74.27±0.1974.27_{\pm 0.19} 1h 27m 94.10±0.1994.10_{\pm 0.19} 2h 48m 76.30±0.1076.30_{\pm 0.10} 2h 52m
AdaBound 92.93±0.1192.93_{\pm 0.11} 48m 72.51±0.2372.51_{\pm 0.23} 49m 93.14±0.2993.14_{\pm 0.29} 3h 1m 73.87±0.2773.87_{\pm 0.27} 3h 2m
SWATS 92.73±0.1092.73_{\pm 0.10} 1h 6m 72.76±0.2772.76_{\pm 0.27} 54m 92.38±1.3892.38_{\pm 1.38} 2h 51m 72.83±2.1072.83_{\pm 2.10} 2h 50m
Adai 93.86±0.1593.86_{\pm 0.15} 53m 74.82±0.0974.82_{\pm 0.09} 53m 93.95±0.0693.95_{\pm 0.06} 3h 32m 76.07±0.3676.07_{\pm 0.36} 3h 18m
MIAdam1 94.20±0.1294.20_{\pm 0.12}* 47m 75.03±0.0575.03_{\pm 0.05}* 47m 94.17±0.1094.17_{\pm 0.10} 2h 49m 76.96±0.3176.96_{\pm 0.31}* 2h 47m
MIAdam2 94.03±0.1294.03_{\pm 0.12} 47m 74.41±0.3874.41_{\pm 0.38} 46m 94.28±0.15\bm{94.28_{\pm 0.15}}* 2h 47m 76.63±0.1576.63_{\pm 0.15} 2h 46m
MIAdam3 93.96±0.1793.96_{\pm 0.17} 47m 74.53±0.2274.53_{\pm 0.22} 47m 94.10±0.10\bm{94.10_{\pm 0.10}} 2h 48m 76.51±0.0276.51_{\pm 0.02} 2h 46m
DenseNet121 PyramidNet110
CIFAR-10(%) Time CIFAR-100(%) Time CIFAR-10(%) Time CIFAR-100(%) Time
Adam 94.11±0.0194.11_{\pm 0.01} 3h 30m 73.88±0.3473.88_{\pm 0.34} 3h 32m 93.45±0.1093.45_{\pm 0.10} 2h 46m 72.20±0.0972.20_{\pm 0.09} 2h 50m
NAdam 94.39±0.3094.39_{\pm 0.30} 4h 23m 65.71±0.3465.71_{\pm 0.34} 4h 24m 93.33±0.2093.33_{\pm 0.20} 3h 34m 71.26±0.4471.26_{\pm 0.44} 3h 32m
AdamW 94.16±0.1194.16_{\pm 0.11} 3h 25m 75.07±0.1575.07_{\pm 0.15} 3h 30m 93.25±0.0893.25_{\pm 0.08} 2h 48m 70.41±0.6970.41_{\pm 0.69} 2h 43m
ND-Adam 94.11±0.0194.11_{\pm 0.01} 3h 23m 74.59±0.2074.59_{\pm 0.20} 3h 34m 93.00±0.1593.00_{\pm 0.15} 2h 58m 70.72±0.3070.72_{\pm 0.30} 3h 11m
Adamax 90.97±0.1590.97_{\pm 0.15} 3h 40m 63.48±0.0863.48_{\pm 0.08} 3h 47m 92.46±0.2492.46_{\pm 0.24} 4h 51m 69.43±0.2869.43_{\pm 0.28} 5h 16m
AdaBound 93.14±0.1093.14_{\pm 0.10} 4h 0m 73.88±0.3573.88_{\pm 0.35} 4h 2m 92.14±0.2292.14_{\pm 0.22} 3h 32m 68.97±0.3968.97_{\pm 0.39} 3h 27m
SWATS 93.64±0.9493.64_{\pm 0.94} 3h 35m 75.62±3.1075.62_{\pm 3.10} 3h 45m 89.42±4.1389.42_{\pm 4.13} 3h 35m 49.82±25.1349.82_{\pm 25.13} 2h 29m
Adai 94.45±0.2194.45_{\pm 0.21} 4h 28m 76.77±0.3176.77_{\pm 0.31} 4h 42m 93.50±0.1093.50_{\pm 0.10} 4h 18m 71.94±0.3171.94_{\pm 0.31} 4h 5m
MIAdam1 94.75±0.1094.75_{\pm 0.10}* 3h 29m 77.02±0.1077.02_{\pm 0.10}* 3h 26m 93.65±0.0893.65_{\pm 0.08}* 2h 59m 72.51±0.2472.51_{\pm 0.24}* 2h 56m
MIAdam2 94.43±0.1294.43_{\pm 0.12} 3h 29m 76.21±0.3376.21_{\pm 0.33} 3h 32m 93.02±0.1093.02_{\pm 0.10} 2h 52m 71.96±0.5971.96_{\pm 0.59} 2h 52m
MIAdam3 94.35±0.0394.35_{\pm 0.03} 3h 30m 76.54±0.2876.54_{\pm 0.28} 3h 30m 93.07±0.2593.07_{\pm 0.25} 2h 54m 71.34±0.1771.34_{\pm 0.17} 2h 53m
Table 1: Top-1 test accuracy (mean±\pmstd) on CIFAR-10 and CIFAR-100.

Simulations

This subsection mainly includes simulations demonstrating that MIAdam is easier to escape from a sharp minima compared to Adam and exploring the impact of learning rate on the optimization process. The first simulation is conducted on an elaborate 2-parameter loss landscape (Yang 2020) with one flat minima surrounded by two sharp minima, whose contour map is displayed in Fig. 2(a). On this loss landscape, the learning rates of Adam and MIAdam1 are respectively set to 0.05, 0.1, and 0.15, and the simulation results for their corresponding optimization trajectories are shown in Figs. 2(b)-(d). It is clear that MIAdam1 tends to escape from sharp minima and converge to flat minima compared to Adam on the 2-parameter loss landscape. Moreover, as the learning rate increases, MIAdam1 is able to converge to the flat minima. In contrast, Adam always shows poor convergence on this loss landscape and can not converge well to the flat minima or sharp minima. Therefore, our proposed method is effective in finding flat minima and can not be simply replaced by increasing the learning rate of Adam.

The second 2-parameter loss landscape used for simulations is depicted in Fig. 2(e), which contains a large number of sharp minima and flat minima. On this loss landscape, the optimization trajectories of MIAdam1, MIAdam2, MIAdam3, and Adam are compared in Fig. 2(f)-(h), respectively. Simulation results indicate that MIAdam1, MIAdam2, and MIAdam3 tend to converge toward flat minima, while the Adam optimizer tends to converge to the nearest sharp minima. It’s worth noting that MIAdam3 exhibits more intense oscillations near the flat region compared to the MIAdam2. This suggests that increasing the order of multiple integration does not always lead to improved outcomes. Different starting points influence the trajectory of the optimizer. Therefore, to make the simulation results more convincing, we conduct additional simulations using 2,500 different starting coordinate points and calculate the sum of the absolute values of the eigenvalues of the Hessian matrix of the different optimizers for the final convergence points at different starting coordinate points and compare them. Due to space constraints, the simulation results are presented in the Appendix.

Experiments

The effectiveness of MIAdam is evaluated in this subsection through extensive empirical experiments. Initially, we conduct image classification experiments with various neural network architectures on CIFAR111http://www.cs.toronto.edu/ kriz/cifar.html and ImageNet-1k222https://www.image-net.org/, compared against widely-used adaptive learning rate optimizers, including Adam and its SOTA variants. Additionally, we utilize the fast computation method of Hessian information of loss landscapes provided in (Yao et al. 2020) for further comparative analyses. Subsequently, the effectiveness of the proposed MIAdam optimizer for text classification tasks is tested using the BERT and RoBERTa models across four distinct datasets (Lin et al. 2021). Finally, to validate the robustness against label noises of MIAdam, we perform image classification experiments on datasets injected with label noises. The results of MIAdam exceeding Adam are all bold, and the optimal experimental results are all marked by asterisks. Because of space constraints, the detailed experimental settings for all experiments are included in the Appendix.

Image Classification Experiments

To enhance the conviction of our experimental results, we employ four different neural network architectures for image classification tasks on the CIFAR-10 and CIFAR-100 datasets: ResNet18 (He et al. 2016), ResNet50 (He et al. 2016), DenseNet121 (Huang et al. 2017), and PyramidNet110 (Han, Kim, and Kim 2017). For experiments on large-scale image datasets, we utilize the AlexNet (Krizhevsky, Sutskever, and Hinton 2012), ResNet18, and DenseNet121 architectures for both training and testing on the ImageNet-1k. The classification performance of MIAdam is compared with optimizers such as Adam, NAdam (Dozat 2016), AdamW (Loshchilov and Hutter 2017), ND-Adam, Adamax (Kingma and Ba 2015), AdaBound, SWATS, and Adai (Xie et al. 2022). Detailed hyperparameters and experimental settings are presented in the Appendix. As observed from Table 1 and Table 2, MIAdam maintains a training time comparable to Adam while obtaining much better performance than Adam. To provide a comparison of the flatness in the final convergence regions, we compute top Hessian eigenvalues, Hessian traces, and full Hessian eigenvalue densities for loss landscapes of Adam, MIAdam1, MIAdam2, and MIAdam3 using DenseNet121 on the CIFAR-100 dataset in Fig. 3. Fig. 3 suggests that the multiple integral term is helpful in finding flatter minima in a specific neural network training task.

Optimizer AlexNet(%) ResNet18(%) DenseNet121(%)
Adam 46.4846.48 67.1967.19 71.4871.48
MIAdam1 52.3452.34* 72.2772.27* 75.3975.39*
MIAdam2 49.6149.61 70.7070.70 74.2274.22
MIAdam3 43.3643.36 70.3170.31 74.60\bm{74.60}
Table 2: Top-1 test accuracies (mean±\pmstd) on ImageNet-1k
Dataset Optimizer BERT(%) RoBERTa(%)
R8 Adam 98.15±0.0298.15_{\pm 0.02} 98.36±0.0598.36_{\pm 0.05}
MIAdam1 98.18±0.03\bm{98.18_{\pm 0.03}}* 98.45±0.05\bm{98.45_{\pm 0.05}}*
MIAdam2 98.11±0.1098.11_{\pm 0.10} 98.29±0.1098.29_{\pm 0.10}
MIAdam3 98.04±0.0998.04_{\pm 0.09} 98.29±0.0698.29_{\pm 0.06}
R52 Adam 96.36±0.2196.36_{\pm 0.21} 96.21±0.1396.21_{\pm 0.13}
MIAdam1 96.42±0.2396.42_{\pm 0.23} 96.48±0.0196.48_{\pm 0.01}
MIAdam2 96.57±0.07\bm{96.57_{\pm 0.07}} 96.46±0.23\bm{96.46_{\pm 0.23}}
MIAdam3 96.65±0.27\bm{96.65_{\pm 0.27}}* 96.65±0.07\bm{96.65_{\pm 0.07}}*
MR Adam 86.03±0.3086.03_{\pm 0.30} 87.72±0.0987.72_{\pm 0.09}
MIAdam1 86.51±0.0986.51_{\pm 0.09}* 89.73±0.1789.73_{\pm 0.17}*
MIAdam2 86.45±0.1986.45_{\pm 0.19} 89.54±0.1989.54_{\pm 0.19}
MIAdam3 86.34±0.09\bm{86.34_{\pm 0.09}} 89.54±0.1889.54_{\pm 0.18}
Table 3: Test accuracies (mean±\pmstd) on text classification experiments

Text Classification Experiments

We conduct text classification experiments by fine-tuning the pre-trained models, BERT and RoBERTa models, on three widely-used text datasets R8, R52, and Movie Review (MR). Each optimizer is run three times on each dataset using different network structures, with the mean and standard deviation of the test accuracy reported in Table 3. The experimental results indicate that MIAdam significantly outperforms Adam in text classification tasks.

Robustness Against Label Noises

In this subsection, we investigate the capacity of MIAdam to withstand label noises in the training dataset, thereby validating its robustness against label noises. The ResNet18 network is trained by using Adam and MIAdam on the corrupted version of the CIAFR10 dataset, where some of its training labels are randomly flipped while the inputs are kept clean. The noise levels are 20%, 40%, 60%, and 80%. On each noise level, each optimizer is run only once. The remaining experimental settings are consistent with those used in the previous image classification experiments. As indicated in Table 4, MIAdam consistently achieves the highest test accuracy across all noise levels, underscoring MIAdam’s superior robustness against label noises.

Optimizer Noise rate(%)
20 40 60 80
Adam 88.2488.24 84.9084.90 79.6179.61 66.3966.39
ND-Adam 87.5287.52 84.1084.10 78.3478.34 63.5163.51
AdaBound 86.5186.51 82.4682.46 76.5876.58 57.8657.86
SWATS 89.4389.43 85.4785.47 80.3080.30 53.5053.50
Adai 86.0986.09 81.9281.92 75.7275.72 58.6058.60
MIAdam1 90.3290.32* 87.6787.67* 82.0282.02* 67.6867.68*
MIAdam2 89.1389.13 85.0385.03 79.84\bm{79.84} 64.9664.96
MIAdam3 88.7188.71 85.8785.87 79.4079.40 64.2764.27
Table 4: Top-1 test accuracy on CIFAR-10 under label noises.

Conclusion

In this paper, we have proposed MIAdam, a new adaptive learning rate optimizer algorithm with a multiple integral term added to Adam. MIAdam smoothes the optimization trajectory through the filtering effect of the multiple integral term, enabling it to escape sharp local minima during training and converge towards flat minima, thereby alleviating the problem of poor generalization of Adam and improving the robustness against label noises while retaining the fast convergence of Adam. Utilizing the diffusion theory framework, we have provided the proof that incorporating the multiple integral term enhances the capability of the optimizer to escape sharp minima and converge to flatter minima, thus improving the generalization of the models. We have analyzed the convergence of MIAdam and provided a guarantee of convergence. The simulations have demonstrated that MIAdam is capable of finding flatter minima compared to Adam. For empirical analyses, We have conducted image classification experiments, text classification experiments, and experiments that inject label noises into datasets. The experimental results show that MIAdam has much better generalization and robustness against label noises than Adam. Future work will focus on introducing multiple integral terms into other optimizers.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 62476115 and Grant 62176109, in part by the Fundamental Research Funds for the Central Universities under Grant lzujbky2023-ct05 and Grant Izuibky-2023-ey07, in part by the China Computer Federation (CCF)-Baidu Open Fund under Grant 202306, and in part by the Supercomputing Center of Lanzhou University.

References

  • Chaudhari et al. (2019) Chaudhari, P.; et al. 2019. Entropy-SGD: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): 124018.
  • Deng et al. (2023) Deng, X.; Sun, T.; Li, S.; and Li, D. 2023. Stability-based generalization analysis of the asynchronous decentralized SGD. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 7340–7348.
  • Dinh et al. (2017) Dinh, L.; Pascanu, R.; Bengio, S.; and Bengio, Y. 2017. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, 1019–1028.
  • Dozat (2016) Dozat, T. 2016. Incorporating Nesterov momentum into Adam. In International Conference on Machine Learning.
  • Du et al. (2022) Du, J.; Zhou, D.; Feng, J.; Tan, V.; and Zhou, J. T. 2022. Sharpness-aware training for free. Advances in Neural Information Processing Systems, 23439–23451.
  • Han, Kim, and Kim (2017) Han, D.; Kim, J.; and Kim, J. 2017. Deep pyramidal residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5927–5935.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
  • Hochreiter and Schmidhuber (1995) Hochreiter, S.; and Schmidhuber, J. 1995. Simplifying neural nets by discovering flat minima. Advances in Neural Information Processing Systems, 7: 529–536.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Flat minima. Neural Computation, 9(1): 1–42.
  • Huang et al. (2017) Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Jiang et al. (2020) Jiang, Y.; Neyshabur, B.; Mobahi, H.; Krishnan, D.; and Bengio, S. 2020. Fantastic Generalization Measures and Where to Find Them. In International Conference on Learning Representations.
  • Jin, Zhang, and Li (2015) Jin, L.; Zhang, Y.; and Li, S. 2015. Integration-enhanced Zhang neural network for real-time-varying matrix inversion in the presence of various kinds of noises. IEEE Transactions on Neural Networks and Learning Systems, 27(12): 2615–2627.
  • Johnson et al. (2020) Johnson, T.; Agrawal, P.; Gu, H.; and Guestrin, C. 2020. AdaScale SGD: A user-friendly algorithm for distributed training. In International Conference on Machine Learning, 4911–4920.
  • Kalinay and Percus (2012) Kalinay, P.; and Percus, J. K. 2012. Phase space reduction of the one-dimensional Fokker-Planck (Kramers) equation. Journal of Statistical Physics, 148(6): 1135–1155.
  • Keskar et al. (2017) Keskar, N. S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; and Tang, P. T. P. 2017. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations.
  • Keskar and Socher (2017) Keskar, N. S.; and Socher, R. 2017. Improving generalization performance by switching from Adam to SGD. arXiv preprint arXiv:1712.07628.
  • Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1097–1105.
  • Lin et al. (2021) Lin, Y.; Meng, Y.; Sun, X.; Han, Q.; Kuang, K.; Li, J.; and Wu, F. 2021. BERTGCN: Transductive text classification by combining GCN and BERT. arXiv preprint arXiv:2105.05727.
  • Liu et al. (2021) Liu, Z.; Li, B.; Simon, J. B.; and Ueda, M. 2021. SGD can converge to local maxima. In International Conference on Learning Representations.
  • Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  • Luo et al. (2019) Luo, L.; Xiong, Y.; Liu, Y.; and Sun, X. 2019. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843.
  • Mulayoff and Michaeli (2020) Mulayoff, R.; and Michaeli, T. 2020. Unique properties of flat minima in deep networks. In International Conference on Machine Learning, 7108–7118.
  • Petzka et al. (2021) Petzka, H.; Kamp, M.; Adilova, L.; Sminchisescu, C.; and Boley, M. 2021. Relative flatness and generalization. Advances in Neural Information Processing Systems, 18420–18432.
  • Qian (1999) Qian, N. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1): 145–151.
  • Roberts and Mullis (1987) Roberts, R. A.; and Mullis, C. T. 1987. Digital signal processing. Addison-Wesley Longman Publishing Co., Inc.
  • Savarese (2019) Savarese, P. 2019. On the convergence of AdaBound and its connection to SGD. arXiv preprint arXiv:1908.04457.
  • Sun et al. (2023) Sun, Y.; Shen, L.; Chen, S.; Ding, L.; and Tao, D. 2023. Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape. In International Conference on Machine Learning.
  • Wang et al. (2018) Wang, H.; Keskar, N. S.; Xiong, C.; and Socher, R. 2018. Identifying generalization properties in neural networks. arXiv preprint arXiv:1809.07402.
  • Wilson et al. (2017) Wilson, A. C.; Roelofs, R.; Stern, M.; Srebro, N.; and Recht, B. 2017. The marginal value of adaptive gradient methods in machine learning. Advances in Neural Information Processing Systems, 4148–4158.
  • Xie, Sato, and Sugiyama (2020) Xie, Z.; Sato, I.; and Sugiyama, M. 2020. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Machine Learning.
  • Xie et al. (2022) Xie, Z.; Wang, X.; Zhang, H.; Sato, I.; and Sugiyama, M. 2022. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning, 24430–24459.
  • Yang (2020) Yang, X. 2020. Stochastic gradient variance reduction by solving a filtering problem. arXiv preprint arXiv:2012.12418.
  • Yao et al. (2020) Yao, Z.; Gholami, A.; Keutzer, K.; and Mahoney, M. W. 2020. Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE International Conference on Big Data (Big Data), 581–590.
  • Yao et al. (2021) Yao, Z.; Gholami, A.; Shen, S.; Mustafa, M.; Keutzer, K.; and Mahoney, M. 2021. Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, volume 35, 10665–10673.
  • Zhang (2018) Zhang, Z. 2018. Improved Adam optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), 1–2.
  • Zou et al. (2021) Zou, D.; Cao, Y.; Li, Y.; and Gu, Q. 2021. Understanding the generalization of Adam in learning neural networks with proper regularization. arXiv preprint arXiv:2108.11371.

Appendix A Appendix

Appendix B Simulation and Experiment Settings

All experimental procedures are implemented using Python 3.6 and PyTorch 1.8.3. Computational tasks are executed on a system running Ubuntu 16.04 TLS, equipped with an array of 10 NVIDIA RTX 2080Ti GPUs.

Simulation Settings

The simulation settings are illustrated in this paragraph. All simulations employ the CosineAnnealingLR as the learning rate adjustment strategy, with a total of 1500 training steps. These hyperparameters are set identically for all optimizers: ϵ=1e8\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, α=1e3\alpha=1\mathrm{e}-3, and weight_decay=5e5weight\_decay=5\mathrm{e}-5. For the simulations on the loss landscape Fig. 2(a), the hyperparameters κ\kappa and ζ\zeta for MIAdam are set to 0.885 and 1400, respectively. The same settings are applied for the simulations on the loss landscape Fig. 2(e), with the only difference being the learning rate α=0.005\alpha=0.005.

Experiment Settings

Image Classification Experiments: For MIAdam, we adopt the same hyperparameters as those used for Adam to ensure a consistent comparison basis. Additionally, the hyperparameters κ\kappa and ζ\zeta are identified with their optimal values through grid search within the range [0.01,0.02,,0.99][0.01,0.02,...,0.99] and [1,2,,150][1,2,...,150]. After grid search, the optimal values of κ\kappa and ζ\zeta are 0.98 and 20, respectively. For each optimizer, we calculate the mean and standard deviation of the top-1 test accuracy based on three individual runs while computing the average training time for three runs. For image classification experiments on the CIFAR datasets, the hyperparameter configurations for various optimizers are provided in Table 5.

Table 5: Hyperparameter settings for optimizers of image classification experiments.
Epoch =150=150; batch size =128=128; milestone =[50,75]=[50,75]
Adam α=1e3,ϵ=1e8\alpha=1\mathrm{e}-3,\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight_decay=5e5weight\_decay=5\mathrm{e}-5
NAdam α=1e3,ϵ=1e8\alpha=1\mathrm{e}-3,\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight_decay=5e5weight\_decay=5\mathrm{e}-5, γ=1e3\gamma=1\mathrm{e}-3
ND-Adam α=1e3,ϵ=1e8\alpha=1\mathrm{e}-3,\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight_decay=5e5weight\_decay=5\mathrm{e}-5
Adamax α=2e3,ϵ=1e8\alpha=2\mathrm{e}-3,\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight_decay=5e5weight\_decay=5\mathrm{e}-5
AdaBound α=1e3,ϵ=1e8\alpha=1\mathrm{e}-3,\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight_decay=5e5weight\_decay=5\mathrm{e}-5
SWATS α=1e3,ϵ=1e8\alpha=1\mathrm{e}-3,\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight_decay=5e5weight\_decay=5\mathrm{e}-5
Adai α=1e3,ϵ=1e8\alpha=1\mathrm{e}-3,\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight_decay=5e5weight\_decay=5\mathrm{e}-5
MIAdam α=1e3,ϵ=1e8\alpha=1\mathrm{e}-3,\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, weight_decay=5e5weight\_decay=5\mathrm{e}-5, κ=0.98\kappa=0.98, ζ=20\zeta=20

For the experiments conducted on the ImageNet-1k dataset, all settings remain consistent with those used for the CIFAR dataset experiments, except for the total number of training epochs, which is set to 90.

Text Classification Experiments: We fine-tune the BERT and RoBERTa models using a learning rate of 0.0001 and a batch size of 128 for 30 epochs. For the two optimizers used in training, Adam and MIAdam, both employ milestones as the learning rate adjustment strategy, with the same hyperparameter settings as follows: α=1e5\alpha=1\mathrm{e}-5, ϵ=1e8\epsilon=1\mathrm{e}-8, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, α=1e3\alpha=1\mathrm{e}-3, and weight_decay=5e5weight\_decay=5\mathrm{e}-5. Additionally, for MIAdam, the extra hyperparameters κ\kappa and ζ\zeta are set to 0.98 and 20, respectively.

Experiments on Datasets Injected with Label Noises: The extra hyperparameter of MIAdam ζ\zeta is set to 40. Apart from the injection of label noises on the CIFAR-10 dataset, the experimental settings are identical to those of the image classification experiments.

Appendix C Supplementary Experiments and Simulations

Simulations

On the second loss landscape in Fig. 2(e), different starting points influence the trajectory of the optimizer. Therefore, to make the simulation results more convincing, we uniformly select 2,500 starting coordinate points on this loss landscape and conduct 2,500 rounds of simulations in the region where θ1[2,3]\theta_{1}\in[-2,3] and θ2[2,3]\theta_{2}\in[-2,3]. Furthermore, we calculate the sum of the absolute values of the eigenvalues of the Hessian matrix of the different optimizers for the final convergence points at different starting coordinate points and compare them. As can be seen in Fig. 4, the sum of the absolute values of the eigenvalues at the location of the final convergence on the second loss landscape is generally smaller for the MIAdam2 and MIAdam3 compared to Adam, which suggests that the multiple integral term is indeed helpful in finding flat minima.

Refer to caption
Figure 4: Comparisons of the sum of the absolute values of the eigenvalues of the Hessian matrix at the convergence points of Adam, MIAdam1, MIAdam2, and MIAdam3 for 2500 rounds of simulation in the loss landscape shown in Fig. 2(e).

Experiments

Fig. 5 provides training and testing comparisons of Adam, MIAdam1, MIAdam2, and MIAdam3 on CIFAR-100 using DenseNet121. Fig. 5 shows that the introduction of the multiple integral term leads to an enhancement in generalization rather than in training accuracy. It is worth noting that before switching optimizers, training with the multiple integral term leads to slower convergence or even non-convergence, as the optimizer searches for the flat minima on the loss landscape. However, when considering the entire training process, the epochs taken for convergence to a steady state by Adam and MIAdam are similar.

Refer to caption
Refer to caption
Figure 5: Training and testing comparations of Adam, MIAdam1, MIAdam2, and MIAdam3 on CIFAR-100 using DenseNet121.

Appendix D Generalization Proof

Proof of Theorem 1

Proof.

Without loss of generality, we set the hyperparameter κ=1\kappa=1 and consider the adaptive learning rate at the end of the derivation to make it convenient to analyze the effect of multiple integral term on generalization. Consequently, the parameter update formula for MIAdam1 without adaptive learning rate and switching optimizers is simplified as follows:

{𝒎t=β1𝒎t1+(1β1)𝒈t,𝒎¯t(1)=j=0t𝒎j,𝜽t+1=𝜽tα2𝒎¯t(1).\left\{\begin{aligned} &\bm{m}_{t}=\beta_{1}\bm{m}_{t-1}+(1-\beta_{1})\bm{g}_{t},\\ &\overline{\bm{m}}_{t}^{(1)}=\sum^{t}_{j=0}\bm{m}_{j},\\ &\bm{\theta}_{t+1}=\bm{\theta}_{t}-{\alpha}^{2}\overline{\bm{m}}_{t}^{(1)}.\end{aligned}\right. (18)

Then, we write the deformed motion equation as follows:

{𝓿t=(1δΔt)𝓿t1+𝓯χΔt,𝓿¯t=j=0t𝓿j,𝓹t+1=𝓹t+𝓿¯t(Δt)2,\left\{\begin{aligned} &\bm{\mathcal{v}}_{t}=(1-\delta\Delta t)\bm{\mathcal{v}}_{t-1}+\frac{\bm{\mathcal{f}}}{\chi}\Delta t,\\ &\overline{\bm{\mathcal{v}}}_{t}=\sum^{t}_{j=0}\bm{\mathcal{v}}_{j},\\ &\bm{\mathcal{p}}_{t+1}=\bm{\mathcal{p}}_{t}+\overline{\bm{\mathcal{v}}}_{t}{(\Delta t)}^{2},\\ \end{aligned}\right. (19)

where 𝓿t=𝒎t\bm{\mathcal{v}}_{t}=-\bm{m}_{t}, 𝓯=𝒈t\bm{\mathcal{f}}=\bm{g}_{t}, Δt\Delta t is equivalent to α\alpha, δ=(1β1)/Δt\delta=(1-\beta_{1})/\Delta t, and χ=Δt/(1β1)\chi=\Delta t/(1-\beta_{1}). When the learning rate α\alpha is sufficiently small, the differential form of the motion equation is derived as

χd3𝓹(t)dt~3=δχd2𝓹(t)dt~2+𝓯,\displaystyle\chi\frac{\text{d}^{3}\bm{\mathcal{p}}(t)}{\text{d}\tilde{t}^{3}}=\delta\chi\frac{\text{d}^{2}\bm{\mathcal{p}}(t)}{\text{d}\tilde{t}^{2}}+\bm{\mathcal{f}}, (20)

where dt~=Δt\text{d}\tilde{t}=\Delta t. As 𝓯\bm{\mathcal{f}} corresponds to the stochastic gradient term, we obtain

χ¯d𝓹˙=δχ¯d𝓹+L(𝓹)𝓹dt~+2[ψ(𝓹)]12dWt~,\displaystyle\overline{\chi}\text{d}\dot{\bm{\mathcal{p}}}=\delta\overline{\chi}\text{d}\bm{\mathcal{p}}+\frac{\partial L(\bm{\mathcal{p}})}{\partial\bm{\mathcal{p}}}\text{d}\tilde{t}+2[\psi(\bm{\mathcal{p}})]^{\frac{1}{2}}\text{d}W_{\tilde{t}}, (21)

where χ¯=χ/t~\overline{\chi}=\chi/\tilde{t}; dWt~𝒩(0,Idt~)\text{d}W_{\tilde{t}}\sim\mathscr{N}(0,I\text{d}\tilde{t}); II presents the identity matrix; ψ(𝓹)=αC(𝓹)/2\psi(\bm{\mathcal{p}})=\alpha C(\bm{\mathcal{p}})/2 is the diffusion matrix in the dynamics; C(𝓹)C(\bm{\mathcal{p}}) denotes the gradient noise covariance matrix. Furthermore, its Fokker-Planck equation in the phase space (the 𝓹˙\dot{\bm{\mathcal{p}}}-𝓹\bm{\mathcal{p}} space) is written as follows:

P(𝓹,𝓿,t~)t~\displaystyle\frac{\partial P(\bm{\mathcal{p}},\bm{\mathcal{v}},\tilde{t})}{\partial\tilde{t}} =𝓹[𝒓P(𝓹,𝓿,t~)]\displaystyle=-\nabla_{\bm{\mathcal{p}}}\cdot[\bm{r}P(\bm{\mathcal{p}},\bm{\mathcal{v}},\tilde{t})] (22)
+𝒓[δ𝓿+χ¯𝓹L(𝓹)]\displaystyle+\nabla_{\bm{r}}\cdot[\delta\bm{\mathcal{v}}+\overline{\chi}\nabla_{\bm{\mathcal{p}}}L(\bm{\mathcal{p}})]
+𝓿χ¯2ψ𝓿P(𝓹,𝓿,t~)\displaystyle+\nabla_{\bm{\mathcal{v}}}\overline{\chi}^{-2}\psi\cdot\nabla_{\bm{\mathcal{v}}}P(\bm{\mathcal{p}},\bm{\mathcal{v}},\tilde{t})
=𝝋(𝓹,t~),\displaystyle=-\nabla\cdot\bm{\varphi}(\bm{\mathcal{p}},\tilde{t}),

where 𝒓=𝓹˙\bm{r}=\dot{\bm{\mathcal{p}}}; 𝝋\bm{\varphi} represents the probability flux density; \nabla\cdot denotes the divergence operator. Under Assumption 2, the probability distribution in the valley around 𝒂\bm{a} is derived as

P(𝓹,𝓿,t~)\displaystyle P(\bm{\mathcal{p}},\bm{\mathcal{v}},\tilde{t}) (23)
=\displaystyle= 𝓹V𝒂exp[δχ¯2(𝓹𝒂)(ψ𝒂12H𝒂ψ𝒂12)(𝓹𝒂)]dV\displaystyle\int_{\bm{\mathcal{p}}\in V_{\bm{a}}}\text{exp}[-\frac{\delta\overline{\chi}}{2}(\bm{\mathcal{p}}-{\bm{a}})^{\top}(\psi_{\bm{a}}^{-\frac{1}{2}}H_{\bm{a}}\psi_{\bm{a}}^{-\frac{1}{2}})(\bm{\mathcal{p}}-{\bm{a}})]\text{d}V
=\displaystyle= P(𝒂)(2πδχ¯)2det(ψ𝒂1H𝒂)12.\displaystyle P({\bm{a}})\frac{(2\pi\delta\overline{\chi})^{2}}{\operatorname{det}(\psi_{\bm{a}}^{-1}H_{\bm{a}})^{\frac{1}{2}}}.

According to (Kalinay and Percus 2012), in the case of finite inertia, we transform the phase-space equation into the position-space Smoluchowski-like form with the effective diffusion correction:

ψ^i(𝓹i)=ψi(𝓹i)(114Hi(𝓹i)δ2χ¯)(2Hi(𝓹i)δ2χ¯).\displaystyle\hat{\psi}_{i}(\bm{\mathcal{p}}_{i})=\psi_{i}(\bm{\mathcal{p}}_{i})\left(1-\sqrt{1-\frac{4H_{i}(\bm{\mathcal{p}}_{i})}{\delta^{2}\overline{\chi}}}\right)\left(\frac{2H_{i}(\bm{\mathcal{p}}_{i})}{\delta^{2}\overline{\chi}}\right). (24)

Suppose that 𝝋\bm{\varphi} is fixed on an escape path from sharp minimum 𝒂\bm{a} to flat minimum 𝒃\bm{b} through saddle point 𝒖\bm{u}, we derive 𝝋\bm{\varphi} from  (Xie et al. 2022) as

𝝋=\displaystyle\bm{\varphi}= exp(L(𝒂)L(𝒖)T𝒂P(𝒂))𝒂𝒃ψ^1exp(L(𝓹)L(ϱ)T)d𝓹\displaystyle\frac{\text{exp}\left(\frac{L({\bm{a}})-L(\bm{u})}{T_{\bm{a}}}P({\bm{a}})\right)}{\int^{\bm{b}}_{\bm{a}}\hat{\psi}^{-1}\text{exp}(\frac{L(\bm{\mathcal{p}})-L({{\bm{\varrho}}})}{T})\text{d}\bm{\mathcal{p}}} (25)
=\displaystyle= exp(L(𝒂)L(𝒖)T𝒂P(𝒂))ψ^𝒖1exp(L(𝒖)L(ϱ)T𝒖2πT𝒖|H𝒖|),\displaystyle\frac{\text{exp}\left(\frac{L({\bm{a}})-L(\bm{u})}{T_{\bm{a}}}P({\bm{a}})\right)}{\hat{\psi}_{\bm{u}}^{-1}\text{exp}\left(\frac{L(\bm{u})-L(\bm{\varrho})}{T_{\bm{u}}}\sqrt{\frac{2\pi T_{\bm{u}}}{|H_{\bm{u}}|}}\right)},

where T=ψ/(δχ¯)T=\psi/(\delta\overline{\chi}) and denotes the temperature parameter in the stationary distribution. Based on the formula of probability current and flux, we obtain the flux escaping through saddle point 𝒖{\bm{u}}:

𝒔𝒖𝝋d𝒔\displaystyle\int_{\bm{s}_{\bm{u}}}\bm{\varphi}\cdot\text{d}\bm{s} (26)
=\displaystyle= φi𝒔𝒖exp[δχ¯2(𝓹𝒖)[ψ𝒖12H𝒖ψ𝒖12]𝒆(𝓹𝒖)]d𝒔\displaystyle\varphi_{i}\int_{\bm{s}_{\bm{u}}}\text{exp}\left[-\frac{\delta\overline{\chi}}{2}(\bm{\mathcal{p}}-\bm{u})^{\top}[\psi_{\bm{u}}^{-\frac{1}{2}}H_{\bm{u}}\psi_{\bm{u}}^{-\frac{1}{2}}]^{\bot\bm{e}}(\bm{\mathcal{p}}-\bm{u})\right]\text{d}\bm{s}
=\displaystyle= exp(L(𝒂)L(ϱ)T𝒂𝒆)P(𝒂)(2πδχ)n12(ie(ψ𝒖i1H𝒖i))12ψ^𝒖𝒆1exp(L(𝒖)L(ϱ)T𝒖𝒆2πT𝒖𝒆|H𝒖𝒆|),\displaystyle\frac{\text{exp}\left(\frac{L(\bm{a})-L(\bm{\varrho})}{T_{{\bm{a}}{\bm{e}}}}\right)P({\bm{a}})\frac{(2\pi\delta\chi)^{\frac{n-1}{2}}}{(\prod_{i\neq e}(\psi_{\bm{u}_{i}}^{-1}H_{\bm{u}_{i}}))^{\frac{1}{2}}}}{\hat{\psi}_{\bm{ue}}^{-1}\text{exp}\left(\frac{L(\bm{u})-L(\bm{\varrho})}{T_{\bm{ue}}}\sqrt{\frac{2\pi T_{\bm{ue}}}{|H_{\bm{ue}}|}}\right)},

where superscript ⊥e indicates the directions perpendicular to the escape direction 𝒆\bm{e}. Considering the adaptive learning rate, we have ψMIAdam1=α([H]+)/𝒷)/2\psi_{\text{MIAdam1}}=\alpha(\sqrt{[H]^{+})/\mathcal{b}})/2 according to the case of Adam in  (Xie et al. 2022), where superscript + presents the transformation that H=Udiag(H1,H2,,Hn1,Hn)UH=U\operatorname{diag}(H_{1},H_{2},\cdots,H_{n-1},H_{n})U^{\top} and the ii-th column vector of UU is the eigenvector corresponding to HiH_{i}. Finally, we obtain the mean escape time ϕ\phi of MIAdam1:

ϕMIAdam1=\displaystyle\phi_{\text{MIAdam1}}= P(𝓹V𝒂)Sb𝝋d𝒔\displaystyle\frac{P\left(\bm{\mathcal{p}}\in V_{\bm{a}}\right)}{\int_{S_{b}}\bm{\varphi}\cdot\text{d}\bm{s}} (27)
=\displaystyle= π[1+4α𝒷|H𝒖𝒆|t~(1β1)+1]|det(H𝒂1H𝒖)|14|H𝒖𝒆|exp[2𝒷ΔLt~α(ϱH𝒂𝒆+(1ϱ)|H𝒖𝒆|)].\displaystyle\pi\left[\sqrt{1+\frac{4\alpha\sqrt{\mathcal{b}\left|H_{\bm{u}\bm{e}}\right|}}{\tilde{t}(1-\beta_{1})}}+1\right]\frac{\left|\operatorname{det}\left(H_{\bm{a}}^{-1}H_{\bm{u}}\right)\right|^{\frac{1}{4}}}{\left|H_{\bm{u}\bm{e}}\right|}\exp\left[\frac{2\sqrt{\mathcal{b}}\Delta L}{\tilde{t}\alpha}\left(\frac{\varrho}{\sqrt{H_{{\bm{a}}\bm{e}}}}+\frac{(1-\varrho)}{\sqrt{\left|H_{\bm{u}\bm{e}}\right|}}\right)\right].

The proof is thus completed. ∎

Appendix E Convergence Proof

Definition 1.

For a convex set FF, a function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is convex, if for all 𝒙,𝒚d\bm{x},\bm{y}\in\mathbb{R}^{d} satisfy

f(𝒙)f(𝒙)+f(𝒙)(𝒚𝒙),f\left(\bm{x}\right)\leq f\left(\bm{x}\right)+\nabla f\left(\bm{x}\right)\left(\bm{y}-\bm{x}\right), (28)

and for all λ[0,1]\lambda\in[0,1] satisfy

f(λ𝒙+(1λ)𝒚)λf(𝒙)+(1λ)f(𝒚).f\left(\lambda\bm{x}+\left(1-\lambda\right)\bm{y}\right)\leq\lambda f\left(\bm{x}\right)+\left(1-\lambda\right)f\left(\bm{y}\right). (29)
Definition 2.

For a convex function ftf_{t}, we give the R(t^)R(\hat{t}) to determine the convergence. Its expression is shown as follows:

R(t^)=t=1t^ft(𝜽t)min𝜽t=1t^ft(𝜽).\displaystyle R\left(\hat{t}\right)=\sum_{t=1}^{\hat{t}}f_{t}\left(\bm{\theta}_{t}\right)-\min_{\bm{\theta}}\sum_{t=1}^{\hat{t}}f_{t}\left(\bm{\theta}\right). (30)

When limt^R(t^)/t^0\lim_{\hat{t}\rightarrow\infty}R\left({\hat{t}}\right)/{\hat{t}}\neq 0, we consider the algorithm is not convergent.

Proof of Theorem 2

Proof.

According to the Definition 1 and the Definition 2, we have

R(t^)=\displaystyle R\left({\hat{t}}\right)= t=1t^[ft(𝜽t)ft(𝜽)]\displaystyle\sum_{t=1}^{\hat{t}}[f_{t}\left(\bm{\theta}_{t}\right)-f_{t}\left(\bm{\theta}^{*}\right)] (31)
\displaystyle\leq i=1dt=1t^gt,i(θt,iθ,i).\displaystyle\sum_{i=1}^{d}\sum_{t=1}^{\hat{t}}g_{t,i}\left(\theta_{t,i}-\theta_{,i}^{*}\right).

From the updating rules of MIAdam, we get

θt+1,i=\displaystyle\theta_{t+1,i}= θt,iαm^t,i(1)v^t,i\displaystyle\theta_{t,i}-\alpha\frac{\hat{m}^{(1)}_{t,i}}{\sqrt{\hat{v}_{t,i}}} (32)
=\displaystyle= θt,iα11β1tm¯t,i(1)v^t,i\displaystyle\theta_{t,i}-\alpha\frac{1}{1-\beta_{1}^{t}}\frac{\overline{m}^{(1)}_{t,i}}{\sqrt{\hat{v}_{t,i}}}
=\displaystyle= θt,iα11β1tr=1tκ1trmrv^t,i.\displaystyle\theta_{t,i}-\alpha\frac{1}{1-\beta_{1}^{t}}\frac{\sum^{t}_{r=1}\kappa_{1}^{t-r}m_{r}}{\sqrt{\hat{v}_{t,i}}}.

Then we obtain

gt,i(θt,iθ,i)\displaystyle g_{t,i}\left(\theta_{t,i}-\theta_{,i}^{*}\right) (33)
=\displaystyle= (1β1t)v^t,i2α(1β1,t)((θt,iθ,i)2(θt+1,iθ,i)2){1}\displaystyle\underbrace{\frac{(1-\beta_{1}^{t})\sqrt{\hat{v}_{t,i}}}{2\alpha\left(1-\beta_{1,t}\right)}\left(\left(\theta_{t,i}-\theta_{,i}^{*}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{*}\right)^{2}\right)}_{\{1\}}
+\displaystyle+ β1,tmt1,i(θ,iθt,i)(1β1,t){2}+κ1m¯t1,i(1)(θ,iθt,i)(1β1,t){3}\displaystyle\underbrace{\frac{\beta_{1,t}m_{t-1,i}\left(\theta_{,i}^{*}-\theta_{t,i}\right)}{\left(1-\beta_{1,t}\right)}}_{\{2\}}+\underbrace{\frac{\kappa_{1}\overline{m}^{(1)}_{t-1,i}\left(\theta_{,i}^{*}-\theta_{t,i}\right)}{\left(1-\beta_{1,t}\right)}}_{\{3\}}
+\displaystyle+ α(1β1t)2(1β1,t)m^t,i2v^t,i{4}.\displaystyle\underbrace{\frac{\alpha(1-\beta_{1}^{t})}{2\left(1-\beta_{1,t}\right)}\frac{\hat{m}_{t,i}^{2}}{\sqrt{\hat{v}_{t,i}}}}_{\{4\}}.

For the first two terms {1} and {2} in Eq. (LABEL:eq:23), we derive the following two inequalities based on (Kingma and Ba 2015):

i=1d\displaystyle\sum_{i=1}^{d} t=1t^(1β1t)v^t,i2α(1β1,t)((θt,iθ,i)2(θt+1,iθ,i)2)\displaystyle\sum_{t=1}^{\hat{t}}\frac{(1-\beta_{1}^{t})\sqrt{\hat{v}_{t,i}}}{2\alpha\left(1-\beta_{1,t}\right)}\left(\left(\theta_{t,i}-\theta_{,i}^{*}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{*}\right)^{2}\right) (34)
\displaystyle\leq 𝖽22α(1β1)i=1dt^v^t^,i,\displaystyle\frac{\mathsf{d}^{2}}{2\alpha(1-\beta_{1})}\sum^{d}_{i=1}\sqrt{{\hat{t}}\hat{v}_{{\hat{t}},i}},

and

i=1d\displaystyle\sum_{i=1}^{d} t=1t^β1,tmt1,i(θ,iθt,i)(1β1,t)\displaystyle\sum_{t=1}^{\hat{t}}\frac{\beta_{1,t}m_{t-1,i}\left(\theta_{,i}^{*}-\theta_{t,i}\right)}{\left(1-\beta_{1,t}\right)} (35)
\displaystyle\leq α(1+β1)𝗀(1β1)1β2(1γ)2i=1dg1:t^,i2+\displaystyle\frac{\alpha(1+\beta_{1})\mathsf{g}_{\infty}}{(1-\beta_{1})\sqrt{1-\beta_{2}}(1-\gamma)^{2}}\sum^{d}_{i=1}\lVert g_{1:{\hat{t}},i}\rVert_{2}+ i=1d𝖽2𝗀1β22αβ1(1λ)2.\displaystyle\sum_{i=1}^{d}\frac{\mathsf{d}^{2}_{\infty}\mathsf{g}_{\infty}\sqrt{1-\beta_{2}}}{2\alpha\beta_{1}(1-\lambda)^{2}}.

Since the gradient is assumed to be bounded, the mt,im_{t,i} should also be bounded in Adam and satisfies mt,i𝗀m_{t,i}\leq\mathsf{g}. For the term {3}, we have

t=2t^κ1m¯t1,i(1)(θ,iθt,i)(1β1,t)\displaystyle\sum^{\hat{t}}_{t=2}\frac{\kappa_{1}\overline{m}^{(1)}_{t-1,i}\left(\theta_{,i}^{*}-\theta_{t,i}\right)}{\left(1-\beta_{1,t}\right)} (36)
=\displaystyle= t=2t^(θ,iθt,i)(1β1,t)r=1t1κ1trmr\displaystyle\sum^{\hat{t}}_{t=2}\frac{\left(\theta_{,i}^{*}-\theta_{t,i}\right)}{\left(1-\beta_{1,t}\right)}\sum^{t-1}_{r=1}\kappa_{1}^{t-r}m_{r}
\displaystyle\leq t=2t^(θ,iθt,i)(1β1,t)(κ11κ1κt1κ1)t𝗀,\displaystyle\sum^{\hat{t}}_{t=2}\frac{\left(\theta_{,i}^{*}-\theta_{t,i}\right)}{\left(1-\beta_{1,t}\right)}\left(\frac{\kappa_{1}}{1-\kappa_{1}}-\frac{\kappa^{t}}{1-\kappa_{1}}\right)t\mathsf{g},

which is a divergent series. Therefore, we further deduce the following limit:

limt^i=1dt=2t^κ1m¯t1,i(1)(θ,iθt,i)(1β1,t)t^.\displaystyle\lim_{{\hat{t}}\rightarrow\infty}\frac{\sum_{i=1}^{d}\sum^{\hat{t}}_{t=2}\frac{\kappa_{1}\overline{m}^{(1)}_{t-1,i}\left(\theta_{,i}^{*}-\theta_{t,i}\right)}{\left(1-\beta_{1,t}\right)}}{{\hat{t}}}\rightarrow\infty. (37)

For the term {4}, we have

i=1dt=1t^α(1β1t)2(1β1,t)m^t,i2v^t,i\displaystyle\sum_{i=1}^{d}\sum^{\hat{t}}_{t=1}\frac{\alpha(1-\beta_{1}^{t})}{2\left(1-\beta_{1,t}\right)}\frac{\hat{m}_{t,i}^{2}}{\sqrt{\hat{v}_{t,i}}} (38)
=\displaystyle= i=1dt=1t^α(1β1t)2(1β1,t)(r=1t1κ1trmr)2v^t,i.\displaystyle\sum_{i=1}^{d}\sum^{\hat{t}}_{t=1}\frac{\alpha(1-\beta_{1}^{t})}{2\left(1-\beta_{1,t}\right)}\frac{(\sum^{t-1}_{r=1}\kappa_{1}^{t-r}m_{r})^{2}}{\sqrt{\hat{v}_{t,i}}}.

Analogous to the derivation of the term {3}, we similarly obtain the following limit:

limt^i=1dt=1t^α(1β1t)2(1β1,t)m^t,i2v^t,it^.\displaystyle\lim_{{\hat{t}}\rightarrow\infty}\frac{\sum_{i=1}^{d}\sum^{\hat{t}}_{t=1}\frac{\alpha(1-\beta_{1}^{t})}{2\left(1-\beta_{1,t}\right)}\frac{\hat{m}_{t,i}^{2}}{\sqrt{\hat{v}_{t,i}}}}{{\hat{t}}}\rightarrow\infty. (39)

Finally, following the above derivation, we have

limt^R(t^)t^0.\displaystyle\lim_{{\hat{t}}\rightarrow\infty}\frac{R({\hat{t}})}{{\hat{t}}}\neq 0. (40)

The proof is thus completed. ∎