A Method for Enhancing Generalization of Adam by Multiple Integrations
Abstract
The insufficient generalization of adaptive moment estimation (Adam) has hindered its broader application. Recent studies have shown that flat minima in loss landscapes are highly associated with improved generalization. Inspired by the filtering effect of integration operations on high-frequency signals, we propose multiple integral Adam (MIAdam), a novel optimizer that integrates a multiple integral term into Adam. This multiple integral term effectively filters out sharp minima encountered during optimization, guiding the optimizer towards flatter regions and thereby enhancing generalization capability. We provide a theoretical explanation for the improvement in generalization through the diffusion theory framework and analyze the impact of the multiple integral term on the optimizer’s convergence. Experimental results demonstrate that MIAdam not only enhances generalization and robustness against label noise but also maintains the rapid convergence characteristic of Adam, outperforming Adam and its variants in state-of-the-art benchmarks.
Code — https://github.com/LongJin-lab/MIAdam
Introduction
An appropriate optimizer is essential to train a deep neural network (DNN), as it directly affects the training convergence and performance of a model (Yao et al. 2021). The goal of optimizers is usually to minimize (or maximize) a certain objective function, typically a loss function, which measures the gap between the predictions and ground-truth values. As a traditional optimizer, stochastic gradient descent (SGD) is a commonly used optimizer for training DNNs (Deng et al. 2023). However, SGD suffers from certain limitations, such as the need to precisely tune the learning rate, the uniform scaling of gradients in all directions, and the risk of being trapped in saddle points (Johnson et al. 2020; Liu et al. 2021). In order to address these challenges, adaptive learning rate optimizers are developed, offering more nuanced control over learning rates and improved convergence in diverse training scenarios. Among them, adaptive moment estimation (Adam) (Kingma and Ba 2015) is currently one of the most popular adaptive learning optimizers for its rapid convergence and efficient handling of sparse gradients. The combination of first-order and second-order moments in Adam enables the effective incorporation of momentum-based optimization and adaptive learning rate methods, thereby enhancing its overall efficiency and applicability in various neural network training contexts.

Despite being widely used, Adam also exhibits certain limitations, such as the inferior generalization capabilities compared to SGD in some scenarios (Wilson et al. 2017; Luo et al. 2019; Zou et al. 2021). Therefore, several enhanced variants of Adam are developed to alleviate the issue of poor generalization. Switching from Adam to SGD (SWATS) (Keskar and Socher 2017) is designed to start training with the Adam optimizer, then automatically switch to SGD, aiming to improve the model’s generalization performance. However, this method can not maintain the original convergence rate of Adam to some extent. ND-Adam (normalized direction-preserving Adam) (Zhang 2018) meticulously preserves the direction of gradients for each parameter and produces the regularization effect akin to L2 weight decay. Despite this, its ability to enhance generalization is quite limited. AdaBound (Luo et al. 2019) employs dynamic constraints on the learning rate to achieve a smooth and gradual transition from adaptive methods to SGD, which enhances the generalization of a model and reduces the dependence on detailed learning rate adjustments. Nonetheless, a significant drawback of AdaBound is its potential for slow convergence in certain scenarios (Savarese 2019). Overall, these improved optimizers of Adam attempting to enhance the generalization of Adam are unable to simultaneously retain the rapid convergence characteristic of Adam and enhance generalization effectively.
Many studies show theoretically and empirically that the generalization performance of a model is highly correlated with its loss landscape in the parameter space (Hochreiter and Schmidhuber 1997; Chaudhari et al. 2019; Jiang et al. 2020; Petzka et al. 2021; Du et al. 2022). A crucial observation is that flat regions in the loss landscape tend to be associated with good generalization performance, while sharp or narrow regions may lead to overfitting (Mulayoff and Michaeli 2020; Sun et al. 2023). This means that optimizers can effectively improve the generalization of a model by converging to flat minima during the training process, which provides a new perspective to alleviate the issue of the poor generalization of Adam. In order to improve the generalization of a model by finding flat minima in the loss landscape, we propose multiple integral Adam (MIAdam), which is inspired by the effect that the multiple integral term can often serve as filters and noise suppressions in the field of signal processing and control systems (Roberts and Mullis 1987; Jin, Zhang, and Li 2015). MIAdam introduces a multiple integral term to the parameter update formula of Adam, utilizing the filtering effect of multiple integrations to smooth the optimizer’s trajectory. As shown in Fig. 1, if we consider the trajectory as a time-varying input signal, integrating the signal is equivalent to filtering out the sharp minima encountered by the optimizer on the loss landscape, thereby enabling the optimizer to converge to flat minima. More details about the design of the MIAdam optimizer are discussed in Section MIAdam. The main contributions of this paper are summarized as follows.
-
To the best of our knowledge, the method of introducing multiple integral term in the optimizer to find flat minima in the loss landscape is proposed for the first time. Furthermore, we propose a new optimizer based on Adam, which is called MIAdam.
-
We provide theoretical analyses on MIAdam. Specifically, utilizing the diffusion theory framework in (Xie, Sato, and Sugiyama 2020), we prove that the multiple integral term enables MIAdam to generalize better than Adam under some assumptions. In addition, we also analyze the effect of multiple integral terms on convergence.
-
The effectiveness of the proposed method is validated through image classification experiments, text classification experiments, and experiments that inject label noises into datasets. Experimental results demonstrate that MIAdam outperforms the Adam and its state-of-the-art (SOTA) variants on both generalization and robustness against label noises.
Preliminaries
In this section, core concepts about Adam and the related theoretical analyses on the relationship between flat minima and generalization are briefly given to set the stage for the detailed exposition of MIAdam that follows.
Overview of Adam
The training procedure for a DNN can be primarily characterized as an optimization problem, which is defined as follows:
(1) |
where represents the loss function; denotes the parameter of the model; and are the input and its corresponding ground-truth label, respectively; is a subset of and is the training dataset. In the early stages of deep learning development, SGD emerges as a prevailing optimizer, with its parameter update formula expressed as follows:
(2) |
where represents the learning rate; represents the -th dimension of the parameter at discrete time ; is the gradient with respect to the parameter . The gradient is formally defined as
(3) |
where . Momentum is a strategy used to expedite convergence of SGD towards minima and escape saddle points on the loss landscape (Qian 1999). It is computed by accumulating previous gradients into the current gradient. The parameter update formula of SGD with momentum (SGDM) is shown as
(4a) | |||||
(4b) |
where denotes the momentum at and is the hyperparameter is used to trade off between the current gradient and the accumulation of historical gradients. Adam refines the momentum formulation in Eq. (4a) and introduces an adaptive learning rate achieved through the computation of the first-order and the second-order moments concerning current gradients. The first-order and second-order moments are calculated by the following expressions:
(5a) | |||||
(5b) |
where denotes the second-order moment at ; the and are the exponential decay rates used to adjust the first-order and second-order moments, respectively. Furthermore, the parameter update formula of Adam is expressed as
(6) |
where and . The hyperparameter is assumed as a small value to prevent division by zero in the denominator.
The Relationship between Flat Minima and Generalization
The generalization of DNNs has been extensively explored in recent years. In order to understand the phenomenon of generalization of DNNs, some of the existing research delves into understanding the relationship between loss landscapes and the generalization. A correlation between the flatness of the loss landscape and model generalization is revealed in (Hochreiter and Schmidhuber 1995). Subsequent investigations in (Hochreiter and Schmidhuber 1997) expand on this correlation and provide a method for identifying flat minima. In (Keskar et al. 2017), a definition of the sharpness of a specified point on a loss landscape is given. The study in (Dinh et al. 2017) introduces a reparameterization method and argues that previous sharpness measurements are inadequate for predicting generalization capabilities. Furthermore, it is demonstrated that the generalization capability is influenced by factors such as the batch size, higher-order “smoothness” terms characterized by the Lipschitz constant of the Hessian matrix, the loss function, and the number of parameters (Wang et al. 2018). Based on the above theoretical studies, empirical experiments extensively explore the intrinsic link between the generalization performance of a model and loss landscapes. A consensus emerging from these studies, including (Chaudhari et al. 2019; Jiang et al. 2020; Du et al. 2022; Petzka et al. 2021), is that flat minima usually yield better generalization compared to sharp minima.
MIAdam
In the parameter update formula of Adam, the first-order moment, as defined in Eq. (5a), is reformulated as follows (Kingma and Ba 2015):
(7) |
Consequently, the parameter update formula of Adam is rewritten as
(8) |
When the learning rate is sufficiently small, Eq. (8) is approximated in a continuous form as follows:
(9) |
where is the continuous time, is equivalent to and . It is noteworthy that the integral term appears in Eq. (9). In signal processing, a continuous input signal that undergoes an integral operation is written as
(10) |
where is the integrated signal and the integration range is from to . Ultimately, the resulting integral signal contains the cumulative information of the original signal at different points in time. After the integral operation, the high-frequency components of the signal are filtered out. Inspired by this, the trajectory of the optimizer on the loss landscape can be viewed as the input signal when training a DNN, and the sharp minima are equivalent to the high-frequency components in the signal. Integrating this signal is equivalent to filtering out the sharp minima encountered by an optimizer in the loss landscape, thereby guiding the optimizer toward convergence in flat regions. Therefore, to further achieve the effect of filtering out the sharp minima encountered by the optimizer, multiple integrations, an enhanced version of the integral operation, are introduced into the parameter update formula of Adam. Based on the process involving the integration as depicted in Eq. (10), we obtain the following equation:
(11) | ||||
where is multiple integration rate which adjusts the multiple integral term. Then, we perform cumulative operations on the first-order moments in the parameter update formula of Adam to transform the multiple integral term from a continuous form to its corresponding discrete form. Thus, the corresponding parameter update formula of Eq. (LABEL:eq12) is derived as follows:
(12) |
where the superscript (n) means the -th-order multiple summation.
Given: Learning rate: ;
exponential decay rates: , ;
multiple integration rate: ;
infinitesimal term: ;
the order of the multiple integral item: ;
switching moment: .
Initialize: Step time ;
first moment vector ;
second moment vector , .
According to the theoretical analyses in Section Generalization and Convergence Analyses and the simulations in Fig. 2, although the multiple integral term helps an optimizer to find flat minima, the optimizer hovers around flat minima and does not converge. Thus, we only use the multiple integral term in the early stages of training, and after that, the optimizer switches to Adam to ensure that the training is convergent eventually. At this point, the multiple integral term is introduced into Adam, and this new optimizer is named MIAdam. The pseudo code for MIAdam is shown in Algorithm 1.
Note that the multiple integration is approximated by the multiple summation in Algorithm 1, which adds only additional summation operations at each iteration for each dimension of the parameter. Therefore, MIAdam adds very little additional computational overhead compared to Adam. In the following text, we refer to Adam with an additional first-order integration as MIAdam1, and the one with an additional second-order integration as MIAdam2, and so on.
Generalization and Convergence Analyses
In this section, we present the theoretical analyses of the generalization and convergence associated with the addition of the multiple integral term to Adam, which does not involve the switching of optimizers. These analyses provide a theoretical foundation for our proposed optimizer.
Generalization Analyses
In this subsection, the diffusion theory framework is utilized to rigorously demonstrate that the incorporation of the multiple integral term enhances the generalization capabilities of the model. Specifically, generalization is quantitatively assessed by comparing the mean escape time, represented as , which indicates an optimizer’s ability to escape from sharp minima. In the following analyses, we begin by delineating three fundamental assumptions that are crucial for the application of the diffusion theory framework (Xie, Sato, and Sugiyama 2020).
Assumption 1.
The loss function around the critical point is approximately written as
(13) |
where the superscript ⊤ means the transpose of a vector.
Assumption 2.
(Quasi-equilibrium approximation). The system is in quasi-equilibrium near minima.
Assumption 3.
(Low-temperature approximation). The system is under low temperature (small gradient noise).
Consequently, following the theoretical analyses in (Xie, Sato, and Sugiyama 2020; Xie et al. 2022), we can further deduce Theorem 1. The detailed proof is given in the Appendix.
Theorem 1.
Suppose that Assumption 1, Assumption 2, and Assumption 3 hold while saddle point is the exit from sharp minimum . Then the mean escape time of MIAdam1 from sharp minimum to flat minimum through saddle point before the switch is
(14) | ||||
where subscript e denotes the escape direction; is the path-dependent parameter; indicates the batch size; ; represents the Hessian matrix.








Comparing the mean escape time obtained from Theorem 1 with that of Adam’s in (Xie et al. 2022),
(15) | ||||
when , it is found that is smaller than , indicating that Adam introduces an additional first-order integration which is more likely to escape from sharp minima and consequently converge to flat minima, thereby improving the generalization.
Convergence Analyses
In order to verify the effect of the multiple integral term on the convergence of the optimizer, we follow the analytical framework of Adam which is also used in this subsection. Concretely, the regret bound is utilized to evaluate the convergence of the algorithm and is defined as follows:
(16) |
where is a convex loss function.
Theorem 2.
Assume that the convex function has bounded gradients, for all and distance between any is guaranteed to be bounded, , for any , and satisfy . Let and . For the convex problem, the of MIAdam1 before the switch satisfies
(17) |
The detailed proof of Theorem 2 is thoroughly presented in the Appendix. From Theorem 2, it is evident that merely adding an extra first-order integration to the parameter updating formula of Adam leads to the non-convergence of the optimizer. Although it is non-convergent, it effectively escapes sharp minima and hovers around flat minima in the loss landscape. This observation is corroborated by the simulation results shown in Figs. 2(f)-(h). As a result, the MIAdam’s algorithm is structured to switch to Adam after a certain number of epochs to guarantee convergence.
Simulations and Experiments
In this section, we conduct the simulations on 2-parameter loss landscapes to illustrate the efficiency of MIAdam to escape from sharp minima. Furthermore, extensive empirical experiments are conducted to demonstrate that MIAdam outperforms Adam in terms of generalization and robustness against label noises.




Optimizer | ResNet18 | ResNet50 | ||||||
---|---|---|---|---|---|---|---|---|
CIFAR-10(%) | Time | CIFAR-100(%) | Time | CIFAR-10(%) | Time | CIFAR-100(%) | Time | |
Adam | 47m | 47m | 2h 45m | 2h 47m | ||||
NAdam | 46m | 46m | 2h 57m | 2h 58m | ||||
AdamW | 48m | 47m | 2h 45m | 2h 48m | ||||
ND-Adam | 44m | 48m | 2h 39m | 2h 57m | ||||
Adamax | 1h 28m | 1h 27m | 2h 48m | 2h 52m | ||||
AdaBound | 48m | 49m | 3h 1m | 3h 2m | ||||
SWATS | 1h 6m | 54m | 2h 51m | 2h 50m | ||||
Adai | 53m | 53m | 3h 32m | 3h 18m | ||||
MIAdam1 | * | 47m | * | 47m | 2h 49m | * | 2h 47m | |
MIAdam2 | 47m | 46m | * | 2h 47m | 2h 46m | |||
MIAdam3 | 47m | 47m | 2h 48m | 2h 46m | ||||
DenseNet121 | PyramidNet110 | |||||||
CIFAR-10(%) | Time | CIFAR-100(%) | Time | CIFAR-10(%) | Time | CIFAR-100(%) | Time | |
Adam | 3h 30m | 3h 32m | 2h 46m | 2h 50m | ||||
NAdam | 4h 23m | 4h 24m | 3h 34m | 3h 32m | ||||
AdamW | 3h 25m | 3h 30m | 2h 48m | 2h 43m | ||||
ND-Adam | 3h 23m | 3h 34m | 2h 58m | 3h 11m | ||||
Adamax | 3h 40m | 3h 47m | 4h 51m | 5h 16m | ||||
AdaBound | 4h 0m | 4h 2m | 3h 32m | 3h 27m | ||||
SWATS | 3h 35m | 3h 45m | 3h 35m | 2h 29m | ||||
Adai | 4h 28m | 4h 42m | 4h 18m | 4h 5m | ||||
MIAdam1 | * | 3h 29m | * | 3h 26m | * | 2h 59m | * | 2h 56m |
MIAdam2 | 3h 29m | 3h 32m | 2h 52m | 2h 52m | ||||
MIAdam3 | 3h 30m | 3h 30m | 2h 54m | 2h 53m |
Simulations
This subsection mainly includes simulations demonstrating that MIAdam is easier to escape from a sharp minima compared to Adam and exploring the impact of learning rate on the optimization process. The first simulation is conducted on an elaborate 2-parameter loss landscape (Yang 2020) with one flat minima surrounded by two sharp minima, whose contour map is displayed in Fig. 2(a). On this loss landscape, the learning rates of Adam and MIAdam1 are respectively set to 0.05, 0.1, and 0.15, and the simulation results for their corresponding optimization trajectories are shown in Figs. 2(b)-(d). It is clear that MIAdam1 tends to escape from sharp minima and converge to flat minima compared to Adam on the 2-parameter loss landscape. Moreover, as the learning rate increases, MIAdam1 is able to converge to the flat minima. In contrast, Adam always shows poor convergence on this loss landscape and can not converge well to the flat minima or sharp minima. Therefore, our proposed method is effective in finding flat minima and can not be simply replaced by increasing the learning rate of Adam.
The second 2-parameter loss landscape used for simulations is depicted in Fig. 2(e), which contains a large number of sharp minima and flat minima. On this loss landscape, the optimization trajectories of MIAdam1, MIAdam2, MIAdam3, and Adam are compared in Fig. 2(f)-(h), respectively. Simulation results indicate that MIAdam1, MIAdam2, and MIAdam3 tend to converge toward flat minima, while the Adam optimizer tends to converge to the nearest sharp minima. It’s worth noting that MIAdam3 exhibits more intense oscillations near the flat region compared to the MIAdam2. This suggests that increasing the order of multiple integration does not always lead to improved outcomes. Different starting points influence the trajectory of the optimizer. Therefore, to make the simulation results more convincing, we conduct additional simulations using 2,500 different starting coordinate points and calculate the sum of the absolute values of the eigenvalues of the Hessian matrix of the different optimizers for the final convergence points at different starting coordinate points and compare them. Due to space constraints, the simulation results are presented in the Appendix.
Experiments
The effectiveness of MIAdam is evaluated in this subsection through extensive empirical experiments. Initially, we conduct image classification experiments with various neural network architectures on CIFAR111http://www.cs.toronto.edu/ kriz/cifar.html and ImageNet-1k222https://www.image-net.org/, compared against widely-used adaptive learning rate optimizers, including Adam and its SOTA variants. Additionally, we utilize the fast computation method of Hessian information of loss landscapes provided in (Yao et al. 2020) for further comparative analyses. Subsequently, the effectiveness of the proposed MIAdam optimizer for text classification tasks is tested using the BERT and RoBERTa models across four distinct datasets (Lin et al. 2021). Finally, to validate the robustness against label noises of MIAdam, we perform image classification experiments on datasets injected with label noises. The results of MIAdam exceeding Adam are all bold, and the optimal experimental results are all marked by asterisks. Because of space constraints, the detailed experimental settings for all experiments are included in the Appendix.
Image Classification Experiments
To enhance the conviction of our experimental results, we employ four different neural network architectures for image classification tasks on the CIFAR-10 and CIFAR-100 datasets: ResNet18 (He et al. 2016), ResNet50 (He et al. 2016), DenseNet121 (Huang et al. 2017), and PyramidNet110 (Han, Kim, and Kim 2017). For experiments on large-scale image datasets, we utilize the AlexNet (Krizhevsky, Sutskever, and Hinton 2012), ResNet18, and DenseNet121 architectures for both training and testing on the ImageNet-1k. The classification performance of MIAdam is compared with optimizers such as Adam, NAdam (Dozat 2016), AdamW (Loshchilov and Hutter 2017), ND-Adam, Adamax (Kingma and Ba 2015), AdaBound, SWATS, and Adai (Xie et al. 2022). Detailed hyperparameters and experimental settings are presented in the Appendix. As observed from Table 1 and Table 2, MIAdam maintains a training time comparable to Adam while obtaining much better performance than Adam. To provide a comparison of the flatness in the final convergence regions, we compute top Hessian eigenvalues, Hessian traces, and full Hessian eigenvalue densities for loss landscapes of Adam, MIAdam1, MIAdam2, and MIAdam3 using DenseNet121 on the CIFAR-100 dataset in Fig. 3. Fig. 3 suggests that the multiple integral term is helpful in finding flatter minima in a specific neural network training task.
Optimizer | AlexNet(%) | ResNet18(%) | DenseNet121(%) |
---|---|---|---|
Adam | |||
MIAdam1 | * | * | * |
MIAdam2 | |||
MIAdam3 |
Dataset | Optimizer | BERT(%) | RoBERTa(%) |
---|---|---|---|
R8 | Adam | ||
MIAdam1 | * | * | |
MIAdam2 | |||
MIAdam3 | |||
R52 | Adam | ||
MIAdam1 | |||
MIAdam2 | |||
MIAdam3 | * | * | |
MR | Adam | ||
MIAdam1 | * | * | |
MIAdam2 | |||
MIAdam3 |
Text Classification Experiments
We conduct text classification experiments by fine-tuning the pre-trained models, BERT and RoBERTa models, on three widely-used text datasets R8, R52, and Movie Review (MR). Each optimizer is run three times on each dataset using different network structures, with the mean and standard deviation of the test accuracy reported in Table 3. The experimental results indicate that MIAdam significantly outperforms Adam in text classification tasks.
Robustness Against Label Noises
In this subsection, we investigate the capacity of MIAdam to withstand label noises in the training dataset, thereby validating its robustness against label noises. The ResNet18 network is trained by using Adam and MIAdam on the corrupted version of the CIAFR10 dataset, where some of its training labels are randomly flipped while the inputs are kept clean. The noise levels are 20%, 40%, 60%, and 80%. On each noise level, each optimizer is run only once. The remaining experimental settings are consistent with those used in the previous image classification experiments. As indicated in Table 4, MIAdam consistently achieves the highest test accuracy across all noise levels, underscoring MIAdam’s superior robustness against label noises.
Optimizer | Noise rate(%) | |||
---|---|---|---|---|
20 | 40 | 60 | 80 | |
Adam | ||||
ND-Adam | ||||
AdaBound | ||||
SWATS | ||||
Adai | ||||
MIAdam1 | * | * | * | * |
MIAdam2 | ||||
MIAdam3 |
Conclusion
In this paper, we have proposed MIAdam, a new adaptive learning rate optimizer algorithm with a multiple integral term added to Adam. MIAdam smoothes the optimization trajectory through the filtering effect of the multiple integral term, enabling it to escape sharp local minima during training and converge towards flat minima, thereby alleviating the problem of poor generalization of Adam and improving the robustness against label noises while retaining the fast convergence of Adam. Utilizing the diffusion theory framework, we have provided the proof that incorporating the multiple integral term enhances the capability of the optimizer to escape sharp minima and converge to flatter minima, thus improving the generalization of the models. We have analyzed the convergence of MIAdam and provided a guarantee of convergence. The simulations have demonstrated that MIAdam is capable of finding flatter minima compared to Adam. For empirical analyses, We have conducted image classification experiments, text classification experiments, and experiments that inject label noises into datasets. The experimental results show that MIAdam has much better generalization and robustness against label noises than Adam. Future work will focus on introducing multiple integral terms into other optimizers.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 62476115 and Grant 62176109, in part by the Fundamental Research Funds for the Central Universities under Grant lzujbky2023-ct05 and Grant Izuibky-2023-ey07, in part by the China Computer Federation (CCF)-Baidu Open Fund under Grant 202306, and in part by the Supercomputing Center of Lanzhou University.
References
- Chaudhari et al. (2019) Chaudhari, P.; et al. 2019. Entropy-SGD: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): 124018.
- Deng et al. (2023) Deng, X.; Sun, T.; Li, S.; and Li, D. 2023. Stability-based generalization analysis of the asynchronous decentralized SGD. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 7340–7348.
- Dinh et al. (2017) Dinh, L.; Pascanu, R.; Bengio, S.; and Bengio, Y. 2017. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, 1019–1028.
- Dozat (2016) Dozat, T. 2016. Incorporating Nesterov momentum into Adam. In International Conference on Machine Learning.
- Du et al. (2022) Du, J.; Zhou, D.; Feng, J.; Tan, V.; and Zhou, J. T. 2022. Sharpness-aware training for free. Advances in Neural Information Processing Systems, 23439–23451.
- Han, Kim, and Kim (2017) Han, D.; Kim, J.; and Kim, J. 2017. Deep pyramidal residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5927–5935.
- He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
- Hochreiter and Schmidhuber (1995) Hochreiter, S.; and Schmidhuber, J. 1995. Simplifying neural nets by discovering flat minima. Advances in Neural Information Processing Systems, 7: 529–536.
- Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Flat minima. Neural Computation, 9(1): 1–42.
- Huang et al. (2017) Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Jiang et al. (2020) Jiang, Y.; Neyshabur, B.; Mobahi, H.; Krishnan, D.; and Bengio, S. 2020. Fantastic Generalization Measures and Where to Find Them. In International Conference on Learning Representations.
- Jin, Zhang, and Li (2015) Jin, L.; Zhang, Y.; and Li, S. 2015. Integration-enhanced Zhang neural network for real-time-varying matrix inversion in the presence of various kinds of noises. IEEE Transactions on Neural Networks and Learning Systems, 27(12): 2615–2627.
- Johnson et al. (2020) Johnson, T.; Agrawal, P.; Gu, H.; and Guestrin, C. 2020. AdaScale SGD: A user-friendly algorithm for distributed training. In International Conference on Machine Learning, 4911–4920.
- Kalinay and Percus (2012) Kalinay, P.; and Percus, J. K. 2012. Phase space reduction of the one-dimensional Fokker-Planck (Kramers) equation. Journal of Statistical Physics, 148(6): 1135–1155.
- Keskar et al. (2017) Keskar, N. S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; and Tang, P. T. P. 2017. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations.
- Keskar and Socher (2017) Keskar, N. S.; and Socher, R. 2017. Improving generalization performance by switching from Adam to SGD. arXiv preprint arXiv:1712.07628.
- Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
- Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1097–1105.
- Lin et al. (2021) Lin, Y.; Meng, Y.; Sun, X.; Han, Q.; Kuang, K.; Li, J.; and Wu, F. 2021. BERTGCN: Transductive text classification by combining GCN and BERT. arXiv preprint arXiv:2105.05727.
- Liu et al. (2021) Liu, Z.; Li, B.; Simon, J. B.; and Ueda, M. 2021. SGD can converge to local maxima. In International Conference on Learning Representations.
- Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Luo et al. (2019) Luo, L.; Xiong, Y.; Liu, Y.; and Sun, X. 2019. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843.
- Mulayoff and Michaeli (2020) Mulayoff, R.; and Michaeli, T. 2020. Unique properties of flat minima in deep networks. In International Conference on Machine Learning, 7108–7118.
- Petzka et al. (2021) Petzka, H.; Kamp, M.; Adilova, L.; Sminchisescu, C.; and Boley, M. 2021. Relative flatness and generalization. Advances in Neural Information Processing Systems, 18420–18432.
- Qian (1999) Qian, N. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1): 145–151.
- Roberts and Mullis (1987) Roberts, R. A.; and Mullis, C. T. 1987. Digital signal processing. Addison-Wesley Longman Publishing Co., Inc.
- Savarese (2019) Savarese, P. 2019. On the convergence of AdaBound and its connection to SGD. arXiv preprint arXiv:1908.04457.
- Sun et al. (2023) Sun, Y.; Shen, L.; Chen, S.; Ding, L.; and Tao, D. 2023. Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape. In International Conference on Machine Learning.
- Wang et al. (2018) Wang, H.; Keskar, N. S.; Xiong, C.; and Socher, R. 2018. Identifying generalization properties in neural networks. arXiv preprint arXiv:1809.07402.
- Wilson et al. (2017) Wilson, A. C.; Roelofs, R.; Stern, M.; Srebro, N.; and Recht, B. 2017. The marginal value of adaptive gradient methods in machine learning. Advances in Neural Information Processing Systems, 4148–4158.
- Xie, Sato, and Sugiyama (2020) Xie, Z.; Sato, I.; and Sugiyama, M. 2020. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Machine Learning.
- Xie et al. (2022) Xie, Z.; Wang, X.; Zhang, H.; Sato, I.; and Sugiyama, M. 2022. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning, 24430–24459.
- Yang (2020) Yang, X. 2020. Stochastic gradient variance reduction by solving a filtering problem. arXiv preprint arXiv:2012.12418.
- Yao et al. (2020) Yao, Z.; Gholami, A.; Keutzer, K.; and Mahoney, M. W. 2020. Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE International Conference on Big Data (Big Data), 581–590.
- Yao et al. (2021) Yao, Z.; Gholami, A.; Shen, S.; Mustafa, M.; Keutzer, K.; and Mahoney, M. 2021. Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, volume 35, 10665–10673.
- Zhang (2018) Zhang, Z. 2018. Improved Adam optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), 1–2.
- Zou et al. (2021) Zou, D.; Cao, Y.; Li, Y.; and Gu, Q. 2021. Understanding the generalization of Adam in learning neural networks with proper regularization. arXiv preprint arXiv:2108.11371.
Appendix A Appendix
Appendix B Simulation and Experiment Settings
All experimental procedures are implemented using Python 3.6 and PyTorch 1.8.3. Computational tasks are executed on a system running Ubuntu 16.04 TLS, equipped with an array of 10 NVIDIA RTX 2080Ti GPUs.
Simulation Settings
The simulation settings are illustrated in this paragraph. All simulations employ the CosineAnnealingLR as the learning rate adjustment strategy, with a total of 1500 training steps. These hyperparameters are set identically for all optimizers: , , , , and . For the simulations on the loss landscape Fig. 2(a), the hyperparameters and for MIAdam are set to 0.885 and 1400, respectively. The same settings are applied for the simulations on the loss landscape Fig. 2(e), with the only difference being the learning rate .
Experiment Settings
Image Classification Experiments: For MIAdam, we adopt the same hyperparameters as those used for Adam to ensure a consistent comparison basis. Additionally, the hyperparameters and are identified with their optimal values through grid search within the range and . After grid search, the optimal values of and are 0.98 and 20, respectively. For each optimizer, we calculate the mean and standard deviation of the top-1 test accuracy based on three individual runs while computing the average training time for three runs. For image classification experiments on the CIFAR datasets, the hyperparameter configurations for various optimizers are provided in Table 5.
Epoch ; batch size ; milestone | |
---|---|
Adam | , , , |
NAdam | , , , , |
ND-Adam | , , , |
Adamax | , , , |
AdaBound | , , , |
SWATS | , , , |
Adai | , , , |
MIAdam | , , , , , |
For the experiments conducted on the ImageNet-1k dataset, all settings remain consistent with those used for the CIFAR dataset experiments, except for the total number of training epochs, which is set to 90.
Text Classification Experiments: We fine-tune the BERT and RoBERTa models using a learning rate of 0.0001 and a batch size of 128 for 30 epochs. For the two optimizers used in training, Adam and MIAdam, both employ milestones as the learning rate adjustment strategy, with the same hyperparameter settings as follows: , , , , , and . Additionally, for MIAdam, the extra hyperparameters and are set to 0.98 and 20, respectively.
Experiments on Datasets Injected with Label Noises: The extra hyperparameter of MIAdam is set to 40. Apart from the injection of label noises on the CIFAR-10 dataset, the experimental settings are identical to those of the image classification experiments.
Appendix C Supplementary Experiments and Simulations
Simulations
On the second loss landscape in Fig. 2(e), different starting points influence the trajectory of the optimizer. Therefore, to make the simulation results more convincing, we uniformly select 2,500 starting coordinate points on this loss landscape and conduct 2,500 rounds of simulations in the region where and . Furthermore, we calculate the sum of the absolute values of the eigenvalues of the Hessian matrix of the different optimizers for the final convergence points at different starting coordinate points and compare them. As can be seen in Fig. 4, the sum of the absolute values of the eigenvalues at the location of the final convergence on the second loss landscape is generally smaller for the MIAdam2 and MIAdam3 compared to Adam, which suggests that the multiple integral term is indeed helpful in finding flat minima.

Experiments
Fig. 5 provides training and testing comparisons of Adam, MIAdam1, MIAdam2, and MIAdam3 on CIFAR-100 using DenseNet121. Fig. 5 shows that the introduction of the multiple integral term leads to an enhancement in generalization rather than in training accuracy. It is worth noting that before switching optimizers, training with the multiple integral term leads to slower convergence or even non-convergence, as the optimizer searches for the flat minima on the loss landscape. However, when considering the entire training process, the epochs taken for convergence to a steady state by Adam and MIAdam are similar.


Appendix D Generalization Proof
Proof of Theorem 1
Proof.
Without loss of generality, we set the hyperparameter and consider the adaptive learning rate at the end of the derivation to make it convenient to analyze the effect of multiple integral term on generalization. Consequently, the parameter update formula for MIAdam1 without adaptive learning rate and switching optimizers is simplified as follows:
(18) |
Then, we write the deformed motion equation as follows:
(19) |
where , , is equivalent to , , and . When the learning rate is sufficiently small, the differential form of the motion equation is derived as
(20) |
where . As corresponds to the stochastic gradient term, we obtain
(21) |
where ; ; presents the identity matrix; is the diffusion matrix in the dynamics; denotes the gradient noise covariance matrix. Furthermore, its Fokker-Planck equation in the phase space (the - space) is written as follows:
(22) | ||||
where ; represents the probability flux density; denotes the divergence operator. Under Assumption 2, the probability distribution in the valley around is derived as
(23) | ||||
According to (Kalinay and Percus 2012), in the case of finite inertia, we transform the phase-space equation into the position-space Smoluchowski-like form with the effective diffusion correction:
(24) |
Suppose that is fixed on an escape path from sharp minimum to flat minimum through saddle point , we derive from (Xie et al. 2022) as
(25) | ||||
where and denotes the temperature parameter in the stationary distribution. Based on the formula of probability current and flux, we obtain the flux escaping through saddle point :
(26) | ||||
where superscript ⊥e indicates the directions perpendicular to the escape direction . Considering the adaptive learning rate, we have according to the case of Adam in (Xie et al. 2022), where superscript + presents the transformation that and the -th column vector of is the eigenvector corresponding to . Finally, we obtain the mean escape time of MIAdam1:
(27) | ||||
The proof is thus completed. ∎
Appendix E Convergence Proof
Definition 1.
For a convex set , a function is convex, if for all satisfy
(28) |
and for all satisfy
(29) |
Definition 2.
For a convex function , we give the to determine the convergence. Its expression is shown as follows:
(30) |
When , we consider the algorithm is not convergent.
Proof of Theorem 2
Proof.
According to the Definition 1 and the Definition 2, we have
(31) | ||||
From the updating rules of MIAdam, we get
(32) | ||||
Then we obtain
(33) | ||||
For the first two terms {1} and {2} in Eq. (LABEL:eq:23), we derive the following two inequalities based on (Kingma and Ba 2015):
(34) | ||||
and
(35) | |||||
Since the gradient is assumed to be bounded, the should also be bounded in Adam and satisfies . For the term {3}, we have
(36) | ||||
which is a divergent series. Therefore, we further deduce the following limit:
(37) |
For the term {4}, we have
(38) | ||||
Analogous to the derivation of the term {3}, we similarly obtain the following limit:
(39) |
Finally, following the above derivation, we have
(40) |
The proof is thus completed. ∎