This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jmlrvolume

xx \jmlryear2020 \jmlrworkshopACML 2020

A New Accelerated Stochastic Gradient Method with Momentum

\Nameliang liu \Emailmf1915058@smail.nju.edu.cn    \NameXiaopeng Luo \Emailxpluo@nju.edu.cn
\addrNanjing China
Abstract

In this paper, we propose a novel accelerated stochastic gradient method with momentum, which momentum is the weighted average of previous gradients. The weights decays inverse proportionally with the iteration times. Stochastic gradient descent with momentum (Sgdm) use weights that decays exponentially with the iteration times to generate an momentum term. Using exponentially decaying weights, variants of Sgdm with well designed and complicated formats have been proposed to achieve better performance. The momentum update rules of our method is as simple as that of Sgdm. We provide theoretical convergence properties analyses for our method, which show both the exponentially decay weights and our inverse proportionally decay weights can limit the variance of the moving direction of parameters to be optimized to a region. Experimental results empirically show that our method works well with practical problems and outperforms Sgdm, and it outperforms Adam in convolutional neural networks.

keywords:
exponential decaying rate weight, gradient descent, inverse proportional decay rate weight, momentum
editors: Wee Sun Lee and Taiji Suzuki

1 Introduction

A stochastic optimization problem can be consider as minimizing a differentiable function F:dF:\mathbb{R}^{d}\rightarrow\mathbb{R} using stochastic gradient descent (SGD) optimizer over a data set 𝒮\mathcal{S} which has nn samples. Function F(xk)F(x_{k}), in the empirical risk format, can be written as F(xk)=1ni=1nfi(xk)F(x_{k})=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x_{k}), where fi(xk)f_{i}(x_{k}) is the realization of F(xK)F(x_{K}) with the iith sample. Given the initial x0x_{0} and a stepsize αk=0>0\alpha_{k=0}>0, the optimizer iterates with the following rule

xk+1xkαkgi(xk),x_{k+1}\leftarrow x_{k}-\alpha_{k}g_{i}(x_{k}), (1)

until F(xk)F(x_{k}) reaches a predefined state. The vector gi(xk)g_{i}(x_{k}) is the stochastic gradient of fi(x)f_{i}(x), which gradient satisfies 𝔼[gi(xk)]=F(xk)\mathbb{E}[g_{i}(x_{k})]=\nabla F(x_{k}) and has bounded variance Bottou et al. (2018). For large-scale data sets, the computational complexity of iterating with full gradient F(x)=1ni=1nF(x)=\frac{1}{n}\sum_{i=1}^{n} is unacceptable. It is more efficient to measure a single component gradient fi(xk)\nabla f_{i}(x_{k}), where ii is the index of the uniformly selected sample from the nn samples, and move in the noisy direction gi(xk)=fi(xk)g_{i}(x_{k})=\nabla f_{i}(x_{k}), than to move in the full gradient direction F(xk)=1ni=1nfi(xk)\nabla F(x_{k})=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(x_{k}) with a bigger stepsize Ward et al. (2019); Bottou et al. (2018). However, choosing a feasible and suitable stepsize schedule {αk>0}\{\alpha_{k}>0\} is difficult. According to the work of Robbins and Monro Robbins and Monro (1951), the stepsize schedule should satisfy

k=1αk=andk=1αk2<\sum_{k=1}^{\infty}\alpha_{k}=\infty\quad\text{and}\quad\sum_{k=1}^{\infty}\alpha_{k}^{2}<\infty (2)

to make limk𝔼[F(xk)2]=0\lim_{k\rightarrow\infty}\mathbb{E}[\|\nabla F(x_{k})\|^{2}]=0. This constraint makes full gradient descent methods converge slower than stochastic SGD methods.

Methods using gradients in previous iterations as momentum terms have been proposed to accelerate the convergence and show great benefits Kingma and Ba (2014); Leen and Orr (1993). Yet, theoretical analyses of those method in stochastic settings are elusive, and the ways to generate the momentum term from previous gradients can be revised.

1.1 Stochastic Gradient Methods With Momentum

Stochastic gradient descent methods with momentum (Sgdm) update parameter xx with

xk+1xkαkgk(xk)+γmk,x_{k+1}\leftarrow x_{k}-\alpha_{k}g_{k}(x_{k})+\gamma m_{k}, (3)

where mkm_{k} is called a momentum term, and Sgdm updates the momentum term with

mk=γmk1αk1gk(xk1).m_{k}=\gamma m_{k-1}-\alpha_{k-1}g_{k}(x_{k-1}). (4)

If γ\gamma is a constant, then we call it a exponentially decay rate. With (4), we can rewrite the update rule (3) as

xk+1=xk+j=1kαjγkjgj(xj),x_{k+1}=x_{k}+\sum_{j=1}^{k}\alpha_{j}\gamma^{k-j}g_{j}(x_{j}), (5)

which shows that the moving direction here is a weighted average of all gradients, with exponential decay weights. We denote the weighted average of gradients as vk=j=1kαjγkjgj(xj)v_{k}=\sum_{j=1}^{k}\alpha_{j}\gamma^{k-j}g_{j}(x_{j}). It is believed that the momentum term can reduce the variance Sutskever et al. (2013). Using a exponential decay rate is common and effective. According our following analyses, a exponential decay rate can limit the variance of vkv_{k} to a region which determined by the constant γ\gamma.

Main Contribution First, we proposed a novel stochastic gradient descent method with inverse proportional decay rate momentum, which method dynamically adjusts the momentum to match the convergence of the stochastic optimization problem. Besides, our rigorous analyses prove that both the simple SGDM and our novel momentum term can limit the variance to a region γ\gamma (Theorem 3.4,Theorem 3.6). We list out two main theorems (informally) in the following.

For a differential function FF with LL-Lipschitz gradient and the variance of gradient is limited, Theorem 3.4 implies that the momentum term with exponentially decay weights can limit the variance of vkv_{k}

𝕍[vk]11γ2α02(M+2MV(F(xk)22+L2D2)).\mathbb{V}[v_{k}]\leq\frac{1}{1-\gamma^{2}}\alpha_{0}^{2}(M+2M_{V}(\|\nabla F(x_{k})\|_{2}^{2}+L^{2}D^{2})).

Theorem 3.6 implies that the momentum term with our inverse proportional decay weights can limit the variance of vkv_{k}

𝕍[vk]α022β(M+2MVF(xk)22+2MVL2D2).\mathbb{V}[v_{k}]\leq\frac{\alpha_{0}^{2}}{2\beta}\Big{(}M+2M_{V}\|\nabla F(x_{k})\|_{2}^{2}+2M_{V}L^{2}D^{2}\Big{)}.

Extensive experiments in Section 4 shows that the robustness of our method extends from linear regression to practical model in deep learning problems.

1.2 Previous Work

Momentum method and its use within optimization problems has been studied extensively  Sutskever et al. (2013); Orr (1996); Leen and Orr (1993). The classical momentum (CM) method  Polyak (1964) accumulates a decaying sum of the previous updates of parmeter xx into a momentum term mm using equation (4) and updates xx with (3). CM uses constant hyperparameter γ\gamma and learning rate α\alpha. With this method, one can see that the steps tend to accumulate contributions in directions of persistent descent, while directions that oscillate tend to be cancelled, or at least remain small Bottou et al. (2018). Thus, this method can make optimization algorithms move faster along dimensions of low curvature where the update is small and its direction is persistent and slower along turbulent dimensions where the update usually significantly changes its direction  Sutskever et al. (2013). The adaptive momentum method in Leen and Orr (1993) uses decaying learning rate and adaptive β=max(0,1α0μ2)\beta=\max(0,1-\alpha_{0}\mu^{2}), where μ(t)\mu(t) is the network input at time tt.

1.3 Our Method

Our method updates xkx_{k} with

xk+1xkαkgk(xk)+γ(k)mk,x_{k+1}\leftarrow x_{k}-\alpha_{k}g_{k}(x_{k})+\gamma(k)m_{k}, (6)

and it updates the momentum term mKm_{K} with

mkγ(k1)mk1αk1gk1(xk1),m_{k}\leftarrow\gamma(k-1)m_{k-1}-\alpha_{k-1}g_{k-1}(x_{k-1}), (7)

where

y(k)=(kk+1)β.y(k)=\left(\frac{k}{k+1}\right)^{\beta}. (8)

The hyperparameter β\beta is a predefined constant. Thus we can rewrite the updating rules (6),(7), and (8) as

xk+1xki=1kαigi(xi)(ik)β.x_{k+1}\leftarrow x_{k}-\sum_{i=1}^{k}\alpha_{i}g_{i}(x_{i})\left(\frac{i}{k}\right)^{\beta}.

We denote the moving direction of xkx_{k} as vk=i=1kαigi(xi)(ik)βv_{k}=-\sum_{i=1}^{k}\alpha_{i}g_{i}(x_{i})\left(\frac{i}{k}\right)^{\beta}.

We use the same stepsize as shown in (adam)

αi=α0i\alpha_{i}=\frac{\alpha_{0}}{\sqrt{i}} (9)

And it is straightforward that iαk=\sum_{i}^{\infty}\alpha_{k}=\infty.

Algorithm 1 Our method
  Require: Stepsize α\alpha
  Require: Factor of decay rates, β\beta
  Require: Stochastic objective function f(x)f(x)
  Require: Initial parameter vector x0x_{0}
  Initialize momentum vector m00m_{0}\leftarrow 0
  Initialize iteration times k0k\leftarrow 0
  repeat
     kk+1k\leftarrow k+1
     αkα0k\alpha_{k}\leftarrow\frac{\alpha_{0}}{\sqrt{k}}
     Generate gradient vector gkxfk(xk1)g_{k}\leftarrow\nabla_{x}f_{k}(x_{k-1})
     mkγ(k)mk1αkgkm_{k}\leftarrow\gamma(k)m_{k-1}-\alpha_{k}g_{k}
     xkxk1+mkx_{k}\leftarrow x_{k-1}+m_{k}
  until predefined conditions are achieved
  return  xkx_{k}

2 Algorithm

See Algorithm 1 for the pseudo-code of our proposed algorithm. We denote f(x)f(x) as a risk function with noise, which is differentiable with respect to parameters xx. The optimization problem is to minimize the expected risk function 𝔼[f(x)]\mathbb{E}[f(x)] with respect to parameters xx. We denote f1(x),,fT(x)f_{1}(x),\dots,f_{T}(x) as the realizations of the risk function at subsequent iterations 1,,T1,\dots,T. Let gt=xft(x)g_{t}=\nabla_{x}f_{t}(x) be the gradient of ft(x)f_{t}(x) with respect to xx in the ttth iteration.

Our algorithm updates weighted averages of the previous gradients, denoted as mkm_{k}. The hyper-parameter β(1,)\beta\in(1,\infty) control the decay rates of the weighted averages. We initialize mkm_{k} as a vector with all 0 elements.

3 Convergence Analysis

In this section, we show that the momentum term could reduce the variance of stochastic directions vkv_{k} along the iterations progress. A fixed exponential decay factor γ\gamma and a changing factor γ(k)\gamma(k) can both limit the variance within a small range.

Let us begin with a basic assumption of smoothness of the objective function.

Assumption  3.1 (Lipschitz-continuous gradients)

The objective function F:dF:\mathbb{R}^{d}\rightarrow\mathbb{R} is continuously differentiable and the gradient function of FF, namely, F:dd\nabla F:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, is Lipschitz continuous with Lipschitz constant L>0L>0, i.e.,

F(x)F(x¯)2Lxx¯2for all{x,x¯}d.\|\nabla F(x)-\nabla F(\bar{x})\|_{2}\leq L\|x-\bar{x}\|_{2}~{}~{}\textrm{for all}~{}~{}\{x,\bar{x}\}\subset\mathbb{R}^{d}.

Assumption 3.1 is essential for convergence analyses of most gradient-based methods, which ensures that the gradient of FF does not change arbitrarily quickly with respect to the parameter vector xx. Based on Assumption 3.1, we have

F(x)F(x¯)+F(x¯)T(xx¯)+12Lxx¯22,F(x)\leq F(\bar{x})+\nabla F(\bar{x})^{T}(x-\bar{x})+\frac{1}{2}L\|x-\bar{x}\|_{2}^{2}, (10)

which inequality holds for all {x,x¯}d\{x,\bar{x}\}\subset\mathbb{R}^{d}.

Proof 3.1.

Using Assumption 3.1, we have

F(x)\displaystyle F(x) =F(x¯)+01F(x¯+t(xx¯))t𝑑t\displaystyle=F(\bar{x})+\int_{0}^{1}\frac{\partial F(\bar{x}+t(x-\bar{x}))}{\partial t}dt
=F(x¯)+01F(x¯+t(xx¯))T(xx¯)𝑑t\displaystyle=F(\bar{x})+\int_{0}^{1}\nabla F(\bar{x}+t(x-\bar{x}))^{T}(x-\bar{x})dt
=F(x)+F(x¯)T(xx¯)+01[F(x¯)+t(xx¯)F(x¯)]T(xx¯)𝑑t\displaystyle=F(x)+\nabla F(\bar{x})^{T}(x-\bar{x})+\int_{0}^{1}[\nabla F(\bar{x})+t(x-\bar{x})-\nabla F(\bar{x})]^{T}(x-\bar{x})dt
F(x¯)+F(x¯)T(xx¯)+01Lt(xx¯)2xx¯2𝑑t,\displaystyle\leq F(\bar{x})+\nabla F(\bar{x})^{T}(x-\bar{x})+\int_{0}^{1}L\|t(x-\bar{x})\|_{2}\|x-\bar{x}\|_{2}dt,

from which the desired result follows.

3.1 Variance Analysis

We first consider a fixed decay factor γ(0,1)\gamma\in(0,1), which is used in many simple stochastic gradient descent methods with momentum.

Assumption  3.2

The objective function F and stochastic gradient gk(xk)g_{k}(x_{k}) satisfy for all kk\in\mathbb{N}, there exist scalars M0M\geq 0 and Mv0M_{v}\geq 0 such that

𝕍xk[gi(xk)]M+MVF(xk)22.\mathbb{V}_{x_{k}}[g_{i}(x_{k})]\leq M+M_{V}\|\nabla F(x_{k})\|_{2}^{2}.
Theorem 3.2.

Under Assumptions 3.1 and 3.2, suppose that the algorithm 1 satisfies xixjD\|x_{i}-x_{j}\|\leqslant D for any i,ji,j\in\mathbb{N}. Then

F(xj)222F(xk)22+2L2D2.\|\nabla F(x_{j})\|_{2}^{2}\leqslant 2\|\nabla F(x_{k})\|_{2}^{2}+2L^{2}D^{2}.
Proof 3.3.

According to Assumption 3.1, there is a diagonal matrix Λ=diag(λ1,,λd)\Lambda=\mathrm{diag}(\lambda_{1},\cdots,\lambda_{d}) with λi[L,L]\lambda_{i}\in[-L,L] such that

F(xj)=F(xk+δj,k)=F(xk)+Λδj,k,\displaystyle\nabla F(x_{j})=\nabla F(x_{k}+\delta_{j,k})=\nabla F(x_{k})+\Lambda\delta_{j,k},

where xj=xk+δj,kx_{j}=x_{k}+\delta_{j,k}; and by further noting that δj,k2=xjxk2D\|\delta_{j,k}\|_{2}=\|x_{j}-x_{k}\|_{2}\leqslant D, we have

F(xj)22\displaystyle\|\nabla F(x_{j})\|_{2}^{2}\leqslant (F(xk)2+Lδj,k2)2\displaystyle\Big{(}\|\nabla F(x_{k})\|_{2}+L\|\delta_{j,k}\|_{2}\Big{)}^{2}
\displaystyle\leqslant F(xk)22+L2δj,k22+2Lδj,k2F(xk)2\displaystyle\|\nabla F(x_{k})\|_{2}^{2}+L^{2}\|\delta_{j,k}\|_{2}^{2}+2L\|\delta_{j,k}\|_{2}\|\nabla F(x_{k})\|_{2}
\displaystyle\leqslant 2F(xk)22+2L2δj,k22\displaystyle 2\|\nabla F(x_{k})\|_{2}^{2}+2L^{2}\|\delta_{j,k}\|_{2}^{2}
\displaystyle\leqslant 2F(xk)22+2L2D2\displaystyle 2\|\nabla F(x_{k})\|_{2}^{2}+2L^{2}D^{2}

as claimed.

Theorem 3.4 (a fixed decay factor).

Under the conditions of Theorem 3.2 and Assumption 3.2, suppose that (i) the sequence of iterates {xk}\{x_{k}\} is generated with (5) using a fixed factor γ(0,1)\gamma\in(0,1) and a stepsize sequence {αk}\{\alpha_{k}\}, which sequence satisfies αkαk+1\alpha_{k}\geq\alpha_{k+1} for all kk\in\mathbb{N}, and (ii) the sequence {xk}\{x_{k}\} satisfies xixjD\|x_{i}-x_{j}\|\leq D for any i,j𝒩i,j\in\mathcal{N}, then

𝕍[vk]11γ2α02(M+2MVF(xk)22+2MVL2D2)\mathbb{V}[v_{k}]\leq\frac{1}{1-\gamma^{2}}\alpha_{0}^{2}\big{(}M+2M_{V}\|\nabla F(x_{k})\|_{2}^{2}+2M_{V}L^{2}D^{2}\big{)}
Proof 3.5.

For the SGDM updating strategy, we have direction vector

vk=i=1kαiγkigi(xi).v_{k}=-\sum_{i=1}^{k}\alpha_{i}\gamma^{k-i}g_{i}(x_{i}).

Hence, along with Assumption 3.2, we obtain

𝕍[vk]=\displaystyle\mathbb{V}[v_{k}]= j=1kγ2(kj)αj2𝕍xj[gj(xj)]\displaystyle\sum_{j=1}^{k}\gamma^{2(k-j)}\alpha_{j}^{2}\mathbb{V}_{x_{j}}[g_{j}(x_{j})]
\displaystyle\leqslant α02j=1kγ2(ki)(M+MVF(xj)22)\displaystyle\alpha_{0}^{2}\sum_{j=1}^{k}\gamma^{2(k-i)}\Big{(}M+M_{V}\|\nabla F(x_{j})\|_{2}^{2}\Big{)}
\displaystyle\leqslant i=1kα02γ2(ki)(M+2MVF(xk)22+2MVL2D2).\displaystyle\sum_{i=1}^{k}\alpha_{0}^{2}\gamma^{2(k-i)}\Big{(}M+2M_{V}\|\nabla F(x_{k})\|_{2}^{2}+2M_{V}L^{2}D^{2}\Big{)}.

Notice that

j=1kγ2(kj)=1γ2k1γ2.\displaystyle\sum_{j=1}^{k}\gamma^{2(k-j)}=\frac{1-\gamma^{2k}}{1-\gamma^{2}}.

Since 1γ2k1γ2\frac{1-\gamma^{2k}}{1-\gamma^{2}} decays to 11γ2\frac{1}{1-\gamma^{2}} as kk increases for 0<γ<10<\gamma<1, so the variance of vkv_{k} could be finally reduced to

11γ2α02(M+2MVF(xk)22+2MVL2D2).\frac{1}{1-\gamma^{2}}\alpha_{0}^{2}\Big{(}M+2M_{V}\|\nabla F(x_{k})\|_{2}^{2}+2M_{V}L^{2}D^{2}\Big{)}.
Theorem 3.6 (a changing decay factor).

Under the conditions of Theorem 3.2 and Assumption 3.2, suppose that (i) the sequence of iterates {xk}\{x_{k}\} is generated with (6) using a changing factor γ(k)(0,1)\gamma(k)\in(0,1) taking the form γk=(k/k+1)β\gamma_{k}=(k/k+1)^{\beta} and a stepsize sequence {αk}\{\alpha_{k}\} as (9), which sequence satisfies αkαk+1\alpha_{k}\geq\alpha_{k+1} for all kk\in\mathbb{N}, and (ii) the sequence {xk}\{x_{k}\} satisfies xixjD\|x_{i}-x_{j}\|\leq D for any i,j𝒩i,j\in\mathcal{N}, then

𝕍[vk]α022β(M+2MVF(xk)22+2MVL2D2)\mathbb{V}[v_{k}]\leq\frac{\alpha_{0}^{2}}{2\beta}\Big{(}M+2M_{V}\|\nabla F(x_{k})\|_{2}^{2}+2M_{V}L^{2}D^{2}\Big{)}
Proof 3.7.

For our updating strategy

vk=j=1kjβkβαjgj(xj)v_{k}=-\sum_{j=1}^{k}\frac{j^{\beta}}{k^{\beta}}\alpha_{j}g_{j}(x_{j})

Hence, along with Assumption 3.2, we obtain

𝕍[vk]=\displaystyle\mathbb{V}[v_{k}]= α0j=1kj2βk2β1j𝕍θj[gj(xj)]\displaystyle\alpha_{0}\sum_{j=1}^{k}\frac{j^{2\beta}}{k^{2\beta}}\frac{1}{j}\mathbb{V}_{\theta_{j}}[g_{j}(x_{j})]
\displaystyle\leqslant α02j=1kj2βk2β1j(M+MVF(xj)22)\displaystyle\alpha_{0}^{2}\sum_{j=1}^{k}\frac{j^{2\beta}}{k^{2\beta}}\frac{1}{j}\Big{(}M+M_{V}\|\nabla F(x_{j})\|_{2}^{2}\Big{)}
\displaystyle\leqslant α02(M+2MVF(xk)22+2MVL2D2)1k2β1kj2β1\displaystyle\alpha_{0}^{2}\Big{(}M+2M_{V}\|\nabla F(x_{k})\|_{2}^{2}+2M_{V}L^{2}D^{2}\Big{)}\frac{1}{k^{2\beta}}\int_{1}^{k}j^{2\beta-1}
\displaystyle\leqslant α022β(M+2MVF(xk)22+2MVL2D2),\displaystyle\frac{\alpha_{0}^{2}}{2\beta}\Big{(}M+2M_{V}\|\nabla F(x_{k})\|_{2}^{2}+2M_{V}L^{2}D^{2}\Big{)},

and the proof is complete.

4 Experiments

To compare the performance of the proposed method with that of other methods, we investigate different popular machine learning models, including logistic regression, multi-layer fully connected neural networks and deep convolutional neural networks. Experimental results show that our novel momentum term can efficiently solve practical stochastic optimization problems in the field of deep learning and outperforms other methods in deep convolutional neural networks.

In our experiments, we use the same parameter initialization strategy for different optimization algorithms. The values of hyper-parameters are selected, based on results, from the common used settings, respectively.

4.1 Experiment: Logistic Regression

We evaluate the two methods (saying Sgdm, and our method), on two multi-class logistic regression problems using the MNIST dataset without regularization. Logistic regression is suitable for comparing different optimizers for its simple structure and convex objective function. In our experiments, we set the stepsize αt=αt\alpha_{t}=\frac{\alpha}{\sqrt{t}}, where tt is the current iteration times. The logistic regression problems is to classify the class label directly on the 28×2828\times 28 image matrix.

\subfigure

[]Refer to caption \subfigure[]Refer to caption

Figure 1: The negative log likelihood training cost of logistic regression on MNIST images. (a) Normal logistic regression. (b) Logistic regression after a two layer fully connected neural network.

We compare the performance of our method (Lim for short) to optimize the logistic regression problem with that of Sgdm using a minibatch size of 128. According to Fig. 1, we found that our method converges faster than Sgdm.

4.2 Experiment: Multi-layer Fully Connected Neural Networks

Multi-layer neural networks are powerful models with non-convex objective functions. We empirically found that our method outperforms Sgdm. In our experiments, the multi-layer neural network models are consistent with that in previous publications, saying a neural network model with two or three fully connected hidden layers with 1000 hidden units each and ReLU activation are used for this experiment with a minibatch size of 128.

\subfigure

[]Refer to caption \subfigure[]Refer to caption

Figure 2: Training cost of fully connected multilayer neural networks on MNIST images. (a) Neural network with two fully connected hidden layers.(b) Neural network with three fully connected hidden layers.

We investigate the two optimizers using the standard deterministic cross-entropy objective function without regularization. According to Fig. 2, we found that our method outperforms Sgdm by a large margin.

4.3 Experiment: Deep Convolutional Neural Networks

Deep convolutional neural networks (CNNs) have shown considerable success in practical machine learning tasks, e.g. computer vision tasks and natural language procession tasks. Our CNN architectures has three or two alternating stages of 5×55\times 5 convolution filters and 3×33\times 3 max pooling with stride of 2, each followed by a fully connected layer of 1000 units with RLeU activations. We pre-process the input image with whitening imply drop out noise to the input layer and fully connected layer. The minibatch size is also set to 128.

\subfigure

[]Refer to caption \subfigure[]Refer to caption

Figure 3: Training cost of deep convolutional neural networks on MNIST images. (a) CNN with two convolutional layers.(b) CNN with three convolutional layers.

We investigate the three optimizers using the standard deterministic cross-entropy objective function. According to Fig. 3, we found that our method converge faster than Adam and Sgdm.

5 Conclusion

In this paper, we introduced a computationally efficient algorithm for gradient-based stochastic optimization problems. The updating strategies of the proposed method enjoys the simplicity of the original SGD with momentum but utilizes the momentum more efficiently than SGD with momentum. Our method is designed for machine learning problems with large scale data sets and non-convex optimization problems, where is hard for stochastic optimizers to achieve linear converge speed. The experiment results confirm our theoretical analysis on its convergence property and show that our method can solve practical optimization problems efficiently. Overall, we found that our method is a robust and well-suited method to a wide range of non-convex optimization problems in the field of machine learning.

References

  • Bottou et al. (2018) Léon. Bottou, Frank E. Curtis, and Jorge. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, page arXiv:1412.6980, Dec 2014.
  • Leen and Orr (1993) Todd K. Leen and Genevieve B. Orr. Optimal stochastic search and adaptive momentum. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, page 477–484, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
  • Orr (1996) Genevieve Beth Orr. Dynamics and Algorithms for Stochastic Search. PhD thesis, USA, 1996.
  • Polyak (1964) B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 – 17, 1964.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 09 1951.
  • Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, page III–1139–III–1147. JMLR.org, 2013.
  • Ward et al. (2019) Rachel Ward, Xiaoxia Wu, and Léon Bottou. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 6677–6686, 2019.