KwInputInput \SetKwInputKwOutputOutput \SetKwBlockLoopLoopend
\SetKwInputKwRequireRequire
\optauthor
\NameSaugata Purkayastha \Emailsau.pur9@gmail.com
\addrAssam Don Bosco University, India
and \NameSukannya Purkayastha \Emailpurkayasthasukannya020@gmail.com
\addrIndian Institute of Technology Kharagpur, India
A Variant of Gradient Descent Algorithm Based on Gradient Averaging
Abstract
In this work, we study an optimizer, Grad-Avg to optimize error functions. We establish the convergence of the sequence of iterates of Grad-Avg mathematically to a minimizer (under boundedness assumption). We apply Grad-Avg along with some of the popular optimizers on regression as well as classification tasks. In regression tasks, it is observed that the behaviour of Grad-Avg is almost identical with Stochastic Gradient Descent (SGD). We present a mathematical justification of this fact. In case of classification tasks, it is observed that the performance of Grad-Avg can be enhanced by suitably scaling the parameters. Experimental results demonstrate that Grad-Avg converges faster than the other state-of-the-art optimizers for the classification task on two benchmark datasets.
1 Introduction
Gradient descent (GD) method [lemarechal2012cauchy] is one of the most popular algorithms to optimise an error function. Let be a function. The classical Gradient Descent method is given by the following algorithm:
where is the constant step size, otherwise known as the learning rate , stands for the gradient of the function and is the set of parameters. Although gradient descent algorithm is guaranteed to converge to the global minima (a local minima) for convex functions (non-convex functions), in case of large datasets the convergence can be slow. To overcome this, a variant of GD known as Stochastic Gradient Descent (SGD) [robbins1951stochastic] is applied. SGD, unlike GD avoids redundant computations but has a tendency to overshoot the minima in the process. Both GD and SGD, albeit being efficient in convex optimization, proved to be inefficient in case of non-convex surfaces because of the existence of saddle points, as observed in [dauphin2014identifying]. Keeping this in mind two more variants of GD, namely SGD with momentum [ruder2016overview] and Nesterov Accelerated Gradient algorithm [nesterov2013introductory] are commonly used in case of non-convex optimization. One is referred to [ruder2016overview] for a detailed discussion on the above mentioned optimizers. Further, [truong2018backtracking] contains an excellent account of various analytical concepts occuring in the context of machine learning. We define the proposed optimizer, Grad-Avg motivated by the Heun’s method [atkinson2008introduction] with the following iterative scheme:
(1) |
Thus we note that in the present optimizer the update of the parameter is done by considering the average of the gradients calculated at the previous position and the position of the parameter as suggested by the GD algorithm. The intuition lies in the fact that if the gradient becomes zero at a position which is not the minimum (say at a saddle point instead), even then the update of the parameter continues by virtue of the average of the gradients as mentioned above. The concept of gradient averaging is a known concept in this field. For instance [defazio2014saga] introduced a new optimization method using this concept. Also [huang2017snapshot] used the average of model parameters obtained by letting the model to converge at multiple local minimas before making the final prediction . Our work, however uses the averaging of the gradients of the model parameter at every timestep to update the parameter’s position.
Contributions:
In this work, we develop a new optimizer, Grad-Avg based on gradient descent algorithms. We obtain its convergence for where is the Lipschitz constant (under boundedness assumption). We propose a justification of its similar behaviour with GD for regression tasks. We empirically demonstrate its efficiency over other popular optimizers for classification task.
2 Assumptions
In section 1, we have introduced the proposed optimizer. Below we mention the assumptions which are necessary to establish the convergence of the optimizer:
-
(i) the function is i.e. the function has continuous first order partial derivatives.
-
(ii) the gradient of the function is Lipschitz continuous i.e. there exists some (known as the Lipschitz constant) such that for all .
Further, we make use of Monotone Convergence Theorem which states that every bounded below monotonic decreasing sequence of real numbers converges to the infimum. A detailed account of these concepts can be found in [kreyszig1978introductory] and [shifrin2005multivariable].
3 Our Proposed Algorithm
: \textInput, Output pair \KwRequireInformation of gradient, for function, \KwRequireLearning rate, \KwRequireInitial parameter values, \Fort in range (epochs) Algorithm for Grad-Avg
4 Convergence Analysis
To analyze the convergence of equation \eqrefeq:1, we first rewrite equation \eqrefeq:1 in the following way:
(2) |
with
(3) |
We note that if , then both and reduces to i.e. is the optimal value. We may thus assume that and hence . We first establish that the sequence where as defined by scheme \eqrefeq:2 is monotonic decreasing in the following result. As such, by the Monotone Convergence Theorem , it follows that converges to the infimum provided it is bounded.
Theorem 4.1.
Let be a function such that is Lipschitz continuous with Lipschitz constant . Then for the sequence where is defined by scheme \eqrefeq:2 is monotonic decreasing.
First of all, using the fact that , is Lipschitz continuous, we observe that for , there is some such that when , one has
(4) |
Let us now consider and define
(5) |
for . Then by chain rule we get,
(6) |
where stands for the usual dot product. Now by Fundamental Theorem of Calculus, we get
(7) | |||||
As by assumption is , hence for there is a such that whenever , one gets
(8) |
Further, we have the following two inequalities:
{align*}
&∇J(θ_n -αt 12(∇J(θ_n)+∇J(¯θ_n+1)))(∇J(θ_n)+∇J(¯θ_n+1))
=(∇J(θ_n -αt 12(∇J(θ_n)+∇J(¯θ_n+1))-∇J(θ_n))(∇J(θ_n)+∇J(¯θ_n+1))
+ ∇J(θ_n)(∇J(θ_n)+∇J(¯θ_n+1)
≥∇J(θ_n)(∇J(θ_n)+∇J(¯θ_n+1))-∥∇J(θ_n -αt 12(∇J(θ_n)+∇J(¯θ_n+1))-∇J(θ_n)∥∥∇J(θ_n)+∇J(