\SetKwInput

KwInputInput \SetKwInputKwOutputOutput \SetKwBlockLoopLoopend \SetKwInputKwRequireRequire \optauthor \NameSaugata Purkayastha \Emailsau.pur9@gmail.com
\addrAssam Don Bosco University, India and \NameSukannya Purkayastha \Emailpurkayasthasukannya020@gmail.com
\addrIndian Institute of Technology Kharagpur, India

A Variant of Gradient Descent Algorithm Based on Gradient Averaging

Abstract

In this work, we study an optimizer, Grad-Avg to optimize error functions. We establish the convergence of the sequence of iterates of Grad-Avg mathematically to a minimizer (under boundedness assumption). We apply Grad-Avg along with some of the popular optimizers on regression as well as classification tasks. In regression tasks, it is observed that the behaviour of Grad-Avg is almost identical with Stochastic Gradient Descent (SGD). We present a mathematical justification of this fact. In case of classification tasks, it is observed that the performance of Grad-Avg can be enhanced by suitably scaling the parameters. Experimental results demonstrate that Grad-Avg converges faster than the other state-of-the-art optimizers for the classification task on two benchmark datasets.

1 Introduction

Gradient descent (GD) method [lemarechal2012cauchy] is one of the most popular algorithms to optimise an error function. Let $J:\mathbb{R}^{n}\rightarrow\mathbb{R}$ be a $C^{1}$ function. The classical Gradient Descent method is given by the following algorithm:

$\theta=\theta-\alpha\nabla J(\theta)$

where $\alpha$ is the constant step size, otherwise known as the learning rate , $\nabla J(\theta)$ stands for the gradient of the function $J$ and $\theta$ is the set of parameters. Although gradient descent algorithm is guaranteed to converge to the global minima (a local minima) for convex functions (non-convex functions), in case of large datasets the convergence can be slow. To overcome this, a variant of GD known as Stochastic Gradient Descent (SGD) [robbins1951stochastic] is applied. SGD, unlike GD avoids redundant computations but has a tendency to overshoot the minima in the process. Both GD and SGD, albeit being efficient in convex optimization, proved to be inefficient in case of non-convex surfaces because of the existence of saddle points, as observed in [dauphin2014identifying]. Keeping this in mind two more variants of GD, namely SGD with momentum [ruder2016overview] and Nesterov Accelerated Gradient algorithm [nesterov2013introductory] are commonly used in case of non-convex optimization. One is referred to [ruder2016overview] for a detailed discussion on the above mentioned optimizers. Further, [truong2018backtracking] contains an excellent account of various analytical concepts occuring in the context of machine learning. We define the proposed optimizer, Grad-Avg motivated by the Heun’s method [atkinson2008introduction] with the following iterative scheme:

\theta=\theta-\alpha\frac{1}{2}(\nabla J(\theta)+\nabla J(\theta-\alpha\nabla J(\theta)))

(1)

Thus we note that in the present optimizer the update of the parameter is done by considering the average of the gradients calculated at the previous position and the position of the parameter as suggested by the GD algorithm. The intuition lies in the fact that if the gradient becomes zero at a position which is not the minimum (say at a saddle point instead), even then the update of the parameter continues by virtue of the average of the gradients as mentioned above. The concept of gradient averaging is a known concept in this field. For instance [defazio2014saga] introduced a new optimization method using this concept. Also [huang2017snapshot] used the average of model parameters obtained by letting the model to converge at multiple local minimas before making the final prediction . Our work, however uses the averaging of the gradients of the model parameter at every timestep to update the parameter’s position.
Contributions: In this work, we develop a new optimizer, Grad-Avg based on gradient descent algorithms. We obtain its convergence for $\alpha\leq\frac{1}{3L}$ where $L$ is the Lipschitz constant (under boundedness assumption). We propose a justification of its similar behaviour with GD for regression tasks. We empirically demonstrate its efficiency over other popular optimizers for classification task.

2 Assumptions

In section 1, we have introduced the proposed optimizer. Below we mention the assumptions which are necessary to establish the convergence of the optimizer:

: (i) the function $J:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is $C^{1}$ i.e. the function $J$ has continuous first order partial derivatives.
: (ii) the gradient $\nabla J$ of the function $J:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is Lipschitz continuous i.e. there exists some $L>0$ (known as the Lipschitz constant) such that $\left\lVert\nabla J(x)-\nabla J(y)\right\rVert<L\left\lVert x-y\right\rVert$ for all $x,y\in\mathbb{R}^{n}$ .

Further, we make use of Monotone Convergence Theorem which states that every bounded below monotonic decreasing sequence of real numbers converges to the infimum. A detailed account of these concepts can be found in [kreyszig1978introductory] and [shifrin2005multivariable].

3 Our Proposed Algorithm

{algorithm2e}\KwData

$(x_{i},y_{i})$ : \textInput, Output pair \KwRequireInformation of gradient, $\nabla$ for function, $J$ \KwRequireLearning rate, $\alpha$ \KwRequireInitial parameter values, $\theta_{0}$ \Fort in range (epochs) $\overline{\theta_{n+1}}=\theta_{n}-\alpha\cdot\nabla(J(\theta_{n});(x_{i},y_{i}))$ $\theta_{n+1}=\theta_{n}-\alpha\cdot\frac{\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}})}{2}$ Algorithm for Grad-Avg

4 Convergence Analysis

To analyze the convergence of equation \eqrefeq:1, we first rewrite equation \eqrefeq:1 in the following way:

\theta_{n+1}=\theta_{n}-\alpha\frac{1}{2}(\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}}))

(2)

with

\overline{\theta_{n+1}}=\theta_{n}-\alpha\nabla J(\theta_{n})

(3)

We note that if $\nabla J(\theta_{n})=0$ , then both $\overline{\theta_{n+1}}$ and $\theta_{n+1}$ reduces to $\theta_{n}$ i.e. $\theta_{n}$ is the optimal value. We may thus assume that $\nabla J(\theta_{n})\neq 0$ and hence $\left\|\nabla J(\theta_{n})\right\|>0$ . We first establish that the sequence $\{J(\theta_{n})\}$ where $\{\theta_{n}\}$ as defined by scheme \eqrefeq:2 is monotonic decreasing in the following result. As such, by the Monotone Convergence Theorem , it follows that $\{J(\theta_{n})\}$ converges to the infimum provided it is bounded.

Theorem 4.1.

Let $J:\mathbb{R}^{n}\rightarrow\mathbb{R}$ be a $C^{1}$ function such that $\nabla J$ is Lipschitz continuous with Lipschitz constant $L$ . Then for $\alpha\leq\frac{1}{3L}$ the sequence $\{J(\theta_{n})\}$ where $\{\theta_{n}\}$ is defined by scheme \eqrefeq:2 is monotonic decreasing.

First of all, using the fact that $J:\mathbb{R}^{n}\rightarrow\mathbb{R}$ , is Lipschitz continuous, we observe that for $\epsilon_{0}>0$ , there is some $\delta_{0}>0$ such that when $\left\|\alpha\nabla J(\theta_{n})\right\|<\delta_{0}$ , one has

\left\|\nabla J(\overline{\theta_{n+1}})-\nabla J(\theta_{n})\right\|<L\delta_{0}=\epsilon_{0}

(4)

Let us now consider $f:\mathbb{R}\rightarrow\mathbb{R}$ and define

f(t)=J(\theta_{n}-\alpha t\frac{1}{2}(\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}})))

(5)

for $t\in\mathbb{R}$ . Then by chain rule we get,

f^{\prime}(t)=(\nabla J(\theta_{n}-\alpha t\frac{1}{2}(\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}}))))\boldsymbol{\cdot}(-\alpha\frac{1}{2}(\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}})))

(6)

where $``\boldsymbol{\cdot}"$ stands for the usual dot product. Now by Fundamental Theorem of Calculus, we get

	$\displaystyle J(\theta_{n+1})-J(\theta_{n})$	$\displaystyle=\int_{0}^{1}(\nabla J(\theta_{n}-\alpha t\frac{1}{2}(\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}}))))$			(7)
		$\displaystyle(-\alpha\frac{1}{2}(\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}})))dt$			(7)

As by assumption $J$ is $C^{1}$ , hence for $\epsilon=\frac{2\delta}{\alpha t}-(\epsilon_{0}+\frac{\delta_{0}}{\alpha})$ there is a $\delta>0$ such that whenever $\left\|\alpha t\frac{1}{2}(\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}}))\right\|<\delta$ , one gets

\left\|\nabla J(\theta_{n}-\alpha t\frac{1}{2}(\nabla J(\theta_{n})+\nabla J(\overline{\theta_{n+1}})))-\nabla J(\theta_{n})\right\|<\epsilon

(8)

Further, we have the following two inequalities: {align*} &∇J(θ_n -αt 12(∇J(θ_n)+∇J(¯θ_n+1)))(∇J(θ_n)+∇J(¯θ_n+1))
=(∇J(θ_n -αt 12(∇J(θ_n)+∇J(¯θ_n+1))-∇J(θ_n))(∇J(θ_n)+∇J(¯θ_n+1)) + ∇J(θ_n)(∇J(θ_n)+∇J(¯θ_n+1)
≥∇J(θ_n)(∇J(θ_n)+∇J(¯θ_n+1))-∥∇J(θ_n -αt 12(∇J(θ_n)+∇J(¯θ_n+1))-∇J(θ_n)∥∥∇J(θ_n)+∇J(