A Comprehensive Study on Optimization Strategies for Gradient Descent In Deep Learning

Kaustubh Yadav School of Computer Science and Engineering, SCOPE
Vellore Institute of Technology
Vellore, Tamil Nadu
Email: kaustubh.q@gmail.com

Abstract

One of the most important parts of Artificial Neural Networks is minimizing the loss functions which tells us how good or bad our model is. To minimize these losses we need to tune the weights and biases. Also to calculate the minimum value of a function we need gradient. And to update our weights we need gradient descent. But there are some problems with regular gradient descent ie. it is quite slow and not that accurate. This article aims to give an introduction to optimization strategies to gradient descent. In addition, we shall also discuss the architecture of these algorithms and further optimization of Neural Networks in general.

Index Terms:

Gradient Descent, Loss Functions, Optimization

I Introduction

Gradient Descent is one of the most important parts of Deep Learning used for optimizations of Neural Network. It is used to update weights so as to get the best result possible. Gradient Descent is a mathematical concept to calculate the minimum of a function, in case of deep learning it is calculated for minimizing the loss function or the cost functions by tuning the weights. The gradient of a single neuron is dependent on the gradient of the previous neurons, and if you imagine a set fully connected layers how hard it can be to calculate this gradient and how computationally tedious it can be, this is the idea behind backpropagation which is itself derived from the chain rule. There are also problems regarding Backpropagation like the vanishing gradient and the exploding gradient problems which can be solved by weight initialization. With this article, we aim to provide a better understanding of gradient descent and its optimization and why this optimization is necessary. We shall also shed some light on other optimizing strategies that are being used in today’s Deep Learning Models. In Section 2, we will start with loss functions and backpropagation. In Section 3, we will discuss types of Gradient Descent and their optimizations in Section 4. Further, we will look at the performance of these optimizations on standard datasets in Section 5. With this, we will look at other optimization strategies and regularization in Section 6.

II Loss Functions

Loss Functions are the measure of how good the neural network performs. In layman’s terms loss function is needed to calculate the deficit between the actual values and the predicted values by our network. Lower the value of the loss the better our network’s performance is and higher the value means that our predicted values are far from the actual values. And finally, our goal is to minimize this loss. Loss functions are of three types on the basis of what our network is known to perform, for example, Regression and classification. Let the input values are $x_{i}$ and our outputs are $y_{i}$ , the loss function can be defined as:

L=\frac{1}{N}\sum_{i=1}^{N}L_{i}[f(x,W),y_{i}]+\lambda(R)

(1)

Here $L$ is the average of the total loss incurred and $L_{i}$ is the loss function we will discuss in the following subsection. The $\lambda(R)$ term signifies regularization, which is another optimizing strategy that we will discuss in Section 6.

II-A Loss Functions for Regression Problems

II-A1 Mean Square Error Loss

Mean Square Error is one of the most common error function used in regression problems, it’s given by averaging the sum of the squared of the difference between the actual and the predicted values. This function is given by:

L=\frac{1}{N}\sum_{i=1}^{N}(y_{i}-y_{p_{i}})^{2}

(2)

Mean squared Error is ideal when the distribution of the target variable is Gaussian. The range of MSE lies from 0 to infinity, it can never be negative. Also, one important fact about MSE is that it gives a higher value of error when the deficit is higher, which means that model corrects itself when the error is larger.

II-A2 Mean Absloute Error Loss

Although MSE works satisfactorily for Gaussian distribution, its performance decreases if our dataset has outliers, and continues to worsen if the outlies farther from the mean value. To solve this problem, MAE is used. It is calculated by averaging the absolute difference between the actual and predicted values. It is given by:

L=\frac{1}{N}\sum_{i=1}^{N}|y_{i}-y_{p_{i}}|

(3)

MAE could be used if the distribution is not Gaussian. And as shown by Nie et. al [1] outperforms any other loss functions if our dataset is noisy.

II-A3 Mean Square Log Error Loss

MSLE is an intermediate between MAE and MSE, it gives almost the same priority to the values that are off by a larger magnitude and the values that are off by a smaller magnitude, which means that it doesn’t give too much priority to outliers. It is given by:

L=\frac{1}{N}\sum_{i=1}^{N}(log(y_{i}+1)-log(y_{p_{i}}+1))^{2}

(4)

But MSLE gives an asymmetric error curve because it gives a higher loss if our predictions have undershot the actual values. A padding of 1 is added because of the fact $log(0)$ is undefined.

II-B Loss Functions for Classification Problems

The reason why we have separated both regression and classification is that we cannot use the losses we saw in the previous subsection as they won’t give a valid result. For example, we will take the case of MSE and we have to perform classification on a target value of 0 or 1. And our classifier will yield a value between 0 and 1. So if the value by the network is totally off ie. the actual value is 1 and the predicted value was 0 so by the MSE equation the loss incurred will not be that large. Hence for classification problems, a different set of Loss Functions were defined. But this doesn’t mean we cannot use the aforementioned Loss functions in Classification problems, Ghosh et al. [2] proved that MAE actually performs better at classification than Cross-Entropy when tested on the MNIST dataset.

II-B1 Categorical Cross Entropy Loss

Categorical Cross Entropy is used to measure loss between a softmax layer which contains the predicted class and one hot encoding which contains the true class. There are actually two variants of CCE. one is used for binary classification, this loss function is used when our target distribution is binary ie. there are only 2 output classes. This loss function actually a derived version of Cross-Entropy Loss with is another loss function for multi-class classification, but both of them have the same intuition. Binary Cross-Entropy Loss is given by:

L=\frac{1}{N}\sum_{i=1}^{N}y_{i}log(y_{p_{i}})+(1-y_{i})log(1-y_{p_{i}})

(5)

Here $y_{p_{i}}$ is the probability of the class so its value ranges from 0 to 1 and $y_{i}$ is the true class and its value is either 1 or 0. The values of both the terms under summation will be negative because the log is negative from 0 to 1 hence we have a negative sign outside the summation. The reason why we use Binary cross-entropy is that it gives a very high loss for misclassification. For example, if our model gives a totally wrong result ie the actual value is 1 and the predicted value was 0. Because of the first log term, the Loss becomes infinity. Although this property is crucial, this makes the loss function prone to outliers. A generalized approach to CCE was introduced by Zhang et. al [3] where they used the robustness of MAE and the implicit weighting mechanism of the Cross-Entropy Loss which focuses on samples that are hard to learn. The other variant of Cross-Entropy is Multiclass Cross-Entropy, which is a broader generalization of Binary Cross-Entropy, so instead of two target classes in this can be and a number of target classes. It is given by:

L=\frac{1}{N}\sum_{i=1}^{N}(y_{i})log(y_{p_{i}})

(6)

As we discussed earlier Cross-Entropy has a drawback that it is quite prone to outliers but it does have a plus point in the sense that it focuses learning on ”difficult examples” which mean it foucuses on the cases that are futher from the actual pediction than the labels that are closer to our samples, that’s what makes it prone to outliers.

II-B2 Kullback-Leibler Divergence Loss

KL divergence is defined as the deficit between the joint distribution and the product distribution of two variables in the case of Loss functions it is the probabilities of the true and the predicted labels. KL divergence is similar to entropy in the sense that they both have a logarithm term, but the difference is that Entropy is used with probabilities and KL Divergence is used with Probability Density Functions. A Probability function can be defined as:

F(x)=P(a\leq x\leq b)=\int_{a}^{b}p(x)dx

(7)

And we defined entropy by equation 6 but that was for discrete variables but PDF is continuous hence it can be defined as $p(x_{i})\Delta x_{i}$ hence new entropy will be:

H(x)=-\sum_{i=1}^{n}p(x_{i})\Delta x_{i}.log(p(x_{i}).\Delta x_{i})

(8)

By properties of logarithms, we can write equation 8 as:

H(x)=-\sum_{i=1}^{n}p(x_{i})log(p(x_{i}))\Delta x_{i}-\bigg{[}\sum_{i=1}^{n}p(x_{i})\Delta x_{i}\bigg{]}log(\Delta x_{i})

(9)

Taking the limits as $\Delta x_{i}$ tends to 0

H(x)=\int_{-\infty}^{+\infty}p(x)log(p(x)).dx-\lim_{max(\Delta x)\rightarrow o}log(\Delta x_{i})

(10)

But there are problems with these equations that are they are not valid, a rule in applied math is that transcendental functions [4] and Probability density function do have units so the logarithms are invalid. Also as $\Delta x_{i}$ approaches 0, $p(x)dx$ also approaches zero meaning that a little difference will give a huge value for $H(x)$ . To solve this problem we use KL-Divergence, which is the difference between the joint distribution and the product distribution. Let $X$ be a variable with density $p(x)$ and $Y$ be a variable with density $q(x)$ the difference between joint distribution and product distribution is given by equation 11 which is also known as Relative Entropy or Kullback-Leibler Divergence.

D(X||Y)=-\int_{-\infty}^{+\infty}p(x)log(q(x)).dx+\int_{-\infty}^{+\infty}p(x)log(p(x)).dx

(11)

Taking $p(x)$ as the common term

D(X||Y)=-\int_{-\infty}^{+\infty}p(x)[log(p(x))-log(q(x))].dx

(12)

And finally by using the quotient rule for logarithms

D(X||Y)=\int_{-\infty}^{+\infty}p(x).log\frac{p(x)}{q(x)}.dx

(13)

So now we have a ratio under the logarithm hence this equation is totally valid. Also if $X=Y$ the value for $D(X||Y)$ becomes 0 and not infinty like in the case for equation 10.

II-C Robustness Of Loss Functions

The concept of robustness of loss functions [5] has been there for a long time and it is the first step in achieving optimization in deep learning. Here robustness of a Loss function means that it is not much influenced by larger errors. And a loss function with this capability can be universally applied on any dataset. For example, MAE is more robust to outliers than MSE, due to the fact that the difference is squared in case of MSE hence the outliers have a higher loss than the loss in MAE. But when we are optimizing our losses with gradient descent it is the derivative that matters. A larger value for the loss doesn’t mean that that the derivate would large as well. The derivative depends on the loss function that is in use. A larger value for the loss(outlier) will have a negligible derivative. In addition to this, the entries in our dataset or examples can be classified into two types- easy and hard examples. Easy examples are easier to learn and the model focuses on these sorts of examples which usually have a lower error, but these examples are known to slow convergence ie. during Gradient Descent these examples will take a lot of time so as to achieve optimal results. The other type is hard examples, these examples are known to accelerate the convergence to achieve optimal results faster. Hence Hard Example mining [6] is a big part of the research on optimization. Hence when someone says more the data better the performance it is quite an oversimplification, in the end, these hard examples are the key to achieve the results we require faster. But when it comes to hard or easy examples we ourselves cannot differentiate between them and nor can our model. It is after some training we can be a little certain with the difference between them. Even then robustness matters that is our model might start to focus on noisy data, this can be solved by some variations of Stochastic Gradient Descent as discussed in Section 4. This also brings up the concept of learning, as with human leaning patterns we can apply the same patterns in machine learning, some of the popular learning patterns are Curriculum Learning and Self Paced Learning.

II-C1 Curriculum Learning

This method is quite inspired by how humans learn. It was introduced into machine learning by Bengio et al. [7] as a concept for attaining convergence faster and finding a local minima for non- convex functions. The intuition behind CL is to give the machine easier examples and then consequently give harder examples when it’s ready. This can also be observed while training a CNN, at every layer, our model learns a new level of abstraction.

II-C2 Self-Paced Learing

The motivation behind this method is quite similar to Curriculum Learning in the sense that we initially cannot give our model everything and expect it to learn the parameters, instead we start with easier examples and then move to harder ones. If we have a dataset of animals and we need the model to classify it into its corresponding label, the measure of easiness is totally ambiguous, furthermore, if the dataset has manually boxed images it’s not conclusive that the example is easy or hard. Technically, the algorithm for self-paced learning takes inspiration from the Concave Convex Procedure (CCCP)[8] which is a solution to Latent SVM[9]. For a binary linear SVM classifier, the objective is to get the maximum distance between two hyperplanes for the corresponding classes, which is given by:

max\bigg{[}(\vec{x}_{+}-\vec{x}_{-}).\frac{\vec{w}}{||w||}\bigg{]}=min\frac{1}{2}||\vec{w}||^{2}

(14)

And the Decision Rule is given by:

y_{i}(\vec{w}x_{i}+b)\geq 1-\xi_{i}

(15)

Here $\vec{x}_{+}$ and $\vec{x}_{-}$ are the two classes and $\vec{w}$ is the weight vector and $y_{i}$ is +1 for positive samples and -1 for negative samples. The minimization can be calculated by taking the gradient with respect to weight and bias. But we cannot get a perfect hyperplane always because the data might be inseparable hence there are two methods for a generalized result, the first one is transforming a vector using a kernel function[10], but this virtually increases the complexity of the function, hence we use the second method which is by using a slack variable $\xi$ , which is a penalty of how much we are not able to satisfy our Decision Rule. Now Latent SVM works on the premise that we have to find a value of the hidden variable that should be consistent with our labels should be better than any other pairs of ground truth and hidden variables, if not so we will add a penalty(slack variable) which we can further minimize. So in the following equations, $h_{i}$ are the set of hidden variables on a dataset $D={{x_{i},y_{i}}}$ and $\psi(x,y,h)$ is our feature space.

max_{h_{i}}w^{T}\psi(x_{i},y_{i},h_{i})-w^{T}\psi(x_{i},y,h)\geq\nabla(y_{i},y,h)-\xi_{i}

(16)

But this is still a Non-Convex Solution, hence we use CCCP, which first initializes $w_{0}$ randomly and then updates:

\frac{1}{2}||\vec{w}||^{2}+C\sum_{i=1}^{n}\xi_{i}

(17)

And finally update $w_{t+1}$ by solving convex problem, $min||w||^{2}+C\sum_{i}\xi_{i}$ and equation 16. But this still does’nt solve the problem of harder and easier example hence Self Paced Leaning introduces a new variable to equation 17, $v_{i}\in{0,1}$ where 1 defines easier examples and 0 defines that we will not be bothered by that example for now. Making the equation as $min||w||^{2}+C\sum_{i}v_{i}\xi_{i}$ . But this will be trivial as we can put all $v_{i}$ to be 0. Hence adding penalty, which doesn’t allow all the values to be 0 and for every value 0 the penalty will be higher and higher. Making equation 17 as:

min||w||^{2}+C\sum_{i}v_{i}\xi_{i}-\sum_{i}\frac{v_{i}}{K}

(18)

Where K is the self-paced learing rate weight, initially K is large hence hard examples are excluded and sequentially decreased to inclued hard examples.

III discrepancies in Non-Convex Optimizations

In this section, we will look into the problems that arise while optimizing non-convex problems, the reason for this is that when we look at our loss functions against the weights, through every round of backpropagation [11] we are looking for the set of weights that minimize our loss function. Hence here the word optimization really means that we have to minimize the cost function which is non-convex. The best way to minimize any function is to compute the gradients and move along the direction of its decrease. But this is quite an overstatement when it comes to saddle points and local minima or even local maxima as the gradient at these points is zero. Although gradient descent performs well in a convex setting as it converges in $O(1/\epsilon)$ iterations and even better in stong convex as the convergence rate is poly-logarithmic ie. it is almost not dependant on the dimensions. But deep learning problems are highly non-convex, with saddle points, local minima, low gradient regions and we cannot expect regular gradient descent to converge to the global minimum and not get stuck at these sub-optimal regions. Fortunately, we can sort of rule out local minima as they give quite satisfactory results, this was proved for many cases including dictionary learning [12], matrix completion [13], tensor decomposition [14], and some cases of deep learning [15]. Generally, for deep learning, a local minimum is almost as good as global minima [16]. But this still doesn’t address saddle points and bad local minima.

III-A Saddle Points

When we use gradient descent on these non-convex problems, it cannot distinguish between a saddle point and a local minimum as the gradient is 0 in both the cases. Hence Gradient Descent gets stuck maybe indefinitely or as long making training time infeasible, and we get a sub-optimal result. Prior works on escaping saddle points involved firstly getting to a stationary point and if it is a local minimum we stop but if it is a saddle point we have to escape from it. Escaping from the saddle point involves following the negative eigenvector of the Hessian when the gradient is 0( $\nabla f(x)=0$ ) and the hessian is positive semidefinate ( $\nabla^{2}f(x)\succeq 0$ ). In a 2-D approach, if we look at the gradient flow around a saddle point we have a direction where the gradient is flowing towards the saddle point and a direction where the gradient is flowing away from the saddle point, hence if we start at a point which is not on the central line, we can escape the saddle point using Stochastic Gradient Descent in $O(1/\epsilon^{4})$ rounds [14] and $O(1/\epsilon^{2})$ in full Gradient Descent [17].But this result is highly constrained to first-order saddle points where a saddle point is nth order:

f(x’)\geq f(x)-O(\|x-x’\|^{n+1})

(19)

Hence a normal saddle point ie. $z=x^{2}-y^{2}$ is a first-order saddle point. But there are a lot of different types of saddle points specifically of higher orders, for example, a monkey saddle which is a second-order saddle point given by( $y={x_{1}}^{3}-3{x_{2}}^{2}x_{1}$ ) and hyperbolic paraboloid which is also a first-order saddle point. Optimality conditions are defined for the first-order saddle point that if the gradient is not zero we move in the direction of the negative gradient and if the Hessian is not positive semidefinite we follow the direction of the negative eigenvector of the Hessian, but these conditions don’t include second-order saddle points. Hence an extension by Anandkumar and Ge [18] where they included a special constraint that is any vector should be negative in the null space of the Hessian and the third derivative should be zero in this direction which is given by:

\forall u,[\nabla^{2}f(x)]u=0;[\nabla^{3}f(x)](u,u,u)=0

(20)

which is not true for the Monkey Saddle and hence this can also be an optimality condition when the second term is not equal to zero and hence in that case we pick a random $u$ in the null space of the hessian and move in the direction of either $u$ or $-u$ . In case of deep learning problems, it is quite rare to see higher-order saddle points so the final optimality conditions really sums it all up.
Although the methods mentioned above are quite robust, they do have two limitations, one that we have to be in the saddle point and then switch to a strategy to escape it which inherently leads to longer training times, and secondly, we need 3rd order information which is hard to compute. Hence a method by Zeyuan Allen-Zhu [19] we just “swing by” a saddle point, which means that we have knowledge on when our point will get stuck in a saddle point and with that knowledge we perturb so as to avoid the saddle points. A saddle point is called a $\delta$ strict saddle point if eigenvalue is below $-\delta$ hence if we have any point y and $\|y-x\|\leq\gamma$ where $\gamma<\delta$ due second order Lipschitz conditions some eigenvalue of $\nabla^{2}f(y)$ are below $-\gamma$ . Hence if we have multiply it with a vector( $v$ ) and a transpose we can almost be sure if that vector is close to the saddle point. Also now we can move in the direction of the negative or positive of this vector ie. $y^{\prime}=y\pm\delta v$ which will steer us away from a saddle point.

III-B Bad Local Minima

This might seem a bit counter argumentative to the statement that local minima are as good as global minima but when we see the holistic picture, we have to avoid those local optimum that give sub-optimal results and our loss functions have a lot of local minima. Some studies have shown that by adding one neuron can actually eliminate all the bad local minima, the theoretical proof has been given in multiple studies [20] and [21]but was initially proved by Auer et al. [22] for square loss using logistic activation function for the extra neuron. The original study changes the loss function to a different form and then gives the evaluations based on that transformation, in general, we define our loss function as $L(f(x,W),y_{i})$ but now we have a separate function $l$ which takes one argument $l(-y_{i}f_{0}(x,W))$ where $l$ is assumed to be twice differentiable and non decreasing and all critical points to be the global minima, one another strong assumption is that the dataset is serializable that is the model can classify the given parameters correctly which is not a common attribute when it comes to datasets in general. It was observed by the original authors that the new loss functions can be visualized as a polynomial hinge loss. Hence we modify our loss function to become:

L=\sum_{i}l(-y_{i}f(x_{i}))+\frac{\lambda a^{2}}{2}

(21)

Where $\lambda$ is the regularization term and we change our model output to $f(x,W)=f_{0}(x,W)+f_{e}(x_{i})$ where $f_{e}(x_{i})=a*exp(x_{i}W^{T}+b)$ and $a$ is the scale and $b$ is the bias. After making a new loss function, usually, we calculate the derivatives respect to weights, but in order to show that effect of the extra neuron vanishes when we reach a global minimum, derivatives with respect to scale and bias are needed, which will be same but the derivative with respect to the scale will have an extra $\lambda*a$ term and finally to find the minimum we equate both the deviates to zero. Now just by subtracting both the derivatives and multiplying $a$ on both sides we e $\lambda*a^{2}=0$ which says with a non-zero $\lambda$ , $a$ has to be zero and hence for the global minimum the extra neuron becomes inactive.

IV Gradient Descent and it’s variants

The idea of gradient descent is a derivative of the descent methods that were introduced about 150 years ago, hence before going into what gradient descent is we will start with the initial idea, which is the steepest descent and was given for strongly convex functions. A function $f$ is convex if dom $f$ is a convex set and it satisfies:

f(\theta x+(1-\theta)y)\leq\theta f(x)+(1-\theta)f(y),\forall x,y\in\emph{0}\leq\theta\leq 1

(22)

And using the first-order conditions for convexity the first-order Taylor approximation is always a global underestimator of the function ie. $f(y)\geq f(x)+\nabla f(x)^{T}(y-x)$ . Also, we can define a function as $\alpha$ strong convex if it satisfies:

f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+\frac{\alpha}{2}\|x-y\|^{2}

(23)

Where the quadratic implies that there exists a lower bound to the growth of the function which in turn means that strong convex growth is strictly greater than linear growth, also strong convexity is a dual-assumption to Lipschitz smoothness. Also let $q^{*}$ be the optimum value for the function using Polyak-Lojasiewicz inequality we can also derive $f(x)-q^{*}\leq\frac{1}{2\alpha}\|\nabla f(x)\|^{2}$ which also calls for a stopping criterion as the gradient tends to zero. To optimize smooth convex problems descents methods are the key and generally, they can be defined as:

x^{k+1}=x^{k}-\eta\Delta x^{k},f(x^{k+1})<f(x^{k})

(24)

Where $\Delta x^{k}$ is called the normalized step direction which is usually a vector and $\eta$ signifies the step size which is always positive. Hence general descent methods involve the calculation of the step direction $\Delta x$ and then doing a ray search [23] to for the step size $\eta$ and then updating equation 24, until we satisfy the stop criterion. The reason we used the word ray search in place of line search is that we only look for positive values of $\eta$ . A basic line search or exact line search works on the premise $\eta=argmin_{\eta>0}f(x+\eta\Delta x)$ which really means that we are choosing $\eta$ to achieve the minimum of the plot. Another variant of line search is the backtracking line search which performs a bit better than exact line search [24]. In backtracking line search, we start with $\eta=1$ and we update $\eta=\beta\eta$ until the following condition is satisfied $f(x+\eta\Delta x)<f(x)+\alpha\eta\Delta f(x)^{T}\nabla x$ where the parameters $\alpha\in(0,0.5)$ and $\beta\in(0,1)$ also this condition is invalid if the $\alpha\geq 1$ due to preservation of convexity and as the first-order taylor approximation is a global underestimator of the function. When we observe the equation for steepest descent, there is an anomaly ie. it is not normalized which means in order to multiply with the step size we have to get them into the same norm. As our intuition, $Deltax$ should have the gradient with respect to $x$ hence normalized step direction can be given by $\Delta x_{n}=argmin{\|\nabla f(x)^{T}v\|}$ where $v$ is the direction derivative. General descent methods use the un-normalized step direction that is $\Delta x=\|\nabla f(x)\|\Delta x_{n}$ . Also, Gradient Descent is actually steepest descent in the Euclidian Norm or the 2-norm. Hence now the only thing changed from the definition of Steepest descent is that Gradient Descent has a step direction negative to the gradient with respect to $x$ . Also, with fixed step size, the convergence rate is of $O(1/k)$ where $k$ is the number of iterations and is same even with backtracking line search but as $\eta$ changes in every iteration, we replace $\eta$ with $\eta_{min}=min\{1,\beta/L\}$ where $L$ is the Lipschitz Constant and is always greater than 0.

Refer to caption — Figure 1: Sub-level sets of a function $f(x,y)=x^{2}+by^{2}$ or $f(x)=\frac{1}{2}x^{T}Ax$

As we can see it is the norm that matters, it is only useful if we can change the norm for better convergence. As in Figure 1 if the sublevel sets are oval and narrow, here when we apply the steepest descent it works very poorly, but if the sub-level sets are isotropic, ie. almost spherical then the direction of the negative gradient will either be on the minimum or extremely close to it. Also when we get close to the minimum of the function the gradient tends to zero, but that doesn’t mean we are at the minimum, so as that happens we switch to the second-order Taylor approximation which can be written by appending $\frac{1}{2}(y-x)^{T}\nabla^{2}f(x)(y-x)$ to the first-order approximation. Hence near the optimum, the sub-level sets look like ellipsoids due to the hessian almost similar to Figure 1, so initially, if we have some intuition about the hessian we can change the norm in accordance to the hessian and then we can apply steepest descent, this is called Newton’s step. So by the newton’s step we change the step direction $\Delta x$ as $\Delta x_{ns}=-\nabla^{2}f(x)^{-1}\nabla f(x)$ or else another interpretation would be that we take the second-order approximation and minimize that or else we need to find a $v$ such the $\nabla f(x+v)$ is zero. After changing the norm and using newton’s step as the new step direction we have to change the stopping criterion as well, which is described as Newton’s decrement and is given by $\lambda(x)=(\nabla f(x)^{T}\nabla^{2}f(x)^{-1}\nabla f(x))^{1/2}$ which is a measure of closeness to the optimum similar to the stopping criterion of the Descent methods, hence the best approximation ie. $f(x)-q^{*}$ is $\lambda^{2}/2$ . So in conclusion, Newton’s method requires to calculate Newton’s step $\Delta x_{n}s$ and Newton’s decrement $\lambda^{2}$ and update $x^{(k+1)}:=x^{(k)}+\eta\nabla x_{n}s$ until the stopping criterion that is $\lambda^{2}/2\leq\epsilon$ Although gradient descent is a derivative of steepest descent the assumption that it performs better than other descent methods is totally arbitrary because if we have a norm which makes our function align with which geometry of the sublevel sets, the convergence would be extremely fast. Also, Newton’s method is more advantageous in the sense that if we scale it or maybe take a different norm, it will converge almost in the same number of steps whereas gradient descent will lead to a drastic impact on changing the norm.

But in deep learning and machine learning, we generally tend to use gradient descent, hence going into depth with more descent methods and second-order methods are out of the scope of this paper. As gradient descent is a quite widely researched area, there are 3 types of gradient descent variants that differ on the amount of data required of the calculation of gradient for the loss functions.

IV-A Batch Gradient Descent

Also known as vanilla gradient descent, is the first intuitive step in optimizing our cost functions, as this classification was based on the amount of data, in batch gradient descent entire dataset is used for the gradient update.

x^{k+1}:=x^{k}-\eta\nabla_{:n}f(x_{:n})

(25)

where $n$ is the number of entries in the dataset. Hence to make one gradient update the gradient for every parameter is calculated. For large datasets, this is an extremely tedious process and might take hours just for one update, and for extremely large datasets, it doesn’t converge at all. But it is extremely accurate for small size problems in the sense that it will converge to a global minimum for convex and a local minimum for a non-convex setting.

IV-B Mini-Batch Gradient Descent

As the name states, mini-batch gradient descent is a truncated version of vanilla gradient descent, what that means is that in mini-batch gradient descent we divide our parameters and the labels into smaller sets and with each iteration calculate the gradients with respect to one of the mini-batches.

x^{k+1}:=x^{k}-\eta\nabla_{i:i+n}f(x_{i:i+n})

(26)

Hence instead of calculating the gradient for the entire dataset, we calculate the gradient for only a mini-batch. The convergence plot is a little different from batch gradient descent in the sense that in batch gradient descent the plot with respect to cost and iterations only goes downward but in the case of mini-batch the plot is noisier because in every iteration we are taking a new batch for our descent. Now another ambiguity that arises is the size of the mini-batch if our training data is small that is, 3000-4000 entries it’s recommended to use Batch gradient descent because even if we divide it into mini-batches the convergence wouldn’t have a that significant improvement if any, although in case of large datasets mini-batches it’s recommended to use mini-batches of the powers of two [25], also larger the batch size small the gradient improvement hence there is a trade-off between the batch size and the convergence.

IV-C Stochastic Gradient Descent

Stochastic Gradient Descent is essentially a condensed form of mini-batch gradient descent, here the size of the mini-batch is 1.

x^{k+1}:=x^{k}-\eta\nabla_{i}f(x_{i});i\in rand\{1,2,..n\}

(27)

Hence during a gradient update, we only need to get the gradient of one training example, which is usually picked at random. Now if we compare Batch gradient descent with SGD, it is quite a jump in the sense if we have 1 million training examples, Batch gradient descent has to compute 1 million gradients each iteration and if we have 1000 iterations, so we have to calculate gradients 10 billion times whereas in SGD we only need one gradient per iteration hence SGD is 1 million times faster than Batch Gradient Descent, on the premise both reach convergence. Another feature for SGD is that it is very sensitive to the learning rate, $\eta$ [26], and as the leaning rate increases so does the variance. The reason why SGD is common in machine learning is that, as stated earlier we are not looking for the perfect solution in the sense that the optimum should most definitely be a global solution, our aim is to get an optimum that performs well on unseen data, so in SGD essentially we get a considerable drop in the cost initially and that is what we need, so with early-stopping we get the result that we need [27].

V Optimization of Gradient Descent

Now, when we talk about optimization there’s a lot that can be done as gradient descent itself has a lot of moving parts. Although we did touch upon some optimizations that do give better results in the early parts of the previous section but those methods involve the use of second-order methods which are not easy to compute talking from a deep learning perspective. Also, the problems discussed like bad local minima and saddle points arise here as well and hence here we shall provide an GD approach to those problems rather than second-order methods.

V-A Gradient with Momentum

As we saw in Section 3, saddle points are mostly responsible for the sub-optimal results when it comes to optimization and even if the surface for our loss function is flat, gradient descent still runs into problems pertaining to the low gradients and hence slow movements. Momentum with Gradient Descent[28] [11] is based on the intuition we have from motion that is an object in motion will continue in the same direction until an equivalent force is applied in the opposite direction, here we care about the first part of the previous statement. Now in the sense of optimization, we move in the direction of the negative gradient and continue to update in that direction. Also, when we see convergence on our sublevel sets in Figure 1, we can see that it’s a zig-zag motion, if the step size is high then we might overshoot out of the sub-level sets and then even convergence towards the minimum is not guaranteed. So it is quite obvious that along the y-axis should be as minimum as possible and along the x-axis should have considerable but constrained updates, this can be solved by momentum. As shown earlier, gradient descent can be represented as

x_{k+1}:=x_{k}-\eta Z_{k}

(28)

where $Z^{k}=\nabla f_{k}$ so with every step we go in the direction of the negative gradient, after introducing the momentum term we have a memory of the previous update ie. $k-1$ , so

Z_{k}=\nabla f_{k}+\beta Z_{k-1}

(29)

where $\beta$ is essentially the momentum parameter and $Z_{k-1}$ holds the information of the previous step direction. Now, by using the same example from Figure 1, the gradient is actually $Ax$ so by replacing $\nabla f_{k}$ in equation 29 and replacing $k$ as $k+1$ we get a system of equations,

x_{k+1}:=x_{k}-\eta Z_{k};Z_{k+1}-AX_{k+1}=\beta Z_{k}

(30)

which can be written in a matrix form as,

\begin{pmatrix}1&0\\ -A&1\end{pmatrix}\begin{pmatrix}x_{k+1}\\ Z_{k+1}\end{pmatrix}=\begin{pmatrix}1&\eta\\ 0&\beta\end{pmatrix}\begin{pmatrix}x_{k}\\ Z_{k}\end{pmatrix}

(31)

And finally, at every step, we have to compute the matrix

\begin{pmatrix}c_{k+1}\\ d_{k+1}\end{pmatrix}=\begin{pmatrix}1&-\eta\\ \lambda&\beta-\eta\lambda\end{pmatrix}\begin{pmatrix}c_{k}\\ d_{k}\end{pmatrix}

(32)

Which depends on both $\beta$ and $\eta$ , hence we have to minimize that matrix, which in turn is dependent on the eigenvalues, so we minimize over a whole range of eigenvalues. Let $m$ and $M$ be the upper bound and the lower bound on our eigenvalues, hence we have to choose $\beta$ and $\eta$ so as to minimize that eigenvalues for a whole range of $\lambda$ between $m$ and $M$ and it has been proven[11]for the optimality points for $\eta$ and $\beta$ ,

\eta_{optimal}=\bigg{(}\frac{2}{\sqrt{M}+\sqrt{m}}\bigg{)}^{2}

(33)

\beta_{optimal}=\bigg{(}\frac{\sqrt{M}-\sqrt{m}}{\sqrt{M}+\sqrt{m}}\bigg{)}^{2}

(34)

Usually, $\beta$ is set at 0.9. Now, what the momentum term does is that as it records the previous step direction it moves in that direction, by the magnitude of $\beta$ hence the frequency of the oscillations drops and the convergence is accelerated.

V-B Nesterov’s Accelerated Gradient

After the introduction of the momentum term, the frequency oscillations were considerably reduced but there is always a scope for more. Secondly, the momentum term works well on the premise that our data is free from any sort of considerable outliers which means that while updating if we have an outlier and the gradient at that points in a direction opposite to the minima, every consequent update will have point to that direction at least for the foreseeable updates, increasing the convergence time. Also, this phenomenon is quite common in a stochastic setting as the update only looks at one entry while update. Nesterov’s Accelerated Gradient(NAG)[29] introduced the concept of a lookahead term in extension with classical momentum. So instead of first calculating the gradient and then moving in that direction and doing a final update with regards to the previous step, we can just move with regards to the previously recorded step and then calculate the gradient at that point and then move accordingly. So we are calculating the gradient essentially at a different point $y_{k}$ than the initial starting point( $x_{k}$ ). Firstly as we need to compute the gradient for the sake of simplicity we can define a function $g(x)$ as $g(x)=x-\eta\nabla f(x)$ which is same as computing the gradient and updating in its direction of decrease. So as we have to compute the gradient at a different point we can write the function $g(x)$ as $g(y_{k})$

x_{k}=g(y_{k})=y_{k}-\eta\nabla f(y_{k})

(35)

Also, initially $y_{1}=x_{0}$ . After the calculation of gradient at the new point $y_{k}$ we have to make a correction, that is we have to change the direction of the previous momentum step which is given by,

y_{k+1}=x_{k}-\frac{t_{k}-1}{t_{k+1}}(x_{k}-x{k+1})

(36)

Here, $t_{k}-1/t_{k+1}$ is essentially known as Nesterov’s coefficient also represented by $\kappa$ . And $t_{k+1}$ is defined by $t_{k+1}=\frac{1+\sqrt{1+4{t_{k}}^{2}}}{2}$ and initially $t_{1}=1$ . Now, these set of equations can guarantee fewer oscillations as compared with Classical Momentum, as a matter of fact, the convergence rate is $O(1/k^{2})$ (k-iterations) in Nesterov’s Momentum whereas the convergence rate in Gradient descent is $O(1/k)$ hence there is a significant boost in convergence rates. But, it still doesn’t address the problems with SGD as mentioned earlier and as a matter of fact, the convergence rate drops to $O(1/\sqrt{k})$ while using Accelerated Gradient with SGD, which is even worse than normal Gradient Descent. So we can actually improve the performance of SGD with Nesterov’s Momentum to $O(1/k)$ quite easily, by using Variance Reduction with Stochastic Gradient Descent, also known as SVRG[30]. But getting the convergence rate of $O(1/k^{2})$ is still under research. Although a method known as Katyusha’s Momentum[31] was introduced for getting a convergence rate of $O(1/k^{2})$ and it was an extension to the concepts of Nesterov’s Momentum. Instead of computing the gradients and taking the step with respect to prior momentum, we compute the next point with our momentum term form the previous time step but instead of making a definitive update, we compute the midpoint between the new and the initial point and do the further updates in the same manner from the midpoints. And this method gave a convergence rate congruent to NAG but was also dependent on the number of examples we are using in total.

V-C Adaptive Gradient Algorithm (AdaGrad)

Although we have seen a good convergence rate from Nesterov’s Momentum and Katyusha’s Momentum it’s important to note that these convergence rates are not universal and may vary tremendously under certain cases. Another flaw that is quite common in the aforementioned methods is overshooting, although quite reduced with Nesterov’s and Katyusha’s Momentum but still prevalent. Adaptive Gradient Algorithm or AdaGrad[32] solved this discrepancy by introducing the concepts of adaptive step size, which when tested performed really well on large scale and distributed networks. They also work well for sparse data like word embeddings[33] also due to the fact step size changes for every iteration. The intuition behind the adaptive gradient is that when we start the sum of the prior gradients will be smaller compared to the sum considered after simultaneous iterations. As we just make a change in the step size which is a scalar quantity the update equation is quite similar to that of Gradient Descent.

x_{k+1}=x_{k}-\eta{{}^{\prime}}\Delta x_{k}

(37)

where $\eta^{\prime}$ indicates the new step size for the $k$ th iteration, so $\eta^{\prime}$ changes with every iteration, although as a hyperparameter we have to decide the initial step size which is usually kept at $0.01$ , which gets decrimented later on. Now, $\eta’$ is defined as,

\eta’=\frac{\eta}{\sqrt{diag(G_{i})+\epsilon I}}

(38)

where $\epsilon$ is a small positive scalar quantity which is appended to make sure that $\eta’$ is constrained even if the other term is zero. And $G_{i}$ is the matrix that stores the sum of the products of the gradients till the $i$ th iteration represented as $G_{i}=\sum_{1}^{i}{\big{(}\nabla{x_{k}}^{(i)}\big{)}}^{2}$ . Hence the diagonal of the matrix $G$ will hold the elements that we actually need and taking the square root of the entire matrix isn’t actually required. As we are multiplying the gradient by itself, if the value for $diag(G_{i})$ is large during the later iterations of gradient descent hence the step size automatically drops down. Hence, a small gradient pertains to bigger step size and a bigger gradient pertains to smaller step size. But if the initial gradients are large then the step sizes for the rest of the iterations will be smaller and smaller this can be corrected by using a higher learning rate than $0.01$ . But there are problems with this method too firstly, it doesn’t perform well in a non-convex or dense setting, AdaGrad can get easily stuck in saddle points and to get out we are only dependent on the value of $\epsilon$ which will lead to increased training time. Also if the value for the sum of gradient becomes extremely large, it will eventually make the learning rate so small that our model won’t learn anything after that point on, that is why AdaGrad is usually used in cases of sparse data.

V-D AdaDelta

Adadelta[34] was introduced as an extension of AdaGrad, improving on the drawbacks it had which are the accumulation of gradients leading to preponed stoppage of training and manual setting of the initial learning rate. The first constraint introduced in the paper was limiting the window size of the previous gradients, hence instead of storing the $t$ iterations, we keep information for $w$ iterations. By doing the denominator term in AdaGrad will not reach infinity and updates will be done on the basis of recent gradient information which in turn doesn’t terminate learning prematurely. Also storing the previous, squares of the gradient is really inefficient as we have to define a separate matrix and use and update that it each step, instead we can compute the average of the square of gradients or expectation given by,

\mathbb{E}[g^{2}]_{t}=\rho\mathbb{E}[g^{2}]_{t-1}+(1-\rho){g^{2}}_{t}

(39)

here $\rho$ represents the decay constant and has a similar function to $\beta$ in classical momentum. The value of $\rho$ is usually kept at $0.9$ , so by Equation 39, we can intuitively say that expectation at a point $t$ is influenced both by the expectation of the square of gradients at the previous point and square of the gradients at the point more so in the previous steps. As described in AdaGrad we require the square root of the squares of the gradients, we get the square root of $\mathbb{E}[g^{2}]_{t}$ which is also Root Mean Square(RMS). Also, we still need a small positive number $\epsilon$ to escape regions of low gradients, hence $RMS[g]_{t}=\sqrt{\mathbb{E}[g^{2}]_{t}+\epsilon}$ and the new learning rate becomes,

\eta’=\frac{\eta}{RMS[g]_{t}}

(40)

The original paper also pointed out another discrepancy which was the units of the parameter update term, $\eta\nabla x_{k}$ which is proportional to the units of $1/x$ on the premise that the cost function doesn’t have any units, but if we compare these finding with second-order methods such as Newton’s method the units for the update will be directly proportional to $x$ using the formula $\Delta x_{ns}=-\nabla^{2}f(x)^{-1}\nabla f(x)$ where $\Delta x_{ns}$ is the update parameter for Newton’s step as explained in Section 4. Hence with a little rearrangement, we can show that the inverse of the Hessian is equal to $\frac{\Delta x_{ns}}{\nabla f(x)}$ , also here the parameter $\Delta x$ is given by $\eta’\nabla x_{t}$ hence as the RMS information is represented as the denominator hence $\Delta x_{t}$ is in the numerator, which needs to be calculated for the time step $t$ but by assuming Lipchitz smoothness we can approximate $\Delta x_{t}$ as $RMS[\Delta x]_{t-1}$ so the parameter update becomes,

\Delta x_{t}=\frac{RMS[\Delta]_{t-1}}{RMS[g]_{t}}g_{t}

(41)

And the final update is, $x_{t+1}=x_{t}-\Delta x_{t}$ . Now the introduction of $RMS[\Delta x]_{t-1}$ , which essentially lags behind one-time step, made an improvement ie. it made this model robust to sudden changes which can stall the learning rates. Also, this method is analogous to Momentum as the RMS information holds the information of the previous time steps as momentum holds the history of the previous time steps
Another novel idea which is quite similar to Adadelta is RMSprop which also pursued the same goal of having an adaptive learning rate and intuitively is very similar to Adadelta as it is also governed by

\begin{split}\mathbb{E}[g^{2}]_{t}=\gamma\mathbb{E}[g^{2}]_{t-1}+(1-\gamma){g^{2}}_{t}\\ x_{t+1}=x_{t}-\eta’g_{t}\end{split}

(42)

Where $\eta’$ is the same as the one defined in Equation 40. The only difference is this method doesn’t involve the RMS values of the previous time step, but the rest is the same. Performance-wise, there’s is not a monumental difference but Adadelta performs better in cases of highly non-convex loss functions[35] but Adadelta still outperforms SGD, momentum and Adagrad under certain tasks[34].

V-E Adam

Adam [36] is yet another first-order optimization for gradient descent and in some sense is the union of RMSprop and Momentum. Hence it retains the information for both the square of the past gradient as in RMSprop and also the gradient information as in Momentum. Although Adam works with adaptive learning rates, it has 2 parameters that have to be tuned manually $\beta_{1}$ and $\beta_{2}$ where $\beta_{1},\beta_{2}\in[0,1)$ , which are decay rates that control the expectations of gradients and the sum of squares of gradients. Hence first we define $m_{t}$ which is defined as the first movement and involves the gradient information of the previous time steps, this variable is intuitively similar to the one in Equation 29 given by,

m_{t}=\beta_{1}m_{t-1}+(1-\beta_{1})g_{t}

(43)

and $g_{t}$ is the gradient of the objective function at time $t$ . It’s obvious that this term is quite similar to the momentum update as we are keeping a record of the previous gradients. The second variable is $v_{t}$ which records the sum of the squares of gradient till the time step $t$ similar to Equation 42 in RMSprop given by,

v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2}){g_{t}}^{2}

(44)

here ${g_{t}}^{2}$ stores the sum of gradient till time step $t$ . Now if the gradient update is done with these variables as is we might run into some problems like for example if the gradient is low and we are leaving a sparse region the values for both $m_{t}$ and $v_{t}$ will be low and will continue that way for the rest of the iteration, to correct this Bias Correction was introduced in the original paper, which would counteract this situation. Now, to prove for Bias Correction the authors assumed a very strong assumption that all the gradients at every time step have the same distribution hence $\mathbb{E}[g_{i}]=\mathbb{E}[g]$ . With bias correction the new $\hat{m_{t}}$ and $\hat{v_{t}}$ are defined as,

\begin{split}\hat{m_{t}}=\frac{m_{t}}{1-{\beta_{1}}^{t}}\\ \hat{v_{t}}=\frac{v_{t}}{1-{\beta_{2}}^{t}}\end{split}

(45)

And the final update is given by,

x_{t}=x_{t-1}-\eta\frac{\hat{m_{t}}}{\sqrt{\hat{v_{t}}}+\epsilon}

(46)

Here the ratio of $\hat{m_{t}}/\sqrt{\hat{v_{t}}}$ was termed as Signal to Noise ratio and intuitively speaking as this ratio becomes smaller and smaller so with the effective step size as when the ratio is smaller there is increasing uncertainty if the direction $\hat{m_{t}}$ is same as the direction of the true gradient and with this, the value will tend to zero when we approach an optimum.

V-F AdaMax

Adamax was introduced in accordance with Adam and had a similar idea but with only a simple but a counter-intuitive change. As we have been working with Adam we were using the $L_{2}$ norm hence all the updates and parameters were scaled to $L_{2}$ norm, we can still generalize for $L_{p}$ norm where the stepsize update is now proportional to ${v_{t}}^{1/p}$ and both $\beta_{1}$ and $\beta_{2}$ are in powers of $p$ . Hence the new $m_{t}$ and $v_{t}$ are,

\begin{split}m_{t}=\beta_{1}^{p}m_{t-1}+(1-\beta_{1}^{p})g_{t}\\ v_{t}=\beta_{2}^{p}v_{t-1}+(1-\beta_{2}^{p})|g_{t}|^{p}\end{split}

(47)

Now, as $p$ increases harder will be to compute and trace the gradients as we compute them but as the authors found as $p\rightarrow\infty$ we get a stable and a descriable result and the transformation lead to the AdaMax algorithm. As the step size depends on the ${v_{t}}^{1/p}$ we need to find the limit as $p\rightarrow\infty$ , which is given by, $u_{t}=\lim_{p\rightarrow\infty}(v_{t})^{\frac{1}{p}}$ and can be written as $max(\beta_{2}^{t-1}|g_{1}|,\beta_{2}^{t-2}|g_{2}|,...)$ after solving for the limit in $v_{t}$ . And can be written as a recursive formula as $u_{t}=max(\beta_{2}u_{t-1},|g_{t}|)$ which is the new $v_{t}$ . And finally the new update will become,

x_{t}=x_{t-1}-\frac{\eta}{1-\beta_{1}^{t}}\frac{m_{t}}{u_{t}}

(48)

V-G Nadam

As the name suggests Nadam is the integration of Nesterov’s accelerated Momentum and Adam and was collectively called Nesterov-accelerated adaptive moment[37]. Now to use Adam with NAG, we have to modify NAG a little bit to work with the bias correction and the symbology of Adam. We can redefine NAG in the following sense,

\begin{split}g_{t}=\nabla_{x_{t}}f(x_{t-1})\\ m_{t}=\beta_{1^{t}}m_{t-1}+g_{t}\\ \hat{m_{t}}=\_{t+1}m_{t}+g_{t}\\ x_{t+1}=x_{t}-\eta\hat{m_{t}}\end{split}

(49)

Where $\mu_{t}$ is the same as $\frac{t_{k}-1}{t_{k+1}}$ at a time step $k$ as defined in subsection C. An important observation in this variation is that the final moment term has the gradient term of the current time step and hence there is no need to calculate the gradient twice. Now, after redefining NAG it is also important to note that the intuition behind it remains the same, this version just improves definitions of the concepts mentioned in the previous subsection. Next, when we look at the Adam optimizer and especially at the final update that is given by equation 46 we know that $\hat{m_{t}}$ and $\hat{v_{t}}$ are the bias-corrected terms for sake of solving we are assuming they are not although the final results do include the bias correction as well. So now we can substitute the newly calculated $\hat{m_{t}}$ in equation 46,

x_{t}=x_{t-1}-\eta\bigg{[}\frac{\beta_{1}m_{t}}{\Psi}-\frac{(1-\beta_{1})g_{t}}{\Psi}\bigg{]}

(50)

Where $\Psi$ is given by ( $\Psi=\sqrt{\beta_{2}\hat{v_{t}}+(1-\beta_{2}){g_{t}}^{2}}+\epsilon$ ). Now an assumption that the original suthour took was that $v_{t}\approx v_{t-1}$ due to the fact that $\beta_{2}$ is initailized as $>0.9$ hence we can change the denominator as $\sqrt{v_{t-1}}+\epsilon$ hence when we replace for the denominator of both the terms we get,

x_{t}=x_{t-1}-\eta\bigg{[}\frac{\beta_{1}m_{t}}{\sqrt{\hat{v_{t}}+\epsilon}}\frac{(1-\beta_{1})g_{t}}{\sqrt{\hat{v_{t}}+\epsilon}}\bigg{]}

(51)

Now for the sake of readability, we can introduce a new variable $\hat{m_{t}}$ which is defined by,

\hat{m_{t}}=(1-\beta_{1})g_{t}+\beta_{1}m_{t}

(52)

And the final update becomes,

x_{t}=x_{t-1}-\eta\frac{\hat{m_{t}}}{\sqrt{\hat{v_{t}}+\epsilon}}

(53)

V-H AMSGrad

Before we get into the workings of AMSGrad or any further optimizer we have to follow a different set of notations. Let $\mathcal{F}$ be the finite space containing the set of parameters such that $x_{t}\in\mathcal{F}$ . And as defined previously let $f_{t}$ be the loss function at time step $t$ and the incurred loss at time step will be $f_{t}(x_{t})$ . Now due to the uncertainty of the subspace of the cost function at each time step, we have to evaluate using regret which is given by $R(T)=\sum_{t=1}^{T}[f_{t}(x_{t})-f_{t}(x^{*})]$ where $x^{*}=argmin_{x\in\mathcal{F}}\sum_{t=1}^{T}f_{t}(x)$ . In case of Adam, the regret bound is of the order $O(\sqrt{T})$ . Although we have been using the notion of the final update we haven’t specified the subspace that it belongs to $\mathcal{F}$ hence we define a new final update that projects the final update to the set $\mathcal{F}$ ie., $x_{t+1}=\Pi_{\mathcal{F}}(x_{t-1}-\eta g_{t})$ . Also as another measure we define $\Gamma_{t+1}$ as the inverse of $\eta$ in adaptive methods and is given by $\Gamma_{t+1}=\bigg{(}\frac{\sqrt{\hat{v_{t+1}}}}{\eta_{t+1}}-\frac{\sqrt{\hat{v_{t}}}}{\eta_{t}}\bigg{)}$ which is quite important in sense of measuring the step sizes in each iteration.

AMSGrad [38] was essentially an improvement on Adam in that sense that the original authors found out that we cannot get a good convergence on adaptive methods with only a limited history of the gradients and they even proved for Adam optimizer. Hence to have a meaningful convergence the optimizer should support the storage of past gradients rather than expectations or moving averages. Another thing to notice is that the learning rates for SGD and ADAGrad can only be decremented that is after initializing the step size can only be reduced or $\Gamma_{t}\succcurlyeq 0$ for every timestep $t$ , but for Adam and RMSprop the values for $\Gamma_{t}$ can be indefinite, which leads to bad convergence for the both. The authors also proved that for $\beta_{1},\beta_{2}\in[0,1]$ such that $\beta_{1}/\sqrt{\beta_{2}}<1$ Adam doesn’t converge to the optimal solution in a stochastic setting, and even the regret $R$ hence the focus for AMSGrad was to develop an algorithm that has similar implications as Adam that it has bias corrections and adaptive gradients but it should converge to the optimal solution. As mentioned earlier that $\Gamma_{t}$ in case of Adam is indefinite but was taken as an assumption that it is positive semidefinite. A correction for the step sizes that were introduced with gamma correction was that firstly the step size was initialized as a comparatively smaller quantity as compared to Adam but AMSGrad still has the running notion of suppressing the effect of gradient given that $\Gamma_{t}$ is positive semidefinite. Another difference between Adam and AMSGrad is that incase of AMSGrad we use the maximum value for $v_{t}$ for moving averages of the gradient instead of $v_{t}$ itself hence we can represent the modification as,

\hat{v_{t}}=max(\hat{v_{t-1}},v_{t})

(54)

Hence this actually shows that is the gradients are positive instead of taking a large step size as in the case of Adam, AMSGrad decreases the value for $v_{t}$ which checks the learning rates even if the gradients are large. And the final update similar to equation is given by,

x_{t+1}=\Pi_{\mathcal{F}}\bigg{(}x_{t}-\eta_{t}\frac{\hat{m_{t}}}{\sqrt{\hat{v_{t}}}+\epsilon}\bigg{)}

(55)

V-I Padam

Padam[39] was an extension of AMSGrad which gave an option to switch between SGD and AMSGrad by introducing a new parameter $p$ whose values lie in the range of $[0,1/2]$ , where at $0$ the algorithm is same as SGD with Momentum and at $1/2$ it is similar to AMSGrad. The parameter $p$ was introduced at the final update term for AMSGrad given by,

x_{t+1}=\Pi_{\mathcal{F},\hat{v_{t}}^{p}}\bigg{(}x_{t}-\eta_{t}\frac{\hat{m_{t}}}{\hat{v_{t}}^{p}}\bigg{)}

(56)

Now, this technique is quite similar to what we saw for AdaMax optimizer as shown by equation 46 but the only key difference is that here we are only changing the final step instead of the norm as above. Now choosing the value $p$ is the most important task when it comes to this optimizer, but the value of $p$ is highly sensitive to the value of the step size, yet the most promising results were from $p=0.125$ . This is quite justifiable as if the value for $p$ was large initially, the value for $\eta/\hat{v_{t}}^{p}$ would also large for small $\hat{v_{t}}$ and this can be resolved easily by using small values for $\eta$ . The authors also give a notion of early stopping due to the much decay in the learning rate terming it “small learning rate dilemma” and formulated that as the value for $p$ becomes larger and larger(constrained under [0,1]) a greater effect of the dilemma is observed and if $p<1/2$ then Padam performs as well, even if the step size is changed a bit. An important note here is that the authors have focussed on improving $\hat{v_{t}}$ instead of $\eta_{t}$ because the concept of generality as we strive for better not for a singular problem but we need results that can be applied to a variety of problems, this loss of generality is the reason why optimizers like Adam don’t give remarkably better results than SGD with momentum[40] on datasets like CIFAR-10 and CIFAR-100 [41]

V-J AdamW

After introducing a good foundation of gradient descent optimization, there is quite a strong connection between concepts of loss functions and optimization. As introduced in section 2 the loss functions have a regularization term which is called the $\mathcal{L}_{2}$ regularization term, which exists to prevent overfitting. As a prerequisite, we assume that the loss function $f_{t}(x)$ includes the regularization term. Another notion of weight decay although not introduced earlier, is just adding a decay term to the weights hence it is given by,

x_{t+1}=(1-\lambda)x_{t}-\eta\nabla f_{t}(x_{t})

(57)

Here $\lambda$ is the rate at which decay is happening. Although these terms $\mathcal{L}_{2}$ regularization and weight decay are used interchangeably which is not correct. Although while working with SGD with Momentum we can assume that these are the same. For representation, we will define a new representation for the regularized loss functions as,

f_{t}’(x)=f_{t}(x)+\frac{\lambda’}{2}||x||_{2}^{2}

(58)

Where again $\lambda$ defines the weight decay. And now if we substitute equation 57 in equation 58 we will get,

\begin{split}x_{t+1}=x_{t}-\eta\nabla f_{t}’(x_{t})\\ x_{t+1}=x_{t}-\eta\nabla f_{t}(x_{t})-\eta\lambda’x_{t}\\ x_{t+1}=(1-\lambda)x_{t}-\eta\nabla f_{t}(x_{t})\end{split}

(59)

Here the only assumption is to take $\lambda’=\lambda/\eta$ . This is probably the reason why both methods are mentioned interchangeably. But with this also comes this notion that $\lambda’$ and $\eta$ are strongly coupled which shouldn’t be the case as these hyperparameters are responsible for two separate phenomena. Hence in SGDW, the authors proposed that the weight decay and gradient update to be done simultaneously which is worked by introducing a new variable $\delta_{t}$ which can be fixed or can have the decay effect as well, hence the update is given by

x_{t+1}=x_{t}-m_{t}-\delta_{t}\eta x_{t}

(60)

Now, as we proved that for SGDW that $\mathcal{L}_{2}$ regularization is the same as weight decay, this is not the case for adaptive methods. Let us define a new update for generic adaptive methods(without weight decay) which is given by $x_{t+1}=(1-\lambda)x_{t}-\eta M_{t}\nabla f_{t}(x_{t})$ where $M_{t}$ is a called a preconditioner matrix and is defined as $M_{t}\neq kI$ and it can be assumned that it holds the ratio of $\hat{m_{t}}/\sqrt{\hat{v_{t}}}$ also the update with weight decay can be written $x_{t+1}=(1-\lambda)x_{t}-\eta M_{t}\nabla f_{t}(x_{t})$ so if we follow the previous notion as given by equation 58 we get,

\begin{split}x_{t+1}=x_{t}-\eta M_{t}\nabla f_{t}(x_{t})-\eta M_{t}\lambda’x_{t}\\ x_{t+1}=(1-\lambda)x_{t}-\eta M_{t}\nabla f_{t}(x_{t})\end{split}

(61)

Now for the condition to hold that $\mathcal{L}_{2}$ -regularization is the same as weighed decay even for adaptive methods both subequations under equation 59 should be equal that is $\lambda x_{t}=\eta\lambda’M_{t}x_{t}$ should hold good, but for this to be satisfied $M_{t}$ has to be a diagonal matrix, which it is not hence this relation doesn’t hold good. Hence for adaptive methods, $\mathcal{L}_{2}$ is not the same as weighted decay. And due to this reason, we have to make changes to Adam by introducing a weighted decay term. As we did for SGD with momentum even for adam, the final update now becomes, 5

x_{t}=x_{t-1}-\delta_{t}\bigg{[}\eta\frac{\hat{m_{t}}}{\sqrt{\hat{v_{t}}}+\epsilon}+\lambda x_{t-1}\bigg{]}

(62)

VI Conclusion

The conclusion goes here.

Acknowledgment

References

[1] F. Nie, Z. Hu, and X. Li, “An investigation for loss functions widely used in machine learning,” Communications in Information and Systems, vol. 18, no. 1, pp. 37–52, 2018.
[2] A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label noise for deep neural networks,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[3] P. Zhao and T. Zhang, “Stochastic optimization with importance sampling for regularized loss minimization,” in international conference on machine learning, 2015, pp. 1–9.
[4] M. Warren, J. Gresham, and B. Wyatt, “Transcendental functions with a complex twist,” arXiv preprint arXiv:1805.05320, 2018.
[5] P. J. Huber, Robust statistics. John Wiley & Sons, 2004, vol. 523.
[6] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors with online hard example mining,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 761–769.
[7] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
[8] A. L. Yuille and A. Rangarajan, “The concave-convex procedure,” Neural computation, vol. 15, no. 4, pp. 915–936, 2003.
[9] M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Advances in Neural Information Processing Systems, 2010, pp. 1189–1197.
[10] S.-i. Amari and S. Wu, “Improving support vector machine classifiers by modifying kernel functions,” Neural Networks, vol. 12, no. 6, pp. 783–789, 1999.
[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.
[12] J. Sun, Q. Qu, and J. Wright, “Complete dictionary recovery over the sphere i: Overview and the geometric picture,” IEEE Transactions on Information Theory, vol. 63, no. 2, pp. 853–884, 2016.
[13] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious local minimum,” in Advances in Neural Information Processing Systems, 2016, pp. 2973–2981.
[14] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Conference on Learning Theory, 2015, pp. 797–842.
[15] K. Kawaguchi, “Deep learning without poor local minima,” in Advances in neural information processing systems, 2016, pp. 586–594.
[16] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” in Artificial intelligence and statistics, 2015, pp. 192–204.
[17] C. Jin, P. Netrapalli, and M. I. Jordan, “Accelerated gradient descent escapes saddle points faster than gradient descent,” arXiv preprint arXiv:1711.10456, 2017.
[18] A. Anandkumar and R. Ge, “Efficient approaches for escaping higher order saddle points in non-convex optimization,” in Conference on learning theory, 2016, pp. 81–102.
[19] Z. Allen-Zhu, “Natasha 2: Faster non-convex optimization than sgd,” in Advances in neural information processing systems, 2018, pp. 2675–2686.
[20] S. Liang, R. Sun, J. D. Lee, and R. Srikant, “Adding one neuron can eliminate all bad local minima,” in Advances in Neural Information Processing Systems, 2018, pp. 4350–4360.
[21] K. Kawaguchi and L. P. Kaelbling, “Elimination of all bad local minima in deep learning,” arXiv preprint arXiv:1901.00279, 2019.
[22] P. Auer, M. Herbster, and M. K. Warmuth, “Exponentially many local minima for single neurons,” in Advances in neural information processing systems, 1996, pp. 316–322.
[23] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business Media, 2006.
[24] G. Yuan, S. Lu, Z. Wei et al., “A line search algorithm for unconstrained optimization,” Journal of Software Engineering and Applications, vol. 3, no. 05, p. 503, 2010.
[25] M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient mini-batch training for stochastic optimization,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 661–670.
[26] D. Musso, “Stochastic gradient descent with random learning rate,” arXiv preprint arXiv:2003.06926, 2020.
[27] R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in Advances in neural information processing systems, 2001, pp. 402–408.
[28] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
[29] Y. E. Nesterov, “A method for solving the convex programming problem with convergence rate o (1/k^ 2),” in Dokl. akad. nauk Sssr, vol. 269, 1983, pp. 543–547.
[30] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in neural information processing systems, 2013, pp. 315–323.
[31] Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 8194–8244, 2017.
[32] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of machine learning research, vol. 12, no. 7, 2011.
[33] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
[34] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
[35] D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu, “On the convergence of adaptive gradient methods for nonconvex optimization,” arXiv preprint arXiv:1808.05671, 2018.
[36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[37] T. Dozat, “Incorporating nesterov momentum into adam,” 2016.
[38] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
[39] J. Chen, D. Zhou, Y. Tang, Z. Yang, and Q. Gu, “Closing the generalization gap of adaptive gradient methods in training deep neural networks,” arXiv preprint arXiv:1806.06763, 2018.
[40] X. Gastaldi, “Shake-shake regularization,” arXiv preprint arXiv:1705.07485, 2017.
[41] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.