This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Channel-Directed Gradients for Optimization of Convolutional Neural Networks

Dong Lao, Peihao Zhu, Peter Wonka, Ganesh Sundaramoorthi
King Abdullah University of Science and Technology
Thuwal, Saudi Arabia
{dong.lao,peihao.zhu,peter.wonka,ganesh.sundaramoorthi}@kaust.edu.sa
Abstract

We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. The method works by computing the gradient of the loss function with respect to output-channel directed re-weighted 𝕃2\mathbb{L}^{2} or Sobolev metrics, which has the effect of smoothing components of the gradient across a certain direction of the parameter tensor. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics.

1 Introduction

Stochastic gradient descent (SGD) is currently the dominant algorithm for optimizing large-scale convolutional neural networks (CNNs) LeCun etΒ al. (1998); Simonyan & Zisserman (2014); He etΒ al. (2016b). Although there has been large activity in optimization methods seeking to improve performance, SGD still dominates in large-scale CNN optimization in terms of its generalization ability. Despite SGD’s dominance, there is still often a gap between training and real-world test accuracy performance in applications, which necessitates research in optimization methods to increase generalization accuracy.

In this paper, we observe that there is regularity in parameter tensors of learned CNN models, and we thus exploit regularity implicitly in optimization to derive new optimization methods that are simple modifications of traditional SGD to improve generalization. In particular, we empirically observe that parameter tensors in trained networks typically exhibit correlation over output channel dimension (see FigureΒ 1). We thus explore encoding correlation through smoothness in optimization, which we show improves generalization accuracy as learning without imposing regularity may not fully learn it. We encode smoothness implicitly in stochastic gradient descent by considering new metrics on the parameter space of the network, and reformulating the notion of gradient. To do this, we treat the space of parameter tensors as a Riemannian manifold to derive gradients of the loss with respect to new metrics that promote regularity in the output channel dimension of the tensors by changing the geometry of the underlying space of tensors.

Our contributions are as follows. First, we formulate output channel-directed Riemannian metrics (a re-weighted version of the standard 𝕃2\mathbb{L}^{2} metric and another that is a Sobolev metric) over the space of parameter tensors. This encodes channel-directed regularity inherently in the gradient optimization without changing the loss. Second, we compute Riemannian gradients with respect to the metrics showing linear complexity (in the number of parameters) over standard gradient computation, and thus derive new optimization methods for CNN training. Finally, we apply the methodology to training CNNs and show the empirical advantage in generalization accuracy, especially with small batch sizes, over standard optimizers (SGD, Adam) on numerous applications (image classification, semantic segmentation, generative adversarial networks) with simple modification of existing optimizers.

1.1 Related Work

Refer to caption
Figure 1: Visualization of weights (parameter tensor) of convolutional layers trained on ImageNet using SGD. The vertical structures that indicate regularity (correlation) of the weights along the output channel direction. This pattern is frequently observed in layers. Our method is motivated by this empirical observation, and favors parameter correlations in this direction in optimization.

We briefly survey related work in deep network optimization; a detailed survey is Bottou etΒ al. (2018). Stochastic gradient descent (SGD), e.g., Bottou (2012), samples a batch of data to tractably estimate the gradient of the loss function. As the stochastic gradient is a noisy version of the gradient, learning rates must follow a decay schedule in order to converge. Many methods have been formulated to choose learning rate over epochs and components of the gradient, including recent work on adaptive learning rates (e.g., Duchi etΒ al. (2011); Zeiler (2012); Kingma & Ba (2014); Bengio (2015); Loshchilov & Hutter (2017); Luo etΒ al. (2019)). For instance, Adam Kingma & Ba (2014) adaptively adjusts the learning rate so that parameters that have changed infrequently based on historical gradients are updated more quickly than parameters that have changed frequently. Another way to interpret such methods is that they change the underlying metric on the space on which the loss function is defined to an iso-tropically scaled version of the 𝕃2\mathbb{L}^{2} metric given by a simple diagonal matrix; we change the metrics an-isotropically. We show that our method can be used in conjunction with such methods by simply using the stochastic gradient computed with our metrics to boost performance.

As the stochastic gradient is computed based on sampling, different runs of the algorithm can result in different local optima. To reduce the variance, several methods have been been formulated, e.g., Defazio etΒ al. (2014); Johnson & Zhang (2013). We are not motivated by variance reduction, rather we are motivated by inducing regularity in optimization to improve generalization. However, as our method effectively smooths the gradient, our empirical experiments do indicate reduced variance with our metrics compared to SGD.

Another method motivated by variance reduction is the recent work of Osher etΒ al. (2018), where the stochastic gradient is pre-multiplied with an inverse Laplacian smoothing matrix. For CNNs, the gradient with respect to parameters is rasterized in row or column order of network filters before smoothing, which still lowers variance. Our work is inspired by Osher etΒ al. (2018), though we are motivated by incorporating structured regularity of the parameter tensor inherently in the optimization. Osher etΒ al. (2018) can be interpreted as using the gradient of the loss with respect to a Sobolev metric. Our major insight over Osher etΒ al. (2018) is that keeping the multi-dimensional structure of the parameter tensor (rather than rasterizing) and preferentially defining the Sobolev metric with respect to the output-channel direction boosts generalization accuracy, while other directions appear to have no boost. Secondly, we introduce a re-weighted 𝕃2\mathbb{L}^{2} metric that preferentially treats the output-channel direction, and can be implemented with one line of Pytorch code, is linear (in parameter size) complexity, and achieves similar results (in many cases) to our channel-directed Sobolev metric, boosting generalization of SGD and Adam. Third, our channel-directed Sobolev gradient can be implemented in linear cost rather than quasi-linear (not requiring FFT to compute). Sobolev gradients have been used in computer vision Sundaramoorthi etΒ al. (2007); Charpiat etΒ al. (2007) for their coarse-to-fine evolution properties Sundaramoorthi etΒ al. (2008) and we adapt that formulation to channel-directed metrics for CNNs.

We formulate Sobolev gradients by considering the space of parameter tensors as a Riemannian manifold, and choosing the metric (i.e., inner product) on the tangent space to be a Sobolev metric. By choosing a metric, gradients intrinsic to the manifold can be computed and gradient flows decrease loss. Other Riemannian metrics have been used for optimization in machine learning, e.g., Amari (1998); Marceau-Caron & Ollivier (2016); Hoffman etΒ al. (2013); Gunasekar etΒ al. (2018; 2020) and tangentially relate to our work. These works are based on Amari’s Amari (1998) information geometry on probability measures, and the metric considered is the Fisher information metric. The motivation for these methods is re-parametrization invariance of optimization, whereas our motivation is imposing regularity directly in the parameter space. Most of these methods relate to density estimation as the metric is on probability measures. Gunasekar etΒ al. (2018) notes that even vanilla gradient descent has certain implicit bias relating to the underlying metric on the space. In Gunasekar etΒ al. (2020) the Hessian metric (in the convex case) is analyzed and relates to mirror descent. These metrics are data-dependent and the gradient is challenging to compute, requiring (a large) inverse matrix computation. Moreover, they do not exploit the channel dimension regularity, the main purpose of our work.

2 Channel-Directed Gradients

We now present the theory to define channel-directed gradients. To do this, we formulate new metrics on the space of tensors, and then derive analytic formulas for channel-directed gradients in terms of the standard 𝕃2\mathbb{L}^{2} gradient. As we show, our channel-directed gradients effectively smooth the components of the 𝕃2\mathbb{L}^{2} gradient across a certain direction of the parameter tensors of the CNN. Another interpretation is we are changing the geometric structure of the loss landscape (without changing the loss) to a more smooth one by changing the underlying metric of the space on which the loss is defined.

Our metrics are motivated by the empirical observation that a certain dimension of parameter tensors in trained deep networks exhibits regularity (see FigureΒ 1), and thus our method exploits this regularity implicitly in optimization. If we visualize the parameter tensor along the input and output dimensions of the parameter tensor, we see mostly what appears to be random noise, however, there are in addition, regular (correlated) patterns along the output channel direction, implying that each output channel of a layer of a network uses similar (regularly-varying) weightings of input channels. Our metrics thus favor that gradient updates during optimization exhibit correlation, which we show in experiments leads to optimization that generalizes better.

2.1 Background on Riemannian Gradients

We first briefly present the definition of gradient on a Riemannian manifold, and show the explicit dependence of the gradient on the chosen metric on the manifold. More detailed theory can be found in Carmo (1992); Abraham etΒ al. (2012). We note that a manifold 𝒳\mathcal{X} is a space that is locally linear around each point Xβˆˆπ’³X\in\mathcal{X} in the space, and the linear space at each point is called the tangent space, denoted TX​𝒳T_{X}\mathcal{X}. A Riemannian manifold, in addition, has a smoothly varying positive definite bilinear form βŸ¨β‹…,β‹…βŸ©\left<\cdot,\cdot\right> (called the metric) on the tangent space. This metric allows one to define the notion of lengths of curves on the space, in addition to many other operations, including gradients of functions defined on the space.

Definition 1 (Gradient of a Function)

Let 𝒳\mathcal{X} be a Riemannian manifold, and f:𝒳→ℝf:\mathcal{X}\to\mathbb{R} be a function. The directional derivative of ff at Xβˆˆπ’³X\in\mathcal{X} along a direction k∈TX​𝒳k\in T_{X}\mathcal{X} is defined as d​f​(X)β‹…k=dd​Ρ​f​(X+Ρ​k)|Ξ΅=0\,\mathrm{d}f(X)\cdot k=\frac{\,\mathrm{d}{}}{\,\mathrm{d}{\varepsilon}}\left.f(X+\varepsilon k)\right|_{\varepsilon=0}. The gradient of ff at Xβˆˆπ’³X\in\mathcal{X} is the vector, βˆ‡f​(X)∈TX​𝒳\nabla f(X)\in T_{X}\mathcal{X}, that satisfies the relation

d​f​(X)β‹…k=βŸ¨βˆ‡f​(X),k⟩,for all ​k∈TX​𝒳.\,\mathrm{d}f(X)\cdot k=\left<{\nabla f(X)},{k}\right>,\,\,\mbox{for all }k\in T_{X}\mathcal{X}. (1)

From the definition, we note that β€œthe” gradient will depend on the choice of the metric on the manifold. We note that any such gradient will decrease the the function ff by moving infinitesimally in the tangent space in the direction of negative the gradient as d​f​(X)β‹…k=βˆ’β€–βˆ‡f​(X)β€–2<0\,\mathrm{d}f(X)\cdot k=-\|\nabla f(X)\|^{2}<0 when k=βˆ’βˆ‡f​(X)k=-\nabla f(X), where βˆ₯β‹…βˆ₯\|\cdot\| is the norm induced by βŸ¨β‹…,β‹…βŸ©\left<\cdot,\cdot\right>. The gradient flow, defined by the differential equation XΛ™t=βˆ’βˆ‡f​(Xt)\dot{X}_{t}=-\nabla f(X_{t}), will converge to a local minimum. In our application of this theory to CNN optimization, ff will be the loss function, and 𝒳\mathcal{X} will be the space of parameter tensors. In this case, as the tensor is multi-dimensional, the gradient flow will be a partial differential equation.

A consequence of this definition is that the gradient is the direction (up to a scale factor) in the tangent space that optimizes the following problem:

arg​maxk∈TX​𝒳\{0}⁑|d​f​(X)β‹…k|β€–kβ€–.\operatorname*{arg\,max}_{k\in T_{X}\mathcal{X}\backslash\{0\}}\frac{|\,\mathrm{d}f(X)\cdot k|}{\|k\|}. (2)

Thus, the gradient can be regarded as the most efficient direction as it maximizes the ratio of the change in energy by perturbing in a direction kk over the cost (defined by the metric) of kk. Thus, by constructing the metric to have small costs for perturbations (directions) that we prefer for gradients, the gradient flow will move in these preferential directions while minimizng the fucntion, and thus land in more favorable local minima. In particular, we construct metrics that favor gradients with output channel direction regularity and this does induce regularity in the final tensor.

2.2 Channel-Directed Metrics

As gradients of a function depend on the metric structure on the underlying space, we re-define the metric on the underlying space so that tensors that differ smoothly along the output channel direction have small distance. In existing deep network gradient-based optimization schemes, the underlying metric on the loss function is assumed to be the standard Euclidean 𝕃2\mathbb{L}^{2} metric. We will consider a re-weighted version of the 𝕃2\mathbb{L}^{2} metric and Sobolev metrics that favor regularity in the output channel direction of parameter tensors. To formulate the methodology, we start from a continuum formulation, where we treat weight tensors in the continuum, formulate the metrics in the continuum and then in the next sub-section derive the gradients with respect to these metrics. Finally, we discretize gradient flows in the implementation to derive iterative schemes.

Let X:π’ͺ×ℐ×ℋ×𝒲→ℝX:\mathcal{O}\times\mathcal{I}\times\mathcal{H}\times\mathcal{W}\to\mathbb{R} denote a parameter tensor of a deep network (from a layer of a convolutional network). Here π’ͺ=[0,O]\mathcal{O}=[0,O] denotes indices to the output channel dimension of the tensor, ℐ=[0,I]\mathcal{I}=[0,I] denote the indices to the input channel, and β„‹=[0,H],𝒲=[0,W]\mathcal{H}=[0,H],\mathcal{W}=[0,W] denote indices to the height and width dimension of the spatial filters of the tensor. The metric is defined on the tangent space to the space of such XX. An element of the tangent space will have the same form of the tensor, i.e., k:π’ͺ×ℐ×ℋ×𝒲→ℝk:\mathcal{O}\times\mathcal{I}\times\mathcal{H}\times\mathcal{W}\to\mathbb{R}. The 𝕃2\mathbb{L}^{2} (called H0H^{0} from now on) metric is defined as

⟨k1,k2⟩H0=∫π’ͺ,ℐ,β„‹,𝒲k1​(o,i,h,w)β‹…k2​(o,i,h,w)​do​di​dh​dw,\left\langle k_{1},k_{2}\right\rangle_{{}_{\!\!H^{0}}}=\int_{\mathcal{O},\mathcal{I},\mathcal{H},\mathcal{W}}k_{1}(o,i,h,w)\cdot k_{2}(o,i,h,w)\,\mathrm{d}o\,\mathrm{d}i\,\mathrm{d}h\,\mathrm{d}w, (3)

where k1,k2k_{1},k_{2} are in the tangent space of tensors. We now define a re-weighted version of H0H^{0} that favors tangent vectors that have global smoothness in the direction of the π’ͺ\mathcal{O} dimension:

⟨k1,k2⟩HΞ»0=βˆ«β„,β„‹,𝒲kΒ―1​(i,h,w)β‹…kΒ―2​(i,h,w)​di​dh​dw+Ξ»Oβ€‹βŸ¨k1βˆ’kΒ―1,k2βˆ’kΒ―2⟩H0,\left<{k_{1}},{k_{2}}\right>_{H^{0}_{\lambda}}=\int_{\mathcal{I},\mathcal{H},\mathcal{W}}\bar{k}_{1}(i,h,w)\cdot\bar{k}_{2}(i,h,w)\,\mathrm{d}i\,\mathrm{d}h\,\mathrm{d}w+\frac{\lambda}{O}\left\langle k_{1}-\bar{k}_{1},k_{2}-\bar{k}_{2}\right\rangle_{{}_{\!\!H^{0}}}, (4)

where Ξ»>0\lambda>0 is a hyper-parameter, and kΒ―\bar{k} is the average value in the output channel direction, i.e.,

k¯​(i,h,w)=1Oβ€‹βˆ«π’ͺk​(o,i,h,w)​do.\bar{k}(i,h,w)=\frac{1}{O}\int_{\mathcal{O}}k(o,i,h,w)\,\mathrm{d}o. (5)

The metric in (4) splits the tangent vector into global translations in the output channel direction and its orthogonal complement, i.e., the deformation. The weight Ξ»\lambda is used to control the weighting between the translation and deformation components, i.e., larger values of Ξ»\lambda means that deformations more heavily influence the norm of the perturbation. As shown in the next sub-section that means gradients with respect to this metric have higher weighted channel-directed translations than deformations.

Next, we introduce channel-directed versions of the Sobolev metric, defined as follows:

⟨k1,k2⟩H1\displaystyle\left\langle k_{1},k_{2}\right\rangle_{{}_{\!\!H^{1}}} =1Oβ€‹βŸ¨k1,k2⟩H0+λ​Oβ€‹βŸ¨βˆ‚k1βˆ‚o,βˆ‚k2βˆ‚o⟩H0\displaystyle=\frac{1}{O}\left\langle k_{1},k_{2}\right\rangle_{{}_{\!\!H^{0}}}+\lambda O\left\langle\frac{\partial{k_{1}}}{\partial{o}},\frac{\partial{k_{2}}}{\partial{o}}\right\rangle_{{}_{\!\!H^{0}}} (6)
⟨k1,k2⟩H~1\displaystyle\left<{k_{1}},{k_{2}}\right>_{\tilde{H}^{1}} =βˆ«β„,β„‹,𝒲kΒ―1​(i,h,w)β‹…kΒ―2​(i,h,w)​di​dh​dw+λ​Oβ€‹βŸ¨βˆ‚k1βˆ‚o,βˆ‚k2βˆ‚o⟩H0,\displaystyle=\int_{\mathcal{I},\mathcal{H},\mathcal{W}}\bar{k}_{1}(i,h,w)\cdot\bar{k}_{2}(i,h,w)\,\mathrm{d}i\,\mathrm{d}h\,\mathrm{d}w+\lambda O\left\langle\frac{\partial{k_{1}}}{\partial{o}},\frac{\partial{k_{2}}}{\partial{o}}\right\rangle_{{}_{\!\!H^{0}}}, (7)

where βˆ‚βˆ‚o\frac{\partial{}}{\partial{o}} indicates the partial derivative with respect to the the output channel direction. The partial derivative in the oo-direction implies that tensor perturbations that are smooth along the oo-direction are close with respect to these metrics, which will imply that the corresponding gradients will exhibit smoothness in this direction, i.e., convolution filters that are nearby in the output direction will exhibit correlation. The first metric is analogous to the usual Sobolev metric being a weighted combination of the H0H^{0} metric and the H0H^{0} metric of the derivative, except only considering the derivative with respect to one direction. The second metric is similar to the first except that we use the H0H^{0} metric of the channel-directed average rather than that of the perturbation itself. As we will see, both have similar properties, but the latter is computationally less costly to compute. The scale factors of OO in the expressions above are so that the metric is scale invariant with respect to different sizes of output channels. The part of the metric with the partial derivative component implies that tensors that differ in the output channel direction by a non-smooth perturbation are far away in distance. In the latter metric, tensors that differ by just a channel-directed translation are close. Compared with the re-weighted H0H^{0} metric (4), the latter Sobolev metric promotes smooth perturbations beyond global translations.

We have presented only channel-directed metrics that preferentially treat the output channel dimension of the tensor as our empirical experiments demonstrate that promoting regularity in other directions is detrimental to optimization performance.

2.3 Computing Channel-Directed Gradients

Refer to caption
Figure 2: Visualization of kernels applied to the H0H^{0} gradient under different metrics for Ξ»=1\lambda=1. This illustrates the smoothing effect of the metrics. In computation, linear cost formulas are applied to compute the gradients not using the convolution interpretation.

We now compute gradients with respect to the metrics defined in the previous sub-section in terms of the H0H^{0} gradient so that simple processing of the existing gradient can be done with no other modification of existing optimization code. To compute the relation between the channel-directed gradients and the usual H0H^{0} gradient, we note the relation between the directional derivative of a loss function, the gradient and the metric:

d​L​(X)β‹…k=βŸ¨βˆ‡H0L​(X),k⟩H0=βŸ¨βˆ‡HΞ»0L​(X),k⟩HΞ»0=βŸ¨βˆ‡H1L​(X),k⟩H1=βŸ¨βˆ‡H1L​(X),k⟩H~1,\,\mathrm{d}L(X)\cdot k=\left\langle\nabla_{H^{0}}L(X),k\right\rangle_{{}_{\!\!H^{0}}}=\left<{\nabla_{H^{0}_{\lambda}}L(X)},{k}\right>_{H^{0}_{\lambda}}=\left\langle\nabla_{H^{1}}L(X),k\right\rangle_{{}_{\!\!H^{1}}}=\left<{\nabla_{H^{1}}L(X)},{k}\right>_{\tilde{H}^{1}}, (8)

where d​L​(X)β‹…k=limΞ΅β†’0L​(X+Ρ​k)βˆ’L​(X)Ξ΅dL(X)\cdot k=\lim_{\varepsilon\to 0}\frac{L(X+\varepsilon k)-L(X)}{\varepsilon} is the directional derivative in the direction of the perturbation kk. Note that the expression above indicates the directional derivative is equal to the inner product (metric) between the gradient with respect to the metric and the perturbation. This holds for any metric. With this relation, we may compute the channel-directed gradients in terms of the H0H^{0} gradient. Derivations are in the Supplementary materials. Letting f=βˆ‡H0L​(X)f=\nabla_{H^{0}}L(X), we have

βˆ‡HΞ»0L​(X)\displaystyle\nabla_{H^{0}_{\lambda}}L(X) =fΒ―+1λ​(fβˆ’fΒ―)\displaystyle=\bar{f}+\frac{1}{\lambda}(f-\bar{f}) (9)
f=βˆ‡H1L​(X)βˆ’Ξ»β€‹O2β€‹βˆ‚2βˆ‚o2β€‹βˆ‡H1L​(X)\displaystyle f=\nabla_{H^{1}}L(X)-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{H^{1}}L(X) Β andΒ f=βˆ‡H~1L​(X)Β―βˆ’Ξ»β€‹O2β€‹βˆ‚2βˆ‚o2β€‹βˆ‡H~1L​(X),\displaystyle\quad\mbox{ and }\quad f=\overline{\nabla_{\tilde{H}^{1}}L(X)}-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{\tilde{H}^{1}}L(X), (10)

where the last two expressions are second order ordinary differential equations (ODE), whose solution we discuss next. Notice that the re-weighted H0H^{0} gradient (9) simply re-weights the channel-directed translation component and the deformation component of the H0H^{0} gradient differently, i.e., as Ξ»\lambda gets larger, the channel-directed translation becomes more prominent.

In obtaining the ODE expressions for the Sobolev gradients above, we have assumed periodic boundary conditions in the π’ͺ\mathcal{O} dimension111Since ordering of filters along the channel direction in a CNN has no particular significance, choosing periodic or non-periodic boundary conditions is arbitrary. Periodic conditions induces smoothness between the starting and ending filters in the oo-dimension. The periodic condition enables a simpler computational solution.. In this case, the Sobolev gradients can be interpreted as the circular convolution of the H0H^{0} gradient with convolution kernels given as

K​(o)=cosh⁑[Ξ»βˆ’1/2​(oβˆ’0.5)]2​sinh⁑[Ξ»βˆ’1/2],K~​(o)=1+o2βˆ’o+1/62​λ,Β for ​o∈[0,1],K(o)=\frac{\cosh\left[\lambda^{-1/2}(o-0.5)\right]}{2\sinh\left[\lambda^{-1/2}\right]},\quad\tilde{K}(o)=1+\frac{o^{2}-o+1/6}{2\lambda},\,\,\mbox{ for }\,o\in[0,1], (11)

for each of the H1H^{1} and H~1\tilde{H}^{1} metrics, respectively, where oo above is scaled by OO to be between 0 and 1, and the circular convolution is given by

βˆ‡H~1L​(X)​(o,i,h,w)=1Oβ€‹βˆ«π’ͺK~​((oβˆ’o~)/O)​f​(o~,i,h,w)​do~.\nabla_{\tilde{H}^{1}}L(X)(o,i,h,w)=\frac{1}{O}\int_{\mathcal{O}}\tilde{K}((o-\tilde{o})/O)f(\tilde{o},i,h,w)\,\mathrm{d}\tilde{o}. (12)

Note that the re-weighted H0H^{0} solution also has an interpretation of convolution with respect to a smoothing kernel. FigureΒ 2 shows plots of various kernels for the parameter Ξ»\lambda chosen in experiments. For each oo, the Sobolev or re-weighted H0H^{0} is a local average whose weights die far away from oo. Thus, we can see that the effect of the metrics is to induce smoothness of the gradient along the output channel direction.

The second version of the Sobolev gradient need not use the convolution formula for computation, as one can just integrate the ODE twice (after noting that the channel-directed averages for both the H0H^{0} and Sobolev gradient are the same). This saves one from having to compute the convolution directly, and hence a reduction in computational cost from quadratic (or quasi-linear with an FFT) to linear in OO given the H0H^{0} gradient. The second version of the Sobolev gradient can be computed as

g​(o,i,h,w)\displaystyle g(o,i,h,w) =g​(0,i,h,w)+oβ€‹βˆ‚gβˆ‚o​(0,i,h,w)βˆ’1Ξ»β€‹βˆ«0o(oβˆ’o~)​(f​(o​O,i,h,w)βˆ’f¯​(i,h,w))​do~\displaystyle=g(0,i,h,w)+o\frac{\partial{g}}{\partial{o}}(0,i,h,w)-\frac{1}{\lambda}\int_{0}^{o}(o-\tilde{o})(f(oO,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}\tilde{o} (13)
βˆ‚gβˆ‚o​(0,i,h,w)\displaystyle\frac{\partial{g}}{\partial{o}}(0,i,h,w) =βˆ’1Ξ»β€‹βˆ«01o​(f​(o​O,i,h,w)βˆ’f¯​(i,h,w))​do\displaystyle=-\frac{1}{\lambda}\int_{0}^{1}o(f(oO,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}o (14)
g​(0,i,h,w)\displaystyle g(0,i,h,w) =∫01K~​(o)​f​(o​O,i,h,w)​do,o∈[0,1]\displaystyle=\int_{0}^{1}\tilde{K}(o)f(oO,i,h,w)\,\mathrm{d}o,\quad o\in[0,1] (15)

where g=βˆ‡H~1L​(X)g=\nabla_{\tilde{H}^{1}}L(X) is the second version of the Sobolev gradient and f=βˆ‡H0L​(X)f=\nabla_{H^{0}}L(X), which are just three integrals that can be computed in linear complexity with respect to OO. The gradient flows under these metrics are given by

XΛ™t=βˆ’βˆ‡L​(Xt),\dot{X}_{t}=-\nabla L(X_{t}), (16)

where tt denotes the artificial time variable, XΛ™\dot{X} is the time derivative of the parameter tensor, and βˆ‡\nabla denotes the gradient with respect to the desired metric. Under any metric, this reduces the loss.

2.4 Properties of Channel Directed Gradient Flows

We describe some properties of the resulting gradient flows according to the metrics defined in the previous sections.

Coarse-to-Fine Evolution and Removal of Some Local Minima: In Sundaramoorthi etΒ al. (2008), it is shown that gradient flows with respect to Sobolev metrics evolve in a coarse-to-fine fashion, deforming according to coarse-scale perturbations before moving to finer scale perturbations. This can avoid being trapped in local minima due to fine-scale structures. It is also shown that when we change the metric on the space 𝒳\mathcal{X}, the loss landscape changes and some local minima with respect to H0H^{0} may change to other critical points with respect to Sobolev, and numerically some local minima may cease to exist. That is, local minima due to fine-scale structures can be removed by switching to a Sobolev metric. As local minima that are from wide, flat minima generalize well, the removal of local minima due to localized fine structures may encourage convergence at wider, flat minima and hence generalize better than ordinary SGD.

Regularity of the Weight Tensor: By the convolution formulas above, we can see that the Sobolev gradients are a smoothing of the usual H0H^{0} gradient. Noting that the gradient flow (16) integrates the (smooth) gradients over time, the final tensor will also be smooth in the output direction provided the initialization is smooth. In practice, in applications with deep networks, one typically initializes the weight tensor to be random noise, so the final tensor may exhibit some randomness, but the final tensor is sum of a smooth component and the randomness, and so it exhibits regularity, e.g., strong correlation across nearby output-channel components in the weight tensor, as we verify in experiments. Further, experiments (see SectionΒ 4) indicate that final results with respect to our metrics are less dependent on initialization than SGD, which suggests that the initial randomness may die out.

3 Application to Stochastic Gradient Descent and Implementation

To apply re-weighted H0H^{0} and Sobolev channel-directed gradients to optimizing deep convolutional networks based on stochastic gradient descent or its variants, we discretize the gradient flow (16) according to the forward Euler method. We approximate the standard H0H^{0} gradient of the loss, βˆ‡H0L​(X)\nabla_{H^{0}}L(X), using a mini-batch, as is standard in deep learning. We then use this approximation of the H0H^{0} gradient to approximate the H~1\tilde{H}^{1} gradient, βˆ‡H~1L​(X)\nabla_{\tilde{H}^{1}}L(X), by discretizing (13)-(15) using a standard Riemann sum. Note that (13) can be computed for each oo, the output channel index of the tensor, with the cumulative sum (CUMSUM) operation, which is linear in cost, as are (14) and (15). We compute the Sobolev gradient for each convolutional layer parameter tensor independent of others. We use Ξ»=1\lambda=1 for H~1\tilde{H}^{1} gradient and add it to a scaled version (by a hyper parameter) of the H0H^{0} gradient (as shown in FigureΒ 2) to avoid over-smoothing. The re-weighted H0H^{0} gradient is computed by using (9) from the H0H^{0} stochastic gradient approximation. Both gradients require few additional lines of code; the code for re-weighted H0H^{0} is shown in FigureΒ 3. Thus, the channel-directed gradients replace the usual one, and all other additions to standard SGD (e.g., momentum, Adam, etc) can be used as usual.

def reweighted_L2_grad(grad=param.grad.data,lambda):
#grad: L2 gradient; lambda>0 weights translation of L2 grad
grad += lambda*torch.mean(grad,0,True).repeat(grad.size(0),1,1,1)
return grad
Figure 3: Pytorch code to compute the re-weighted 𝕃2\mathbb{L}^{2} (HΞ»0H^{0}_{\lambda}) gradient from the H0H^{0} gradient.

4 Experiments

We test our proposed channel-directed metrics with different baseline optimizers and tasks. Our intent is to show that any baseline method task can be improved just by switching the gradient with respect to channel directed metrics in the optimizer. We fix Ξ»=1\lambda=1 for channel-directed metrics unless specified otherwise. TableΒ 1 shows the settings for each experiment. Experiments are run on a single NVIDIA Titan Xp GPU except for GANs, which are run on a Tesla v100 GPU due to memory requirements.

Table 1: Experimental settings.
Task Dataset Baseline Network Batch Size Epochs Initial LR
Image Classification Cifar-10 SGD ResNet-56 128,32,8 240 0.1
VGG-16 128,8,6 240 0.01
ADAM ResNet-56 128,32,8 200 1e-3
LS ResNet-56 128,32,8 240 0.1
MNIST SGD Two-layer Conv 100 100 0.01
Semantic Segmentation PascalVOC SGD ResNet50 2 70 7e-3
Image Generation (GAN) CityScapes SGD SPADE 2 100 1e-4,4e-4

Image Classification: We experiment on CIFAR-10Β Krizhevsky etΒ al. (2009). We test the combination of our channel-directed metrics with both SGD and ADAM on ResNet-56 He etΒ al. (2016a) and VGG-16 Simonyan & Zisserman (2014) following Osher etΒ al. (2018)’s settings. For SGD, we set the initial learning rate to be 0.1 and 0.01 on ResNet-56 and VGG-16 respectively with momentum 0.9 and weight decay 5e-4. For ADAM, we set the initial learning rate to 0.01. We decrease the learning rate by a factor of 10 every 40 epochs as Osher etΒ al. (2018). Results presented in this section are the average of at least 10 independent trials.

Table 2: Test accuracy on CIFAR-10. Channel-directed metrics improve H0H^{0} in all cases. Best case, >>10% of error can be reduced by using our channel-directed metrics. Results average 10 trials.
Architecture ResNet-56 VGG-16 Architecture ResNet-56
Batch size 128 32 8 128 8 6 Batch size 128 32 8
SGD 93.24 91.96 86.54 93.02 92.31 91.88 ADAM 91.20 91.04 89.53
+H1~+\tilde{H^{1}} 93.29 92.13 87.99 93.26 92.77 92.25 +H1~+\tilde{H^{1}} 91.42 91.13 90.02
Error reduced% 0.7% 2.1% 10.8% 3.4% 6.0% 4.6% Error reduced% 2.5% 1.0% 4.7%
+HΞ»0+H^{0}_{\lambda} 93.38 92.10 88.04 93.19 92.79 92.43 +HΞ»0+H^{0}_{\lambda} 91.20 91.06 89.70
Error reduced% 2.1% 1.7% 11.1% 2.4% 6.2% 6.8% Error reduced% 0.0% 0.2% 1.6%

TableΒ 2 shows the test accuracy under different settings. Both channel-directed metrics achieve improvement over H0H^{0}. A greater advantage over the baseline is achieved when the batch size is small as the stochastic gradient is noisy, and our method imposes regularity. In most cases, all channel-directed gradients perform similarly, but H~1{\tilde{H}^{1}} performs significantly better with ADAM.

Refer to caption
Refer to caption
Figure 4: Evolution of training and test accuracy on CIFAR-10: an example with batchsize = 8. Our metric significantly improves both training and test accuracy.

In FigureΒ 4, we show an example of training and test accuracy curves (batch size of 8) for baselines as well as Laplacian Smoothing (LS) Osher etΒ al. (2018), which rasterizes before smoothing. We out-perform all methods. We also apply LS (without rasterization) to smooth the gradient in our output-channel directed fashion, which improves LS, but we still out-perform it. The original implementation of LS only runs smoothing for the first 40 epochs, then uses SGD (for speed). In our experiments, we apply smoothing all the way to convergence to test the effectiveness for the whole duration.

Variance reduction: In FigureΒ 5 (left), we compare the histograms of test accuracy over multiple runs of ours and SGD. Our method achieves higher average test accuracy with reduced variance.

Direction of smoothing: To investigate the effect of different channel directions of smoothing, we apply our method as well as LS along different channel-directions. We compare approaches under two settings, which are smoothing gradients in all layers and smoothing gradients in only convolutional layers.222For completeness, we have also tested LS for 40 epochs and then switched to SGD (LS+SGD), as done by Osher etΒ al. (2018) for speed; this results in slightly worse performance than running LS for all epochs. This verifies that running LS for all epochs, as we did in experiments in our main paper gives the best results for LS. FigureΒ 5 (right) shows that our output-channel direction is preferred regardless of smoothing method used. This shows that preferentially treating the output channel to smooth, as in our approach, is essential to performance. Interestingly, smoothing only convolutional layers in a rasterized order (as in LS) performs worse than SGD, but that is made up by smoothing in non-convolutional layers when smoothing in all layers.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Distribution of results on CIFAR-10. Left: Histogram of test accuracy. Ours achieves higher average with significantly reduced variance. Right: Results from different smoothing directions. Best accuracy obtained from our proposed direction. A: Output-Channel Directed; B: Input-Channel Directed; All: parameters rasterized into a 1-D vector to perform smoothing; Ours: re-weighted 𝕃2\mathbb{L}^{2}.

Regularity of Tensor: We show that the final weight tensor at convergence in our methods have regularity in the output channel dimension in Figure Β 6, as should be the case as the tensor is composed of a component that is smooth. To show this, we plot the correlation between filters in the weight tensors as a function of the distance in the output channel dimension. This is done over multiple tensor layers in ResNet-56 and over multiple trials of optimization on CIFAR-10. We also show the correlation of filters in the input channel direction. As can be seen, all optimization methods produce tensors that exhibit regularity in the output channel direction while no such regularity in the output direction. Notice that our methods increase the amount of regularity compared to SGD as it imposes this in optimization.

Refer to caption
Figure 6: Regularity of Tensor. Correlation between weights within different channel directions in CIFAR trained ResNet 56 conv layers (over 10 trials). |iβˆ’j||i-j| is distance between weight locations in tensor for correlation computation. Sobolev/re-weighted H0H^{0} show strong correlation in output direction, but not input. SGD shows correlation in output direction.

Effect of Smoothing Parameter: We perform controlled experiments on MNISTΒ LeCun & Cortes (2010) and Fashion-MNISTΒ Xiao etΒ al. (2017) by varying the smoothness parameter Ξ»\lambda from 0 to 20. Instead of using the standard partition, we conduct training on the test set (10000 samples) and test on the training set (60000 samples), which makes generalization more challenging. We use a 2-layer CNN with 50 and 100 5Γ—55\times 5 filters in each layer, respectively, and train with batch size 100. FigureΒ 7 shows the accuracy at the 100th epoch (average over 5 trials). When Ξ»=0\lambda=0, the optimizer degenerates to vanilla SGD. Our methods are not sensitive to Ξ»\lambda and improve over SGD for any Ξ»\lambda.

Refer to caption
Refer to caption
Figure 7: Results on MNIST and Fashion-MNIST with different choice of smoothness. 0ur methods improve classification accuracy over SGD (when Ξ»=0\lambda=0) for a wide range of smoothness levels.

Semantic Segmentation: The experiments on semantic segmentation are conducted on the PascalVOCΒ Everingham etΒ al. (2015) dataset using a standard segmentation network Ronneberger etΒ al. (2015) with ResNet-50 as the encoder (see https://github.com/nyoki-mtl/pytorch-segmentation). We perform training with initial learning rate 7e-3 and batch size 2 (the maximum size to fit on Titan Xp GPU memory), and record the training/testing loss and accuracy for 60 epochs. 3 independent trials are run under each setting. FigureΒ 8 shows comparison between ours and SGD. Both channel-directed metrics improve the final segmentation accuracy on the test set by ~8% relatively. Also note that our method reduced the generalization gap from 0.163 to 0.151 (by 7.4%) and 0.150 (by 8.0%) for H~1\tilde{H}^{1} and HΞ»0H^{0}_{\lambda}, respectively.

Refer to caption
Refer to caption
Figure 8: Semantic Segmentation Results on PascalVOC. Sobolev H~1\tilde{H}^{1} and re-weighted 𝕃2\mathbb{L}^{2} (HΞ»0H^{0}_{\lambda}) improve segmentation accuracy by 8.5% and 7.8% respectively relative to SGD.

Image Generation: To test the performance on GANs, we choose the task of semantic labels to image conversion. We perform the experiments on the current state-of-the-art model SPADEΒ Park etΒ al. (2019) (a.k.a GauGAN), which aims to generate high-quality realistic images from given semantic layouts. Experiments are conducted on CityScapesΒ Cordts etΒ al. (2016) and the FIDΒ Heusel etΒ al. (2017) score is used to evaluate the quality (lower is better). Learning rates are 1​eβˆ’41e-4 and 4​eβˆ’44e-4 for the generator and discriminator, respectively. We compare to SGD with momentum 0.9 and weight decay 5e-4. All models are trained for 100 epochs with batch size 2 (to fit on Tesla v100 memory), and 6 independent trials are run for each optimizer. FigureΒ 9 provides FID curves and error bars. Our methods achieve better average FID score with significantly less variance. Note 2 out of 6 models trained by SGD suffered from collapse, which led to high variance, while all twelve trials of our methods achieved good final results.

Refer to caption
Figure 9: Results on the image generation task. Our methods achieve better result with significantly reduced variance due to regularity imposed during the training process. Final FID: SGD: 61.37Β±12.0061.37\pm 12.00; Channel-Directed Sobolev (H1~\tilde{H^{1}}): 56.31Β±3.1256.31\pm 3.12; Channel-Directed Re-Weighted 𝕃2\mathbb{L}^{2} (HΞ»0H^{0}_{\lambda}): 57.62Β±4.0257.62\pm 4.02. Lower is better.

Speed: With PyTorch, re-weighted 𝕃2\mathbb{L}^{2} (HΞ»0H^{0}_{\lambda}) adds negligible overhead. In our current Pytorch implementation, H1~\tilde{H^{1}} adds on average 45 ms overhead to each mini-batch with batch size 128, which increases training time on CIFAR-10 by 50%. This is because tensor transpose and saving/loading are currently required due to limited library functions, contributing to a large portion of computational overhead. In principle, as computing the H~1\tilde{H}^{1} gradient has linear time complexity, if the computation were done, for instance, using C++, it like re-weighted 𝕃2\mathbb{L}^{2}, would add negligible overhead over SGD/Adam.

5 Conclusion

Using gradients that are regular in the output-channel dimension of CNN network tensors in SGD is effective in improving generalization accuracy of SGD and its variants. We reformulated the gradient (without changing the loss) by changing the underlying Riemannian geometry on the tensor space using two different metrics. Both the channel-directed re-weighted H0H^{0} and H~1\tilde{H}^{1} both gave generalization boosts. Regularity in other tensor dimensions was not effective in improving SGD or variants. Both channel-directed gradients have similar computational complexity, and the re-weighted H0H^{0} adds negligible training time in its Pytorch implementation, which is one line of PyTorch code.

References

  • Abraham etΒ al. (2012) Ralph Abraham, JerroldΒ E Marsden, and Tudor Ratiu. Manifolds, tensor analysis, and applications, volumeΒ 75. Springer Science & Business Media, 2012.
  • Amari (1998) Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
  • Bengio (2015) Yoshua Bengio. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization. corr abs/1502.04390, 2015.
  • Bottou (2012) LΓ©on Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp.Β  421–436. Springer, 2012.
  • Bottou etΒ al. (2018) LΓ©on Bottou, FrankΒ E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
  • Carmo (1992) Manfredo PerdigaoΒ do Carmo. Riemannian geometry. BirkhΓ€user, 1992.
  • Charpiat etΒ al. (2007) Guillaume Charpiat, Pierre Maurel, J-P Pons, Renaud Keriven, and Olivier Faugeras. Generalized gradients: Priors on minimization flows. International journal of computer vision, 73(3):325–344, 2007.
  • Cordts etΒ al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Defazio etΒ al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pp.Β 1646–1654, 2014.
  • Duchi etΒ al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159, 2011.
  • Everingham etΒ al. (2015) M.Β Everingham, S.Β M.Β A. Eslami, L.Β VanΒ Gool, C.Β K.Β I. Williams, J.Β Winn, and A.Β Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, January 2015.
  • Gunasekar etΒ al. (2018) Suriya Gunasekar, JasonΒ D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pp.Β 9461–9471, 2018.
  • Gunasekar etΒ al. (2020) Suriya Gunasekar, Blake Woodworth, and Nathan Srebro. Mirrorless mirror descent: A more natural discretization of riemannian gradient flow. arXiv preprint arXiv:2004.01025, 2020.
  • He etΒ al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016a. doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/cvpr.2016.90.
  • He etΒ al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.Β  770–778, 2016b.
  • Heusel etΒ al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017.
  • Hoffman etΒ al. (2013) MatthewΒ D Hoffman, DavidΒ M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • Johnson & Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp.Β 315–323, 2013.
  • Kingma & Ba (2014) DiederikΒ P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krizhevsky etΒ al. (2009) Alex Krizhevsky, Geoffrey Hinton, etΒ al. Learning multiple layers of features from tiny images. 2009.
  • LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
  • LeCun etΒ al. (1998) Yann LeCun, LΓ©on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017.
  • Luo etΒ al. (2019) Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg3g2R9FX.
  • Marceau-Caron & Ollivier (2016) GaΓ©tan Marceau-Caron and Yann Ollivier. Practical riemannian neural networks. arXiv preprint arXiv:1602.08007, 2016.
  • Osher etΒ al. (2018) Stanley Osher, Bao Wang, Penghang Yin, Xiyang Luo, Farzin Barekat, Minh Pham, and Alex Lin. Laplacian smoothing gradient descent. arXiv preprint arXiv:1806.06317, 2018.
  • Park etΒ al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • Ronneberger etΒ al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp.Β  234–241. Springer, 2015.
  • Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Sundaramoorthi etΒ al. (2007) Ganesh Sundaramoorthi, Anthony Yezzi, and AndreaΒ C Mennucci. Sobolev active contours. International Journal of Computer Vision, 73(3):345–366, 2007.
  • Sundaramoorthi etΒ al. (2008) Ganesh Sundaramoorthi, Anthony Yezzi, and Andrea Mennucci. Coarse-to-fine segmentation and tracking using sobolev active contours. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):851–864, 2008.
  • Xiao etΒ al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  • Zeiler (2012) MatthewΒ D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

Appendix A Additional Analysis of Evolution of Channel-Directed Optimization

FigureΒ 10 and FigureΒ 11 present the evolution of training and test accuracy of ADAM and SGD with different batch sizes. Using channel-directed gradients (H~1\tilde{H}^{1} in this experiment) for SGD or ADAM improves test accuracy for any batch size. More prominent performance gains are seen for smaller batch sizes.

Refer to caption
Refer to caption
Figure 10: Training and test accuracy on CIFAR-10 with ADAM.
Refer to caption
Refer to caption
Figure 11: Training and test accuracy on CIFAR-10 with SGD.

Appendix B Additional Experimental Verification of Output-Channel Direction

To investigate the effect of different channel directions of smoothing, we apply our method as well as LS along different channel-directions. FigureΒ 12 shows that our output-channel direction is preferred regardless of different smoothing approaches.

Refer to caption
Figure 12: Channel-Directed Smoothing Leads to Better Performance. Best accuracy obtained from our proposed direction. A: Output-Channel Directed; B: Input-Channel Directed; All: parameters rasterized into a 1-D vector to perform smoothing; Ours: re-weighted 𝕃2\mathbb{L}^{2}.

Appendix C Detailed Derivations for Section 2.2

We first derive the re-weighted 𝕃2\mathbb{L}^{2} gradient under HΞ»0H^{0}_{\lambda} metric following the same notations from the paper. Consider fβ‰œβˆ‡H0L​(X)f\triangleq\nabla_{H^{0}}L(X) the standard 𝕃2\mathbb{L}^{2} gradient, and we want to solve for gβ‰œβˆ‡HΞ»0L​(X)g\triangleq\nabla_{H^{0}_{\lambda}}L(X). By (4) and (8) we have

⟨f,k⟩H0\displaystyle\left\langle f,k\right\rangle_{{}_{\!\!H^{0}}} =⟨g,k⟩Hλ0\displaystyle=\left<{g},{k}\right>_{H^{0}_{\lambda}} (17)
=⟨gΒ―,k¯⟩H0+Ξ»β€‹βŸ¨gβˆ’gΒ―,kβˆ’k¯⟩H0.\displaystyle=\left\langle\bar{g},\bar{k}\right\rangle_{{}_{\!\!H^{0}}}+\lambda\left\langle g-\bar{g},k-\bar{k}\right\rangle_{{}_{\!\!H^{0}}}. (18)

After decomposing ff and kk into

f=fΒ―+(fβˆ’fΒ―),k=kΒ―+(kβˆ’kΒ―)f=\bar{f}+(f-\bar{f}),\quad k=\bar{k}+(k-\bar{k}) (19)

and noting the fact that ⟨kΒ―,kβˆ’k¯⟩H0=0\left\langle\bar{k},k-\bar{k}\right\rangle_{{}_{\!\!H^{0}}}=0 holds for all kk, we have

fΒ―=gΒ―,fβˆ’fΒ―=λ​(gβˆ’gΒ―),\displaystyle\bar{f}=\bar{g},\quad f-\bar{f}=\lambda(g-\bar{g}), (20)

which leads to the result of (9).

We then derive the Sobolev gradient under H1H^{1} metric, following similar computations in Sundaramoorthi etΒ al. (2007). Consider βˆ‡H1L​(X)\nabla_{H^{1}}L(X) the Sobolev gradient under H1H^{1} metric. By (7) and (8) we have

βŸ¨βˆ‡H0L​(X),k⟩H0\displaystyle\left\langle\nabla_{H^{0}}L(X),k\right\rangle_{{}_{\!\!H^{0}}} =βŸ¨βˆ‡H1L​(X),k⟩H1\displaystyle=\left\langle\nabla_{H^{1}}L(X),k\right\rangle_{{}_{\!\!H^{1}}} (21)
=1Oβ€‹βŸ¨k,βˆ‡H1L​(X)⟩H0+λ​Oβ€‹βŸ¨βˆ‚kβˆ‚o,βˆ‚βˆ‡H1L​(X)βˆ‚o⟩H0.\displaystyle=\frac{1}{O}\left\langle k,\nabla_{H^{1}}L(X)\right\rangle_{{}_{\!\!H^{0}}}+\lambda O\left\langle\frac{\partial{k}}{\partial{o}},\frac{\partial{\nabla_{H^{1}}L(X)}}{\partial{o}}\right\rangle_{{}_{\!\!H^{0}}}. (22)

Integrating by parts and considering the periodic boundary conditions, we have

βŸ¨βˆ‡H0L​(X),k⟩H0=βŸ¨βˆ‡H1L​(X)βˆ’Ξ»β€‹O2β€‹βˆ‚2βˆ‚o2β€‹βˆ‡H1L​(X),k⟩H0.\left\langle\nabla_{H^{0}}L(X),k\right\rangle_{{}_{\!\!H^{0}}}=\left\langle\nabla_{H^{1}}L(X)-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{H^{1}}L(X),k\right\rangle_{{}_{\!\!H^{0}}}. (23)

Since kk can be any perturbation, by uniqueness, we have

βˆ‡H0L​(X)=βˆ‡H1L​(X)βˆ’Ξ»β€‹O2β€‹βˆ‚2βˆ‚o2β€‹βˆ‡H1L​(X)\nabla_{H^{0}}L(X)=\nabla_{H^{1}}L(X)-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{H^{1}}L(X) (24)

which is (10). Similarly, for H1~\tilde{H^{1}} metric, we have

βˆ‡H0L​(X)=βˆ‡H~1L​(X)Β―βˆ’Ξ»β€‹O2β€‹βˆ‚2βˆ‚o2β€‹βˆ‡H~1L​(X).\nabla_{H^{0}}L(X)=\overline{\nabla_{\tilde{H}^{1}}L(X)}-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{\tilde{H}^{1}}L(X). (25)

First observe that by computed the output-channel directed average of the both sides of the above equation, we see that βˆ‡H~1L​(X)Β―=βˆ‡H0L​(X)Β―\overline{\nabla_{\tilde{H}^{1}}L(X)}=\overline{\nabla_{H^{0}}L(X)}, i.e., the average values are same. One may integrate (25) twice to solve for the H1~\tilde{H^{1}} gradient. For simplicity, let ff be the 𝕃2\mathbb{L}^{2} gradient and gg be the H1~\tilde{H^{1}} gradient. Integrating twice yields

g​(o,i,h,w)\displaystyle g(o,i,h,w) =g​(0,i,h,w)+∫0oβˆ‚gβˆ‚o​(0,i,h,w)​do~βˆ’1Ξ»β€‹βˆ«0o∫0o^(f​(o~​O,i,h,w)βˆ’f¯​(i,h,w))​do~​do^\displaystyle=g(0,i,h,w)+\int_{0}^{o}\frac{\partial{g}}{\partial{o}}(0,i,h,w)\,\mathrm{d}\tilde{o}-\frac{1}{\lambda}\int_{0}^{o}\int_{0}^{\hat{o}}(f(\tilde{o}O,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}\tilde{o}\,\mathrm{d}\hat{o} (26)
=g​(0,i,h,w)+∫0oβˆ‚gβˆ‚o​(0,i,h,w)​do~βˆ’1Ξ»β€‹βˆ«0o∫o~o(f​(o~​O,i,h,w)βˆ’f¯​(i,h,w))​do^​do~\displaystyle=g(0,i,h,w)+\int_{0}^{o}\frac{\partial{g}}{\partial{o}}(0,i,h,w)\,\mathrm{d}\tilde{o}-\frac{1}{\lambda}\int_{0}^{o}\int_{\tilde{o}}^{o}(f(\tilde{o}O,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}\hat{o}\,\mathrm{d}\tilde{o} (27)
=g​(0,i,h,w)+oβ€‹βˆ‚gβˆ‚o​(0,i,h,w)βˆ’1Ξ»β€‹βˆ«0o(oβˆ’o~)​(f​(o~​O,i,h,w)βˆ’f¯​(i,h,w))​do~.\displaystyle=g(0,i,h,w)+o\frac{\partial{g}}{\partial{o}}(0,i,h,w)-\frac{1}{\lambda}\int_{0}^{o}(o-\tilde{o})(f(\tilde{o}O,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}\tilde{o}. (28)

Note that here we perform normalization by scaling to the channel direction by letting o∈[0,1]o\in[0,1]. With boundary conditions g​(0)=g​(1)g(0)=g(1), βˆ‚gβˆ‚o​(0)=βˆ‚gβˆ‚o​(1)\frac{\partial{g}}{\partial{o}}(0)=\frac{\partial{g}}{\partial{o}}(1) and fΒ―=gΒ―\bar{f}=\bar{g}, we have

βˆ‚gβˆ‚o​(0,i,h,w)=βˆ’1Ξ»β€‹βˆ«01o​(f​(o​O,i,h,w)βˆ’f¯​(i,h,w))​do.\frac{\partial{g}}{\partial{o}}(0,i,h,w)=-\frac{1}{\lambda}\int_{0}^{1}o(f(oO,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}o. (29)

For simplicity, we eliminate i,h,wi,h,w and OO in the following derivations. We have

g​(0)\displaystyle g(0) =g​(o)βˆ’oβ€‹βˆ‚gβˆ‚o​(0)+1Ξ»β€‹βˆ«0o(oβˆ’o~)​(f​(o~)βˆ’fΒ―)​do~\displaystyle=g(o)-o\frac{\partial{g}}{\partial{o}}(0)+\frac{1}{\lambda}\int_{0}^{o}(o-\tilde{o})(f(\tilde{o})-\bar{f})\,\mathrm{d}\tilde{o} (30)
=g​(o)+o​1Ξ»β€‹βˆ«01o​(f​(o)βˆ’fΒ―)​do+1Ξ»β€‹βˆ«0o(oβˆ’o~)​(f​(o~)βˆ’fΒ―)​do~.\displaystyle=g(o)+o\frac{1}{\lambda}\int_{0}^{1}o(f(o)-\bar{f})\,\mathrm{d}o+\frac{1}{\lambda}\int_{0}^{o}(o-\tilde{o})(f(\tilde{o})-\bar{f})\,\mathrm{d}\tilde{o}. (31)

Noting ∫01g​(0)​do=g​(0)\int_{0}^{1}g(0)\,\mathrm{d}o=g(0) and fΒ―=∫01f​(o)​do\bar{f}=\int_{0}^{1}f(o)\,\mathrm{d}o, we integrate both sides over the entire interval [0,1][0,1].

g​(0)\displaystyle g(0) =gΒ―+1Ξ»β€‹βˆ«01o​doβ‹…βˆ«01o​(f​(o)βˆ’fΒ―)​do+1Ξ»β€‹βˆ«01∫0o(oβˆ’o~)​(f​(o~)βˆ’fΒ―)​do~​do\displaystyle=\bar{g}+\frac{1}{\lambda}\int_{0}^{1}o\,\mathrm{d}o\cdot\int_{0}^{1}o(f(o)-\bar{f})\,\mathrm{d}o+\frac{1}{\lambda}\int_{0}^{1}\int_{0}^{o}(o-\tilde{o})(f(\tilde{o})-\bar{f})\,\mathrm{d}\tilde{o}\,\mathrm{d}o (32)
=fΒ―+12β€‹Ξ»βˆ«01of(o)doβˆ’14​λfΒ―+1Ξ»(∫01∫0o(oβˆ’o~)f(o~)do~do+f¯∫01∫0o(oβˆ’o~)do~do\displaystyle=\bar{f}+\frac{1}{2\lambda}\int_{0}^{1}of(o)\,\mathrm{d}o-\frac{1}{4\lambda}\bar{f}+\frac{1}{\lambda}(\int_{0}^{1}\int_{0}^{o}(o-\tilde{o})f(\tilde{o})\,\mathrm{d}\tilde{o}\,\mathrm{d}o+\bar{f}\int_{0}^{1}\int_{0}^{o}(o-\tilde{o})\,\mathrm{d}\tilde{o}\,\mathrm{d}o (33)
=(1βˆ’14β€‹Ξ»βˆ’16​λ)​fΒ―+12β€‹Ξ»β€‹βˆ«01o​f​(o)​do+1Ξ»β€‹βˆ«01∫o~1(oβˆ’o~)​f​(o~)​do​do~\displaystyle=(1-\frac{1}{4\lambda}-\frac{1}{6\lambda})\bar{f}+\frac{1}{2\lambda}\int_{0}^{1}of(o)\,\mathrm{d}o+\frac{1}{\lambda}\int_{0}^{1}\int_{\tilde{o}}^{1}(o-\tilde{o})f(\tilde{o})\,\mathrm{d}o\,\mathrm{d}\tilde{o} (34)
=(1βˆ’512​λ)β€‹βˆ«01f​(o)​do+12β€‹Ξ»β€‹βˆ«01o​f​(o)​do+1Ξ»β€‹βˆ«01(12+o~22βˆ’o~)​f​(o~)​do~\displaystyle=(1-\frac{5}{12\lambda})\int_{0}^{1}f(o)\,\mathrm{d}o+\frac{1}{2\lambda}\int_{0}^{1}of(o)\,\mathrm{d}o+\frac{1}{\lambda}\int_{0}^{1}(\frac{1}{2}+\frac{{\tilde{o}}^{2}}{2}-\tilde{o})f(\tilde{o})\,\mathrm{d}\tilde{o} (35)
=∫01(1+o2βˆ’o+1/62​λ)​f​(o)​do.\displaystyle=\int_{0}^{1}(1+\frac{o^{2}-o+1/6}{2\lambda})f(o)\,\mathrm{d}o. (36)

This gives (15) in the main paper.