Channel-Directed Gradients for Optimization of Convolutional Neural Networks
Abstract
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. The method works by computing the gradient of the loss function with respect to output-channel directed re-weighted or Sobolev metrics, which has the effect of smoothing components of the gradient across a certain direction of the parameter tensor. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics.
1 Introduction
Stochastic gradient descent (SGD) is currently the dominant algorithm for optimizing large-scale convolutional neural networks (CNNs) LeCun etΒ al. (1998); Simonyan & Zisserman (2014); He etΒ al. (2016b). Although there has been large activity in optimization methods seeking to improve performance, SGD still dominates in large-scale CNN optimization in terms of its generalization ability. Despite SGDβs dominance, there is still often a gap between training and real-world test accuracy performance in applications, which necessitates research in optimization methods to increase generalization accuracy.
In this paper, we observe that there is regularity in parameter tensors of learned CNN models, and we thus exploit regularity implicitly in optimization to derive new optimization methods that are simple modifications of traditional SGD to improve generalization. In particular, we empirically observe that parameter tensors in trained networks typically exhibit correlation over output channel dimension (see FigureΒ 1). We thus explore encoding correlation through smoothness in optimization, which we show improves generalization accuracy as learning without imposing regularity may not fully learn it. We encode smoothness implicitly in stochastic gradient descent by considering new metrics on the parameter space of the network, and reformulating the notion of gradient. To do this, we treat the space of parameter tensors as a Riemannian manifold to derive gradients of the loss with respect to new metrics that promote regularity in the output channel dimension of the tensors by changing the geometry of the underlying space of tensors.
Our contributions are as follows. First, we formulate output channel-directed Riemannian metrics (a re-weighted version of the standard metric and another that is a Sobolev metric) over the space of parameter tensors. This encodes channel-directed regularity inherently in the gradient optimization without changing the loss. Second, we compute Riemannian gradients with respect to the metrics showing linear complexity (in the number of parameters) over standard gradient computation, and thus derive new optimization methods for CNN training. Finally, we apply the methodology to training CNNs and show the empirical advantage in generalization accuracy, especially with small batch sizes, over standard optimizers (SGD, Adam) on numerous applications (image classification, semantic segmentation, generative adversarial networks) with simple modification of existing optimizers.
1.1 Related Work

We briefly survey related work in deep network optimization; a detailed survey is Bottou etΒ al. (2018). Stochastic gradient descent (SGD), e.g., Bottou (2012), samples a batch of data to tractably estimate the gradient of the loss function. As the stochastic gradient is a noisy version of the gradient, learning rates must follow a decay schedule in order to converge. Many methods have been formulated to choose learning rate over epochs and components of the gradient, including recent work on adaptive learning rates (e.g., Duchi etΒ al. (2011); Zeiler (2012); Kingma & Ba (2014); Bengio (2015); Loshchilov & Hutter (2017); Luo etΒ al. (2019)). For instance, Adam Kingma & Ba (2014) adaptively adjusts the learning rate so that parameters that have changed infrequently based on historical gradients are updated more quickly than parameters that have changed frequently. Another way to interpret such methods is that they change the underlying metric on the space on which the loss function is defined to an iso-tropically scaled version of the metric given by a simple diagonal matrix; we change the metrics an-isotropically. We show that our method can be used in conjunction with such methods by simply using the stochastic gradient computed with our metrics to boost performance.
As the stochastic gradient is computed based on sampling, different runs of the algorithm can result in different local optima. To reduce the variance, several methods have been been formulated, e.g., Defazio etΒ al. (2014); Johnson & Zhang (2013). We are not motivated by variance reduction, rather we are motivated by inducing regularity in optimization to improve generalization. However, as our method effectively smooths the gradient, our empirical experiments do indicate reduced variance with our metrics compared to SGD.
Another method motivated by variance reduction is the recent work of Osher etΒ al. (2018), where the stochastic gradient is pre-multiplied with an inverse Laplacian smoothing matrix. For CNNs, the gradient with respect to parameters is rasterized in row or column order of network filters before smoothing, which still lowers variance. Our work is inspired by Osher etΒ al. (2018), though we are motivated by incorporating structured regularity of the parameter tensor inherently in the optimization. Osher etΒ al. (2018) can be interpreted as using the gradient of the loss with respect to a Sobolev metric. Our major insight over Osher etΒ al. (2018) is that keeping the multi-dimensional structure of the parameter tensor (rather than rasterizing) and preferentially defining the Sobolev metric with respect to the output-channel direction boosts generalization accuracy, while other directions appear to have no boost. Secondly, we introduce a re-weighted metric that preferentially treats the output-channel direction, and can be implemented with one line of Pytorch code, is linear (in parameter size) complexity, and achieves similar results (in many cases) to our channel-directed Sobolev metric, boosting generalization of SGD and Adam. Third, our channel-directed Sobolev gradient can be implemented in linear cost rather than quasi-linear (not requiring FFT to compute). Sobolev gradients have been used in computer vision Sundaramoorthi etΒ al. (2007); Charpiat etΒ al. (2007) for their coarse-to-fine evolution properties Sundaramoorthi etΒ al. (2008) and we adapt that formulation to channel-directed metrics for CNNs.
We formulate Sobolev gradients by considering the space of parameter tensors as a Riemannian manifold, and choosing the metric (i.e., inner product) on the tangent space to be a Sobolev metric. By choosing a metric, gradients intrinsic to the manifold can be computed and gradient flows decrease loss. Other Riemannian metrics have been used for optimization in machine learning, e.g., Amari (1998); Marceau-Caron & Ollivier (2016); Hoffman etΒ al. (2013); Gunasekar etΒ al. (2018; 2020) and tangentially relate to our work. These works are based on Amariβs Amari (1998) information geometry on probability measures, and the metric considered is the Fisher information metric. The motivation for these methods is re-parametrization invariance of optimization, whereas our motivation is imposing regularity directly in the parameter space. Most of these methods relate to density estimation as the metric is on probability measures. Gunasekar etΒ al. (2018) notes that even vanilla gradient descent has certain implicit bias relating to the underlying metric on the space. In Gunasekar etΒ al. (2020) the Hessian metric (in the convex case) is analyzed and relates to mirror descent. These metrics are data-dependent and the gradient is challenging to compute, requiring (a large) inverse matrix computation. Moreover, they do not exploit the channel dimension regularity, the main purpose of our work.
2 Channel-Directed Gradients
We now present the theory to define channel-directed gradients. To do this, we formulate new metrics on the space of tensors, and then derive analytic formulas for channel-directed gradients in terms of the standard gradient. As we show, our channel-directed gradients effectively smooth the components of the gradient across a certain direction of the parameter tensors of the CNN. Another interpretation is we are changing the geometric structure of the loss landscape (without changing the loss) to a more smooth one by changing the underlying metric of the space on which the loss is defined.
Our metrics are motivated by the empirical observation that a certain dimension of parameter tensors in trained deep networks exhibits regularity (see FigureΒ 1), and thus our method exploits this regularity implicitly in optimization. If we visualize the parameter tensor along the input and output dimensions of the parameter tensor, we see mostly what appears to be random noise, however, there are in addition, regular (correlated) patterns along the output channel direction, implying that each output channel of a layer of a network uses similar (regularly-varying) weightings of input channels. Our metrics thus favor that gradient updates during optimization exhibit correlation, which we show in experiments leads to optimization that generalizes better.
2.1 Background on Riemannian Gradients
We first briefly present the definition of gradient on a Riemannian manifold, and show the explicit dependence of the gradient on the chosen metric on the manifold. More detailed theory can be found in Carmo (1992); Abraham etΒ al. (2012). We note that a manifold is a space that is locally linear around each point in the space, and the linear space at each point is called the tangent space, denoted . A Riemannian manifold, in addition, has a smoothly varying positive definite bilinear form (called the metric) on the tangent space. This metric allows one to define the notion of lengths of curves on the space, in addition to many other operations, including gradients of functions defined on the space.
Definition 1 (Gradient of a Function)
Let be a Riemannian manifold, and be a function. The directional derivative of at along a direction is defined as . The gradient of at is the vector, , that satisfies the relation
(1) |
From the definition, we note that βtheβ gradient will depend on the choice of the metric on the manifold. We note that any such gradient will decrease the the function by moving infinitesimally in the tangent space in the direction of negative the gradient as when , where is the norm induced by . The gradient flow, defined by the differential equation , will converge to a local minimum. In our application of this theory to CNN optimization, will be the loss function, and will be the space of parameter tensors. In this case, as the tensor is multi-dimensional, the gradient flow will be a partial differential equation.
A consequence of this definition is that the gradient is the direction (up to a scale factor) in the tangent space that optimizes the following problem:
(2) |
Thus, the gradient can be regarded as the most efficient direction as it maximizes the ratio of the change in energy by perturbing in a direction over the cost (defined by the metric) of . Thus, by constructing the metric to have small costs for perturbations (directions) that we prefer for gradients, the gradient flow will move in these preferential directions while minimizng the fucntion, and thus land in more favorable local minima. In particular, we construct metrics that favor gradients with output channel direction regularity and this does induce regularity in the final tensor.
2.2 Channel-Directed Metrics
As gradients of a function depend on the metric structure on the underlying space, we re-define the metric on the underlying space so that tensors that differ smoothly along the output channel direction have small distance. In existing deep network gradient-based optimization schemes, the underlying metric on the loss function is assumed to be the standard Euclidean metric. We will consider a re-weighted version of the metric and Sobolev metrics that favor regularity in the output channel direction of parameter tensors. To formulate the methodology, we start from a continuum formulation, where we treat weight tensors in the continuum, formulate the metrics in the continuum and then in the next sub-section derive the gradients with respect to these metrics. Finally, we discretize gradient flows in the implementation to derive iterative schemes.
Let denote a parameter tensor of a deep network (from a layer of a convolutional network). Here denotes indices to the output channel dimension of the tensor, denote the indices to the input channel, and denote indices to the height and width dimension of the spatial filters of the tensor. The metric is defined on the tangent space to the space of such . An element of the tangent space will have the same form of the tensor, i.e., . The (called from now on) metric is defined as
(3) |
where are in the tangent space of tensors. We now define a re-weighted version of that favors tangent vectors that have global smoothness in the direction of the dimension:
(4) |
where is a hyper-parameter, and is the average value in the output channel direction, i.e.,
(5) |
The metric in (4) splits the tangent vector into global translations in the output channel direction and its orthogonal complement, i.e., the deformation. The weight is used to control the weighting between the translation and deformation components, i.e., larger values of means that deformations more heavily influence the norm of the perturbation. As shown in the next sub-section that means gradients with respect to this metric have higher weighted channel-directed translations than deformations.
Next, we introduce channel-directed versions of the Sobolev metric, defined as follows:
(6) | ||||
(7) |
where indicates the partial derivative with respect to the the output channel direction. The partial derivative in the -direction implies that tensor perturbations that are smooth along the -direction are close with respect to these metrics, which will imply that the corresponding gradients will exhibit smoothness in this direction, i.e., convolution filters that are nearby in the output direction will exhibit correlation. The first metric is analogous to the usual Sobolev metric being a weighted combination of the metric and the metric of the derivative, except only considering the derivative with respect to one direction. The second metric is similar to the first except that we use the metric of the channel-directed average rather than that of the perturbation itself. As we will see, both have similar properties, but the latter is computationally less costly to compute. The scale factors of in the expressions above are so that the metric is scale invariant with respect to different sizes of output channels. The part of the metric with the partial derivative component implies that tensors that differ in the output channel direction by a non-smooth perturbation are far away in distance. In the latter metric, tensors that differ by just a channel-directed translation are close. Compared with the re-weighted metric (4), the latter Sobolev metric promotes smooth perturbations beyond global translations.
We have presented only channel-directed metrics that preferentially treat the output channel dimension of the tensor as our empirical experiments demonstrate that promoting regularity in other directions is detrimental to optimization performance.
2.3 Computing Channel-Directed Gradients

We now compute gradients with respect to the metrics defined in the previous sub-section in terms of the gradient so that simple processing of the existing gradient can be done with no other modification of existing optimization code. To compute the relation between the channel-directed gradients and the usual gradient, we note the relation between the directional derivative of a loss function, the gradient and the metric:
(8) |
where is the directional derivative in the direction of the perturbation . Note that the expression above indicates the directional derivative is equal to the inner product (metric) between the gradient with respect to the metric and the perturbation. This holds for any metric. With this relation, we may compute the channel-directed gradients in terms of the gradient. Derivations are in the Supplementary materials. Letting , we have
(9) | ||||
(10) |
where the last two expressions are second order ordinary differential equations (ODE), whose solution we discuss next. Notice that the re-weighted gradient (9) simply re-weights the channel-directed translation component and the deformation component of the gradient differently, i.e., as gets larger, the channel-directed translation becomes more prominent.
In obtaining the ODE expressions for the Sobolev gradients above, we have assumed periodic boundary conditions in the dimension111Since ordering of filters along the channel direction in a CNN has no particular significance, choosing periodic or non-periodic boundary conditions is arbitrary. Periodic conditions induces smoothness between the starting and ending filters in the -dimension. The periodic condition enables a simpler computational solution.. In this case, the Sobolev gradients can be interpreted as the circular convolution of the gradient with convolution kernels given as
(11) |
for each of the and metrics, respectively, where above is scaled by to be between 0 and 1, and the circular convolution is given by
(12) |
Note that the re-weighted solution also has an interpretation of convolution with respect to a smoothing kernel. FigureΒ 2 shows plots of various kernels for the parameter chosen in experiments. For each , the Sobolev or re-weighted is a local average whose weights die far away from . Thus, we can see that the effect of the metrics is to induce smoothness of the gradient along the output channel direction.
The second version of the Sobolev gradient need not use the convolution formula for computation, as one can just integrate the ODE twice (after noting that the channel-directed averages for both the and Sobolev gradient are the same). This saves one from having to compute the convolution directly, and hence a reduction in computational cost from quadratic (or quasi-linear with an FFT) to linear in given the gradient. The second version of the Sobolev gradient can be computed as
(13) | ||||
(14) | ||||
(15) |
where is the second version of the Sobolev gradient and , which are just three integrals that can be computed in linear complexity with respect to . The gradient flows under these metrics are given by
(16) |
where denotes the artificial time variable, is the time derivative of the parameter tensor, and denotes the gradient with respect to the desired metric. Under any metric, this reduces the loss.
2.4 Properties of Channel Directed Gradient Flows
We describe some properties of the resulting gradient flows according to the metrics defined in the previous sections.
Coarse-to-Fine Evolution and Removal of Some Local Minima: In Sundaramoorthi etΒ al. (2008), it is shown that gradient flows with respect to Sobolev metrics evolve in a coarse-to-fine fashion, deforming according to coarse-scale perturbations before moving to finer scale perturbations. This can avoid being trapped in local minima due to fine-scale structures. It is also shown that when we change the metric on the space , the loss landscape changes and some local minima with respect to may change to other critical points with respect to Sobolev, and numerically some local minima may cease to exist. That is, local minima due to fine-scale structures can be removed by switching to a Sobolev metric. As local minima that are from wide, flat minima generalize well, the removal of local minima due to localized fine structures may encourage convergence at wider, flat minima and hence generalize better than ordinary SGD.
Regularity of the Weight Tensor: By the convolution formulas above, we can see that the Sobolev gradients are a smoothing of the usual gradient. Noting that the gradient flow (16) integrates the (smooth) gradients over time, the final tensor will also be smooth in the output direction provided the initialization is smooth. In practice, in applications with deep networks, one typically initializes the weight tensor to be random noise, so the final tensor may exhibit some randomness, but the final tensor is sum of a smooth component and the randomness, and so it exhibits regularity, e.g., strong correlation across nearby output-channel components in the weight tensor, as we verify in experiments. Further, experiments (see SectionΒ 4) indicate that final results with respect to our metrics are less dependent on initialization than SGD, which suggests that the initial randomness may die out.
3 Application to Stochastic Gradient Descent and Implementation
To apply re-weighted and Sobolev channel-directed gradients to optimizing deep convolutional networks based on stochastic gradient descent or its variants, we discretize the gradient flow (16) according to the forward Euler method. We approximate the standard gradient of the loss, , using a mini-batch, as is standard in deep learning. We then use this approximation of the gradient to approximate the gradient, , by discretizing (13)-(15) using a standard Riemann sum. Note that (13) can be computed for each , the output channel index of the tensor, with the cumulative sum (CUMSUM) operation, which is linear in cost, as are (14) and (15). We compute the Sobolev gradient for each convolutional layer parameter tensor independent of others. We use for gradient and add it to a scaled version (by a hyper parameter) of the gradient (as shown in FigureΒ 2) to avoid over-smoothing. The re-weighted gradient is computed by using (9) from the stochastic gradient approximation. Both gradients require few additional lines of code; the code for re-weighted is shown in FigureΒ 3. Thus, the channel-directed gradients replace the usual one, and all other additions to standard SGD (e.g., momentum, Adam, etc) can be used as usual.
4 Experiments
We test our proposed channel-directed metrics with different baseline optimizers and tasks. Our intent is to show that any baseline method task can be improved just by switching the gradient with respect to channel directed metrics in the optimizer. We fix for channel-directed metrics unless specified otherwise. TableΒ 1 shows the settings for each experiment. Experiments are run on a single NVIDIA Titan Xp GPU except for GANs, which are run on a Tesla v100 GPU due to memory requirements.
Task | Dataset | Baseline | Network | Batch Size | Epochs | Initial LR |
---|---|---|---|---|---|---|
Image Classification | Cifar-10 | SGD | ResNet-56 | 128,32,8 | 240 | 0.1 |
VGG-16 | 128,8,6 | 240 | 0.01 | |||
ADAM | ResNet-56 | 128,32,8 | 200 | 1e-3 | ||
LS | ResNet-56 | 128,32,8 | 240 | 0.1 | ||
MNIST | SGD | Two-layer Conv | 100 | 100 | 0.01 | |
Semantic Segmentation | PascalVOC | SGD | ResNet50 | 2 | 70 | 7e-3 |
Image Generation (GAN) | CityScapes | SGD | SPADE | 2 | 100 | 1e-4,4e-4 |
Image Classification: We experiment on CIFAR-10Β Krizhevsky etΒ al. (2009). We test the combination of our channel-directed metrics with both SGD and ADAM on ResNet-56 He etΒ al. (2016a) and VGG-16 Simonyan & Zisserman (2014) following Osher etΒ al. (2018)βs settings. For SGD, we set the initial learning rate to be 0.1 and 0.01 on ResNet-56 and VGG-16 respectively with momentum 0.9 and weight decay 5e-4. For ADAM, we set the initial learning rate to 0.01. We decrease the learning rate by a factor of 10 every 40 epochs as Osher etΒ al. (2018). Results presented in this section are the average of at least 10 independent trials.
Architecture | ResNet-56 | VGG-16 | Architecture | ResNet-56 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Batch size | 128 | 32 | 8 | 128 | 8 | 6 | Batch size | 128 | 32 | 8 |
SGD | 93.24 | 91.96 | 86.54 | 93.02 | 92.31 | 91.88 | ADAM | 91.20 | 91.04 | 89.53 |
93.29 | 92.13 | 87.99 | 93.26 | 92.77 | 92.25 | 91.42 | 91.13 | 90.02 | ||
Error reduced% | 0.7% | 2.1% | 10.8% | 3.4% | 6.0% | 4.6% | Error reduced% | 2.5% | 1.0% | 4.7% |
93.38 | 92.10 | 88.04 | 93.19 | 92.79 | 92.43 | 91.20 | 91.06 | 89.70 | ||
Error reduced% | 2.1% | 1.7% | 11.1% | 2.4% | 6.2% | 6.8% | Error reduced% | 0.0% | 0.2% | 1.6% |
TableΒ 2 shows the test accuracy under different settings. Both channel-directed metrics achieve improvement over . A greater advantage over the baseline is achieved when the batch size is small as the stochastic gradient is noisy, and our method imposes regularity. In most cases, all channel-directed gradients perform similarly, but performs significantly better with ADAM.


In FigureΒ 4, we show an example of training and test accuracy curves (batch size of 8) for baselines as well as Laplacian Smoothing (LS) Osher etΒ al. (2018), which rasterizes before smoothing. We out-perform all methods. We also apply LS (without rasterization) to smooth the gradient in our output-channel directed fashion, which improves LS, but we still out-perform it. The original implementation of LS only runs smoothing for the first 40 epochs, then uses SGD (for speed). In our experiments, we apply smoothing all the way to convergence to test the effectiveness for the whole duration.
Variance reduction: In FigureΒ 5 (left), we compare the histograms of test accuracy over multiple runs of ours and SGD. Our method achieves higher average test accuracy with reduced variance.
Direction of smoothing: To investigate the effect of different channel directions of smoothing, we apply our method as well as LS along different channel-directions. We compare approaches under two settings, which are smoothing gradients in all layers and smoothing gradients in only convolutional layers.222For completeness, we have also tested LS for 40 epochs and then switched to SGD (LS+SGD), as done by Osher etΒ al. (2018) for speed; this results in slightly worse performance than running LS for all epochs. This verifies that running LS for all epochs, as we did in experiments in our main paper gives the best results for LS. FigureΒ 5 (right) shows that our output-channel direction is preferred regardless of smoothing method used. This shows that preferentially treating the output channel to smooth, as in our approach, is essential to performance. Interestingly, smoothing only convolutional layers in a rasterized order (as in LS) performs worse than SGD, but that is made up by smoothing in non-convolutional layers when smoothing in all layers.



Regularity of Tensor: We show that the final weight tensor at convergence in our methods have regularity in the output channel dimension in Figure Β 6, as should be the case as the tensor is composed of a component that is smooth. To show this, we plot the correlation between filters in the weight tensors as a function of the distance in the output channel dimension. This is done over multiple tensor layers in ResNet-56 and over multiple trials of optimization on CIFAR-10. We also show the correlation of filters in the input channel direction. As can be seen, all optimization methods produce tensors that exhibit regularity in the output channel direction while no such regularity in the output direction. Notice that our methods increase the amount of regularity compared to SGD as it imposes this in optimization.

Effect of Smoothing Parameter: We perform controlled experiments on MNISTΒ LeCun & Cortes (2010) and Fashion-MNISTΒ Xiao etΒ al. (2017) by varying the smoothness parameter from 0 to 20. Instead of using the standard partition, we conduct training on the test set (10000 samples) and test on the training set (60000 samples), which makes generalization more challenging. We use a 2-layer CNN with 50 and 100 filters in each layer, respectively, and train with batch size 100. FigureΒ 7 shows the accuracy at the 100th epoch (average over 5 trials). When , the optimizer degenerates to vanilla SGD. Our methods are not sensitive to and improve over SGD for any .


Semantic Segmentation: The experiments on semantic segmentation are conducted on the PascalVOCΒ Everingham etΒ al. (2015) dataset using a standard segmentation network Ronneberger etΒ al. (2015) with ResNet-50 as the encoder (see https://github.com/nyoki-mtl/pytorch-segmentation). We perform training with initial learning rate 7e-3 and batch size 2 (the maximum size to fit on Titan Xp GPU memory), and record the training/testing loss and accuracy for 60 epochs. 3 independent trials are run under each setting. FigureΒ 8 shows comparison between ours and SGD. Both channel-directed metrics improve the final segmentation accuracy on the test set by ~8% relatively. Also note that our method reduced the generalization gap from 0.163 to 0.151 (by 7.4%) and 0.150 (by 8.0%) for and , respectively.


Image Generation: To test the performance on GANs, we choose the task of semantic labels to image conversion. We perform the experiments on the current state-of-the-art model SPADEΒ Park etΒ al. (2019) (a.k.a GauGAN), which aims to generate high-quality realistic images from given semantic layouts. Experiments are conducted on CityScapesΒ Cordts etΒ al. (2016) and the FIDΒ Heusel etΒ al. (2017) score is used to evaluate the quality (lower is better). Learning rates are and for the generator and discriminator, respectively. We compare to SGD with momentum 0.9 and weight decay 5e-4. All models are trained for 100 epochs with batch size 2 (to fit on Tesla v100 memory), and 6 independent trials are run for each optimizer. FigureΒ 9 provides FID curves and error bars. Our methods achieve better average FID score with significantly less variance. Note 2 out of 6 models trained by SGD suffered from collapse, which led to high variance, while all twelve trials of our methods achieved good final results.

Speed: With PyTorch, re-weighted () adds negligible overhead. In our current Pytorch implementation, adds on average 45 ms overhead to each mini-batch with batch size 128, which increases training time on CIFAR-10 by 50%. This is because tensor transpose and saving/loading are currently required due to limited library functions, contributing to a large portion of computational overhead. In principle, as computing the gradient has linear time complexity, if the computation were done, for instance, using C++, it like re-weighted , would add negligible overhead over SGD/Adam.
5 Conclusion
Using gradients that are regular in the output-channel dimension of CNN network tensors in SGD is effective in improving generalization accuracy of SGD and its variants. We reformulated the gradient (without changing the loss) by changing the underlying Riemannian geometry on the tensor space using two different metrics. Both the channel-directed re-weighted and both gave generalization boosts. Regularity in other tensor dimensions was not effective in improving SGD or variants. Both channel-directed gradients have similar computational complexity, and the re-weighted adds negligible training time in its Pytorch implementation, which is one line of PyTorch code.
References
- Abraham etΒ al. (2012) Ralph Abraham, JerroldΒ E Marsden, and Tudor Ratiu. Manifolds, tensor analysis, and applications, volumeΒ 75. Springer Science & Business Media, 2012.
- Amari (1998) Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251β276, 1998.
- Bengio (2015) Yoshua Bengio. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization. corr abs/1502.04390, 2015.
- Bottou (2012) LΓ©on Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp.Β 421β436. Springer, 2012.
- Bottou etΒ al. (2018) LΓ©on Bottou, FrankΒ E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223β311, 2018.
- Carmo (1992) Manfredo PerdigaoΒ do Carmo. Riemannian geometry. BirkhΓ€user, 1992.
- Charpiat etΒ al. (2007) Guillaume Charpiat, Pierre Maurel, J-P Pons, Renaud Keriven, and Olivier Faugeras. Generalized gradients: Priors on minimization flows. International journal of computer vision, 73(3):325β344, 2007.
- Cordts etΒ al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Defazio etΒ al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pp.Β 1646β1654, 2014.
- Duchi etΒ al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121β2159, 2011.
- Everingham etΒ al. (2015) M.Β Everingham, S.Β M.Β A. Eslami, L.Β VanΒ Gool, C.Β K.Β I. Williams, J.Β Winn, and A.Β Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98β136, January 2015.
- Gunasekar etΒ al. (2018) Suriya Gunasekar, JasonΒ D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pp.Β 9461β9471, 2018.
- Gunasekar etΒ al. (2020) Suriya Gunasekar, Blake Woodworth, and Nathan Srebro. Mirrorless mirror descent: A more natural discretization of riemannian gradient flow. arXiv preprint arXiv:2004.01025, 2020.
- He etΒ al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016a. doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/cvpr.2016.90.
- He etΒ al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.Β 770β778, 2016b.
- Heusel etΒ al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017.
- Hoffman etΒ al. (2013) MatthewΒ D Hoffman, DavidΒ M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303β1347, 2013.
- Johnson & Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp.Β 315β323, 2013.
- Kingma & Ba (2014) DiederikΒ P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky etΒ al. (2009) Alex Krizhevsky, Geoffrey Hinton, etΒ al. Learning multiple layers of features from tiny images. 2009.
- LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
- LeCun etΒ al. (1998) Yann LeCun, LΓ©on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278β2324, 1998.
- Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017.
- Luo etΒ al. (2019) Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg3g2R9FX.
- Marceau-Caron & Ollivier (2016) GaΓ©tan Marceau-Caron and Yann Ollivier. Practical riemannian neural networks. arXiv preprint arXiv:1602.08007, 2016.
- Osher etΒ al. (2018) Stanley Osher, Bao Wang, Penghang Yin, Xiyang Luo, Farzin Barekat, Minh Pham, and Alex Lin. Laplacian smoothing gradient descent. arXiv preprint arXiv:1806.06317, 2018.
- Park etΒ al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Ronneberger etΒ al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp.Β 234β241. Springer, 2015.
- Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Sundaramoorthi etΒ al. (2007) Ganesh Sundaramoorthi, Anthony Yezzi, and AndreaΒ C Mennucci. Sobolev active contours. International Journal of Computer Vision, 73(3):345β366, 2007.
- Sundaramoorthi etΒ al. (2008) Ganesh Sundaramoorthi, Anthony Yezzi, and Andrea Mennucci. Coarse-to-fine segmentation and tracking using sobolev active contours. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):851β864, 2008.
- Xiao etΒ al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
- Zeiler (2012) MatthewΒ D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Additional Analysis of Evolution of Channel-Directed Optimization
FigureΒ 10 and FigureΒ 11 present the evolution of training and test accuracy of ADAM and SGD with different batch sizes. Using channel-directed gradients ( in this experiment) for SGD or ADAM improves test accuracy for any batch size. More prominent performance gains are seen for smaller batch sizes.




Appendix B Additional Experimental Verification of Output-Channel Direction
To investigate the effect of different channel directions of smoothing, we apply our method as well as LS along different channel-directions. FigureΒ 12 shows that our output-channel direction is preferred regardless of different smoothing approaches.

Appendix C Detailed Derivations for Section 2.2
We first derive the re-weighted gradient under metric following the same notations from the paper. Consider the standard gradient, and we want to solve for . By (4) and (8) we have
(17) | ||||
(18) |
After decomposing and into
(19) |
and noting the fact that holds for all , we have
(20) |
which leads to the result of (9).
We then derive the Sobolev gradient under metric, following similar computations in Sundaramoorthi etΒ al. (2007). Consider the Sobolev gradient under metric. By (7) and (8) we have
(21) | ||||
(22) |
Integrating by parts and considering the periodic boundary conditions, we have
(23) |
Since can be any perturbation, by uniqueness, we have
(24) |
which is (10). Similarly, for metric, we have
(25) |
First observe that by computed the output-channel directed average of the both sides of the above equation, we see that , i.e., the average values are same. One may integrate (25) twice to solve for the gradient. For simplicity, let be the gradient and be the gradient. Integrating twice yields
(26) | ||||
(27) | ||||
(28) |
Note that here we perform normalization by scaling to the channel direction by letting . With boundary conditions , and , we have
(29) |
For simplicity, we eliminate and in the following derivations. We have
(30) | ||||
(31) |
Noting and , we integrate both sides over the entire interval .
(32) | ||||
(33) | ||||
(34) | ||||
(35) | ||||
(36) |
This gives (15) in the main paper.