Channel-Directed Gradients for Optimization of Convolutional Neural Networks

Dong Lao, Peihao Zhu, Peter Wonka, Ganesh Sundaramoorthi
King Abdullah University of Science and Technology
Thuwal, Saudi Arabia
{dong.lao,peihao.zhu,peter.wonka,ganesh.sundaramoorthi}@kaust.edu.sa

Abstract

We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. The method works by computing the gradient of the loss function with respect to output-channel directed re-weighted $\mathbb{L}^{2}$ or Sobolev metrics, which has the effect of smoothing components of the gradient across a certain direction of the parameter tensor. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics.

1 Introduction

Stochastic gradient descent (SGD) is currently the dominant algorithm for optimizing large-scale convolutional neural networks (CNNs) LeCun et al. (1998); Simonyan & Zisserman (2014); He et al. (2016b). Although there has been large activity in optimization methods seeking to improve performance, SGD still dominates in large-scale CNN optimization in terms of its generalization ability. Despite SGD’s dominance, there is still often a gap between training and real-world test accuracy performance in applications, which necessitates research in optimization methods to increase generalization accuracy.

In this paper, we observe that there is regularity in parameter tensors of learned CNN models, and we thus exploit regularity implicitly in optimization to derive new optimization methods that are simple modifications of traditional SGD to improve generalization. In particular, we empirically observe that parameter tensors in trained networks typically exhibit correlation over output channel dimension (see Figure 1). We thus explore encoding correlation through smoothness in optimization, which we show improves generalization accuracy as learning without imposing regularity may not fully learn it. We encode smoothness implicitly in stochastic gradient descent by considering new metrics on the parameter space of the network, and reformulating the notion of gradient. To do this, we treat the space of parameter tensors as a Riemannian manifold to derive gradients of the loss with respect to new metrics that promote regularity in the output channel dimension of the tensors by changing the geometry of the underlying space of tensors.

Our contributions are as follows. First, we formulate output channel-directed Riemannian metrics (a re-weighted version of the standard $\mathbb{L}^{2}$ metric and another that is a Sobolev metric) over the space of parameter tensors. This encodes channel-directed regularity inherently in the gradient optimization without changing the loss. Second, we compute Riemannian gradients with respect to the metrics showing linear complexity (in the number of parameters) over standard gradient computation, and thus derive new optimization methods for CNN training. Finally, we apply the methodology to training CNNs and show the empirical advantage in generalization accuracy, especially with small batch sizes, over standard optimizers (SGD, Adam) on numerous applications (image classification, semantic segmentation, generative adversarial networks) with simple modification of existing optimizers.

1.1 Related Work

Refer to caption — Figure 1: Visualization of weights (parameter tensor) of convolutional layers trained on ImageNet using SGD. The vertical structures that indicate regularity (correlation) of the weights along the output channel direction. This pattern is frequently observed in layers. Our method is motivated by this empirical observation, and favors parameter correlations in this direction in optimization.

We briefly survey related work in deep network optimization; a detailed survey is Bottou et al. (2018). Stochastic gradient descent (SGD), e.g., Bottou (2012), samples a batch of data to tractably estimate the gradient of the loss function. As the stochastic gradient is a noisy version of the gradient, learning rates must follow a decay schedule in order to converge. Many methods have been formulated to choose learning rate over epochs and components of the gradient, including recent work on adaptive learning rates (e.g., Duchi et al. (2011); Zeiler (2012); Kingma & Ba (2014); Bengio (2015); Loshchilov & Hutter (2017); Luo et al. (2019)). For instance, Adam Kingma & Ba (2014) adaptively adjusts the learning rate so that parameters that have changed infrequently based on historical gradients are updated more quickly than parameters that have changed frequently. Another way to interpret such methods is that they change the underlying metric on the space on which the loss function is defined to an iso-tropically scaled version of the $\mathbb{L}^{2}$ metric given by a simple diagonal matrix; we change the metrics an-isotropically. We show that our method can be used in conjunction with such methods by simply using the stochastic gradient computed with our metrics to boost performance.

As the stochastic gradient is computed based on sampling, different runs of the algorithm can result in different local optima. To reduce the variance, several methods have been been formulated, e.g., Defazio et al. (2014); Johnson & Zhang (2013). We are not motivated by variance reduction, rather we are motivated by inducing regularity in optimization to improve generalization. However, as our method effectively smooths the gradient, our empirical experiments do indicate reduced variance with our metrics compared to SGD.

Another method motivated by variance reduction is the recent work of Osher et al. (2018), where the stochastic gradient is pre-multiplied with an inverse Laplacian smoothing matrix. For CNNs, the gradient with respect to parameters is rasterized in row or column order of network filters before smoothing, which still lowers variance. Our work is inspired by Osher et al. (2018), though we are motivated by incorporating structured regularity of the parameter tensor inherently in the optimization. Osher et al. (2018) can be interpreted as using the gradient of the loss with respect to a Sobolev metric. Our major insight over Osher et al. (2018) is that keeping the multi-dimensional structure of the parameter tensor (rather than rasterizing) and preferentially defining the Sobolev metric with respect to the output-channel direction boosts generalization accuracy, while other directions appear to have no boost. Secondly, we introduce a re-weighted $\mathbb{L}^{2}$ metric that preferentially treats the output-channel direction, and can be implemented with one line of Pytorch code, is linear (in parameter size) complexity, and achieves similar results (in many cases) to our channel-directed Sobolev metric, boosting generalization of SGD and Adam. Third, our channel-directed Sobolev gradient can be implemented in linear cost rather than quasi-linear (not requiring FFT to compute). Sobolev gradients have been used in computer vision Sundaramoorthi et al. (2007); Charpiat et al. (2007) for their coarse-to-fine evolution properties Sundaramoorthi et al. (2008) and we adapt that formulation to channel-directed metrics for CNNs.

We formulate Sobolev gradients by considering the space of parameter tensors as a Riemannian manifold, and choosing the metric (i.e., inner product) on the tangent space to be a Sobolev metric. By choosing a metric, gradients intrinsic to the manifold can be computed and gradient flows decrease loss. Other Riemannian metrics have been used for optimization in machine learning, e.g., Amari (1998); Marceau-Caron & Ollivier (2016); Hoffman et al. (2013); Gunasekar et al. (2018; 2020) and tangentially relate to our work. These works are based on Amari’s Amari (1998) information geometry on probability measures, and the metric considered is the Fisher information metric. The motivation for these methods is re-parametrization invariance of optimization, whereas our motivation is imposing regularity directly in the parameter space. Most of these methods relate to density estimation as the metric is on probability measures. Gunasekar et al. (2018) notes that even vanilla gradient descent has certain implicit bias relating to the underlying metric on the space. In Gunasekar et al. (2020) the Hessian metric (in the convex case) is analyzed and relates to mirror descent. These metrics are data-dependent and the gradient is challenging to compute, requiring (a large) inverse matrix computation. Moreover, they do not exploit the channel dimension regularity, the main purpose of our work.

2 Channel-Directed Gradients

We now present the theory to define channel-directed gradients. To do this, we formulate new metrics on the space of tensors, and then derive analytic formulas for channel-directed gradients in terms of the standard $\mathbb{L}^{2}$ gradient. As we show, our channel-directed gradients effectively smooth the components of the $\mathbb{L}^{2}$ gradient across a certain direction of the parameter tensors of the CNN. Another interpretation is we are changing the geometric structure of the loss landscape (without changing the loss) to a more smooth one by changing the underlying metric of the space on which the loss is defined.

Our metrics are motivated by the empirical observation that a certain dimension of parameter tensors in trained deep networks exhibits regularity (see Figure 1), and thus our method exploits this regularity implicitly in optimization. If we visualize the parameter tensor along the input and output dimensions of the parameter tensor, we see mostly what appears to be random noise, however, there are in addition, regular (correlated) patterns along the output channel direction, implying that each output channel of a layer of a network uses similar (regularly-varying) weightings of input channels. Our metrics thus favor that gradient updates during optimization exhibit correlation, which we show in experiments leads to optimization that generalizes better.

2.1 Background on Riemannian Gradients

We first briefly present the definition of gradient on a Riemannian manifold, and show the explicit dependence of the gradient on the chosen metric on the manifold. More detailed theory can be found in Carmo (1992); Abraham et al. (2012). We note that a manifold $\mathcal{X}$ is a space that is locally linear around each point $X\in\mathcal{X}$ in the space, and the linear space at each point is called the tangent space, denoted $T_{X}\mathcal{X}$ . A Riemannian manifold, in addition, has a smoothly varying positive definite bilinear form $\left<\cdot,\cdot\right>$ (called the metric) on the tangent space. This metric allows one to define the notion of lengths of curves on the space, in addition to many other operations, including gradients of functions defined on the space.

Definition 1 (Gradient of a Function)

Let $\mathcal{X}$ be a Riemannian manifold, and $f:\mathcal{X}\to\mathbb{R}$ be a function. The directional derivative of $f$ at $X\in\mathcal{X}$ along a direction $k\in T_{X}\mathcal{X}$ is defined as $\,\mathrm{d}f(X)\cdot k=\frac{\,\mathrm{d}{}}{\,\mathrm{d}{\varepsilon}}\left.f(X+\varepsilon k)\right|_{\varepsilon=0}$ . The gradient of $f$ at $X\in\mathcal{X}$ is the vector, $\nabla f(X)\in T_{X}\mathcal{X}$ , that satisfies the relation

\,\mathrm{d}f(X)\cdot k=\left<{\nabla f(X)},{k}\right>,\,\,\mbox{for all }k\in T_{X}\mathcal{X}.

(1)

From the definition, we note that “the” gradient will depend on the choice of the metric on the manifold. We note that any such gradient will decrease the the function $f$ by moving infinitesimally in the tangent space in the direction of negative the gradient as $\,\mathrm{d}f(X)\cdot k=-\|\nabla f(X)\|^{2}<0$ when $k=-\nabla f(X)$ , where $\|\cdot\|$ is the norm induced by $\left<\cdot,\cdot\right>$ . The gradient flow, defined by the differential equation $\dot{X}_{t}=-\nabla f(X_{t})$ , will converge to a local minimum. In our application of this theory to CNN optimization, $f$ will be the loss function, and $\mathcal{X}$ will be the space of parameter tensors. In this case, as the tensor is multi-dimensional, the gradient flow will be a partial differential equation.

A consequence of this definition is that the gradient is the direction (up to a scale factor) in the tangent space that optimizes the following problem:

\operatorname*{arg\,max}_{k\in T_{X}\mathcal{X}\backslash\{0\}}\frac{|\,\mathrm{d}f(X)\cdot k|}{\|k\|}.

(2)

Thus, the gradient can be regarded as the most efficient direction as it maximizes the ratio of the change in energy by perturbing in a direction $k$ over the cost (defined by the metric) of $k$ . Thus, by constructing the metric to have small costs for perturbations (directions) that we prefer for gradients, the gradient flow will move in these preferential directions while minimizng the fucntion, and thus land in more favorable local minima. In particular, we construct metrics that favor gradients with output channel direction regularity and this does induce regularity in the final tensor.

2.2 Channel-Directed Metrics

As gradients of a function depend on the metric structure on the underlying space, we re-define the metric on the underlying space so that tensors that differ smoothly along the output channel direction have small distance. In existing deep network gradient-based optimization schemes, the underlying metric on the loss function is assumed to be the standard Euclidean $\mathbb{L}^{2}$ metric. We will consider a re-weighted version of the $\mathbb{L}^{2}$ metric and Sobolev metrics that favor regularity in the output channel direction of parameter tensors. To formulate the methodology, we start from a continuum formulation, where we treat weight tensors in the continuum, formulate the metrics in the continuum and then in the next sub-section derive the gradients with respect to these metrics. Finally, we discretize gradient flows in the implementation to derive iterative schemes.

Let $X:\mathcal{O}\times\mathcal{I}\times\mathcal{H}\times\mathcal{W}\to\mathbb{R}$ denote a parameter tensor of a deep network (from a layer of a convolutional network). Here $\mathcal{O}=[0,O]$ denotes indices to the output channel dimension of the tensor, $\mathcal{I}=[0,I]$ denote the indices to the input channel, and $\mathcal{H}=[0,H],\mathcal{W}=[0,W]$ denote indices to the height and width dimension of the spatial filters of the tensor. The metric is defined on the tangent space to the space of such $X$ . An element of the tangent space will have the same form of the tensor, i.e., $k:\mathcal{O}\times\mathcal{I}\times\mathcal{H}\times\mathcal{W}\to\mathbb{R}$ . The $\mathbb{L}^{2}$ (called $H^{0}$ from now on) metric is defined as

\left\langle k_{1},k_{2}\right\rangle_{{}_{\!\!H^{0}}}=\int_{\mathcal{O},\mathcal{I},\mathcal{H},\mathcal{W}}k_{1}(o,i,h,w)\cdot k_{2}(o,i,h,w)\,\mathrm{d}o\,\mathrm{d}i\,\mathrm{d}h\,\mathrm{d}w,

(3)

where $k_{1},k_{2}$ are in the tangent space of tensors. We now define a re-weighted version of $H^{0}$ that favors tangent vectors that have global smoothness in the direction of the $\mathcal{O}$ dimension:

\left<{k_{1}},{k_{2}}\right>_{H^{0}_{\lambda}}=\int_{\mathcal{I},\mathcal{H},\mathcal{W}}\bar{k}_{1}(i,h,w)\cdot\bar{k}_{2}(i,h,w)\,\mathrm{d}i\,\mathrm{d}h\,\mathrm{d}w+\frac{\lambda}{O}\left\langle k_{1}-\bar{k}_{1},k_{2}-\bar{k}_{2}\right\rangle_{{}_{\!\!H^{0}}},

(4)

where $\lambda>0$ is a hyper-parameter, and $\bar{k}$ is the average value in the output channel direction, i.e.,

\bar{k}(i,h,w)=\frac{1}{O}\int_{\mathcal{O}}k(o,i,h,w)\,\mathrm{d}o.

(5)

The metric in (4) splits the tangent vector into global translations in the output channel direction and its orthogonal complement, i.e., the deformation. The weight $\lambda$ is used to control the weighting between the translation and deformation components, i.e., larger values of $\lambda$ means that deformations more heavily influence the norm of the perturbation. As shown in the next sub-section that means gradients with respect to this metric have higher weighted channel-directed translations than deformations.

Next, we introduce channel-directed versions of the Sobolev metric, defined as follows:

	$\displaystyle\left\langle k_{1},k_{2}\right\rangle_{{}_{\!\!H^{1}}}$	$\displaystyle=\frac{1}{O}\left\langle k_{1},k_{2}\right\rangle_{{}_{\!\!H^{0}}}+\lambda O\left\langle\frac{\partial{k_{1}}}{\partial{o}},\frac{\partial{k_{2}}}{\partial{o}}\right\rangle_{{}_{\!\!H^{0}}}$		(6)
	$\displaystyle\left<{k_{1}},{k_{2}}\right>_{\tilde{H}^{1}}$	$\displaystyle=\int_{\mathcal{I},\mathcal{H},\mathcal{W}}\bar{k}_{1}(i,h,w)\cdot\bar{k}_{2}(i,h,w)\,\mathrm{d}i\,\mathrm{d}h\,\mathrm{d}w+\lambda O\left\langle\frac{\partial{k_{1}}}{\partial{o}},\frac{\partial{k_{2}}}{\partial{o}}\right\rangle_{{}_{\!\!H^{0}}},$		(7)

where $\frac{\partial{}}{\partial{o}}$ indicates the partial derivative with respect to the the output channel direction. The partial derivative in the $o$ -direction implies that tensor perturbations that are smooth along the $o$ -direction are close with respect to these metrics, which will imply that the corresponding gradients will exhibit smoothness in this direction, i.e., convolution filters that are nearby in the output direction will exhibit correlation. The first metric is analogous to the usual Sobolev metric being a weighted combination of the $H^{0}$ metric and the $H^{0}$ metric of the derivative, except only considering the derivative with respect to one direction. The second metric is similar to the first except that we use the $H^{0}$ metric of the channel-directed average rather than that of the perturbation itself. As we will see, both have similar properties, but the latter is computationally less costly to compute. The scale factors of $O$ in the expressions above are so that the metric is scale invariant with respect to different sizes of output channels. The part of the metric with the partial derivative component implies that tensors that differ in the output channel direction by a non-smooth perturbation are far away in distance. In the latter metric, tensors that differ by just a channel-directed translation are close. Compared with the re-weighted $H^{0}$ metric (4), the latter Sobolev metric promotes smooth perturbations beyond global translations.

We have presented only channel-directed metrics that preferentially treat the output channel dimension of the tensor as our empirical experiments demonstrate that promoting regularity in other directions is detrimental to optimization performance.

2.3 Computing Channel-Directed Gradients

We now compute gradients with respect to the metrics defined in the previous sub-section in terms of the $H^{0}$ gradient so that simple processing of the existing gradient can be done with no other modification of existing optimization code. To compute the relation between the channel-directed gradients and the usual $H^{0}$ gradient, we note the relation between the directional derivative of a loss function, the gradient and the metric:

\,\mathrm{d}L(X)\cdot k=\left\langle\nabla_{H^{0}}L(X),k\right\rangle_{{}_{\!\!H^{0}}}=\left<{\nabla_{H^{0}_{\lambda}}L(X)},{k}\right>_{H^{0}_{\lambda}}=\left\langle\nabla_{H^{1}}L(X),k\right\rangle_{{}_{\!\!H^{1}}}=\left<{\nabla_{H^{1}}L(X)},{k}\right>_{\tilde{H}^{1}},

(8)

where $dL(X)\cdot k=\lim_{\varepsilon\to 0}\frac{L(X+\varepsilon k)-L(X)}{\varepsilon}$ is the directional derivative in the direction of the perturbation $k$ . Note that the expression above indicates the directional derivative is equal to the inner product (metric) between the gradient with respect to the metric and the perturbation. This holds for any metric. With this relation, we may compute the channel-directed gradients in terms of the $H^{0}$ gradient. Derivations are in the Supplementary materials. Letting $f=\nabla_{H^{0}}L(X)$ , we have

	$\displaystyle\nabla_{H^{0}_{\lambda}}L(X)$	$\displaystyle=\bar{f}+\frac{1}{\lambda}(f-\bar{f})$		(9)
	$\displaystyle f=\nabla_{H^{1}}L(X)-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{H^{1}}L(X)$	$\displaystyle\quad\mbox{ and }\quad f=\overline{\nabla_{\tilde{H}^{1}}L(X)}-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{\tilde{H}^{1}}L(X),$		(10)

where the last two expressions are second order ordinary differential equations (ODE), whose solution we discuss next. Notice that the re-weighted $H^{0}$ gradient (9) simply re-weights the channel-directed translation component and the deformation component of the $H^{0}$ gradient differently, i.e., as $\lambda$ gets larger, the channel-directed translation becomes more prominent.

In obtaining the ODE expressions for the Sobolev gradients above, we have assumed periodic boundary conditions in the $\mathcal{O}$ dimension¹¹1Since ordering of filters along the channel direction in a CNN has no particular significance, choosing periodic or non-periodic boundary conditions is arbitrary. Periodic conditions induces smoothness between the starting and ending filters in the $o$ -dimension. The periodic condition enables a simpler computational solution.. In this case, the Sobolev gradients can be interpreted as the circular convolution of the $H^{0}$ gradient with convolution kernels given as

K(o)=\frac{\cosh\left[\lambda^{-1/2}(o-0.5)\right]}{2\sinh\left[\lambda^{-1/2}\right]},\quad\tilde{K}(o)=1+\frac{o^{2}-o+1/6}{2\lambda},\,\,\mbox{ for }\,o\in[0,1],

(11)

for each of the $H^{1}$ and $\tilde{H}^{1}$ metrics, respectively, where $o$ above is scaled by $O$ to be between 0 and 1, and the circular convolution is given by

\nabla_{\tilde{H}^{1}}L(X)(o,i,h,w)=\frac{1}{O}\int_{\mathcal{O}}\tilde{K}((o-\tilde{o})/O)f(\tilde{o},i,h,w)\,\mathrm{d}\tilde{o}.

(12)

Note that the re-weighted $H^{0}$ solution also has an interpretation of convolution with respect to a smoothing kernel. Figure 2 shows plots of various kernels for the parameter $\lambda$ chosen in experiments. For each $o$ , the Sobolev or re-weighted $H^{0}$ is a local average whose weights die far away from $o$ . Thus, we can see that the effect of the metrics is to induce smoothness of the gradient along the output channel direction.

The second version of the Sobolev gradient need not use the convolution formula for computation, as one can just integrate the ODE twice (after noting that the channel-directed averages for both the $H^{0}$ and Sobolev gradient are the same). This saves one from having to compute the convolution directly, and hence a reduction in computational cost from quadratic (or quasi-linear with an FFT) to linear in $O$ given the $H^{0}$ gradient. The second version of the Sobolev gradient can be computed as

$\displaystyle g(o,i,h,w)$	$\displaystyle=g(0,i,h,w)+o\frac{\partial{g}}{\partial{o}}(0,i,h,w)-\frac{1}{\lambda}\int_{0}^{o}(o-\tilde{o})(f(oO,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}\tilde{o}$	(13)
$\displaystyle\frac{\partial{g}}{\partial{o}}(0,i,h,w)$	$\displaystyle=-\frac{1}{\lambda}\int_{0}^{1}o(f(oO,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}o$	(14)
$\displaystyle g(0,i,h,w)$	$\displaystyle=\int_{0}^{1}\tilde{K}(o)f(oO,i,h,w)\,\mathrm{d}o,\quad o\in[0,1]$	(15)

where $g=\nabla_{\tilde{H}^{1}}L(X)$ is the second version of the Sobolev gradient and $f=\nabla_{H^{0}}L(X)$ , which are just three integrals that can be computed in linear complexity with respect to $O$ . The gradient flows under these metrics are given by

\dot{X}_{t}=-\nabla L(X_{t}),

(16)

where $t$ denotes the artificial time variable, $\dot{X}$ is the time derivative of the parameter tensor, and $\nabla$ denotes the gradient with respect to the desired metric. Under any metric, this reduces the loss.

2.4 Properties of Channel Directed Gradient Flows

We describe some properties of the resulting gradient flows according to the metrics defined in the previous sections.

Coarse-to-Fine Evolution and Removal of Some Local Minima: In Sundaramoorthi et al. (2008), it is shown that gradient flows with respect to Sobolev metrics evolve in a coarse-to-fine fashion, deforming according to coarse-scale perturbations before moving to finer scale perturbations. This can avoid being trapped in local minima due to fine-scale structures. It is also shown that when we change the metric on the space $\mathcal{X}$ , the loss landscape changes and some local minima with respect to $H^{0}$ may change to other critical points with respect to Sobolev, and numerically some local minima may cease to exist. That is, local minima due to fine-scale structures can be removed by switching to a Sobolev metric. As local minima that are from wide, flat minima generalize well, the removal of local minima due to localized fine structures may encourage convergence at wider, flat minima and hence generalize better than ordinary SGD.

Regularity of the Weight Tensor: By the convolution formulas above, we can see that the Sobolev gradients are a smoothing of the usual $H^{0}$ gradient. Noting that the gradient flow (16) integrates the (smooth) gradients over time, the final tensor will also be smooth in the output direction provided the initialization is smooth. In practice, in applications with deep networks, one typically initializes the weight tensor to be random noise, so the final tensor may exhibit some randomness, but the final tensor is sum of a smooth component and the randomness, and so it exhibits regularity, e.g., strong correlation across nearby output-channel components in the weight tensor, as we verify in experiments. Further, experiments (see Section 4) indicate that final results with respect to our metrics are less dependent on initialization than SGD, which suggests that the initial randomness may die out.

3 Application to Stochastic Gradient Descent and Implementation

To apply re-weighted $H^{0}$ and Sobolev channel-directed gradients to optimizing deep convolutional networks based on stochastic gradient descent or its variants, we discretize the gradient flow (16) according to the forward Euler method. We approximate the standard $H^{0}$ gradient of the loss, $\nabla_{H^{0}}L(X)$ , using a mini-batch, as is standard in deep learning. We then use this approximation of the $H^{0}$ gradient to approximate the $\tilde{H}^{1}$ gradient, $\nabla_{\tilde{H}^{1}}L(X)$ , by discretizing (13)-(15) using a standard Riemann sum. Note that (13) can be computed for each $o$ , the output channel index of the tensor, with the cumulative sum (CUMSUM) operation, which is linear in cost, as are (14) and (15). We compute the Sobolev gradient for each convolutional layer parameter tensor independent of others. We use $\lambda=1$ for $\tilde{H}^{1}$ gradient and add it to a scaled version (by a hyper parameter) of the $H^{0}$ gradient (as shown in Figure 2) to avoid over-smoothing. The re-weighted $H^{0}$ gradient is computed by using (9) from the $H^{0}$ stochastic gradient approximation. Both gradients require few additional lines of code; the code for re-weighted $H^{0}$ is shown in Figure 3. Thus, the channel-directed gradients replace the usual one, and all other additions to standard SGD (e.g., momentum, Adam, etc) can be used as usual.

⬇

def reweighted_L2_grad(grad=param.grad.data,lambda):

#grad: L2 gradient; lambda>0 weights translation of L2 grad

grad += lambda*torch.mean(grad,0,True).repeat(grad.size(0),1,1,1)

return grad

Figure 3: Pytorch code to compute the re-weighted

\mathbb{L}^{2}

(

H^{0}_{\lambda}

) gradient from the

H^{0}

gradient.

4 Experiments

We test our proposed channel-directed metrics with different baseline optimizers and tasks. Our intent is to show that any baseline method task can be improved just by switching the gradient with respect to channel directed metrics in the optimizer. We fix $\lambda=1$ for channel-directed metrics unless specified otherwise. Table 1 shows the settings for each experiment. Experiments are run on a single NVIDIA Titan Xp GPU except for GANs, which are run on a Tesla v100 GPU due to memory requirements.

Table 1: Experimental settings.

Task	Dataset	Baseline	Network	Batch Size	Epochs	Initial LR
Image Classification	Cifar-10	SGD	ResNet-56	128,32,8	240	0.1
		SGD	VGG-16	128,8,6	240	0.01
		ADAM	ResNet-56	128,32,8	200	1e-3
		LS	ResNet-56	128,32,8	240	0.1
	MNIST	SGD	Two-layer Conv	100	100	0.01
Semantic Segmentation	PascalVOC	SGD	ResNet50	2	70	7e-3
Image Generation (GAN)	CityScapes	SGD	SPADE	2	100	1e-4,4e-4

Image Classification: We experiment on CIFAR-10 Krizhevsky et al. (2009). We test the combination of our channel-directed metrics with both SGD and ADAM on ResNet-56 He et al. (2016a) and VGG-16 Simonyan & Zisserman (2014) following Osher et al. (2018)’s settings. For SGD, we set the initial learning rate to be 0.1 and 0.01 on ResNet-56 and VGG-16 respectively with momentum 0.9 and weight decay 5e-4. For ADAM, we set the initial learning rate to 0.01. We decrease the learning rate by a factor of 10 every 40 epochs as Osher et al. (2018). Results presented in this section are the average of at least 10 independent trials.

Table 2: Test accuracy on CIFAR-10. Channel-directed metrics improve

H^{0}

in all cases. Best case,

>

10% of error can be reduced by using our channel-directed metrics. Results average 10 trials.

Architecture	ResNet-56			VGG-16			Architecture	ResNet-56
Batch size	128	32	8	128	8	6	Batch size	128	32	8
SGD	93.24	91.96	86.54	93.02	92.31	91.88	ADAM	91.20	91.04	89.53
$+\tilde{H^{1}}$	93.29	92.13	87.99	93.26	92.77	92.25	$+\tilde{H^{1}}$	91.42	91.13	90.02
Error reduced%	0.7%	2.1%	10.8%	3.4%	6.0%	4.6%	Error reduced%	2.5%	1.0%	4.7%
$+H^{0}_{\lambda}$	93.38	92.10	88.04	93.19	92.79	92.43	$+H^{0}_{\lambda}$	91.20	91.06	89.70
Error reduced%	2.1%	1.7%	11.1%	2.4%	6.2%	6.8%	Error reduced%	0.0%	0.2%	1.6%

Table 2 shows the test accuracy under different settings. Both channel-directed metrics achieve improvement over $H^{0}$ . A greater advantage over the baseline is achieved when the batch size is small as the stochastic gradient is noisy, and our method imposes regularity. In most cases, all channel-directed gradients perform similarly, but ${\tilde{H}^{1}}$ performs significantly better with ADAM.

In Figure 4, we show an example of training and test accuracy curves (batch size of 8) for baselines as well as Laplacian Smoothing (LS) Osher et al. (2018), which rasterizes before smoothing. We out-perform all methods. We also apply LS (without rasterization) to smooth the gradient in our output-channel directed fashion, which improves LS, but we still out-perform it. The original implementation of LS only runs smoothing for the first 40 epochs, then uses SGD (for speed). In our experiments, we apply smoothing all the way to convergence to test the effectiveness for the whole duration.

Variance reduction: In Figure 5 (left), we compare the histograms of test accuracy over multiple runs of ours and SGD. Our method achieves higher average test accuracy with reduced variance.

Direction of smoothing: To investigate the effect of different channel directions of smoothing, we apply our method as well as LS along different channel-directions. We compare approaches under two settings, which are smoothing gradients in all layers and smoothing gradients in only convolutional layers.²²2For completeness, we have also tested LS for 40 epochs and then switched to SGD (LS+SGD), as done by Osher et al. (2018) for speed; this results in slightly worse performance than running LS for all epochs. This verifies that running LS for all epochs, as we did in experiments in our main paper gives the best results for LS. Figure 5 (right) shows that our output-channel direction is preferred regardless of smoothing method used. This shows that preferentially treating the output channel to smooth, as in our approach, is essential to performance. Interestingly, smoothing only convolutional layers in a rasterized order (as in LS) performs worse than SGD, but that is made up by smoothing in non-convolutional layers when smoothing in all layers.

Regularity of Tensor: We show that the final weight tensor at convergence in our methods have regularity in the output channel dimension in Figure 6, as should be the case as the tensor is composed of a component that is smooth. To show this, we plot the correlation between filters in the weight tensors as a function of the distance in the output channel dimension. This is done over multiple tensor layers in ResNet-56 and over multiple trials of optimization on CIFAR-10. We also show the correlation of filters in the input channel direction. As can be seen, all optimization methods produce tensors that exhibit regularity in the output channel direction while no such regularity in the output direction. Notice that our methods increase the amount of regularity compared to SGD as it imposes this in optimization.

Effect of Smoothing Parameter: We perform controlled experiments on MNIST LeCun & Cortes (2010) and Fashion-MNIST Xiao et al. (2017) by varying the smoothness parameter $\lambda$ from 0 to 20. Instead of using the standard partition, we conduct training on the test set (10000 samples) and test on the training set (60000 samples), which makes generalization more challenging. We use a 2-layer CNN with 50 and 100 $5\times 5$ filters in each layer, respectively, and train with batch size 100. Figure 7 shows the accuracy at the 100th epoch (average over 5 trials). When $\lambda=0$ , the optimizer degenerates to vanilla SGD. Our methods are not sensitive to $\lambda$ and improve over SGD for any $\lambda$ .

Semantic Segmentation: The experiments on semantic segmentation are conducted on the PascalVOC Everingham et al. (2015) dataset using a standard segmentation network Ronneberger et al. (2015) with ResNet-50 as the encoder (see https://github.com/nyoki-mtl/pytorch-segmentation). We perform training with initial learning rate 7e-3 and batch size 2 (the maximum size to fit on Titan Xp GPU memory), and record the training/testing loss and accuracy for 60 epochs. 3 independent trials are run under each setting. Figure 8 shows comparison between ours and SGD. Both channel-directed metrics improve the final segmentation accuracy on the test set by ~8% relatively. Also note that our method reduced the generalization gap from 0.163 to 0.151 (by 7.4%) and 0.150 (by 8.0%) for $\tilde{H}^{1}$ and $H^{0}_{\lambda}$ , respectively.

Image Generation: To test the performance on GANs, we choose the task of semantic labels to image conversion. We perform the experiments on the current state-of-the-art model SPADE Park et al. (2019) (a.k.a GauGAN), which aims to generate high-quality realistic images from given semantic layouts. Experiments are conducted on CityScapes Cordts et al. (2016) and the FID Heusel et al. (2017) score is used to evaluate the quality (lower is better). Learning rates are $1e-4$ and $4e-4$ for the generator and discriminator, respectively. We compare to SGD with momentum 0.9 and weight decay 5e-4. All models are trained for 100 epochs with batch size 2 (to fit on Tesla v100 memory), and 6 independent trials are run for each optimizer. Figure 9 provides FID curves and error bars. Our methods achieve better average FID score with significantly less variance. Note 2 out of 6 models trained by SGD suffered from collapse, which led to high variance, while all twelve trials of our methods achieved good final results.

Speed: With PyTorch, re-weighted $\mathbb{L}^{2}$ ( $H^{0}_{\lambda}$ ) adds negligible overhead. In our current Pytorch implementation, $\tilde{H^{1}}$ adds on average 45 ms overhead to each mini-batch with batch size 128, which increases training time on CIFAR-10 by 50%. This is because tensor transpose and saving/loading are currently required due to limited library functions, contributing to a large portion of computational overhead. In principle, as computing the $\tilde{H}^{1}$ gradient has linear time complexity, if the computation were done, for instance, using C++, it like re-weighted $\mathbb{L}^{2}$ , would add negligible overhead over SGD/Adam.

5 Conclusion

Using gradients that are regular in the output-channel dimension of CNN network tensors in SGD is effective in improving generalization accuracy of SGD and its variants. We reformulated the gradient (without changing the loss) by changing the underlying Riemannian geometry on the tensor space using two different metrics. Both the channel-directed re-weighted $H^{0}$ and $\tilde{H}^{1}$ both gave generalization boosts. Regularity in other tensor dimensions was not effective in improving SGD or variants. Both channel-directed gradients have similar computational complexity, and the re-weighted $H^{0}$ adds negligible training time in its Pytorch implementation, which is one line of PyTorch code.

References

Abraham et al. (2012) Ralph Abraham, Jerrold E Marsden, and Tudor Ratiu. Manifolds, tensor analysis, and applications, volume 75. Springer Science & Business Media, 2012.
Amari (1998) Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
Bengio (2015) Yoshua Bengio. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization. corr abs/1502.04390, 2015.
Bottou (2012) Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421–436. Springer, 2012.
Bottou et al. (2018) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
Carmo (1992) Manfredo Perdigao do Carmo. Riemannian geometry. Birkhäuser, 1992.
Charpiat et al. (2007) Guillaume Charpiat, Pierre Maurel, J-P Pons, Renaud Keriven, and Olivier Faugeras. Generalized gradients: Priors on minimization flows. International journal of computer vision, 73(3):325–344, 2007.
Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pp. 1646–1654, 2014.
Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159, 2011.
Everingham et al. (2015) M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, January 2015.
Gunasekar et al. (2018) Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pp. 9461–9471, 2018.
Gunasekar et al. (2020) Suriya Gunasekar, Blake Woodworth, and Nathan Srebro. Mirrorless mirror descent: A more natural discretization of riemannian gradient flow. arXiv preprint arXiv:2004.01025, 2020.
He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016a. doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/cvpr.2016.90.
He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016b.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017.
Hoffman et al. (2013) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
Johnson & Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315–323, 2013.
Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017.
Luo et al. (2019) Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg3g2R9FX.
Marceau-Caron & Ollivier (2016) Gaétan Marceau-Caron and Yann Ollivier. Practical riemannian neural networks. arXiv preprint arXiv:1602.08007, 2016.
Osher et al. (2018) Stanley Osher, Bao Wang, Penghang Yin, Xiyang Luo, Farzin Barekat, Minh Pham, and Alex Lin. Laplacian smoothing gradient descent. arXiv preprint arXiv:1806.06317, 2018.
Park et al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Springer, 2015.
Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Sundaramoorthi et al. (2007) Ganesh Sundaramoorthi, Anthony Yezzi, and Andrea C Mennucci. Sobolev active contours. International Journal of Computer Vision, 73(3):345–366, 2007.
Sundaramoorthi et al. (2008) Ganesh Sundaramoorthi, Anthony Yezzi, and Andrea Mennucci. Coarse-to-fine segmentation and tracking using sobolev active contours. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):851–864, 2008.
Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
Zeiler (2012) Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

Appendix A Additional Analysis of Evolution of Channel-Directed Optimization

Figure 10 and Figure 11 present the evolution of training and test accuracy of ADAM and SGD with different batch sizes. Using channel-directed gradients ( $\tilde{H}^{1}$ in this experiment) for SGD or ADAM improves test accuracy for any batch size. More prominent performance gains are seen for smaller batch sizes.

Appendix B Additional Experimental Verification of Output-Channel Direction

To investigate the effect of different channel directions of smoothing, we apply our method as well as LS along different channel-directions. Figure 12 shows that our output-channel direction is preferred regardless of different smoothing approaches.

Appendix C Detailed Derivations for Section 2.2

We first derive the re-weighted $\mathbb{L}^{2}$ gradient under $H^{0}_{\lambda}$ metric following the same notations from the paper. Consider $f\triangleq\nabla_{H^{0}}L(X)$ the standard $\mathbb{L}^{2}$ gradient, and we want to solve for $g\triangleq\nabla_{H^{0}_{\lambda}}L(X)$ . By (4) and (8) we have

	$\displaystyle\left\langle f,k\right\rangle_{{}_{\!\!H^{0}}}$	$\displaystyle=\left<{g},{k}\right>_{H^{0}_{\lambda}}$		(17)
		$\displaystyle=\left\langle\bar{g},\bar{k}\right\rangle_{{}_{\!\!H^{0}}}+\lambda\left\langle g-\bar{g},k-\bar{k}\right\rangle_{{}_{\!\!H^{0}}}.$		(18)

After decomposing $f$ and $k$ into

f=\bar{f}+(f-\bar{f}),\quad k=\bar{k}+(k-\bar{k})

(19)

and noting the fact that $\left\langle\bar{k},k-\bar{k}\right\rangle_{{}_{\!\!H^{0}}}=0$ holds for all $k$ , we have

\displaystyle\bar{f}=\bar{g},\quad f-\bar{f}=\lambda(g-\bar{g}),

(20)

which leads to the result of (9).

We then derive the Sobolev gradient under $H^{1}$ metric, following similar computations in Sundaramoorthi et al. (2007). Consider $\nabla_{H^{1}}L(X)$ the Sobolev gradient under $H^{1}$ metric. By (7) and (8) we have

	$\displaystyle\left\langle\nabla_{H^{0}}L(X),k\right\rangle_{{}_{\!\!H^{0}}}$	$\displaystyle=\left\langle\nabla_{H^{1}}L(X),k\right\rangle_{{}_{\!\!H^{1}}}$		(21)
		$\displaystyle=\frac{1}{O}\left\langle k,\nabla_{H^{1}}L(X)\right\rangle_{{}_{\!\!H^{0}}}+\lambda O\left\langle\frac{\partial{k}}{\partial{o}},\frac{\partial{\nabla_{H^{1}}L(X)}}{\partial{o}}\right\rangle_{{}_{\!\!H^{0}}}.$		(22)

Integrating by parts and considering the periodic boundary conditions, we have

\left\langle\nabla_{H^{0}}L(X),k\right\rangle_{{}_{\!\!H^{0}}}=\left\langle\nabla_{H^{1}}L(X)-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{H^{1}}L(X),k\right\rangle_{{}_{\!\!H^{0}}}.

(23)

Since $k$ can be any perturbation, by uniqueness, we have

\nabla_{H^{0}}L(X)=\nabla_{H^{1}}L(X)-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{H^{1}}L(X)

(24)

which is (10). Similarly, for $\tilde{H^{1}}$ metric, we have

\nabla_{H^{0}}L(X)=\overline{\nabla_{\tilde{H}^{1}}L(X)}-\lambda O^{2}\frac{\partial{{}^{2}}}{\partial{o^{2}}}\nabla_{\tilde{H}^{1}}L(X).

(25)

First observe that by computed the output-channel directed average of the both sides of the above equation, we see that $\overline{\nabla_{\tilde{H}^{1}}L(X)}=\overline{\nabla_{H^{0}}L(X)}$ , i.e., the average values are same. One may integrate (25) twice to solve for the $\tilde{H^{1}}$ gradient. For simplicity, let $f$ be the $\mathbb{L}^{2}$ gradient and $g$ be the $\tilde{H^{1}}$ gradient. Integrating twice yields

$\displaystyle g(o,i,h,w)$	$\displaystyle=g(0,i,h,w)+\int_{0}^{o}\frac{\partial{g}}{\partial{o}}(0,i,h,w)\,\mathrm{d}\tilde{o}-\frac{1}{\lambda}\int_{0}^{o}\int_{0}^{\hat{o}}(f(\tilde{o}O,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}\tilde{o}\,\mathrm{d}\hat{o}$	(26)
	$\displaystyle=g(0,i,h,w)+\int_{0}^{o}\frac{\partial{g}}{\partial{o}}(0,i,h,w)\,\mathrm{d}\tilde{o}-\frac{1}{\lambda}\int_{0}^{o}\int_{\tilde{o}}^{o}(f(\tilde{o}O,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}\hat{o}\,\mathrm{d}\tilde{o}$	(27)
	$\displaystyle=g(0,i,h,w)+o\frac{\partial{g}}{\partial{o}}(0,i,h,w)-\frac{1}{\lambda}\int_{0}^{o}(o-\tilde{o})(f(\tilde{o}O,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}\tilde{o}.$	(28)

Note that here we perform normalization by scaling to the channel direction by letting $o\in[0,1]$ . With boundary conditions $g(0)=g(1)$ , $\frac{\partial{g}}{\partial{o}}(0)=\frac{\partial{g}}{\partial{o}}(1)$ and $\bar{f}=\bar{g}$ , we have

\frac{\partial{g}}{\partial{o}}(0,i,h,w)=-\frac{1}{\lambda}\int_{0}^{1}o(f(oO,i,h,w)-\bar{f}(i,h,w))\,\mathrm{d}o.

(29)

For simplicity, we eliminate $i,h,w$ and $O$ in the following derivations. We have

	$\displaystyle g(0)$	$\displaystyle=g(o)-o\frac{\partial{g}}{\partial{o}}(0)+\frac{1}{\lambda}\int_{0}^{o}(o-\tilde{o})(f(\tilde{o})-\bar{f})\,\mathrm{d}\tilde{o}$		(30)
		$\displaystyle=g(o)+o\frac{1}{\lambda}\int_{0}^{1}o(f(o)-\bar{f})\,\mathrm{d}o+\frac{1}{\lambda}\int_{0}^{o}(o-\tilde{o})(f(\tilde{o})-\bar{f})\,\mathrm{d}\tilde{o}.$		(31)

Noting $\int_{0}^{1}g(0)\,\mathrm{d}o=g(0)$ and $\bar{f}=\int_{0}^{1}f(o)\,\mathrm{d}o$ , we integrate both sides over the entire interval $[0,1]$ .

$\displaystyle g(0)$	$\displaystyle=\bar{g}+\frac{1}{\lambda}\int_{0}^{1}o\,\mathrm{d}o\cdot\int_{0}^{1}o(f(o)-\bar{f})\,\mathrm{d}o+\frac{1}{\lambda}\int_{0}^{1}\int_{0}^{o}(o-\tilde{o})(f(\tilde{o})-\bar{f})\,\mathrm{d}\tilde{o}\,\mathrm{d}o$	(32)
	$\displaystyle=\bar{f}+\frac{1}{2\lambda}\int_{0}^{1}of(o)\,\mathrm{d}o-\frac{1}{4\lambda}\bar{f}+\frac{1}{\lambda}(\int_{0}^{1}\int_{0}^{o}(o-\tilde{o})f(\tilde{o})\,\mathrm{d}\tilde{o}\,\mathrm{d}o+\bar{f}\int_{0}^{1}\int_{0}^{o}(o-\tilde{o})\,\mathrm{d}\tilde{o}\,\mathrm{d}o$	(33)
	$\displaystyle=(1-\frac{1}{4\lambda}-\frac{1}{6\lambda})\bar{f}+\frac{1}{2\lambda}\int_{0}^{1}of(o)\,\mathrm{d}o+\frac{1}{\lambda}\int_{0}^{1}\int_{\tilde{o}}^{1}(o-\tilde{o})f(\tilde{o})\,\mathrm{d}o\,\mathrm{d}\tilde{o}$	(34)
	$\displaystyle=(1-\frac{5}{12\lambda})\int_{0}^{1}f(o)\,\mathrm{d}o+\frac{1}{2\lambda}\int_{0}^{1}of(o)\,\mathrm{d}o+\frac{1}{\lambda}\int_{0}^{1}(\frac{1}{2}+\frac{{\tilde{o}}^{2}}{2}-\tilde{o})f(\tilde{o})\,\mathrm{d}\tilde{o}$	(35)
	$\displaystyle=\int_{0}^{1}(1+\frac{o^{2}-o+1/6}{2\lambda})f(o)\,\mathrm{d}o.$	(36)

This gives (15) in the main paper.