This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning kk-Level Structured Sparse Neural Networks Using Group Envelope Regularization

Yehonathan Refael     Iftach Arbel     Wasim Huleihel
Department of Electrical Engineering, Tel Aviv University
E-mail: {refaelkalim@mail,wasimh@tauex}.tau.ac.il
   Yehonathan Refael     Iftach Arbel     Wasim Huleihel
Department of Electrical Engineering, Tel Aviv University
E-mail: {refaelkalim@mail, wasimh@tauex}.tau.ac.il
Independent Researcher
E-mail: i.arbel84@gmail.com
Abstract

The extensive need for computational resources poses a significant obstacle to deploying large-scale Deep Neural Networks (DNN) on devices with constrained resources. At the same time, studies have demonstrated that a significant number of these DNN parameters are redundant and extraneous. In this paper, we introduce a novel approach for learning structured sparse neural networks, aimed at bridging the DNN hardware deployment challenges. We develop a novel regularization technique, termed Weighted Group Sparse Envelope Function (WGSEF), generalizing the Sparse Envelop Function (SEF), to select (or nullify) neuron groups, thereby reducing redundancy and enhancing computational efficiency. The method speeds up inference time and aims to reduce memory demand and power consumption, thanks to its adaptability which lets any hardware specify group definitions, such as filters, channels, filter shapes, layer depths, a single parameter (unstructured), etc. The properties of the WGSEF enable the pre-definition of a desired sparsity level to be achieved at the training convergence. In the case of redundant parameters, this approach maintains negligible network accuracy degradation or can even lead to improvements in accuracy. Our method efficiently computes the WGSEF regularizer and its proximal operator, in a worst-case linear complexity relative to the number of group variables. Employing a proximal-gradient-based optimization technique, to train the model, it tackles the non-convex minimization problem incorporating the neural network loss and the WGSEF. Finally, we experiment and illustrate the efficiency of our proposed method in terms of the compression ratio, accuracy, and inference latency.

1 Introduction

In the past decade, significant progress has characterized the study of Deep Neural Networks (DNNs), which consistently demonstrate superior performance across the entire spectrum of machine learning tasks. As modern neural networks increase in size and complexity, with parameters often surpassing the number of available training samples, their deployment on resource-limited edge devices becomes increasingly challenging. This difficulty stems from the higher computational demands that lead to greater power consumption, longer inference times, and the need for substantial memory space for storage, which edge devices typically lack [1, 2]. Notwithstanding, many studies have revealed that modern neural networks tend to be excessively over-parametrized [3, 4]. This over-parametrization implies the existence of redundant parameters that could be pruned (or, nullified) without compromising network accuracy [5], which are also responsible for issues such as overfitting [6], memorization of random patterns in the data [7], and a potential degradation in generalization. The realization that numerous redundant parameters exist has prompted a quest for neural network architectures that are both sparse and efficient, which emerges as a prominent challenge in the field.

To mitigate the challenges associated with the deployment of modern large DNNs, numerous studies have suggested compressing their scale. Various approaches have been explored, including (unstructured) sparsity including regularization [8], pruning [9], low-rank approximation [3, 10], quantization [11, 12], and even sparse neural architecture search (NAS) [13, 12]. In the case of the unstructured sparsity-inducing, the most natural regularizer would be the so-called 0\ell_{0}-pseudo-norm function that counts the number of nonzero elements in the input vector, i.e., 𝐳0|{i:zi0}|\|{\bf z}\|_{0}\triangleq|\left\{i:z_{i}\neq 0\right\}|. These sparse regularized minimization/training problems are of the form min𝐳n{f(𝐳)+λ𝐳0}\min_{{\bf z}\in\mathbb{R}^{n}}\left\{f({\bf z})+\lambda\|{\bf z}\|_{0}\right\}, or, alternatively, one can explicitly constrain the number of parameters used for regression and solve min𝐳n{f(𝐳):𝐳0k}\min_{{\bf z}\in\mathbb{R}^{n}}\left\{f({\bf z}):\|{\bf z}\|_{0}\leq k\right\}. Unfortunately, the 0\ell_{0}-norm is a difficult function to handle being non-convex and even non-continuous. Indeed, these types of regression models are known to be NP-hard problems, in general, [14] (global optimal solution can not be computed in a reasonable time, even for a very small number of parameters). As a remedy for the inherent problem above, [15] proposed a highly efficient tractable convex relaxation technique, termed sparse envelope function (SEF), for the sum of both 0\ell_{0} and 2\ell_{2} norms. Specifically, [15] suggested using this relaxation as a regularizer term for a convex loss objective, particularly for a linear regression model, to achieve feature selection while explicitly limiting the number of features to be a fixed parameter kk. It was shown that the performance of this sparse inducing regularization method in both reconstruction of a sparse noisy signal and recovering its support, surpass the performance of state-of-the-art techniques, such as, the Elastic-net [16], kk-support norm [17], etc. Also, it was shown that the computational complexity of the SEF approach is linear in the number of parameters, while all others require at least quadratic in the number of features, thus SEF was found very attractive.

Not long ago, the idea of structured sparsification was used in [18, 19] to learn sparse neural networks that leverage tensor arithmetic in dedicated neural processing units (NPUs). In a nutshell, structured sparsity learning amounts to inducing sparsity onto structured components (e.g., channels, filters, or layers) in the neural network during the optimization procedure. This leads, in practice, to both low latency and lower power consumption, which can not be obtained by deploying unstructured sparse models on such modern hardware.

With the goal of enabling structured sparsification learning that can be customized for different NPU devices, in this paper we propose a novel generalized notion of the SEF regularizer to handle group structured sparsification in neural network training. Our new generalized regularization term selects the most essential kmk\leq m predefined groups of neurons (which could be convolutional filters, channels, individual neurons, or any other user-defined/NPU definition, where mm is the total number of groups) and prunes all others, while maintaining minimal network accuracy degradation. We define the new regularization term mathematically, propose an efficient method to calculate its value and proximal operator, and suggest a new algorithm to solve the complete optimization problem involving the non-convex term, which is the composition of the loss function and the neural network output.

Related work.

The topic of regularization-based pruning received a lot of attention in recent years. Generally speaking, these studies can be divided into unstructured and structured pruning. Most prominent regularizers are the convex 1\ell_{1} and 2\ell_{2} norms [20, 21, 3], as well as the non-convex 0\ell_{0} “norm” [22, 9, 23], where Bayesian methods and additional regularization terms for practicality, were used to deal with the non-convexity of the 0\ell_{0} norm. Additional works of [24, 25] suggest methods for norm l0l_{0} relaxation by employing l1l_{1} minimization in general (nonorthogonal) dictionaries and leading to an error surface with fewer local minima than the l0l_{0} norm. The motivation for these regularizers is their “sparsity-inducing” property which can be harnessed to learn sparse neural networks. While these fundamental papers significantly reduce the storage needed to store the networks on hardware, there were no benefits in reducing the inference latency time or either in cutting down power consumption. That is, the sparse neural networks, learned by the aforementioned methods, were not adapted to the tensor arithmetic of the hardware they aimed to run on.

The practical inefficiency of unstructured sparsity-inducing methods has led researchers to propose regularization-based structured pruning in favor of accelerating the running time. For example, [26, 18, 27] proposed the use of the Group Lasso regularization technique to learn sparse structures, and [28] uses Sparse Group Lasso, summing Group Lasso with the standard Lasso penalty. Other convex regularizers include the Combined Group and Exclusive Sparsity (CGES) [29], which extends Exclusive Lasso (in essence, squared 1\ell_{1} over groups) [30] using Group Lasso. Recently, [19] suggested a family of nonconvex regularizers that blend Group Lasso with nonconvex terms (0\ell_{0}, 12\ell_{1}-\ell_{2} [31], and SCAD [32]). Since [19] introduces non-convexity term into the penalty, it also requires an appropriate optimization scheme, for which the authors propose an Augmented Lagrangian type method. However, this optimization algorithm has an inner optimization loop with a high computational cost. Moreover, their extensive experiments do not show an accuracy or sparsity advantages over convex penalties, suggesting that it might be still desirable to use a convex regularizer. Other methods, such as [33, 34], focus on a group structure that captures the relations between parameters, neurons, and layers, in order to construct groups that can maximize network compression while minimizing accuracy loss. However, these methods still apply Group Lasso regularization. Specifically in [33], the authors introduce the concept of Zero-Invariant Groups (ZIGs), which includes all input and output connections between layers. In the context of CNNs, it extends the channel-wise grouping [18] to include corresponding batch normalization parameters. By using this group structure, entire blocks of parameters can be removed while keeping dimensions aligned between layers, and ultimately allowing network compression. Moreover, their optimization scheme utilizes a two-phase algorithm to include a half-space projection step, which they name HSPG. Lately, a novel methodology that applies adaptive optimization via weighted proximal operators to induce structured sparsity was suggested in [35] in seamlessly integrating numerical solvers to preserve convergence guarantees, albeit with computational efficiency concerns due to approximation requirements.

Finally, we mention that there exist other techniques for neural net compression, such as, quantization, low-rank decomposition, to name a few. In quantization, [36, 37, 38], the precision of the weights is reduced, by representing weights using a low number of bits (i.e., 8-bit) instead of higher one (i.e., 32-bit floating point values). The low-rank decomposition approach [39, 40, 41, 42] is based on the observation that many weight matrices in neural networks are highly correlated and can be well approximated by matrices with a lower rank. By decomposing a weight matrix into lower-rank matrices, one can reduce the total number of parameters in the network.

Notation.

We denote 𝐞\mathbf{e} for the vector of all ones. For a positive integer mm, we denote [m]{1,2,,m}[m]\equiv\{1,2,\ldots,m\}. We denote by xix_{\langle i\rangle} the component of xx with the ii the largest absolute value, meaning in particular that |x1||x2||xn|\left|x_{\langle 1\rangle}\right|\geq\left|x_{\langle 2\rangle}\right|\geq\ldots\geq\left|x_{\langle n\rangle}\right|. |S|\left|S\right| refers to the number of elements in the set SS. Let f:nf:\mathbb{R}^{n}\to\mathbb{R}, be an extended real-valued function, then, the conjugate function of ff, denoted by f:nf^{\star}:\mathbb{R}^{n}\to\mathbb{R} is defined as f(y)=maxxn{x,yf(x)}f^{\star}(y)=\max_{x\in\mathbb{R}^{n}}\left\{\langle x,y\rangle-f(x)\right\}, for any yny\in\mathbb{R}^{n}. The bi-conjugate function is defined as the conjugate of the conjugate function, i.e., f(x)=maxyn{x,yf(y)}f^{\star\star}(x)=\max_{y\in\mathbb{R}^{n}}\left\{\langle x,y\rangle-f^{\star}(y)\right\}, for any xnx\in\mathbb{R}^{n}. Finally, the proximal operator of a proper, lower semi-continuous convex function f:nf:\mathbb{R}^{n}\to\mathbb{R} is defined as proxf(v)=argminx{f(x)+12xv22}\operatorname{prox}_{f}(v)=\arg\min_{x\in\mathbb{R}}\left\{f(x)+\frac{1}{2}\|x-v\|_{2}^{2}\right\}, for any vnv\in\mathbb{R}^{n}. The sets +\mathbb{R}_{+} and ++\mathbb{R}_{++} denote all non-negative and positive real numbers, respectively.

2 Structured sparsity via WGSEF

2.1 Problem formulation

In this subsection, we formulate the problem and introduce the weighted group sparse envelop function (WGSEF). Without loss of generality, our method is formulated on weights sparsity, but it can be directly extended to neuron sparsity (i.e., both weights and bias). Let 𝒟\mathcal{D} be a dataset consisting of NN i.i.d. input output pairs {(x1,y1),,(xN,yN)}\left\{\left(x_{1},y_{1}\right),\ldots,\left(x_{N},y_{N}\right)\right\}. The general neural network training problem is formalized as the following regularized empirical risk minimization procedure on the parameters θΘ\theta\in\Theta of a given neural network architecture f(;θ)f(\cdot;\theta),

argminθΘ1Ni=1N(f(xi,θ),yi)+λΩ(θ),\displaystyle\underset{\theta\in\Theta}{\operatorname{argmin}}\;\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}\left(f\left(x_{i},\theta\right),y_{i}\right)+\lambda\cdot\Omega(\theta), (1)

where, f(;θ)f(\cdot;\theta) is the hypothesis, that is, a given neural network architecture, ()0\mathcal{L}(\cdot)\geq 0 corresponds to a loss function, e.g., cross-entropy loss for classification, mean-squared error for regression, etc, Ω():Θ+\Omega(\cdot):\Theta\to\mathbb{R}_{+} is the parameters regularization term, λ+\lambda\in\mathbb{R}_{+} is the regularization magnitude. Below, n|Θ|n\triangleq|\Theta| denotes the number of parameters.

The most predominant regularizer used for DNNs is the weight decay, also known as the 2\ell_{2} norm regularization. It is known to prevent overfitting and to improve generalization since it enforces the weights to decrease proportionally to their magnitudes. The most natural way to force a predefined kk-level sparsity would be to constrain the number of non-zeros parameters (e.g., the model weights), which can be done by adding the constraint that θ0k\|\theta\|_{0}\leq k, where knk\leq n is the required predefined level of sparsity. In this case, the training problem is formalized as follows,

argminθΘ1Ni=1N(f(xi,θ),yi)+λ2θ22\displaystyle\underset{\theta\in\Theta}{\operatorname{argmin}}\;\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}\left(f\left(x_{i},\theta\right),y_{i}\right)+\frac{\lambda}{2}\|\theta\|_{2}^{2} (2)
s.t.θ0k.\displaystyle\quad\quad\textrm{s.t.}\quad\quad\quad\|\theta\|_{0}\leq k.

We refer to the above training problem as unstructured sparsification. In the case of structured sparsity, the parameters θ\theta are divided into predefined disjoint sub-groups. These subgroups could define the building blocks architecture of DNNs, i.e., filters, channels, filter shapes, and layer depth. Consider the following definition.

Definition 1 (Group projection).

Let ss be a subset of indexes s{1,2,,n}s\subseteq\{1,2,\ldots,n\} of size |s|n|s|\leq n. Then, given some vector θn\theta\in\mathbb{R}^{n}, the projection Ms:nnM_{s}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} preserves only the entries of θ\theta that belong to the set ss. Furthermore, let AsA_{s} be an n×nn\times n diagonal matrix, where [As]ii=1[A_{s}]_{ii}=1 if isi\in s, and zero, otherwise. Note that Ms(θ)=AsθM_{s}(\theta)=A_{s}\theta.

Example 1.

Let n=3,θ=(3,6,9),s={1,3}[n]n=3,\theta=(3,6,9)^{\top},s=\{1,3\}\subseteq[n], and accordingly |s|=2|s|=2, then Ms(θ)=Ms((3,6,9))=(3,0,9),M_{s}(\theta)=M_{s}\left((3,6,9)^{\top}\right)=(3,0,9)^{\top}, with [As]11=[As]33=1[A_{s}]_{11}=[A_{s}]_{33}=1, and zero otherwise.

Following the above definition, let mnm\leq n subsets s1,s2,,sms_{1},s_{2},\ldots,s_{m} be a given (non-overlapping) partition of [n][n], namely, sisj=s_{i}\cap s_{j}=\emptyset, for all iji\neq j, and i=1msi=[n]\bigcup_{i=1}^{m}s_{i}=[n]. Without loss of generality, we assume that nmodm=0n\mod m=0; otherwise, the groups would have different coordinates. Every group is associated with some weight dj𝐑++d_{j}\in\mathbf{R}_{++}, where j[m]j\in[m], e.g., dj=1|sj|d_{j}=\frac{1}{|s_{j}|}, namely, we normalize by the group size. For simplicity of notation, let θsi=Msi(θ)\theta{s_{i}}=M_{s_{i}}(\theta), for i=1,2,,mi=1,2,\ldots,m. Then, our structured training problem is,

argminθΘ1Ni=1N(f(xi,θ),yi)+λ2j=1mdjθsj22\displaystyle\underset{\theta\in\Theta}{\operatorname{argmin}}\;\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}\left(f\left(x_{i},\theta\right),y_{i}\right)+\frac{\lambda}{2}\sum_{j=1}^{m}d_{j}\|\theta{s_{j}}\|_{2}^{2} (3)
s.t.θs122,θs222,,θsm220k.\displaystyle\quad\quad\textrm{s.t.}\quad\left\|\|\theta{s_{1}}\|_{2}^{2},\|\theta{s_{2}}\|_{2}^{2},\ldots,\|\theta{s_{m}}\|_{2}^{2}\right\|_{0}\leq k.

To wit, we constrain the number of groups which has at least one non-zero coordinate, to be at most kk. Let CkC_{k} denote the set of all kk sparse groups, i.e.,

Ck{θ:θs12,θs22,θsm20k},\displaystyle C_{k}\triangleq\left\{\theta:\left\|\|\theta{s_{1}}\|^{2},\|\theta{s_{2}}\|^{2}\ldots,\|\theta{s_{m}}\|^{2}\right\|_{0}\leq k\right\},

and define δCk\delta_{C_{k}} as the following extended real-valued function,

δCk(θ){0,θs12,θs22,θsm20k,, else.\displaystyle\delta_{C_{k}}(\theta)\triangleq\begin{cases}0,&\left\|\|\theta{s_{1}}\|^{2},\|\theta{s_{2}}\|^{2}\ldots,\|\theta{s_{m}}\|^{2}\right\|_{0}\leq k,\\ \infty,&\text{ else. }\end{cases}

Then, the optimization problem in equation 3 can be reformulated as,

argminθΘ\displaystyle\underset{\theta\in\Theta}{\operatorname{argmin}} 1Ni=1N(f(xi,θ),yi)+λgsk(θ),\displaystyle\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}\left(f\left(x_{i},\theta\right),y_{i}\right)+\lambda\cdot gs_{k}(\theta), (4)

where gsk(θ)12j=1mdjθsj22+δCk(Θ)gs_{k}(\theta)\triangleq\frac{1}{2}\sum_{j=1}^{m}d_{j}\|\theta{s_{j}}\|_{2}^{2}+\delta_{C_{k}}(\Theta). Equivalently, gsk(θ)gs_{k}(\theta) can be rewritten as gsk(θ)12i=1ndiθi2+δCk(Θ)gs_{k}(\theta)\triangleq\frac{1}{2}\sum_{i=1}^{n}d_{i}\cdot\theta_{i}^{2}+\delta_{C_{k}}(\Theta), where di=djd_{i}=d_{j} for every isji\in s_{j}.

The 0\ell_{0}-norm that appears in 4, is a difficult function to handle being nonconvex and even non-continuous, making the problem an untraceable combinatorial NP hard problem [14]. Following the work in [15], one approach to deal with this inherent difficulty is to consider the best convex underestimator of gsk()gs_{k}(\cdot). The later is its bi-conjugate function, namely, 𝒢𝒮k(θ)=gsk(θ){\mathcal{GS}}_{k}(\theta)=gs_{k}^{\star\star}(\theta), which we refer to as the weighted group sparse envelope function (WGSEF).

Remark 1 (Generalization of SEF).

Consider the case m=nm=n, namely, every subset si,i[m]s_{i},i\in[m] is a singleton and j[m],dj=1\forall j\in[m],d_{j}=1. Here, gsk(θ)=sk(θ)gs_{k}(\theta)=s_{k}(\theta), where sk(θ)=12θ22s_{k}(\theta)=\frac{1}{2}\|{\theta}\|_{2}^{2} if θ0k\|{\theta}\|_{0}\leq k and sk(θ)=s_{k}(\theta)=\infty, otherwise. Accordingly, in this case, 𝒢𝒮k(θ)=𝒮k(θ)=sk(θ){\mathcal{GS}}_{k}(\theta)={\mathcal{S}}_{k}(\theta)=s_{k}^{\star\star}(\theta), and sk(θ)s_{k}^{\star\star}(\theta) is exactly the classical SEF, namely, 𝒮k(){\mathcal{S}}_{k}(\cdot). Therefore, 𝒢𝒮k(){\mathcal{GS}}_{k}(\cdot) is indeed a new generalization of SEF to handle group sparsity.

Thus, the path taken in this paper is to consider the following relaxed learning problem (training),

argminθΘ1Ni=1N(f(xi,θ),yi)+𝒢𝒮k(θ).\underset{\theta\in\Theta}{\operatorname{argmin}}\;\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}\left(f\left(x_{i},\theta\right),y_{i}\right)+{\mathcal{GS}}_{k}(\theta). (5)

In the following subsection, we develop an efficient algorithm to calculate the value and the prox-operator [43] of WGSEF; these will play an essential ingredient when solving (5).

2.2 Convex relaxation

Let us start by introducing some notation. For any θn\theta\in\mathbb{R}^{n} and mm subgroups of indexes s1,s2,,sm[n]s_{1},s_{2},\ldots,s_{m}\subset[n], we denote by Msi(θ)M_{\langle s_{i}\rangle}(\theta) the corresponding subgroup of coordinates in θ\theta with the iith largest 2\ell_{2}-norm, i.e. ij[m],\forall i\neq j\in[m],

Ms1(θ)2Ms2(θ)2Msm(θ)2.\left\|M_{\langle s_{1}\rangle}(\theta)\right\|_{2}\geq\left\|M_{\langle s_{2}\rangle}(\theta)\right\|_{2}\geq\ldots\geq\left\|M_{\left\langle s_{m}\right\rangle}(\theta)\right\|_{2}.

We next show that the conjugate of the kk group sparse envelopes is the kk weighted group hard thresholding function. In the sequel, we let DD be the n×nn\times n diagonal positive-definite weights matrix, such that Di,i=di,isjD_{i,i}=\sqrt{d_{i}},\forall i\in s_{j}, and j[m]j\in[m].

Lemma 1 (The kk weighted group sparse envelop conjugate).

Let subsets s1,s2,,sms_{1},s_{2},\ldots,s_{m} be a set of mnm\leq n disjoint indexes that partition [n][n]. Then, for any θ~n\tilde{\theta}\in\mathbb{R}^{n},

gsk(θ~)=12j=1k1djMsj(θ~)22.\displaystyle gs_{k}^{\star}(\tilde{\theta})=\frac{1}{2}\sum_{j=1}^{k}\frac{1}{d_{j}}\left\|M_{\left\langle{s_{j}}\right\rangle}(\tilde{\theta})\right\|_{2}^{2}. (6)

Next, we obtain the bi-conjugate function of the kk weighted group sparse envelope. To express the following results explicitly, we will deliberately utilize the group projection definition 1 expressed as Ms(θ)=AsθM_{s}(\theta)=A_{s}\theta, for some set of indices ss.

Lemma 2 (The variational bi-conjugate kk weighted group sparse envelop).

Let s1,s2,,sms_{1},s_{2},\ldots,s_{m} be a set of mnm\leq n disjoint subsets that partition [n][n]. Then, for any θn\theta\in\mathbb{R}^{n}, the bi-conjugate of the kk group sparse envelop is given by

𝒢𝒮k(θ)=12min𝐮Bk{j=1mdjϕ(Asjθ,uj)},{\mathcal{GS}}_{k}(\theta)=\frac{1}{2}\min_{\mathbf{u}\in B_{k}}\left\{\sum_{j=1}^{m}d_{j}\phi\left(A_{s_{j}}\theta,u_{j}\right)\right\}, (7)

where,

ϕ(Asjθ,uj){θAsjθuj,uj>0,0,uj=0Asjθ=0,𝖾𝗅𝗌𝖾.\displaystyle\phi\left(A_{s_{j}}\theta,u_{j}\right)\triangleq\begin{cases}\frac{\theta^{\top}A_{s_{j}}\theta}{u_{j}},&u_{j}>0,\\ 0,&u_{j}=0\cap A_{s_{j}}\theta=0,\\ \infty&\mathsf{else}.\end{cases} (8)

The following is a straightforward corollary of Lemma 7.

Corollary 2.1.

The following holds:

𝒢𝒮k\displaystyle{\mathcal{GS}}_{k} (θ)=\displaystyle(\theta)=
𝒮((d1As1θ2,d2As2θ2,,dmAsmθ2)),\displaystyle{\mathcal{S}}((\sqrt{d_{1}}\|A_{s_{1}}\theta\|_{2},\sqrt{d_{2}}\|A_{s_{2}}\theta\|_{2},\ldots,\sqrt{d_{m}}\|A_{s_{m}}\theta\|_{2})^{\top}),

where 𝒮(θ)sk(θ){\mathcal{S}}(\theta)\triangleq s_{k}^{\star\star}(\theta) is the standard SEF.

The above corollary implies that in order to calculate 𝒢𝒮k(θ){\mathcal{GS}}_{k}(\theta) we only need to apply an algorithm that calculates the SEF at (d1As1θ2,d2As2θ2,,dmAsmθ2)(\sqrt{d_{1}}\|A_{s_{1}}\theta\|_{2},\sqrt{d_{2}}\|A_{s_{2}}\theta\|_{2},\ldots,\sqrt{d_{m}}\|A_{s_{m}}\theta\|_{2})^{\top}.

Remark 2.

Noting djAsjθ22=isjdjθi2\|\sqrt{d_{j}}A_{s_{j}}\theta\|_{2}^{2}=\sum_{i\in s_{j}}d_{j}\theta_{i}^{2} we observe that since sjs_{j} is given, the number of operations required to calculate djAsjθ22\|\sqrt{d_{j}}A_{s_{j}}\theta\|_{2}^{2} is linear w.r.t. |sj||s_{j}|. Thus, the computational complexity of calculating the vector (d1As1θ2,d2As2θ2,,dmAsmθ2)(\sqrt{d_{1}}\|A_{s_{1}}\theta\|_{2},\sqrt{d_{2}}\|A_{s_{2}}\theta\|_{2},\ldots,\sqrt{d_{m}}\|A_{s_{m}}\theta\|_{2})^{\top} is linear in nn.

2.3 Proximal mapping of the WGSEF

In this subsection, we will show how to efficiently compute the proximal operator of positive scalar multiples of 𝒢𝒮k{\mathcal{GS}}_{k}. The ability to perform such an operation implies that it is possible to employ fast proximal gradient methods to solve equation 5. We begin with the following lemma that shows that the proximal operator can be determined in terms of the optimal solution of a convex problem that resembles the optimization problem defined in Lemma equation 7 for computing 𝒢𝒮k{\mathcal{GS}}_{k}.

Lemma 3.

Let λ>0\lambda>0, tnt\in\mathbb{R}^{n}, and s1,s2,,sms_{1},s_{2},\ldots,s_{m} be a set of mnm\leq n disjoint subsets that partition [n][n]. Then, v=proxλ𝒢𝒮k(t)v=\operatorname{prox}_{\lambda{\mathcal{GS}}_{k}}(t) is given by

j[m],Asjv=ujAsjtλdj+uj,\displaystyle j\in[m],\quad A_{s_{j}}v=\frac{u_{j}A_{s_{j}}t}{\lambda d_{j}+u_{j}},

where (u1,u2,,un)T\left(u_{1},u_{2},\ldots,u_{n}\right)^{T} is the minimizer of

min𝐮Dkj=1mϕ(djAsjt,λdj+uj).\min_{\mathbf{u}\in D_{k}}\sum_{j=1}^{m}\phi\left(\sqrt{d_{j}}A_{s_{j}}t,\lambda d_{j}+u_{j}\right). (9)

Next, we show that the proximal operator of 𝒢𝒮k{\mathcal{GS}}_{k} reduces to an efficient one-dimensional search.

Corollary 3.1 (The proximal operator of 𝒢𝒮k{\mathcal{GS}}_{k}).

The solution uj=uj(μ)u_{j}=u_{j}(\mu^{*}) of equation 9 with uj()u_{j}(\cdot) defined as111If Asjt=0A_{s_{j}}t=0, then equation 10 implies that ui(μ)=0u_{i}(\mu)=0 for all μ0\mu\geq 0.

u(μ)={1,μ|bj|αj+1,|bj|μαj,|bj|αj+1<μ<|bj|αj,0,μ|bj|αj.u(\mu^{*})=\begin{cases}1,&\sqrt{\mu^{*}}\leq\frac{|b_{j}|}{\alpha_{j}+1},\\ \frac{|b_{j}|}{\sqrt{\mu^{*}}}-\alpha_{j},&\frac{|b_{j}|}{\alpha_{j}+1}<\sqrt{\mu^{*}}<\frac{|b_{j}|}{\alpha_{j}},\\ 0,&\sqrt{\mu^{*}}\geq\frac{|b_{j}|}{\alpha_{j}}.\end{cases} (10)

for bj=djAsjt2 and αj=λdj(>0)b_{j}=\left\|\sqrt{d_{j}}A_{s_{j}}t\right\|_{2}\text{ and }\alpha_{j}=\lambda d_{j}(>0), and η~=1μ\tilde{\eta}=\frac{1}{\sqrt{\mu^{*}}} is a root of the function

gt(η)i=jmuj(η)k,g_{t}(\eta)\equiv\sum_{i=j}^{m}u_{j}(\eta)-k, (11)

which is nondecreasing and satisfies

gt\displaystyle g_{t} (λminj[m]{dj}d1As1t2,d2As2t2,,dmAsmt2)\displaystyle\left(\frac{\lambda\cdot\min_{j\in[m]}\{d_{j}\}}{\left\|\sqrt{d_{1}}\left\|A_{s_{1}}t\right\|_{2},\sqrt{d_{2}}\left\|A_{s_{2}}t\right\|_{2},\ldots,\sqrt{d_{m}}\left\|A_{s_{m}}t\right\|_{2}\right\|_{\infty}}\right)
=i=1m0k<0,\displaystyle=\sum_{i=1}^{m}0-k<0,

and,

gt(λd1,d2,,dm+1Msm(djt)2)=i=1m1k>0.\displaystyle g_{t}\left(\frac{\lambda\left\|d_{1},d_{2},\ldots,d_{m}\right\|_{\infty}+1}{\left\|M_{\left\langle s_{m}\right\rangle}\left(\sqrt{d_{j}}t\right)\right\|_{2}}\right)=\sum_{i=1}^{m}1-k>0.

In addition, gtg_{t} can be reformulated as the sum of pairs of the functions

vi(η)|η|bj|αj|,wi(η)1|η|bj|(αj+1)|,j[m],\displaystyle v_{i}(\eta)\equiv|\eta|b_{j}|-\alpha_{j}|,w_{i}(\eta)\equiv 1-|\eta|b_{j}|-(\alpha_{j}+1)|,j\in[m],

such that,

g𝐭(η)=12j=1mvj(η)+12j=1mwj(η)k.\displaystyle g_{\mathbf{t}}(\eta)=\frac{1}{2}\sum_{j=1}^{m}v_{j}(\eta)+\frac{1}{2}\sum_{j=1}^{m}w_{j}(\eta)-k.

The following important remarks are in order.

Remark 3 (Root search application for function 11).

Employing the randomized root search method in [15, Algorithm 1] with the 2m2m one break point piece-wise linear functions vj,wjv_{j},w_{j}, as an input to the algorithm, the root of g𝐭g_{\mathbf{t}} can be found in O(m)O(m) time.

Remark 4 (Computational complexity of proxλ𝒢𝒮k\operatorname{prox}_{\lambda{\mathcal{GS}}_{k}}).

The computation of proxλ𝒢𝒮k\operatorname{prox}_{\lambda{\mathcal{GS}}_{k}} boils down to a root search problem (see, Remark 3), which requires O(m)O(m) operations. In addition, before employing the root search, the assembly of vj,wjv_{j},w_{j}, requires the calculations of the mm values of bjb_{j} defined in Corollary 11. Note that for any j[m]j\in[m] calculating bjb_{j} is equivalent to tAsjt=isjti2t^{\top}A_{s_{j}}t=\sum_{i\in s_{j}}t_{i}^{2}. Since sjs_{j}’s are given, the computational complexity of calculating all mm of bjb_{j} is linear in nn, which is the dimension of tt. Thus, the total computational operations of calculating proxλ𝒢𝒮k\operatorname{prox}_{\lambda{\mathcal{GS}}_{k}} summarizes to nn with is the dimension of all groups parameters together.

3 Optimization procedure

The general training problem we are solving is of the form

min𝐱nF(x)=f(𝐱)+h(𝐱),\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{n}}F(x)=f(\mathbf{x})+h(\mathbf{x}), (12)

where f=1Ni=1Nfi:Xf=\frac{1}{N}\sum_{i=1}^{N}f_{i}:X\to\mathbb{R} is continuously differentiable, but possibly nonconvex, and hh is a convex function, but nonsmooth. We adopt the ProxGen [44], which can accommodate momentum, and also has a proven convergence rate with a fixed and reasonable minibatch size of order Θ(N)\Theta(\sqrt{N}) (a comprehensive discussion on the selection of the optimization method is provided in the appendix A.4).

Input: Stepsize αt,{ρt}t=1t=T[0,1)\alpha_{t},\{\rho_{t}\}_{t=1}^{t=T}\in[0,1), regularization parameter λ\lambda.
Initialization: 𝜽1n\bm{\theta}_{1}\in\mathbb{R}^{n} and 𝐦0=𝟎n\mathbf{m}_{0}=\mathbf{0}\in\mathbb{R}^{n}.
for iteration t=1,,Tt=1,\dots,T:
  Draw a minibatch sample ξt\xi_{t}
  𝐠tf(𝜽t;ξt)\mathbf{g}_{t}\longleftarrow\nabla f(\bm{\theta}_{t};\xi_{t})
  𝐦tρt𝐦t1+(1ρt)𝐠t\mathbf{m}_{t}\longleftarrow\rho_{t}\mathbf{m}_{t-1}+(1-\rho_{t})\mathbf{g}_{t}
  𝜽t+1proxαtλh(𝜽tαt𝐦t)\bm{\theta}_{t+1}\longleftarrow\operatorname{prox}_{\alpha_{t}\lambda h}\left(\bm{\theta}_{t}-\alpha_{t}\mathbf{m}_{t}\right)
return 𝜽\bm{\theta}
Algorithm 1 General Stochastic Proximal Gradient Method

Next, we provide a convergence guarantee for Algorithm 1, as given in [45]. This result holds under several regularity assumptions which can be found in Appendix A.4.1.

Corollary 3.2.

Under Assumptions (C-11)(C-33), Algorithm 1 with constant minibatch size bt=b=Θ(T)b_{t}=b=\Theta(T) is guaranteed to yield 𝔼[dist(𝟎,^F(θ))2]O(T1)\mathbb{E}\Big{[}\mathrm{dist}\big{(}\mathbf{0},\widehat{\partial}F(\theta)\big{)}^{2}\Big{]}\leq O\left(T^{-1}\right), where ^F\widehat{\partial}F is the Fréchet sub-differential function of FF.

Forthwith, we propose Algorithm 2, as an implementation of Algorithm 1 to solve equation 5, where ff\triangleq\mathcal{L} and h𝒢𝒮kh\triangleq{\mathcal{GS}}_{k}.

Input: Stepsize αt,{ρt}t=1t=T[0,1)\alpha_{t},\{\rho_{t}\}_{t=1}^{t=T}\in[0,1), regularization parameters {λl}l=1l=L[0,)\{\lambda_{l}\}_{l=1}^{l=L}\in[0,\infty).
Initialization: Randomly initialize the weights θt=0n\theta_{t=0}\in\mathbb{R}^{n}, 𝐦0=𝟎n\mathbf{m}_{0}=\mathbf{0}\in\mathbb{R}^{n}.
for iteration t=1,,Tt=1,\dots,T:
  Draw a minibatch sample ξt\xi_{t}
  𝐠t(𝜽t;ξt)\mathbf{g}_{t}\longleftarrow\nabla\mathcal{L}(\bm{\theta}_{t};\xi_{t})
  𝐦tρt𝐦t1+(1ρt)𝐠t\mathbf{m}_{t}\longleftarrow\rho_{t}\mathbf{m}_{t-1}+(1-\rho_{t})\mathbf{g}_{t}
  𝜽t+1proxαtλ𝒢𝒮k(𝜽tαt𝐦t)\bm{\theta}_{t+1}\longleftarrow\operatorname{prox}_{\alpha_{t}\lambda{\mathcal{GS}}_{k}}\left(\bm{\theta}_{t}-\alpha_{t}\mathbf{m}_{t}\right)
Prune (Optional): all #(mk)\#\left(m-k\right) smallest 2\ell_{2} group norm values of 𝜽\bm{\theta}.
return 𝜽\bm{\theta}
Algorithm 2 Learning structured kk-level sparse neural-network by Prox SGD with WGSEF regularization

Calculating (𝜽t;ξt)\nabla\mathcal{L}(\bm{\theta}_{t};\xi_{t}), commonly approached using a propagation algorithm, is at least linear in the number of parameters and is obviously getting more complex as the number of layers increases. Therefore, the calculation of the prox of the WGSEF is not a bottleneck of the update step complexity (since it is linear in the number of parameters). Notice that in our setting, one option is that the regularization may be separable by layer, as indicated by the regularization function’s definition h(𝐱)=l=1Lhl(𝐱l)h(\mathbf{x})=\sum_{l=1}^{L}h_{l}(\mathbf{x}_{l}). Therefore, in this case, prox\operatorname{prox} is applied to each layer separately [43, Theorem 6.6] with different λl,kl\lambda_{l},k_{l} parameters per layer. Another option is that the regularization is applied to all groups’ layers collectively, namely, not in a per-layer fashion, to allow more flexibility in group selection. Formally, in this latter option, only a single pair of λ,k\lambda,k values needs to be selected, so l[L],λl=λ,kl=k\forall l\in[L],\lambda_{l}=\lambda,k_{l}=k. For example, regularization is applied simultaneously on filters across all convolutional layers at once, resulting in different filter sparsity levels in each layer, but overall adhering to the pre-defined filter sparsity level kk. The only condition for both options is that groups are not overlapping. We note that since Assumptions (C-11)(C-33) are met, Algorithm 2 converges to an ϵ\epsilon-stationary point. Another technique for solving equation 12 is using the HSPG family of algorithms in [46, 47]. These algorithms utilize a two-step procedure in which optimization is carried by standard first-order methods (i.e., subgradient or proximal) to find an approximation that is “sufficiently close” to a solution. This step is followed by a half-space step that freezes the sparse groups and applies a tentative gradient step on the dense groups. Over these dense groups, parameters are zeroed-out if a sufficient decrease condition is met, otherwise, a standard gradient step is executed. Notice that the half-space step, as a variant of a gradient method, requires the regularizer term hh to have a Lipschitz continuous gradient, which is not satisfied in our setting. However, this property is required only for the dense groups as there is no use of the gradient in groups that are already sparse. Since the continuity is violated only for sparse groups, the condition is satisfied in the required region. Finally, while the “sufficiently close” condition mentioned above cannot be verified in practice, simple heuristics to switch between steps still work well. We can either run the first-order step for a fixed number of iterations before switching to the half-space step, or, alternatively, run the first-order step until the sparsity level stabilizes, and then switch to the half-space step. The dense groups are defined as 0(𝐱):={γ|γ𝒢,𝐱γ=0},\mathcal{I}^{0}(\mathbf{x}):=\{\gamma\,|\,\gamma\in\mathcal{G},\|\mathbf{x}^{\gamma}\|=0\}, and sparse groups are defined as 0:={γ|γ𝒢,𝐱γ0.\mathcal{I}^{\neq 0}:=\{\gamma\,|\,\gamma\in\mathcal{G},\|\mathbf{x}^{\gamma}\|\neq 0. The HSPG pseudo-code (proximal-gradient variant) is given in Algorithm 3 (in the appendix A.6). We mention the enhanced variant of the standard HSPG algorithm named AdaHSPG+ [47] improves upon that implementing adaptive strategies that optimize performance, focusing on better handling of complex or dynamic problem scenarios where standard HSPG may be less efficient.

4 Experiments

In this section, we present a comprehensive benchmark of structured sparse-inducing regularization techniques. Our evaluation covers a wide range of model architectures, datasets, regularizers, optimizers, and pruning techniques. This extensive benchmarking demonstrates that WGSEF achieves state-of-the-art performance across all these dimensions.

4.1 Evaluation of sparsity-inducing optimization methods

To demonstrate the performance of Algorithm 2 in terms of compression and accuracy, as compared to state-of-the-art prox-SGD-based optimization methods, we use the following well-known DNNs benchmark architectures: VGG16 [48], ResNet18 [49], and MobileNetV1 [50]. These architectures were tested on the datasets CIFAR-10 [51] and Fashion-MNIST [52]. While WGSEF is a regularizer rather than an optimization algorithm, we benchmark it versus the Group Lasso regularizer using various optimization algorithms that are well-known for their superior performance with group sparsity inducing regularization. All experiments were conducted over 300 epochs. For the first 150 epochs, we employed Algorithm 2, and for the leftover epochs, we used the HSPG with the WGSEF acting as a regularizer (i.e., Algorithm 3). Experiments were conducted using a mini-batch size of b=128b=128 on an A100 GPU. The coefficient for the WGSE regularizer was set to λ=102\lambda=10^{-2}. In Table 1, we compare our results with those reported in [47]. The primary metrics of interest are the neural network group sparsity ratio, and the prediction accuracy (Top-1). Notably, the WGSE achieves a markedly higher group sparsity compared to all other methods except AdaHSPG+, for which we obtained slightly better results. It should be mentioned that all techniques achieved comparable generalization error rates on the validation datasets.

Table 1: Comparison of state-of-the-art techniques to our WGSEF regularization technique, in terms of group-sparsity-ratio/validation-accuracy (Top-1), both in percentage, for various models and datasets. Our method provides the highest sparsity level with a comparable accuracy.
Model Dataset Prox-SG Prox-SVRG HSPG AdaHSPG+ WGSEF
VGG16 CIFAR-10 54.0 / 90.6 14.7 / 89.4 74.6 / 91.1 76.1 / 91.0 76.8 / 91.5
F-MNIST 19.1 / 93.0 0.5 / 92.7 39.7 / 93.0 51.2 / 92.9 51.9 / 92.8
ResNet18 CIFAR-10 26.5 / 94.1 2.8 / 94.2 41.6 / 94.4 42.1 / 94.5 42.6 / 94.5
F-MNIST 0.0 / 94.8 0.0 / 94.6 10.4 / 94.9 43.9 / 94.9 44.2 / 94.9
MobileNetV1 CIFAR-10 58.1 / 91.7 29.2 / 90.7 65.4 / 92.0 71.5 / 91.8 71.8 /  91.9
F-MNIST 62.6 / 94.2 42.0 / 94.2 74.3 / 94.5 78.9 / 94.6 79.1 / 94.5
Refer to caption
Figure 1: Ratio of the sparse filters in the convolutional layers, according to the order of the layers in the resnet18 model, as obtained by Algorithm 2, and corresponds to the experiment in row 3 of table 4.

4.2 Evaluation of different sparsity-inducing regularizers

In this experiment, we trained deep residual networks ResNet40 [49] on CIFAR-10, while applying Algorithm 2, with a predefined sparsity level of 55%55\% for all filters in all convolutional layers of the network, over 55 runs. Again, to have a fair comparison, the baseline model was trained using SGD, both with an initial learning rate of α0=0.01\alpha_{0}=0.01, regularization magnitude λ=0.03\lambda=0.03, a batch size of 128128, and a cosine annealing learning rate scheduler. Our results in Table 2 demonstrate our method’s superiority over state-of-the-art structured sparsity-inducing regularizers. These results are similar to those in [19], and are obtained through a grid search process that varied the magnitude of regularization w.r.t. the sparsity level. Note that the model trained using our method achieved 2.14×2.14\times in speedup and 47.3%47.3\% reduction in FLOPs.

Table 2: Comparison of state-of-the-art structured-sparse inducing regularization methods to our WGSEF technique, in terms of filter sparsity compression and accuracy in percentage, for Resnet40 and CIFAR-10. The WGSEF regularization provides the highest sparsity level with even better accuracy.
Method Error Sparsed Filters
Baseline (SGD) 6.854% 0%
SGL1 [28] 7.760% 50.7%
SGL0 [19] 8.146% 53.4%
SGSCAD [53] 8.026% 52.2%
SGTL1 [19] 8.096% 53.7%
SGTL1L2 [54] 7.968% 53.7%
WGSEF 7.264% 54.3%

4.3 Evaluation of different state-of-the-art pruning techniques on ImageNet

In this subsection, we compare our method to the state-of-the-art pruning techniques, which are often used as an alternative for model compression during (or, post) training. We train Resent50 with ImageNet dataset, using λ=0.05\lambda=0.05, with an initial learning rate of α0=0.01\alpha_{0}=0.01, sparsity level k=0.34k=0.34, and use a cosine annealing learning rate scheduler. In table 3, we compare our results to those obtained by the methods in [33]. We emphasize that, as mentioned later, some of the methods require several stages of training, fine-tuning, etc. Our method trains the model from scratch once, as well as OTO, and thus, this is the most fair comparison. Additionally, it should be noted that all techniques achieved comparable generalization error on the validation datasets, while our method achieved better compression performance as compared to all the other techniques.

Table 3: Comparison different pruning methods with ResNet50 for ImageNet. Our method provides the highest sparsity level, lowest total number of parameters, with a comparable accuracy in both Top-11 and Top-55 metrics.
Method FLOPs Number of Params Top-1 Acc. Top-5 Acc.
Baseline 100% 100% 76.1% 92.9%
DDS-26 [55] 57.0 % 61.2 % 71.8 % 91.9 %
CP [56] 66.7 % - 72.3 % 90.8 %
RRBP [57] 45.4 % - 73.0 % 91.0 %
SFP [58] 41.8 % - 74.6 % 92.1 %
Hinge [59] 46.6 % - 74.7 % -
GBN-60[60] 59.5 % 68.2 % 76.2 % 92.8 %
ResRep [61] 45.5 % - 76.2 % 92.9 %
DDS-26 [62] 57.0% 61.2% 71.8% 91.9%
ThiNet-50 [63] 44.2% 48.3% 71.0% 90.0%
RBP [57] 43.5% 48.0% 71.1% 90.0%
GHS [64] 52.9% - 76.4% 93.1%
SCP [65] 45.7% - 74.2% 92.0%
OTO [33] 34.5% 35.5% 74.7% 92.1%
WGSEF 34.2% 35.1% 74.2% 92.0%

4.4 Evaluation of different group structures

In this experiment, we demonstrate WGSEF’s performance across various architectures, datasets, and group structures. We specifically compare WGSEF to standard training using SGD without regularization. Our aim is to demonstrate WGSEF’s flexibility by showing its ability to accommodate different group structures, allowing for consideration of the input data, model architecture, hardware properties, etc. We examine the effectiveness of the WGSEF in the LeNet-5 convolutional neural network [66] (the architecture is Pytorch and not Caffe and is given in Appendix A.5), on the MNIST dataset [67]. The networks were trained without any data augmentation. We apply the WGSEF regularization on filters in convolutional layers using a predefined value for the sparsity level kk. Table 4 summarizes the number of remaining filters at convergence, FLOPs, and the speedups. We evaluate these metrics both for a LeNet-5 baseline (i.e., without sparsity learning), and our WGSEF sparsification technique. To ensure a fair and accurate comparison, the baseline model was trained using SGD. It can be seen that WGSEF reduces the number of filters in the convolution layers by a factor of half, as dictated by k=8k=8, while the accuracy level did not decrease. Furthermore, since the sparsification is structural, there is a significant improvement in FLOPs, as well as in the latency time of inference. Repeating the same experiment, but now constraining the number of non-pruned filters in the second convolutional layer to be at most 4 (i.e., at most quarter of the baseline), the accuracy slightly deteriorates; however, significant improvements can be observed in both the FLOPs number and the speed up, as expected. The networks were trained with a learning rate of 0.0010.001, regularization magnitude λ=105\lambda=10^{-5}, and a batch size of 3232 for 150150 epochs across 55 runs.

Table 4: Results of running Algorithm 2, onto redundant filters in LeNet (in the order of conv1-conv2).
LeNet-5 (MNIST)
Error Filter (sparsity-level) FLOPs Speedup
Baseline (SGD) 0.84 %\% 6-16 100 %\%-100 %\% 1.00 ×\times-1.00 ×\times
WGSEF 0.78 %\% 3-8 48.7 %\%-21.6 %\% 2.06×\times-4.53 ×\times
WGSEF 0.89 %\% 3-4 48.7 %\%-14.7 %\% 2.06×\times-7.31 ×\times
LeNet-5 (MNIST) Error Parameters FLOPs Speedup
Unstructured WGSEF 0.76 %\% 75(/150)-1200(/2400) 68.7 %\%-59.2 %\% 1×\times-1×\times

In Table 5, we present the results when training both VGG16 and DenseNet40 [68] on CIFAR-100 [69], while applying WGSEF regularization with a predefined sparsity level with half the number of channels for VGG16, and 60%60\% of those for the DenseNet40. The baseline model was trained using SGD, both with an initial learning rate of α=0.01\alpha=0.01 and regularization magnitude λ=0.01\lambda=0.01.

Table 5: Results of running Algorithm 2, onto redundant Channels on CIFAR-100, over 250250 epochs.
DCNN
CIFAR-100
Model Error (%)
Pruned
Channels
Overall
Density
VGG16 Baseline 26.28 0%\sim 0\% 0%\sim 0\%
WGSEF 26.46 50%50\% 41.3%41.3\%
DenseNet40 Baseline 25.36 0%\sim 0\% 0%\sim 0\%
WGSEF 25.6 60%60\% 42.8%42.8\%

4.5 Impact of sparsification levels on model error and training dynamics

Here, we examine the effectiveness of WGSEF in training the LeNet-5, on the FasionMNIST dataset. The networks were trained without any data augmentation. We apply the WGSEF regularization on filters in convolutional layers using a predefined value for the sparsity level kk. Table 6 summarizes the number of remaining filters at convergence, FLOPs, and the speedups. We evaluate these metrics both for a LeNet baseline (i.e., without sparsity learning), and our WGSEF sparsification technique. To be accurate and fair in comparison, the baseline model was trained using SGD. We use a learning rate equal to 1e41e-4, with a batch size of 3232, a momentum 0.950.95, and 1515 epochs.

Table 6: Results of training on FasionMNIST while applying WGSEF sparsification (with λ=0.05\lambda=0.05), onto redundant filters in LeNet-5, (in the order of conv1-conv2), and neurons in Linear layers. The baseline model was trained using SGD.
LeNet-5
(F-MNIST)
Error
Filter
(non-sparse)
FC-layers
sparsity
Speedup
Baseline 11.1 %\% 6-16 9 %\% 1.0 ×\times-1.0×\times
WGSEF 11 %\% 3-8 5%\% 2×\times-4.5×\times
WGSEF 14 %\% 4-6 62%\% 1.7×\times-6.1×\times
WGSEF 12.3 %\% 2-3 1%\% 2×\times-7.12×\times

The second row of Table 6 shows that our method exhibits a significant decrease in the number of non-zero filters, specifically, half of the filters (groups of parameters) were nullified and at the same time the model performance is improved. The rest of the experiments show that there was a higher sparsification in the number of non-zero filters (groups of parameters), with only a negligible degradation in the model’s accuracy.

Finally, in Figure 2, we illustrate the sparsity level as a function of the epoch number, during the training of the model which corresponds to the last row of Table 6. The desired (predefined) sparsity level was rapidly attained within the first three epochs, while the model continued to improve its accuracy throughout the remaining epochs without compromising the achieved sparsity.

Refer to caption
Figure 2: The graph on the left shows the level of sparseness as a function of epoch number, while the graph on the right shows the model’s accuracy as a function of epoch number.

5 Discussion

In this study, we introduce a novel method for structured sparsification in neural network training, aiming to accelerate neural network inference and compress the neural network memory size, while minimizing the accuracy degradation (or improving accuracy). Our method utilizes a new novel regularizer, termed weighted group sparse envelope function (WGSEF), which is adaptable for pruning different specified neuron groups (e.g convolutional filter, channels), according to the unique requirements of NPU’s tensor arithmetic. Mathematically, the Weighted Group Sparse Envelope Function (WGSEF) represents the optimal convex underestimator of the combined sum of weighted 2\ell_{2} norms and 0\ell_{0} norms. In this context, the 0\ell_{0} norm is applied to the squared norms of each group. During the neural network training, the WGSE regularizer selects the kk most essential predefined neuron groups, where kk that controls the compression of the network is configurable to the trainer. Consequently, the trained neural network benefits from reduced inference latency, a more compact size, and decreased power consumption. Additionally, we show that the computational complexity of the prox operator of WGSEF, a key component in the training phase, is linear relative to the number of group parameters. This ensures that it is highly efficient and which does not constitute a bottleneck in the calculation complexity of the training process.

The experimental results show the effectiveness of WGSEF in achieving high compression ratios (reduced memory demand), and speed up in inference, with negligible compromising in accuracy. Compared to the previous approaches, the proposed method stood out in its compression capabilities while maintaining similar network performance.

Along with the method’s ability to predetermine the extent of network compression to be obtained at the training convergence, it is essential to have a prior understanding of the maximum compression level that can be applied without compromising the network’s performance, and accordingly set the kk parameter. Naturally, it is also necessary to define the groups to which the pruning will be encouraged.

Future research could extend this work by delving into more intricate group definitions such as what could be defined in large language models. Moreover, we suggest studying a different mechanism for assessing the importance of each group being regulraized, as an alternative to the current approach which is based on the group’s squared norm magnitude.

References

  • [1] Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532, 2020.
  • [2] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
  • [3] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015b.
  • [4] Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
  • [5] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
  • [6] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019.
  • [7] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  • [8] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [9] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  • [10] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1405.3866, pages 1269–1277, 2014.
  • [11] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  • [12] Yan Wu, Aoming Liu, Zhiwu Huang, Siwei Zhang, and Luc Van Gool. Neural architecture search as sparse supernet. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10379–10387, 2021.
  • [13] Yibo Yang, Hongyang Li, Shan You, Fei Wang, Chen Qian, and Zhouchen Lin. Ista-nas: Efficient and consistent neural architecture search by sparse coding. Advances in Neural Information Processing Systems, 33:10503–10513, 2020.
  • [14] Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234, 1995.
  • [15] Amir Beck and Yehonathan Refael. Sparse regularization via bidualization. Journal of Global Optimization, 82(3):463–482, 2022.
  • [16] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
  • [17] Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse prediction with the kk-support norm. Advances in Neural Information Processing Systems, 25, 2012.
  • [18] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
  • [19] Kevin Bui, Fredrick Park, Shuai Zhang, Yingyong Qi, and Jack Xin. Structured sparsity of convolutional neural networks via nonconvex sparse group regularization. Frontiers in applied mathematics and statistics, page 62, 2021.
  • [20] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017.
  • [21] Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124, 2018.
  • [22] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0l\_0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  • [23] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. Advances in neural information processing systems, 30, 2017.
  • [24] David L Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via l1l_{1} minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
  • [25] Alice Delmer, Anne Ferréol, and Pascal Larzabal. On the complementarity of sparse l0 and cel0 regularized loss landscapes for doa estimation. Sensors, 21(18), 2021.
  • [26] Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage, 2015.
  • [27] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
  • [28] Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
  • [29] Jaehong Yoon and Sung Ju Hwang. Combined group and exclusive sparsity for deep neural networks. In International Conference on Machine Learning, pages 3958–3966. PMLR, 2017.
  • [30] Yang Zhou, Rong Jin, and Steven Chu-Hong Hoi. Exclusive lasso for multi-task feature selection. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 988–995. JMLR Workshop and Conference Proceedings, 2010.
  • [31] Yifei Lou, Penghang Yin, Qi He, and Jack Xin. Computing sparse representation in a highly coherent dictionary based on difference of l_1 l 1 and l_2 l 2. Journal of Scientific Computing, 64:178–196, 2015.
  • [32] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
  • [33] Tianyi Chen, Bo Ji, Tianyu Ding, Biyi Fang, Guanyi Wang, Zhihui Zhu, Luming Liang, Yixin Shi, Sheng Yi, and Xiao Tu. Only train once: A one-shot neural network training and pruning framework. Advances in Neural Information Processing Systems, 34:19637–19651, 2021.
  • [34] Jiashi Li, Qi Qi, Jingyu Wang, Ce Ge, Yujian Li, Zhangzhang Yue, and Haifeng Sun. Oicsr: Out-in-channel sparsity regularization for compact deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7046–7055, 2019.
  • [35] Tristan Deleu and Yoshua Bengio. Structured sparsity inducing adaptive optimizers for deep learning, 2023.
  • [36] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  • [37] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In Computer Vision–ECCV 2016, pages 525–542. Springer, 2016.
  • [38] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
  • [39] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. Advances in neural information processing systems, 27, 2014.
  • [40] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
  • [41] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.
  • [42] Yawei Li, Shuhang Gu, Christoph Mayer, Luc Van Gool, and Radu Timofte. Group sparsity: The hinge between filter pruning and decomposition for network compression, 2020.
  • [43] Amir Beck. First-order methods in optimization. SIAM, 2017.
  • [44] Jihun Yun, Aurélie C Lozano, and Eunho Yang. Adaptive proximal gradient methods for structured neural networks. Advances in Neural Information Processing Systems, 34:24365–24378, 2021.
  • [45] Yang Yang, Yaxiong Yuan, Avraam Chatzimichailidis, Ruud JG van Sloun, Lei Lei, and Symeon Chatzinotas. Proxsgd: Training structured neural networks under regularization and constraints. In International Conference on Learning Representations (ICLR) 2020, 2020.
  • [46] Tianyi Chen, Guanyi Wang, Tianyu Ding, Bo Ji, Sheng Yi, and Zhihui Zhu. Half-space proximal stochastic gradient method for group-sparsity regularized problem. arXiv preprint arXiv:2009.12078, 2020.
  • [47] Yutong Dai, Tianyi Chen, Guanyi Wang, and Daniel Robinson. An adaptive half-space projection method for stochastic optimization problems with group sparse regularization. Transactions on Machine Learning Research, 2023.
  • [48] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  • [50] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [51] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [52] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [53] Jinchi Lv and Yingying Fan. A unified approach to model selection and sparse recovery using regularized least squares. 2009.
  • [54] Hoang Tran and Clayton Webster. A class of null space conditions for sparse recovery via nonconvex, non-separable minimizations. Results in Applied Mathematics, 3:100011, 2019.
  • [55] Zehao Huang, Naiyan Wang, and Naiyan Wang. Data-driven sparse structure selection for deep neural networks, 2018.
  • [56] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks, 2017.
  • [57] Yuefu Zhou, Ya Zhang, Yanfeng Wang, and Qi Tian. Accelerate cnn via recursive bayesian pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3306–3315, 2019.
  • [58] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks, 2018.
  • [59] Yawei Li, Shuhang Gu, Christoph Mayer, Luc Van Gool, and Radu Timofte. Group sparsity: The hinge between filter pruning and decomposition for network compression. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8015–8024, 2020.
  • [60] Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks, 2019.
  • [61] Xiaohan Ding, Tianxiang Hao, Jianchao Tan, Ji Liu, Jungong Han, Yuchen Guo, and Guiguang Ding. Resrep: Lossless cnn pruning via decoupling remembering and forgetting, 2021.
  • [62] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
  • [63] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
  • [64] Huanrui Yang, Wei Wen, and Hai Li. Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures. arXiv preprint arXiv:1908.09979, 2019.
  • [65] Minsoo Kang and Bohyung Han. Operation-aware soft channel pruning using differentiable masks. In International Conference on Machine Learning, pages 5122–5131. PMLR, 2020.
  • [66] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [67] Yann LeCun and Corinna Cortes. Mnist handwritten digit database, 2010.
  • [68] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2018.
  • [69] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research).
  • [70] J v. Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
  • [71] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, 2016.
  • [72] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Advances in neural information processing systems, 29, 2016.
  • [73] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014.
  • [74] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.

Appendix A Appendix

A.1 Proofs of results in Subsection 2.2

A.1.1 Proof of Lemma 1

Proof.

Let us define the axillary diagonal positive definite matrix Dn×nD^{n\times n}, where the Di,iD_{i,i} entry holds Di,i=di,isj,di=djD_{i,i}=\sqrt{d_{i}},\forall i\in s_{j},d_{i}=d_{j}. Now, consider the following chain of equalities:

gsk(θ~)\displaystyle gs_{k}^{\star}(\tilde{\theta}) =maxθn{θ~,θgsk(θ)}\displaystyle=\max_{\theta\in\mathbb{R}^{n}}\{\langle\tilde{\theta},\theta\rangle-gs_{k}(\theta)\}
=maxθn{θ~θ12j=1mdjθsj22δCk(θ)}\displaystyle=\max_{\theta\in\mathbb{R}^{n}}\left\{\tilde{\theta}^{\top}\theta-\frac{1}{2}\sum_{j=1}^{m}d_{j}\|\theta{s_{j}}\|_{2}^{2}-\delta_{C_{k}}(\theta)\right\}
=maxθCkisj,di=dj{θ~θj=1mdjθsj22}\displaystyle=\max_{\begin{subarray}{c}\theta\in C_{k}\\ \forall i\in s_{j},d_{i}=d_{j}\end{subarray}}\left\{\tilde{\theta}^{\top}\theta-\sum_{j=1}^{m}d_{j}\cdot\|\theta{s_{j}}\|_{2}^{2}\right\}
=maxθCkisj,di=dj{θ~θ12θDDθ}\displaystyle=\max_{\begin{subarray}{c}\theta\in C_{k}\\ \forall i\in s_{j},d_{i}=d_{j}\end{subarray}}\left\{\tilde{\theta}^{\top}\theta-\frac{1}{2}\theta^{\top}D^{\top}D\theta\right\}
=(a)maxtC~kisj,di=dj{θ~D1t12tt}\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\max_{\begin{subarray}{c}t\in\tilde{C}_{k}\\ \forall i\in s_{j},d_{i}=d_{j}\end{subarray}}\left\{\tilde{\theta}^{\top}D^{-1}t-\frac{1}{2}t^{\top}t\right\}
=maxtC~kisj,di=dj{12tD1θ~22+12D1θ~22}\displaystyle=\max_{\begin{subarray}{c}t\in\tilde{C}_{k}\\ \forall i\in s_{j},d_{i}=d_{j}\end{subarray}}\left\{-\frac{1}{2}\|t-D^{-1}\tilde{\theta}\|_{2}^{2}+\frac{1}{2}\|D^{-1}\tilde{\theta}\|_{2}^{2}\right\}
=12D1θ~22+maxtC~kisj,di=dj{12tD1θ~22}\displaystyle=\frac{1}{2}\|D^{-1}\tilde{\theta}\|_{2}^{2}+\max_{\begin{subarray}{c}t\in\tilde{C}_{k}\\ \forall i\in s_{j},d_{i}=d_{j}\end{subarray}}\left\{-\frac{1}{2}\|t-D^{-1}\tilde{\theta}\|_{2}^{2}\right\}
=12D1θ~22+maxtC~kisj,di=dj{12i=1n(ti(D1)iiθ~i)2}\displaystyle=\frac{1}{2}\|D^{-1}\tilde{\theta}\|_{2}^{2}+\max_{\begin{subarray}{c}t\in\tilde{C}_{k}\\ \forall i\in s_{j},d_{i}=d_{j}\end{subarray}}\left\{-\frac{1}{2}\sum_{i=1}^{n}(t_{i}-\left(D^{-1}\right)_{ii}\tilde{\theta}_{i})^{2}\right\}
=(b)12D1θ~22+maxtC~kisj,di=dj{12j=1mMsj(tD1θ~)22}\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\frac{1}{2}\|D^{-1}\tilde{\theta}\|_{2}^{2}+\max_{\begin{subarray}{c}t\in\tilde{C}_{k}\\ \forall i\in s_{j},d_{i}=d_{j}\end{subarray}}\left\{-\frac{1}{2}\sum_{j=1}^{m}\left\|M_{s_{j}}(t-D^{-1}\tilde{\theta})\right\|_{2}^{2}\right\}
=12D1θ~2212j=mkmMsj(D1θ~)22\displaystyle=\frac{1}{2}\|D^{-1}\tilde{\theta}\|_{2}^{2}-\frac{1}{2}\sum_{j=m-k}^{m}\left\|M_{\left\langle{s_{j}}\right\rangle}(D^{-1}\tilde{\theta})\right\|_{2}^{2}
=(c)12j=1kMsj(D1θ~)22\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}\frac{1}{2}\sum_{j=1}^{k}\left\|M_{\left\langle{s_{j}}\right\rangle}(D^{-1}\tilde{\theta})\right\|_{2}^{2}
=12j=1k1djMsj(θ~)22\displaystyle=\frac{1}{2}\sum_{j=1}^{k}\frac{1}{d_{j}}\left\|M_{\left\langle{s_{j}}\right\rangle}(\tilde{\theta})\right\|_{2}^{2}

where, (a)(a) the set C~k\tilde{C}_{k} is given by

C~k={θ:(Dθ)s12,(Dθ)s22,(Dθ)sm20k},\tilde{C}_{k}=\left\{\theta:\left\|\|\left(D\theta\right){s_{1}}\|^{2},\|\left(D\theta\right){s_{2}}\|^{2}\ldots,\|\left(D\theta\right){s_{m}}\|^{2}\right\|_{0}\leq k\right\},

(b)(b) follows by the fact that the sum of the squares of the coordinates of the input vector in the support of the disjoint subset s1,s2,,sms_{1},s_{2},\ldots,s_{m} that completes the index space i=1msi=[n]\bigcup_{i=1}^{m}s_{i}=[n] is equal to the sum of squares of the coordinates of the original vector, and (c)(c) follows by the fact that j=mkmMsj(D1θ~)22\sum_{j=m-k}^{m}\left\|M_{\left\langle{s_{j}}\right\rangle}(D^{-1}\tilde{\theta})\right\|_{2}^{2} is the sum of square 2\ell_{2}-norm of the mkm-k disjoint subsets of D1θ~D^{-1}\tilde{\theta} with the smallest 2\ell_{2}-norm, while is the sum of squared 2\ell_{2}-norm of all disjoint subsets of D1θ~D^{-1}\tilde{\theta}. \Box

A.1.2 Proof of Lemma 2

Proof.

We first note that

i=1kMsi(D1θ~)2=maxuBki=1muiMsi(D1θ~)22,\displaystyle\sum_{i=1}^{k}\left\|M_{\left\langle{s_{i}}\right\rangle}(D^{-1}\tilde{\theta})\right\|_{2}=\max_{u\in B_{k}}\sum_{i=1}^{m}u_{i}\left\|M_{s_{i}}(D^{-1}\tilde{\theta})\right\|_{2}^{2}, (13)

where

Bk{um0ue,euk}.\displaystyle B_{k}\triangleq\left\{u\in\mathbb{R}^{m}\mid 0\leqslant u\leqslant e,e^{\top}u\leq k\right\}. (14)

Now, Consider the following chain of inequalities:

𝒢𝒮k(θ)\displaystyle{\mathcal{GS}}_{k}(\theta) =maxθ~n{θ,θ~gs(θ~)}\displaystyle=\max_{\tilde{\theta}\in\mathbb{R}^{n}}\left\{\langle\theta,\tilde{\theta}\rangle-gs^{\star}(\tilde{\theta})\right\}
=maxθ~n{θθ~12i=1kMsi(D1θ~)22}\displaystyle=\max_{\tilde{\theta}\in\mathbb{R}^{n}}\left\{\theta^{\top}\tilde{\theta}-\frac{1}{2}\sum_{i=1}^{k}\left\|M_{\left\langle{s_{i}}\right\rangle}(D^{-1}\tilde{\theta})\right\|_{2}^{2}\right\}
=maxθ~nt=D1θ~{θDt12i=1kMsi(t)22}\displaystyle=\max_{\begin{subarray}{c}\tilde{\theta}\in\mathbb{R}^{n}\\ t=D^{-1}\tilde{\theta}\end{subarray}}\left\{\theta^{\top}Dt-\frac{1}{2}\sum_{i=1}^{k}\left\|M_{\left\langle{s_{i}}\right\rangle}(t)\right\|_{2}^{2}\right\}
=maxtn{θDt12maxuDki=1muiMsi(t)22}\displaystyle=\max_{t\in\mathbb{R}^{n}}\left\{\theta^{\top}Dt-\frac{1}{2}\max_{u\in D_{k}}\sum_{i=1}^{m}u_{i}\left\|M_{s_{i}}(t)\right\|_{2}^{2}\right\}
=maxtn{θDt12maxuDk{j=1muj(tAsjAsjt)}}\displaystyle=\max_{t\in\mathbb{R}^{n}}\left\{\theta^{\top}Dt-\frac{1}{2}\max_{u\in D_{k}}\left\{\sum_{j=1}^{m}u_{j}\left(t^{\top}A_{s_{j}}^{\top}A_{s_{j}}t\right)\right\}\right\}
=(a)maxtn{θDt+12min𝐮Dk{j=1m(uj)(tAsjt)}}\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\max_{t\in\mathbb{R}^{n}}\left\{\theta^{\top}Dt+\frac{1}{2}\min_{\mathbf{u}\in D_{k}}\left\{\sum_{j=1}^{m}(-u_{j})\left(t^{\top}A_{s_{j}}t\right)\right\}\right\}
=12maxtn{min𝐮Dk{2θDtj=1mujtAsjt}}\displaystyle=\frac{1}{2}\max_{t\in\mathbb{R}^{n}}\left\{\min_{\mathbf{u}\in D_{k}}\left\{2\theta^{\top}Dt-\sum_{j=1}^{m}u_{j}t^{\top}A_{s_{j}}t\right\}\right\}
=(b)12min𝐮Dk{maxtn{2θDtj=1mujtAsjt}}\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\frac{1}{2}\min_{\mathbf{u}\in D_{k}}\left\{\max_{t\in\mathbb{R}^{n}}\left\{2\theta^{\top}Dt-\sum_{j=1}^{m}u_{j}t^{\top}A_{s_{j}}t\right\}\right\}
=12min𝐮Dk{j=1mmaxtn2θAsjDtujtASjt}\displaystyle=\frac{1}{2}\min_{\mathbf{u}\in D_{k}}\left\{\sum_{j=1}^{m}\max_{t\in\mathbb{R}^{n}}2\theta^{\top}A_{s_{j}}Dt-u_{j}tA_{S_{j}}t\right\}
=12min𝐮Dk{j=1mmaxtn2djθAsjtujtASjt}\displaystyle=\frac{1}{2}\min_{\mathbf{u}\in D_{k}}\left\{\sum_{j=1}^{m}\max_{t\in\mathbb{R}^{n}}2\sqrt{d_{j}}\theta^{\top}A_{s_{j}}t-u_{j}tA_{S_{j}}t\right\}
=12min𝐮Dk{j=1mϕ(djAsjθ,uj)}\displaystyle=\frac{1}{2}\min_{\mathbf{u}\in D_{k}}\left\{\sum_{j=1}^{m}\phi\left(\sqrt{d_{j}}A_{s_{j}}\theta,u_{j}\right)\right\} (15)
=12min𝐮Dk{j=1mdjϕ(Asjθ,uj)},\displaystyle=\frac{1}{2}\min_{\mathbf{u}\in D_{k}}\left\{\sum_{j=1}^{m}d_{j}\phi\left(A_{s_{j}}\theta,u_{j}\right)\right\},

where (a)(a) follows by the fact that AsjA_{s_{j}} is a self-adjoint matrix, and (b)(b) follows from the fact that the objective function is concave w.r.t. y~\tilde{y} and convex w.r.t uu, and the MinMax Theorem [70]. \Box

A.1.3 Proof of Corollary 2.1

Proof.

Directly by expression (34)(34). \Box

A.2 Proofs of results in Subsection 2.3

A.2.1 Proof of Lemma 3

Proof.

Recall that

v=proxλ𝒢𝒮k(t)=argminθn{λ𝒢𝒮k(θ)+12θt22}.v=\operatorname{prox}_{\lambda{\mathcal{GS}}_{k}}(t)=\underset{\theta\in\mathbb{R}^{n}}{\operatorname{argmin}}\left\{\lambda{\mathcal{GS}}_{k}(\theta)+\frac{1}{2}\|\theta-t\|_{2}^{2}\right\}.

Using Lemma 7, the above minimization problem can be written as

min𝐮Dkminθn{Φ(θ,u,t)λ2j=1mdjϕ(Asjθ,uj)+12θt22}\displaystyle\min_{\mathbf{u}\in D_{k}}\min_{\theta\in\mathbb{R}^{n}}\left\{\Phi(\theta,u,t)\equiv\frac{\lambda}{2}\sum_{j=1}^{m}d_{j}\phi\left(A_{s_{j}}\theta,u_{j}\right)+\frac{1}{2}\|\theta-t\|_{2}^{2}\right\}
=min𝐮Dkminθn{Φ(θ,u,t)λ2j=1mdjϕ(Asjθ,uj)+12Msj(θt)22}\displaystyle=\min_{\mathbf{u}\in D_{k}}\min_{\theta\in\mathbb{R}^{n}}\left\{\Phi(\theta,u,t)\equiv\frac{\lambda}{2}\sum_{j=1}^{m}d_{j}\phi\left(A_{s_{j}}\theta,u_{j}\right)+\frac{1}{2}M_{s_{j}}(\theta-t)_{2}^{2}\right\}
=min𝐮Dkminθn{Φ(θ,u,t)λ2j=1mdjϕ(Asjθ,uj)+12(θt)Asj(θt)}.\displaystyle=\min_{\mathbf{u}\in D_{k}}\min_{\theta\in\mathbb{R}^{n}}\left\{\Phi(\theta,u,t)\equiv\frac{\lambda}{2}\sum_{j=1}^{m}d_{j}\phi\left(A_{s_{j}}\theta,u_{j}\right)+\frac{1}{2}(\theta-t)^{\top}A_{s_{j}}(\theta-t)\right\}. (16)

Solving for θ\theta, we get that for any j[m]j\in[m], if djAsjθ0d_{j}A_{s_{j}}\theta\geq 0 then,

djλAsjθ^uj+Asj(θ^t)\displaystyle\frac{d_{j}\lambda A_{s_{j}}\hat{\theta}}{u_{j}}+A_{s_{j}}(\hat{\theta}-t) =0\displaystyle=0
Asjθ^(djλuj+1)Asjt\displaystyle A_{s_{j}}\hat{\theta}\left(\frac{d_{j}\lambda}{u_{j}}+1\right)-A_{s_{j}}t =0\displaystyle=0
ASj(θ^(djλuj+1)t)\displaystyle A_{S_{j}}\left(\hat{\theta}\left(\frac{d_{j}\lambda}{u_{j}}+1\right)-t\right) =0,\displaystyle=0,

meaning that,

vi=tiujλdj+uj,j[m],isj,\displaystyle v_{i}=\frac{t_{i}u_{j}}{\lambda d_{j}+u_{j}},\;j\in[m],i\in s_{j}, (17)

or, equivalently,

Asjv=ujAsjtλdj+uj,j[m].\displaystyle A_{s_{j}}v=\frac{u_{j}A_{s_{j}}t}{\lambda d_{j}+u_{j}},\;j\in[m]. (18)

Next, we show that uu is the minimizer of the problem min𝐮DkΦ(θ,u,t)\min_{\mathbf{u}\in D_{k}}\Phi(\theta,u,t). Equation equation 18 also holds when uj=0u_{j}=0, since in that case, vi=θ^i=0v_{i}=\hat{\theta}_{i}=0, for all isji\in s_{j}. Plugging equation 18 in Φ\Phi, yields,

Φ(θ^,u,t)\displaystyle\Phi(\hat{\theta},{u},{t}) =12j=1mdj(λθ^Asjθ^ui)+12θ^𝐭22\displaystyle=\frac{1}{2}\sum_{j=1}^{m}d_{j}\left(\lambda\frac{\hat{\theta}^{\top}A_{s_{j}}\hat{\theta}}{u_{i}}\right)+\frac{1}{2}\|\hat{\theta}-\mathbf{t}\|_{2}^{2} (19)
=12j=1m(λdjθ^Asjθ^uj+Asj(θ^t)22)\displaystyle=\frac{1}{2}\sum_{j=1}^{m}\left(\lambda d_{j}\frac{\hat{\theta}^{\top}A_{s_{j}}\hat{\theta}}{u_{j}}+\|A_{s_{j}}\left(\hat{\theta}-t\right)\|_{2}^{2}\right)
=12j=1m(λdjuj2tAsjtuj(λdj+uj)2+λdjAsjtλdj+uj22)\displaystyle=\frac{1}{2}\sum_{j=1}^{m}\left(\lambda d_{j}\frac{u_{j}^{2}t^{\top}A_{s_{j}}t}{u_{j}\left(\lambda d_{j}+u_{j}\right)^{2}}+\left\|\frac{\lambda d_{j}A_{s_{j}}t}{\lambda d_{j}+u_{j}}\right\|_{2}^{2}\right)
=12j=1m(λdjujtAsjt(λdj+uj)2+(λdj)2tAsjt(λdj+uj)2)\displaystyle=\frac{1}{2}\sum_{j=1}^{m}\left(\lambda d_{j}\frac{u_{j}t^{\top}A_{s_{j}}t}{\left(\lambda d_{j}+u_{j}\right)^{2}}+\frac{(\lambda d_{j})^{2}t^{\top}A_{s_{j}}t}{\left(\lambda d_{j}+u_{j}\right)^{2}}\right)
=λ2j=1mdjtAsjtλdj+uj\displaystyle=\frac{\lambda}{2}\sum_{j=1}^{m}d_{j}\frac{t^{\top}A_{s_{j}}t}{\lambda d_{j}+u_{j}}
=λ2j=1mϕ(djAsjt,λdj+uj),\displaystyle=\frac{\lambda}{2}\sum_{j=1}^{m}\phi\left(\sqrt{d_{j}}A_{s_{j}}t,\lambda d_{j}+u_{j}\right), (20)

which concludes the proof. \Box

A.2.2 Proof of Corollary 3.1

Proof.

Assigning a Lagrange multiplier for the inequality constraint 𝐞T𝐮k\mathbf{e}^{T}\mathbf{u}\leq k in problem (16), we obtain the Lagrangian function

L(𝐮,μ)=j=1m(ϕ(djAsjt,λdj+uj)+μuj)kμ.L(\mathbf{u},\mu)=\sum_{j=1}^{m}\left(\phi\left(\sqrt{d_{j}}A_{s_{j}}t,\lambda d_{j}+u_{j}\right)+\mu u_{j}\right)-k\mu.

Therefore, the dual objective function is given by

q(μ)min𝐮:0𝐮𝐞L(𝐮,μ)=j=1mφbj,αj(μ)kμ,q(\mu)\equiv\min_{\mathbf{u}:0\leq\mathbf{u}\leq\mathbf{e}}L(\mathbf{u},\mu)=\sum_{j=1}^{m}\varphi_{b_{j},\alpha_{j}}(\mu)-k\mu, (21)

for, bj=djAsjt2b_{j}=\left\|\sqrt{d_{j}}A_{s_{j}}t\right\|_{2} and αj=λdj(>0),\alpha_{j}=\lambda d_{j}(>0), where for any bb\in\mathbb{R} and α0\alpha\geq 0, the function φb,α\varphi_{b,\alpha} is defined in [15] by

φb,α(μ)min0u1{ϕ(b,α+u)+μu},μ0.\varphi_{b,\alpha}(\mu)\equiv\min_{0\leq u\leq 1}\{\phi(b,\alpha+u)+\mu u\},\quad\mu\geq 0. (22)

Thus, the dual of problem (9) is the maximization problem

max{q(μ):μ0}\max\{q(\mu):\mu\geq 0\} (23)

A direct projection of Lemma [15, Lemma 2.4] is that if μ~>0\tilde{\mu}>0, the function 𝐮L(𝐮,μ~)\mathbf{u}\mapsto L(\mathbf{u},\tilde{\mu}) has a unique minimizer over {𝐮Rm:𝟎𝐮𝐞}\{\mathbf{u}\in\mathrm{R}^{m}:\mathbf{0}\leq\mathbf{u}\leq\mathbf{e}\} given by uj=φbj,αj(μ~),u_{j}=\varphi_{b_{j},\alpha_{j}}^{\prime}(\tilde{\mu}), where it was shown that

φbj,αj(μ)={bj2αj+1+μ,μ|bj|αj+12|bj|μαjμ,|bj|αj+1<μ<|bj|αj,b2αj,μ|bj|αj,\varphi_{b_{j},\alpha_{j}}(\mu)=\begin{cases}\frac{b_{j}^{2}}{\alpha_{j}+1}+\mu,&\sqrt{\mu}\leq\frac{|b_{j}|}{\alpha_{j}+1}\\ 2|b_{j}|\sqrt{\mu}-\alpha_{j}\mu,&\frac{|b_{j}|}{\alpha_{j}+1}<\sqrt{\mu}<\frac{|b_{j}|}{\alpha_{j}},\\ \frac{b^{2}}{\alpha_{j}},&\sqrt{\mu}\geq\frac{|b_{j}|}{\alpha_{j}},\end{cases}

for b>0b>0, otherwise 0, and the minimizer is given by

u(μ)={1,μ|bj|αj+1,|bj|μαj,|bj|αj+1<μ<|bj|αj,0,μ|bj|αj.u(\mu^{*})=\begin{cases}1,&\sqrt{\mu}\leq\frac{|b_{j}|}{\alpha_{j}+1},\\ \frac{|b_{j}|}{\sqrt{\mu}}-\alpha_{j},&\frac{|b_{j}|}{\alpha_{j}+1}<\sqrt{\mu}<\frac{|b_{j}|}{\alpha_{j}},\\ 0,&\sqrt{\mu}\geq\frac{|b_{j}|}{\alpha_{j}}.\end{cases}

Problem (23), is concave differentiable and thus the minimizer μ~\tilde{\mu} holds q(μ~)=0q^{\prime}(\tilde{\mu})=0, meaning

q(μ~)=j=1muj(μ)k=0.q^{\prime}(\tilde{\mu})=\sum_{j=1}^{m}u_{j}(\mu)-k=0.

We observe that for any j[m]j\in[m] the functions uj(μ)u_{j}(\mu) are monotonically continuous nonincresing, and therefore utilizing Lemma [15, Lemma 3.1] μ=1η2\mu^{*}=\frac{1}{\eta^{2}} is the a root of the nondecreasing function,

gt(η)j=1muj(η)k.g_{t}(\eta)\equiv\sum_{j=1}^{m}u_{j}(\eta)-k.

Note that for

gt\displaystyle g_{t} (λminj[m]{dj}d1As1t2,d2As2t2,,dmAsmt2)\displaystyle\left(\frac{\lambda\cdot min_{j\in[m]}\{d_{j}\}}{\left\|\sqrt{d_{1}}\left\|A_{s_{1}}t\right\|_{2},\sqrt{d_{2}}\left\|A_{s_{2}}t\right\|_{2},\ldots,\sqrt{d_{m}}\left\|A_{s_{m}}t\right\|_{2}\right\|_{\infty}}\right)
=i=1m0k<0,\displaystyle=\sum_{i=1}^{m}0-k<0,

while

gt(λd1,d2,,dm+1Msm(djt)2)=i=1m1k>0.g_{t}\left(\frac{\lambda\left\|d_{1},d_{2},\ldots,d_{m}\right\|_{\infty}+1}{\left\|M_{\left\langle s_{m}\right\rangle}\left(\sqrt{d_{j}}t\right)\right\|_{2}}\right)=\sum_{i=1}^{m}1-k>0.

Now, applying Lemma [15, Lemma 3.2], we deduce that uj(μ)u_{j}(\mu), can be divided into the sum of the two following functions,

vj(η)|η|bj|αj|,wj(η)1|η|bj|(αj+1)|,j[m],v_{j}(\eta)\equiv|\eta|b_{j}|-\alpha_{j}|,w_{j}(\eta)\equiv 1-|\eta|b_{j}|-(\alpha_{j}+1)|,j\in[m],

and thus gtg_{t} can be reformulated as follows

g𝐭(η)=12j=1mvj(η)+12j=1mwj(η)k.g_{\mathbf{t}}(\eta)=\frac{1}{2}\sum_{j=1}^{m}v_{j}(\eta)+\frac{1}{2}\sum_{j=1}^{m}w_{j}(\eta)-k.

\Box

A.3 Proof of Corollary 3.2

Proof.

The proof follows from [44, Corollary 1], by taking Ct=𝟎C_{t}=\mathbf{0} and δ=1\delta=1. \Box

A.4 Discussion on the selection of the optimization method

The general training problem we are solving is of the form

min𝐱nf(𝐱)+h(𝐱),\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{n}}f(\mathbf{x})+h(\mathbf{x}),

where f=1Ni=1Nfi:Xf=\frac{1}{N}\sum_{i=1}^{N}f_{i}:X\to\mathbb{R} is continuously differentiable, but possibly nonconvex, and hh is a convex function, but possibly nonsmooth. For practical reasons, we cannot store the full gradient f(𝐱)\nabla f(\mathbf{x}). Hence, we would like to use a stochastic gradient type algorithm. However, such a structure posses several difficulties from an optimization perspective, as most research of stochastic first-order algorithms does not account for both nonconvex smooth term and a nonsmooth convex regularize. In [71] the authors provide an analysis of a simple stochastic proximal gradient algorithm, where at each iteration a minibatch of weights is updated using a gradient step followed by a proximal step. This algorithm is proved to converge, however, the rate of convergence depends heavily on the minibatch size, and, in fact, for reasonably sized minibatches it will not converge. [72] proposes variance-reduction type algorithms, but since these extend SAGA [73] and SVRG [74] to the nonconvex and nonsmooth setting, they require storing the gradient for each sample (SAGA) which requires 𝒪(Nn)\mathcal{O}(Nn) storage, or recomputing the full gradient every sNs\geq N iterations (SVRG), which is undesirable for training neural networks.

The ProxSGD algorithm appears appealing to our problem as it allows for momentum. While the algorithm has a convergence guarantee, the authors do not provide the rate, making it less appealing, given the known issue with the minibatch size. We have found the most suitable optimization algorithm to be ProxGen [44], as it can accommodate momentum, and also has a proven convergence rate with a fixed and reasonable minibatch size of order Θ(N)\Theta(\sqrt{N}). Next, we provide a convergence guarantee for Algorithm 1, as given in [45]. The convergence is in terms of the subdifferential defined as follows.

Definition 2 (Fréchet Subdifferential).

Let φ\varphi be a real-valued function. The Fréchet subdifferential of φ\varphi at θ¯\bar{\theta} with |φ(θ¯)|<|\varphi(\bar{\theta})|<\infty is defined by

^φ(x¯){θΩ|lim infθθ¯φ(θ)φ(θ¯)θ,θθ¯θθ¯0}.\displaystyle\widehat{\partial}\varphi(\bar{x})\triangleq\Big{\{}\theta^{*}\in\Omega~{}\Big{|}~{}\liminf\limits_{\theta\rightarrow\bar{\theta}}\frac{\varphi(\theta)-\varphi(\bar{\theta})-\langle\theta^{*},\theta-\bar{\theta}\rangle}{\|\theta-\bar{\theta}\|}\geq 0\Big{\}}.

A.4.1 Assumptions

The following assumptions in terms of the objective function ff and the algorithm parameters are required:

  1. (C-𝟏\bf{1})

    (LL-smoothness) The loss function ff is differentiable, LL-smooth, and lower-bounded:

    f(x)f(y)Lxyandf(x)>.\displaystyle\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|\quad\text{and}\quad f(x^{*})>-\infty.
  2. (C-𝟐\bf{2})

    (Bounded variance) The stochastic gradient gt=f(θt;ξ)g_{t}=\nabla f(\theta_{t};\xi) is unbiased with bounded variance:

    𝔼ξ[f(θt;ξ)]=f(θt),𝔼ξ[gtf(θt)2]σ2.\displaystyle\mathbb{E}_{\xi}\big{[}\nabla f(\theta_{t};\xi)\big{]}=\nabla f(\theta_{t}),\quad\mathbb{E}_{\xi}\big{[}\|g_{t}-\nabla f(\theta_{t})\|^{2}\big{]}\leq\sigma^{2}.
  3. (C-𝟑\bf{3})

    (i) Final step-vector is finite, (ii) the stochastic gradient is bounded, and (iii) the momentum parameter is exponentially decaying, namely,

    (i)θt+1θtD,(ii)gtG,(iii)ρt=ρ0μt1,\displaystyle\text{(i)}~{}~{}\|\theta_{t+1}-\theta_{t}\|\leq D,\quad\text{(ii)}~{}~{}\|g_{t}\|\leq G,\quad\text{(iii)}~{}~{}\rho_{t}=\rho_{0}\mu^{t-1},

    with D,G>0D,G>0 and ρ0,μ[0,1)\rho_{0},\mu\in[0,1).

Based on these assumptions we can state the following general convergence guarantee.

A.5 LeNet convolutional neural network architecture

 Layer  # filters I  neurons  Filter size  Stride  Size of  feature map  Activation  function  Input 32×32×1 Conv 1 65×5128×28×6Relu MaxPool2d 2×2214×14×6 Conv 2 165×5110×10×16Relu MaxPool2d 2×225×5×16 Fully Connected 1 120Relu Fully Connected 2 84Relu Fully Connected 3 10 Softmax \begin{array}[]{|c|c|c|c|c|c|}\hline\cr\text{ Layer }&\begin{array}[]{c}\text{ \# filters I }\\ \text{ neurons }\end{array}&\text{ Filter size }&\text{ Stride }&\begin{array}[]{c}\text{ Size of }\\ \text{ feature map }\end{array}&\begin{array}[]{c}\text{ Activation }\\ \text{ function }\end{array}\\ \hline\cr\text{ Input }&-&-&-&32\times 32\times 1&\\ \hline\cr\text{ Conv 1 }&6&5\times 5&1&28\times 28\times 6&\text{Relu}\\ \hline\cr\text{ MaxPool2d }&&2\times 2&2&14\times 14\times 6&\\ \hline\cr\text{ Conv 2 }&16&5\times 5&1&10\times 10\times 16&\text{Relu}\\ \hline\cr\text{ MaxPool2d }&&2\times 2&2&5\times 5\times 16&\\ \hline\cr\text{ Fully Connected 1 }&-&-&-&120&\text{Relu}\\ \hline\cr\text{ Fully Connected 2 }&-&-&-&84&\text{Relu}\\ \hline\cr\text{ Fully Connected 3 }&-&-&-&10&\text{ Softmax }\\ \hline\cr\end{array}

A.6 Additional algorithm

Input: Stepsize αt,{ρt}t=1t=T[0,1)\alpha_{t},\{\rho_{t}\}_{t=1}^{t=T}\in[0,1), regularization parameter λ\lambda, switch condition 𝒮\mathcal{S}, projection threshold ϵ\epsilon.
Initialization: 𝜽1n\bm{\theta}_{1}\in\mathbb{R}^{n} and 𝐦0=𝟎n\mathbf{m}_{0}=\mathbf{0}\in\mathbb{R}^{n}.
for iteration t=1,,Tt=1,\dots,T:
if condition 𝒮\mathcal{S} is not satisfied:
  Apply Algorithm 2
else:
  Draw a minibatch sample ξt\xi_{t}
  𝐠tf(𝜽t0;ξt)+h(𝜽t0;ξt)\mathbf{g}_{t}\longleftarrow\nabla f(\bm{\theta}_{t}^{\mathcal{I}^{\neq 0}};\xi_{t})+\nabla h(\bm{\theta}_{t}^{\mathcal{I}^{\neq 0}};\xi_{t})
  𝜽~t0𝜽t1αt𝐠t,𝜽~t0𝟎\tilde{\bm{\theta}}_{t}^{\mathcal{I}^{\neq 0}}\longleftarrow\bm{\theta}_{t-1}-\alpha_{t}\mathbf{g}_{t},\;\tilde{\bm{\theta}}_{t}^{\mathcal{I}^{0}}\longleftarrow\mathbf{0}
  for each group γ0\gamma\in\mathcal{I}^{\neq 0}:
   if 𝜽~tγ,𝜽t1γ<ϵ𝜽t1γ2\langle\tilde{\bm{\theta}}_{t}^{\gamma},\bm{\theta}_{t-1}^{\gamma}\rangle<\epsilon\|\bm{\theta}_{t-1}^{\gamma}\|^{2}:
    𝜽~tγ𝟎\tilde{\bm{\theta}}_{t}^{\gamma}\longleftarrow\mathbf{0}
  𝜽t+1𝜽~tγ\bm{\theta}_{t+1}\longleftarrow\tilde{\bm{\theta}}_{t}^{\gamma}
return 𝜽\bm{\theta}
Algorithm 3 General Stochastic Proximal Gradient Method