Spectral Pruning for Recurrent Neural Networks
Abstract
Recurrent neural networks (RNNs) are a class of neural networks used in sequential tasks. However, in general, RNNs have a large number of parameters and involve enormous computational costs by repeating the recurrent structures in many time steps. As a method to overcome this difficulty, RNN pruning has attracted increasing attention in recent years, and it brings us benefits in terms of the reduction of computational cost as the time step progresses. However, most existing methods of RNN pruning are heuristic. The purpose of this paper is to study the theoretical scheme for RNN pruning method. We propose an appropriate pruning algorithm for RNNs inspired by “spectral pruning”, and provide the generalization error bounds for compressed RNNs. We also provide numerical experiments to demonstrate our theoretical results and show the effectiveness of our pruning method compared with existing methods.
1 Introduction
Recurrent neural networks (RNNs) are a class of neural networks used in sequential tasks. However, in general, RNNs have a large number of parameters and involve enormous computational costs by repeating the recurrent structures in many time steps. These make their application difficult in edge-computing devices. To overcome this difficulty, RNN compression has attracted increasing attention in recent years. It brings us more benefits in terms of the reduction of computational costs as the time step progresses, compared to deep neural networks (DNNs) without any recurrent structure. There are many RNN compression methods such as pruning [(Narang; tang2015pruning; Zhang; Lobacheva; wang2019acceleration; Wen2020StructuredPO; lobacheva2020structured), low rank factorization [(Kliegl; Tjandra), quantization [(Alom; Liu), distillation [(Shi; Tang), and sparse training [(liu2021selfish; liu2021efficient; dodge2019rnn; wen2017learning). This paper is devoted to the pruning for RNNs, and its purpose is to provide an RNN pruning method with the theoretical background.
Recently, Suzuki et al. [(Suzuki) proposed a novel pruning method with the theoretical background, called spectral pruning, for DNNs such as the fully connected and convolutional neural network architectures. The idea of the proposed method is to select important nodes for each layer by minimizing the information losses (see (2) in [(Suzuki)), which can be represented by the layerwise covariance matrix. The minimization only requires linear algebraic operations. Suzuki et al. [(Suzuki) also evaluated generalization error bounds for networks compressed using spectral pruning (see Theorems 1 and 2 in [(Suzuki)). It was shown that generalization error bounds are controlled by the degrees of freedom, which are defined based on the eigenvalues of the covariance matrix. Hence, the characteristics of the eigenvalue distribution have an influence on the error bounds. We can also observe that in the generalization error bounds, there is a bias-variance tradeoff corresponding to compressibility. Numerical experiments have also demonstrated the effectiveness of spectral pruning.
In this paper, we extend the theoretical scheme of spectral pruning to RNNs. Our pruning algorithm involves the selection of hidden nodes by minimizing the information losses, which can be represented by the time mean of covariance matrices instead of the layerwise covariance matrix which appears in spectral pruning of DNNs. We emphasize that our information losses are derived from the generalization error bound. More precisely, we show that choosing compressed weight matrices which minimize the information losses reduces the generalization error bound we evaluated in Section 4.1 (see sentences after Theorem 4.5). We also remark that Suzuki et al. [(Suzuki) has not clearly mentioned anything about how the information losses are derived from. As in DNNs [(Suzuki), we can provide the generalization error bounds for RNNs compressed with our pruning and interpret the degrees of freedom and the bias-variance tradeoff.
We also provide numerical experiments to compare our method with existing methods. We observed that our method outperforms existing methods, and gets benefits from over-parameterization [(chang2020provable; zhang2021understanding) (see Sections 5.2 and 5.3). In particular, our method can compress models with small degradation (see Remark 3.2) when we employ IRNN, which is an RNN that uses the ReLU as the activation function and initializes weights as the identity matrix and biases to zero (see [(Le)).
2 Related Works
One of the popular compression methods for RNNs is pruning that removes redundant weights based on certain criteria. For example, magnitude-based weight pruning [(Narang; narang2017block; tang2015pruning) involves pruning trained weights that are less than the threshold value decided by the user. This method has to gradually repeat pruning and retraining weights to ensure that a certain accuracy is maintained. However, based on recent developments, the costly repetitions might not always be necessary. In one-shot pruning [(Zhang; lee2018snip), weights are pruned once prior to training from the spectrum of the recurrent Jacobian. Bayesian sparsification [(Lobacheva; molchanov2017variational) induce sparse weight matrix by choosing the prior as log-uniform distribution, and weights are also once pruned if the variance of the posterior over weight is large.
While the above methods are referred to as weight pruning, our spectral pruning is a structured pruning where redundant nodes are removed. The advantage of the structured pruning over the weight pruning is that it more simply reduces computational costs. The implementation advantages of structured pruning are illustrated in [(wang2019acceleration). Although weight pruning from large networks to small networks is less likely to degrade accuracy, it usually requires an accelerator for addressing sparsification (see [(Parashar)). The structured pruning methods discussed in [(wang2019acceleration; Wen2020StructuredPO; lobacheva2020structured) induce sparse weight matrices in the training process, and prune weights close to zero, and does not repeat fine-tuning. In our pruning, weight matrices are trained by the usual way, and compressed weight matrices consist of the multiplication of the trained weight matrix and the reconstruction matrix, and no need to repeat pruning and fine-tuning. The idea of the multiplication of the trained weight matrix and the reconstruction matrix is a similar idea to low rank factorization [(Kliegl; Tjandra; prabhavalkar2016compression; grachev2019compression; denil2013predicting). In particular, the work [(denil2013predicting) is most related to spectral pruning, and it employs the reconstruction matrix replacing the empirical covariance matrix with kernel matrix (see Section 3.1 in [(denil2013predicting)).
In general, RNN pruning is more difficult than DNN pruning, because recurrent architectures are not robust to pruning, that is, even a little pruning causes accumulated errors and total errors increase significantly for many time steps. Such a peculiar problem for recurrent feature is also observed in dropout (see Introduction in [(Gal; Zaremba)).
Our motivation is to theoretically propose the RNN pruning algorithm. Inspired by [(Suzuki), we focus on the generalization error bound, and we provide the algorithm so that the generalization error bound becomes smaller. Thus, the derivation of our pruning method would be theoretical, while that of existing methods such as the magnitude-based pruning [(Narang; narang2017block; tang2015pruning; wang2019acceleration; Wen2020StructuredPO) would be heuristic. For the study of the generalization error bounds for RNNs, we refer to [(tu2019understanding; Chen; akpinar2019sample; joukovsky2021generalization).
3 Pruning Algorithm
We propose a pruning algorithm for RNNs inspired by [(Suzuki). See Appendix A for a review of spectral pruning for DNNs. Let be the training data with time series sequences and , where is an input and is an output at time . The training data are independently identically distributed. To train the appropriate relationship between input and output , we consider RNNs as
for , where is an activation function, is the hidden state with the initial state , , , and are weight matrices, and and are biases. Here, an element-wise activation operator is employed, i.e., we define for .
Let be a trained RNN obtained from the training data with weight matrices , , and , and biases and , i.e., , for . We denote the hidden state by
as a function with inputs and . Our aim is to compress the trained network to the smaller network without loss of performance to the extent possible.
Let be an index set with , where , and let be the number of hidden nodes for a compressed RNN with . We denote by the subvector of corresponding to the index set , where represents the -th components of the vector .
(i) Input information loss. The input information loss is defined by
(3.1) |
where is the empirical -norm with respect to and , i.e.,
where is the Euclidean norm, for the regularization parameter , and . Here, denotes the submatrix of corresponding to the index sets with , , i.e., . Based on the linear regularization theory (see e.g., [(gockenbach2016linear)), there exists a unique solution of the minimization problem of , which has the form
(3.2) |
where is the (noncentered) empirical covariance matrix of the hidden state with respect to and , i.e.,
(3.3) |
We term the unique solution as the reconstruction matrix. Here, we would like to emphasize that the mean of the covariance matrix with respect to time is employed in RNNs, while the layerwise covariance matrix is employed in DNNs (see Appendix A). By substituting the explicit formula of the reconstruction matrix into (3.1), the input information loss is reformulated as:
(3.4) |
(ii) Output information loss. The hidden state of a RNN is forwardly propagated to the next hidden state or output, and hence, the two output information losses are defined by
(3.5) |
(3.6) |
where and denote the -th rows of the matrix and , respectively. Then, the unique solutions of the minimization problems of and are and , respectively. By substituting them into (3.5) and (3.6), the output information losses are reformulated as
(3.7) |
(3.8) |
Here, we remark that the output information losses and are bounded above by the input information loss (see Remark 4.3).
(iii) Compressed RNNs. We construct the compressed RNN by and for , where , , , , and .
(iv) Optimization. To select an appropriate index set , we consider the following optimization problem that minimizes the convex combination of the input and two output information losses:
(3.9) |
for with , where is a prespecified number. The optimal index is obtained by the greedy algorithm. We term this method as spectral pruning (for a schematic diagram of spectral pruning, see Figure 1). The reason why information losses are employed in the objective will be theoretically explained later, when the error bounds in Remark 4.3 and Theorem 4.5 are provided. We summarize our pruning algorithm in the following.
![]() |
Remark 3.1.
In the case of the regularization parameter , spectral pruning can be applied, but the following point must be noted. In this case, the uniqueness of the minimization problem of with respect to does not generally hold (i.e., there might be several reconstruction matrices). One of the solutions is , which is the limit of (3.2) as , where is the pseudo-inverse of . It should be noted that coincides with the usual inverse , when is smaller than or equal to the rank of the covariance matrix .
Remark 3.2.
We consider the case of the regularization parameter and , where denotes the number of non-zero rows of . Here, we would like to remark on the relation between and pruning. Let be the index set such that corresponds to zero rows of . Then, by the definition (3.3) of , we have for , ,
which implies that is a trivial solution of the minimization problem because . Here, is the submatrix of the identity matrix corresponding to the index sets and . If we choose as the reconstruction matrix, then the trivial compressed weights can be obtained by simply removing the columns corresponding to , i.e., and , and its network coincides with the trained network for training data, i.e., for ,
which means that the trained RNN is compressed to size without degradation. On the other hand, in the case of , is not a solution of the minimization problem for any choice of the index , which means that the compressed network using is closer to the trained network than that using . Therefore, spectral pruning essentially contributes to compression when .
4 Generalization Error Bounds for Compressed RNNs
In this section, we discuss the generalization error bounds for compressed RNNs. In Subsection 4.1, the error bounds for general compressed RNNs are evaluated to explain the reason for deriving spectral pruning discussed in Section 3 in the error bound term. In Subsection 4.2, the error bounds for RNNs compressed with spectral pruning are evaluated.
4.1 Error bound for general compressed RNNs
Let be the training data generated independently identically from the true distribution , and let be a general compressed RNN, and assume that it belongs to the following function space:
where is the marginal distribution of with respect to , and , , , , are the upper bounds of the compressed weights , , , biases , and , respectively. Here, denotes the Frobenius norm.
Assumption 4.1.
The following assumptions are made: (i) The marginal distribution of with respect to is bounded, i.e., there exist a constant independent of such that for all and (ii) The activation function satisfies and for all
Under these assumptions, we obtain the following approximation error bounds between the trained network and compressed networks .
Proposition 4.2.
Let Assumption 4.1 hold. Let be sampled i.i.d. from the distribution . Then, for all and with , we have
(4.1) |
Here, implies that the left-hand side in (4.1) is bounded above by the right-hand side times a constant independent of the trained weights and biases , and compressed weights and biases , . The proof is given by direct computations. For the exact statement and proof, see Appendix B.
Remark 4.3.
For the RNN , the training error with respect to the -th component of the output is defined as
where and is a loss function. The generalization error with respect to the -th component of the output is defined as
where the expectation is taken with respect to .
Assumption 4.4.
The following assumptions are made: (i) The loss function is bounded, i.e., there exists a constant such that for all , , . (ii) is -Lipschitz continuous, i.e., for all
We obtain the following generalization error bound for .
Theorem 4.5.
Here, implies that the left-hand side in (4.3) is bounded above by the right-hand side times a constant independent of the trained weights and biases , , compressed weights and biases , , compressed number , and the number of samples . We remark that some omitted constants blow up as increasing , but they can be controlled by increasing sampling number (see Theorem C.1). The idea behind the proof is that the generalization error is decomposed into the training, approximation, and estimation errors. The approximation and estimation errors are evaluated using Proposition 4.2 and the estimation of the Rademacher complexity, respectively. For the exact statement and proof, see Appendix C.
The second term in (4.3) is the approximation error bound between and regarded as the bias, which is given by Proposition 4.1, while the third term is the estimation error bound regarded as the variance. It can be observed that minimizing the terms and with respect to and is equivalent to the output information losses (3.5) and (3.6) with , respectively, which means that (iii) in Section 3 with constructs the compressed RNN such that the bias term becomes smaller. Considering prevents the blow up of and , which means that the regularization parameter plays an important role in preventing the blow up of the variance term because in the variance term includes the upper bounds and of and . Therefore, (iii) with constructs the compressed RNN such that the generalization error bound becomes smaller. In addition, selecting an optimal for minimizing the information losses (see (iv) in Section 3) further decreases the error bound.
4.2 Error bound for RNNs compressed with spectral pruning
Next, we evaluate the generalization error bounds for the RNN compressed using the reconstruction matrix (see (iii) in Section 3). We define the degrees of freedom by
where is an eigenvalue of . Throughout this subsection, the regularization parameter is chosen as , where satisfies
(4.4) |
for a prespecified . Here, is the leverage score defined by for
(4.5) |
where is the orthogonal matrix that diagonalizes , i.e., . The leverage score includes the information of the eigenvalues and eigenvectors of , and indicates that the large components correspond to the important nodes from the viewpoint of the spectral information of . Let be the probability measure on defined by
(4.6) |
Proposition 4.6.
The proof is given in Appendix E. In the proof, we essentially refer to previous work [(Bach). Combining (4.2) and (4.7), we conclude that
(4.8) |
It can be observed that the approximation error bound (4.8) is controlled by the degrees of freedom. If the eigenvalues of rapidly decrease, then is a rapidly decreasing function as is large. Therefore, in that case, we can choose a smaller even when is fixed. We will numerically study the relationship between the eigenvalue distribution and the input information loss in Section 5.1.
We make the following additional assumption.
Assumption 4.7.
Assume that the upper bounds for the trained weights and biases are given by , , , , and .
We have the following generalization error bound.
Theorem 4.8.
Here, implies that the left-hand side in (4.9) is bounded above by the right-hand side times a constant independent of , , and . We remark that some omitted constants blow up as increasing , but they can be controlled by increasing sampling number (see Theorem F.1). The proof is given by the combination of applying Theorem 4.5 as and using Proposition 4.6. For the exact statement and proof, see Appendix F. It can be observed that in (4.9), a bias-variance tradeoff relationship exists with respect to . When is large, can be chosen smaller in the condition (4.4), which implies that the bias term (the second term in (4.9)) becomes smaller, but the variance term (the third term in (4.9)) becomes larger. In contrast, the bias becomes larger and the variance becomes smaller when is small. Further remarks on Theorem 4.8 are given in Appendix G.
5 Numerical Experiments
In this section, numerical experiments are detailed to demonstrate our theoretical results and show the effectiveness of spectral pruning compared with existing methods. In Sections 5.1 and 5.2, we select the pixel-MNIST as our task and employ the IRNN, which is an RNN that uses the ReLU as the activation function and initializes weights as the identity matrix and biases to zero (see [(Le)). In Section 5.3, we select the PTB [(marcus1993building) and employ the RNNLM whose RNN layer is orthodox Elman-type. For RNN training details, see Appendix H. We choose parameters , in (iv) of Section 3, i.e., we minimize only the input information loss. This choice is not so problematic because the bound of output information loss automatically becomes smaller with minimizing the input one (see Remark 4.3). We choose the regularization parameter , where this choice regards as a well-trained network and gives priority to minimizing the approximation error between and (see below Theorem4.5).
5.1 Eigenvalue distributions and information losses
First, we numerically study the relationship between the eigenvalue distribution and the information losses. Figure 2 shows the eigenvalue distribution of the covariance matrix with 128 hidden nodes, which are sorted in decreasing order. In this experiment, almost half of the eigenvalues are zero, which cannot be visualized in the figure. Figure 2 shows the input information loss versus the compressed number . The information losses vanish when (see Remark 3.2). The blue and pink curves correspond to MNIST111http://yann.lecun.com/exdb/mnist/ and FashionMNIST222https://github.com/zalandoresearch/fashion-mnist, respectively. It can be observed that the eigenvalues for MNIST decrease more rapidly than those for FashionMNIST, and the information losses for MNIST decrease more rapidly than those for FashionMNIST. This phenomenon coincides with the interpretation on Proposition 4.6 (see the discussion below (4.8)).
![]() ![]() |
5.2 Pixel-MNIST (IRNN)
Method | Accuracy[%] (std) |
|
|
|
|
total | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline(128) | 96.80 (0.23) | - | 128 | 16384 | 1280 | 17792 | ||||||||
Baseline(42) | 93.35 (0.75) | - | 42 | 1764 | 420 | 2226 | ||||||||
Spectral w/ rec.(ours) | 92.61 (2.46) | 97.08 (0.16) | 42 | 1764 | 420 | 2226 | ||||||||
Spectral w/o rec. | 83.60 (8.24) | - | 42 | 1764 | 420 | 2226 | ||||||||
Random w/ rec. | 34.72 (32.47) | - | 42 | 1764 | 420 | 2226 | ||||||||
Random w/o rec. | 23.13 (16.09) | - | 42 | 1764 | 420 | 2226 | ||||||||
Random Weight | 10.35 (1.38) | - | 128 | 1764 | 1280 | 3172 | ||||||||
Magnitude-based Weight | 11.06 (0.70) | 94.41 (3.02) | 128 | 1764 | 1280 | 3172 | ||||||||
Column Sparsification | 84.80 (7.29) | - | 128 | 5376 | 1280 | 6784 | ||||||||
Low Rank Factorization | 9.65 (3.85) | - | 128 | 10752 | 1280 | 12160 |
Method | Perplexity (std) |
|
|
|
|
total | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline(128) | 114.66 (0.35) | - | 1270016 | 16384 | 1270016 | 2556416 | ||||||||
Baseline(42) | 145.85 (0.74) | 132.46 (0.74) | 416724 | 1764 | 416724 | 835212 | ||||||||
Spectral w/ rec.(ours) | 207.63 (2.19) | 124.26 (0.39) | 416724 | 1764 | 416724 | 835212 | ||||||||
Spectral w/o rec. | 433.99 (10.64) | - | 416724 | 1764 | 416724 | 835212 | ||||||||
Random w/ rec. | 243.76 (9.46) | - | 416724 | 1764 | 416724 | 835212 | ||||||||
Random w/o rec. | 492.06 (22.40) | - | 416724 | 1764 | 416724 | 835212 | ||||||||
Random Weight | 203.41 (2.02) | - | 1270016 | 1764 | 1270016 | 2541796 | ||||||||
Magnitude-based Weight | 168.57 (2.57) | 115.65 (0.31) | 1270016 | 1764 | 1270016 | 2541796 | ||||||||
Magnitude-based Weight | 201.41 (3.60) | 126.20 (0.28) | 416724 | 1764 | 416724 | 835212 | ||||||||
Column Sparsification | 128.98 (0.52) | - | 1270016 | 5376 | 1270016 | 2545408 | ||||||||
Low Rank Factorization | 126.24 (1.79) | - | 1270016 | 10752 | 1270016 | 2550784 |
We compare spectral pruning with other pruning methods in the pixel-MNIST (IRNN). Table 1 summarizes the accuracies and the number of weight parameters for different pruning methods. We consider one-third compression in the hidden state, i.e., for the node pruning, 128 hidden nodes were compressed to 42 nodes, while for weight pruning, hidden weights were compressed to weights.
“Baseline(128)” and “Baseline(42)” represent direct training (not pruning) with 128 and 42 hidden nodes, respectively. “Spectral w/ rec.(ours)” represents spectral pruning with the reconstruction matrix (i.e., the compressed weight is chosen as with the optimal with respect to (3.9)), while “Spectral w/o rec.” represents spectral pruning without the reconstruction matrix (i.e., with the optimal with respect to (3.9)), which idea is based on [(luo2017thinet). “Random w/ rec.” represents random node pruning with the reconstruction matrix (i.e., , where is randomly chosen), while “Random w/o rec.” represents random node pruning without the reconstruction matrix (i.e., , where is randomly chosen). “Random Weight” represents random weight pruning. For the reason why we compare with random pruning, see the introduction of [(Zhang). “Magnitude-based Weight” represents magnitude-based weight pruning based on [(Narang). “Column Sparsification” represents the magnitude-based column sparsification during training based on [(wang2019acceleration). “Low Rank Factorization” represents low rank factorization which truncates small singular values of trained weights based on [(prabhavalkar2016compression). “Accuracy[%](std)” and “Finetuned Accuracy[%](std)” represent their mean (standard deviation) of accuracy before and after fine-tuning, respectively. “# input-hidden”, “# hidden-hidden”, and “# hidden-out” represent the number of input-to-hidden, hidden-to-hidden, and hidden-to-output weight parameters, respectively. “total” represents their sum. For detailed procedures of training, pruning, and fine-tuning, see Appendix H.
We demonstrate that spectral pruning significantly outperforms other pruning methods. The reason why spectral pruning can compress with small degradation is that the covariance matrix has a small number of non-zero rows (we observed around 50 non-zero rows). For the detail of non-zero rows, see Remark 3.2. Our method with fine-tuning outperforms “Baseline(42)”, which means that the spectral pruning gets benefits from over-parameterization [(chang2020provable; zhang2021understanding). Since the magnitude-based weight pruning is the method to require the fine-tuning (e.g., see [(Narang)), we have also compared our method with the magnitude-based weight pruning with fine-tuning, and observed that our method outperforms the magnitude-based weight pruning as well. We also remark that our method with fine-tuning overcomes “Baseline(128)”.
5.3 PTB (RNNLM)
We compare spectral pruning with other pruning methods in the PTB (RNNLM). Table 2 summarizes the perplexity and the number of weight parameters for different pruning methods. As in Section 5.2, we consider one-third compression in the hidden state, and how to represent “Method” is the same as Table 1 except for “Magnitude-based Weight ”, which represents the magnitude-based weight pruning for not only hidden-to-hidden weights but also input-to-hidden and hidden-to-out weights so that the number of resultant weight parameters is the same as Spectral w/ rec.(ours).
We demonstrate that our method with fine-tuning outperforms other pruning methods except for magnitude-based Weight pruning. Even though "Low Rank Factorization" retains large number of weight parameters, its perplexity is slightly worse than our method with fine-tuning. On the other hands, our method with fine-tuning can not outperform “Magnitude-based Weight”, but it can slightly do under the condition of the same number of weight parameters. We also remark that our method with fine-tuning overcomes “Baseline(42)”, although it does not overcome “Baseline(128)”.
Therefore, we conclude that spectral pruning works well in Elman-RNN, especially in IRNN.
Future Work
It would be interesting to extend our work to the long short-term memory (LSTM). The properties of LSTMs are different from those of RNNs in that LSTMs have the gated architectures including product operations, which might require more complicated analysis of the generalization error bounds as compared with RNNs. Hence, the investigation of spectral pruning for LSTMs is beyond the scope of this study and will be the focus of future work.
Acknowledgements
The authors are grateful to Professor Taiji Suzuki for useful discussions and comments on our work. The first author was supported by Grant-in-Aid for JSPS Fellows (No.21J00119), Japan Society for the Promotion of Science.
References
Appendix
Appendix A Review of Spectral Pruning for DNNs
Let be training data, where is an input and is an output. The training data are independently identically distributed. To train the appropriate relationship between input and output, we consider DNNs as
where is an activation function, is a weight matrix, and is a bias. Let be a trained DNN obtained from the training data , i.e.,
We denote the input with respect to -th layer by
Let be an index set with , where and is the number of nodes of the -th layer of the compressed DNN with . Let be a subvector of corresponding to the index set , where is the -th components of the vector .
(i) Input information loss. The input information loss is defined by
(A.1) |
where is the empirical -norm with respect to , i.e.,
(A.2) |
where is the Euclidean norm and for a regularization parameter . Here, and . By the linear regularization theory, there exists a unique solution of the minimization problem of , and it has the form
(A.3) |
where is the (noncentered) empirical covariance matrix of with respect to , i.e.,
and is the submatrix of corresponding to index sets with and . By substituting the explicit formula (A.2) of the reconstruction matrix into (A.1), the input information loss is reformulated as
(A.4) |
(ii) Output information loss. For any matrix with an output size , we define the output information loss by
(A.5) |
where denotes the -th row of the matrix . A typical situation is that . The minimization problem of has the unique solution
and by substituting it into (A.5), the output information loss is reformulated as
(iii) Compressed DNN by the reconstruction matrix. We construct the compressed DNN by
where , and and is the compressed weight as the multiplication of the trained weight and the reconstruction matrix , i.e.,
(A.6) |
.
(iv) Optimization. To select an appropriate index set , we consider the following optimization problem that minimizes a convex combination of input and output information losses, i.e.,
for , where is a prespecified number. We adapt the optimal index in the algorithm. We term this method as spectral pruning.
In [[Suzuki], the generalization error bounds for compressed DNNs with the spectral pruning have been studied (see Theorems 1 and 2 in [[Suzuki]), and the parameters , , and are chosen such that its error bound become smaller.
Appendix B Proof of Proposition 4.2
We restate Proposition 4.2 in an exact form as follows:
Proposition B.1.
Suppose that Assumption 4.1 holds. Let be sampled i.i.d. from the distribution . Then,
(B.1) |
for all and with .
Proof.
Let be a trained RNN and . Let us define functions and by
(B.2) |
and denote the hidden states by
(B.3) |
If a training data is used as input, we denote its hidden states by
and its outputs at time by
for . Then, we have
(B.4) |
If we can prove that the second term of right-hand side in (B.4) is estimated as
(B.5) |
then by using the inequalities (B.4) and , we have
Hence, by taking the average over and , and by using the inequality , we obtain
which concludes the inequality (B.1). It remains to prove (B.5). We calculate that
(B.6) |
where is the operator norm (which is the largest singular value). Concerning the quantity , we estimate
and moreover, the second term is estimated as
for all . Thus, we have the recursive inequality
(B.7) |
for . By repeatedly substituting (B.7) into (B.6), we arrive at (B.5):
Thus, we conclude Proposition B.1. ∎
Appendix C Proof of Theorem 4.5
We restate Theorem 4.5 in an exact form as follows:
Theorem C.1.
Proof.
The generalization error of is decomposed into
where the second term is called the approximation error and the third term is called the estimation error. Since the loss function is -Lipschitz continuous, the approximation error is evaluated as
The term is evaluated by Proposition 4.2 (see also Proposition B.1). In the rest of the proof, let us concentrate on the estimation error bound.
First, we define the following function space
for . For , we have
The quantity is evaluated by
The recurrent structure (B.2) and (B.3) give
as this is repeated,
Hence, we see from (C.1) that
which implies that
By Theorem 3.4.5 in [[Gine], for any , we have the following inequality with probability grater than :
where is the i.i.d. Rademacher sequence (see, e.g., Definition 3.1.19 in [[Gine]). The first term of right-hand side in the above inequality, called the Rademacher complexity, is estimated by using Theorem 4.12 in [[Ledoux], and Lemma A.5 in [[Bartlett] (or Lemma 9 in [[Chen]) as follows:
where and are defined by
Here, we denote by the covering number of which means the minimal cardinality of a subset that covers at scale with respect to the norm . By using Lemma D.1 in Appendix D, for any , we conclude the following estimation error bound:
for all with probability grater than , where , and is defined by (C.2). The proof of Theorem C.1 is complete. ∎
Appendix D Upper Bound of the Covering Number
Lemma D.1.
Proof.
The proof is based on the argument of proof of Lemma 3 in [[Chen]. For , we estimate
The second term of right-hand side is estimated as
We estimate the first term of right-hand side in the above inequality as
and as this is repeated, we eventually obtain
Summarizing the above, we have
Since
and
we see that
(D.1) |
Since the right-hand side of (D.1) is independent of training data , we estimate
Then, the covering number is bounded as follows
where we used the notation
By Lemma 8 in [[Chen], the above five covering numbers are bounded as
Therefore, by using for , we conclude that
where is the constant given by (C.2). The proof of Lemma D.1 is finished. ∎
Appendix E Proof of Proposition 4.6
We review the following proposition (see Proposition 1 in [[Suzuki] and Proposition 1 in [[Bach]).
Proposition E.1.
Let be i.i.d. sampled from the distribution in (4.6), and . Then, for any and , if , then we have the following inequality with probability greater than :
(E.1) |
for all .
Proof.
Let be an indicator vector which has at the -th component and in other components for . Applying Proposition E.1 with and taking the summation over , we obtain
∎
Appendix F Proof of Theorem 4.8
We restate Theorem 4.8 in an exact form as follows:
Theorem F.1.
Proof.
Let , and let be the compressed RNN with parameters
Once we can prove that
(F.2) |
we can apply Theorem C.1 with to obtain, for any , the following inequality with probability greater than :
(F.3) |
for . Moreover, by using Proposition E.1, we have
(F.4) |
and
(F.5) |
Therefore, by combining (F.3), (F.4) and (F.5), we conclude the inequality (F.1). It remains to prove (F.2). Finally, we prove that (F.2) holds with probability greater than .
Let us recall the definition (4.5) of the leverage score , i.e.,
By Markov’s inequality, we have
(F.6) |
because (see the proof of Lemma 1 in [[Suzuki]). Therefore, the probability of two events (E.1) and
(F.7) |
happening simultaneously is greater than . By the same argument as in (F.4) and (F.5), and by using (F.6), we have
and
Hence, (F.2) holds with probability greater than . Thus, we conclude Theorem F.1. ∎
Appendix G Remarks for Theorems 4.8 and F.1
Remark G.1.
We remark that the index in Theorem 4.8 is a random variable with a distribution . If the deterministic satisfying (E.1) and (F.7) is considered, the inequality (4.9) holds with a probability greater than , which is the same probability obtained with the inequality in Theorem 2 of [[Suzuki]. The index in Theorem 2 of [[Suzuki] is chosen deterministically by minimizing the information losses (2) with the additional constraint . This constraint can be interpreted as the leverage score corresponding to becomes larger, which implies that important nodes are selected from the spectral information of the covariance matrix .
Remark G.2.
In the case of , we can obtain a sharper error bound than (4.9) in Theorem 4.8. More precisely, the constant omitted in (4.9), which depends on the size of , can be improved to the constant depending on , not on . In fact, when , let be the network obtained by deleting the nodes corresponding to the non-zero rows of the covariance matrix . By the same argument, replacing with in the proof of Theorem 4.8, we can obtain Theorem 4.8 by replacing by , which means that a sharper error bound can be obtained.
Appendix H Detailed configurations for training, pruning and fine-tuning
Employed architecture for the Pixel-MNIST classification task consists of a single IRNN layer and an output layer, while that for the PTB word level language modeling consists of an embedding layer, a single RNN layer and an output layer, where we can merge an embedding weight matrix and an RNN input weight matrix into an single weight matrix. The loss function is the cross entropy function following the soft-max function for both tasks. Each training and fine-tuning is optimized by Adam, and hyper-parameters obtained by grid search are summarized in Table 3, where “FT” means the parameter used in fine-tuning and “bptt” means the step size for back-propagation through time. As regards regularization techniques for the PTB task, we adopt the dropout, whose ratio is , in any case and the weight tying [[inan2016tying] in effective case.
Task | epochs (FT) | batch size | learning rate (FT) | LR decay (step) | gradient clip | bptt |
---|---|---|---|---|---|---|
Pixel-MNIST | 500 (250) | 120 | () | 0.95 (10) | 1.0 | 784 |
PTB | 200 (200) | 20 | () | 0.95 (1) | 0.01 | 35 |
We sample five models for each baseline in section 5. Furthermore, pruning methods including randomness are applied five times for each baseline model. Other detailed configurations for each method are the following:
-
•
Baseline (128)
-
–
train:
-
*
hidden size: 128
-
*
weight tying: True
-
*
-
–
-
•
Baseline (42)
-
–
train:
-
*
hidden size: 42
-
*
weight tying: True
-
*
-
–
prune:
-
*
None
-
*
-
–
finetune: (only PTB case)
-
*
hidden size: 42 (stay)
-
*
weight tying: False
-
*
-
–
-
•
Spectral w/ rec. or w/o rec.
-
–
train:
-
*
Use Baseline (128)
-
*
-
–
prune:
-
*
size of hidden-to-hidden weight matrix:
-
*
size of input-to-hidden weight matrix: (Pixel-MNIST) or (PTB)
-
*
size of hidden-to-output weight matrix: (Pixel-MNIST) or (PTB)
-
*
Reduce the RNN weight matrices based on our proposed method with or without the reconstruction matrix
-
*
-
–
finetune:
-
*
hidden size: 42 (reduced from 128)
-
*
weight tying: False
-
*
-
–
-
•
Random w/ rec. or w/o rec.
-
–
Same as “Spectral” except for reducing the RNN weight matrices randomly in pruning phase
-
–
-
•
Column Sparsification
-
–
train:
-
*
hidden size: 128
-
*
weight tying: True
-
*
Mask the lowest columns of the hidden-to-hidden weight matrix by -norm for each iteration (add noise on the weight matrix before masking when applied to the IRNN)
-
*
-
–
prune:
-
*
Fix the mask
-
*
-
–
finetune:
-
*
None
-
*
-
–
-
•
Low Rank Factorization
-
–
train:
-
*
Use Baseline (128)
-
*
-
–
prune:
-
*
intrinsic parameters of hidden-to-hidden weight matrix:
-
*
Decompose hidden-to-hidden weight matrix based on SVD:
-
*
Entry of , which is singular values, are in descending order
-
*
-
–
finetune:
-
*
None
-
*
-
–
-
•
Magnitude-based Weight
-
–
train:
-
*
Use Baseline (128)
-
*
-
–
prune:
-
*
parameters of hidden-to-hidden weight matrix:
-
*
Remove the lowest parameters by -norm
-
*
-
–
finetune:
-
*
hidden size: 128 (stay but have sparse weight matrix)
-
*
-
–
-
•
Random Weight
-
–
Same as “Magnitude-based Weight” except for removing parameters randomly in pruning phase
-
–