This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Spectral Pruning for Recurrent Neural Networks

Takashi Furuya Department of Mathematics, Hokkaido University, Japan Email: takashi.furuya0101@gmail.com Kazuma Suetake AISIN SOFTWARE Co., Ltd., Japan Email: kazuma.suetake@aisin-software.com Koichi Taniguchi Advanced Institute for Materials Research, Tohoku University, Japan Email: koichi.taniguchi.b7@tohoku.ac.jp Hiroyuki Kusumoto Graduate School of Mathematics, Nagoya University, Japan Email: mackey3141@gmail.com Ryuji Saiin AISIN SOFTWARE Co., Ltd., Japan Email: ryuji.saiin@aisin-software.com Tomohiro Daimon AISIN SOFTWARE Co., Ltd., Japan Email: tomohiro.daimon@aisin-software.com
Abstract

Recurrent neural networks (RNNs) are a class of neural networks used in sequential tasks. However, in general, RNNs have a large number of parameters and involve enormous computational costs by repeating the recurrent structures in many time steps. As a method to overcome this difficulty, RNN pruning has attracted increasing attention in recent years, and it brings us benefits in terms of the reduction of computational cost as the time step progresses. However, most existing methods of RNN pruning are heuristic. The purpose of this paper is to study the theoretical scheme for RNN pruning method. We propose an appropriate pruning algorithm for RNNs inspired by “spectral pruning”, and provide the generalization error bounds for compressed RNNs. We also provide numerical experiments to demonstrate our theoretical results and show the effectiveness of our pruning method compared with existing methods.

1 Introduction

Recurrent neural networks (RNNs) are a class of neural networks used in sequential tasks. However, in general, RNNs have a large number of parameters and involve enormous computational costs by repeating the recurrent structures in many time steps. These make their application difficult in edge-computing devices. To overcome this difficulty, RNN compression has attracted increasing attention in recent years. It brings us more benefits in terms of the reduction of computational costs as the time step progresses, compared to deep neural networks (DNNs) without any recurrent structure. There are many RNN compression methods such as pruning [(Narang; tang2015pruning; Zhang; Lobacheva; wang2019acceleration; Wen2020StructuredPO; lobacheva2020structured), low rank factorization [(Kliegl; Tjandra), quantization [(Alom; Liu), distillation [(Shi; Tang), and sparse training [(liu2021selfish; liu2021efficient; dodge2019rnn; wen2017learning). This paper is devoted to the pruning for RNNs, and its purpose is to provide an RNN pruning method with the theoretical background.

Recently, Suzuki et al. [(Suzuki) proposed a novel pruning method with the theoretical background, called spectral pruning, for DNNs such as the fully connected and convolutional neural network architectures. The idea of the proposed method is to select important nodes for each layer by minimizing the information losses (see (2) in [(Suzuki)), which can be represented by the layerwise covariance matrix. The minimization only requires linear algebraic operations. Suzuki et al. [(Suzuki) also evaluated generalization error bounds for networks compressed using spectral pruning (see Theorems 1 and 2 in [(Suzuki)). It was shown that generalization error bounds are controlled by the degrees of freedom, which are defined based on the eigenvalues of the covariance matrix. Hence, the characteristics of the eigenvalue distribution have an influence on the error bounds. We can also observe that in the generalization error bounds, there is a bias-variance tradeoff corresponding to compressibility. Numerical experiments have also demonstrated the effectiveness of spectral pruning.

In this paper, we extend the theoretical scheme of spectral pruning to RNNs. Our pruning algorithm involves the selection of hidden nodes by minimizing the information losses, which can be represented by the time mean of covariance matrices instead of the layerwise covariance matrix which appears in spectral pruning of DNNs. We emphasize that our information losses are derived from the generalization error bound. More precisely, we show that choosing compressed weight matrices which minimize the information losses reduces the generalization error bound we evaluated in Section 4.1 (see sentences after Theorem 4.5). We also remark that Suzuki et al. [(Suzuki) has not clearly mentioned anything about how the information losses are derived from. As in DNNs [(Suzuki), we can provide the generalization error bounds for RNNs compressed with our pruning and interpret the degrees of freedom and the bias-variance tradeoff.

We also provide numerical experiments to compare our method with existing methods. We observed that our method outperforms existing methods, and gets benefits from over-parameterization [(chang2020provable; zhang2021understanding) (see Sections 5.2 and 5.3). In particular, our method can compress models with small degradation (see Remark 3.2) when we employ IRNN, which is an RNN that uses the ReLU as the activation function and initializes weights as the identity matrix and biases to zero (see [(Le)).

The summary of our contributions is the following:

  • A pruning algorithm for RNNs (Section 3) is proposed by the analysis of generalization error (Remark 4.3 and Theorem 4.8).

  • The generalization error bounds for RNNs compressed with our pruning algorithm are provided (Theorem 4.8).

2 Related Works

One of the popular compression methods for RNNs is pruning that removes redundant weights based on certain criteria. For example, magnitude-based weight pruning [(Narang; narang2017block; tang2015pruning) involves pruning trained weights that are less than the threshold value decided by the user. This method has to gradually repeat pruning and retraining weights to ensure that a certain accuracy is maintained. However, based on recent developments, the costly repetitions might not always be necessary. In one-shot pruning [(Zhang; lee2018snip), weights are pruned once prior to training from the spectrum of the recurrent Jacobian. Bayesian sparsification [(Lobacheva; molchanov2017variational) induce sparse weight matrix by choosing the prior as log-uniform distribution, and weights are also once pruned if the variance of the posterior over weight is large.

While the above methods are referred to as weight pruning, our spectral pruning is a structured pruning where redundant nodes are removed. The advantage of the structured pruning over the weight pruning is that it more simply reduces computational costs. The implementation advantages of structured pruning are illustrated in [(wang2019acceleration). Although weight pruning from large networks to small networks is less likely to degrade accuracy, it usually requires an accelerator for addressing sparsification (see [(Parashar)). The structured pruning methods discussed in [(wang2019acceleration; Wen2020StructuredPO; lobacheva2020structured) induce sparse weight matrices in the training process, and prune weights close to zero, and does not repeat fine-tuning. In our pruning, weight matrices are trained by the usual way, and compressed weight matrices consist of the multiplication of the trained weight matrix and the reconstruction matrix, and no need to repeat pruning and fine-tuning. The idea of the multiplication of the trained weight matrix and the reconstruction matrix is a similar idea to low rank factorization [(Kliegl; Tjandra; prabhavalkar2016compression; grachev2019compression; denil2013predicting). In particular, the work [(denil2013predicting) is most related to spectral pruning, and it employs the reconstruction matrix replacing the empirical covariance matrix with kernel matrix (see Section 3.1 in [(denil2013predicting)).

In general, RNN pruning is more difficult than DNN pruning, because recurrent architectures are not robust to pruning, that is, even a little pruning causes accumulated errors and total errors increase significantly for many time steps. Such a peculiar problem for recurrent feature is also observed in dropout (see Introduction in [(Gal; Zaremba)).

Our motivation is to theoretically propose the RNN pruning algorithm. Inspired by [(Suzuki), we focus on the generalization error bound, and we provide the algorithm so that the generalization error bound becomes smaller. Thus, the derivation of our pruning method would be theoretical, while that of existing methods such as the magnitude-based pruning [(Narang; narang2017block; tang2015pruning; wang2019acceleration; Wen2020StructuredPO) would be heuristic. For the study of the generalization error bounds for RNNs, we refer to [(tu2019understanding; Chen; akpinar2019sample; joukovsky2021generalization).

3 Pruning Algorithm

We propose a pruning algorithm for RNNs inspired by [(Suzuki). See Appendix A for a review of spectral pruning for DNNs. Let D={(XTi,YTi)}i=1nD=\{(X^{i}_{T},Y^{i}_{T})\}_{i=1}^{n} be the training data with time series sequences XTi=(xti)t=1TX^{i}_{T}=(x^{i}_{t})_{t=1}^{T} and YTi=(yti)t=1TY^{i}_{T}=(y^{i}_{t})_{t=1}^{T}, where xtidxx^{i}_{t}\in\mathbb{R}^{d_{x}} is an input and ytidyy^{i}_{t}\in\mathbb{R}^{d_{y}} is an output at time tt. The training data are independently identically distributed. To train the appropriate relationship between input XT=(xt)t=1TX_{T}=(x_{t})_{t=1}^{T} and output YT=(yt)t=1TY_{T}=(y_{t})_{t=1}^{T}, we consider RNNs f=(ft)t=1Tf=(f_{t})_{t=1}^{T} as

ft=Woht+bo,ht=σ(Whht1+Wixt+bhi),f_{t}=W^{o}h_{t}+b^{o},\quad h_{t}=\sigma(W^{h}h_{t-1}+W^{i}x_{t}+b^{hi}),

for t=1,,Tt=1,\ldots,T, where σ:\sigma:\mathbb{R}\to\mathbb{R} is an activation function, htmh_{t}\in\mathbb{R}^{m} is the hidden state with the initial state h0=0h_{0}=0, Wody×mW^{o}\in\mathbb{R}^{d_{y}\times m}, Whm×mW^{h}\in\mathbb{R}^{m\times m}, and Wim×dxW^{i}\in\mathbb{R}^{m\times d_{x}} are weight matrices, and bodyb^{o}\in\mathbb{R}^{d_{y}} and bhimb^{hi}\in\mathbb{R}^{m} are biases. Here, an element-wise activation operator is employed, i.e., we define σ(x):=(σ(x1),,σ(xm))T\sigma(x):=(\sigma(x_{1}),\ldots,\sigma(x_{m}))^{T} for x=(x1,,xm)mx=(x_{1},\ldots,x_{m})\in\mathbb{R}^{m}.

Let f^=(f^t)t=1T\widehat{f}=(\widehat{f}_{t})_{t=1}^{T} be a trained RNN obtained from the training data DD with weight matrices W^ody×m\widehat{W}^{o}\in\mathbb{R}^{d_{y}\times m}, W^hm×m\widehat{W}^{h}\in\mathbb{R}^{m\times m}, and W^im×dx\widehat{W}^{i}\in\mathbb{R}^{m\times d_{x}}, and biases b^ody\widehat{b}^{o}\in\mathbb{R}^{d_{y}} and b^him\widehat{b}^{hi}\in\mathbb{R}^{m}, i.e., f^t=W^oh^t+b^o\widehat{f}_{t}=\widehat{W}^{o}\widehat{h}_{t}+\widehat{b}^{o}, h^t=σ(W^hh^t1+W^ixt+b^hi)\widehat{h}_{t}=\sigma(\widehat{W}^{h}\widehat{h}_{t-1}+\widehat{W}^{i}x_{t}+\widehat{b}^{hi}) for t=1,,Tt=1,\ldots,T. We denote the hidden state h^t\widehat{h}_{t} by

h^t=ϕ(xt,h^t1),\widehat{h}_{t}=\phi(x_{t},\widehat{h}_{t-1}),

as a function with inputs xtx_{t} and h^t1\widehat{h}_{t-1}. Our aim is to compress the trained network f^\widehat{f} to the smaller network ff^{\sharp} without loss of performance to the extent possible.

Let J[m]J\subset[m] be an index set with |J|=m|J|=m^{\sharp}, where [m]:={1,,m}[m]:=\{1,\ldots,m\}, and let mm^{\sharp}\in\mathbb{N} be the number of hidden nodes for a compressed RNN ff^{\sharp} with mmm^{\sharp}\leq m. We denote by ϕJ(xt,h^t1)=(ϕj(xt,h^t1))jJ\phi_{J}(x_{t},\widehat{h}_{t-1})=(\phi_{j}(x_{t},\widehat{h}_{t-1}))_{j\in J} the subvector of ϕ(xt,ht1)\phi(x_{t},h_{t-1}) corresponding to the index set JJ, where ϕj(xt,h^t1)\phi_{j}(x_{t},\widehat{h}_{t-1}) represents the jj-th components of the vector ϕ(xt,h^t1)\phi(x_{t},\widehat{h}_{t-1}).

(i) Input information loss. The input information loss is defined by

Lτ(A)(J):=minAm×m{ϕAϕJn,T2+Aτ2},L^{(A)}_{\tau}(J):=\min_{A\in\mathbb{R}^{m\times m^{\sharp}}}\big{\{}\|\phi-A\phi_{J}\|^{2}_{n,T}+\|A\|^{2}_{\tau}\big{\}}, (3.1)

where n,T\|\cdot\|_{n,T} is the empirical L2L^{2}-norm with respect to nn and tt, i.e.,

ϕAϕJn,T2:=1nTi=1nt=1Tϕ(xti,h^t1i)AϕJ(xti,h^t1i)22,\begin{split}&\|\phi-A\phi_{J}\|^{2}_{n,T}\\ &:=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\big{\|}\phi(x^{i}_{t},\widehat{h}^{i}_{t-1})-A\phi_{J}(x^{i}_{t},\widehat{h}^{i}_{t-1})\big{\|}^{2}_{2},\end{split}

where 2\|\cdot\|_{2} is the Euclidean norm, Aτ2:=Tr[AIτAT]\|A\|^{2}_{\tau}:=\mathrm{Tr}[AI_{\tau}A^{T}] for the regularization parameter τ+m:={xml|xj>0,j=1,,ml}\tau\in\mathbb{R}^{m^{\sharp}}_{+}:=\{x\in\mathbb{R}^{m^{\sharp}_{l}}\,|\,x_{j}>0,\ j=1,\ldots,m^{\sharp}_{l}\}, and Iτ:=diag(τ)I_{\tau}:=\text{diag}(\tau). Here, Σ^I,IK×H\widehat{\Sigma}_{I,I^{\prime}}\in\mathbb{R}^{K\times H} denotes the submatrix of Σ^\widehat{\Sigma} corresponding to the index sets I,I[m]I,I^{\prime}\subset[m] with |I|=K|I|=K, |I|=H|I^{\prime}|=H, i.e., Σ^I,I=(Σ^i,i)iI,iI\widehat{\Sigma}_{I,I^{\prime}}=(\widehat{\Sigma}_{i,i^{\prime}})_{i\in I,i^{\prime}\in I^{\prime}}. Based on the linear regularization theory (see e.g., [(gockenbach2016linear)), there exists a unique solution A^Jm×m\widehat{A}_{J}\in\mathbb{R}^{m\times m^{\sharp}} of the minimization problem of ϕAϕJn,T2+Aτ2\|\phi-A\phi_{J}\|^{2}_{n,T}+\|A\|^{2}_{\tau}, which has the form

A^J=Σ^[m],J(Σ^J,J+Iτ)1,\widehat{A}_{J}=\widehat{\Sigma}_{[m],J}\big{(}\widehat{\Sigma}_{J,J}+I_{\tau}\big{)}^{-1}, (3.2)

where Σ^\widehat{\Sigma} is the (noncentered) empirical covariance matrix of the hidden state ϕ(xt,h^t1)\phi(x_{t},\widehat{h}_{t-1}) with respect to nn and tt, i.e.,

Σ^=1nTi=1nt=1Tϕ(xti,h^t1i)ϕ(xti,h^t1i)T.\widehat{\Sigma}=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\phi(x^{i}_{t},\widehat{h}^{i}_{t-1})\phi(x^{i}_{t},\widehat{h}^{i}_{t-1})^{T}. (3.3)

We term the unique solution A^J\widehat{A}_{J} as the reconstruction matrix. Here, we would like to emphasize that the mean of the covariance matrix with respect to time tt is employed in RNNs, while the layerwise covariance matrix is employed in DNNs (see Appendix A). By substituting the explicit formula of the reconstruction matrix A^J\widehat{A}_{J} into (3.1), the input information loss is reformulated as:

Lτ(A)(J)=Tr[Σ^Σ^[m],J(Σ^J,J+Iτ)1Σ^J,[m]].L^{(A)}_{\tau}(J)=\mathrm{Tr}\Big{[}\widehat{\Sigma}-\widehat{\Sigma}_{[m],J}\big{(}\widehat{\Sigma}_{J,J}+I_{\tau}\big{)}^{-1}\widehat{\Sigma}_{J,[m]}\Big{]}. (3.4)

(ii) Output information loss. The hidden state of a RNN is forwardly propagated to the next hidden state or output, and hence, the two output information losses are defined by

Lτ(B,o)(J):=j=1dyminβm{W^j,:oϕβTϕJn,T2+βTτ2},L^{(B,o)}_{\tau}(J):=\sum_{j=1}^{d_{y}}\min_{\beta\in\mathbb{R}^{m^{\sharp}}}\Big{\{}\big{\|}\widehat{W}^{o}_{j,:}\phi-\beta^{T}\phi_{J}\big{\|}^{2}_{n,T}+\big{\|}\beta^{T}\big{\|}^{2}_{\tau}\Big{\}}, (3.5)
Lτ(B,h)(J):=jJminβm{W^j,:hϕβTϕJn,T2+βTτ2},L^{(B,h)}_{\tau}(J):=\sum_{j\in J}\min_{\beta\in\mathbb{R}^{m^{\sharp}}}\Big{\{}\big{\|}\widehat{W}^{h}_{j,:}\phi-\beta^{T}\phi_{J}\big{\|}^{2}_{n,T}+\big{\|}\beta^{T}\big{\|}^{2}_{\tau}\Big{\}}, (3.6)

where W^j,:o\widehat{W}^{o}_{j,:} and W^j,:h\widehat{W}^{h}_{j,:} denote the jj-th rows of the matrix W^h\widehat{W}^{h} and W^o\widehat{W}^{o}, respectively. Then, the unique solutions of the minimization problems of W^j,:oϕβTϕJn,T2\|\widehat{W}^{o}_{j,:}\phi-\beta^{T}\phi_{J}\|^{2}_{n,T} ++ βTτ2\|\beta^{T}\|^{2}_{\tau} and W^j,:hϕβTϕJn,T2\|\widehat{W}^{h}_{j,:}\phi-\beta^{T}\phi_{J}\|^{2}_{n,T} ++ βTτ2\|\beta^{T}\|^{2}_{\tau} are β^o=(W^j,:oA^J)T\widehat{\beta}^{o}=(\widehat{W}^{o}_{j,:}\widehat{A}_{J})^{T} and β^jh=(W^j,:hA^J)T\widehat{\beta}^{h}_{j}=(\widehat{W}^{h}_{j,:}\widehat{A}_{J})^{T}, respectively. By substituting them into (3.5) and (3.6), the output information losses are reformulated as

Lτ(B,o)(J)=Tr[W^o(Σ^Σ^[m],J(Σ^J,J+Iτ)1Σ^J,[m])W^oT],\begin{split}&L^{(B,o)}_{\tau}(J)\\ &=\mathrm{Tr}\bigg{[}\widehat{W}^{o}\Big{(}\widehat{\Sigma}-\widehat{\Sigma}_{[m],J}\big{(}\widehat{\Sigma}_{J,J}+I_{\tau}\big{)}^{-1}\widehat{\Sigma}_{J,[m]}\Big{)}\widehat{W}^{o^{T}}\bigg{]},\end{split} (3.7)
Lτ(B,h)(J)=Tr[W^J,[m]h(Σ^Σ^[m],J(Σ^J,J+Iτ)1Σ^J,[m])W^J,[m]hT].\begin{split}&L^{(B,h)}_{\tau}(J)\\ &=\mathrm{Tr}\bigg{[}\widehat{W}^{h}_{J,[m]}\Big{(}\widehat{\Sigma}-\widehat{\Sigma}_{[m],J}\big{(}\widehat{\Sigma}_{J,J}+I_{\tau}\big{)}^{-1}\widehat{\Sigma}_{J,[m]}\Big{)}\widehat{W}^{h^{T}}_{J,[m]}\bigg{]}.\end{split} (3.8)

Here, we remark that the output information losses Lτ(B,o)(J)L^{(B,o)}_{\tau}(J) and Lτ(B,h)(J)L^{(B,h)}_{\tau}(J) are bounded above by the input information loss Lτ(A)(J)L^{(A)}_{\tau}(J) (see Remark 4.3).

(iii) Compressed RNNs. We construct the compressed RNN fJf^{\sharp}_{J} by fJ,t=WJohJ,t+bJof^{\sharp}_{J,t}=W^{\sharp o}_{J}h^{\sharp}_{J,t}+b^{\sharp o}_{J} and hJ,t=σ(WJhhJ,t1+WJixt+bJhi)h^{\sharp}_{J,t}=\sigma(W^{\sharp h}_{J}h^{\sharp}_{J,t-1}+W^{\sharp i}_{J}x_{t}+b^{\sharp hi}_{J}) for t=1,,Tt=1,\ldots,T, where WJo:=W^oA^JW^{\sharp o}_{J}:=\widehat{W}^{o}\widehat{A}_{J}, WJh:=W^J,[m]hA^JW^{\sharp h}_{J}:=\widehat{W}^{h}_{J,[m]}\widehat{A}_{J}, WJi:=W^J,[dx]iW^{\sharp i}_{J}:=\widehat{W}^{i}_{J,[d_{x}]}, bJhi:=b^Jhib^{\sharp hi}_{J}:=\widehat{b}^{hi}_{J}, and bJo:=b^ob^{\sharp o}_{J}:=\widehat{b}^{o}.

(iv) Optimization. To select an appropriate index set JJ, we consider the following optimization problem that minimizes the convex combination of the input and two output information losses:

minJ[m]s.t.|J|=m{θ1Lτ(A)(J)+θ2Lτ(B,o)(J)+θ3Lτ(B,h)(J)},\min_{\scriptsize{\begin{array}[]{c}J\subset[m]\\ s.t.\ |J|=m^{\sharp}\end{array}}}\left\{\theta_{1}L^{(A)}_{\tau}(J)+\theta_{2}L^{(B,o)}_{\tau}(J)+\theta_{3}L^{(B,h)}_{\tau}(J)\right\}, (3.9)

for θ1,θ2,θ3[0,1]\theta_{1},\theta_{2},\theta_{3}\in[0,1] with θ1+θ2+θ3=1\theta_{1}+\theta_{2}+\theta_{3}=1, where ml[m]m^{\sharp}_{l}\in[m] is a prespecified number. The optimal index JJ^{\sharp} is obtained by the greedy algorithm. We term this method as spectral pruning (for a schematic diagram of spectral pruning, see Figure 1). The reason why information losses are employed in the objective will be theoretically explained later, when the error bounds in Remark 4.3 and Theorem 4.5 are provided. We summarize our pruning algorithm in the following.

Algorithm 1 Spectral pruning
0:  Data set D={(xn,yn)}n=1Ndx×dyD=\{(x_{n},y_{n})\}_{n=1}^{N}\subset\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{y}}, Trained RNN f^=(f^t)t=1T\widehat{f}=(\widehat{f}_{t})_{t=1}^{T} with f^t=W^oh^t+b^o\widehat{f}_{t}=\widehat{W}^{o}\widehat{h}_{t}+\widehat{b}^{o}, h^t=σ(W^hh^t1+W^ixt+b^hi)\widehat{h}_{t}=\sigma(\widehat{W}^{h}\widehat{h}_{t-1}+\widehat{W}^{i}x_{t}+\widehat{b}^{hi}), Number of hidden nodes mm for trained RNN f^\widehat{f}, Number of hidden nodes mmm^{\sharp}\leq m for returned compressed RNN, Regularization parameter τ+m\tau\in\mathbb{R}^{m^{\sharp}}_{+}, Coefficients θ1,θ2,θ3[0,1]\theta_{1},\theta_{2},\theta_{3}\in[0,1] with θ1+θ2+θ3=1\theta_{1}+\theta_{2}+\theta_{3}=1.
1:  Minimize {θ1Lτ(A)(J)+θ2Lτ(B,o)(J)+θ3Lτ(B,h)(J)}\left\{\theta_{1}L^{(A)}_{\tau}(J)+\theta_{2}L^{(B,o)}_{\tau}(J)+\theta_{3}L^{(B,h)}_{\tau}(J)\right\} for index J[m]J\subset[m] with |J|=m|J|=m^{\sharp} by the greedy algorithm where Lτ(A)(J)L^{(A)}_{\tau}(J), Lτ(B,o)(J)L^{(B,o)}_{\tau}(J), and Lτ(B,h)(J)L^{(B,h)}_{\tau}(J) compute (3.4), (3.7), and (3.8), respectively.
2:  Obtain optimal JJ^{\sharp}.
3:  Compute A^J\widehat{A}_{J^{\sharp}} by (3.2).
4:  Set WJo:=W^oA^JW^{\sharp o}_{J^{\sharp}}:=\widehat{W}^{o}\widehat{A}_{J^{\sharp}}, WJh:=W^J,[m]hA^JW^{\sharp h}_{J^{\sharp}}:=\widehat{W}^{h}_{J^{\sharp},[m]}\widehat{A}_{J^{\sharp}}, WJi:=W^J,[dx]iW^{\sharp i}_{J^{\sharp}}:=\widehat{W}^{i}_{J^{\sharp},[d_{x}]}, bJhi:=b^Jhib^{\sharp hi}_{J^{\sharp}}:=\widehat{b}^{hi}_{J^{\sharp}}, bJo:=b^ob^{\sharp o}_{J^{\sharp}}:=\widehat{b}^{o}.
5:  return  Compressed RNN fJ=(fJ,t)t=1Tf^{\sharp}_{J^{\sharp}}=(f^{\sharp}_{J^{\sharp},t})_{t=1}^{T} with fJ,t=WJohJ,t+bJof^{\sharp}_{J^{\sharp},t}=W^{\sharp o}_{J^{\sharp}}h^{\sharp}_{J^{\sharp},t}+b^{\sharp o}_{J^{\sharp}} and hJ,t=σ(WJhhJ,t1+WJixt+bJhi)h^{\sharp}_{J^{\sharp},t}=\sigma(W^{\sharp h}_{J^{\sharp}}h^{\sharp}_{J^{\sharp},t-1}+W^{\sharp i}_{J^{\sharp}}x_{t}+b^{\sharp hi}_{J^{\sharp}}).
Refer to caption
Figure 1: Spectral pruning for RNN
Remark 3.1.

In the case of the regularization parameter τ=0\tau=0, spectral pruning can be applied, but the following point must be noted. In this case, the uniqueness of the minimization problem of ϕAϕJn,T2\left\|\phi-A\phi_{J}\right\|^{2}_{n,T} with respect to AA does not generally hold (i.e., there might be several reconstruction matrices). One of the solutions is A^J=Σ^[m],JΣ^J,J\widehat{A}_{J}=\widehat{\Sigma}_{[m],J}\widehat{\Sigma}_{J,J}^{{\dagger}}, which is the limit of (3.2) as τ0\tau\to 0, where Σ^J,J\widehat{\Sigma}_{J,J}^{{\dagger}} is the pseudo-inverse of Σ^J,J\widehat{\Sigma}_{J,J}. It should be noted that Σ^J,J\widehat{\Sigma}_{J,J}^{{\dagger}} coincides with the usual inverse Σ^J,J1\widehat{\Sigma}_{J,J}^{-1}, when mm^{\sharp} is smaller than or equal to the rank of the covariance matrix Σ^\widehat{\Sigma}.

Remark 3.2.

We consider the case of the regularization parameter τ=0\tau=0 and mmnzrm^{\sharp}\geq m_{\mathrm{nzr}}, where mnzrm_{\mathrm{nzr}} denotes the number of non-zero rows of Σ^\widehat{\Sigma}. Here, we would like to remark on the relation between mnzrm_{\mathrm{nzr}} and pruning. Let JnzrJ_{\mathrm{nzr}} be the index set such that [m]Jnzr[m]\setminus J_{\mathrm{nzr}} corresponds to zero rows of Σ^\widehat{\Sigma}. Then, by the definition (3.3) of Σ^\widehat{\Sigma}, we have for i=1,,ni=1,\cdots,n, t=1,,Tt=1,\cdots,T, v[m]Jnzrv\in[m]\setminus J_{\mathrm{nzr}}

ϕv(xti,h^t1i)=0,\phi_{v}(x_{t}^{i},\widehat{h}^{i}_{t-1})=0,

which implies that A~Jnzr=I[m],Jnzr\widetilde{A}_{J_{\mathrm{nzr}}}=I_{[m],J_{\mathrm{nzr}}} is a trivial solution of the minimization problem because ϕA~JnzrϕJnzrn,T2=0\|\phi-\widetilde{A}_{J_{\mathrm{nzr}}}\phi_{J_{\mathrm{nzr}}}\|^{2}_{n,T}=0. Here, I[m],JnzrI_{[m],J_{\mathrm{nzr}}} is the submatrix of the identity matrix corresponding to the index sets [m][m] and JnzrJ_{\mathrm{nzr}}. If we choose A~Jnzr=I[m],Jnzr\widetilde{A}_{J_{\mathrm{nzr}}}=I_{[m],J_{\mathrm{nzr}}} as the reconstruction matrix, then the trivial compressed weights can be obtained by simply removing the columns corresponding to [m]Jnzr[m]\setminus J_{\mathrm{nzr}}, i.e., Wo:=W^oA~Jnzr=W^[m],JnzroW^{\sharp o}:=\widehat{W}^{o}\widetilde{A}_{J_{\mathrm{nzr}}}=\widehat{W}^{o}_{[m],J_{\mathrm{nzr}}} and Wh:=W^Jnzr,[m]hA~Jnzr=W^Jnzr,JnzrhW^{\sharp h}:=\widehat{W}^{h}_{J_{\mathrm{nzr}},[m]}\widetilde{A}_{J_{\mathrm{nzr}}}=\widehat{W}^{h}_{J_{\mathrm{nzr}},J_{\mathrm{nzr}}}, and its network fJnzrf^{\sharp}_{J_{\mathrm{nzr}}} coincides with the trained network f^\widehat{f} for training data, i.e., for i=1,,ni=1,\cdots,n, t=1,,Tt=1,\cdots,T

fJnzr,t(Xti)=f^t(Xti)f^{\sharp}_{J_{\mathrm{nzr}},t}(X^{i}_{t})=\widehat{f}_{t}(X^{i}_{t})

which means that the trained RNN is compressed to size mm^{\sharp} without degradation. On the other hand, in the case of m<mnzrm^{\sharp}<m_{\mathrm{nzr}}, A~J=I[m],J\widetilde{A}_{J}=I_{[m],J} is not a solution of the minimization problem for any choice of the index JJ, which means that the compressed network using A^J=Σ^[m],JΣ^J,J\widehat{A}_{J}=\widehat{\Sigma}_{[m],J}\widehat{\Sigma}_{J,J}^{{\dagger}} is closer to the trained network than that using A~J=I[m],J\widetilde{A}_{J}=I_{[m],J}. Therefore, spectral pruning essentially contributes to compression when m<mnzrm^{\sharp}<m_{\mathrm{nzr}}.

4 Generalization Error Bounds for Compressed RNNs

In this section, we discuss the generalization error bounds for compressed RNNs. In Subsection 4.1, the error bounds for general compressed RNNs are evaluated to explain the reason for deriving spectral pruning discussed in Section 3 in the error bound term. In Subsection 4.2, the error bounds for RNNs compressed with spectral pruning are evaluated.

4.1 Error bound for general compressed RNNs

Let (XTi,YTi)(X^{i}_{T},Y^{i}_{T}) be the training data generated independently identically from the true distribution PTP_{T}, and let ff^{\sharp} be a general compressed RNN, and assume that it belongs to the following function space:

T=T(Ro,Rh,Ri,Rob,Rhib):={f|f(XT)=(ft(Xt))t=1T,ft(Xt)=(Woσ()+bo)(Whσ()+Wixt+bhi)(Whσ()+Wix2+bhi)(Wix1+bhi) for XTsupp(PXT),WoFRo,WhFRh,WiFRi,bo2Rob,bhi2Rhib},\begin{split}\mathcal{F}^{\sharp}_{T}&=\mathcal{F}^{\sharp}_{T}(R_{o},R_{h},R_{i},R^{b}_{o},R^{b}_{hi})\\ &:=\bigg{\{}f^{\sharp}\,\Big{|}\,f^{\sharp}(X_{T})=(f^{\sharp}_{t}(X_{t}))_{t=1}^{T},\\ &f^{\sharp}_{t}(X_{t})=(W^{\sharp o}\sigma(\cdot)+b^{\sharp o})\circ(W^{\sharp h}\sigma(\cdot)+W^{\sharp i}x_{t}+b^{\sharp hi})\circ\\ &\cdots\circ(W^{\sharp h}\sigma(\cdot)+W^{\sharp i}x_{2}+b^{\sharp hi})\circ(W^{\sharp i}x_{1}+b^{\sharp hi})\\ &\text{ for }X_{T}\in\mathrm{supp}(P_{X_{T}}),\ \big{\|}W^{\sharp o}\big{\|}_{F}\leq R_{o},\ \big{\|}W^{\sharp h}\big{\|}_{F}\leq R_{h},\\ &\big{\|}W^{\sharp i}\big{\|}_{F}\leq R_{i},\ \big{\|}b^{\sharp o}\big{\|}_{2}\leq R^{b}_{o},\ \big{\|}b^{\sharp hi}\big{\|}_{2}\leq R^{b}_{hi}\bigg{\}},\end{split}

where PXTP_{X_{T}} is the marginal distribution of PTP_{T} with respect to XTX_{T}, and RoR_{o}, RhR_{h}, RiR_{i}, RobR^{b}_{o}, RhibR^{b}_{hi} are the upper bounds of the compressed weights Wody×mW^{\sharp o}\in\mathbb{R}^{d_{y}\times m^{\sharp}}, Whm×mW^{\sharp h}\in\mathbb{R}^{m^{\sharp}\times m^{\sharp}}, Wim×dxW^{\sharp i}\in\mathbb{R}^{m^{\sharp}\times d_{x}}, biases bodyb^{\sharp o}\in\mathbb{R}^{d_{y}}, and bhimb^{\sharp hi}\in\mathbb{R}^{m^{\sharp}}, respectively. Here, F\|\cdot\|_{F} denotes the Frobenius norm.

Assumption 4.1.

The following assumptions are made: (i) The marginal distribution PxtP_{x_{t}} of PTP_{T} with respect to xtx_{t} is bounded, i.e., there exist a constant RxR_{x} independent of tt such that xt2Rx\|x_{t}\|_{2}\leq R_{x} for all xtsupp(Pxt)x_{t}\in\mathrm{supp}(P_{x_{t}}) and t=1,,T.t=1,\ldots,T. (ii) The activation function σ:\sigma:\mathbb{R}\to\mathbb{R} satisfies σ(0)=0\sigma(0)=0 and |σ(t)σ(s)|ρσ|ts||\sigma(t)-\sigma(s)|\leq\rho_{\sigma}|t-s| for all t,s.t,s\in\mathbb{R}.

Under these assumptions, we obtain the following approximation error bounds between the trained network f^\widehat{f} and compressed networks ff^{\sharp}.

Proposition 4.2.

Let Assumption 4.1 hold. Let (XT1,YT1),,(XTn,YTn)(X_{T}^{1},Y_{T}^{1}),\ldots,(X_{T}^{n},Y_{T}^{n}) be sampled i.i.d. from the distribution PTP_{T}. Then, for all fTf^{\sharp}\in\mathcal{F}^{\sharp}_{T} and J[m]J\subset[m] with |J|=m|J|=m^{\sharp}, we have

f^fn,TW^oϕWoϕJn,T+W^J,[m]hϕWhϕJn,T+W^J,[dx]iWiop+b^Jhibhi2+b^obo2.\begin{split}&\big{\|}\widehat{f}-f^{\sharp}\big{\|}_{n,T}\leq\big{\|}\widehat{W}^{o}\phi-W^{\sharp o}\phi_{J}\big{\|}_{n,T}\\ +&\,\big{\|}\widehat{W}^{h}_{J,[m]}\phi-W^{\sharp h}\phi_{J}\big{\|}_{n,T}+\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\\ +&\,\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}+\big{\|}\widehat{b}^{o}-b^{\sharp o}\big{\|}_{2}.\end{split} (4.1)

Here, \lesssim implies that the left-hand side in (4.1) is bounded above by the right-hand side times a constant independent of the trained weights and biases W^\widehat{W}, b^\widehat{b} and compressed weights and biases WW^{\sharp}, bb^{\sharp}. The proof is given by direct computations. For the exact statement and proof, see Appendix B.

Remark 4.3.

Let fJf^{\sharp}_{J} be the network compressed using the reconstruction matrix (see (iii) in Section 3). By applying Proposition 4.2 as f=fJf^{\sharp}=f^{\sharp}_{J}, we obtain

f^fJn,T2W^oϕW^oA^JϕJn,T2+W^oA^Jτ2=Lτ(B,o)(J)+W^J,[m]hϕW^J,[m]hA^JϕJn,T2+W^J,[m]hA^Jτ2=Lτ(B,h)(J)(W^oF2+W^J,[m]hF2)(ϕA^JϕJn,T2+A^Jτ2)=Lτ(A)(J),\begin{split}&\big{\|}\widehat{f}-f^{\sharp}_{J}\big{\|}^{2}_{n,T}\lesssim\underbrace{\big{\|}\widehat{W}^{o}\phi-\widehat{W}^{o}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\big{\|}\widehat{W}^{o}\widehat{A}_{J}\big{\|}^{2}_{\tau}}_{=L^{(B,o)}_{\tau}(J)}\\ &+\underbrace{\big{\|}\widehat{W}^{h}_{J,[m]}\phi-\widehat{W}^{h}_{J,[m]}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\big{\|}\widehat{W}^{h}_{J,[m]}\widehat{A}_{J}\big{\|}^{2}_{\tau}}_{=L^{(B,h)}_{\tau}(J)}\\ &\leq\left(\big{\|}\widehat{W}^{o}\big{\|}^{2}_{F}+\big{\|}\widehat{W}^{h}_{J,[m]}\big{\|}^{2}_{F}\right)\underbrace{\left(\big{\|}\phi-\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\big{\|}\widehat{A}_{J}\big{\|}^{2}_{\tau}\right)}_{=L^{(A)}_{\tau}(J)},\end{split} (4.2)

i.e., the approximation error is bounded above by the input information loss.

For the RNN f=(ft)t=1Tf=(f_{t})_{t=1}^{T}, the training error with respect to the jj-th component of the output is defined as

Ψ^j(f):=1nTi=1nt=1Tψ(yt,ji,ft(Xti)j),\widehat{\Psi}_{j}(f):=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\psi(y^{i}_{t,j},f_{t}(X^{i}_{t})_{j}),

where Xt=(xt)t=1tX_{t}=(x_{t})_{t=1}^{t} and ψ:×+\psi:\mathbb{R}\times\mathbb{R}\to\mathbb{R}_{+} is a loss function. The generalization error with respect to the jj-th component of the output is defined as

Ψj(f):=E[1Tt=1Tψ(yt,j,ft(Xt)j)],\Psi_{j}(f):=E\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\psi(y_{t,j},f_{t}(X_{t})_{j})\bigg{]},

where the expectation is taken with respect to (XT,YT)PT(X_{T},Y_{T})\sim P_{T}.

Assumption 4.4.

The following assumptions are made: (i) The loss function ψ(yt,j,0)\psi(y_{t,j},0) is bounded, i.e., there exists a constant RyR_{y} such that |ψ(yt,j,0)|Ry|\psi(y_{t,j},0)|\leq R_{y} for all yt,jsupp(Pyt,j)y_{t,j}\in\mathrm{supp}(P_{y_{t,j}}), t=1,,Tt=1,\ldots,T, j=1,,dyj=1,\ldots,d_{y}. (ii) ψ\psi is ρψ\rho_{\psi}-Lipschitz continuous, i.e., |ψ(y,f)ψ(y,g)|ρψ|fg||\psi(y,f)-\psi(y,g)|\leq\rho_{\psi}|f-g| for all y,f,g.y,f,g\in\mathbb{R}.

We obtain the following generalization error bound for fT(Ro,Rh,Ri,Rob,Rhib)f^{\sharp}\in\mathcal{F}_{T}^{\sharp}(R_{o},R_{h},R_{i},R^{b}_{o},R^{b}_{hi}).

Theorem 4.5.

Let Assumptions 4.1 and 4.4 hold, and let (XT1,YT1),,(XTn,YTn)(X_{T}^{1},Y_{T}^{1}),\ldots,(X_{T}^{n},Y_{T}^{n}) be sampled i.i.d. from the distribution PTP_{T}. Then, for any δlog2\delta\geq\log 2, we have the following inequality with probability greater than 12eδ1-2e^{-\delta}:

Ψj(f)Ψ^j(f^)+{W^oϕWoϕJn,T+W^J,[m]hϕWhϕJn,T+W^J,[dx]iWiop+b^Jhibhi2+b^obo2}+1n(m)54R1/2,T,\begin{split}&\Psi_{j}(f^{\sharp})\lesssim\widehat{\Psi}_{j}(\widehat{f})+\Big{\{}\big{\|}\widehat{W}^{o}\phi-W^{\sharp o}\phi_{J}\big{\|}_{n,T}\\ &+\big{\|}\widehat{W}^{h}_{J,[m]}\phi-W^{\sharp h}\phi_{J}\big{\|}_{n,T}+\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\\ &+\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}+\big{\|}\widehat{b}^{o}-b^{\sharp o}\big{\|}_{2}\Big{\}}+\frac{1}{\sqrt{n}}(m^{\sharp})^{\frac{5}{4}}R^{1/2}_{\infty,T},\end{split} (4.3)

for j=1,,dyj=1,\ldots,d_{y} and for all J[m]J\subset[m] with |J|=m|J|=m^{\sharp}, and fTf^{\sharp}\in\mathcal{F}^{\sharp}_{T}, where R,tR_{\infty,t} is defined by

R,t:=Roρσ(RiRx+Rhib)(l=1t(Rhρσ)l1)+Rob.R_{\infty,t}:=R_{o}\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\bigg{(}\sum_{l=1}^{t}(R_{h}\rho_{\sigma})^{l-1}\bigg{)}+R^{b}_{o}.

Here, \lesssim implies that the left-hand side in (4.3) is bounded above by the right-hand side times a constant independent of the trained weights and biases W^\widehat{W}, b^\widehat{b}, compressed weights and biases WW^{\sharp}, bb^{\sharp}, compressed number mm^{\sharp}, and the number of samples nn. We remark that some omitted constants blow up as increasing TT, but they can be controlled by increasing sampling number nn (see Theorem C.1). The idea behind the proof is that the generalization error is decomposed into the training, approximation, and estimation errors. The approximation and estimation errors are evaluated using Proposition 4.2 and the estimation of the Rademacher complexity, respectively. For the exact statement and proof, see Appendix C.

The second term in (4.3) is the approximation error bound between f^\widehat{f} and ff^{\sharp} regarded as the bias, which is given by Proposition 4.1, while the third term is the estimation error bound regarded as the variance. It can be observed that minimizing the terms W^oϕWoϕJn,T\|\widehat{W}^{o}\phi-W^{\sharp o}\phi_{J}\|_{n,T} and W^J,[m]hϕWhϕJn,T\|\widehat{W}^{h}_{J,[m]}\phi-W^{\sharp h}\phi_{J}\|_{n,T} with respect to WoW^{\sharp o} and WhW^{\sharp h} is equivalent to the output information losses (3.5) and (3.6) with τ=0\tau=0, respectively, which means that (iii) in Section 3 with τ=0\tau=0 constructs the compressed RNN such that the bias term becomes smaller. Considering τ0\tau\neq 0 prevents the blow up of WoF\|W^{\sharp o}\|_{F} and WhF\|W^{\sharp h}\|_{F}, which means that the regularization parameter τ\tau plays an important role in preventing the blow up of the variance term because R,TR_{\infty,T} in the variance term includes the upper bounds RoR_{o} and RhR_{h} of WoF\|W^{\sharp o}\|_{F} and WhF\|W^{\sharp h}\|_{F}. Therefore, (iii) with τ0\tau\neq 0 constructs the compressed RNN such that the generalization error bound becomes smaller. In addition, selecting an optimal JJ for minimizing the information losses (see (iv) in Section 3) further decreases the error bound.

4.2 Error bound for RNNs compressed with spectral pruning

Next, we evaluate the generalization error bounds for the RNN fJf^{\sharp}_{J} compressed using the reconstruction matrix (see (iii) in Section 3). We define the degrees of freedom N^(λ)\widehat{N}(\lambda) by

N^(λ):=Tr[Σ^(Σ^+λI)1]=j=1mμ^jμ^j+λ,\widehat{N}(\lambda):=\mathrm{Tr}\big{[}\widehat{\Sigma}(\widehat{\Sigma}+\lambda I)^{-1}\big{]}=\sum_{j=1}^{m}\frac{\widehat{\mu}_{j}}{\widehat{\mu}_{j}+\lambda},

where μ^j\widehat{\mu}_{j} is an eigenvalue of Σ^\widehat{\Sigma}. Throughout this subsection, the regularization parameter τ+m\tau\in\mathbb{R}^{m^{\sharp}}_{+} is chosen as τ=λmτ\tau=\lambda m^{\sharp}\tau^{\prime}, where λ>0\lambda>0 satisfies

m5N^(λ)log(16N^(λ)/δ~),m^{\sharp}\geq 5\widehat{N}(\lambda)\log(16\widehat{N}(\lambda)/\widetilde{\delta}), (4.4)

for a prespecified δ~(0,1/2)\widetilde{\delta}\in(0,1/2). Here, τ=(τj)jJm\tau^{\prime}=(\tau^{\prime}_{j})_{j\in J}\in\mathbb{R}^{m^{\sharp}} is the leverage score defined by for k[m]k\in[m]

τk:=1N^(λ)[Σ^(Σ^+λI)1]k,k=1N^(λ)j=1mUk,j2μ^jμ^j+λ,\tau^{\prime}_{k}:=\frac{1}{\widehat{N}(\lambda)}\big{[}\widehat{\Sigma}(\widehat{\Sigma}+\lambda I)^{-1}\big{]}_{k,k}=\frac{1}{\widehat{N}(\lambda)}\sum_{j=1}^{m}U^{2}_{k,j}\frac{\widehat{\mu}_{j}}{\widehat{\mu}_{j}+\lambda}, (4.5)

where U=(Uk,j)k,jU=(U_{k,j})_{k,j} is the orthogonal matrix that diagonalizes Σ^\widehat{\Sigma}, i.e., Σ^=Udiag{μ^1,,μ^m}UT\widehat{\Sigma}=U\text{diag}\left\{\widehat{\mu}_{1},\ldots,\widehat{\mu}_{m}\right\}U^{T}. The leverage score includes the information of the eigenvalues and eigenvectors of Σ^\widehat{\Sigma}, and indicates that the large components correspond to the important nodes from the viewpoint of the spectral information of Σ^\widehat{\Sigma}. Let qq be the probability measure on [m][m] defined by

q(v):=τvforv[m].q(v):=\tau^{\prime}_{v}\quad\text{for}\ v\in[m]. (4.6)
Proposition 4.6.

Let v1,,vmv_{1},\ldots,v_{m^{\sharp}} be sampled i.i.d. from the distribution qq in (4.6), and J={v1,,vm}J=\{v_{1},\ldots,v_{m^{\sharp}}\}. Then, for any δ~(0,1/2)\widetilde{\delta}\in(0,1/2) and λ>0\lambda>0 satisfying (4.4), we have the following inequality with probability greater than 1δ~1-\widetilde{\delta}:

Lτ(A)(J)4λ.L^{(A)}_{\tau}(J)\leq 4\lambda. (4.7)

The proof is given in Appendix E. In the proof, we essentially refer to previous work [(Bach). Combining (4.2) and (4.7), we conclude that

f^fJn,T2λ.\big{\|}\widehat{f}-f^{\sharp}_{J}\big{\|}^{2}_{n,T}\lesssim\lambda. (4.8)

It can be observed that the approximation error bound (4.8) is controlled by the degrees of freedom. If the eigenvalues of Σ^\widehat{\Sigma} rapidly decrease, then N^(λ)\widehat{N}(\lambda) is a rapidly decreasing function as λ\lambda is large. Therefore, in that case, we can choose a smaller λ\lambda even when mm^{\sharp} is fixed. We will numerically study the relationship between the eigenvalue distribution and the input information loss in Section 5.1.

We make the following additional assumption.

Assumption 4.7.

Assume that the upper bounds for the trained weights and biases are given by W^oFR^o\|\widehat{W}^{o}\|_{F}\leq\widehat{R}_{o}, W^hFR^h\|\widehat{W}^{h}\|_{F}\leq\widehat{R}_{h}, W^iFR^i\|\widehat{W}^{i}\|_{F}\leq\widehat{R}_{i}, b^hi2R^hib\|\widehat{b}^{hi}\|_{2}\leq\widehat{R}^{b}_{hi}, and b^o2R^ob\|\widehat{b}^{o}\|_{2}\leq\widehat{R}^{b}_{o}.

We have the following generalization error bound.

Theorem 4.8.

Let Assumptions 4.1, 4.4, and 4.7 hold, and let (XT1,YT1),,(XTn,YTn)(X_{T}^{1},Y_{T}^{1}),\ldots,(X_{T}^{n},Y_{T}^{n}) and v1,,vmv_{1},\ldots,v_{m^{\sharp}} be sampled i.i.d. from the distributions PTP_{T} and qq in (4.6), respectively. Let J={v1,,vm}J=\{v_{1},\ldots,v_{m^{\sharp}}\}. Then, for any δlog2\delta\geq\log 2 and δ~(0,1/2)\widetilde{\delta}\in(0,1/2), we have the following inequality with probability greater than (12eδ)δ~(1-2e^{-\delta})\widetilde{\delta}:

Ψj(fJ)Ψ^j(f^)+λ+1n(m)54,\Psi_{j}(f^{\sharp}_{J})\lesssim\widehat{\Psi}_{j}(\widehat{f})+\sqrt{\lambda}+\frac{1}{\sqrt{n}}(m^{\sharp})^{\frac{5}{4}}, (4.9)

for j=1,,dyj=1,\ldots,d_{y} and for all λ>0\lambda>0 satisfying (4.4).

Here, \lesssim implies that the left-hand side in (4.9) is bounded above by the right-hand side times a constant independent of λ\lambda, mm^{\sharp}, and nn. We remark that some omitted constants blow up as increasing TT, but they can be controlled by increasing sampling number nn (see Theorem F.1). The proof is given by the combination of applying Theorem 4.5 as f=fJf^{\sharp}=f^{\sharp}_{J} and using Proposition 4.6. For the exact statement and proof, see Appendix F. It can be observed that in (4.9), a bias-variance tradeoff relationship exists with respect to mm^{\sharp}. When mm^{\sharp} is large, λ\lambda can be chosen smaller in the condition (4.4), which implies that the bias term (the second term in (4.9)) becomes smaller, but the variance term (the third term in (4.9)) becomes larger. In contrast, the bias becomes larger and the variance becomes smaller when mm^{\sharp} is small. Further remarks on Theorem 4.8 are given in Appendix G.

5 Numerical Experiments

In this section, numerical experiments are detailed to demonstrate our theoretical results and show the effectiveness of spectral pruning compared with existing methods. In Sections 5.1 and 5.2, we select the pixel-MNIST as our task and employ the IRNN, which is an RNN that uses the ReLU as the activation function and initializes weights as the identity matrix and biases to zero (see [(Le)). In Section 5.3, we select the PTB [(marcus1993building) and employ the RNNLM whose RNN layer is orthodox Elman-type. For RNN training details, see Appendix H. We choose parameters θ1=1\theta_{1}=1, θ2=θ3=0\theta_{2}=\theta_{3}=0 in (iv) of Section 3, i.e., we minimize only the input information loss. This choice is not so problematic because the bound of output information loss automatically becomes smaller with minimizing the input one (see Remark 4.3). We choose the regularization parameter τ=0\tau=0, where this choice regards f^\widehat{f} as a well-trained network and gives priority to minimizing the approximation error between f^\widehat{f} and fJf^{\sharp}_{J} (see below Theorem4.5).

5.1 Eigenvalue distributions and information losses

First, we numerically study the relationship between the eigenvalue distribution and the information losses. Figure 2 shows the eigenvalue distribution of the covariance matrix Σ^\widehat{\Sigma} with 128 hidden nodes, which are sorted in decreasing order. In this experiment, almost half of the eigenvalues are zero, which cannot be visualized in the figure. Figure 2 shows the input information loss L0(A)(J)L^{(A)}_{0}(J) versus the compressed number mm^{\sharp}. The information losses vanish when m>mnzrm^{\sharp}>m_{\mathrm{nzr}} (see Remark 3.2). The blue and pink curves correspond to MNIST111http://yann.lecun.com/exdb/mnist/ and FashionMNIST222https://github.com/zalandoresearch/fashion-mnist, respectively. It can be observed that the eigenvalues for MNIST decrease more rapidly than those for FashionMNIST, and the information losses for MNIST decrease more rapidly than those for FashionMNIST. This phenomenon coincides with the interpretation on Proposition 4.6 (see the discussion below (4.8)).

Refer to caption
(a) Eigenvalue distribution for Σ^\widehat{\Sigma}
         
Refer to caption
(b) Input information loss vs. mm^{\sharp}
Figure 2: Relationship between the eigenvalue distribution and the input information loss

5.2 Pixel-MNIST (IRNN)

Table 1: Pixel-MNIST (IRNN)
Method Accuracy[%] (std)
Finetuned
Accuracy[%](std)
# input
-hidden
# hidden
-hidden
# hidden
-out
total
Baseline(128) 96.80 (0.23) - 128 16384 1280 17792
Baseline(42) 93.35 (0.75) - 42 1764 420 2226
Spectral w/ rec.(ours) 92.61 (2.46) 97.08 (0.16) 42 1764 420 2226
Spectral w/o rec. 83.60 (8.24) - 42 1764 420 2226
Random w/ rec. 34.72 (32.47) - 42 1764 420 2226
Random w/o rec. 23.13 (16.09) - 42 1764 420 2226
Random Weight 10.35 (1.38) - 128 1764 1280 3172
Magnitude-based Weight 11.06 (0.70) 94.41 (3.02) 128 1764 1280 3172
Column Sparsification 84.80 (7.29) - 128 5376 1280 6784
Low Rank Factorization 9.65 (3.85) - 128 10752 1280 12160
Table 2: PTB (RNNLM)
Method Perplexity (std)
Finetuned
Perplexity (std)
# input
-hidden
# hidden
-hidden
# hidden
-out
total
Baseline(128) 114.66 (0.35) - 1270016 16384 1270016 2556416
Baseline(42) 145.85 (0.74) 132.46 (0.74) 416724 1764 416724 835212
Spectral w/ rec.(ours) 207.63 (2.19) 124.26 (0.39) 416724 1764 416724 835212
Spectral w/o rec. 433.99 (10.64) - 416724 1764 416724 835212
Random w/ rec. 243.76 (9.46) - 416724 1764 416724 835212
Random w/o rec. 492.06 (22.40) - 416724 1764 416724 835212
Random Weight 203.41 (2.02) - 1270016 1764 1270016 2541796
Magnitude-based Weight 168.57 (2.57) 115.65 (0.31) 1270016 1764 1270016 2541796
Magnitude-based Weight \diamondsuit 201.41 (3.60) 126.20 (0.28) 416724 1764 416724 835212
Column Sparsification 128.98 (0.52) - 1270016 5376 1270016 2545408
Low Rank Factorization 126.24 (1.79) - 1270016 10752 1270016 2550784

We compare spectral pruning with other pruning methods in the pixel-MNIST (IRNN). Table 1 summarizes the accuracies and the number of weight parameters for different pruning methods. We consider one-third compression in the hidden state, i.e., for the node pruning, 128 hidden nodes were compressed to 42 nodes, while for weight pruning, 1282(=16384)128^{2}(=16384) hidden weights were compressed to 422(=1764)42^{2}(=1764) weights.

“Baseline(128)” and “Baseline(42)” represent direct training (not pruning) with 128 and 42 hidden nodes, respectively. “Spectral w/ rec.(ours)” represents spectral pruning with the reconstruction matrix (i.e., the compressed weight is chosen as Wh=W^J,[m]hA^JW^{\sharp h}=\widehat{W}^{h}_{J,[m]}\widehat{A}_{J} with the optimal JJ with respect to (3.9)), while “Spectral w/o rec.” represents spectral pruning without the reconstruction matrix (i.e., Wh=W^J,JhW^{\sharp h}=\widehat{W}^{h}_{J,J} with the optimal JJ with respect to (3.9)), which idea is based on [(luo2017thinet). “Random w/ rec.” represents random node pruning with the reconstruction matrix (i.e., Wh=W^J,[m]hA^JW^{\sharp h}=\widehat{W}^{h}_{J,[m]}\widehat{A}_{J}, where JJ is randomly chosen), while “Random w/o rec.” represents random node pruning without the reconstruction matrix (i.e., Wh=W^J,JhW^{\sharp h}=\widehat{W}^{h}_{J,J}, where JJ is randomly chosen). “Random Weight” represents random weight pruning. For the reason why we compare with random pruning, see the introduction of [(Zhang). “Magnitude-based Weight” represents magnitude-based weight pruning based on [(Narang). “Column Sparsification” represents the magnitude-based column sparsification during training based on [(wang2019acceleration). “Low Rank Factorization” represents low rank factorization which truncates small singular values of trained weights based on [(prabhavalkar2016compression). “Accuracy[%](std)” and “Finetuned Accuracy[%](std)” represent their mean (standard deviation) of accuracy before and after fine-tuning, respectively. “# input-hidden”, “# hidden-hidden”, and “# hidden-out” represent the number of input-to-hidden, hidden-to-hidden, and hidden-to-output weight parameters, respectively. “total” represents their sum. For detailed procedures of training, pruning, and fine-tuning, see Appendix H.

We demonstrate that spectral pruning significantly outperforms other pruning methods. The reason why spectral pruning can compress with small degradation is that the covariance matrix Σ^\widehat{\Sigma} has a small number of non-zero rows (we observed around 50 non-zero rows). For the detail of non-zero rows, see Remark 3.2. Our method with fine-tuning outperforms “Baseline(42)”, which means that the spectral pruning gets benefits from over-parameterization [(chang2020provable; zhang2021understanding). Since the magnitude-based weight pruning is the method to require the fine-tuning (e.g., see [(Narang)), we have also compared our method with the magnitude-based weight pruning with fine-tuning, and observed that our method outperforms the magnitude-based weight pruning as well. We also remark that our method with fine-tuning overcomes “Baseline(128)”.

5.3 PTB (RNNLM)

We compare spectral pruning with other pruning methods in the PTB (RNNLM). Table 2 summarizes the perplexity and the number of weight parameters for different pruning methods. As in Section 5.2, we consider one-third compression in the hidden state, and how to represent “Method” is the same as Table 1 except for “Magnitude-based Weight \diamondsuit”, which represents the magnitude-based weight pruning for not only hidden-to-hidden weights but also input-to-hidden and hidden-to-out weights so that the number of resultant weight parameters is the same as Spectral w/ rec.(ours).

We demonstrate that our method with fine-tuning outperforms other pruning methods except for magnitude-based Weight pruning. Even though "Low Rank Factorization" retains large number of weight parameters, its perplexity is slightly worse than our method with fine-tuning. On the other hands, our method with fine-tuning can not outperform “Magnitude-based Weight”, but it can slightly do under the condition of the same number of weight parameters. We also remark that our method with fine-tuning overcomes “Baseline(42)”, although it does not overcome “Baseline(128)”.

Therefore, we conclude that spectral pruning works well in Elman-RNN, especially in IRNN.

Future Work

It would be interesting to extend our work to the long short-term memory (LSTM). The properties of LSTMs are different from those of RNNs in that LSTMs have the gated architectures including product operations, which might require more complicated analysis of the generalization error bounds as compared with RNNs. Hence, the investigation of spectral pruning for LSTMs is beyond the scope of this study and will be the focus of future work.

Acknowledgements

The authors are grateful to Professor Taiji Suzuki for useful discussions and comments on our work. The first author was supported by Grant-in-Aid for JSPS Fellows (No.21J00119), Japan Society for the Promotion of Science.

References

Appendix

Appendix A Review of Spectral Pruning for DNNs

Let D={(xi,yi)}i=1nD=\{(x^{i},y^{i})\}_{i=1}^{n} be training data, where xidxx^{i}\in\mathbb{R}^{d_{x}} is an input and yidyy^{i}\in\mathbb{R}^{d_{y}} is an output. The training data are independently identically distributed. To train the appropriate relationship between input and output, we consider DNNs ff as

f(x)=(W(L)σ()+b(L))(W(1)x+b(1)),f(x)=(W^{(L)}\sigma(\cdot)+b^{(L)})\circ\cdots\circ(W^{(1)}x+b^{(1)}),

where σ:\sigma:\mathbb{R}\to\mathbb{R} is an activation function, W(l)ml+1×mlW^{(l)}\in\mathbb{R}^{m_{l+1}\times m_{l}} is a weight matrix, and b(l)ml+1b^{(l)}\in\mathbb{R}^{m_{l+1}} is a bias. Let f^\widehat{f} be a trained DNN obtained from the training data DD, i.e.,

f^(x)=(W^(L)σ()+b^(L))(W^(1)x+b^(1)).\widehat{f}(x)=(\widehat{W}^{(L)}\sigma(\cdot)+\widehat{b}^{(L)})\circ\cdots\circ(\widehat{W}^{(1)}x+\widehat{b}^{(1)}).

We denote the input with respect to ll-th layer by

ϕ(l)(x)=σ(W^(l1)σ()+b^(l1))(W^(1)x+b^(1)).\phi^{(l)}(x)=\sigma\circ(\widehat{W}^{(l-1)}\sigma(\cdot)+\widehat{b}^{(l-1)})\circ\cdots\circ(\widehat{W}^{(1)}x+\widehat{b}^{(1)}).

Let J(l)[ml]J^{(l)}\subset[m_{l}] be an index set with |J(l)|=ml|J^{(l)}|=m^{\sharp}_{l}, where [ml]:={1,,ml}[m_{l}]:=\{1,\ldots,m_{l}\} and mlm^{\sharp}_{l}\in\mathbb{N} is the number of nodes of the ll-th layer of the compressed DNN ff^{\sharp} with mlmlm^{\sharp}_{l}\leq m_{l}. Let ϕJ(l)(l)(x)=(ϕj(l)(x))jJ(l)\phi^{(l)}_{J^{(l)}}(x)=(\phi^{(l)}_{j}(x))_{j\in J^{(l)}} be a subvector of ϕ(l)(x)\phi^{(l)}(x) corresponding to the index set J(l)J^{(l)}, where ϕj(l)(x)\phi^{(l)}_{j}(x) is the jj-th components of the vector ϕ(l)(x)\phi^{(l)}(x).

(i) Input information loss. The input information loss is defined by

Lτ(A,l)(J(l)):=minAml×ml{ϕ(l)AϕJ(l)(l)n2+Aτ2},L^{(A,l)}_{\tau}(J^{(l)}):=\min_{A\in\mathbb{R}^{m_{l}\times m^{\sharp}_{l}}}\big{\{}\big{\|}\phi^{(l)}-A\phi^{(l)}_{J^{(l)}}\big{\|}^{2}_{n}+\big{\|}A\big{\|}^{2}_{\tau}\big{\}}, (A.1)

where n\|\cdot\|_{n} is the empirical L2L^{2}-norm with respect to nn, i.e.,

ϕ(l)AϕJ(l)(l)n2:=1ni=1nϕ(l)(xi)AϕJ(l)(l)(xi)22,\big{\|}\phi^{(l)}-A\phi^{(l)}_{J^{(l)}}\big{\|}^{2}_{n}:=\frac{1}{n}\sum_{i=1}^{n}\big{\|}\phi^{(l)}(x^{i})-A\phi^{(l)}_{J^{(l)}}(x^{i})\big{\|}^{2}_{2}, (A.2)

where 2\|\cdot\|_{2} is the Euclidean norm and Aτ2:=Tr[AIτAT]\|A\|^{2}_{\tau}:=\mathrm{Tr}\,[AI_{\tau}A^{T}] for a regularization parameter τ+ml\tau\in\mathbb{R}^{m^{\sharp}_{l}}_{+}. Here, +ml:={xml|xj>0,j=1,,ml}\mathbb{R}^{m^{\sharp}_{l}}_{+}:=\big{\{}x\in\mathbb{R}^{m^{\sharp}_{l}}\,|\,x_{j}>0,\ j=1,\ldots,m^{\sharp}_{l}\big{\}} and Iτ:=diag(τ)I_{\tau}:=\text{diag}(\tau). By the linear regularization theory, there exists a unique solution A^J(l)(l)ml×ml\widehat{A}^{(l)}_{J^{(l)}}\in\mathbb{R}^{m_{l}\times m^{\sharp}_{l}} of the minimization problem of ϕ(l)AϕJ(l)(l)n2+Aτ2\|\phi^{(l)}-A\phi^{(l)}_{J^{(l)}}\|^{2}_{n}+\|A\|^{2}_{\tau}, and it has the form

A^J(l)(l)=Σ^[ml],J(l)(l)(Σ^J(l),J(l)(l)+Iτ)1,\widehat{A}^{(l)}_{J^{(l)}}=\widehat{\Sigma}^{(l)}_{[m_{l}],J^{(l)}}\big{(}\widehat{\Sigma}^{(l)}_{J^{(l)},J^{(l)}}+I_{\tau}\big{)}^{-1}, (A.3)

where Σ^(l)\widehat{\Sigma}^{(l)} is the (noncentered) empirical covariance matrix of ϕ(l)(x)\phi^{(l)}(x) with respect to nn, i.e.,

Σ^(l)=1ni=1nϕ(l)(xi)ϕ(l)(xi)T,\widehat{\Sigma}^{(l)}=\frac{1}{n}\sum_{i=1}^{n}\phi^{(l)}(x^{i})\phi^{(l)}(x^{i})^{T},

and Σ^I,I(l)=(Σ^i,i(l))iI,iIK×H\widehat{\Sigma}^{(l)}_{I,I^{\prime}}=(\widehat{\Sigma}^{(l)}_{i,i^{\prime}})_{i\in I,i^{\prime}\in I^{\prime}}\in\mathbb{R}^{K\times H} is the submatrix of Σ^(l)\widehat{\Sigma}^{(l)} corresponding to index sets I,I[m]I,I^{\prime}\subset[m] with |I|=K|I|=K and |I|=H|I^{\prime}|=H. By substituting the explicit formula (A.2) of the reconstruction matrix A^J(l)(l)\widehat{A}^{(l)}_{J^{(l)}} into (A.1), the input information loss is reformulated as

Lτ(A,l)(J(l))=Tr[Σ^(l)Σ^[ml],J(l)(l)(Σ^J(l),J(l)(l)+Iτ)1Σ^J(l),[ml](l)].L^{(A,l)}_{\tau}(J^{(l)})=\mathrm{Tr}\Big{[}\widehat{\Sigma}^{(l)}-\widehat{\Sigma}^{(l)}_{[m_{l}],J^{(l)}}\big{(}\widehat{\Sigma}^{(l)}_{J^{(l)},J^{(l)}}+I_{\tau}\big{)}^{-1}\widehat{\Sigma}^{(l)}_{J^{(l)},[m_{l}]}\Big{]}. (A.4)

(ii) Output information loss. For any matrix Z(l)m×mlZ^{(l)}\in\mathbb{R}^{m\times m_{l}} with an output size mm\in\mathbb{N}, we define the output information loss by

Lτ(B,l)(J(l)):=j=1mminβml{Zj,:(l)ϕ(l)βTϕJ(l)(l)n2+βTτ2},L^{(B,l)}_{\tau}(J^{(l)}):=\sum_{j=1}^{m}\min_{\beta\in\mathbb{R}^{m_{l}^{\sharp}}}\big{\{}\big{\|}Z^{(l)}_{j,:}\phi^{(l)}-\beta^{T}\phi^{(l)}_{J^{(l)}}\big{\|}^{2}_{n}+\big{\|}\beta^{T}\big{\|}^{2}_{\tau}\big{\}}, (A.5)

where Zj,:(l)Z^{(l)}_{j,:} denotes the jj-th row of the matrix Z(l)Z^{(l)}. A typical situation is that Z(l)=W^(l)Z^{(l)}=\widehat{W}^{(l)}. The minimization problem of Zj,:(l)ϕ(l)βTϕJ(l)n2\|Z^{(l)}_{j,:}\phi^{(l)}-\beta^{T}\phi_{J^{(l)}}\|^{2}_{n} ++ βTτ2\|\beta^{T}\|^{2}_{\tau} has the unique solution

β^j(l)=(Zj,:(l)A^J(l)(l))T,\widehat{\beta}^{(l)}_{j}=(Z^{(l)}_{j,:}\widehat{A}^{(l)}_{J^{(l)}})^{T},

and by substituting it into (A.5), the output information loss is reformulated as

Lτ(B,l)(J(l))=Tr[Z(l)(Σ^(l)Σ^[m],J(l)(l)(Σ^J(l),J(l)(l)+Iτ)1Σ^J(l),[m](l))Z(l)T].L^{(B,l)}_{\tau}(J^{(l)})=\mathrm{Tr}\Big{[}Z^{(l)}\big{(}\widehat{\Sigma}^{(l)}-\widehat{\Sigma}^{(l)}_{[m],J^{(l)}}\big{(}\widehat{\Sigma}^{(l)}_{J^{(l)},J^{(l)}}+I_{\tau}\big{)}^{-1}\widehat{\Sigma}^{(l)}_{J^{(l)},[m]}\big{)}Z^{(l)T}\Big{]}.

(iii) Compressed DNN by the reconstruction matrix. We construct the compressed DNN by

fJ(1:L)(x)=(WJ(L)(L)σ()+b(L))(WJ(1)(1)x+b(1)),f^{\sharp}_{J^{(1:L)}}(x)=(W^{\sharp(L)}_{J^{(L)}}\sigma(\cdot)+b^{\sharp(L)})\circ\cdots\circ(W^{\sharp(1)}_{J^{(1)}}x+b^{\sharp(1)}),

where J(1:L)=J(1)J(L)J^{(1:L)}=J^{(1)}\cup\cdots\cup J^{(L)}, and b(l)=b^(l)b^{\sharp(l)}=\widehat{b}^{(l)} and WJ(l)(l)W^{\sharp(l)}_{J^{(l)}} is the compressed weight as the multiplication of the trained weight W^J(l+1),[ml](l)\widehat{W}^{(l)}_{J^{(l+1)},[m_{l}]} and the reconstruction matrix A^J(l)\widehat{A}_{J^{(l)}}, i.e.,

WJ(l)(l):=W^J(l+1),[ml](l)A^J(l).W^{\sharp(l)}_{J^{(l)}}:=\widehat{W}^{(l)}_{J^{(l+1)},[m_{l}]}\widehat{A}_{J^{(l)}}. (A.6)

.

(iv) Optimization. To select an appropriate index set J(l)J^{(l)}, we consider the following optimization problem that minimizes a convex combination of input and output information losses, i.e.,

minJ(l)[ml]s.t.|J(l)|=ml{θLτ(A,l)(J(l))+(1θ)Lτ(B,l)(J(l))},\min_{J^{(l)}\subset[m_{l}]\ s.t.\ |J^{(l)}|=m^{\sharp}_{l}}\big{\{}\theta L^{(A,l)}_{\tau}(J^{(l)})+(1-\theta)L^{(B,l)}_{\tau}(J^{(l)})\big{\}},

for θ[0,1]\theta\in[0,1], where ml[m]m^{\sharp}_{l}\in[m] is a prespecified number. We adapt the optimal index J(1:L)J^{\sharp(1:L)} in the algorithm. We term this method as spectral pruning.

In [[Suzuki], the generalization error bounds for compressed DNNs with the spectral pruning have been studied (see Theorems 1 and 2 in [[Suzuki]), and the parameters θ\theta, τ\tau, and Z(l)Z^{(l)} are chosen such that its error bound become smaller.

Appendix B Proof of Proposition 4.2

We restate Proposition 4.2 in an exact form as follows:

Proposition B.1.

Suppose that Assumption 4.1 holds. Let {(XTi,YTi)}i=1n\{(X_{T}^{i},Y_{T}^{i})\}_{i=1}^{n} be sampled i.i.d. from the distribution PTP_{T}. Then,

f^fn,T3{W^oϕWoϕJn,T+Roρσmax{1,(Rhρσ)T2}TW^J,[m]hϕWhϕJn,T+Roρσ(t=1T(Rhρσ)t1)(RxW^J,[dx]iWiop+b^Jhibhi2)+b^obo2},\|\widehat{f}-f^{\sharp}\|_{n,T}\leq\sqrt{3}\,\biggl{\{}\big{\|}\widehat{W}^{o}\phi-W^{\sharp o}\phi_{J}\big{\|}_{n,T}+R_{o}\rho_{\sigma}\max\{1,(R_{h}\rho_{\sigma})^{T-2}\}T\big{\|}\widehat{W}^{h}_{J,[m]}\phi-W^{\sharp h}\phi_{J}\big{\|}_{n,T}\\ +R_{o}\rho_{\sigma}\bigg{(}\sum_{t=1}^{T}(R_{h}\rho_{\sigma})^{t-1}\bigg{)}\big{(}R_{x}\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}+\|\widehat{b}^{hi}_{J}-b^{\sharp hi}\|_{2}\big{)}+\big{\|}\widehat{b}^{o}-b^{\sharp o}\big{\|}_{2}\biggr{\}}, (B.1)

for all fT(Ro,Rh,Ri,Rob,Rhib)f^{\sharp}\in\mathcal{F}^{\sharp}_{T}(R_{o},R_{h},R_{i},R^{b}_{o},R^{b}_{hi}) and J[m]J\subset[m] with |J|=m|J|=m^{\sharp}.

Proof.

Let f^=(f^t)t=1T\widehat{f}=(\widehat{f}_{t})_{t=1}^{T} be a trained RNN and fT(Ro,Rh,Ri,Rob,Rhib)f^{\sharp}\in\mathcal{F}^{\sharp}_{T}(R_{o},R_{h},R_{i},R^{b}_{o},R^{b}_{hi}). Let us define functions ϕ\phi and ϕ\phi^{\sharp} by

ϕ(x,h):=σ(W^hh+W^ix+b^hi)forxdx,hm,\phi(x,h):=\sigma(\widehat{W}^{h}h+\widehat{W}^{i}x+\widehat{b}^{hi})\quad\text{for}\ x\in\mathbb{R}^{d_{x}},\ h\in\mathbb{R}^{m},
ϕ(x,h):=σ(Whh+Wix+bhi)forxdx,hm,\phi^{\sharp}(x,h^{\sharp}):=\sigma(W^{\sharp h}h^{\sharp}+W^{\sharp i}x+b^{\sharp hi})\quad\text{for}\ x\in\mathbb{R}^{d_{x}},\ h^{\sharp}\in\mathbb{R}^{m^{\sharp}}, (B.2)

and denote the hidden states by

h^t:=ϕ(xt,h^t1),ht:=ϕ(xt,ht1)fort=1,2,,T.\widehat{h}_{t}:=\phi(x_{t},\widehat{h}_{t-1}),\ \ h^{\sharp}_{t}:=\phi^{\sharp}(x_{t},h^{\sharp}_{t-1})\quad\text{for}\ t=1,2,\cdots,T. (B.3)

If a training data XTi=(xti)t=1TX^{i}_{T}=(x^{i}_{t})_{t=1}^{T} is used as input, we denote its hidden states by

h^ti:=ϕ(xti,h^t1i),hti:=ϕ(xti,ht1i),\widehat{h}^{i}_{t}:=\phi(x^{i}_{t},\widehat{h}^{i}_{t-1}),\ \ h^{\sharp i}_{t}:=\phi^{\sharp}(x^{i}_{t},h^{\sharp i}_{t-1}),

and its outputs at time tt by

f^t(Xti)=W^oϕ(xti,h^t1i)+b^o,ft(Xti)=Woϕ(xti,ht1i)+bo,\widehat{f}_{t}(X^{i}_{t})=\widehat{W}^{o}\phi(x^{i}_{t},\widehat{h}^{i}_{t-1})+\widehat{b}^{o},\quad f^{\sharp}_{t}(X^{i}_{t})=W^{\sharp o}\phi^{\sharp}(x^{i}_{t},h^{\sharp i}_{t-1})+b^{\sharp o},

for t=1,2,,Tt=1,2,\ldots,T. Then, we have

f^t(Xti)ft(Xti)2W^oϕ(xti,h^t1i)WoϕJ(xti,h^t1i)2+WoϕJ(xti,h^t1i)Woϕ(xti,ht1i)2+b^obo2.\begin{split}\big{\|}\widehat{f}_{t}(X^{i}_{t})-f^{\sharp}_{t}(X^{i}_{t})\big{\|}_{2}&\leq\big{\|}\widehat{W}^{o}\phi(x^{i}_{t},\widehat{h}^{i}_{t-1})-W^{\sharp o}\phi_{J}(x^{i}_{t},\widehat{h}^{i}_{t-1})\big{\|}_{2}\\ &+\big{\|}W^{\sharp o}\phi_{J}(x^{i}_{t},\widehat{h}^{i}_{t-1})-W^{\sharp o}\phi^{\sharp}(x^{i}_{t},h^{\sharp i}_{t-1})\big{\|}_{2}+\big{\|}\widehat{b}^{o}-b^{\sharp o}\big{\|}_{2}.\end{split} (B.4)

If we can prove that the second term of right-hand side in (B.4) is estimated as

WoϕJ(xti,h^t1i)Woϕ(xti,ht1i)2Roρσ{max{1,(Rhρσ)t2}l=1t1W^J,[m]hϕ(xtli,h^tl1i)WhϕJ(xtli,h^tl1i)2+l=1t(Rhρσ)l1(W^J,[dx]iWiopxtl+1i2+b^Jhibhi2)},\big{\|}W^{\sharp o}\phi_{J}(x^{i}_{t},\widehat{h}^{i}_{t-1})-W^{\sharp o}\phi^{\sharp}(x^{i}_{t},h^{\sharp i}_{t-1})\big{\|}_{2}\\ \leq R_{o}\rho_{\sigma}\bigg{\{}\max\big{\{}1,(R_{h}\rho_{\sigma})^{t-2}\big{\}}\sum_{l=1}^{t-1}\big{\|}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t-l},\widehat{h}^{i}_{t-l-1})-W^{\sharp h}\phi_{J}(x^{i}_{t-l},\widehat{h}^{i}_{t-l-1})\big{\|}_{2}\\ +\sum_{l=1}^{t}(R_{h}\rho_{\sigma})^{l-1}\big{(}\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\|x^{i}_{t-l+1}\|_{2}+\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}\big{)}\bigg{\}}, (B.5)

then by using the inequalities (B.4) and (k=1Kak)2Kk=1Kak2(\sum_{k=1}^{K}a_{k})^{2}\leq K\sum_{k=1}^{K}a_{k}^{2}, we have

f^t(Xti)ft(Xti)223{W^oϕ(xti,h^t1i)WoϕJ(xti,h^t1i)22+(Roρσmax{1,(Rhρσ)t2}l=1t1W^J,[m]hϕ(xtli,h^tl1i)WhϕJ(xtli,h^tl1i)2)2+(Roρσl=1t(Rhρσ)l1(W^J,[dx]iWiopxtl+1i2+b^Jhibhi2)+b^obo2)2}.\begin{split}&\big{\|}\widehat{f}_{t}(X^{i}_{t})-f^{\sharp}_{t}(X^{i}_{t})\big{\|}_{2}^{2}\leq 3\Bigg{\{}\big{\|}\widehat{W}^{o}\phi(x^{i}_{t},\widehat{h}^{i}_{t-1})-W^{\sharp o}\phi_{J}(x^{i}_{t},\widehat{h}^{i}_{t-1})\big{\|}_{2}^{2}\\ &+\bigg{(}R_{o}\rho_{\sigma}\max\big{\{}1,(R_{h}\rho_{\sigma})^{t-2}\big{\}}\sum_{l=1}^{t-1}\big{\|}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t-l},\widehat{h}^{i}_{t-l-1})-W^{\sharp h}\phi_{J}(x^{i}_{t-l},\widehat{h}^{i}_{t-l-1})\big{\|}_{2}\bigg{)}^{2}\\ &+\bigg{(}R_{o}\rho_{\sigma}\sum_{l=1}^{t}(R_{h}\rho_{\sigma})^{l-1}\big{(}\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\|x^{i}_{t-l+1}\|_{2}+\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}\big{)}+\big{\|}\widehat{b}^{o}-b^{\sharp o}\big{\|}_{2}\bigg{)}^{2}\Bigg{\}}.\end{split}

Hence, by taking the average over i=1,,ni=1,\ldots,n and t=1,,Tt=1,\ldots,T, and by using the inequality t=1T(l=1tal)2T2t=1Tat2\sum_{t=1}^{T}(\sum_{l=1}^{t}a_{l})^{2}\leq T^{2}\sum_{t=1}^{T}a_{t}^{2}, we obtain

f^fn,T2=1nTi=1nt=1Tf^t(Xti)ft(Xti)223{1nTi=1nt=1TW^oϕ(xti,h^t1i)WoϕJ(xti,h^t1i)22=W^oϕWoϕJn,T2+(Roρσmax{1,(Rhρσ)T2}T)21nTi=1nt=1TW^J,[m]hϕ(xti,h^t1i)WhϕJ(xti,h^t1i)22=W^J,[m]hϕWhϕJn,T2+(Roρσ(t=1T(Rhρσ)t1)(W^J,[dx]iWiopxtl+1i2+b^Jhibhi2)+b^obo22)2},\begin{split}&\big{\|}\widehat{f}-f^{\sharp}\big{\|}_{n,T}^{2}=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\big{\|}\widehat{f}_{t}(X^{i}_{t})-f^{\sharp}_{t}(X^{i}_{t})\big{\|}^{2}_{2}\\ &\leq 3\,\Bigg{\{}\underbrace{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\big{\|}\widehat{W}^{o}\phi(x^{i}_{t},\widehat{h}^{i}_{t-1})-W^{\sharp o}\phi_{J}(x^{i}_{t},\widehat{h}^{i}_{t-1})\big{\|}_{2}^{2}}_{=\|\widehat{W}^{o}\phi-W^{\sharp o}\phi_{J}\|_{n,T}^{2}}\\ &+\big{(}R_{o}\rho_{\sigma}\max\big{\{}1,(R_{h}\rho_{\sigma})^{T-2}\big{\}}T\big{)}^{2}\underbrace{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\big{\|}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t},\widehat{h}^{i}_{t-1})-W^{\sharp h}\phi_{J}(x^{i}_{t},\widehat{h}^{i}_{t-1})\big{\|}_{2}^{2}}_{=\|\widehat{W}^{h}_{J,[m]}\phi-W^{\sharp h}\phi_{J}\|_{n,T}^{2}}\\ &+\bigg{(}R_{o}\rho_{\sigma}\bigg{(}\sum_{t=1}^{T}(R_{h}\rho_{\sigma})^{t-1}\bigg{)}\big{(}\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\|x^{i}_{t-l+1}\|_{2}+\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}\big{)}+\big{\|}\widehat{b}^{o}-b^{\sharp o}\big{\|}_{2}^{2}\bigg{)}^{2}\Bigg{\}},\end{split}

which concludes the inequality (B.1). It remains to prove (B.5). We calculate that

WoϕJ(xti,h^t1i)Woϕ(xti,ht1i)2Woopσ(W^J,[m]hϕ(xt1i,h^t2i)+W^J,[dx]ixti+b^Jhi)σ(Whϕ(xt1i,ht2i)+Wixti+bhi)2Roρσ{W^J,[m]hϕ(xt1i,h^t2i)Whϕ(xt1i,ht2i)2=:Ht1+W^J,[dx]iWiopxti2+b^hibhi2},\begin{split}\big{\|}W^{\sharp o}\phi_{J}(x^{i}_{t},&\,\widehat{h}^{i}_{t-1})-W^{\sharp o}\phi^{\sharp}(x^{i}_{t},h^{\sharp i}_{t-1})\big{\|}_{2}\\ &\leq\big{\|}W^{\sharp o}\big{\|}_{op}\big{\|}\sigma\big{(}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t-1},\widehat{h}^{i}_{t-2})+\widehat{W}^{i}_{J,[d_{x}]}x^{i}_{t}+\widehat{b}^{hi}_{J}\big{)}\\ &\hskip 113.81102pt-\sigma\big{(}W^{\sharp h}\phi^{\sharp}(x^{i}_{t-1},h^{\sharp i}_{t-2})+W^{\sharp i}x^{i}_{t}+b^{\sharp hi}\big{)}\big{\|}_{2}\\ &\leq R_{o}\rho_{\sigma}\bigg{\{}\underbrace{\big{\|}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t-1},\widehat{h}^{i}_{t-2})-W^{\sharp h}\phi^{\sharp}(x^{i}_{t-1},h^{\sharp i}_{t-2})\big{\|}_{2}}_{=:H_{t-1}}\\ &\hskip 113.81102pt+\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\|x^{i}_{t}\|_{2}+\big{\|}\widehat{b}^{hi}-b^{\sharp hi}\big{\|}_{2}\bigg{\}},\end{split} (B.6)

where op\|\cdot\|_{op} is the operator norm (which is the largest singular value). Concerning the quantity Ht1H_{t-1}, we estimate

Ht1W^J,[m]hϕ(xt1i,h^t2i)WhϕJ(xt1i,h^t2i)2+WhϕJ(xt1i,h^t2i)Whϕ(xt1i,ht2i)2,H_{t-1}\leq\big{\|}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t-1},\widehat{h}^{i}_{t-2})-W^{\sharp h}\phi_{J}(x^{i}_{t-1},\widehat{h}^{i}_{t-2})\big{\|}_{2}\\ +\big{\|}W^{\sharp h}\phi_{J}(x^{i}_{t-1},\widehat{h}^{i}_{t-2})-W^{\sharp h}\phi^{\sharp}(x^{i}_{t-1},h^{\sharp i}_{t-2})\big{\|}_{2},

and moreover, the second term is estimated as

WhϕJ(xt1i,h^t2i)Whϕ(xt1i,ht2i)2Whopσ(W^J,[m]hϕ(xt2i,h^t3i)+W^J,[dx]ixt1i+b^Jhi)σ(Whϕ(xt2i,ht3i)+Wixt1i+bhi)2Rhρσ{Ht2+W^J,[dx]iWiopxt1i2+b^Jhibhi2},\begin{split}\big{\|}W^{\sharp h}\phi_{J}(x^{i}_{t-1},&\,\widehat{h}^{i}_{t-2})-W^{\sharp h}\phi^{\sharp}(x^{i}_{t-1},h^{\sharp i}_{t-2})\big{\|}_{2}\\ &\leq\big{\|}W^{\sharp h}\big{\|}_{op}\big{\|}\sigma\big{(}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t-2},\widehat{h}^{i}_{t-3})+\widehat{W}^{i}_{J,[d_{x}]}x^{i}_{t-1}+\widehat{b}^{hi}_{J}\big{)}\\ &\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad-\sigma\big{(}W^{\sharp h}\phi^{\sharp}(x^{i}_{t-2},h^{\sharp i}_{t-3})+W^{\sharp i}x^{i}_{t-1}+b^{\sharp hi}\big{)}\big{\|}_{2}\\ &\leq R_{h}\rho_{\sigma}\Big{\{}H_{t-2}+\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\|x^{i}_{t-1}\|_{2}+\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}\Big{\}},\end{split}

for all tt. Thus, we have the recursive inequality

Ht1W^J,[m]hϕ(xt1i,h^t2i)WhϕJ(xt1i,h^t2i)2+Rhρσ{Ht2+W^J,[dx]iWiopxt1i2+b^Jhibhi2},\begin{split}H_{t-1}&\leq\big{\|}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t-1},\widehat{h}^{i}_{t-2})-W^{\sharp h}\phi_{J}(x^{i}_{t-1},\widehat{h}^{i}_{t-2})\big{\|}_{2}\\ &\quad+R_{h}\rho_{\sigma}\Big{\{}H_{t-2}+\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\|x^{i}_{t-1}\|_{2}+\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}\Big{\}},\end{split} (B.7)

for t=2,,Tt=2,\ldots,T. By repeatedly substituting (B.7) into (B.6), we arrive at (B.5):

WoϕJ(xti,h^t1i)Woϕ(xti,ht1i)2Roρσ{l=1t1(Rhρσ)l1max{1,(Rhρσ)t2}W^J,[m]hϕ(xtli,h^tl1i)WhϕJ(xtli,h^tl1i)2+l=1t(Rhρσ)l1(W^J,[dx]iWiopxtl+1i2+b^Jhibhi2)}.\begin{split}\big{\|}W^{\sharp o}&\,\phi_{J}(x^{i}_{t},\widehat{h}^{i}_{t-1})-W^{\sharp o}\phi^{\sharp}(x^{i}_{t},h^{\sharp i}_{t-1})\big{\|}_{2}\\ &\leq R_{o}\rho_{\sigma}\bigg{\{}\sum_{l=1}^{t-1}\underbrace{(R_{h}\rho_{\sigma})^{l-1}}_{\leq\,\max\{1,(R_{h}\rho_{\sigma})^{t-2}\}}\big{\|}\widehat{W}^{h}_{J,[m]}\phi(x^{i}_{t-l},\widehat{h}^{i}_{t-l-1})-W^{\sharp h}\phi_{J}(x^{i}_{t-l},\widehat{h}^{i}_{t-l-1})\big{\|}_{2}\\ &\quad+\sum_{l=1}^{t}(R_{h}\rho_{\sigma})^{l-1}\big{(}\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}\|x^{i}_{t-l+1}\|_{2}+\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}\big{)}\bigg{\}}.\end{split}

Thus, we conclude Proposition B.1. ∎

Appendix C Proof of Theorem 4.5

We restate Theorem 4.5 in an exact form as follows:

Theorem C.1.

Suppose that Assumptions 4.1 and 4.4 hold. Let {(XTi,YTi)}i=1n\{(X_{T}^{i},Y_{T}^{i})\}_{i=1}^{n} be sampled i.i.d. from the distribution PTP_{T}. Then, for any δlog2\delta\geq\log 2, we have the following inequality with probability greater than 12eδ1-2e^{-\delta}:

Ψj(f)Ψ^j(f^)+3ρψ{W^oϕWoϕJn,T+Roρσmax{1,(Rhρσ)T2}TW^J,[m]hϕWhϕJn,T+Roρσ(t=1T(Rhρσ)t1)(RxW^J,[dx]iWiop+b^Jhibhi2)+b^obo2}+1n{c^ρψmT(t=1TMt1/2R,t1/2)+32δ(ρψR,T+Ry)},\begin{split}\Psi_{j}(f^{\sharp})&\leq\widehat{\Psi}_{j}(\widehat{f})+\sqrt{3}\rho_{\psi}\Bigg{\{}\big{\|}\widehat{W}^{o}\phi-W^{\sharp o}\phi_{J}\big{\|}_{n,T}\\ &\quad+R_{o}\rho_{\sigma}\max\{1,(R_{h}\rho_{\sigma})^{T-2}\}T\big{\|}\widehat{W}^{h}_{J,[m]}\phi-W^{\sharp h}\phi_{J}\big{\|}_{n,T}\\ &\quad+R_{o}\rho_{\sigma}\bigg{(}\sum_{t=1}^{T}(R_{h}\rho_{\sigma})^{t-1}\bigg{)}\big{(}R_{x}\big{\|}\widehat{W}^{i}_{J,[d_{x}]}-W^{\sharp i}\big{\|}_{op}+\big{\|}\widehat{b}^{hi}_{J}-b^{\sharp hi}\big{\|}_{2}\big{)}+\big{\|}\widehat{b}^{o}-b^{\sharp o}\big{\|}_{2}\Bigg{\}}\\ &\quad+\frac{1}{\sqrt{n}}\Bigg{\{}\frac{\widehat{c}\rho_{\psi}\sqrt{m^{\sharp}}}{T}\bigg{(}\sum_{t=1}^{T}M_{t}^{1/2}R^{1/2}_{\infty,t}\bigg{)}+3\sqrt{2\delta}(\rho_{\psi}R_{\infty,T}+R_{y})\Bigg{\}},\end{split}

for j=1,,dyj=1,\ldots,d_{y} and for all J[m]J\subset[m] with |J|=m|J|=m^{\sharp} and fT(Ro,Rh,Ri,Rob,Rhib)f^{\sharp}\in\mathcal{F}^{\sharp}_{T}(R_{o},R_{h},R_{i},R^{b}_{o},R^{b}_{hi}), where c^:=1925\widehat{c}:=192\sqrt{5}, and R,tR_{\infty,t} and MtM_{t} are defined by

R,t:=Roρσ(RiRx+Rhib)(l=1t(Rhρσ)l1)+Rob,R_{\infty,t}:=R_{o}\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\bigg{(}\sum_{l=1}^{t}(R_{h}\rho_{\sigma})^{l-1}\bigg{)}+R^{b}_{o}, (C.1)
Mt:=Roρσ[(dymin{m,dy}+dxmin{m,dx})RiRx+(dymin{m,dy}+1)Rhib](l=0t1(Rhρσ)l)+(m)32Rhρσ2Ro(RiRx+Rhib)(l=1t1k=0l1(Rhρσ)t1l+k)+dyRob.\begin{split}M_{t}&:=R_{o}\rho_{\sigma}\biggl{[}\big{(}d_{y}\min\{\sqrt{m^{\sharp}},\sqrt{d_{y}}\}+d_{x}\min\{\sqrt{m^{\sharp}},\sqrt{d_{x}}\}\big{)}R_{i}R_{x}\\ &\quad+\big{(}d_{y}\min\{\sqrt{m^{\sharp}},\sqrt{d_{y}}\}+1\big{)}R^{b}_{hi}\biggr{]}\bigg{(}\sum_{l=0}^{t-1}(R_{h}\rho_{\sigma})^{l}\bigg{)}\\ &\quad+(m^{\sharp})^{\frac{3}{2}}R_{h}\rho_{\sigma}^{2}R_{o}(R_{i}R_{x}+R^{b}_{hi})\bigg{(}\sum_{l=1}^{t-1}\sum_{k=0}^{l-1}(R_{h}\rho_{\sigma})^{t-1-l+k}\bigg{)}+d_{y}R^{b}_{o}.\end{split} (C.2)
Proof.

The generalization error of fttf^{\sharp}_{t}\in\mathcal{F}^{\sharp}_{t} is decomposed into

Ψj(f)=Ψj(f^)+(Ψ^j(f)Ψ^j(f^))+(Ψj(f)Ψ^j(f)),\Psi_{j}(f^{\sharp})=\Psi_{j}(\widehat{f})+\big{(}\widehat{\Psi}_{j}(f^{\sharp})-\widehat{\Psi}_{j}(\widehat{f})\big{)}+\big{(}\Psi_{j}(f^{\sharp})-\widehat{\Psi}_{j}(f^{\sharp})\big{)},

where the second term Ψ^j(f)Ψ^j(f^)\widehat{\Psi}_{j}(f^{\sharp})-\widehat{\Psi}_{j}(\widehat{f}) is called the approximation error and the third term Ψj(f)Ψ^j(f)\Psi_{j}(f^{\sharp})-\widehat{\Psi}_{j}(f^{\sharp}) is called the estimation error. Since the loss function ψ\psi is ρψ\rho_{\psi}-Lipschitz continuous, the approximation error is evaluated as

|Ψ^j(f)Ψ^j(f^)|1nTi=1nt=1T|ψ(yt,ji,ft(Xti)j)ψ(yti,f^t(Xti)j)|ρψnTi=1nt=1T|f(Xti)jf^t(Xti)j|ρψnTi=1nt=1Tft(Xti)f^t(Xti)2ρψ1nTi=1nt=1Tft(Xti)f^t(Xti)22=ρψff^n,T.\begin{split}\big{|}\widehat{\Psi}_{j}(f^{\sharp})-\widehat{\Psi}_{j}(\widehat{f})\big{|}&\leq\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\big{|}\psi(y^{i}_{t,j},f^{\sharp}_{t}(X^{i}_{t})_{j})-\psi(y^{i}_{t},\widehat{f}_{t}(X^{i}_{t})_{j})\big{|}\\ &\leq\frac{\rho_{\psi}}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\big{|}f^{\sharp}(X^{i}_{t})_{j}-\widehat{f}_{t}(X^{i}_{t})_{j}\big{|}\\ &\leq\frac{\rho_{\psi}}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\big{\|}f^{\sharp}_{t}(X^{i}_{t})-\widehat{f}_{t}(X^{i}_{t})\big{\|}_{2}\\ &\leq\rho_{\psi}\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=1}^{T}\big{\|}f^{\sharp}_{t}(X^{i}_{t})-\widehat{f}_{t}(X^{i}_{t})\big{\|}^{2}_{2}}=\rho_{\psi}\big{\|}f^{\sharp}-\widehat{f}\big{\|}_{n,T}.\end{split}

The term ff^n,T\|f^{\sharp}-\widehat{f}\|_{n,T} is evaluated by Proposition 4.2 (see also Proposition B.1). In the rest of the proof, let us concentrate on the estimation error bound.

First, we define the following function space

𝒢T,j:={gj|gj(YT,XT)=1Tt=1Tψ(yt,j,ft(Xt)j) for (XT,YT)supp(PT)fT,}\mathcal{G}^{\sharp}_{T,j}:=\bigg{\{}g_{j}\,\bigg{|}\,g_{j}(Y_{T},X_{T})=\frac{1}{T}\sum_{t=1}^{T}\psi(y_{t,j},f_{t}(X_{t})_{j})\text{ for $(X_{T},Y_{T})\in\mathrm{supp}(P_{T})$, $f\in\mathcal{F}^{\sharp}_{T}$},\bigg{\}}

for j=1,,dyj=1,\ldots,d_{y}. For gj𝒢T,jg_{j}\in\mathcal{G}^{\sharp}_{T,j}, we have

|gj(YT,XT)|1Tt=1T{|ψ(yt,j,ft(Xt)j)ψ(yt,j,0)|+|ψ(yt,j,0)|}1Tt=1T(ρψ|ft(Xt)j|+Ry)ρψTt=1Tft(Xt)2+Ry.\begin{split}\big{|}g_{j}(Y_{T},X_{T})\big{|}&\leq\frac{1}{T}\sum_{t=1}^{T}\big{\{}|\psi(y_{t,j},f_{t}(X_{t})_{j})-\psi(y_{t,j},0)|+|\psi(y_{t,j},0)|\big{\}}\\ &\leq\frac{1}{T}\sum_{t=1}^{T}\big{(}\rho_{\psi}|f_{t}(X_{t})_{j}|+R_{y}\big{)}\leq\frac{\rho_{\psi}}{T}\sum_{t=1}^{T}\|f_{t}(X_{t})\|_{2}+R_{y}.\end{split}

The quantity ft(Xt)2\|f_{t}(X_{t})\|_{2} is evaluated by

ft(Xt)2RoρσWhϕ(xt1,ht2i)+Wixt+bhi2+Rob.\|f_{t}(X_{t})\|_{2}\leq R_{o}\rho_{\sigma}\big{\|}W^{\sharp h}\phi^{\sharp}(x_{t-1},h^{\sharp i}_{t-2})+W^{\sharp i}x_{t}+b^{\sharp hi}\big{\|}_{2}+R_{o}^{b}.

The recurrent structure (B.2) and (B.3) give

Whϕ(xt1,ht2i)+Wixt+bhi2RhρσWhϕ(xt2,ht3i)+Wixt1+bhi2+RiRx+Rhib,\begin{split}\big{\|}W^{\sharp h}\phi^{\sharp}&\,(x_{t-1},h^{\sharp i}_{t-2})+W^{\sharp i}x_{t}+b^{\sharp hi}\big{\|}_{2}\\ &\leq R_{h}\rho_{\sigma}\big{\|}W^{\sharp h}\phi^{\sharp}(x_{t-2},h^{\sharp i}_{t-3})+W^{\sharp i}x_{t-1}+b^{\sharp hi}\big{\|}_{2}+R_{i}R_{x}+R^{b}_{hi},\end{split}

as this is repeated,

Whϕ(xt1,ht2i)+Wixt+bhi2(RiRx+Rhib)(l=1t(Rhρσ)l1).\big{\|}W^{\sharp h}\phi^{\sharp}(x_{t-1},h^{\sharp i}_{t-2})+W^{\sharp i}x_{t}+b^{\sharp hi}\big{\|}_{2}\leq(R_{i}R_{x}+R^{b}_{hi})\bigg{(}\sum_{l=1}^{t}(R_{h}\rho_{\sigma})^{l-1}\bigg{)}.

Hence, we see from (C.1) that

ft(Xt)2Roρσ(RiRx+Rhib)(l=1t(Rhρσ)l1)+Rob=R,t,\|f_{t}(X_{t})\|_{2}\leq R_{o}\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\bigg{(}\sum_{l=1}^{t}(R_{h}\rho_{\sigma})^{l-1}\bigg{)}+R^{b}_{o}=R_{\infty,t},

which implies that

|gj(YT,XT)|ρψRoρσ(RiRx+Rhib){1Tt=1Tl=1t(Rhρσ)l1}+Rob+RyρψRoρσ(RiRx+Rhib)(t=1T(Rhρσ)t1)+Rob+Ry=ρψR,T+Ry.\begin{split}\big{|}g_{j}(Y_{T},X_{T})\big{|}&\leq\rho_{\psi}R_{o}\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\bigg{\{}\frac{1}{T}\sum_{t=1}^{T}\sum_{l=1}^{t}(R_{h}\rho_{\sigma})^{l-1}\bigg{\}}+R^{b}_{o}+R_{y}\\ &\leq\rho_{\psi}R_{o}\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\bigg{(}\sum_{t=1}^{T}(R_{h}\rho_{\sigma})^{t-1}\bigg{)}+R^{b}_{o}+R_{y}\\ &=\rho_{\psi}R_{\infty,T}+R_{y}.\end{split}

By Theorem 3.4.5 in [[Gine], for any δ>log2\delta>\log 2, we have the following inequality with probability grater than 12eδ1-2e^{-\delta}:

|Ψj(f)Ψ^j(f)|supgj𝒢T,j|1ni=1ngj(YTi,XTi)EPT[gj(YT,XT)]|2Eϵ[supgj𝒢T,j|1ni=1nϵigj(YTi,XTi)|]+3(ρψR,T+Ry)2δn,\begin{split}\big{|}\Psi_{j}(f^{\sharp})-\widehat{\Psi}_{j}(f^{\sharp})\big{|}&\leq\sup_{g_{j}\in\mathcal{G}^{\sharp}_{T,j}}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}g_{j}(Y_{T}^{i},X_{T}^{i})-E_{P_{T}}[g_{j}(Y_{T},X_{T})]\bigg{|}\\ &\leq 2E_{\epsilon}\Bigg{[}\sup_{g_{j}\in\mathcal{G}^{\sharp}_{T,j}}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{j}(Y_{T}^{i},X_{T}^{i})\bigg{|}\Bigg{]}+3(\rho_{\psi}R_{\infty,T}+R_{y})\sqrt{\frac{2\delta}{n}},\end{split}

where (ϵi)i=1n(\epsilon_{i})_{i=1}^{n} is the i.i.d. Rademacher sequence (see, e.g., Definition 3.1.19 in [[Gine]). The first term of right-hand side in the above inequality, called the Rademacher complexity, is estimated by using Theorem 4.12 in [[Ledoux], and Lemma A.5 in [[Bartlett] (or Lemma 9 in [[Chen]) as follows:

Eϵ[supgj𝒢T,j|1ni=1nϵigj(YTi,XTi)|]1Tt=1TEϵ[supft,jt,j|1ni=1nϵiψj(yt,ji,ft(Xti)j)|]2ρψTt=1TEϵ[supft,jt,j|1ni=1nϵift(Xti)j|]2ρψTt=1Tinfα>0(4αn+12nα2R,tnlogN(t,j,ϵ,S)𝑑ϵ),\begin{split}E_{\epsilon}\Bigg{[}\sup_{g_{j}\in\mathcal{G}^{\sharp}_{T,j}}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{j}(Y_{T}^{i},X_{T}^{i})\bigg{|}\Bigg{]}&\leq\frac{1}{T}\sum_{t=1}^{T}E_{\epsilon}\Bigg{[}\sup_{f_{t,j}\in\mathcal{F}^{\sharp}_{t,j}}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}\psi_{j}(y_{t,j}^{i},f_{t}(X_{t}^{i})_{j})\bigg{|}\Bigg{]}\\ &\leq\frac{2\rho_{\psi}}{T}\sum_{t=1}^{T}E_{\epsilon}\Bigg{[}\sup_{f_{t,j}\in\mathcal{F}^{\sharp}_{t,j}}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f_{t}(X_{t}^{i})_{j}\bigg{|}\Bigg{]}\\ &\leq\frac{2\rho_{\psi}}{T}\sum_{t=1}^{T}\inf_{\alpha>0}\bigg{(}\frac{4\alpha}{\sqrt{n}}+\frac{12}{n}\int_{\alpha}^{2R_{\infty,t}\sqrt{n}}\sqrt{\log N(\mathcal{F}^{\sharp}_{t,j},\epsilon,\|\cdot\|_{S})}\,d\epsilon\bigg{)},\end{split}

where t,j\mathcal{F}^{\sharp}_{t,j} and S\|\cdot\|_{S} are defined by

t,j:={ft,j|ft,j(Xt)=ft(Xt)j for Xtsupp(PXt)fT},\mathcal{F}^{\sharp}_{t,j}:=\big{\{}f_{t,j}\,\big{|}\,f_{t,j}(X_{t})=f_{t}(X_{t})_{j}\text{ for $X_{t}\in\mathrm{supp}(P_{X_{t}})$, $f\in\mathcal{F}^{\sharp}_{T}$}\big{\}},
ft,jS:=(i=1n|ft(Xti)j|2)1/2.\|f_{t,j}\|_{S}:=\bigg{(}\sum_{i=1}^{n}|f_{t}(X_{t}^{i})_{j}|^{2}\bigg{)}^{1/2}.

Here, we denote by N(F,ϵ,)N(F,\epsilon,\|\cdot\|) the covering number of FF which means the minimal cardinality of a subset CFC\subset F that covers FF at scale ϵ\epsilon with respect to the norm \|\cdot\|. By using Lemma D.1 in Appendix D, for any δ>log2\delta>\log 2, we conclude the following estimation error bound:

|Ψj(f)Ψ^j(f)|16ρψαn+48ρψnTt=1Tα2R,tnlogN(t,j,ϵ,S)𝑑ϵ+3(ρψR,T+Ry)2δn48ρψnT10mn1/4t=1TMt1/2α2R,tndϵϵ+3(ρψR,T+Ry)2δn+O(α)=c^ρψmnT(t=1TMt1/2R,t1/2)+3(ρψR,T+Ry)2δn+O(α),\begin{split}&\big{|}\Psi_{j}(f^{\sharp})-\widehat{\Psi}_{j}(f^{\sharp})\big{|}\\ &\leq\frac{16\rho_{\psi}\alpha}{\sqrt{n}}+\frac{48\rho_{\psi}}{nT}\sum_{t=1}^{T}\int_{\alpha}^{2R_{\infty,t}\sqrt{n}}\sqrt{\log N(\mathcal{F}^{\sharp}_{t,j},\epsilon,\|\cdot\|_{S})}\,d\epsilon+3(\rho_{\psi}R_{\infty,T}+R_{y})\sqrt{\frac{2\delta}{n}}\\ &\leq\frac{48\rho_{\psi}}{nT}\sqrt{10m^{\sharp}}n^{1/4}\sum_{t=1}^{T}M_{t}^{1/2}\int_{\alpha}^{2R_{\infty,t}\sqrt{n}}\frac{d\epsilon}{\sqrt{\epsilon}}+3(\rho_{\psi}R_{\infty,T}+R_{y})\sqrt{\frac{2\delta}{n}}+O(\alpha)\\ &=\frac{\widehat{c}\rho_{\psi}\sqrt{m^{\sharp}}}{\sqrt{n}T}\bigg{(}\sum_{t=1}^{T}M_{t}^{1/2}R^{1/2}_{\infty,t}\bigg{)}+3(\rho_{\psi}R_{\infty,T}+R_{y})\sqrt{\frac{2\delta}{n}}+O(\alpha),\end{split}

for all α>0\alpha>0 with probability grater than 12eδ1-2e^{-\delta}, where c^:=1925\widehat{c}:=192\sqrt{5}, and MtM_{t} is defined by (C.2). The proof of Theorem C.1 is complete. ∎

Appendix D Upper Bound of the Covering Number

Lemma D.1.

Under the same assumptions as in Theorem C.1, the covering number N(t,j,ϵ,S)N(\mathcal{F}^{\sharp}_{t,j},\epsilon,\|\cdot\|_{S}) has the following bound:

logN(t,j,ϵ,S)10mn1/2Mtϵ,\log N(\mathcal{F}^{\sharp}_{t,j},\epsilon,\|\cdot\|_{S})\leq\frac{10m^{\sharp}n^{1/2}M_{t}}{\epsilon},

for any ϵ>0\epsilon>0, where MtM_{t} is given by (C.2).

Proof.

The proof is based on the argument of proof of Lemma 3 in [[Chen]. For ft,j,f~t,jt,jf^{\sharp}_{t,j},\widetilde{f}^{\sharp}_{t,j}\in\mathcal{F}^{\sharp}_{t,j}, we estimate

|ft(Xt)jf~t(Xt)j|ft(Xt)f~t(Xt)2WoW~oopϕ(xt,ht1)2+W~oϕ~(xt,h~t1)W~oϕ(xt,ht1)2+bob~o2.\begin{split}|f^{\sharp}_{t}(X_{t})_{j}-\widetilde{f}^{\sharp}_{t}(X_{t})_{j}|&\leq\big{\|}f^{\sharp}_{t}(X_{t})-\widetilde{f}^{\sharp}_{t}(X_{t})\big{\|}_{2}\\ &\leq\big{\|}W^{\sharp o}-\widetilde{W}^{\sharp o}\big{\|}_{op}\big{\|}\phi^{\sharp}(x_{t},h^{\sharp}_{t-1})\big{\|}_{2}\\ &\qquad+\big{\|}\widetilde{W}^{\sharp o}\widetilde{\phi}^{\sharp}(x_{t},\widetilde{h}^{\sharp}_{t-1})-\widetilde{W}^{\sharp o}\phi^{\sharp}(x_{t},h^{\sharp}_{t-1})\big{\|}_{2}+\big{\|}b^{\sharp o}-\widetilde{b}^{\sharp o}\big{\|}_{2}.\end{split}

The second term of right-hand side is estimated as

W~oϕ~(xt,h~t1)W~oϕ(xt,ht1)2W~oopρσ(W~hϕ~(xt1,h~t2)Whϕ(xt1,ht2)2+WiW~iopxt2+bhib~hi2).\big{\|}\widetilde{W}^{\sharp o}\widetilde{\phi}^{\sharp}(x_{t},\widetilde{h}^{\sharp}_{t-1})-\widetilde{W}^{\sharp o}\phi^{\sharp}(x_{t},h^{\sharp}_{t-1})\big{\|}_{2}\\ \leq\big{\|}\widetilde{W}^{\sharp o}\big{\|}_{op}\rho_{\sigma}\Big{(}\big{\|}\widetilde{W}^{\sharp h}\widetilde{\phi}^{\sharp}(x_{t-1},\widetilde{h}^{\sharp}_{t-2})-W^{\sharp h}\phi^{\sharp}(x_{t-1},h^{\sharp}_{t-2})\big{\|}_{2}\\ +\big{\|}W^{\sharp i}-\widetilde{W}^{\sharp i}\big{\|}_{op}\big{\|}x_{t}\big{\|}_{2}+\big{\|}b^{\sharp hi}-\widetilde{b}^{hi}\big{\|}_{2}\Big{)}.

We estimate the first term of right-hand side in the above inequality as

W~hϕ~(xt1,h~t2)Whϕ(xt1,ht2)2W~hopϕ~(xt1,h~t2)ϕ(xt1,ht2)2+WhW~hopϕ(xt1,ht2)2W~hopρσ(W~hϕ~(xt2,h~t3)Whϕ(xt2,ht3)2+WiW~iopxt12+b~hibhi)2)+WhW~hopϕ(xt1,ht2)2,\begin{split}&\big{\|}\widetilde{W}^{\sharp h}\widetilde{\phi}^{\sharp}(x_{t-1},\widetilde{h}^{\sharp}_{t-2})-W^{\sharp h}\phi^{\sharp}(x_{t-1},h^{\sharp}_{t-2})\big{\|}_{2}\\ &\leq\big{\|}\widetilde{W}^{\sharp h}\big{\|}_{op}\big{\|}\widetilde{\phi}^{\sharp}(x_{t-1},\widetilde{h}^{\sharp}_{t-2})-\phi^{\sharp}(x_{t-1},h^{\sharp}_{t-2})\big{\|}_{2}+\big{\|}W^{\sharp h}-\widetilde{W}^{\sharp h}\big{\|}_{op}\big{\|}\phi^{\sharp}(x_{t-1},h^{\sharp}_{t-2})\big{\|}_{2}\\ &\leq\big{\|}\widetilde{W}^{\sharp h}\big{\|}_{op}\rho_{\sigma}\Big{(}\big{\|}\widetilde{W}^{\sharp h}\widetilde{\phi}^{\sharp}(x_{t-2},\widetilde{h}^{\sharp}_{t-3})-W^{\sharp h}\phi^{\sharp}(x_{t-2},h^{\sharp}_{t-3})\big{\|}_{2}\\ &\qquad+\big{\|}W^{\sharp i}-\widetilde{W}^{\sharp i}\big{\|}_{op}\big{\|}x_{t-1}\big{\|}_{2}+\big{\|}\widetilde{b}^{hi}-b^{\sharp hi})\big{\|}_{2}\Big{)}+\big{\|}W^{\sharp h}-\widetilde{W}^{\sharp h}\big{\|}_{op}\big{\|}\phi^{\sharp}(x_{t-1},h^{\sharp}_{t-2})\big{\|}_{2},\end{split}

and as this is repeated, we eventually obtain

W~oϕ~(xt,h~t1)W~oϕ(xt,ht1)2Roρσ{l=0t1(Rhρσ)l(WiW~iopxtl2+bhib~hi2)+l=1t1(Rhρσ)t1lϕ(xl,hl1)2WhW~hop}.\big{\|}\widetilde{W}^{\sharp o}\widetilde{\phi}^{\sharp}(x_{t},\widetilde{h}^{\sharp}_{t-1})-\widetilde{W}^{\sharp o}\phi^{\sharp}(x_{t},h^{\sharp}_{t-1})\big{\|}_{2}\\ \leq R_{o}\rho_{\sigma}\bigg{\{}\sum_{l=0}^{t-1}(R_{h}\rho_{\sigma})^{l}\Big{(}\big{\|}W^{\sharp i}-\widetilde{W}^{\sharp i}\big{\|}_{op}\big{\|}x_{t-l}\big{\|}_{2}+\big{\|}b^{\sharp hi}-\widetilde{b}^{\sharp hi}\big{\|}_{2}\Big{)}\\ +\sum_{l=1}^{t-1}(R_{h}\rho_{\sigma})^{t-1-l}\big{\|}\phi^{\sharp}(x_{l},h^{\sharp}_{l-1})\big{\|}_{2}\big{\|}W^{\sharp h}-\widetilde{W}^{\sharp h}\big{\|}_{op}\bigg{\}}.

Summarizing the above, we have

|ft(Xt)jf~t(Xt)j|WoW~oopϕ(xt,ht1)2+Roρσ{l=0t1(Rhρσ)l(WiW~iopxtl2+bhib~hi2)+l=1t1(Rhρσ)t1lϕ(xl,hl1)2WhW~hop}+bob~o2.\begin{split}|f^{\sharp}_{t}(X_{t})_{j}-\widetilde{f}^{\sharp}_{t}(X_{t})_{j}|&\leq\big{\|}W^{\sharp o}-\widetilde{W}^{\sharp o}\big{\|}_{op}\big{\|}\phi^{\sharp}(x_{t},h^{\sharp}_{t-1})\big{\|}_{2}\\ &\quad+R_{o}\rho_{\sigma}\Biggl{\{}\sum_{l=0}^{t-1}(R_{h}\rho_{\sigma})^{l}\Big{(}\big{\|}W^{\sharp i}-\widetilde{W}^{\sharp i}\big{\|}_{op}\big{\|}x_{t-l}\big{\|}_{2}+\big{\|}b^{\sharp hi}-\widetilde{b}^{\sharp hi}\big{\|}_{2}\Big{)}\\ &\quad+\sum_{l=1}^{t-1}(R_{h}\rho_{\sigma})^{t-1-l}\big{\|}\phi^{\sharp}(x_{l},h^{\sharp}_{l-1})\big{\|}_{2}\big{\|}W^{\sharp h}-\widetilde{W}^{\sharp h}\big{\|}_{op}\Biggr{\}}+\big{\|}b^{\sharp o}-\widetilde{b}^{\sharp o}\big{\|}_{2}.\end{split}

Since

ϕt(xt,ht1)2ρσ(Whopϕ(xt1,ht2)2+Wiopxt2+bhi2)ρσ(RiRx+Rhib)l=0t1(Rhρσ)l,\begin{split}\big{\|}\phi^{\sharp}_{t}(x_{t},h^{\sharp}_{t-1})\big{\|}_{2}&\leq\rho_{\sigma}\big{(}\big{\|}W^{\sharp h}\big{\|}_{op}\big{\|}\phi^{\sharp}(x_{t-1},h^{\sharp}_{t-2})\big{\|}_{2}+\big{\|}W^{\sharp i}\big{\|}_{op}\big{\|}x_{t}\big{\|}_{2}+\big{\|}b^{\sharp hi}\big{\|}_{2}\big{)}\\ &\leq\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\sum_{l=0}^{t-1}(R_{h}\rho_{\sigma})^{l},\end{split}

and

l=1t1(Rhρσ)t1lϕt(xl,hl1)2ρσ(RiRx+Rhib)l=1t1k=0l1(Rhρσ)t1l(Rhρσ)k=ρσ(RiRx+Rhib)l=1t1k=0l1(Rhρσ)t1l+k,\begin{split}\sum_{l=1}^{t-1}(R_{h}\rho_{\sigma})^{t-1-l}\big{\|}\phi^{\sharp}_{t}(x_{l},h^{\sharp}_{l-1})\big{\|}_{2}&\leq\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\sum_{l=1}^{t-1}\sum_{k=0}^{l-1}(R_{h}\rho_{\sigma})^{t-1-l}(R_{h}\rho_{\sigma})^{k}\\ &=\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\sum_{l=1}^{t-1}\sum_{k=0}^{l-1}(R_{h}\rho_{\sigma})^{t-1-l+k},\end{split}

we see that

|ft(Xt)jf~t(Xt)j|ρσ(RiRx+Rhib)(l=0t1(Rhρσ)l)=:Lo,tWoW~oop+ρσRoRx(l=0t1(Rhρσ)l)=:Li,tWiW~iop+ρσRo(l=0t1(Rhρσ)l)=:Lb,tbhib~hi2+ρσ2Ro(RiRx+Rhib)(l=1t1k=0l1(Rhρσ)t1l+k)=:Lh,tWhW~hop+bob~o2.\begin{split}&|f^{\sharp}_{t}(X_{t})_{j}-\widetilde{f}^{\sharp}_{t}(X_{t})_{j}|\leq\underbrace{\rho_{\sigma}(R_{i}R_{x}+R^{b}_{hi})\bigg{(}\sum_{l=0}^{t-1}(R_{h}\rho_{\sigma})^{l}\bigg{)}}_{=:L_{o,t}}\big{\|}W^{\sharp o}-\widetilde{W}^{\sharp o}\big{\|}_{op}\\ &\quad+\underbrace{\rho_{\sigma}R_{o}R_{x}\bigg{(}\sum_{l=0}^{t-1}(R_{h}\rho_{\sigma})^{l}\bigg{)}}_{=:L_{i,t}}\big{\|}W^{\sharp i}-\widetilde{W}^{\sharp i}\big{\|}_{op}+\underbrace{\rho_{\sigma}R_{o}\bigg{(}\sum_{l=0}^{t-1}(R_{h}\rho_{\sigma})^{l}\bigg{)}}_{=:L_{b,t}}\big{\|}b^{\sharp hi}-\widetilde{b}^{\sharp hi}\big{\|}_{2}\\ &\quad+\underbrace{\rho_{\sigma}^{2}R_{o}(R_{i}R_{x}+R^{b}_{hi})\bigg{(}\sum_{l=1}^{t-1}\sum_{k=0}^{l-1}(R_{h}\rho_{\sigma})^{t-1-l+k}\bigg{)}}_{=:L_{h,t}}\big{\|}W^{\sharp h}-\widetilde{W}^{\sharp h}\big{\|}_{op}+\big{\|}b^{\sharp o}-\widetilde{b}^{\sharp o}\big{\|}_{2}.\end{split} (D.1)

Since the right-hand side of (D.1) is independent of training data XtiX_{t}^{i}, we estimate

ft(Xt)jf~t(Xt)jS=(i=1n|ft(Xti)jf~t(Xti)j|2)1/2n(Lo,tWoW~oop+Li,tWiW~iop+Lb,tbhib~hi2+Lh,tWhW~hop+bob~o2).\begin{split}\big{\|}f^{\sharp}_{t}(X_{t})_{j}-\widetilde{f}^{\sharp}_{t}(X_{t})_{j}\big{\|}_{S}&=\bigg{(}\sum_{i=1}^{n}\big{|}f^{\sharp}_{t}(X^{i}_{t})_{j}-\widetilde{f}^{\sharp}_{t}(X^{i}_{t})_{j}\big{|}^{2}\bigg{)}^{1/2}\\ &\leq\sqrt{n}\Big{(}L_{o,t}\big{\|}W^{\sharp o}-\widetilde{W}^{\sharp o}\big{\|}_{op}+L_{i,t}\big{\|}W^{\sharp i}-\widetilde{W}^{\sharp i}\big{\|}_{op}\\ &\quad+L_{b,t}\big{\|}b^{\sharp hi}-\widetilde{b}^{\sharp hi}\big{\|}_{2}+L_{h,t}\big{\|}W^{\sharp h}-\widetilde{W}^{\sharp h}\big{\|}_{op}+\big{\|}b^{\sharp o}-\widetilde{b}^{\sharp o}\big{\|}_{2}\Big{)}.\end{split}

Then, the covering number N(t,j,ϵ,S)N(\mathcal{F}^{\sharp}_{t,j},\epsilon,\|\cdot\|_{S}) is bounded as follows

N(t,j,ϵ,S)N(Wo,Ro,ϵ5nLo,t,F)N(Wi,Ri,ϵ5nLi,t,F)×N(bhi,Rhib,ϵ5nLb,t,F)N(Wh,Rh,ϵ5nLh,t,F)N(bo,Rob,ϵ5n,F),\begin{split}&N(\mathcal{F}^{\sharp}_{t,j},\epsilon,\|\cdot\|_{S})\leq N\Big{(}\mathcal{H}_{W^{\sharp o},R_{o}},\frac{\epsilon}{5\sqrt{n}L_{o,t}},\|\cdot\|_{F}\Big{)}N\Big{(}\mathcal{H}_{W^{\sharp i},R_{i}},\frac{\epsilon}{5\sqrt{n}L_{i,t}},\|\cdot\|_{F}\Big{)}\\ &\times N\Big{(}\mathcal{H}_{b^{\sharp hi},R^{b}_{hi}},\frac{\epsilon}{5\sqrt{n}L_{b,t}},\|\cdot\|_{F}\Big{)}N\Big{(}\mathcal{H}_{W^{\sharp h},R_{h}},\frac{\epsilon}{5\sqrt{n}L_{h,t}},\|\cdot\|_{F}\Big{)}N\Big{(}\mathcal{H}_{b^{\sharp o},R^{b}_{o}},\frac{\epsilon}{5\sqrt{n}},\|\cdot\|_{F}\Big{)},\end{split}

where we used the notation

A,R:={Ad1×d2|AFR}.\mathcal{H}_{A,R}:=\big{\{}A\in\mathbb{R}^{d_{1}\times d_{2}}\,|\,\|A\|_{F}\leq R\big{\}}.

By Lemma 8 in [[Chen], the above five covering numbers are bounded as

N(Wo,Ro,ϵ5nLo,t,F)(1+10min{m,dy}RoLo,tnϵ)mdy,N\Big{(}\mathcal{H}_{W^{\sharp o},R_{o}},\frac{\epsilon}{5\sqrt{n}L_{o,t}},\|\cdot\|_{F}\Big{)}\leq\bigg{(}1+\frac{10\min\{\sqrt{m^{\sharp}},\sqrt{d_{y}}\}R_{o}L_{o,t}\sqrt{n}}{\epsilon}\bigg{)}^{m^{\sharp}d_{y}},
N(Wi,Ri,ϵ5nLi,t,F)(1+10min{m,dx}RiLi,tnϵ)mdx,N\Big{(}\mathcal{H}_{W^{\sharp i},R_{i}},\frac{\epsilon}{5\sqrt{n}L_{i,t}},\|\cdot\|_{F}\Big{)}\leq\bigg{(}1+\frac{10\min\{\sqrt{m^{\sharp}},\sqrt{d_{x}}\}R_{i}L_{i,t}\sqrt{n}}{\epsilon}\bigg{)}^{m^{\sharp}d_{x}},
N(bhi,Rhib,ϵ5nLb,t,F)(1+10RhibLb,tnϵ)m,N\Big{(}\mathcal{H}_{b^{\sharp hi},R^{b}_{hi}},\frac{\epsilon}{5\sqrt{n}L_{b,t}},\|\cdot\|_{F}\Big{)}\leq\bigg{(}1+\frac{10R^{b}_{hi}L_{b,t}\sqrt{n}}{\epsilon}\bigg{)}^{m^{\sharp}},
N(Wh,Rh,ϵ5nLh,t,F)(1+10mRhLh,tnϵ)(m)2,N\Big{(}\mathcal{H}_{W^{\sharp h},R_{h}},\frac{\epsilon}{5\sqrt{n}L_{h,t}},\|\cdot\|_{F}\Big{)}\leq\bigg{(}1+\frac{10\sqrt{m^{\sharp}}R_{h}L_{h,t}\sqrt{n}}{\epsilon}\bigg{)}^{(m^{\sharp})^{2}},
N(bo,Rob,ϵ5n,F)(1+10Robnϵ)dy.N\Big{(}\mathcal{H}_{b^{\sharp o},R^{b}_{o}},\frac{\epsilon}{5\sqrt{n}},\|\cdot\|_{F}\Big{)}\leq\bigg{(}1+\frac{10R^{b}_{o}\sqrt{n}}{\epsilon}\bigg{)}^{d_{y}}.

Therefore, by using log(1+x)x\log(1+x)\leq x for x0x\geq 0, we conclude that

logN(t,j,ϵ,S)10mdymin{m,dy}RoLo,tnϵ+10mdxmin{m,dx}RiLi,tnϵ+10mRhibLb,tnϵ+10(m)52RhLh,tnϵ+10mdyRobnϵ=10mnMtϵ,\begin{split}&\log N(\mathcal{F}^{\sharp}_{t,j},\epsilon,\|\cdot\|_{S})\\ &\leq\frac{10m^{\sharp}d_{y}\min\{\sqrt{m^{\sharp}},\sqrt{d_{y}}\}R_{o}L_{o,t}\sqrt{n}}{\epsilon}+\frac{10m^{\sharp}d_{x}\min\{\sqrt{m^{\sharp}},\sqrt{d_{x}}\}R_{i}L_{i,t}\sqrt{n}}{\epsilon}\\ &\qquad+\frac{10m^{\sharp}R^{b}_{hi}L_{b,t}\sqrt{n}}{\epsilon}+\frac{10(m^{\sharp})^{\frac{5}{2}}R_{h}L_{h,t}\sqrt{n}}{\epsilon}+\frac{10m^{\sharp}d_{y}R^{b}_{o}\sqrt{n}}{\epsilon}\\ &=\frac{10m^{\sharp}\sqrt{n}M_{t}}{\epsilon},\end{split}

where MtM_{t} is the constant given by (C.2). The proof of Lemma D.1 is finished. ∎

Appendix E Proof of Proposition 4.6

We review the following proposition (see Proposition 1 in [[Suzuki] and Proposition 1 in [[Bach]).

Proposition E.1.

Let v1,,vmv_{1},\ldots,v_{m^{\sharp}} be i.i.d. sampled from the distribution qq in (4.6), and J={v1,,vm}J=\{v_{1},\ldots,v_{m^{\sharp}}\}. Then, for any δ~(0,1/2)\widetilde{\delta}\in(0,1/2) and λ>0\lambda>0, if m5N^(λ)log(16N^(λ)/δ~)m^{\sharp}\geq 5\widehat{N}(\lambda)\log(16\widehat{N}(\lambda)/\widetilde{\delta}), then we have the following inequality with probability greater than 1δ~1-\widetilde{\delta}:

infαm{zTϕαTϕJn,T2+λmαTτ2}4λzTΣ^(Σ^+λI)1z,\inf_{\alpha\in\mathbb{R}^{m^{\sharp}}}\big{\{}\big{\|}z^{T}\phi-\alpha^{T}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\alpha^{T}\big{\|}^{2}_{\tau^{\prime}}\big{\}}\leq 4\lambda z^{T}\widehat{\Sigma}(\widehat{\Sigma}+\lambda I)^{-1}z, (E.1)

for all zmz\in\mathbb{R}^{m}.

Proof.

Let eje_{j} be an indicator vector which has 11 at the jj-th component and 0 in other components for j=1,,mj=1,\ldots,m. Applying Proposition E.1 with z=ejz=e_{j} and taking the summation over j=1,,mj=1,\ldots,m, we obtain

Lτ(A)(J)=ϕA^JϕJn,T2+λmA^Jτ2j=1m{ejTϕejTA^JϕJn,T2+λmejTA^Jτ2}=j=1minfαm{ejTϕαTϕJn,T2+λmαTτ2}4λj=1mejTΣ^(Σ^+λI)1ej4λ.\begin{split}L^{(A)}_{\tau}(J)&=\big{\|}\phi-\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\widehat{A}_{J}\big{\|}^{2}_{\tau^{\prime}}\\ &\leq\sum_{j=1}^{m}\big{\{}\big{\|}e_{j}^{T}\phi-e_{j}^{T}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}e_{j}^{T}\widehat{A}_{J}\big{\|}^{2}_{\tau^{\prime}}\big{\}}\\ &=\sum_{j=1}^{m}\inf_{\alpha\in\mathbb{R}^{m^{\sharp}}}\big{\{}\big{\|}e_{j}^{T}\phi-\alpha^{T}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\alpha^{T}\big{\|}^{2}_{\tau^{\prime}}\big{\}}\\ &\leq 4\lambda\sum_{j=1}^{m}e_{j}^{T}\widehat{\Sigma}(\widehat{\Sigma}+\lambda I)^{-1}e_{j}\leq 4\lambda.\end{split}

Appendix F Proof of Theorem 4.8

We restate Theorem 4.8 in an exact form as follows:

Theorem F.1.

Suppose that Assumptions 4.1, 4.4 and 4.7 hold. Let {(XTi,YTi)}i=1n\{(X_{T}^{i},Y_{T}^{i})\}_{i=1}^{n} and {vj}j=1m\{v_{j}\}_{j=1}^{m^{\sharp}} be sampled i.i.d. from the distributions PTP_{T} and qq in (4.6), respectively. Let J={v1,,vm}J=\{v_{1},\ldots,v_{m^{\sharp}}\}. Then, for any δlog2\delta\geq\log 2 and δ~(0,1/2)\widetilde{\delta}\in(0,1/2), we have the following inequality with probability greater than (12eδ)δ~(1-2e^{-\delta})\widetilde{\delta}:

Ψj(fJ)Ψ^j(f^)+3ρψ{2R^o+4R^oρσm12δ~max{1,(2ρσR^hm12δ~)T2}TR^h}λ+1n{c^ρψmT(t=1TM^t1/2R^,t1/2)+32δ(ρψR^,T+Ry)}Ψ^j(f^)+λ+1n(m)54R^,T1/2,\begin{split}\Psi_{j}(f^{\sharp}_{J})&\leq\widehat{\Psi}_{j}(\widehat{f})\\ &+\sqrt{3}\rho_{\psi}\Bigg{\{}2\widehat{R}_{o}+4\widehat{R}_{o}\rho_{\sigma}\sqrt{\frac{m}{1-2\widetilde{\delta}}}\max\bigg{\{}1,\bigg{(}2\rho_{\sigma}\widehat{R}_{h}\sqrt{\frac{m}{1-2\widetilde{\delta}}}\bigg{)}^{T-2}\bigg{\}}T\widehat{R}_{h}\Bigg{\}}\sqrt{\lambda}\\ &+\frac{1}{\sqrt{n}}\Bigg{\{}\frac{\widehat{c}\rho_{\psi}\sqrt{m^{\sharp}}}{T}\bigg{(}\sum_{t=1}^{T}\widehat{M}_{t}^{1/2}\widehat{R}^{1/2}_{\infty,t}\bigg{)}+3\sqrt{2\delta}(\rho_{\psi}\widehat{R}_{\infty,T}+R_{y})\Bigg{\}}\\ &\lesssim\widehat{\Psi}_{j}(\widehat{f})+\sqrt{\lambda}+\frac{1}{\sqrt{n}}(m^{\sharp})^{\frac{5}{4}}\widehat{R}^{1/2}_{\infty,T},\end{split} (F.1)

for j=1,,dyj=1,\ldots,d_{y} and for all λ>0\lambda>0 satisfying (4.4), where R^,t\widehat{R}_{\infty,t} and M^t\widehat{M}_{t} are defined by

R^,t:=2ρσR^om12δ~(R^iRx+R^hib){l=1t(2ρσR^hm12δ~)l1}+R^ob,\widehat{R}_{\infty,t}:=2\rho_{\sigma}\widehat{R}_{o}\sqrt{\frac{m}{1-2\widetilde{\delta}}}(\widehat{R}_{i}R_{x}+\widehat{R}^{b}_{hi})\Bigg{\{}\sum_{l=1}^{t}\bigg{(}2\rho_{\sigma}\widehat{R}_{h}\sqrt{\frac{m}{1-2\widetilde{\delta}}}\bigg{)}^{l-1}\Bigg{\}}+\widehat{R}^{b}_{o},
M^t:=2ρσR^om12δ~{(dymin{m,dy}+dxmin{m,dx})R^iRx+(dymin{m,dy}+1)R^hib}{l=0t1(2ρσR^hm12δ~)l}+4(m)3/2R^hR^om12δ~ρσ2(R^iRx+R^hib){l=1t1k=0l1(2ρσR^hm12δ~)t1l+k}+dyR^ob.\begin{split}\widehat{M}_{t}&:=2\rho_{\sigma}\widehat{R}_{o}\sqrt{\frac{m}{1-2\widetilde{\delta}}}\Biggl{\{}\Big{(}d_{y}\min\{\sqrt{m^{\sharp}},\sqrt{d_{y}}\}+d_{x}\min\{\sqrt{m^{\sharp}},\sqrt{d_{x}}\}\Big{)}\widehat{R}_{i}R_{x}\\ &+\Big{(}d_{y}\min\{\sqrt{m^{\sharp}},\sqrt{d_{y}}\}+1\Big{)}\widehat{R}^{b}_{hi}\Biggr{\}}\Bigg{\{}\sum_{l=0}^{t-1}\bigg{(}2\rho_{\sigma}\widehat{R}_{h}\sqrt{\frac{m}{1-2\widetilde{\delta}}}\bigg{)}^{l}\Bigg{\}}\\ &+4(m^{\sharp})^{3/2}\widehat{R}_{h}\widehat{R}_{o}\frac{m}{1-2\widetilde{\delta}}\rho_{\sigma}^{2}(\widehat{R}_{i}R_{x}+\widehat{R}^{b}_{hi})\Bigg{\{}\sum_{l=1}^{t-1}\sum_{k=0}^{l-1}\bigg{(}2\rho_{\sigma}\widehat{R}_{h}\sqrt{\frac{m}{1-2\widetilde{\delta}}}\bigg{)}^{t-1-l+k}\Bigg{\}}+d_{y}\widehat{R}^{b}_{o}.\end{split}
Proof.

Let δ~(0,1/2)\tilde{\delta}\in(0,1/2), and let fJf^{\sharp}_{J} be the compressed RNN with parameters

WJo:=W^oA^J,WJh:=W^J,[m]hA^J,WJi:=W^J,[dx]i,bJhi:=b^Jhi,andbJo:=b^o.W^{\sharp o}_{J}:=\widehat{W}^{o}\widehat{A}_{J},\quad W^{\sharp h}_{J}:=\widehat{W}^{h}_{J,[m]}\widehat{A}_{J},\quad W^{\sharp i}_{J}:=\widehat{W}^{i}_{J,[d_{x}]},\quad b^{\sharp hi}_{J}:=\widehat{b}^{hi}_{J},\quad\text{and}\quad b^{\sharp o}_{J}:=\widehat{b}^{o}.

Once we can prove that

fJT(2R^om12δ~,2R^hm12δ~,R^i,R^ob,R^hib),f^{\sharp}_{J}\in\mathcal{F}^{\sharp}_{T}\bigg{(}2\widehat{R}_{o}\sqrt{\frac{m}{1-2\widetilde{\delta}}},2\widehat{R}_{h}\sqrt{\frac{m}{1-2\widetilde{\delta}}},\widehat{R}_{i},\widehat{R}^{b}_{o},\widehat{R}^{b}_{hi}\bigg{)}, (F.2)

we can apply Theorem C.1 with f=fJf^{\sharp}=f^{\sharp}_{J} to obtain, for any δlog2\delta\geq\log 2, the following inequality with probability greater than 12eδ1-2e^{-\delta}:

Ψj(fJ)Ψ^j(f^)+3ρψ{W^oϕWoϕJn,T+2R^om12δ~ρσmax{1,(2R^hm12δ~ρσ)T2}TW^J,[m]hϕWhϕJn,T}+1n{c^ρψmT(t=1TM^t1/2R^,t1/2)+32δ(ρψR^,T+Ry)},\begin{split}\Psi_{j}&(f^{\sharp}_{J})\leq\widehat{\Psi}_{j}(\widehat{f})+\sqrt{3}\rho_{\psi}\Bigg{\{}\big{\|}\widehat{W}^{o}\phi-W^{\sharp o}\phi_{J}\big{\|}_{n,T}\\ &+2\widehat{R}_{o}\sqrt{\frac{m}{1-2\widetilde{\delta}}}\rho_{\sigma}\max\bigg{\{}1,\bigg{(}2\widehat{R}_{h}\sqrt{\frac{m}{1-2\widetilde{\delta}}}\rho_{\sigma}\bigg{)}^{T-2}\bigg{\}}T\big{\|}\widehat{W}^{h}_{J,[m]}\phi-W^{\sharp h}\phi_{J}\big{\|}_{n,T}\Bigg{\}}\\ &+\frac{1}{\sqrt{n}}\Bigg{\{}\frac{\widehat{c}\rho_{\psi}\sqrt{m^{\sharp}}}{T}\bigg{(}\sum_{t=1}^{T}\widehat{M}_{t}^{1/2}\widehat{R}^{1/2}_{\infty,t}\bigg{)}+3\sqrt{2\delta}(\rho_{\psi}\widehat{R}_{\infty,T}+R_{y})\Bigg{\}},\end{split} (F.3)

for j=1,,dyj=1,\ldots,d_{y}. Moreover, by using Proposition E.1, we have

W^oϕWJoϕJn,T2=W^oϕW^oA^JϕJn,T2j=1dy(W^j,:oϕW^j,:oA^JϕJn,T2+λmW^j,:oA^Jτ2)=j=1dyinfαm(W^j,:oϕαTϕJn,T2+λmαTτ2)4λj=1dyW^j,:oΣ^(Σ^+λI)1(W^j,:o)T4λW^oF24λ(R^o)2,\begin{split}\big{\|}\widehat{W}^{o}\phi-W^{\sharp o}_{J}\phi_{J}\big{\|}^{2}_{n,T}&=\big{\|}\widehat{W}^{o}\phi-\widehat{W}^{o}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}\\ &\leq\sum_{j=1}^{d_{y}}\Big{(}\big{\|}\widehat{W}^{o}_{j,:}\phi-\widehat{W}^{o}_{j,:}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\widehat{W}^{o}_{j,:}\widehat{A}_{J}\big{\|}^{2}_{\tau^{\prime}}\Big{)}\\ &=\sum_{j=1}^{d_{y}}\inf_{\alpha\in\mathbb{R}^{m^{\sharp}}}\Big{(}\big{\|}\widehat{W}^{o}_{j,:}\phi-\alpha^{T}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\alpha^{T}\big{\|}^{2}_{\tau^{\prime}}\Big{)}\\ &\leq 4\lambda\sum_{j=1}^{d_{y}}\widehat{W}^{o}_{j,:}\widehat{\Sigma}(\widehat{\Sigma}+\lambda I)^{-1}(\widehat{W}^{o}_{j,:})^{T}\\ &\leq 4\lambda\big{\|}\widehat{W}^{o}\big{\|}^{2}_{F}\leq 4\lambda(\widehat{R}_{o})^{2},\end{split} (F.4)

and

W^J,[m]hϕWJhϕJn,T2=W^J,[m]hϕW^J,[m]hA^JϕJn,T2jJ(W^j,:hϕW^j,:hA^JϕJn,T2+λmW^j,:hA^Jτ2)=jJinfαm(W^j,:hϕαTϕJn,T2+λmαTτ2)4λjJW^j,:hΣ^(Σ^+λI)1(W^j,:h)T4λW^hF24λ(R^h)2.\begin{split}\big{\|}\widehat{W}^{h}_{J,[m]}\phi-W^{\sharp h}_{J}\phi_{J}\big{\|}^{2}_{n,T}&=\big{\|}\widehat{W}^{h}_{J,[m]}\phi-\widehat{W}^{h}_{J,[m]}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}\\ &\leq\sum_{j\in J}\Big{(}\big{\|}\widehat{W}^{h}_{j,:}\phi-\widehat{W}^{h}_{j,:}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\widehat{W}^{h}_{j,:}\widehat{A}_{J}\big{\|}^{2}_{\tau^{\prime}}\Big{)}\\ &=\sum_{j\in J}\inf_{\alpha\in\mathbb{R}^{m^{\sharp}}}\Big{(}\big{\|}\widehat{W}^{h}_{j,:}\phi-\alpha^{T}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\alpha^{T}\big{\|}^{2}_{\tau^{\prime}}\Big{)}\\ &\leq 4\lambda\sum_{j\in J}\widehat{W}^{h}_{j,:}\widehat{\Sigma}(\widehat{\Sigma}+\lambda I)^{-1}(\widehat{W}^{h}_{j,:})^{T}\\ &\leq 4\lambda\big{\|}\widehat{W}^{h}\big{\|}^{2}_{F}\leq 4\lambda(\widehat{R}_{h})^{2}.\end{split} (F.5)

Therefore, by combining (F.3), (F.4) and (F.5), we conclude the inequality (F.1). It remains to prove (F.2). Finally, we prove that (F.2) holds with probability greater than δ~\widetilde{\delta}.

Let us recall the definition (4.5) of the leverage score τ=(τj)jJm\tau^{\prime}=(\tau^{\prime}_{j})_{j\in J}\in\mathbb{R}^{m^{\sharp}}, i.e.,

τj:=1N^(λ)[Σ^(Σ^+λI)1]j,j,j=1,,m.\tau^{\prime}_{j}:=\frac{1}{\widehat{N}(\lambda)}\big{[}\widehat{\Sigma}(\widehat{\Sigma}+\lambda I)^{-1}\big{]}_{j,j},\quad j=1,\cdots,m.

By Markov’s inequality, we have

P[jJ(τj)1<mm12δ~]=1P[jJ(τj)1mm12δ~]1E[jJ(τj)1]mm12δ~=2δ~,\begin{split}P\bigg{[}\sum_{j\in J}(\tau^{\prime}_{j})^{-1}<\frac{mm^{\sharp}}{1-2\widetilde{\delta}}\bigg{]}&=1-P\bigg{[}\sum_{j\in J}(\tau^{\prime}_{j})^{-1}\geq\frac{mm^{\sharp}}{1-2\widetilde{\delta}}\bigg{]}\\ &\geq 1-\frac{E\big{[}\sum_{j\in J}(\tau^{\prime}_{j})^{-1}\big{]}}{\frac{mm^{\sharp}}{1-2\widetilde{\delta}}}=2\widetilde{\delta},\end{split} (F.6)

because E[jJ(τj)1]=mmE\big{[}\sum_{j\in J}(\tau^{\prime}_{j})^{-1}\big{]}=mm^{\sharp} (see the proof of Lemma 1 in [[Suzuki]). Therefore, the probability of two events (E.1) and

jJ(τj)1<mm12δ~,\sum_{j\in J}(\tau^{\prime}_{j})^{-1}<\frac{mm^{\sharp}}{1-2\widetilde{\delta}}, (F.7)

happening simultaneously is greater than (1δ~)+2δ~1=δ~(1-\widetilde{\delta})+2\widetilde{\delta}-1=\widetilde{\delta}. By the same argument as in (F.4) and (F.5), and by using (F.6), we have

WJoF2=λmλmW^oA^JF2(jJ(τj)1)λmj=1dy(W^j,:oϕW^j,:oA^JϕJn,T2+λmW^j,:oA^Jτ2)4(R^o)2m12δ~,\begin{split}\big{\|}W^{\sharp o}_{J}\big{\|}^{2}_{F}&=\frac{\lambda m^{\sharp}}{\lambda m^{\sharp}}\big{\|}\widehat{W}^{o}\widehat{A}_{J}\big{\|}^{2}_{F}\\ &\leq\frac{(\sum_{j\in J}(\tau^{\prime}_{j})^{-1})}{\lambda m^{\sharp}}\sum_{j=1}^{d_{y}}\Big{(}\big{\|}\widehat{W}^{o}_{j,:}\phi-\widehat{W}^{o}_{j,:}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\widehat{W}^{o}_{j,:}\widehat{A}_{J}\big{\|}^{2}_{\tau^{\prime}}\Big{)}\\ &\leq 4(\widehat{R}_{o})^{2}\frac{m}{1-2\widetilde{\delta}},\end{split}
WJhF2=λmλmW^J,[m]hA^JF2(jJ(τj)1)λmjJ(W^j,:hϕW^j,:hA^JϕJn,T2+λmW^j,:hA^Jτ2)4(R^h)2m12δ~,\begin{split}\big{\|}W^{\sharp h}_{J}\big{\|}^{2}_{F}&=\frac{\lambda m^{\sharp}}{\lambda m^{\sharp}}\big{\|}\widehat{W}^{h}_{J,[m]}\widehat{A}_{J}\big{\|}^{2}_{F}\\ &\leq\frac{(\sum_{j\in J}(\tau^{\prime}_{j})^{-1})}{\lambda m^{\sharp}}\sum_{j\in J}\Big{(}\big{\|}\widehat{W}^{h}_{j,:}\phi-\widehat{W}^{h}_{j,:}\widehat{A}_{J}\phi_{J}\big{\|}^{2}_{n,T}+\lambda m^{\sharp}\big{\|}\widehat{W}^{h}_{j,:}\widehat{A}_{J}\big{\|}^{2}_{\tau^{\prime}}\Big{)}\\ &\leq 4(\widehat{R}_{h})^{2}\frac{m}{1-2\widetilde{\delta}},\end{split}

and

WJiF2W^iF2(R^i)2,bJoF2b^oF2(R^ob)2,bJhiF2b^hiF2(R^hib)2.\big{\|}W^{\sharp i}_{J}\big{\|}^{2}_{F}\leq\big{\|}\widehat{W}^{i}\big{\|}^{2}_{F}\leq(\widehat{R}_{i})^{2},\quad\big{\|}b^{\sharp o}_{J}\big{\|}^{2}_{F}\leq\big{\|}\widehat{b}^{o}\big{\|}^{2}_{F}\leq(\widehat{R}_{o}^{b})^{2},\quad\big{\|}b^{\sharp hi}_{J}\big{\|}^{2}_{F}\leq\big{\|}\widehat{b}^{hi}\big{\|}^{2}_{F}\leq(\widehat{R}_{hi}^{b})^{2}.

Hence, (F.2) holds with probability greater than δ~\widetilde{\delta}. Thus, we conclude Theorem F.1. ∎

Appendix G Remarks for Theorems 4.8 and F.1

Remark G.1.

We remark that the index JJ in Theorem 4.8 is a random variable with a distribution qq. If the deterministic JJ satisfying (E.1) and (F.7) is considered, the inequality (4.9) holds with a probability greater than 12eδ1-2e^{-\delta}, which is the same probability obtained with the inequality in Theorem 2 of [[Suzuki]. The index JJ in Theorem 2 of [[Suzuki] is chosen deterministically by minimizing the information losses (2) with the additional constraint jJ(τj)1<53mm\sum_{j\in J}(\tau^{\prime}_{j})^{-1}<\frac{5}{3}mm^{\sharp}. This constraint can be interpreted as the leverage score τJ\tau^{\prime}_{J} corresponding to JJ becomes larger, which implies that important nodes are selected from the spectral information of the covariance matrix Σ^\widehat{\Sigma}.

Remark G.2.

In the case of m>mnzrm>m_{\mathrm{nzr}}, we can obtain a sharper error bound than (4.9) in Theorem 4.8. More precisely, the constant omitted in (4.9), which depends on the size mm of f^\widehat{f}, can be improved to the constant depending on mnzrm_{\mathrm{nzr}}, not on mm. In fact, when m>mnzrm>m_{\mathrm{nzr}}, let f^nzr\widehat{f}_{\mathrm{nzr}} be the network obtained by deleting the nodes corresponding to the non-zero rows of the covariance matrix Σ^\widehat{\Sigma}. By the same argument, replacing Ψ^j(f^)\widehat{\Psi}_{j}(\widehat{f}) with Ψ^j(f^nzr)\widehat{\Psi}_{j}(\widehat{f}_{\mathrm{nzr}}) in the proof of Theorem 4.8, we can obtain Theorem 4.8 by replacing mm by mnzrm_{\mathrm{nzr}}, which means that a sharper error bound can be obtained.

Appendix H Detailed configurations for training, pruning and fine-tuning

Employed architecture for the Pixel-MNIST classification task consists of a single IRNN layer and an output layer, while that for the PTB word level language modeling consists of an embedding layer, a single RNN layer and an output layer, where we can merge an embedding weight matrix and an RNN input weight matrix into an single weight matrix. The loss function is the cross entropy function following the soft-max function for both tasks. Each training and fine-tuning is optimized by Adam, and hyper-parameters obtained by grid search are summarized in Table 3, where “FT” means the parameter used in fine-tuning and “bptt” means the step size for back-propagation through time. As regards regularization techniques for the PTB task, we adopt the dropout, whose ratio is 0.10.1, in any case and the weight tying [[inan2016tying] in effective case.

Table 3: Hyper-parameters for learning.
Task epochs (FT) batch size learning rate (FT) LR decay (step) gradient clip bptt
Pixel-MNIST 500 (250) 120 10410^{-4} (555^{-5}) 0.95 (10) 1.0 784
PTB 200 (200) 20 5.05.0  (2.52.5) 0.95 (1) 0.01 35

We sample five models for each baseline in section 5. Furthermore, pruning methods including randomness are applied five times for each baseline model. Other detailed configurations for each method are the following:

  • Baseline (128)

    • train:

      • *

        hidden size: 128

      • *

        weight tying: True

  • Baseline (42)

    • train:

      • *

        hidden size: 42

      • *

        weight tying: True

    • prune:

      • *

        None

    • finetune: (only PTB case)

      • *

        hidden size: 42 (stay)

      • *

        weight tying: False

  • Spectral w/ rec. or w/o rec.

    • train:

      • *

        Use Baseline (128)

    • prune:

      • *

        size of hidden-to-hidden weight matrix: 16384(=128×128)1764(=42×42)16384(=128\times 128)\to 1764(=42\times 42)

      • *

        size of input-to-hidden weight matrix: 128(=1×128)42(=1×42)128(=1\times 128)\to 42(=1\times 42) (Pixel-MNIST) or 1270016(=9922×128)416724(=9922×42)1270016(=9922\times 128)\to 416724(=9922\times 42) (PTB)

      • *

        size of hidden-to-output weight matrix: 1280(=128×10)420(=42×10)1280(=128\times 10)\to 420(=42\times 10) (Pixel-MNIST) or 1270016(=9922×128)416724(=9922×42)1270016(=9922\times 128)\to 416724(=9922\times 42) (PTB)

      • *

        Reduce the RNN weight matrices based on our proposed method with or without the reconstruction matrix

    • finetune:

      • *

        hidden size: 42 (reduced from 128)

      • *

        weight tying: False

  • Random w/ rec. or w/o rec.

    • Same as “Spectral” except for reducing the RNN weight matrices randomly in pruning phase

  • Column Sparsification

    • train:

      • *

        hidden size: 128

      • *

        weight tying: True

      • *

        Mask the lowest 86(=12842)86(=128-42) columns of the hidden-to-hidden weight matrix by L2L^{2}-norm for each iteration (add noise on the weight matrix before masking when applied to the IRNN)

    • prune:

      • *

        Fix the mask

    • finetune:

      • *

        None

  • Low Rank Factorization

    • train:

      • *

        Use Baseline (128)

    • prune:

      • *

        intrinsic parameters of hidden-to-hidden weight matrix: 16384(=128×128)10752(=128×42+42×128)16384(=128\times 128)\to 10752(=128\times 42+42\times 128)

      • *

        Decompose hidden-to-hidden weight matrix based on SVD: W=USVW=U[:,:42]S[:42]V[:42,:]W=USV^{\top}\to W^{\prime}=U[:,:42]S[:42]V^{\top}[:42,:]

      • *

        Entry of SS, which is singular values, are in descending order

    • finetune:

      • *

        None

  • Magnitude-based Weight

    • train:

      • *

        Use Baseline (128)

    • prune:

      • *

        parameters of hidden-to-hidden weight matrix: 16384(=128×128)1764(=42×42)16384(=128\times 128)\to 1764(=42\times 42)

      • *

        Remove the lowest 14620(=128×12842×42)14620(=128\times 128-42\times 42) parameters by L1L^{1}-norm

    • finetune:

      • *

        hidden size: 128 (stay but have sparse weight matrix)

  • Random Weight

    • Same as “Magnitude-based Weight” except for removing parameters randomly in pruning phase