This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast Deep Predictive Coding Networks for Videos Feature Extraction without Labels

Wenqian  Xue, Chi Ding, Jose Principe
Department of Electrical & Computer Engineering
University of Florida
Gainesville, FL 32611
w.xue@ufl.edu, ding.chi@ufl.edu, principe@cnel.ufl.edu
Abstract

Brain-inspired deep predictive coding networks (DPCNs) effectively model and capture video features through a bi-directional information flow, even without labels. They are based on an overcomplete description of video scenes, and one of the bottlenecks has been the lack of effective sparsification techniques to find discriminative and robust dictionaries. FISTA has been the best alternative. This paper proposes a DPCN with a fast inference of internal model variables (states and causes) that achieves high sparsity and accuracy of feature clustering. The proposed unsupervised learning procedure, inspired by adaptive dynamic programming with a majorization-minimization framework, and its convergence are rigorously analyzed. Experiments in the data sets CIFAR-10, Super Mario Bros video game, and Coil-100 validate the approach, which outperforms previous versions of DPCNs on learning rate, sparsity ratio, and feature clustering accuracy. Because of DCPN’s solid foundation and explainability, this advance opens the door for general applications in object recognition in video without labels.

1 Introduction

Sparse model is significant for the systems with a plethora of parameters and variables, as it selectively activates only a small subset of the variables or coefficients while maintaining representation accuracy and computational efficiency. This not only efficiently reduces the demand and storage for data to represent a dynamic system but also leads to more concise and easier access to the contained information in the areas including control, signal processing, sensory compression, etc.

In the control theory sense, a model for a dynamic process is often described by the equations

{yt=Gt(xt)+ntxt=Ft(xt1,ut)+wt\left\{\begin{array}[]{l}{y_{t}}={G_{t}}(x_{t})+n_{t}\\ x_{t}={F_{t}}(x_{t-1},u_{t})+w_{t}\end{array}\right.

where yty_{t} is a set of measurements associated with a changing state xtx_{t} through a mapping function GtG_{t}, the states xtx_{t}, also known as the signal of interest, is produced from a past one xt1x_{t-1} and an input utu_{t} through an evolution function FtF_{t}, wtw_{t} is the measurement noise and ntn_{t} is the modeling error. Given measurements yty_{t} and input utu_{t}, the Kalman filter [1, 2] has emerged as a widely-employed technique for estimating states [3, 4] and mapping functions using neural networks [5] in a sparse way [6, 7, 8]. Therein, it is typically constrained to estimate one variable, namely the state. Can both state and input variables be estimated? For many dynamic plants characterized by natural and complex signals, latent variables often exhibit residual dependencies as well as non-stationary statistical properties. Can data with non-stationary statistics be well represented? Additionally, (deep) neural networks (NNs) [9, 10, 11, 5] with multi-layered structures are extensively used for sparse modeling of dynamic systems [12, 13, 14]. Similarly structured, convolutional NNs have demonstrated significant success in tasks such as target detection and feature classification in computer vision and control applications [15, 16, 17, 18, 19]. As we all know, these methods are mathematically uninterpretable, and the NN architecture is a feedforward pass through stacks of convolutional layers. As studied in [20], a bi-directional information pathway, including not only a feedforward but also a feedforward and recurrent passing, is used by brain for effective visual perception. Can dynamics be represented in an interpretable way with bi-directional connections and interactions?

These goals can be achieved by the hierarchical predictive coding networks [21, 22, 23, 24], also known as deep-predictive-coding networks (DPCNs) [25, 26, 27, 28, 29, 30, 31], where, inspired by [20], a hierarchical generative model is formulated as

{ytl=Gt(xtl)+ntlxtl=Ft(xt1l,utl)+wtl\left\{\begin{array}[]{l}{y_{t}^{l}}={G_{t}}(x_{t}^{l})+n_{t}^{l}\\ x_{t}^{l}={F_{t}}(x_{t-1}^{l},u_{t}^{l})+w_{t}^{l}\end{array}\right.

where ll denotes layers. Measurements for layer ll are the causes of the lower layer, i.e., ytl=ut(l1)y_{t}^{l}=u_{t}^{(l-1)} for l>1l>1. The causes link the layers, and the states link the dynamics over time tt. The model admits a bi-directional information flow [32, 30], including feedforward, feedback, and recurrent connections. That is, measurements travel through a bottom-up pathway from lower to higher visual areas (for rapid object recognition) and simultaneously a top-down pathway running in the opposite direction (to enhance the recognition) [33]. The previous DPCNs either use linear filters for sound [25, 26] or convolutions to better preserve neighborhoods in images [27, 28]. With fovea vision, non-convolutional DPCNs may offer a more automated and straightforward implementation [31, 30]. In both types of DPCNs, the proximal gradient descent methods, such as fast iterative shrinkage-thresholding algorithm (FISTA) [34], are frequently used for variable and model inferences in [27, 31, 30] for accelerated inference. Can the DPCNs inference be faster while maintaining high sparsity?

This paper answers these questions by studying vector DPCN with an improved inference procedure for both variable and models (dictionary) that is applicable to the two types, and that will be tested for proof of concept to model and capture objects in videos. Given measurements from the real world, the DPCNs infer model parameters and variables through feedforward, feedback, and recurrent connections represented by optimization problems with sparsity penalties. Inspired by the maximization minimization (MM) [35] and the value iteration of reinforcement learning (RL) [36], this paper proposes a MM-based unsupervised learning procedure to enhance the inference of DPCNs by introducing a majorizer of the sparsity penalty. This is called MM-DPCNs and offers the following advantages:

  • The learning procedure does not need labels and offers accelerated inference.

  • The inference results guarantee sparsity of variables and representation accuracy of features.

  • Rigorous proofs show convergence and interpretability.

  • Experiments validate the higher performance of MM-DPCNs versus previous DPCNs on learning rate, sparsity ratio, and feature clustering accuracy.

2 Dynamic Networks for DPCNs

Table 1: Detonations.
yt,n1y_{t,n}^{1} nn-th patch of video frame at time tt
yt,nly_{t,n}^{l}, l>1l>1 the causes from layer l1l-1
xt,nlx_{t,n}^{l} state at layer ll for yt,ny_{t,n}
utlu_{t}^{l} cause at layer ll for a group of xt,nlx_{t,n}^{l}
Al,Bl,ClA^{l},B^{l},C^{l} model parameters at layer ll

Based on the hierarchical generative model [20, 31] briefly reviewed in the Introduction, we now review the dynamic networks for DPCNs [31, 30] in terms of sparse optimization problems for sparse model and feature extraction of videos.

Refer to caption
Figure 1: Two-layered DPCNs structure. The video frame is decomposed into patches (green blocks). Every patch is mapped onto a state xt1x_{t}^{1} at layer 1, and the cause ut1u_{t}^{1} pool all the states within a group. The cause ut1u_{t}^{1} is input of layer 2 and corresponds to state xt2x_{t}^{2} and cause ut2u_{t}^{2}.

The structure of DPCN is shown in Fig. 1, and the involved denotations are show in Table 1. Given a video input, the measurements of each video frame are decomposed into multiple contiguous patches in terms of position, which is denoted by yt,nP,n𝒩={1,2,,N}y_{t,n}\in\mathbb{R}^{P},n\in\mathcal{N}=\{1,2,\cdots,N\}, a vectorized form of P×P\sqrt{P}\times\sqrt{P} square patch. These measurements are injected to the DPCNs with a hierarchical multiple-layered structure. From the second layer, the causes from a lower layer serve as the input of the next layer, i.e., yt,nl=ut,nl1y_{t,n}^{l}=u_{t,n}^{l-1}. At every layer, the network consists of two distinctive parts: feature extraction (inferring states) and pooling (inferring causes). The parameters to connect states and causes are called model (dictionary), going along states and causes (inferring model). The networks and connections at each layer ll are given in terms of objective functions for the inferences. In the following, we would omit the layer superscript ll for simplicity.

For inferring states given a patch measurement yt,ny_{t,n}, a linear state space model using an over-complete dictionary of KK-filters, i.e., CP×KC\in\mathbb{R}^{P\times K} with P<KP<K, to get sparse states xt,nKx_{t,n}\in\mathbb{R}^{K}. Also, a state-transition matrix AK×KA\in\mathbb{R}^{K\times K} is applied to keep track of historical sparse states dynamics. To this end, the objective function for states is given by

Ex(xt)=\displaystyle E_{x}(x_{t})= n=1N12yt,nCxt,n22+μxt,n1\displaystyle\sum_{n=1}^{N}\frac{1}{2}\|y_{t,n}-Cx_{t,n}\|_{2}^{2}+\mu\|x_{t,n}\|_{1}
+λxt,nAxt1,n1,\displaystyle+\lambda\|x_{t,n}-Ax_{t-1,n}\|_{1}, (1)

where λ>0\lambda>0 and μ>0\mu>0 are weighting parameters, 22\|\cdot\|_{2}^{2} is the L2L_{2}-norm denoting energy, and 1\|\cdot\|_{1} is the L1L_{1}-norm serving as the penalty term to make solution sparse [37].

For inferring causes given states, utDu_{t}\in\mathbb{R}^{D} multiplicatively interacts with the accumulated states through BK×DB\in\mathbb{R}^{K\times D} in the way that whenever a component in utu_{t} is active, the corresponding set of components in xtx_{t} are also likely to be active. This is for significant clustering of features even with non-stationary distribution of states [38]. To this end, the objective function for causes is given by

Eu(ut)=\displaystyle E_{u}(u_{t})= n=1Nk=1Kγ|(xt,n)k|(1+exp((But)k))+βut1\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}\gamma|(x_{t,n})_{k}|(1+exp(-(Bu_{t})_{k}))+\beta\|u_{t}\|_{1} (2)

where γ>0\gamma>0 and β>0\beta>0 are weighting parameters.

For inferring model θ={A,B,C}\theta=\{A,B,C\} given states and causes, the overall objective function is given by

Ep(xt,ut,θ)=Ex(xt)+Eu(ut).\displaystyle E_{p}(x_{t},u_{t},\theta)=E_{x}(x_{t})+E_{u}(u_{t}). (3)

Notably, optimization of the functions ExE_{x} and EuE_{u} are strong convex problems, and we will design learning method to find the unique optimal sparse solution.

3 Learning For Model Inference and Variable Inference

Refer to caption
Figure 2: (a) Bi-directional inference flow, where feedforward (yellow), feedback (green), and recurrent (pink) connections convey the bottom-up and top-down predictions. (b) Connections for variables inference (solid lines) and for model inference (dash lines).

In this section, we propose an unsupervised learning method for self-organizing models and variables with accelerated learning while maintaining high sparsity and accuracy of feature extraction. The flow and connections for the inference are shown in Fig. 2. The inference process includes Model Inference and Variable Inference. The model inference needs repeated interleaved updates on states and causes and updates on model. Then, given a model, the variable inference needs an interleaved updates on states and causes using an extra top-down preference from the upper layer. These form a bi-directional inference process on a bottom-up feedforward path, a top-down feedback path, and a recurrent path.

For the updates of states and causes involved in the Model Inference and Variable Inference, we propose a new learning procedure using the majorization minimization (MM) framework [39, 35] for optimization with sparsity constraint. Different from the frequently used proximal gradient descent methods iterative shrinkage-thresholding algorithm (ISTA) and fast ISTA (FISTA) [34, 40, 41] that use a majorizer for the differentiable non-sparsity-penalty terms [31], this paper uses a majorizer for sparsity penalty. As such the convex non-differentiable optimization problem with sparsity constraint is transformed into a convex and differentiable problem. Moreover, taking advantage of over-complete dictionary and the iteration form inspired by the value iteration of RL [36], the iterations for inference are derived from the condition for the optimal sparse solution to MM-based optimization problems. This also differs from the traditional gradient descent method and adaptive moment estimation (ADAM) [42] method for solving optimization problems.

3.1 MM-Based Model Inference

Model inference seeks θ={A,B,C}\theta=\{A,B,C\} by minimizing Ep(xt,ut,θ)E_{p}(x_{t},u_{t},\theta) in (3) with an interleaved procedure to infer states and causes by minimizing ExE_{x} (1) and EuE_{u} (2).

State Inference

To infer sparse xt,nx_{t,n} by minimizing ExE_{x} (1), first, we let et,n=xt,nAxt1,ne_{t,n}=x_{t,n}-Ax_{t-1,n} and use the Nesterov’s smooth approximator [43, 44] taking the form

e1fs(e)(α)Tem2α22\displaystyle\|e\|_{1}\approx f_{s}(e)\triangleq(\alpha^{*})^{T}e-\frac{m}{2}\|\alpha^{*}\|_{2}^{2} (4)

where m>0m>0 is a constant and α\alpha^{*} is some vector reaching the best approximation. Then, we find a majorizer for the penalty term μxt,n1\mu\|x_{t,n}\|_{1} [39] in the form

μx1h(x,Vx)12xTWxx+c\displaystyle\mu\|x\|_{1}\leq h(x,V_{x})\triangleq\frac{1}{2}x^{T}W_{x}x+c (5)

with equality at x=Vxx=V_{x}, where VxV_{x} is a vector, Wx=diag(μ./|Vx|)W_{x}=\text{diag}(\mu./|{V_{x}}|) with ././ a component-wise division product, and cc is a constant independent of xx (see details in Appendix A).

Applying the approximator (4), majorizer (5) and MM principles, the minimization problem of ExE_{x} (1) can be transformed to the minimization of

Hx(xt,n)=n=1N12yt,nCxt,n22+λfs(et,n)+h(xt,n,Vx).\displaystyle H_{x}(x_{t,n})=\sum_{n=1}^{N}\frac{1}{2}\|y_{t,n}-Cx_{t,n}\|_{2}^{2}+\lambda f_{s}(e_{t,n})+h(x_{t,n},V_{x}). (6)

Minimizing HxH_{x} with respect to xt,nx_{t,n} yields the Karush–Kuhn–Tucker (KKT) condition for the optimal sparse state

(CTC+Wx)xt,n=CTyt,nλα.\displaystyle(C^{T}C+W_{x})x_{t,n}=C^{T}y_{t,n}-\lambda\alpha^{*}. (7)

To find such an optimal state, we propose Algorithm 1 that is applicable for every layer, applying an iterative form of (7). The update of states at each iteration is one-step optimal. We set a positive initial value for state. Note that it cannot be zero because the iteration will never update with Rx0=0R_{x}^{0}=0. Also, the optimal state in (7) is expected to be sparse, namely some components of xt,nix_{t,n}^{i} go to zero. This makes entries of WxW_{x} go to infinity, leading to numerically inaccurate results. We avoid this by using Rx=(Wx)1R_{x}=(W_{x})^{-1} and the matrix inverse lemma [45]

(CTC+Wx)1\displaystyle(C^{T}C+W_{x})^{-1} =RxRxCT(I+CRxCT)1CRxT(C,Rx).\displaystyle=R_{x}-R_{x}C^{T}(I+CR_{x}C^{T})^{-1}CR_{x}\triangleq T(C,R_{x}). (8)

Note that the matrix CTC+WxC^{T}C+W_{x} is invertible due to positive semi-definite CTCC^{T}C and positive definite diagonal WxW_{x}. To further accelerate the computation, we can avoid directly computing the inverse term (I+CRxCT)1(I+CR_{x}C^{T})^{-1} by using the conjugate gradient method to compute (I+CRxCT)1CRx(I+CR_{x}C^{T})^{-1}CR_{x}.

Cause Inference

To infer sparse causes by minimizing EuE_{u} (2), we find a majorizer of βut1\beta\|u_{t}\|_{1} as

βu1h(u,Vu)12uTWuu+c\displaystyle\beta\|u\|_{1}\leq h(u,V_{u})\triangleq\frac{1}{2}u^{T}W_{u}{u}+c (9)

with equality at u=Vuu=V_{u}, where Wu=diag(β./|Vu|)W_{u}=\text{diag}(\beta./|{V_{u}}|). Therefore, based on MM principles, we transform the minimization of EuE_{u} in (2) to the minimization of

Hu(ut)=|Xt|T(1+exp(But))+h(ut,Vu)\displaystyle H_{u}(u_{t})=|X_{t}|^{T}(1+exp(-Bu_{t}))+h(u_{t},V_{u}) (10)

where |Xt|=γn=1N|xt,n||X_{t}|=\gamma\sum_{n=1}^{N}|x_{t,n}|. Minimizing HuH_{u} with respect to utu_{t} yields the KKT condition

Wuut=BT(|Xt|.exp(But)).\displaystyle W_{u}u_{t}=B^{T}(|X_{t}|.exp(-Bu_{t})). (11)

To find such an optimal cause, we propose Algorithm 2 that is applicable for every layer, applying the iterative form of (11) for causes inference. Since Ru=(Wu)1R_{u}=(W_{u})^{-1} and the iteration never update with Ru0=0R_{u}^{0}=0, we set an initial value ut0>0u_{t}^{0}>0.

With fixed model parameter θ\theta, states xt,nx_{t,n} and causes utu_{t} can be updated interleavely until they converge. Since sparsity penalty terms are replaced by a majorizer in the learning, small values of the variables are clamped via thresholds, ex>0e_{x}>0 for states and eu>0e_{u}>0 for causes, to be zero. As such, the states and causes become sparse at finite iterations.

Algorithm 1 State Inference

1. Initialization: initial values of states xt,n0x_{t,n}^{0}, initial iteration step i=0i=0.

2. Update State at patch nn and time tt

xt,ni+1=T(C,Rxi)(CTyt,nλα),\displaystyle x_{t,n}^{i+1}=T(C,R_{x}^{i})(C^{T}y_{t,n}-\lambda\alpha^{*}), (12)
Rxi=diag(|xt,ni|μ),\displaystyle R_{x}^{i}=\text{diag}(\frac{|x_{t,n}^{i}|}{\mu}), (13)

3. Set i=i+1i=i+1 and repeat 2 until it converges.

Algorithm 2 Cause Inference

1. Initialization: initial values of causes ut0u_{t}^{0}, initial iteration step j=0j=0.

2. Update Causes at time tt:

utj+1=RujBT(|Xt|.exp(Butj)),\displaystyle u_{t}^{j+1}=R_{u}^{j}B^{T}(|X_{t}|.exp(-Bu_{t}^{j})), (14)
Ruj=diag(|utj|β).\displaystyle R_{u}^{j}=\text{diag}(\frac{|u_{t}^{j}|}{\beta}). (15)

3. Set j=j+1j=j+1 and repeat 2 until it converges.

Model Parameters Inference

By fixing the converged states and causes, the model parameters θ={A,B,C}\theta=\{A,B,C\} are updated based on the overall objective function (3). For time-varying input, to keep track of parameter temporal relationships, we put an additional constraint on the parameters [30, 31], i.e., θt=θt1+zt{\theta_{t}}=\theta_{t-1}+z_{t}, where ztz_{t} is Gaussian transition noise as an additional temporal smoothness prior. Along with this constraint, each matrix can be updated independently using gradient descent. It is encouraged to normalize columns of matrices CC and BB after the update to avoid any trivial solution.

3.2 MM-Based Variable Inference with Top-Down Preference

Given the learned model, the updates of states and causes in variable inference process are the same as Section IV-A except for adding EuE_{u} (2) with a top-down preference for causes inference. Since the causes at a lower layer serves as the input of an upper layer, therefore, a predicted top-down reference using the states from the layer above is injected into causes inference of the lower layer. That is,

E¯u(ut)=\displaystyle\bar{E}_{u}(u_{t})= Eu(ut)+12utu^t22,\displaystyle E_{u}(u_{t})+\frac{1}{2}\|u_{t}-\hat{u}_{t}\|_{2}^{2}, (16)

where u^t\hat{u}_{t} is the top-down prediction [46]. Determination of its value can be found in Appendix A and [31]. Similar to Section 3.1, using the majorizer (9) to replace the L1L_{1}-norm penalty in EuE_{u}, minimizing E¯u\bar{E}_{u} (16) becomes minimizing

H¯u(ut)=Hu(ut)+12utu^t22.\displaystyle\bar{H}_{u}(u_{t})=H_{u}(u_{t})+\frac{1}{2}\|u_{t}-\hat{u}_{t}\|_{2}^{2}. (17)

with respect to utu_{t}, which yields the KKT condition

(I+Wu)ut=u^t+BT(|Xt|.exp(But))\displaystyle(I+W_{u})u_{t}=\hat{u}_{t}+B^{T}(|X_{t}|.exp(-Bu_{t})) (18)

for every layer, where II denotes identity matrix. Since the diagonal matrix (I+Wu)(I+W_{u}) is non-singular, we develop the iterative form in Algorithm 3.

Since inferences at each layer are independent, the complete learning procedure for each layer is summarized in Algorithm 4. For better convergence of state inference and cause inference that are interleaved in an alternating minimization manner, we encourage to run Algorithm 1 for several iterations isi_{s} and then Algorithm 2 for several iterations jsj_{s}.

Algorithm 3 Top-down Cause Inference

1. Initialization: initial values of causes ut0u_{t}^{0}, initial iteration step j=0j=0.
2. Update Causes at time tt:

utj+1=T(ID,R¯uj)(u^t+(B)T\displaystyle u_{t}^{j+1}=T(I_{D},\bar{R}_{u}^{j})\left(\hat{u}_{t}+(B)^{T}\right.
×(|Xt|.exp(Butj)))\displaystyle\quad\quad\quad\ \times\left.(|X_{t}|.exp(-Bu_{t}^{j}))\right) (19)
R¯uj=diag(|utj|β).\displaystyle\bar{R}_{u}^{j}=\text{diag}(\frac{|u_{t}^{j}|}{\beta}). (20)

3. Set j=j+1j=j+1 and repeat 2 until it converges.

Algorithm 4 MM-DPCNs

1. Initialization: Video input yt,ny_{t,n}, initial model parameters θ0\theta^{0}, initial variables xt,n0,ut0x_{t,n}^{0},u_{t}^{0}.

2. Model Inference:
i). Update state xt,nx_{t,n} by Algorithm 1 and cause utu_{t} by Algorithm 2 interleavely until converge.
ii). Update dictionary θ\theta using gradient descent method once.
iii) Go to step i) until θ\theta converges.

3. Bi-Directional Variable Inference:
Fix model θ\theta. Run Algorithms 1 and 3 interleavely to infer xt,nx_{t,n} and utu_{t} until they converge.

4 Convergence Analysis of MM-Based Variable Inference

In this section, we analyze the convergence of the proposed Algorithm 1 for state inference and Algorithm 2 for cause inference, respectively.

Convergence of State Inference

States inference is independent at each patch nn and each layer ll, hence we analyze the convergence of the objective function of ExE_{x} (1) using Algorithm 1 by removing the subscript nn and ll for simplicity. To do this, we introduce an auxiliary objective function

F(xt)=f(xt)+g(xt)\displaystyle F(x_{t})=f(x_{t})+g(x_{t}) (21)

where f(xt)=12ytCxt22+λfs(et)f(x_{t})=\frac{1}{2}\|y_{t}-Cx_{t}\|_{2}^{2}+\lambda f_{s}(e_{t}) and g(xt)=μxt1g(x_{t})=\mu\|x_{t}\|_{1}. Rewrite HxH_{x} in (6) for each patch as

Hx(xt,Vt)=f(xt)+h(xt,Vx)\displaystyle H_{x}(x_{t},V_{t})=f(x_{t})+h(x_{t},V_{x}) (22)

where g(xt)h(xt,Vx)g(x_{t})\leq h(x_{t},V_{x}) with equality at xt=Vxx_{t}=V_{x} as shown in (5). This admits the unique minimizer

P(Vx):=argminxtHx(xt,Vx).\displaystyle P(V_{x}):=\underset{x_{t}}{\text{argmin}}H_{x}(x_{t},V_{x}). (23)
Theorem 1

Consider the sequence {xti}K\{x_{t}^{i}\}\in\mathbb{R}^{K} for a patch generated by Algorithm 1. Then, F(xti)F(x_{t}^{i}) converges, and for any s1s\geq 1 we have

F(xts)F(xt)12si=0s1(|xt||xti|)TR(|x||xti|)\displaystyle F(x_{t}^{s})-F(x_{t}^{*})\leq\frac{1}{2s}\sum_{i=0}^{s-1}(|x_{t}^{*}|-|x_{t}^{i}|)^{T}R(|x^{*}|-|x_{t}^{i}|) (24)

where R=diag{1/(1~|(xt0)k|+(11~1¯)|(xt)k|+1¯|(xti)k|)}R=\text{diag}\{1/(\tilde{{1}}|(x_{t}^{0})_{k}|+(1-\tilde{{1}}-\bar{{1}})|(x_{t}^{*})_{k}|+\bar{{1}}|(x_{t}^{i})_{k}|)\}, k{1,2,,K}k\in\{1,2,...,K\}, with 1~=1\tilde{{1}}=1 if |(x)k||(xt0)k|>0|(x^{*})_{k}|\geq|(x_{t}^{0})_{k}|>0, 1~=0\tilde{{1}}=0 if 0|(x)k|<|(xt0)k|0\leq|(x^{*})_{k}|<|(x_{t}^{0})_{k}|, 1¯=1\bar{{1}}=1 if |(x)k|=0|(x^{*})_{k}|=0, and 1¯=0\bar{{1}}=0 otherwise. Notably, ()k()_{k} denotes the kk-th elements of a vector.

Proof: Please see Appendix B.

Theorem 2

Let xtx_{t}^{*} be the optimal solution to minimizing ExE_{x} (1) for a single patch at a layer. The upper bound of its convergence satisfies

Ex(xts)Ex(xt)λmD¯+12si=0s1(|xt||xti|)TR(|xt||xti|).\displaystyle E_{x}(x_{t}^{s})-E_{x}(x_{t}^{*})\leq\lambda m\bar{D}+\frac{1}{2s}\sum_{i=0}^{s-1}(|x_{t}^{*}|-|x_{t}^{i}|)^{T}R(|x_{t}^{*}|-|x_{t}^{i}|). (25)

where D¯=maxα112α22\bar{D}=\underset{\|\alpha\|_{\infty}\leq 1}{\text{max}}\frac{1}{2}\|\alpha\|_{2}^{2}.

Proof: Please see Appendix B.

Convergence of Causes Inference

The convergence of cause inference can be analyzed similarly. We rewrite the function EuE_{u} (2) at a single layer as

Eu(ut)=fu(ut)+βut1\displaystyle E_{u}(u_{t})=f_{u}(u_{t})+\beta\|u_{t}\|_{1} (26)

where fu(ut)=|Xt|T(1+exp(But))f_{u}(u_{t})=|X_{t}|^{T}(1+exp(-Bu_{t})). We also rewrite HuH_{u} (10) with (9) as

Hu(ut,Vu)=fu(ut)+h(ut,Vu).\displaystyle H_{u}(u_{t},V_{u})=f_{u}(u_{t})+h(u_{t},V_{u}). (27)
Theorem 3

Consider the sequence {utj}D\{u_{t}^{j}\}\in\mathbb{R}^{D} generated by Algorithm 2. Then, Eu(utj)E_{u}(u_{t}^{j}) converges, and for any s1s\geq 1 we have

Eu(uts)Eu(ut)12sj=0s1(|ut||utj|)TR¯(|ut||utj|).\displaystyle E_{u}(u_{t}^{s})-E_{u}(u_{t}^{*})\leq\frac{1}{2s}\sum_{j=0}^{s-1}(|u_{t}^{*}|-|u_{t}^{j}|)^{T}\bar{R}(|u_{t}^{*}|-|u_{t}^{j}|). (28)

where R¯=diag{1/(1~|(ut0)k|+(11~1¯)|(ut)k|+1¯|(utj)k|)}\bar{R}=\text{diag}\{1/(\tilde{1}|(u_{t}^{0})_{k}|+(1-\tilde{1}-\bar{{1}})|(u_{t}^{*})_{k}|+\bar{{1}}|(u_{t}^{j})_{k}|)\}, k{1,2,,D}k\in\{1,2,...,D\}, with 1~=1\tilde{1}=1 if |(ut)k||(ut0)k|>0|(u_{t}^{*})_{k}|\geq|(u_{t}^{0})_{k}|>0, 1~=0\tilde{1}=0 if 0|(ut)k|<|(ut0)k|0\leq|(u_{t}^{*})_{k}|<|(u_{t}^{0})_{k}|, 1¯=1\bar{{1}}=1 if |(ut)k|=0|(u_{t}^{*})_{k}|=0, and 1¯=0\bar{{1}}=0 otherwise.

Proof: Please see Appendix B.

We have a similar conclusion for Algorithm 3. In Algorithm 3, we set initial ut0>0u_{t}^{0}>0. With a diagonal positive-definite matrix T(ID,R¯uj)T(I_{D},\bar{R}_{u}^{j}), i.e., (I+Wuj)1(I+W_{u}^{j})^{-1}, given utj>0u_{t}^{j}>0, (19) with a normalized matrix BB yields utj+1>0u_{t}^{j+1}>0. Using similar proof of Algorithm 2, we can induce that Algorithm 3 will make utju_{t}^{j} sparse and minimizes H¯u\bar{H}_{u} in (17). Based on the MM principles, it also minimizes the function E¯u\bar{E}_{u} in (16).

5 Experiments

We report the performance of MM-DPCNs on image sparse coding and video feature clustering. We compare MM-based algorithm used for MM-DPCNs with the methods FISTA [34], ISTA [40], ADAM [42] to test optimization quality of sparse coding on the CIFAR-10 data set. For video feature clustering, we compare our MM-DPCNs to previous DPCNs version FISTA-DPCN [31] and methods auto-encoder (AE) [47], WTA-RNN-AE [48] (architecture details are provided in Appendix C) on video data sets OpenAI Gym Super Mario Bros environment [49] and Coil-100 [50]. Note that these are the standard data sets used for sparse coding and feature extraction [51, 52]. We use indices including clustering accuracy (ACC) as the completeness score, adjusted rand index (ARI) and the sparsity level (SPA) to evaluate the clustering quality, learning convergence time (LCT) for sparse coding optimization on each frame. More results on a geometric moving shape data set can be found in Appendix C. The implementations are written in PyTorch-Python, and all the experiments were run on a Linux server with a 32G NVIDIA V100 Tensor Core GPU.

Refer to caption
(a) Convergence.
Refer to caption
(b) MM state histogram.
Refer to caption
(c) FISTA state histogram.
Figure 3: (a) Convergence of MM Algorithm 1, ISTA, FISTA, and ADAM, (b) sparsity level using MM Algorithm 1, and (c) sparsity level using FISTA.

5.1 Comparison on Image Sparse Coding

Table 2: CIFAR-10 sparse coding optimization.
Methods ExE_{x} SPA
ISTA 2.96e4±680{2.96}\mathrm{e}{4}\pm 680 8.96±0.398.96\pm 0.39
FISTA 1.77e4±537{1.77}\mathrm{e}{4}\pm 537 19.50±0.8319.50\pm 0.83
Adam 1.59e4±13.68{1.59}\mathrm{e}{4}\pm 13.68 34.99±0.0534.99\pm 0.05
MM 1.09e4±390{1.09}\mathrm{e}{4}\pm 390 79.87±0.3279.87\pm 0.32

The proposed MM Algorithms 1 is applicable for general sparse optimization problems such as Lasso problems [53]. We apply the MM Algorithm 1, as well as the well-known ISTA [40], FISTA [34] for comparison, on the CIFAR-10 data set with the reconstruction and sparsity loss ExE_{x} (1) (μ=0.3\mu=0.3, λ=0\lambda=0, and randomized C256×300C\in\mathbb{R}^{256\times 300}). We also compare the performance with the Adam algorithm [42] to optimize the smooth majorizer, which is of particular interest to the Deep Learning optimization community. The images are preprocessed by splitting into four equally-sized patches. FISTA and ISTA have learning rates, set as η=1e2\eta=1e-2, while MM is learning-rate-free.

Fig. 3(a) shows that the MM Algorithm 1 converges in less than 10 steps, much faster than the others. Also, it enjoys a higher sparsity level of the learned state, to be a direct benefit of fast convergence rate, as shown in Fig. 3(b) and Fig. 3(c). The statistics of the optimization results are summarized in Table 2, where MM Algorithm 1 produces the least loss value while maintaining the highest sparsity level. The results reveal three potential advantages for MM-DPCN: 1. Faster computation. 2. Higher level sparsity for the latent space embeddings. 3. More faithful reconstructions. The last two advantages enable the algorithm to produce highly condensed and faithful information embedded into the latent space, which also benefits feature clustering.

5.2 Comparison on Video Clustering

Super Mario Bors data set

We picked five main objects of the Mario [49] data set from the video sequence played by humans: Bullet Bill, Goomba, Koopa, Mario, and Piranha Plant. They exhibit various movements, such as jumping, running, and opening or closing, against diverse backgrounds. Both training and testing videos contain 500 frames (32×32×332\times 32\times 3 pixels), with 100 consecutive frames per object. For DPCNs, each frame is divided into four vectorized patches normalized between 0 and 1. It is initialized with x1300x^{1}\in\mathbb{R}^{300}, u140u^{1}\in\mathbb{R}^{40}, x2100x^{2}\in\mathbb{R}^{100}, u220u^{2}\in\mathbb{R}^{20}, and model matrices Al,Bl,ClA^{l},B^{l},C^{l}, l=1,2l=1,2. We set μl=0.3\mu^{l}=0.3 and βl=0.3\beta^{l}=0.3 for MM-DPCN and μl=1\mu^{l}=1 and βl=0.5\beta^{l}=0.5 for FISTA-DPCN. Figure 4 shows that MM-DPCN produces a clean separation while keeping each cluster compact. Figure 5(a) demonstrates the optimal reconstruction quality produced by MM-DPCN in comparison to alternative methods. We obseve from Table 3 that MM-DPCN achieves the best ACC, ARI, SPA, and is much faster than previous version FISTA-DPCN.

Refer to caption
(a) AE
Refer to caption
(b) WTA-RNN-AE
Refer to caption
(c) FISTA-DPCN
Refer to caption
(d) MM-DPCN
Figure 4: Clustering result for a Super Mario Bros video data set.

Coil-100 data set

The Coil-100 data set [50] consists of 100 videos of different objects, with each 72 frames long. The frames are resized into 32×32 pixels and normalized between 0 and 1. We used the first 50 frames of all the objects for training, while the rest 22 frames for testing. We initialize our MM-DPCNs with randomized model Al,Bl,ClA^{l},B^{l},C^{l}, l=1,2l=1,2, and x12000x^{1}\in\mathbb{R}^{2000}, x2500x^{2}\in\mathbb{R}^{500}, u1128u^{1}\in\mathbb{R}^{128} and u280u^{2}\in\mathbb{R}^{80}. We set μl=0.1\mu^{l}=0.1, βl=0.1\beta^{l}=0.1 for MM-DPCN and μl=1\mu^{l}=1, βl=0.2\beta^{l}=0.2 for FISTA-DPCN. We extract the causes from the last layer of MM- and FISTA-DPCNs and use PCA to project them into three-dimensional vectors, then apply K-Means for clustering. This same process is applied to the learned latent space encodings for both AE and WTA-RNN-AE, constructed using MLPs and ReLU.

Table 3: Quantitative comparison for video clustering and learning convergence time.
Methods Mario Coil-100
ACC ARI SPA LCT (ss) ACC ARI SPA LCT (ss)
AE 84.81 76.74 0.00 * 77.74 44.04 0.00 *
WTA-RNN-AE 92.76 88.22 90.00 * 79.28 44.45 90.00 *
FISTA-DPCN 87.74 72.01 87.22 0.084 80.48 47.00 81.02 0.102
MM-DPCN 94.87 91.98 95.17 0.015 82.98 48.93 57.86 0.016

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(a) Super Mario Bros

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(b) Coil-100
Figure 5: Qualitative video sequence reconstruction for Super Mario Bros and Coil-100 data sets.

Table 3 presents the quantitative clustering and learning results, and Figure 5(b) showcases the qualitative video sequence reconstruction results. WTA-RNN-AE includes an additional RNN to learn video dynamics, which, however, is a trade-off with reconstruction. On the other hand, the FISTA- and MM-DPCNs provide much better reconstruction as the recurrent models AA are linear and less susceptible to overfitting than RNN, while WTA-RNN-AE tends to blend and blur different objects. Therefore, the efficiency of the iterative process enables MM to provide the best reconstruction quality. As shown in Table 3, WTA-RNN-AE has best SPA since it allows selected sparse level as 90%90\% for encodings, which, however, results in worse ACC and ARI due to over-loss of information. In contrast, MM and FISTA, by selecting sparsity coefficients or how much information can be compressed without resorting to nonlinear DL models, have much better ACC and ARI, where our MM-DPCN has the best ACC and ARI and MSE.

In the learning, the matrix inversion operation involves a conjugate gradient computation with complexity approximately O(mK2)O(\sqrt{m}K^{2}), where mm is the matrix condition number and KK is the state size. The memory complexity for storing matrices is O(K2)O(K^{2}), and this requirement arises as state size increases, potentially leading to memory overhead when vector size is too large. This can be mitigated to moderately increasing patches or enlarging hardware memory.

6 Conclusion

We proposed a MM-based DPCNs that circumvents the non-smooth optimization problem with sparsity penalty for sparse coding by turning it into a smooth minimization problem using majorizer for sparsity penalty. The method searches for the optimal solution directly by the direction of the stationary point of the smoothed objective function. The experiments on image and video data sets demonstrated that this tremendously speeds up the rate of convergence, computation time, and feature clustering performance.

Acknowledgments and Disclosure of Funding

This work is partially supported by the Office of the Under Secretary of Defense for Research and Engineering under awards N00014-21-1-2295 and N00014-21-1-2345

References

  • [1] Rudolf E Kalman. On the general theory of control systems. In the 1st International Conference on Automatic Control, pages 481–492, 1960.
  • [2] R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960.
  • [3] Bosen Lian, Frank L Lewis, Gary A Hewer, Katia Estabridis, and Tianyou Chai. Robustness analysis of distributed kalman filter for estimation in sensor networks. IEEE Transactions on Cybernetics, 52(11):12479–12490, 2021.
  • [4] Bosen Lian, Yan Wan, Ya Zhang, Mushuang Liu, Frank L Lewis, Alexandra Abad, Tina Setter, Dunham Short, and Tianyou Chai. Distributed consensus-based kalman filtering for estimation with multiple moving targets. In IEEE 58th Conference on Decision and Control, pages 3910–3915, 2019.
  • [5] Amir Parviz Valadbeigi, Ali Khaki Sedigh, and Frank L Lewis. hh_{\infty} static output-feedback control design for discrete-time systems using reinforcement learning. IEEE transactions on neural networks and learning systems, 31(2):396–406, 2020.
  • [6] Adam Charles, M Salman Asif, Justin Romberg, and Christopher Rozell. Sparsity penalties in dynamical system estimation. In the 45th IEEE conference on information sciences and systems, pages 1–6, 2011.
  • [7] Ashish Pal and Satish Nagarajaiah. Sparsity promoting algorithm for identification of nonlinear dynamic system based on unscented kalman filter using novel selective thresholding and penalty-based model selection. Mechanical Systems and Signal Processing, 212(111301):1–22, 2024.
  • [8] Tapio Schneider, Andrew M Stuart, and Jinlong Wu. Ensemble kalman inversion for sparse learning of dynamical systems from time-averaged data. Journal of Computational Physics, 470(111559):1–31, 2022.
  • [9] Fernando Ornelas-Tellez, J Jesus Rico-Melgoza, Angel E Villafuerte, and Febe J Zavala-Mendoza. Neural networks: A methodology for modeling and control design of dynamical systems. In Artificial neural networks for engineering applications, pages 21–38. Elsevier, 2019.
  • [10] Christian Legaard, Thomas Schranz, Gerald Schweiger, Ján Drgoňa, Basak Falay, Cláudio Gomes, Alexandros Iosifidis, Mahdi Abkar, and Peter Larsen. Constructing neural network based models for simulating dynamical systems. ACM Computing Surveys, 55(11):1–34, 2023.
  • [11] Kyriakos G Vamvoudakis and Frank L Lewis. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica, 46(5):878–888, 2010.
  • [12] Shaowu Pan and Karthik Duraisamy. Long-time predictive modeling of nonlinear dynamical systems using neural networks. Complexity, 2018:1–26, 2018.
  • [13] Pawan Goyal and Peter Benner. Discovery of nonlinear dynamical systems using a runge–kutta inspired dictionary-based sparse regression approach. Proceedings of the Royal Society A, 478(20210883):1–24, 2022.
  • [14] Yingcheng Lai. Finding nonlinear system equations and complex network structures from data: A sparse optimization approach. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(082101):1–12, 2021.
  • [15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, pages 630–645, 2016.
  • [18] Pu Li and Wangda Zhao. Image fire detection algorithms based on convolutional neural networks. Case Studies in Thermal Engineering, 19:100625, 2020.
  • [19] Dolly Das, Saroj Kumar Biswas, and Sivaji Bandyopadhyay. Detection of diabetic retinopathy using convolutional neural networks for feature extraction and classification (drfec). Multimedia Tools and Applications, 82(19):29943–30001, 2023.
  • [20] Karl Friston. Hierarchical models in the brain. PLoS computational biology, 4(11):e1000211, 2008.
  • [21] Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle. Philosophical transactions of the Royal Society B: Biological sciences, 364(1521):1211–1221, 2009.
  • [22] Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, and Karl J Friston. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012.
  • [23] Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79–87, 1999.
  • [24] Janneke FM Jehee, Constantin Rothkopf, Jeffrey M Beck, and Dana H Ballard. Learning receptive fields using predictive feedback. Journal of Physiology-Paris, 100(1-3):125–132, 2006.
  • [25] Kuan Han, Haiguang Wen, Yizhen Zhang, Di Fu, Eugenio Culurciello, and Zhongming Liu. Deep predictive coding network with local recurrent processing for object recognition. In the 32nd Conference on Neural Information Processing Systems, pages 1–13, 2018.
  • [26] Haiguang Wen, Kuan Han, Junxing Shi, Yizhen Zhang, Eugenio Culurciello, and Zhongming Liu. Deep predictive coding network for object recognition. In International conference on machine learning, pages 5266–5275. PMLR, 2018.
  • [27] Rakesh Chalasani and Jose C Principe. Context dependent encoding using convolutional dynamic networks. IEEE Transactions on Neural Networks and Learning Systems, 26(9):1992–2004, 2015.
  • [28] Isaac J Sledge and José C Príncipe. Faster convergence in deep-predictive-coding networks to learn deeper representations. IEEE Transactions on Neural Networks and Learning Systems, 34(8):5156–5170, 2021.
  • [29] Jamal Banzi, Isack Bulugu, and Zhongfu Ye. Learning a deep predictive coding network for a semi-supervised 3d-hand pose estimation. IEEE/CAA Journal of Automatica Sinica, 7(5):1371–1379, 2020.
  • [30] Jose C Principe and Rakesh Chalasani. Cognitive architectures for sensory processing. Proceedings of the IEEE, 102(4):514–525, 2014.
  • [31] Rakesh Chalasani and Jose C Principe. Deep predictive coding networks. arXiv preprint arXiv:1301.3541, 2013.
  • [32] Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991.
  • [33] Thomas Serre, Aude Oliva, and Tomaso Poggio. A feedforward architecture accounts for rapid categorization. Proceedings of the national academy of sciences, 104(15):6424–6429, 2007.
  • [34] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
  • [35] Jérôme Bolte and Edouard Pauwels. Majorization-minimization procedures and convergence of sqp methods for semi-algebraic and tame programs. Mathematics of Operations Research, 41(2):442–465, 2016.
  • [36] Frank L Lewis and Draguna Vrabie. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE circuits and systems magazine, 9(3):32–50, 2009.
  • [37] Ramzi Ben Mhenni, Sébastien Bourguignon, and Jordan Ninin. Global optimization for sparse solution of least squares problems. Optimization Methods and Software, 37(5):1740–1769, 2022.
  • [38] Yan Karklin and Michael S Lewicki. A hierarchical bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural computation, 17(2):397–423, 2005.
  • [39] Ivan Selesnick. Penalty and shrinkage functions for sparse signal processing. Connexions, 11(22):1–26, 2012.
  • [40] Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413–1457, 2004.
  • [41] Mário AT Figueiredo and Robert D Nowak. An em algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12(8):906–916, 2003.
  • [42] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [43] Xi Chen, Qihang Lin, Seyoung Kim, Jaime G Carbonell, and Eric P Xing. Smoothing proximal gradient method for general structured sparse regression. The ANNALS of Applied Statistics, 6(2):719–752, 2012.
  • [44] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103:127–152, 2005.
  • [45] Mário AT Figueiredo, José M Bioucas-Dias, and Robert D Nowak. Majorization minimization algorithms for wavelet-based image restoration. IEEE Transactions on Image processing, 16(12):2980–2991, 2007.
  • [46] Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467, 2010.
  • [47] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  • [48] Eder Santana, Matthew S Emigh, Pablo Zegers, and Jose C Principe. Exploiting spatio-temporal structure with recurrent winner-take-all networks. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3738–3746, 2017.
  • [49] OpenAI. Super mario bros environment for openai gym, 2017.
  • [50] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil-100). Technical Report CUCS-006-96, 1996.
  • [51] Hongming Li, Ran Dou, Andreas Keil, and Jose C Principe. A self-learning cognitive architecture exploiting causality from rewards. Neural Networks, 150:274–292, 2022.
  • [52] Zhenyu Qian, Yizhang Jiang, Zhou Hong, Lijun Huang, Fengda Li, Khin Wee Lai, and Kaijian Xia. Multiscale and auto-tuned semi-supervised deep subspace clustering and its application in brain tumor clustering. Computers, Materials & Continua, 79(3), 2024.
  • [53] Silvia Cascianelli, Gabriele Costante, Francesco Crocetti, Elisa Ricci, Paolo Valigi, and Mario Luca Fravolini. Data-based design of robust fault detection and isolation residuals via lasso optimization and bayesian filtering. Asian Journal of Control, 23(1):57–71, 2021.

Appendix A Appendix for Derivations

For the term et,n1\|e_{t,n}\|_{1} where et,n=xt,nAxt1,ne_{t,n}=x_{t,n}-Ax_{t-1,n}, the smooth approximation on it is given by

et,n1fs(et,n)=maxα1(αTet,nm2α22).\displaystyle\|e_{t,n}\|_{1}\approx f_{s}(e_{t,n})=\underset{\|\alpha\|_{\infty}\leq 1}{\text{max}}\left(\alpha^{T}e_{t,n}-\frac{m}{2}\|\alpha\|_{2}^{2}\right). (29)

The best approximation, as well as the maximum, is reached at α\alpha^{*} such that

α=S(et,nm)={et,nm1et,nm11et,nm>11et,nm<1\displaystyle\alpha^{*}=S(\frac{e_{t,n}}{m})=\left\{\begin{array}[]{cc}\frac{e_{t,n}}{m}&-1\leq\frac{e_{t,n}}{m}\leq 1\\ 1&\frac{e_{t,n}}{m}>1\\ -1&\frac{e_{t,n}}{m}<1\end{array}\right. (33)

The majorizer of the sparsity penalty is given by

μxt,n1h(xt,n,Vx)=k=1Kh((xt,n)k,(Vx)k)\displaystyle\mu\|x_{t,n}\|_{1}\leq h(x_{t,n},V_{x})=\sum_{k=1}^{K}h((x_{t,n})_{k},(V_{x})_{k}) (34)

where

h((xt,n)k,(Vx)k)\displaystyle h((x_{t,n})_{k},(V_{x})_{k}) =ϕ((Vx)k)2(Vx)k(xt,n)k2+ϕ((Vx)k)(Vx)k2ϕ(Vx)k),\displaystyle=\frac{\phi^{\prime}((V_{x})_{k})}{2(V_{x})_{k}}(x_{t,n})_{k}^{2}+\phi((V_{x})_{k})-\frac{(V_{x})_{k}}{2}\phi^{\prime}(V_{x})_{k}),
μ|(xt,n)k|,(xt,n)k.\displaystyle\geq\mu|(x_{t,n})_{k}|,\quad\ \forall(x_{t,n})_{k}\in\mathbb{R}. (35)

where ϕ((Vx)k)=μ|(Vx)k|\phi((V_{x})_{k})=\mu|(V_{x})_{k}| and VxKV_{x}\in\mathbb{R}^{K} can be any vector. The equality holds only at Vx=xt,nV_{x}=x_{t,n}. By rewriting the left-hand-side majorizer compactly, it becomes (5) where c=k=1Kϕ((Vx)k)0.5(Vx)kϕ((Vx)k)c=\sum_{k=1}^{K}\phi((V_{x})_{k})-0.5{(V_{x})_{k}}\phi^{\prime}((V_{x})_{k}) is a constant independent of xt,nx_{t,n}. Accordingly, the constant cc in (9) is c=k=1Dψ((Vu)k)0.5(Vu)kψ((Vu)k)c=\sum_{k=1}^{D}\psi((V_{u})_{k})-0.5{(V_{u})_{k}}\psi^{\prime}((V_{u})_{k}), ψ((ut)k)=β|(ut)k|\psi((u_{t})_{k})=\beta|(u_{t})_{k}|, where VuDV_{u}\in\mathbb{R}^{D} can be any vector.

The top-down prediction for layer ll from the upper layer l+1l+1 is denoted by u^t\hat{u}_{t} which is given by

u^tl=Cl+1x^tl+1,\displaystyle\hat{u}_{t}^{l}=C^{l+1}\hat{x}^{l+1}_{t}, (36)
(x^tl+1)k={(Al+1xt1l+1)kλ>γ(1+exp((Bl+1utl+1)k)0λγ(1+exp((Bl+1utl+1)k)\displaystyle(\hat{x}^{l+1}_{t})_{k}=\left\{\begin{array}[]{cc}(A^{l+1}x^{l+1}_{t-1})_{k}&\lambda>\gamma(1+exp(-(B^{l+1}u^{l+1}_{t})_{k})\\ 0&\lambda\leq\gamma(1+exp(-(B^{l+1}u^{l+1}_{t})_{k})\end{array}\right. (39)

where λ\lambda belongs to layer l+1l+1. At the top layer LL, we set u^tL=ut1L\hat{u}^{L}_{t}=u^{L}_{t-1}, which induces some temporal coherence on the final outputs.

Appendix B Appendix for Proofs

We first show a necessary lemma before proving Theorem 1. Since VxV_{x} in (5) represents any vector with the same dimension as xtx_{t}, for simplification we use VV as VxV_{x} in the following analysis regrading state inference. We also do the same, using VV as VuV_{u} that appears in (9), in the analysis regrading cause inference.

Lemma 1

Let VKV\in\mathbb{R}^{K} satisfy

F(P(V))Hx(P(V),V).\displaystyle F(P(V))\leq H_{x}(P(V),V). (40)

For any xtKx_{t}\in\mathbb{R}^{K} one has

F(xt)F(P(V))k=1K(|(xt)k||(V)k|)22|(V)k|.\displaystyle F(x_{t})-F(P(V))\geq\sum_{k=1}^{K}-\frac{(|(x_{t})_{k}|-|(V)_{k}|)^{2}}{2|(V)_{k}|}. (41)

Proof: Recalling the majorizer for states, i.e., h(xt,V)h(x_{t},V) in (5), it can be induced from (22)-(23) that P(V)P(V) satisfies

f(P(V))+xth(P(V),V)=0.\displaystyle\nabla f(P(V))+\nabla_{x_{t}}h(P(V),V)=0. (42)

Then, we know from (12) that

xti+1=P(xti).\displaystyle x_{t}^{i+1}=P(x_{t}^{i}). (43)

It follows from (5) that (40) holds. Since f(xt)f(x_{t}) and h(xt,Vx)h(x_{t},V_{x}) are convex on xtx_{t}, we have

f(xt)f(P(V))xtP(V),f(P(Vx)),\displaystyle f(x_{t})-f(P(V))\geq\langle x_{t}-P(V),\nabla f(P(V_{x}))\rangle, (44)
h(xt,V)h(P(V),V)xtP(V),xth(P(V),V).\displaystyle h(x_{t},V)-h(P(V),V)\geq\langle x_{t}-P(V),\nabla_{x_{t}}h(P(V),V)\rangle. (45)

Hence, with (40), (21) and (22), we have

F(xt)F(P(V))\displaystyle F(x_{t})-F(P(V))
F(xt)Hx(P(V),V)\displaystyle\geq F(x_{t})-H_{x}(P(V),V)
=f(xt)+g(xt)f(P(V))h(P(V),V)\displaystyle=f(x_{t})+g(x_{t})-f(P(V))-h(P(V),V)
xtP(V),f(P(V))+h(xt,xt)h(P(V),V)\displaystyle\geq\langle x_{t}-P(V),\nabla f(P(V))\rangle+h(x_{t},x_{t})-h(P(V),V)
=xtP(V),f(P(V))+h(xt,V)h(P(V),V)+h(xt,xt)h(xt,V)\displaystyle=\langle x_{t}-P(V),\nabla f(P(V))\rangle+h(x_{t},V)-h(P(V),V)+h(x_{t},x_{t})-h(x_{t},V)
xtP(V),f(P(V))+xtP(V),xth(P(V),V)+h(xt,xt)h(xt,V)\displaystyle\geq\langle x_{t}-P(V),\nabla f(P(V))\rangle+\langle x_{t}-P(V),\nabla_{x_{t}}h(P(V),V)\rangle+h(x_{t},x_{t})-h(x_{t},V)
=h(xt,xt)h(xt,V).\displaystyle=h(x_{t},x_{t})-h(x_{t},V). (46)

Note that the fourth line applies (44) and g(xt)=h(xt,xt)g(x_{t})=h(x_{t},x_{t}), the seventh line applies (45), and the last line applies (42).

It follows from (5) and Appendix A that

h(xt,xt)h(xt,V)\displaystyle h(x_{t},x_{t})-h(x_{t},V) =k=1K(xt)ksign((xt)k)sign((V)k)2(V)k((xt)k2+(V)k2)\displaystyle=\sum_{k=1}^{K}(x_{t})_{k}\text{sign}((x_{t})_{k})-\frac{\text{sign}((V)_{k})}{2(V)_{k}}\left((x_{t})_{k}^{2}+(V)_{k}^{2}\right)
=k=1K|(xt)k|(xt)k2+(V)k22|(V)k|\displaystyle=\sum_{k=1}^{K}|(x_{t})_{k}|-\frac{(x_{t})_{k}^{2}+(V)_{k}^{2}}{2|(V)_{k}|}
=k=1K(|(xt)k||(V)k|)22|(V)k|0.\displaystyle=\sum_{k=1}^{K}-\frac{(|(x_{t})_{k}|-|(V)_{k}|)^{2}}{2|(V)_{k}|}\leq 0. (47)

Substituting it into (46) yields (41). This completes the proof.

Proof of Theorem 1

It can be inferred from the derivations that

F(xti)=Hx(xti,xti)Hx(xti,xti1)H(xti1,xti1)=F(xti1)\displaystyle F(x_{t}^{i})=H_{x}(x_{t}^{i},x_{t}^{i})\leq H_{x}(x_{t}^{i},x_{t}^{i-1})\leq H(x_{t}^{i-1},x_{t}^{i-1})=F(x_{t}^{i-1}) (48)

where the second and third equality hold only at xti=xti1x_{t}^{i}=x_{t}^{i-1}, i.e., xtix_{t}^{i} satisfies the optimality condition (7). That is, F(xti)F(x_{t}^{i}) monotonically decreases until xtix_{t}^{i} satisfies the optimality condition. Moreover, it follows from the approximation shown in (4) that the approximation gap is

et,n1mD¯fs(et,n)et,n1\displaystyle\|e_{t,n}\|_{1}-m\bar{D}\leq f_{s}(e_{t,n})\leq\|e_{t,n}\|_{1} (49)

where D¯=maxα112α22\bar{D}=\underset{\|\alpha\|_{\infty}\leq 1}{\text{max}}\frac{1}{2}\|\alpha\|_{2}^{2}. It indicates that F(xt)F(x_{t}) is lower-bounded such that

Ex(xt)λmD¯F(xt)Ex(xt)\displaystyle E_{x}(x_{t})-\lambda m\bar{D}\leq F(x_{t})\leq E_{x}(x_{t}) (50)

where Ex(xt)0E_{x}(x_{t})\geq 0. Therefore, F(xti)F(x_{t}^{i}) is monotonically convergent with boundaries using Ex(xti)E_{x}(x_{t}^{i}).

By taking xt=xtx_{t}=x_{t}^{*}, P(V)=xti+1P(V)=x_{t}^{i+1}, and V=xtiV=x_{t}^{i} in Lemma 1, we can write

F(xt)F(xti+1)k=1K(|(x)k||(xti)k|)22|(xti)k|.\displaystyle F(x_{t}^{*})-F(x_{t}^{i+1})\geq\sum_{k=1}^{K}-\frac{(|(x^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{2|(x_{t}^{i})_{k}|}. (51)

Summing it for ss iterations yields

sF(xt)i=1sF(xti)i=0s1k=1K(|(x)k||(xti)k|)22|(xti)k|.\displaystyle sF(x_{t}^{*})-\sum_{i=1}^{s}F(x_{t}^{i})\geq\sum_{i=0}^{s-1}\sum_{k=1}^{K}-\frac{(|(x^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{2|(x_{t}^{i})_{k}|}. (52)

Subtracting sF(xts)sF(x_{t}^{s}) from the both sides yields

sF(xt)sF(xts)i=0s1k=1K(|(x)k||(xti)k|)22|(xti)k|+i=1s(F(xti)F(xts)).\displaystyle sF(x_{t}^{*})-sF(x_{t}^{s})\geq\sum_{i=0}^{s-1}\sum_{k=1}^{K}-\frac{(|(x^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{2|(x_{t}^{i})_{k}|}+\sum_{i=1}^{s}\left(F(x_{t}^{i})-F(x_{t}^{s})\right). (53)

From (48) we infer that i=1s(F(xti)F(xts))0\sum_{i=1}^{s}\left(F(x_{t}^{i})-F(x_{t}^{s})\right)\geq 0. Therefore, (53) becomes

F(xts)F(xt)\displaystyle F(x_{t}^{s})-F(x_{t}^{*}) 12si=0s1k=1K(|(x)k||(xti)k|)2|(xti)k|.\displaystyle\leq\frac{1}{2s}\sum_{i=0}^{s-1}\sum_{k=1}^{K}\frac{(|(x^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{i})_{k}|}. (54)

Let xtx_{t}^{*} be the optimal sparse solution satisfying (7). Since F(xti)F(x_{t}^{i}) is monotonically decreasing to F(xt)F(x_{t}^{*}), as well as the sequence RxiR_{x}^{i} in (13), then |xti||x_{t}^{i}| is approaching |xt||x_{t}^{*}| monotonically. Positive or negative initial xt0x_{t}^{0} does not influence result as |xt0||x_{t}^{0}| is used, and the update views xt0x_{t}^{0} as positive and drives it to a non-negative xtx_{t}^{*} and similarly, views xt0x_{t}^{0} as negative and drives it to a non-positive xtx_{t}^{*}. Note that we never choose xt0=0x_{t}^{0}=0. Therefore, for an optimal value (xt)k=0(x_{t}^{*})_{k}=0, one has

(|(xt)k||(xti)k|)2|(xti)k||(xti)k|.\displaystyle\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{i})_{k}|}\leq|(x_{t}^{i})_{k}|. (55)

For an optimal value 0<|(xt0)k||(xt)k|0<|(x_{t}^{0})_{k}|\leq|(x_{t}^{*})_{k}|, one has

(|(xt)k||(xti)k|)2|(xti)k|(|(xt)k||(xti)k|)2|(xt0)k|.\displaystyle\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{i})_{k}|}\leq\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{0})_{k}|}. (56)

For an optimal value 0<|(xt)k|<|(xt0)k|0<|(x_{t}^{*})_{k}|<|(x_{t}^{0})_{k}|, one has

(|(xt)k||(xti)k|)2|(xti)k|(|(xt)k||(xti)k|)2|(xt)k|.\displaystyle\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{i})_{k}|}\leq\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{*})_{k}|}. (57)

Using (55)-(57) in (54) for xtK\forall x_{t}^{*}\in\mathbb{R}^{K}, we write

F(xts)F(xt)\displaystyle F(x_{t}^{s})-F(x_{t}^{*}) 12si=0s1k=1K(|(xt)k||(xti)k|)21~|(xt0)k|+(11~𝟙¯)|(xt)k|+𝟙¯|(xti)k|\displaystyle\leq\frac{1}{2s}\sum_{i=0}^{s-1}\sum_{k=1}^{K}\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{\tilde{1}|(x_{t}^{0})_{k}|+(1-\tilde{1}-\bar{\mathbbm{1}})|(x_{t}^{*})_{k}|+\bar{\mathbbm{1}}|(x_{t}^{i})_{k}|}
=12si=0s1(|xt||xti|)TR(|xt||xti|)\displaystyle=\frac{1}{2s}\sum_{i=0}^{s-1}(|x_{t}^{*}|-|x_{t}^{i}|)^{T}R(|x^{*}_{t}|-|x_{t}^{i}|) (58)

where R=diag{1/(1~|(xt0)k|+(11~1¯)|(xt)k|+1¯|(xti)k|)}R=\text{diag}\{1/(\tilde{{1}}|(x_{t}^{0})_{k}|+(1-\tilde{{1}}-\bar{{1}})|(x_{t}^{*})_{k}|+\bar{{1}}|(x_{t}^{i})_{k}|)\}, k{1,2,,K}k\in\{1,2,...,K\}, with 1~=1\tilde{{1}}=1 if |(x)k||(xt0)k|>0|(x^{*})_{k}|\geq|(x_{t}^{0})_{k}|>0, 1~=0\tilde{{1}}=0 if 0|(x)k|<|(xt0)k|0\leq|(x^{*})_{k}|<|(x_{t}^{0})_{k}|, 1¯=1\bar{{1}}=1 if |(x)k|=0|(x^{*})_{k}|=0, and 1¯=0\bar{{1}}=0 otherwise. It can be inferred from uniqueness of xtx_{t}^{*} and monotonic convergence of F(xti)F(x_{t}^{i}) that the upper bound at (58) decreases with iterations ss. This completes the proof.

Proof of Theorem 2

We write Ex(xts)Ex(xt)E_{x}(x_{t}^{s})-E_{x}(x_{t}^{*}) in three pairs as

Ex(xts)Ex(xt)=Ex(xts)F(xts)+F(xts)F(xt)+F(xt)Ex(xt).\displaystyle E_{x}(x_{t}^{s})-E_{x}(x_{t}^{*})=E_{x}(x_{t}^{s})-F(x_{t}^{s})+F(x_{t}^{s})-F(x_{t}^{*})+F(x_{t}^{*})-E_{x}(x_{t}^{*}). (59)

The first and third pairs in (59), i.e., Ex(xts)F(xts)E_{x}(x_{t}^{s})-F(x_{t}^{s}) and F(xt)Ex(xt)F(x_{t}^{*})-E_{x}(x_{t}^{*}), are bounded by the gap of approximation shown in (50). That is

Ex(xts)λmD¯F(xts)Ex(xts),\displaystyle E_{x}(x_{t}^{s})-\lambda m\bar{D}\leq F(x_{t}^{s})\leq E_{x}(x_{t}^{s}), (60)
Ex(xt)λmD¯F(xt)Ex(xt).\displaystyle E_{x}(x_{t}^{*})-\lambda m\bar{D}\leq F(x_{t}^{*})\leq E_{x}(x_{t}^{*}). (61)

That is, Ex(xts)F(xts)E_{x}(x_{t}^{s})-F(x_{t}^{s}) is upper-bounded by λmD¯\lambda m\bar{D}, and F(xt)Ex(xt)F(x_{t}^{*})-E_{x}(x_{t}^{*}) is upper-bounded by 0. From Theorem 1, the second pair F(xts)F(xt)F(x_{t}^{s})-F(x_{t}^{*}) is bounded by (24). Therefore, we can conclude (25). This completes the proof.

Proof of Theorem 3

It is seen from (14) that utj+1>0u_{t}^{j+1}>0 given utj>0u_{t}^{j}>0 with a normalized matrix BB. Also, we observe a trade-off between effects on the update from |utj||u_{t}^{j}| and eButje^{-Bu_{t}^{j}}, either one deviating zero while the other approaching zero. Based on the fact that limutj0utj.(B¯eButj)=0\lim_{u_{t}^{j}\rightarrow 0}u_{t}^{j}.(\bar{B}e^{-Bu_{t}^{j}})=0 and limutjutj.(B¯eButj)=0\lim_{u_{t}^{j}\rightarrow\infty}u_{t}^{j}.(\bar{B}e^{-Bu_{t}^{j}})=0 where B¯\bar{B} is a constant matrix with non-negative elements, we can infer that the update (14) will not diverge but will have an upper bound for the updated uj+1u^{j+1}. Recalling Algorithm 2 and the condition (11), we can write (14) as

utj+1utj\displaystyle u_{t}^{j+1}-u_{t}^{j} =Ruj(fu(utj))utj\displaystyle=R_{u}^{j}(-\nabla f_{u}(u_{t}^{j}))-u_{t}^{j}
=Ruj(fu(utj)+(Ruj)1utj)\displaystyle=-R_{u}^{j}(\nabla f_{u}(u_{t}^{j})+(R_{u}^{j})^{-1}u_{t}^{j})
=Ruj(fu(utj)+uthu(utj,utj))\displaystyle=-R_{u}^{j}(\nabla f_{u}(u_{t}^{j})+\nabla_{u_{t}}h_{u}(u_{t}^{j},u_{t}^{j}))
=RujutHu(utj,utj)\displaystyle=-R_{u}^{j}\nabla_{u_{t}}H_{u}(u_{t}^{j},u_{t}^{j}) (62)

It follows from (15) that Ruj>0R_{u}^{j}>0 is a diagonal matrix during the learning. Therefore, the update law in Algorithm 2 for causes admits a gradient descent form with a positive-definite diagonal matrix as step size during the learning. The learning will stop when Ruj=0R_{u}^{j}=0, i.e., uj=0u^{j}=0, and Hu(utj,.)H_{u}(u_{t}^{j},.) is minimized. That is, the method will learn until utu_{t} becomes sparse and the optimal condition (11) is met. By taking the first two orders of Taylor expansion of Hu(utj+1,.)H_{u}(u_{t}^{j+1},.), we have

Hu(utj+1,utj)\displaystyle H_{u}(u_{t}^{j+1},u_{t}^{j}) =Hu(utj,utj)+(utHu(utj,utj))T(utj+1utj)\displaystyle=H_{u}(u_{t}^{j},u_{t}^{j})+(\nabla_{u_{t}}H_{u}(u_{t}^{j},u_{t}^{j}))^{T}(u_{t}^{j+1}-u_{t}^{j})
=Hu(utj,utj)(utHu(utj,utj))TRuj(utHu(utj,utj))\displaystyle=H_{u}(u_{t}^{j},u_{t}^{j})-(\nabla_{u_{t}}H_{u}(u_{t}^{j},u_{t}^{j}))^{T}R_{u}^{j}(\nabla_{u_{t}}H_{u}(u_{t}^{j},u_{t}^{j}))
Hu(utj,utj)\displaystyle\leq H_{u}(u_{t}^{j},u_{t}^{j}) (63)

Combining it with (26)-(27) yields

Eu(utj+1)\displaystyle E_{u}(u_{t}^{j+1}) =Hu(utj+1,utj+1)Hu(utj+1,utj)Hu(utj,utj)=Eu(utj)\displaystyle=H_{u}(u_{t}^{j+1},u_{t}^{j+1})\leq H_{u}(u_{t}^{j+1},u_{t}^{j})\leq H_{u}(u_{t}^{j},u_{t}^{j})=E_{u}(u_{t}^{j}) (64)

with equality at utj+1=utju_{t}^{j+1}=u_{t}^{j}. It can be concluded that function EuE_{u} decreases using Algorithm 2 for causes inference. This convergence is also verified by the experiments.

Lemma 1 still holds if we replace f(xt),h(xt,V),F,Hxf(x_{t}),h(x_{t},V),F,H_{x} with g(ut),h(ut,V),Eu,Hug(u_{t}),h(u_{t},V),E_{u},H_{u}, respectively. Let V=utjV=u_{t}^{j} and P(V)=utj+1P(V)=u_{t}^{j+1}, and let utu_{t}^{*} be the optimal solution satisfying (11). Following Theorem 1 we have

Eu(uts)Eu(ut)\displaystyle E_{u}(u_{t}^{s})-E_{u}(u_{t}^{*}) 12sj=0s1k=1D(|(ut)k||(utj)k|)2|(utj)k|,\displaystyle\leq\frac{1}{2s}\sum_{j=0}^{s-1}\sum_{k=1}^{D}\frac{(|(u_{t}^{*})_{k}|-|(u_{t}^{j})_{k}|)^{2}}{|(u_{t}^{j})_{k}|}, (65)

and consequently (28). This completes the proof.

Appendix C Appendix for more results and AE architecture details

We used a simple geometric moving shape data set to demonstrate the video clustering mechanism for MM-DPCN further. Each video contains three geometric shapes: diamond, triangle, and square. Each shape appears consistently for 100 frames until another shape shows up. The shape could appear in each patch of the image frame and move within the 100 frames of a single shape.

Refer to caption
Figure 6: Clustering results for the moving geometric shape data set. The first row is the cause vector plot for one video. The three shapes are perfectly orthogonalized and assigned to the correct clusters. The third row shows examples of reconstruction.

To visualize the learned filters, the plots for matrix C1C^{1} are provided in Fig. 7.

Refer to caption
Figure 7: Learned filters C1C^{1} on moving geometric shape data set.

The architectures for AE and WTA-RNN-AE used for the comparison results are provided in Table 4. We use the same architectures for both Mario and Coil-100 data sets.

Table 4: AE and WTA-RNN-AE architectures.
layer name AE WTA-RNN-AE
encoder_layer1 [3,128]\left[3,128\right] [3,256]\left[3,256\right]
encoder_layer2 [128,64]\left[128,64\right] [256,128]\left[256,128\right]
encoder_layer3 [64,36]\left[64,36\right] [128,64]\left[128,64\right]
encoder_layer4 [36,18]\left[36,18\right] *
encoder_layer5 [18,9]\left[18,9\right] *
RNN * [64,64]×5\left[64,64\right]\times 5