Fast Deep Predictive Coding Networks for Videos Feature Extraction without Labels

Wenqian Xue, Chi Ding, Jose Principe
Department of Electrical & Computer Engineering
University of Florida
Gainesville, FL 32611
w.xue@ufl.edu, ding.chi@ufl.edu, principe@cnel.ufl.edu

Abstract

Brain-inspired deep predictive coding networks (DPCNs) effectively model and capture video features through a bi-directional information flow, even without labels. They are based on an overcomplete description of video scenes, and one of the bottlenecks has been the lack of effective sparsification techniques to find discriminative and robust dictionaries. FISTA has been the best alternative. This paper proposes a DPCN with a fast inference of internal model variables (states and causes) that achieves high sparsity and accuracy of feature clustering. The proposed unsupervised learning procedure, inspired by adaptive dynamic programming with a majorization-minimization framework, and its convergence are rigorously analyzed. Experiments in the data sets CIFAR-10, Super Mario Bros video game, and Coil-100 validate the approach, which outperforms previous versions of DPCNs on learning rate, sparsity ratio, and feature clustering accuracy. Because of DCPN’s solid foundation and explainability, this advance opens the door for general applications in object recognition in video without labels.

1 Introduction

Sparse model is significant for the systems with a plethora of parameters and variables, as it selectively activates only a small subset of the variables or coefficients while maintaining representation accuracy and computational efficiency. This not only efficiently reduces the demand and storage for data to represent a dynamic system but also leads to more concise and easier access to the contained information in the areas including control, signal processing, sensory compression, etc.

In the control theory sense, a model for a dynamic process is often described by the equations

\left\{\begin{array}[]{l}{y_{t}}={G_{t}}(x_{t})+n_{t}\\ x_{t}={F_{t}}(x_{t-1},u_{t})+w_{t}\end{array}\right.

where $y_{t}$ is a set of measurements associated with a changing state $x_{t}$ through a mapping function $G_{t}$ , the states $x_{t}$ , also known as the signal of interest, is produced from a past one $x_{t-1}$ and an input $u_{t}$ through an evolution function $F_{t}$ , $w_{t}$ is the measurement noise and $n_{t}$ is the modeling error. Given measurements $y_{t}$ and input $u_{t}$ , the Kalman filter [1, 2] has emerged as a widely-employed technique for estimating states [3, 4] and mapping functions using neural networks [5] in a sparse way [6, 7, 8]. Therein, it is typically constrained to estimate one variable, namely the state. Can both state and input variables be estimated? For many dynamic plants characterized by natural and complex signals, latent variables often exhibit residual dependencies as well as non-stationary statistical properties. Can data with non-stationary statistics be well represented? Additionally, (deep) neural networks (NNs) [9, 10, 11, 5] with multi-layered structures are extensively used for sparse modeling of dynamic systems [12, 13, 14]. Similarly structured, convolutional NNs have demonstrated significant success in tasks such as target detection and feature classification in computer vision and control applications [15, 16, 17, 18, 19]. As we all know, these methods are mathematically uninterpretable, and the NN architecture is a feedforward pass through stacks of convolutional layers. As studied in [20], a bi-directional information pathway, including not only a feedforward but also a feedforward and recurrent passing, is used by brain for effective visual perception. Can dynamics be represented in an interpretable way with bi-directional connections and interactions?

These goals can be achieved by the hierarchical predictive coding networks [21, 22, 23, 24], also known as deep-predictive-coding networks (DPCNs) [25, 26, 27, 28, 29, 30, 31], where, inspired by [20], a hierarchical generative model is formulated as

\left\{\begin{array}[]{l}{y_{t}^{l}}={G_{t}}(x_{t}^{l})+n_{t}^{l}\\ x_{t}^{l}={F_{t}}(x_{t-1}^{l},u_{t}^{l})+w_{t}^{l}\end{array}\right.

where $l$ denotes layers. Measurements for layer $l$ are the causes of the lower layer, i.e., $y_{t}^{l}=u_{t}^{(l-1)}$ for $l>1$ . The causes link the layers, and the states link the dynamics over time $t$ . The model admits a bi-directional information flow [32, 30], including feedforward, feedback, and recurrent connections. That is, measurements travel through a bottom-up pathway from lower to higher visual areas (for rapid object recognition) and simultaneously a top-down pathway running in the opposite direction (to enhance the recognition) [33]. The previous DPCNs either use linear filters for sound [25, 26] or convolutions to better preserve neighborhoods in images [27, 28]. With fovea vision, non-convolutional DPCNs may offer a more automated and straightforward implementation [31, 30]. In both types of DPCNs, the proximal gradient descent methods, such as fast iterative shrinkage-thresholding algorithm (FISTA) [34], are frequently used for variable and model inferences in [27, 31, 30] for accelerated inference. Can the DPCNs inference be faster while maintaining high sparsity?

This paper answers these questions by studying vector DPCN with an improved inference procedure for both variable and models (dictionary) that is applicable to the two types, and that will be tested for proof of concept to model and capture objects in videos. Given measurements from the real world, the DPCNs infer model parameters and variables through feedforward, feedback, and recurrent connections represented by optimization problems with sparsity penalties. Inspired by the maximization minimization (MM) [35] and the value iteration of reinforcement learning (RL) [36], this paper proposes a MM-based unsupervised learning procedure to enhance the inference of DPCNs by introducing a majorizer of the sparsity penalty. This is called MM-DPCNs and offers the following advantages:

•

The learning procedure does not need labels and offers accelerated inference.
•

The inference results guarantee sparsity of variables and representation accuracy of features.
•

Rigorous proofs show convergence and interpretability.
•

Experiments validate the higher performance of MM-DPCNs versus previous DPCNs on learning rate, sparsity ratio, and feature clustering accuracy.

2 Dynamic Networks for DPCNs

Table 1: Detonations.

$y_{t,n}^{1}$	$n$ -th patch of video frame at time $t$
$y_{t,n}^{l}$ , $l>1$	the causes from layer $l-1$
$x_{t,n}^{l}$	state at layer $l$ for $y_{t,n}$
$u_{t}^{l}$	cause at layer $l$ for a group of $x_{t,n}^{l}$
$A^{l},B^{l},C^{l}$	model parameters at layer $l$

Based on the hierarchical generative model [20, 31] briefly reviewed in the Introduction, we now review the dynamic networks for DPCNs [31, 30] in terms of sparse optimization problems for sparse model and feature extraction of videos.

Refer to caption — Figure 1: Two-layered DPCNs structure. The video frame is decomposed into patches (green blocks). Every patch is mapped onto a state $x_{t}^{1}$ at layer 1, and the cause $u_{t}^{1}$ pool all the states within a group. The cause $u_{t}^{1}$ is input of layer 2 and corresponds to state $x_{t}^{2}$ and cause $u_{t}^{2}$ .

The structure of DPCN is shown in Fig. 1, and the involved denotations are show in Table 1. Given a video input, the measurements of each video frame are decomposed into multiple contiguous patches in terms of position, which is denoted by $y_{t,n}\in\mathbb{R}^{P},n\in\mathcal{N}=\{1,2,\cdots,N\}$ , a vectorized form of $\sqrt{P}\times\sqrt{P}$ square patch. These measurements are injected to the DPCNs with a hierarchical multiple-layered structure. From the second layer, the causes from a lower layer serve as the input of the next layer, i.e., $y_{t,n}^{l}=u_{t,n}^{l-1}$ . At every layer, the network consists of two distinctive parts: feature extraction (inferring states) and pooling (inferring causes). The parameters to connect states and causes are called model (dictionary), going along states and causes (inferring model). The networks and connections at each layer $l$ are given in terms of objective functions for the inferences. In the following, we would omit the layer superscript $l$ for simplicity.

For inferring states given a patch measurement $y_{t,n}$ , a linear state space model using an over-complete dictionary of $K$ -filters, i.e., $C\in\mathbb{R}^{P\times K}$ with $P<K$ , to get sparse states $x_{t,n}\in\mathbb{R}^{K}$ . Also, a state-transition matrix $A\in\mathbb{R}^{K\times K}$ is applied to keep track of historical sparse states dynamics. To this end, the objective function for states is given by

	$\displaystyle E_{x}(x_{t})=$	$\displaystyle\sum_{n=1}^{N}\frac{1}{2}\\|y_{t,n}-Cx_{t,n}\\|_{2}^{2}+\mu\\|x_{t,n}\\|_{1}$
		$\displaystyle+\lambda\\|x_{t,n}-Ax_{t-1,n}\\|_{1},$		(1)

where $\lambda>0$ and $\mu>0$ are weighting parameters, $\|\cdot\|_{2}^{2}$ is the $L_{2}$ -norm denoting energy, and $\|\cdot\|_{1}$ is the $L_{1}$ -norm serving as the penalty term to make solution sparse [37].

For inferring causes given states, $u_{t}\in\mathbb{R}^{D}$ multiplicatively interacts with the accumulated states through $B\in\mathbb{R}^{K\times D}$ in the way that whenever a component in $u_{t}$ is active, the corresponding set of components in $x_{t}$ are also likely to be active. This is for significant clustering of features even with non-stationary distribution of states [38]. To this end, the objective function for causes is given by

\displaystyle E_{u}(u_{t})=

\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}\gamma|(x_{t,n})_{k}|(1+exp(-(Bu_{t})_{k}))+\beta\|u_{t}\|_{1}

(2)

where $\gamma>0$ and $\beta>0$ are weighting parameters.

For inferring model $\theta=\{A,B,C\}$ given states and causes, the overall objective function is given by

\displaystyle E_{p}(x_{t},u_{t},\theta)=E_{x}(x_{t})+E_{u}(u_{t}).

(3)

Notably, optimization of the functions $E_{x}$ and $E_{u}$ are strong convex problems, and we will design learning method to find the unique optimal sparse solution.

3 Learning For Model Inference and Variable Inference

In this section, we propose an unsupervised learning method for self-organizing models and variables with accelerated learning while maintaining high sparsity and accuracy of feature extraction. The flow and connections for the inference are shown in Fig. 2. The inference process includes Model Inference and Variable Inference. The model inference needs repeated interleaved updates on states and causes and updates on model. Then, given a model, the variable inference needs an interleaved updates on states and causes using an extra top-down preference from the upper layer. These form a bi-directional inference process on a bottom-up feedforward path, a top-down feedback path, and a recurrent path.

For the updates of states and causes involved in the Model Inference and Variable Inference, we propose a new learning procedure using the majorization minimization (MM) framework [39, 35] for optimization with sparsity constraint. Different from the frequently used proximal gradient descent methods iterative shrinkage-thresholding algorithm (ISTA) and fast ISTA (FISTA) [34, 40, 41] that use a majorizer for the differentiable non-sparsity-penalty terms [31], this paper uses a majorizer for sparsity penalty. As such the convex non-differentiable optimization problem with sparsity constraint is transformed into a convex and differentiable problem. Moreover, taking advantage of over-complete dictionary and the iteration form inspired by the value iteration of RL [36], the iterations for inference are derived from the condition for the optimal sparse solution to MM-based optimization problems. This also differs from the traditional gradient descent method and adaptive moment estimation (ADAM) [42] method for solving optimization problems.

3.1 MM-Based Model Inference

Model inference seeks $\theta=\{A,B,C\}$ by minimizing $E_{p}(x_{t},u_{t},\theta)$ in (3) with an interleaved procedure to infer states and causes by minimizing $E_{x}$ (1) and $E_{u}$ (2).

State Inference

To infer sparse $x_{t,n}$ by minimizing $E_{x}$ (1), first, we let $e_{t,n}=x_{t,n}-Ax_{t-1,n}$ and use the Nesterov’s smooth approximator [43, 44] taking the form

\displaystyle\|e\|_{1}\approx f_{s}(e)\triangleq(\alpha^{*})^{T}e-\frac{m}{2}\|\alpha^{*}\|_{2}^{2}

(4)

where $m>0$ is a constant and $\alpha^{*}$ is some vector reaching the best approximation. Then, we find a majorizer for the penalty term $\mu\|x_{t,n}\|_{1}$ [39] in the form

\displaystyle\mu\|x\|_{1}\leq h(x,V_{x})\triangleq\frac{1}{2}x^{T}W_{x}x+c

(5)

with equality at $x=V_{x}$ , where $V_{x}$ is a vector, $W_{x}=\text{diag}(\mu./|{V_{x}}|)$ with $./$ a component-wise division product, and $c$ is a constant independent of $x$ (see details in Appendix A).

Applying the approximator (4), majorizer (5) and MM principles, the minimization problem of $E_{x}$ (1) can be transformed to the minimization of

\displaystyle H_{x}(x_{t,n})=\sum_{n=1}^{N}\frac{1}{2}\|y_{t,n}-Cx_{t,n}\|_{2}^{2}+\lambda f_{s}(e_{t,n})+h(x_{t,n},V_{x}).

(6)

Minimizing $H_{x}$ with respect to $x_{t,n}$ yields the Karush–Kuhn–Tucker (KKT) condition for the optimal sparse state

\displaystyle(C^{T}C+W_{x})x_{t,n}=C^{T}y_{t,n}-\lambda\alpha^{*}.

(7)

To find such an optimal state, we propose Algorithm 1 that is applicable for every layer, applying an iterative form of (7). The update of states at each iteration is one-step optimal. We set a positive initial value for state. Note that it cannot be zero because the iteration will never update with $R_{x}^{0}=0$ . Also, the optimal state in (7) is expected to be sparse, namely some components of $x_{t,n}^{i}$ go to zero. This makes entries of $W_{x}$ go to infinity, leading to numerically inaccurate results. We avoid this by using $R_{x}=(W_{x})^{-1}$ and the matrix inverse lemma [45]

\displaystyle(C^{T}C+W_{x})^{-1}

\displaystyle=R_{x}-R_{x}C^{T}(I+CR_{x}C^{T})^{-1}CR_{x}\triangleq T(C,R_{x}).

(8)

Note that the matrix $C^{T}C+W_{x}$ is invertible due to positive semi-definite $C^{T}C$ and positive definite diagonal $W_{x}$ . To further accelerate the computation, we can avoid directly computing the inverse term $(I+CR_{x}C^{T})^{-1}$ by using the conjugate gradient method to compute $(I+CR_{x}C^{T})^{-1}CR_{x}$ .

Cause Inference

To infer sparse causes by minimizing $E_{u}$ (2), we find a majorizer of $\beta\|u_{t}\|_{1}$ as

\displaystyle\beta\|u\|_{1}\leq h(u,V_{u})\triangleq\frac{1}{2}u^{T}W_{u}{u}+c

(9)

with equality at $u=V_{u}$ , where $W_{u}=\text{diag}(\beta./|{V_{u}}|)$ . Therefore, based on MM principles, we transform the minimization of $E_{u}$ in (2) to the minimization of

\displaystyle H_{u}(u_{t})=|X_{t}|^{T}(1+exp(-Bu_{t}))+h(u_{t},V_{u})

(10)

where $|X_{t}|=\gamma\sum_{n=1}^{N}|x_{t,n}|$ . Minimizing $H_{u}$ with respect to $u_{t}$ yields the KKT condition

\displaystyle W_{u}u_{t}=B^{T}(|X_{t}|.exp(-Bu_{t})).

(11)

To find such an optimal cause, we propose Algorithm 2 that is applicable for every layer, applying the iterative form of (11) for causes inference. Since $R_{u}=(W_{u})^{-1}$ and the iteration never update with $R_{u}^{0}=0$ , we set an initial value $u_{t}^{0}>0$ .

With fixed model parameter $\theta$ , states $x_{t,n}$ and causes $u_{t}$ can be updated interleavely until they converge. Since sparsity penalty terms are replaced by a majorizer in the learning, small values of the variables are clamped via thresholds, $e_{x}>0$ for states and $e_{u}>0$ for causes, to be zero. As such, the states and causes become sparse at finite iterations.

Algorithm 1 State Inference

1. Initialization: initial values of states $x_{t,n}^{0}$ , initial iteration step $i=0$ .

2. Update State at patch $n$ and time $t$

	$\displaystyle x_{t,n}^{i+1}=T(C,R_{x}^{i})(C^{T}y_{t,n}-\lambda\alpha^{*}),$		(12)
	$\displaystyle R_{x}^{i}=\text{diag}(\frac{\|x_{t,n}^{i}\|}{\mu}),$		(13)

3. Set $i=i+1$ and repeat 2 until it converges.

Algorithm 2 Cause Inference

1. Initialization: initial values of causes $u_{t}^{0}$ , initial iteration step $j=0$ .

2. Update Causes at time $t$ :

	$\displaystyle u_{t}^{j+1}=R_{u}^{j}B^{T}(\|X_{t}\|.exp(-Bu_{t}^{j})),$		(14)
	$\displaystyle R_{u}^{j}=\text{diag}(\frac{\|u_{t}^{j}\|}{\beta}).$		(15)

3. Set $j=j+1$ and repeat 2 until it converges.

Model Parameters Inference

By fixing the converged states and causes, the model parameters $\theta=\{A,B,C\}$ are updated based on the overall objective function (3). For time-varying input, to keep track of parameter temporal relationships, we put an additional constraint on the parameters [30, 31], i.e., ${\theta_{t}}=\theta_{t-1}+z_{t}$ , where $z_{t}$ is Gaussian transition noise as an additional temporal smoothness prior. Along with this constraint, each matrix can be updated independently using gradient descent. It is encouraged to normalize columns of matrices $C$ and $B$ after the update to avoid any trivial solution.

3.2 MM-Based Variable Inference with Top-Down Preference

Given the learned model, the updates of states and causes in variable inference process are the same as Section IV-A except for adding $E_{u}$ (2) with a top-down preference for causes inference. Since the causes at a lower layer serves as the input of an upper layer, therefore, a predicted top-down reference using the states from the layer above is injected into causes inference of the lower layer. That is,

\displaystyle\bar{E}_{u}(u_{t})=

\displaystyle E_{u}(u_{t})+\frac{1}{2}\|u_{t}-\hat{u}_{t}\|_{2}^{2},

(16)

where $\hat{u}_{t}$ is the top-down prediction [46]. Determination of its value can be found in Appendix A and [31]. Similar to Section 3.1, using the majorizer (9) to replace the $L_{1}$ -norm penalty in $E_{u}$ , minimizing $\bar{E}_{u}$ (16) becomes minimizing

\displaystyle\bar{H}_{u}(u_{t})=H_{u}(u_{t})+\frac{1}{2}\|u_{t}-\hat{u}_{t}\|_{2}^{2}.

(17)

with respect to $u_{t}$ , which yields the KKT condition

\displaystyle(I+W_{u})u_{t}=\hat{u}_{t}+B^{T}(|X_{t}|.exp(-Bu_{t}))

(18)

for every layer, where $I$ denotes identity matrix. Since the diagonal matrix $(I+W_{u})$ is non-singular, we develop the iterative form in Algorithm 3.

Since inferences at each layer are independent, the complete learning procedure for each layer is summarized in Algorithm 4. For better convergence of state inference and cause inference that are interleaved in an alternating minimization manner, we encourage to run Algorithm 1 for several iterations $i_{s}$ and then Algorithm 2 for several iterations $j_{s}$ .

Algorithm 3 Top-down Cause Inference

1. Initialization: initial values of causes $u_{t}^{0}$ , initial iteration step $j=0$ .
2. Update Causes at time $t$ :

	$\displaystyle u_{t}^{j+1}=T(I_{D},\bar{R}_{u}^{j})\left(\hat{u}_{t}+(B)^{T}\right.$
	$\displaystyle\quad\quad\quad\ \times\left.(\|X_{t}\|.exp(-Bu_{t}^{j}))\right)$		(19)
	$\displaystyle\bar{R}_{u}^{j}=\text{diag}(\frac{\|u_{t}^{j}\|}{\beta}).$		(20)

3. Set $j=j+1$ and repeat 2 until it converges.

Algorithm 4 MM-DPCNs

1. Initialization: Video input $y_{t,n}$ , initial model parameters $\theta^{0}$ , initial variables $x_{t,n}^{0},u_{t}^{0}$ .

2. Model Inference:
i). Update state $x_{t,n}$ by Algorithm 1 and cause $u_{t}$ by Algorithm 2 interleavely until converge.
ii). Update dictionary $\theta$ using gradient descent method once.
iii) Go to step i) until $\theta$ converges.

3. Bi-Directional Variable Inference:
Fix model $\theta$ . Run Algorithms 1 and 3 interleavely to infer $x_{t,n}$ and $u_{t}$ until they converge.

4 Convergence Analysis of MM-Based Variable Inference

In this section, we analyze the convergence of the proposed Algorithm 1 for state inference and Algorithm 2 for cause inference, respectively.

Convergence of State Inference

States inference is independent at each patch $n$ and each layer $l$ , hence we analyze the convergence of the objective function of $E_{x}$ (1) using Algorithm 1 by removing the subscript $n$ and $l$ for simplicity. To do this, we introduce an auxiliary objective function

\displaystyle F(x_{t})=f(x_{t})+g(x_{t})

(21)

where $f(x_{t})=\frac{1}{2}\|y_{t}-Cx_{t}\|_{2}^{2}+\lambda f_{s}(e_{t})$ and $g(x_{t})=\mu\|x_{t}\|_{1}$ . Rewrite $H_{x}$ in (6) for each patch as

\displaystyle H_{x}(x_{t},V_{t})=f(x_{t})+h(x_{t},V_{x})

(22)

where $g(x_{t})\leq h(x_{t},V_{x})$ with equality at $x_{t}=V_{x}$ as shown in (5). This admits the unique minimizer

\displaystyle P(V_{x}):=\underset{x_{t}}{\text{argmin}}H_{x}(x_{t},V_{x}).

(23)

Theorem 1

Consider the sequence $\{x_{t}^{i}\}\in\mathbb{R}^{K}$ for a patch generated by Algorithm 1. Then, $F(x_{t}^{i})$ converges, and for any $s\geq 1$ we have

\displaystyle F(x_{t}^{s})-F(x_{t}^{*})\leq\frac{1}{2s}\sum_{i=0}^{s-1}(|x_{t}^{*}|-|x_{t}^{i}|)^{T}R(|x^{*}|-|x_{t}^{i}|)

(24)

where $R=\text{diag}\{1/(\tilde{{1}}|(x_{t}^{0})_{k}|+(1-\tilde{{1}}-\bar{{1}})|(x_{t}^{*})_{k}|+\bar{{1}}|(x_{t}^{i})_{k}|)\}$ , $k\in\{1,2,...,K\}$ , with $\tilde{{1}}=1$ if $|(x^{*})_{k}|\geq|(x_{t}^{0})_{k}|>0$ , $\tilde{{1}}=0$ if $0\leq|(x^{*})_{k}|<|(x_{t}^{0})_{k}|$ , $\bar{{1}}=1$ if $|(x^{*})_{k}|=0$ , and $\bar{{1}}=0$ otherwise. Notably, $()_{k}$ denotes the $k$ -th elements of a vector.

Proof: Please see Appendix B.

Theorem 2

Let $x_{t}^{*}$ be the optimal solution to minimizing $E_{x}$ (1) for a single patch at a layer. The upper bound of its convergence satisfies

\displaystyle E_{x}(x_{t}^{s})-E_{x}(x_{t}^{*})\leq\lambda m\bar{D}+\frac{1}{2s}\sum_{i=0}^{s-1}(|x_{t}^{*}|-|x_{t}^{i}|)^{T}R(|x_{t}^{*}|-|x_{t}^{i}|).

(25)

where $\bar{D}=\underset{\|\alpha\|_{\infty}\leq 1}{\text{max}}\frac{1}{2}\|\alpha\|_{2}^{2}$ .

Proof: Please see Appendix B.

Convergence of Causes Inference

The convergence of cause inference can be analyzed similarly. We rewrite the function $E_{u}$ (2) at a single layer as

\displaystyle E_{u}(u_{t})=f_{u}(u_{t})+\beta\|u_{t}\|_{1}

(26)

where $f_{u}(u_{t})=|X_{t}|^{T}(1+exp(-Bu_{t}))$ . We also rewrite $H_{u}$ (10) with (9) as

\displaystyle H_{u}(u_{t},V_{u})=f_{u}(u_{t})+h(u_{t},V_{u}).

(27)

Theorem 3

Consider the sequence $\{u_{t}^{j}\}\in\mathbb{R}^{D}$ generated by Algorithm 2. Then, $E_{u}(u_{t}^{j})$ converges, and for any $s\geq 1$ we have

\displaystyle E_{u}(u_{t}^{s})-E_{u}(u_{t}^{*})\leq\frac{1}{2s}\sum_{j=0}^{s-1}(|u_{t}^{*}|-|u_{t}^{j}|)^{T}\bar{R}(|u_{t}^{*}|-|u_{t}^{j}|).

(28)

where $\bar{R}=\text{diag}\{1/(\tilde{1}|(u_{t}^{0})_{k}|+(1-\tilde{1}-\bar{{1}})|(u_{t}^{*})_{k}|+\bar{{1}}|(u_{t}^{j})_{k}|)\}$ , $k\in\{1,2,...,D\}$ , with $\tilde{1}=1$ if $|(u_{t}^{*})_{k}|\geq|(u_{t}^{0})_{k}|>0$ , $\tilde{1}=0$ if $0\leq|(u_{t}^{*})_{k}|<|(u_{t}^{0})_{k}|$ , $\bar{{1}}=1$ if $|(u_{t}^{*})_{k}|=0$ , and $\bar{{1}}=0$ otherwise.

Proof: Please see Appendix B.

We have a similar conclusion for Algorithm 3. In Algorithm 3, we set initial $u_{t}^{0}>0$ . With a diagonal positive-definite matrix $T(I_{D},\bar{R}_{u}^{j})$ , i.e., $(I+W_{u}^{j})^{-1}$ , given $u_{t}^{j}>0$ , (19) with a normalized matrix $B$ yields $u_{t}^{j+1}>0$ . Using similar proof of Algorithm 2, we can induce that Algorithm 3 will make $u_{t}^{j}$ sparse and minimizes $\bar{H}_{u}$ in (17). Based on the MM principles, it also minimizes the function $\bar{E}_{u}$ in (16).

5 Experiments

We report the performance of MM-DPCNs on image sparse coding and video feature clustering. We compare MM-based algorithm used for MM-DPCNs with the methods FISTA [34], ISTA [40], ADAM [42] to test optimization quality of sparse coding on the CIFAR-10 data set. For video feature clustering, we compare our MM-DPCNs to previous DPCNs version FISTA-DPCN [31] and methods auto-encoder (AE) [47], WTA-RNN-AE [48] (architecture details are provided in Appendix C) on video data sets OpenAI Gym Super Mario Bros environment [49] and Coil-100 [50]. Note that these are the standard data sets used for sparse coding and feature extraction [51, 52]. We use indices including clustering accuracy (ACC) as the completeness score, adjusted rand index (ARI) and the sparsity level (SPA) to evaluate the clustering quality, learning convergence time (LCT) for sparse coding optimization on each frame. More results on a geometric moving shape data set can be found in Appendix C. The implementations are written in PyTorch-Python, and all the experiments were run on a Linux server with a 32G NVIDIA V100 Tensor Core GPU.

5.1 Comparison on Image Sparse Coding

Table 2: CIFAR-10 sparse coding optimization.

Methods	$E_{x}$	SPA
ISTA	${2.96}\mathrm{e}{4}\pm 680$	$8.96\pm 0.39$
FISTA	${1.77}\mathrm{e}{4}\pm 537$	$19.50\pm 0.83$
Adam	${1.59}\mathrm{e}{4}\pm 13.68$	$34.99\pm 0.05$
MM	${1.09}\mathrm{e}{4}\pm 390$	$79.87\pm 0.32$

The proposed MM Algorithms 1 is applicable for general sparse optimization problems such as Lasso problems [53]. We apply the MM Algorithm 1, as well as the well-known ISTA [40], FISTA [34] for comparison, on the CIFAR-10 data set with the reconstruction and sparsity loss $E_{x}$ (1) ( $\mu=0.3$ , $\lambda=0$ , and randomized $C\in\mathbb{R}^{256\times 300}$ ). We also compare the performance with the Adam algorithm [42] to optimize the smooth majorizer, which is of particular interest to the Deep Learning optimization community. The images are preprocessed by splitting into four equally-sized patches. FISTA and ISTA have learning rates, set as $\eta=1e-2$ , while MM is learning-rate-free.

Fig. 3(a) shows that the MM Algorithm 1 converges in less than 10 steps, much faster than the others. Also, it enjoys a higher sparsity level of the learned state, to be a direct benefit of fast convergence rate, as shown in Fig. 3(b) and Fig. 3(c). The statistics of the optimization results are summarized in Table 2, where MM Algorithm 1 produces the least loss value while maintaining the highest sparsity level. The results reveal three potential advantages for MM-DPCN: 1. Faster computation. 2. Higher level sparsity for the latent space embeddings. 3. More faithful reconstructions. The last two advantages enable the algorithm to produce highly condensed and faithful information embedded into the latent space, which also benefits feature clustering.

5.2 Comparison on Video Clustering

Super Mario Bors data set

We picked five main objects of the Mario [49] data set from the video sequence played by humans: Bullet Bill, Goomba, Koopa, Mario, and Piranha Plant. They exhibit various movements, such as jumping, running, and opening or closing, against diverse backgrounds. Both training and testing videos contain 500 frames ( $32\times 32\times 3$ pixels), with 100 consecutive frames per object. For DPCNs, each frame is divided into four vectorized patches normalized between 0 and 1. It is initialized with $x^{1}\in\mathbb{R}^{300}$ , $u^{1}\in\mathbb{R}^{40}$ , $x^{2}\in\mathbb{R}^{100}$ , $u^{2}\in\mathbb{R}^{20}$ , and model matrices $A^{l},B^{l},C^{l}$ , $l=1,2$ . We set $\mu^{l}=0.3$ and $\beta^{l}=0.3$ for MM-DPCN and $\mu^{l}=1$ and $\beta^{l}=0.5$ for FISTA-DPCN. Figure 4 shows that MM-DPCN produces a clean separation while keeping each cluster compact. Figure 5(a) demonstrates the optimal reconstruction quality produced by MM-DPCN in comparison to alternative methods. We obseve from Table 3 that MM-DPCN achieves the best ACC, ARI, SPA, and is much faster than previous version FISTA-DPCN.

Coil-100 data set

The Coil-100 data set [50] consists of 100 videos of different objects, with each 72 frames long. The frames are resized into 32×32 pixels and normalized between 0 and 1. We used the first 50 frames of all the objects for training, while the rest 22 frames for testing. We initialize our MM-DPCNs with randomized model $A^{l},B^{l},C^{l}$ , $l=1,2$ , and $x^{1}\in\mathbb{R}^{2000}$ , $x^{2}\in\mathbb{R}^{500}$ , $u^{1}\in\mathbb{R}^{128}$ and $u^{2}\in\mathbb{R}^{80}$ . We set $\mu^{l}=0.1$ , $\beta^{l}=0.1$ for MM-DPCN and $\mu^{l}=1$ , $\beta^{l}=0.2$ for FISTA-DPCN. We extract the causes from the last layer of MM- and FISTA-DPCNs and use PCA to project them into three-dimensional vectors, then apply K-Means for clustering. This same process is applied to the learned latent space encodings for both AE and WTA-RNN-AE, constructed using MLPs and ReLU.

Table 3: Quantitative comparison for video clustering and learning convergence time.

Methods	Mario				Coil-100
Methods	ACC	ARI	SPA	LCT ( $s$ )	ACC	ARI	SPA	LCT ( $s$ )
AE	84.81	76.74	0.00	*	77.74	44.04	0.00	*
WTA-RNN-AE	92.76	88.22	90.00	*	79.28	44.45	90.00	*
FISTA-DPCN	87.74	72.01	87.22	0.084	80.48	47.00	81.02	0.102
MM-DPCN	94.87	91.98	95.17	0.015	82.98	48.93	57.86	0.016

Table 3 presents the quantitative clustering and learning results, and Figure 5(b) showcases the qualitative video sequence reconstruction results. WTA-RNN-AE includes an additional RNN to learn video dynamics, which, however, is a trade-off with reconstruction. On the other hand, the FISTA- and MM-DPCNs provide much better reconstruction as the recurrent models $A$ are linear and less susceptible to overfitting than RNN, while WTA-RNN-AE tends to blend and blur different objects. Therefore, the efficiency of the iterative process enables MM to provide the best reconstruction quality. As shown in Table 3, WTA-RNN-AE has best SPA since it allows selected sparse level as $90\%$ for encodings, which, however, results in worse ACC and ARI due to over-loss of information. In contrast, MM and FISTA, by selecting sparsity coefficients or how much information can be compressed without resorting to nonlinear DL models, have much better ACC and ARI, where our MM-DPCN has the best ACC and ARI and MSE.

In the learning, the matrix inversion operation involves a conjugate gradient computation with complexity approximately $O(\sqrt{m}K^{2})$ , where $m$ is the matrix condition number and $K$ is the state size. The memory complexity for storing matrices is $O(K^{2})$ , and this requirement arises as state size increases, potentially leading to memory overhead when vector size is too large. This can be mitigated to moderately increasing patches or enlarging hardware memory.

6 Conclusion

We proposed a MM-based DPCNs that circumvents the non-smooth optimization problem with sparsity penalty for sparse coding by turning it into a smooth minimization problem using majorizer for sparsity penalty. The method searches for the optimal solution directly by the direction of the stationary point of the smoothed objective function. The experiments on image and video data sets demonstrated that this tremendously speeds up the rate of convergence, computation time, and feature clustering performance.

Acknowledgments and Disclosure of Funding

This work is partially supported by the Office of the Under Secretary of Defense for Research and Engineering under awards N00014-21-1-2295 and N00014-21-1-2345

References

[1] Rudolf E Kalman. On the general theory of control systems. In the 1st International Conference on Automatic Control, pages 481–492, 1960.
[2] R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960.
[3] Bosen Lian, Frank L Lewis, Gary A Hewer, Katia Estabridis, and Tianyou Chai. Robustness analysis of distributed kalman filter for estimation in sensor networks. IEEE Transactions on Cybernetics, 52(11):12479–12490, 2021.
[4] Bosen Lian, Yan Wan, Ya Zhang, Mushuang Liu, Frank L Lewis, Alexandra Abad, Tina Setter, Dunham Short, and Tianyou Chai. Distributed consensus-based kalman filtering for estimation with multiple moving targets. In IEEE 58th Conference on Decision and Control, pages 3910–3915, 2019.
[5] Amir Parviz Valadbeigi, Ali Khaki Sedigh, and Frank L Lewis. $h_{\infty}$ static output-feedback control design for discrete-time systems using reinforcement learning. IEEE transactions on neural networks and learning systems, 31(2):396–406, 2020.
[6] Adam Charles, M Salman Asif, Justin Romberg, and Christopher Rozell. Sparsity penalties in dynamical system estimation. In the 45th IEEE conference on information sciences and systems, pages 1–6, 2011.
[7] Ashish Pal and Satish Nagarajaiah. Sparsity promoting algorithm for identification of nonlinear dynamic system based on unscented kalman filter using novel selective thresholding and penalty-based model selection. Mechanical Systems and Signal Processing, 212(111301):1–22, 2024.
[8] Tapio Schneider, Andrew M Stuart, and Jinlong Wu. Ensemble kalman inversion for sparse learning of dynamical systems from time-averaged data. Journal of Computational Physics, 470(111559):1–31, 2022.
[9] Fernando Ornelas-Tellez, J Jesus Rico-Melgoza, Angel E Villafuerte, and Febe J Zavala-Mendoza. Neural networks: A methodology for modeling and control design of dynamical systems. In Artificial neural networks for engineering applications, pages 21–38. Elsevier, 2019.
[10] Christian Legaard, Thomas Schranz, Gerald Schweiger, Ján Drgoňa, Basak Falay, Cláudio Gomes, Alexandros Iosifidis, Mahdi Abkar, and Peter Larsen. Constructing neural network based models for simulating dynamical systems. ACM Computing Surveys, 55(11):1–34, 2023.
[11] Kyriakos G Vamvoudakis and Frank L Lewis. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica, 46(5):878–888, 2010.
[12] Shaowu Pan and Karthik Duraisamy. Long-time predictive modeling of nonlinear dynamical systems using neural networks. Complexity, 2018:1–26, 2018.
[13] Pawan Goyal and Peter Benner. Discovery of nonlinear dynamical systems using a runge–kutta inspired dictionary-based sparse regression approach. Proceedings of the Royal Society A, 478(20210883):1–24, 2022.
[14] Yingcheng Lai. Finding nonlinear system equations and complex network structures from data: A sparse optimization approach. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(082101):1–12, 2021.
[15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, pages 630–645, 2016.
[18] Pu Li and Wangda Zhao. Image fire detection algorithms based on convolutional neural networks. Case Studies in Thermal Engineering, 19:100625, 2020.
[19] Dolly Das, Saroj Kumar Biswas, and Sivaji Bandyopadhyay. Detection of diabetic retinopathy using convolutional neural networks for feature extraction and classification (drfec). Multimedia Tools and Applications, 82(19):29943–30001, 2023.
[20] Karl Friston. Hierarchical models in the brain. PLoS computational biology, 4(11):e1000211, 2008.
[21] Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle. Philosophical transactions of the Royal Society B: Biological sciences, 364(1521):1211–1221, 2009.
[22] Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, and Karl J Friston. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012.
[23] Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79–87, 1999.
[24] Janneke FM Jehee, Constantin Rothkopf, Jeffrey M Beck, and Dana H Ballard. Learning receptive fields using predictive feedback. Journal of Physiology-Paris, 100(1-3):125–132, 2006.
[25] Kuan Han, Haiguang Wen, Yizhen Zhang, Di Fu, Eugenio Culurciello, and Zhongming Liu. Deep predictive coding network with local recurrent processing for object recognition. In the 32nd Conference on Neural Information Processing Systems, pages 1–13, 2018.
[26] Haiguang Wen, Kuan Han, Junxing Shi, Yizhen Zhang, Eugenio Culurciello, and Zhongming Liu. Deep predictive coding network for object recognition. In International conference on machine learning, pages 5266–5275. PMLR, 2018.
[27] Rakesh Chalasani and Jose C Principe. Context dependent encoding using convolutional dynamic networks. IEEE Transactions on Neural Networks and Learning Systems, 26(9):1992–2004, 2015.
[28] Isaac J Sledge and José C Príncipe. Faster convergence in deep-predictive-coding networks to learn deeper representations. IEEE Transactions on Neural Networks and Learning Systems, 34(8):5156–5170, 2021.
[29] Jamal Banzi, Isack Bulugu, and Zhongfu Ye. Learning a deep predictive coding network for a semi-supervised 3d-hand pose estimation. IEEE/CAA Journal of Automatica Sinica, 7(5):1371–1379, 2020.
[30] Jose C Principe and Rakesh Chalasani. Cognitive architectures for sensory processing. Proceedings of the IEEE, 102(4):514–525, 2014.
[31] Rakesh Chalasani and Jose C Principe. Deep predictive coding networks. arXiv preprint arXiv:1301.3541, 2013.
[32] Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991.
[33] Thomas Serre, Aude Oliva, and Tomaso Poggio. A feedforward architecture accounts for rapid categorization. Proceedings of the national academy of sciences, 104(15):6424–6429, 2007.
[34] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
[35] Jérôme Bolte and Edouard Pauwels. Majorization-minimization procedures and convergence of sqp methods for semi-algebraic and tame programs. Mathematics of Operations Research, 41(2):442–465, 2016.
[36] Frank L Lewis and Draguna Vrabie. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE circuits and systems magazine, 9(3):32–50, 2009.
[37] Ramzi Ben Mhenni, Sébastien Bourguignon, and Jordan Ninin. Global optimization for sparse solution of least squares problems. Optimization Methods and Software, 37(5):1740–1769, 2022.
[38] Yan Karklin and Michael S Lewicki. A hierarchical bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural computation, 17(2):397–423, 2005.
[39] Ivan Selesnick. Penalty and shrinkage functions for sparse signal processing. Connexions, 11(22):1–26, 2012.
[40] Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413–1457, 2004.
[41] Mário AT Figueiredo and Robert D Nowak. An em algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12(8):906–916, 2003.
[42] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[43] Xi Chen, Qihang Lin, Seyoung Kim, Jaime G Carbonell, and Eric P Xing. Smoothing proximal gradient method for general structured sparse regression. The ANNALS of Applied Statistics, 6(2):719–752, 2012.
[44] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103:127–152, 2005.
[45] Mário AT Figueiredo, José M Bioucas-Dias, and Robert D Nowak. Majorization minimization algorithms for wavelet-based image restoration. IEEE Transactions on Image processing, 16(12):2980–2991, 2007.
[46] Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467, 2010.
[47] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
[48] Eder Santana, Matthew S Emigh, Pablo Zegers, and Jose C Principe. Exploiting spatio-temporal structure with recurrent winner-take-all networks. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3738–3746, 2017.
[49] OpenAI. Super mario bros environment for openai gym, 2017.
[50] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil-100). Technical Report CUCS-006-96, 1996.
[51] Hongming Li, Ran Dou, Andreas Keil, and Jose C Principe. A self-learning cognitive architecture exploiting causality from rewards. Neural Networks, 150:274–292, 2022.
[52] Zhenyu Qian, Yizhang Jiang, Zhou Hong, Lijun Huang, Fengda Li, Khin Wee Lai, and Kaijian Xia. Multiscale and auto-tuned semi-supervised deep subspace clustering and its application in brain tumor clustering. Computers, Materials & Continua, 79(3), 2024.
[53] Silvia Cascianelli, Gabriele Costante, Francesco Crocetti, Elisa Ricci, Paolo Valigi, and Mario Luca Fravolini. Data-based design of robust fault detection and isolation residuals via lasso optimization and bayesian filtering. Asian Journal of Control, 23(1):57–71, 2021.

Appendix A Appendix for Derivations

For the term $\|e_{t,n}\|_{1}$ where $e_{t,n}=x_{t,n}-Ax_{t-1,n}$ , the smooth approximation on it is given by

\displaystyle\|e_{t,n}\|_{1}\approx f_{s}(e_{t,n})=\underset{\|\alpha\|_{\infty}\leq 1}{\text{max}}\left(\alpha^{T}e_{t,n}-\frac{m}{2}\|\alpha\|_{2}^{2}\right).

(29)

The best approximation, as well as the maximum, is reached at $\alpha^{*}$ such that

\displaystyle\alpha^{*}=S(\frac{e_{t,n}}{m})=\left\{\begin{array}[]{cc}\frac{e_{t,n}}{m}&-1\leq\frac{e_{t,n}}{m}\leq 1\\ 1&\frac{e_{t,n}}{m}>1\\ -1&\frac{e_{t,n}}{m}<1\end{array}\right.

(33)

The majorizer of the sparsity penalty is given by

\displaystyle\mu\|x_{t,n}\|_{1}\leq h(x_{t,n},V_{x})=\sum_{k=1}^{K}h((x_{t,n})_{k},(V_{x})_{k})

(34)

where

	$\displaystyle h((x_{t,n})_{k},(V_{x})_{k})$	$\displaystyle=\frac{\phi^{\prime}((V_{x})_{k})}{2(V_{x})_{k}}(x_{t,n})_{k}^{2}+\phi((V_{x})_{k})-\frac{(V_{x})_{k}}{2}\phi^{\prime}(V_{x})_{k}),$
		$\displaystyle\geq\mu\|(x_{t,n})_{k}\|,\quad\ \forall(x_{t,n})_{k}\in\mathbb{R}.$		(35)

where $\phi((V_{x})_{k})=\mu|(V_{x})_{k}|$ and $V_{x}\in\mathbb{R}^{K}$ can be any vector. The equality holds only at $V_{x}=x_{t,n}$ . By rewriting the left-hand-side majorizer compactly, it becomes (5) where $c=\sum_{k=1}^{K}\phi((V_{x})_{k})-0.5{(V_{x})_{k}}\phi^{\prime}((V_{x})_{k})$ is a constant independent of $x_{t,n}$ . Accordingly, the constant $c$ in (9) is $c=\sum_{k=1}^{D}\psi((V_{u})_{k})-0.5{(V_{u})_{k}}\psi^{\prime}((V_{u})_{k})$ , $\psi((u_{t})_{k})=\beta|(u_{t})_{k}|$ , where $V_{u}\in\mathbb{R}^{D}$ can be any vector.

The top-down prediction for layer $l$ from the upper layer $l+1$ is denoted by $\hat{u}_{t}$ which is given by

	$\displaystyle\hat{u}_{t}^{l}=C^{l+1}\hat{x}^{l+1}_{t},$		(36)
	$\displaystyle(\hat{x}^{l+1}_{t})_{k}=\left\{\begin{array}[]{cc}(A^{l+1}x^{l+1}_{t-1})_{k}&\lambda>\gamma(1+exp(-(B^{l+1}u^{l+1}_{t})_{k})\\ 0&\lambda\leq\gamma(1+exp(-(B^{l+1}u^{l+1}_{t})_{k})\end{array}\right.$		(39)

where $\lambda$ belongs to layer $l+1$ . At the top layer $L$ , we set $\hat{u}^{L}_{t}=u^{L}_{t-1}$ , which induces some temporal coherence on the final outputs.

Appendix B Appendix for Proofs

We first show a necessary lemma before proving Theorem 1. Since $V_{x}$ in (5) represents any vector with the same dimension as $x_{t}$ , for simplification we use $V$ as $V_{x}$ in the following analysis regrading state inference. We also do the same, using $V$ as $V_{u}$ that appears in (9), in the analysis regrading cause inference.

Lemma 1

Let $V\in\mathbb{R}^{K}$ satisfy

\displaystyle F(P(V))\leq H_{x}(P(V),V).

(40)

For any $x_{t}\in\mathbb{R}^{K}$ one has

\displaystyle F(x_{t})-F(P(V))\geq\sum_{k=1}^{K}-\frac{(|(x_{t})_{k}|-|(V)_{k}|)^{2}}{2|(V)_{k}|}.

(41)

Proof: Recalling the majorizer for states, i.e., $h(x_{t},V)$ in (5), it can be induced from (22)-(23) that $P(V)$ satisfies

\displaystyle\nabla f(P(V))+\nabla_{x_{t}}h(P(V),V)=0.

(42)

Then, we know from (12) that

\displaystyle x_{t}^{i+1}=P(x_{t}^{i}).

(43)

It follows from (5) that (40) holds. Since $f(x_{t})$ and $h(x_{t},V_{x})$ are convex on $x_{t}$ , we have

	$\displaystyle f(x_{t})-f(P(V))\geq\langle x_{t}-P(V),\nabla f(P(V_{x}))\rangle,$		(44)
	$\displaystyle h(x_{t},V)-h(P(V),V)\geq\langle x_{t}-P(V),\nabla_{x_{t}}h(P(V),V)\rangle.$		(45)

Hence, with (40), (21) and (22), we have

	$\displaystyle F(x_{t})-F(P(V))$
	$\displaystyle\geq F(x_{t})-H_{x}(P(V),V)$
	$\displaystyle=f(x_{t})+g(x_{t})-f(P(V))-h(P(V),V)$
	$\displaystyle\geq\langle x_{t}-P(V),\nabla f(P(V))\rangle+h(x_{t},x_{t})-h(P(V),V)$
	$\displaystyle=\langle x_{t}-P(V),\nabla f(P(V))\rangle+h(x_{t},V)-h(P(V),V)+h(x_{t},x_{t})-h(x_{t},V)$
	$\displaystyle\geq\langle x_{t}-P(V),\nabla f(P(V))\rangle+\langle x_{t}-P(V),\nabla_{x_{t}}h(P(V),V)\rangle+h(x_{t},x_{t})-h(x_{t},V)$
	$\displaystyle=h(x_{t},x_{t})-h(x_{t},V).$		(46)

Note that the fourth line applies (44) and $g(x_{t})=h(x_{t},x_{t})$ , the seventh line applies (45), and the last line applies (42).

It follows from (5) and Appendix A that

$\displaystyle h(x_{t},x_{t})-h(x_{t},V)$	$\displaystyle=\sum_{k=1}^{K}(x_{t})_{k}\text{sign}((x_{t})_{k})-\frac{\text{sign}((V)_{k})}{2(V)_{k}}\left((x_{t})_{k}^{2}+(V)_{k}^{2}\right)$
	$\displaystyle=\sum_{k=1}^{K}\|(x_{t})_{k}\|-\frac{(x_{t})_{k}^{2}+(V)_{k}^{2}}{2\|(V)_{k}\|}$
	$\displaystyle=\sum_{k=1}^{K}-\frac{(\|(x_{t})_{k}\|-\|(V)_{k}\|)^{2}}{2\|(V)_{k}\|}\leq 0.$	(47)

Substituting it into (46) yields (41). This completes the proof.

Proof of Theorem 1

It can be inferred from the derivations that

\displaystyle F(x_{t}^{i})=H_{x}(x_{t}^{i},x_{t}^{i})\leq H_{x}(x_{t}^{i},x_{t}^{i-1})\leq H(x_{t}^{i-1},x_{t}^{i-1})=F(x_{t}^{i-1})

(48)

where the second and third equality hold only at $x_{t}^{i}=x_{t}^{i-1}$ , i.e., $x_{t}^{i}$ satisfies the optimality condition (7). That is, $F(x_{t}^{i})$ monotonically decreases until $x_{t}^{i}$ satisfies the optimality condition. Moreover, it follows from the approximation shown in (4) that the approximation gap is

\displaystyle\|e_{t,n}\|_{1}-m\bar{D}\leq f_{s}(e_{t,n})\leq\|e_{t,n}\|_{1}

(49)

where $\bar{D}=\underset{\|\alpha\|_{\infty}\leq 1}{\text{max}}\frac{1}{2}\|\alpha\|_{2}^{2}$ . It indicates that $F(x_{t})$ is lower-bounded such that

\displaystyle E_{x}(x_{t})-\lambda m\bar{D}\leq F(x_{t})\leq E_{x}(x_{t})

(50)

where $E_{x}(x_{t})\geq 0$ . Therefore, $F(x_{t}^{i})$ is monotonically convergent with boundaries using $E_{x}(x_{t}^{i})$ .

By taking $x_{t}=x_{t}^{*}$ , $P(V)=x_{t}^{i+1}$ , and $V=x_{t}^{i}$ in Lemma 1, we can write

\displaystyle F(x_{t}^{*})-F(x_{t}^{i+1})\geq\sum_{k=1}^{K}-\frac{(|(x^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{2|(x_{t}^{i})_{k}|}.

(51)

Summing it for $s$ iterations yields

\displaystyle sF(x_{t}^{*})-\sum_{i=1}^{s}F(x_{t}^{i})\geq\sum_{i=0}^{s-1}\sum_{k=1}^{K}-\frac{(|(x^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{2|(x_{t}^{i})_{k}|}.

(52)

Subtracting $sF(x_{t}^{s})$ from the both sides yields

\displaystyle sF(x_{t}^{*})-sF(x_{t}^{s})\geq\sum_{i=0}^{s-1}\sum_{k=1}^{K}-\frac{(|(x^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{2|(x_{t}^{i})_{k}|}+\sum_{i=1}^{s}\left(F(x_{t}^{i})-F(x_{t}^{s})\right).

(53)

From (48) we infer that $\sum_{i=1}^{s}\left(F(x_{t}^{i})-F(x_{t}^{s})\right)\geq 0$ . Therefore, (53) becomes

\displaystyle F(x_{t}^{s})-F(x_{t}^{*})

\displaystyle\leq\frac{1}{2s}\sum_{i=0}^{s-1}\sum_{k=1}^{K}\frac{(|(x^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{i})_{k}|}.

(54)

Let $x_{t}^{*}$ be the optimal sparse solution satisfying (7). Since $F(x_{t}^{i})$ is monotonically decreasing to $F(x_{t}^{*})$ , as well as the sequence $R_{x}^{i}$ in (13), then $|x_{t}^{i}|$ is approaching $|x_{t}^{*}|$ monotonically. Positive or negative initial $x_{t}^{0}$ does not influence result as $|x_{t}^{0}|$ is used, and the update views $x_{t}^{0}$ as positive and drives it to a non-negative $x_{t}^{*}$ and similarly, views $x_{t}^{0}$ as negative and drives it to a non-positive $x_{t}^{*}$ . Note that we never choose $x_{t}^{0}=0$ . Therefore, for an optimal value $(x_{t}^{*})_{k}=0$ , one has

\displaystyle\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{i})_{k}|}\leq|(x_{t}^{i})_{k}|.

(55)

For an optimal value $0<|(x_{t}^{0})_{k}|\leq|(x_{t}^{*})_{k}|$ , one has

\displaystyle\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{i})_{k}|}\leq\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{0})_{k}|}.

(56)

For an optimal value $0<|(x_{t}^{*})_{k}|<|(x_{t}^{0})_{k}|$ , one has

\displaystyle\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{i})_{k}|}\leq\frac{(|(x_{t}^{*})_{k}|-|(x_{t}^{i})_{k}|)^{2}}{|(x_{t}^{*})_{k}|}.

(57)

Using (55)-(57) in (54) for $\forall x_{t}^{*}\in\mathbb{R}^{K}$ , we write

	$\displaystyle F(x_{t}^{s})-F(x_{t}^{*})$	$\displaystyle\leq\frac{1}{2s}\sum_{i=0}^{s-1}\sum_{k=1}^{K}\frac{(\|(x_{t}^{})_{k}\|-\|(x_{t}^{i})_{k}\|)^{2}}{\tilde{1}\|(x_{t}^{0})_{k}\|+(1-\tilde{1}-\bar{\mathbbm{1}})\|(x_{t}^{})_{k}\|+\bar{\mathbbm{1}}\|(x_{t}^{i})_{k}\|}$
		$\displaystyle=\frac{1}{2s}\sum_{i=0}^{s-1}(\|x_{t}^{}\|-\|x_{t}^{i}\|)^{T}R(\|x^{}_{t}\|-\|x_{t}^{i}\|)$		(58)

where $R=\text{diag}\{1/(\tilde{{1}}|(x_{t}^{0})_{k}|+(1-\tilde{{1}}-\bar{{1}})|(x_{t}^{*})_{k}|+\bar{{1}}|(x_{t}^{i})_{k}|)\}$ , $k\in\{1,2,...,K\}$ , with $\tilde{{1}}=1$ if $|(x^{*})_{k}|\geq|(x_{t}^{0})_{k}|>0$ , $\tilde{{1}}=0$ if $0\leq|(x^{*})_{k}|<|(x_{t}^{0})_{k}|$ , $\bar{{1}}=1$ if $|(x^{*})_{k}|=0$ , and $\bar{{1}}=0$ otherwise. It can be inferred from uniqueness of $x_{t}^{*}$ and monotonic convergence of $F(x_{t}^{i})$ that the upper bound at (58) decreases with iterations $s$ . This completes the proof.

Proof of Theorem 2

We write $E_{x}(x_{t}^{s})-E_{x}(x_{t}^{*})$ in three pairs as

\displaystyle E_{x}(x_{t}^{s})-E_{x}(x_{t}^{*})=E_{x}(x_{t}^{s})-F(x_{t}^{s})+F(x_{t}^{s})-F(x_{t}^{*})+F(x_{t}^{*})-E_{x}(x_{t}^{*}).

(59)

The first and third pairs in (59), i.e., $E_{x}(x_{t}^{s})-F(x_{t}^{s})$ and $F(x_{t}^{*})-E_{x}(x_{t}^{*})$ , are bounded by the gap of approximation shown in (50). That is

	$\displaystyle E_{x}(x_{t}^{s})-\lambda m\bar{D}\leq F(x_{t}^{s})\leq E_{x}(x_{t}^{s}),$		(60)
	$\displaystyle E_{x}(x_{t}^{})-\lambda m\bar{D}\leq F(x_{t}^{})\leq E_{x}(x_{t}^{*}).$		(61)

That is, $E_{x}(x_{t}^{s})-F(x_{t}^{s})$ is upper-bounded by $\lambda m\bar{D}$ , and $F(x_{t}^{*})-E_{x}(x_{t}^{*})$ is upper-bounded by 0. From Theorem 1, the second pair $F(x_{t}^{s})-F(x_{t}^{*})$ is bounded by (24). Therefore, we can conclude (25). This completes the proof.

Proof of Theorem 3

It is seen from (14) that $u_{t}^{j+1}>0$ given $u_{t}^{j}>0$ with a normalized matrix $B$ . Also, we observe a trade-off between effects on the update from $|u_{t}^{j}|$ and $e^{-Bu_{t}^{j}}$ , either one deviating zero while the other approaching zero. Based on the fact that $\lim_{u_{t}^{j}\rightarrow 0}u_{t}^{j}.(\bar{B}e^{-Bu_{t}^{j}})=0$ and $\lim_{u_{t}^{j}\rightarrow\infty}u_{t}^{j}.(\bar{B}e^{-Bu_{t}^{j}})=0$ where $\bar{B}$ is a constant matrix with non-negative elements, we can infer that the update (14) will not diverge but will have an upper bound for the updated $u^{j+1}$ . Recalling Algorithm 2 and the condition (11), we can write (14) as

$\displaystyle u_{t}^{j+1}-u_{t}^{j}$	$\displaystyle=R_{u}^{j}(-\nabla f_{u}(u_{t}^{j}))-u_{t}^{j}$
	$\displaystyle=-R_{u}^{j}(\nabla f_{u}(u_{t}^{j})+(R_{u}^{j})^{-1}u_{t}^{j})$
	$\displaystyle=-R_{u}^{j}(\nabla f_{u}(u_{t}^{j})+\nabla_{u_{t}}h_{u}(u_{t}^{j},u_{t}^{j}))$
	$\displaystyle=-R_{u}^{j}\nabla_{u_{t}}H_{u}(u_{t}^{j},u_{t}^{j})$	(62)

It follows from (15) that $R_{u}^{j}>0$ is a diagonal matrix during the learning. Therefore, the update law in Algorithm 2 for causes admits a gradient descent form with a positive-definite diagonal matrix as step size during the learning. The learning will stop when $R_{u}^{j}=0$ , i.e., $u^{j}=0$ , and $H_{u}(u_{t}^{j},.)$ is minimized. That is, the method will learn until $u_{t}$ becomes sparse and the optimal condition (11) is met. By taking the first two orders of Taylor expansion of $H_{u}(u_{t}^{j+1},.)$ , we have

$\displaystyle H_{u}(u_{t}^{j+1},u_{t}^{j})$	$\displaystyle=H_{u}(u_{t}^{j},u_{t}^{j})+(\nabla_{u_{t}}H_{u}(u_{t}^{j},u_{t}^{j}))^{T}(u_{t}^{j+1}-u_{t}^{j})$
	$\displaystyle=H_{u}(u_{t}^{j},u_{t}^{j})-(\nabla_{u_{t}}H_{u}(u_{t}^{j},u_{t}^{j}))^{T}R_{u}^{j}(\nabla_{u_{t}}H_{u}(u_{t}^{j},u_{t}^{j}))$
	$\displaystyle\leq H_{u}(u_{t}^{j},u_{t}^{j})$	(63)

Combining it with (26)-(27) yields

\displaystyle E_{u}(u_{t}^{j+1})

\displaystyle=H_{u}(u_{t}^{j+1},u_{t}^{j+1})\leq H_{u}(u_{t}^{j+1},u_{t}^{j})\leq H_{u}(u_{t}^{j},u_{t}^{j})=E_{u}(u_{t}^{j})

(64)

with equality at $u_{t}^{j+1}=u_{t}^{j}$ . It can be concluded that function $E_{u}$ decreases using Algorithm 2 for causes inference. This convergence is also verified by the experiments.

Lemma 1 still holds if we replace $f(x_{t}),h(x_{t},V),F,H_{x}$ with $g(u_{t}),h(u_{t},V),E_{u},H_{u}$ , respectively. Let $V=u_{t}^{j}$ and $P(V)=u_{t}^{j+1}$ , and let $u_{t}^{*}$ be the optimal solution satisfying (11). Following Theorem 1 we have

\displaystyle E_{u}(u_{t}^{s})-E_{u}(u_{t}^{*})

\displaystyle\leq\frac{1}{2s}\sum_{j=0}^{s-1}\sum_{k=1}^{D}\frac{(|(u_{t}^{*})_{k}|-|(u_{t}^{j})_{k}|)^{2}}{|(u_{t}^{j})_{k}|},

(65)

and consequently (28). This completes the proof.

Appendix C Appendix for more results and AE architecture details

We used a simple geometric moving shape data set to demonstrate the video clustering mechanism for MM-DPCN further. Each video contains three geometric shapes: diamond, triangle, and square. Each shape appears consistently for 100 frames until another shape shows up. The shape could appear in each patch of the image frame and move within the 100 frames of a single shape.

To visualize the learned filters, the plots for matrix $C^{1}$ are provided in Fig. 7.

The architectures for AE and WTA-RNN-AE used for the comparison results are provided in Table 4. We use the same architectures for both Mario and Coil-100 data sets.

Table 4: AE and WTA-RNN-AE architectures.

layer name	AE	WTA-RNN-AE
encoder_layer1	$\left[3,128\right]$	$\left[3,256\right]$
encoder_layer2	$\left[128,64\right]$	$\left[256,128\right]$
encoder_layer3	$\left[64,36\right]$	$\left[128,64\right]$
encoder_layer4	$\left[36,18\right]$	*
encoder_layer5	$\left[18,9\right]$	*
RNN	*	$\left[64,64\right]\times 5$