On The Multi-View Information Bottleneck Representation

Teng-Hui Huang Electrical and Computer Engineering
Purdue University
West Lafayette, IN, USA
huan1456@purdue.edu Aly El Gamal Emerging Technologies Lab
InterDigital
Los Altos, CA, USA
aly.elgamal@interdigital.com Hesham El Gamal Electrical and Information Engineering
The University of Sydney
NSW, Australia
hesham.elgamal@sydney.edu.au

Abstract

In this work, we generalize the information bottleneck (IB) approach to the multi-view learning context. The exponentially growing complexity of the optimal representation motivates the development of two novel formulations with more favorable performance-complexity tradeoffs. The first approach is based on forming a stochastic consensus and is suited for scenarios with significant representation overlap between the different views. The second method, relying on incremental updates, is tailored for the other extreme scenario with minimal representation overlap. In both cases, we extend our earlier work on the alternating directional methods of multiplier (ADMM) solver and establish its convergence and scalability. Empirically, we find that the proposed methods outperform state-of-the-art approaches in multi-view classification problems under a broad range of modelling parameters.

Index Terms:

Information bottleneck, consensus ADMM, non-convex optimization, classification, multi-view learning.

I Introduction

Recently, learning from multi-view data has drawn significant interest in the machine learning and data science community (e.g., [1, 2, 3, 4, 5]). In this context, a view of data is a description or observation about the source. For example, an object can be described in words or images. It is natural to expect learning from multi-view data to improve performance[6].

The challenges in multi-view learning are two-folded: First, as one can naively combine all views of observations to form one giant view which loses no information contained within them but would suffer from exponential growth of the dimensionality of the merged observation, we call this the performance-complexity trade-off. Second, following this, if one instead opts to extract either view-shared or view-specific relevant features from each view, then heterogeneous forms of observations (e.g., audio and images) would make it difficult to learn low-complexity and meaningful representations with a unified method. In other words, the amount of representation overlap across all view observations is important for efficient multi-view learning.

In addressing the challenges, several recent works have attempted to apply the IB [7] principle to multi-view learning for it matches the objective well, that is, trading-off relevance and complexity in extracting both view-shared and view-specific features [8, 9, 10, 11]. This generalization is known as the multi-view IB (MvIB) which is known to be a special case of the multi-terminal remote source coding problems with logarithmic loss. The logarithmic loss corresponds to soft reconstruction where the likelihood of all possible outcomes is received in contrast to a reconstructed symbol in conventional source coding problems [12, 13, 14, 15, 16].

In literature, the achievable region for the remote source coding problem is characterized in [17] for discrete cases, and recently, jointly Gaussian cases as well [13]. Along with the characterization, a variety of variational inference-based algorithms are proposed[14, 15]. This type of algorithms introduces extra variational variables to facilitate the optimization as by fixing one of the two sets of variables, the overall objective function is convex w.r.t. to the other set of variables. Then optimize in this alternating fashion, the convergence is assured.

Extending this line of research, our approach is rooted in a top-down information-theoretic formulation closely related to the optimal characterization of MvIB. Moreover, contrary to [11] which relies on black-box deep neural networks, we propose two constructive information theoretic formulations with performance comparable to that of the optimal joint view approach. The two approaches are motivated by two extreme multi-view learning scenarios: The first is characterized by a significant representation overlap between the different views which favors our consensus-complement two stage formulation, whereas the second extreme scenario is characterized by a minimal representation overlap leading to our incremental update approach.

Different from existing variational inference-based algorithms that avoid dealing with the non-convexity of the overall objective function, in both of the proposed methods, we adopt the non-convex consensus ADMM as the main tool in deriving our solvers [18, 19, 20]. These new solvers can, therefore, be viewed as generalizations of our earlier work on the single view ADMM algorithm [21]. More specifically, in the consensus-complement version, we separate the proposed Lagrangian into consensus and complement sub-objective functions and then proceed to solve the optimization problem in two steps. The new ADMM solver can hence efficiently form a consensus representation in large-scale multi-view learning problems with significantly lower dimensions compared to joint-view approaches. The same intuition is applied to the increment update approach as detailed in the sequel. Finally, we prove the linear rate of convergence of our solvers under significantly milder constraints; as compared with earlier convergence results on this class of solvers [22, 23, 24]. More specifically, we relax the need for a strongly convex sub-objective function and, moreover, establish the linear rate of convergence around a local optimal point in each case.

II Multiview Information Bottleneck

Given $V$ views with observations $\{X^{(i)}\}_{i=1}^{V}$ generated from a target variable $Y$ , we aim to find a set of individual representations $\{Z\}$ that is most compressed w.r.t. the individual-view observations $X^{(i)},\forall i\in[V]$ but at the same time maximize their relevance toward the target $Y$ through $X^{(i)}$ .

Using a Lagrangian multiplier formulation, the problem can be casted as:

\mathcal{L}_{\text{MvIB}}:=\gamma I(\{X\}_{V};\{Z\})-I(Y;\{Z\}),

(1)

where $\{X\}$ denotes the set of all $V$ -views of observations and $\{Z\}$ is the set of representations to be designed. Note that if the observations are combined in this manner and treated as one view, the above reduces to the standard IB and can be solved with any existing single-view solver. However, combining all the observations in one giant view will result in an exponential increase in complexity (curse of dimensionality).

A basic assumption in the multiview learning literature is the conditional independence [25, 9], where the observations of all views $\{X^{(i)}\}_{i=1}^{V}$ are independent given the target variable $Y$ . That is, $p(\{x\}|y)=\prod_{i=1}^{V}p(x^{(i)}|y)$ . In the next two sections, we use this conditional independence assumption while constraining the set of allowable latent representations $\{Z\}$ to develop our two novel information-theoretic formulations of the Multi-view IB (MvIB) problem.

II-A Consensus-Complement Form

Inspired by the co-training methods in multi-view research [25], we design the set of latent representations $\{Z\}$ to consist of a consensus representation $Z_{c}$ and view-specific complement components $Z_{e}^{(i)},\forall i\in[V]$ . Then, by the chain rule of mutual information, the Lagrangian of (1) becomes

\mathcal{L}_{\text{con}}=\gamma I(Z_{c};\{X\})-I(Z_{c};Y)\\ +\sum_{i=1}^{V}\gamma I(Z_{e}^{(i)};\{X\}|Z_{c},\{Z_{e}\}_{i-1})-I(Y;Z_{e}^{(i)}|Z_{c},\{Z_{e}\}_{i-1}),

(2)

where the sequence $\{Z_{e}\}_{j}:=\{Z_{e}^{(1)},\cdots,Z_{e}^{(j)}\}$ is defined to represent the accumulated complement views. To further simplify the above, we restrict the set of possible representations to satisfy the following conditions (similar to [25, 11]):

•

There always exist constants $\kappa_{i},\forall i\in[V]$ , independent of the observations $\{X\}$ such that $\kappa_{i}I(Z_{c};X^{(i)})=I(Z_{c};X^{(i)}|\{X\}_{i-1})$ .
•

$Y\rightarrow X^{(i)}\rightarrow Z_{e}^{(i)}\leftarrow Z_{c}$ forms a Markov chain. That is, $Z_{c}$ is side-information for $Z_{e}^{(i)}$ .
•

For each view $i\in[V]$ , given the consensus $Z_{c}$ , $\{Z_{e}\}$ are independent.

Under these constraints, (2) can be rewritten as the superposition of two parts, i.e., $\mathcal{L}:=\bar{\mathcal{L}}+\sum_{i=1}^{V}\mathcal{L}_{e}^{(i)}$ , where the first component $\bar{\mathcal{L}}$ is defined as the multi-view consensus IB Lagrangian:

\bar{\mathcal{L}}:=\sum_{i=1}^{V}\gamma_{i}I(Z_{c};X^{(i)})-I(Z_{c};Y),

(3)

and the second consists of $V$ terms with each one corresponding to a complement sub-objective for each view:

\mathcal{L}_{e}^{(i)}:=\gamma I(Z_{e}^{(i)};X^{(i)}|Z_{c})-I(Z_{e}^{(i)};Y|Z_{c}),\forall i\in[V].

(4)

We can now recast $\bar{\mathcal{L}}$ in (3) as:

\begin{split}\bar{\mathcal{L}}:&=-\sum_{i=1}^{V}\gamma_{i}H(Z|X^{(i)})+\bigl{(}-1+\sum_{i=1}^{V}\gamma_{i}\bigr{)}H(Z)+H(Z|Y)\\ &=\sum_{i=1}^{V}F_{i}(p_{z|x,i})+G(p_{z},p_{z|y}).\end{split}

(5)

Similarly, we rewrite (4), $\forall i\in[V]$ , as:

\mathcal{L}_{e}^{(i)}=-\gamma H(Z_{e}^{(i)}|Z_{c},X^{(i)})\\ +(\gamma-1)H(Z_{e}^{(i)}|Z_{c})+H(Z_{e}^{(i)}|Z_{c},Y).

(6)

By representing the discrete (conditional) probabilities as vectors/tensors, we can solve (5) and (6) with augmented Lagrangian methods. We define the following vectors:

\begin{split}p_{z|x,i}&:=\begin{bmatrix}p(z_{1}|x_{1}^{(i)})\cdots p(z_{1}|x^{(i)}_{N_{i}})\cdots p(z_{L}|x^{(i)}_{N_{i}})\end{bmatrix}^{T},\\ p_{z}&:=\begin{bmatrix}p(z_{1})&\cdots p(z_{L})\end{bmatrix}^{T},\\ p_{z|y}&:=\begin{bmatrix}p(z_{1}|y_{1})\cdots p(z_{1}|y_{T})\cdots p(z_{L}|y_{K})\end{bmatrix}^{T}.\end{split}

(7)

where $N_{i}:=|\mathcal{X}^{(i)}|,\forall i\in[V],L:=|\mathcal{Z}|,K:=|\mathcal{Y}|$ . For clarity, we rewrite the primal variables for each view as $p_{z|x,i}:=p_{i}$ , and cascade the augmented variables which gives $q:=\begin{bmatrix}p_{z}^{T}&p_{z|y}^{T}\end{bmatrix}^{T}$ . On the other hand, for the complement part, we define the following tensors:

\begin{split}\pi_{x,i}[m,n,r]&:=P(Z^{(i)}_{e}=z^{(i)}_{e,m}|Z_{c}=z_{c,n},X^{(i)}=x_{r}^{(i)}),\\ \pi_{y,i}[m,n,r]&:=P(Z^{(i)}_{e}=z^{(i)}_{e,m}|Z_{c}=z_{c,n},Y=y_{r}),\\ \pi_{z,i}[m,n]&:=P(Z^{(i)}_{e}=z^{(i)}_{e,m}|Z_{c}=z_{c,n}).\end{split}

(8)

Then we present the consensus-complement MvIB augmented Lagrangian as follows. For the consensus part:

\mathcal{L}_{c}(\{p_{i}\}_{i=1}^{V},q,\{\nu_{i}\}_{i=1}^{V})\\ =\sum_{i=1}^{V}\left[F_{i}(p_{i})+\langle\nu_{i},A_{i}p_{i}-q\rangle+\frac{c}{2}\lVert A_{i}p_{i}-q\rVert^{2}\right]+G(q),

(9)

where $\lVert\cdot\rVert$ is 2-norm. As for the complement part, let

\begin{split}F_{e,i}&=-\gamma H(Z_{e}^{(i)}|Z_{c},X^{(i)}),\\ G_{e,i}&=(\gamma-1)H(Z_{e}^{(i)})+H(Z_{e}^{(i)}|Z_{c},Y).\end{split}

As for the tensors used in the complement step, denote $\pi[t]$ the realization of the consensus representation $t\in\mathcal{Z}_{c}$ , by Bayes’ rule, we can recover the linear expression: $\pi_{y,i}[t]=\Lambda^{-1}_{z_{c}|y}[t]A_{x^{(i)}|y}^{T}\pi_{x,i}[t]$ , where $\Lambda_{z_{c}|y}[t]$ is a diagonal matrix formed from the vector $p_{z_{c}=t|y},\forall y\in\mathcal{Y}$ . To simplify notation, define the equivalent prior as a linear operator $\tilde{A}_{x^{(i)|y,t}}:=\Lambda^{-1}_{z_{c}|y}[t]A^{T}_{x^{(i)}|y}$ , then we can express the augmented Lagrangian for the complement step as $\mathcal{L}^{(i)}_{e,c}=\sum_{t\in\mathcal{Z}_{c}}\mathcal{L}_{e,c}^{(i)}[t]$ and each term is defined as:

\mathcal{L}_{e,c}^{(i)}[t]=F_{e,i}(\pi_{x,i}[t])+G_{e,i}(\pi_{q,i}[t])\\ +\langle\mu_{i},\tilde{A}_{e}\pi_{x,i}[t]-\pi_{q,i}[t]\rangle+\frac{c}{2}\lVert\tilde{A}_{e}\pi_{x,i}[t]-\pi_{q,i}[t]\rVert^{2},

(10)

where $c>0$ is the penalty coefficient and the linear penalty $A_{i}p_{i}-q$ for each view $i\in[V]$ encourages the variables $q$ and each $p_{i}$ to satisfy the marginal probability and Markov chain conditions. Specifically, let $\otimes$ denote the Kronecker product, $A_{z,i}:=I\otimes p_{x^{(i)}}^{T},A_{z|y,i}:=I\otimes P_{x^{(i)}|y}^{T}$ where $P_{x^{(i)}|y}$ is the matrix form of the conditional distribution $p(x^{(i)}|y)$ with each entry $(m,n)$ equal to $p(x^{(i)}_{m}|y_{n})$ .

We propose a two-step algorithm to solve (2) described as follows. The first step is solving (9) through the following consensus ADMM algorithm. $\forall i\in[V]$ :


$\displaystyle p_{i}^{k+1}:=$	$\displaystyle\underset{p_{i}\in\Omega_{i}}{\arg\min}\,\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p_{i},\{p^{k}_{>i}\},\{\nu_{i}^{k}\},q^{k}),$	(11a)
$\displaystyle\nu_{i}^{k+1}:=$	$\displaystyle\nu_{i}^{k}+c\left(A_{i}p_{i}^{k+1}-q^{k}\right),$	(11b)
$\displaystyle q^{k+1}:=$	$\displaystyle\underset{q\in\Omega_{q}}{\arg\min}\,\mathcal{L}_{c}(\{p^{k+1}_{i}\}_{i=1}^{V},\{\nu_{i}^{k+1}\},q),$	(11c)

Then in the second step we solve (10) with two-block ADMM:


$\displaystyle\pi^{k+1}_{e,i}$	$\displaystyle:=\underset{\pi_{x,i}\in\Pi^{(i)}_{x}}{\arg\min}\,\mathcal{L}_{e,c}(\pi_{x,i},\mu_{i}^{k},\pi^{k}_{q,i}),$	(12a)
$\displaystyle\mu^{k+1}_{i}$	$\displaystyle:=\mu^{k}_{i}+c(\tilde{A}_{e,i}\pi^{k+1}_{x,i}-\pi^{k}_{q,i}),$	(12b)
$\displaystyle\pi^{k+1}_{y,i}$	$\displaystyle:=\underset{\pi_{y,i}\in\Pi^{(i)}_{y}}{\arg\min}\,\mathcal{L}_{e,c}(\pi^{k+1}_{x,i},\mu^{k+1}_{i},\pi_{y,i}),$	(12c)

where in (11), we use the short-hand notation $\{p^{k+1}_{<i}\}:=\{p^{k+1}_{l}\}_{l=1}^{i-1}$ to denote the primal variables, up to $i-1$ views that are already updated to step $k+1$ , and $\{p^{k}_{i<}\}:=\{p_{m}^{k}\}_{m=i+1}^{V}$ to denote the rest that are still at step $k$ . We define $\{p^{k+1}_{<0}\}=\{\varnothing\}=\{p^{k}_{>V}\}$ ; in (11) and (12), the superscript $k$ denotes the step index; each of $\Omega_{i},\Omega_{q},\Pi_{x}^{(i)},\Pi_{y}^{(i)}$ denotes a compound probability simplex. The algorithm starts with (11a), updating each view in succession. Then the augmented variables are updated with (11c). Finally, the difference between the primal and augmented variables are added to the dual variables (11b) to complete step $k$ . After convergence of (11), we run (12) in similar fashion for each view. And this completes the full algorithm.

II-B Incremental Update Form

Intuitively, the consensus-complement form works well in the case where the common information in the observations $\{X\}$ across all views is abundant. However, if the views are almost distinct, where each view is a complement to the others, then the previous form will be inefficient in the sense that learning the common may have negligible benefit. To address this, we propose another formulation of the multi-view IB, by restricting the representation set to $\{Z^{(i)}\}_{i=1}^{V}$ . The incremental update multi-view IB Lagrangian is therefore given by:

\mathcal{L}_{\text{inc}}:=\sum_{i=1}^{V}\gamma I(\{X\};Z^{(i)}|\{Z\}_{i-1})-I(Y;Z^{(i)}|\{Z\}_{i-1}).

(13)

Again, to simplify the above, the incremental form will satisfy the following constraints:

•

For each view $i\in[V]$ , the corresponding representation $Z^{(i)}$ only access $X^{(i)}$ , so $Y\rightarrow X^{(i)}\rightarrow Z^{(i)}\leftarrow\{Z\}_{i-1}$ forms a Markov chain.

With the assumptions above, in each step, we can replace observations of all views $\{X\}$ with the view-specific observation $X^{(i)}$ and rewrite (13) as:

\mathcal{L}_{\text{inc}}:=\sum_{i=1}^{V}\gamma I(X^{(i)};Z^{(i)}|\{Z\}_{i-1})-I(Y;Z^{(i)}|\{Z\}_{i-1}).

(14)

In solving (14), we consider the following algorithm. At the $i^{th}$ step, we have:


	$\displaystyle P^{(i)}_{z\|x,z_{<i}}:=\underset{P\in\Omega^{(i)}}{\arg\min}\,\mathcal{L}_{\text{inc}}(P,\{P^{(j)}_{z\|y,z_{<j}}\}_{j=1}^{i-1}),$		(15a)
	$\displaystyle p(z^{(i)}\|y,z_{<i})=\frac{\sum_{x_{i}}p(x^{(i)}\|y)p(z^{(i)},z_{<i}\|x^{(i)})}{p(z_{<i}\|y)},$		(15b)

where $P^{(i)}_{z|x,z_{<i}}$ denotes the tensor form of a conditional probability $p(z^{(i)}|x^{(i)},z^{(i-1)},\cdots,z^{(1)}),\forall i\in[V]$ . The tensor is the primal variable for step $i$ which belongs to a compound simplex $\Omega^{(i)}$ . In the algorithm, for each step (15a), we solve it with (11) by setting $V=1$ and treating the estimators from the previous steps as priors. For example, $p(z^{(2)},x^{(2)}|z^{(1)})=p(x^{(2)}|z^{(1)})p(z^{(2)}|z^{(1)},x^{(2)})$ , and $p(x^{(2)}|z^{(1)})=\sum_{y}p(x^{(2)}|y)p(y|z^{(1)})$ .

III Main Results

We propose two new information-theoretic formulations of MvIB and develop optimal-bound achieving algorithms that are in parallel to existing solvers [15, 26, 14]; our main results are the convergence proofs for the proposed two algorithms. The convergence analysis goes beyond the MvIB and the recent non-convex multi-block ADMM convergence results as we further show that strong convexity on $\{F_{i}\}_{i=1}^{V}$ is not necessary for proving convergence [24]. This new result connects our analysis to a more general class of functions that can be solved with multi-block non-convex consensus ADMM. For simplicity we denote the collective point at step $k$ as $w^{k}:=(\{p_{i}^{k}\},\{\nu^{k}_{i}\},q^{k})$ , $\mathcal{L}_{c}^{k}:=\mathcal{L}_{c}(\{p_{i}^{k}\},\{\nu_{i}^{k}\},q^{k})$ as the function value evaluated with $w^{k}$ and $\mu_{B},\lambda_{B}$ denote the smallest and largest singular value of a linear operator $B$ respectively,

Theorem 1

Suppose $F_{i}(p_{i})$ is $L_{i}$ -smooth and $M_{i}$ -Lipschitz continuous $\forall i\in[V]$ and $G(q)$ is $\sigma_{G}$ -weakly convex. Further, let $\mathcal{L}_{c}$ be defined as in (9) and solved with the algorithm (11). If the penalty coefficient satisfies $c>\max_{i\in[V]}\{(M_{i}\lambda_{A_{i}}L_{i})/\mu_{A_{i}A_{i}^{T}},\sigma_{G}/V\}$ , then the sequence $\{w^{k}\}_{k\in\mathbb{N}}$ is finite and bounded. Moreover, $\{w^{k}\}_{k\in\mathbb{N}}$ converges linearly to a stationary point $w^{*}$ around a neighborhood such that $\mathcal{L}_{c}^{*}<\mathcal{L}_{c}<\mathcal{L}_{c}^{*}+\delta$ for $\lVert w-w^{*}\rVert<\epsilon$ where $\delta,\epsilon>0$ .

Proof:

The details of the proof are deferred to Appendix A. Here we explain the key ideas.

Continuing on the proof sketch, the first step is to construct a sufficient decrease lemma (Lemma 3) to assure that the function value $\mathcal{L}_{c}^{k}$ decreases from step $k$ to $k+1$ by an amount lower bounded by the positive squared norm $\lVert w^{k}-w^{k+1}\rVert^{2}$ . We decompose $\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}$ according to each step of the algorithm (11), $\forall i\in[V]$ as follows:


	$\displaystyle\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}$
$\displaystyle=$	$\displaystyle\begin{multlined}\sum_{i=1}^{V}\left[\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p^{k}_{i},\{p^{k}_{>i}\},\{\nu^{k}\},q^{k})\right.\\ \left.-\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p^{k+1}_{i},\{p^{k}_{>i}\},\{\nu^{k}\},q^{k})\vphantom{\mathcal{L}_{c}}\right]\end{multlined}\sum_{i=1}^{V}\left[\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p^{k}_{i},\{p^{k}_{>i}\},\{\nu^{k}\},q^{k})\right.\\ \left.-\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p^{k+1}_{i},\{p^{k}_{>i}\},\{\nu^{k}\},q^{k})\vphantom{\mathcal{L}_{c}}\right]$	(16c)
	$\displaystyle+\begin{multlined}\sum_{i=1}^{V}\left[\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k+1}_{<i}\},\nu^{k}_{i},\{\nu^{k}_{>i}\},q^{k})\right.\\ \left.-\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k+1}_{<i}\},\nu^{k+1}_{i},\{\nu^{k}_{>i}\},q^{k+1})\vphantom{\mathcal{L}_{c}}\right]\end{multlined}\sum_{i=1}^{V}\left[\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k+1}_{<i}\},\nu^{k}_{i},\{\nu^{k}_{>i}\},q^{k})\right.\\ \left.-\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k+1}_{<i}\},\nu^{k+1}_{i},\{\nu^{k}_{>i}\},q^{k+1})\vphantom{\mathcal{L}_{c}}\right]$	(16f)
	$\displaystyle+\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k}\},q^{k})-\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k}\},q^{k+1}).$	(16g)

For each view, each difference in (16c) can be lower bounded by using the convexity of $F_{i}$ , and we get:

\mathcal{L}_{c}(\{p_{i}^{k}\})-\mathcal{L}_{c}(\{p_{i}^{k+1}\})\geq\sum_{i=1}^{V}\frac{c}{2}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert^{2}.

(17)

On the other hand, in (16g), a similar lower bound for $G$ follows from its $\sigma_{G}$ -weak convexity. This results in a negative squared norm $-\sigma_{G}/2\lVert q^{k}-q^{k+1}\rVert^{2}$ . Nonetheless, by the first-order minimizer conditions (24) and the identity $2\langle u-v,w-u\rangle=|v-w|^{2}-|u-v|^{2}-|u-w|^{2}$ , the negative term is balanced by the penalty coefficient $c$ as the corresponding lower bound is (with other variables fixed):

\mathcal{L}_{c}(q^{k})-\mathcal{L}_{c}(q^{k+1})\geq\frac{cV-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}.

As for the dual update, (16f) gives a combination of negative norms $-1/c\sum_{i=1}^{V}\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert^{2}$ . It turns out that by the first-order minimizer condition of $F_{i}$ and its smoothness:

\nabla F_{i}(p_{i}^{k+1})=-A_{i}^{T}\nu_{i}^{k+1},

and that $A_{i}$ is full-row rank (holds for complement step):

\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert^{2}\leq\frac{\lambda^{2}_{A_{i}}L_{i}^{2}}{\mu^{2}_{A_{i}A_{i}^{T}}}\lVert p_{i}^{k}-p_{i}^{k+1}\rVert^{2},

(18)

where $\mu_{B},\lambda_{B}$ denote the smallest and largest singular value of a linear operator B. Then we need the following relation:

\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert\geq M\lVert p_{i}^{k}-p_{i}^{k+1}\rVert,\quad M>0,

which is non-trivial as $A_{i}$ is full-row rank. To address this, we adopt the sub-minimization path method in [20], which is applicable since $F_{i}$ is convex. Observe that (11a) is equivalent to a proximal operator:

\Psi_{i}(\eta):=\underset{\pi_{i}\in\Omega_{i}}{\arg\min}\,F(p_{i})+\frac{c}{2}\lVert A_{i}p_{i}^{k}-\eta\rVert^{2},

with $\eta:=A_{i}p_{i}^{k}$ at step $k$ . Using this technique, we can have the desired result using the Lipschitz continuity of $F_{i}$ :

\lVert\Psi_{i}(A_{i}p^{k}_{i})-\Psi_{i}(A_{i}p_{i}^{k+1})\rVert=\lVert p_{i}^{k}-p_{i}^{k+1}\rVert\\ \leq M_{i}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert

prove the sufficient decrease lemma and hence the convergence (Appendix F).

We further prove that the rate of convergence is linear by explicitly showing the Kurdyka-Łojasiewicz (KŁ [19, 27]) property is satisfied with an exponent $\theta=1/2$ (Appendix D). This is based on the known result based on the KŁ inequality, which characterizes the rate of convergence in to three regions in terms of $\theta$ [19] ( $\theta=1/2$ corresponds to linear rate). The proof for $\theta=1/2$ is again based on the convexity of $\{F_{i}\}$ and the weak convexity of $G$ and is referred to Appendix D. We note that the linear rate holds around a neighborhood of a stationary point $w^{*}:=(\{p_{i}^{*}\},\{\nu_{i}^{*}\},q^{*})$ which aligns with the results in [27].

∎

As a remark, if the minimum element of a probability vector is bounded away from zero by a constant $\xi>0$ , a commonly adopted smoothness condition in density and entropy estimation research [28], the sub-objectives $\{F_{i}\}_{i=1}^{V}$ and $G$ can be shown to be Lipschitz continuous and smooth. Furthermore, under smoothness conditions, $G$ is a weakly convex function w.r.t. $q$ (Lemma 2).

From Theorem 1, the consensus-complement algorithm is convergent since the complement step is a special case of the algorithm (11) with $V=1$ while treating $p(z_{c}|x^{(i)})$ as an additional prior probability. The incremental algorithm is convergent following the same reason.

Refer to caption — (a) Classification Accuracy

IV Numerical Results

We evaluate the proposed two approaches for two-view, synthetic distributions. For simplicity, we denote the consensus-complement approach as Cons-Cmpl while the incremental update approach as Increment.

We simulate a classification task and compare the performance of the two proposed approaches to joint-view/single-view IB solvers [29], which are served as references for the best- and worst-case performance, and the state-of-the-art deep neural network-based method (DeepMvIB [14, 11]), with two layers of $4$ -neuron, fully connected weights plus ReLU activation for each view. Given (19), we randomly sample $10000$ pairs of outcomes $(y,x^{(1)},x^{(2)})$ as testing data. Then we run the algorithms, sweeping through a range of $\gamma\in[0.1,0.7]$ and record the best accuracy from $50$ trials per $\gamma$ . We use Bayes’ decoder to predict the testing data, where we perform inverse transform sampling for the cumulative distribution of the decoders to obtain $\hat{y}$ for each pair of $(x^{(1)},x^{(2)})$ . The data-generating distribution is:

P(X^{(1)}|Y)=\begin{bmatrix}0.75&0.05\\ 0.20&0.20\\ 0.05&0.75\end{bmatrix},P(X^{(2)}|Y)=\begin{bmatrix}0.85&0.15\\ 0.15&0.85\end{bmatrix},

(19)

with $P(Y)=\begin{bmatrix}0.5&0.5\end{bmatrix}^{T}$ . The result is shown in Figure 1a. The dimension of each of $Z_{c},Z_{e}^{(2)},Z^{(2)}$ is $2$ , and $3$ for each of $Z_{e}^{(1)},Z^{(1)}$ . Clearly, the two proposed approaches can achieve comparable performance to that of the joint-view IB solver and outperform the deepMvIB over the range of $\gamma$ we simulated. Interestingly, Cons-Cmpl outperforms Increment in the best accuracy. This might be due to the abundance of representation overlap. To better investigate this observation, we further consider a different distribution with dimensions of all representations $|\mathcal{Z}_{c}|=|\mathcal{Z}_{e}^{(i)}|=|\mathcal{Z}^{(i)}|=3,\forall i\in\{1,2\}$ :

	$\displaystyle p(Y)=\begin{bmatrix}\frac{1}{3}&\frac{1}{3}&\frac{1}{3}\end{bmatrix}^{T},P(X^{(1)}\|Y):=\begin{bmatrix}0.90&0.20&0.20\\ 0.05&0.45&0.35\\ 0.05&0.35&0.45\end{bmatrix},$
	$\displaystyle P(X^{(2)}\|Y):=\begin{bmatrix}0.25&0.10&0.55\\ 0.20&0.80&0.25\\ 0.55&0.10&0.20\end{bmatrix}.$		(20)

Observe that for each view in (20), there is one class ( $y_{1}$ in view $1$ an $y_{2}$ in view $2$ ), that is easy to infer through $X^{(i)},i\in\{1,2\}$ while the remaining two are ambiguous. This results in low representation overlap and consensus is therefore difficult to form. In Figure 1b we examine the components of the relevance rate $I(\{Z\};Y)$ where the Sum is: $I(Z_{c};Y)+\sum_{i=1}^{2}I(Z_{e}^{(i)};Y|Z_{c}))$ for Cons-Cmpl and $I(Z^{(1)};Y)+I(Z^{(2)};Y|Z^{(1)})$ for Increment. Step 1 indicates $I(Z_{c};Y)$ in Cons-Cmpl, and $I(Z^{(1)};Y)$ in Increment. Observe that there is almost no increase in $I(Z_{c};Y)$ over varying $\gamma$ , and that Increment has a greater relevance rate than Cons-Cmpl when $\gamma<0.4$ . Since it is known that the high relevance rate is related to high prediction accuracy [30], this example favors the Increment approach as it is designed to increase the overall relevance rate view-by-view.

Lastly, we can compare the complexity of the two approaches in terms of the number of dimensions for the primal variables. For simplicity, let $|X|=|X^{(i)}|$ , $|Z|=|Z_{c}|=|Z_{e}|=|Z^{(i)}|$ . For Cons-Cmpl, the number of dimensions for the variables scales as $\mathcal{O}(V|X||Z|^{2})$ while for Increment, it grows as $\mathcal{O}(|X||Z|^{V})$ . The two methods both improve over the joint view as their complexity values scale as $\mathcal{O}(|X|^{V}|Z|)$ and $|Z|\ll|X|$ in general. Remarkably, the complexity for Cons-Cmpl scales linearly in the number of views $V$ while we get an exponential growth with factor $|Z|$ for Increment.

V Conclusion

In this work, we propose two new information-theoretic formulations of MvIB and develop new optimal bound-achieving algorithms based on non-convex consensus ADMM, which are in parallel to existing solvers. We proposed two algorithms to solve the two forms respectively and prove their convergence and linear rates. Empirically, they achieve comparable performance to joint-view benchmarks and outperforms state-of-the-art deep neural networks-based approaches in some synthetic datasets. For future works, we plan to evaluate the two methods on available multiview datasets [31, 32] and generalize the proposed MvIB framework to continuous distributions [33].

Appendix A Convergence Analysis

In this part, we prove the convergence of the consensus non-convex ADMM algorithm for the proposed two MvIB solvers, (11) and (15). Moreover, we demonstrate that the convergence rate is linear based on the recent non-convex ADMM convergence results through the KŁ inequality. Specifically, we explicit show that the Łojasiewicz exponents associate with the augmented Lagrangian for both forms is $1/2$ and therefore corresponds to linear rates. As mentioned in section II, the complement step and each level of the incremental update algorithms are special cases of the consensus step algorithm with a normalized linear operator for each realization of the conditioned representation and setting the numbers of view $V=1$ , so it suffices to analyze the convergence and the associated rate of the non-convex consensus ADMM (11). In solving (11), we consider the following first-order optimization method and assume an exact solution exists and can be found at each step.

A-A Preliminaries

We first introduce the following definitions which allow us to study the properties of the sub-objective functions $F_{i}(p_{i}),G(q),\forall i\in[V]$ .

We start with the elementary definitions of smoothness conditions for optimization.

Definition 1 (Lipschitz continuity)

A real-value function $f:\mathbb{R}^{n}\mapsto\mathbb{R}$ is $M$ -Lipschitz continuous if $|f(x)-f(y)|\leq M|x-y|,M>0$ for $x,y\in\textit{\text{dom}}(f)$ .

A function is “smooth” if its gradient is Lipschitz continuous.

Definition 2 (Smoothness)

A real-value function $f:\mathbb{R}^{n}\mapsto\mathbb{R},f\in\mathcal{C}^{2}$ is $L$ -smooth if $|\nabla f(x)-\nabla f(y)|\leq L|x-y|$ , $L>0$ and $\forall x,y\in\textit{\text{dom}}(f)$ .

Note that if a $L$ -smooth function $f\in\mathcal{C}^{2}$ , then the Lipschitz smoothness coefficient $L$ of $f$ satisfies $|\nabla^{2}f|\leq L$ . In this work, the variables are cascade of (conditional) probability mass, and a common assumption in density/entropy estimation research is non-zero minimal measure [28, 34].

Definition 3 ( $\varepsilon$ -infimal)

A measure $f(x)$ is said to be $\varepsilon$ -infimal if there exists $\varepsilon>0$ such that $\inf_{x\in\mathcal{X}}f(x)=\varepsilon$ .

Assuming $\varepsilon$ -infimal for a given set of primal variables $p_{i}$ , we have the following results:

Lemma 1

Suppose $p_{i}$ in (7) is $\varepsilon_{i}$ -infimal $\forall i\in[V]$ , $p_{z}$ is $\varepsilon_{z}$ -infimal and $p_{z|y}$ is $\varepsilon_{z|y}$ -infimal. Then $F_{i}$ is $\gamma_{i}/\varepsilon_{i}$ -smooth and $G$ is $1/\varepsilon_{q}$ -smooth. Where $\varepsilon_{q}:=\max\{|\sigma_{z}|/\varepsilon_{z},1/\varepsilon_{z|y}\},\sigma_{z}:=1-\sum_{i=1}^{V}\gamma_{i}$ .

Proof:

For $F_{i}$ , it suffices to consider a single view. Since $F\in\mathcal{C}^{2}$ and $\nabla F_{i}\leq\gamma_{i}\big{[}I\otimes\text{diag}(p_{x^{(i)}})\big{]}\big{(}\log{p_{z|x,i}}+\mathbf{1}\big{)}$ , we have:

|\nabla^{2}F_{i}|\leq\gamma_{i}\max_{x\in\mathcal{X}^{(i)},z\in\mathcal{Z}}\frac{p(x^{(i)})}{p(z|x^{(i)})}=\frac{\gamma_{i}}{\varepsilon_{i}}.

On the other hand, recall $q:=\begin{bmatrix}p_{z}&p_{z|y}\end{bmatrix}^{T}$ , we can separate $G(q)$ into two parts, denotes $G(p_{z}),G(p_{z|y})$ respectively. For the first part $G(p_{z})$ :

\nabla G(p_{z})=\big{(}1-\sum_{i=1}^{V}\gamma_{i}\big{)}\big{(}\log{p_{z}}+\mathbf{1}\big{)}=\sigma_{z}\big{(}\log{p_{z}}+\mathbf{1}\big{)}.

On the other hand, for $p_{z|y}$ :

\nabla G(p_{z|y})=\big{[}I\otimes\text{diag}(p_{y})\big{]}\big{(}\log{p_{z|y}}+\mathbf{1}\big{)}.

Since $G\in\mathcal{C}^{2}$ , combine the two parts:

|\nabla^{2}G(q)|\\ \leq\max_{z\in\mathcal{Z},y\in\mathcal{Y}}\bigg{\{}\frac{|\sigma_{z}|}{p(z)},\frac{p(y)}{p(z|y)}\bigg{\}}\leq\max_{z\in\mathcal{Z},y\in\mathcal{Y}}\bigg{\{}\frac{|\sigma_{z}|}{\varepsilon_{z}},\frac{1}{\varepsilon_{z|y}}\bigg{\}}.

(21)

∎

Besides the smoothness conditions, given the joint probability $p(x,y)$ , the (conditional) entropy functions are concave w.r.t. the associated probability mass [35]. The observation in (3) is that its non-convexity is due to a combination of differences of convex functions. By convexity, we refer to the following definition.

Definition 4 (Hypoconvexity)

A convex function $f:\mathbb{R}^{n}\mapsto\mathbb{R},f\in\mathcal{C}^{2}$ is $\sigma$ -hypoconvex if $\exists\sigma\in\mathbb{R},|\sigma|<+\infty$ such that $f(y)\geq f(x)+\langle\nabla f(x),y-x\rangle+\frac{\sigma}{2}|y-x|^{2},\forall x,y\in\textit{\text{dom}}(f)$ . In particular, if $\sigma>0$ , $f$ is strongly convex; if $\sigma<0$ , $f$ is weakly convex.

In MvIB, given the $p_{x^{(i)}}$ , it is easy to show that each $F_{i},\forall i\in[V]$ is a convex function. On the other hand, for the function $G(q)$ , if assume $\varepsilon_{q}$ -infimal, we show in the following that it is weakly convex.

Lemma 2

Let $G(q)=\sigma_{\gamma}H(Z)+H(Z|Y)$ and $q:=\begin{bmatrix}p_{z}&p_{z|y}\end{bmatrix}^{T}$ . If $p_{z},p_{z|y}$ is $\varepsilon_{z},\varepsilon_{z|y}$ -infimal measure respectively. Then the function $G$ is $\sigma_{G}$ -weakly convex. Where $\sigma_{\gamma}:=1-\sum_{i=1}^{V}\gamma_{i}$ and $\sigma_{G}:=\max\{(2N_{z}|\sigma_{\gamma}|)/\varepsilon_{z},(2N_{z}N_{y})/\varepsilon_{z|y}\}$ with $N_{z}=|\mathcal{Z}|,N_{y}=|\mathcal{Y}|$ .

Proof:

see Appendix B. ∎

From Lemma 2, it turns out that the MvIB objective is a multi-block consists of $V$ convex $F_{i}$ in addition to a weakly convex $G$ . This decomposition of the non-convexity of the overall objective enables us to generalize the recent strongly-weakly pair non-convex ADMM convergence results to consensus ADMM [36, 37].

If a function satisfies the KŁ properties, then its rate of convergence can be determined in terms of its Łojasiewicz exponent.

Definition 5

A function $f(x):R^{|\mathcal{X}|}\mapsto R$ is said to satisfy the Łojasweicz inequality if there exists an exponent $\theta\in[0,1)$ , $\delta>0$ and a critical point $x^{*}\in\Omega^{*}$ with a constant $C>0$ , and a neighborhood $\lVert x-x^{*}\rVert\leq\varepsilon$ such that:

\left|f(x)-f(x^{*})\right|^{\theta}\leq C\text{dist}(0,\nabla f(x)).

In literature, there are a broad class of functions that are known to satisfy the KŁ properties, in particular, the $o$ -minimal structure (e.g., sub-analytic, semi-algebraic functions [19]).

Definition 6

A function $f(x):R^{|\mathcal{X}|}\mapsto R$ is said to satisfy the Kurdyka-Łojasiewicz inequality if there exists a neighborhood around $\bar{q}$ and a level set $Q:=\{q|q\in\Omega,f(q)<f(\bar{q})<f(q)+\eta\}$ with a margin $\eta>0$ and a continuous concave function $\varphi(s):[0,\eta)\rightarrow\mathbb{R}_{+}$ , such that the following inequality holds:

\varphi^{\prime}(f(q)-f(\bar{q}))\textit{dist}(0,\partial f(q))\geq 1,

(22)

where $\partial f$ denotes the sub-gradient of $f(\cdot)$ for non-smooth functions, and $\textit{dist}(y,A):=\inf_{x\in\mathcal{A}}\lVert x-y\rVert$ is defined as the distance of a set $A$ to a fixed point $y$ if exists. Note that if $\varphi(s):=\bar{c}s^{1-\theta},\bar{c}>0$ then is recovers the definition of Łojasiewicz inequality.

The following elementary identity is useful in the convergence analysis, we list it for completeness.

2\langle u-v,w-u\rangle=\lVert v-w\rVert^{2}-\lVert u-v\rVert^{2}-\lVert w-u\rVert^{2}.

(23)

Lastly, by “linear” rate of convergence, we refer to the definition in [38]:

Definition 7

Let $\{w^{k}\}$ be a sequence in $\mathbb{R}^{n}$ that converges to a stationary point $w^{*}$ when $k>K_{0}\in\mathbb{N}$ . If it converges $Q$ -linearly, then $\exists Q\in(0,1)$ such that

\frac{\lVert w^{k+1}-w^{*}\rVert}{\lVert w^{k}-w^{*}\rVert}\leq Q,\quad\forall k>K_{0}.

On the other hand, the convergence of the sequence is $R$ -linear if there is $Q$ -linearly convergent sequence $\{\mu^{k}\},\forall k\in\mathbb{N},\mu^{k}\geq 0$ such that:

\lVert w^{k}-w^{*}\rVert\leq\mu^{k},\quad\forall k\in\mathbb{N}.

A-B Convergence and Rate Analysis

As in the standard convex setting, to prove convergence of consensus ADMM, we simply need to establish a sufficient decrease lemma [39, 18]. However, since the MvIB problem is non-convex, the sub-objectives cannot be viewed as a monotone operator which leads to convergence naturally. Our key insight is that the non-convexity of the MvIB problem can be separated into a combination of a set of convex sub-objective $F_{i}$ and a single weakly-convex sub-objective $G$ which can be exploited to show convergence. This result requires certain smoothness conditions to be satisfied, which is a consequence when assuming $\varepsilon$ -infimality (Definition 3) on the primal $p_{i},\forall i\in[V]$ and the augmented variables $q$ . With these smoothness conditions, it can be easily shown that $F_{i}$ is $M_{i}$ -Lipschitz continuous and $L_{i}$ -smooth w.r.t. $p_{i}$ for some $M_{i},L_{i}>0$ . In addition to the properties of the sub-objective functions, it turns out that the structural advantages in consensus ADMM allows us to connect the dual update to the gradient of $F_{i}$ , and we can therefore establish the desired results. Before presenting the results, we summarize the minimizer conditions as follows:


$\displaystyle 0$	$\displaystyle=\nabla F_{i}(p_{i}^{k+1})+A_{i}^{T}\left[\nu_{i}^{k}+\left(A_{i}p_{i}^{k+1}-q^{k}\right)\right]$
	$\displaystyle=\nabla F_{i}(p_{i}^{k+1})+A_{i}^{T}\nu_{i}^{k+1},$	(24a)
$\displaystyle\nu_{i}^{k+1}$	$\displaystyle=\nu_{i}^{k}+c\left(A_{i}p_{i}^{k+1}-q^{k}\right),$	(24b)
$\displaystyle 0$	$\displaystyle=\nabla G(q^{k+1})-\sum_{i=1}^{V}\nu_{i}^{k+1}-c\left(A_{i}p_{i}^{k+1}-q^{k+1}\right),$	(24c)

where $i\in[V]$ denotes the view index. Following this, suppose there exists a stationary point $w^{*}:=(\{p_{i}^{*}\},\{\nu_{i}^{*}\},q^{*})$ such that $\nabla\mathcal{L}_{c}(w^{*})=0$ , (24) reduces to:

A_{i}p_{i}^{*}=q^{*},\quad\nabla F_{i}(p_{i}^{*})=-A_{i}^{T}p_{i}^{*},\quad\nabla G(q^{*})=\sum_{i=1}^{V}\nu_{i}^{*}.

(25)

Furthermore, we impose the following set of assumptions to facilitate the convergence analysis:

Assumption A

•

There exists stationary points $w^{*}:=(\{p_{i}^{*}\},q^{*},\{\nu_{i}^{*}\})$ that belong to a set $\Omega^{*}:=\{w|w\in\Omega,\nabla\mathcal{L}_{c}(w)=0\}$ .
•

$F_{i}(p_{i}),\forall i\in[V]$ is $L_{i}$ -smooth, $M_{i}$ -Lipschitz continuous and convex; $G(q)$ is $L_{q}$ -smooth and $\sigma_{G}$ -weakly convex.
•

The penalty coefficient satisfies:

$c>\max_{i\in[V]}\{(M_{i}\lambda_{A_{i}}L_{i})/\mu_{A_{i}A_{i}^{T}},\sigma_{G}/V\}.$

With (24) and assumption A, we can establish the sufficient decrease lemma, which is given below.

Lemma 3

Define $\mathcal{L}_{c}$ as in (9). If Assumption A is satisfied and the sequence $\{w^{k}\}_{k\in\mathbb{N}}$ , with $w^{k}:=(\{p_{i}^{k}\},\{\nu_{i}^{l}\},q^{k})$ , is obtained from the algorithm (11), then we have:

\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}\geq\sum_{i=1}^{V}\left[\frac{c}{2M_{i}^{2}}-\frac{\lambda^{2}_{A_{i}}L^{2}_{i}}{c\mu^{2}_{A_{i}A_{i}^{T}}}\right]\lVert p_{i}^{k+1}-p_{i}^{k}\rVert^{2}.

where $\mu_{B},\lambda_{B}$ denote the smallest and largest singular value of a matrix $B$ respectively.

Proof:

See Appendix C. ∎

As a remark, Lemma 3 implies the convergence of the non-convex ADMM-based algorithm depends on a sufficient large penalty coefficient $c$ , and the minimum value that assures this, in turns, relies on the properties of the sub-objective functions $F_{i},\forall i\in[V]$ and $G$ . Note that both the Lipschitz continuity and the smoothness of $F_{i}$ are exploited to prove Lemma 3 which corresponds to the sub-minimization path method developed in [20] as $F_{i},\forall i\in[V]$ are convex, and the connection between dual update and the gradient of $F_{i}$ [40, 37].

In addition to convergence, it turns out that we can follow the recent results in optimization mathematics that adopt the KŁ inequality to analyze the convergence and hence the rate of convergence for non-convex ADMM and prove that the proposed algorithms converge linearly. It is worth noting that the linear rate obtained through this framework is not uniform over the whole parameter space but converge linearly when the solution is located in the vicinity around local stationary points. In other words, the rate of convergence is locally linear. In the following we aim to use the KŁ inequality to prove locally linear rate for the proposed two algorithms (11) and (15). As summarized in [40], three elements are needed to adopt the KŁ inequality: 1) a sufficient decrease lemma. 2) Showing the Łojasiewicz exponent $\theta$ of the objective function $\mathcal{L}$ , solved with the algorithm to be analyzed is $1/2$ and 3) Contraction of the gradients $|\nabla\mathcal{L}^{k}|\leq C^{*}|w^{k}-w^{k-1}|,C^{*}>0$ . Since we already have the first element, we can focus on the others. The desired result $\theta=1/2$ can be obtained through the following lemma.

Lemma 4

Let $\mathcal{L}_{c}$ be defined as in (9) and Assumption A is satisfied. Let the sequence $\{w^{k}\}_{k\in\mathbb{N}}$ be obtained from the algorithm 11 with $w^{k}:=(\{p^{k}_{i}\},\{\nu^{k}_{i}\},q^{k})$ be the step $k$ collective point. Then the Łojasiewicz exponent of $\mathcal{L}_{c}$ is $\theta=1/2$ around a neighborhood of a stationary point $w^{*}$ such that $|w-w^{*}|<\varepsilon$ and $\mathcal{L}_{c}(w^{*})\leq\mathcal{L}_{c}(w)<\mathcal{L}_{c}(w^{*})+\delta$ for $\varepsilon,\delta>0$ .

Proof:

See Appendix D ∎

The element to adopt the KŁ inequality is the contraction of the gradients of the augmented Lagrangian between consecutive updates.

Lemma 5

Let $\mathcal{L}_{c}$ be defined as in (9). Suppose Assumption A is satisfied and the sequence $\{w^{k}\}_{k\in\mathbb{N}}$ obtained through the algorithm (11) where $w^{k}:=(\{p_{i}^{k}\},\{\nu^{k}_{i}\},q)$ denotes the collective point at step $k$ , then there exists a constant $C^{*}>0$ such that the following holds:

\lVert\nabla\mathcal{L}_{c}^{k}\rVert\leq C^{*}\lVert w^{k}-w^{k-1}\rVert.

Proof:

See Appendix E. ∎

Then combining the lemmas, we can prove the locally linear rate of convergence of the non-convex consensus ADMM algorithm to form consensus MvIB representation. To be self-contained, the framework to adopt the KŁ inequality is summarized in the following. Note that this result characterizes the rate of convergence into three regions in terms of the value of the Łojasiewicz exponent and $\theta=1/2$ suffices in our case. For the complete characterization, we refer to [19, 40] for details.

Lemma 6 (Theorem 2 [19])

Assume that a function $\mathcal{L}_{c}(\{p_{i}\},\{\nu_{i}\},q)$ satisfies the KŁ properties, define $w^{k}$ the collective point at step $k$ , and let $\{w^{k}\}_{k\in\mathbb{N}}$ be a sequence generated by the algorithm (11). Suppose $\{w^{k}\}_{k\in\mathbb{N}}$ is bounded and the following relation holds:

\lVert\nabla\mathcal{L}_{c}^{k}\rVert\leq C^{*}\lVert w^{k}-w^{k-1}\rVert,

where $\mathcal{L}_{c}^{k}:=\mathcal{L}_{c}(\{p_{i}^{k}\},\{\nu_{i}^{k}\},q)$ and $C^{*}>0$ is some constants. Denote the Łojasiewicz exponent of $\mathcal{L}_{c}$ with $\{w^{\infty}\}$ as $\theta$ . Then the following holds:

1.

If $\theta=0$ , the sequence $\{w^{k}\}_{k\in\mathbb{N}}$ converges in a finite number of steps,
2.

If $\theta\in(0,1/2]$ then there exist $\tau>0$ and $Q\in[0,1)$ such that

$|w^{k}-w^{\infty}|\leq\tau Q^{k},$
3.

If $\theta\in(1/2,1)$ then there exists $\tau>0$ such that

$|w^{k}-w^{\infty}|\leq\tau k^{-\frac{1-\theta}{2\theta-1}}.$

Overall, the three elements for applying the KŁ inequality are obtained from Lemma 3, 4 and 5. Then, by using Lemma 6 we prove the linear rate of convergence.

Theorem 1

Let $\mathcal{L}_{c}$ be defined as in (11c), and Assumption A is satisfied, then the sequence $\{w_{k}\}_{k\in\mathbb{N}}$ obtained through the algorithm (11) where $w_{k}:=(\{p^{k}_{i}\},\{\nu^{k}_{i}\},q)$ denote the collective point at step $k$ converges $Q$ -linearly around a neighborhood of a stationary point $w^{*}$ , satisfying $\lVert w-w^{*}\rVert<\varepsilon$ and $\mathcal{L}_{c}^{*}<\mathcal{L}_{c}<\mathcal{L}_{c}+\delta$ for some $\varepsilon,\delta>0$ .

Proof:

See Appendix F. ∎

Appendix B Proof of Lemma 2

By definition, for two arbitrary $q^{m},q^{n}\in\Omega_{g}$ , $G(q^{m})-G(q^{n})$ consists of two parts. For the first part, we have:

\begin{split}G(p_{z}^{m})-G(p_{z}^{n})&=(1-\sum_{i=1}^{V}\gamma_{i})[H(Z^{m})-H(Z^{n})]\\ &=\sigma_{\gamma}[H(Z^{m})-H(Z^{n})]\end{split}

If $\sigma_{\gamma}<0$ then $G(p_{z})$ is a scaled negative entropy function w.r.t. $p_{z}$ which is therefore $|\sigma_{\gamma}|$ -strongly convex. In turns, the positive squared norm introduced by strong convexity is always greater than zero, which is $0$ -weakly convex. On the other hand, if $\sigma_{\gamma}>0$ , let $\sigma_{\gamma}=1$ without loss of generosity, we have:

\begin{split}{}&H(Z^{m})-H(Z^{n})\\ =&\sum_{z}\langle p_{z}^{m}-p_{z}^{n},-\log{p_{z}^{n}}\rangle-D_{KL}(p_{z}^{m}||p_{z}^{n})\\ \geq&\langle\nabla H(Z^{n}),p_{z}^{m}-p_{z}^{n}\rangle-\frac{1}{\varepsilon_{z}}\lVert p_{z}^{m}-p_{z}^{n}\rVert^{2}_{1}\\ \geq&\langle\nabla H(Z^{n}),p_{z}^{m}-p_{z}^{n}\rangle-\frac{N_{z}}{\varepsilon_{z}}\lVert p_{z}^{m}-p_{z}^{n}\rVert^{2}_{2}\end{split}

(26)

where the first inequality is due to the reverse Pinsker’s inequality [41] while the last inequality is due to norm bound: $\lVert x\rVert_{1}\leq\sqrt{N}\lVert x\rVert_{2},x\in\mathbb{R}^{n}$ . Then, for the second part, consider the following:

	$\displaystyle H(Z^{m}\|Y)-H(Z^{n}\|Y)$
	$\displaystyle=\sum_{y}p(y)[\langle p^{m}_{z\|Y}-p^{n}_{z\|Y},-\log{p^{m}_{z\|Y}}\rangle-D_{KL}(p^{m}_{z\|Y}\|\|p^{n}_{z\|Y})]$
	$\displaystyle\geq\langle\nabla H(Z^{m}\|Y),p^{m}_{z\|y}-p^{n}_{z\|y}\rangle-E_{y}\bigl{[}\frac{1}{\epsilon_{z\|y}}\lVert p^{m}_{z\|Y}-p^{n}_{z\|Y}\rVert_{1}^{2}\bigr{]}$
	$\displaystyle\geq\langle\nabla H(Z^{m}\|Y),p^{m}_{z\|y}-p^{n}_{z\|y}\rangle-\frac{N_{z}N_{y}}{\epsilon_{z\|y}}\lVert p^{m}_{z\|y}-p^{n}_{z\|y}\rVert_{2}^{2}.$

where the first and second inequality follows the same reasons in (26). Combining the above discussions, we conclude that $G(q)$ is $\sigma_{G}$ -weakly convex where:

\sigma_{G}:=\max\{\frac{2N_{z}|\sigma_{\gamma}|}{\varepsilon_{z}},\frac{N_{z}N_{y}}{\varepsilon_{z|y}}\}.

Appendix C Proof of Lemma 3

We divide the proof into three parts according to $p_{i},\nu_{i},q$ updates sequentially.

First, for each view $i\in[V]$ , the $p_{i}$ updates from step $k$ to $k+1$ with $\{\nu_{i}^{k}\},q^{k}$ fixed, denote $\mathcal{L}_{c}(p_{i}^{k})$ :

\begin{split}{}&\mathcal{L}_{c}(p_{i}^{k})-\mathcal{L}_{c}(p_{i}^{k+1})\\ =&\begin{multlined}F_{i}(p_{i}^{k})-F_{i}(p_{i}^{k+1})+\langle\nu_{i}^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\end{multlined}F_{i}(p_{i}^{k})-F_{i}(p_{i}^{k+1})+\langle\nu_{i}^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\\ \geq&\begin{multlined}\langle\nabla F_{i}(p_{i}^{k+1}),p_{i}^{k}-p_{i}^{k+1}\rangle+\langle\nu_{i}^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\end{multlined}\langle\nabla F_{i}(p_{i}^{k+1}),p_{i}^{k}-p_{i}^{k+1}\rangle+\langle\nu_{i}^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\\ =&\begin{multlined}-c\langle A_{i}p_{i}^{k+1}-q^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\end{multlined}-c\langle A_{i}p_{i}^{k+1}-q^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\\ =&\frac{c}{2}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert^{2},\end{split}

(27)

where the inequality is due to the convexity of $F_{i}$ and the last two lines follow the minimizer condition (24a) and the identity (23) respectively.

Second, for the dual update of each view:

\mathcal{L}_{c}(\nu_{i}^{k})-\mathcal{L}_{c}(\nu_{i}^{k+1})=-\frac{1}{c}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}.

(28)

Lastly, for the $q$ updates, from the $\sigma_{G}$ -weak convexity of the sub-objective function $G$ :

\begin{split}{}&\mathcal{L}_{c}(q^{k})-\mathcal{L}_{c}(q^{k+1})\\ =&\begin{multlined}G(q^{k})-G(q^{k+1})+\sum_{i=1}^{V}\left[\langle\nu_{i}^{k+1},q^{k+1}-q^{k}\rangle\right.\\ \left.+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle\nu_{i}^{k+1}}\right]\end{multlined}G(q^{k})-G(q^{k+1})+\sum_{i=1}^{V}\left[\langle\nu_{i}^{k+1},q^{k+1}-q^{k}\rangle\right.\\ \left.+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle\nu_{i}^{k+1}}\right]\\ \geq&\begin{multlined}\langle\nabla G(q^{k+1}),q^{k}-q^{k+1}\rangle-\frac{\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}\\ +\sum_{i=1}^{V}\left[\langle\nu_{i}^{k+1},q^{k+1}-q^{k}\rangle+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\right.\\ \left.-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle\nu_{i}^{k+1}}\right]\end{multlined}\langle\nabla G(q^{k+1}),q^{k}-q^{k+1}\rangle-\frac{\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}\\ +\sum_{i=1}^{V}\left[\langle\nu_{i}^{k+1},q^{k+1}-q^{k}\rangle+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\right.\\ \left.-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle\nu_{i}^{k+1}}\right]\\ =&\begin{multlined}c\sum_{i=1}^{V}\left[\langle A_{i}p_{i}^{k+1}-q^{k+1},q^{k}-q^{k+1}\rangle\right.\\ \left.+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle A_{i}}\right]\\ \frac{-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}.\end{multlined}c\sum_{i=1}^{V}\left[\langle A_{i}p_{i}^{k+1}-q^{k+1},q^{k}-q^{k+1}\rangle\right.\\ \left.+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle A_{i}}\right]\\ \frac{-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}.\\ &=\frac{cV-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}.\end{split}

(29)

Combining (27) (28) $\forall i\in[V]$ and (29), we have:

\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}\geq\frac{cV-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}\\ +\sum_{i=1}^{V}\left[\frac{c}{2}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert^{2}-\frac{1}{c}\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert^{2}\vphantom{\frac{c}{2}}\right],

where $\mathcal{L}_{c}^{k}$ denotes the augmented Lagrangian evaluated with step $k$ solution. The next step is to address the negative squared norm $\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert$ . Since $A_{i}$ is full-row rank $\forall i\in[V]$ , consider the following:

\lVert(A_{i}A_{i}^{T})^{-1}A_{i}A_{i}^{T}(\nu_{i}^{k}-\nu_{i}^{k+1})\rVert^{2}\leq\frac{\lambda_{A_{i}}^{2}}{\mu^{2}_{A_{i}A_{i}^{T}}}\lVert A_{i}^{T}(\nu_{i}^{k}-\nu_{i}^{k+1})\rVert^{2},

(30)

where $\mu_{B},\lambda_{B}$ denotes the smallest and largest singular value of a linear operator $B$ , we connect the gradient of $F_{i}$ to the dual update:

\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert^{2}\leq\frac{\lambda_{A_{i}}^{2}}{\mu^{2}_{A_{i}A_{i}^{T}}}\lVert\nabla F_{i}(p_{i}^{k})-\nabla F_{i}(p_{i}^{k+1})\rVert^{2}\\ \leq\frac{L_{i}^{2}\lambda^{2}_{A_{i}}}{\mu^{2}_{A_{i}A_{i}^{T}}}\lVert p_{i}^{k}-p_{i}^{k+1}\rVert^{2},

(31)

where the last inequality is due to the $L_{i}$ -smoothness of $F_{i}$ . After applying (31), we need the relation:

\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert\geq M^{*}\lVert p_{i}^{k}-p_{i}^{k+1}\rVert,

for a constant $M^{*}>0$ . Note that by definition, $A_{i},\forall i\in[V]$ is full-row rank, hence the above relation is non-trivial. Fortunately, since each $F_{i}$ is convex w.r.t. $p_{i}$ , which assures a unique minimizer, we can follow the sub-minimization technique recently developed in [20] to establish the desired relation. By defining the following proximal operator:

\Psi_{i}(\eta):=\underset{p_{i}\in\Omega_{i}}{\arg\min}\,F_{i}(p_{i})+\frac{c}{2}\lVert A_{i}p_{i}-\eta\rVert^{2},

which coincides with the $p_{i}$ update, we have:

\lVert\Psi_{i}(A_{i}p_{i}^{k})-\Psi_{i}(A_{i}p_{i}^{k+1})\rVert=\lVert p_{i}^{k}-p_{i}^{k+1}\rVert\\ \leq M_{i}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert,

(32)

where the last inequality is due to the Lipschitz continuity of $F_{i}$ . Applying (32) to (27) and we have:

\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}\geq\frac{cV-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}\\ +\sum_{i=1}^{V}\left[\frac{c}{2M_{i}^{2}}-\frac{\lambda_{A_{i}}^{2}L_{i}^{2}}{c\mu^{2}_{A_{i}A_{i}^{T}}}\right]\lVert p_{i}^{k}-p_{i}^{k+1}\rVert^{2}.

This completes the proof.

Appendix D Proof of Lemma 4

The proof is divided into two parts. We first establish the following relation between a step $k+1$ solution and a stationary point $w^{*}:=(\{p_{i}^{*}\},\{\nu_{i}^{*}\},q^{*})$ :

\mathcal{L}_{c}^{k+1}-\mathcal{L}_{c}^{*}\leq\frac{c}{2}\sum_{i=1}^{V}\lVert A_{i}p_{i}^{k+1}-A_{i}p_{i}^{*}\rVert^{2}.

(33)

This is accomplished by the following relations. Start from $\mathcal{L}_{c}^{k+1}-\mathcal{L}_{c}^{*}$ , For $F_{i},\forall i\in[V]$ differences, using the convexity and the minimizer condition (24a):

\begin{split}F_{i}(p_{i}^{k+1})-F_{i}(p_{i}^{*})\leq-\langle\nu_{i}^{k+1},A_{i}p_{i}^{k+1}-A_{i}p^{*}_{i}\rangle,\end{split}

(34)

where we use the reduction of the minimizer conditions at a stationary point (25) to have $A_{i}p_{i}^{*}=q^{*},\forall i\in[V]$ . As for the $G(q)$ difference:

G(q^{k+1})-G(q^{*})\leq\sum_{i=1}^{V}\langle\nu_{i}^{k+1},q^{k+1}-q^{*}\rangle\\ +c\sum_{i=1}^{V}\langle A_{i}p_{i}^{k+1}-q^{k+1},q^{k+1}-q^{*}\rangle+\frac{\sigma_{G}}{2}\lVert q^{k+1}-q^{*}\rVert^{2}.

(35)

Note that by assumption $c>\sigma_{G}/V$ . Therefore, by combining the two with the inner products where the dual variables $\nu_{i}^{k+1}$ associated with, we have the desired result (33). The second part is to construct the following relation:

\lVert\nabla\mathcal{L}_{c}(w^{k+1})\rVert^{2}\geq\sum_{i=1}^{V}\left(c^{2}\mu_{A_{i}A_{i}^{T}}+1\right)\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2},

(36)

which is straightforward to show since:

\begin{split}{}&\nabla\mathcal{L}_{c}(w^{k+1})\\ =&\begin{bmatrix}\nabla F_{i}(p_{i}^{k+1})+A_{i}^{T}\left[\nu_{i}^{k+1}+c\left(A_{i}p_{i}^{k+1}-q^{k+1}\right)\right]\\ A_{i}p_{i}^{k+1}-q^{k+1}\\ \nabla G(q^{k+1})-\sum_{i=1}^{V}\left[\nu_{i}^{k+1}+c\left(A_{i}p_{i}^{k+1}-q^{k+1}\right)\right]\end{bmatrix}\\ =&\begin{bmatrix}cA_{i}^{T}\left(A_{i}p_{i}^{k+1}-q^{k+1}\right)\\ A_{i}p_{i}^{k+1}-q^{k+1}\\ 0\end{bmatrix},\end{split}

(37)

where the last equality is due to the minimizer conditions (24). Note that in (37), for simplicity, we derive the relation considering a single $i\in[V]$ for $p_{i}$ and $\nu_{i}$ . With (33) and (36), consider a neighborhood around a stationary point $w^{*}$ such that $|\bar{w}-w^{*}|<\varepsilon$ and $\mathcal{L}^{*}_{c}<\bar{\mathcal{L}}_{c}<\mathcal{L}^{*}_{c}+\delta$ for some $\varepsilon,\delta>0$ , then we have:

\begin{split}{}&\mathcal{L}^{k+1}_{c}-\mathcal{L}_{c}^{*}\\ \leq&\begin{multlined}\sum_{i=1}^{V}\left[\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-A_{i}p_{i}^{*}\rVert^{2}\right.\\ \left.+(c^{2}\mu_{A_{i}A_{i}^{T}}+1)\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\frac{c}{2}}\right]\end{multlined}\sum_{i=1}^{V}\left[\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-A_{i}p_{i}^{*}\rVert^{2}\right.\\ \left.+(c^{2}\mu_{A_{i}A_{i}^{T}}+1)\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\frac{c}{2}}\right]\\ \leq&\sum_{i=1}^{V}\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-A_{i}p_{i}^{*}\rVert^{2}+\lVert\nabla\mathcal{L}_{c}^{k+1}\rVert^{2}\\ \leq&\lVert\nabla\mathcal{L}_{c}^{k+1}\rVert^{2}\left(1+\frac{c\varepsilon^{2}}{2\lVert\nabla\mathcal{L}_{c}^{k+1}\rVert^{2}}\right)\\ \leq&\lVert\nabla\mathcal{L}_{c}^{k+1}\rVert^{2}\left(1+\frac{c\varepsilon^{2}}{2\eta^{2}}\right),\end{split}

(38)

where in the last equality, by construction, because $\mathcal{L}_{c}^{k+1}$ is not a stationary point, there exists a $\eta>0$ such that $|\nabla\mathcal{L}_{c}^{k+1}|>\eta$ [27, Lemma 2.1]. Then we complete the proof as (38) implies that the Łojasiewicz exponent $\theta=1/2$ .

Appendix E Proof of Lemma 5

From (37), we have:

\begin{split}{}&\lVert\nabla\mathcal{L}_{c}^{k}\rVert^{2}\\ \leq&\sum_{i=1}^{V}\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}\\ \leq&2\sum_{i=1}^{V}\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\left[\lVert A_{i}p_{i}^{k}-q^{k-1}\rVert^{2}+\lVert q^{k}-q^{k-1}\rVert^{2}\right]\\ =&\begin{multlined}2\sum_{i=1}^{V}\left[\left(\lambda_{A_{i}A_{i}^{T}}+\frac{1}{c^{2}}\right)\lVert\nu_{i}^{k}-\nu_{i}^{k-1}\rVert^{2}\right.\\ \left.+\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert q^{k}-q^{k-1}\rVert^{2}\vphantom{\lambda_{A_{i}A_{i}^{T}}}\right]\end{multlined}2\sum_{i=1}^{V}\left[\left(\lambda_{A_{i}A_{i}^{T}}+\frac{1}{c^{2}}\right)\lVert\nu_{i}^{k}-\nu_{i}^{k-1}\rVert^{2}\right.\\ \left.+\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert q^{k}-q^{k-1}\rVert^{2}\vphantom{\lambda_{A_{i}A_{i}^{T}}}\right]\\ \leq&\begin{multlined}2\sum_{i=1}^{V}\left(\lambda_{A_{i}A_{i}^{T}}+\frac{1}{c^{2}}\right)\left[\lVert\nu_{i}^{k}-\nu_{i}^{k-1}\rVert^{2}+\lVert p_{i}^{k}-p_{i}^{k-1}\rVert^{2}\right]\\ +2\sum_{i=1}^{V}\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert q^{k}-q^{k-1}\rVert^{2},\end{multlined}2\sum_{i=1}^{V}\left(\lambda_{A_{i}A_{i}^{T}}+\frac{1}{c^{2}}\right)\left[\lVert\nu_{i}^{k}-\nu_{i}^{k-1}\rVert^{2}+\lVert p_{i}^{k}-p_{i}^{k-1}\rVert^{2}\right]\\ +2\sum_{i=1}^{V}\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert q^{k}-q^{k-1}\rVert^{2},\end{split}

(39)

Then there exists a positive constant $C^{*}>0$ such that:

\lVert\nabla\mathcal{L}_{c}^{k}\rVert\leq C^{*}\lVert w^{k}-w^{k-1}\rVert,

where $w^{k}:=(\{p_{i}^{k}\},\{\nu_{i}^{k}\},q^{k})$ denotes the step $k$ collective point.

Appendix F Proof of Theorem 1

We first show the convergence. By Assumption A, the sufficient decrease lemma (Lemma 3) holds. Consider a sequence $\{w_{k}\}_{k\in\mathbb{N}}$ obtained through the algorithm (11), by sufficient decrease lemma, there exists some constants $\rho_{p,i},\rho_{q}>0,\forall i\in[V]$ such that:

\mathcal{L}_{c}^{0}-\mathcal{L}_{c}^{\infty}\geq\sum_{k=0}^{\infty}\left[\sum_{i=1}^{V}\rho_{p,i}\lVert p_{i}^{k}-p_{i}^{k+1}\rVert^{2}+\rho_{q}\lVert q^{k}-q^{k+1}\rVert^{2}\right].

(40)

In discrete settings, the $\mathcal{L}_{c}$ is lower semi-continuous and therefore the l.h.s. of (40) is bounded. Following this, observe that the r.h.s. of (40) is a Cauchy sequence, and hence $\forall i\in[V],p_{i}^{k}\rightarrow p_{i}^{*},q^{k}\rightarrow q^{*}$ as $k\rightarrow\infty$ . Given these, by (31), this result implies $\nu_{i}^{k}\rightarrow\nu_{i}^{*}$ as $k\rightarrow\infty$ . And we prove the convergence.

Given convergence, along with Lemma 4 and Lemma 5, that is, the KŁ property is satisfied with a Łojasiewicz exponent is $\theta=1/2$ , and the contraction of the gradient of $\mathcal{L}_{c}$ is established, by Lemma 6, we prove the rate of convergence of the sequence $\{w^{k}\}_{k\in\mathbb{N}}$ obtained through the algorithm (11) is $Q$ -linear around a neighborhood of a stationary point $w^{*}$ .

References

[1] S. Sun, “A survey of multi-view machine learning,” Neural computing and applications, vol. 23, no. 7, pp. 2031–2038, 2013.
[2] Y. Yang and H. Wang, “Multi-view clustering: A survey,” Big Data Mining and Analytics, vol. 1, no. 2, pp. 83–107, 2018.
[3] M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata, “Learning robust representations via multi-view information bottleneck,” in International Conference on Learning Representations, 2020.
[4] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in International conference on machine learning. PMLR, 2015, pp. 1083–1092.
[5] Y. Li, M. Yang, and Z. Zhang, “A survey of multi-view representation learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 10, pp. 1863–1883, 2019.
[6] K. Zhan, F. Nie, J. Wang, and Y. Yang, “Multiview consensus graph clustering,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1261–1270, 2019.
[7] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
[8] C. Xu, D. Tao, and C. Xu, “Large-margin multi-viewinformation bottleneck,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1559–1572, 2014.
[9] Y. Gao, S. Gu, L. Xia, and Y. Fei, “Web document clustering with multi-view information bottleneck,” in 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA’06), 2006, pp. 148–148.
[10] S. Hu, Z. Shi, and Y. Ye, “Dmib: Dual-correlated multivariate information bottleneck for multiview clustering,” IEEE Transactions on Cybernetics, pp. 1–15, 2020.
[11] Q. Wang, C. Boudreau, Q. Luo, P.-N. Tan, and J. Zhou, “Deep multi-view information bottleneck,” in Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 2019, pp. 37–45.
[12] Z. Goldfeld and Y. Polyanskiy, “The information bottleneck problem and its applications in machine learning,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 19–38, 2020.
[13] Y. Uğur, I. E. Aguerri, and A. Zaidi, “Vector gaussian ceo problem under logarithmic loss and applications,” IEEE Transactions on Information Theory, vol. 66, no. 7, pp. 4183–4202, 2020.
[14] I. E. Aguerri and A. Zaidi, “Distributed variational representation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 120–138, 2021.
[15] I. Estella Aguerri and A. Zaidi, “Distributed information bottleneck method for discrete and gaussian sources,” in International Zurich Seminar on Information and Communication (IZS 2018). Proceedings. ETH Zurich, 2018, pp. 35–39.
[16] A. Zaidi, I. Estella-Aguerri, and S. Shamai, “On the information bottleneck problems: Models, connections, applications and information theoretic views,” Entropy, vol. 22, no. 2, p. 151, 2020.
[17] T. A. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 740–761, 2014.
[18] S. Boyd, N. Parikh, and E. Chu, Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011.
[19] H. Attouch and J. Bolte, “On the convergence of the proximal algorithm for nonsmooth functions involving analytic features,” Mathematical Programming, vol. 116, no. 1, pp. 5–16, 2009.
[20] Y. Wang, W. Yin, and J. Zeng, “Global convergence of admm in nonconvex nonsmooth optimization,” Journal of Scientific Computing, vol. 78, no. 1, pp. 29–63, 2019.
[21] T.-H. Huang and A. el Gamal, “A provably convergent information bottleneck solution via admm,” in 2021 IEEE International Symposium on Information Theory (ISIT), 2021, pp. 43–48.
[22] D. Boley, “Local linear convergence of the alternating direction method of multipliers on quadratic or linear programs,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2183–2207, 2013.
[23] K. Guo, D. R. Han, and T. T. Wu, “Convergence of alternating direction method for minimizing sum of two nonconvex functions with linear constraints,” International Journal of Computer Mathematics, vol. 94, no. 8, pp. 1653–1669, 2017.
[24] M. Chao, D. Han, and X. Cai, “Convergence of the peaceman-rachford splitting method for a class of nonconvex programs,” Numerical Mathematics: Theory, Methods and Applications, vol. 14, no. 2, pp. 438–460, 2021.
[25] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proceedings of the Eleventh Annual Conference on Computational Learning Theory, ser. COLT’ 98. New York, NY, USA: Association for Computing Machinery, 1998, p. 92–100.
[26] Y. Ugur, I. E. Aguerri, and A. Zaidi, “A generalization of blahut-arimoto algorithm to compute rate-distortion regions of multiterminal source coding under logarithmic loss,” arXiv preprint arXiv:1708.07309, 2017.
[27] G. Li and T. K. Pong, “Calculus of the exponent of kurdyka–Łojasiewicz inequality and its applications to linear convergence of first-order methods,” Foundations of Computational Mathematics, vol. 18, no. 5, p. 1199–1232, 2018.
[28] K. Sricharan, R. Raich, and A. O. Hero, “Estimation of nonlinear functionals of densities with confidence,” IEEE Transactions on Information Theory, vol. 58, no. 7, pp. 4135–4159, 2012.
[29] R. Blahut, “Computation of channel capacity and rate-distortion functions,” IEEE Transactions on Information Theory, vol. 18, no. 4, pp. 460–473, 1972.
[30] O. Shamir, S. Sabato, and N. Tishby, “Learning and generalization with the information bottleneck,” Theoretical Computer Science, vol. 411, no. 29-30, pp. 2696–2711, 2010.
[31] E. Schubert and A. Zimek, “ELKI: A large open-source library for data analysis - ELKI release 0.7.5 ”heidelberg”,” CoRR, vol. abs/1902.03616, 2019.
[32] D. Cremers and K. Kolev, “Multiview stereo and silhouette consistency via convex functionals over convex domains,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 6, pp. 1161–1174, 2011.
[33] G. Franca, D. Robinson, and R. Vidal, “Admm and accelerated admm as continuous dynamical systems,” in International Conference on Machine Learning. PMLR, 2018, pp. 1559–1567.
[34] Y. Han, J. Jiao, T. Weissman, and Y. Wu, “Optimal rates of entropy estimation over lipschitz balls,” The Annals of Statistics, vol. 48, no. 6, pp. 3228–3250, 2020.
[35] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). New York, NY, USA: Wiley-Interscience, 2006.
[36] K. Guo, D. Han, and X. Yuan, “Convergence analysis of douglas–rachford splitting method for “strongly + weakly” convex programming,” SIAM Journal on Numerical Analysis, vol. 55, no. 4, p. 1549–1577, 2017.
[37] Z. Jia, X. Gao, X. Cai, and D. Han, “Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems,” Journal of Optimization Theory and Applications, vol. 188, no. 1, p. 1–25, 2021.
[38] J. Nocedal, Numerical optimization, 2nd ed., ser. Springer series in operations research. New York: Springer, 2006.
[39] Y. Nesterov, Lectures on convex optimization, ser. Springer optimization and its applications. Cham: Springer, 2018, vol. 137.
[40] T.-H. Huang, A. E. Gamal, and H. E. Gamal, “A linearly convergent douglas-rachford splitting solver for markovian information-theoretic optimization problems,” arXiv preprint arXiv:2203.07527, 2022.
[41] I. Sason, “On reverse pinsker inequalities,” CoRR, vol. abs/1503.07118, 2015.

	$\displaystyle H(Z^{m}\|Y)-H(Z^{n}\|Y)$
	$\displaystyle=\sum_{y}p(y)[\langle p^{m}_{z\|Y}-p^{n}_{z\|Y},-\log{p^{m}_{z\|Y}}\rangle-D_{KL}(p^{m}_{z\|Y}\|\|p^{n}_{z\|Y})]$
	$\displaystyle\geq\langle\nabla H(Z^{m}\|Y),p^{m}_{z\|y}-p^{n}_{z\|y}\rangle-E_{y}\bigl{[}\frac{1}{\epsilon_{z\|y}}\lVert p^{m}_{z\|Y}-p^{n}_{z\|Y}\rVert_{1}^{2}\bigr{]}$
	$\displaystyle\geq\langle\nabla H(Z^{m}\|Y),p^{m}_{z\|y}-p^{n}_{z\|y}\rangle-\frac{N_{z}N_{y}}{\epsilon_{z\|y}}\lVert p^{m}_{z\|y}-p^{n}_{z\|y}\rVert_{2}^{2}.$

On The Multi-View Information Bottleneck Representation

Abstract

Index Terms:

I Introduction

II Multiview Information Bottleneck

II-A Consensus-Complement Form

II-B Incremental Update Form

III Main Results

Theorem 1

Proof:

IV Numerical Results

V Conclusion

Appendix A Convergence Analysis

A-A Preliminaries

Definition 1 (Lipschitz continuity)

Definition 2 (Smoothness)

Definition 3 (ε\varepsilon-infimal)

Lemma 1

Proof:

Definition 4 (Hypoconvexity)

Lemma 2

Proof:

Definition 5

Definition 6

Definition 7

A-B Convergence and Rate Analysis

Assumption A

Lemma 3

Proof:

Lemma 4

Proof:

Lemma 5

Proof:

Lemma 6 (Theorem 2 [19])

Theorem 1

Proof:

Appendix B Proof of Lemma 2

Appendix C Proof of Lemma 3

Appendix D Proof of Lemma 4

Appendix E Proof of Lemma 5

Appendix F Proof of Theorem 1

References

Definition 3 ( $\varepsilon$ -infimal)