This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On The Multi-View Information Bottleneck Representation

Teng-Hui Huang Electrical and Computer Engineering
Purdue University
West Lafayette, IN, USA
huan1456@purdue.edu
   Aly El Gamal Emerging Technologies Lab
InterDigital
Los Altos, CA, USA
aly.elgamal@interdigital.com
   Hesham El Gamal Electrical and Information Engineering
The University of Sydney
NSW, Australia
hesham.elgamal@sydney.edu.au
Abstract

In this work, we generalize the information bottleneck (IB) approach to the multi-view learning context. The exponentially growing complexity of the optimal representation motivates the development of two novel formulations with more favorable performance-complexity tradeoffs. The first approach is based on forming a stochastic consensus and is suited for scenarios with significant representation overlap between the different views. The second method, relying on incremental updates, is tailored for the other extreme scenario with minimal representation overlap. In both cases, we extend our earlier work on the alternating directional methods of multiplier (ADMM) solver and establish its convergence and scalability. Empirically, we find that the proposed methods outperform state-of-the-art approaches in multi-view classification problems under a broad range of modelling parameters.

Index Terms:
Information bottleneck, consensus ADMM, non-convex optimization, classification, multi-view learning.

I Introduction

Recently, learning from multi-view data has drawn significant interest in the machine learning and data science community (e.g., [1, 2, 3, 4, 5]). In this context, a view of data is a description or observation about the source. For example, an object can be described in words or images. It is natural to expect learning from multi-view data to improve performance[6].

The challenges in multi-view learning are two-folded: First, as one can naively combine all views of observations to form one giant view which loses no information contained within them but would suffer from exponential growth of the dimensionality of the merged observation, we call this the performance-complexity trade-off. Second, following this, if one instead opts to extract either view-shared or view-specific relevant features from each view, then heterogeneous forms of observations (e.g., audio and images) would make it difficult to learn low-complexity and meaningful representations with a unified method. In other words, the amount of representation overlap across all view observations is important for efficient multi-view learning.

In addressing the challenges, several recent works have attempted to apply the IB [7] principle to multi-view learning for it matches the objective well, that is, trading-off relevance and complexity in extracting both view-shared and view-specific features [8, 9, 10, 11]. This generalization is known as the multi-view IB (MvIB) which is known to be a special case of the multi-terminal remote source coding problems with logarithmic loss. The logarithmic loss corresponds to soft reconstruction where the likelihood of all possible outcomes is received in contrast to a reconstructed symbol in conventional source coding problems [12, 13, 14, 15, 16].

In literature, the achievable region for the remote source coding problem is characterized in [17] for discrete cases, and recently, jointly Gaussian cases as well [13]. Along with the characterization, a variety of variational inference-based algorithms are proposed[14, 15]. This type of algorithms introduces extra variational variables to facilitate the optimization as by fixing one of the two sets of variables, the overall objective function is convex w.r.t. to the other set of variables. Then optimize in this alternating fashion, the convergence is assured.

Extending this line of research, our approach is rooted in a top-down information-theoretic formulation closely related to the optimal characterization of MvIB. Moreover, contrary to [11] which relies on black-box deep neural networks, we propose two constructive information theoretic formulations with performance comparable to that of the optimal joint view approach. The two approaches are motivated by two extreme multi-view learning scenarios: The first is characterized by a significant representation overlap between the different views which favors our consensus-complement two stage formulation, whereas the second extreme scenario is characterized by a minimal representation overlap leading to our incremental update approach.

Different from existing variational inference-based algorithms that avoid dealing with the non-convexity of the overall objective function, in both of the proposed methods, we adopt the non-convex consensus ADMM as the main tool in deriving our solvers [18, 19, 20]. These new solvers can, therefore, be viewed as generalizations of our earlier work on the single view ADMM algorithm [21]. More specifically, in the consensus-complement version, we separate the proposed Lagrangian into consensus and complement sub-objective functions and then proceed to solve the optimization problem in two steps. The new ADMM solver can hence efficiently form a consensus representation in large-scale multi-view learning problems with significantly lower dimensions compared to joint-view approaches. The same intuition is applied to the increment update approach as detailed in the sequel. Finally, we prove the linear rate of convergence of our solvers under significantly milder constraints; as compared with earlier convergence results on this class of solvers [22, 23, 24]. More specifically, we relax the need for a strongly convex sub-objective function and, moreover, establish the linear rate of convergence around a local optimal point in each case.

II Multiview Information Bottleneck

Given VV views with observations {X(i)}i=1V\{X^{(i)}\}_{i=1}^{V} generated from a target variable YY, we aim to find a set of individual representations {Z}\{Z\} that is most compressed w.r.t. the individual-view observations X(i),i[V]X^{(i)},\forall i\in[V] but at the same time maximize their relevance toward the target YY through X(i)X^{(i)}.

Using a Lagrangian multiplier formulation, the problem can be casted as:

MvIB:=γI({X}V;{Z})I(Y;{Z}),\mathcal{L}_{\text{MvIB}}:=\gamma I(\{X\}_{V};\{Z\})-I(Y;\{Z\}), (1)

where {X}\{X\} denotes the set of all VV-views of observations and {Z}\{Z\} is the set of representations to be designed. Note that if the observations are combined in this manner and treated as one view, the above reduces to the standard IB and can be solved with any existing single-view solver. However, combining all the observations in one giant view will result in an exponential increase in complexity (curse of dimensionality).

A basic assumption in the multiview learning literature is the conditional independence [25, 9], where the observations of all views {X(i)}i=1V\{X^{(i)}\}_{i=1}^{V} are independent given the target variable YY. That is, p({x}|y)=i=1Vp(x(i)|y)p(\{x\}|y)=\prod_{i=1}^{V}p(x^{(i)}|y). In the next two sections, we use this conditional independence assumption while constraining the set of allowable latent representations {Z}\{Z\} to develop our two novel information-theoretic formulations of the Multi-view IB (MvIB) problem.

II-A Consensus-Complement Form

Inspired by the co-training methods in multi-view research [25], we design the set of latent representations {Z}\{Z\} to consist of a consensus representation ZcZ_{c} and view-specific complement components Ze(i),i[V]Z_{e}^{(i)},\forall i\in[V]. Then, by the chain rule of mutual information, the Lagrangian of (1) becomes

con=γI(Zc;{X})I(Zc;Y)+i=1VγI(Ze(i);{X}|Zc,{Ze}i1)I(Y;Ze(i)|Zc,{Ze}i1),\mathcal{L}_{\text{con}}=\gamma I(Z_{c};\{X\})-I(Z_{c};Y)\\ +\sum_{i=1}^{V}\gamma I(Z_{e}^{(i)};\{X\}|Z_{c},\{Z_{e}\}_{i-1})-I(Y;Z_{e}^{(i)}|Z_{c},\{Z_{e}\}_{i-1}), (2)

where the sequence {Ze}j:={Ze(1),,Ze(j)}\{Z_{e}\}_{j}:=\{Z_{e}^{(1)},\cdots,Z_{e}^{(j)}\} is defined to represent the accumulated complement views. To further simplify the above, we restrict the set of possible representations to satisfy the following conditions (similar to [25, 11]):

  • There always exist constants κi,i[V]\kappa_{i},\forall i\in[V], independent of the observations {X}\{X\} such that κiI(Zc;X(i))=I(Zc;X(i)|{X}i1)\kappa_{i}I(Z_{c};X^{(i)})=I(Z_{c};X^{(i)}|\{X\}_{i-1}).

  • YX(i)Ze(i)ZcY\rightarrow X^{(i)}\rightarrow Z_{e}^{(i)}\leftarrow Z_{c} forms a Markov chain. That is, ZcZ_{c} is side-information for Ze(i)Z_{e}^{(i)}.

  • For each view i[V]i\in[V], given the consensus ZcZ_{c}, {Ze}\{Z_{e}\} are independent.

Under these constraints, (2) can be rewritten as the superposition of two parts, i.e., :=¯+i=1Ve(i)\mathcal{L}:=\bar{\mathcal{L}}+\sum_{i=1}^{V}\mathcal{L}_{e}^{(i)}, where the first component ¯\bar{\mathcal{L}} is defined as the multi-view consensus IB Lagrangian:

¯:=i=1VγiI(Zc;X(i))I(Zc;Y),\bar{\mathcal{L}}:=\sum_{i=1}^{V}\gamma_{i}I(Z_{c};X^{(i)})-I(Z_{c};Y), (3)

and the second consists of VV terms with each one corresponding to a complement sub-objective for each view:

e(i):=γI(Ze(i);X(i)|Zc)I(Ze(i);Y|Zc),i[V].\mathcal{L}_{e}^{(i)}:=\gamma I(Z_{e}^{(i)};X^{(i)}|Z_{c})-I(Z_{e}^{(i)};Y|Z_{c}),\forall i\in[V]. (4)

We can now recast ¯\bar{\mathcal{L}} in (3) as:

¯:=i=1VγiH(Z|X(i))+(1+i=1Vγi)H(Z)+H(Z|Y)=i=1VFi(pz|x,i)+G(pz,pz|y).\begin{split}\bar{\mathcal{L}}:&=-\sum_{i=1}^{V}\gamma_{i}H(Z|X^{(i)})+\bigl{(}-1+\sum_{i=1}^{V}\gamma_{i}\bigr{)}H(Z)+H(Z|Y)\\ &=\sum_{i=1}^{V}F_{i}(p_{z|x,i})+G(p_{z},p_{z|y}).\end{split} (5)

Similarly, we rewrite (4), i[V]\forall i\in[V], as:

e(i)=γH(Ze(i)|Zc,X(i))+(γ1)H(Ze(i)|Zc)+H(Ze(i)|Zc,Y).\mathcal{L}_{e}^{(i)}=-\gamma H(Z_{e}^{(i)}|Z_{c},X^{(i)})\\ +(\gamma-1)H(Z_{e}^{(i)}|Z_{c})+H(Z_{e}^{(i)}|Z_{c},Y). (6)

By representing the discrete (conditional) probabilities as vectors/tensors, we can solve (5) and (6) with augmented Lagrangian methods. We define the following vectors:

pz|x,i:=[p(z1|x1(i))p(z1|xNi(i))p(zL|xNi(i))]T,pz:=[p(z1)p(zL)]T,pz|y:=[p(z1|y1)p(z1|yT)p(zL|yK)]T.\begin{split}p_{z|x,i}&:=\begin{bmatrix}p(z_{1}|x_{1}^{(i)})\cdots p(z_{1}|x^{(i)}_{N_{i}})\cdots p(z_{L}|x^{(i)}_{N_{i}})\end{bmatrix}^{T},\\ p_{z}&:=\begin{bmatrix}p(z_{1})&\cdots p(z_{L})\end{bmatrix}^{T},\\ p_{z|y}&:=\begin{bmatrix}p(z_{1}|y_{1})\cdots p(z_{1}|y_{T})\cdots p(z_{L}|y_{K})\end{bmatrix}^{T}.\end{split} (7)

where Ni:=|𝒳(i)|,i[V],L:=|𝒵|,K:=|𝒴|N_{i}:=|\mathcal{X}^{(i)}|,\forall i\in[V],L:=|\mathcal{Z}|,K:=|\mathcal{Y}|. For clarity, we rewrite the primal variables for each view as pz|x,i:=pip_{z|x,i}:=p_{i}, and cascade the augmented variables which gives q:=[pzTpz|yT]Tq:=\begin{bmatrix}p_{z}^{T}&p_{z|y}^{T}\end{bmatrix}^{T}. On the other hand, for the complement part, we define the following tensors:

πx,i[m,n,r]:=P(Ze(i)=ze,m(i)|Zc=zc,n,X(i)=xr(i)),πy,i[m,n,r]:=P(Ze(i)=ze,m(i)|Zc=zc,n,Y=yr),πz,i[m,n]:=P(Ze(i)=ze,m(i)|Zc=zc,n).\begin{split}\pi_{x,i}[m,n,r]&:=P(Z^{(i)}_{e}=z^{(i)}_{e,m}|Z_{c}=z_{c,n},X^{(i)}=x_{r}^{(i)}),\\ \pi_{y,i}[m,n,r]&:=P(Z^{(i)}_{e}=z^{(i)}_{e,m}|Z_{c}=z_{c,n},Y=y_{r}),\\ \pi_{z,i}[m,n]&:=P(Z^{(i)}_{e}=z^{(i)}_{e,m}|Z_{c}=z_{c,n}).\end{split} (8)

Then we present the consensus-complement MvIB augmented Lagrangian as follows. For the consensus part:

c({pi}i=1V,q,{νi}i=1V)=i=1V[Fi(pi)+νi,Aipiq+c2Aipiq2]+G(q),\mathcal{L}_{c}(\{p_{i}\}_{i=1}^{V},q,\{\nu_{i}\}_{i=1}^{V})\\ =\sum_{i=1}^{V}\left[F_{i}(p_{i})+\langle\nu_{i},A_{i}p_{i}-q\rangle+\frac{c}{2}\lVert A_{i}p_{i}-q\rVert^{2}\right]+G(q), (9)

where \lVert\cdot\rVert is 2-norm. As for the complement part, let

Fe,i=γH(Ze(i)|Zc,X(i)),Ge,i=(γ1)H(Ze(i))+H(Ze(i)|Zc,Y).\begin{split}F_{e,i}&=-\gamma H(Z_{e}^{(i)}|Z_{c},X^{(i)}),\\ G_{e,i}&=(\gamma-1)H(Z_{e}^{(i)})+H(Z_{e}^{(i)}|Z_{c},Y).\end{split}

As for the tensors used in the complement step, denote π[t]\pi[t] the realization of the consensus representation t𝒵ct\in\mathcal{Z}_{c}, by Bayes’ rule, we can recover the linear expression: πy,i[t]=Λzc|y1[t]Ax(i)|yTπx,i[t]\pi_{y,i}[t]=\Lambda^{-1}_{z_{c}|y}[t]A_{x^{(i)}|y}^{T}\pi_{x,i}[t], where Λzc|y[t]\Lambda_{z_{c}|y}[t] is a diagonal matrix formed from the vector pzc=t|y,y𝒴p_{z_{c}=t|y},\forall y\in\mathcal{Y}. To simplify notation, define the equivalent prior as a linear operator A~x(i)|y,t:=Λzc|y1[t]Ax(i)|yT\tilde{A}_{x^{(i)|y,t}}:=\Lambda^{-1}_{z_{c}|y}[t]A^{T}_{x^{(i)}|y}, then we can express the augmented Lagrangian for the complement step as e,c(i)=t𝒵ce,c(i)[t]\mathcal{L}^{(i)}_{e,c}=\sum_{t\in\mathcal{Z}_{c}}\mathcal{L}_{e,c}^{(i)}[t] and each term is defined as:

e,c(i)[t]=Fe,i(πx,i[t])+Ge,i(πq,i[t])+μi,A~eπx,i[t]πq,i[t]+c2A~eπx,i[t]πq,i[t]2,\mathcal{L}_{e,c}^{(i)}[t]=F_{e,i}(\pi_{x,i}[t])+G_{e,i}(\pi_{q,i}[t])\\ +\langle\mu_{i},\tilde{A}_{e}\pi_{x,i}[t]-\pi_{q,i}[t]\rangle+\frac{c}{2}\lVert\tilde{A}_{e}\pi_{x,i}[t]-\pi_{q,i}[t]\rVert^{2}, (10)

where c>0c>0 is the penalty coefficient and the linear penalty AipiqA_{i}p_{i}-q for each view i[V]i\in[V] encourages the variables qq and each pip_{i} to satisfy the marginal probability and Markov chain conditions. Specifically, let \otimes denote the Kronecker product, Az,i:=Ipx(i)T,Az|y,i:=IPx(i)|yTA_{z,i}:=I\otimes p_{x^{(i)}}^{T},A_{z|y,i}:=I\otimes P_{x^{(i)}|y}^{T} where Px(i)|yP_{x^{(i)}|y} is the matrix form of the conditional distribution p(x(i)|y)p(x^{(i)}|y) with each entry (m,n)(m,n) equal to p(xm(i)|yn)p(x^{(i)}_{m}|y_{n}).

We propose a two-step algorithm to solve (2) described as follows. The first step is solving (9) through the following consensus ADMM algorithm. i[V]\forall i\in[V]:

pik+1:=\displaystyle p_{i}^{k+1}:= argminpiΩic({p<ik+1},pi,{p>ik},{νik},qk),\displaystyle\underset{p_{i}\in\Omega_{i}}{\arg\min}\,\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p_{i},\{p^{k}_{>i}\},\{\nu_{i}^{k}\},q^{k}), (11a)
νik+1:=\displaystyle\nu_{i}^{k+1}:= νik+c(Aipik+1qk),\displaystyle\nu_{i}^{k}+c\left(A_{i}p_{i}^{k+1}-q^{k}\right), (11b)
qk+1:=\displaystyle q^{k+1}:= argminqΩqc({pik+1}i=1V,{νik+1},q),\displaystyle\underset{q\in\Omega_{q}}{\arg\min}\,\mathcal{L}_{c}(\{p^{k+1}_{i}\}_{i=1}^{V},\{\nu_{i}^{k+1}\},q), (11c)

Then in the second step we solve (10) with two-block ADMM:

πe,ik+1\displaystyle\pi^{k+1}_{e,i} :=argminπx,iΠx(i)e,c(πx,i,μik,πq,ik),\displaystyle:=\underset{\pi_{x,i}\in\Pi^{(i)}_{x}}{\arg\min}\,\mathcal{L}_{e,c}(\pi_{x,i},\mu_{i}^{k},\pi^{k}_{q,i}), (12a)
μik+1\displaystyle\mu^{k+1}_{i} :=μik+c(A~e,iπx,ik+1πq,ik),\displaystyle:=\mu^{k}_{i}+c(\tilde{A}_{e,i}\pi^{k+1}_{x,i}-\pi^{k}_{q,i}), (12b)
πy,ik+1\displaystyle\pi^{k+1}_{y,i} :=argminπy,iΠy(i)e,c(πx,ik+1,μik+1,πy,i),\displaystyle:=\underset{\pi_{y,i}\in\Pi^{(i)}_{y}}{\arg\min}\,\mathcal{L}_{e,c}(\pi^{k+1}_{x,i},\mu^{k+1}_{i},\pi_{y,i}), (12c)

where in (11), we use the short-hand notation {p<ik+1}:={plk+1}l=1i1\{p^{k+1}_{<i}\}:=\{p^{k+1}_{l}\}_{l=1}^{i-1} to denote the primal variables, up to i1i-1 views that are already updated to step k+1k+1, and {pi<k}:={pmk}m=i+1V\{p^{k}_{i<}\}:=\{p_{m}^{k}\}_{m=i+1}^{V} to denote the rest that are still at step kk. We define {p<0k+1}={}={p>Vk}\{p^{k+1}_{<0}\}=\{\varnothing\}=\{p^{k}_{>V}\}; in (11) and (12), the superscript kk denotes the step index; each of Ωi,Ωq,Πx(i),Πy(i)\Omega_{i},\Omega_{q},\Pi_{x}^{(i)},\Pi_{y}^{(i)} denotes a compound probability simplex. The algorithm starts with (11a), updating each view in succession. Then the augmented variables are updated with (11c). Finally, the difference between the primal and augmented variables are added to the dual variables (11b) to complete step kk. After convergence of (11), we run (12) in similar fashion for each view. And this completes the full algorithm.

II-B Incremental Update Form

Intuitively, the consensus-complement form works well in the case where the common information in the observations {X}\{X\} across all views is abundant. However, if the views are almost distinct, where each view is a complement to the others, then the previous form will be inefficient in the sense that learning the common may have negligible benefit. To address this, we propose another formulation of the multi-view IB, by restricting the representation set to {Z(i)}i=1V\{Z^{(i)}\}_{i=1}^{V}. The incremental update multi-view IB Lagrangian is therefore given by:

inc:=i=1VγI({X};Z(i)|{Z}i1)I(Y;Z(i)|{Z}i1).\mathcal{L}_{\text{inc}}:=\sum_{i=1}^{V}\gamma I(\{X\};Z^{(i)}|\{Z\}_{i-1})-I(Y;Z^{(i)}|\{Z\}_{i-1}). (13)

Again, to simplify the above, the incremental form will satisfy the following constraints:

  • For each view i[V]i\in[V], the corresponding representation Z(i)Z^{(i)} only access X(i)X^{(i)}, so YX(i)Z(i){Z}i1Y\rightarrow X^{(i)}\rightarrow Z^{(i)}\leftarrow\{Z\}_{i-1} forms a Markov chain.

With the assumptions above, in each step, we can replace observations of all views {X}\{X\} with the view-specific observation X(i)X^{(i)} and rewrite (13) as:

inc:=i=1VγI(X(i);Z(i)|{Z}i1)I(Y;Z(i)|{Z}i1).\mathcal{L}_{\text{inc}}:=\sum_{i=1}^{V}\gamma I(X^{(i)};Z^{(i)}|\{Z\}_{i-1})-I(Y;Z^{(i)}|\{Z\}_{i-1}). (14)

In solving (14), we consider the following algorithm. At the ithi^{th} step, we have:

Pz|x,z<i(i):=argminPΩ(i)inc(P,{Pz|y,z<j(j)}j=1i1),\displaystyle P^{(i)}_{z|x,z_{<i}}:=\underset{P\in\Omega^{(i)}}{\arg\min}\,\mathcal{L}_{\text{inc}}(P,\{P^{(j)}_{z|y,z_{<j}}\}_{j=1}^{i-1}), (15a)
p(z(i)|y,z<i)=xip(x(i)|y)p(z(i),z<i|x(i))p(z<i|y),\displaystyle p(z^{(i)}|y,z_{<i})=\frac{\sum_{x_{i}}p(x^{(i)}|y)p(z^{(i)},z_{<i}|x^{(i)})}{p(z_{<i}|y)}, (15b)

where Pz|x,z<i(i)P^{(i)}_{z|x,z_{<i}} denotes the tensor form of a conditional probability p(z(i)|x(i),z(i1),,z(1)),i[V]p(z^{(i)}|x^{(i)},z^{(i-1)},\cdots,z^{(1)}),\forall i\in[V]. The tensor is the primal variable for step ii which belongs to a compound simplex Ω(i)\Omega^{(i)}. In the algorithm, for each step (15a), we solve it with (11) by setting V=1V=1 and treating the estimators from the previous steps as priors. For example, p(z(2),x(2)|z(1))=p(x(2)|z(1))p(z(2)|z(1),x(2))p(z^{(2)},x^{(2)}|z^{(1)})=p(x^{(2)}|z^{(1)})p(z^{(2)}|z^{(1)},x^{(2)}), and p(x(2)|z(1))=yp(x(2)|y)p(y|z(1))p(x^{(2)}|z^{(1)})=\sum_{y}p(x^{(2)}|y)p(y|z^{(1)}).

III Main Results

We propose two new information-theoretic formulations of MvIB and develop optimal-bound achieving algorithms that are in parallel to existing solvers [15, 26, 14]; our main results are the convergence proofs for the proposed two algorithms. The convergence analysis goes beyond the MvIB and the recent non-convex multi-block ADMM convergence results as we further show that strong convexity on {Fi}i=1V\{F_{i}\}_{i=1}^{V} is not necessary for proving convergence [24]. This new result connects our analysis to a more general class of functions that can be solved with multi-block non-convex consensus ADMM. For simplicity we denote the collective point at step kk as wk:=({pik},{νik},qk)w^{k}:=(\{p_{i}^{k}\},\{\nu^{k}_{i}\},q^{k}), ck:=c({pik},{νik},qk)\mathcal{L}_{c}^{k}:=\mathcal{L}_{c}(\{p_{i}^{k}\},\{\nu_{i}^{k}\},q^{k}) as the function value evaluated with wkw^{k} and μB,λB\mu_{B},\lambda_{B} denote the smallest and largest singular value of a linear operator BB respectively,

Theorem 1

Suppose Fi(pi)F_{i}(p_{i}) is LiL_{i}-smooth and MiM_{i}-Lipschitz continuous i[V]\forall i\in[V] and G(q)G(q) is σG\sigma_{G}-weakly convex. Further, let c\mathcal{L}_{c} be defined as in (9) and solved with the algorithm (11). If the penalty coefficient satisfies c>maxi[V]{(MiλAiLi)/μAiAiT,σG/V}c>\max_{i\in[V]}\{(M_{i}\lambda_{A_{i}}L_{i})/\mu_{A_{i}A_{i}^{T}},\sigma_{G}/V\}, then the sequence {wk}k\{w^{k}\}_{k\in\mathbb{N}} is finite and bounded. Moreover, {wk}k\{w^{k}\}_{k\in\mathbb{N}} converges linearly to a stationary point ww^{*} around a neighborhood such that c<c<c+δ\mathcal{L}_{c}^{*}<\mathcal{L}_{c}<\mathcal{L}_{c}^{*}+\delta for ww<ϵ\lVert w-w^{*}\rVert<\epsilon where δ,ϵ>0\delta,\epsilon>0.

Proof:

The details of the proof are deferred to Appendix A. Here we explain the key ideas.

Continuing on the proof sketch, the first step is to construct a sufficient decrease lemma (Lemma 3) to assure that the function value ck\mathcal{L}_{c}^{k} decreases from step kk to k+1k+1 by an amount lower bounded by the positive squared norm wkwk+12\lVert w^{k}-w^{k+1}\rVert^{2}. We decompose ckck+1\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1} according to each step of the algorithm (11), i[V]\forall i\in[V] as follows:

ckck+1\displaystyle\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}
=\displaystyle= i=1V[c({p<ik+1},pik,{p>ik},{νk},qk)c({p<ik+1},pik+1,{p>ik},{νk},qk)]\displaystyle\begin{multlined}\sum_{i=1}^{V}\left[\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p^{k}_{i},\{p^{k}_{>i}\},\{\nu^{k}\},q^{k})\right.\\ \left.-\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p^{k+1}_{i},\{p^{k}_{>i}\},\{\nu^{k}\},q^{k})\vphantom{\mathcal{L}_{c}}\right]\end{multlined}\sum_{i=1}^{V}\left[\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p^{k}_{i},\{p^{k}_{>i}\},\{\nu^{k}\},q^{k})\right.\\ \left.-\mathcal{L}_{c}(\{p^{k+1}_{<i}\},p^{k+1}_{i},\{p^{k}_{>i}\},\{\nu^{k}\},q^{k})\vphantom{\mathcal{L}_{c}}\right] (16c)
+i=1V[c({pk+1},{ν<ik+1},νik,{ν>ik},qk)c({pk+1},{ν<ik+1},νik+1,{ν>ik},qk+1)]\displaystyle+\begin{multlined}\sum_{i=1}^{V}\left[\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k+1}_{<i}\},\nu^{k}_{i},\{\nu^{k}_{>i}\},q^{k})\right.\\ \left.-\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k+1}_{<i}\},\nu^{k+1}_{i},\{\nu^{k}_{>i}\},q^{k+1})\vphantom{\mathcal{L}_{c}}\right]\end{multlined}\sum_{i=1}^{V}\left[\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k+1}_{<i}\},\nu^{k}_{i},\{\nu^{k}_{>i}\},q^{k})\right.\\ \left.-\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k+1}_{<i}\},\nu^{k+1}_{i},\{\nu^{k}_{>i}\},q^{k+1})\vphantom{\mathcal{L}_{c}}\right] (16f)
+c({pk+1},{νk},qk)c({pk+1},{νk},qk+1).\displaystyle+\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k}\},q^{k})-\mathcal{L}_{c}(\{p^{k+1}\},\{\nu^{k}\},q^{k+1}). (16g)

For each view, each difference in (16c) can be lower bounded by using the convexity of FiF_{i}, and we get:

c({pik})c({pik+1})i=1Vc2AipikAipik+12.\mathcal{L}_{c}(\{p_{i}^{k}\})-\mathcal{L}_{c}(\{p_{i}^{k+1}\})\geq\sum_{i=1}^{V}\frac{c}{2}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert^{2}. (17)

On the other hand, in (16g), a similar lower bound for GG follows from its σG\sigma_{G}-weak convexity. This results in a negative squared norm σG/2qkqk+12-\sigma_{G}/2\lVert q^{k}-q^{k+1}\rVert^{2}. Nonetheless, by the first-order minimizer conditions (24) and the identity 2uv,wu=|vw|2|uv|2|uw|22\langle u-v,w-u\rangle=|v-w|^{2}-|u-v|^{2}-|u-w|^{2}, the negative term is balanced by the penalty coefficient cc as the corresponding lower bound is (with other variables fixed):

c(qk)c(qk+1)cVσG2qkqk+12.\mathcal{L}_{c}(q^{k})-\mathcal{L}_{c}(q^{k+1})\geq\frac{cV-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}.

As for the dual update, (16f) gives a combination of negative norms 1/ci=1Vνikνik+12-1/c\sum_{i=1}^{V}\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert^{2}. It turns out that by the first-order minimizer condition of FiF_{i} and its smoothness:

Fi(pik+1)=AiTνik+1,\nabla F_{i}(p_{i}^{k+1})=-A_{i}^{T}\nu_{i}^{k+1},

and that AiA_{i} is full-row rank (holds for complement step):

νikνik+12λAi2Li2μAiAiT2pikpik+12,\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert^{2}\leq\frac{\lambda^{2}_{A_{i}}L_{i}^{2}}{\mu^{2}_{A_{i}A_{i}^{T}}}\lVert p_{i}^{k}-p_{i}^{k+1}\rVert^{2}, (18)

where μB,λB\mu_{B},\lambda_{B} denote the smallest and largest singular value of a linear operator B. Then we need the following relation:

AipikAipik+1Mpikpik+1,M>0,\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert\geq M\lVert p_{i}^{k}-p_{i}^{k+1}\rVert,\quad M>0,

which is non-trivial as AiA_{i} is full-row rank. To address this, we adopt the sub-minimization path method in [20], which is applicable since FiF_{i} is convex. Observe that (11a) is equivalent to a proximal operator:

Ψi(η):=argminπiΩiF(pi)+c2Aipikη2,\Psi_{i}(\eta):=\underset{\pi_{i}\in\Omega_{i}}{\arg\min}\,F(p_{i})+\frac{c}{2}\lVert A_{i}p_{i}^{k}-\eta\rVert^{2},

with η:=Aipik\eta:=A_{i}p_{i}^{k} at step kk. Using this technique, we can have the desired result using the Lipschitz continuity of FiF_{i}:

Ψi(Aipik)Ψi(Aipik+1)=pikpik+1MiAipikAipik+1\lVert\Psi_{i}(A_{i}p^{k}_{i})-\Psi_{i}(A_{i}p_{i}^{k+1})\rVert=\lVert p_{i}^{k}-p_{i}^{k+1}\rVert\\ \leq M_{i}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert

prove the sufficient decrease lemma and hence the convergence (Appendix F).

We further prove that the rate of convergence is linear by explicitly showing the Kurdyka-Łojasiewicz (KŁ [19, 27]) property is satisfied with an exponent θ=1/2\theta=1/2 (Appendix D). This is based on the known result based on the KŁ inequality, which characterizes the rate of convergence in to three regions in terms of θ\theta [19] (θ=1/2\theta=1/2 corresponds to linear rate). The proof for θ=1/2\theta=1/2 is again based on the convexity of {Fi}\{F_{i}\} and the weak convexity of GG and is referred to Appendix D. We note that the linear rate holds around a neighborhood of a stationary point w:=({pi},{νi},q)w^{*}:=(\{p_{i}^{*}\},\{\nu_{i}^{*}\},q^{*}) which aligns with the results in [27].

As a remark, if the minimum element of a probability vector is bounded away from zero by a constant ξ>0\xi>0, a commonly adopted smoothness condition in density and entropy estimation research [28], the sub-objectives {Fi}i=1V\{F_{i}\}_{i=1}^{V} and GG can be shown to be Lipschitz continuous and smooth. Furthermore, under smoothness conditions, GG is a weakly convex function w.r.t. qq (Lemma 2).

From Theorem 1, the consensus-complement algorithm is convergent since the complement step is a special case of the algorithm (11) with V=1V=1 while treating p(zc|x(i))p(z_{c}|x^{(i)}) as an additional prior probability. The incremental algorithm is convergent following the same reason.

Refer to caption
(a) Classification Accuracy
Refer to caption
(b) Mutual information
Figure 1: Simulation results on synthetic datasets. We set c=64,min{ε}=1011c=64,\min\{\varepsilon\}=10^{-11} and run the algorithms with random initialization. For simplicity, we let γ1=γ2=γ\gamma_{1}=\gamma_{2}=\gamma. The termination criterion for the proposed approaches is when the total variation (the linear constraints) between the primal and augmented variables DTV(Aipi||q)<106,i[V]D_{TV}(A_{i}p_{i}||q)<10^{-6},\forall i\in[V] (convergent case), or the maximum number of iterations is reached (divergent case). Figure 1a follows the distribution given in (19) with Joint denoting the joint-view IB; The distribution in Figure 1b is given in (20). In both figures, the Bayes’ decoder for Cons-Cmpl is pcc(y|x(1),x(2))={ze(1),ze(2),zc}p(y|zc,ze(1),ze(2))p(zc|x(1),x(2))p(ze(1)|zc,x(1))p(ze(2)|zc,x(2))p_{cc}(y|x^{(1)},x^{(2)})=\sum_{\{z_{e}^{(1)},z_{e}^{(2)},z_{c}\}}p(y|z_{c},z_{e}^{(1)},z_{e}^{(2)})p(z_{c}|x^{(1)},x^{(2)})p(z_{e}^{(1)}|z_{c},x^{(1)})p(z_{e}^{(2)}|z_{c},x^{(2)}) where we approximate p(zc=k|x(1),x(2))p(zc=k|x(1))p(zc=k|x(2))/t𝒵cp(zc=t|x(1))p(zc=t|x(2))p(z_{c}=k|x^{(1)},x^{(2)})\approx p(z_{c}=k|x^{(1)})p(z_{c}=k|x^{(2)})/\sum_{t\in\mathcal{Z}_{c}}p(z_{c}=t|x^{(1)})p(z_{c}=t|x^{(2)}); On the other hand, for Increment: pinc(y|x(1),x(2))={z(1),z(2)}p(y|z(1),z(2))p(z(1)|x(1))p(z(2)|z(1),x(2)).p_{inc}(y|x^{(1)},x^{(2)})=\sum_{\{z^{(1)},z^{(2)}\}}p(y|z^{(1)},z^{(2)})p(z^{(1)}|x^{(1)})p(z^{(2)}|z^{(1)},x^{(2)}).

IV Numerical Results

We evaluate the proposed two approaches for two-view, synthetic distributions. For simplicity, we denote the consensus-complement approach as Cons-Cmpl while the incremental update approach as Increment.

We simulate a classification task and compare the performance of the two proposed approaches to joint-view/single-view IB solvers [29], which are served as references for the best- and worst-case performance, and the state-of-the-art deep neural network-based method (DeepMvIB [14, 11]), with two layers of 44-neuron, fully connected weights plus ReLU activation for each view. Given (19), we randomly sample 1000010000 pairs of outcomes (y,x(1),x(2))(y,x^{(1)},x^{(2)}) as testing data. Then we run the algorithms, sweeping through a range of γ[0.1,0.7]\gamma\in[0.1,0.7] and record the best accuracy from 5050 trials per γ\gamma. We use Bayes’ decoder to predict the testing data, where we perform inverse transform sampling for the cumulative distribution of the decoders to obtain y^\hat{y} for each pair of (x(1),x(2))(x^{(1)},x^{(2)}). The data-generating distribution is:

P(X(1)|Y)=[0.750.050.200.200.050.75],P(X(2)|Y)=[0.850.150.150.85],P(X^{(1)}|Y)=\begin{bmatrix}0.75&0.05\\ 0.20&0.20\\ 0.05&0.75\end{bmatrix},P(X^{(2)}|Y)=\begin{bmatrix}0.85&0.15\\ 0.15&0.85\end{bmatrix}, (19)

with P(Y)=[0.50.5]TP(Y)=\begin{bmatrix}0.5&0.5\end{bmatrix}^{T}. The result is shown in Figure 1a. The dimension of each of Zc,Ze(2),Z(2)Z_{c},Z_{e}^{(2)},Z^{(2)} is 22, and 33 for each of Ze(1),Z(1)Z_{e}^{(1)},Z^{(1)}. Clearly, the two proposed approaches can achieve comparable performance to that of the joint-view IB solver and outperform the deepMvIB over the range of γ\gamma we simulated. Interestingly, Cons-Cmpl outperforms Increment in the best accuracy. This might be due to the abundance of representation overlap. To better investigate this observation, we further consider a different distribution with dimensions of all representations |𝒵c|=|𝒵e(i)|=|𝒵(i)|=3,i{1,2}|\mathcal{Z}_{c}|=|\mathcal{Z}_{e}^{(i)}|=|\mathcal{Z}^{(i)}|=3,\forall i\in\{1,2\}:

p(Y)=[131313]T,P(X(1)|Y):=[0.900.200.200.050.450.350.050.350.45],\displaystyle p(Y)=\begin{bmatrix}\frac{1}{3}&\frac{1}{3}&\frac{1}{3}\end{bmatrix}^{T},P(X^{(1)}|Y):=\begin{bmatrix}0.90&0.20&0.20\\ 0.05&0.45&0.35\\ 0.05&0.35&0.45\end{bmatrix},
P(X(2)|Y):=[0.250.100.550.200.800.250.550.100.20].\displaystyle P(X^{(2)}|Y):=\begin{bmatrix}0.25&0.10&0.55\\ 0.20&0.80&0.25\\ 0.55&0.10&0.20\end{bmatrix}. (20)

Observe that for each view in (20), there is one class (y1y_{1} in view 11 an y2y_{2} in view 22), that is easy to infer through X(i),i{1,2}X^{(i)},i\in\{1,2\} while the remaining two are ambiguous. This results in low representation overlap and consensus is therefore difficult to form. In Figure 1b we examine the components of the relevance rate I({Z};Y)I(\{Z\};Y) where the Sum is: I(Zc;Y)+i=12I(Ze(i);Y|Zc))I(Z_{c};Y)+\sum_{i=1}^{2}I(Z_{e}^{(i)};Y|Z_{c})) for Cons-Cmpl and I(Z(1);Y)+I(Z(2);Y|Z(1))I(Z^{(1)};Y)+I(Z^{(2)};Y|Z^{(1)}) for Increment. Step 1 indicates I(Zc;Y)I(Z_{c};Y) in Cons-Cmpl, and I(Z(1);Y)I(Z^{(1)};Y) in Increment. Observe that there is almost no increase in I(Zc;Y)I(Z_{c};Y) over varying γ\gamma, and that Increment has a greater relevance rate than Cons-Cmpl when γ<0.4\gamma<0.4. Since it is known that the high relevance rate is related to high prediction accuracy [30], this example favors the Increment approach as it is designed to increase the overall relevance rate view-by-view.

Lastly, we can compare the complexity of the two approaches in terms of the number of dimensions for the primal variables. For simplicity, let |X|=|X(i)||X|=|X^{(i)}|,|Z|=|Zc|=|Ze|=|Z(i)||Z|=|Z_{c}|=|Z_{e}|=|Z^{(i)}|. For Cons-Cmpl, the number of dimensions for the variables scales as 𝒪(V|X||Z|2)\mathcal{O}(V|X||Z|^{2}) while for Increment, it grows as 𝒪(|X||Z|V)\mathcal{O}(|X||Z|^{V}). The two methods both improve over the joint view as their complexity values scale as 𝒪(|X|V|Z|)\mathcal{O}(|X|^{V}|Z|) and |Z||X||Z|\ll|X| in general. Remarkably, the complexity for Cons-Cmpl scales linearly in the number of views VV while we get an exponential growth with factor |Z||Z| for Increment.

V Conclusion

In this work, we propose two new information-theoretic formulations of MvIB and develop new optimal bound-achieving algorithms based on non-convex consensus ADMM, which are in parallel to existing solvers. We proposed two algorithms to solve the two forms respectively and prove their convergence and linear rates. Empirically, they achieve comparable performance to joint-view benchmarks and outperforms state-of-the-art deep neural networks-based approaches in some synthetic datasets. For future works, we plan to evaluate the two methods on available multiview datasets [31, 32] and generalize the proposed MvIB framework to continuous distributions [33].

Appendix A Convergence Analysis

In this part, we prove the convergence of the consensus non-convex ADMM algorithm for the proposed two MvIB solvers, (11) and (15). Moreover, we demonstrate that the convergence rate is linear based on the recent non-convex ADMM convergence results through the KŁ inequality. Specifically, we explicit show that the Łojasiewicz exponents associate with the augmented Lagrangian for both forms is 1/21/2 and therefore corresponds to linear rates. As mentioned in section II, the complement step and each level of the incremental update algorithms are special cases of the consensus step algorithm with a normalized linear operator for each realization of the conditioned representation and setting the numbers of view V=1V=1, so it suffices to analyze the convergence and the associated rate of the non-convex consensus ADMM (11). In solving (11), we consider the following first-order optimization method and assume an exact solution exists and can be found at each step.

A-A Preliminaries

We first introduce the following definitions which allow us to study the properties of the sub-objective functions Fi(pi),G(q),i[V]F_{i}(p_{i}),G(q),\forall i\in[V].

We start with the elementary definitions of smoothness conditions for optimization.

Definition 1 (Lipschitz continuity)

A real-value function f:nf:\mathbb{R}^{n}\mapsto\mathbb{R} is MM-Lipschitz continuous if |f(x)f(y)|M|xy|,M>0|f(x)-f(y)|\leq M|x-y|,M>0 for x,ydom(f)x,y\in\textit{\text{dom}}(f).

A function is “smooth” if its gradient is Lipschitz continuous.

Definition 2 (Smoothness)

A real-value function f:n,f𝒞2f:\mathbb{R}^{n}\mapsto\mathbb{R},f\in\mathcal{C}^{2} is LL-smooth if |f(x)f(y)|L|xy||\nabla f(x)-\nabla f(y)|\leq L|x-y|, L>0L>0 and x,ydom(f)\forall x,y\in\textit{\text{dom}}(f).

Note that if a LL-smooth function f𝒞2f\in\mathcal{C}^{2}, then the Lipschitz smoothness coefficient LL of ff satisfies |2f|L|\nabla^{2}f|\leq L. In this work, the variables are cascade of (conditional) probability mass, and a common assumption in density/entropy estimation research is non-zero minimal measure [28, 34].

Definition 3 (ε\varepsilon-infimal)

A measure f(x)f(x) is said to be  ε\varepsilon-infimal if there exists ε>0\varepsilon>0 such that infx𝒳f(x)=ε\inf_{x\in\mathcal{X}}f(x)=\varepsilon.

Assuming ε\varepsilon-infimal for a given set of primal variables pip_{i}, we have the following results:

Lemma 1

Suppose pip_{i} in (7) is εi\varepsilon_{i}-infimal i[V]\forall i\in[V], pzp_{z} is εz\varepsilon_{z}-infimal and pz|yp_{z|y} is εz|y\varepsilon_{z|y}-infimal. Then FiF_{i} is γi/εi\gamma_{i}/\varepsilon_{i}-smooth and GG is 1/εq1/\varepsilon_{q}-smooth. Where εq:=max{|σz|/εz,1/εz|y},σz:=1i=1Vγi\varepsilon_{q}:=\max\{|\sigma_{z}|/\varepsilon_{z},1/\varepsilon_{z|y}\},\sigma_{z}:=1-\sum_{i=1}^{V}\gamma_{i}.

Proof:

For FiF_{i}, it suffices to consider a single view. Since F𝒞2F\in\mathcal{C}^{2} and Fiγi[Idiag(px(i))](logpz|x,i+𝟏)\nabla F_{i}\leq\gamma_{i}\big{[}I\otimes\text{diag}(p_{x^{(i)}})\big{]}\big{(}\log{p_{z|x,i}}+\mathbf{1}\big{)}, we have:

|2Fi|γimaxx𝒳(i),z𝒵p(x(i))p(z|x(i))=γiεi.|\nabla^{2}F_{i}|\leq\gamma_{i}\max_{x\in\mathcal{X}^{(i)},z\in\mathcal{Z}}\frac{p(x^{(i)})}{p(z|x^{(i)})}=\frac{\gamma_{i}}{\varepsilon_{i}}.

On the other hand, recall q:=[pzpz|y]Tq:=\begin{bmatrix}p_{z}&p_{z|y}\end{bmatrix}^{T}, we can separate G(q)G(q) into two parts, denotes G(pz),G(pz|y)G(p_{z}),G(p_{z|y}) respectively. For the first part G(pz)G(p_{z}):

G(pz)=(1i=1Vγi)(logpz+𝟏)=σz(logpz+𝟏).\nabla G(p_{z})=\big{(}1-\sum_{i=1}^{V}\gamma_{i}\big{)}\big{(}\log{p_{z}}+\mathbf{1}\big{)}=\sigma_{z}\big{(}\log{p_{z}}+\mathbf{1}\big{)}.

On the other hand, for pz|yp_{z|y}:

G(pz|y)=[Idiag(py)](logpz|y+𝟏).\nabla G(p_{z|y})=\big{[}I\otimes\text{diag}(p_{y})\big{]}\big{(}\log{p_{z|y}}+\mathbf{1}\big{)}.

Since G𝒞2G\in\mathcal{C}^{2}, combine the two parts:

|2G(q)|maxz𝒵,y𝒴{|σz|p(z),p(y)p(z|y)}maxz𝒵,y𝒴{|σz|εz,1εz|y}.|\nabla^{2}G(q)|\\ \leq\max_{z\in\mathcal{Z},y\in\mathcal{Y}}\bigg{\{}\frac{|\sigma_{z}|}{p(z)},\frac{p(y)}{p(z|y)}\bigg{\}}\leq\max_{z\in\mathcal{Z},y\in\mathcal{Y}}\bigg{\{}\frac{|\sigma_{z}|}{\varepsilon_{z}},\frac{1}{\varepsilon_{z|y}}\bigg{\}}. (21)

Besides the smoothness conditions, given the joint probability p(x,y)p(x,y), the (conditional) entropy functions are concave w.r.t. the associated probability mass [35]. The observation in (3) is that its non-convexity is due to a combination of differences of convex functions. By convexity, we refer to the following definition.

Definition 4 (Hypoconvexity)

A convex function f:n,f𝒞2f:\mathbb{R}^{n}\mapsto\mathbb{R},f\in\mathcal{C}^{2} is σ\sigma-hypoconvex if σ,|σ|<+\exists\sigma\in\mathbb{R},|\sigma|<+\infty such that f(y)f(x)+f(x),yx+σ2|yx|2,x,ydom(f)f(y)\geq f(x)+\langle\nabla f(x),y-x\rangle+\frac{\sigma}{2}|y-x|^{2},\forall x,y\in\textit{\text{dom}}(f). In particular, if σ>0\sigma>0, ff is strongly convex; if σ<0\sigma<0, ff is weakly convex.

In MvIB, given the px(i)p_{x^{(i)}}, it is easy to show that each Fi,i[V]F_{i},\forall i\in[V] is a convex function. On the other hand, for the function G(q)G(q), if assume εq\varepsilon_{q}-infimal, we show in the following that it is weakly convex.

Lemma 2

Let G(q)=σγH(Z)+H(Z|Y)G(q)=\sigma_{\gamma}H(Z)+H(Z|Y) and q:=[pzpz|y]Tq:=\begin{bmatrix}p_{z}&p_{z|y}\end{bmatrix}^{T}. If pz,pz|yp_{z},p_{z|y} is εz,εz|y\varepsilon_{z},\varepsilon_{z|y}-infimal measure respectively. Then the function GG is σG\sigma_{G}-weakly convex. Where σγ:=1i=1Vγi\sigma_{\gamma}:=1-\sum_{i=1}^{V}\gamma_{i} and σG:=max{(2Nz|σγ|)/εz,(2NzNy)/εz|y}\sigma_{G}:=\max\{(2N_{z}|\sigma_{\gamma}|)/\varepsilon_{z},(2N_{z}N_{y})/\varepsilon_{z|y}\} with Nz=|𝒵|,Ny=|𝒴|N_{z}=|\mathcal{Z}|,N_{y}=|\mathcal{Y}|.

Proof:

see Appendix B. ∎

From Lemma 2, it turns out that the MvIB objective is a multi-block consists of VV convex FiF_{i} in addition to a weakly convex GG. This decomposition of the non-convexity of the overall objective enables us to generalize the recent strongly-weakly pair non-convex ADMM convergence results to consensus ADMM [36, 37].

If a function satisfies the KŁ properties, then its rate of convergence can be determined in terms of its Łojasiewicz exponent.

Definition 5

A function f(x):R|𝒳|Rf(x):R^{|\mathcal{X}|}\mapsto R is said to satisfy the Łojasweicz inequality if there exists an exponent θ[0,1)\theta\in[0,1), δ>0\delta>0 and a critical point xΩx^{*}\in\Omega^{*} with a constant C>0C>0, and a neighborhood xxε\lVert x-x^{*}\rVert\leq\varepsilon such that:

|f(x)f(x)|θCdist(0,f(x)).\left|f(x)-f(x^{*})\right|^{\theta}\leq C\text{dist}(0,\nabla f(x)).

In literature, there are a broad class of functions that are known to satisfy the KŁ properties, in particular, the oo-minimal structure (e.g., sub-analytic, semi-algebraic functions [19]).

Definition 6

A function f(x):R|𝒳|Rf(x):R^{|\mathcal{X}|}\mapsto R is said to satisfy the Kurdyka-Łojasiewicz inequality if there exists a neighborhood around q¯\bar{q} and a level set Q:={q|qΩ,f(q)<f(q¯)<f(q)+η}Q:=\{q|q\in\Omega,f(q)<f(\bar{q})<f(q)+\eta\} with a margin η>0\eta>0 and a continuous concave function φ(s):[0,η)+\varphi(s):[0,\eta)\rightarrow\mathbb{R}_{+}, such that the following inequality holds:

φ(f(q)f(q¯))dist(0,f(q))1,\varphi^{\prime}(f(q)-f(\bar{q}))\textit{dist}(0,\partial f(q))\geq 1, (22)

where f\partial f denotes the sub-gradient of f()f(\cdot) for non-smooth functions, and dist(y,A):=infx𝒜xy\textit{dist}(y,A):=\inf_{x\in\mathcal{A}}\lVert x-y\rVert is defined as the distance of a set AA to a fixed point yy if exists. Note that if φ(s):=c¯s1θ,c¯>0\varphi(s):=\bar{c}s^{1-\theta},\bar{c}>0 then is recovers the definition of Łojasiewicz inequality.

The following elementary identity is useful in the convergence analysis, we list it for completeness.

2uv,wu=vw2uv2wu2.2\langle u-v,w-u\rangle=\lVert v-w\rVert^{2}-\lVert u-v\rVert^{2}-\lVert w-u\rVert^{2}. (23)

Lastly, by “linear” rate of convergence, we refer to the definition in [38]:

Definition 7

Let {wk}\{w^{k}\} be a sequence in n\mathbb{R}^{n} that converges to a stationary point ww^{*} when k>K0k>K_{0}\in\mathbb{N}. If it converges QQ-linearly, then Q(0,1)\exists Q\in(0,1) such that

wk+1wwkwQ,k>K0.\frac{\lVert w^{k+1}-w^{*}\rVert}{\lVert w^{k}-w^{*}\rVert}\leq Q,\quad\forall k>K_{0}.

On the other hand, the convergence of the sequence is RR-linear if there is QQ-linearly convergent sequence {μk},k,μk0\{\mu^{k}\},\forall k\in\mathbb{N},\mu^{k}\geq 0 such that:

wkwμk,k.\lVert w^{k}-w^{*}\rVert\leq\mu^{k},\quad\forall k\in\mathbb{N}.

A-B Convergence and Rate Analysis

As in the standard convex setting, to prove convergence of consensus ADMM, we simply need to establish a sufficient decrease lemma [39, 18]. However, since the MvIB problem is non-convex, the sub-objectives cannot be viewed as a monotone operator which leads to convergence naturally. Our key insight is that the non-convexity of the MvIB problem can be separated into a combination of a set of convex sub-objective FiF_{i} and a single weakly-convex sub-objective GG which can be exploited to show convergence. This result requires certain smoothness conditions to be satisfied, which is a consequence when assuming ε\varepsilon-infimality (Definition 3) on the primal pi,i[V]p_{i},\forall i\in[V] and the augmented variables qq. With these smoothness conditions, it can be easily shown that FiF_{i} is MiM_{i}-Lipschitz continuous and LiL_{i}-smooth w.r.t. pip_{i} for some Mi,Li>0M_{i},L_{i}>0. In addition to the properties of the sub-objective functions, it turns out that the structural advantages in consensus ADMM allows us to connect the dual update to the gradient of FiF_{i}, and we can therefore establish the desired results. Before presenting the results, we summarize the minimizer conditions as follows:

0\displaystyle 0 =Fi(pik+1)+AiT[νik+(Aipik+1qk)]\displaystyle=\nabla F_{i}(p_{i}^{k+1})+A_{i}^{T}\left[\nu_{i}^{k}+\left(A_{i}p_{i}^{k+1}-q^{k}\right)\right]
=Fi(pik+1)+AiTνik+1,\displaystyle=\nabla F_{i}(p_{i}^{k+1})+A_{i}^{T}\nu_{i}^{k+1}, (24a)
νik+1\displaystyle\nu_{i}^{k+1} =νik+c(Aipik+1qk),\displaystyle=\nu_{i}^{k}+c\left(A_{i}p_{i}^{k+1}-q^{k}\right), (24b)
0\displaystyle 0 =G(qk+1)i=1Vνik+1c(Aipik+1qk+1),\displaystyle=\nabla G(q^{k+1})-\sum_{i=1}^{V}\nu_{i}^{k+1}-c\left(A_{i}p_{i}^{k+1}-q^{k+1}\right), (24c)

where i[V]i\in[V] denotes the view index. Following this, suppose there exists a stationary point w:=({pi},{νi},q)w^{*}:=(\{p_{i}^{*}\},\{\nu_{i}^{*}\},q^{*}) such that c(w)=0\nabla\mathcal{L}_{c}(w^{*})=0, (24) reduces to:

Aipi=q,Fi(pi)=AiTpi,G(q)=i=1Vνi.A_{i}p_{i}^{*}=q^{*},\quad\nabla F_{i}(p_{i}^{*})=-A_{i}^{T}p_{i}^{*},\quad\nabla G(q^{*})=\sum_{i=1}^{V}\nu_{i}^{*}. (25)

Furthermore, we impose the following set of assumptions to facilitate the convergence analysis:

Assumption A
  • There exists stationary points w:=({pi},q,{νi})w^{*}:=(\{p_{i}^{*}\},q^{*},\{\nu_{i}^{*}\}) that belong to a set Ω:={w|wΩ,c(w)=0}\Omega^{*}:=\{w|w\in\Omega,\nabla\mathcal{L}_{c}(w)=0\}.

  • Fi(pi),i[V]F_{i}(p_{i}),\forall i\in[V] is LiL_{i}-smooth, MiM_{i}-Lipschitz continuous and convex; G(q)G(q) is LqL_{q}-smooth and σG\sigma_{G}-weakly convex.

  • The penalty coefficient satisfies:

    c>maxi[V]{(MiλAiLi)/μAiAiT,σG/V}.c>\max_{i\in[V]}\{(M_{i}\lambda_{A_{i}}L_{i})/\mu_{A_{i}A_{i}^{T}},\sigma_{G}/V\}.

With (24) and assumption A, we can establish the sufficient decrease lemma, which is given below.

Lemma 3

Define c\mathcal{L}_{c} as in (9). If Assumption A is satisfied and the sequence {wk}k\{w^{k}\}_{k\in\mathbb{N}}, with wk:=({pik},{νil},qk)w^{k}:=(\{p_{i}^{k}\},\{\nu_{i}^{l}\},q^{k}), is obtained from the algorithm (11), then we have:

ckck+1i=1V[c2Mi2λAi2Li2cμAiAiT2]pik+1pik2.\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}\geq\sum_{i=1}^{V}\left[\frac{c}{2M_{i}^{2}}-\frac{\lambda^{2}_{A_{i}}L^{2}_{i}}{c\mu^{2}_{A_{i}A_{i}^{T}}}\right]\lVert p_{i}^{k+1}-p_{i}^{k}\rVert^{2}.

where μB,λB\mu_{B},\lambda_{B} denote the smallest and largest singular value of a matrix BB respectively.

Proof:

See Appendix C. ∎

As a remark, Lemma 3 implies the convergence of the non-convex ADMM-based algorithm depends on a sufficient large penalty coefficient cc, and the minimum value that assures this, in turns, relies on the properties of the sub-objective functions Fi,i[V]F_{i},\forall i\in[V] and GG. Note that both the Lipschitz continuity and the smoothness of FiF_{i} are exploited to prove Lemma 3 which corresponds to the sub-minimization path method developed in [20] as Fi,i[V]F_{i},\forall i\in[V] are convex, and the connection between dual update and the gradient of FiF_{i} [40, 37].

In addition to convergence, it turns out that we can follow the recent results in optimization mathematics that adopt the KŁ inequality to analyze the convergence and hence the rate of convergence for non-convex ADMM and prove that the proposed algorithms converge linearly. It is worth noting that the linear rate obtained through this framework is not uniform over the whole parameter space but converge linearly when the solution is located in the vicinity around local stationary points. In other words, the rate of convergence is locally linear. In the following we aim to use the KŁ inequality to prove locally linear rate for the proposed two algorithms (11) and (15). As summarized in [40], three elements are needed to adopt the KŁ inequality: 1) a sufficient decrease lemma. 2) Showing the Łojasiewicz exponent θ\theta of the objective function \mathcal{L}, solved with the algorithm to be analyzed is 1/21/2 and 3) Contraction of the gradients |k|C|wkwk1|,C>0|\nabla\mathcal{L}^{k}|\leq C^{*}|w^{k}-w^{k-1}|,C^{*}>0. Since we already have the first element, we can focus on the others. The desired result θ=1/2\theta=1/2 can be obtained through the following lemma.

Lemma 4

Let c\mathcal{L}_{c} be defined as in (9) and Assumption A is satisfied. Let the sequence {wk}k\{w^{k}\}_{k\in\mathbb{N}} be obtained from the algorithm 11 with wk:=({pik},{νik},qk)w^{k}:=(\{p^{k}_{i}\},\{\nu^{k}_{i}\},q^{k}) be the step kk collective point. Then the Łojasiewicz exponent of c\mathcal{L}_{c} is θ=1/2\theta=1/2 around a neighborhood of a stationary point ww^{*} such that |ww|<ε|w-w^{*}|<\varepsilon and c(w)c(w)<c(w)+δ\mathcal{L}_{c}(w^{*})\leq\mathcal{L}_{c}(w)<\mathcal{L}_{c}(w^{*})+\delta for ε,δ>0\varepsilon,\delta>0.

Proof:

See Appendix D

The element to adopt the KŁ inequality is the contraction of the gradients of the augmented Lagrangian between consecutive updates.

Lemma 5

Let c\mathcal{L}_{c} be defined as in (9). Suppose Assumption A is satisfied and the sequence {wk}k\{w^{k}\}_{k\in\mathbb{N}} obtained through the algorithm (11) where wk:=({pik},{νik},q)w^{k}:=(\{p_{i}^{k}\},\{\nu^{k}_{i}\},q) denotes the collective point at step kk, then there exists a constant C>0C^{*}>0 such that the following holds:

ckCwkwk1.\lVert\nabla\mathcal{L}_{c}^{k}\rVert\leq C^{*}\lVert w^{k}-w^{k-1}\rVert.
Proof:

See Appendix E. ∎

Then combining the lemmas, we can prove the locally linear rate of convergence of the non-convex consensus ADMM algorithm to form consensus MvIB representation. To be self-contained, the framework to adopt the KŁ inequality is summarized in the following. Note that this result characterizes the rate of convergence into three regions in terms of the value of the Łojasiewicz exponent and θ=1/2\theta=1/2 suffices in our case. For the complete characterization, we refer to [19, 40] for details.

Lemma 6 (Theorem 2 [19])

Assume that a function c({pi},{νi},q)\mathcal{L}_{c}(\{p_{i}\},\{\nu_{i}\},q) satisfies the KŁ properties, define wkw^{k} the collective point at step kk, and let {wk}k\{w^{k}\}_{k\in\mathbb{N}} be a sequence generated by the algorithm (11). Suppose {wk}k\{w^{k}\}_{k\in\mathbb{N}} is bounded and the following relation holds:

ckCwkwk1,\lVert\nabla\mathcal{L}_{c}^{k}\rVert\leq C^{*}\lVert w^{k}-w^{k-1}\rVert,

where ck:=c({pik},{νik},q)\mathcal{L}_{c}^{k}:=\mathcal{L}_{c}(\{p_{i}^{k}\},\{\nu_{i}^{k}\},q) and C>0C^{*}>0 is some constants. Denote the Łojasiewicz exponent of c\mathcal{L}_{c} with {w}\{w^{\infty}\} as θ\theta. Then the following holds:

  1. 1.

    If θ=0\theta=0, the sequence {wk}k\{w^{k}\}_{k\in\mathbb{N}} converges in a finite number of steps,

  2. 2.

    If θ(0,1/2]\theta\in(0,1/2] then there exist τ>0\tau>0 and Q[0,1)Q\in[0,1) such that

    |wkw|τQk,|w^{k}-w^{\infty}|\leq\tau Q^{k},
  3. 3.

    If θ(1/2,1)\theta\in(1/2,1) then there exists τ>0\tau>0 such that

    |wkw|τk1θ2θ1.|w^{k}-w^{\infty}|\leq\tau k^{-\frac{1-\theta}{2\theta-1}}.

Overall, the three elements for applying the KŁ inequality are obtained from Lemma 3, 4 and 5. Then, by using Lemma 6 we prove the linear rate of convergence.

Theorem 1

Let c\mathcal{L}_{c} be defined as in (11c), and Assumption A is satisfied, then the sequence {wk}k\{w_{k}\}_{k\in\mathbb{N}} obtained through the algorithm (11) where wk:=({pik},{νik},q)w_{k}:=(\{p^{k}_{i}\},\{\nu^{k}_{i}\},q) denote the collective point at step kk converges QQ-linearly around a neighborhood of a stationary point ww^{*}, satisfying ww<ε\lVert w-w^{*}\rVert<\varepsilon and c<c<c+δ\mathcal{L}_{c}^{*}<\mathcal{L}_{c}<\mathcal{L}_{c}+\delta for some ε,δ>0\varepsilon,\delta>0.

Proof:

See Appendix F. ∎

Appendix B Proof of Lemma 2

By definition, for two arbitrary qm,qnΩgq^{m},q^{n}\in\Omega_{g}, G(qm)G(qn)G(q^{m})-G(q^{n}) consists of two parts. For the first part, we have:

G(pzm)G(pzn)=(1i=1Vγi)[H(Zm)H(Zn)]=σγ[H(Zm)H(Zn)]\begin{split}G(p_{z}^{m})-G(p_{z}^{n})&=(1-\sum_{i=1}^{V}\gamma_{i})[H(Z^{m})-H(Z^{n})]\\ &=\sigma_{\gamma}[H(Z^{m})-H(Z^{n})]\end{split}

If σγ<0\sigma_{\gamma}<0 then G(pz)G(p_{z}) is a scaled negative entropy function w.r.t. pzp_{z} which is therefore |σγ||\sigma_{\gamma}|-strongly convex. In turns, the positive squared norm introduced by strong convexity is always greater than zero, which is 0-weakly convex. On the other hand, if σγ>0\sigma_{\gamma}>0, let σγ=1\sigma_{\gamma}=1 without loss of generosity, we have:

H(Zm)H(Zn)=zpzmpzn,logpznDKL(pzm||pzn)H(Zn),pzmpzn1εzpzmpzn12H(Zn),pzmpznNzεzpzmpzn22\begin{split}{}&H(Z^{m})-H(Z^{n})\\ =&\sum_{z}\langle p_{z}^{m}-p_{z}^{n},-\log{p_{z}^{n}}\rangle-D_{KL}(p_{z}^{m}||p_{z}^{n})\\ \geq&\langle\nabla H(Z^{n}),p_{z}^{m}-p_{z}^{n}\rangle-\frac{1}{\varepsilon_{z}}\lVert p_{z}^{m}-p_{z}^{n}\rVert^{2}_{1}\\ \geq&\langle\nabla H(Z^{n}),p_{z}^{m}-p_{z}^{n}\rangle-\frac{N_{z}}{\varepsilon_{z}}\lVert p_{z}^{m}-p_{z}^{n}\rVert^{2}_{2}\end{split} (26)

where the first inequality is due to the reverse Pinsker’s inequality [41] while the last inequality is due to norm bound: x1Nx2,xn\lVert x\rVert_{1}\leq\sqrt{N}\lVert x\rVert_{2},x\in\mathbb{R}^{n}. Then, for the second part, consider the following:

H(Zm|Y)H(Zn|Y)\displaystyle H(Z^{m}|Y)-H(Z^{n}|Y)
=yp(y)[pz|Ympz|Yn,logpz|YmDKL(pz|Ym||pz|Yn)]\displaystyle=\sum_{y}p(y)[\langle p^{m}_{z|Y}-p^{n}_{z|Y},-\log{p^{m}_{z|Y}}\rangle-D_{KL}(p^{m}_{z|Y}||p^{n}_{z|Y})]
H(Zm|Y),pz|ympz|ynEy[1ϵz|ypz|Ympz|Yn12]\displaystyle\geq\langle\nabla H(Z^{m}|Y),p^{m}_{z|y}-p^{n}_{z|y}\rangle-E_{y}\bigl{[}\frac{1}{\epsilon_{z|y}}\lVert p^{m}_{z|Y}-p^{n}_{z|Y}\rVert_{1}^{2}\bigr{]}
H(Zm|Y),pz|ympz|ynNzNyϵz|ypz|ympz|yn22.\displaystyle\geq\langle\nabla H(Z^{m}|Y),p^{m}_{z|y}-p^{n}_{z|y}\rangle-\frac{N_{z}N_{y}}{\epsilon_{z|y}}\lVert p^{m}_{z|y}-p^{n}_{z|y}\rVert_{2}^{2}.

where the first and second inequality follows the same reasons in (26). Combining the above discussions, we conclude that G(q)G(q) is σG\sigma_{G}-weakly convex where:

σG:=max{2Nz|σγ|εz,NzNyεz|y}.\sigma_{G}:=\max\{\frac{2N_{z}|\sigma_{\gamma}|}{\varepsilon_{z}},\frac{N_{z}N_{y}}{\varepsilon_{z|y}}\}.

Appendix C Proof of Lemma 3

We divide the proof into three parts according to pi,νi,qp_{i},\nu_{i},q updates sequentially.

First, for each view i[V]i\in[V], the pip_{i} updates from step kk to k+1k+1 with {νik},qk\{\nu_{i}^{k}\},q^{k} fixed, denote c(pik)\mathcal{L}_{c}(p_{i}^{k}):

c(pik)c(pik+1)=Fi(pik)Fi(pik+1)+νik,Ai(pikpik+1)+c2Aipikqk2c2Aipik+1qk2Fi(pik+1),pikpik+1+νik,Ai(pikpik+1)+c2Aipikqk2c2Aipik+1qk2=cAipik+1qk,Ai(pikpik+1)+c2Aipikqk2c2Aipik+1qk2=c2AipikAipik+12,\begin{split}{}&\mathcal{L}_{c}(p_{i}^{k})-\mathcal{L}_{c}(p_{i}^{k+1})\\ =&\begin{multlined}F_{i}(p_{i}^{k})-F_{i}(p_{i}^{k+1})+\langle\nu_{i}^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\end{multlined}F_{i}(p_{i}^{k})-F_{i}(p_{i}^{k+1})+\langle\nu_{i}^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\\ \geq&\begin{multlined}\langle\nabla F_{i}(p_{i}^{k+1}),p_{i}^{k}-p_{i}^{k+1}\rangle+\langle\nu_{i}^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\end{multlined}\langle\nabla F_{i}(p_{i}^{k+1}),p_{i}^{k}-p_{i}^{k+1}\rangle+\langle\nu_{i}^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\\ =&\begin{multlined}-c\langle A_{i}p_{i}^{k+1}-q^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\end{multlined}-c\langle A_{i}p_{i}^{k+1}-q^{k},A_{i}(p_{i}^{k}-p_{i}^{k+1})\rangle\\ +\frac{c}{2}\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\\ =&\frac{c}{2}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert^{2},\end{split} (27)

where the inequality is due to the convexity of FiF_{i} and the last two lines follow the minimizer condition (24a) and the identity (23) respectively.

Second, for the dual update of each view:

c(νik)c(νik+1)=1cAipik+1qk2.\mathcal{L}_{c}(\nu_{i}^{k})-\mathcal{L}_{c}(\nu_{i}^{k+1})=-\frac{1}{c}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}. (28)

Lastly, for the qq updates, from the σG\sigma_{G}-weak convexity of the sub-objective function GG:

c(qk)c(qk+1)=G(qk)G(qk+1)+i=1V[νik+1,qk+1qk+c2Aipik+1qk2c2Aipik+1qk+12]G(qk+1),qkqk+1σG2qkqk+12+i=1V[νik+1,qk+1qk+c2Aipik+1qk2c2Aipik+1qk+12]=ci=1V[Aipik+1qk+1,qkqk+1+c2Aipik+1qk2c2Aipik+1qk+12]σG2qkqk+12.=cVσG2qkqk+12.\begin{split}{}&\mathcal{L}_{c}(q^{k})-\mathcal{L}_{c}(q^{k+1})\\ =&\begin{multlined}G(q^{k})-G(q^{k+1})+\sum_{i=1}^{V}\left[\langle\nu_{i}^{k+1},q^{k+1}-q^{k}\rangle\right.\\ \left.+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle\nu_{i}^{k+1}}\right]\end{multlined}G(q^{k})-G(q^{k+1})+\sum_{i=1}^{V}\left[\langle\nu_{i}^{k+1},q^{k+1}-q^{k}\rangle\right.\\ \left.+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle\nu_{i}^{k+1}}\right]\\ \geq&\begin{multlined}\langle\nabla G(q^{k+1}),q^{k}-q^{k+1}\rangle-\frac{\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}\\ +\sum_{i=1}^{V}\left[\langle\nu_{i}^{k+1},q^{k+1}-q^{k}\rangle+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\right.\\ \left.-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle\nu_{i}^{k+1}}\right]\end{multlined}\langle\nabla G(q^{k+1}),q^{k}-q^{k+1}\rangle-\frac{\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}\\ +\sum_{i=1}^{V}\left[\langle\nu_{i}^{k+1},q^{k+1}-q^{k}\rangle+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}\right.\\ \left.-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle\nu_{i}^{k+1}}\right]\\ =&\begin{multlined}c\sum_{i=1}^{V}\left[\langle A_{i}p_{i}^{k+1}-q^{k+1},q^{k}-q^{k+1}\rangle\right.\\ \left.+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle A_{i}}\right]\\ \frac{-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}.\end{multlined}c\sum_{i=1}^{V}\left[\langle A_{i}p_{i}^{k+1}-q^{k+1},q^{k}-q^{k+1}\rangle\right.\\ \left.+\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k}\rVert^{2}-\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\langle A_{i}}\right]\\ \frac{-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}.\\ &=\frac{cV-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}.\end{split} (29)

Combining (27) (28) i[V]\forall i\in[V] and (29), we have:

ckck+1cVσG2qkqk+12+i=1V[c2AipikAipik+121cνikνik+12],\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}\geq\frac{cV-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}\\ +\sum_{i=1}^{V}\left[\frac{c}{2}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert^{2}-\frac{1}{c}\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert^{2}\vphantom{\frac{c}{2}}\right],

where ck\mathcal{L}_{c}^{k} denotes the augmented Lagrangian evaluated with step kk solution. The next step is to address the negative squared norm νikνik+1\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert. Since AiA_{i} is full-row rank i[V]\forall i\in[V], consider the following:

(AiAiT)1AiAiT(νikνik+1)2λAi2μAiAiT2AiT(νikνik+1)2,\lVert(A_{i}A_{i}^{T})^{-1}A_{i}A_{i}^{T}(\nu_{i}^{k}-\nu_{i}^{k+1})\rVert^{2}\leq\frac{\lambda_{A_{i}}^{2}}{\mu^{2}_{A_{i}A_{i}^{T}}}\lVert A_{i}^{T}(\nu_{i}^{k}-\nu_{i}^{k+1})\rVert^{2}, (30)

where μB,λB\mu_{B},\lambda_{B} denotes the smallest and largest singular value of a linear operator BB, we connect the gradient of FiF_{i} to the dual update:

νikνik+12λAi2μAiAiT2Fi(pik)Fi(pik+1)2Li2λAi2μAiAiT2pikpik+12,\lVert\nu_{i}^{k}-\nu_{i}^{k+1}\rVert^{2}\leq\frac{\lambda_{A_{i}}^{2}}{\mu^{2}_{A_{i}A_{i}^{T}}}\lVert\nabla F_{i}(p_{i}^{k})-\nabla F_{i}(p_{i}^{k+1})\rVert^{2}\\ \leq\frac{L_{i}^{2}\lambda^{2}_{A_{i}}}{\mu^{2}_{A_{i}A_{i}^{T}}}\lVert p_{i}^{k}-p_{i}^{k+1}\rVert^{2}, (31)

where the last inequality is due to the LiL_{i}-smoothness of FiF_{i}. After applying (31), we need the relation:

AipikAipik+1Mpikpik+1,\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert\geq M^{*}\lVert p_{i}^{k}-p_{i}^{k+1}\rVert,

for a constant M>0M^{*}>0. Note that by definition, Ai,i[V]A_{i},\forall i\in[V] is full-row rank, hence the above relation is non-trivial. Fortunately, since each FiF_{i} is convex w.r.t. pip_{i}, which assures a unique minimizer, we can follow the sub-minimization technique recently developed in [20] to establish the desired relation. By defining the following proximal operator:

Ψi(η):=argminpiΩiFi(pi)+c2Aipiη2,\Psi_{i}(\eta):=\underset{p_{i}\in\Omega_{i}}{\arg\min}\,F_{i}(p_{i})+\frac{c}{2}\lVert A_{i}p_{i}-\eta\rVert^{2},

which coincides with the pip_{i} update, we have:

Ψi(Aipik)Ψi(Aipik+1)=pikpik+1MiAipikAipik+1,\lVert\Psi_{i}(A_{i}p_{i}^{k})-\Psi_{i}(A_{i}p_{i}^{k+1})\rVert=\lVert p_{i}^{k}-p_{i}^{k+1}\rVert\\ \leq M_{i}\lVert A_{i}p_{i}^{k}-A_{i}p_{i}^{k+1}\rVert, (32)

where the last inequality is due to the Lipschitz continuity of FiF_{i}. Applying (32) to (27) and we have:

ckck+1cVσG2qkqk+12+i=1V[c2Mi2λAi2Li2cμAiAiT2]pikpik+12.\mathcal{L}_{c}^{k}-\mathcal{L}_{c}^{k+1}\geq\frac{cV-\sigma_{G}}{2}\lVert q^{k}-q^{k+1}\rVert^{2}\\ +\sum_{i=1}^{V}\left[\frac{c}{2M_{i}^{2}}-\frac{\lambda_{A_{i}}^{2}L_{i}^{2}}{c\mu^{2}_{A_{i}A_{i}^{T}}}\right]\lVert p_{i}^{k}-p_{i}^{k+1}\rVert^{2}.

This completes the proof.

Appendix D Proof of Lemma 4

The proof is divided into two parts. We first establish the following relation between a step k+1k+1 solution and a stationary point w:=({pi},{νi},q)w^{*}:=(\{p_{i}^{*}\},\{\nu_{i}^{*}\},q^{*}):

ck+1cc2i=1VAipik+1Aipi2.\mathcal{L}_{c}^{k+1}-\mathcal{L}_{c}^{*}\leq\frac{c}{2}\sum_{i=1}^{V}\lVert A_{i}p_{i}^{k+1}-A_{i}p_{i}^{*}\rVert^{2}. (33)

This is accomplished by the following relations. Start from ck+1c\mathcal{L}_{c}^{k+1}-\mathcal{L}_{c}^{*}, For Fi,i[V]F_{i},\forall i\in[V] differences, using the convexity and the minimizer condition (24a):

Fi(pik+1)Fi(pi)νik+1,Aipik+1Aipi,\begin{split}F_{i}(p_{i}^{k+1})-F_{i}(p_{i}^{*})\leq-\langle\nu_{i}^{k+1},A_{i}p_{i}^{k+1}-A_{i}p^{*}_{i}\rangle,\end{split} (34)

where we use the reduction of the minimizer conditions at a stationary point (25) to have Aipi=q,i[V]A_{i}p_{i}^{*}=q^{*},\forall i\in[V]. As for the G(q)G(q) difference:

G(qk+1)G(q)i=1Vνik+1,qk+1q+ci=1VAipik+1qk+1,qk+1q+σG2qk+1q2.G(q^{k+1})-G(q^{*})\leq\sum_{i=1}^{V}\langle\nu_{i}^{k+1},q^{k+1}-q^{*}\rangle\\ +c\sum_{i=1}^{V}\langle A_{i}p_{i}^{k+1}-q^{k+1},q^{k+1}-q^{*}\rangle+\frac{\sigma_{G}}{2}\lVert q^{k+1}-q^{*}\rVert^{2}. (35)

Note that by assumption c>σG/Vc>\sigma_{G}/V. Therefore, by combining the two with the inner products where the dual variables νik+1\nu_{i}^{k+1} associated with, we have the desired result (33). The second part is to construct the following relation:

c(wk+1)2i=1V(c2μAiAiT+1)Aipik+1qk+12,\lVert\nabla\mathcal{L}_{c}(w^{k+1})\rVert^{2}\geq\sum_{i=1}^{V}\left(c^{2}\mu_{A_{i}A_{i}^{T}}+1\right)\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}, (36)

which is straightforward to show since:

c(wk+1)=[Fi(pik+1)+AiT[νik+1+c(Aipik+1qk+1)]Aipik+1qk+1G(qk+1)i=1V[νik+1+c(Aipik+1qk+1)]]=[cAiT(Aipik+1qk+1)Aipik+1qk+10],\begin{split}{}&\nabla\mathcal{L}_{c}(w^{k+1})\\ =&\begin{bmatrix}\nabla F_{i}(p_{i}^{k+1})+A_{i}^{T}\left[\nu_{i}^{k+1}+c\left(A_{i}p_{i}^{k+1}-q^{k+1}\right)\right]\\ A_{i}p_{i}^{k+1}-q^{k+1}\\ \nabla G(q^{k+1})-\sum_{i=1}^{V}\left[\nu_{i}^{k+1}+c\left(A_{i}p_{i}^{k+1}-q^{k+1}\right)\right]\end{bmatrix}\\ =&\begin{bmatrix}cA_{i}^{T}\left(A_{i}p_{i}^{k+1}-q^{k+1}\right)\\ A_{i}p_{i}^{k+1}-q^{k+1}\\ 0\end{bmatrix},\end{split} (37)

where the last equality is due to the minimizer conditions (24). Note that in (37), for simplicity, we derive the relation considering a single i[V]i\in[V] for pip_{i} and νi\nu_{i}. With (33) and (36), consider a neighborhood around a stationary point ww^{*} such that |w¯w|<ε|\bar{w}-w^{*}|<\varepsilon and c<¯c<c+δ\mathcal{L}^{*}_{c}<\bar{\mathcal{L}}_{c}<\mathcal{L}^{*}_{c}+\delta for some ε,δ>0\varepsilon,\delta>0, then we have:

ck+1ci=1V[c2Aipik+1Aipi2+(c2μAiAiT+1)Aipik+1qk+12]i=1Vc2Aipik+1Aipi2+ck+12ck+12(1+cε22ck+12)ck+12(1+cε22η2),\begin{split}{}&\mathcal{L}^{k+1}_{c}-\mathcal{L}_{c}^{*}\\ \leq&\begin{multlined}\sum_{i=1}^{V}\left[\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-A_{i}p_{i}^{*}\rVert^{2}\right.\\ \left.+(c^{2}\mu_{A_{i}A_{i}^{T}}+1)\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\frac{c}{2}}\right]\end{multlined}\sum_{i=1}^{V}\left[\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-A_{i}p_{i}^{*}\rVert^{2}\right.\\ \left.+(c^{2}\mu_{A_{i}A_{i}^{T}}+1)\lVert A_{i}p_{i}^{k+1}-q^{k+1}\rVert^{2}\vphantom{\frac{c}{2}}\right]\\ \leq&\sum_{i=1}^{V}\frac{c}{2}\lVert A_{i}p_{i}^{k+1}-A_{i}p_{i}^{*}\rVert^{2}+\lVert\nabla\mathcal{L}_{c}^{k+1}\rVert^{2}\\ \leq&\lVert\nabla\mathcal{L}_{c}^{k+1}\rVert^{2}\left(1+\frac{c\varepsilon^{2}}{2\lVert\nabla\mathcal{L}_{c}^{k+1}\rVert^{2}}\right)\\ \leq&\lVert\nabla\mathcal{L}_{c}^{k+1}\rVert^{2}\left(1+\frac{c\varepsilon^{2}}{2\eta^{2}}\right),\end{split} (38)

where in the last equality, by construction, because ck+1\mathcal{L}_{c}^{k+1} is not a stationary point, there exists a η>0\eta>0 such that |ck+1|>η|\nabla\mathcal{L}_{c}^{k+1}|>\eta [27, Lemma 2.1]. Then we complete the proof as (38) implies that the Łojasiewicz exponent θ=1/2\theta=1/2.

Appendix E Proof of Lemma 5

From (37), we have:

ck2i=1V(c2λAiAiT+1)Aipikqk22i=1V(c2λAiAiT+1)[Aipikqk12+qkqk12]=2i=1V[(λAiAiT+1c2)νikνik12+(c2λAiAiT+1)qkqk12]2i=1V(λAiAiT+1c2)[νikνik12+pikpik12]+2i=1V(c2λAiAiT+1)qkqk12,\begin{split}{}&\lVert\nabla\mathcal{L}_{c}^{k}\rVert^{2}\\ \leq&\sum_{i=1}^{V}\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert A_{i}p_{i}^{k}-q^{k}\rVert^{2}\\ \leq&2\sum_{i=1}^{V}\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\left[\lVert A_{i}p_{i}^{k}-q^{k-1}\rVert^{2}+\lVert q^{k}-q^{k-1}\rVert^{2}\right]\\ =&\begin{multlined}2\sum_{i=1}^{V}\left[\left(\lambda_{A_{i}A_{i}^{T}}+\frac{1}{c^{2}}\right)\lVert\nu_{i}^{k}-\nu_{i}^{k-1}\rVert^{2}\right.\\ \left.+\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert q^{k}-q^{k-1}\rVert^{2}\vphantom{\lambda_{A_{i}A_{i}^{T}}}\right]\end{multlined}2\sum_{i=1}^{V}\left[\left(\lambda_{A_{i}A_{i}^{T}}+\frac{1}{c^{2}}\right)\lVert\nu_{i}^{k}-\nu_{i}^{k-1}\rVert^{2}\right.\\ \left.+\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert q^{k}-q^{k-1}\rVert^{2}\vphantom{\lambda_{A_{i}A_{i}^{T}}}\right]\\ \leq&\begin{multlined}2\sum_{i=1}^{V}\left(\lambda_{A_{i}A_{i}^{T}}+\frac{1}{c^{2}}\right)\left[\lVert\nu_{i}^{k}-\nu_{i}^{k-1}\rVert^{2}+\lVert p_{i}^{k}-p_{i}^{k-1}\rVert^{2}\right]\\ +2\sum_{i=1}^{V}\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert q^{k}-q^{k-1}\rVert^{2},\end{multlined}2\sum_{i=1}^{V}\left(\lambda_{A_{i}A_{i}^{T}}+\frac{1}{c^{2}}\right)\left[\lVert\nu_{i}^{k}-\nu_{i}^{k-1}\rVert^{2}+\lVert p_{i}^{k}-p_{i}^{k-1}\rVert^{2}\right]\\ +2\sum_{i=1}^{V}\left(c^{2}\lambda_{A_{i}A_{i}^{T}}+1\right)\lVert q^{k}-q^{k-1}\rVert^{2},\end{split} (39)

Then there exists a positive constant C>0C^{*}>0 such that:

ckCwkwk1,\lVert\nabla\mathcal{L}_{c}^{k}\rVert\leq C^{*}\lVert w^{k}-w^{k-1}\rVert,

where wk:=({pik},{νik},qk)w^{k}:=(\{p_{i}^{k}\},\{\nu_{i}^{k}\},q^{k}) denotes the step kk collective point.

Appendix F Proof of Theorem 1

We first show the convergence. By Assumption A, the sufficient decrease lemma (Lemma 3) holds. Consider a sequence {wk}k\{w_{k}\}_{k\in\mathbb{N}} obtained through the algorithm (11), by sufficient decrease lemma, there exists some constants ρp,i,ρq>0,i[V]\rho_{p,i},\rho_{q}>0,\forall i\in[V] such that:

c0ck=0[i=1Vρp,ipikpik+12+ρqqkqk+12].\mathcal{L}_{c}^{0}-\mathcal{L}_{c}^{\infty}\geq\sum_{k=0}^{\infty}\left[\sum_{i=1}^{V}\rho_{p,i}\lVert p_{i}^{k}-p_{i}^{k+1}\rVert^{2}+\rho_{q}\lVert q^{k}-q^{k+1}\rVert^{2}\right]. (40)

In discrete settings, the c\mathcal{L}_{c} is lower semi-continuous and therefore the l.h.s. of (40) is bounded. Following this, observe that the r.h.s. of (40) is a Cauchy sequence, and hence i[V],pikpi,qkq\forall i\in[V],p_{i}^{k}\rightarrow p_{i}^{*},q^{k}\rightarrow q^{*} as kk\rightarrow\infty. Given these, by (31), this result implies νikνi\nu_{i}^{k}\rightarrow\nu_{i}^{*} as kk\rightarrow\infty. And we prove the convergence.

Given convergence, along with Lemma 4 and Lemma 5, that is, the KŁ property is satisfied with a Łojasiewicz exponent is θ=1/2\theta=1/2, and the contraction of the gradient of c\mathcal{L}_{c} is established, by Lemma 6, we prove the rate of convergence of the sequence {wk}k\{w^{k}\}_{k\in\mathbb{N}} obtained through the algorithm (11) is QQ-linear around a neighborhood of a stationary point ww^{*}.

References

  • [1] S. Sun, “A survey of multi-view machine learning,” Neural computing and applications, vol. 23, no. 7, pp. 2031–2038, 2013.
  • [2] Y. Yang and H. Wang, “Multi-view clustering: A survey,” Big Data Mining and Analytics, vol. 1, no. 2, pp. 83–107, 2018.
  • [3] M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata, “Learning robust representations via multi-view information bottleneck,” in International Conference on Learning Representations, 2020.
  • [4] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in International conference on machine learning.   PMLR, 2015, pp. 1083–1092.
  • [5] Y. Li, M. Yang, and Z. Zhang, “A survey of multi-view representation learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 10, pp. 1863–1883, 2019.
  • [6] K. Zhan, F. Nie, J. Wang, and Y. Yang, “Multiview consensus graph clustering,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1261–1270, 2019.
  • [7] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
  • [8] C. Xu, D. Tao, and C. Xu, “Large-margin multi-viewinformation bottleneck,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1559–1572, 2014.
  • [9] Y. Gao, S. Gu, L. Xia, and Y. Fei, “Web document clustering with multi-view information bottleneck,” in 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA’06), 2006, pp. 148–148.
  • [10] S. Hu, Z. Shi, and Y. Ye, “Dmib: Dual-correlated multivariate information bottleneck for multiview clustering,” IEEE Transactions on Cybernetics, pp. 1–15, 2020.
  • [11] Q. Wang, C. Boudreau, Q. Luo, P.-N. Tan, and J. Zhou, “Deep multi-view information bottleneck,” in Proceedings of the 2019 SIAM International Conference on Data Mining.   SIAM, 2019, pp. 37–45.
  • [12] Z. Goldfeld and Y. Polyanskiy, “The information bottleneck problem and its applications in machine learning,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 19–38, 2020.
  • [13] Y. Uğur, I. E. Aguerri, and A. Zaidi, “Vector gaussian ceo problem under logarithmic loss and applications,” IEEE Transactions on Information Theory, vol. 66, no. 7, pp. 4183–4202, 2020.
  • [14] I. E. Aguerri and A. Zaidi, “Distributed variational representation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 120–138, 2021.
  • [15] I. Estella Aguerri and A. Zaidi, “Distributed information bottleneck method for discrete and gaussian sources,” in International Zurich Seminar on Information and Communication (IZS 2018). Proceedings.   ETH Zurich, 2018, pp. 35–39.
  • [16] A. Zaidi, I. Estella-Aguerri, and S. Shamai, “On the information bottleneck problems: Models, connections, applications and information theoretic views,” Entropy, vol. 22, no. 2, p. 151, 2020.
  • [17] T. A. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 740–761, 2014.
  • [18] S. Boyd, N. Parikh, and E. Chu, Distributed optimization and statistical learning via the alternating direction method of multipliers.   Now Publishers Inc, 2011.
  • [19] H. Attouch and J. Bolte, “On the convergence of the proximal algorithm for nonsmooth functions involving analytic features,” Mathematical Programming, vol. 116, no. 1, pp. 5–16, 2009.
  • [20] Y. Wang, W. Yin, and J. Zeng, “Global convergence of admm in nonconvex nonsmooth optimization,” Journal of Scientific Computing, vol. 78, no. 1, pp. 29–63, 2019.
  • [21] T.-H. Huang and A. el Gamal, “A provably convergent information bottleneck solution via admm,” in 2021 IEEE International Symposium on Information Theory (ISIT), 2021, pp. 43–48.
  • [22] D. Boley, “Local linear convergence of the alternating direction method of multipliers on quadratic or linear programs,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2183–2207, 2013.
  • [23] K. Guo, D. R. Han, and T. T. Wu, “Convergence of alternating direction method for minimizing sum of two nonconvex functions with linear constraints,” International Journal of Computer Mathematics, vol. 94, no. 8, pp. 1653–1669, 2017.
  • [24] M. Chao, D. Han, and X. Cai, “Convergence of the peaceman-rachford splitting method for a class of nonconvex programs,” Numerical Mathematics: Theory, Methods and Applications, vol. 14, no. 2, pp. 438–460, 2021.
  • [25] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proceedings of the Eleventh Annual Conference on Computational Learning Theory, ser. COLT’ 98.   New York, NY, USA: Association for Computing Machinery, 1998, p. 92–100.
  • [26] Y. Ugur, I. E. Aguerri, and A. Zaidi, “A generalization of blahut-arimoto algorithm to compute rate-distortion regions of multiterminal source coding under logarithmic loss,” arXiv preprint arXiv:1708.07309, 2017.
  • [27] G. Li and T. K. Pong, “Calculus of the exponent of kurdyka–Łojasiewicz inequality and its applications to linear convergence of first-order methods,” Foundations of Computational Mathematics, vol. 18, no. 5, p. 1199–1232, 2018.
  • [28] K. Sricharan, R. Raich, and A. O. Hero, “Estimation of nonlinear functionals of densities with confidence,” IEEE Transactions on Information Theory, vol. 58, no. 7, pp. 4135–4159, 2012.
  • [29] R. Blahut, “Computation of channel capacity and rate-distortion functions,” IEEE Transactions on Information Theory, vol. 18, no. 4, pp. 460–473, 1972.
  • [30] O. Shamir, S. Sabato, and N. Tishby, “Learning and generalization with the information bottleneck,” Theoretical Computer Science, vol. 411, no. 29-30, pp. 2696–2711, 2010.
  • [31] E. Schubert and A. Zimek, “ELKI: A large open-source library for data analysis - ELKI release 0.7.5 ”heidelberg”,” CoRR, vol. abs/1902.03616, 2019.
  • [32] D. Cremers and K. Kolev, “Multiview stereo and silhouette consistency via convex functionals over convex domains,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 6, pp. 1161–1174, 2011.
  • [33] G. Franca, D. Robinson, and R. Vidal, “Admm and accelerated admm as continuous dynamical systems,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1559–1567.
  • [34] Y. Han, J. Jiao, T. Weissman, and Y. Wu, “Optimal rates of entropy estimation over lipschitz balls,” The Annals of Statistics, vol. 48, no. 6, pp. 3228–3250, 2020.
  • [35] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).   New York, NY, USA: Wiley-Interscience, 2006.
  • [36] K. Guo, D. Han, and X. Yuan, “Convergence analysis of douglas–rachford splitting method for “strongly + weakly” convex programming,” SIAM Journal on Numerical Analysis, vol. 55, no. 4, p. 1549–1577, 2017.
  • [37] Z. Jia, X. Gao, X. Cai, and D. Han, “Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems,” Journal of Optimization Theory and Applications, vol. 188, no. 1, p. 1–25, 2021.
  • [38] J. Nocedal, Numerical optimization, 2nd ed., ser. Springer series in operations research.   New York: Springer, 2006.
  • [39] Y. Nesterov, Lectures on convex optimization, ser. Springer optimization and its applications.   Cham: Springer, 2018, vol. 137.
  • [40] T.-H. Huang, A. E. Gamal, and H. E. Gamal, “A linearly convergent douglas-rachford splitting solver for markovian information-theoretic optimization problems,” arXiv preprint arXiv:2203.07527, 2022.
  • [41] I. Sason, “On reverse pinsker inequalities,” CoRR, vol. abs/1503.07118, 2015.