On The Multi-View Information Bottleneck Representation
Abstract
In this work, we generalize the information bottleneck (IB) approach to the multi-view learning context. The exponentially growing complexity of the optimal representation motivates the development of two novel formulations with more favorable performance-complexity tradeoffs. The first approach is based on forming a stochastic consensus and is suited for scenarios with significant representation overlap between the different views. The second method, relying on incremental updates, is tailored for the other extreme scenario with minimal representation overlap. In both cases, we extend our earlier work on the alternating directional methods of multiplier (ADMM) solver and establish its convergence and scalability. Empirically, we find that the proposed methods outperform state-of-the-art approaches in multi-view classification problems under a broad range of modelling parameters.
Index Terms:
Information bottleneck, consensus ADMM, non-convex optimization, classification, multi-view learning.I Introduction
Recently, learning from multi-view data has drawn significant interest in the machine learning and data science community (e.g., [1, 2, 3, 4, 5]). In this context, a view of data is a description or observation about the source. For example, an object can be described in words or images. It is natural to expect learning from multi-view data to improve performance[6].
The challenges in multi-view learning are two-folded: First, as one can naively combine all views of observations to form one giant view which loses no information contained within them but would suffer from exponential growth of the dimensionality of the merged observation, we call this the performance-complexity trade-off. Second, following this, if one instead opts to extract either view-shared or view-specific relevant features from each view, then heterogeneous forms of observations (e.g., audio and images) would make it difficult to learn low-complexity and meaningful representations with a unified method. In other words, the amount of representation overlap across all view observations is important for efficient multi-view learning.
In addressing the challenges, several recent works have attempted to apply the IB [7] principle to multi-view learning for it matches the objective well, that is, trading-off relevance and complexity in extracting both view-shared and view-specific features [8, 9, 10, 11]. This generalization is known as the multi-view IB (MvIB) which is known to be a special case of the multi-terminal remote source coding problems with logarithmic loss. The logarithmic loss corresponds to soft reconstruction where the likelihood of all possible outcomes is received in contrast to a reconstructed symbol in conventional source coding problems [12, 13, 14, 15, 16].
In literature, the achievable region for the remote source coding problem is characterized in [17] for discrete cases, and recently, jointly Gaussian cases as well [13]. Along with the characterization, a variety of variational inference-based algorithms are proposed[14, 15]. This type of algorithms introduces extra variational variables to facilitate the optimization as by fixing one of the two sets of variables, the overall objective function is convex w.r.t. to the other set of variables. Then optimize in this alternating fashion, the convergence is assured.
Extending this line of research, our approach is rooted in a top-down information-theoretic formulation closely related to the optimal characterization of MvIB. Moreover, contrary to [11] which relies on black-box deep neural networks, we propose two constructive information theoretic formulations with performance comparable to that of the optimal joint view approach. The two approaches are motivated by two extreme multi-view learning scenarios: The first is characterized by a significant representation overlap between the different views which favors our consensus-complement two stage formulation, whereas the second extreme scenario is characterized by a minimal representation overlap leading to our incremental update approach.
Different from existing variational inference-based algorithms that avoid dealing with the non-convexity of the overall objective function, in both of the proposed methods, we adopt the non-convex consensus ADMM as the main tool in deriving our solvers [18, 19, 20]. These new solvers can, therefore, be viewed as generalizations of our earlier work on the single view ADMM algorithm [21]. More specifically, in the consensus-complement version, we separate the proposed Lagrangian into consensus and complement sub-objective functions and then proceed to solve the optimization problem in two steps. The new ADMM solver can hence efficiently form a consensus representation in large-scale multi-view learning problems with significantly lower dimensions compared to joint-view approaches. The same intuition is applied to the increment update approach as detailed in the sequel. Finally, we prove the linear rate of convergence of our solvers under significantly milder constraints; as compared with earlier convergence results on this class of solvers [22, 23, 24]. More specifically, we relax the need for a strongly convex sub-objective function and, moreover, establish the linear rate of convergence around a local optimal point in each case.
II Multiview Information Bottleneck
Given views with observations generated from a target variable , we aim to find a set of individual representations that is most compressed w.r.t. the individual-view observations but at the same time maximize their relevance toward the target through .
Using a Lagrangian multiplier formulation, the problem can be casted as:
(1) |
where denotes the set of all -views of observations and is the set of representations to be designed. Note that if the observations are combined in this manner and treated as one view, the above reduces to the standard IB and can be solved with any existing single-view solver. However, combining all the observations in one giant view will result in an exponential increase in complexity (curse of dimensionality).
A basic assumption in the multiview learning literature is the conditional independence [25, 9], where the observations of all views are independent given the target variable . That is, . In the next two sections, we use this conditional independence assumption while constraining the set of allowable latent representations to develop our two novel information-theoretic formulations of the Multi-view IB (MvIB) problem.
II-A Consensus-Complement Form
Inspired by the co-training methods in multi-view research [25], we design the set of latent representations to consist of a consensus representation and view-specific complement components . Then, by the chain rule of mutual information, the Lagrangian of (1) becomes
(2) |
where the sequence is defined to represent the accumulated complement views. To further simplify the above, we restrict the set of possible representations to satisfy the following conditions (similar to [25, 11]):
-
•
There always exist constants , independent of the observations such that .
-
•
forms a Markov chain. That is, is side-information for .
-
•
For each view , given the consensus , are independent.
Under these constraints, (2) can be rewritten as the superposition of two parts, i.e., , where the first component is defined as the multi-view consensus IB Lagrangian:
(3) |
and the second consists of terms with each one corresponding to a complement sub-objective for each view:
(4) |
We can now recast in (3) as:
(5) |
Similarly, we rewrite (4), , as:
(6) |
By representing the discrete (conditional) probabilities as vectors/tensors, we can solve (5) and (6) with augmented Lagrangian methods. We define the following vectors:
(7) |
where . For clarity, we rewrite the primal variables for each view as , and cascade the augmented variables which gives . On the other hand, for the complement part, we define the following tensors:
(8) |
Then we present the consensus-complement MvIB augmented Lagrangian as follows. For the consensus part:
(9) |
where is 2-norm. As for the complement part, let
As for the tensors used in the complement step, denote the realization of the consensus representation , by Bayes’ rule, we can recover the linear expression: , where is a diagonal matrix formed from the vector . To simplify notation, define the equivalent prior as a linear operator , then we can express the augmented Lagrangian for the complement step as and each term is defined as:
(10) |
where is the penalty coefficient and the linear penalty for each view encourages the variables and each to satisfy the marginal probability and Markov chain conditions. Specifically, let denote the Kronecker product, where is the matrix form of the conditional distribution with each entry equal to .
We propose a two-step algorithm to solve (2) described as follows. The first step is solving (9) through the following consensus ADMM algorithm. :
(11a) | ||||
(11b) | ||||
(11c) |
Then in the second step we solve (10) with two-block ADMM:
(12a) | ||||
(12b) | ||||
(12c) |
where in (11), we use the short-hand notation to denote the primal variables, up to views that are already updated to step , and to denote the rest that are still at step . We define ; in (11) and (12), the superscript denotes the step index; each of denotes a compound probability simplex. The algorithm starts with (11a), updating each view in succession. Then the augmented variables are updated with (11c). Finally, the difference between the primal and augmented variables are added to the dual variables (11b) to complete step . After convergence of (11), we run (12) in similar fashion for each view. And this completes the full algorithm.
II-B Incremental Update Form
Intuitively, the consensus-complement form works well in the case where the common information in the observations across all views is abundant. However, if the views are almost distinct, where each view is a complement to the others, then the previous form will be inefficient in the sense that learning the common may have negligible benefit. To address this, we propose another formulation of the multi-view IB, by restricting the representation set to . The incremental update multi-view IB Lagrangian is therefore given by:
(13) |
Again, to simplify the above, the incremental form will satisfy the following constraints:
-
•
For each view , the corresponding representation only access , so forms a Markov chain.
With the assumptions above, in each step, we can replace observations of all views with the view-specific observation and rewrite (13) as:
(14) |
In solving (14), we consider the following algorithm. At the step, we have:
(15a) | |||
(15b) |
where denotes the tensor form of a conditional probability . The tensor is the primal variable for step which belongs to a compound simplex . In the algorithm, for each step (15a), we solve it with (11) by setting and treating the estimators from the previous steps as priors. For example, , and .
III Main Results
We propose two new information-theoretic formulations of MvIB and develop optimal-bound achieving algorithms that are in parallel to existing solvers [15, 26, 14]; our main results are the convergence proofs for the proposed two algorithms. The convergence analysis goes beyond the MvIB and the recent non-convex multi-block ADMM convergence results as we further show that strong convexity on is not necessary for proving convergence [24]. This new result connects our analysis to a more general class of functions that can be solved with multi-block non-convex consensus ADMM. For simplicity we denote the collective point at step as , as the function value evaluated with and denote the smallest and largest singular value of a linear operator respectively,
Theorem 1
Suppose is -smooth and -Lipschitz continuous and is -weakly convex. Further, let be defined as in (9) and solved with the algorithm (11). If the penalty coefficient satisfies , then the sequence is finite and bounded. Moreover, converges linearly to a stationary point around a neighborhood such that for where .
Proof:
The details of the proof are deferred to Appendix A. Here we explain the key ideas.
Continuing on the proof sketch, the first step is to construct a sufficient decrease lemma (Lemma 3) to assure that the function value decreases from step to by an amount lower bounded by the positive squared norm . We decompose according to each step of the algorithm (11), as follows:
(16c) | ||||
(16f) | ||||
(16g) |
For each view, each difference in (16c) can be lower bounded by using the convexity of , and we get:
(17) |
On the other hand, in (16g), a similar lower bound for follows from its -weak convexity. This results in a negative squared norm . Nonetheless, by the first-order minimizer conditions (24) and the identity , the negative term is balanced by the penalty coefficient as the corresponding lower bound is (with other variables fixed):
As for the dual update, (16f) gives a combination of negative norms . It turns out that by the first-order minimizer condition of and its smoothness:
and that is full-row rank (holds for complement step):
(18) |
where denote the smallest and largest singular value of a linear operator B. Then we need the following relation:
which is non-trivial as is full-row rank. To address this, we adopt the sub-minimization path method in [20], which is applicable since is convex. Observe that (11a) is equivalent to a proximal operator:
with at step . Using this technique, we can have the desired result using the Lipschitz continuity of :
prove the sufficient decrease lemma and hence the convergence (Appendix F).
We further prove that the rate of convergence is linear by explicitly showing the Kurdyka-Łojasiewicz (KŁ [19, 27]) property is satisfied with an exponent (Appendix D). This is based on the known result based on the KŁ inequality, which characterizes the rate of convergence in to three regions in terms of [19] ( corresponds to linear rate). The proof for is again based on the convexity of and the weak convexity of and is referred to Appendix D. We note that the linear rate holds around a neighborhood of a stationary point which aligns with the results in [27].
∎
As a remark, if the minimum element of a probability vector is bounded away from zero by a constant , a commonly adopted smoothness condition in density and entropy estimation research [28], the sub-objectives and can be shown to be Lipschitz continuous and smooth. Furthermore, under smoothness conditions, is a weakly convex function w.r.t. (Lemma 2).
From Theorem 1, the consensus-complement algorithm is convergent since the complement step is a special case of the algorithm (11) with while treating as an additional prior probability. The incremental algorithm is convergent following the same reason.


IV Numerical Results
We evaluate the proposed two approaches for two-view, synthetic distributions. For simplicity, we denote the consensus-complement approach as Cons-Cmpl while the incremental update approach as Increment.
We simulate a classification task and compare the performance of the two proposed approaches to joint-view/single-view IB solvers [29], which are served as references for the best- and worst-case performance, and the state-of-the-art deep neural network-based method (DeepMvIB [14, 11]), with two layers of -neuron, fully connected weights plus ReLU activation for each view. Given (19), we randomly sample pairs of outcomes as testing data. Then we run the algorithms, sweeping through a range of and record the best accuracy from trials per . We use Bayes’ decoder to predict the testing data, where we perform inverse transform sampling for the cumulative distribution of the decoders to obtain for each pair of . The data-generating distribution is:
(19) |
with . The result is shown in Figure 1a. The dimension of each of is , and for each of . Clearly, the two proposed approaches can achieve comparable performance to that of the joint-view IB solver and outperform the deepMvIB over the range of we simulated. Interestingly, Cons-Cmpl outperforms Increment in the best accuracy. This might be due to the abundance of representation overlap. To better investigate this observation, we further consider a different distribution with dimensions of all representations :
(20) |
Observe that for each view in (20), there is one class ( in view an in view ), that is easy to infer through while the remaining two are ambiguous. This results in low representation overlap and consensus is therefore difficult to form. In Figure 1b we examine the components of the relevance rate where the Sum is: for Cons-Cmpl and for Increment. Step 1 indicates in Cons-Cmpl, and in Increment. Observe that there is almost no increase in over varying , and that Increment has a greater relevance rate than Cons-Cmpl when . Since it is known that the high relevance rate is related to high prediction accuracy [30], this example favors the Increment approach as it is designed to increase the overall relevance rate view-by-view.
Lastly, we can compare the complexity of the two approaches in terms of the number of dimensions for the primal variables. For simplicity, let ,. For Cons-Cmpl, the number of dimensions for the variables scales as while for Increment, it grows as . The two methods both improve over the joint view as their complexity values scale as and in general. Remarkably, the complexity for Cons-Cmpl scales linearly in the number of views while we get an exponential growth with factor for Increment.
V Conclusion
In this work, we propose two new information-theoretic formulations of MvIB and develop new optimal bound-achieving algorithms based on non-convex consensus ADMM, which are in parallel to existing solvers. We proposed two algorithms to solve the two forms respectively and prove their convergence and linear rates. Empirically, they achieve comparable performance to joint-view benchmarks and outperforms state-of-the-art deep neural networks-based approaches in some synthetic datasets. For future works, we plan to evaluate the two methods on available multiview datasets [31, 32] and generalize the proposed MvIB framework to continuous distributions [33].
Appendix A Convergence Analysis
In this part, we prove the convergence of the consensus non-convex ADMM algorithm for the proposed two MvIB solvers, (11) and (15). Moreover, we demonstrate that the convergence rate is linear based on the recent non-convex ADMM convergence results through the KŁ inequality. Specifically, we explicit show that the Łojasiewicz exponents associate with the augmented Lagrangian for both forms is and therefore corresponds to linear rates. As mentioned in section II, the complement step and each level of the incremental update algorithms are special cases of the consensus step algorithm with a normalized linear operator for each realization of the conditioned representation and setting the numbers of view , so it suffices to analyze the convergence and the associated rate of the non-convex consensus ADMM (11). In solving (11), we consider the following first-order optimization method and assume an exact solution exists and can be found at each step.
A-A Preliminaries
We first introduce the following definitions which allow us to study the properties of the sub-objective functions .
We start with the elementary definitions of smoothness conditions for optimization.
Definition 1 (Lipschitz continuity)
A real-value function is -Lipschitz continuous if for .
A function is “smooth” if its gradient is Lipschitz continuous.
Definition 2 (Smoothness)
A real-value function is -smooth if , and .
Note that if a -smooth function , then the Lipschitz smoothness coefficient of satisfies . In this work, the variables are cascade of (conditional) probability mass, and a common assumption in density/entropy estimation research is non-zero minimal measure [28, 34].
Definition 3 (-infimal)
A measure is said to be -infimal if there exists such that .
Assuming -infimal for a given set of primal variables , we have the following results:
Lemma 1
Suppose in (7) is -infimal , is -infimal and is -infimal. Then is -smooth and is -smooth. Where .
Proof:
For , it suffices to consider a single view. Since and , we have:
On the other hand, recall , we can separate into two parts, denotes respectively. For the first part :
On the other hand, for :
Since , combine the two parts:
(21) |
∎
Besides the smoothness conditions, given the joint probability , the (conditional) entropy functions are concave w.r.t. the associated probability mass [35]. The observation in (3) is that its non-convexity is due to a combination of differences of convex functions. By convexity, we refer to the following definition.
Definition 4 (Hypoconvexity)
A convex function is -hypoconvex if such that . In particular, if , is strongly convex; if , is weakly convex.
In MvIB, given the , it is easy to show that each is a convex function. On the other hand, for the function , if assume -infimal, we show in the following that it is weakly convex.
Lemma 2
Let and . If is -infimal measure respectively. Then the function is -weakly convex. Where and with .
Proof:
see Appendix B. ∎
From Lemma 2, it turns out that the MvIB objective is a multi-block consists of convex in addition to a weakly convex . This decomposition of the non-convexity of the overall objective enables us to generalize the recent strongly-weakly pair non-convex ADMM convergence results to consensus ADMM [36, 37].
If a function satisfies the KŁ properties, then its rate of convergence can be determined in terms of its Łojasiewicz exponent.
Definition 5
A function is said to satisfy the Łojasweicz inequality if there exists an exponent , and a critical point with a constant , and a neighborhood such that:
In literature, there are a broad class of functions that are known to satisfy the KŁ properties, in particular, the -minimal structure (e.g., sub-analytic, semi-algebraic functions [19]).
Definition 6
A function is said to satisfy the Kurdyka-Łojasiewicz inequality if there exists a neighborhood around and a level set with a margin and a continuous concave function , such that the following inequality holds:
(22) |
where denotes the sub-gradient of for non-smooth functions, and is defined as the distance of a set to a fixed point if exists. Note that if then is recovers the definition of Łojasiewicz inequality.
The following elementary identity is useful in the convergence analysis, we list it for completeness.
(23) |
Lastly, by “linear” rate of convergence, we refer to the definition in [38]:
Definition 7
Let be a sequence in that converges to a stationary point when . If it converges -linearly, then such that
On the other hand, the convergence of the sequence is -linear if there is -linearly convergent sequence such that:
A-B Convergence and Rate Analysis
As in the standard convex setting, to prove convergence of consensus ADMM, we simply need to establish a sufficient decrease lemma [39, 18]. However, since the MvIB problem is non-convex, the sub-objectives cannot be viewed as a monotone operator which leads to convergence naturally. Our key insight is that the non-convexity of the MvIB problem can be separated into a combination of a set of convex sub-objective and a single weakly-convex sub-objective which can be exploited to show convergence. This result requires certain smoothness conditions to be satisfied, which is a consequence when assuming -infimality (Definition 3) on the primal and the augmented variables . With these smoothness conditions, it can be easily shown that is -Lipschitz continuous and -smooth w.r.t. for some . In addition to the properties of the sub-objective functions, it turns out that the structural advantages in consensus ADMM allows us to connect the dual update to the gradient of , and we can therefore establish the desired results. Before presenting the results, we summarize the minimizer conditions as follows:
(24a) | ||||
(24b) | ||||
(24c) |
where denotes the view index. Following this, suppose there exists a stationary point such that , (24) reduces to:
(25) |
Furthermore, we impose the following set of assumptions to facilitate the convergence analysis:
Assumption A
-
•
There exists stationary points that belong to a set .
-
•
is -smooth, -Lipschitz continuous and convex; is -smooth and -weakly convex.
-
•
The penalty coefficient satisfies:
Lemma 3
Proof:
See Appendix C. ∎
As a remark, Lemma 3 implies the convergence of the non-convex ADMM-based algorithm depends on a sufficient large penalty coefficient , and the minimum value that assures this, in turns, relies on the properties of the sub-objective functions and . Note that both the Lipschitz continuity and the smoothness of are exploited to prove Lemma 3 which corresponds to the sub-minimization path method developed in [20] as are convex, and the connection between dual update and the gradient of [40, 37].
In addition to convergence, it turns out that we can follow the recent results in optimization mathematics that adopt the KŁ inequality to analyze the convergence and hence the rate of convergence for non-convex ADMM and prove that the proposed algorithms converge linearly. It is worth noting that the linear rate obtained through this framework is not uniform over the whole parameter space but converge linearly when the solution is located in the vicinity around local stationary points. In other words, the rate of convergence is locally linear. In the following we aim to use the KŁ inequality to prove locally linear rate for the proposed two algorithms (11) and (15). As summarized in [40], three elements are needed to adopt the KŁ inequality: 1) a sufficient decrease lemma. 2) Showing the Łojasiewicz exponent of the objective function , solved with the algorithm to be analyzed is and 3) Contraction of the gradients . Since we already have the first element, we can focus on the others. The desired result can be obtained through the following lemma.
Lemma 4
Proof:
See Appendix D ∎
The element to adopt the KŁ inequality is the contraction of the gradients of the augmented Lagrangian between consecutive updates.
Lemma 5
Proof:
See Appendix E. ∎
Then combining the lemmas, we can prove the locally linear rate of convergence of the non-convex consensus ADMM algorithm to form consensus MvIB representation. To be self-contained, the framework to adopt the KŁ inequality is summarized in the following. Note that this result characterizes the rate of convergence into three regions in terms of the value of the Łojasiewicz exponent and suffices in our case. For the complete characterization, we refer to [19, 40] for details.
Lemma 6 (Theorem 2 [19])
Assume that a function satisfies the KŁ properties, define the collective point at step , and let be a sequence generated by the algorithm (11). Suppose is bounded and the following relation holds:
where and is some constants. Denote the Łojasiewicz exponent of with as . Then the following holds:
-
1.
If , the sequence converges in a finite number of steps,
-
2.
If then there exist and such that
-
3.
If then there exists such that
Overall, the three elements for applying the KŁ inequality are obtained from Lemma 3, 4 and 5. Then, by using Lemma 6 we prove the linear rate of convergence.
Theorem 1
Proof:
See Appendix F. ∎
Appendix B Proof of Lemma 2
By definition, for two arbitrary , consists of two parts. For the first part, we have:
If then is a scaled negative entropy function w.r.t. which is therefore -strongly convex. In turns, the positive squared norm introduced by strong convexity is always greater than zero, which is -weakly convex. On the other hand, if , let without loss of generosity, we have:
(26) |
where the first inequality is due to the reverse Pinsker’s inequality [41] while the last inequality is due to norm bound: . Then, for the second part, consider the following:
where the first and second inequality follows the same reasons in (26). Combining the above discussions, we conclude that is -weakly convex where:
Appendix C Proof of Lemma 3
We divide the proof into three parts according to updates sequentially.
First, for each view , the updates from step to with fixed, denote :
(27) |
where the inequality is due to the convexity of and the last two lines follow the minimizer condition (24a) and the identity (23) respectively.
Second, for the dual update of each view:
(28) |
Lastly, for the updates, from the -weak convexity of the sub-objective function :
(29) |
Combining (27) (28) and (29), we have:
where denotes the augmented Lagrangian evaluated with step solution. The next step is to address the negative squared norm . Since is full-row rank , consider the following:
(30) |
where denotes the smallest and largest singular value of a linear operator , we connect the gradient of to the dual update:
(31) |
where the last inequality is due to the -smoothness of . After applying (31), we need the relation:
for a constant . Note that by definition, is full-row rank, hence the above relation is non-trivial. Fortunately, since each is convex w.r.t. , which assures a unique minimizer, we can follow the sub-minimization technique recently developed in [20] to establish the desired relation. By defining the following proximal operator:
which coincides with the update, we have:
(32) |
where the last inequality is due to the Lipschitz continuity of . Applying (32) to (27) and we have:
This completes the proof.
Appendix D Proof of Lemma 4
The proof is divided into two parts. We first establish the following relation between a step solution and a stationary point :
(33) |
This is accomplished by the following relations. Start from , For differences, using the convexity and the minimizer condition (24a):
(34) |
where we use the reduction of the minimizer conditions at a stationary point (25) to have . As for the difference:
(35) |
Note that by assumption . Therefore, by combining the two with the inner products where the dual variables associated with, we have the desired result (33). The second part is to construct the following relation:
(36) |
which is straightforward to show since:
(37) |
where the last equality is due to the minimizer conditions (24). Note that in (37), for simplicity, we derive the relation considering a single for and . With (33) and (36), consider a neighborhood around a stationary point such that and for some , then we have:
(38) |
where in the last equality, by construction, because is not a stationary point, there exists a such that [27, Lemma 2.1]. Then we complete the proof as (38) implies that the Łojasiewicz exponent .
Appendix E Proof of Lemma 5
From (37), we have:
(39) |
Then there exists a positive constant such that:
where denotes the step collective point.
Appendix F Proof of Theorem 1
We first show the convergence. By Assumption A, the sufficient decrease lemma (Lemma 3) holds. Consider a sequence obtained through the algorithm (11), by sufficient decrease lemma, there exists some constants such that:
(40) |
In discrete settings, the is lower semi-continuous and therefore the l.h.s. of (40) is bounded. Following this, observe that the r.h.s. of (40) is a Cauchy sequence, and hence as . Given these, by (31), this result implies as . And we prove the convergence.
Given convergence, along with Lemma 4 and Lemma 5, that is, the KŁ property is satisfied with a Łojasiewicz exponent is , and the contraction of the gradient of is established, by Lemma 6, we prove the rate of convergence of the sequence obtained through the algorithm (11) is -linear around a neighborhood of a stationary point .
References
- [1] S. Sun, “A survey of multi-view machine learning,” Neural computing and applications, vol. 23, no. 7, pp. 2031–2038, 2013.
- [2] Y. Yang and H. Wang, “Multi-view clustering: A survey,” Big Data Mining and Analytics, vol. 1, no. 2, pp. 83–107, 2018.
- [3] M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata, “Learning robust representations via multi-view information bottleneck,” in International Conference on Learning Representations, 2020.
- [4] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in International conference on machine learning. PMLR, 2015, pp. 1083–1092.
- [5] Y. Li, M. Yang, and Z. Zhang, “A survey of multi-view representation learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 10, pp. 1863–1883, 2019.
- [6] K. Zhan, F. Nie, J. Wang, and Y. Yang, “Multiview consensus graph clustering,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1261–1270, 2019.
- [7] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
- [8] C. Xu, D. Tao, and C. Xu, “Large-margin multi-viewinformation bottleneck,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1559–1572, 2014.
- [9] Y. Gao, S. Gu, L. Xia, and Y. Fei, “Web document clustering with multi-view information bottleneck,” in 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA’06), 2006, pp. 148–148.
- [10] S. Hu, Z. Shi, and Y. Ye, “Dmib: Dual-correlated multivariate information bottleneck for multiview clustering,” IEEE Transactions on Cybernetics, pp. 1–15, 2020.
- [11] Q. Wang, C. Boudreau, Q. Luo, P.-N. Tan, and J. Zhou, “Deep multi-view information bottleneck,” in Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 2019, pp. 37–45.
- [12] Z. Goldfeld and Y. Polyanskiy, “The information bottleneck problem and its applications in machine learning,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 19–38, 2020.
- [13] Y. Uğur, I. E. Aguerri, and A. Zaidi, “Vector gaussian ceo problem under logarithmic loss and applications,” IEEE Transactions on Information Theory, vol. 66, no. 7, pp. 4183–4202, 2020.
- [14] I. E. Aguerri and A. Zaidi, “Distributed variational representation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 120–138, 2021.
- [15] I. Estella Aguerri and A. Zaidi, “Distributed information bottleneck method for discrete and gaussian sources,” in International Zurich Seminar on Information and Communication (IZS 2018). Proceedings. ETH Zurich, 2018, pp. 35–39.
- [16] A. Zaidi, I. Estella-Aguerri, and S. Shamai, “On the information bottleneck problems: Models, connections, applications and information theoretic views,” Entropy, vol. 22, no. 2, p. 151, 2020.
- [17] T. A. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 740–761, 2014.
- [18] S. Boyd, N. Parikh, and E. Chu, Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011.
- [19] H. Attouch and J. Bolte, “On the convergence of the proximal algorithm for nonsmooth functions involving analytic features,” Mathematical Programming, vol. 116, no. 1, pp. 5–16, 2009.
- [20] Y. Wang, W. Yin, and J. Zeng, “Global convergence of admm in nonconvex nonsmooth optimization,” Journal of Scientific Computing, vol. 78, no. 1, pp. 29–63, 2019.
- [21] T.-H. Huang and A. el Gamal, “A provably convergent information bottleneck solution via admm,” in 2021 IEEE International Symposium on Information Theory (ISIT), 2021, pp. 43–48.
- [22] D. Boley, “Local linear convergence of the alternating direction method of multipliers on quadratic or linear programs,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2183–2207, 2013.
- [23] K. Guo, D. R. Han, and T. T. Wu, “Convergence of alternating direction method for minimizing sum of two nonconvex functions with linear constraints,” International Journal of Computer Mathematics, vol. 94, no. 8, pp. 1653–1669, 2017.
- [24] M. Chao, D. Han, and X. Cai, “Convergence of the peaceman-rachford splitting method for a class of nonconvex programs,” Numerical Mathematics: Theory, Methods and Applications, vol. 14, no. 2, pp. 438–460, 2021.
- [25] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proceedings of the Eleventh Annual Conference on Computational Learning Theory, ser. COLT’ 98. New York, NY, USA: Association for Computing Machinery, 1998, p. 92–100.
- [26] Y. Ugur, I. E. Aguerri, and A. Zaidi, “A generalization of blahut-arimoto algorithm to compute rate-distortion regions of multiterminal source coding under logarithmic loss,” arXiv preprint arXiv:1708.07309, 2017.
- [27] G. Li and T. K. Pong, “Calculus of the exponent of kurdyka–Łojasiewicz inequality and its applications to linear convergence of first-order methods,” Foundations of Computational Mathematics, vol. 18, no. 5, p. 1199–1232, 2018.
- [28] K. Sricharan, R. Raich, and A. O. Hero, “Estimation of nonlinear functionals of densities with confidence,” IEEE Transactions on Information Theory, vol. 58, no. 7, pp. 4135–4159, 2012.
- [29] R. Blahut, “Computation of channel capacity and rate-distortion functions,” IEEE Transactions on Information Theory, vol. 18, no. 4, pp. 460–473, 1972.
- [30] O. Shamir, S. Sabato, and N. Tishby, “Learning and generalization with the information bottleneck,” Theoretical Computer Science, vol. 411, no. 29-30, pp. 2696–2711, 2010.
- [31] E. Schubert and A. Zimek, “ELKI: A large open-source library for data analysis - ELKI release 0.7.5 ”heidelberg”,” CoRR, vol. abs/1902.03616, 2019.
- [32] D. Cremers and K. Kolev, “Multiview stereo and silhouette consistency via convex functionals over convex domains,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 6, pp. 1161–1174, 2011.
- [33] G. Franca, D. Robinson, and R. Vidal, “Admm and accelerated admm as continuous dynamical systems,” in International Conference on Machine Learning. PMLR, 2018, pp. 1559–1567.
- [34] Y. Han, J. Jiao, T. Weissman, and Y. Wu, “Optimal rates of entropy estimation over lipschitz balls,” The Annals of Statistics, vol. 48, no. 6, pp. 3228–3250, 2020.
- [35] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). New York, NY, USA: Wiley-Interscience, 2006.
- [36] K. Guo, D. Han, and X. Yuan, “Convergence analysis of douglas–rachford splitting method for “strongly + weakly” convex programming,” SIAM Journal on Numerical Analysis, vol. 55, no. 4, p. 1549–1577, 2017.
- [37] Z. Jia, X. Gao, X. Cai, and D. Han, “Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems,” Journal of Optimization Theory and Applications, vol. 188, no. 1, p. 1–25, 2021.
- [38] J. Nocedal, Numerical optimization, 2nd ed., ser. Springer series in operations research. New York: Springer, 2006.
- [39] Y. Nesterov, Lectures on convex optimization, ser. Springer optimization and its applications. Cham: Springer, 2018, vol. 137.
- [40] T.-H. Huang, A. E. Gamal, and H. E. Gamal, “A linearly convergent douglas-rachford splitting solver for markovian information-theoretic optimization problems,” arXiv preprint arXiv:2203.07527, 2022.
- [41] I. Sason, “On reverse pinsker inequalities,” CoRR, vol. abs/1503.07118, 2015.