This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

One-Step Late Fusion Multi-view Clustering with Compressed Subspace

Abstract

Late fusion multi-view clustering (LFMVC) has become a rapidly growing class of methods in the multi-view clustering (MVC) field, owing to its excellent computational speed and clustering performance. One bottleneck faced by existing late fusion methods is that they are usually aligned to the average kernel function, which makes the clustering performance highly dependent on the quality of datasets. Another problem is that they require subsequent k-means clustering after obtaining the consensus partition matrix to get the final discrete labels, and the resulting separation of the label learning and cluster structure optimization processes limits the integrity of these models. To address the above issues, we propose an integrated framework named One-Step Late Fusion Multi-view Clustering with Compressed Subspace (OS-LFMVC-CS). Specifically, we use the consensus subspace to align the partition matrix while optimizing the partition fusion, and utilize the fused partition matrix to guide the learning of discrete labels. A six-step iterative optimization approach with verified convergence is proposed. Sufficient experiments on multiple datasets validate the effectiveness and efficiency of our proposed method.

Index Terms—  Multi-view Clustering; Unsupervised learning and clustering; Late Fusion; One Step

1 Introduction

The kk-means algorithm, as a classical and widely used clustering algorithm, provides an intuitive and effective method for cluster analysis. Denote data matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d}, of which each line represents an element from the set of data samples {𝐱i}i=1n𝒳\left\{\mathbf{x}_{i}\right\}_{i=1}^{n}\subseteq\mathcal{X}. Denote 𝐅n×μ\mathbf{F}\in\mathbb{R}^{n\times\mu} as cluster indicating matrix and μ\mu represents total number of clusters. 𝐅i,j=1|Cj|\mathbf{F}_{i,j}=\frac{1}{\sqrt{|C_{j}|}} \Longleftrightarrow xix_{i} belongs to jj-th cluster, or it equals 0. Discrete kk-means methods can be expressed as:

max𝐅Tr(𝐅𝐗𝐗𝐅), s.t. 𝐅n×μ,𝐅ij={1|𝐂j|, if xi is in the j-th cluster. 0, otherwise. \begin{gathered}\max_{\mathbf{F}}\operatorname{Tr}\left(\mathbf{F}^{\top}\mathbf{X}\mathbf{X}^{\top}\mathbf{F}\right),\\ \text{ s.t. }\mathbf{F}\in\mathbb{R}^{n\times\mu},\mathbf{F}_{ij}=\left\{\begin{array}[]{l}\frac{1}{\sqrt{\left|\mathbf{C}_{j}\right|}},\text{ if }x_{i}\text{ is in the }j\text{-th cluster. }\\ 0,\text{ otherwise. }\end{array}\right.\end{gathered} (1)

To optimize the above target function concerns NP-hard problem. Thereby traditional kk-means clustering algorithms relax the discrete constraints on 𝐅\mathbf{F} matrix to orthogonal constraints. The terms we use in this paper are listed in Table1.

Table 1: NOMENCLATURE
pp The number of views.
μ\mu The number of clusters.
kk The dimension of partition matrices.
mm The scale of compressed subspace.
𝐈\mathbf{I} Identity matrix.
𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} A series of nn samples with dd dimensions.
F\|\cdot\|_{\mathrm{F}} Frobenius norm.
𝜷p\boldsymbol{\beta}\in\mathbb{R}^{p} A concatenation of pp weights.
{𝐊i}i=1p\left\{\mathbf{K}_{i}\right\}_{i=1}^{p} The ii-th base kernel matrix.
𝐘{0,1}μ×n\mathbf{Y}\in\{0,1\}^{\mu\times n} Label matrix.
𝐂k×μ\mathbf{C}\in\mathbb{R}^{k\times\mu} The clustering centroids.
{𝐖i}i=1p\left\{\mathbf{W}_{i}\right\}_{i=1}^{p} The permutation matrix of ii-th partition.
𝐇k×n\mathbf{H}\in\mathbb{R}^{k\times n} Partition matrix of consensus embedding.
𝐇ik×n\mathbf{H}_{i}\in\mathbb{R}^{k\times n} The partition matrix of individual kernels.
𝐏n×m\mathbf{P}\in\mathbb{R}^{n\times m} The unified compression matrix.
𝐒m×n\mathbf{S}\in\mathbb{R}^{m\times n} The consensus reconstruction matrix.

For datasets that are not linearly separable, kernel mappings are utilized to perform kernel kk-means algorithms[1, 2], which can be easily adopted in multi-view clustering tasks[3, 4, 5]. Denote ϕi:𝒳i\phi_{i}:\mathcal{X}\rightarrow\mathcal{H}_{i} as the ii-th feature mapping from {𝐱i}i=1n\left\{\mathbf{x}_{i}\right\}_{i=1}^{n} to pp Regenerative Kernel Hilbert Space (RKHS) {i}i=1p\left\{{\mathcal{H}}_{i}\right\}_{i=1}^{p}. Each sample in multiple kernel clustering is denoted as ϕ𝜷(𝐱)=[β1ϕ1(𝐱),,βpϕp(𝐱)]\phi_{\boldsymbol{\beta}}(\mathbf{x})=\left[\beta_{1}\phi_{1}(\mathbf{x})^{\top},\cdots,\beta_{p}\phi_{p}(\mathbf{x})^{\top}\right]^{\top}, where 𝜷{\boldsymbol{\beta}} is the coefficient vector of pp base kernel functions. The kernel weights are adjusted during the clustering process to optimize the clustering performance. Kernel function can be expressed by: κ𝜷(𝐱i,𝐱j)=ϕ𝜷(𝐱i)ϕ𝜷(𝐱j)\kappa_{\boldsymbol{\beta}}\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)=\phi_{\boldsymbol{\beta}}\left(\mathbf{x}_{i}\right)^{\top}\phi_{\boldsymbol{\beta}}\left(\mathbf{x}_{j}\right), and the corresponding loss function of Multiple Kernel K-means (MKKM) is:

min𝐇,𝜷Tr(𝐊𝜷(𝐈n𝐇𝐇)), s.t. 𝐇k×n,𝐇𝐇=𝐈k,𝜷𝟏p=1,𝜷0.\begin{array}[]{ll}\min\limits_{\mathbf{H},\boldsymbol{\beta}}&\operatorname{Tr}\left(\mathbf{K}_{\boldsymbol{\beta}}\left(\mathbf{I}_{n}-\mathbf{H}^{\top}\mathbf{H}\right)\right),\\ \text{ s.t. }&\mathbf{H}\in\mathbb{R}^{k\times n},\mathbf{H}\mathbf{H}^{\top}=\mathbf{I}_{k},\\ &\boldsymbol{\beta}^{\top}\mathbf{1}_{p}=1,\boldsymbol{\beta}\geq 0.\end{array} (2)

Optimizing the objective can be reduced to a traditional kk-means process and solving a quadratic programming problem through alternately optimizing 𝜷\boldsymbol{\beta} and 𝐇\mathbf{H}.

There are plenty of work emerged to improve MKKM, e.g., linear fusion-based methods[6] assumes that each base kernel function can correspond to a view of different dimensions, which in turn captures complementary aspects of features, and the multi-view consistent kernel matrices can be obtained by linear combination of multiple base kernel matrices. Secondly, joint training-based methods[7, 8] assumes that the clustering results can be obtained independently based on each base kernel matrix, and that the multi-view clustering results should be consistent, so the clustering information can be fused by statistical methods to further enhance the credibility of the clustering results. Thirdly, consensus structure extraction methods[9, 10] assumes that different kernel functions acting on the data matrix can preserve and extract the consistent clustering structure existing in multiple types of data, and the consistent clustering structure of multiple kernels is obtained by decomposing the kernel matrix. There are also deep learning based methods adopted to improve clustering performance and learn representations[11, 12, 13, 14, 15, 16, 17, 18]. To improve the efficiency and to better represent the sample distributions of individual views, late fusion based multi-kernel clustering methods[19, 20] use a tighter approach (base partitioning matrix 𝐇k×n\mathbf{H}\in\mathbb{R}^{k\times n} extracted using kk-means algorithm) for structural representation rather than using multiple base kernels.

We propose a One-Step Late Fusion Multi-view Clustering with Compressed Subspace method to directly obtain discrete cluster label by integrating clustering structure optimization and label learning into a unified framework. Our proposed algorithm has the following advantages:

  • Our algorithm is able to obtain clustering labels in one step, by negotiating label learning and cluster structure optimization through a unified framework.

  • The method is highly efficient with both 𝒪(n)\mathcal{O}(n) time and space expenditure, which allows our algorithm to be used directly on large-scale multi-view datasets.

  • We propose a six-step iterative optimization algorithm with fast convergence of the target. We conduct experiments to verify the effectiveness and efficiency of the algorithm.

2 Methodology

Most existing multi-kernel clustering methods assume that the optimal kernel exists in a linear space consisting of base kernels[6], and such assumption greatly limits the feasible domain of the optimal kernel. ONKC methods[21] propose to reconstruct the kernel matrices in a nonlinear neighborhood space, thus enlarging the search space of the optimal kernel, but the computational overheads of such methods are 𝒪(n3)\mathcal{O}(n^{3}), which restrict the algorithms’ application to large-scale datasets. LFMVC methods[22, 20] reduce the dimensionality of the kernel matrix by constructing the corresponding base partition matrices, thus reducing the computational overhead. These methods usually use the average kernel as a reference for the alignment of partitions, and requires higher quality of these base partition matrices. These drawbacks limit their generalizability to the clustering task over wide range of datasets.

In addition, the above methods cannot directly obtain cluster labels and require an extra spectral clustering or kk-means process for ultimate clusters. To tackle the NP-hard problem brought by discrete cluster labels, spectral rotation (SR)[23, 24, 25] and improved spectral rotation (ISR)[26] methods are proposed which learn the discrete labels and representations synchronically. However, the 𝒪(n3)\mathcal{O}(n^{3}) time cost and 𝒪(n2)\mathcal{O}(n^{2}) space cost greatly inhibits the scalability for large-scale datasets.

We adopt kernel subspace clustering method[27] for self-reconstruction of the consensus kernel partition, and use the trace alignment rather than Frobenius norm to avoid a re-weighing procedure[28]. We additionally adopt a compressed subspace using a uniform matrix 𝐏\mathbf{P} to further increase the computational efficiency. It is proved by[29] that maximizing late-fusion alignment is equivalent to minimizing the target function of MKKM in Eq.2. By integrating late fusion of kernel partition alignment maximization and self-reconstruction through a shared reconstruction matrix 𝐒\mathbf{S}, our method is capable of optimizing cluster structure with the negotiation of multiple views. A cluster label assigning matrix 𝐘\mathbf{Y} is learned with the centroid 𝐂\mathbf{C} to refine the aligned partition 𝐇\mathbf{H}. The overall optimization target is listed below:

max𝐏,𝐒,{𝐖i}i=1p,𝜷,𝐂,𝐘tr(𝐇(𝐇𝐏𝐒)+𝐘𝐂𝐇)\displaystyle\max_{\mathbf{P},\mathbf{S},\left\{\mathbf{W}_{i}\right\}_{i=1}^{p},\boldsymbol{\beta},\mathbf{C},\mathbf{Y}}\mathrm{tr}(\mathbf{H}(\mathbf{HPS})^{\top}+{\mathbf{Y}}^{\top}{\mathbf{C}}^{\top}\mathbf{H}) (3)
s.t. 𝐏𝐏=𝐈m,𝐒0,i=1m𝐒i,j2=1,j{1,2,,n},\displaystyle\mathbf{P}^{\top}\mathbf{P}=\mathbf{I}_{m},\mathbf{S}\geq 0,\sum_{i=1}^{m}\mathbf{S}_{i,j}^{2}=1,\forall j\in\{1,2,\cdots,n\},
𝜷i0,i{1,2,,p},𝟏μ𝐘=𝟏n,𝐘{0,1}μ×n,\displaystyle\boldsymbol{\beta}_{i}\geq 0,\forall i\in\{1,2,\cdots,p\},\boldsymbol{1}_{\mu}^{\top}\mathbf{Y}=\boldsymbol{1}_{n}^{\top},\mathbf{Y}\in\{0,1\}^{\mu\times n},
𝐇=i=1p𝜷i𝐖i𝐇i,𝐖i𝐖i=𝐈k,𝐂𝐂=𝐈μ,\displaystyle\mathbf{H}=\sum_{i=1}^{p}\boldsymbol{\beta}_{i}\mathbf{W}_{i}\mathbf{H}_{i},\mathbf{W}_{i}\mathbf{W}_{i}^{\top}=\mathbf{I}_{k},\mathbf{C}^{\top}\mathbf{C}=\mathbf{I}_{\mu},

where 𝟏nn\boldsymbol{1}_{n}\in\mathbb{R}^{n} is an all-11 column vector.

3 Optimization

We develop a six-step iterative optimization algorithm to maximize the clustering target 3.

Update {𝐖i}i=1p\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p} with 𝐏\mathbf{P}, 𝐒\mathbf{S}, β\boldsymbol{\beta}, 𝐂,𝐘\mathbf{C,Y} fixed. For δ{1,2,,p}\forall\delta\in\{1,2,\cdots,p\}, we update 𝐖δ\mathbf{W}_{\delta} by maximizing the target listed below:

max𝐖δtr(𝐆𝐖δ)s.t.𝐖δ𝐖δ=𝐈k,\displaystyle\max_{{\mathbf{W}}_{\delta}}\operatorname{tr}\left(\mathbf{G}{\mathbf{W}}_{\delta}^{\top}\right)~{}\text{s.t.}{\mathbf{W}}_{\delta}^{\top}{\mathbf{W}}_{\delta}=\mathbf{I}_{k}, (4)

where 𝐆=βδ(j=1,jδpβj𝐖j𝐇j)𝐒𝐏𝐇δ+βδ𝐂𝐘𝐇δ=𝐔g𝐃g𝐕g\mathbf{G}=\beta_{\delta}\left(\sum_{j=1,j\neq\delta}^{p}\beta_{j}{\mathbf{W}}_{j}\mathbf{H}_{j}\right)\mathbf{S}^{\top}\mathbf{P}^{\top}\mathbf{H}_{\delta}^{\top}+\beta_{\delta}\mathbf{CYH}_{\delta}^{\top}=\mathbf{U}_{g}\mathbf{D}_{g}\mathbf{V}_{g}^{\top}, 𝐔g\mathbf{U}_{g} and 𝐕g\mathbf{V}_{g} are the left and right singular matrices, respectively. According to [30], a closed-form solution of the optimal can be expressed as:

𝐖δ=𝐔g𝐕g.{\mathbf{W}}_{\delta}^{*}=\mathbf{U}_{g}^{*}\mathbf{V}_{g}^{\top}. (5)

By following the updating formula in Eq. (5), the algorithm refreshes 𝐖i,i=1,2,,p\mathbf{W}_{i},~{}i=1,2,\cdots,p in succession.

Update β\boldsymbol{\beta} with {𝐖i}i=1p\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}, 𝐏\mathbf{P}, 𝐒\mathbf{S}, 𝐂,𝐘\mathbf{C,Y} fixed. When 𝐏\mathbf{P}, 𝐒\mathbf{S}, 𝐂,𝐘\mathbf{C,Y} are settled, updating 𝜷\boldsymbol{\beta} with fixed {𝐖i}i=1p\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}, which concludes the final fusion step, can be termed as:

max𝜷i=1pj=1pβiβjtr(𝐏𝐒𝐇i𝐖i𝐖j𝐇j)\displaystyle\max_{\boldsymbol{\beta}}\sum_{i=1}^{p}\sum_{j=1}^{p}\beta_{i}\beta_{j}\operatorname{tr}\left(\mathbf{P}\mathbf{S}\mathbf{H}_{i}^{\top}\mathbf{W}_{i}^{\top}\mathbf{W}_{j}\mathbf{H}_{j}\right) (6)
+i=1pβitr(𝐂𝐖i𝐇i𝐘) s.t. 𝜷0,\displaystyle+\sum_{i=1}^{p}\beta_{i}\operatorname{tr}\left(\mathbf{C}^{\top}\mathbf{W}_{i}\mathbf{H}_{i}\mathbf{Y}^{\top}\right)\quad\text{ s.t. }{\boldsymbol{\beta}}\geqslant 0,

denoting the quadratic coefficient matrix 𝐌\mathbf{M} where 𝐌i,j=tr(𝐏𝐒𝐇i𝐖i𝐖j𝐇j)\mathbf{M}_{i,j}~{}=-\operatorname{tr}\left(\mathbf{P}\mathbf{S}\mathbf{H}_{i}^{\top}\mathbf{W}_{i}^{\top}\mathbf{W}_{j}\mathbf{H}_{j}\right), and coefficient vector ff with fi=tr(𝐂𝐖i𝐇i𝐘)f_{i}=-\operatorname{tr}\left(\mathbf{C}^{\top}\mathbf{W}_{i}\mathbf{H}_{i}\mathbf{Y}^{\top}\right), it can be further written as a minimization function:

min𝜷𝜷T𝐌𝜷+f𝜷T, s.t. 𝜷0.\displaystyle\min_{\boldsymbol{\beta}}{\boldsymbol{\beta}}^{T}\mathbf{M}{\boldsymbol{\beta}}+f{\boldsymbol{\beta}}^{T}\text{, s.t. }{\boldsymbol{\beta}}\geqslant 0. (7)

It is worth noticing that Eq. (7) is a quadratic optimization problem and a symmetrization manipulation with respect to 𝐌\mathbf{M} does not affect the results of the quadratic objective equation. Alternatively it can acquire a closed-form solution by Cauchy-Schwarz Inequality after diagonalization of the polarized 𝐌\mathbf{M}. Here we directly adopt the quadratic programming scheme[31] to achieve higher speed.

Update 𝐏\mathbf{P} with β\boldsymbol{\beta}, {𝐖i}i=1p\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}, 𝐒\mathbf{S}, 𝐂,𝐘\mathbf{C,Y} fixed. The optimization problem in Eq. (3) concerning 𝐏\mathbf{P} can be written as:

max𝐏tr(𝐇(𝐇𝐏𝐒)) s.t. 𝐏𝐏=𝐈m,𝐇=i=1pβi𝐖i𝐇i\displaystyle\max_{\mathbf{P}}\operatorname{tr}\left(\mathbf{H}(\mathbf{H}\mathbf{P}\mathbf{S})^{\top}\right)\quad\text{ s.t. }\mathbf{P}^{\top}\mathbf{P}=\mathbf{I}_{m},\mathbf{H}=\sum_{i=1}^{p}\beta_{i}\mathbf{W}_{i}\mathbf{H}_{i} (8)

Likewise, the deduction for the optimization of the compress matrix 𝐏\mathbf{P} is similar to Eq. (4). Denote 𝐀=𝐇𝐇𝐒=𝐔a𝐃a𝐕a\mathbf{A}=\mathbf{H}^{\top}\mathbf{H}\mathbf{S}=\mathbf{U}_{a}\mathbf{D}_{a}\mathbf{V}_{a}^{\top}, the optimal solution is:

𝐏=𝐔a𝐕a.\mathbf{P}^{*}=\mathbf{U}_{a}\mathbf{V}_{a}^{\top}. (9)

Update 𝐒\mathbf{S} with β\boldsymbol{\beta}, {𝐖i}i=1p\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}, 𝐏\mathbf{P}, 𝐂,𝐘\mathbf{C,Y} fixed. When variables 𝜷\boldsymbol{\beta}, {𝐖i}i=1p\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p} are fixed, given compress matrix 𝐏\mathbf{P}, the late-fusion kernel is lodged in a unified compressed subspace, which is shared among views. Updating 𝐒\mathbf{S} is equivalent to constructing the bipartite graph from the anchor space to the original feature space. However in our method, the consensus subspace does not need further integration. The optimization procedure regarding 𝐒\mathbf{S} is written below:

max𝐒tr(𝐇𝐇𝐏𝐒) s.t. i=1m𝐒i,j2=1,j{1,2,,n}.\displaystyle\max_{\mathbf{S}}\operatorname{tr}\left(\mathbf{H}^{\top}\mathbf{H}\mathbf{P}\mathbf{S}\right)\quad\text{ s.t. }\sum_{i=1}^{m}\mathbf{S}_{i,j}^{2}=1,\forall_{j}\in\{1,2,\cdots,n\}. (10)

Let 𝐐=𝐇𝐇𝐏=(𝒒1𝒒2𝒒n),𝐒=(𝒔1𝒔2𝒔n)\mathbf{Q}=\mathbf{H}^{\top}\mathbf{H}\mathbf{P}=\left({\boldsymbol{q}}_{1}~{}{\boldsymbol{q}}_{2}~{}\cdots~{}{\boldsymbol{q}}_{n}\right)^{\top},\mathbf{S}=\left({\boldsymbol{s}}_{1}~{}{\boldsymbol{s}}_{2}~{}\cdots~{}{\boldsymbol{s}}_{n}\right), where 𝒒\boldsymbol{q} and 𝒔\boldsymbol{s} are column vectors, Eq. (10) is equivalent to:

max𝐒i=1n𝒒i𝒔i, s.t. i=1m𝐒i,j2=1,\max_{\mathbf{S}}\sum_{i=1}^{n}{\boldsymbol{q}}_{i}{\boldsymbol{s}}_{i},\quad\text{ s.t. }\sum_{i=1}^{m}\mathbf{S}_{i,j}^{2}=1, (11)

accordingly, the optimal solution is:

𝐒j,i=𝐐ij𝒒i2\mathbf{S}_{j,i}^{*}=\frac{\mathbf{Q}_{ij}}{\left\|{\boldsymbol{q}}_{i}\right\|_{2}} (12)

Update 𝐂\mathbf{C} with β\boldsymbol{\beta}, {𝐖i}i=1p\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}, 𝐏\mathbf{P}, 𝐘\mathbf{Y}, 𝐒\mathbf{S} fixed. The optimization problem w.r.t. centroids 𝐂\mathbf{C} is:

max𝐂tr(𝐇𝐘𝐂),s.t.𝐂𝐂=𝐈μ,\max_{\mathbf{C}}\operatorname{tr}\left(\mathbf{HY^{\top}C^{\top}}\right),s.t.\mathbf{C}^{\top}\mathbf{C}=\mathbf{I}_{\mu}, (13)

likewise, denoting 𝐇𝐘=𝐔c𝐃c𝐕c\mathbf{HY^{\top}}=\mathbf{U}_{c}\mathbf{D}_{c}\mathbf{V}_{c}^{\top}, the optimal solution is:

𝐂=𝐔c𝐕c\mathbf{C}^{*}=\mathbf{U}_{c}\mathbf{V}_{c}^{\top} (14)

Update 𝐘\mathbf{Y} with β\boldsymbol{\beta}, {𝐖i}i=1p\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}, 𝐏\mathbf{P}, 𝐂\mathbf{C}, 𝐒\mathbf{S} fixed. The optimization problem w.r.t. discrete label 𝐘\mathbf{Y} is:

max𝐘tr((𝐂𝐇)𝐘),s.t.1μ𝐘=𝟏n,𝐘{0,1}μ×n,\max_{\mathbf{Y}}\operatorname{tr}\left(\mathbf{(C^{\top}H)^{\top}Y}\right),s.t.\boldsymbol{1}_{\mu}^{\top}\mathbf{Y}=\boldsymbol{1}_{n}^{\top},\mathbf{Y}\in\{0,1\}^{\mu\times n}, (15)

denoting 𝐁=𝐂𝐇\mathbf{B}=\mathbf{C^{\top}H}, the optimal solution is:

𝐘(i,j)=argmaxi(𝐁(i,j)),j{1,2,,n}.\mathbf{Y}^{*}(i,j)=arg\max_{i}(\mathbf{B}(i,j)),\forall j\in\{1,2,\cdots,n\}. (16)

The overall optimization is listed in Algorithm 1. According to each step of the algorithm, the time and space cost are 𝒪(n)\mathcal{O}(n), and the efficiency is validated in Sec.4.

Algorithm 1 One-Step Late Fusion Multi-view Clustering with Compressed Subspace
0:    Multiple base kernels {𝐊i}i=1p\left\{\mathbf{K}_{i}\right\}_{i=1}^{p}, number of clusters μ\mu, the scale of compressed subspace mm, the dimension of partition matrix kk.
0:    The label matrix 𝐘\mathbf{Y}.
1:  Initialization Initialize compression matrix 𝐏n×m\mathbf{P}\in\mathbb{R}^{n\times m} as orthogonalization of a randomized matrix. Initialize 𝐒\mathbf{S} by imposing sum-11 restriction of 2\ell 2-norm on randomized matrix. βi=1/p,i\beta_{i}=1/p,\forall i. {𝐖i}i=1p=𝐈k\left\{\mathbf{W}_{i}\right\}_{i=1}^{p}=\mathbf{I}_{k}, t=1t=1.
2:  repeat
3:     Calculate {𝐖i}i=1p\left\{\mathbf{W}_{i}\right\}_{i=1}^{p} by optimizing Eq. (5);
4:     Calculate 𝜷\boldsymbol{\beta} by optimizing Eq. (7);
5:     Calculate 𝐏\mathbf{P} by optimizing Eq. (9);
6:     Calculate 𝐒\mathbf{S} by optimizing Eq. (12);
7:     Calculate 𝐂\mathbf{C} by optimizing Eq. (14);
8:     Calculate 𝐘\mathbf{Y} by optimizing Eq. (16);
9:     t=t+1t=t+1.
10:  until (obj(t)obj(t1))2<103(obj^{(t)}-obj^{(t-1)})^{2}<10^{-3}.

4 Experiment

In this section, we conduct a series of experiments to evaluate the effectiveness and efficiency of our proposed method.

Table 2: Comparison of clustering performance on 5 benchmark datasets. The best performances are in bold-face.
Datasets Avg-KKM SB-KKM MKKM RMSC FMKKM FMR LSGMC Proposed
ACC
Citeseer 20.8 ± 0.0 46.3 ± 0.2 20.1 ± 0.0 19.9 ± 0.3 30.6 ± 0.7 23.9 ± 0.0 22.0 ± 0.4 56.6 ± 0.0
Cora 30.7 ± 0.8 45.2 ± 0.1 25.3 ± 0.4 20.2 ± 0.1 38.4 ± 0.1 40.7 ± 0.5 20.9 ± 0.2 60.8 ± 0.0
ProteinFold 29.0 ± 1.5 33.8 ± 1.3 27.0 ± 1.1 31.2 ± 1.0 32.4 ± 1.8 34.5 ± 1.4 32.8 ± 1.2 35.3 ± 0.0
NUSWIDE 12.5 ± 0.4 12.2 ± 0.3 12.7 ± 0.2 - 14.0 ± 0.3 - - 14.7 ± 0.0
Reuters 45.5 ± 1.5 47.2 ± 0.0 45.4 ± 1.5 - 45.5 ± 1.6 - - 53.7 ± 0.0
NMI
Citeseer 2.3 ± 0.0 23.2 ± 0.5 1.9 ± 0.0 0.4 ± 0.1 10.1 ± 0.4 2.3 ± 0.0 1.7 ± 0.1 28.7 ± 0.0
Cora 15.7 ± 1.4 25.6 ± 0.1 9.5 ± 0.2 1.7 ± 0.1 21.8 ± 0.1 20.0 ± 0.2 0.7 ± 0.0 37.8 ± 0.0
ProteinFold 40.3 ± 1.3 41.1 ± 1.1 38.0 ± 0.6 43.2 ± 0.8 41.5 ± 1.0 42.0 ± 1.1 43.9 ± 0.5 44.1 ± 0.0
NUSWIDE 11.1 ± 0.1 11.0 ± 0.1 11.3 ± 0.2 - 12.6 ± 0.2 - - 13.2 ± 0.0
Reuters 27.4 ± 0.4 25.5 ± 0.0 27.3 ± 0.4 - 27.6 ± 0.5 - - 31.8 ± 0.0
purity
Citeseer 24.9 ± 0.0 48.8 ± 0.4 24.2 ± 0.0 22.1 ± 0.3 32.9 ± 0.7 25.8 ± 0.0 24.9 ± 0.2 58.9 ± 0.0
Cora 41.5 ± 1.3 52.5 ± 0.1 36.1 ± 1.0 31.5 ± 0.0 46.9 ± 0.1 42.9 ± 0.5 30.2 ± 0.0 63.3 ± 0.0
ProteinFold 37.4 ± 1.7 39.4 ± 1.2 33.7 ± 1.1 38.5 ± 0.9 38.6 ± 1.5 40.6 ± 1.4 40.7 ± 0.6 43.4 ± 0.0
NUSWIDE 23.3 ± 0.3 23.7 ± 0.2 24.2 ± 0.4 - 25.7 ± 0.4 - - 25.6 ± 0.0
Reuters 53.0 ± 0.4 53.9 ± 0.0 52.9 ± 0.5 - 53.1 ± 0.4 - - 61.9 ± 0.0

4.1 Evaluation Preliminaries

Datasets We adopt 5 publicly available multi-view benchmark datasets, including Citeseer111http://linqs-data.soe.ucsc.edu/public/lbc/, Cora222http://mlg.ucd.ie/aggregation/, ProteinFold333http://mkl.ucsd.edu/dataset/protein-fold-prediction, NUS-WIDE444https://lms.comp.nus.edu.sg/wp-content/uploads/2019/research/nuswide/NUSWIDE.html, Reuters555https://kdd.ics.uci.edu/databases/reuters21578/. Among them, NUS-WIDE and Reuters are large scale datasets with over 10 thousand samples each.

Compared Algorithms 7 algorithms are compared with our algorithm, including multiple kernel clustering and multi-view subspace clustering methods, over the above 5 benchmark datasets. Specifically, we use SB-MKKM and Avg-MKKM as baseline methods of kernel clustering, which calculate the best performance and average performance of kernel kk-means results respectively. We select MKKM[32], FMKKM[33] as a representation of classical kernel methods which expand from fuzzy kk-means to late-fusion-based methods. And RMSC[34] is representative of spectral clustering methods. In subspace clustering area, we select FMR[35] and LSGMC[36] for performance comparison.

4.2 Performance Analysis

We use the initialization specified in Algorithm 1. The Table 2 shows the ACC, NMI and purity of all methods. To increase the confidence of results, we conduct 20 iterations for each clustering algorithm and within kk-means function, the replicates are set to 10. The best performance and standard variance, which is brought by kk-means tasks, are both reported in our experiments. As our algorithm is one-step method without the downstream kk-means or spectral clustering, the variance is 0 accordingly, guaranteeing the stability of our algorithm. The symbol ‘-’ represents ‘Out-of-Memory’ problem. The platform for all methods is PC with Intel(R) Core(TM) i7-12700H 2.30 GHz, 64GB RAM.

Refer to caption

(a) Convergence

Refer to caption

(b) Parameter sensitivity

Fig. 1: Experimental results.
Refer to caption
Fig. 2: Time comparison.

Convergence In our six-step iterative optimization process, each updating formula ensures the objective value of the optimization target monotonously increases after each iteration, while keeping the rest five of the decision variables fixed, and our optimization target is upper-bound. We further verify the convergence in Fig.1(a).

Parameter sensitivity We have 2 hyper-parameters in our algorithm: the sampling scale mm and number of partitions kk. Empirical guidance for their pre-set are numerical multipliers of the clustering number μ\mu. Our experiments use 3×33\times 3 parameter set {μ,2μ,4μ}\{\mu,2\mu,4\mu\}. From Fig.1(b), we observe that our method enjoys a stable performance with the variation of hyper-parameters.

Time cost The running time of compared algorithms on benchmark datasets is shown in Fig. 2. For large-scale datasets (NUSWIDE and Reuters), some algorithms(FMR, RMSC, LSGMC) encounter the ‘Out-of-Memory’ problem. As a result, the time bars of these methods are omitted in the comparison graph.

5 Conclusion

In this article, we propose an effective and efficient multi-view clustering method OS-LFMVC-CS, which simultaneously maximizes the alignment between different views and learns the shared subspace, and directly optimizes the cluster assignments. This mechanism greatly enhances the negotiation among views and improves the consistency in shared subspace. In this way, we obtain the clustering results in one step and reduce the time and space expenditure to linear cost. We derive a novel optimization framework using a six-step iterative optimization with verified convergence. In addition, extensive experiments are conducted to support the method well. In the future, we will explore more efficient clustering algorithms and use one-step clustering scheme for easy application on a wide range of scenarios.

References

  • [1] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis, “Kernel k-means: spectral clustering and normalized cuts,” in KDD, 2004.
  • [2] Radha Chitta, Rong Jin, Timothy C Havens, and Anil K Jain, “Approximate kernel k-means: Solution to large scale kernel clustering,” in ACM SIGKDD, 2011.
  • [3] Liang Du, Peng Zhou, Lei Shi, Hanmo Wang, Mingyu Fan, Wenjian Wang, and Yi-Dong Shen, “Robust multiple kernel k-means using l21-norm,” in IJCAI, 2015.
  • [4] Pei Zhang, Xinwang Liu, Jian Xiong, Sihang Zhou, Wentao Zhao, En Zhu, and Zhiping Cai, “Consensus one-step multi-view subspace clustering,” IEEE TKDE, 2020.
  • [5] Xinwang Liu, “Simplemkkm: Simple multiple kernel k-means,” IEEE TPAMI, 2022.
  • [6] Xinwang Liu, Yong Dou, Jianping Yin, Lei Wang, and En Zhu, “Multiple kernel k-means clustering with matrix-induced regularization.,” in AAAI, 2016.
  • [7] Zhenwen Ren, Quansen Sun, and Dong Wei, “Multiple kernel clustering with kernel k-means coupled graph tensor learning,” in AAAI, 2021.
  • [8] Yihang Lu, Haonan Xin, Rong Wang, Feiping Nie, and Xuelong Li, “Scalable multiple kernel k-means clustering,” in CIKM, 2022.
  • [9] Jing Liu, Fuyuan Cao, Xiao-Zhi Gao, Liqin Yu, and Jiye Liang, “A cluster-weighted kernel k-means method for multi-view clustering,” in AAAI, 2020.
  • [10] Jitao Lu, Yihang Lu, Rong Wang, Feiping Nie, and Xuelong Li, “Multiple kernel k-means clustering with simultaneous spectral rotation,” in ICASSP, 2022.
  • [11] Yue Liu, Ke Liang, Jun Xia, Sihang Zhou, Xihong Yang, , Xinwang Liu, and Z. Stan Li, “Dink-net: Neural clustering on large graphs,” in Proc. of ICML, 2023.
  • [12] Yiqi Wang, Chaozhuo Li, Wei Jin, Rui Li, Jianan Zhao, Jiliang Tang, and Xing Xie, “Test-time training for graph neural networks,” arXiv preprint arXiv:2210.08813, 2022.
  • [13] Yiqi Wang, Chaozhuo Li, Mingzheng Li, Wei Jin, Yuming Liu, Hao Sun, Xing Xie, and Jiliang Tang, “Localized graph collaborative filtering,” in SDM, 2022.
  • [14] Wenxuan Tu, Sihang Zhou, Xinwang Liu, Xifeng Guo, Zhiping Cai, En Zhu, and Jieren Cheng, “Deep fusion clustering network,” in AAAI, 2021.
  • [15] Wenxuan Tu, Qing Liao, Sihang Zhou, Xin Peng, Chuan Ma, Zhe Liu, Xinwang Liu, and Zhiping Cai, “Rare: Robust masked graph autoencoder,” IEEE TKDE, 2023.
  • [16] Meng Liu, Yue Liu, Ke Liang, Siwei Wang, Sihang Zhou, and Xinwang Liu, “Deep temporal graph clustering,” arXiv preprint arXiv:2305.10738, 2023.
  • [17] Renxiang Guan, Zihao Li, Teng Li, Xianju Li, Jinzhong Yang, and Weitao Chen, “Classification of heterogeneous mining areas based on rescapsnet and gaofen-5 imagery,” Remote Sensing, 2022.
  • [18] Renxiang Guan, Zihao Li, Xianju Li, Chang Tang, and Ruyi Feng, “Contrastive multi-view subspace clustering of hyperspectral images based on graph convolutional networks,” arXiv preprint arXiv:2312.06068, 2023.
  • [19] Siwei Wang, Xinwang Liu, En Zhu, Chang Tang, Jiyuan Liu, Jingtao Hu, Jingyuan Xia, and Jianping Yin, “Multi-view clustering via late fusion alignment maximization,” in IJCAI, 2019.
  • [20] Siwei Wang, Xinwang Liu, Li Liu, Sihang Zhou, and En Zhu, “Late fusion multiple kernel clustering with proxy graph refinement,” TNNLS, 2021.
  • [21] Jiyuan Liu, Xinwang Liu, Jian Xiong, Qing Liao, Sihang Zhou, Siwei Wang, and Yuexiang Yang, “Optimal neighborhood multiple kernel clustering with adaptive local kernels,” TKDE, 2022.
  • [22] Siwei Wang, En Zhu, Jingtao Hu, Miaomiao Li, Kaikai Zhao, Ning Hu, and Xinwang Liu, “Efficient multiple kernel k-means clustering with late fusion,” IEEE Access, 2019.
  • [23] Jin Huang, Feiping Nie, and Heng Huang, “Spectral rotation versus k-means in spectral clustering.,” in AAAI, 2013.
  • [24] Yanwei Pang, Jin Xie, Feiping Nie, and Xuelong Li, “Spectral clustering by joint spectral embedding and spectral rotation.,” IEEE Trans. Cybern., 2020.
  • [25] Yihang Lu, Jitao Lu, Rong Wang, and Feiping Nie, “Discrete multi-kernel k-means with diverse and optimal kernel learning,” in ICASSP, 2022.
  • [26] Xiaojun Chen, Feiping Nie, Joshua Zhexue Huang, and Min Yang, “Scalable normalized cut with improved spectral rotation.,” in IJCAI, 2017.
  • [27] Jihun Ham and Daniel D. Lee, “Extended grassmann kernels for subspace-based learning,” in NeuralIPS 21, 2008.
  • [28] Feiping Nie, Jianjun Yuan, and Heng Huang, “Optimal mean robust principal component analysis,” in ICML, 2014.
  • [29] Xinwang Liu, Xinzhong Zhu, Miaomiao Li, Lei Wang, Chang Tang, Jianping Yin, Dinggang Shen, Huaimin Wang, and Wen Gao, “Late fusion incomplete multi-view clustering.,” TPAMI, 2019.
  • [30] Qiyuan Ou, Siwei Wang, Sihang Zhou, Miaomiao Li, Xifeng Guo, and En Zhu, “Anchor-based multiview subspace clustering with diversity regularization,” IEEE MultiMedia, 2020.
  • [31] Thomas F. Coleman and Yuying Li, “A reflective newton method for minimizing a quadratic function subject to bounds on some of the variables,” SIAM J. Optim., 1996.
  • [32] Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen, “Multiple kernel fuzzy clustering,” IEEE Trans. Fuzzy Syst., 2012.
  • [33] Yi Zhang, Xinwang Liu, Jiyuan Liu, Sisi Dai, Changwang Zhang, Kai Xu, and En Zhu, “Fusion multiple kernel k-means,” in AAAI, 2022.
  • [34] Rongkai Xia, Yan Pan, Lei Du, and Jian Yin, “Robust multi-view spectral clustering via low-rank and sparse decomposition,” in AAAI, 2014.
  • [35] Ruihuang Li, Changqing Zhang, Qinghua Hu, Pengfei Zhu, and Zheng Wang, “Flexible multi-view representation learning for subspace clustering,” in IJCAI, 2019.
  • [36] Wei Lan, Tianchuan Yang, Qingfeng Chen, Shichao Zhang, Yi Dong, Huiyu Zhou, and Yi Pan, “Multiview subspace clustering via low-rank symmetric affinity graph,” TNNLS, 2023.