One-Step Late Fusion Multi-view Clustering with Compressed Subspace

Abstract

Late fusion multi-view clustering (LFMVC) has become a rapidly growing class of methods in the multi-view clustering (MVC) field, owing to its excellent computational speed and clustering performance. One bottleneck faced by existing late fusion methods is that they are usually aligned to the average kernel function, which makes the clustering performance highly dependent on the quality of datasets. Another problem is that they require subsequent k-means clustering after obtaining the consensus partition matrix to get the final discrete labels, and the resulting separation of the label learning and cluster structure optimization processes limits the integrity of these models. To address the above issues, we propose an integrated framework named One-Step Late Fusion Multi-view Clustering with Compressed Subspace (OS-LFMVC-CS). Specifically, we use the consensus subspace to align the partition matrix while optimizing the partition fusion, and utilize the fused partition matrix to guide the learning of discrete labels. A six-step iterative optimization approach with verified convergence is proposed. Sufficient experiments on multiple datasets validate the effectiveness and efficiency of our proposed method.

Index Terms— Multi-view Clustering; Unsupervised learning and clustering; Late Fusion; One Step

1 Introduction

The $k$ -means algorithm, as a classical and widely used clustering algorithm, provides an intuitive and effective method for cluster analysis. Denote data matrix $\mathbf{X}\in\mathbb{R}^{n\times d}$ , of which each line represents an element from the set of data samples $\left\{\mathbf{x}_{i}\right\}_{i=1}^{n}\subseteq\mathcal{X}$ . Denote $\mathbf{F}\in\mathbb{R}^{n\times\mu}$ as cluster indicating matrix and $\mu$ represents total number of clusters. $\mathbf{F}_{i,j}=\frac{1}{\sqrt{|C_{j}|}}$ $\Longleftrightarrow$ $x_{i}$ belongs to $j$ -th cluster, or it equals $0$ . Discrete $k$ -means methods can be expressed as:

\begin{gathered}\max_{\mathbf{F}}\operatorname{Tr}\left(\mathbf{F}^{\top}\mathbf{X}\mathbf{X}^{\top}\mathbf{F}\right),\\ \text{ s.t. }\mathbf{F}\in\mathbb{R}^{n\times\mu},\mathbf{F}_{ij}=\left\{\begin{array}[]{l}\frac{1}{\sqrt{\left|\mathbf{C}_{j}\right|}},\text{ if }x_{i}\text{ is in the }j\text{-th cluster. }\\ 0,\text{ otherwise. }\end{array}\right.\end{gathered}

(1)

To optimize the above target function concerns NP-hard problem. Thereby traditional $k$ -means clustering algorithms relax the discrete constraints on $\mathbf{F}$ matrix to orthogonal constraints. The terms we use in this paper are listed in Table1.

Table 1: NOMENCLATURE

$p$	The number of views.
$\mu$	The number of clusters.
$k$	The dimension of partition matrices.
$m$	The scale of compressed subspace.
$\mathbf{I}$	Identity matrix.
$\mathbf{X}\in\mathbb{R}^{n\times d}$	A series of $n$ samples with $d$ dimensions.
$\\|\cdot\\|_{\mathrm{F}}$	Frobenius norm.
$\boldsymbol{\beta}\in\mathbb{R}^{p}$	A concatenation of $p$ weights.
$\left\{\mathbf{K}_{i}\right\}_{i=1}^{p}$	The $i$ -th base kernel matrix.
$\mathbf{Y}\in\{0,1\}^{\mu\times n}$	Label matrix.
$\mathbf{C}\in\mathbb{R}^{k\times\mu}$	The clustering centroids.
$\left\{\mathbf{W}_{i}\right\}_{i=1}^{p}$	The permutation matrix of $i$ -th partition.
$\mathbf{H}\in\mathbb{R}^{k\times n}$	Partition matrix of consensus embedding.
$\mathbf{H}_{i}\in\mathbb{R}^{k\times n}$	The partition matrix of individual kernels.
$\mathbf{P}\in\mathbb{R}^{n\times m}$	The unified compression matrix.
$\mathbf{S}\in\mathbb{R}^{m\times n}$	The consensus reconstruction matrix.

For datasets that are not linearly separable, kernel mappings are utilized to perform kernel $k$ -means algorithms[1, 2], which can be easily adopted in multi-view clustering tasks[3, 4, 5]. Denote $\phi_{i}:\mathcal{X}\rightarrow\mathcal{H}_{i}$ as the $i$ -th feature mapping from $\left\{\mathbf{x}_{i}\right\}_{i=1}^{n}$ to $p$ Regenerative Kernel Hilbert Space (RKHS) $\left\{{\mathcal{H}}_{i}\right\}_{i=1}^{p}$ . Each sample in multiple kernel clustering is denoted as $\phi_{\boldsymbol{\beta}}(\mathbf{x})=\left[\beta_{1}\phi_{1}(\mathbf{x})^{\top},\cdots,\beta_{p}\phi_{p}(\mathbf{x})^{\top}\right]^{\top}$ , where ${\boldsymbol{\beta}}$ is the coefficient vector of $p$ base kernel functions. The kernel weights are adjusted during the clustering process to optimize the clustering performance. Kernel function can be expressed by: $\kappa_{\boldsymbol{\beta}}\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)=\phi_{\boldsymbol{\beta}}\left(\mathbf{x}_{i}\right)^{\top}\phi_{\boldsymbol{\beta}}\left(\mathbf{x}_{j}\right)$ , and the corresponding loss function of Multiple Kernel K-means (MKKM) is:

\begin{array}[]{ll}\min\limits_{\mathbf{H},\boldsymbol{\beta}}&\operatorname{Tr}\left(\mathbf{K}_{\boldsymbol{\beta}}\left(\mathbf{I}_{n}-\mathbf{H}^{\top}\mathbf{H}\right)\right),\\ \text{ s.t. }&\mathbf{H}\in\mathbb{R}^{k\times n},\mathbf{H}\mathbf{H}^{\top}=\mathbf{I}_{k},\\ &\boldsymbol{\beta}^{\top}\mathbf{1}_{p}=1,\boldsymbol{\beta}\geq 0.\end{array}

(2)

Optimizing the objective can be reduced to a traditional $k$ -means process and solving a quadratic programming problem through alternately optimizing $\boldsymbol{\beta}$ and $\mathbf{H}$ .

There are plenty of work emerged to improve MKKM, e.g., linear fusion-based methods[6] assumes that each base kernel function can correspond to a view of different dimensions, which in turn captures complementary aspects of features, and the multi-view consistent kernel matrices can be obtained by linear combination of multiple base kernel matrices. Secondly, joint training-based methods[7, 8] assumes that the clustering results can be obtained independently based on each base kernel matrix, and that the multi-view clustering results should be consistent, so the clustering information can be fused by statistical methods to further enhance the credibility of the clustering results. Thirdly, consensus structure extraction methods[9, 10] assumes that different kernel functions acting on the data matrix can preserve and extract the consistent clustering structure existing in multiple types of data, and the consistent clustering structure of multiple kernels is obtained by decomposing the kernel matrix. There are also deep learning based methods adopted to improve clustering performance and learn representations[11, 12, 13, 14, 15, 16, 17, 18]. To improve the efficiency and to better represent the sample distributions of individual views, late fusion based multi-kernel clustering methods[19, 20] use a tighter approach (base partitioning matrix $\mathbf{H}\in\mathbb{R}^{k\times n}$ extracted using $k$ -means algorithm) for structural representation rather than using multiple base kernels.

We propose a One-Step Late Fusion Multi-view Clustering with Compressed Subspace method to directly obtain discrete cluster label by integrating clustering structure optimization and label learning into a unified framework. Our proposed algorithm has the following advantages:

•

Our algorithm is able to obtain clustering labels in one step, by negotiating label learning and cluster structure optimization through a unified framework.
•

The method is highly efficient with both $\mathcal{O}(n)$ time and space expenditure, which allows our algorithm to be used directly on large-scale multi-view datasets.
•

We propose a six-step iterative optimization algorithm with fast convergence of the target. We conduct experiments to verify the effectiveness and efficiency of the algorithm.

2 Methodology

Most existing multi-kernel clustering methods assume that the optimal kernel exists in a linear space consisting of base kernels[6], and such assumption greatly limits the feasible domain of the optimal kernel. ONKC methods[21] propose to reconstruct the kernel matrices in a nonlinear neighborhood space, thus enlarging the search space of the optimal kernel, but the computational overheads of such methods are $\mathcal{O}(n^{3})$ , which restrict the algorithms’ application to large-scale datasets. LFMVC methods[22, 20] reduce the dimensionality of the kernel matrix by constructing the corresponding base partition matrices, thus reducing the computational overhead. These methods usually use the average kernel as a reference for the alignment of partitions, and requires higher quality of these base partition matrices. These drawbacks limit their generalizability to the clustering task over wide range of datasets.

In addition, the above methods cannot directly obtain cluster labels and require an extra spectral clustering or $k$ -means process for ultimate clusters. To tackle the NP-hard problem brought by discrete cluster labels, spectral rotation (SR)[23, 24, 25] and improved spectral rotation (ISR)[26] methods are proposed which learn the discrete labels and representations synchronically. However, the $\mathcal{O}(n^{3})$ time cost and $\mathcal{O}(n^{2})$ space cost greatly inhibits the scalability for large-scale datasets.

We adopt kernel subspace clustering method[27] for self-reconstruction of the consensus kernel partition, and use the trace alignment rather than Frobenius norm to avoid a re-weighing procedure[28]. We additionally adopt a compressed subspace using a uniform matrix $\mathbf{P}$ to further increase the computational efficiency. It is proved by[29] that maximizing late-fusion alignment is equivalent to minimizing the target function of MKKM in Eq.2. By integrating late fusion of kernel partition alignment maximization and self-reconstruction through a shared reconstruction matrix $\mathbf{S}$ , our method is capable of optimizing cluster structure with the negotiation of multiple views. A cluster label assigning matrix $\mathbf{Y}$ is learned with the centroid $\mathbf{C}$ to refine the aligned partition $\mathbf{H}$ . The overall optimization target is listed below:

	$\displaystyle\max_{\mathbf{P},\mathbf{S},\left\{\mathbf{W}_{i}\right\}_{i=1}^{p},\boldsymbol{\beta},\mathbf{C},\mathbf{Y}}\mathrm{tr}(\mathbf{H}(\mathbf{HPS})^{\top}+{\mathbf{Y}}^{\top}{\mathbf{C}}^{\top}\mathbf{H})$	(3)
s.t.	$\displaystyle\mathbf{P}^{\top}\mathbf{P}=\mathbf{I}_{m},\mathbf{S}\geq 0,\sum_{i=1}^{m}\mathbf{S}_{i,j}^{2}=1,\forall j\in\{1,2,\cdots,n\},$
	$\displaystyle\boldsymbol{\beta}_{i}\geq 0,\forall i\in\{1,2,\cdots,p\},\boldsymbol{1}_{\mu}^{\top}\mathbf{Y}=\boldsymbol{1}_{n}^{\top},\mathbf{Y}\in\{0,1\}^{\mu\times n},$
	$\displaystyle\mathbf{H}=\sum_{i=1}^{p}\boldsymbol{\beta}_{i}\mathbf{W}_{i}\mathbf{H}_{i},\mathbf{W}_{i}\mathbf{W}_{i}^{\top}=\mathbf{I}_{k},\mathbf{C}^{\top}\mathbf{C}=\mathbf{I}_{\mu},$

where $\boldsymbol{1}_{n}\in\mathbb{R}^{n}$ is an all- $1$ column vector.

3 Optimization

We develop a six-step iterative optimization algorithm to maximize the clustering target 3.

Update $\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}$ with $\mathbf{P}$ , $\mathbf{S}$ , $\boldsymbol{\beta}$ , $\mathbf{C,Y}$ fixed. For $\forall\delta\in\{1,2,\cdots,p\}$ , we update $\mathbf{W}_{\delta}$ by maximizing the target listed below:

\displaystyle\max_{{\mathbf{W}}_{\delta}}\operatorname{tr}\left(\mathbf{G}{\mathbf{W}}_{\delta}^{\top}\right)~{}\text{s.t.}{\mathbf{W}}_{\delta}^{\top}{\mathbf{W}}_{\delta}=\mathbf{I}_{k},

(4)

where $\mathbf{G}=\beta_{\delta}\left(\sum_{j=1,j\neq\delta}^{p}\beta_{j}{\mathbf{W}}_{j}\mathbf{H}_{j}\right)\mathbf{S}^{\top}\mathbf{P}^{\top}\mathbf{H}_{\delta}^{\top}+\beta_{\delta}\mathbf{CYH}_{\delta}^{\top}=\mathbf{U}_{g}\mathbf{D}_{g}\mathbf{V}_{g}^{\top}$ , $\mathbf{U}_{g}$ and $\mathbf{V}_{g}$ are the left and right singular matrices, respectively. According to [30], a closed-form solution of the optimal can be expressed as:

{\mathbf{W}}_{\delta}^{*}=\mathbf{U}_{g}^{*}\mathbf{V}_{g}^{\top}.

(5)

By following the updating formula in Eq. (5), the algorithm refreshes $\mathbf{W}_{i},~{}i=1,2,\cdots,p$ in succession.

Update $\boldsymbol{\beta}$ with $\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}$ , $\mathbf{P}$ , $\mathbf{S}$ , $\mathbf{C,Y}$ fixed. When $\mathbf{P}$ , $\mathbf{S}$ , $\mathbf{C,Y}$ are settled, updating $\boldsymbol{\beta}$ with fixed $\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}$ , which concludes the final fusion step, can be termed as:

		$\displaystyle\max_{\boldsymbol{\beta}}\sum_{i=1}^{p}\sum_{j=1}^{p}\beta_{i}\beta_{j}\operatorname{tr}\left(\mathbf{P}\mathbf{S}\mathbf{H}_{i}^{\top}\mathbf{W}_{i}^{\top}\mathbf{W}_{j}\mathbf{H}_{j}\right)$		(6)
		$\displaystyle+\sum_{i=1}^{p}\beta_{i}\operatorname{tr}\left(\mathbf{C}^{\top}\mathbf{W}_{i}\mathbf{H}_{i}\mathbf{Y}^{\top}\right)\quad\text{ s.t. }{\boldsymbol{\beta}}\geqslant 0,$		(6)

denoting the quadratic coefficient matrix $\mathbf{M}$ where $\mathbf{M}_{i,j}~{}=-\operatorname{tr}\left(\mathbf{P}\mathbf{S}\mathbf{H}_{i}^{\top}\mathbf{W}_{i}^{\top}\mathbf{W}_{j}\mathbf{H}_{j}\right)$ , and coefficient vector $f$ with $f_{i}=-\operatorname{tr}\left(\mathbf{C}^{\top}\mathbf{W}_{i}\mathbf{H}_{i}\mathbf{Y}^{\top}\right)$ , it can be further written as a minimization function:

\displaystyle\min_{\boldsymbol{\beta}}{\boldsymbol{\beta}}^{T}\mathbf{M}{\boldsymbol{\beta}}+f{\boldsymbol{\beta}}^{T}\text{, s.t. }{\boldsymbol{\beta}}\geqslant 0.

(7)

It is worth noticing that Eq. (7) is a quadratic optimization problem and a symmetrization manipulation with respect to $\mathbf{M}$ does not affect the results of the quadratic objective equation. Alternatively it can acquire a closed-form solution by Cauchy-Schwarz Inequality after diagonalization of the polarized $\mathbf{M}$ . Here we directly adopt the quadratic programming scheme[31] to achieve higher speed.

Update $\mathbf{P}$ with $\boldsymbol{\beta}$ , $\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}$ , $\mathbf{S}$ , $\mathbf{C,Y}$ fixed. The optimization problem in Eq. (3) concerning $\mathbf{P}$ can be written as:

\displaystyle\max_{\mathbf{P}}\operatorname{tr}\left(\mathbf{H}(\mathbf{H}\mathbf{P}\mathbf{S})^{\top}\right)\quad\text{ s.t. }\mathbf{P}^{\top}\mathbf{P}=\mathbf{I}_{m},\mathbf{H}=\sum_{i=1}^{p}\beta_{i}\mathbf{W}_{i}\mathbf{H}_{i}

(8)

Likewise, the deduction for the optimization of the compress matrix $\mathbf{P}$ is similar to Eq. (4). Denote $\mathbf{A}=\mathbf{H}^{\top}\mathbf{H}\mathbf{S}=\mathbf{U}_{a}\mathbf{D}_{a}\mathbf{V}_{a}^{\top}$ , the optimal solution is:

\mathbf{P}^{*}=\mathbf{U}_{a}\mathbf{V}_{a}^{\top}.

(9)

Update $\mathbf{S}$ with $\boldsymbol{\beta}$ , $\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}$ , $\mathbf{P}$ , $\mathbf{C,Y}$ fixed. When variables $\boldsymbol{\beta}$ , $\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}$ are fixed, given compress matrix $\mathbf{P}$ , the late-fusion kernel is lodged in a unified compressed subspace, which is shared among views. Updating $\mathbf{S}$ is equivalent to constructing the bipartite graph from the anchor space to the original feature space. However in our method, the consensus subspace does not need further integration. The optimization procedure regarding $\mathbf{S}$ is written below:

\displaystyle\max_{\mathbf{S}}\operatorname{tr}\left(\mathbf{H}^{\top}\mathbf{H}\mathbf{P}\mathbf{S}\right)\quad\text{ s.t. }\sum_{i=1}^{m}\mathbf{S}_{i,j}^{2}=1,\forall_{j}\in\{1,2,\cdots,n\}.

(10)

Let $\mathbf{Q}=\mathbf{H}^{\top}\mathbf{H}\mathbf{P}=\left({\boldsymbol{q}}_{1}~{}{\boldsymbol{q}}_{2}~{}\cdots~{}{\boldsymbol{q}}_{n}\right)^{\top},\mathbf{S}=\left({\boldsymbol{s}}_{1}~{}{\boldsymbol{s}}_{2}~{}\cdots~{}{\boldsymbol{s}}_{n}\right)$ , where $\boldsymbol{q}$ and $\boldsymbol{s}$ are column vectors, Eq. (10) is equivalent to:

\max_{\mathbf{S}}\sum_{i=1}^{n}{\boldsymbol{q}}_{i}{\boldsymbol{s}}_{i},\quad\text{ s.t. }\sum_{i=1}^{m}\mathbf{S}_{i,j}^{2}=1,

(11)

accordingly, the optimal solution is:

\mathbf{S}_{j,i}^{*}=\frac{\mathbf{Q}_{ij}}{\left\|{\boldsymbol{q}}_{i}\right\|_{2}}

(12)

Update $\mathbf{C}$ with $\boldsymbol{\beta}$ , $\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}$ , $\mathbf{P}$ , $\mathbf{Y}$ , $\mathbf{S}$ fixed. The optimization problem w.r.t. centroids $\mathbf{C}$ is:

\max_{\mathbf{C}}\operatorname{tr}\left(\mathbf{HY^{\top}C^{\top}}\right),s.t.\mathbf{C}^{\top}\mathbf{C}=\mathbf{I}_{\mu},

(13)

likewise, denoting $\mathbf{HY^{\top}}=\mathbf{U}_{c}\mathbf{D}_{c}\mathbf{V}_{c}^{\top}$ , the optimal solution is:

\mathbf{C}^{*}=\mathbf{U}_{c}\mathbf{V}_{c}^{\top}

(14)

Update $\mathbf{Y}$ with $\boldsymbol{\beta}$ , $\left\{{\mathbf{W}}_{i}\right\}_{i=1}^{p}$ , $\mathbf{P}$ , $\mathbf{C}$ , $\mathbf{S}$ fixed. The optimization problem w.r.t. discrete label $\mathbf{Y}$ is:

\max_{\mathbf{Y}}\operatorname{tr}\left(\mathbf{(C^{\top}H)^{\top}Y}\right),s.t.\boldsymbol{1}_{\mu}^{\top}\mathbf{Y}=\boldsymbol{1}_{n}^{\top},\mathbf{Y}\in\{0,1\}^{\mu\times n},

(15)

denoting $\mathbf{B}=\mathbf{C^{\top}H}$ , the optimal solution is:

\mathbf{Y}^{*}(i,j)=arg\max_{i}(\mathbf{B}(i,j)),\forall j\in\{1,2,\cdots,n\}.

(16)

The overall optimization is listed in Algorithm 1. According to each step of the algorithm, the time and space cost are $\mathcal{O}(n)$ , and the efficiency is validated in Sec.4.

Algorithm 1 One-Step Late Fusion Multi-view Clustering with Compressed Subspace

0: Multiple base kernels

\left\{\mathbf{K}_{i}\right\}_{i=1}^{p}

, number of clusters

\mu

, the scale of compressed subspace

m

, the dimension of partition matrix

k

0: The label matrix

\mathbf{Y}

1: Initialization Initialize compression matrix

\mathbf{P}\in\mathbb{R}^{n\times m}

as orthogonalization of a randomized matrix. Initialize

\mathbf{S}

by imposing sum-

1

restriction of

\ell 2

-norm on randomized matrix.

\beta_{i}=1/p,\forall i

\left\{\mathbf{W}_{i}\right\}_{i=1}^{p}=\mathbf{I}_{k}

t=1

2: repeat

3: Calculate

\left\{\mathbf{W}_{i}\right\}_{i=1}^{p}

by optimizing Eq. (5);

4: Calculate

\boldsymbol{\beta}

by optimizing Eq. (7);

5: Calculate

\mathbf{P}

by optimizing Eq. (9);

6: Calculate

\mathbf{S}

by optimizing Eq. (12);

7: Calculate

\mathbf{C}

by optimizing Eq. (14);

8: Calculate

\mathbf{Y}

by optimizing Eq. (16);

t=t+1

10: until

(obj^{(t)}-obj^{(t-1)})^{2}<10^{-3}

4 Experiment

In this section, we conduct a series of experiments to evaluate the effectiveness and efficiency of our proposed method.

Table 2: Comparison of clustering performance on 5 benchmark datasets. The best performances are in bold-face.

Datasets	Avg-KKM	SB-KKM	MKKM	RMSC	FMKKM	FMR	LSGMC	Proposed
ACC
Citeseer	20.8 ± 0.0	46.3 ± 0.2	20.1 ± 0.0	19.9 ± 0.3	30.6 ± 0.7	23.9 ± 0.0	22.0 ± 0.4	56.6 ± 0.0
Cora	30.7 ± 0.8	45.2 ± 0.1	25.3 ± 0.4	20.2 ± 0.1	38.4 ± 0.1	40.7 ± 0.5	20.9 ± 0.2	60.8 ± 0.0
ProteinFold	29.0 ± 1.5	33.8 ± 1.3	27.0 ± 1.1	31.2 ± 1.0	32.4 ± 1.8	34.5 ± 1.4	32.8 ± 1.2	35.3 ± 0.0
NUSWIDE	12.5 ± 0.4	12.2 ± 0.3	12.7 ± 0.2	-	14.0 ± 0.3	-	-	14.7 ± 0.0
Reuters	45.5 ± 1.5	47.2 ± 0.0	45.4 ± 1.5	-	45.5 ± 1.6	-	-	53.7 ± 0.0
NMI
Citeseer	2.3 ± 0.0	23.2 ± 0.5	1.9 ± 0.0	0.4 ± 0.1	10.1 ± 0.4	2.3 ± 0.0	1.7 ± 0.1	28.7 ± 0.0
Cora	15.7 ± 1.4	25.6 ± 0.1	9.5 ± 0.2	1.7 ± 0.1	21.8 ± 0.1	20.0 ± 0.2	0.7 ± 0.0	37.8 ± 0.0
ProteinFold	40.3 ± 1.3	41.1 ± 1.1	38.0 ± 0.6	43.2 ± 0.8	41.5 ± 1.0	42.0 ± 1.1	43.9 ± 0.5	44.1 ± 0.0
NUSWIDE	11.1 ± 0.1	11.0 ± 0.1	11.3 ± 0.2	-	12.6 ± 0.2	-	-	13.2 ± 0.0
Reuters	27.4 ± 0.4	25.5 ± 0.0	27.3 ± 0.4	-	27.6 ± 0.5	-	-	31.8 ± 0.0
purity
Citeseer	24.9 ± 0.0	48.8 ± 0.4	24.2 ± 0.0	22.1 ± 0.3	32.9 ± 0.7	25.8 ± 0.0	24.9 ± 0.2	58.9 ± 0.0
Cora	41.5 ± 1.3	52.5 ± 0.1	36.1 ± 1.0	31.5 ± 0.0	46.9 ± 0.1	42.9 ± 0.5	30.2 ± 0.0	63.3 ± 0.0
ProteinFold	37.4 ± 1.7	39.4 ± 1.2	33.7 ± 1.1	38.5 ± 0.9	38.6 ± 1.5	40.6 ± 1.4	40.7 ± 0.6	43.4 ± 0.0
NUSWIDE	23.3 ± 0.3	23.7 ± 0.2	24.2 ± 0.4	-	25.7 ± 0.4	-	-	25.6 ± 0.0
Reuters	53.0 ± 0.4	53.9 ± 0.0	52.9 ± 0.5	-	53.1 ± 0.4	-	-	61.9 ± 0.0

4.1 Evaluation Preliminaries

Datasets We adopt 5 publicly available multi-view benchmark datasets, including Citeseer¹¹1http://linqs-data.soe.ucsc.edu/public/lbc/, Cora²²2http://mlg.ucd.ie/aggregation/, ProteinFold³³3http://mkl.ucsd.edu/dataset/protein-fold-prediction, NUS-WIDE⁴⁴4https://lms.comp.nus.edu.sg/wp-content/uploads/2019/research/nuswide/NUSWIDE.html, Reuters⁵⁵5https://kdd.ics.uci.edu/databases/reuters21578/. Among them, NUS-WIDE and Reuters are large scale datasets with over 10 thousand samples each.

Compared Algorithms 7 algorithms are compared with our algorithm, including multiple kernel clustering and multi-view subspace clustering methods, over the above 5 benchmark datasets. Specifically, we use SB-MKKM and Avg-MKKM as baseline methods of kernel clustering, which calculate the best performance and average performance of kernel $k$ -means results respectively. We select MKKM[32], FMKKM[33] as a representation of classical kernel methods which expand from fuzzy $k$ -means to late-fusion-based methods. And RMSC[34] is representative of spectral clustering methods. In subspace clustering area, we select FMR[35] and LSGMC[36] for performance comparison.

4.2 Performance Analysis

We use the initialization specified in Algorithm 1. The Table 2 shows the ACC, NMI and purity of all methods. To increase the confidence of results, we conduct 20 iterations for each clustering algorithm and within $k$ -means function, the replicates are set to 10. The best performance and standard variance, which is brought by $k$ -means tasks, are both reported in our experiments. As our algorithm is one-step method without the downstream $k$ -means or spectral clustering, the variance is 0 accordingly, guaranteeing the stability of our algorithm. The symbol ‘-’ represents ‘Out-of-Memory’ problem. The platform for all methods is PC with Intel(R) Core(TM) i7-12700H 2.30 GHz, 64GB RAM.

Refer to caption — Fig. 1: Experimental results.

Convergence In our six-step iterative optimization process, each updating formula ensures the objective value of the optimization target monotonously increases after each iteration, while keeping the rest five of the decision variables fixed, and our optimization target is upper-bound. We further verify the convergence in Fig.1(a).

Parameter sensitivity We have 2 hyper-parameters in our algorithm: the sampling scale $m$ and number of partitions $k$ . Empirical guidance for their pre-set are numerical multipliers of the clustering number $\mu$ . Our experiments use $3\times 3$ parameter set $\{\mu,2\mu,4\mu\}$ . From Fig.1(b), we observe that our method enjoys a stable performance with the variation of hyper-parameters.

Time cost The running time of compared algorithms on benchmark datasets is shown in Fig. 2. For large-scale datasets (NUSWIDE and Reuters), some algorithms(FMR, RMSC, LSGMC) encounter the ‘Out-of-Memory’ problem. As a result, the time bars of these methods are omitted in the comparison graph.

5 Conclusion

In this article, we propose an effective and efficient multi-view clustering method OS-LFMVC-CS, which simultaneously maximizes the alignment between different views and learns the shared subspace, and directly optimizes the cluster assignments. This mechanism greatly enhances the negotiation among views and improves the consistency in shared subspace. In this way, we obtain the clustering results in one step and reduce the time and space expenditure to linear cost. We derive a novel optimization framework using a six-step iterative optimization with verified convergence. In addition, extensive experiments are conducted to support the method well. In the future, we will explore more efficient clustering algorithms and use one-step clustering scheme for easy application on a wide range of scenarios.

References

[1] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis, “Kernel k-means: spectral clustering and normalized cuts,” in KDD, 2004.
[2] Radha Chitta, Rong Jin, Timothy C Havens, and Anil K Jain, “Approximate kernel k-means: Solution to large scale kernel clustering,” in ACM SIGKDD, 2011.
[3] Liang Du, Peng Zhou, Lei Shi, Hanmo Wang, Mingyu Fan, Wenjian Wang, and Yi-Dong Shen, “Robust multiple kernel k-means using l21-norm,” in IJCAI, 2015.
[4] Pei Zhang, Xinwang Liu, Jian Xiong, Sihang Zhou, Wentao Zhao, En Zhu, and Zhiping Cai, “Consensus one-step multi-view subspace clustering,” IEEE TKDE, 2020.
[5] Xinwang Liu, “Simplemkkm: Simple multiple kernel k-means,” IEEE TPAMI, 2022.
[6] Xinwang Liu, Yong Dou, Jianping Yin, Lei Wang, and En Zhu, “Multiple kernel k-means clustering with matrix-induced regularization.,” in AAAI, 2016.
[7] Zhenwen Ren, Quansen Sun, and Dong Wei, “Multiple kernel clustering with kernel k-means coupled graph tensor learning,” in AAAI, 2021.
[8] Yihang Lu, Haonan Xin, Rong Wang, Feiping Nie, and Xuelong Li, “Scalable multiple kernel k-means clustering,” in CIKM, 2022.
[9] Jing Liu, Fuyuan Cao, Xiao-Zhi Gao, Liqin Yu, and Jiye Liang, “A cluster-weighted kernel k-means method for multi-view clustering,” in AAAI, 2020.
[10] Jitao Lu, Yihang Lu, Rong Wang, Feiping Nie, and Xuelong Li, “Multiple kernel k-means clustering with simultaneous spectral rotation,” in ICASSP, 2022.
[11] Yue Liu, Ke Liang, Jun Xia, Sihang Zhou, Xihong Yang, , Xinwang Liu, and Z. Stan Li, “Dink-net: Neural clustering on large graphs,” in Proc. of ICML, 2023.
[12] Yiqi Wang, Chaozhuo Li, Wei Jin, Rui Li, Jianan Zhao, Jiliang Tang, and Xing Xie, “Test-time training for graph neural networks,” arXiv preprint arXiv:2210.08813, 2022.
[13] Yiqi Wang, Chaozhuo Li, Mingzheng Li, Wei Jin, Yuming Liu, Hao Sun, Xing Xie, and Jiliang Tang, “Localized graph collaborative filtering,” in SDM, 2022.
[14] Wenxuan Tu, Sihang Zhou, Xinwang Liu, Xifeng Guo, Zhiping Cai, En Zhu, and Jieren Cheng, “Deep fusion clustering network,” in AAAI, 2021.
[15] Wenxuan Tu, Qing Liao, Sihang Zhou, Xin Peng, Chuan Ma, Zhe Liu, Xinwang Liu, and Zhiping Cai, “Rare: Robust masked graph autoencoder,” IEEE TKDE, 2023.
[16] Meng Liu, Yue Liu, Ke Liang, Siwei Wang, Sihang Zhou, and Xinwang Liu, “Deep temporal graph clustering,” arXiv preprint arXiv:2305.10738, 2023.
[17] Renxiang Guan, Zihao Li, Teng Li, Xianju Li, Jinzhong Yang, and Weitao Chen, “Classification of heterogeneous mining areas based on rescapsnet and gaofen-5 imagery,” Remote Sensing, 2022.
[18] Renxiang Guan, Zihao Li, Xianju Li, Chang Tang, and Ruyi Feng, “Contrastive multi-view subspace clustering of hyperspectral images based on graph convolutional networks,” arXiv preprint arXiv:2312.06068, 2023.
[19] Siwei Wang, Xinwang Liu, En Zhu, Chang Tang, Jiyuan Liu, Jingtao Hu, Jingyuan Xia, and Jianping Yin, “Multi-view clustering via late fusion alignment maximization,” in IJCAI, 2019.
[20] Siwei Wang, Xinwang Liu, Li Liu, Sihang Zhou, and En Zhu, “Late fusion multiple kernel clustering with proxy graph refinement,” TNNLS, 2021.
[21] Jiyuan Liu, Xinwang Liu, Jian Xiong, Qing Liao, Sihang Zhou, Siwei Wang, and Yuexiang Yang, “Optimal neighborhood multiple kernel clustering with adaptive local kernels,” TKDE, 2022.
[22] Siwei Wang, En Zhu, Jingtao Hu, Miaomiao Li, Kaikai Zhao, Ning Hu, and Xinwang Liu, “Efficient multiple kernel k-means clustering with late fusion,” IEEE Access, 2019.
[23] Jin Huang, Feiping Nie, and Heng Huang, “Spectral rotation versus k-means in spectral clustering.,” in AAAI, 2013.
[24] Yanwei Pang, Jin Xie, Feiping Nie, and Xuelong Li, “Spectral clustering by joint spectral embedding and spectral rotation.,” IEEE Trans. Cybern., 2020.
[25] Yihang Lu, Jitao Lu, Rong Wang, and Feiping Nie, “Discrete multi-kernel k-means with diverse and optimal kernel learning,” in ICASSP, 2022.
[26] Xiaojun Chen, Feiping Nie, Joshua Zhexue Huang, and Min Yang, “Scalable normalized cut with improved spectral rotation.,” in IJCAI, 2017.
[27] Jihun Ham and Daniel D. Lee, “Extended grassmann kernels for subspace-based learning,” in NeuralIPS 21, 2008.
[28] Feiping Nie, Jianjun Yuan, and Heng Huang, “Optimal mean robust principal component analysis,” in ICML, 2014.
[29] Xinwang Liu, Xinzhong Zhu, Miaomiao Li, Lei Wang, Chang Tang, Jianping Yin, Dinggang Shen, Huaimin Wang, and Wen Gao, “Late fusion incomplete multi-view clustering.,” TPAMI, 2019.
[30] Qiyuan Ou, Siwei Wang, Sihang Zhou, Miaomiao Li, Xifeng Guo, and En Zhu, “Anchor-based multiview subspace clustering with diversity regularization,” IEEE MultiMedia, 2020.
[31] Thomas F. Coleman and Yuying Li, “A reflective newton method for minimizing a quadratic function subject to bounds on some of the variables,” SIAM J. Optim., 1996.
[32] Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen, “Multiple kernel fuzzy clustering,” IEEE Trans. Fuzzy Syst., 2012.
[33] Yi Zhang, Xinwang Liu, Jiyuan Liu, Sisi Dai, Changwang Zhang, Kai Xu, and En Zhu, “Fusion multiple kernel k-means,” in AAAI, 2022.
[34] Rongkai Xia, Yan Pan, Lei Du, and Jian Yin, “Robust multi-view spectral clustering via low-rank and sparse decomposition,” in AAAI, 2014.
[35] Ruihuang Li, Changqing Zhang, Qinghua Hu, Pengfei Zhu, and Zheng Wang, “Flexible multi-view representation learning for subspace clustering,” in IJCAI, 2019.
[36] Wei Lan, Tianchuan Yang, Qingfeng Chen, Shichao Zhang, Yi Dong, Huiyu Zhou, and Yi Pan, “Multiview subspace clustering via low-rank symmetric affinity graph,” TNNLS, 2023.