Deep Attention-guided Graph Clustering
with Dual Self-supervision

Zhihao Peng, Hui Liu, Yuheng Jia, , Junhui Hou This work was supported in part by the Hong Kong UGC under grant UGC/FDS11/E02/22 and RGC under grants 11219019 and 11202320, in part by the National Natural Science Foundation of China under Grant 62106044, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20210221, in part by the ZhiShan Youth Scholar Program from Southeast University 2242022R40015. Corresponding author: Hui Liu and Yuheng Jia.Z. Peng and J. Hou are with the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong 999077 (e-mail: zhihapeng3-c@my.cityu.edu.hk; jh.hou@cityu.edu.hk)H. Liu is with the School of Computing & Information Sciences, Caritas Institute of Higher Education, Hong Kong. E-mail:hliu99-c@my.cityu.edu.hkY. Jia is with the School of Computer Science and Engineering, Southeast University, Nanjing 210096, China, and also with Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China (e-mail: yhjia@seu.edu.cn).

Abstract

Existing deep embedding clustering methods fail to sufficiently utilize the available off-the-shelf information from feature embeddings and cluster assignments, limiting their performance. To this end, we propose a novel method, namely deep attention-guided graph clustering with dual self-supervision (DAGC). Specifically, DAGC first utilizes a heterogeneity-wise fusion module to adaptively integrate the features of the auto-encoder and the graph convolutional network in each layer and then uses a scale-wise fusion module to dynamically concatenate the multi-scale features in different layers. Such modules are capable of learning an informative feature embedding via an attention-based mechanism. In addition, we design a distribution-wise fusion module that leverages cluster assignments to acquire clustering results directly. To better explore the off-the-shelf information from the cluster assignments, we develop a dual self-supervision solution consisting of a soft self-supervision strategy with a Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss. Extensive experiments on nine benchmark datasets validate that our method consistently outperforms state-of-the-art methods. Especially, our method improves the ARI by more than 10.29% over the best baseline. The code will be publicly available at https://github.com/ZhihaoPENG-CityU/DAGC.

Index Terms:

Unsupervised learning, deep embedding clustering, feature fusion, self-supervision.

I Introduction

Clustering is one of the fundamental tasks in data analysis, which aims to categorize samples into multiple groups according to their intrinsic similarities, and has been successfully applied to many real-world applications such as image processing [1, 2, 3, 4, 5], face recognition [6, 7, 8], and object detection [9, 10, 11]. Recently, with the booming of deep learning, numerous researchers have paid attention to deep embedding clustering analysis, which could effectively learn a clustering-friendly representation by extracting intrinsic patterns from the latent embedding space. For example, Hinton and Salakhutdinov [12] developed a deep auto-encoder (DAE) framework that first conducts embedding learning and then performs K-means [13] to obtain clustering results. Xie et al. [14] designed a deep embedding clustering method (DEC) to perform embedding learning and cluster assignment jointly. Guo et al. [15] improved DEC by introducing a reconstruction loss to preserve data structure. Although these DAE-based approaches obtain impressive improvement, they neglect the underlying topological structure among data, which has demonstrated its importance in various works [16, 17, 18].

Recently, a series of works have been proposed to use graph convolutional networks (GCNs) [19] to exploit the topological structure information. For instance, Kipf et al. [20] incorporated GCN into DAE and variational DAE, and proposed graph auto-encoder (GAE) and variational graph auto-encoder (VGAE), respectively. Pan et al. [21] designed an adversarially regularized graph auto-encoder network (ARGA) to promote GAE. Wang et al. [22] incorporated graph attention networks [23] into GAE for attributed graph clustering. Bo et al. [24] fused GCN into DEC to consider the node content and topological structure information at the same time.

However, these works still suffer from the following drawbacks. First, they equate the importance of the features extracted from DAE and GCN, e.g., in [24], the DAE and GCN features of a typical layer are averaged. Such a simple fusion strategy is not a good choice since those features contain different characteristic information. Second, they neglect the multi-scale information embedded in different layers, which may lead to inferior clustering results. Third, they output two probability distributions capable of obtaining the final clustering results; however, the complex real-world datasets are usually agnostic and vastly different, so it is difficult to decide which one should be used to get the final clustering results. To the best of our knowledge, this is a decision-making dilemma for those kinds of deep graph clustering methods. Last but not least, the previous approaches fail to adequately exploit the available information from the high-confidence clustering assignments.

Refer to caption — Figure 1: The overall flowchart of the proposed method, namely deep attention-guided graph clustering with dual self-supervision (DAGC). It consists of a DAE module, a heterogeneity-wise fusion (HWF) module, a scale-wise fusion (SWF) module, a distribution-wise fusion (DWF) module, a soft self-supervision (SSS) strategy, and a hard self-supervision (HSS) strategy. Specifically, HWF and SWF conduct the weight fusion in the sum and concatenation manner, respectively, where both modules involve a multilayer perceptron module, a normalization operation, and a GCN module. DWF uses a softmax function to infer a probability distribution. To achieve an end-to-end self-supervision, SSS drives the soft assignments to achieve distributions alignment between $\mathbf{Q}$ and $\mathbf{Z}$ distributions, and HSS transfers the cluster assignment to a hard one-hot encoding. The detailed architectures of HWF, SWF, DWF, and SSS are given in Figures 2, 3, and 4, respectively.

To address the above-mentioned drawbacks, we propose a novel deep embedding clustering method, focusing on exploiting the available off-the-shelf information from feature embeddings and cluster assignments. As shown in Figure 1, the proposed method consists of a heterogeneity¹¹1Here, ‘heterogeneity’ indicates the discrimination of feature structure, e.g., the DAE-based feature structure and the GCN-based feature structure.-wise fusion (HWF) module, a scale-wise fusion (SWF) module, a distribution-wise fusion (DWF) module, a soft self-supervision (SSS) strategy, and a hard self-supervision (HSS) strategy. A preliminary version of this work was published in ACM Multimedia 2021 [25], which can be regarded as a special case of the current version that focuses on embedding enhancement via the HWF and SWF modules. However, the conference paper suffers from the decision-making dilemma concerning two learned probability distributions from DAE and GCN, i.e., which one should be selected as the final clustering assignment result. In summary, the main contributions of this journal paper are as follows:

•

To handle the decision-making dilemma, we propose a learning-aware fusion module to adaptively fuse the learned data probability distributions to predict the clustering results.
•

In addition, for the two learned distributions, we improve the soft self-supervision strategy to better preserve the distribution consistency alignment.
•

Moreover, for the aforementioned fused distribution, we develop a hard self-supervision strategy with a pseudo supervision loss to employ the high-confidence clustering assignments to improve clustering performance.
•

Extensive experiments on nine benchmark datasets validate that our method consistently outperforms the conference version [25]. For instance, on DBLP, our method improves the ARI by more than 14.80%. In addition, the ablation studies and visualizations quantitatively and qualitatively validate the effectiveness of this method.

We organize the rest of this paper as follows. Section II briefly reviews the related works. Section III introduces the proposed network architecture, followed by the experimental results and analyses in Section IV. Finally, we conclude this paper in Section V.

Notation: Throughout the paper, scalars are denoted by italic lower case letters, vectors by bold lower case letters, matrices by bold upper case ones, and operators by calligraphy ones, respectively. Let $\mathbf{X}$ be the input data, $\mathcal{V}$ be the node set, $\mathcal{E}$ be the edge set, and $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{X})$ be the undirected graph. $\mathbf{A}\in\mathbb{R}^{\emph{n}\times\emph{n}}$ denotes the adjacency matrix, $\mathbf{D}\in\mathbb{R}^{\emph{n}\times\emph{n}}$ denotes the degree matrix, and $\mathbf{I}\in\mathbb{R}^{\emph{n}\times\emph{n}}$ denotes the identity matrix. We summarize the main notations in Table I.

TABLE I: Main notations and descriptions.

Notations		Descriptions
$\mathbf{X}$	$\in\mathbb{R}^{\emph{n}\times\emph{d}}$	The input matrix
$\hat{\mathbf{X}}$	$\in\mathbb{R}^{\emph{n}\times\emph{d}}$	The reconstructed matrix
$\mathbf{A}$	$\in\mathbb{R}^{\emph{n}\times\emph{n}}$	The adjacency matrix
$\mathbf{D}$	$\in\mathbb{R}^{\emph{n}\times\emph{n}}$	The degree matrix
$\mathbf{Z}_{\emph{i}}$	$\in\mathbb{R}^{\emph{n}\times{\emph{d}_{\emph{i}}}}$	The GCN feature from the $i_{th}$ layer
$\mathbf{H}_{\emph{i}}$	$\in\mathbb{R}^{\emph{n}\times{\emph{d}_{\emph{i}}}}$	The encoder feature from the $i_{th}$ layer
$\mathbf{M}_{\emph{i}}$	$\in\mathbb{R}^{\emph{n}\times 2}$	The HWF weight matrix
$\mathbf{Z}_{\emph{i}}^{{}^{\prime}}$	$\in\mathbb{R}^{\emph{n}\times{\emph{d}_{\emph{i}}}}$	The HWF combined feature
$\mathbf{U}$	$\in\mathbb{R}^{\emph{n}\times\left(\emph{l}+1\right)}$	The SWF weight matrix
$\mathbf{H}$	$\in\mathbb{R}^{\emph{n}\times{\emph{d}_{\emph{l}}}}$	The DAE extracted feature
$\mathbf{Q}$	$\in\mathbb{R}^{\emph{n}\times\emph{k}}$	The distribution obtained from DAE
$\mathbf{Z}$	$\in\mathbb{R}^{\emph{n}\times\emph{k}}$	The distribution obtained from SWF
$\mathbf{P}$	$\in\mathbb{R}^{\emph{n}\times\emph{k}}$	The auxiliary distribution
$\mathbf{V}$	$\in\mathbb{R}^{\emph{n}\times 2}$	The DWF weight matrix
$\mathbf{F}$	$\in\mathbb{R}^{\emph{n}\times\emph{k}}$	The DWF combined feature
n		The number of samples
d		The dimension of $\mathbf{X}$
$\emph{d}_{\emph{i}}$		The dimension of the $i_{th}$ latent feature
l		The number of network layers
k		The number of clusters
$\hat{\emph{k}}$		The number of neighbors for KNN graph
r		The threshold value for pseudo supervision
$\cdot\\|\cdot$		The concatenation operation

II Related Work

DAE is a typical deep neural network that allows computational models composed of multiple processing layers to learn data representations with multiple levels of abstraction. Recently, benefiting from the powerful representation ability of DAE, deep embedding clustering has achieved remarkable development [26, 27, 28, 29, 30, 31, 32]. For example, Hinton and Salakhutdinov [12] used DAE to extract the feature representation of input data, on which K-means [13] is performed to obtain the clustering results. [14] jointly conducted embedding learning and cluster assignment in an iterative optimization manner. The improved DEC (IDEC) [15] enhanced the clustering performance by adding a reconstruction loss function into DEC. A series of works [2, 3, 33] introduced multi-view information into the DAE framework to further improve embedding learning. However, these DAE-based methods neglect the underlying topological structure among data, which has demonstrated its effectiveness for data clustering [16, 17, 34, 35, 18], thus limiting their performance.

Graph embedding is a new paradigm for clustering to capture the topological structure among samples [36, 37, 38, 17, 39, 40, 41], and many recent approaches [42, 43, 44, 45, 46] have explored GCN to achieve graph embedding. For instance, Kipf and Welling [20] provided GAE and VGAE methods by incorporating GCN into DAE and variational DAE frameworks, respectively. [21] extended GAE by introducing a designed adversarial regularization. Wang et al. [22] merged GAE and the graph attention network [23] to build a deep attentional embedding framework for attributed graph clustering (DAEGC). The structural deep clustering network (SDCN) [24] fused the node content and topological structure information to achieve deep embedding clustering. Peng et al. [25] designed the attention-driven graph clustering network (AGCN) to merge numerous features to enhance the embedding learning via an adaptive mechanism. The deep fusion clustering network (DFCN) [47] exploited a designed fusion strategy to combine the DAE and GAE frameworks to merge the node attribute and topological structure information. He et al. [48] developed an adaptive graph convolutional clustering (AGCC) model to update the graph structure and the data representation layer by layer.

Optimizing a deep clustering network is a fundamental yet challenging task because there are no ground-truth labels as supervision. Previous works [14, 15, 22, 24, 25] minimize the Kullback-Leibler (KL) divergence to tackle this challenge, and its effectiveness has been proven. Specifically, it first uses the Student’s t-distribution [49, 50] as a kernel function to measure the similarity between the extracted feature $\mathbf{h}_{\emph{i}}$ and its corresponding centroid vector $\boldsymbol{\mu}_{\emph{j}}$ , in which the measured similarity can be regarded as a probability distribution $\mathbf{Q}$ with its $\emph{i},\emph{j}$ -th element being

\displaystyle\emph{q}_{\emph{i},\emph{j}}=\frac{(1+\|\mathbf{h}_{\emph{i}}-\boldsymbol{\mu}_{\emph{j}}\|^{2}/\alpha)^{-\frac{\alpha+1}{2}}}{\sum_{\emph{j}^{{}^{\prime}}}(1+\|\mathbf{h}_{\emph{i}}-\boldsymbol{\mu}_{\emph{j}^{{}^{\prime}}}\|^{2}/\alpha)^{-\frac{\alpha+1}{2}}},

(1)

where $\alpha$ is set to $1$ . Then, it implements the KL divergence minimization between $\mathbf{Q}$ and a distribution $\mathbf{B}$ with the $\emph{i},\emph{j}$ -th element $b_{i,\emph{j}}=\frac{q_{i,\emph{j}}^{2}/\sum_{i}q_{i,\emph{j}}}{\sum_{\emph{j}}^{{}^{\prime}}q_{i,\emph{j}^{{}^{\prime}}}^{2}/\sum_{i}q_{i,\emph{j}^{{}^{\prime}}}}$ , i.e.,

\displaystyle\min KL(\mathbf{B},\mathbf{Q})=\sum_{i}\sum_{j}{b_{i,\emph{j}}log{\frac{b_{i,\emph{j}}}{q_{i,\emph{j}}}}},

(2)

where $KL\left(\cdot,\cdot\right)$ is the Kullback-Leibler divergence function that measures the distance between two distributions. Such an auxiliary distribution generation normalizes the high-confidence probability value as a large value, capable of learning the high-confidence assignments. However, previous works fail to sufficiently utilize the available off-the-shelf information from high-confidence clustering assignments, inevitably leading to inferior clustering results.

III Proposed Method

Figure 1 illustrates the architecture of the proposed method, where we will detail the main components in what follows.

III-A Heterogeneity-Wise Fusion

We first exploit a DAE module with a series of encoders and decoders to extract the latent representation by adapting the reconstruction loss, i.e.,

\displaystyle\mathcal{L}_{R}=\left\|\mathbf{X}-\hat{\mathbf{X}}\right\|^{2}_{F},

(3)

where $\mathbf{X}$ and $\hat{\mathbf{X}}$ denote the input matrix and the reconstructed matrix, respectively. Here, $\mathbf{H}_{\emph{0}}=\mathbf{X}$ , $\hat{\mathbf{H}}_{\emph{l}}=\hat{\mathbf{X}}$ , $\mathbf{H}_{\emph{i}}=\phi(\mathbf{W}_{\emph{i}}^{e}\mathbf{H}_{\emph{i}-1}+\mathbf{b}_{\emph{i}}^{e})$ , $\hat{\mathbf{H}}_{\emph{i}}=\phi(\mathbf{W}_{\emph{i}}^{d}\hat{\mathbf{H}}_{\emph{i}-1}+\mathbf{b}_{\emph{i}}^{d})$ , where $\mathbf{H}_{\emph{i}}$ and $\hat{\mathbf{H}}_{\emph{i}}$ denote the encoder and decoder outputs from the $i_{th}$ layer, respectively, l denotes the number of encoder/decoder layers, $\mathbf{W}_{\emph{i}}^{e}$ , $\mathbf{b}_{i}^{e}$ , $\mathbf{W}_{\emph{i}}^{d}$ , and $\mathbf{b}_{i}^{d}$ denote the network weight and bias of the $i_{th}$ encoder and decoder layer, respectively, and $\phi(\cdot)$ denotes an activation function, such as Tanh or ReLU [51]. Particularly, we set $\mathbf{H}=\mathbf{H}_{\emph{l}}$ for convenience. In addition, we denote the GCN feature learned from the $\emph{i}_{th}$ layer as $\mathbf{Z}_{\emph{i}}\in\mathbb{R}^{\emph{n}\times\emph{d}_{\emph{i}}}$ with $\emph{d}_{i}$ being the dimension of the $\emph{i}_{th}$ layer, where $\mathbf{Z}_{0}=\mathbf{X}$ . Previous works (e.g., SDCN [24]) combine the heterogeneity-wise representation on the $i_{th}$ layer ( $\mathbf{Z}_{i}$ and $\mathbf{H}_{i}$ ) via a fixed fusion strategy (i.e., $\mathbf{Z}_{i}^{{}^{\prime}}=0.5\mathbf{Z}_{i}+0.5\mathbf{H}_{i}$ ) to enhance representation learning. However, such a fusion strategy is simple but unreasonable since the heterogeneity-wise representations $\mathbf{Z}_{i}$ and $\mathbf{H}_{i}$ owe different characteristic information. To this end, we propose a learning-aware fusion strategy to develop an adaptive fusion strategy to dynamically weight $\mathbf{Z}_{i}$ and $\mathbf{H}_{i}$ . Specifically, to learn the corresponding attention coefficients of $\mathbf{Z}_{\emph{i}}$ and $\mathbf{H}_{\emph{i}}$ , we first concatenate them as $[\mathbf{Z}_{\emph{i}}\|\mathbf{H}_{\emph{i}}]\in\mathbb{R}^{\emph{n}\times{\emph{2d}_{\emph{i}}}}$ and then build a fully connected layer parametrized by a weight matrix $\mathbf{W}_{\emph{i}}^{a}\in\mathbb{R}^{{\emph{2d}_{\emph{i}}}\times 2}$ . Afterwards, we apply the LeakyReLU (LReLU) [52] on the product between $\left[\mathbf{Z}_{\emph{i}}\|\mathbf{H}_{\emph{i}}\right]$ and $\mathbf{W}_{\emph{i}}^{a}$ , and normalize the output of the LReLU unit via the softmax function and the $\ell_{2}$ normalization (i.e., ‘softmax- $\ell_{2}$ ’ normalization). Formally, we formulate the prediction of the corresponding attention coefficients as

\displaystyle\mathbf{M}_{\emph{i}}=[\mathbf{m}_{i,1}\|\mathbf{m}_{i,2}]=\Upsilon_{\textit{A}}(\left[\mathbf{Z}_{\emph{i}}\|\mathbf{H}_{\emph{i}}\right]\mathbf{W}_{\emph{i}}^{a}),

(4)

where $\mathbf{M}_{\emph{i}}\in\mathbb{R}^{\emph{n}\times 2}$ is the attention coefficient matrix with entries being greater than $0$ , $\mathbf{m}_{i,1}$ and $\mathbf{m}_{i,2}$ are the weight vectors for measuring the importance of $\mathbf{Z}_{\emph{i}}$ and $\mathbf{H}_{\emph{i}}$ , respectively, and $\Upsilon_{\textit{A}}(\cdot)=\ell_{2}\left(softmax\left(LReLU\left(\cdot\right)\right)\right)$ . Thus, we can adaptively fuse the GCN feature $\mathbf{Z}_{\emph{i}}$ and the DAE feature $\mathbf{H}_{\emph{i}}$ on the $\emph{i}_{th}$ layer as

\displaystyle\mathbf{Z}_{\emph{i}}^{{}^{\prime}}=\left(\mathbf{m}_{i,1}\mathbf{1}_{\emph{i}}\right)\odot\mathbf{Z}_{\emph{i}}+\left(\mathbf{m}_{i,2}\mathbf{1}_{\emph{i}}\right)\odot\mathbf{H}_{\emph{i}},

(5)

where $\mathbf{1}_{\emph{i}}\in\mathbb{R}^{1\times\emph{d}_{\emph{i}}}$ denotes the vector of all ones, and $\odot$ denotes the Hadamard product of matrices. Then, we use the resulting matrix $\mathbf{Z}_{\emph{i}}^{{}^{\prime}}\in\mathbb{R}^{\emph{n}\times\emph{d}_{\emph{i}}}$ as the input of the $(\emph{i}+1)_{th}$ GCN layer to learn the representation $\mathbf{Z}_{\emph{i}+1}$ , i.e.,

\displaystyle\mathbf{Z}_{\emph{i}+1}=LReLU(\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}\mathbf{Z}_{\emph{i}}^{{}^{\prime}}\mathbf{W}_{\emph{i}}),

(6)

where $\mathbf{W}_{\emph{i}}$ denotes the weight matrix from the $i_{th}$ GCN layer, and $\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}$ normalizes $\mathbf{A}$ by using renormalization with a self-loop normalized $\mathbf{A}$ and the corresponding $\mathbf{D}$ .

III-B Scale-Wise Fusion

As aforementioned, previous works neglect the off-the-shelf multi-scale information embedded in different layers, which is of great importance for embedding learning. To this end, we propose the SWF module to concatenate the multi-scale features from different layers via an attention-based mechanism. The right border of Figure 2 shows the overall architecture of SWF.

We aggregate the multi-scale features with a concatenation manner to dynamically combine various scale features with different dimensions. Afterwards, we build a fully connected layer parametrized by a weight matrix $\mathbf{W}^{\emph{s}}\in\mathbb{R}^{(\sum_{j=1}^{l}\emph{d}_{j}+\emph{d}_{\emph{l}})\times(\emph{l}+1)}$ to capture the relationship among the multi-scale features. Formally, we formulate the whole process as

\displaystyle\mathbf{U}=\Upsilon_{\textit{A}}\left(\Xi_{j=1}^{\emph{l}+1}\mathbf{Z}_{j}\mathbf{W}^{s}\right),

(7)

where $\Xi_{j=1}^{\emph{l}+1}\mathbf{Z}_{j}=\left[\mathbf{Z}_{1}\|\cdots\|\mathbf{Z}_{\emph{l}}\|\mathbf{Z}_{\emph{l}+1}\right]$ denotes the concatenation operation of multiple elements. We then conduct the feature fusion as

\displaystyle\mathbf{Z}^{{}^{\prime}}=\Xi_{j=1}^{\emph{l}+1}\left(\left(\mathbf{u}_{j}\mathbf{1}_{j}\right)\odot\mathbf{Z}_{\emph{j}}\right),

(8)

where $\mathbf{u}_{j}$ is the j-th element of $\mathbf{U}$ , i.e., $\Xi_{j=1}^{\emph{l}+1}\mathbf{u}_{j}=\mathbf{U}$ . In addition, we use a Laplacian smoothing operator [53] and the softmax function to make the fused feature $\mathbf{Z}^{{}^{\prime}}$ as a reasonable probability distribution, i.e.,

\displaystyle\mathbf{Z}=softmax(\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}\mathbf{Z}^{{}^{\prime}}\mathbf{W}),

(9)

where $\mathbf{W}$ denotes the learnable parameters.

III-C Distribution-Wise Fusion

As we obtain the feature $\mathbf{H}$ from DAE, we can exploit it to calculate its cluster center embedding $\boldsymbol{\mu}$ with K-means. Afterward, we measure the similarity between the extracted feature $\mathbf{h}_{\emph{i}}$ and its corresponding centroid vector $\boldsymbol{\mu}_{\emph{j}}$ , in which the measured similarity can be regarded as a probability distribution $\mathbf{Q}$ with its $\emph{i},\emph{j}$ -th element being $\emph{q}_{\emph{i},\emph{j}}=\frac{(1+\|\mathbf{h}_{\emph{i}}-\boldsymbol{\mu}_{\emph{j}}\|^{2})^{-1}}{\sum_{\emph{j}^{{}^{\prime}}}(1+\|\mathbf{h}_{\emph{i}}-\boldsymbol{\mu}_{\emph{j}^{{}^{\prime}}}\|^{2})^{-1}}$ , following Eq. (1). Both $\mathbf{Z}$ and $\mathbf{Q}$ can generate the final clustering results; however, it is challenging to choose which one to obtain the final clustering result within different scenarios. To the best of our knowledge, this is an unsolved decision-making dilemma commonly existing in the previous deep graph clustering methods. To handle this challenge, we propose the DWF module to fuse the learned data probability distributions in an attention-driven manner to predict cluster labels. Figure 3 shows the overall architecture.

Specifically, we first learn the importance of $\mathbf{Z}$ and $\mathbf{Q}$ by an attention-based mechanism, i.e.,

\displaystyle\mathbf{V}=[\mathbf{\mathbf{v}}_{1}\|\mathbf{\mathbf{v}}_{2}]=\Upsilon_{\textit{A}}\left(\left[\mathbf{Z}\|\mathbf{Q}\right]\mathbf{\hat{W}}\right),

(10)

where $\mathbf{V}\in\mathbb{R}^{\emph{n}\times 2}$ is the attention coefficient matrix, $\mathbf{\hat{W}}$ is a learned weight matrix via a fully connected layer. We then adaptively leverage $\mathbf{Z}$ and $\mathbf{Q}$ as

\displaystyle\mathbf{F}=\left(\mathbf{\mathbf{v}}_{1}\mathbf{1}\right)\odot\mathbf{Z}+\left(\mathbf{\mathbf{v}}_{2}\mathbf{1}\right)\odot\mathbf{Q},

(11)

where $\mathbf{1}\in\mathbb{R}^{1\times k}$ denotes the vector of all ones. Finally, we apply the softmax function to normalize $\mathbf{F}$ with

\displaystyle\mathbf{F}=softmax\left(\mathbf{F}\right)\quad\rm{s.t.}\quad\sum_{\emph{j}=1}^{\emph{k}}\emph{f}_{\emph{i},\emph{j}}=1,\quad\emph{f}_{\emph{i},\emph{j}}>0,

(12)

where $\emph{f}_{i,\emph{j}}$ is the element of $\mathbf{F}$ . When the network is well-trained, we can directly infer the predicted cluster label through $\mathbf{F}$ , i.e.,

\displaystyle y_{i}=\mathop{\arg\max}_{\emph{j}}\emph{f}_{i,\emph{j}}\quad\rm{s.t.}\quad\emph{j}=1,\cdots,\emph{k},

(13)

where $y_{i}$ is the predicted label of $\mathbf{x}_{i}$ . In this way, the cluster structure can be represented explicitly in $\mathbf{F}$ .

III-D Dual Self-supervision

As unsupervised clustering lacks reliable guidance, we propose a novel dual self-supervision scheme that combines a soft self-supervision strategy with a Kullback-Leibler (KL) divergence loss and a hard self-supervision strategy with a pseudo supervision loss to guide the overall network training, as illustrated in Figure 4.

III-D1 Soft Self-supervision

Since we take advantage of the high-confidence assignments to iteratively refine the clusters by utilizing the soft assignments (i.e., the probability distributions $\mathbf{Q}$ and $\mathbf{Z}$ ), we term this supervision strategy as the soft self-supervision strategy. Concretely, since $\mathbf{Z}$ involves the graph information through the HWF and SWF modules, we first derive an auxiliary distribution $\mathbf{P}$ via $\mathbf{Z}$ by normalizing per cluster after squaring $\emph{z}_{\emph{i},\emph{j}}$ , i.e.,

\displaystyle\emph{p}_{\emph{i},\emph{j}}=\frac{\emph{z}_{\emph{i},\emph{j}}^{2}/\sum_{\emph{i}^{{}^{\prime}}=1}^{\emph{n}}\emph{z}_{\emph{i}^{{}^{\prime}},\emph{j}}}{\sum_{\emph{j}^{{}^{\prime}}=1}^{\emph{k}}\emph{z}_{\emph{i},\emph{j}^{{}^{\prime}}}^{2}/\sum_{\emph{i}^{{}^{\prime}}=1}^{\emph{n}}\sum_{\emph{j}^{{}^{\prime}}=1}^{\emph{k}}\emph{z}_{\emph{i}^{{}^{\prime}},\emph{j}^{{}^{\prime}}}},

(14)

where $0\leq\emph{p}_{\emph{i},\emph{j}}\leq 1$ is the element of $\mathbf{P}$ . Then, we minimize the KL divergence not only between a learned distribution and its auxiliary distribution (i.e., $KL\left(\mathbf{P},\mathbf{Z}\right)$ and $KL\left(\mathbf{P},\mathbf{Q}\right)$ ), but also between both two learned distributions (i.e., $KL\left(\mathbf{Z},\mathbf{Q}\right)$ ) to promote a highly consistent distribution alignment to train our model, i.e.,

	$\displaystyle\mathcal{L}_{S}$	$\displaystyle=\lambda_{1}\!\!(KL\left(\mathbf{P},\mathbf{Z}\right)\!+\!KL\left(\mathbf{P},\mathbf{Q})\right)\!+\!\lambda_{2}KL\left(\mathbf{Z},\mathbf{Q}\right)$		(15)
		$\displaystyle=\lambda_{1}\sum_{\emph{i}}^{\emph{n}}\sum_{\emph{j}}^{\emph{k}}{\emph{p}_{\emph{i},\emph{j}}log{\frac{\emph{p}_{\emph{i},\emph{j}}^{2}}{z_{i,\emph{j}}\emph{q}_{\emph{i},\emph{j}}}}}\!+\!\lambda_{2}\sum_{\emph{i}}^{\emph{n}}\sum_{\emph{j}}^{\emph{k}}{z_{i,\emph{j}}log{\frac{z_{i,\emph{j}}}{\emph{q}_{\emph{i},\emph{j}}}}},$		(15)

where $\lambda_{1}>0$ and $\lambda_{2}>0$ are the trade-off parameters.

III-D2 Hard Self-supervision

Although the soft self-supervision strategy has become a helpful tool for unsupervised clustering, it preserves the low-confidence predicted probabilities, limiting the clustering performance. To further make use of the available off-the-shelf information from the cluster assignments, we introduce the pseudo supervision technique [54] and set the pseudo-label $\hat{y}_{i}$ as $\hat{y}_{i}={y}_{i}$ . Considering that the pseudo-labels may contain many incorrect labels, we select the high-confidence ones as supervisory information by a large threshold r, i.e.,

\emph{g}_{i,\emph{j}}=\left\{\begin{array}[]{ll}1&\text{ if }f_{i,\emph{j}}\geq\text{ \emph{r}, }\\ 0&\text{ otherwise. }\end{array}\right.

(16)

In the experiment, we set $r=0.8$ . Then, we leverage the high-confidence pseudo-labels to supervise the network training, i.e.,

\displaystyle\mathcal{L}_{H}

\displaystyle=\lambda_{3}\sum_{\emph{i}}\sum_{\emph{j}}\emph{g}_{i,\emph{j}}*\Upsilon_{\textit{CE}}(\mathbf{\emph{f}}_{i,\emph{j}},\Upsilon_{\textit{OH}}(\hat{y}_{i})),

(17)

where $\lambda_{3}>0$ is the trade-off parameter, $\Upsilon_{\textit{CE}}$ denotes the cross-entropy [55] loss, and $\Upsilon_{\textit{OH}}$ transforms $\hat{\emph{y}}_{\emph{i}}$ to its one-hot form. As shown in Figure 4, the pseudo-labels transfer the cluster assignment to the hard one-hot encoding, we thus name it as hard self-supervision strategy.

In addition, we have empirically observed that only using HSS does not perform well in all scenarios. The reason may be that the distribution probability values are small in some situations, making a weak self-supervision for guiding network training with the HSS strategy. To this end, we combine the SSS and HSS strategies together to drive the network training. Combining Eqs. (3), (15), and (17), our overall loss function can be written as

\displaystyle\mathcal{L}=\min_{\mathbf{F}}\left(\mathcal{L}_{R}+\mathcal{L}_{S}+\mathcal{L}_{H}\right).

(18)

The whole training process is shown in Algorithm 1.

Algorithm 1 Training process of our method

0: Input matrix

\mathbf{X}

; Adjacency matrix

\mathbf{A}

; Cluster number k; Trade-off parameters

\lambda_{1},\lambda_{2},\lambda_{3}

; Maximum iterations

i_{\emph{MaxIter}}

;

0: Reconstructed matrix

\hat{\mathbf{X}}

; Clustering result

\mathbf{y}

;

1: Initialization:

\emph{l}=4

i_{\emph{Iter}}=1

;

\mathbf{Z}_{0}=\mathbf{X}

;

\mathbf{H}_{0}=\mathbf{X}

;

2: Initialize the parameters of the DAE network;

3: while

i_{\emph{Iter}}<i_{\emph{MaxIter}}

4: Obtain the feature

\mathbf{H}

by Eq. (3);

5: Obtain the feature

\mathbf{Z}

via Eq. (9);

6: Obtain the cluster center embedding

\boldsymbol{\mu}

with K-means based on the feature

\mathbf{H}

;

7: Calculate the distribution

\mathbf{Q}

via Eq. (1);

8: Calculate the distribution

\mathbf{P}

via Eq. (14);

9: Calculate the distribution

\mathbf{F}

via Eq. (12);

10: Conduct the soft self-supervision via Eq. (15);

11: Conduct the hard self-supervision via Eq. (17);

12: Minimize the overall loss function via Eq. (18);

13: Conduct the back propagation and update parameters in the proposed network;

14:

i_{\emph{Iter}}=i_{\emph{Iter}}+1

;

15: end while

16: Calculate the clustering results

\mathbf{y}

with

\mathbf{F}

by Eq. (13);

III-E Computational Complexity Analysis

For the DAE module, the time complexity is $\mathcal{O}(n\sum_{i=2}^{l}d_{i-1}d_{i})$ . For the GCN module, as the operation can be computed efficiently using sparse matrix computation, the time complexity is only $\mathcal{O}(|\mathcal{E}|\sum_{i=2}^{l}d_{i-1}d_{i})$ according to [21]. For Eq. (1), the time complexity is $\mathcal{O}(nk+n\log n)$ based on [14]. For HWF, SWF, and DWF modules, the total time complexity is $\mathcal{O}(\sum_{i=1}^{l-1}(d_{i}))+\mathcal{O}((\sum_{i=1}^{l+1}d_{i})(l+1))+\mathcal{O}(k)$ . Thus, the overall computational complexity of Algorithm 1 in one iteration is about $\mathcal{O}(n\sum_{i=2}^{l}d_{i-1}d_{i}+|\mathcal{E}|\sum_{i=2}^{l}d_{i-1}d_{i}+(n+1)k+n\log n+\sum_{i=1}^{l-1}(d_{i})+(\sum_{i=1}^{l+1}d_{i})(l+1))$ .

TABLE II: Description of the adopted datasets.

Dataset	Type	Samples	Classes	Edges
USPS	Image	9298	10	27894
Reuters	Text	10000	4	30000
HHAR	Record	10299	6	30897
ACM	Graph	3025	3	13128
CiteSeer	Graph	3327	6	4552
DBLP	Graph	4057	4	3528
Amazon Photo	Graph	7650	8	119081
PubMed	Graph	19717	3	44324
AIDS	Graph	31385	38	64780

TABLE III: Clustering results (mean

\pm

std) with twelve compared methods on nine benchmark datasets. The best and second-best results are bolded and underlined, respectively. ‘OOM’ denotes the Out Of Memory case.

Datasets

Metrics

K-means [13]

DAE [12]

DEC [14]

IDEC [15]

GAE [20]

VGAE [20]

DAEGC [22]

ARGA [21]

SDCN [24]

AGCN [25]

DFCN [47]

AGCC [48]

Our

[Science06]

[ICML16]

[AAAI17]

[NIPS16]

[AAAI19]

[IJCAI18]

[WWW20]

[MM21]

[AAAI21]

[TNNLS22]

Reuters

ARI

46.09

\pm

0.02

49.55

\pm

0.37

48.44

\pm

0.14

51.26

\pm

0.21

19.61

\pm

0.22

26.18

\pm

0.36

31.12

\pm

0.18

24.50

\pm

0.40

55.36

\pm

0.37

60.55

\pm

1.78

59.80

\pm

0.40

62.98

\pm

2.24

63.48

\pm

1.10

58.33

\pm

0.03

60.96

\pm

0.22

64.25

\pm

0.22

63.21

\pm

0.12

43.53

\pm

0.42

57.14

\pm

0.17

61.82

\pm

0.13

51.10

\pm

0.20

65.48

\pm

0.08

66.16

\pm

0.64

69.60

\pm

0.10

67.21

\pm

1.61

68.81

\pm

1.26

ACC

59.98

\pm

0.02

74.90

\pm

0.21

73.58

\pm

0.13

75.43

\pm

0.14

54.40

\pm

0.27

60.85

\pm

0.23

65.50

\pm

0.13

56.20

\pm

0.20

77.15

\pm

0.21

79.30

\pm

1.07

77.70

\pm

0.20

81.65

\pm

1.52

81.68

\pm

0.69

NMI

58.86

\pm

0.01

49.69

\pm

0.29

47.50

\pm

0.34

50.28

\pm

0.17

25.92

\pm

0.41

25.51

\pm

0.22

30.55

\pm

0.29

28.70

\pm

0.30

50.82

\pm

0.21

57.83

\pm

1.01

59.90

\pm

0.40

59.56

\pm

0.94

58.94

\pm

1.16

HHAR

ARI

27.95

\pm

0.38

60.36

\pm

0.88

61.25

\pm

0.51

62.83

\pm

0.45

42.63

\pm

1.63

51.47

\pm

0.73

60.38

\pm

2.15

44.70

\pm

1.00

72.84

\pm

0.09

77.07

\pm

0.66

76.40

\pm

0.10

75.58

\pm

1.85

77.38

\pm

0.97

41.28

\pm

2.43

66.36

\pm

0.34

67.29

\pm

0.29

68.63

\pm

0.33

62.64

\pm

0.97

71.55

\pm

0.29

76.89

\pm

2.18

61.10

\pm

0.90

82.58

\pm

0.08

88.00

\pm

0.53

87.30

\pm

0.10

85.79

\pm

2.48

87.90

\pm

1.11

ACC

54.04

\pm

0.01

68.69

\pm

0.31

69.39

\pm

0.25

71.05

\pm

0.36

62.33

\pm

1.01

71.30

\pm

0.36

76.51

\pm

2.19

63.30

\pm

0.80

84.26

\pm

0.17

88.11

\pm

0.43

87.10

\pm

0.10

86.54

\pm

1.79

87.83

\pm

1.01

NMI

41.54

\pm

0.51

71.42

\pm

0.97

72.91

\pm

0.39

74.19

\pm

0.39

55.06

\pm

1.39

62.95

\pm

0.36

69.10

\pm

2.28

57.10

\pm

1.40

79.90

\pm

0.09

82.44

\pm

0.62

82.20

\pm

0.10

82.21

\pm

1.78

85.34

\pm

2.11

USPS

ARI

54.55

\pm

0.06

58.83

\pm

0.05

63.70

\pm

0.27

67.86

\pm

0.12

50.30

\pm

0.55

40.96

\pm

0.59

63.33

\pm

0.34

51.10

\pm

0.60

71.84

\pm

0.24

73.61

\pm

0.43

75.30

\pm

0.20

68.50

\pm

3.83

75.54

\pm

1.28

64.78

\pm

0.03

69.74

\pm

0.03

71.82

\pm

0.21

74.63

\pm

0.10

61.84

\pm

0.43

53.63

\pm

1.05

72.45

\pm

0.49

66.10

\pm

1.20

76.98

\pm

0.18

77.61

\pm

0.38

78.30

\pm

0.20

74.86

\pm

2.56

79.33

\pm

0.74

ACC

66.82

\pm

0.04

71.04

\pm

0.03

73.31

\pm

0.17

76.22

\pm

0.12

63.10

\pm

0.33

56.19

\pm

0.72

73.55

\pm

0.40

66.80

\pm

0.70

78.08

\pm

0.19

80.98

\pm

0.28

79.50

\pm

0.20

77.14

\pm

1.21

81.13

\pm

1.89

NMI

62.63

\pm

0.05

67.53

\pm

0.03

70.58

\pm

0.25

75.56

\pm

0.06

60.69

\pm

0.58

51.08

\pm

0.37

71.12

\pm

0.24

61.60

\pm

0.30

79.51

\pm

0.27

79.64

\pm

0.32

82.80

\pm

0.30

75.93

\pm

3.83

82.14

\pm

0.15

ACM

ARI

30.60

\pm

0.69

54.64

\pm

0.16

60.64

\pm

1.87

62.16

\pm

1.50

59.46

\pm

3.10

57.72

\pm

0.67

59.35

\pm

3.89

62.90

\pm

2.10

73.91

\pm

0.40

74.20

\pm

0.38

74.90

\pm

0.40

73.73

\pm

0.90

76.72

\pm

0.98

67.57

\pm

0.74

82.01

\pm

0.08

84.51

\pm

0.74

85.11

\pm

0.48

84.65

\pm

1.33

84.17

\pm

0.23

87.07

\pm

2.79

86.10

\pm

1.20

90.42

\pm

0.19

90.58

\pm

0.17

90.80

\pm

0.20

90.39

\pm

0.39

91.53

\pm

0.42

ACC

67.31

\pm

0.71

81.83

\pm

0.08

84.33

\pm

0.76

85.12

\pm

0.52

84.52

\pm

1.44

84.13

\pm

0.22

86.94

\pm

2.83

86.10

\pm

1.20

90.45

\pm

0.18

90.59

\pm

0.15

90.90

\pm

0.20

90.38

\pm

0.38

91.55

\pm

0.40

NMI

32.44

\pm

0.46

49.30

\pm

0.16

54.54

\pm

1.51

56.61

\pm

1.16

55.38

\pm

1.92

53.20

\pm

0.52

56.18

\pm

4.15

55.70

\pm

1.40

68.31

\pm

0.25

68.38

\pm

0.45

69.40

\pm

0.40

68.34

\pm

0.89

71.50

\pm

0.80

CiteSeer

ARI

06.97

\pm

0.39

29.31

\pm

0.14

28.12

\pm

0.36

25.70

\pm

2.65

33.55

\pm

1.18

33.13

\pm

0.53

37.78

\pm

1.24

33.40

\pm

1.50

40.17

\pm

0.43

43.79

\pm

0.31

45.50

\pm

0.30

41.82

\pm

2.03

47.98

\pm

0.91

31.92

\pm

0.27

53.80

\pm

0.11

52.62

\pm

0.17

61.62

\pm

1.39

57.36

\pm

0.82

57.70

\pm

0.49

62.20

\pm

1.32

54.80

\pm

0.80

63.62

\pm

0.24

62.37

\pm

0.21

64.30

\pm

0.20

60.47

\pm

1.57

62.37

\pm

0.52

ACC

38.65

\pm

0.65

57.08

\pm

0.13

55.89

\pm

0.20

60.49

\pm

1.42

61.35

\pm

0.80

60.97

\pm

0.36

64.54

\pm

1.39

56.90

\pm

0.70

65.96

\pm

0.31

68.79

\pm

0.23

69.50

\pm

0.20

68.08

\pm

1.44

72.01

\pm

0.53

NMI

11.45

\pm

0.38

27.64

\pm

0.08

28.34

\pm

0.30

27.17

\pm

2.40

34.63

\pm

0.65

32.69

\pm

0.27

36.41

\pm

0.86

34.50

\pm

0.80

38.71

\pm

0.32

41.54

\pm

0.30

43.90

\pm

0.20

40.86

\pm

1.45

45.34

\pm

0.70

DBLP

ARI

13.43

\pm

3.02

12.21

\pm

0.43

23.92

\pm

0.39

25.37

\pm

0.60

22.02

\pm

1.40

17.92

\pm

0.07

21.03

\pm

0.52

22.70

\pm

0.30

39.15

\pm

2.01

42.49

\pm

0.31

47.00

\pm

1.50

44.40

\pm

3.79

57.29

\pm

1.20

36.08

\pm

3.53

52.53

\pm

0.36

59.38

\pm

0.51

61.33

\pm

0.56

61.41

\pm

2.23

58.69

\pm

0.07

61.75

\pm

0.67

61.80

\pm

0.90

67.71

\pm

1.51

72.80

\pm

0.56

75.70

\pm

0.80

71.84

\pm

2.02

80.79

\pm

0.61

ACC

39.32

\pm

3.17

51.43

\pm

0.35

58.16

\pm

0.56

60.31

\pm

0.62

61.21

\pm

1.22

58.59

\pm

0.06

62.05

\pm

0.48

61.60

\pm

1.00

68.05

\pm

1.81

73.26

\pm

0.37

76.00

\pm

0.80

73.45

\pm

2.16

81.26

\pm

0.62

NMI

16.94

\pm

3.22

25.40

\pm

0.16

29.51

\pm

0.28

31.17

\pm

0.50

30.80

\pm

0.91

26.92

\pm

0.06

32.49

\pm

0.45

26.80

\pm

1.00

39.50

\pm

1.34

39.68

\pm

0.42

43.70

\pm

1.00

40.36

\pm

2.81

51.99

\pm

0.76

Amazon Photo

ARI

05.50

\pm

0.44

20.80

\pm

0.47

18.59

\pm

0.04

19.24

\pm

0.07

48.82

\pm

4.57

56.24

\pm

4.66

59.39

\pm

0.02

44.18

\pm

4.41

31.21

\pm

1.23

41.15

\pm

2.78

58.98

\pm

0.84

29.96

\pm

3.46

60.51

\pm

1.58

23.96

\pm

0.51

47.87

\pm

0.20

46.71

\pm

0.12

47.20

\pm

0.11

68.08

\pm

1.76

70.38

\pm

2.98

69.97

\pm

0.02

64.30

\pm

1.95

50.66

\pm

1.49

43.68

\pm

5.08

71.58

\pm

0.31

39.67

\pm

5.22

71.68

\pm

2.35

ACC

27.22

\pm

0.76

48.25

\pm

0.08

47.22

\pm

0.08

47.62

\pm

0.08

71.57

\pm

2.48

74.26

\pm

3.63

76.44

\pm

0.01

69.28

\pm

2.30

53.44

\pm

0.81

58.53

\pm

1.74

76.88

\pm

0.80

51.47

\pm

3.04

78.75

\pm

1.02

NMI

13.23

\pm

1.33

38.76

\pm

0.30

37.35

\pm

0.05

37.83

\pm

0.08

62.13

\pm

2.79

66.01

\pm

3.40

65.57

\pm

0.03

58.36

\pm

2.76

44.85

\pm

0.83

51.76

\pm

3.23

69.21

\pm

1.00

39.19

\pm

4.07

66.27

\pm

1.13

PubMed

ARI

28.10

\pm

0.01

23.86

\pm

0.67

19.55

\pm

0.13

20.58

\pm

0.39

20.62

\pm

1.39

30.15

\pm

1.23

29.84

\pm

0.04

24.35

\pm

0.17

22.30

\pm

2.07

31.39

\pm

0.67

30.64

\pm

0.11

OOM

35.29

\pm

1.02

58.88

\pm

0.01

64.01

\pm

0.29

61.49

\pm

0.10

62.41

\pm

0.32

61.37

\pm

0.85

67.68

\pm

0.89

68.23

\pm

0.02

65.69

\pm

0.13

65.01

\pm

1.21

69.73

\pm

0.45

68.10

\pm

0.07

OOM

72.78

\pm

0.72

ACC

59.83

\pm

0.01

63.07

\pm

0.31

60.14

\pm

0.09

60.70

\pm

0.34

62.09

\pm

0.81

68.48

\pm

0.77

68.73

\pm

0.03

65.26

\pm

0.12

64.20

\pm

1.30

69.67

\pm

0.42

68.89

\pm

0.07

OOM

73.16

\pm

0.69

NMI

31.05

\pm

0.02

26.32

\pm

0.57

22.44

\pm

0.14

23.67

\pm

0.29

23.84

\pm

3.54

30.61

\pm

1.71

28.26

\pm

0.03

24.80

\pm

0.17

22.87

\pm

2.04

30.96

\pm

0.99

31.43

\pm

0.13

OOM

33.29

\pm

1.14

AIDS

ARI

05.37

\pm

0.19

05.71

\pm

0.66

10.71

\pm

3.49

13.39

\pm

5.35

03.50

\pm

0.79

00.44

\pm

0.43

OOM

01.79

\pm

0.97

-

0.06

\pm

0.00

14.03

\pm

5.76

00.48

\pm

0.00

OOM

21.40

\pm

7.12

11.86

\pm

1.02

11.91

\pm

1.54

13.81

\pm

1.60

12.10

\pm

1.55

08.40

\pm

1.23

09.13

\pm

2.74

OOM

06.01

\pm

1.41

02.02

\pm

0.00

06.14

\pm

1.80

05.51

\pm

0.01

OOM

21.77

\pm

1.10

ACC

17.11

\pm

0.63

21.78

\pm

5.02

35.12

\pm

3.69

47.32

\pm

5.76

15.72

\pm

0.89

23.04

\pm

6.08

OOM

59.27

\pm

2.54

62.25

\pm

0.00

59.82

\pm

3.37

11.28

\pm

0.02

OOM

63.84

\pm

2.81

NMI

23.42

\pm

0.57

24.29

\pm

2.49

24.30

\pm

2.19

25.37

\pm

4.90

12.66

\pm

2.64

01.62

\pm

0.29

OOM

04.86

\pm

2.16

00.16

\pm

0.00

09.08

\pm

2.28

04.67

\pm

0.01

OOM

34.44

\pm

2.96

IV Experiments

We conducted quantitative and qualitative experiments on nine commonly used benchmark datasets to evaluate the proposed model. In addition, we performed ablation studies to investigate the effectiveness of the proposed modules and the adopted strategies. Moreover, we performed a series of parameter analyses to verify the robustness of our method.

IV-A Datasets and Compared Methods

We conducted experiments on one image dataset (USPS [56]), one text dataset (Reuters [57]), one record dataset (HHAR [58]), and six graph datasets (ACM²²2http://dl.acm.org, CiteSeer³³3http://CiteSeerx.ist.psu.edu/, DBLP⁴⁴4https://dblp.uni-trier.de, Amazon Photo, PubMed[59], and AIDS[60]), which are briefly summarized in Table II.

We compared the proposed method with the classic clustering method K-means [13], three DAE-based embedding clustering methods [12, 14, 15], and seven GCN-based embedding clustering methods [20, 21, 22, 23, 24, 25, 47], the details of which are listed as follows.

TABLE IV: Results of ablation studies. ✗ and ✓ indicate the component is used and unused, respectively. The best results are noted in bold.

Datasets	SSS	HSS	DWF	SWF	HWF	ARI	F1	ACC	NMI
USPS	✗	✗	✗	✗	✗	71.67 $\pm$ 0.44	76.88 $\pm$ 0.30	78.08 $\pm$ 0.30	79.19 $\pm$ 0.44
	✗	✗	✗	✗	✓	71.71 $\pm$ 0.87	76.46 $\pm$ 0.54	78.98 $\pm$ 0.97	78.87 $\pm$ 0.36
	✗	✗	✗	✓	✓	70.96 $\pm$ 0.24	76.44 $\pm$ 0.17	77.70 $\pm$ 0.14	78.61 $\pm$ 0.22
	✗	✗	✓	✓	✓	71.73 $\pm$ 0.71	76.31 $\pm$ 0.34	79.63 $\pm$ 0.43	78.41 $\pm$ 0.29
	✗	✓	✓	✓	✓	71.90 $\pm$ 0.99	76.61 $\pm$ 0.56	79.74 $\pm$ 0.79	78.64 $\pm$ 0.46
	✓	✗	✓	✓	✓	74.39 $\pm$ 0.13	78.51 $\pm$ 0.09	79.11 $\pm$ 0.11	82.06 $\pm$ 0.16
	✓	✓	✓	✓	✓	75.54 $\pm$ 1.28	79.33 $\pm$ 0.74	81.13 $\pm$ 1.89	82.14 $\pm$ 0.15
Reuters	✗	✗	✗	✗	✗	56.37 $\pm$ 4.76	65.03 $\pm$ 1.87	78.19 $\pm$ 2.02	53.74 $\pm$ 3.63
	✗	✗	✗	✗	✓	61.38 $\pm$ 0.78	67.22 $\pm$ 1.15	80.19 $\pm$ 0.53	57.94 $\pm$ 0.49
	✗	✗	✗	✓	✓	61.55 $\pm$ 0.64	66.54 $\pm$ 0.21	80.60 $\pm$ 0.47	58.15 $\pm$ 0.49
	✗	✗	✓	✓	✓	62.70 $\pm$ 1.00	66.90 $\pm$ 0.30	80.95 $\pm$ 0.46	59.42 $\pm$ 0.69
	✗	✓	✓	✓	✓	63.32 $\pm$ 0.57	67.21 $\pm$ 0.18	81.28 $\pm$ 0.32	60.79 $\pm$ 0.69
	✓	✗	✓	✓	✓	62.75 $\pm$ 2.00	68.74 $\pm$ 1.23	81.02 $\pm$ 0.81	57.93 $\pm$ 1.50
	✓	✓	✓	✓	✓	63.48 $\pm$ 1.10	68.81 $\pm$ 1.26	81.68 $\pm$ 0.69	58.94 $\pm$ 1.16
HHAR	✗	✗	✗	✗	✗	73.17 $\pm$ 1.95	82.70 $\pm$ 3.97	84.18 $\pm$ 2.80	80.03 $\pm$ 1.16
	✗	✗	✗	✗	✓	72.45 $\pm$ 1.02	83.25 $\pm$ 0.81	84.60 $\pm$ 0.66	79.08 $\pm$ 0.85
	✗	✗	✗	✓	✓	73.24 $\pm$ 0.73	83.34 $\pm$ 1.69	84.77 $\pm$ 1.21	80.10 $\pm$ 0.50
	✗	✗	✓	✓	✓	72.84 $\pm$ 1.23	83.72 $\pm$ 1.10	84.95 $\pm$ 0.86	79.22 $\pm$ 0.94
	✗	✓	✓	✓	✓	73.24 $\pm$ 0.52	83.74 $\pm$ 0.68	85.01 $\pm$ 0.46	79.99 $\pm$ 0.47
	✓	✗	✓	✓	✓	75.91 $\pm$ 0.40	86.65 $\pm$ 0.70	86.23 $\pm$ 0.81	82.61 $\pm$ 0.12
	✓	✓	✓	✓	✓	77.38 $\pm$ 0.97	87.90 $\pm$ 1.11	87.83 $\pm$ 1.01	85.34 $\pm$ 2.11
ACM	✗	✗	✗	✗	✗	73.91 $\pm$ 0.40	90.42 $\pm$ 0.19	90.45 $\pm$ 0.18	68.31 $\pm$ 0.25
	✗	✗	✗	✗	✓	73.95 $\pm$ 0.60	90.48 $\pm$ 0.26	90.47 $\pm$ 0.24	68.42 $\pm$ 0.61
	✗	✗	✗	✓	✓	74.20 $\pm$ 0.38	90.58 $\pm$ 0.17	90.59 $\pm$ 0.15	68.38 $\pm$ 0.45
	✗	✗	✓	✓	✓	74.58 $\pm$ 0.78	90.72 $\pm$ 0.35	90.73 $\pm$ 0.33	68.94 $\pm$ 0.63
	✗	✓	✓	✓	✓	74.83 $\pm$ 0.73	90.85 $\pm$ 0.33	90.85 $\pm$ 0.31	69.02 $\pm$ 0.66
	✓	✗	✓	✓	✓	75.78 $\pm$ 0.64	91.18 $\pm$ 0.25	91.18 $\pm$ 0.26	70.59 $\pm$ 0.68
	✓	✓	✓	✓	✓	76.72 $\pm$ 0.98	91.53 $\pm$ 0.42	91.55 $\pm$ 0.40	71.50 $\pm$ 0.80
CiteSeer	✗	✗	✗	✗	✗	40.17 $\pm$ 0.43	63.62 $\pm$ 0.24	65.96 $\pm$ 0.31	38.71 $\pm$ 0.32
	✗	✗	✗	✗	✓	40.93 $\pm$ 1.78	60.91 $\pm$ 0.81	66.38 $\pm$ 1.72	39.07 $\pm$ 1.52
	✗	✗	✗	✓	✓	43.79 $\pm$ 0.31	62.37 $\pm$ 0.21	68.79 $\pm$ 0.23	41.54 $\pm$ 0.30
	✗	✗	✓	✓	✓	43.50 $\pm$ 0.47	61.25 $\pm$ 0.31	68.54 $\pm$ 0.30	41.35 $\pm$ 0.58
	✗	✓	✓	✓	✓	43.72 $\pm$ 0.60	61.52 $\pm$ 0.65	68.46 $\pm$ 0.40	41.25 $\pm$ 0.41
	✓	✗	✓	✓	✓	47.76 $\pm$ 1.28	62.24 $\pm$ 0.80	71.86 $\pm$ 0.79	45.10 $\pm$ 1.05
	✓	✓	✓	✓	✓	47.98 $\pm$ 0.91	62.37 $\pm$ 0.52	72.01 $\pm$ 0.53	45.34 $\pm$ 0.70
DBLP	✗	✗	✗	✗	✗	39.15 $\pm$ 2.01	67.71 $\pm$ 1.51	68.05 $\pm$ 1.81	39.50 $\pm$ 1.34
	✗	✗	✗	✗	✓	37.78 $\pm$ 1.85	68.69 $\pm$ 1.65	69.65 $\pm$ 1.43	35.37 $\pm$ 1.58
	✗	✗	✗	✓	✓	42.49 $\pm$ 0.31	72.80 $\pm$ 0.56	73.26 $\pm$ 0.37	39.68 $\pm$ 0.42
	✗	✗	✓	✓	✓	41.72 $\pm$ 0.47	72.68 $\pm$ 0.20	72.92 $\pm$ 0.21	39.26 $\pm$ 0.33
	✗	✓	✓	✓	✓	42.52 $\pm$ 0.96	72.81 $\pm$ 0.59	73.43 $\pm$ 0.50	39.99 $\pm$ 0.70
	✓	✗	✓	✓	✓	55.45 $\pm$ 0.60	79.83 $\pm$ 0.32	80.29 $\pm$ 0.33	50.08 $\pm$ 0.56
	✓	✓	✓	✓	✓	57.29 $\pm$ 1.20	80.79 $\pm$ 0.61	81.26 $\pm$ 0.62	51.99 $\pm$ 0.76
Amazon Photo	✗	✗	✗	✗	✗	31.21 $\pm$ 1.23	50.66 $\pm$ 1.49	53.44 $\pm$ 0.81	44.85 $\pm$ 0.83
	✗	✗	✗	✗	✓	37.86 $\pm$ 3.46	35.86 $\pm$ 4.00	54.84 $\pm$ 1.43	46.51 $\pm$ 4.93
	✗	✗	✗	✓	✓	41.15 $\pm$ 2.78	43.68 $\pm$ 5.08	58.53 $\pm$ 1.74	51.76 $\pm$ 3.23
	✗	✗	✓	✓	✓	41.14 $\pm$ 2.78	43.68 $\pm$ 5.08	58.52 $\pm$ 1.74	51.77 $\pm$ 3.22
	✗	✓	✓	✓	✓	43.50 $\pm$ 2.29	46.20 $\pm$ 4.18	60.59 $\pm$ 1.94	52.23 $\pm$ 1.67
	✓	✗	✓	✓	✓	51.81 $\pm$ 2.25	66.37 $\pm$ 2.64	71.93 $\pm$ 2.08	59.09 $\pm$ 1.60
	✓	✓	✓	✓	✓	60.51 $\pm$ 1.58	71.68 $\pm$ 2.35	78.75 $\pm$ 1.02	66.27 $\pm$ 1.13
PubMed	✗	✗	✗	✗	✗	22.30 $\pm$ 2.07	65.01 $\pm$ 1.21	64.20 $\pm$ 1.30	22.87 $\pm$ 2.04
	✗	✗	✗	✗	✓	27.65 $\pm$ 1.16	67.21 $\pm$ 0.83	67.31 $\pm$ 0.78	27.77 $\pm$ 1.85
	✗	✗	✗	✓	✓	31.39 $\pm$ 0.67	69.73 $\pm$ 0.45	69.67 $\pm$ 0.42	30.96 $\pm$ 0.99
	✗	✗	✓	✓	✓	30.85 $\pm$ 1.10	69.05 $\pm$ 0.87	68.67 $\pm$ 0.79	32.19 $\pm$ 1.29
	✗	✓	✓	✓	✓	33.21 $\pm$ 1.94	70.75 $\pm$ 1.28	71.35 $\pm$ 1.39	31.47 $\pm$ 1.75
	✓	✗	✓	✓	✓	32.79 $\pm$ 1.57	69.89 $\pm$ 1.20	70.56 $\pm$ 1.37	31.85 $\pm$ 1.36
	✓	✓	✓	✓	✓	35.29 $\pm$ 1.02	72.78 $\pm$ 0.72	73.16 $\pm$ 0.69	33.29 $\pm$ 1.14
AIDS	✗	✗	✗	✗	✗	10.15 $\pm$ 3.04	4.76 $\pm$ 0.69	58.32 $\pm$ 3.33	7.67 $\pm$ 0.48
	✗	✗	✗	✗	✓	10.31 $\pm$ 3.52	3.92 $\pm$ 0.67	62.25 $\pm$ 0.01	6.68 $\pm$ 1.13
	✗	✗	✗	✓	✓	10.69 $\pm$ 3.33	4.33 $\pm$ 0.99	62.32 $\pm$ 0.21	7.54 $\pm$ 1.40
	✗	✗	✓	✓	✓	11.85 $\pm$ 3.35	21.19 $\pm$ 1.39	62.29 $\pm$ 0.11	32.31 $\pm$ 0.40
	✗	✓	✓	✓	✓	12.78 $\pm$ 2.61	21.65 $\pm$ 1.51	62.25 $\pm$ 0.00	32.34 $\pm$ 0.35
	✓	✗	✓	✓	✓	15.34 $\pm$ 5.94	21.37 $\pm$ 1.60	62.61 $\pm$ 1.15	32.46 $\pm$ 1.01
	✓	✓	✓	✓	✓	21.40 $\pm$ 7.12	21.77 $\pm$ 1.10	63.84 $\pm$ 2.81	34.44 $\pm$ 2.96

•

DAE [12] uses deep auto-encoder to learn latent feature representations and then performs K-means on that feature to obtain clustering results.
•

DEC [14] jointly conducts embedding learning and cluster assignment with an iterative procedure.
•

IDEC [15] introduces a reconstruction loss into DEC to improve the clustering performance.
•

GAE [20] and VGAE [20] incorporate DAE and variational DAE into GCN frameworks, respectively.
•

DAEGC [22] achieves a neighbor-wise embedding learning with an attention-driven strategy and supervises the network training with a clustering loss.
•

ARGA [21] guides embedding learning with a designed adversarial regularization.
•

SDCN [24] fuses DEC and GCN to merge the topological structure information into deep embedding clustering.
•

AGCN [25] focuses on enhancing the embedding learning.
•

DFCN [47] merges the node attribute and topological structure information based on the DAE and GAE.
•

AGCC [48] replaces the graph layer by layer to mine the latent connected relationship between data.

IV-B Implementation Details

IV-B1 Evaluation metrics

We used four metrics to evaluate the clustering performance, including Average Rand Index (ARI), macro F1-score (F1), Accuracy (ACC), and Normalized Mutual Information (NMI). For each metric, a larger value implies a better clustering result.

IV-B2 Graph construction

As those non-graph datasets (i.e., USPS, Reuters, and HHAR) lack the topology graph, we used a typical graph construction approach to generate their graph data. Specifically, we first employed the cosine distance to compute the similarity matrix $\mathbf{S}$ , i.e.,

\displaystyle\mathbf{S}=\frac{\mathbf{X}\mathbf{X}^{\mathsf{T}}}{\left\|\mathbf{X}\right\|_{F}\left\|\mathbf{X}^{\mathsf{T}}\right\|_{F}},

(19)

where $\left\|\mathbf{X}\right\|_{F}=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{d}{\left|\emph{x}_{i,j}\right|}^{2}}$ and $\mathbf{X}^{\mathsf{T}}$ denote the Frobenius norm and the transpose operation of $\mathbf{X}$ , respectively. Then, we keep the top- $\hat{\emph{k}}$ similar neighbors of each sample to construct an undirected $\hat{\emph{k}}$ -nearest neighbor (KNN [61]) graph. The constructed KNN graph can depict the topological structure of a dataset and hence is used as GCN input.

IV-B3 Training Procedure

Similar to [14, 15, 24, 25], we first pre-trained the DAE module with $30$ epochs and the learning rate equal to $0.001$ . Then, we trained the whole network with $200$ iterations. We set the dimension of the auto-encoder and the GCN layers to $500-500-2000-10$ , the batch size to $256$ , and the negative input slope of LReLU to $0.2$ . In addition, we set the learning rates of USPS, HHAR, ACM, DBLP, and PubMed datasets with $0.001$ , and Reuters, CiteSeer, and Amazon Photo datasets with $0.0001$ . We set $r$ to $0.8$ in this paper, where more detailed experiments and analyses of the threshold value are given in Section IV. E. 3). For the method ARGA, we used the parameter settings given by the original paper [21]. For other methods under comparison, we directly cited the results in [25]. We repeated the experiment 10 times to evaluate our method with the mean values and the corresponding standard deviations (i.e., mean $\pm$ std). The training procedure is implemented by PyTorch on two GPUs (GeForce RTX 2080 Ti and NVIDIA GeForce RTX 3090).

IV-C Clustering Results

Table III provides the clustering results of the proposed method and twelve compared methods with four metrics, where we have the following observations.

•

Our method achieves the best clustering results on most benchmark datasets. For example, in the non-graph dataset Reuters, our approach improves the ARI, F1, ACC, and NMI values of SDCN [24] by 8.12%, 3.33%, 4.53%, and 8.12%, respectively. In the graph dataset DBLP, our approach improves 18.14% over SDCN on ARI, 13.08% on F1, 13.21% on ACC, and 12.49% on NMI.
•

DAEGC enhances GAE by introducing the neighbor-wise embedding learning with an attention-based strategy, benefiting clustering performance improvement. Such a phenomenon validates the effectiveness of the attention-based mechanism. Differently, our method extends the attention-based mechanism to the heterogeneity-wise, scale-wise, and distribution-wise fusion modules to adaptively utilize the multiple off-the-shelf information, which significantly improves the clustering performance.
•

SDCN performs better than the DAE-based (DAE, DEC, IDEC) and GCN-based (GAE, VGAE, ARGA) embedding clustering methods, demonstrating that combining DAE and GCN can contribute to clustering performance. Nevertheless, SDCN ( $i$ ) equates the importance of the DAE feature and the GCN feature; ( $ii$ ) neglects the multi-scale features; and ( $iii$ ) fails to utilize available off-the-shelf information from the clustering assignment. The proposed method addresses those issues and thus produces significantly better clustering performance than SDCN on all the datasets in almost all metrics.
•

Our method typically achieves better clustering performance than AGCN [25], demonstrating the effectiveness of the proposed distribution-wise fusion module and the dual self-supervision solution in guiding the unsupervised clustering network training. For instance, in Amazon Photo, our approach improves 19.36% on ARI, 28.00% on F1, 20.22% on ACC, and 14.51% on NMI.
•

Our method provides a significant improvement on DBLP and PubMed, e.g., in DBLP, our approach improves 10.29% over the second-best one on ARI, 5.09% on F1, 5.26% on ACC, and 8.29% on NMI. The possible reason is that DBLP and PubMed belong to datasets with low feature dimensions (i.e., little information), meaning that sufficiently utilizing the available off-the-shelf information plays a great important role in improving the clustering performance.
•

Our method does not outperform AGCN in HHAR. The possible reason is that in HHAR, a series of dissimilar nodes are connected in the constructed KNN graph, reducing the graph quality. Although AGCN also uses the KNN graph, its auxiliary distribution $\mathbf{P}$ was inferred by the output of the conventional auto-encoder. Differently, the proposed method uses the graph convolutional network output to derive $\mathbf{P}$ for utilizing rich graph information of $\mathbf{Z}$ . Thus, if the graph quality is terrible, the clustering performance of the proposed method may be worse than AGCN.
•

Our method obtains the best clustering performance on AIDS with four metrics, where AIDS is a large-scale long-tailed dataset that one class accounts for 62.34% number, and the other thirty-seven classes share 37.66%.

IV-D Ablation Study

We conducted comprehensive ablation studies to validate the effectiveness of the proposed modules and self-supervision strategies. Table IV lists the quantitative results, where the first row of each dataset denotes the baseline that merges the DAE and GCN features in a half-and-half mechanism (i.e., without HWF) and uses the last GCN layer feature (i.e., without SWF and DWF) to conduct the optimization via the reconstruction loss and self-optimizing embedding loss following [14, 15, 22, 24, 25] (i.e., without SSS and HSS). The second, third, and fourth rows denote the baseline that adopts the proposed module HWF, HWF+SWF, and HWF+SWF+DWF, respectively. The fifth, sixth, and seventh rows denote the methods optimized by the introduced hard self-supervision, the soft self-supervision, and both.

IV-D1 Heterogeneity-wise fusion module

By comparing the first and second rows of each dataset in Table IV, we can observe that HWF can typically improve clustering performance in most cases, validating its effectiveness. For example, on Reuters, it produces a 5.01% performance improvement on ARI, 2.19% on F1, 2.00% on ACC, and 4.20% on NMI.

IV-D2 Scale-wise fusion module

We can examine the effectiveness of SWF by comparing the second and third rows of each dataset in Table IV, in which the compared results with four metrics indicate the superiority of the SWF module in most datasets.

IV-D3 Distribution-wise fusion module

By comparing the results of the third and fourth rows of each dataset in Table IV, we observe that DWF also improves clustering performance, benefiting from the adaptive fusion of the information of two distributions.

To qualitatively validate the significant performance of the DWF module, we plotted 2D t-distributed stochastic neighbor embedding (t-SNE) [62] visualizations of the distributions $\mathbf{Q}$ , $\mathbf{Z}$ , and $\mathbf{F}$ on DBLP in Figure 5, where we can see that our adaptively aggregated one is better than others, benefiting from adaptively (due to the DWF module) and effectively (due to the dual self-supervision solution) merging the information of two distributions.

IV-D4 Hard self-supervision strategy

From the results of the fourth and fifth rows of each dataset in Table IV, it can be seen that on the non-graph dataset HHAR and graph dataset DBLP, there is about 2.00% improvement when involving HSS, validating its effectiveness.

IV-D5 Soft Self-supervision Strategy

We can validate the effectiveness of SSS by comparing the results of the fourth and sixth rows of each dataset in Table IV. Specifically, on DBLP, SSS produces 13.73% improvement on ARI, 7.15% on F1, 7.37% on ACC, and 10.82% on NMI. Such impressive improvement is credited to that the SSS strategy refines the cluster assignment by minimizing a Kullback-Leibler divergence loss to promote consistent distribution alignment among distributions $\mathbf{Q}$ , $\mathbf{Z}$ , and $\mathbf{P}$ .

IV-D6 Dual Self-supervision (DSS)

By comparing the results of the fourth, fifth, sixth, and seventh rows of each dataset in Table IV, we can observe that DSS, which combines HSS and SSS, almost produces the best results on all nine benchmark datasets.

TABLE V: The experiments on DBLP with the learned weights from the attention modules and four metrics results of corresponding representations. The larger weight values are highlighted in bold, and the better clustering results are highlighted in red.

$Sample$	$\mathbf{m}_{1,1}$ ( $\mathbf{Z}_{1}$ )	$\mathbf{m}_{1,2}$ ( $\mathbf{H}_{1}$ )	$\mathbf{m}_{2,1}$ ( $\mathbf{Z}_{2}$ )	$\mathbf{m}_{2,2}$ ( $\mathbf{H}_{2}$ )	$\mathbf{m}_{3,1}$ ( $\mathbf{Z}_{3}$ )	$\mathbf{m}_{3,2}$ ( $\mathbf{H}_{3}$ )	$\mathbf{u}_{1}$ ( $\mathbf{Z}_{1}$ )	$\mathbf{u}_{2}$ ( $\mathbf{Z}_{2}$ )	$\mathbf{u}_{3}$ ( $\mathbf{Z}_{3}$ )	$\mathbf{u}_{4}$ ( $\mathbf{Z}_{4}$ )	$\mathbf{u}_{5}$ ( $\mathbf{Z}_{5}$ )	$\mathbf{v}_{1}$ ( $\mathbf{Z}$ )	$\mathbf{v}_{2}$ ( $\mathbf{Q}$ )
$\mathbf{x}_{0}$	0.6417	0.7669	0.0954	0.9954	0.7091	0.7051	0.3874	0.3896	0.4718	0.3930	0.5666	0.8980	0.4399
$\mathbf{x}_{1}$	0.6166	0.7872	0.0918	0.9958	0.7052	0.7090	0.2780	0.2783	0.2854	0.2824	0.8271	0.8584	0.5131
$\mathbf{x}_{2}$	0.8232	0.5678	0.2532	0.9674	0.7051	0.7091	0.2188	0.2187	0.8826	0.2216	0.2762	0.8585	0.5128
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
$\mathbf{x}_{4054}$	0.6254	0.7803	0.0686	0.9976	0.7047	0.7095	0.2503	0.2505	0.2572	0.2544	0.8624	0.8582	0.5134
$\mathbf{x}_{4055}$	0.6238	0.7816	0.0508	0.9987	0.7060	0.7082	0.2733	0.2741	0.2817	0.2782	0.8327	0.8584	0.5129
$\mathbf{x}_{4056}$	0.5919	0.8060	0.0811	0.9967	0.7082	0.7060	0.3903	0.3926	0.5544	0.3956	0.4793	0.8979	0.4402
AVG	0.7182	0.6704	0.1310	0.9854	0.5916	0.7698	0.3162	0.3157	0.5141	0.3200	0.5821	0.8555	0.5138
ARI	0.0836	0.5539	0.3009	0.5500	0.5437	0.5489	0.0836	0.3009	0.5437	0.5453	0.5489	0.5500	0.5489
F1	0.4072	0.7972	0.5781	0.7969	0.7941	0.7964	0.4072	0.5781	0.7941	0.7944	0.7964	0.7969	0.7964
ACC	0.4212	0.8023	0.6120	0.8011	0.7981	0.8006	0.4212	0.6120	0.7981	0.7986	0.8006	0.8011	0.8006
NMI	0.1778	0.4987	0.3023	0.4996	0.4950	0.4988	0.1778	0.3023	0.4950	0.4965	0.4988	0.4995	0.4988

TABLE VI: Comparisons of the clustering results, network parameters, and spending time of compared methods on DBLP. The best results are highlighted in bold.

Metrics	SDCN	AGCN	AGCC	Our	boost $\uparrow$
ARI (%)	39.15 $\pm$ 2.01	42.49 $\pm$ 0.31	44.40 $\pm$ 3.79	57.29 $\pm$ 1.20	$\uparrow$ 12.89
F1 (%)	67.71 $\pm$ 1.51	72.80 $\pm$ 0.56	71.84 $\pm$ 2.02	80.79 $\pm$ 0.61	$\uparrow$ 07.99
ACC (%)	68.05 $\pm$ 1.81	73.26 $\pm$ 0.37	73.45 $\pm$ 2.16	81.26 $\pm$ 0.62	$\uparrow$ 07.81
NMI (%)	39.50 $\pm$ 1.34	39.68 $\pm$ 0.42	40.36 $\pm$ 2.81	51.99 $\pm$ 0.76	$\uparrow$ 11.63
Parameters (M)	4.31742	4.35658	11.86304	4.35659
Time (s)	253.3905	273.1204	5420.5255	310.6794

IV-E Parameters Analysis

IV-E1 Analysis of the number of neighbors

As the number of neighbors $\hat{\emph{k}}$ directly decides the KNN graph with respect to (w.r.t.) the quality of the adjacency matrix, we tested different $\hat{\emph{k}}$ on the non-graph datasets, i.e., USPS, Reuters, and HHAR. From Figure 6, we can observe that our model is not sensitive to $\hat{\emph{k}}$ . In the experiments, we fixed $\hat{\emph{k}}$ to $3$ to construct the KNN graph for the non-graph datasets.

IV-E2 Analysis of hyperparameters

We investigated the influence of the hyperparameters, i.e., $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ , on DBLP. Figure 7 illustrates four metrics results in a 4D figure manner where the color indicates the fourth direction, i.e., the corresponding experimental results. From Figure 7, we have the following observations.

•

The parameters setting of $\lambda_{1}$ and $\lambda_{2}$ is critical to the proposed model. Specifically, the highest clustering result occurs when $\lambda_{1}$ and $\lambda_{2}$ tend to the same value. This phenomenon reflects the importance of balancing the regularization term in constraining the distribution alignment.
•

Our model is robust to the hyperparameter $\lambda_{3}$ , i.e., our method can obtain the optimal performance in a wide and common parameter range of $\lambda_{3}$ .

IV-E3 Analysis of the threshold value

We investigated the effect of the threshold value $r$ on clustering performance. Figure 8 shows the clustering results with various thresholds (i.e., $0.5$ , $0.6$ , $0.7$ , $0.8$ , and $0.9$ ). From Figure 8, we have the following conclusions.

•

A small threshold value unavoidably degrades the clustering performance compared with the ones using a large threshold value. For example, when we set $r$ to $0.5$ or $0.6$ , all four metrics results on DBLP have degraded performance. Apparently, a small threshold value can easily generate a lot of incorrect pseudo-labels.
•

A large threshold value is capable of leading to high clustering performance. However, setting $r$ to a tremendous value like $0.9$ cannot improve clustering performance. The reason is that with a larger threshold, the number of selected supervised labels will reduce, resulting in weak label propagation. Thus, we set $r$ to $0.8$ in this paper.

IV-E4 Analysis of the learned attention-aware weights

We added the results of the learned weight on DBLP to verify the effectiveness of the designed attention mechanism in Table V, where $\mathbf{x}_{j}$ indicates the j-th sample; $\mathbf{m}_{i,1}$ and $\mathbf{m}_{i,2}$ indicate the HWF learned weights of $\mathbf{Z}_{i}$ and $\mathbf{H}_{i}$ in the i-th layer, respectively; $\mathbf{u}_{1}$ , $\mathbf{u}_{2}$ , $\mathbf{u}_{3}$ , $\mathbf{u}_{4}$ , and $\mathbf{u}_{5}$ indicate the SWF learned weights; $\mathbf{v}_{1}$ and $\mathbf{v}_{2}$ indicate the DWF learned weights of $\mathbf{Z}$ and $\mathbf{Q}$ , respectively; AVG indicates the average value of the weight results. The clustering results of $\mathbf{Z}$ and $\mathbf{Q}$ are inferred through their column indexes of the maximum in each row, and those results of other features are obtained with K-means, where the higher clustering performance, the better the feature representation. We can see that the representation corresponding to a large weight value typically performs better clustering results than the one corresponding to a small weight value, substantiating the effectiveness of the designed attention mechanism in the weighted fusion.

IV-F Time and Space Complexity Analysis

We repeated the experiment 10 times to compare the mean values, the standard deviations (i.e., mean $\pm$ std), the parameters number, and the running time of the proposed method with the baselines [24, 25, 47] on DBLP in Table VI. Specifically, the experiments are implemented with Python 3.6.12 and Pytorch-1.9.0+cu102 on an NVIDIA GeForce RTX 2080 Ti and an i7-8700K CPU. M and s are the abbreviations of the million and second, respectively. From Table VI, we can observe that our method obtains a significant clustering improvement at the cost of acceptable resource consumption.

IV-G Visual Comparison

To qualitatively evaluate the effectiveness of the proposed method, we plotted 2D t-SNE visualizations of baselines [24, 25, 47] and the proposed method on DBLP in Figure 9, where we can find that the feature representation obtained by our method shows the best separability for different clusters, i.e., samples from the same class naturally gather together and the gap between different groups is the most obvious one. This phenomenon substantiates that our method produces the most clustering-oriented representation compared with state-of-the-art methods.

V Conclusion

We have presented a novel deep embedding clustering method that simultaneously enhances embedding learning and cluster assignment. Specifically, we first designed heterogeneity-wise and scale-wise fusion modules to learn an informative representation adaptively. Then, we utilized a distribution-wise fusion module to achieve cluster enhancement via an attention-based mechanism. Finally, we proposed a soft self-supervision strategy with a Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss to utilize the available off-the-shelf information from the cluster assignments. The quantitative and qualitative experiments and analyses demonstrate that our method consistently outperforms state-of-the-art approaches. We also provided comprehensive ablation studies to validate the effectiveness and advantage of our network.

References

[1] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (gpca),” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1945–1959, 2005.
[2] Q. Wang, J. Cheng, Q. Gao, G. Zhao, and L. Jiao, “Deep multi-view subspace clustering with unified and discriminative learning,” IEEE Transactions on Multimedia, vol. 23, pp. 3483–3493, 2020.
[3] Z. Dang, C. Deng, X. Yang, and H. Huang, “Multi-scale fusion subspace clustering using similarity constraint,” in CVPR, 2020, pp. 6658–6667.
[4] X. Wang, S. Fan, K. Kuang, C. Shi, J. Liu, and B. Wang, “Decorrelated clustering with data selection bias,” in IJCAI, 2021, pp. 2177–2183.
[5] Y. Jia, H. Liu, J. Hou, and Q. Zhang, “Clustering ensemble meets low-rank tensor approximation,” in AAAI, vol. 35, no. 9, 2021, pp. 7970–7978.
[6] S. Yang, W. Deng, M. Wang, J. Du, and J. Hu, “Orthogonality loss: learning discriminative representations for face recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 6, pp. 2301–2314, 2020.
[7] Y. Jia, J. Hou, and S. Kwong, “Constrained clustering with dissimilarity propagation-guided graph-laplacian pca,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2020.
[8] Y. Wu, L. Du, and H. Hu, “Parallel multi-path age distinguish network for cross-age face recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3482–3492, 2020.
[9] Z. Peng, W. Zhang, N. Han, X. Fang, P. Kang, and L. Teng, “Active transfer learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 4, pp. 1022–1036, 2019.
[10] Y. Jia, H. Liu, J. Hou, S. Kwong, and Q. Zhang, “Multi-view spectral clustering tailored tensor low-rank representation,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
[11] Z. Peng, Y. Jia, H. Liu, J. Hou, and Q. Zhang, “Maximum entropy subspace clustering network,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
[12] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[13] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of The Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. Oakland, CA, USA: Berkeley, 1967, pp. 281–297.
[14] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in ICML. New York, NY, USA: PMLR, 2016, pp. 478–487.
[15] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded clustering with local structure preservation,” in IJCAI. Melbourne, Australia: AAAI Press, 2017, pp. 1753–1759.
[16] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in ICML. PMLR, 2019, pp. 6861–6871.
[17] D. Kim and A. Oh, “How to find your friendly neighborhood: Graph attention design with self-supervision,” in ICLR. Vienna, Austria: ICLR, 2021, pp. 1–14.
[18] L. Wu, P. Cui, J. Pei, and L. Zhao, Graph Neural Networks: Foundations, Frontiers, and Applications. Singapore: Springer Singapore, 2022.
[19] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” ICLR, 2017.
[20] ——, “Variational graph auto-encoders,” in NIPS workshop. Centre Convencions Internacional Barcelona, Barcelona SPAIN: NIPS, 2016, pp. 1–3.
[21] S. Pan, R. Hu, S.-f. Fung, G. Long, J. Jiang, and C. Zhang, “Learning graph embedding with adversarial training methods,” IEEE Transactions on Cybernetics, vol. 50, no. 6, pp. 2475–2487, 2019.
[22] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang, “Attributed graph clustering: A deep attentional embedding approach,” in IJCAI. Macao, China: AAAI Press, 2019, pp. 3670–3676.
[23] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in ICLR. Vancouver Convention Center, Vancouver, BC, Canada: ICLR, 2018, pp. 1–12.
[24] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deep clustering network,” in WWW. Taipei Taiwan: Association for Computing Machinery, New York, NY, United States, 2020, pp. 1400–1410.
[25] Z. Peng, H. Liu, Y. Jia, and J. Hou, “Attention-driven graph clustering network,” in ACM MM, 2021, pp. 935–943.
[26] X. Dong, L. Liu, L. Zhu, Z. Cheng, and H. Zhang, “Unsupervised deep k-means hashing for efficient image retrieval and clustering,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp. 3266–3277, 2020.
[27] J. Huang, S. Gong, and X. Zhu, “Deep semantic clustering by partition confidence maximisation,” in CVPR, 2020, pp. 8849–8858.
[28] Z. Wang, Y. Zou, and Z. Zhang, “Cluster attention contrast for video anomaly detection,” in ACM MM. Seattle, United States: ACM, 2020, pp. 2463–2471.
[29] K. Han, A. Vedaldi, and A. Zisserman, “Learning to discover novel visual categories via deep transfer clustering,” in ICCV. Seoul, Korea: IEEE, 2019, pp. 8401–8409.
[30] Y. Ou, Z. Chen, and F. Wu, “Multimodal local-global attention network for affective video content analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1901–1914, 2020.
[31] X. Wang, Z. Chen, J. Tang, B. Luo, Y. Wang, Y. Tian, and F. Wu, “Dynamic attention guided multi-trajectory analysis for single object tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4895–4908, 2021.
[32] Y. Liu, W. Tu, S. Zhou, X. Liu, L. Song, X. Yang, and E. Zhu, “Deep graph clustering via dual correlation reduction,” in AAAI, 2022.
[33] Y. Hu, Z. Song, B. Wang, J. Gao, Y. Sun, and B. Yin, “Akm 3 c: Adaptive k-multiple-means for multi-view clustering,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4214–4226, 2021.
[34] Z. Zhang, J. Wang, J. Ye, and F. Wu, “Rethinking graph convolutional networks in knowledge graph completion,” in WWW, 2022, pp. 798–807.
[35] H. He, J. Wang, Z. Zhang, and F. Wu, “Compressing deep graph neural networks via adversarial knowledge distillation,” in ACM SIGKDD, 2022.
[36] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, “Am-gcn: Adaptive multi-channel graph convolutional networks,” in ACM SIGKDD. Virtual Conference: ACM, 2020, pp. 1243–1253.
[37] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” IEEE Transactions on Knowledge and Data Engineering, 2020.
[38] B. Chen, Z. Zhang, Y. Li, G. Lu, and D. Zhang, “Multi-label chest x-ray image classification via semantic similarity graph embedding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2455–2468, 2021.
[39] J. He, T. Zhang, Y. Zheng, M. Xu, Y. Zhang, and F. Wu, “Consistency graph modeling for semantic correspondence,” IEEE Transactions on Image Processing, vol. 30, pp. 4932–4946, 2021.
[40] J. Wang, Z. Zhang, Z. Shi, J. Cai, S. Ji, and F. Wu, “Duality-induced regularizer for semantic matching knowledge graph embeddings,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[41] L. Hu, Z. Dai, L. Tian, and W. Zhang, “Class-oriented self-learning graph embedding for image compact representation,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
[42] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” in CVPR, 2020, pp. 10 539–10 547.
[43] J. Park, M. Lee, H. J. Chang, K. Lee, and J. Y. Choi, “Symmetric graph convolutional autoencoder for unsupervised graph representation learning,” in ICCV. Seoul, Korea: IEEE, 2019, pp. 6519–6528.
[44] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” Knowledge-Based Systems, vol. 151, pp. 78–94, 2018.
[45] G. Li, M. Muller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” in ICCV, 2019, pp. 9267–9276.
[46] Q. Huang, H. He, A. Singh, S.-N. Lim, and A. Benson, “Combining label propagation and simple models out-performs graph neural networks,” in ICLR. Vienna, Austria: ICLR, 2021, pp. 1–19.
[47] W. Tu, S. Zhou, X. Liu, X. Guo, Z. Cai, E. Zhu, and J. Cheng, “Deep fusion clustering network,” in AAAI, vol. 35, no. 11, 2021, pp. 9978–9987.
[48] X. He, B. Wang, Y. Hu, J. Gao, Y. Sun, and B. Yin, “Parallelly adaptive graph convolutional clustering model,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[49] F. Helmert, “Die genauigkeit der formel von peters zur berechnung des wahrscheinlichen beobachtungsfehlers director beobachtungen gleicher genauigkeit,” Astronomische Nachrichten, vol. 88, p. 113, 1876.
[50] Student, “The probable error of a mean,” Biometrika, vol. 6, no. 1, pp. 1–25, 1908.
[51] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in AISTATS. Fort Lauderdale, FL, USA: PMLR, 2011, pp. 315–323.
[52] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML, vol. 30. Atlanta, USA: Citeseer, 2013, p. 3.
[53] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in AAAI, vol. 32. Hilton New Orleans Riverside, New Orleans, Louisiana, USA: AAAI Press, 2018, pp. 1–8.
[54] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, vol. 3, no. 2, 2013, p. 896.
[55] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of Operations Research, vol. 134, no. 1, pp. 19–67, 2005.
[56] J. J. Hull, “A database for handwritten text recognition research,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 5, pp. 550–554, 1994.
[57] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” Journal of Machine Learning Research, vol. 5, no. Apr, pp. 361–397, 2004.
[58] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition,” in SenSys. New York, NY, United States: ACM, 2015, pp. 127–140.
[59] S. Wan, Y. Zhan, L. Liu, B. Yu, S. Pan, and C. Gong, “Contrastive graph poisson networks: Semi-supervised learning with extremely limited labels,” NIPS, vol. 34, 2021.
[60] K. Riesen and H. Bunke, “Iam graph database repository for graph based pattern recognition and machine learning,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 2008, pp. 287–297.
[61] N. S. Altman, “An introduction to kernel and nearest-neighbor nonparametric regression,” The American Statistician, vol. 46, no. 3, pp. 175–185, 1992.
[62] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.

Hui Liu received the B.Sc. degree in communication engineering from Central South University, Changsha, China, the M.Eng. degree in computer science from Nanyang Technological University, Singapore, and the Ph.D. degree from the Department of Computer Science, City University of Hong Kong, Hong Kong. From 2014 to 2017, she was a Research Associate at the Maritime Institute, Nanyang Technological University. She is currently an Assistant Professor with the School of Computing Information Sciences, Caritas Institute of Higher Education, Hong Kong. Her research interests include image processing and machine learning.

Deep Attention-guided Graph Clustering with Dual Self-supervision

Abstract

Index Terms:

I Introduction

II Related Work

III Proposed Method

III-A Heterogeneity-Wise Fusion

III-B Scale-Wise Fusion

III-C Distribution-Wise Fusion

III-D Dual Self-supervision

III-D1 Soft Self-supervision

III-D2 Hard Self-supervision

III-E Computational Complexity Analysis

IV Experiments

IV-A Datasets and Compared Methods

IV-B Implementation Details

IV-B1 Evaluation metrics

IV-B2 Graph construction

IV-B3 Training Procedure

IV-C Clustering Results

IV-D Ablation Study

IV-D1 Heterogeneity-wise fusion module

IV-D2 Scale-wise fusion module

IV-D3 Distribution-wise fusion module

IV-D4 Hard self-supervision strategy

IV-D5 Soft Self-supervision Strategy

IV-D6 Dual Self-supervision (DSS)

IV-E Parameters Analysis

IV-E1 Analysis of the number of neighbors

IV-E2 Analysis of hyperparameters

IV-E3 Analysis of the threshold value

IV-E4 Analysis of the learned attention-aware weights

IV-F Time and Space Complexity Analysis

IV-G Visual Comparison

V Conclusion

References

Deep Attention-guided Graph Clustering
with Dual Self-supervision