This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Attention-guided Graph Clustering
with Dual Self-supervision

Zhihao Peng, Hui Liu, Yuheng Jia,  , Junhui Hou This work was supported in part by the Hong Kong UGC under grant UGC/FDS11/E02/22 and RGC under grants 11219019 and 11202320, in part by the National Natural Science Foundation of China under Grant 62106044, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20210221, in part by the ZhiShan Youth Scholar Program from Southeast University 2242022R40015. Corresponding author: Hui Liu and Yuheng Jia.Z. Peng and J. Hou are with the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong 999077 (e-mail: zhihapeng3-c@my.cityu.edu.hk; jh.hou@cityu.edu.hk)H. Liu is with the School of Computing & Information Sciences, Caritas Institute of Higher Education, Hong Kong. E-mail:hliu99-c@my.cityu.edu.hkY. Jia is with the School of Computer Science and Engineering, Southeast University, Nanjing 210096, China, and also with Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China (e-mail: yhjia@seu.edu.cn).
Abstract

Existing deep embedding clustering methods fail to sufficiently utilize the available off-the-shelf information from feature embeddings and cluster assignments, limiting their performance. To this end, we propose a novel method, namely deep attention-guided graph clustering with dual self-supervision (DAGC). Specifically, DAGC first utilizes a heterogeneity-wise fusion module to adaptively integrate the features of the auto-encoder and the graph convolutional network in each layer and then uses a scale-wise fusion module to dynamically concatenate the multi-scale features in different layers. Such modules are capable of learning an informative feature embedding via an attention-based mechanism. In addition, we design a distribution-wise fusion module that leverages cluster assignments to acquire clustering results directly. To better explore the off-the-shelf information from the cluster assignments, we develop a dual self-supervision solution consisting of a soft self-supervision strategy with a Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss. Extensive experiments on nine benchmark datasets validate that our method consistently outperforms state-of-the-art methods. Especially, our method improves the ARI by more than 10.29% over the best baseline. The code will be publicly available at https://github.com/ZhihaoPENG-CityU/DAGC.

Index Terms:
Unsupervised learning, deep embedding clustering, feature fusion, self-supervision.

I Introduction

Clustering is one of the fundamental tasks in data analysis, which aims to categorize samples into multiple groups according to their intrinsic similarities, and has been successfully applied to many real-world applications such as image processing [1, 2, 3, 4, 5], face recognition [6, 7, 8], and object detection [9, 10, 11]. Recently, with the booming of deep learning, numerous researchers have paid attention to deep embedding clustering analysis, which could effectively learn a clustering-friendly representation by extracting intrinsic patterns from the latent embedding space. For example, Hinton and Salakhutdinov [12] developed a deep auto-encoder (DAE) framework that first conducts embedding learning and then performs K-means [13] to obtain clustering results. Xie et al. [14] designed a deep embedding clustering method (DEC) to perform embedding learning and cluster assignment jointly. Guo et al. [15] improved DEC by introducing a reconstruction loss to preserve data structure. Although these DAE-based approaches obtain impressive improvement, they neglect the underlying topological structure among data, which has demonstrated its importance in various works [16, 17, 18].

Recently, a series of works have been proposed to use graph convolutional networks (GCNs) [19] to exploit the topological structure information. For instance, Kipf et al. [20] incorporated GCN into DAE and variational DAE, and proposed graph auto-encoder (GAE) and variational graph auto-encoder (VGAE), respectively. Pan et al. [21] designed an adversarially regularized graph auto-encoder network (ARGA) to promote GAE. Wang et al. [22] incorporated graph attention networks [23] into GAE for attributed graph clustering. Bo et al. [24] fused GCN into DEC to consider the node content and topological structure information at the same time.

However, these works still suffer from the following drawbacks. First, they equate the importance of the features extracted from DAE and GCN, e.g., in [24], the DAE and GCN features of a typical layer are averaged. Such a simple fusion strategy is not a good choice since those features contain different characteristic information. Second, they neglect the multi-scale information embedded in different layers, which may lead to inferior clustering results. Third, they output two probability distributions capable of obtaining the final clustering results; however, the complex real-world datasets are usually agnostic and vastly different, so it is difficult to decide which one should be used to get the final clustering results. To the best of our knowledge, this is a decision-making dilemma for those kinds of deep graph clustering methods. Last but not least, the previous approaches fail to adequately exploit the available information from the high-confidence clustering assignments.

Refer to caption
Figure 1: The overall flowchart of the proposed method, namely deep attention-guided graph clustering with dual self-supervision (DAGC). It consists of a DAE module, a heterogeneity-wise fusion (HWF) module, a scale-wise fusion (SWF) module, a distribution-wise fusion (DWF) module, a soft self-supervision (SSS) strategy, and a hard self-supervision (HSS) strategy. Specifically, HWF and SWF conduct the weight fusion in the sum and concatenation manner, respectively, where both modules involve a multilayer perceptron module, a normalization operation, and a GCN module. DWF uses a softmax function to infer a probability distribution. To achieve an end-to-end self-supervision, SSS drives the soft assignments to achieve distributions alignment between 𝐐\mathbf{Q} and 𝐙\mathbf{Z} distributions, and HSS transfers the cluster assignment to a hard one-hot encoding. The detailed architectures of HWF, SWF, DWF, and SSS are given in Figures 2, 3, and 4, respectively.

To address the above-mentioned drawbacks, we propose a novel deep embedding clustering method, focusing on exploiting the available off-the-shelf information from feature embeddings and cluster assignments. As shown in Figure 1, the proposed method consists of a heterogeneity111Here, ‘heterogeneity’ indicates the discrimination of feature structure, e.g., the DAE-based feature structure and the GCN-based feature structure.-wise fusion (HWF) module, a scale-wise fusion (SWF) module, a distribution-wise fusion (DWF) module, a soft self-supervision (SSS) strategy, and a hard self-supervision (HSS) strategy. A preliminary version of this work was published in ACM Multimedia 2021 [25], which can be regarded as a special case of the current version that focuses on embedding enhancement via the HWF and SWF modules. However, the conference paper suffers from the decision-making dilemma concerning two learned probability distributions from DAE and GCN, i.e., which one should be selected as the final clustering assignment result. In summary, the main contributions of this journal paper are as follows:

  • To handle the decision-making dilemma, we propose a learning-aware fusion module to adaptively fuse the learned data probability distributions to predict the clustering results.

  • In addition, for the two learned distributions, we improve the soft self-supervision strategy to better preserve the distribution consistency alignment.

  • Moreover, for the aforementioned fused distribution, we develop a hard self-supervision strategy with a pseudo supervision loss to employ the high-confidence clustering assignments to improve clustering performance.

  • Extensive experiments on nine benchmark datasets validate that our method consistently outperforms the conference version [25]. For instance, on DBLP, our method improves the ARI by more than 14.80%. In addition, the ablation studies and visualizations quantitatively and qualitatively validate the effectiveness of this method.

We organize the rest of this paper as follows. Section II briefly reviews the related works. Section III introduces the proposed network architecture, followed by the experimental results and analyses in Section IV. Finally, we conclude this paper in Section V.

Notation: Throughout the paper, scalars are denoted by italic lower case letters, vectors by bold lower case letters, matrices by bold upper case ones, and operators by calligraphy ones, respectively. Let 𝐗\mathbf{X} be the input data, 𝒱\mathcal{V} be the node set, \mathcal{E} be the edge set, and 𝒢=(𝒱,,𝐗)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{X}) be the undirected graph. 𝐀n×n\mathbf{A}\in\mathbb{R}^{\emph{n}\times\emph{n}} denotes the adjacency matrix, 𝐃n×n\mathbf{D}\in\mathbb{R}^{\emph{n}\times\emph{n}} denotes the degree matrix, and 𝐈n×n\mathbf{I}\in\mathbb{R}^{\emph{n}\times\emph{n}} denotes the identity matrix. We summarize the main notations in Table I.

TABLE I: Main notations and descriptions.
Notations Descriptions
𝐗\mathbf{X} n×d\in\mathbb{R}^{\emph{n}\times\emph{d}} The input matrix
𝐗^\hat{\mathbf{X}} n×d\in\mathbb{R}^{\emph{n}\times\emph{d}} The reconstructed matrix
𝐀\mathbf{A} n×n\in\mathbb{R}^{\emph{n}\times\emph{n}} The adjacency matrix
𝐃\mathbf{D} n×n\in\mathbb{R}^{\emph{n}\times\emph{n}} The degree matrix
𝐙i\mathbf{Z}_{\emph{i}} n×di\in\mathbb{R}^{\emph{n}\times{\emph{d}_{\emph{i}}}} The GCN feature from the ithi_{th} layer
𝐇i\mathbf{H}_{\emph{i}} n×di\in\mathbb{R}^{\emph{n}\times{\emph{d}_{\emph{i}}}} The encoder feature from the ithi_{th} layer
𝐌i\mathbf{M}_{\emph{i}} n×2\in\mathbb{R}^{\emph{n}\times 2} The HWF weight matrix
𝐙i\mathbf{Z}_{\emph{i}}^{{}^{\prime}} n×di\in\mathbb{R}^{\emph{n}\times{\emph{d}_{\emph{i}}}} The HWF combined feature
𝐔\mathbf{U} n×(l+1)\in\mathbb{R}^{\emph{n}\times\left(\emph{l}+1\right)} The SWF weight matrix
𝐇\mathbf{H} n×dl\in\mathbb{R}^{\emph{n}\times{\emph{d}_{\emph{l}}}} The DAE extracted feature
𝐐\mathbf{Q} n×k\in\mathbb{R}^{\emph{n}\times\emph{k}} The distribution obtained from DAE
𝐙\mathbf{Z} n×k\in\mathbb{R}^{\emph{n}\times\emph{k}} The distribution obtained from SWF
𝐏\mathbf{P} n×k\in\mathbb{R}^{\emph{n}\times\emph{k}} The auxiliary distribution
𝐕\mathbf{V} n×2\in\mathbb{R}^{\emph{n}\times 2} The DWF weight matrix
𝐅\mathbf{F} n×k\in\mathbb{R}^{\emph{n}\times\emph{k}} The DWF combined feature
n The number of samples
d The dimension of 𝐗\mathbf{X}
di\emph{d}_{\emph{i}} The dimension of the ithi_{th} latent feature
l The number of network layers
k The number of clusters
k^\hat{\emph{k}} The number of neighbors for KNN graph
r The threshold value for pseudo supervision
\cdot\|\cdot The concatenation operation
Refer to caption
Figure 2: Illustration of the architectures of the HWF module (left) and the SWF module (right). The HWF module fuses the GCN feature 𝐙i\mathbf{Z}_{i} and the DAE feature 𝐡i\mathbf{h}_{\emph{i}} to obtain 𝐙i+1\mathbf{Z}_{i+1} via a weighted sum form, while the SWF module combines the multi-scale weighted features in a feature concatenation manner. More specifically, we first learn the weights through the attention-based mechanism (the left dashed box in the triple-solid line box) and then integrate the corresponding features through the weighted fusion (the right dashed box in the triple-solid line box). Here, \Downarrow represents the input and output actions.

II Related Work

DAE is a typical deep neural network that allows computational models composed of multiple processing layers to learn data representations with multiple levels of abstraction. Recently, benefiting from the powerful representation ability of DAE, deep embedding clustering has achieved remarkable development [26, 27, 28, 29, 30, 31, 32]. For example, Hinton and Salakhutdinov [12] used DAE to extract the feature representation of input data, on which K-means [13] is performed to obtain the clustering results. [14] jointly conducted embedding learning and cluster assignment in an iterative optimization manner. The improved DEC (IDEC) [15] enhanced the clustering performance by adding a reconstruction loss function into DEC. A series of works [2, 3, 33] introduced multi-view information into the DAE framework to further improve embedding learning. However, these DAE-based methods neglect the underlying topological structure among data, which has demonstrated its effectiveness for data clustering [16, 17, 34, 35, 18], thus limiting their performance.

Graph embedding is a new paradigm for clustering to capture the topological structure among samples [36, 37, 38, 17, 39, 40, 41], and many recent approaches [42, 43, 44, 45, 46] have explored GCN to achieve graph embedding. For instance, Kipf and Welling [20] provided GAE and VGAE methods by incorporating GCN into DAE and variational DAE frameworks, respectively. [21] extended GAE by introducing a designed adversarial regularization. Wang et al. [22] merged GAE and the graph attention network [23] to build a deep attentional embedding framework for attributed graph clustering (DAEGC). The structural deep clustering network (SDCN) [24] fused the node content and topological structure information to achieve deep embedding clustering. Peng et al. [25] designed the attention-driven graph clustering network (AGCN) to merge numerous features to enhance the embedding learning via an adaptive mechanism. The deep fusion clustering network (DFCN) [47] exploited a designed fusion strategy to combine the DAE and GAE frameworks to merge the node attribute and topological structure information. He et al. [48] developed an adaptive graph convolutional clustering (AGCC) model to update the graph structure and the data representation layer by layer.

Optimizing a deep clustering network is a fundamental yet challenging task because there are no ground-truth labels as supervision. Previous works [14, 15, 22, 24, 25] minimize the Kullback-Leibler (KL) divergence to tackle this challenge, and its effectiveness has been proven. Specifically, it first uses the Student’s t-distribution [49, 50] as a kernel function to measure the similarity between the extracted feature 𝐡i\mathbf{h}_{\emph{i}} and its corresponding centroid vector 𝝁j\boldsymbol{\mu}_{\emph{j}}, in which the measured similarity can be regarded as a probability distribution 𝐐\mathbf{Q} with its i,j\emph{i},\emph{j}-th element being

qi,j=(1+𝐡i𝝁j2/α)α+12j(1+𝐡i𝝁j2/α)α+12,\displaystyle\emph{q}_{\emph{i},\emph{j}}=\frac{(1+\|\mathbf{h}_{\emph{i}}-\boldsymbol{\mu}_{\emph{j}}\|^{2}/\alpha)^{-\frac{\alpha+1}{2}}}{\sum_{\emph{j}^{{}^{\prime}}}(1+\|\mathbf{h}_{\emph{i}}-\boldsymbol{\mu}_{\emph{j}^{{}^{\prime}}}\|^{2}/\alpha)^{-\frac{\alpha+1}{2}}}, (1)

where α\alpha is set to 11. Then, it implements the KL divergence minimization between 𝐐\mathbf{Q} and a distribution 𝐁\mathbf{B} with the i,j\emph{i},\emph{j}-th element bi,j=qi,j2/iqi,jjqi,j2/iqi,jb_{i,\emph{j}}=\frac{q_{i,\emph{j}}^{2}/\sum_{i}q_{i,\emph{j}}}{\sum_{\emph{j}}^{{}^{\prime}}q_{i,\emph{j}^{{}^{\prime}}}^{2}/\sum_{i}q_{i,\emph{j}^{{}^{\prime}}}}, i.e.,

minKL(𝐁,𝐐)=ijbi,jlogbi,jqi,j,\displaystyle\min KL(\mathbf{B},\mathbf{Q})=\sum_{i}\sum_{j}{b_{i,\emph{j}}log{\frac{b_{i,\emph{j}}}{q_{i,\emph{j}}}}}, (2)

where KL(,)KL\left(\cdot,\cdot\right) is the Kullback-Leibler divergence function that measures the distance between two distributions. Such an auxiliary distribution generation normalizes the high-confidence probability value as a large value, capable of learning the high-confidence assignments. However, previous works fail to sufficiently utilize the available off-the-shelf information from high-confidence clustering assignments, inevitably leading to inferior clustering results.

III Proposed Method

Figure 1 illustrates the architecture of the proposed method, where we will detail the main components in what follows.

III-A Heterogeneity-Wise Fusion

We first exploit a DAE module with a series of encoders and decoders to extract the latent representation by adapting the reconstruction loss, i.e.,

R=𝐗𝐗^F2,\displaystyle\mathcal{L}_{R}=\left\|\mathbf{X}-\hat{\mathbf{X}}\right\|^{2}_{F}, (3)

where 𝐗\mathbf{X} and 𝐗^\hat{\mathbf{X}} denote the input matrix and the reconstructed matrix, respectively. Here, 𝐇0=𝐗\mathbf{H}_{\emph{0}}=\mathbf{X}, 𝐇^l=𝐗^\hat{\mathbf{H}}_{\emph{l}}=\hat{\mathbf{X}}, 𝐇i=ϕ(𝐖ie𝐇i1+𝐛ie)\mathbf{H}_{\emph{i}}=\phi(\mathbf{W}_{\emph{i}}^{e}\mathbf{H}_{\emph{i}-1}+\mathbf{b}_{\emph{i}}^{e}), 𝐇^i=ϕ(𝐖id𝐇^i1+𝐛id)\hat{\mathbf{H}}_{\emph{i}}=\phi(\mathbf{W}_{\emph{i}}^{d}\hat{\mathbf{H}}_{\emph{i}-1}+\mathbf{b}_{\emph{i}}^{d}), where 𝐇i\mathbf{H}_{\emph{i}} and 𝐇^i\hat{\mathbf{H}}_{\emph{i}} denote the encoder and decoder outputs from the ithi_{th} layer, respectively, l denotes the number of encoder/decoder layers, 𝐖ie\mathbf{W}_{\emph{i}}^{e}, 𝐛ie\mathbf{b}_{i}^{e}, 𝐖id\mathbf{W}_{\emph{i}}^{d}, and 𝐛id\mathbf{b}_{i}^{d} denote the network weight and bias of the ithi_{th} encoder and decoder layer, respectively, and ϕ()\phi(\cdot) denotes an activation function, such as Tanh or ReLU [51]. Particularly, we set 𝐇=𝐇l\mathbf{H}=\mathbf{H}_{\emph{l}} for convenience. In addition, we denote the GCN feature learned from the ith\emph{i}_{th} layer as 𝐙in×di\mathbf{Z}_{\emph{i}}\in\mathbb{R}^{\emph{n}\times\emph{d}_{\emph{i}}} with di\emph{d}_{i} being the dimension of the ith\emph{i}_{th} layer, where 𝐙0=𝐗\mathbf{Z}_{0}=\mathbf{X}. Previous works (e.g., SDCN [24]) combine the heterogeneity-wise representation on the ithi_{th} layer (𝐙i\mathbf{Z}_{i} and 𝐇i\mathbf{H}_{i}) via a fixed fusion strategy (i.e., 𝐙i=0.5𝐙i+0.5𝐇i\mathbf{Z}_{i}^{{}^{\prime}}=0.5\mathbf{Z}_{i}+0.5\mathbf{H}_{i}) to enhance representation learning. However, such a fusion strategy is simple but unreasonable since the heterogeneity-wise representations 𝐙i\mathbf{Z}_{i} and 𝐇i\mathbf{H}_{i} owe different characteristic information. To this end, we propose a learning-aware fusion strategy to develop an adaptive fusion strategy to dynamically weight 𝐙i\mathbf{Z}_{i} and 𝐇i\mathbf{H}_{i}. Specifically, to learn the corresponding attention coefficients of 𝐙i\mathbf{Z}_{\emph{i}} and 𝐇i\mathbf{H}_{\emph{i}}, we first concatenate them as [𝐙i𝐇i]n×2di[\mathbf{Z}_{\emph{i}}\|\mathbf{H}_{\emph{i}}]\in\mathbb{R}^{\emph{n}\times{\emph{2d}_{\emph{i}}}} and then build a fully connected layer parametrized by a weight matrix 𝐖ia2di×2\mathbf{W}_{\emph{i}}^{a}\in\mathbb{R}^{{\emph{2d}_{\emph{i}}}\times 2}. Afterwards, we apply the LeakyReLU (LReLU) [52] on the product between [𝐙i𝐇i]\left[\mathbf{Z}_{\emph{i}}\|\mathbf{H}_{\emph{i}}\right] and 𝐖ia\mathbf{W}_{\emph{i}}^{a}, and normalize the output of the LReLU unit via the softmax function and the 2\ell_{2} normalization (i.e., ‘softmax-2\ell_{2}’ normalization). Formally, we formulate the prediction of the corresponding attention coefficients as

𝐌i=[𝐦i,1𝐦i,2]=ΥA([𝐙i𝐇i]𝐖ia),\displaystyle\mathbf{M}_{\emph{i}}=[\mathbf{m}_{i,1}\|\mathbf{m}_{i,2}]=\Upsilon_{\textit{A}}(\left[\mathbf{Z}_{\emph{i}}\|\mathbf{H}_{\emph{i}}\right]\mathbf{W}_{\emph{i}}^{a}), (4)

where 𝐌in×2\mathbf{M}_{\emph{i}}\in\mathbb{R}^{\emph{n}\times 2} is the attention coefficient matrix with entries being greater than 0, 𝐦i,1\mathbf{m}_{i,1} and 𝐦i,2\mathbf{m}_{i,2} are the weight vectors for measuring the importance of 𝐙i\mathbf{Z}_{\emph{i}} and 𝐇i\mathbf{H}_{\emph{i}}, respectively, and ΥA()=2(softmax(LReLU()))\Upsilon_{\textit{A}}(\cdot)=\ell_{2}\left(softmax\left(LReLU\left(\cdot\right)\right)\right). Thus, we can adaptively fuse the GCN feature 𝐙i\mathbf{Z}_{\emph{i}} and the DAE feature 𝐇i\mathbf{H}_{\emph{i}} on the ith\emph{i}_{th} layer as

𝐙i=(𝐦i,1𝟏i)𝐙i+(𝐦i,2𝟏i)𝐇i,\displaystyle\mathbf{Z}_{\emph{i}}^{{}^{\prime}}=\left(\mathbf{m}_{i,1}\mathbf{1}_{\emph{i}}\right)\odot\mathbf{Z}_{\emph{i}}+\left(\mathbf{m}_{i,2}\mathbf{1}_{\emph{i}}\right)\odot\mathbf{H}_{\emph{i}}, (5)

where 𝟏i1×di\mathbf{1}_{\emph{i}}\in\mathbb{R}^{1\times\emph{d}_{\emph{i}}} denotes the vector of all ones, and \odot denotes the Hadamard product of matrices. Then, we use the resulting matrix 𝐙in×di\mathbf{Z}_{\emph{i}}^{{}^{\prime}}\in\mathbb{R}^{\emph{n}\times\emph{d}_{\emph{i}}} as the input of the (i+1)th(\emph{i}+1)_{th} GCN layer to learn the representation 𝐙i+1\mathbf{Z}_{\emph{i}+1}, i.e.,

𝐙i+1=LReLU(𝐃12(𝐀+𝐈)𝐃12𝐙i𝐖i),\displaystyle\mathbf{Z}_{\emph{i}+1}=LReLU(\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}\mathbf{Z}_{\emph{i}}^{{}^{\prime}}\mathbf{W}_{\emph{i}}), (6)

where 𝐖i\mathbf{W}_{\emph{i}} denotes the weight matrix from the ithi_{th} GCN layer, and 𝐃12(𝐀+𝐈)𝐃12\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}} normalizes 𝐀\mathbf{A} by using renormalization with a self-loop normalized 𝐀\mathbf{A} and the corresponding 𝐃\mathbf{D}.

III-B Scale-Wise Fusion

As aforementioned, previous works neglect the off-the-shelf multi-scale information embedded in different layers, which is of great importance for embedding learning. To this end, we propose the SWF module to concatenate the multi-scale features from different layers via an attention-based mechanism. The right border of Figure 2 shows the overall architecture of SWF.

We aggregate the multi-scale features with a concatenation manner to dynamically combine various scale features with different dimensions. Afterwards, we build a fully connected layer parametrized by a weight matrix 𝐖s(j=1ldj+dl)×(l+1)\mathbf{W}^{\emph{s}}\in\mathbb{R}^{(\sum_{j=1}^{l}\emph{d}_{j}+\emph{d}_{\emph{l}})\times(\emph{l}+1)} to capture the relationship among the multi-scale features. Formally, we formulate the whole process as

𝐔=ΥA(Ξj=1l+1𝐙j𝐖s),\displaystyle\mathbf{U}=\Upsilon_{\textit{A}}\left(\Xi_{j=1}^{\emph{l}+1}\mathbf{Z}_{j}\mathbf{W}^{s}\right), (7)

where Ξj=1l+1𝐙j=[𝐙1𝐙l𝐙l+1]\Xi_{j=1}^{\emph{l}+1}\mathbf{Z}_{j}=\left[\mathbf{Z}_{1}\|\cdots\|\mathbf{Z}_{\emph{l}}\|\mathbf{Z}_{\emph{l}+1}\right] denotes the concatenation operation of multiple elements. We then conduct the feature fusion as

𝐙=Ξj=1l+1((𝐮j𝟏j)𝐙j),\displaystyle\mathbf{Z}^{{}^{\prime}}=\Xi_{j=1}^{\emph{l}+1}\left(\left(\mathbf{u}_{j}\mathbf{1}_{j}\right)\odot\mathbf{Z}_{\emph{j}}\right), (8)

where 𝐮j\mathbf{u}_{j} is the j-th element of 𝐔\mathbf{U}, i.e., Ξj=1l+1𝐮j=𝐔\Xi_{j=1}^{\emph{l}+1}\mathbf{u}_{j}=\mathbf{U}. In addition, we use a Laplacian smoothing operator [53] and the softmax function to make the fused feature 𝐙\mathbf{Z}^{{}^{\prime}} as a reasonable probability distribution, i.e.,

𝐙=softmax(𝐃12(𝐀+𝐈)𝐃12𝐙𝐖),\displaystyle\mathbf{Z}=softmax(\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}\mathbf{Z}^{{}^{\prime}}\mathbf{W}), (9)

where 𝐖\mathbf{W} denotes the learnable parameters.

Refer to caption
Figure 3: The architecture of DWF module. The DWF module dynamically combines the distributions 𝐙\mathbf{Z} and 𝐐\mathbf{Q} to learn the final probability distribution, where we can directly obtain the predicted cluster label.

III-C Distribution-Wise Fusion

As we obtain the feature 𝐇\mathbf{H} from DAE, we can exploit it to calculate its cluster center embedding 𝝁\boldsymbol{\mu} with K-means. Afterward, we measure the similarity between the extracted feature 𝐡i\mathbf{h}_{\emph{i}} and its corresponding centroid vector 𝝁j\boldsymbol{\mu}_{\emph{j}}, in which the measured similarity can be regarded as a probability distribution 𝐐\mathbf{Q} with its i,j\emph{i},\emph{j}-th element being qi,j=(1+𝐡i𝝁j2)1j(1+𝐡i𝝁j2)1\emph{q}_{\emph{i},\emph{j}}=\frac{(1+\|\mathbf{h}_{\emph{i}}-\boldsymbol{\mu}_{\emph{j}}\|^{2})^{-1}}{\sum_{\emph{j}^{{}^{\prime}}}(1+\|\mathbf{h}_{\emph{i}}-\boldsymbol{\mu}_{\emph{j}^{{}^{\prime}}}\|^{2})^{-1}}, following Eq. (1). Both 𝐙\mathbf{Z} and 𝐐\mathbf{Q} can generate the final clustering results; however, it is challenging to choose which one to obtain the final clustering result within different scenarios. To the best of our knowledge, this is an unsolved decision-making dilemma commonly existing in the previous deep graph clustering methods. To handle this challenge, we propose the DWF module to fuse the learned data probability distributions in an attention-driven manner to predict cluster labels. Figure 3 shows the overall architecture.

Specifically, we first learn the importance of 𝐙\mathbf{Z} and 𝐐\mathbf{Q} by an attention-based mechanism, i.e.,

𝐕=[𝐯1𝐯2]=ΥA([𝐙𝐐]𝐖^),\displaystyle\mathbf{V}=[\mathbf{\mathbf{v}}_{1}\|\mathbf{\mathbf{v}}_{2}]=\Upsilon_{\textit{A}}\left(\left[\mathbf{Z}\|\mathbf{Q}\right]\mathbf{\hat{W}}\right), (10)

where 𝐕n×2\mathbf{V}\in\mathbb{R}^{\emph{n}\times 2} is the attention coefficient matrix, 𝐖^\mathbf{\hat{W}} is a learned weight matrix via a fully connected layer. We then adaptively leverage 𝐙\mathbf{Z} and 𝐐\mathbf{Q} as

𝐅=(𝐯1𝟏)𝐙+(𝐯2𝟏)𝐐,\displaystyle\mathbf{F}=\left(\mathbf{\mathbf{v}}_{1}\mathbf{1}\right)\odot\mathbf{Z}+\left(\mathbf{\mathbf{v}}_{2}\mathbf{1}\right)\odot\mathbf{Q}, (11)

where 𝟏1×k\mathbf{1}\in\mathbb{R}^{1\times k} denotes the vector of all ones. Finally, we apply the softmax function to normalize 𝐅\mathbf{F} with

𝐅=softmax(𝐅)s.t.j=1kfi,j=1,fi,j>0,\displaystyle\mathbf{F}=softmax\left(\mathbf{F}\right)\quad\rm{s.t.}\quad\sum_{\emph{j}=1}^{\emph{k}}\emph{f}_{\emph{i},\emph{j}}=1,\quad\emph{f}_{\emph{i},\emph{j}}>0, (12)

where fi,j\emph{f}_{i,\emph{j}} is the element of 𝐅\mathbf{F}. When the network is well-trained, we can directly infer the predicted cluster label through 𝐅\mathbf{F}, i.e.,

yi=argmaxjfi,js.t.j=1,,k,\displaystyle y_{i}=\mathop{\arg\max}_{\emph{j}}\emph{f}_{i,\emph{j}}\quad\rm{s.t.}\quad\emph{j}=1,\cdots,\emph{k}, (13)

where yiy_{i} is the predicted label of 𝐱i\mathbf{x}_{i}. In this way, the cluster structure can be represented explicitly in 𝐅\mathbf{F}.

III-D Dual Self-supervision

As unsupervised clustering lacks reliable guidance, we propose a novel dual self-supervision scheme that combines a soft self-supervision strategy with a Kullback-Leibler (KL) divergence loss and a hard self-supervision strategy with a pseudo supervision loss to guide the overall network training, as illustrated in Figure 4.

III-D1 Soft Self-supervision

Since we take advantage of the high-confidence assignments to iteratively refine the clusters by utilizing the soft assignments (i.e., the probability distributions 𝐐\mathbf{Q} and 𝐙\mathbf{Z}), we term this supervision strategy as the soft self-supervision strategy. Concretely, since 𝐙\mathbf{Z} involves the graph information through the HWF and SWF modules, we first derive an auxiliary distribution 𝐏\mathbf{P} via 𝐙\mathbf{Z} by normalizing per cluster after squaring zi,j\emph{z}_{\emph{i},\emph{j}}, i.e.,

pi,j=zi,j2/i=1nzi,jj=1kzi,j2/i=1nj=1kzi,j,\displaystyle\emph{p}_{\emph{i},\emph{j}}=\frac{\emph{z}_{\emph{i},\emph{j}}^{2}/\sum_{\emph{i}^{{}^{\prime}}=1}^{\emph{n}}\emph{z}_{\emph{i}^{{}^{\prime}},\emph{j}}}{\sum_{\emph{j}^{{}^{\prime}}=1}^{\emph{k}}\emph{z}_{\emph{i},\emph{j}^{{}^{\prime}}}^{2}/\sum_{\emph{i}^{{}^{\prime}}=1}^{\emph{n}}\sum_{\emph{j}^{{}^{\prime}}=1}^{\emph{k}}\emph{z}_{\emph{i}^{{}^{\prime}},\emph{j}^{{}^{\prime}}}}, (14)

where 0pi,j10\leq\emph{p}_{\emph{i},\emph{j}}\leq 1 is the element of 𝐏\mathbf{P}. Then, we minimize the KL divergence not only between a learned distribution and its auxiliary distribution (i.e., KL(𝐏,𝐙)KL\left(\mathbf{P},\mathbf{Z}\right) and KL(𝐏,𝐐)KL\left(\mathbf{P},\mathbf{Q}\right)), but also between both two learned distributions (i.e., KL(𝐙,𝐐)KL\left(\mathbf{Z},\mathbf{Q}\right)) to promote a highly consistent distribution alignment to train our model, i.e.,

S\displaystyle\mathcal{L}_{S} =λ1(KL(𝐏,𝐙)+KL(𝐏,𝐐))+λ2KL(𝐙,𝐐)\displaystyle=\lambda_{1}\!*\!(KL\left(\mathbf{P},\mathbf{Z}\right)\!+\!KL\left(\mathbf{P},\mathbf{Q})\right)\!+\!\lambda_{2}*KL\left(\mathbf{Z},\mathbf{Q}\right) (15)
=λ1injkpi,jlogpi,j2zi,jqi,j+λ2injkzi,jlogzi,jqi,j,\displaystyle=\lambda_{1}\sum_{\emph{i}}^{\emph{n}}\sum_{\emph{j}}^{\emph{k}}{\emph{p}_{\emph{i},\emph{j}}log{\frac{\emph{p}_{\emph{i},\emph{j}}^{2}}{z_{i,\emph{j}}\emph{q}_{\emph{i},\emph{j}}}}}\!+\!\lambda_{2}\sum_{\emph{i}}^{\emph{n}}\sum_{\emph{j}}^{\emph{k}}{z_{i,\emph{j}}log{\frac{z_{i,\emph{j}}}{\emph{q}_{\emph{i},\emph{j}}}}},

where λ1>0\lambda_{1}>0 and λ2>0\lambda_{2}>0 are the trade-off parameters.

III-D2 Hard Self-supervision

Although the soft self-supervision strategy has become a helpful tool for unsupervised clustering, it preserves the low-confidence predicted probabilities, limiting the clustering performance. To further make use of the available off-the-shelf information from the cluster assignments, we introduce the pseudo supervision technique [54] and set the pseudo-label y^i\hat{y}_{i} as y^i=yi\hat{y}_{i}={y}_{i}. Considering that the pseudo-labels may contain many incorrect labels, we select the high-confidence ones as supervisory information by a large threshold r, i.e.,

gi,j={1 if fi,j r0 otherwise. \emph{g}_{i,\emph{j}}=\left\{\begin{array}[]{ll}1&\text{ if }f_{i,\emph{j}}\geq\text{ \emph{r}, }\\ 0&\text{ otherwise. }\end{array}\right. (16)

In the experiment, we set r=0.8r=0.8. Then, we leverage the high-confidence pseudo-labels to supervise the network training, i.e.,

H\displaystyle\mathcal{L}_{H} =λ3ijgi,jΥCE(fi,j,ΥOH(y^i)),\displaystyle=\lambda_{3}\sum_{\emph{i}}\sum_{\emph{j}}\emph{g}_{i,\emph{j}}*\Upsilon_{\textit{CE}}(\mathbf{\emph{f}}_{i,\emph{j}},\Upsilon_{\textit{OH}}(\hat{y}_{i})), (17)

where λ3>0\lambda_{3}>0 is the trade-off parameter, ΥCE\Upsilon_{\textit{CE}} denotes the cross-entropy [55] loss, and ΥOH\Upsilon_{\textit{OH}} transforms y^i\hat{\emph{y}}_{\emph{i}} to its one-hot form. As shown in Figure 4, the pseudo-labels transfer the cluster assignment to the hard one-hot encoding, we thus name it as hard self-supervision strategy.

In addition, we have empirically observed that only using HSS does not perform well in all scenarios. The reason may be that the distribution probability values are small in some situations, making a weak self-supervision for guiding network training with the HSS strategy. To this end, we combine the SSS and HSS strategies together to drive the network training. Combining Eqs. (3), (15), and (17), our overall loss function can be written as

=min𝐅(R+S+H).\displaystyle\mathcal{L}=\min_{\mathbf{F}}\left(\mathcal{L}_{R}+\mathcal{L}_{S}+\mathcal{L}_{H}\right). (18)

The whole training process is shown in Algorithm 1.

Refer to caption
Figure 4: The illustration of the proposed dual self-supervision solution. It exploits a soft self-supervision strategy and a hard self-supervision strategy to effectively train the proposed network in an end-to-end manner. Such strategies iteratively refine the network training by learning from high-confidence assignments.
Algorithm 1 Training process of our method
0:  Input matrix 𝐗\mathbf{X}; Adjacency matrix 𝐀\mathbf{A}; Cluster number k; Trade-off parameters λ1,λ2,λ3\lambda_{1},\lambda_{2},\lambda_{3}; Maximum iterations iMaxIteri_{\emph{MaxIter}};
0:  Reconstructed matrix 𝐗^\hat{\mathbf{X}}; Clustering result 𝐲\mathbf{y};
1:  Initialization: l=4\emph{l}=4, iIter=1i_{\emph{Iter}}=1; 𝐙0=𝐗\mathbf{Z}_{0}=\mathbf{X}; 𝐇0=𝐗\mathbf{H}_{0}=\mathbf{X};
2:  Initialize the parameters of the DAE network;
3:  while iIter<iMaxIteri_{\emph{Iter}}<i_{\emph{MaxIter}} do
4:     Obtain the feature 𝐇\mathbf{H} by Eq. (3);
5:     Obtain the feature 𝐙\mathbf{Z} via Eq. (9);
6:     Obtain the cluster center embedding 𝝁\boldsymbol{\mu} with K-means based on the feature 𝐇\mathbf{H};
7:     Calculate the distribution 𝐐\mathbf{Q} via Eq. (1);
8:     Calculate the distribution 𝐏\mathbf{P} via Eq. (14);
9:     Calculate the distribution 𝐅\mathbf{F} via Eq. (12);
10:     Conduct the soft self-supervision via Eq. (15);
11:     Conduct the hard self-supervision via Eq. (17);
12:     Minimize the overall loss function via Eq. (18);
13:     Conduct the back propagation and update parameters in the proposed network;
14:     iIter=iIter+1i_{\emph{Iter}}=i_{\emph{Iter}}+1;
15:  end while
16:  Calculate the clustering results 𝐲\mathbf{y} with 𝐅\mathbf{F} by Eq. (13);

III-E Computational Complexity Analysis

For the DAE module, the time complexity is 𝒪(ni=2ldi1di)\mathcal{O}(n\sum_{i=2}^{l}d_{i-1}d_{i}). For the GCN module, as the operation can be computed efficiently using sparse matrix computation, the time complexity is only 𝒪(||i=2ldi1di)\mathcal{O}(|\mathcal{E}|\sum_{i=2}^{l}d_{i-1}d_{i}) according to [21]. For Eq. (1), the time complexity is 𝒪(nk+nlogn)\mathcal{O}(nk+n\log n) based on [14]. For HWF, SWF, and DWF modules, the total time complexity is 𝒪(i=1l1(di))+𝒪((i=1l+1di)(l+1))+𝒪(k)\mathcal{O}(\sum_{i=1}^{l-1}(d_{i}))+\mathcal{O}((\sum_{i=1}^{l+1}d_{i})(l+1))+\mathcal{O}(k). Thus, the overall computational complexity of Algorithm 1 in one iteration is about 𝒪(ni=2ldi1di+||i=2ldi1di+(n+1)k+nlogn+i=1l1(di)+(i=1l+1di)(l+1))\mathcal{O}(n\sum_{i=2}^{l}d_{i-1}d_{i}+|\mathcal{E}|\sum_{i=2}^{l}d_{i-1}d_{i}+(n+1)k+n\log n+\sum_{i=1}^{l-1}(d_{i})+(\sum_{i=1}^{l+1}d_{i})(l+1)).

TABLE II: Description of the adopted datasets.
Dataset Type Samples Classes Edges
USPS Image 9298 10 27894
Reuters Text 10000 4 30000
HHAR Record 10299 6 30897
ACM Graph 3025 3 13128
CiteSeer Graph 3327 6 4552
DBLP Graph 4057 4 3528
Amazon Photo Graph 7650 8 119081
PubMed Graph 19717 3 44324
AIDS Graph 31385 38 64780
TABLE III: Clustering results (mean±\pmstd) with twelve compared methods on nine benchmark datasets. The best and second-best results are bolded and underlined, respectively. ‘OOM’ denotes the Out Of Memory case.
Datasets Metrics K-means [13] DAE [12] DEC [14] IDEC [15] GAE [20] VGAE [20] DAEGC [22] ARGA [21] SDCN [24] AGCN [25] DFCN [47] AGCC [48] Our
[Science06] [ICML16] [AAAI17] [NIPS16] [NIPS16] [AAAI19] [IJCAI18] [WWW20] [MM21] [AAAI21] [TNNLS22]
Reuters ARI 46.09±\pm0.02 49.55±\pm0.37 48.44±\pm0.14 51.26±\pm0.21 19.61±\pm0.22 26.18±\pm0.36 31.12±\pm0.18 24.50±\pm0.40 55.36±\pm0.37 60.55±\pm1.78 59.80±\pm0.40 62.98±\pm2.24 63.48±\pm1.10
F1 58.33±\pm0.03 60.96±\pm0.22 64.25±\pm0.22 63.21±\pm0.12 43.53±\pm0.42 57.14±\pm0.17 61.82±\pm0.13 51.10±\pm0.20 65.48±\pm0.08 66.16±\pm0.64 69.60±\pm0.10 67.21±\pm1.61 68.81±\pm1.26
ACC 59.98±\pm0.02 74.90±\pm0.21 73.58±\pm0.13 75.43±\pm0.14 54.40±\pm0.27 60.85±\pm0.23 65.50±\pm0.13 56.20±\pm0.20 77.15±\pm0.21 79.30±\pm1.07 77.70±\pm0.20 81.65±\pm1.52 81.68±\pm0.69
NMI 58.86±\pm0.01 49.69±\pm0.29 47.50±\pm0.34 50.28±\pm0.17 25.92±\pm0.41 25.51±\pm0.22 30.55±\pm0.29 28.70±\pm0.30 50.82±\pm0.21 57.83±\pm1.01 59.90±\pm0.40 59.56±\pm0.94 58.94±\pm1.16
HHAR ARI 27.95±\pm0.38 60.36±\pm0.88 61.25±\pm0.51 62.83±\pm0.45 42.63±\pm1.63 51.47±\pm0.73 60.38±\pm2.15 44.70±\pm1.00 72.84±\pm0.09 77.07±\pm0.66 76.40±\pm0.10 75.58±\pm1.85 77.38±\pm0.97
F1 41.28±\pm2.43 66.36±\pm0.34 67.29±\pm0.29 68.63±\pm0.33 62.64±\pm0.97 71.55±\pm0.29 76.89±\pm2.18 61.10±\pm0.90 82.58±\pm0.08 88.00±\pm0.53 87.30±\pm0.10 85.79±\pm2.48 87.90±\pm1.11
ACC 54.04±\pm0.01 68.69±\pm0.31 69.39±\pm0.25 71.05±\pm0.36 62.33±\pm1.01 71.30±\pm0.36 76.51±\pm2.19 63.30±\pm0.80 84.26±\pm0.17 88.11±\pm0.43 87.10±\pm0.10 86.54±\pm1.79 87.83±\pm1.01
NMI 41.54±\pm0.51 71.42±\pm0.97 72.91±\pm0.39 74.19±\pm0.39 55.06±\pm1.39 62.95±\pm0.36 69.10±\pm2.28 57.10±\pm1.40 79.90±\pm0.09 82.44±\pm0.62 82.20±\pm0.10 82.21±\pm1.78 85.34±\pm2.11
USPS ARI 54.55±\pm0.06 58.83±\pm0.05 63.70±\pm0.27 67.86±\pm0.12 50.30±\pm0.55 40.96±\pm0.59 63.33±\pm0.34 51.10±\pm0.60 71.84±\pm0.24 73.61±\pm0.43 75.30±\pm0.20 68.50±\pm3.83 75.54±\pm1.28
F1 64.78±\pm0.03 69.74±\pm0.03 71.82±\pm0.21 74.63±\pm0.10 61.84±\pm0.43 53.63±\pm1.05 72.45±\pm0.49 66.10±\pm1.20 76.98±\pm0.18 77.61±\pm0.38 78.30±\pm0.20 74.86±\pm2.56 79.33±\pm0.74
ACC 66.82±\pm0.04 71.04±\pm0.03 73.31±\pm0.17 76.22±\pm0.12 63.10±\pm0.33 56.19±\pm0.72 73.55±\pm0.40 66.80±\pm0.70 78.08±\pm0.19 80.98±\pm0.28 79.50±\pm0.20 77.14±\pm1.21 81.13±\pm1.89
NMI 62.63±\pm0.05 67.53±\pm0.03 70.58±\pm0.25 75.56±\pm0.06 60.69±\pm0.58 51.08±\pm0.37 71.12±\pm0.24 61.60±\pm0.30 79.51±\pm0.27 79.64±\pm0.32 82.80±\pm0.30 75.93±\pm3.83 82.14±\pm0.15
ACM ARI 30.60±\pm0.69 54.64±\pm0.16 60.64±\pm1.87 62.16±\pm1.50 59.46±\pm3.10 57.72±\pm0.67 59.35±\pm3.89 62.90±\pm2.10 73.91±\pm0.40 74.20±\pm0.38 74.90±\pm0.40 73.73±\pm0.90 76.72±\pm0.98
F1 67.57±\pm0.74 82.01±\pm0.08 84.51±\pm0.74 85.11±\pm0.48 84.65±\pm1.33 84.17±\pm0.23 87.07±\pm2.79 86.10±\pm1.20 90.42±\pm0.19 90.58±\pm0.17 90.80±\pm0.20 90.39±\pm0.39 91.53±\pm0.42
ACC 67.31±\pm0.71 81.83±\pm0.08 84.33±\pm0.76 85.12±\pm0.52 84.52±\pm1.44 84.13±\pm0.22 86.94±\pm2.83 86.10±\pm1.20 90.45±\pm0.18 90.59±\pm0.15 90.90±\pm0.20 90.38±\pm0.38 91.55±\pm0.40
NMI 32.44±\pm0.46 49.30±\pm0.16 54.54±\pm1.51 56.61±\pm1.16 55.38±\pm1.92 53.20±\pm0.52 56.18±\pm4.15 55.70±\pm1.40 68.31±\pm0.25 68.38±\pm0.45 69.40±\pm0.40 68.34±\pm0.89 71.50±\pm0.80
CiteSeer ARI 06.97±\pm0.39 29.31±\pm0.14 28.12±\pm0.36 25.70±\pm2.65 33.55±\pm1.18 33.13±\pm0.53 37.78±\pm1.24 33.40±\pm1.50 40.17±\pm0.43 43.79±\pm0.31 45.50±\pm0.30 41.82±\pm2.03 47.98±\pm0.91
F1 31.92±\pm0.27 53.80±\pm0.11 52.62±\pm0.17 61.62±\pm1.39 57.36±\pm0.82 57.70±\pm0.49 62.20±\pm1.32 54.80±\pm0.80 63.62±\pm0.24 62.37±\pm0.21 64.30±\pm0.20 60.47±\pm1.57 62.37±\pm0.52
ACC 38.65±\pm0.65 57.08±\pm0.13 55.89±\pm0.20 60.49±\pm1.42 61.35±\pm0.80 60.97±\pm0.36 64.54±\pm1.39 56.90±\pm0.70 65.96±\pm0.31 68.79±\pm0.23 69.50±\pm0.20 68.08±\pm1.44 72.01±\pm0.53
NMI 11.45±\pm0.38 27.64±\pm0.08 28.34±\pm0.30 27.17±\pm2.40 34.63±\pm0.65 32.69±\pm0.27 36.41±\pm0.86 34.50±\pm0.80 38.71±\pm0.32 41.54±\pm0.30 43.90±\pm0.20 40.86±\pm1.45 45.34±\pm0.70
DBLP ARI 13.43±\pm3.02 12.21±\pm0.43 23.92±\pm0.39 25.37±\pm0.60 22.02±\pm1.40 17.92±\pm0.07 21.03±\pm0.52 22.70±\pm0.30 39.15±\pm2.01 42.49±\pm0.31 47.00±\pm1.50 44.40±\pm3.79 57.29±\pm1.20
F1 36.08±\pm3.53 52.53±\pm0.36 59.38±\pm0.51 61.33±\pm0.56 61.41±\pm2.23 58.69±\pm0.07 61.75±\pm0.67 61.80±\pm0.90 67.71±\pm1.51 72.80±\pm0.56 75.70±\pm0.80 71.84±\pm2.02 80.79±\pm0.61
ACC 39.32±\pm3.17 51.43±\pm0.35 58.16±\pm0.56 60.31±\pm0.62 61.21±\pm1.22 58.59±\pm0.06 62.05±\pm0.48 61.60±\pm1.00 68.05±\pm1.81 73.26±\pm0.37 76.00±\pm0.80 73.45±\pm2.16 81.26±\pm0.62
NMI 16.94±\pm3.22 25.40±\pm0.16 29.51±\pm0.28 31.17±\pm0.50 30.80±\pm0.91 26.92±\pm0.06 32.49±\pm0.45 26.80±\pm1.00 39.50±\pm1.34 39.68±\pm0.42 43.70±\pm1.00 40.36±\pm2.81 51.99±\pm0.76
Amazon Photo ARI 05.50±\pm0.44 20.80±\pm0.47 18.59±\pm0.04 19.24±\pm0.07 48.82±\pm4.57 56.24±\pm4.66 59.39±\pm0.02 44.18±\pm4.41 31.21±\pm1.23 41.15±\pm2.78 58.98±\pm0.84 29.96±\pm3.46 60.51±\pm1.58
F1 23.96±\pm0.51 47.87±\pm0.20 46.71±\pm0.12 47.20±\pm0.11 68.08±\pm1.76 70.38±\pm2.98 69.97±\pm0.02 64.30±\pm1.95 50.66±\pm1.49 43.68±\pm5.08 71.58±\pm0.31 39.67±\pm5.22 71.68±\pm2.35
ACC 27.22±\pm0.76 48.25±\pm0.08 47.22±\pm0.08 47.62±\pm0.08 71.57±\pm2.48 74.26±\pm3.63 76.44±\pm0.01 69.28±\pm2.30 53.44±\pm0.81 58.53±\pm1.74 76.88±\pm0.80 51.47±\pm3.04 78.75±\pm1.02
NMI 13.23±\pm1.33 38.76±\pm0.30 37.35±\pm0.05 37.83±\pm0.08 62.13±\pm2.79 66.01±\pm3.40 65.57±\pm0.03 58.36±\pm2.76 44.85±\pm0.83 51.76±\pm3.23 69.21±\pm1.00 39.19±\pm4.07 66.27±\pm1.13
PubMed ARI 28.10±\pm0.01 23.86±\pm0.67 19.55±\pm0.13 20.58±\pm0.39 20.62±\pm1.39 30.15±\pm1.23 29.84±\pm0.04 24.35±\pm0.17 22.30±\pm2.07 31.39±\pm0.67 30.64±\pm0.11 OOM 35.29±\pm1.02
F1 58.88±\pm0.01 64.01±\pm0.29 61.49±\pm0.10 62.41±\pm0.32 61.37±\pm0.85 67.68±\pm0.89 68.23±\pm0.02 65.69±\pm0.13 65.01±\pm1.21 69.73±\pm0.45 68.10±\pm0.07 OOM 72.78±\pm0.72
ACC 59.83±\pm0.01 63.07±\pm0.31 60.14±\pm0.09 60.70±\pm0.34 62.09±\pm0.81 68.48±\pm0.77 68.73±\pm0.03 65.26±\pm0.12 64.20±\pm1.30 69.67±\pm0.42 68.89±\pm0.07 OOM 73.16±\pm0.69
NMI 31.05±\pm0.02 26.32±\pm0.57 22.44±\pm0.14 23.67±\pm0.29 23.84±\pm3.54 30.61±\pm1.71 28.26±\pm0.03 24.80±\pm0.17 22.87±\pm2.04 30.96±\pm0.99 31.43±\pm0.13 OOM 33.29±\pm1.14
AIDS ARI 05.37±\pm0.19 05.71±\pm0.66 10.71±\pm3.49 13.39±\pm5.35 03.50±\pm0.79 00.44±\pm0.43 OOM 01.79±\pm0.97 -0.06±\pm0.00 14.03±\pm5.76 00.48±\pm0.00 OOM 21.40±\pm7.12
F1 11.86±\pm1.02 11.91±\pm1.54 13.81±\pm1.60 12.10±\pm1.55 08.40±\pm1.23 09.13±\pm2.74 OOM 06.01±\pm1.41 02.02±\pm0.00 06.14±\pm1.80 05.51±\pm0.01 OOM 21.77±\pm1.10
ACC 17.11±\pm0.63 21.78±\pm5.02 35.12±\pm3.69 47.32±\pm5.76 15.72±\pm0.89 23.04±\pm6.08 OOM 59.27±\pm2.54 62.25±\pm0.00 59.82±\pm3.37 11.28±\pm0.02 OOM 63.84±\pm2.81
NMI 23.42±\pm0.57 24.29±\pm2.49 24.30±\pm2.19 25.37±\pm4.90 12.66±\pm2.64 01.62±\pm0.29 OOM 04.86±\pm2.16 00.16±\pm0.00 09.08±\pm2.28 04.67±\pm0.01 OOM 34.44±\pm2.96

IV Experiments

We conducted quantitative and qualitative experiments on nine commonly used benchmark datasets to evaluate the proposed model. In addition, we performed ablation studies to investigate the effectiveness of the proposed modules and the adopted strategies. Moreover, we performed a series of parameter analyses to verify the robustness of our method.

IV-A Datasets and Compared Methods

We conducted experiments on one image dataset (USPS [56]), one text dataset (Reuters [57]), one record dataset (HHAR [58]), and six graph datasets (ACM222http://dl.acm.org, CiteSeer333http://CiteSeerx.ist.psu.edu/, DBLP444https://dblp.uni-trier.de, Amazon Photo, PubMed[59], and AIDS[60]), which are briefly summarized in Table II.

We compared the proposed method with the classic clustering method K-means [13], three DAE-based embedding clustering methods [12, 14, 15], and seven GCN-based embedding clustering methods [20, 21, 22, 23, 24, 25, 47], the details of which are listed as follows.

TABLE IV: Results of ablation studies. ✗ and ✓ indicate the component is used and unused, respectively. The best results are noted in bold.
Datasets SSS HSS DWF SWF HWF ARI F1 ACC NMI
USPS 71.67±\pm0.44 76.88±\pm0.30 78.08±\pm0.30 79.19±\pm0.44
71.71±\pm0.87 76.46±\pm0.54 78.98±\pm0.97 78.87±\pm0.36
70.96±\pm0.24 76.44±\pm0.17 77.70±\pm0.14 78.61±\pm0.22
71.73±\pm0.71 76.31±\pm0.34 79.63±\pm0.43 78.41±\pm0.29
71.90±\pm0.99 76.61±\pm0.56 79.74±\pm0.79 78.64±\pm0.46
74.39±\pm0.13 78.51±\pm0.09 79.11±\pm0.11 82.06±\pm0.16
75.54±\pm1.28 79.33±\pm0.74 81.13±\pm1.89 82.14±\pm0.15
Reuters 56.37±\pm4.76 65.03±\pm1.87 78.19±\pm2.02 53.74±\pm3.63
61.38±\pm0.78 67.22±\pm1.15 80.19±\pm0.53 57.94±\pm0.49
61.55±\pm0.64 66.54±\pm0.21 80.60±\pm0.47 58.15±\pm0.49
62.70±\pm1.00 66.90±\pm0.30 80.95±\pm0.46 59.42±\pm0.69
63.32±\pm0.57 67.21±\pm0.18 81.28±\pm0.32 60.79±\pm0.69
62.75±\pm2.00 68.74±\pm1.23 81.02±\pm0.81 57.93±\pm1.50
63.48±\pm1.10 68.81±\pm1.26 81.68±\pm0.69 58.94±\pm1.16
HHAR 73.17±\pm1.95 82.70±\pm3.97 84.18±\pm2.80 80.03±\pm1.16
72.45±\pm1.02 83.25±\pm0.81 84.60±\pm0.66 79.08±\pm0.85
73.24±\pm0.73 83.34±\pm1.69 84.77±\pm1.21 80.10±\pm0.50
72.84±\pm1.23 83.72±\pm1.10 84.95±\pm0.86 79.22±\pm0.94
73.24±\pm0.52 83.74±\pm0.68 85.01±\pm0.46 79.99±\pm0.47
75.91±\pm0.40 86.65±\pm0.70 86.23±\pm0.81 82.61±\pm0.12
77.38±\pm0.97 87.90±\pm1.11 87.83±\pm1.01 85.34±\pm2.11
ACM 73.91±\pm0.40 90.42±\pm0.19 90.45±\pm0.18 68.31±\pm0.25
73.95±\pm0.60 90.48±\pm0.26 90.47±\pm0.24 68.42±\pm0.61
74.20±\pm0.38 90.58±\pm0.17 90.59±\pm0.15 68.38±\pm0.45
74.58±\pm0.78 90.72±\pm0.35 90.73±\pm0.33 68.94±\pm0.63
74.83±\pm0.73 90.85±\pm0.33 90.85±\pm0.31 69.02±\pm0.66
75.78±\pm0.64 91.18±\pm0.25 91.18±\pm0.26 70.59±\pm0.68
76.72±\pm0.98 91.53±\pm0.42 91.55±\pm0.40 71.50±\pm0.80
CiteSeer 40.17±\pm0.43 63.62±\pm0.24 65.96±\pm0.31 38.71±\pm0.32
40.93±\pm1.78 60.91±\pm0.81 66.38±\pm1.72 39.07±\pm1.52
43.79±\pm0.31 62.37±\pm0.21 68.79±\pm0.23 41.54±\pm0.30
43.50±\pm0.47 61.25±\pm0.31 68.54±\pm0.30 41.35±\pm0.58
43.72±\pm0.60 61.52±\pm0.65 68.46±\pm0.40 41.25±\pm0.41
47.76±\pm1.28 62.24±\pm0.80 71.86±\pm0.79 45.10±\pm1.05
47.98±\pm0.91 62.37±\pm0.52 72.01±\pm0.53 45.34±\pm0.70
DBLP 39.15±\pm2.01 67.71±\pm1.51 68.05±\pm1.81 39.50±\pm1.34
37.78±\pm1.85 68.69±\pm1.65 69.65±\pm1.43 35.37±\pm1.58
42.49±\pm0.31 72.80±\pm0.56 73.26±\pm0.37 39.68±\pm0.42
41.72±\pm0.47 72.68±\pm0.20 72.92±\pm0.21 39.26±\pm0.33
42.52±\pm0.96 72.81±\pm0.59 73.43±\pm0.50 39.99±\pm0.70
55.45±\pm0.60 79.83±\pm0.32 80.29±\pm0.33 50.08±\pm0.56
57.29±\pm1.20 80.79±\pm0.61 81.26±\pm0.62 51.99±\pm0.76
Amazon Photo 31.21±\pm1.23 50.66±\pm1.49 53.44±\pm0.81 44.85±\pm0.83
37.86±\pm3.46 35.86±\pm4.00 54.84±\pm1.43 46.51±\pm4.93
41.15±\pm2.78 43.68±\pm5.08 58.53±\pm1.74 51.76±\pm3.23
41.14±\pm2.78 43.68±\pm5.08 58.52±\pm1.74 51.77±\pm3.22
43.50±\pm2.29 46.20±\pm4.18 60.59±\pm1.94 52.23±\pm1.67
51.81±\pm2.25 66.37±\pm2.64 71.93±\pm2.08 59.09±\pm1.60
60.51±\pm1.58 71.68±\pm2.35 78.75±\pm1.02 66.27±\pm1.13
PubMed 22.30±\pm2.07 65.01±\pm1.21 64.20±\pm1.30 22.87±\pm2.04
27.65±\pm1.16 67.21±\pm0.83 67.31±\pm0.78 27.77±\pm1.85
31.39±\pm0.67 69.73±\pm0.45 69.67±\pm0.42 30.96±\pm0.99
30.85±\pm1.10 69.05±\pm0.87 68.67±\pm0.79 32.19±\pm1.29
33.21±\pm1.94 70.75±\pm1.28 71.35±\pm1.39 31.47±\pm1.75
32.79±\pm1.57 69.89±\pm1.20 70.56±\pm1.37 31.85±\pm1.36
35.29±\pm1.02 72.78±\pm0.72 73.16±\pm0.69 33.29±\pm1.14
AIDS 10.15±\pm3.04 4.76±\pm0.69 58.32±\pm3.33 7.67±\pm0.48
10.31±\pm3.52 3.92±\pm0.67 62.25±\pm0.01 6.68±\pm1.13
10.69±\pm3.33 4.33±\pm0.99 62.32±\pm0.21 7.54±\pm1.40
11.85±\pm3.35 21.19±\pm1.39 62.29±\pm0.11 32.31±\pm0.40
12.78±\pm2.61 21.65±\pm1.51 62.25±\pm0.00 32.34±\pm0.35
15.34±\pm5.94 21.37±\pm1.60 62.61±\pm1.15 32.46±\pm1.01
21.40±\pm7.12 21.77±\pm1.10 63.84±\pm2.81 34.44±\pm2.96
  • DAE [12] uses deep auto-encoder to learn latent feature representations and then performs K-means on that feature to obtain clustering results.

  • DEC [14] jointly conducts embedding learning and cluster assignment with an iterative procedure.

  • IDEC [15] introduces a reconstruction loss into DEC to improve the clustering performance.

  • GAE [20] and VGAE [20] incorporate DAE and variational DAE into GCN frameworks, respectively.

  • DAEGC [22] achieves a neighbor-wise embedding learning with an attention-driven strategy and supervises the network training with a clustering loss.

  • ARGA [21] guides embedding learning with a designed adversarial regularization.

  • SDCN [24] fuses DEC and GCN to merge the topological structure information into deep embedding clustering.

  • AGCN [25] focuses on enhancing the embedding learning.

  • DFCN [47] merges the node attribute and topological structure information based on the DAE and GAE.

  • AGCC [48] replaces the graph layer by layer to mine the latent connected relationship between data.

IV-B Implementation Details

IV-B1 Evaluation metrics

We used four metrics to evaluate the clustering performance, including Average Rand Index (ARI), macro F1-score (F1), Accuracy (ACC), and Normalized Mutual Information (NMI). For each metric, a larger value implies a better clustering result.

IV-B2 Graph construction

As those non-graph datasets (i.e., USPS, Reuters, and HHAR) lack the topology graph, we used a typical graph construction approach to generate their graph data. Specifically, we first employed the cosine distance to compute the similarity matrix 𝐒\mathbf{S}, i.e.,

𝐒=𝐗𝐗𝖳𝐗F𝐗𝖳F,\displaystyle\mathbf{S}=\frac{\mathbf{X}\mathbf{X}^{\mathsf{T}}}{\left\|\mathbf{X}\right\|_{F}\left\|\mathbf{X}^{\mathsf{T}}\right\|_{F}}, (19)

where 𝐗F=i=1nj=1d|xi,j|2\left\|\mathbf{X}\right\|_{F}=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{d}{\left|\emph{x}_{i,j}\right|}^{2}} and 𝐗𝖳\mathbf{X}^{\mathsf{T}} denote the Frobenius norm and the transpose operation of 𝐗\mathbf{X}, respectively. Then, we keep the top-k^\hat{\emph{k}} similar neighbors of each sample to construct an undirected k^\hat{\emph{k}}-nearest neighbor (KNN [61]) graph. The constructed KNN graph can depict the topological structure of a dataset and hence is used as GCN input.

IV-B3 Training Procedure

Similar to [14, 15, 24, 25], we first pre-trained the DAE module with 3030 epochs and the learning rate equal to 0.0010.001. Then, we trained the whole network with 200200 iterations. We set the dimension of the auto-encoder and the GCN layers to 500500200010500-500-2000-10, the batch size to 256256, and the negative input slope of LReLU to 0.20.2. In addition, we set the learning rates of USPS, HHAR, ACM, DBLP, and PubMed datasets with 0.0010.001, and Reuters, CiteSeer, and Amazon Photo datasets with 0.00010.0001. We set rr to 0.80.8 in this paper, where more detailed experiments and analyses of the threshold value are given in Section IV. E. 3). For the method ARGA, we used the parameter settings given by the original paper [21]. For other methods under comparison, we directly cited the results in [25]. We repeated the experiment 10 times to evaluate our method with the mean values and the corresponding standard deviations (i.e., mean±\pmstd). The training procedure is implemented by PyTorch on two GPUs (GeForce RTX 2080 Ti and NVIDIA GeForce RTX 3090).

IV-C Clustering Results

Table III provides the clustering results of the proposed method and twelve compared methods with four metrics, where we have the following observations.

  • Our method achieves the best clustering results on most benchmark datasets. For example, in the non-graph dataset Reuters, our approach improves the ARI, F1, ACC, and NMI values of SDCN [24] by 8.12%, 3.33%, 4.53%, and 8.12%, respectively. In the graph dataset DBLP, our approach improves 18.14% over SDCN on ARI, 13.08% on F1, 13.21% on ACC, and 12.49% on NMI.

  • DAEGC enhances GAE by introducing the neighbor-wise embedding learning with an attention-based strategy, benefiting clustering performance improvement. Such a phenomenon validates the effectiveness of the attention-based mechanism. Differently, our method extends the attention-based mechanism to the heterogeneity-wise, scale-wise, and distribution-wise fusion modules to adaptively utilize the multiple off-the-shelf information, which significantly improves the clustering performance.

  • SDCN performs better than the DAE-based (DAE, DEC, IDEC) and GCN-based (GAE, VGAE, ARGA) embedding clustering methods, demonstrating that combining DAE and GCN can contribute to clustering performance. Nevertheless, SDCN (ii) equates the importance of the DAE feature and the GCN feature; (iiii) neglects the multi-scale features; and (iiiiii) fails to utilize available off-the-shelf information from the clustering assignment. The proposed method addresses those issues and thus produces significantly better clustering performance than SDCN on all the datasets in almost all metrics.

  • Our method typically achieves better clustering performance than AGCN [25], demonstrating the effectiveness of the proposed distribution-wise fusion module and the dual self-supervision solution in guiding the unsupervised clustering network training. For instance, in Amazon Photo, our approach improves 19.36% on ARI, 28.00% on F1, 20.22% on ACC, and 14.51% on NMI.

  • Our method provides a significant improvement on DBLP and PubMed, e.g., in DBLP, our approach improves 10.29% over the second-best one on ARI, 5.09% on F1, 5.26% on ACC, and 8.29% on NMI. The possible reason is that DBLP and PubMed belong to datasets with low feature dimensions (i.e., little information), meaning that sufficiently utilizing the available off-the-shelf information plays a great important role in improving the clustering performance.

  • Our method does not outperform AGCN in HHAR. The possible reason is that in HHAR, a series of dissimilar nodes are connected in the constructed KNN graph, reducing the graph quality. Although AGCN also uses the KNN graph, its auxiliary distribution 𝐏\mathbf{P} was inferred by the output of the conventional auto-encoder. Differently, the proposed method uses the graph convolutional network output to derive 𝐏\mathbf{P} for utilizing rich graph information of 𝐙\mathbf{Z}. Thus, if the graph quality is terrible, the clustering performance of the proposed method may be worse than AGCN.

  • Our method obtains the best clustering performance on AIDS with four metrics, where AIDS is a large-scale long-tailed dataset that one class accounts for 62.34% number, and the other thirty-seven classes share 37.66%.

Refer to caption
(a) 𝐐\mathbf{Q}
Refer to caption
(b) 𝐙\mathbf{Z}
Refer to caption
(c) 𝐅\mathbf{F}
Figure 5: The visual comparison of (a) distribution 𝐐\mathbf{Q}, (b) distribution 𝐙\mathbf{Z}, and (c) our adaptively fused one.
Refer to caption
(a) USPS
Refer to caption
(b) Reuters
Refer to caption
(c) HHAR
Figure 6: Analyses of the number of neighbors for KNN graph construction. All the sub-figures share the same legend.
Refer to caption
(a) ARI
Refer to caption
(b) F1
Refer to caption
(c) ACC
Refer to caption
(d) NMI
Figure 7: Analysis of different hyperparameters (λ1\lambda_{1}, λ2\lambda_{2}, and λ3\lambda_{3}) with four metrics on DBLP. We illustrate the results w.r.t. these hyperparameters in a 4D figure manner, where the color indicates the fourth direction, i.e., the corresponding experimental results.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Investigation of the effect of the threshold value for pseudo supervision on (a) USPS, (b) Reuters, (c) HHAR, (d) ACM, (e) CiteSeer, (f) DBLP, (g) Amazon Photo, (h) PubMed, and (i) AIDS.

IV-D Ablation Study

We conducted comprehensive ablation studies to validate the effectiveness of the proposed modules and self-supervision strategies. Table IV lists the quantitative results, where the first row of each dataset denotes the baseline that merges the DAE and GCN features in a half-and-half mechanism (i.e., without HWF) and uses the last GCN layer feature (i.e., without SWF and DWF) to conduct the optimization via the reconstruction loss and self-optimizing embedding loss following [14, 15, 22, 24, 25] (i.e., without SSS and HSS). The second, third, and fourth rows denote the baseline that adopts the proposed module HWF, HWF+SWF, and HWF+SWF+DWF, respectively. The fifth, sixth, and seventh rows denote the methods optimized by the introduced hard self-supervision, the soft self-supervision, and both.

IV-D1 Heterogeneity-wise fusion module

By comparing the first and second rows of each dataset in Table IV, we can observe that HWF can typically improve clustering performance in most cases, validating its effectiveness. For example, on Reuters, it produces a 5.01% performance improvement on ARI, 2.19% on F1, 2.00% on ACC, and 4.20% on NMI.

IV-D2 Scale-wise fusion module

We can examine the effectiveness of SWF by comparing the second and third rows of each dataset in Table IV, in which the compared results with four metrics indicate the superiority of the SWF module in most datasets.

IV-D3 Distribution-wise fusion module

By comparing the results of the third and fourth rows of each dataset in Table IV, we observe that DWF also improves clustering performance, benefiting from the adaptive fusion of the information of two distributions.

To qualitatively validate the significant performance of the DWF module, we plotted 2D t-distributed stochastic neighbor embedding (t-SNE) [62] visualizations of the distributions 𝐐\mathbf{Q}, 𝐙\mathbf{Z}, and 𝐅\mathbf{F} on DBLP in Figure 5, where we can see that our adaptively aggregated one is better than others, benefiting from adaptively (due to the DWF module) and effectively (due to the dual self-supervision solution) merging the information of two distributions.

IV-D4 Hard self-supervision strategy

From the results of the fourth and fifth rows of each dataset in Table IV, it can be seen that on the non-graph dataset HHAR and graph dataset DBLP, there is about 2.00% improvement when involving HSS, validating its effectiveness.

IV-D5 Soft Self-supervision Strategy

We can validate the effectiveness of SSS by comparing the results of the fourth and sixth rows of each dataset in Table IV. Specifically, on DBLP, SSS produces 13.73% improvement on ARI, 7.15% on F1, 7.37% on ACC, and 10.82% on NMI. Such impressive improvement is credited to that the SSS strategy refines the cluster assignment by minimizing a Kullback-Leibler divergence loss to promote consistent distribution alignment among distributions 𝐐\mathbf{Q}, 𝐙\mathbf{Z}, and 𝐏\mathbf{P}.

IV-D6 Dual Self-supervision (DSS)

By comparing the results of the fourth, fifth, sixth, and seventh rows of each dataset in Table IV, we can observe that DSS, which combines HSS and SSS, almost produces the best results on all nine benchmark datasets.

TABLE V: The experiments on DBLP with the learned weights from the attention modules and four metrics results of corresponding representations. The larger weight values are highlighted in bold, and the better clustering results are highlighted in red.
SampleSample 𝐦1,1\mathbf{m}_{1,1} (𝐙1\mathbf{Z}_{1}) 𝐦1,2\mathbf{m}_{1,2} (𝐇1\mathbf{H}_{1}) 𝐦2,1\mathbf{m}_{2,1} (𝐙2\mathbf{Z}_{2}) 𝐦2,2\mathbf{m}_{2,2} (𝐇2\mathbf{H}_{2}) 𝐦3,1\mathbf{m}_{3,1} (𝐙3\mathbf{Z}_{3}) 𝐦3,2\mathbf{m}_{3,2} (𝐇3\mathbf{H}_{3}) 𝐮1\mathbf{u}_{1} (𝐙1\mathbf{Z}_{1}) 𝐮2\mathbf{u}_{2} (𝐙2\mathbf{Z}_{2}) 𝐮3\mathbf{u}_{3} (𝐙3\mathbf{Z}_{3}) 𝐮4\mathbf{u}_{4} (𝐙4\mathbf{Z}_{4}) 𝐮5\mathbf{u}_{5} (𝐙5\mathbf{Z}_{5}) 𝐯1\mathbf{v}_{1} (𝐙\mathbf{Z}) 𝐯2\mathbf{v}_{2} (𝐐\mathbf{Q})
𝐱0\mathbf{x}_{0} 0.6417 0.7669 0.0954 0.9954 0.7091 0.7051 0.3874 0.3896 0.4718 0.3930 0.5666 0.8980 0.4399
𝐱1\mathbf{x}_{1} 0.6166 0.7872 0.0918 0.9958 0.7052 0.7090 0.2780 0.2783 0.2854 0.2824 0.8271 0.8584 0.5131
𝐱2\mathbf{x}_{2} 0.8232 0.5678 0.2532 0.9674 0.7051 0.7091 0.2188 0.2187 0.8826 0.2216 0.2762 0.8585 0.5128
𝐱4054\mathbf{x}_{4054} 0.6254 0.7803 0.0686 0.9976 0.7047 0.7095 0.2503 0.2505 0.2572 0.2544 0.8624 0.8582 0.5134
𝐱4055\mathbf{x}_{4055} 0.6238 0.7816 0.0508 0.9987 0.7060 0.7082 0.2733 0.2741 0.2817 0.2782 0.8327 0.8584 0.5129
𝐱4056\mathbf{x}_{4056} 0.5919 0.8060 0.0811 0.9967 0.7082 0.7060 0.3903 0.3926 0.5544 0.3956 0.4793 0.8979 0.4402
AVG 0.7182 0.6704 0.1310 0.9854 0.5916 0.7698 0.3162 0.3157 0.5141 0.3200 0.5821 0.8555 0.5138
ARI 0.0836 0.5539 0.3009 0.5500 0.5437 0.5489 0.0836 0.3009 0.5437 0.5453 0.5489 0.5500 0.5489
F1 0.4072 0.7972 0.5781 0.7969 0.7941 0.7964 0.4072 0.5781 0.7941 0.7944 0.7964 0.7969 0.7964
ACC 0.4212 0.8023 0.6120 0.8011 0.7981 0.8006 0.4212 0.6120 0.7981 0.7986 0.8006 0.8011 0.8006
NMI 0.1778 0.4987 0.3023 0.4996 0.4950 0.4988 0.1778 0.3023 0.4950 0.4965 0.4988 0.4995 0.4988
TABLE VI: Comparisons of the clustering results, network parameters, and spending time of compared methods on DBLP. The best results are highlighted in bold.
Metrics SDCN AGCN AGCC Our boost \uparrow
ARI (%) 39.15±\pm2.01 42.49±\pm0.31 44.40±\pm3.79 57.29±\pm1.20 \uparrow 12.89
F1 (%) 67.71±\pm1.51 72.80±\pm0.56 71.84±\pm2.02 80.79±\pm0.61 \uparrow 07.99
ACC (%) 68.05±\pm1.81 73.26±\pm0.37 73.45±\pm2.16 81.26±\pm0.62 \uparrow 07.81
NMI (%) 39.50±\pm1.34 39.68±\pm0.42 40.36±\pm2.81 51.99±\pm0.76 \uparrow 11.63
Parameters (M) 4.31742 4.35658 11.86304 4.35659
Time (s) 253.3905 273.1204 5420.5255 310.6794
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Comparison of the visualization of the learned representations on DBLP by (a) SDCN [24], (b) AGCN [25], (c) DFCN [[47], and (d) Ours, where different colors represent different clusters.

IV-E Parameters Analysis

IV-E1 Analysis of the number of neighbors

As the number of neighbors k^\hat{\emph{k}} directly decides the KNN graph with respect to (w.r.t.) the quality of the adjacency matrix, we tested different k^\hat{\emph{k}} on the non-graph datasets, i.e., USPS, Reuters, and HHAR. From Figure 6, we can observe that our model is not sensitive to k^\hat{\emph{k}}. In the experiments, we fixed k^\hat{\emph{k}} to 33 to construct the KNN graph for the non-graph datasets.

IV-E2 Analysis of hyperparameters

We investigated the influence of the hyperparameters, i.e., λ1\lambda_{1}, λ2\lambda_{2}, and λ3\lambda_{3}, on DBLP. Figure 7 illustrates four metrics results in a 4D figure manner where the color indicates the fourth direction, i.e., the corresponding experimental results. From Figure 7, we have the following observations.

  • The parameters setting of λ1\lambda_{1} and λ2\lambda_{2} is critical to the proposed model. Specifically, the highest clustering result occurs when λ1\lambda_{1} and λ2\lambda_{2} tend to the same value. This phenomenon reflects the importance of balancing the regularization term in constraining the distribution alignment.

  • Our model is robust to the hyperparameter λ3\lambda_{3}, i.e., our method can obtain the optimal performance in a wide and common parameter range of λ3\lambda_{3}.

IV-E3 Analysis of the threshold value

We investigated the effect of the threshold value rr on clustering performance. Figure 8 shows the clustering results with various thresholds (i.e., 0.50.5, 0.60.6, 0.70.7, 0.80.8, and 0.90.9). From Figure 8, we have the following conclusions.

  • A small threshold value unavoidably degrades the clustering performance compared with the ones using a large threshold value. For example, when we set rr to 0.50.5 or 0.60.6, all four metrics results on DBLP have degraded performance. Apparently, a small threshold value can easily generate a lot of incorrect pseudo-labels.

  • A large threshold value is capable of leading to high clustering performance. However, setting rr to a tremendous value like 0.90.9 cannot improve clustering performance. The reason is that with a larger threshold, the number of selected supervised labels will reduce, resulting in weak label propagation. Thus, we set rr to 0.80.8 in this paper.

IV-E4 Analysis of the learned attention-aware weights

We added the results of the learned weight on DBLP to verify the effectiveness of the designed attention mechanism in Table V, where 𝐱j\mathbf{x}_{j} indicates the j-th sample; 𝐦i,1\mathbf{m}_{i,1} and 𝐦i,2\mathbf{m}_{i,2} indicate the HWF learned weights of 𝐙i\mathbf{Z}_{i} and 𝐇i\mathbf{H}_{i} in the i-th layer, respectively; 𝐮1\mathbf{u}_{1}, 𝐮2\mathbf{u}_{2}, 𝐮3\mathbf{u}_{3}, 𝐮4\mathbf{u}_{4}, and 𝐮5\mathbf{u}_{5} indicate the SWF learned weights; 𝐯1\mathbf{v}_{1} and 𝐯2\mathbf{v}_{2} indicate the DWF learned weights of 𝐙\mathbf{Z} and 𝐐\mathbf{Q}, respectively; AVG indicates the average value of the weight results. The clustering results of 𝐙\mathbf{Z} and 𝐐\mathbf{Q} are inferred through their column indexes of the maximum in each row, and those results of other features are obtained with K-means, where the higher clustering performance, the better the feature representation. We can see that the representation corresponding to a large weight value typically performs better clustering results than the one corresponding to a small weight value, substantiating the effectiveness of the designed attention mechanism in the weighted fusion.

IV-F Time and Space Complexity Analysis

We repeated the experiment 10 times to compare the mean values, the standard deviations (i.e., mean±\pmstd), the parameters number, and the running time of the proposed method with the baselines [24, 25, 47] on DBLP in Table VI. Specifically, the experiments are implemented with Python 3.6.12 and Pytorch-1.9.0+cu102 on an NVIDIA GeForce RTX 2080 Ti and an i7-8700K CPU. M and s are the abbreviations of the million and second, respectively. From Table VI, we can observe that our method obtains a significant clustering improvement at the cost of acceptable resource consumption.

IV-G Visual Comparison

To qualitatively evaluate the effectiveness of the proposed method, we plotted 2D t-SNE visualizations of baselines [24, 25, 47] and the proposed method on DBLP in Figure 9, where we can find that the feature representation obtained by our method shows the best separability for different clusters, i.e., samples from the same class naturally gather together and the gap between different groups is the most obvious one. This phenomenon substantiates that our method produces the most clustering-oriented representation compared with state-of-the-art methods.

V Conclusion

We have presented a novel deep embedding clustering method that simultaneously enhances embedding learning and cluster assignment. Specifically, we first designed heterogeneity-wise and scale-wise fusion modules to learn an informative representation adaptively. Then, we utilized a distribution-wise fusion module to achieve cluster enhancement via an attention-based mechanism. Finally, we proposed a soft self-supervision strategy with a Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss to utilize the available off-the-shelf information from the cluster assignments. The quantitative and qualitative experiments and analyses demonstrate that our method consistently outperforms state-of-the-art approaches. We also provided comprehensive ablation studies to validate the effectiveness and advantage of our network.

References

  • [1] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (gpca),” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1945–1959, 2005.
  • [2] Q. Wang, J. Cheng, Q. Gao, G. Zhao, and L. Jiao, “Deep multi-view subspace clustering with unified and discriminative learning,” IEEE Transactions on Multimedia, vol. 23, pp. 3483–3493, 2020.
  • [3] Z. Dang, C. Deng, X. Yang, and H. Huang, “Multi-scale fusion subspace clustering using similarity constraint,” in CVPR, 2020, pp. 6658–6667.
  • [4] X. Wang, S. Fan, K. Kuang, C. Shi, J. Liu, and B. Wang, “Decorrelated clustering with data selection bias,” in IJCAI, 2021, pp. 2177–2183.
  • [5] Y. Jia, H. Liu, J. Hou, and Q. Zhang, “Clustering ensemble meets low-rank tensor approximation,” in AAAI, vol. 35, no. 9, 2021, pp. 7970–7978.
  • [6] S. Yang, W. Deng, M. Wang, J. Du, and J. Hu, “Orthogonality loss: learning discriminative representations for face recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 6, pp. 2301–2314, 2020.
  • [7] Y. Jia, J. Hou, and S. Kwong, “Constrained clustering with dissimilarity propagation-guided graph-laplacian pca,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2020.
  • [8] Y. Wu, L. Du, and H. Hu, “Parallel multi-path age distinguish network for cross-age face recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3482–3492, 2020.
  • [9] Z. Peng, W. Zhang, N. Han, X. Fang, P. Kang, and L. Teng, “Active transfer learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 4, pp. 1022–1036, 2019.
  • [10] Y. Jia, H. Liu, J. Hou, S. Kwong, and Q. Zhang, “Multi-view spectral clustering tailored tensor low-rank representation,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • [11] Z. Peng, Y. Jia, H. Liu, J. Hou, and Q. Zhang, “Maximum entropy subspace clustering network,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • [12] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [13] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of The Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1.   Oakland, CA, USA: Berkeley, 1967, pp. 281–297.
  • [14] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in ICML.   New York, NY, USA: PMLR, 2016, pp. 478–487.
  • [15] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded clustering with local structure preservation,” in IJCAI.   Melbourne, Australia: AAAI Press, 2017, pp. 1753–1759.
  • [16] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in ICML.   PMLR, 2019, pp. 6861–6871.
  • [17] D. Kim and A. Oh, “How to find your friendly neighborhood: Graph attention design with self-supervision,” in ICLR.   Vienna, Austria: ICLR, 2021, pp. 1–14.
  • [18] L. Wu, P. Cui, J. Pei, and L. Zhao, Graph Neural Networks: Foundations, Frontiers, and Applications.   Singapore: Springer Singapore, 2022.
  • [19] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” ICLR, 2017.
  • [20] ——, “Variational graph auto-encoders,” in NIPS workshop.   Centre Convencions Internacional Barcelona, Barcelona SPAIN: NIPS, 2016, pp. 1–3.
  • [21] S. Pan, R. Hu, S.-f. Fung, G. Long, J. Jiang, and C. Zhang, “Learning graph embedding with adversarial training methods,” IEEE Transactions on Cybernetics, vol. 50, no. 6, pp. 2475–2487, 2019.
  • [22] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang, “Attributed graph clustering: A deep attentional embedding approach,” in IJCAI.   Macao, China: AAAI Press, 2019, pp. 3670–3676.
  • [23] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in ICLR.   Vancouver Convention Center, Vancouver, BC, Canada: ICLR, 2018, pp. 1–12.
  • [24] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deep clustering network,” in WWW.   Taipei Taiwan: Association for Computing Machinery, New York, NY, United States, 2020, pp. 1400–1410.
  • [25] Z. Peng, H. Liu, Y. Jia, and J. Hou, “Attention-driven graph clustering network,” in ACM MM, 2021, pp. 935–943.
  • [26] X. Dong, L. Liu, L. Zhu, Z. Cheng, and H. Zhang, “Unsupervised deep k-means hashing for efficient image retrieval and clustering,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp. 3266–3277, 2020.
  • [27] J. Huang, S. Gong, and X. Zhu, “Deep semantic clustering by partition confidence maximisation,” in CVPR, 2020, pp. 8849–8858.
  • [28] Z. Wang, Y. Zou, and Z. Zhang, “Cluster attention contrast for video anomaly detection,” in ACM MM.   Seattle, United States: ACM, 2020, pp. 2463–2471.
  • [29] K. Han, A. Vedaldi, and A. Zisserman, “Learning to discover novel visual categories via deep transfer clustering,” in ICCV.   Seoul, Korea: IEEE, 2019, pp. 8401–8409.
  • [30] Y. Ou, Z. Chen, and F. Wu, “Multimodal local-global attention network for affective video content analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1901–1914, 2020.
  • [31] X. Wang, Z. Chen, J. Tang, B. Luo, Y. Wang, Y. Tian, and F. Wu, “Dynamic attention guided multi-trajectory analysis for single object tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4895–4908, 2021.
  • [32] Y. Liu, W. Tu, S. Zhou, X. Liu, L. Song, X. Yang, and E. Zhu, “Deep graph clustering via dual correlation reduction,” in AAAI, 2022.
  • [33] Y. Hu, Z. Song, B. Wang, J. Gao, Y. Sun, and B. Yin, “Akm 3 c: Adaptive k-multiple-means for multi-view clustering,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4214–4226, 2021.
  • [34] Z. Zhang, J. Wang, J. Ye, and F. Wu, “Rethinking graph convolutional networks in knowledge graph completion,” in WWW, 2022, pp. 798–807.
  • [35] H. He, J. Wang, Z. Zhang, and F. Wu, “Compressing deep graph neural networks via adversarial knowledge distillation,” in ACM SIGKDD, 2022.
  • [36] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, “Am-gcn: Adaptive multi-channel graph convolutional networks,” in ACM SIGKDD.   Virtual Conference: ACM, 2020, pp. 1243–1253.
  • [37] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” IEEE Transactions on Knowledge and Data Engineering, 2020.
  • [38] B. Chen, Z. Zhang, Y. Li, G. Lu, and D. Zhang, “Multi-label chest x-ray image classification via semantic similarity graph embedding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2455–2468, 2021.
  • [39] J. He, T. Zhang, Y. Zheng, M. Xu, Y. Zhang, and F. Wu, “Consistency graph modeling for semantic correspondence,” IEEE Transactions on Image Processing, vol. 30, pp. 4932–4946, 2021.
  • [40] J. Wang, Z. Zhang, Z. Shi, J. Cai, S. Ji, and F. Wu, “Duality-induced regularizer for semantic matching knowledge graph embeddings,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [41] L. Hu, Z. Dai, L. Tian, and W. Zhang, “Class-oriented self-learning graph embedding for image compact representation,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  • [42] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” in CVPR, 2020, pp. 10 539–10 547.
  • [43] J. Park, M. Lee, H. J. Chang, K. Lee, and J. Y. Choi, “Symmetric graph convolutional autoencoder for unsupervised graph representation learning,” in ICCV.   Seoul, Korea: IEEE, 2019, pp. 6519–6528.
  • [44] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” Knowledge-Based Systems, vol. 151, pp. 78–94, 2018.
  • [45] G. Li, M. Muller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” in ICCV, 2019, pp. 9267–9276.
  • [46] Q. Huang, H. He, A. Singh, S.-N. Lim, and A. Benson, “Combining label propagation and simple models out-performs graph neural networks,” in ICLR.   Vienna, Austria: ICLR, 2021, pp. 1–19.
  • [47] W. Tu, S. Zhou, X. Liu, X. Guo, Z. Cai, E. Zhu, and J. Cheng, “Deep fusion clustering network,” in AAAI, vol. 35, no. 11, 2021, pp. 9978–9987.
  • [48] X. He, B. Wang, Y. Hu, J. Gao, Y. Sun, and B. Yin, “Parallelly adaptive graph convolutional clustering model,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [49] F. Helmert, “Die genauigkeit der formel von peters zur berechnung des wahrscheinlichen beobachtungsfehlers director beobachtungen gleicher genauigkeit,” Astronomische Nachrichten, vol. 88, p. 113, 1876.
  • [50] Student, “The probable error of a mean,” Biometrika, vol. 6, no. 1, pp. 1–25, 1908.
  • [51] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in AISTATS.   Fort Lauderdale, FL, USA: PMLR, 2011, pp. 315–323.
  • [52] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML, vol. 30.   Atlanta, USA: Citeseer, 2013, p. 3.
  • [53] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in AAAI, vol. 32.   Hilton New Orleans Riverside, New Orleans, Louisiana, USA: AAAI Press, 2018, pp. 1–8.
  • [54] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, vol. 3, no. 2, 2013, p. 896.
  • [55] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of Operations Research, vol. 134, no. 1, pp. 19–67, 2005.
  • [56] J. J. Hull, “A database for handwritten text recognition research,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 5, pp. 550–554, 1994.
  • [57] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” Journal of Machine Learning Research, vol. 5, no. Apr, pp. 361–397, 2004.
  • [58] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition,” in SenSys.   New York, NY, United States: ACM, 2015, pp. 127–140.
  • [59] S. Wan, Y. Zhan, L. Liu, B. Yu, S. Pan, and C. Gong, “Contrastive graph poisson networks: Semi-supervised learning with extremely limited labels,” NIPS, vol. 34, 2021.
  • [60] K. Riesen and H. Bunke, “Iam graph database repository for graph based pattern recognition and machine learning,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR).   Springer, 2008, pp. 287–297.
  • [61] N. S. Altman, “An introduction to kernel and nearest-neighbor nonparametric regression,” The American Statistician, vol. 46, no. 3, pp. 175–185, 1992.
  • [62] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
[Uncaptioned image] Zhihao Peng received the B.S. and M.S. degrees in computer science and technology from Guangdong University of Technology, Guangzhou, China, in 2016 and 2019, respectively. He is currently pursuing the Ph.D. degree in department of computer science from City University of Hong Kong, SAR, China. His current research interests include spectral clustering, subspace learning, and domain adaptation in image/text/graph processing with unsupervised learning.
[Uncaptioned image] Yuheng Jia received the B.S. degree in automation and the M.S. degree in control theory and engineering from Zhengzhou University, Zhengzhou, China, in 2012 and 2015, respectively, and the Ph.D. degree in computer science from the City University of Hong Kong, SAR, China, in 2019. He is currently an associate professor with the School of Computer Science and Engineering, Southeast University, China. His research interests include machine learning, Bayesian method, spectral clustering and low-rank modeling.
[Uncaptioned image] Hui Liu received the B.Sc. degree in communication engineering from Central South University, Changsha, China, the M.Eng. degree in computer science from Nanyang Technological University, Singapore, and the Ph.D. degree from the Department of Computer Science, City University of Hong Kong, Hong Kong. From 2014 to 2017, she was a Research Associate at the Maritime Institute, Nanyang Technological University. She is currently an Assistant Professor with the School of Computing Information Sciences, Caritas Institute of Higher Education, Hong Kong. Her research interests include image processing and machine learning.
[Uncaptioned image] Junhui Hou (Senior Member) is an Assistant Professor with the Department of Computer Science, City University of Hong Kong. He received the B.Eng. degree in information engineering (Talented Students Program) from the South China University of Technology, Guangzhou, China, in 2009, the M.Eng. degree in signal and information processing from Northwestern Polytechnical University, Xian, China, in 2012, and the Ph.D. degree in electrical and electronic engineering from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, in 2016. His research interests fall into the general areas of multimedia signal processing, such as image/video/3D geometry data representation, processing and analysis, graph-based clustering/classification, and data compression. He received the Chinese Government Award for Outstanding Students Study Abroad from China Scholarship Council in 2015 and the Early Career Award (3/381) from the Hong Kong Research Grants Council in 2018. He is an elected member of IEEE MSA-TC, IEEE VSPC-TC, and IEEE MMSP-TC. He is currently an Associate Editor for IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, Signal Processing: Image Communication, and The Visual Computer. He also served as the Guest Editor for the IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing and Journal of Visual Communication and Image Representation, and as an Area Chair of ACM MM’19-22, IEEE ICME’20, VCIP’20-22, ICIP’22, MMSP’22, and WACV’21.