\stripsep

-3pt plus 3pt minus 2pt

A Unified Framework for Fair Spectral Clustering With Effective Graph Learning

Xiang Zhang, Qiao Wang The authors are with the School of Information Science and Engineering, Southeast University, Nanjing 210096, China (e-mail: xiangzhang369@seu.edu.cn; qiaowang@seu.edu.cn).

Abstract

We consider the problem of spectral clustering under group fairness constraints, where samples from each sensitive group are approximately proportionally represented in each cluster. Traditional fair spectral clustering (FSC) methods consist of two consecutive stages, i.e., performing fair spectral embedding on a given graph and conducting $k$ means to obtain discrete cluster labels. However, in practice, the graph is usually unknown, and we need to construct the underlying graph from potentially noisy data, the quality of which inevitably affects subsequent fair clustering performance. Furthermore, performing FSC through separate steps breaks the connections among these steps, leading to suboptimal results. To this end, we first theoretically analyze the effect of the constructed graph on FSC. Motivated by the analysis, we propose a novel graph construction method with a node-adaptive graph filter to learn graphs from noisy data. Then, all independent stages of conventional FSC are integrated into a single objective function, forming an end-to-end framework that inputs raw data and outputs discrete cluster labels. An algorithm is developed to jointly and alternately update the variables in each stage. Finally, we conduct extensive experiments on synthetic, benchmark, and real data, which show that our model is superior to state-of-the-art fair clustering methods.

Index Terms:

Spectral clustering, graph learning, joint optimization, fairness constraints, spectral embedding.

I Introduction

Clustering is an unsupervised task that aims to group samples with common attributes and separate dissimilar samples. It has numerous practical applications, e.g., image processing [1], remote sensing [2], and bioinformatics [3]. Existing clustering methods include $k$ means [4], spectral clustering (SC) [5], hierarchical clustering [6]. Among these methods, SC is a graph-based method utilizing the topological information of data and usually obtains better performance when handling complex high-dimensional datasets [5].

Recently, many concerns have arisen regarding fairness when performing clustering algorithms. For example, in loan applications, applicants are grouped into several clusters to support cluster-specific loan policies. However, clustering results could be affected by sensitive factors such as race and gender [7], even if the clustering algorithms do not consider sensitive attributes. Unfair clustering can lead to discriminatory outcomes, such as a specific group being more likely to be denied a loan. Therefore, there is a growing need for fair clustering methods unbiased by sensitive attributes. In the literature, [8] first introduces the notion of group fairness into clustering. As illustrated in Fig.1, given data with sensitive attributes, fair clustering aims to partition the data into clusters, where samples in every sensitive group are approximately proportionally represented in each cluster [8]. In this way, every sensitive group is treated fairly. Following [8], [9] generalizes the definition of fair clustering, [10] proposes a scalable fair clustering algorithm, and [11] applies the variational method to fair clustering. Furthermore, fairness constraints are also incorporated into deep clustering methods that leverage deep neural networks to partition data [12, 13].

Refer to caption — Figure 1: The illustration of fair clustering. Given data points of two sensitive groups (squares and circles), fair clustering partitions them into two clusters (blue and red), where samples of each group are proportionally represented in each cluster.

Here, we consider the problem of fair spectral clustering (FSC). The first work discussing FSC is [14], which designs a fairness constraint for SC according to the definition of group fairness in [8]. A scalable algorithm is proposed in [15] to solve the model in [14], and [16] considers group fairness of normalized-cut graph partitioning. In [17], individual fairness is considered in SC, which utilizes a representation graph to encode sensitive attributes and requires the neighbors of a node in the graph to be approximately proportionally represented in the clusters. Recently, [18] proposes a fair multi-view SC method. However, existing FSC models are built on a given similarity graph, which may not be available in practice. Thus, before proceeding with FSC algorithms, it is necessary to construct a graph from raw data. That is, a complete FSC method typically consists of three subtasks. First, a similarity graph is constructed from raw data. Second, spectral embedding under fairness constraints is performed on the graph to obtain a continuous cluster indicator matrix. Third, conducting $k$ means on the continuous matrix to obtain discrete cluster labels.

Although feasible, the traditional FSC paradigm still has the following problems to be addressed. (i) The quality of the constructed graph inevitably affects subsequent fair clustering performance, but this has not been explored theoretically. Additionally, noisy observations make it more difficult to construct accurate graphs. (ii) The post-processing discretization $k$ means is sensitive to the initial cluster centers and could cause far deviation from the true discrete results [19]. (iii) Performing the subtasks separately breaks the connections among graph construction, fair spectral embedding, and discretization, leading to suboptimal fair clustering results. For example, independent graph construction may fail to find the optimal graph for fair clustering [20]. Furthermore, independent spectral embedding is inferior to joint optimization of graph construction and spectral embedding [21].

To address the above issues, we propose a unified FSC model based on group fairness, which is an end-to-end framework that inputs observed data and outputs discrete cluster labels. Specifically, we first theoretically analyze how the estimated graphs affect FSC, demonstrating that accurate graphs are crucial to improve fair clustering performance. Motivated by the analysis, we propose a novel graph construction method to learn graphs from observed data under the smoothness assumption. Our approach incorporates a node-adaptive graph filter to denoise and produce smooth signals from potentially noisy data. Second, we introduce the group fairness constraint into traditional spectral embedding to guarantee fair clustering results. Third, we utilize spectral rotation instead of $k$ means as the discretization operation since it can produce discrete results with smaller discrepancies from the true labels. Finally, all subtasks are integrated into a single objective function to avoid the sub-optimality caused by separate optimization.

In summary, the contributions of this study are as follows.

$\bullet$

We theoretically analyze the impact of the estimated graph on fair clustering errors, justifying the necessity of an accurate graph to improve FSC performance. Motivated by the analysis, we propose a graph construction method to learn accurate graphs as inputs to FSC.
$\bullet$

We propose a unified FSC model integrating graph construction, fair spectral embedding, and discretization into a single objective function. Our model is an end-to-end framework that inputs observed data and outputs discrete fair clustering results and a similarity graph.
$\bullet$

We develop an algorithm to solve the objective function of our model. Compared with separate optimization, our algorithm updates all variables jointly and alternately, leading to an overall optimal solution for all subtasks.
$\bullet$

We conduct extensive experiments on synthetic, benchmark, and real data to test the proposed FSC model. Experimental results demonstrate that our model outperforms state-of-the-art fair clustering models.

Organization: The rest of this paper is organized as follows. Section II presents some related works. Background information is introduced in Section III. We propose our unified FSC framework in Section IV. Then, the proposed algorithm is provided in Section V. We conduct experiments to test the proposed FSC method in Section VI. Finally, concluding remarks are presented in Section VII.

Notations: Throughout this paper, vectors, matrices, and sets are written in bold lowercase, bold uppercase letters, and calligraphic uppercase letters, respectively. Given a matrix $\mathbf{B}$ , $\mathbf{B}_{[i,:]},\mathbf{B}_{[:,j]}$ , and $\mathbf{B}_{[ij]}$ denote the $i$ -th row, the $i$ -th column, and the $(i,j)$ entry of $\mathbf{B}$ , respectively. $\mathbf{B}\geq 0$ means all elements of $\mathbf{B}$ are non-negative. Furthermore, $\mathrm{diag}(\mathbf{B})$ and $\mathrm{diag}_{\mathrm{0}}(\mathbf{B})$ mean converting the diagonal elements of $\mathbf{B}$ to a vector and setting the diagonal entries of $\mathbf{B}$ to zeros. The vectors $\mathbf{1}$ , $\mathbf{0}$ , and matrix $\mathbf{I}$ represent all-one vectors, all-zero vectors, and identity matrices, respectively. Moreover, $\lVert\cdot\rVert_{\mathrm{F}}$ , $\lVert\cdot\rVert_{1,1}$ , and $\lVert\cdot\rVert_{q}$ are the Frobenius norm, element-wise $\ell_{1}$ norm, and $\ell_{q}$ norm of a vector (matrix), respectively. The notations ${\dagger}$ , $\circ$ , and $\mathrm{Tr}(\cdot)$ are pseudo inverse, Hadamard product, and trace operator, respectively. Given a set $\mathcal{B}$ , $|\mathcal{B}|$ is the number of elements in $\mathcal{B}$ . Finally, $\mathbb{R}$ and $\mathbb{S}$ are the domain of real values and symmetric matrices whose dimensions depend on the context.

II Related Work

II-A Graph Learning Methods For (Fair) SC

Graph learning (GL) aims to infer the graph topology behind observed data, a prerequisite step for (fair) SC when similarity graphs are unavailable. Traditionally, graphs are constructed via some direct rules, such as $k$ -nearest-neighborhood ( $k$ -NN), $\varepsilon$ -nearest-neighborhood ( $\varepsilon$ -NN) [22], and sample correlation methods like Pearson correlation (PC). These methods may be limited in capturing similarity relationships between data pairs [23]. Thus, many works attempt to learn graphs from data adaptively, including the sparse representation (SR) method [24] and the low-rank representation method [25]. The emergence of adaptive neighbourhood graph learning (ANGL) [26] provides a new way that uses the probability of two samples being adjacent to measure the similarity between them. In [27], a possibilistic neighbourhood graph is proposed, an improved version of [26]. Recently, with the rise of graph signal processing (GSP) [28], many works attempt to learn graphs from the perspective of signal processing. One of the widely-used GSP-based GL methods postulates that signals are smooth over the corresponding graphs [29]. Intuitively, a smooth graph signal means the signal values of two connected nodes are similar[30], which is also a fundamental principle of SC [5]. Many methods are dedicated to learning graphs from smooth signals [31]. However, limited to our understanding, applying smoothness-based GL to SC has yet to be thoroughly explored, let alone FSC.

II-B Unified SC Models

Many works focus on establishing a unified model for SC, which can be roughly divided into three categories. The first one integrates graph construction and spectral embedding [21, 32, 26]. They use an independent discretization step as post-processing. The second one is based on a given similarity graph. They integrate spectral embedding and discretization [33, 34, 35]. The third category integrates all three stages into a single objective function [20, 23, 36, 37, 38]. Our model differs from these models in two main ways. (i) Our framework utilizes a new graph construction method. (ii) We further consider fairness issues in clustering tasks.

III Background

This section presents background information, including SC under group fairness constraints and spectral rotation.

III-A SC Under Group Fairness Constraints

Given an undirected graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ of $D$ vertices, where $\mathcal{V}$ and $\mathcal{E}$ are the sets of vertices and edges of $\mathcal{G}$ , respectively, its adjacency matrix $\mathbf{W}\in\mathbb{S}^{D\times D}$ is a symmetric matrix with zero diagonal entries and non-negative off-diagonal elements if the graph has non-negative edge weights and no self-loops. The Laplacian matrix of $\mathcal{G}$ is defined as $\mathbf{L}=\mathbf{D}-\mathbf{W}$ , where $\mathbf{D}\in\mathbb{S}^{D\times D}$ is a diagonal matrix satisfying $\mathbf{D}_{[ii]}=\sum_{j=1}^{D}\mathbf{W}_{[ij]}$ . Unnormalized SC aims to partition $D$ nodes into $K$ disjoint clusters $\mathcal{C}_{1},...,\mathcal{C}_{K}$ , where $\mathcal{V}=\mathcal{C}_{1}\cup...\cup\mathcal{C}_{K}$ , and $\mathcal{C}_{k}$ is the set containing nodes in the $k$ –th cluster. The problem of unnormalized SC is equivalent to minimizing the $\mathrm{RatioCut}$ objective function [5], i.e.,

\displaystyle\mathrm{RatioCut}(\mathcal{C}_{1},...,\mathcal{C}_{K})=\sum_{k=1}^{K}\frac{\mathrm{Cut}(\mathcal{C}_{k},\mathcal{V}\setminus\mathcal{C}_{k})}{|\mathcal{C}_{k}|},

(1)

where $\mathcal{V}\setminus\mathcal{C}_{k}$ contains all nodes in $\mathcal{V}$ except those in $\mathcal{C}_{k}$ , and

\displaystyle\mathrm{Cut}(\mathcal{C}_{k},\mathcal{V}\setminus\mathcal{C}_{k})=\sum_{i\in\mathcal{C}_{k},j\in\mathcal{V}\setminus\mathcal{C}_{k}}\mathbf{W}_{[ij]}.

(2)

Let $\widetilde{\mathbf{U}}\in\mathbb{R}^{D\times K}$ be

\displaystyle\widetilde{\mathbf{U}}_{[ik]}=\begin{cases}\frac{1}{\sqrt{|\mathcal{C}_{k}|}}&i\in\mathcal{C}_{k}\\ 0&i\notin\mathcal{C}_{k}\end{cases}.

(3)

Then, minimizing the $\mathrm{RatioCut}$ objective function (1) is equivalent to solving the following problem [5]

\displaystyle\underset{\widetilde{\mathbf{U}}}{\min}\;\mathrm{Tr}(\widetilde{\mathbf{U}}^{\top}\mathbf{L}\widetilde{\mathbf{U}}),\;\;\mathrm{s.t.}\;\widetilde{\mathbf{U}}\;\text{is of form \eqref{eq-prelim-U}}.

(4)

Due to the discrete constraint of (3), problem (4) is NP-hard. In practice, problem (4) is usually relaxed to

\displaystyle\underset{\mathbf{U}}{\min}\;\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}),\;\;\mathrm{s.t.}\;\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},

(5)

where $\mathbf{U}\in\mathbb{R}^{D\times K}$ is a relaxed continuous clustering label matrix, and $\mathbf{U}^{\top}\mathbf{U}=\mathbf{I}$ is adopted to avoid trivial solutions. The process of solving (5) is called spectral embedding. After obtaining $\mathbf{U}^{*}$ , a common practice is to apply $k$ means to the rows of $\mathbf{U}^{*}$ to yield final discrete clustering labels $\mathbf{Q}$ , where $\mathbf{Q}\in\{0,1\}^{D\times K}$ is a binary cluster indicator matrix. The only non-zero element of the $i$ -th row of $\mathbf{Q}$ indicates the cluster membership of the $i$ -th node of $\mathcal{G}$ .

Fair spectral clustering groups the vertices of $\mathcal{G}$ by considering fairness. If the nodes of $\mathcal{G}$ belong to $S$ sensitive groups $\mathcal{D}_{1},...,\mathcal{D}_{S}$ , where $\mathcal{D}_{s}$ contains the nodes of the $s$ -th sensitive group, we define the $\mathrm{Balance}$ of cluster $\mathcal{C}_{k}$ as [8]

\displaystyle\mathrm{Balance}(\mathcal{C}_{k})=\underset{s\neq s^{\prime}\in[S]}{\min}\;\frac{\left\lvert\mathcal{D}_{s}\cap\mathcal{C}_{k}\right\rvert}{\left\lvert\mathcal{D}_{s^{\prime}}\cap\mathcal{C}_{k}\right\rvert}\in[0,1],

(6)

where $[S]:=\{1,...,S\}$ . The higher the $\mathrm{Balance}$ of each cluster, the fairer the clustering [8]. It is not difficult to check that $\underset{k\in[K]}{\min}\mathrm{Balance}(\mathcal{C}_{k})\leq\underset{s\neq s^{\prime}\in[S]}{\min}\;|\mathcal{D}_{s}|/|\mathcal{D}_{s^{\prime}}|$ . Thus, this notion of fairness is asking for a clustering where the fraction of different sensitive groups in each cluster is approximately the same as that of the entire dataset $\mathcal{V}$ [14], which is also called group fairness. To incorporate this fairness notion into SC, a group-membership vector $\mathbf{f}_{s}\in\{0,1\}^{D}$ of $\mathcal{D}_{s}$ is defined, where $(\mathbf{f}_{s})_{[i]}=1$ if $i\in\mathcal{D}_{s}$ and $(\mathbf{f}_{s})_{[i]}=0$ otherwise, for $s\in[S]$ and $i\in[D]$ . Then, we have the following lemma.

Lemma 1.

(Fairness constraint as linear constraint on $\widetilde{\mathbf{U}}$ [14]) Let $\mathcal{V}=\mathcal{C}_{1}\cup...\cup\mathcal{C}_{K}$ be a clustering that is encoded as in (3). We have, for every $k\in[K]$

\displaystyle\forall s\in[S]:\frac{|\mathcal{D}_{s}\cap\mathcal{C}_{k}|}{|\mathcal{C}_{k}|}=\frac{|\mathcal{D}_{s}|}{D}\Leftrightarrow\mathbf{F}^{\top}\widetilde{\mathbf{U}}=\mathbf{0},

(7)

where $\mathbf{F}\in\mathbb{R}^{D\times(S-1)}$ is a matrix satisfying $\mathbf{F}_{[:,s]}=\mathbf{f}_{s}-(|\mathcal{D}_{s}|/D)\cdot\mathbf{1},s\in[S-1]$ .

Lemma 1 states that the proportional representation of all sensitive attribute samples in each cluster can be interpreted as a linear constraint $\mathbf{F}^{\top}\widetilde{\mathbf{U}}=\mathbf{0}$ . Under this fairness constraint, unnormalized SC is equivalent to the following problem

\displaystyle\underset{\widetilde{\mathbf{U}}}{\min}\;\mathrm{Tr}(\widetilde{\mathbf{U}}^{\top}\mathbf{L}\widetilde{\mathbf{U}}),\;\;\mathrm{s.t.}\;\widetilde{\mathbf{U}}\;\text{is of form \eqref{eq-prelim-U}},\;\mathbf{F}^{\top}\widetilde{\mathbf{U}}=\mathbf{0}.

(8)

Similarly, we can relax (8) to

\displaystyle\underset{\mathbf{U}}{\min}\;\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}),\;\;\mathrm{s.t.}\;\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\;\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}.

(9)

Following traditional SC, existing FSC models perform $k$ means on rows of $\mathbf{U}$ to obtain discrete cluster labels $\mathbf{Q}$ .

III-B Spectral Rotation

Spectral rotation [19] is an alternative to $k$ means for obtaining discrete clustering results from continuous labels $\mathbf{U}$ , which is formulated as

\displaystyle\underset{\mathbf{Q},\mathbf{R}}{\min}\;\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2},

\displaystyle\mathrm{s.t.}\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I},

(10)

where the set $\mathcal{I}$ contains all discrete cluster indicator matrices, and $\mathbf{R}\in\mathbb{R}^{K\times K}$ is an orthonormal matrix. According to the spectral solution invariance property [19], if $\mathbf{U}$ is a solution of (5), $\mathbf{U}\mathbf{R}$ is another solution. A suitable $\mathbf{R}$ could facilitate $\mathbf{U}\mathbf{R}$ as close to $\mathbf{Q}$ as possible. In contrast, $k$ means is performed directly on $\mathbf{U}$ obtained from spectral embedding, which may far deviate from the real discrete results. Thus, spectral rotation usually achieves superior performance than $k$ means [19].

IV Model Formulation

In this section, we first theoretically analyze the impact of the constructed graph on FSC, which justifies an accurate graph for improving FSC performance. Then, we propose a novel graph construction method to learn graphs from potentially noisy observed data. Next, we integrate graph construction, fair spectral embedding, and discretization into an end-to-end framework. Finally, we analyze the connections between our model and existing works.

IV-A Why We Need An Accurate Graph?

We first introduce a variant of the stochastic block model [39] (vSBM) to generate random graphs with cluster structures and sensitive attributes [14]. This model assumes that there are two or more meaningful ground-truth clusterings of the observed data, and only one of them is fair. Assume that $\mathcal{V}$ comprises $S$ sensitive groups and is partitioned into $K$ clusters such that $|\mathcal{D}_{s}\cap\mathcal{C}_{k}|/|\mathcal{C}_{k}|=\zeta_{s},s\in[S],k\in[K]$ , for $\zeta_{s}\in(0,1)$ with $\sum_{s=1}^{S}\zeta_{s}=1$ . Based on the clusters and sensitive groups, we construct a random graph by connecting two vertices $i$ and $j$ with a probability $\mathrm{Pr}(i,j)$ that depends on the clusters and sensitive groups of $i$ and $j$ . We define

\displaystyle\mathrm{Pr}(i,j)=\begin{cases}a,&\pi_{C}(i)=\pi_{C}(j),\;\pi_{S}(i)=\pi_{S}(j)\\ b,&\pi_{C}(i)\neq\pi_{C}(j),\;\pi_{S}(i)=\pi_{S}(j)\\ c,&\pi_{C}(i)=\pi_{C}(j),\;\pi_{S}(i)\neq\pi_{S}(j)\\ d,&\pi_{C}(i)\neq\pi_{C}(j),\;\pi_{S}(i)\neq\pi_{S}(j),\end{cases}

(11)

where $\pi_{C}:[D]\to[K]$ and $\pi_{S}:[D]\to[S]$ are two functions that assign a node $i\in\mathcal{V}$ to one of the clusters and sensitive groups, respectively. Let $\mathbf{L}^{*}$ be the real graph Laplacian matrix generated by the vSBM method and $\widehat{\mathbf{L}}$ be the Laplacian matrix estimated by any graph construction method. The matrix $\widehat{\mathbf{L}}$ is used as the input to fair spectral embedding in (9), and spectral rotation is utilized to obtain discrete cluster labels. Our goal is to derive a fair clustering error bound related to the estimation error between $\widehat{\mathbf{L}}$ and $\mathbf{L}^{*}$ . Let us make some assumptions.

Assumption 1.

Let $\widehat{\mathbf{U}}$ be a continuous cluster indicator matrix estimated from $\widehat{\mathbf{L}}$ via (9). For a given constant $\epsilon>0$ , the $\widehat{\mathbf{Q}}$ and $\widehat{\mathbf{R}}$ estimated by spectral rotation satisfies

\displaystyle\lVert\widehat{\mathbf{Q}}-\widehat{\mathbf{U}}\widehat{\mathbf{R}}\rVert_{\mathrm{F}}^{2}\leq(1+\epsilon)\underset{\mathbf{Q}\in\mathcal{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}}{\min}\lVert{\mathbf{Q}}-\widehat{\mathbf{U}}{\mathbf{R}}\rVert_{\mathrm{F}}^{2}.

(12)

Assumption 2.

The ground-truth clustering and sensitive partitions of $\mathcal{V}$ satisfy

\displaystyle|\mathcal{D}_{s}|=\frac{D}{S},\;|\mathcal{C}_{k}|=\frac{D}{K},\;\frac{|\mathcal{D}_{s}\cap\mathcal{C}_{k}|}{|\mathcal{C}_{k}|}=\frac{1}{S}.

(13)

Assumption 1 is similar to the $(1+\epsilon)-$ approximation of $k$ means [40], which provides the estimation accuracy of spectral rotation. Assumption 2 is the same as that in Theorem 1 of [14], which is only made to facilitate theoretical analysis. In practice, Assumption 2 may be violated, which, however, does not affect the effectiveness of FSC algorithms [14]. Based on the two assumptions, we have the following proposition.

Proposition 1.

Let $\mathbf{L}^{*}$ be the real Laplacian matrix of the random graph generated by the vSBM method with $a>b>c>d$ satisfying $a>r_{1}\ln{D}/D$ for some $r_{1}>0$ , and $\widehat{\mathbf{L}}$ be the estimated Laplacian matrix from observed data. Assume that we run fair spectral embedding (9) on $\widehat{\mathbf{L}}$ and perform $(1+\epsilon)$ spectral rotation (10) to obtain discrete cluster labels. Besides, let $\widehat{\pi}_{C}(i)$ be the assigned cluster label (after proper permutation) of node $i$ , and define $\mathcal{M}_{k}:=\left\{i\in\mathcal{C}_{k}:\widehat{\pi}_{C}(i)\neq k\right\}$ as the set of misclassified vertices of cluster $k$ . Under Assumptions 1-2, for every $r_{2}>0$ , there exist constants $\widehat{C}=\widehat{C}(r_{1},r_{2})$ and $\widetilde{C}=\widetilde{C}(r_{1},r_{2})$ such that if

\displaystyle\frac{aK^{3}\ln{D}}{D(c-d)^{2}}\leq\frac{\widehat{C}}{1+\epsilon},

(14)

then with probability at least $1-D^{-r_{2}}$ , the number of misclassified vertices, $\sum_{k=1}^{K}{|\mathcal{M}_{k}|}$ , is at most

\displaystyle\underbrace{\frac{\widetilde{C}(1+\epsilon)aK^{2}\ln{D}}{(c-d)^{2}}}_{\mathrm{related\;to\;the\;vSBM\;}}+\underbrace{\frac{512(4+2\epsilon)K^{2}}{D(c-d)^{2}}\lVert\mathbf{Z}^{\top}\mathbf{L}^{*}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{\mathrm{F}}^{2}}_{\mathrm{related\;to\;graph\;estimation}},

(15)

where $\mathbf{Z}\in\mathbb{R}^{D\times(D-S+1)}$ is a matrix whose columns form the orthonormal basis of the nullspace of $\mathbf{F}^{\top}$ .

Proof.

The proof is inspired by [14], but has two main differences. First, spectral rotation instead of $k$ means is used to obtain discrete labels. Second, fair spectral embedding is based on an estimated graph rather than a known graph generated by the vSBM method. See Appendix A for details. ∎

According to [14], the meaning of “the number of misclassified vertices is at most $D_{m}$ ” is that there exists a permutation of cluster indices such that the clustering results up to this permutation successfully predict all cluster labels but $D_{m}$ many vertices. Note that the error bound consists of two parts. The first one is caused by the difference between the expected and real graph produced by the vSBM method, which is similar to [14]. The second part is related to the estimation error of graph construction methods. The fair constraint affects clustering performance via $\mathbf{Z}$ , which is a matrix determined by sensitive group-membership matrix $\mathbf{F}$ . For convenience, $\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}$ is dubbed fair graph. Generally, the error bound in (15) depends on $K$ , $D$ and $\epsilon$ . If we divide (15) by $D$ , we obtain the bound for the misclassification rate. The first part of the misclassification rate bound tends to zero as $D$ goes to infinity, meaning that if $\mathbf{L}^{*}$ is exactly estimated (the second part equals to zero), performing FSC via (9) and spectral rotation is weakly consistent [14]. However, $\mathbf{L}^{*}$ usually cannot be estimated exactly, causing an additional error for subsequent fair clustering results. If the fair graph estimation error does not increase quadratically as $D$ , the second part of the misclassification rate bound will also decay to zero. Proposition 1 illustrates that a well-estimated graph $\widehat{\mathbf{L}}$ , which is close to $\mathbf{L}^{*}$ , brings a small misclassification error bound. Thus, it motivates us to seek a more effective method to construct accurate graphs from observed data.

IV-B The Proposed Graph Construction Method

Given $N$ observed data $\mathbf{X}_{o}\in\mathbb{R}^{D\times N}$ , we need to infer the underlying similarity graph topology as the input to FSC algorithms. However, contaminated data may lead to poor graph estimation performance, as indicated in Proposition 1, which degrades subsequent fair clustering performance. Therefore, we propose a method to learn graphs from potentially noisy data $\mathbf{X}_{o}$ , which is formulated as

	$\displaystyle\underset{\mathbf{L}\in\mathcal{L},\mathbf{X},\bm{\upsilon}>0}{\mathrm{min}}\,\,$	$\displaystyle\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}$
		$\displaystyle\underbrace{-\mathbf{1}^{\top}\log\left(\mathrm{diag}(\mathbf{L})\right)+{\beta}\lVert\mathrm{diag_{0}}(\mathbf{L})\rVert^{2}_{\mathrm{F}}}_{Reg(\mathbf{L})},$		(16)

where $\mathcal{L}:=\left\{\mathbf{L}:\mathbf{L}\in\mathbb{S}^{D\times D},\,\mathbf{L}\mathbf{1}=\mathbf{0},\,\mathbf{L}_{[ij]}\leq 0,\,\,i\neq j\right\}$ contains all Laplacian matrices. Moreover, $\xi$ and $\beta$ are parameters, and $\bm{\upsilon}\in\mathbb{R}^{D}$ is a vector of adaptive weights. We let $\mathbf{\Upsilon}:=\mathrm{diag}(\sqrt{\bm{\upsilon}})$ , where $\sqrt{\bm{\upsilon}}=(\sqrt{\bm{\upsilon}_{[1]}},...,\sqrt{\bm{\upsilon}_{[D]}})^{\top}$ . Eq.(16) is a joint model of denoising and smoothness-based GL [30].

1) Denoising: If $\mathbf{L}$ is fixed, the problem (16) becomes

\displaystyle\underset{\mathbf{X},\bm{\upsilon}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}.

(17)

The model is a node-adaptive graph filter, and $\bm{\upsilon}$ represents node weights. Specifically, given node weights $\bm{\upsilon}$ , we have

\displaystyle\underset{\mathbf{X}}{\mathrm{min}}\,\,\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+{\xi}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X}).

(18)

Taking the derivative of (18) and setting it to zero, we obtain

\displaystyle\mathbf{X}=\left(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon}+{\xi}\mathbf{L}\right)^{-1}\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon}\mathbf{X}_{o}={\left(\mathbf{I}+{\xi}(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon})^{-1}\mathbf{L}\right)^{-1}}\mathbf{X}_{o}.

(19)

We let $\mathbf{K}:=\left(\mathbf{I}+{\xi}(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon})^{-1}\mathbf{L}\right)^{-1}$ , which is positive definite and has eigen-decomposition $\mathbf{K}=\mathbf{\Theta}\mathbf{\Lambda}\mathbf{\Theta}^{\top}$ with eigenvalues matrix $\mathbf{\Lambda}$ and eigenvectors matrix $\mathbf{\Theta}$ . Moreover, $\mathbf{\Lambda}=\mathrm{diag}\left(\frac{1}{1+\xi\lambda_{1}},....,\frac{1}{1+\xi\lambda_{D}}\right)$ , where $0=\lambda_{1}\leq,....,\leq\lambda_{D}$ are the eigenvalues of $(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon})^{-1}\mathbf{L}$ . From the perspective of graph spectral filtering (GFT) [28], $\mathbf{K}\mathbf{X}_{o}=\mathbf{\Theta}\mathbf{\Lambda}\mathbf{\Theta}^{\top}\mathbf{X}_{o}$ can be interpreted as that the observed graph signals (columns of $\mathbf{X}_{o}$ ) are first transformed to the graph frequency domain via $\mathbf{\Theta}^{\top}$ , attenuated GFT coefficients according to $\mathbf{\Lambda}$ , and transformed back to the nodal domain via $\mathbf{\Theta}$ . It is observed from $\mathbf{\Lambda}$ that the graph filter $\mathbf{K}$ is low-pass since the attenuation is stronger for larger eigenvalues. Thus, the graph filter can suppress the high-frequency component of raw data $\mathbf{X}_{o}$ , alleviating the noises on the graph to output the “noiseless” signal $\mathbf{X}$ .

Our graph filter $\mathbf{K}$ differs from the Auto-Regressive graph filter $\left(\mathbf{I}+{\xi}\mathbf{L}\right)^{-1}$ [41] in that we assign each node an individual weight $\bm{\upsilon}_{[i]},i=1,...,D$ . The reason for using $\bm{\upsilon}$ is that the measurement noise of different nodes may be heterogeneous. If the $i$ -th node signal (the $i$ -th row of $\mathbf{X}_{o}$ ) has a small noise scale, a large $\bm{\upsilon}_{[i]}$ should be assigned to the fidelity term of node $i$ in (17) to ensure $\mathbf{X}_{[i,:]}$ is are close to the corresponding observation $(\mathbf{X}_{o})_{[i,:]}$ [42]. When we cannot know the noise scale a priori, we can adaptively learn $\bm{\upsilon}$ from the data. Specifically, given $\mathbf{X}$ , the problem (17) becomes

\displaystyle\underset{\bm{\upsilon}>0}{\mathrm{min}}\,\,\frac{1}{N}\sum_{i=1}^{D}\bm{\upsilon}_{[i]}\lVert(\mathbf{X}_{o})_{[i,:]}-\mathbf{X}_{[i,:]}\rVert_{2}^{2}+\frac{1}{\bm{\upsilon}_{[i]}}.

(20)

Intuitively, solving (20) will assign a large $\bm{\upsilon}_{[i]}$ to node $i$ if $\mathbf{X}_{[i,:]}$ is close to $(\mathbf{X}_{o})_{[i,:]}$ , as expected.

2) Graph learning: If we have obtained the “noiseless” signals $\mathbf{X}$ via the graph filter $\mathbf{K}$ , the problem (16) becomes

\displaystyle\underset{\mathbf{L}\in\mathcal{L}}{\mathrm{min}}\,\,\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+{Reg(\mathbf{L})}.

(21)

The first Laplacian quadratic term of (21) is equivalent to

\displaystyle\frac{1}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})=\frac{1}{N}\sum_{n=1}^{N}\sum_{i,j=1}^{D}\mathbf{W}_{[ij]}\left(\mathbf{X}_{[in]}-\mathbf{X}_{[jn]}\right)^{2},

(22)

which measures the average smoothness of data $\mathbf{X}$ over the graph $\mathbf{L}$ [30]. The second term of (21) contains regularizers that endow the learned graphs with desired properties. The $\log$ -degree term is to control node degree, and the Frobenius norm term is to control graph sparsity. Our model (21) can learn a graph suitable for graph-based clustering tasks for the following reasons. (i) It is observed from (22) that minimizing the smoothness is to seek a graph whose similar vertices (node signals) are closely connected, which is consistent with the fundamental principle of SC. (ii) The $\log$ -degree term can avoid isolated nodes, which is crucial for SC, especially for normalized SC [5]. (iii) The Frobenius norm term can lead to a sparse graph, which may remove redundant and noisy edges.

The model (21) is similar to the ANGL method [26] since both construct graphs by minimizing the smoothness. The main differences lie in three aspects. (i) Our model removes the sum-to-one constraint in the ANGL method—the degree of each node is forced to be one—since the constraint makes the output graphs sensitive to noisy points [27]. Removing this constraint allows our model to capture more complex similarity relationships. (ii) We add a $\log$ -degree term to ensure the learned graph has no isolated nodes. (iii) The input data of (21) are those produced by a low-pass graph filter.

3) Discussion: We try to explain why our method is effective in constructing graphs from observed data. If data $\mathbf{X}_{o}$ have a clustering structure, they should follow the assumption of cluster and manifold, i.e., the data in the same cluster are close to each other. According to [43], smooth signals containing low-frequency parts tend to follow the cluster and manifold assumption. Thus, if $\mathbf{L}$ accurately represents the graph behind observed data, the denoising part of our model has two functions. First, it filters out the high-frequency components of the observed graph signals that correspond to noises. Second, it produces smooth signals that have a clearer clustering structure, which could facilitate subsequent clustering. To better illustrate the effectiveness of the node-adaptive filter, Fig.2 depicts the t-SNE [44] results of our methods on the MNIST dataset, where four clusters correspond to four randomly selected digits. We can see that raw data are entangled together. In contrast, the denoised data $\mathbf{X}$ by the graph filter are clearly separated, meaning that the denoising part of our model can produce cluster-friendly signals. From the perspective of GL, our model (21) learns a graph minimizing the smoothness of data, i.e., the nodes corresponding to similar signals are closely connected. Thus, the learned graph can effectively capture similarity relationships between data and preserve clustering structures. Consequently, the denoising operation and the smoothness-based GL can enhance each other collaboratively to bring a high-quality graph for subsequent fair clustering tasks.

IV-C The Unified FSC Model

In this subsection, we build an end-to-end FSC framework that inputs observed data $\mathbf{X}_{o}$ and node attributes $\mathbf{F}$ and directly outputs discrete cluster labels. As shown in Fig.3, our model consists of four modules, i.e., denoising, graph learning, fair spectral embedding, and discretization. First, we construct graphs from the observed data by the proposed method (16). Once we obtain $\mathbf{L}$ , the Laplacian matrix together with $\mathbf{F}$ can be directly used to perform fair spectral embedding (9) to obtain continuous clustering label matrix $\mathbf{U}$ . Finally, we leverage spectral rotation (10) instead of $k$ means to obtain discrete cluster labels. In addition to the reason for superior performance as stated in Section III-B, we utilize spectral rotation since it can be flexibly integrated into an end-to-end framework. Integrating all the above subtasks into a single objective function, we obtain

	$\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I},$		(23)

where $\mu$ and $\gamma$ are two parameters. All modules are not simply put together, but are bridged by two Laplacian quadratic terms, i.e., $\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})$ and $\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$ . First, $\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})$ can be viewed as the graph Tikhonov regularizer of the denoising task (18) to output smooth signals [41]. On the other hand, it measures smoothness in the GL task to capture the similarity relationships between data. Second, $\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$ together with $\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}$ performs fair spectral embedding as input to the discretization. It is also used to impose structural constraints on the constructed graph, which is discussed in the next subsection. The four modules are coupled with each other to achieve the overall optimal results for all subtasks.

To better understand how the fairness constraint works, we introduce a new variable matrix $\mathbf{Y}\in\mathbb{R}^{(D-S+1)\times K}$ and let $\mathbf{U}=\mathbf{Z}\mathbf{Y}$ , where $\mathbf{Z}$ is the matrix defined in Proposition 1. The matrix $\mathbf{F}$ encodes sensitive information, as does $\mathbf{Z}$ . Then, problem (23) can be rephrased in term of $\mathbf{Y}$ as

	$\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}+\mu\mathrm{Tr}(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y})+\gamma\lVert\mathbf{Q}-\mathbf{Z}\mathbf{Y}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}.$		(24)

In (24), the fairness constraint $\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0}$ is removed. We conduct spectral embedding on the fair graph $\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}$ , which encodes graph topology and sensitive information simultaneously, instead of $\mathbf{L}$ to ensure fair clustering. The impact of fairness constraints is discussed in the next subsection.

The basic formulation (23) is flexible and has many possible extensions. Here are some examples. (i) We can replace spectral rotation with improved spectral rotation [45] to further improve discretization performance. (ii) We can introduce self-weighted features into (23) to determine the importance of different features in assigning cluster labels [46]. (iii) We can extend (23) from unnormalized SC (5) to normalized SC [47]. (iv) We can also incorporate individual fairness [17] into our unified model. We place the details of these extensions in the supplementary material for completeness.

Remark 1.

The above extensions may improve fair clustering performance. However, we focus on the basic formulation (23) since our primary goal is to demonstrate the advantages of the proposed graph construction method and the unified framework rather than to propose a complex FSC model.

IV-D Connections to Existing Works

1) Connections to community-based GL models: If we only focus on GL and fair spectral embedding, Eq.(24) becomes

	$\displaystyle\underset{\mathbf{L},\mathbf{Y}}{\min}\;\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})+\mu\mathrm{Tr}(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y})$
	$\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I},$		(25)

where $\mathbf{X}$ is regarded as the “noiseless” data here. According to Ky Fan’s theorem [48], we have $\underset{\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I}}{\min}\;\mathrm{Tr}(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y})=\sum_{k=1}^{K}\widetilde{\lambda}_{k}$ , where $\widetilde{\lambda}_{k}$ is the $k$ smallest eigenvalue of $\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}$ . Thus, the problem (25) can be rephrased as

\displaystyle\underset{\mathbf{L}\in\mathcal{L}}{\min}\;\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})+\mu\sum_{k=1}^{K}\widetilde{\lambda}_{k}.

(26)

Note that $\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}$ is a semi-positive definite matrix, i.e., $\widetilde{\lambda}_{k}\geq 0$ . Minimizing (26) is equivalent to forcing $\sum_{k=1}^{K}\widetilde{\lambda}_{k}\to 0$ if $\mu$ is large enough. That is, (26) encourages the fair graph to have $K$ connected components. Therefore, (26) can be viewed as the community-based GL model, which has been widely studied. For example, [49] lets the first $K$ smallest eigenvalues of Laplacian matrices be zero to obtain the community structures, which can be relaxed to the last term in (26). The works [50, 26] force the rank of Laplacian matrices to $D-K$ , which can also be interpreted as minimizing the sum of the first $K$ eigenvalues. Furthermore, [51] adds a term $\mathrm{Tr}(\mathbf{\Xi}^{\top}\mathbf{L}\mathbf{\Xi})$ to impose community constraints, where $\mathbf{\Xi}\mathbf{\Xi}^{\top}$ contains the value 1 for within-community edges only and 0 everywhere else. Although closely related, (25) differs from existing community-based GL models in two key aspects. First, the basic GL models are different. Our model is based on the smoothness-based GL, while [49, 51] are based on statistical GL models like Graphical Lasso. Besides, [50, 26] are based on the ANGL method. Second, our model imposes the community constraint on the fair graph $\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}$ instead of on $\mathbf{L}$ like existing works. Thus, the fairness constraint may affect the topology of the learned graph to obtain fair clustering. We will test the impact of the fairness constraint in the experimental section.

2) Connections to unified SC models: If we remove the denoising module and fairness constraint, our model becomes

	$\displaystyle\underset{\mathbf{L},\mathbf{U},\mathbf{R},\mathbf{Q}}{\min}\;\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}.$		(27)

Again, $\mathbf{X}$ is treated as the observed data. The model (27) is an end-to-end SC model. Here, we discuss the connections between our model and those unified SC models integrating graph construction, spectral embedding, and discretization. As stated in Remark 1, we focus on basic formulations without additional extensions. The first model we compare is [20]

	$\displaystyle\underset{\mathbf{W},\mathbf{U},\mathbf{Q},\mathbf{R}}{\min}\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{W}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;+\gamma_{U}\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{W}\in\mathcal{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I},$		(28)

where $\alpha_{U},\mu_{U}$ , and $\gamma_{U}$ are constant parameters. Moreover, $\mathcal{W}=\left\{\mathbf{W}:\mathbf{W}\in\mathbb{S}^{D\times D},\mathbf{W}\geq 0,\mathrm{diag}(\mathbf{W})=\mathbf{0}\right\}$ is the set containing all adjacency matrices. This is a unified SC model that leverages the sparse representation method [24] to construct graphs, which is different from our GL method.

Another unified SC model [23, 36] is concluded as

	$\displaystyle\underset{\mathbf{W},\mathbf{U},\mathbf{Q},\mathbf{R}}{\min}\sum_{i,j=1}^{D}\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[i,j]}+\beta_{J}\mathbf{W}_{[i,j]}^{2}$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;+{\mu_{J}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma_{J}\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{W}\mathbf{1}=\mathbf{1},\mathbf{W}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},$
	$\displaystyle\;\;\;\;\;\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I},$		(29)

where $\beta_{J},\mu_{J}$ , and $\gamma_{J}$ are constant parameters. The graph construction method in (29) is the ANGL method [26]. We have discussed the difference between our graph construction method and the ANGL in the previous subsection.

In summary, our model (27) differs from the existing unified SC models (28)-(29) mainly in the graph construction method. As Proposition 1 states, an accurate GL method can boost fair clustering performance. In the experimental section, we develop fair versions of (28)-(29) and compare them with our model (23) to illustrate the superiority of our model.

V Model Optimization

In this section, we first propose an algorithm for solving (23), followed by convergence and complexity analyses.

V-A Optimization Algorithm

Our algorithm alternately updates $\mathbf{L},\mathbf{U},\mathbf{R}$ , $\mathbf{Q}$ , $\mathbf{X}$ , and $\bm{\upsilon}$ in (23), i.e., updating one with the others fixed. For clarity, we omit the iteration index here. The following derivations are the updates in one iteration.

1) Update $\mathbf{L}$ : The sub-problem of updating $\mathbf{L}$ is

\displaystyle\underset{\mathbf{L}\in\mathcal{L}}{\min}\;\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}).

(30)

The problem can be rewritten in terms of $\mathbf{W}$

\displaystyle\underset{\mathbf{W}\in\mathcal{W}}{\min}\;\frac{1}{2}\lVert\mathbf{W}\circ\mathbf{P}\rVert_{1,1}+Reg_{W}(\mathbf{W}),

(31)

where

\displaystyle\mathbf{P}_{[ij]}=\frac{\xi}{N}\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}+{\mu}\lVert\mathbf{U}_{[i,:]}-\mathbf{U}_{[j,:]}\rVert_{2}^{2},

(32)

and $Reg_{W}(\mathbf{W})=-\mathbf{1}^{\top}\mathrm{log}(\mathbf{W}\mathbf{1})+{\beta}\lVert\mathbf{W}\rVert_{\mathrm{F}}^{2}$ . By the definition of $\mathcal{W}$ , the free variables of $\mathbf{W}$ are the upper triangle elements. Thus, we define a vector $\mathbf{w}\in\mathbb{R}^{P},P:=\frac{D(D-1)}{2}$ , satisfying that $\mathbf{w}=\mathrm{Triu}(\mathbf{W})$ , where $\mathrm{Triu}(\cdot):\mathbb{R}^{D\times D}\to\mathbb{R}^{P}$ is a function that converts the upper triangular elements of a matrix into a vector. Then, the problem (31) is equivalent to

\displaystyle\underset{\mathbf{w}\geq 0}{\min}\;\mathbf{p}^{\top}\mathbf{w}-\mathbf{1}^{\top}\log(\mathbf{S}\mathbf{w})+2\beta\lVert\mathbf{w}\rVert_{2}^{2},

(33)

where $\mathbf{p}=\mathrm{Triu}(\mathbf{P}$ ), $\mathbf{S}\in\mathbb{R}^{D\times P}$ is a linear operator satisfying $\mathbf{S}\mathbf{w}=\mathbf{W}\mathbf{1}$ [30]. The problem (33) is convex, and we employ the algorithm in [52] to solve the problem. The complete algorithm flow is presented in the supplementary materials. After obtaining the estimated ${\mathbf{w}}$ , we let ${\mathbf{W}}=\mathrm{iTriu}({\mathbf{w}})$ , where $\mathrm{iTriu}(\cdot):\mathbb{R}^{P}\to\mathbb{R}^{D\times D}$ is the inverse $\mathrm{Triu}$ operation. The operation $\mathrm{iTriu}({\mathbf{w}})$ converts ${\mathbf{w}}$ into an adjacency matrix, where ${\mathbf{w}}$ corresponds to the upper triangle elements of ${\mathbf{W}}$ . Finally, we calculate the Laplacian matrix from ${\mathbf{W}}$ and feed it into subsequent updates of other variables.

2) Update $\mathbf{U}$ : The sub-problem of updating $\mathbf{U}$ is

	$\displaystyle\underset{\mathbf{U}}{\min}\;\mu\mathrm{Tr}\left(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}\right)+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}.$		(34)

Like (24), (34) can be cast into a problem of variable $\mathbf{Y}$

		$\displaystyle\underset{\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I}}{\min}\;\mu\mathrm{Tr}\left(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y}\right)+\gamma\lVert\mathbf{Q}-\mathbf{Z}\mathbf{Y}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\underset{\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I}}{\min}\;\mu\mathrm{Tr}\left(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y}\right)-2\gamma\mathrm{Tr}(\mathbf{R}\mathbf{Q}^{\top}\mathbf{Z}\mathbf{Y}).$		(35)

This is a typical quadratic optimization problem with orthogonal constraints. Let $\phi(\mathbf{Y})$ be the objective function of (35). We have that $\phi(\mathbf{Y})$ is differential, and $\nabla_{\mathbf{Y}}\,\phi(\mathbf{Y})=2\mu\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y}-2\gamma\mathbf{Z}^{\top}\mathbf{Q}\mathbf{R}^{\top}$ . Thus, the problem can be efficiently solved via the algorithm in [53]. After obtaining $\mathbf{Y}$ , we let $\textbf{U}=\textbf{Z}\textbf{Y}$ .

3) Update $\mathbf{R}$ : The sub-problem of updating $\mathbf{R}$ is

		$\displaystyle\underset{\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}}{\min}\;\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\underset{\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}}{\max}\;\mathrm{Tr}(\mathbf{Q}^{\top}\mathbf{U}\mathbf{R}).$		(36)

It is the orthogonal Procrustes problem with a closed-form solution [54]. Assuming that $\mathbf{\Theta}_{L}$ and $\mathbf{\Theta}_{R}$ are the left and right matrices of SVD of $\mathbf{Q}^{\top}\mathbf{U}$ , the solution to (36) is [54]

\displaystyle\mathbf{R}=\mathbf{\Theta}_{R}\mathbf{\Theta}_{L}^{\top}.

(37)

4) Update $\mathbf{Q}$ : The sub-problem of updating $\mathbf{Q}$ is

	$\displaystyle\underset{\mathbf{Q}\in\mathcal{I}}{\min}\;\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
$\displaystyle\Leftrightarrow$	$\displaystyle\underset{\mathbf{Q}\in\mathcal{I}}{\min}\;\gamma\mathrm{Tr}(\mathbf{Q}^{\top}\mathbf{Q})-2\gamma\mathrm{Tr}(\mathbf{Q}^{\top}\mathbf{U}\mathbf{R})$
$\displaystyle\Leftrightarrow$	$\displaystyle\underset{\mathbf{Q}\in\mathcal{I}}{\max}\;\mathrm{Tr}(\mathbf{Q}^{\top}\mathbf{U}\mathbf{R}).$	(38)

The optimal solution to (38) is as follows:

\displaystyle\mathbf{Q}_{[ik]}=\begin{cases}1&k=\mathrm{argmax}_{j\in[K]}\;(\mathbf{U}\mathbf{R})_{[ij]},\\ 0&\mathrm{others}.\end{cases}

(39)

5) Update $\mathbf{X}$ : The sub-problem of updating $\mathbf{X}$ is (17), which has a closed solution (19). However, matrix inversion is computationally expensive with complexity $\mathcal{O}(D^{3})$ . Fortunately, $\left(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon}+{\xi}\mathbf{L}\right)^{-1}$ is symmetric, sparse, and positive definite. We can hence solve (17) efficiently using conjugate gradient (CG) algorithm without matrix inverse [55].

6) Update $\bm{\upsilon}$ : The sub-problem of updating $\bm{\upsilon}$ is (20). Taking the derivative of (20) and setting it to zero, we have

\displaystyle\bm{\upsilon}_{[i]}=\frac{\sqrt{N}}{\lVert(\mathbf{X}_{o})_{[i,:]}-\mathbf{X}_{[i,:]}\rVert_{2}},\;\;i=1,...,D.

(40)

It is observed that the updates of $\mathbf{L}$ , $\mathbf{U}$ , $\mathbf{R}$ , $\mathbf{Q}$ , $\mathbf{X}$ , and $\bm{\upsilon}$ are coupled with each other. Updating one variable depends on the other variables, leading to an overall optimal solution. The complete procedure is presented in Algorithm 1.

Algorithm 1 The algorithm for problem (23)

0: Data matrix

\mathbf{X}_{o}\in\mathbb{R}^{D\times N}

, sensitive attributes related matrix

\mathbf{F}

\mathbf{Z}

, the number of clusters

K

, model parameters

\xi,\beta,\mu

, and

\gamma

0: The learned

\mathbf{L}

and discrete cluster labels

\mathbf{Q}

1: Initialize

\mathbf{L}

\mathbf{U}

\mathbf{Q}

, and

\mathbf{R}

randomly,

\mathbf{X}=\mathbf{X}_{o}

, and

\bm{\upsilon}=\mathbf{1}

2: while not converged do

3: Calculate

\mathbf{P}

by (32) and let

\mathbf{p}=\mathrm{Triu(\mathbf{P})}

4: Update

\mathbf{w}

by solving (33)

5: Convert

\mathbf{W}=\mathrm{iTriu}(\mathbf{w})

and calculate

\mathbf{L}=\mathbf{D}-\mathbf{W}

6: Update

\mathbf{Y}

by solving problem (35) using the algorithm in [53], and let

\mathbf{U}=\mathbf{Z}\mathbf{Y}

7: Update

\mathbf{R}=\mathbf{\Theta}_{R}\mathbf{\Theta}_{L}^{\top}

, where

\mathbf{\Theta}_{L}

and

\mathbf{\Theta}_{R}

are the left and right matrices of SVD of

\mathbf{Q}^{\top}\mathbf{U}

8: Update

\mathbf{Q}

via (39)

9: Update

\mathbf{X}

by solving (17)

10: Update

\bm{\upsilon}

using (40)

11: end while

V-B Convergence and Complexity Analysis

1) Convergence analysis: It is challenging to obtain a globally optimal solution to (23) since it is not jointly convex for all variables. However, our algorithm for solving each sub-problem can reach its optimal solution. Specifically, when we update $\mathbf{L}$ , the problem (32) is convex, and the corresponding algorithm is guaranteed to converge to the global optimum [52]. When updating $\mathbf{U}$ , we use the algorithm in [53] to solve the problem (35), which can converge to the global optimum [53]. The updates of $\mathbf{Q}$ , $\mathbf{R}$ and $\bm{\upsilon}$ have closed-form solutions. Despite updating $\mathbf{X}$ via (18) has a closed solution, we update $\mathbf{X}$ using CG, which is guaranteed to converge [55]. In summary, the update of each variable converges in our algorithm. In reality, the whole algorithm converges well, which is verified experimentally in Section VI.

2) Complexity analysis: In one iteration, our algorithm consists of six parts, which we analyze one by one below. As stated in [52], the update of $\mathbf{L}$ requires $\mathcal{O}(T_{1}D^{2})$ costs, where $T_{1}$ is the average number of iterations of updating $\mathbf{w}$ . The computational cost can be further reduced if the average number of neighbors per node is fixed; see [56] and analysis therein. The computational complexity of our algorithm for updating $\mathbf{U}$ is $\mathcal{O}(T_{2}(DK^{2}+K^{3}))$ according to [53], where $T_{2}$ is the average number of iterations of the algorithm in [53]. When updating $\mathbf{R}$ , we perform SVD on $\mathbf{Q}^{\top}\mathbf{U}\in\mathbb{R}^{K\times K}$ , which costs $\mathcal{O}(K^{3})$ . The updates of $\mathbf{Q}$ and $\bm{\upsilon}$ require $\mathcal{O}(DK^{2})$ and $\mathcal{O}(DN)$ , respectively. Finally, the complexity of using CG to update $\mathbf{X}$ is $\mathcal{O}(T_{3}DN)$ , where $T_{3}$ is the average number of iterations of the CG algorithm.

VI Experiments

In this section, we test our proposed model using synthetic, benchmark, and real-world data. First, some experimental setups are introduced.

VI-A Experimental Setups

1) Graph generation: For synthetic data, we leverage the vSBM method to generate random graphs with sensitive attributes. Specifically, we let $\zeta_{s}=\frac{1}{S},a=0.8,b=0.2,c=0.15$ , and $d=0.05$ . After obtaining connections among nodes, we assign each edge a random weight between $[0.1,2]$ . Finally, we normalize the edge weights to satisfy $\mathrm{Tr}(\mathbf{L}^{*})=D$ .

2) Signal generation: We generate $N$ observed signals from the following Gaussian distribution [29]

\displaystyle(\mathbf{X}_{o})_{[:,n]}\sim\mathcal{N}\left(\mathbf{0},(\mathbf{L}^{*})^{{\dagger}}+\mathbf{\Sigma}_{e}\right),\;\;n=1,...,N,

(41)

where $\mathbf{\Sigma}_{e}=\mathrm{diag}(\sigma_{1}^{2},...,\sigma_{D}^{2})$ and $\sigma_{i}$ is the noise scale of the $i$ -th node. As stated in [29], signals generated in this way are smooth over the corresponding graph.

3) Evaluation metrics: In topology inference, determining whether two vertices are connected can be regarded as a binary classification problem. Thus, we employ the F1-score ( $\mathrm{FS}$ ) to evaluate classification results

\displaystyle\mathrm{FS}=\frac{2\mathrm{TP}}{2\mathrm{TP}+\mathrm{FN}+\mathrm{FP}},

(42)

where $\mathrm{TP}$ is true positive rate, $\mathrm{TN}$ is true negative rate, $\mathrm{FP}$ is false positive rate, and $\mathrm{FN}$ is false negative rate. We also use the estimation error ( ${\mathrm{EE}}$ ) to evaluate the learned graph

\displaystyle\mathrm{EE}=\lVert\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\mathbf{L}^{*}\mathbf{Z}\lVert_{\mathrm{F}}.

(43)

For a fair comparison of $\mathrm{EE}$ , we normalize the learned graphs to $\mathrm{Tr}(\widehat{\mathbf{L}})=D$ . For fair clustering, we use the same two metrics as in [14]: clustering error ( $\mathrm{CE}$ ) and Balance ( $\mathrm{Bal}$ )

	$\displaystyle\mathrm{CE}=\frac{1}{D}\left\|\{i:\widehat{\pi}_{C}(i)\neq{\pi}_{C}(i),i=1....,D\}\right\|,$
	$\displaystyle\mathrm{Balance}\;(\mathrm{Bal})=\frac{1}{K}\sum_{k=1}^{K}\mathrm{Balance}(C_{k}),$		(44)

where $\widehat{\pi}_{C}(i)$ is the estimated cluster label of node $i$ (after proper permutation), and ${\pi}_{C}(i)$ is the ground-truth. The metric $\mathrm{Balance}$ measures the average balance of all clusters.

4) Baselines: The comparison baselines are list in Table.I. The model Fairlets is the fair version of $k$ median [8]. Models 3-5 are the implementations of [14] using different graph construction methods. FGLASSO [57] is the only model that jointly performs graph construction and fair spectral embedding. FSRSC and FJGSED are the fair versions of unified SC models (28) and (29). The formulations and algorithms for the two models are placed in the supplementary material.

TABLE I: Comparison baselines

Index	Models	Graph-based	Fair	End-to-End	GL method
1	$k$ means	✗	✗	—	—
2	Fairlets	✗	✓	✗	—
3	CorrFSC	✓	✓	✗	PC
4	KNNFSC	✓	✓	✗	$k$ -NN
5	$\varepsilon$ NNFSC	✓	✓	✗	$\varepsilon$ -NN
6	FGLASSO	✓	✓	✗	GLASSO
7	FJGSED	✓	✓	✓	ANGL
8	FSRSC	✓	✓	✓	SR

5) Determination of parameters: For our model, we first grid-search $\xi$ and $\beta$ corresponding to the best $\mathrm{FS}$ in the range $[0.001,0.1]$ for the graph learning task. Then, parameters $\mu$ and $\gamma$ are selected as those achieving the best $\mathrm{CE}$ in the range $[0.001,1]$ . All parameters of baselines are also selected as those achieving the best $\mathrm{CE}$ values.

TABLE II: The results of our model and the compared baselines under different cases.

	$\sigma_{i}\sim\mathcal{U}(0,0.2),N=1000$				$\sigma_{i}\sim\mathcal{U}(0.4,0.6),N=1000$				$\sigma_{i}\sim\mathcal{U}(0,0.2),N=5000$				$\sigma_{i}\sim\mathcal{U}(0.4,0.6),N=5000$
	$\mathrm{FS}\uparrow$	$\mathrm{EE}\downarrow$	$\mathrm{CE}\downarrow$	$\mathrm{Bal}\uparrow$	$\mathrm{FS}\uparrow$	$\mathrm{EE}\downarrow$	$\mathrm{CE}\downarrow$	$\mathrm{Bal}\uparrow$	$\mathrm{FS}\uparrow$	$\mathrm{EE}\downarrow$	$\mathrm{CE}\downarrow$	$\mathrm{Bal}\uparrow$	$\mathrm{FS}\uparrow$	$\mathrm{EE}\downarrow$	$\mathrm{CE}\downarrow$	$\mathrm{Bal}\uparrow$
$k$ means	—	—	0.671	0.191	—	—	0.687	0.149	—	—	0.635	0.161	—	—	0.667	0.145
Fairlets	—	—	0.658	0.485	—	—	0.665	0.457	—	—	0.611	0.355	—	—	0.623	0.348
CorrFSC	0.472	2.858	0.567	0.482	0.441	3.016	0.578	0.705	0.630	2.529	0.104	0.874	0.596	2.511	0.156	0.859
KNNFSC	0.105	—	0.687	0.829	0.103	—	0.703	0.626	0.113	—	0.729	0.628	0.098	—	0.682	0.731
EpsNNFSC	0.086	—	0.729	0.380	0.094	—	0.739	0.333	0.091	—	0.718	0.652	0.065	—	0.724	0.369
FGLASSO	0.482	3.902	0.411	0.616	0.450	3.724	0.406	0.646	0.587	3.971	0.291	0.657	0.574	3.533	0.271	0.908
FJGSED	0.271	28.159	0.724	0.359	0.263	22.626	0.734	0.240	0.325	23.576	0.604	0.579	0.293	31.552	0.734	0.247
FSRSC	0.374	5.222	0.724	0.619	0.355	9.671	0.733	0.607	0.345	5.049	0.729	0.766	0.512	10.024	0.739	0.663
Ours	0.501	2.375	0.286	0.845	0.474	2.414	0.390	0.801	0.715	1.691	0.052	0.960	0.674	2.174	0.142	0.870

•

$\uparrow$ means that higher value is better and $\downarrow$ means that lower value is better.
•

$\sigma_{i}\sim\mathcal{U}(a1,a2),i=1,...,D$ , means that the noise scale of the $i$ -th node is from the uniform distribution $\mathcal{U}(a1,a2).$

VI-B Synthetic Data

1) Model performance: We first compare our model with all baselines in four cases. We let $D=192,K=4$ , and $S=2$ . As listed in Table II, our model outperforms $k$ means and Fairlets on clustering metrics because it exploits structured information behind raw data. The graphs established by KNNFSC and EpsNNFSC methods are not evaluated by $\mathrm{EE}$ since no edge weights are assigned. Among Models 3-5, CorrFSC achieves the best GL performance ( $\mathrm{FS}$ ) as well as the best $\mathrm{CE}$ clustering performance ( $\mathrm{CE}$ ). However, the graph construction performance of the three methods is inferior to our model, leading to unsatisfactory fair clustering results. Furthermore, compared with the three methods, our model unifies all seprate stages into a single optimization objective, avoiding suboptimality caused by separate optimization. The reason why our model outperforms FGLASSO could be that FGLASSO separately uses $k$ means to obtain final cluster labels. Besides, our method could learn better graphs than FGLASSO. Although FJGSED and FSRSC also perform fair clustering in an end-to-end manner, our model obtains superior fair clustering performance due to more accurate graphs constructed by our method. Finally, our model has a node-adaptive graph filter to denoise observed signals. Thus, our model obtains the best graph construction performance under different levels of noise contamination.

We visualize the learned graphs of different methods in Fig.4. We see that EpsNNFSC fails to capture the clustering structure, resulting in the worst fair clustering performance. The graph of KNNFSC tends to have imbalanced node degrees, and the graph of FSRSC has small edge weights. Compared with CorrFSC, FGLASSO, and FJGSED, the graph of our model has fewer noisy edges and clearer cluster structures.

2) The effect of $K$ and $S$ : We set $d=192,N=5000,\sigma_{i}\sim\mathcal{U}(0.4,0.6)$ . In the first case, we fix $S=2$ and vary $K$ from 2 to 6. In the second case, we fix $K=2$ and vary $S$ from 2 to 6. Fig.5 displays that the fair clustering performance degrades with the increase of $K$ ( $\mathrm{CE}$ increases and $\mathrm{Balance}$ decreases), which is consistent with Proposition 1. On the other hand, the fair clustering performance is less affected by $S$ .

3) The effect of $D$ : We let $N=10^{4},\sigma_{i}\sim\mathcal{U}(0.4,0.6),K=4$ and $S=2$ . As depicted in Fig.6, for a fixed data size, $\mathrm{CE}$ first decreases and then increases as $D$ . The reason may be that, as stated in Proposition 1, the misclassification rate of FSC algorithms on the graph generated by the vSBM method decreases as $D$ if the underlying graph is exactly estimated. However, the quality of the estimated graph declines for large $D$ if $N$ is fixed. Thus, the second part of the error bound in Proposition 1 worsens. If the performance improvement brought by increasing $D$ is smaller than the degradation caused by the graph estimation error, fair clustering performance decreases when $D$ is large.

4) The sensitivity of parameters: We let $D=196,K=4,S=2,N=5000$ , and $,\sigma_{i}\sim\mathcal{U}(0.4,0.6)$ . First, we fix $\mu=0.01$ and $\gamma=0.01$ and vary $\beta$ and $\xi$ from $0.001$ to $0.1$ . We then fix $\beta=0.01$ and $\xi=0.05$ and vary $\mu$ and $\gamma$ from $0.001$ to $1$ . As shown in Fig.8, our model can achieve consistent GL and fair clustering performance except when $\beta$ is too small and $\xi$ is too large. Moreover, GL performance is more sensitive to $\mu$ than $\gamma$ . There exist combinations of $\mu$ and $\gamma$ that achieve satisfactory $\mathrm{CE}$ and $\mathrm{Balance}$ simultaneously.

5) The effect of the fairness constraint: We consider a special case where the real graph contains two clusters, each consisting of samples from the same sensitive group. In this case, the $\mathrm{Balance}$ of real clustering is zero. We then perform clustering using our FSC model and a variant where the fairness constraint is removed. As shown in Fig.7, if we remove the fairness constraint, our model can exactly group all samples. However, some samples are misclassified to improve fairness in our model due to the effect of the fairness constraint. We list the corresponding model performance in Table III. Our model achieves a significantly higher $\mathrm{Balance}$ value at the cost of reduced clustering accuracy. Furthermore, the GL performance of our model is also degraded due to the fairness constraint. Thus, if the underlying graph has only one meaningful cluster that is highly unbalanced, fairness constraints may degrade GL and clustering performance.

TABLE III: The results of removing fairness.

	$\mathrm{FS}$	$\mathrm{EE}$	$\mathrm{CE}$	$\mathrm{Bal}$
w/o fairness	0.734	1.211	0	0
Ours	0.719	1.353	0.484	0.867

6) Ablation study: Three cases are taken into consideration. (i) We construct graphs using (16), conduct fair spectral embedding, and discretize using spectral rotation separately to test the benefit of a unified model (Ours-Sep). (ii) We construct graphs using (17) and conduct fair spectral embedding jointly. After obtaining continuous results, we exploit $k$ means as the discretization step to test the benefit of spectral rotation (Ours- $k$ means). (iii) We remove the denoising module in our model to test the benefit of the node-adaptive graph filter (Ours-noDN). We let $D=196,K=4,S=2,N=5000$ , and $\sigma_{i}\sim\mathcal{U}(0.4,0.6)$ , and the results are listed in Table IV. Our model outperforms Ours-Sep, demonstrating the benefit of a unified model. Although the graph of Ours- $k$ means is well estimated, it obtains the worst $\mathrm{CE}$ due to the poor performance of $k$ means. Our model outperforms Ours-noDN because it has a low-pass filter to enhance graph construction.

TABLE IV: The results of ablation studies.

	$\mathrm{FS}$	$\mathrm{EE}$	$\mathrm{CE}$	$\mathrm{Bal}$
Ours-Sep	0.637	2.333	0.250	0.704
Ours- $k$ means	0.635	2.320	0.549	0.353
Ours-noDN	0.623	2.354	0.276	0.694
Ours	0.658	2.203	0.167	0.782

7) Convergence: Finally, we test the convergence of our algorithm. We let $D=196,K=4,S=2,D=192$ . As shown in Fig.9, the objective function values monotonically decrease as the number of iterations. Besides, our algorithm converges within a few iterations, indicating its fast convergence.

VI-C Benchmark Data

In this section, we test the performance of our model on the commonly used benchmark datasets of FSC [14]. The first dataset is a high school friendship network named FACEBOOKNET¹¹1http://www:sociopatterns:org/datasets/high-school-contact-and-friendshipnetworks/. The dataset contains a graph with vertices representing high school students and edges representing connections between students on Facebook. After data preprocessing, we obtain 155 students split into male and female groups. In this dataset, gender is considered a sensitive attribute. All vertices are divided into two groups, i.e., male and female. The second dataset, DRUGNET, is a network encoding acquaintanceship between drug users in Hartford ²²2 https://sites:google:com/site/ucinetsoftware/datasets/covert-networks/drugnet. After data preprocessing, we obtain 193 vertices. We use ethnicity as a sensitive attribute and split the vertices into three groups: African Americans, Latinos, and others. Note that previous FSC work [14] is based on a given graph, and the two datasets only contain ground-truth graphs and no observed signals. However, one of the primary advantages of our model is that we can group observed data without the real graph structures. Thus, we generate data via (41) based on the ground-truth networks. We then use our model to group vertices via the observed data. For comparison, we apply the FSC algorithm in [14] (FairSC) and unnormalized spectral clustering (SC) to the real networks to cluster vertices. We aim to demonstrate that our model can achieve competitive fair clustering performance even without real graphs. Referring to [14], we use $\mathrm{Balance}$ and $\mathrm{RatioCut}$ as evaluation metrics since we have no real labels. We let $N=1000$ and $\sigma_{i}\sim\mathcal{U}(0,0.2)$ . As displayed in 10 (a)-(b), for the two datasets, our model achieves almost the same $\mathrm{RatioCut}$ as FairSC and SC—which are based on the ground-truth networks—even though we do not know the underlying graphs. However, compared with FSC and SC, our model can improve $\mathrm{Balance}$ at a moderate sacrifice of $\mathrm{RatioCut}$ . Figures 10 (c)-(d) depict the real FACEBOOKNET graph and the graph learned by our model when $K=2$ . Fewer edges are learned between two clusters, suggesting that our model may tend to learn a graph that is more suitable for clustering. Furthermore, two clusters are observed from our learned graph, meaning our model can fairly partition the nodes from the observed data even if we have no real graphs.

VI-D Real Data

1) MovieLens 100K dataset: We employ MovieLens 100K dataset³³3http://www.grouplens.org to group movies by their ratings. This dataset contains ratings of 1682 movies by 943 users in the range $[1,5]$ , which is sparse as many movies have few ratings. To alleviate the impact of sparsity, we select the top 200 most-rated movies from 1682 movies. Therefore, we have a who-rated-what matrix $\mathbf{X}\in\mathbb{R}^{200\times 943}$ . The matrix can be used to construct a movie-movie similarity graph strongly correlated with how users explicitly rate items [58]. Therefore, we can perform clustering on the similarity graph to group movies with similar attributes. However, as stated in [57], old movies tend to obtain higher ratings because only masterpieces have survived. To obtain fair results unbiased by production time, we consider movie year as a sensitive attribute. Movies made before 1991 are considered old, while others are considered new. To evaluate clustering results, we conduct traditional item-based collaborative filtering (CF) on each cluster, termed cluster CF, to predict user ratings of movies. As claimed in [57], if the obtained clusters accurately contain a set of similar items, cluster CF can better predict user ratings of movies. Therefore, we follow [57] and use root mean square error ( $\mathrm{RMSE}$ ) between the predicted and true ratings as an evaluation metric in addition to $\mathrm{Balance}$ [57, 58]. Figures 11 (a)-(b) depict fair clustering results of different models. Our model obtains the highest $\mathrm{Balance}$ except KNNFSC. However, KNNFSC performs poorly on $\mathrm{RMSE}$ , indicating unsatisfactory clustering results. This may be caused by the fact that the graph constructed by KNNFSC hardly characterizes the similarity relationships between movies. In contrast, our model achieves the best $\mathrm{RMSE}$ since it better reveals similarity relationships behind observed data. In Fig.11(c)-(d), we provide the learned sub-graphs and clustering results of the top 30 rated movies when $K=2$ . The graph learned by KNNFSC has isolated nodes since they are connected to the movies outside the top 30 rated movies. In our graph, nodes 1, 4, 11, and 18 are closely connected because they belong to the Star Wars series. However, in Fig.11(c), they are not connected. Moreover, our model successfully groups the four nodes into the same cluster.

2) MNIST-USPS dataset: The second dataset we employ is MNIST-USPS dataset ⁴⁴4http://yann.lecun.com/exdb/mnist, https://www.kaggle.com/bistaumanga/usps-dataset, which contains two sub-datasets, i.e., MNIST and USPS. The two sub-datasets contain images of handwritten digits from 0 to 9. We cluster these images and use digits as the ground-truth cluster labels. Specifically, we randomly select 48 images from each sub-dataset, which contains four digits and twelve pictures for each digit. We finally obtain 96 images and resize each image to a $28\times 28$ matrix. We take each image as a node in a graph and flatten the corresponding matrix as the node signals. Therefore, the observed data are $\mathbf{X}\in\mathbb{R}^{96\times 784}$ . We take the domain source of images—images from MNIST or USPS—as a sensitive attribute. Thus, we have $S=2$ and $K=4$ . We use $\mathrm{CE}$ and $\mathrm{Balance}$ as evaluation metrics since we have real labels but no ground-truth graphs. As shown in Fig.12, our model achieves the best fair clustering performance for both $\mathrm{CE}$ and $\mathrm{Balance}$ , indicating its superiority. The reason for CorrFSC, KNNFSC, EpsFSC, and FSRSC achieving unsatisfactory $\mathrm{CE}$ may be that the corresponding graphs for this dataset cannot reflect the real topological similarity.

VII Conclusion

In this paper, we theoretically analyzed the impact of similarity graphs on FSC performance. Motivated by the analysis, we proposed a graph construction method for FSC tasks as well as an end-to-end FSC framework. Then, we designed an efficient algorithm to alternately update the variables corresponding to each submodule in our model. Extensive experiments showed that our approach is superior to state-of-the-art (fair) SC models. Future research directions may include developing more scalable FSC algorithms.

Appendix A Proof of Proposition 1

We first provide the following lemma.

Lemma 2.

For any $\epsilon>0$ and any two matrices $\mathbf{U},\widehat{\mathbf{U}}\in\mathbb{R}^{D\times K}$ such that $\mathbf{U}=\mathbf{Q}\mathbf{R}$ with $\mathbf{Q}\in\mathcal{I}$ and $\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}$ , let $(\widehat{\mathbf{Q}},\widehat{\mathbf{R}})$ be a $(1+\epsilon)$ approximation of $\widehat{\mathbf{U}}$ using spectral rotation as Assumption 1, and $\breve{\mathbf{U}}=\widehat{\mathbf{Q}}\widehat{\mathbf{R}}$ . Then, for any $\delta_{k}\geq 0$ , define $\widetilde{\mathcal{M}}_{k}=\left\{i\in\mathcal{C}_{k}:\lVert\mathbf{U}_{[i,:]}-\breve{\mathbf{U}}_{[i,:]}\rVert_{2}\geq\delta_{k}/2\right\}$ , $k=1,...,K$ , and we have

\displaystyle\sum_{k=1}^{K}|\widetilde{\mathcal{M}}_{k}|\delta_{k}^{2}\leq 4(4+2\epsilon)\lVert{\mathbf{U}}-\widehat{\mathbf{U}}\rVert_{\mathrm{F}}^{2},

(45)

Proof.

First, by the procedure of spectral rotation, we have

		$\displaystyle\widehat{\mathbf{Q}},\widehat{\mathbf{R}}=\underset{\mathbf{Q},\mathbf{R}}{\min}\;\lVert\mathbf{Q}-\widehat{\mathbf{U}}{\mathbf{R}}\rVert_{\mathrm{F}}^{2}\;\;\;\mathrm{s.t.}\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\;\mathbf{Q}\in\mathcal{I}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\widehat{\mathbf{Q}},\widehat{\mathbf{R}}=\underset{\mathbf{Q},\mathbf{R}}{\min}\;\lVert\widehat{\mathbf{U}}-\mathbf{Q}{\mathbf{R}}\rVert_{\mathrm{F}}^{2}\;\;\;\mathrm{s.t.}\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\;\mathbf{Q}\in\mathcal{I}.$		(46)

Then, based on Assumption 1, we can obtain that

		$\displaystyle\lVert\widehat{\mathbf{U}}-\widehat{\mathbf{Q}}\widehat{\mathbf{R}}\rVert_{\mathrm{F}}^{2}\leq(1+\epsilon)\underset{\mathbf{Q}\in\mathcal{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}}{\min}\lVert\widehat{\mathbf{U}}-{\mathbf{Q}}{\mathbf{R}}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\Rightarrow$	$\displaystyle\lVert\widehat{\mathbf{U}}-\breve{\mathbf{U}}\rVert_{\mathrm{F}}^{2}\leq(1+\epsilon)\lVert\widehat{\mathbf{U}}-\mathbf{U}\rVert_{\mathrm{F}}^{2}.$		(47)

It is not difficult to obtain the following inequalities

	$\displaystyle\lVert\breve{\mathbf{U}}-\mathbf{U}\rVert_{\mathrm{F}}^{2}\leq 2\lVert\breve{\mathbf{U}}-\widehat{\mathbf{U}}\rVert_{\mathrm{F}}^{2}+2\lVert\widehat{\mathbf{U}}-\mathbf{U}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\leq(4+2\epsilon)\lVert\widehat{\mathbf{U}}-{\mathbf{U}}\rVert_{\mathrm{F}}^{2}.$		(48)

The first inequality holds due to the basic inequality, and the second one holds due to (47). Finally, according to the definition of $\delta_{k}$ , we can obtain the conclusion (45). ∎

We start our proof of Proposition 1, which is inspired by [14]. To incorporate the fairness constraint into the objective function of (9), we let $\widehat{\mathbf{U}}=\mathbf{Z}\widehat{\mathbf{Y}}$ , where $\widehat{\mathbf{Y}}$ contains the eigenvectors of $\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}$ corresponding to the $K$ smallest eigenvalues. Suppose that $\bar{\mathbf{Y}}$ contains the eigenvectors of $\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}$ corresponding to the $K$ smallest eigenvalues, where $\bar{\mathbf{L}}$ is the expected Laplacian matrix of the graphs generated by the vSBM method. We apply spectral rotation on $\widehat{\mathbf{U}}$ estimated from $\widehat{\mathbf{L}}$ by solving (9). For any $\mathbf{V}\in\mathbb{R}^{K\times K}$ satisfying $\mathbf{V}^{\top}\mathbf{V}=\mathbf{I},\mathbf{V}\mathbf{V}^{\top}=\mathbf{I}$ , it is not difficult to obtain

	$\displaystyle\lVert$	$\displaystyle\mathbf{Z}\bar{\mathbf{Y}}-\mathbf{Z}\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}^{2}=\mathrm{Tr}\left((\bar{\mathbf{Y}}-\widehat{\mathbf{Y}}\mathbf{V})^{\top}\mathbf{Z}^{\top}\mathbf{Z}(\bar{\mathbf{Y}}-\widehat{\mathbf{Y}}\mathbf{V})\right)$
	$\displaystyle=$	$\displaystyle\lVert\bar{\mathbf{Y}}-\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}^{2}.$		(49)

Therefore, we have

		$\displaystyle\underset{\mathbf{V}^{\top}\mathbf{V}=\mathbf{I},\mathbf{V}\mathbf{V}^{\top}=\mathbf{I}}{\min}\;\lVert\mathbf{Z}\bar{\mathbf{Y}}-\mathbf{Z}\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}=\underset{\mathbf{V}^{\top}\mathbf{V}=\mathbf{I},\mathbf{V}\mathbf{V}^{\top}=\mathbf{I}}{\min}\;\lVert\bar{\mathbf{Y}}-\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}$
	$\displaystyle\leq$	$\displaystyle\frac{8\sqrt{2K^{3}}}{D(c-d)}\lVert\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{2}\leq\frac{8\sqrt{2K^{3}}}{D(c-d)}\lVert\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{\mathrm{F}}.$		(50)

The first inequality holds due to [14] and how we generate the ground-truth graph, and the second inequality holds due to norm inequality. On the other hand, we have

\displaystyle\lVert\mathbf{Z}\bar{\mathbf{Y}}-\mathbf{Z}\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}=\lVert\mathbf{Z}\bar{\mathbf{Y}}\mathbf{V}^{\top}-\mathbf{Z}\widehat{\mathbf{Y}}\rVert_{\mathrm{F}}.

(51)

As in Lemma 6 of [14], we can choose $\bar{\mathbf{Y}}$ in such a way that $\mathbf{Z}\bar{\mathbf{Y}}=\mathbf{E}$ , where $\mathbf{E}_{[i,:]}=\mathbf{E}_{[j,:]}$ if the vertices $i$ and $j$ are in the same cluster and $\lVert\mathbf{E}_{[i,:]}-\mathbf{E}_{[j,:]}\rVert_{2}=\sqrt{2K/D}$ if the vertices $i$ and $j$ are not in the same cluster. Futhermore, multiplying $\mathbf{E}$ by $\mathbf{V}^{\top}$ will not change the properties of $\mathbf{E}$ since $\mathbf{V}^{\top}$ is a orthogonal matrix. Finally, according to Lemma 2, if we let $\delta_{k}=\sqrt{2K/D}$ , then $\widetilde{\mathcal{M}}_{k}$ in Lemma 2 is equivalent to $\mathcal{M}_{k}$ . Furthermore, according to Lemma 5.3 in [40], if $\frac{4(4+2\epsilon)}{\delta_{k}^{2}}\lVert{\mathbf{E}}\mathbf{V}^{\top}-{\mathbf{Z}}\widehat{\mathbf{Y}}\rVert_{\mathrm{F}}^{2}\leq\frac{D}{K}$ , we have

	$\displaystyle\sum_{k=1}^{K}\|\mathcal{M}_{k}\|$	$\displaystyle\leq\frac{4(4+2\epsilon)}{\delta_{k}^{2}}\lVert{\mathbf{E}}\mathbf{V}^{\top}-{\mathbf{Z}}\widehat{\mathbf{Y}}\rVert_{\mathrm{F}}^{2}$
		$\displaystyle\leq\frac{256(4+2\epsilon)K^{2}}{D(c-d)^{2}}\lVert\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{\mathrm{F}}^{2}.$		(52)

Let $C_{1}=\frac{256(4+2\epsilon)K^{2}}{D(c-d)^{2}}$ , we have

\displaystyle\sum_{k=1}^{K}|\mathcal{M}_{k}|

\displaystyle\leq 2C_{1}\underbrace{\lVert\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\mathbf{L}^{*}\mathbf{Z}\rVert_{\mathrm{F}}^{2}}_{\mathcal{T}_{1}}+2C_{1}\underbrace{\lVert\mathbf{Z}^{\top}\mathbf{L}^{*}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{\mathrm{F}}^{2}}_{\mathcal{T}_{2}}.

(53)

The first term is the difference between the expected Laplacian matrix and the real matrix, which has been derived in [14]. Specifically, for any $r_{2}>0$ and some $r_{1}>0$ satisfying $a\geq r_{1}\ln{D}/D$ , with probability at least $1-D^{-r_{2}}$ , we have that there exist a constant $C_{2}(r_{1},r_{2})$ such that

\displaystyle\mathcal{T}_{1}\leq C_{2}(r_{1},r_{2})aD\ln{D}.

(54)

The second term $\mathcal{T}_{2}$ of (53) is the error between the Laplacian estimated by our model and the real one. Bringing (54) to (53), we finally complete the proof.

References

[1] T. Lei, X. Jia, Y. Zhang, S. Liu, H. Meng, and A. K. Nandi, “Superpixel-based fast fuzzy c-means clustering for color image segmentation,” IEEE Trans. Fuzzy Syst., vol. 27, no. 9, pp. 1753–1766, 2018.
[2] H. Xie, A. Zhao, S. Huang, J. Han, S. Liu, X. Xu, X. Luo, H. Pan, Q. Du, and X. Tong, “Unsupervised hyperspectral remote sensing image clustering based on adaptive density,” IEEE Geosci. Remote S., vol. 15, no. 4, pp. 632–636, 2018.
[3] V. Y. Kiselev, T. S. Andrews, and M. Hemberg, “Challenges in unsupervised clustering of single-cell rna-seq data,” Nat. Rev. Genet., vol. 20, no. 5, pp. 273–282, 2019.
[4] A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering algorithm,” Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003.
[5] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput., vol. 17, pp. 395–416, 2007.
[6] W.-B. Xie, Y.-L. Lee, C. Wang, D.-B. Chen, and T. Zhou, “Hierarchical clustering supported by reciprocal nearest neighbors,” Inf. Sci., vol. 527, pp. 279–292, 2020.
[7] A. Chouldechova and A. Roth, “The frontiers of fairness in machine learning,” arXiv:1810.08810, 2018.
[8] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii, “Fair clustering through fairlets,” Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
[9] S. Bera, D. Chakrabarty, N. Flores, and M. Negahbani, “Fair algorithms for clustering,” Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
[10] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner, “Scalable fair clustering,” in Proc. Int. Conf. Mach. Learn. PMLR, 2019, pp. 405–413.
[11] I. M. Ziko, J. Yuan, E. Granger, and I. B. Ayed, “Variational fair clustering,” in Proc. Natl. Conf. Artif. Intell., vol. 35, no. 12, 2021, pp. 11 202–11 209.
[12] P. Zeng, Y. Li, P. Hu, D. Peng, J. Lv, and X. Peng, “Deep fair clustering via maximizing and minimizing mutual information: Theory, algorithm and metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 23 986–23 995.
[13] P. Li, H. Zhao, and H. Liu, “Deep fair clustering for visual learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9070–9079.
[14] M. Kleindessner, S. Samadi, P. Awasthi, and J. Morgenstern, “Guarantees for spectral clustering with fairness constraints,” in Proc. Int. Conf. Mach. Learn. PMLR, 2019, pp. 3458–3467.
[15] J. Wang, D. Lu, I. Davidson, and Z. Bai, “Scalable spectral clustering with group fairness constraints,” in Proc. Int. Conf. Artif. Intell. Stat., AISTATS. PMLR, 2023, pp. 6613–6629.
[16] J. Li, Y. Wang, and A. Merchant, “Spectral normalized-cut graph partitioning with fairness constraints,” arXiv:2307.12065, 2023.
[17] S. Gupta and A. Dukkipati, “Protecting individual interests across clusters: Spectral clustering with guarantees,” arXiv: 2105.03714, 2021.
[18] Y. Wang, J. Kang, Y. Xia, J. Luo, and H. Tong, “ifig: Individually fair multi-view graph clustering,” in 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022, pp. 329–338.
[19] J. Huang, F. Nie, and H. Huang, “Spectral rotation versus k-means in spectral clustering,” in Proc. Natl. Conf. Artif. Intell., vol. 27, no. 1, 2013, pp. 431–437.
[20] Z. Kang, C. Peng, Q. Cheng, and Z. Xu, “Unified spectral clustering with optimal graph,” in Proc. Natl. Conf. Artif. Intell., vol. 32, no. 1, 2018.
[21] Z. Kang, C. Peng, and Q. Cheng, “Twin learning for similarity and clustering: A unified kernel approach,” in Proc. Natl. Conf. Artif. Intell., vol. 31, no. 1, 2017.
[22] J. Huang, F. Nie, and H. Huang, “A new simplex sparse learning model to measure data similarity for clustering,” in Int. Joint Conf. Artif. Intell., 2015.
[23] Y. Peng, W. Huang, W. Kong, F. Nie, and B.-L. Lu, “Jgsed: An end-to-end spectral clustering model for joint graph construction, spectral embedding and discretization,” IEEE Trans. Emerg. Topics Comput. Intell., 2023.
[24] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 35, no. 11, pp. 2765–2781, 2013.
[25] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 35, no. 1, pp. 171–184, 2012.
[26] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with adaptive neighbors,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2014, pp. 977–986.
[27] C. Gao, Y. Wang, J. Zhou, W. Ding, L. Shen, and Z. Lai, “Possibilistic neighborhood graph: A new concept of similarity graph learning,” IEEE Trans. Emerg. Topics Comput. Intell., 2022.
[28] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, 2013.
[29] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learning Laplacian matrix in smooth graph signal representations,” IEEE Trans. Signal Process., vol. 64, no. 23, pp. 6160–6173, 2016.
[30] V. Kalofolias, “How to learn a graph from smooth signals,” in Proc. Int. Conf. Artif. Intell. Stat., AISTATS. PMLR, 2016, pp. 920–929.
[31] X. Dong, D. Thanou, M. Rabbat, and P. Frossard, “Learning graphs from data: A signal representation perspective,” IEEE Signal Process. Mag., vol. 36, no. 3, pp. 44–63, 2019.
[32] F. Nie, D. Wu, R. Wang, and X. Li, “Self-weighted clustering with adaptive neighbors,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 9, pp. 3428–3441, 2020.
[33] Y. Pang, J. Xie, F. Nie, and X. Li, “Spectral clustering by joint spectral embedding and spectral rotation,” IEEE Trans. Cybern., vol. 50, no. 1, pp. 247–258, 2018.
[34] Y. Yang, F. Shen, Z. Huang, and H. T. Shen, “A unified framework for discrete spectral clustering.” in IJCAI, 2016, pp. 2273–2279.
[35] W. Huang, Y. Peng, Y. Ge, and W. Kong, “A new kmeans clustering model and its generalization achieved by joint spectral embedding and rotation,” PeerJ Comput. Sci., vol. 7, p. e450, 2021.
[36] Y. Han, L. Zhu, Z. Cheng, J. Li, and X. Liu, “Discrete optimal graph clustering,” IEEE Trans. Cybern., vol. 50, no. 4, pp. 1697–1710, 2018.
[37] C. Tang, Z. Li, J. Wang, X. Liu, W. Zhang, and E. Zhu, “Unified one-step multi-view spectral clustering,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 6, pp. 6449–6460, 2022.
[38] F. Zhang, J. Zhao, X. Ye, and H. Chen, “One-step adaptive spectral clustering networks,” IEEE Signal Process. Lett., vol. 29, pp. 2263–2267, 2022.
[39] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Soc. Networks, vol. 5, no. 2, pp. 109–137, 1983.
[40] J. Lei and A. Rinaldo, “Consistency of spectral clustering in stochastic block models,” Ann. Stat., vol. 43, no. 1, 2015.
[41] Q. Li, X.-M. Wu, H. Liu, X. Zhang, and Z. Guan, “Label efficient semi-supervised learning via graph filtering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9582–9591.
[42] Y. Y. Pilavcı, P.-O. Amblard, S. Barthelmé, and N. Tremblay, “Graph tikhonov regularization and interpolation via random spanning forests,” IEEE Trans. Signal. Inf. Process. Netw., vol. 7, pp. 359–374, 2021.
[43] E. Pan and Z. Kang, “Multi-view contrastive graph clustering,” Proc. Adv. Neural Inf. Process. Syst., vol. 34, pp. 2148–2159, 2021.
[44] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” J. Mach. Learn. Res., vol. 9, no. 11, 2008.
[45] G. Zhong and C.-M. Pun, “Self-taught multi-view spectral clustering,” Pattern Recognit., vol. 138, p. 109349, 2023.
[46] F. Nie, S. Shi, and X. Li, “Semi-supervised learning with auto-weighting feature and adaptive graph,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 6, pp. 1167–1178, 2019.
[47] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 22, no. 8, pp. 888–905, 2000.
[48] K. Fan, “On a theorem of weyl concerning eigenvalues of linear transformations i,” Proc. of the Nat. Academy. of Sci., vol. 35, no. 11, pp. 652–655, 1949.
[49] S. Kumar, J. Ying, J. V. de Miranda Cardoso, and D. P. Palomar, “A unified framework for structured graph learning via spectral constraints.” J. Mach. Learn. Res., vol. 21, no. 22, pp. 1–60, 2020.
[50] D. Wu, F. Nie, J. Lu, R. Wang, and X. Li, “Effective clustering via structured graph learning,” IEEE Trans. Knowl. Data Eng., 2022.
[51] E. Pircalabelu and G. Claeskens, “Community-based group graphical lasso,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 2406–2437, 2020.
[52] S. S. Saboksayr and G. Mateos, “Accelerated graph learning from smooth signals,” IEEE Signal Process. Lett., vol. 28, pp. 2192–2196, 2021.
[53] Z. Wen and W. Yin, “A feasible method for optimization with orthogonality constraints,” Math. Program., vol. 142, pp. 397–434, 2013.
[54] P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
[55] O. Axelsson and G. Lindskog, “On the rate of convergence of the preconditioned conjugate gradient method,” Numer. Math., vol. 48, pp. 499–523, 1986.
[56] V. Kalofolias and N. Perraudin, “Large scale graph learning from smooth signals,” in Int. Conf. Learn. Representations, 2019.
[57] D. A. Tarzanagh, L. Balzano, and A. O. Hero, “Fair structure learning in heterogeneous graphical models,” arXiv:2112.05128, 2021.
[58] H. Wang, N. Wang, and D.-Y. Yeung, “Collaborative deep learning for recommender systems,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2015, pp. 1235–1244.
[59] X. Chen, G. Yuan, F. Nie, and Z. Ming, “Semi-supervised feature selection via sparse rescaled linear square regression,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 1, pp. 165–176, 2018.
[60] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and clustering,” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 11, no. 9, pp. 1074–1085, 1992.

Supplementary Materials

A-A Several Extensions to The Proposed Model

1) Improved spectral rotation: Improved spectral rotation is an improved version of (10), which is formulated as [45]:

	$\displaystyle\underset{\mathbf{Q},\mathbf{R}}{\min}\;\lVert\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}.$		(55)

The improved spectral rotation can output a discrete label matrix $\mathbf{Q}$ that are closer to $\mathbf{U}\mathbf{R}$ since $\left(\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}\right)^{\top}\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}=(\mathbf{U}\mathbf{R})^{\top}\mathbf{U}\mathbf{R}=\mathbf{I}$ , i.e., $\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}$ and $\mathbf{U}\mathbf{R}$ are in the same space [45]. If we employ the improved spectral rotation in (23), the model becomes

	$\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}.$		(56)

2) Self-weighted feature importance: To improve clustering performance, some works define feature weights to determine the importance of different features in assigning cluster labels [46, 59]. Specifically, given data matrix $\mathbf{X}$ , we define weight matrix $\mathbf{\Psi}=\mathrm{diag}(\bm{\psi})\in\mathbb{R}^{N\times N}$ , where $\bm{\psi}\in\mathbb{R}^{N}\geq 0$ and $\bm{\psi}^{\top}\mathbf{1}=1$ . The weighted $i$ -th feature is $\mathbf{\Psi}\mathbf{X}_{[i,:]}^{\top}$ . The weights $\mathbf{\Psi}$ can be directly learned from data. Thus, our model (23) with self-weighted feature importance is formulated as

	$\displaystyle\underset{\mathbf{W},\bm{\upsilon},\mathbf{U},\mathbf{R},\mathbf{Q},\mathbf{\Psi},\mathbf{X}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{2N}\lVert\mathbf{W}\circ\mathbf{P}_{\psi}\rVert_{1,1}$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+Reg_{W}(\mathbf{W})+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{W}\in\mathcal{W},\mathbf{L}=\mathbf{D}-\mathbf{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},$
	$\displaystyle\;\;\;\;\;\;\mathbf{Q}\in\mathcal{I},(\mathbf{P}_{\psi})_{[ij]}=\left\lVert\mathbf{\Psi}\mathbf{X}_{[i,:]}^{\top}-\mathbf{\Psi}\mathbf{X}_{[j,:]}^{\top}\right\rVert_{2}^{2},\bm{\upsilon}>0,$
	$\displaystyle\;\;\;\;\;\;\bm{\psi}^{\top}\mathbf{1}=1,\mathbf{\Psi}=\mathrm{diag}(\bm{\psi}).$		(57)

3) Normalized spectral clustering: The model (23) is a unified framework based on unnormalized SC [60]. Here, we extend (23) to normalized SC [47]. The standard normalized spectral embedding is

\displaystyle\underset{\mathbf{U}}{\min}\;\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}),\;\;\mathrm{s.t.}\mathbf{U}^{\top}\mathbf{D}\mathbf{U}=\mathbf{I}.

(58)

The fair constraint $\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}$ also holds for normalized SC [14]. Thus, our model based on the normalized SC is

	$\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{U}^{\top}\mathbf{D}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}.$		(59)

4) Individual fairness: Our model is based on group fairness, which induces the fairness constraint $\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}$ . The work [17] introduces individual fairness into SC, which induces a new fairness constraint $\mathbf{M}(\mathbf{I}-\frac{1}{D}\mathbf{1}\mathbf{1}^{\top})\mathbf{U}=\mathbf{0}$ , where $\mathbf{M}\in\mathbb{R}^{D\times D}$ is a graph representing individual sensitive attributes. Our unified model based on individual fairness is

	$\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{M}\left(\mathbf{I}-\frac{1}{D}\mathbf{1}\mathbf{1}^{\top}\right)\mathbf{U}=\mathbf{0},$
	$\displaystyle\;\;\;\;\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q},\in\mathcal{I}.$		(60)

A-B The Complete Algorithm Flow for Updating (33)

We use the algorithm in [52] to solve problem (33). The complete algorithm flow is presented in Algorithm 2.

Algorithm 2 The algorithm for problem (33)

\beta,\mathbf{p}

, set

L=\frac{D-1}{2\beta}

0: The learned graph

{\mathbf{w}}

1: Initialize

\eta^{(1)}=1

and

\bm{\omega}^{(1)}=\mathbf{r}^{(0)}\in\mathbb{R}^{D}

at random

2: for

t=1,2,...,

\bar{\mathbf{w}}^{(t)}=\max\left(\frac{\mathbf{S}^{\top}\bm{\omega}^{(t)}-2\mathbf{p}}{4\beta},0\right)

\mathbf{v}^{(t)}=\frac{\mathbf{S}\bar{\mathbf{w}}^{(t)}-L\bm{\omega}^{(t)}+\sqrt{(\mathbf{S}\bar{\mathbf{w}}^{(t)}-L\bm{\omega}^{(t)})^{2}+4L\mathbf{1}}}{2}

\mathbf{r}^{(t)}=\bm{\omega}^{(t)}-L^{-1}\left(\mathbf{S}\bar{\mathbf{w}}^{(t)}-\mathbf{v}^{(t)}\right)

\eta^{(t+1)}=\frac{1+\sqrt{1+4(\eta^{(t)})^{2}}}{2}

\bm{\omega}^{(t+1)}=\mathbf{r}^{(t)}+\left(\frac{\eta^{(t)}-1}{\eta^{(t+1)}}\right)\left(\mathbf{r}^{(t)}-\mathbf{r}^{(t-1)}\right)

8: end for

9: return

\mathbf{w}=\max\left(\frac{\mathbf{S}^{\top}\mathbf{r}^{(t)}-2\mathbf{p}}{4\beta},0\right)

A-C The Formulation and Algorithm for FJGSED

The model FJGSED is formulated as

	$\displaystyle\underset{\mathbf{W},\mathbf{U},\mathbf{Q},\mathbf{R}}{\min}\sum_{i,j=1}^{D}\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[i,j]}+\beta_{J}\mathbf{W}_{[i,j]}^{2}$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;+{\mu_{J}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma_{J}\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{W}\mathbf{1}=\mathbf{1},\mathbf{W}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}$
	$\displaystyle\;\;\;\;\;\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}.$		(61)

The framework of our algorithm for sloving (61) is the same as Algorithm 1, which alternately updates $\mathbf{W},\mathbf{U}$ , $\mathbf{R}$ and $\mathbf{Q}$ . The update of $\mathbf{U}$ , $\mathbf{R}$ , and $\mathbf{Q}$ is the same as Algorithm 1. The main difference is updating $\mathbf{W}$ / $\mathbf{L}$ , and hence we discuss the update of $\mathbf{W}$ here. The corresponding sub-problem is

	$\displaystyle\underset{\mathbf{W}}{\min}$	$\displaystyle\;\;\sum_{i,j}^{D}\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[i,j]}+\beta_{J}\mathbf{W}_{[i,j]}^{2}+{\mu_{J}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$
	$\displaystyle\mathrm{s.t.}\;$	$\displaystyle\mathbf{W}_{[i,:]}\mathbf{1}=1,\mathbf{W}_{[i,:]}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W}.$		(62)

We can rewrite the problem as

	$\displaystyle\underset{\mathbf{W}\mathbf{1}=\mathbf{1},\mathbf{W}\geq 0}{\min}\sum_{i,j=1}^{D}\;\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[ij]}$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\frac{\mu_{J}}{2}\lVert\mathbf{U}_{[i,:]}-\mathbf{U}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[ij]}+\beta_{J}\mathbf{W}^{2}_{[ij]}$		(63)

Let $\mathbf{C}_{[ij]}=\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}+\frac{\mu_{J}}{2}\lVert\mathbf{U}_{[i,:]}-\mathbf{U}_{[j,:]}\rVert_{2}^{2}$ , and the problem (63) can be optimized for each row, i.e. for $i=1,...,D$ ,

		$\displaystyle\underset{\mathbf{W}_{[i,:]}}{\min}\sum_{j=1}^{D}\;\mathbf{C}_{[ij]}\mathbf{W}_{[ij]}+\beta_{J}\mathbf{W}^{2}_{[ij]}\;\;\mathrm{s.t.}\;\mathbf{W}_{[i,:]}\mathbf{1}=1,\mathbf{W}_{[i,:]}\geq 0,$
	$\displaystyle\Rightarrow$	$\displaystyle\underset{\mathbf{W}_{[i,:]}}{\min}\left\lVert\mathbf{W}_{[i,:]}+\frac{1}{2\beta_{J}}\mathbf{C}_{[i,:]}\right\rVert_{2}^{2}\;\;\mathrm{s.t.}\;\mathbf{W}_{[i,:]}\mathbf{1}=1,\mathbf{W}_{[i,:]}\geq 0.$		(64)

It defines a squared Euclidean distance on a simplex constraint. Inspired by [23], we update $\mathbf{W}_{[i,:]}$ as

\displaystyle\mathbf{W}_{[i,j]}=\max\left(\frac{\mathbf{C}_{[i,l+1]}-\mathbf{C}_{[i,j]}}{l\mathbf{C}_{[i,l+1]}-\sum_{j=1}^{l}\mathbf{C}_{[i,j]}},0\right),

(65)

where $l$ is a hyper-parameter determining the number of neighbor nodes of the learned graphs. We select $l$ instead of $\beta_{J}$ as the model parameters.

We iteratively update $\mathbf{W}$ , $\mathbf{U},\mathbf{R}$ , and $\mathbf{Q}$ until convergence. The complete algorithm is shown in Algorithm 3

Algorithm 3 The algorithm for FJGSED

\mathbf{X}

, the number of clusters

K

, parameters

l,\mu_{J},\gamma_{J}

0: The learned graph

\mathbf{W}

, the cluster indicator matrix

\mathbf{Q}

1: Initialize

\mathbf{W}

\mathbf{U}

\mathbf{Q}

, and

\mathbf{R}

2: while not converged do

3: Update

\mathbf{W}

via (65)

4: Update

\mathbf{Y}

by solving (35), and let

\mathbf{U}=\mathbf{Z}\mathbf{Y}

5: Update

\mathbf{R}

\mathbf{R}=\mathbf{\Theta}_{R}\mathbf{\Theta}_{L}^{\top}

6: Update

\mathbf{Q}

via (39)

7: end while

Algorithm 4 The algorithm for FSRSC

\mathbf{X}

, the number of clusters

K

, parameters

\gamma_{U},\mu_{J},\gamma_{J}

0: The learned graph

\mathbf{W}

, the cluster indicator matrix

\mathbf{Q}

1: Initialize

\mathbf{W}

\mathbf{U}

\mathbf{\Gamma}

\mathbf{Q}

, and

\mathbf{R}

2: while not converged do

3: Update

\mathbf{A}

via (71)

4: Update

\mathbf{W}

via (76)

\mathbf{W}=\max(\mathbf{W},0)

and let

\mathrm{diag(}\mathbf{W})=\mathbf{0}

\mathbf{W}=\frac{1}{2}(\mathbf{W}^{\top}+\mathbf{W})

7: Update

\mathbf{\Gamma}

\mathbf{\Gamma}=\mathbf{\Gamma}+\gamma_{U}(\mathbf{A}-\mathbf{W})

8: Update

\mathbf{U}

by solving (35)

9: Update

\mathbf{R}

\mathbf{R}=\mathbf{\Theta}_{R}\mathbf{\Theta}_{L}^{\top}

10: Update

\mathbf{Q}

via (39)

11: end while

A-D The Formulation And Algorithm For FSRSC

The model FSRSC is formulated as

	$\displaystyle\underset{\mathbf{W},\mathbf{U},\mathbf{Q},\mathbf{R}}{\min}\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{W}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$
	$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;+\gamma_{U}\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}$
	$\displaystyle\mathrm{s.t.}\;\mathbf{W}\in\mathcal{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{U}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}.$		(66)

The framework of our algorithm for (66) is the same as Algorithm 1, which alternately updates $\mathbf{W},\mathbf{U}$ , $\mathbf{R}$ and $\mathbf{Q}$ . The update of $\mathbf{U}$ , $\mathbf{R}$ , and $\mathbf{Q}$ is the same as Algorithm 1. The main difference is updating $\mathbf{W}$ , and hence we discuss the update of $\mathbf{W}$ here. The corresponding sub-problem is

	$\displaystyle\underset{\mathbf{W}}{\min}$	$\displaystyle\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{W}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$
	$\displaystyle\mathrm{s.t.}\;$	$\displaystyle\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W}.$		(67)

We use the augmented Lagrange multiplier (ALM) type of method to solve the problem (67). Let us first introduce an auxiliary variable $\mathbf{A}$ here

	$\displaystyle\underset{\mathbf{W}}{\min}$	$\displaystyle\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{A}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$
	$\displaystyle\mathrm{s.t.}\;$	$\displaystyle\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W},\mathbf{A}=\mathbf{W}.$		(68)

The augmented Lagrangian function of the problem is

	$\displaystyle Lag(\mathbf{W},\mathbf{A},\mathbf{\Gamma})$
$\displaystyle=$	$\displaystyle\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{A}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})$
	$\displaystyle+\frac{\gamma_{U}}{2}\left\lVert\mathbf{A}-\mathbf{W}+{\mathbf{\Gamma}}/{\gamma_{U}}\right\rVert_{2}^{2}.,$	(69)

where $\gamma_{U}$ is the Lagrangian constant. In the ALM algorithm, we update $\mathbf{W},\mathbf{\Gamma}$ , and $\mathbf{A}$ in an alternating manner. We first fix $\mathbf{W},\mathbf{\Gamma}$ and update $\mathbf{A}$ . Let $\mathbf{J}=\mathbf{W}-\frac{\mathbf{\Gamma}}{\gamma_{U}}$ , the optimization problem is

\displaystyle\underset{\mathbf{A}}{\min}\;\;\alpha_{U}\lVert\mathbf{A}\rVert_{1,1}+\frac{\gamma_{U}}{2}\left\lVert\mathbf{A}-\mathbf{J}\right\rVert_{2}^{2},

(70)

which can be updated elementwise as

\displaystyle\mathbf{A}_{[ij]}=\max\left(|\mathbf{J}_{[ij]}|-\frac{\alpha_{U}}{\gamma_{U}},0\right)\mathrm{sign}\left(\mathbf{J}_{[ij]}\right).

(71)

Then, we fix $\mathbf{A},\mathbf{\Gamma}$ and update $\mathbf{W}$ . Let $\widetilde{\mathbf{J}}=\mathbf{A}+\frac{\mathbf{\Gamma}}{\gamma_{U}}$ , and we have

	$\displaystyle\underset{\mathbf{W}}{\min}\;\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\frac{\gamma_{U}}{2}\lVert\mathbf{W}-\widetilde{\mathbf{J}}\rVert_{\mathrm{F}}^{2},$
	$\displaystyle\mathrm{s.t.}\;\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0,$		(72)

which is equivalent to

	$\displaystyle\underset{\mathbf{W}}{\min}\,g(\mathbf{W})$
$\displaystyle:=$	$\displaystyle\underset{\mathbf{W}}{\min}\;\mathrm{Tr}\left(\mathbf{W}^{\top}\mathbf{X}\mathbf{X}^{\top}\mathbf{W}-2\mathbf{X}\mathbf{X}^{\top}\mathbf{W}^{\top}\right)+\frac{\mu_{U}}{2}\lVert\mathbf{W}\circ\mathbf{P}_{U}\rVert_{1,1}$
	$\displaystyle+\frac{\gamma_{U}}{2}\mathrm{Tr}\left(\mathbf{W}^{\top}\mathbf{W}-2\widetilde{\mathbf{J}}^{\top}\mathbf{W}\right)$
	$\displaystyle\mathrm{s.t.}\;\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0,$	(73)

where $\mathbf{P}_{U}$ is the pair-wise distance matrix of $\mathbf{U}$ . For every column of $\mathbf{W}$ , we have the following problem

	$\displaystyle\underset{\mathbf{W}_{[:,i]}}{\min}\,g(\mathbf{W}_{[:,i]})$
$\displaystyle=$	$\displaystyle\underset{\mathbf{W}_{[:,i]}}{\min}\,\mathbf{W}_{[:,i]}^{\top}\left(\frac{\gamma_{U}}{2}\mathbf{I}+\mathbf{X}\mathbf{X}^{\top}\right)\mathbf{W}_{[:,i]}$
	$\displaystyle+\left(\frac{\mu_{U}}{2}(\mathbf{P}_{U})^{\top}_{[:,i]}-\gamma_{U}\widetilde{\mathbf{J}}^{\top}_{[:,i]}-2(\mathbf{X}\mathbf{X}^{\top})_{[i,:]}\right)\mathbf{W}_{[:,i]}.$	(74)

We calculate the derivative of $g(\mathbf{W}_{[:,i]})$ and have

		$\displaystyle\nabla_{\mathbf{W}_{[:,i]}}g(\mathbf{W}_{[:,i]})$
	$\displaystyle=$	$\displaystyle 2\left(\frac{\gamma_{U}}{2}\mathbf{I}+\mathbf{X}\mathbf{X}^{\top}\right)\mathbf{W}_{[:,i]}+\frac{\mu_{U}}{2}(\mathbf{P}_{U})_{[:,i]}-\gamma_{U}\widetilde{\mathbf{J}}_{[:,i]}-2(\mathbf{X}\mathbf{X}^{\top})_{[:,i]}.$		(75)

Let $\nabla_{\mathbf{W}_{[:,i]}}g(\mathbf{W}_{[:,i]})=\mathbf{0}$ , and we obtain

\displaystyle\mathbf{W}_{[:,i]}=\left({\gamma_{U}}\mathbf{I}+2\mathbf{X}\mathbf{X}^{\top}\right)^{-1}\left(\gamma_{U}\widetilde{\mathbf{J}}_{[:,i]}+2(\mathbf{X}\mathbf{X}^{\top})_{[:,i]}-\frac{\mu_{U}}{2}(\mathbf{P}_{U})_{[:,i]}\right).

(76)

After updating all columns of $\mathbf{W}$ , we project $\mathbf{W}$ into the constraints $\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0$ .

Finally, we fix $\mathbf{W},\mathbf{A}$ and update $\mathbf{\Gamma}$ , i.e., $\mathbf{\Gamma}=\mathbf{\Gamma}+\gamma_{U}(\mathbf{A}-\mathbf{W})$ .

After updating $\mathbf{W},\mathbf{A}$ , and $\mathbf{\Gamma}$ , we then update $\mathbf{U},\mathbf{R}$ , and $\mathbf{Q}$ by following Algorithm 1. We iteratively update $\mathbf{W},\mathbf{A}$ , $\mathbf{\Gamma}$ , $\mathbf{U},\mathbf{R}$ , and $\mathbf{Q}$ until convergence. The complete algorithm flow is shown in Algorithm 4