This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\stripsep

-3pt plus 3pt minus 2pt

A Unified Framework for Fair Spectral Clustering With Effective Graph Learning

Xiang Zhang,   Qiao Wang The authors are with the School of Information Science and Engineering, Southeast University, Nanjing 210096, China (e-mail: xiangzhang369@seu.edu.cn; qiaowang@seu.edu.cn).
Abstract

We consider the problem of spectral clustering under group fairness constraints, where samples from each sensitive group are approximately proportionally represented in each cluster. Traditional fair spectral clustering (FSC) methods consist of two consecutive stages, i.e., performing fair spectral embedding on a given graph and conducting kkmeans to obtain discrete cluster labels. However, in practice, the graph is usually unknown, and we need to construct the underlying graph from potentially noisy data, the quality of which inevitably affects subsequent fair clustering performance. Furthermore, performing FSC through separate steps breaks the connections among these steps, leading to suboptimal results. To this end, we first theoretically analyze the effect of the constructed graph on FSC. Motivated by the analysis, we propose a novel graph construction method with a node-adaptive graph filter to learn graphs from noisy data. Then, all independent stages of conventional FSC are integrated into a single objective function, forming an end-to-end framework that inputs raw data and outputs discrete cluster labels. An algorithm is developed to jointly and alternately update the variables in each stage. Finally, we conduct extensive experiments on synthetic, benchmark, and real data, which show that our model is superior to state-of-the-art fair clustering methods.

Index Terms:
Spectral clustering, graph learning, joint optimization, fairness constraints, spectral embedding.

I Introduction

Clustering is an unsupervised task that aims to group samples with common attributes and separate dissimilar samples. It has numerous practical applications, e.g., image processing [1], remote sensing [2], and bioinformatics [3]. Existing clustering methods include kkmeans [4], spectral clustering (SC) [5], hierarchical clustering [6]. Among these methods, SC is a graph-based method utilizing the topological information of data and usually obtains better performance when handling complex high-dimensional datasets [5].

Recently, many concerns have arisen regarding fairness when performing clustering algorithms. For example, in loan applications, applicants are grouped into several clusters to support cluster-specific loan policies. However, clustering results could be affected by sensitive factors such as race and gender [7], even if the clustering algorithms do not consider sensitive attributes. Unfair clustering can lead to discriminatory outcomes, such as a specific group being more likely to be denied a loan. Therefore, there is a growing need for fair clustering methods unbiased by sensitive attributes. In the literature, [8] first introduces the notion of group fairness into clustering. As illustrated in Fig.1, given data with sensitive attributes, fair clustering aims to partition the data into clusters, where samples in every sensitive group are approximately proportionally represented in each cluster [8]. In this way, every sensitive group is treated fairly. Following [8], [9] generalizes the definition of fair clustering, [10] proposes a scalable fair clustering algorithm, and [11] applies the variational method to fair clustering. Furthermore, fairness constraints are also incorporated into deep clustering methods that leverage deep neural networks to partition data [12, 13].

Refer to caption
Figure 1: The illustration of fair clustering. Given data points of two sensitive groups (squares and circles), fair clustering partitions them into two clusters (blue and red), where samples of each group are proportionally represented in each cluster.

Here, we consider the problem of fair spectral clustering (FSC). The first work discussing FSC is [14], which designs a fairness constraint for SC according to the definition of group fairness in [8]. A scalable algorithm is proposed in [15] to solve the model in [14], and [16] considers group fairness of normalized-cut graph partitioning. In [17], individual fairness is considered in SC, which utilizes a representation graph to encode sensitive attributes and requires the neighbors of a node in the graph to be approximately proportionally represented in the clusters. Recently, [18] proposes a fair multi-view SC method. However, existing FSC models are built on a given similarity graph, which may not be available in practice. Thus, before proceeding with FSC algorithms, it is necessary to construct a graph from raw data. That is, a complete FSC method typically consists of three subtasks. First, a similarity graph is constructed from raw data. Second, spectral embedding under fairness constraints is performed on the graph to obtain a continuous cluster indicator matrix. Third, conducting kkmeans on the continuous matrix to obtain discrete cluster labels.

Although feasible, the traditional FSC paradigm still has the following problems to be addressed. (i) The quality of the constructed graph inevitably affects subsequent fair clustering performance, but this has not been explored theoretically. Additionally, noisy observations make it more difficult to construct accurate graphs. (ii) The post-processing discretization kkmeans is sensitive to the initial cluster centers and could cause far deviation from the true discrete results [19]. (iii) Performing the subtasks separately breaks the connections among graph construction, fair spectral embedding, and discretization, leading to suboptimal fair clustering results. For example, independent graph construction may fail to find the optimal graph for fair clustering [20]. Furthermore, independent spectral embedding is inferior to joint optimization of graph construction and spectral embedding [21].

To address the above issues, we propose a unified FSC model based on group fairness, which is an end-to-end framework that inputs observed data and outputs discrete cluster labels. Specifically, we first theoretically analyze how the estimated graphs affect FSC, demonstrating that accurate graphs are crucial to improve fair clustering performance. Motivated by the analysis, we propose a novel graph construction method to learn graphs from observed data under the smoothness assumption. Our approach incorporates a node-adaptive graph filter to denoise and produce smooth signals from potentially noisy data. Second, we introduce the group fairness constraint into traditional spectral embedding to guarantee fair clustering results. Third, we utilize spectral rotation instead of kkmeans as the discretization operation since it can produce discrete results with smaller discrepancies from the true labels. Finally, all subtasks are integrated into a single objective function to avoid the sub-optimality caused by separate optimization.

In summary, the contributions of this study are as follows.

  1. \bullet

    We theoretically analyze the impact of the estimated graph on fair clustering errors, justifying the necessity of an accurate graph to improve FSC performance. Motivated by the analysis, we propose a graph construction method to learn accurate graphs as inputs to FSC.

  2. \bullet

    We propose a unified FSC model integrating graph construction, fair spectral embedding, and discretization into a single objective function. Our model is an end-to-end framework that inputs observed data and outputs discrete fair clustering results and a similarity graph.

  3. \bullet

    We develop an algorithm to solve the objective function of our model. Compared with separate optimization, our algorithm updates all variables jointly and alternately, leading to an overall optimal solution for all subtasks.

  4. \bullet

    We conduct extensive experiments on synthetic, benchmark, and real data to test the proposed FSC model. Experimental results demonstrate that our model outperforms state-of-the-art fair clustering models.

Organization: The rest of this paper is organized as follows. Section II presents some related works. Background information is introduced in Section III. We propose our unified FSC framework in Section IV. Then, the proposed algorithm is provided in Section V. We conduct experiments to test the proposed FSC method in Section VI. Finally, concluding remarks are presented in Section VII.

Notations: Throughout this paper, vectors, matrices, and sets are written in bold lowercase, bold uppercase letters, and calligraphic uppercase letters, respectively. Given a matrix 𝐁\mathbf{B}, 𝐁[i,:],𝐁[:,j]\mathbf{B}_{[i,:]},\mathbf{B}_{[:,j]}, and 𝐁[ij]\mathbf{B}_{[ij]} denote the ii-th row, the ii-th column, and the (i,j)(i,j) entry of 𝐁\mathbf{B}, respectively. 𝐁0\mathbf{B}\geq 0 means all elements of 𝐁\mathbf{B} are non-negative. Furthermore, diag(𝐁)\mathrm{diag}(\mathbf{B}) and diag0(𝐁)\mathrm{diag}_{\mathrm{0}}(\mathbf{B}) mean converting the diagonal elements of 𝐁\mathbf{B} to a vector and setting the diagonal entries of 𝐁\mathbf{B} to zeros. The vectors 𝟏\mathbf{1}, 𝟎\mathbf{0}, and matrix 𝐈\mathbf{I} represent all-one vectors, all-zero vectors, and identity matrices, respectively. Moreover, F\lVert\cdot\rVert_{\mathrm{F}}, 1,1\lVert\cdot\rVert_{1,1}, and q\lVert\cdot\rVert_{q} are the Frobenius norm, element-wise 1\ell_{1} norm, and q\ell_{q} norm of a vector (matrix), respectively. The notations {\dagger}, \circ, and Tr()\mathrm{Tr}(\cdot) are pseudo inverse, Hadamard product, and trace operator, respectively. Given a set \mathcal{B}, |||\mathcal{B}| is the number of elements in \mathcal{B}. Finally, \mathbb{R} and 𝕊\mathbb{S} are the domain of real values and symmetric matrices whose dimensions depend on the context.

II Related Work

II-A Graph Learning Methods For (Fair) SC

Graph learning (GL) aims to infer the graph topology behind observed data, a prerequisite step for (fair) SC when similarity graphs are unavailable. Traditionally, graphs are constructed via some direct rules, such as kk-nearest-neighborhood (kk-NN), ε\varepsilon-nearest-neighborhood (ε\varepsilon-NN) [22], and sample correlation methods like Pearson correlation (PC). These methods may be limited in capturing similarity relationships between data pairs [23]. Thus, many works attempt to learn graphs from data adaptively, including the sparse representation (SR) method [24] and the low-rank representation method [25]. The emergence of adaptive neighbourhood graph learning (ANGL) [26] provides a new way that uses the probability of two samples being adjacent to measure the similarity between them. In [27], a possibilistic neighbourhood graph is proposed, an improved version of [26]. Recently, with the rise of graph signal processing (GSP) [28], many works attempt to learn graphs from the perspective of signal processing. One of the widely-used GSP-based GL methods postulates that signals are smooth over the corresponding graphs [29]. Intuitively, a smooth graph signal means the signal values of two connected nodes are similar[30], which is also a fundamental principle of SC [5]. Many methods are dedicated to learning graphs from smooth signals [31]. However, limited to our understanding, applying smoothness-based GL to SC has yet to be thoroughly explored, let alone FSC.

II-B Unified SC Models

Many works focus on establishing a unified model for SC, which can be roughly divided into three categories. The first one integrates graph construction and spectral embedding [21, 32, 26]. They use an independent discretization step as post-processing. The second one is based on a given similarity graph. They integrate spectral embedding and discretization [33, 34, 35]. The third category integrates all three stages into a single objective function [20, 23, 36, 37, 38]. Our model differs from these models in two main ways. (i) Our framework utilizes a new graph construction method. (ii) We further consider fairness issues in clustering tasks.

III Background

This section presents background information, including SC under group fairness constraints and spectral rotation.

III-A SC Under Group Fairness Constraints

Given an undirected graph 𝒢={𝒱,}\mathcal{G}=\{\mathcal{V},\mathcal{E}\} of DD vertices, where 𝒱\mathcal{V} and \mathcal{E} are the sets of vertices and edges of 𝒢\mathcal{G}, respectively, its adjacency matrix 𝐖𝕊D×D\mathbf{W}\in\mathbb{S}^{D\times D} is a symmetric matrix with zero diagonal entries and non-negative off-diagonal elements if the graph has non-negative edge weights and no self-loops. The Laplacian matrix of 𝒢\mathcal{G} is defined as 𝐋=𝐃𝐖\mathbf{L}=\mathbf{D}-\mathbf{W}, where 𝐃𝕊D×D\mathbf{D}\in\mathbb{S}^{D\times D} is a diagonal matrix satisfying 𝐃[ii]=j=1D𝐖[ij]\mathbf{D}_{[ii]}=\sum_{j=1}^{D}\mathbf{W}_{[ij]}. Unnormalized SC aims to partition DD nodes into KK disjoint clusters 𝒞1,,𝒞K\mathcal{C}_{1},...,\mathcal{C}_{K}, where 𝒱=𝒞1𝒞K\mathcal{V}=\mathcal{C}_{1}\cup...\cup\mathcal{C}_{K}, and 𝒞k\mathcal{C}_{k} is the set containing nodes in the kk–th cluster. The problem of unnormalized SC is equivalent to minimizing the RatioCut\mathrm{RatioCut} objective function [5], i.e.,

RatioCut(𝒞1,,𝒞K)=k=1KCut(𝒞k,𝒱𝒞k)|𝒞k|,\displaystyle\mathrm{RatioCut}(\mathcal{C}_{1},...,\mathcal{C}_{K})=\sum_{k=1}^{K}\frac{\mathrm{Cut}(\mathcal{C}_{k},\mathcal{V}\setminus\mathcal{C}_{k})}{|\mathcal{C}_{k}|}, (1)

where 𝒱𝒞k\mathcal{V}\setminus\mathcal{C}_{k} contains all nodes in 𝒱\mathcal{V} except those in 𝒞k\mathcal{C}_{k}, and

Cut(𝒞k,𝒱𝒞k)=i𝒞k,j𝒱𝒞k𝐖[ij].\displaystyle\mathrm{Cut}(\mathcal{C}_{k},\mathcal{V}\setminus\mathcal{C}_{k})=\sum_{i\in\mathcal{C}_{k},j\in\mathcal{V}\setminus\mathcal{C}_{k}}\mathbf{W}_{[ij]}. (2)

Let 𝐔~D×K\widetilde{\mathbf{U}}\in\mathbb{R}^{D\times K} be

𝐔~[ik]={1|𝒞k|i𝒞k0i𝒞k.\displaystyle\widetilde{\mathbf{U}}_{[ik]}=\begin{cases}\frac{1}{\sqrt{|\mathcal{C}_{k}|}}&i\in\mathcal{C}_{k}\\ 0&i\notin\mathcal{C}_{k}\end{cases}. (3)

Then, minimizing the RatioCut\mathrm{RatioCut} objective function (1) is equivalent to solving the following problem [5]

min𝐔~Tr(𝐔~𝐋𝐔~),s.t.𝐔~is of form (3).\displaystyle\underset{\widetilde{\mathbf{U}}}{\min}\;\mathrm{Tr}(\widetilde{\mathbf{U}}^{\top}\mathbf{L}\widetilde{\mathbf{U}}),\;\;\mathrm{s.t.}\;\widetilde{\mathbf{U}}\;\text{is of form \eqref{eq-prelim-U}}. (4)

Due to the discrete constraint of (3), problem (4) is NP-hard. In practice, problem (4) is usually relaxed to

min𝐔Tr(𝐔𝐋𝐔),s.t.𝐔𝐔=𝐈,\displaystyle\underset{\mathbf{U}}{\min}\;\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}),\;\;\mathrm{s.t.}\;\mathbf{U}^{\top}\mathbf{U}=\mathbf{I}, (5)

where 𝐔D×K\mathbf{U}\in\mathbb{R}^{D\times K} is a relaxed continuous clustering label matrix, and 𝐔𝐔=𝐈\mathbf{U}^{\top}\mathbf{U}=\mathbf{I} is adopted to avoid trivial solutions. The process of solving (5) is called spectral embedding. After obtaining 𝐔\mathbf{U}^{*}, a common practice is to apply kkmeans to the rows of 𝐔\mathbf{U}^{*} to yield final discrete clustering labels 𝐐\mathbf{Q}, where 𝐐{0,1}D×K\mathbf{Q}\in\{0,1\}^{D\times K} is a binary cluster indicator matrix. The only non-zero element of the ii-th row of 𝐐\mathbf{Q} indicates the cluster membership of the ii-th node of 𝒢\mathcal{G}.

Fair spectral clustering groups the vertices of 𝒢\mathcal{G} by considering fairness. If the nodes of 𝒢\mathcal{G} belong to SS sensitive groups 𝒟1,,𝒟S\mathcal{D}_{1},...,\mathcal{D}_{S}, where 𝒟s\mathcal{D}_{s} contains the nodes of the ss-th sensitive group, we define the Balance\mathrm{Balance} of cluster 𝒞k\mathcal{C}_{k} as [8]

Balance(𝒞k)=minss[S]|𝒟s𝒞k||𝒟s𝒞k|[0,1],\displaystyle\mathrm{Balance}(\mathcal{C}_{k})=\underset{s\neq s^{\prime}\in[S]}{\min}\;\frac{\left\lvert\mathcal{D}_{s}\cap\mathcal{C}_{k}\right\rvert}{\left\lvert\mathcal{D}_{s^{\prime}}\cap\mathcal{C}_{k}\right\rvert}\in[0,1], (6)

where [S]:={1,,S}[S]:=\{1,...,S\}. The higher the Balance\mathrm{Balance} of each cluster, the fairer the clustering [8]. It is not difficult to check that mink[K]Balance(𝒞k)minss[S]|𝒟s|/|𝒟s|\underset{k\in[K]}{\min}\mathrm{Balance}(\mathcal{C}_{k})\leq\underset{s\neq s^{\prime}\in[S]}{\min}\;|\mathcal{D}_{s}|/|\mathcal{D}_{s^{\prime}}|. Thus, this notion of fairness is asking for a clustering where the fraction of different sensitive groups in each cluster is approximately the same as that of the entire dataset 𝒱\mathcal{V} [14], which is also called group fairness. To incorporate this fairness notion into SC, a group-membership vector 𝐟s{0,1}D\mathbf{f}_{s}\in\{0,1\}^{D} of 𝒟s\mathcal{D}_{s} is defined, where (𝐟s)[i]=1(\mathbf{f}_{s})_{[i]}=1 if i𝒟si\in\mathcal{D}_{s} and (𝐟s)[i]=0(\mathbf{f}_{s})_{[i]}=0 otherwise, for s[S]s\in[S] and i[D]i\in[D]. Then, we have the following lemma.

Lemma 1.

(Fairness constraint as linear constraint on 𝐔~\widetilde{\mathbf{U}} [14]) Let 𝒱=𝒞1𝒞K\mathcal{V}=\mathcal{C}_{1}\cup...\cup\mathcal{C}_{K} be a clustering that is encoded as in (3). We have, for every k[K]k\in[K]

s[S]:|𝒟s𝒞k||𝒞k|=|𝒟s|D𝐅𝐔~=𝟎,\displaystyle\forall s\in[S]:\frac{|\mathcal{D}_{s}\cap\mathcal{C}_{k}|}{|\mathcal{C}_{k}|}=\frac{|\mathcal{D}_{s}|}{D}\Leftrightarrow\mathbf{F}^{\top}\widetilde{\mathbf{U}}=\mathbf{0}, (7)

where 𝐅D×(S1)\mathbf{F}\in\mathbb{R}^{D\times(S-1)} is a matrix satisfying 𝐅[:,s]=𝐟s(|𝒟s|/D)𝟏,s[S1]\mathbf{F}_{[:,s]}=\mathbf{f}_{s}-(|\mathcal{D}_{s}|/D)\cdot\mathbf{1},s\in[S-1].

Lemma 1 states that the proportional representation of all sensitive attribute samples in each cluster can be interpreted as a linear constraint 𝐅𝐔~=𝟎\mathbf{F}^{\top}\widetilde{\mathbf{U}}=\mathbf{0}. Under this fairness constraint, unnormalized SC is equivalent to the following problem

min𝐔~Tr(𝐔~𝐋𝐔~),s.t.𝐔~is of form (3),𝐅𝐔~=𝟎.\displaystyle\underset{\widetilde{\mathbf{U}}}{\min}\;\mathrm{Tr}(\widetilde{\mathbf{U}}^{\top}\mathbf{L}\widetilde{\mathbf{U}}),\;\;\mathrm{s.t.}\;\widetilde{\mathbf{U}}\;\text{is of form \eqref{eq-prelim-U}},\;\mathbf{F}^{\top}\widetilde{\mathbf{U}}=\mathbf{0}. (8)

Similarly, we can relax (8) to

min𝐔Tr(𝐔𝐋𝐔),s.t.𝐔𝐔=𝐈,𝐅𝐔=𝟎.\displaystyle\underset{\mathbf{U}}{\min}\;\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}),\;\;\mathrm{s.t.}\;\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\;\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}. (9)

Following traditional SC, existing FSC models perform kkmeans on rows of 𝐔\mathbf{U} to obtain discrete cluster labels 𝐐\mathbf{Q}.

III-B Spectral Rotation

Spectral rotation [19] is an alternative to kkmeans for obtaining discrete clustering results from continuous labels 𝐔\mathbf{U}, which is formulated as

min𝐐,𝐑𝐐𝐔𝐑F2,\displaystyle\underset{\mathbf{Q},\mathbf{R}}{\min}\;\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}, s.t.𝐑𝐑=𝐈,𝐐,\displaystyle\mathrm{s.t.}\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}, (10)

where the set \mathcal{I} contains all discrete cluster indicator matrices, and 𝐑K×K\mathbf{R}\in\mathbb{R}^{K\times K} is an orthonormal matrix. According to the spectral solution invariance property [19], if 𝐔\mathbf{U} is a solution of (5), 𝐔𝐑\mathbf{U}\mathbf{R} is another solution. A suitable 𝐑\mathbf{R} could facilitate 𝐔𝐑\mathbf{U}\mathbf{R} as close to 𝐐\mathbf{Q} as possible. In contrast, kkmeans is performed directly on 𝐔\mathbf{U} obtained from spectral embedding, which may far deviate from the real discrete results. Thus, spectral rotation usually achieves superior performance than kkmeans [19].

IV Model Formulation

In this section, we first theoretically analyze the impact of the constructed graph on FSC, which justifies an accurate graph for improving FSC performance. Then, we propose a novel graph construction method to learn graphs from potentially noisy observed data. Next, we integrate graph construction, fair spectral embedding, and discretization into an end-to-end framework. Finally, we analyze the connections between our model and existing works.

IV-A Why We Need An Accurate Graph?

We first introduce a variant of the stochastic block model [39] (vSBM) to generate random graphs with cluster structures and sensitive attributes [14]. This model assumes that there are two or more meaningful ground-truth clusterings of the observed data, and only one of them is fair. Assume that 𝒱\mathcal{V} comprises SS sensitive groups and is partitioned into KK clusters such that |𝒟s𝒞k|/|𝒞k|=ζs,s[S],k[K]|\mathcal{D}_{s}\cap\mathcal{C}_{k}|/|\mathcal{C}_{k}|=\zeta_{s},s\in[S],k\in[K], for ζs(0,1)\zeta_{s}\in(0,1) with s=1Sζs=1\sum_{s=1}^{S}\zeta_{s}=1. Based on the clusters and sensitive groups, we construct a random graph by connecting two vertices ii and jj with a probability Pr(i,j)\mathrm{Pr}(i,j) that depends on the clusters and sensitive groups of ii and jj. We define

Pr(i,j)={a,πC(i)=πC(j),πS(i)=πS(j)b,πC(i)πC(j),πS(i)=πS(j)c,πC(i)=πC(j),πS(i)πS(j)d,πC(i)πC(j),πS(i)πS(j),\displaystyle\mathrm{Pr}(i,j)=\begin{cases}a,&\pi_{C}(i)=\pi_{C}(j),\;\pi_{S}(i)=\pi_{S}(j)\\ b,&\pi_{C}(i)\neq\pi_{C}(j),\;\pi_{S}(i)=\pi_{S}(j)\\ c,&\pi_{C}(i)=\pi_{C}(j),\;\pi_{S}(i)\neq\pi_{S}(j)\\ d,&\pi_{C}(i)\neq\pi_{C}(j),\;\pi_{S}(i)\neq\pi_{S}(j),\end{cases} (11)

where πC:[D][K]\pi_{C}:[D]\to[K] and πS:[D][S]\pi_{S}:[D]\to[S] are two functions that assign a node i𝒱i\in\mathcal{V} to one of the clusters and sensitive groups, respectively. Let 𝐋\mathbf{L}^{*} be the real graph Laplacian matrix generated by the vSBM method and 𝐋^\widehat{\mathbf{L}} be the Laplacian matrix estimated by any graph construction method. The matrix 𝐋^\widehat{\mathbf{L}} is used as the input to fair spectral embedding in (9), and spectral rotation is utilized to obtain discrete cluster labels. Our goal is to derive a fair clustering error bound related to the estimation error between 𝐋^\widehat{\mathbf{L}} and 𝐋\mathbf{L}^{*}. Let us make some assumptions.

Assumption 1.

Let 𝐔^\widehat{\mathbf{U}} be a continuous cluster indicator matrix estimated from 𝐋^\widehat{\mathbf{L}} via (9). For a given constant ϵ>0\epsilon>0, the 𝐐^\widehat{\mathbf{Q}} and 𝐑^\widehat{\mathbf{R}} estimated by spectral rotation satisfies

𝐐^𝐔^𝐑^F2(1+ϵ)min𝐐,𝐑𝐑=𝐈𝐐𝐔^𝐑F2.\displaystyle\lVert\widehat{\mathbf{Q}}-\widehat{\mathbf{U}}\widehat{\mathbf{R}}\rVert_{\mathrm{F}}^{2}\leq(1+\epsilon)\underset{\mathbf{Q}\in\mathcal{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}}{\min}\lVert{\mathbf{Q}}-\widehat{\mathbf{U}}{\mathbf{R}}\rVert_{\mathrm{F}}^{2}. (12)
Assumption 2.

The ground-truth clustering and sensitive partitions of 𝒱\mathcal{V} satisfy

|𝒟s|=DS,|𝒞k|=DK,|𝒟s𝒞k||𝒞k|=1S.\displaystyle|\mathcal{D}_{s}|=\frac{D}{S},\;|\mathcal{C}_{k}|=\frac{D}{K},\;\frac{|\mathcal{D}_{s}\cap\mathcal{C}_{k}|}{|\mathcal{C}_{k}|}=\frac{1}{S}. (13)

Assumption 1 is similar to the (1+ϵ)(1+\epsilon)-approximation of kkmeans [40], which provides the estimation accuracy of spectral rotation. Assumption 2 is the same as that in Theorem 1 of [14], which is only made to facilitate theoretical analysis. In practice, Assumption 2 may be violated, which, however, does not affect the effectiveness of FSC algorithms [14]. Based on the two assumptions, we have the following proposition.

Proposition 1.

Let 𝐋\mathbf{L}^{*} be the real Laplacian matrix of the random graph generated by the vSBM method with a>b>c>da>b>c>d satisfying a>r1lnD/Da>r_{1}\ln{D}/D for some r1>0r_{1}>0, and 𝐋^\widehat{\mathbf{L}} be the estimated Laplacian matrix from observed data. Assume that we run fair spectral embedding (9) on 𝐋^\widehat{\mathbf{L}} and perform (1+ϵ)(1+\epsilon) spectral rotation (10) to obtain discrete cluster labels. Besides, let π^C(i)\widehat{\pi}_{C}(i) be the assigned cluster label (after proper permutation) of node ii, and define k:={i𝒞k:π^C(i)k}\mathcal{M}_{k}:=\left\{i\in\mathcal{C}_{k}:\widehat{\pi}_{C}(i)\neq k\right\} as the set of misclassified vertices of cluster kk. Under Assumptions 1-2, for every r2>0r_{2}>0, there exist constants C^=C^(r1,r2)\widehat{C}=\widehat{C}(r_{1},r_{2}) and C~=C~(r1,r2)\widetilde{C}=\widetilde{C}(r_{1},r_{2}) such that if

aK3lnDD(cd)2C^1+ϵ,\displaystyle\frac{aK^{3}\ln{D}}{D(c-d)^{2}}\leq\frac{\widehat{C}}{1+\epsilon}, (14)

then with probability at least 1Dr21-D^{-r_{2}}, the number of misclassified vertices, k=1K|k|\sum_{k=1}^{K}{|\mathcal{M}_{k}|}, is at most

C~(1+ϵ)aK2lnD(cd)2relatedtothevSBM+512(4+2ϵ)K2D(cd)2𝐙𝐋𝐙𝐙𝐋^𝐙F2relatedtographestimation,\displaystyle\underbrace{\frac{\widetilde{C}(1+\epsilon)aK^{2}\ln{D}}{(c-d)^{2}}}_{\mathrm{related\;to\;the\;vSBM\;}}+\underbrace{\frac{512(4+2\epsilon)K^{2}}{D(c-d)^{2}}\lVert\mathbf{Z}^{\top}\mathbf{L}^{*}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{\mathrm{F}}^{2}}_{\mathrm{related\;to\;graph\;estimation}}, (15)

where 𝐙D×(DS+1)\mathbf{Z}\in\mathbb{R}^{D\times(D-S+1)} is a matrix whose columns form the orthonormal basis of the nullspace of 𝐅\mathbf{F}^{\top}.

Proof.

The proof is inspired by [14], but has two main differences. First, spectral rotation instead of kkmeans is used to obtain discrete labels. Second, fair spectral embedding is based on an estimated graph rather than a known graph generated by the vSBM method. See Appendix A for details. ∎

According to [14], the meaning of “the number of misclassified vertices is at most DmD_{m}” is that there exists a permutation of cluster indices such that the clustering results up to this permutation successfully predict all cluster labels but DmD_{m} many vertices. Note that the error bound consists of two parts. The first one is caused by the difference between the expected and real graph produced by the vSBM method, which is similar to [14]. The second part is related to the estimation error of graph construction methods. The fair constraint affects clustering performance via 𝐙\mathbf{Z}, which is a matrix determined by sensitive group-membership matrix 𝐅\mathbf{F}. For convenience, 𝐙𝐋𝐙\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z} is dubbed fair graph. Generally, the error bound in (15) depends on KK, DD and ϵ\epsilon. If we divide (15) by DD, we obtain the bound for the misclassification rate. The first part of the misclassification rate bound tends to zero as DD goes to infinity, meaning that if 𝐋\mathbf{L}^{*} is exactly estimated (the second part equals to zero), performing FSC via (9) and spectral rotation is weakly consistent [14]. However, 𝐋\mathbf{L}^{*} usually cannot be estimated exactly, causing an additional error for subsequent fair clustering results. If the fair graph estimation error does not increase quadratically as DD, the second part of the misclassification rate bound will also decay to zero. Proposition 1 illustrates that a well-estimated graph 𝐋^\widehat{\mathbf{L}}, which is close to 𝐋\mathbf{L}^{*}, brings a small misclassification error bound. Thus, it motivates us to seek a more effective method to construct accurate graphs from observed data.

IV-B The Proposed Graph Construction Method

Given NN observed data 𝐗oD×N\mathbf{X}_{o}\in\mathbb{R}^{D\times N}, we need to infer the underlying similarity graph topology as the input to FSC algorithms. However, contaminated data may lead to poor graph estimation performance, as indicated in Proposition 1, which degrades subsequent fair clustering performance. Therefore, we propose a method to learn graphs from potentially noisy data 𝐗o\mathbf{X}_{o}, which is formulated as

min𝐋,𝐗,𝝊>0\displaystyle\underset{\mathbf{L}\in\mathcal{L},\mathbf{X},\bm{\upsilon}>0}{\mathrm{min}}\,\, 1N𝚼(𝐗o𝐗)F2+ξNTr(𝐗𝐋𝐗)+i=1D1𝝊[i]\displaystyle\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}
𝟏log(diag(𝐋))+βdiag0(𝐋)F2Reg(𝐋),\displaystyle\underbrace{-\mathbf{1}^{\top}\log\left(\mathrm{diag}(\mathbf{L})\right)+{\beta}\lVert\mathrm{diag_{0}}(\mathbf{L})\rVert^{2}_{\mathrm{F}}}_{Reg(\mathbf{L})}, (16)

where :={𝐋:𝐋𝕊D×D,𝐋𝟏=𝟎,𝐋[ij]0,ij}\mathcal{L}:=\left\{\mathbf{L}:\mathbf{L}\in\mathbb{S}^{D\times D},\,\mathbf{L}\mathbf{1}=\mathbf{0},\,\mathbf{L}_{[ij]}\leq 0,\,\,i\neq j\right\} contains all Laplacian matrices. Moreover, ξ\xi and β\beta are parameters, and 𝝊D\bm{\upsilon}\in\mathbb{R}^{D} is a vector of adaptive weights. We let 𝚼:=diag(𝝊)\mathbf{\Upsilon}:=\mathrm{diag}(\sqrt{\bm{\upsilon}}), where 𝝊=(𝝊[1],,𝝊[D])\sqrt{\bm{\upsilon}}=(\sqrt{\bm{\upsilon}_{[1]}},...,\sqrt{\bm{\upsilon}_{[D]}})^{\top}. Eq.(16) is a joint model of denoising and smoothness-based GL [30].

1) Denoising: If 𝐋\mathbf{L} is fixed, the problem (16) becomes

min𝐗,𝝊1N𝚼(𝐗o𝐗)F2+ξNTr(𝐗𝐋𝐗)+i=1D1𝝊[i].\displaystyle\underset{\mathbf{X},\bm{\upsilon}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}. (17)

The model is a node-adaptive graph filter, and 𝝊\bm{\upsilon} represents node weights. Specifically, given node weights 𝝊\bm{\upsilon}, we have

min𝐗𝚼(𝐗o𝐗)F2+ξTr(𝐗𝐋𝐗).\displaystyle\underset{\mathbf{X}}{\mathrm{min}}\,\,\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+{\xi}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X}). (18)

Taking the derivative of (18) and setting it to zero, we obtain

𝐗=(𝚼𝚼+ξ𝐋)1𝚼𝚼𝐗o=(𝐈+ξ(𝚼𝚼)1𝐋)1𝐗o.\displaystyle\mathbf{X}=\left(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon}+{\xi}\mathbf{L}\right)^{-1}\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon}\mathbf{X}_{o}={\left(\mathbf{I}+{\xi}(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon})^{-1}\mathbf{L}\right)^{-1}}\mathbf{X}_{o}. (19)

We let 𝐊:=(𝐈+ξ(𝚼𝚼)1𝐋)1\mathbf{K}:=\left(\mathbf{I}+{\xi}(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon})^{-1}\mathbf{L}\right)^{-1}, which is positive definite and has eigen-decomposition 𝐊=𝚯𝚲𝚯\mathbf{K}=\mathbf{\Theta}\mathbf{\Lambda}\mathbf{\Theta}^{\top} with eigenvalues matrix 𝚲\mathbf{\Lambda} and eigenvectors matrix 𝚯\mathbf{\Theta}. Moreover, 𝚲=diag(11+ξλ1,.,11+ξλD)\mathbf{\Lambda}=\mathrm{diag}\left(\frac{1}{1+\xi\lambda_{1}},....,\frac{1}{1+\xi\lambda_{D}}\right), where 0=λ1,.,λD0=\lambda_{1}\leq,....,\leq\lambda_{D} are the eigenvalues of (𝚼𝚼)1𝐋(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon})^{-1}\mathbf{L}. From the perspective of graph spectral filtering (GFT) [28], 𝐊𝐗o=𝚯𝚲𝚯𝐗o\mathbf{K}\mathbf{X}_{o}=\mathbf{\Theta}\mathbf{\Lambda}\mathbf{\Theta}^{\top}\mathbf{X}_{o} can be interpreted as that the observed graph signals (columns of 𝐗o\mathbf{X}_{o}) are first transformed to the graph frequency domain via 𝚯\mathbf{\Theta}^{\top}, attenuated GFT coefficients according to 𝚲\mathbf{\Lambda}, and transformed back to the nodal domain via 𝚯\mathbf{\Theta}. It is observed from 𝚲\mathbf{\Lambda} that the graph filter 𝐊\mathbf{K} is low-pass since the attenuation is stronger for larger eigenvalues. Thus, the graph filter can suppress the high-frequency component of raw data 𝐗o\mathbf{X}_{o}, alleviating the noises on the graph to output the “noiseless” signal 𝐗\mathbf{X}.

Our graph filter 𝐊\mathbf{K} differs from the Auto-Regressive graph filter (𝐈+ξ𝐋)1\left(\mathbf{I}+{\xi}\mathbf{L}\right)^{-1} [41] in that we assign each node an individual weight 𝝊[i],i=1,,D\bm{\upsilon}_{[i]},i=1,...,D. The reason for using 𝝊\bm{\upsilon} is that the measurement noise of different nodes may be heterogeneous. If the ii-th node signal (the ii-th row of 𝐗o\mathbf{X}_{o}) has a small noise scale, a large 𝝊[i]\bm{\upsilon}_{[i]} should be assigned to the fidelity term of node ii in (17) to ensure 𝐗[i,:]\mathbf{X}_{[i,:]} is are close to the corresponding observation (𝐗o)[i,:](\mathbf{X}_{o})_{[i,:]} [42]. When we cannot know the noise scale a priori, we can adaptively learn 𝝊\bm{\upsilon} from the data. Specifically, given 𝐗\mathbf{X}, the problem (17) becomes

min𝝊>01Ni=1D𝝊[i](𝐗o)[i,:]𝐗[i,:]22+1𝝊[i].\displaystyle\underset{\bm{\upsilon}>0}{\mathrm{min}}\,\,\frac{1}{N}\sum_{i=1}^{D}\bm{\upsilon}_{[i]}\lVert(\mathbf{X}_{o})_{[i,:]}-\mathbf{X}_{[i,:]}\rVert_{2}^{2}+\frac{1}{\bm{\upsilon}_{[i]}}. (20)

Intuitively, solving (20) will assign a large 𝝊[i]\bm{\upsilon}_{[i]} to node ii if 𝐗[i,:]\mathbf{X}_{[i,:]} is close to (𝐗o)[i,:](\mathbf{X}_{o})_{[i,:]}, as expected.

2) Graph learning: If we have obtained the “noiseless” signals 𝐗\mathbf{X} via the graph filter 𝐊\mathbf{K}, the problem (16) becomes

min𝐋ξNTr(𝐗𝐋𝐗)+Reg(𝐋).\displaystyle\underset{\mathbf{L}\in\mathcal{L}}{\mathrm{min}}\,\,\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+{Reg(\mathbf{L})}. (21)

The first Laplacian quadratic term of (21) is equivalent to

1NTr(𝐗𝐋𝐗)=1Nn=1Ni,j=1D𝐖[ij](𝐗[in]𝐗[jn])2,\displaystyle\frac{1}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})=\frac{1}{N}\sum_{n=1}^{N}\sum_{i,j=1}^{D}\mathbf{W}_{[ij]}\left(\mathbf{X}_{[in]}-\mathbf{X}_{[jn]}\right)^{2}, (22)

which measures the average smoothness of data 𝐗\mathbf{X} over the graph 𝐋\mathbf{L} [30]. The second term of (21) contains regularizers that endow the learned graphs with desired properties. The log\log-degree term is to control node degree, and the Frobenius norm term is to control graph sparsity. Our model (21) can learn a graph suitable for graph-based clustering tasks for the following reasons. (i) It is observed from (22) that minimizing the smoothness is to seek a graph whose similar vertices (node signals) are closely connected, which is consistent with the fundamental principle of SC. (ii) The log\log-degree term can avoid isolated nodes, which is crucial for SC, especially for normalized SC [5]. (iii) The Frobenius norm term can lead to a sparse graph, which may remove redundant and noisy edges.

The model (21) is similar to the ANGL method [26] since both construct graphs by minimizing the smoothness. The main differences lie in three aspects. (i) Our model removes the sum-to-one constraint in the ANGL method—the degree of each node is forced to be one—since the constraint makes the output graphs sensitive to noisy points [27]. Removing this constraint allows our model to capture more complex similarity relationships. (ii) We add a log\log-degree term to ensure the learned graph has no isolated nodes. (iii) The input data of (21) are those produced by a low-pass graph filter.

3) Discussion: We try to explain why our method is effective in constructing graphs from observed data. If data 𝐗o\mathbf{X}_{o} have a clustering structure, they should follow the assumption of cluster and manifold, i.e., the data in the same cluster are close to each other. According to [43], smooth signals containing low-frequency parts tend to follow the cluster and manifold assumption. Thus, if 𝐋\mathbf{L} accurately represents the graph behind observed data, the denoising part of our model has two functions. First, it filters out the high-frequency components of the observed graph signals that correspond to noises. Second, it produces smooth signals that have a clearer clustering structure, which could facilitate subsequent clustering. To better illustrate the effectiveness of the node-adaptive filter, Fig.2 depicts the t-SNE [44] results of our methods on the MNIST dataset, where four clusters correspond to four randomly selected digits. We can see that raw data are entangled together. In contrast, the denoised data 𝐗\mathbf{X} by the graph filter are clearly separated, meaning that the denoising part of our model can produce cluster-friendly signals. From the perspective of GL, our model (21) learns a graph minimizing the smoothness of data, i.e., the nodes corresponding to similar signals are closely connected. Thus, the learned graph can effectively capture similarity relationships between data and preserve clustering structures. Consequently, the denoising operation and the smoothness-based GL can enhance each other collaboratively to bring a high-quality graph for subsequent fair clustering tasks.

Refer to caption
Figure 2: The t-SNE results of MNIST with different ξ\xi values.
Refer to caption
Figure 3: The illustration of the proposed framework.

IV-C The Unified FSC Model

In this subsection, we build an end-to-end FSC framework that inputs observed data 𝐗o\mathbf{X}_{o} and node attributes 𝐅\mathbf{F} and directly outputs discrete cluster labels. As shown in Fig.3, our model consists of four modules, i.e., denoising, graph learning, fair spectral embedding, and discretization. First, we construct graphs from the observed data by the proposed method (16). Once we obtain 𝐋\mathbf{L}, the Laplacian matrix together with 𝐅\mathbf{F} can be directly used to perform fair spectral embedding (9) to obtain continuous clustering label matrix 𝐔\mathbf{U}. Finally, we leverage spectral rotation (10) instead of kkmeans to obtain discrete cluster labels. In addition to the reason for superior performance as stated in Section III-B, we utilize spectral rotation since it can be flexibly integrated into an end-to-end framework. Integrating all the above subtasks into a single objective function, we obtain

min𝐗,𝐋,𝝊,𝐘,𝐑,𝐐1N𝚼(𝐗o𝐗)F2+ξNTr(𝐗𝐋𝐗)+Reg(𝐋)\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})
+i=1D1𝝊[i]+μTr(𝐔𝐋𝐔)+γ𝐐𝐔𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐋,𝝊>0,𝐔𝐔=𝐈,𝐅𝐔=𝟎,𝐑𝐑=𝐈,𝐐,\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}, (23)

where μ\mu and γ\gamma are two parameters. All modules are not simply put together, but are bridged by two Laplacian quadratic terms, i.e., ξNTr(𝐗𝐋𝐗)\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X}) and μTr(𝐔𝐋𝐔)\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}). First, ξNTr(𝐗𝐋𝐗)\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X}) can be viewed as the graph Tikhonov regularizer of the denoising task (18) to output smooth signals [41]. On the other hand, it measures smoothness in the GL task to capture the similarity relationships between data. Second, μTr(𝐔𝐋𝐔)\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}) together with 𝐅𝐔=𝟎\mathbf{F}^{\top}\mathbf{U}=\mathbf{0} performs fair spectral embedding as input to the discretization. It is also used to impose structural constraints on the constructed graph, which is discussed in the next subsection. The four modules are coupled with each other to achieve the overall optimal results for all subtasks.

To better understand how the fairness constraint works, we introduce a new variable matrix 𝐘(DS+1)×K\mathbf{Y}\in\mathbb{R}^{(D-S+1)\times K} and let 𝐔=𝐙𝐘\mathbf{U}=\mathbf{Z}\mathbf{Y}, where 𝐙\mathbf{Z} is the matrix defined in Proposition 1. The matrix 𝐅\mathbf{F} encodes sensitive information, as does 𝐙\mathbf{Z}. Then, problem (23) can be rephrased in term of 𝐘\mathbf{Y} as

min𝐗,𝐋,𝝊,𝐘,𝐑,𝐐1N𝚼(𝐗o𝐗)F2+ξNTr(𝐗𝐋𝐗)+Reg(𝐋)\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})
+i=1D1𝝊[i]+μTr(𝐘𝐙𝐋𝐙𝐘)+γ𝐐𝐙𝐘𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}+\mu\mathrm{Tr}(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y})+\gamma\lVert\mathbf{Q}-\mathbf{Z}\mathbf{Y}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐋,𝝊>0,𝐘𝐘=𝐈,𝐑𝐑=𝐈,𝐐.\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}. (24)

In (24), the fairness constraint 𝐅𝐔=𝟎\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0} is removed. We conduct spectral embedding on the fair graph 𝐙𝐋𝐙\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}, which encodes graph topology and sensitive information simultaneously, instead of 𝐋\mathbf{L} to ensure fair clustering. The impact of fairness constraints is discussed in the next subsection.

The basic formulation (23) is flexible and has many possible extensions. Here are some examples. (i) We can replace spectral rotation with improved spectral rotation [45] to further improve discretization performance. (ii) We can introduce self-weighted features into (23) to determine the importance of different features in assigning cluster labels [46]. (iii) We can extend (23) from unnormalized SC (5) to normalized SC [47]. (iv) We can also incorporate individual fairness [17] into our unified model. We place the details of these extensions in the supplementary material for completeness.

Remark 1.

The above extensions may improve fair clustering performance. However, we focus on the basic formulation (23) since our primary goal is to demonstrate the advantages of the proposed graph construction method and the unified framework rather than to propose a complex FSC model.

IV-D Connections to Existing Works

1) Connections to community-based GL models: If we only focus on GL and fair spectral embedding, Eq.(24) becomes

min𝐋,𝐘ξNTr(𝐗𝐋𝐗)+Reg(𝐋)+μTr(𝐘𝐙𝐋𝐙𝐘)\displaystyle\underset{\mathbf{L},\mathbf{Y}}{\min}\;\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})+\mu\mathrm{Tr}(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y})
s.t.𝐋,𝐘𝐘=𝐈,\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I}, (25)

where 𝐗\mathbf{X} is regarded as the “noiseless” data here. According to Ky Fan’s theorem [48], we have min𝐘𝐘=𝐈Tr(𝐘𝐙𝐋𝐙𝐘)=k=1Kλ~k\underset{\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I}}{\min}\;\mathrm{Tr}(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y})=\sum_{k=1}^{K}\widetilde{\lambda}_{k}, where λ~k\widetilde{\lambda}_{k} is the kk smallest eigenvalue of 𝐙𝐋𝐙\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}. Thus, the problem (25) can be rephrased as

min𝐋ξNTr(𝐗𝐋𝐗)+Reg(𝐋)+μk=1Kλ~k.\displaystyle\underset{\mathbf{L}\in\mathcal{L}}{\min}\;\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})+\mu\sum_{k=1}^{K}\widetilde{\lambda}_{k}. (26)

Note that 𝐙𝐋𝐙\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z} is a semi-positive definite matrix, i.e., λ~k0\widetilde{\lambda}_{k}\geq 0. Minimizing (26) is equivalent to forcing k=1Kλ~k0\sum_{k=1}^{K}\widetilde{\lambda}_{k}\to 0 if μ\mu is large enough. That is, (26) encourages the fair graph to have KK connected components. Therefore, (26) can be viewed as the community-based GL model, which has been widely studied. For example, [49] lets the first KK smallest eigenvalues of Laplacian matrices be zero to obtain the community structures, which can be relaxed to the last term in (26). The works [50, 26] force the rank of Laplacian matrices to DKD-K, which can also be interpreted as minimizing the sum of the first KK eigenvalues. Furthermore, [51] adds a term Tr(𝚵𝐋𝚵)\mathrm{Tr}(\mathbf{\Xi}^{\top}\mathbf{L}\mathbf{\Xi}) to impose community constraints, where 𝚵𝚵\mathbf{\Xi}\mathbf{\Xi}^{\top} contains the value 1 for within-community edges only and 0 everywhere else. Although closely related, (25) differs from existing community-based GL models in two key aspects. First, the basic GL models are different. Our model is based on the smoothness-based GL, while [49, 51] are based on statistical GL models like Graphical Lasso. Besides, [50, 26] are based on the ANGL method. Second, our model imposes the community constraint on the fair graph 𝐙𝐋𝐙\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z} instead of on 𝐋\mathbf{L} like existing works. Thus, the fairness constraint may affect the topology of the learned graph to obtain fair clustering. We will test the impact of the fairness constraint in the experimental section.

2) Connections to unified SC models: If we remove the denoising module and fairness constraint, our model becomes

min𝐋,𝐔,𝐑,𝐐ξNTr(𝐗𝐋𝐗)+Reg(𝐋)+μTr(𝐔𝐋𝐔)\displaystyle\underset{\mathbf{L},\mathbf{U},\mathbf{R},\mathbf{Q}}{\min}\;\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})
+γ𝐐𝐔𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐋,𝐔𝐔=𝐈,𝐑𝐑=𝐈,𝐐.\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}. (27)

Again, 𝐗\mathbf{X} is treated as the observed data. The model (27) is an end-to-end SC model. Here, we discuss the connections between our model and those unified SC models integrating graph construction, spectral embedding, and discretization. As stated in Remark 1, we focus on basic formulations without additional extensions. The first model we compare is [20]

min𝐖,𝐔,𝐐,𝐑𝐗𝐖𝐗F2+αU𝐖1,1+μUTr(𝐔𝐋𝐔)\displaystyle\underset{\mathbf{W},\mathbf{U},\mathbf{Q},\mathbf{R}}{\min}\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{W}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})
+γU𝐐𝐔𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;+\gamma_{U}\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐖𝒲,𝐔𝐔=𝐈,𝐑𝐑=𝐈,𝐐,\displaystyle\mathrm{s.t.}\;\mathbf{W}\in\mathcal{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}, (28)

where αU,μU\alpha_{U},\mu_{U}, and γU\gamma_{U} are constant parameters. Moreover, 𝒲={𝐖:𝐖𝕊D×D,𝐖0,diag(𝐖)=𝟎}\mathcal{W}=\left\{\mathbf{W}:\mathbf{W}\in\mathbb{S}^{D\times D},\mathbf{W}\geq 0,\mathrm{diag}(\mathbf{W})=\mathbf{0}\right\} is the set containing all adjacency matrices. This is a unified SC model that leverages the sparse representation method [24] to construct graphs, which is different from our GL method.

Another unified SC model [23, 36] is concluded as

min𝐖,𝐔,𝐐,𝐑i,j=1D𝐗[i,:]𝐗[j,:]22𝐖[i,j]+βJ𝐖[i,j]2\displaystyle\underset{\mathbf{W},\mathbf{U},\mathbf{Q},\mathbf{R}}{\min}\sum_{i,j=1}^{D}\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[i,j]}+\beta_{J}\mathbf{W}_{[i,j]}^{2}
+μJTr(𝐔𝐋𝐔)+γJ𝐐𝐔𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;+{\mu_{J}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma_{J}\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐖𝟏=𝟏,𝐖0,𝐋=𝐃𝐖,𝐔𝐔=𝐈,\displaystyle\mathrm{s.t.}\;\mathbf{W}\mathbf{1}=\mathbf{1},\mathbf{W}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},
𝐑𝐑=𝐈,𝐐,\displaystyle\;\;\;\;\;\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}, (29)

where βJ,μJ\beta_{J},\mu_{J}, and γJ\gamma_{J} are constant parameters. The graph construction method in (29) is the ANGL method [26]. We have discussed the difference between our graph construction method and the ANGL in the previous subsection.

In summary, our model (27) differs from the existing unified SC models (28)-(29) mainly in the graph construction method. As Proposition 1 states, an accurate GL method can boost fair clustering performance. In the experimental section, we develop fair versions of (28)-(29) and compare them with our model (23) to illustrate the superiority of our model.

V Model Optimization

In this section, we first propose an algorithm for solving (23), followed by convergence and complexity analyses.

V-A Optimization Algorithm

Our algorithm alternately updates 𝐋,𝐔,𝐑\mathbf{L},\mathbf{U},\mathbf{R}, 𝐐\mathbf{Q}, 𝐗\mathbf{X}, and 𝝊\bm{\upsilon} in (23), i.e., updating one with the others fixed. For clarity, we omit the iteration index here. The following derivations are the updates in one iteration.

1) Update 𝐋\mathbf{L}: The sub-problem of updating 𝐋\mathbf{L} is

min𝐋ξNTr(𝐗𝐋𝐗)+Reg(𝐋)+μTr(𝐔𝐋𝐔).\displaystyle\underset{\mathbf{L}\in\mathcal{L}}{\min}\;\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}). (30)

The problem can be rewritten in terms of 𝐖\mathbf{W}

min𝐖𝒲12𝐖𝐏1,1+RegW(𝐖),\displaystyle\underset{\mathbf{W}\in\mathcal{W}}{\min}\;\frac{1}{2}\lVert\mathbf{W}\circ\mathbf{P}\rVert_{1,1}+Reg_{W}(\mathbf{W}), (31)

where

𝐏[ij]=ξN𝐗[i,:]𝐗[j,:]22+μ𝐔[i,:]𝐔[j,:]22,\displaystyle\mathbf{P}_{[ij]}=\frac{\xi}{N}\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}+{\mu}\lVert\mathbf{U}_{[i,:]}-\mathbf{U}_{[j,:]}\rVert_{2}^{2}, (32)

and RegW(𝐖)=𝟏log(𝐖𝟏)+β𝐖F2Reg_{W}(\mathbf{W})=-\mathbf{1}^{\top}\mathrm{log}(\mathbf{W}\mathbf{1})+{\beta}\lVert\mathbf{W}\rVert_{\mathrm{F}}^{2}. By the definition of 𝒲\mathcal{W}, the free variables of 𝐖\mathbf{W} are the upper triangle elements. Thus, we define a vector 𝐰P,P:=D(D1)2\mathbf{w}\in\mathbb{R}^{P},P:=\frac{D(D-1)}{2}, satisfying that 𝐰=Triu(𝐖)\mathbf{w}=\mathrm{Triu}(\mathbf{W}), where Triu():D×DP\mathrm{Triu}(\cdot):\mathbb{R}^{D\times D}\to\mathbb{R}^{P} is a function that converts the upper triangular elements of a matrix into a vector. Then, the problem (31) is equivalent to

min𝐰0𝐩𝐰𝟏log(𝐒𝐰)+2β𝐰22,\displaystyle\underset{\mathbf{w}\geq 0}{\min}\;\mathbf{p}^{\top}\mathbf{w}-\mathbf{1}^{\top}\log(\mathbf{S}\mathbf{w})+2\beta\lVert\mathbf{w}\rVert_{2}^{2}, (33)

where 𝐩=Triu(𝐏\mathbf{p}=\mathrm{Triu}(\mathbf{P}), 𝐒D×P\mathbf{S}\in\mathbb{R}^{D\times P} is a linear operator satisfying 𝐒𝐰=𝐖𝟏\mathbf{S}\mathbf{w}=\mathbf{W}\mathbf{1} [30]. The problem (33) is convex, and we employ the algorithm in [52] to solve the problem. The complete algorithm flow is presented in the supplementary materials. After obtaining the estimated 𝐰{\mathbf{w}}, we let 𝐖=iTriu(𝐰){\mathbf{W}}=\mathrm{iTriu}({\mathbf{w}}), where iTriu():PD×D\mathrm{iTriu}(\cdot):\mathbb{R}^{P}\to\mathbb{R}^{D\times D} is the inverse Triu\mathrm{Triu} operation. The operation iTriu(𝐰)\mathrm{iTriu}({\mathbf{w}}) converts 𝐰{\mathbf{w}} into an adjacency matrix, where 𝐰{\mathbf{w}} corresponds to the upper triangle elements of 𝐖{\mathbf{W}}. Finally, we calculate the Laplacian matrix from 𝐖{\mathbf{W}} and feed it into subsequent updates of other variables.

2) Update 𝐔\mathbf{U}: The sub-problem of updating 𝐔\mathbf{U} is

min𝐔μTr(𝐔𝐋𝐔)+γ𝐐𝐔𝐑F2\displaystyle\underset{\mathbf{U}}{\min}\;\mu\mathrm{Tr}\left(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}\right)+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐔𝐔=𝐈,𝐅𝐔=𝟎.\displaystyle\mathrm{s.t.}\;\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}. (34)

Like (24), (34) can be cast into a problem of variable 𝐘\mathbf{Y}

min𝐘𝐘=𝐈μTr(𝐘𝐙𝐋𝐙𝐘)+γ𝐐𝐙𝐘𝐑F2\displaystyle\underset{\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I}}{\min}\;\mu\mathrm{Tr}\left(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y}\right)+\gamma\lVert\mathbf{Q}-\mathbf{Z}\mathbf{Y}\mathbf{R}\rVert_{\mathrm{F}}^{2}
\displaystyle\Leftrightarrow min𝐘𝐘=𝐈μTr(𝐘𝐙𝐋𝐙𝐘)2γTr(𝐑𝐐𝐙𝐘).\displaystyle\underset{\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{I}}{\min}\;\mu\mathrm{Tr}\left(\mathbf{Y}^{\top}\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y}\right)-2\gamma\mathrm{Tr}(\mathbf{R}\mathbf{Q}^{\top}\mathbf{Z}\mathbf{Y}). (35)

This is a typical quadratic optimization problem with orthogonal constraints. Let ϕ(𝐘)\phi(\mathbf{Y}) be the objective function of (35). We have that ϕ(𝐘)\phi(\mathbf{Y}) is differential, and 𝐘ϕ(𝐘)=2μ𝐙𝐋𝐙𝐘2γ𝐙𝐐𝐑\nabla_{\mathbf{Y}}\,\phi(\mathbf{Y})=2\mu\mathbf{Z}^{\top}\mathbf{L}\mathbf{Z}\mathbf{Y}-2\gamma\mathbf{Z}^{\top}\mathbf{Q}\mathbf{R}^{\top}. Thus, the problem can be efficiently solved via the algorithm in [53]. After obtaining 𝐘\mathbf{Y}, we let U=ZY\textbf{U}=\textbf{Z}\textbf{Y}.

3) Update 𝐑\mathbf{R}: The sub-problem of updating 𝐑\mathbf{R} is

min𝐑𝐑=𝐈γ𝐐𝐔𝐑F2\displaystyle\underset{\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}}{\min}\;\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
\displaystyle\Leftrightarrow max𝐑𝐑=𝐈Tr(𝐐𝐔𝐑).\displaystyle\underset{\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}}{\max}\;\mathrm{Tr}(\mathbf{Q}^{\top}\mathbf{U}\mathbf{R}). (36)

It is the orthogonal Procrustes problem with a closed-form solution [54]. Assuming that 𝚯L\mathbf{\Theta}_{L} and 𝚯R\mathbf{\Theta}_{R} are the left and right matrices of SVD of 𝐐𝐔\mathbf{Q}^{\top}\mathbf{U}, the solution to (36) is [54]

𝐑=𝚯R𝚯L.\displaystyle\mathbf{R}=\mathbf{\Theta}_{R}\mathbf{\Theta}_{L}^{\top}. (37)

4) Update 𝐐\mathbf{Q}: The sub-problem of updating 𝐐\mathbf{Q} is

min𝐐γ𝐐𝐔𝐑F2\displaystyle\underset{\mathbf{Q}\in\mathcal{I}}{\min}\;\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
\displaystyle\Leftrightarrow min𝐐γTr(𝐐𝐐)2γTr(𝐐𝐔𝐑)\displaystyle\underset{\mathbf{Q}\in\mathcal{I}}{\min}\;\gamma\mathrm{Tr}(\mathbf{Q}^{\top}\mathbf{Q})-2\gamma\mathrm{Tr}(\mathbf{Q}^{\top}\mathbf{U}\mathbf{R})
\displaystyle\Leftrightarrow max𝐐Tr(𝐐𝐔𝐑).\displaystyle\underset{\mathbf{Q}\in\mathcal{I}}{\max}\;\mathrm{Tr}(\mathbf{Q}^{\top}\mathbf{U}\mathbf{R}). (38)

The optimal solution to (38) is as follows:

𝐐[ik]={1k=argmaxj[K](𝐔𝐑)[ij],0others.\displaystyle\mathbf{Q}_{[ik]}=\begin{cases}1&k=\mathrm{argmax}_{j\in[K]}\;(\mathbf{U}\mathbf{R})_{[ij]},\\ 0&\mathrm{others}.\end{cases} (39)

5) Update 𝐗\mathbf{X}: The sub-problem of updating 𝐗\mathbf{X} is (17), which has a closed solution (19). However, matrix inversion is computationally expensive with complexity 𝒪(D3)\mathcal{O}(D^{3}). Fortunately, (𝚼𝚼+ξ𝐋)1\left(\mathbf{\Upsilon}^{\top}\mathbf{\Upsilon}+{\xi}\mathbf{L}\right)^{-1} is symmetric, sparse, and positive definite. We can hence solve (17) efficiently using conjugate gradient (CG) algorithm without matrix inverse [55].

6) Update 𝛖\bm{\upsilon}: The sub-problem of updating 𝝊\bm{\upsilon} is (20). Taking the derivative of (20) and setting it to zero, we have

𝝊[i]=N(𝐗o)[i,:]𝐗[i,:]2,i=1,,D.\displaystyle\bm{\upsilon}_{[i]}=\frac{\sqrt{N}}{\lVert(\mathbf{X}_{o})_{[i,:]}-\mathbf{X}_{[i,:]}\rVert_{2}},\;\;i=1,...,D. (40)

It is observed that the updates of 𝐋\mathbf{L}, 𝐔\mathbf{U}, 𝐑\mathbf{R}, 𝐐\mathbf{Q}, 𝐗\mathbf{X}, and 𝝊\bm{\upsilon} are coupled with each other. Updating one variable depends on the other variables, leading to an overall optimal solution. The complete procedure is presented in Algorithm 1.

Algorithm 1 The algorithm for problem (23)
0:  Data matrix 𝐗oD×N\mathbf{X}_{o}\in\mathbb{R}^{D\times N}, sensitive attributes related matrix 𝐅\mathbf{F} or 𝐙\mathbf{Z}, the number of clusters KK, model parameters ξ,β,μ\xi,\beta,\mu, and γ\gamma
0:  The learned 𝐋\mathbf{L} and discrete cluster labels 𝐐\mathbf{Q}
1:  Initialize 𝐋\mathbf{L}, 𝐔\mathbf{U}, 𝐐\mathbf{Q}, and 𝐑\mathbf{R} randomly, 𝐗=𝐗o\mathbf{X}=\mathbf{X}_{o}, and 𝝊=𝟏\bm{\upsilon}=\mathbf{1}
2:  while not converged do
3:     Calculate 𝐏\mathbf{P} by (32) and let 𝐩=Triu(𝐏)\mathbf{p}=\mathrm{Triu(\mathbf{P})}
4:     Update 𝐰\mathbf{w} by solving (33)
5:     Convert 𝐖=iTriu(𝐰)\mathbf{W}=\mathrm{iTriu}(\mathbf{w}) and calculate 𝐋=𝐃𝐖\mathbf{L}=\mathbf{D}-\mathbf{W}
6:     Update 𝐘\mathbf{Y} by solving problem (35) using the algorithm in [53], and let 𝐔=𝐙𝐘\mathbf{U}=\mathbf{Z}\mathbf{Y}
7:     Update 𝐑=𝚯R𝚯L\mathbf{R}=\mathbf{\Theta}_{R}\mathbf{\Theta}_{L}^{\top}, where 𝚯L\mathbf{\Theta}_{L} and 𝚯R\mathbf{\Theta}_{R} are the left and right matrices of SVD of 𝐐𝐔\mathbf{Q}^{\top}\mathbf{U}
8:     Update 𝐐\mathbf{Q} via (39)
9:     Update 𝐗\mathbf{X} by solving (17)
10:     Update 𝝊\bm{\upsilon} using (40)
11:  end while

V-B Convergence and Complexity Analysis

1) Convergence analysis: It is challenging to obtain a globally optimal solution to (23) since it is not jointly convex for all variables. However, our algorithm for solving each sub-problem can reach its optimal solution. Specifically, when we update 𝐋\mathbf{L}, the problem (32) is convex, and the corresponding algorithm is guaranteed to converge to the global optimum [52]. When updating 𝐔\mathbf{U}, we use the algorithm in [53] to solve the problem (35), which can converge to the global optimum [53]. The updates of 𝐐\mathbf{Q}, 𝐑\mathbf{R} and 𝝊\bm{\upsilon} have closed-form solutions. Despite updating 𝐗\mathbf{X} via (18) has a closed solution, we update 𝐗\mathbf{X} using CG, which is guaranteed to converge [55]. In summary, the update of each variable converges in our algorithm. In reality, the whole algorithm converges well, which is verified experimentally in Section VI.

2) Complexity analysis: In one iteration, our algorithm consists of six parts, which we analyze one by one below. As stated in [52], the update of 𝐋\mathbf{L} requires 𝒪(T1D2)\mathcal{O}(T_{1}D^{2}) costs, where T1T_{1} is the average number of iterations of updating 𝐰\mathbf{w}. The computational cost can be further reduced if the average number of neighbors per node is fixed; see [56] and analysis therein. The computational complexity of our algorithm for updating 𝐔\mathbf{U} is 𝒪(T2(DK2+K3))\mathcal{O}(T_{2}(DK^{2}+K^{3})) according to [53], where T2T_{2} is the average number of iterations of the algorithm in [53]. When updating 𝐑\mathbf{R}, we perform SVD on 𝐐𝐔K×K\mathbf{Q}^{\top}\mathbf{U}\in\mathbb{R}^{K\times K}, which costs 𝒪(K3)\mathcal{O}(K^{3}). The updates of 𝐐\mathbf{Q} and 𝝊\bm{\upsilon} require 𝒪(DK2)\mathcal{O}(DK^{2}) and 𝒪(DN)\mathcal{O}(DN), respectively. Finally, the complexity of using CG to update 𝐗\mathbf{X} is 𝒪(T3DN)\mathcal{O}(T_{3}DN), where T3T_{3} is the average number of iterations of the CG algorithm.

VI Experiments

In this section, we test our proposed model using synthetic, benchmark, and real-world data. First, some experimental setups are introduced.

VI-A Experimental Setups

1) Graph generation: For synthetic data, we leverage the vSBM method to generate random graphs with sensitive attributes. Specifically, we let ζs=1S,a=0.8,b=0.2,c=0.15\zeta_{s}=\frac{1}{S},a=0.8,b=0.2,c=0.15, and d=0.05d=0.05. After obtaining connections among nodes, we assign each edge a random weight between [0.1,2][0.1,2]. Finally, we normalize the edge weights to satisfy Tr(𝐋)=D\mathrm{Tr}(\mathbf{L}^{*})=D.

2) Signal generation: We generate NN observed signals from the following Gaussian distribution [29]

(𝐗o)[:,n]𝒩(𝟎,(𝐋)+𝚺e),n=1,,N,\displaystyle(\mathbf{X}_{o})_{[:,n]}\sim\mathcal{N}\left(\mathbf{0},(\mathbf{L}^{*})^{{\dagger}}+\mathbf{\Sigma}_{e}\right),\;\;n=1,...,N, (41)

where 𝚺e=diag(σ12,,σD2)\mathbf{\Sigma}_{e}=\mathrm{diag}(\sigma_{1}^{2},...,\sigma_{D}^{2}) and σi\sigma_{i} is the noise scale of the ii-th node. As stated in [29], signals generated in this way are smooth over the corresponding graph.

3) Evaluation metrics: In topology inference, determining whether two vertices are connected can be regarded as a binary classification problem. Thus, we employ the F1-score (FS\mathrm{FS}) to evaluate classification results

FS=2TP2TP+FN+FP,\displaystyle\mathrm{FS}=\frac{2\mathrm{TP}}{2\mathrm{TP}+\mathrm{FN}+\mathrm{FP}}, (42)

where TP\mathrm{TP} is true positive rate, TN\mathrm{TN} is true negative rate, FP\mathrm{FP} is false positive rate, and FN\mathrm{FN} is false negative rate. We also use the estimation error (EE{\mathrm{EE}}) to evaluate the learned graph

EE=𝐙𝐋^𝐙𝐙𝐋𝐙F.\displaystyle\mathrm{EE}=\lVert\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\mathbf{L}^{*}\mathbf{Z}\lVert_{\mathrm{F}}. (43)

For a fair comparison of EE\mathrm{EE}, we normalize the learned graphs to Tr(𝐋^)=D\mathrm{Tr}(\widehat{\mathbf{L}})=D. For fair clustering, we use the same two metrics as in [14]: clustering error (CE\mathrm{CE}) and Balance (Bal\mathrm{Bal})

CE=1D|{i:π^C(i)πC(i),i=1.,D}|,\displaystyle\mathrm{CE}=\frac{1}{D}\left|\{i:\widehat{\pi}_{C}(i)\neq{\pi}_{C}(i),i=1....,D\}\right|,
Balance(Bal)=1Kk=1KBalance(Ck),\displaystyle\mathrm{Balance}\;(\mathrm{Bal})=\frac{1}{K}\sum_{k=1}^{K}\mathrm{Balance}(C_{k}), (44)

where π^C(i)\widehat{\pi}_{C}(i) is the estimated cluster label of node ii (after proper permutation), and πC(i){\pi}_{C}(i) is the ground-truth. The metric Balance\mathrm{Balance} measures the average balance of all clusters.

4) Baselines: The comparison baselines are list in Table.I. The model Fairlets is the fair version of kkmedian [8]. Models 3-5 are the implementations of [14] using different graph construction methods. FGLASSO [57] is the only model that jointly performs graph construction and fair spectral embedding. FSRSC and FJGSED are the fair versions of unified SC models (28) and (29). The formulations and algorithms for the two models are placed in the supplementary material.

TABLE I: Comparison baselines
  Index Models Graph-based Fair End-to-End GL method
1 kkmeans
2 Fairlets
3 CorrFSC PC
4 KNNFSC kk-NN
5 ε\varepsilonNNFSC ε\varepsilon-NN
6 FGLASSO GLASSO
7 FJGSED ANGL
8 FSRSC SR
 

5) Determination of parameters: For our model, we first grid-search ξ\xi and β\beta corresponding to the best FS\mathrm{FS} in the range [0.001,0.1][0.001,0.1] for the graph learning task. Then, parameters μ\mu and γ\gamma are selected as those achieving the best CE\mathrm{CE} in the range [0.001,1][0.001,1]. All parameters of baselines are also selected as those achieving the best CE\mathrm{CE} values.

TABLE II: The results of our model and the compared baselines under different cases.
  σi𝒰(0,0.2),N=1000\sigma_{i}\sim\mathcal{U}(0,0.2),N=1000 σi𝒰(0.4,0.6),N=1000\sigma_{i}\sim\mathcal{U}(0.4,0.6),N=1000 σi𝒰(0,0.2),N=5000\sigma_{i}\sim\mathcal{U}(0,0.2),N=5000 σi𝒰(0.4,0.6),N=5000\sigma_{i}\sim\mathcal{U}(0.4,0.6),N=5000
FS\mathrm{FS}\uparrow EE\mathrm{EE}\downarrow CE\mathrm{CE}\downarrow Bal\mathrm{Bal}\uparrow FS\mathrm{FS}\uparrow EE\mathrm{EE}\downarrow CE\mathrm{CE}\downarrow Bal\mathrm{Bal}\uparrow FS\mathrm{FS}\uparrow EE\mathrm{EE}\downarrow CE\mathrm{CE}\downarrow Bal\mathrm{Bal}\uparrow FS\mathrm{FS}\uparrow EE\mathrm{EE}\downarrow CE\mathrm{CE}\downarrow Bal\mathrm{Bal}\uparrow
kkmeans 0.671 0.191 0.687 0.149 0.635 0.161 0.667 0.145
Fairlets 0.658 0.485 0.665 0.457 0.611 0.355 0.623 0.348
CorrFSC 0.472 2.858 0.567 0.482 0.441 3.016 0.578 0.705 0.630 2.529 0.104 0.874 0.596 2.511 0.156 0.859
KNNFSC 0.105 0.687 0.829 0.103 0.703 0.626 0.113 0.729 0.628 0.098 0.682 0.731
EpsNNFSC 0.086 0.729 0.380 0.094 0.739 0.333 0.091 0.718 0.652 0.065 0.724 0.369
FGLASSO 0.482 3.902 0.411 0.616 0.450 3.724 0.406 0.646 0.587 3.971 0.291 0.657 0.574 3.533 0.271 0.908
FJGSED 0.271 28.159 0.724 0.359 0.263 22.626 0.734 0.240 0.325 23.576 0.604 0.579 0.293 31.552 0.734 0.247
FSRSC 0.374 5.222 0.724 0.619 0.355 9.671 0.733 0.607 0.345 5.049 0.729 0.766 0.512 10.024 0.739 0.663
Ours 0.501 2.375 0.286 0.845 0.474 2.414 0.390 0.801 0.715 1.691 0.052 0.960 0.674 2.174 0.142 0.870
 
  • \uparrow means that higher value is better and \downarrow means that lower value is better.

  • σi𝒰(a1,a2),i=1,,D\sigma_{i}\sim\mathcal{U}(a1,a2),i=1,...,D, means that the noise scale of the ii-th node is from the uniform distribution 𝒰(a1,a2).\mathcal{U}(a1,a2).

Refer to caption
(a) Ground-truth
Refer to caption
(b) CorrFSC
Refer to caption
(c) KNNFSC
Refer to caption
(d) EpsNNFSC
Refer to caption
(e) FGLASSO
Refer to caption
(f) FJGSED
Refer to caption
(g) FSRSC
Refer to caption
(h) Ours
Figure 4: The visualization of the learned graphs (unnormalized weights) when N=5000N=5000 and σi𝒰(0.4,0.6)\sigma_{i}\sim\mathcal{U}(0.4,0.6).

VI-B Synthetic Data

1) Model performance: We first compare our model with all baselines in four cases. We let D=192,K=4D=192,K=4, and S=2S=2. As listed in Table II, our model outperforms kkmeans and Fairlets on clustering metrics because it exploits structured information behind raw data. The graphs established by KNNFSC and EpsNNFSC methods are not evaluated by EE\mathrm{EE} since no edge weights are assigned. Among Models 3-5, CorrFSC achieves the best GL performance (FS\mathrm{FS}) as well as the best CE\mathrm{CE} clustering performance (CE\mathrm{CE}). However, the graph construction performance of the three methods is inferior to our model, leading to unsatisfactory fair clustering results. Furthermore, compared with the three methods, our model unifies all seprate stages into a single optimization objective, avoiding suboptimality caused by separate optimization. The reason why our model outperforms FGLASSO could be that FGLASSO separately uses kkmeans to obtain final cluster labels. Besides, our method could learn better graphs than FGLASSO. Although FJGSED and FSRSC also perform fair clustering in an end-to-end manner, our model obtains superior fair clustering performance due to more accurate graphs constructed by our method. Finally, our model has a node-adaptive graph filter to denoise observed signals. Thus, our model obtains the best graph construction performance under different levels of noise contamination.

We visualize the learned graphs of different methods in Fig.4. We see that EpsNNFSC fails to capture the clustering structure, resulting in the worst fair clustering performance. The graph of KNNFSC tends to have imbalanced node degrees, and the graph of FSRSC has small edge weights. Compared with CorrFSC, FGLASSO, and FJGSED, the graph of our model has fewer noisy edges and clearer cluster structures.

2) The effect of KK and SS: We set d=192,N=5000,σi𝒰(0.4,0.6)d=192,N=5000,\sigma_{i}\sim\mathcal{U}(0.4,0.6). In the first case, we fix S=2S=2 and vary KK from 2 to 6. In the second case, we fix K=2K=2 and vary SS from 2 to 6. Fig.5 displays that the fair clustering performance degrades with the increase of KK (CE\mathrm{CE} increases and Balance\mathrm{Balance} decreases), which is consistent with Proposition 1. On the other hand, the fair clustering performance is less affected by SS.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: The effect of (a) KK and (b) SS on clustering results.

3) The effect of DD: We let N=104,σi𝒰(0.4,0.6),K=4N=10^{4},\sigma_{i}\sim\mathcal{U}(0.4,0.6),K=4 and S=2S=2. As depicted in Fig.6, for a fixed data size, CE\mathrm{CE} first decreases and then increases as DD. The reason may be that, as stated in Proposition 1, the misclassification rate of FSC algorithms on the graph generated by the vSBM method decreases as DD if the underlying graph is exactly estimated. However, the quality of the estimated graph declines for large DD if NN is fixed. Thus, the second part of the error bound in Proposition 1 worsens. If the performance improvement brought by increasing DD is smaller than the degradation caused by the graph estimation error, fair clustering performance decreases when DD is large.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: The effect of DD on (a) graph learning and (b) clustering.

4) The sensitivity of parameters: We let D=196,K=4,S=2,N=5000D=196,K=4,S=2,N=5000, and ,σi𝒰(0.4,0.6),\sigma_{i}\sim\mathcal{U}(0.4,0.6). First, we fix μ=0.01\mu=0.01 and γ=0.01\gamma=0.01 and vary β\beta and ξ\xi from 0.0010.001 to 0.10.1. We then fix β=0.01\beta=0.01 and ξ=0.05\xi=0.05 and vary μ\mu and γ\gamma from 0.0010.001 to 11. As shown in Fig.8, our model can achieve consistent GL and fair clustering performance except when β\beta is too small and ξ\xi is too large. Moreover, GL performance is more sensitive to μ\mu than γ\gamma. There exist combinations of μ\mu and γ\gamma that achieve satisfactory CE\mathrm{CE} and Balance\mathrm{Balance} simultaneously.

5) The effect of the fairness constraint: We consider a special case where the real graph contains two clusters, each consisting of samples from the same sensitive group. In this case, the Balance\mathrm{Balance} of real clustering is zero. We then perform clustering using our FSC model and a variant where the fairness constraint is removed. As shown in Fig.7, if we remove the fairness constraint, our model can exactly group all samples. However, some samples are misclassified to improve fairness in our model due to the effect of the fairness constraint. We list the corresponding model performance in Table III. Our model achieves a significantly higher Balance\mathrm{Balance} value at the cost of reduced clustering accuracy. Furthermore, the GL performance of our model is also degraded due to the fairness constraint. Thus, if the underlying graph has only one meaningful cluster that is highly unbalanced, fairness constraints may degrade GL and clustering performance.

TABLE III: The results of removing fairness.
  FS\mathrm{FS} EE\mathrm{EE} CE\mathrm{CE} Bal\mathrm{Bal}
w/o fairness 0.734 1.211 0 0
Ours 0.719 1.353 0.484 0.867
 
Refer to caption
(a) The ground-truth
Refer to caption
(b) w/o fairness
Refer to caption
(c) Ours
Figure 7: The effect of the fairness constraint. Colors represent clusters, while mark shapes represent sensitive groups.

6) Ablation study: Three cases are taken into consideration. (i) We construct graphs using (16), conduct fair spectral embedding, and discretize using spectral rotation separately to test the benefit of a unified model (Ours-Sep). (ii) We construct graphs using (17) and conduct fair spectral embedding jointly. After obtaining continuous results, we exploit kkmeans as the discretization step to test the benefit of spectral rotation (Ours-kkmeans). (iii) We remove the denoising module in our model to test the benefit of the node-adaptive graph filter (Ours-noDN). We let D=196,K=4,S=2,N=5000D=196,K=4,S=2,N=5000, and σi𝒰(0.4,0.6)\sigma_{i}\sim\mathcal{U}(0.4,0.6), and the results are listed in Table IV. Our model outperforms Ours-Sep, demonstrating the benefit of a unified model. Although the graph of Ours-kkmeans is well estimated, it obtains the worst CE\mathrm{CE} due to the poor performance of kkmeans. Our model outperforms Ours-noDN because it has a low-pass filter to enhance graph construction.

TABLE IV: The results of ablation studies.
  FS\mathrm{FS} EE\mathrm{EE} CE\mathrm{CE} Bal\mathrm{Bal}
Ours-Sep 0.637 2.333 0.250 0.704
Ours-kkmeans 0.635 2.320 0.549 0.353
Ours-noDN 0.623 2.354 0.276 0.694
Ours 0.658 2.203 0.167 0.782
 
Refer to caption
(a) FS\mathrm{FS}
Refer to caption
(b) EE\mathrm{EE}
Refer to caption
(c) CE\mathrm{CE}
Refer to caption
(d) Balance\mathrm{Balance}
Refer to caption
(e) FS\mathrm{FS}
Refer to caption
(f) EE\mathrm{EE}
Refer to caption
(g) CE\mathrm{CE}
Refer to caption
(h) Balance\mathrm{Balance}
Figure 8: The effect of parameter sensitivity. (a)-(d) The results of varying ξ\xi and β\beta. (e)-(h) The results of varying μ\mu and γ\gamma.
Refer to caption
(a) σi𝒰(0,0.2)\sigma_{i}\sim\mathcal{U}(0,0.2)
Refer to caption
(b) σi𝒰(0.4,0.6)\sigma_{i}\sim\mathcal{U}(0.4,0.6)
Figure 9: The convergence of our algorithm.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 10: The results of the benchmark datasets. (a)-(b) The fair clustering results of the FACEBOOK and DRUGNET datasets. (c)-(d) The real and the learned FACEBOOK network. Colors represent clusters, while mark shapes represent sensitive groups.

7) Convergence: Finally, we test the convergence of our algorithm. We let D=196,K=4,S=2,D=192D=196,K=4,S=2,D=192. As shown in Fig.9, the objective function values monotonically decrease as the number of iterations. Besides, our algorithm converges within a few iterations, indicating its fast convergence.

VI-C Benchmark Data

In this section, we test the performance of our model on the commonly used benchmark datasets of FSC [14]. The first dataset is a high school friendship network named FACEBOOKNET111http://www:sociopatterns:org/datasets/high-school-contact-and-friendshipnetworks/. The dataset contains a graph with vertices representing high school students and edges representing connections between students on Facebook. After data preprocessing, we obtain 155 students split into male and female groups. In this dataset, gender is considered a sensitive attribute. All vertices are divided into two groups, i.e., male and female. The second dataset, DRUGNET, is a network encoding acquaintanceship between drug users in Hartford 222 https://sites:google:com/site/ucinetsoftware/datasets/covert-networks/drugnet. After data preprocessing, we obtain 193 vertices. We use ethnicity as a sensitive attribute and split the vertices into three groups: African Americans, Latinos, and others. Note that previous FSC work [14] is based on a given graph, and the two datasets only contain ground-truth graphs and no observed signals. However, one of the primary advantages of our model is that we can group observed data without the real graph structures. Thus, we generate data via (41) based on the ground-truth networks. We then use our model to group vertices via the observed data. For comparison, we apply the FSC algorithm in [14] (FairSC) and unnormalized spectral clustering (SC) to the real networks to cluster vertices. We aim to demonstrate that our model can achieve competitive fair clustering performance even without real graphs. Referring to [14], we use Balance\mathrm{Balance} and RatioCut\mathrm{RatioCut} as evaluation metrics since we have no real labels. We let N=1000N=1000 and σi𝒰(0,0.2)\sigma_{i}\sim\mathcal{U}(0,0.2). As displayed in 10 (a)-(b), for the two datasets, our model achieves almost the same RatioCut\mathrm{RatioCut} as FairSC and SC—which are based on the ground-truth networks—even though we do not know the underlying graphs. However, compared with FSC and SC, our model can improve Balance\mathrm{Balance} at a moderate sacrifice of RatioCut\mathrm{RatioCut}. Figures 10 (c)-(d) depict the real FACEBOOKNET graph and the graph learned by our model when K=2K=2. Fewer edges are learned between two clusters, suggesting that our model may tend to learn a graph that is more suitable for clustering. Furthermore, two clusters are observed from our learned graph, meaning our model can fairly partition the nodes from the observed data even if we have no real graphs.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 11: The results of the MovieLens dataset. (a)-(b) The fair clustering results of different models. (c)-(d) The learned sub-graphs of KNNFSC and our model when K=2K=2. Colors represent clusters, while mark shapes represent sensitive attributes.

VI-D Real Data

1) MovieLens 100K dataset: We employ MovieLens 100K dataset333http://www.grouplens.org to group movies by their ratings. This dataset contains ratings of 1682 movies by 943 users in the range [1,5][1,5], which is sparse as many movies have few ratings. To alleviate the impact of sparsity, we select the top 200 most-rated movies from 1682 movies. Therefore, we have a who-rated-what matrix 𝐗200×943\mathbf{X}\in\mathbb{R}^{200\times 943}. The matrix can be used to construct a movie-movie similarity graph strongly correlated with how users explicitly rate items [58]. Therefore, we can perform clustering on the similarity graph to group movies with similar attributes. However, as stated in [57], old movies tend to obtain higher ratings because only masterpieces have survived. To obtain fair results unbiased by production time, we consider movie year as a sensitive attribute. Movies made before 1991 are considered old, while others are considered new. To evaluate clustering results, we conduct traditional item-based collaborative filtering (CF) on each cluster, termed cluster CF, to predict user ratings of movies. As claimed in [57], if the obtained clusters accurately contain a set of similar items, cluster CF can better predict user ratings of movies. Therefore, we follow [57] and use root mean square error (RMSE\mathrm{RMSE}) between the predicted and true ratings as an evaluation metric in addition to Balance\mathrm{Balance} [57, 58]. Figures 11 (a)-(b) depict fair clustering results of different models. Our model obtains the highest Balance\mathrm{Balance} except KNNFSC. However, KNNFSC performs poorly on RMSE\mathrm{RMSE}, indicating unsatisfactory clustering results. This may be caused by the fact that the graph constructed by KNNFSC hardly characterizes the similarity relationships between movies. In contrast, our model achieves the best RMSE\mathrm{RMSE} since it better reveals similarity relationships behind observed data. In Fig.11(c)-(d), we provide the learned sub-graphs and clustering results of the top 30 rated movies when K=2K=2. The graph learned by KNNFSC has isolated nodes since they are connected to the movies outside the top 30 rated movies. In our graph, nodes 1, 4, 11, and 18 are closely connected because they belong to the Star Wars series. However, in Fig.11(c), they are not connected. Moreover, our model successfully groups the four nodes into the same cluster.

2) MNIST-USPS dataset: The second dataset we employ is MNIST-USPS dataset 444http://yann.lecun.com/exdb/mnist, https://www.kaggle.com/bistaumanga/usps-dataset, which contains two sub-datasets, i.e., MNIST and USPS. The two sub-datasets contain images of handwritten digits from 0 to 9. We cluster these images and use digits as the ground-truth cluster labels. Specifically, we randomly select 48 images from each sub-dataset, which contains four digits and twelve pictures for each digit. We finally obtain 96 images and resize each image to a 28×2828\times 28 matrix. We take each image as a node in a graph and flatten the corresponding matrix as the node signals. Therefore, the observed data are 𝐗96×784\mathbf{X}\in\mathbb{R}^{96\times 784}. We take the domain source of images—images from MNIST or USPS—as a sensitive attribute. Thus, we have S=2S=2 and K=4K=4. We use CE\mathrm{CE} and Balance\mathrm{Balance} as evaluation metrics since we have real labels but no ground-truth graphs. As shown in Fig.12, our model achieves the best fair clustering performance for both CE\mathrm{CE} and Balance\mathrm{Balance}, indicating its superiority. The reason for CorrFSC, KNNFSC, EpsFSC, and FSRSC achieving unsatisfactory CE\mathrm{CE} may be that the corresponding graphs for this dataset cannot reflect the real topological similarity.

Refer to caption
Figure 12: The clustering results of the MNIST-USPS dataset.

VII Conclusion

In this paper, we theoretically analyzed the impact of similarity graphs on FSC performance. Motivated by the analysis, we proposed a graph construction method for FSC tasks as well as an end-to-end FSC framework. Then, we designed an efficient algorithm to alternately update the variables corresponding to each submodule in our model. Extensive experiments showed that our approach is superior to state-of-the-art (fair) SC models. Future research directions may include developing more scalable FSC algorithms.

Appendix A Proof of Proposition 1

We first provide the following lemma.

Lemma 2.

For any ϵ>0\epsilon>0 and any two matrices 𝐔,𝐔^D×K\mathbf{U},\widehat{\mathbf{U}}\in\mathbb{R}^{D\times K} such that 𝐔=𝐐𝐑\mathbf{U}=\mathbf{Q}\mathbf{R} with 𝐐\mathbf{Q}\in\mathcal{I} and 𝐑𝐑=𝐈\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}, let (𝐐^,𝐑^)(\widehat{\mathbf{Q}},\widehat{\mathbf{R}}) be a (1+ϵ)(1+\epsilon) approximation of 𝐔^\widehat{\mathbf{U}} using spectral rotation as Assumption 1, and 𝐔˘=𝐐^𝐑^\breve{\mathbf{U}}=\widehat{\mathbf{Q}}\widehat{\mathbf{R}}. Then, for any δk0\delta_{k}\geq 0, define ~k={i𝒞k:𝐔[i,:]𝐔˘[i,:]2δk/2}\widetilde{\mathcal{M}}_{k}=\left\{i\in\mathcal{C}_{k}:\lVert\mathbf{U}_{[i,:]}-\breve{\mathbf{U}}_{[i,:]}\rVert_{2}\geq\delta_{k}/2\right\}, k=1,,Kk=1,...,K, and we have

k=1K|~k|δk24(4+2ϵ)𝐔𝐔^F2,\displaystyle\sum_{k=1}^{K}|\widetilde{\mathcal{M}}_{k}|\delta_{k}^{2}\leq 4(4+2\epsilon)\lVert{\mathbf{U}}-\widehat{\mathbf{U}}\rVert_{\mathrm{F}}^{2}, (45)
Proof.

First, by the procedure of spectral rotation, we have

𝐐^,𝐑^=min𝐐,𝐑𝐐𝐔^𝐑F2s.t.𝐑𝐑=𝐈,𝐐\displaystyle\widehat{\mathbf{Q}},\widehat{\mathbf{R}}=\underset{\mathbf{Q},\mathbf{R}}{\min}\;\lVert\mathbf{Q}-\widehat{\mathbf{U}}{\mathbf{R}}\rVert_{\mathrm{F}}^{2}\;\;\;\mathrm{s.t.}\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\;\mathbf{Q}\in\mathcal{I}
\displaystyle\Leftrightarrow 𝐐^,𝐑^=min𝐐,𝐑𝐔^𝐐𝐑F2s.t.𝐑𝐑=𝐈,𝐐.\displaystyle\widehat{\mathbf{Q}},\widehat{\mathbf{R}}=\underset{\mathbf{Q},\mathbf{R}}{\min}\;\lVert\widehat{\mathbf{U}}-\mathbf{Q}{\mathbf{R}}\rVert_{\mathrm{F}}^{2}\;\;\;\mathrm{s.t.}\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\;\mathbf{Q}\in\mathcal{I}. (46)

Then, based on Assumption 1, we can obtain that

𝐔^𝐐^𝐑^F2(1+ϵ)min𝐐,𝐑𝐑=𝐈𝐔^𝐐𝐑F2\displaystyle\lVert\widehat{\mathbf{U}}-\widehat{\mathbf{Q}}\widehat{\mathbf{R}}\rVert_{\mathrm{F}}^{2}\leq(1+\epsilon)\underset{\mathbf{Q}\in\mathcal{I},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}}{\min}\lVert\widehat{\mathbf{U}}-{\mathbf{Q}}{\mathbf{R}}\rVert_{\mathrm{F}}^{2}
\displaystyle\Rightarrow 𝐔^𝐔˘F2(1+ϵ)𝐔^𝐔F2.\displaystyle\lVert\widehat{\mathbf{U}}-\breve{\mathbf{U}}\rVert_{\mathrm{F}}^{2}\leq(1+\epsilon)\lVert\widehat{\mathbf{U}}-\mathbf{U}\rVert_{\mathrm{F}}^{2}. (47)

It is not difficult to obtain the following inequalities

𝐔˘𝐔F22𝐔˘𝐔^F2+2𝐔^𝐔F2\displaystyle\lVert\breve{\mathbf{U}}-\mathbf{U}\rVert_{\mathrm{F}}^{2}\leq 2\lVert\breve{\mathbf{U}}-\widehat{\mathbf{U}}\rVert_{\mathrm{F}}^{2}+2\lVert\widehat{\mathbf{U}}-\mathbf{U}\rVert_{\mathrm{F}}^{2}
(4+2ϵ)𝐔^𝐔F2.\displaystyle\leq(4+2\epsilon)\lVert\widehat{\mathbf{U}}-{\mathbf{U}}\rVert_{\mathrm{F}}^{2}. (48)

The first inequality holds due to the basic inequality, and the second one holds due to (47). Finally, according to the definition of δk\delta_{k}, we can obtain the conclusion (45). ∎

We start our proof of Proposition 1, which is inspired by [14]. To incorporate the fairness constraint into the objective function of (9), we let 𝐔^=𝐙𝐘^\widehat{\mathbf{U}}=\mathbf{Z}\widehat{\mathbf{Y}}, where 𝐘^\widehat{\mathbf{Y}} contains the eigenvectors of 𝐙𝐋^𝐙\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z} corresponding to the KK smallest eigenvalues. Suppose that 𝐘¯\bar{\mathbf{Y}} contains the eigenvectors of 𝐙𝐋¯𝐙\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z} corresponding to the KK smallest eigenvalues, where 𝐋¯\bar{\mathbf{L}} is the expected Laplacian matrix of the graphs generated by the vSBM method. We apply spectral rotation on 𝐔^\widehat{\mathbf{U}} estimated from 𝐋^\widehat{\mathbf{L}} by solving (9). For any 𝐕K×K\mathbf{V}\in\mathbb{R}^{K\times K} satisfying 𝐕𝐕=𝐈,𝐕𝐕=𝐈\mathbf{V}^{\top}\mathbf{V}=\mathbf{I},\mathbf{V}\mathbf{V}^{\top}=\mathbf{I}, it is not difficult to obtain

\displaystyle\lVert 𝐙𝐘¯𝐙𝐘^𝐕F2=Tr((𝐘¯𝐘^𝐕)𝐙𝐙(𝐘¯𝐘^𝐕))\displaystyle\mathbf{Z}\bar{\mathbf{Y}}-\mathbf{Z}\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}^{2}=\mathrm{Tr}\left((\bar{\mathbf{Y}}-\widehat{\mathbf{Y}}\mathbf{V})^{\top}\mathbf{Z}^{\top}\mathbf{Z}(\bar{\mathbf{Y}}-\widehat{\mathbf{Y}}\mathbf{V})\right)
=\displaystyle= 𝐘¯𝐘^𝐕F2.\displaystyle\lVert\bar{\mathbf{Y}}-\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}^{2}. (49)

Therefore, we have

min𝐕𝐕=𝐈,𝐕𝐕=𝐈𝐙𝐘¯𝐙𝐘^𝐕F=min𝐕𝐕=𝐈,𝐕𝐕=𝐈𝐘¯𝐘^𝐕F\displaystyle\underset{\mathbf{V}^{\top}\mathbf{V}=\mathbf{I},\mathbf{V}\mathbf{V}^{\top}=\mathbf{I}}{\min}\;\lVert\mathbf{Z}\bar{\mathbf{Y}}-\mathbf{Z}\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}=\underset{\mathbf{V}^{\top}\mathbf{V}=\mathbf{I},\mathbf{V}\mathbf{V}^{\top}=\mathbf{I}}{\min}\;\lVert\bar{\mathbf{Y}}-\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}
\displaystyle\leq 82K3D(cd)𝐙𝐋¯𝐙𝐙𝐋^𝐙282K3D(cd)𝐙𝐋¯𝐙𝐙𝐋^𝐙F.\displaystyle\frac{8\sqrt{2K^{3}}}{D(c-d)}\lVert\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{2}\leq\frac{8\sqrt{2K^{3}}}{D(c-d)}\lVert\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{\mathrm{F}}. (50)

The first inequality holds due to [14] and how we generate the ground-truth graph, and the second inequality holds due to norm inequality. On the other hand, we have

𝐙𝐘¯𝐙𝐘^𝐕F=𝐙𝐘¯𝐕𝐙𝐘^F.\displaystyle\lVert\mathbf{Z}\bar{\mathbf{Y}}-\mathbf{Z}\widehat{\mathbf{Y}}\mathbf{V}\rVert_{\mathrm{F}}=\lVert\mathbf{Z}\bar{\mathbf{Y}}\mathbf{V}^{\top}-\mathbf{Z}\widehat{\mathbf{Y}}\rVert_{\mathrm{F}}. (51)

As in Lemma 6 of [14], we can choose 𝐘¯\bar{\mathbf{Y}} in such a way that 𝐙𝐘¯=𝐄\mathbf{Z}\bar{\mathbf{Y}}=\mathbf{E}, where 𝐄[i,:]=𝐄[j,:]\mathbf{E}_{[i,:]}=\mathbf{E}_{[j,:]} if the vertices ii and jj are in the same cluster and 𝐄[i,:]𝐄[j,:]2=2K/D\lVert\mathbf{E}_{[i,:]}-\mathbf{E}_{[j,:]}\rVert_{2}=\sqrt{2K/D} if the vertices ii and jj are not in the same cluster. Futhermore, multiplying 𝐄\mathbf{E} by 𝐕\mathbf{V}^{\top} will not change the properties of 𝐄\mathbf{E} since 𝐕\mathbf{V}^{\top} is a orthogonal matrix. Finally, according to Lemma 2, if we let δk=2K/D\delta_{k}=\sqrt{2K/D}, then ~k\widetilde{\mathcal{M}}_{k} in Lemma 2 is equivalent to k\mathcal{M}_{k}. Furthermore, according to Lemma 5.3 in [40], if 4(4+2ϵ)δk2𝐄𝐕𝐙𝐘^F2DK\frac{4(4+2\epsilon)}{\delta_{k}^{2}}\lVert{\mathbf{E}}\mathbf{V}^{\top}-{\mathbf{Z}}\widehat{\mathbf{Y}}\rVert_{\mathrm{F}}^{2}\leq\frac{D}{K}, we have

k=1K|k|\displaystyle\sum_{k=1}^{K}|\mathcal{M}_{k}| 4(4+2ϵ)δk2𝐄𝐕𝐙𝐘^F2\displaystyle\leq\frac{4(4+2\epsilon)}{\delta_{k}^{2}}\lVert{\mathbf{E}}\mathbf{V}^{\top}-{\mathbf{Z}}\widehat{\mathbf{Y}}\rVert_{\mathrm{F}}^{2}
256(4+2ϵ)K2D(cd)2𝐙𝐋¯𝐙𝐙𝐋^𝐙F2.\displaystyle\leq\frac{256(4+2\epsilon)K^{2}}{D(c-d)^{2}}\lVert\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{\mathrm{F}}^{2}. (52)

Let C1=256(4+2ϵ)K2D(cd)2C_{1}=\frac{256(4+2\epsilon)K^{2}}{D(c-d)^{2}}, we have

k=1K|k|\displaystyle\sum_{k=1}^{K}|\mathcal{M}_{k}| 2C1𝐙𝐋¯𝐙𝐙𝐋𝐙F2𝒯1+2C1𝐙𝐋𝐙𝐙𝐋^𝐙F2𝒯2.\displaystyle\leq 2C_{1}\underbrace{\lVert\mathbf{Z}^{\top}\bar{\mathbf{L}}\mathbf{Z}-\mathbf{Z}^{\top}\mathbf{L}^{*}\mathbf{Z}\rVert_{\mathrm{F}}^{2}}_{\mathcal{T}_{1}}+2C_{1}\underbrace{\lVert\mathbf{Z}^{\top}\mathbf{L}^{*}\mathbf{Z}-\mathbf{Z}^{\top}\widehat{\mathbf{L}}\mathbf{Z}\rVert_{\mathrm{F}}^{2}}_{\mathcal{T}_{2}}. (53)

The first term is the difference between the expected Laplacian matrix and the real matrix, which has been derived in [14]. Specifically, for any r2>0r_{2}>0 and some r1>0r_{1}>0 satisfying ar1lnD/Da\geq r_{1}\ln{D}/D, with probability at least 1Dr21-D^{-r_{2}}, we have that there exist a constant C2(r1,r2)C_{2}(r_{1},r_{2}) such that

𝒯1C2(r1,r2)aDlnD.\displaystyle\mathcal{T}_{1}\leq C_{2}(r_{1},r_{2})aD\ln{D}. (54)

The second term 𝒯2\mathcal{T}_{2} of (53) is the error between the Laplacian estimated by our model and the real one. Bringing (54) to (53), we finally complete the proof.

References

  • [1] T. Lei, X. Jia, Y. Zhang, S. Liu, H. Meng, and A. K. Nandi, “Superpixel-based fast fuzzy c-means clustering for color image segmentation,” IEEE Trans. Fuzzy Syst., vol. 27, no. 9, pp. 1753–1766, 2018.
  • [2] H. Xie, A. Zhao, S. Huang, J. Han, S. Liu, X. Xu, X. Luo, H. Pan, Q. Du, and X. Tong, “Unsupervised hyperspectral remote sensing image clustering based on adaptive density,” IEEE Geosci. Remote S., vol. 15, no. 4, pp. 632–636, 2018.
  • [3] V. Y. Kiselev, T. S. Andrews, and M. Hemberg, “Challenges in unsupervised clustering of single-cell rna-seq data,” Nat. Rev. Genet., vol. 20, no. 5, pp. 273–282, 2019.
  • [4] A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering algorithm,” Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003.
  • [5] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput., vol. 17, pp. 395–416, 2007.
  • [6] W.-B. Xie, Y.-L. Lee, C. Wang, D.-B. Chen, and T. Zhou, “Hierarchical clustering supported by reciprocal nearest neighbors,” Inf. Sci., vol. 527, pp. 279–292, 2020.
  • [7] A. Chouldechova and A. Roth, “The frontiers of fairness in machine learning,” arXiv:1810.08810, 2018.
  • [8] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii, “Fair clustering through fairlets,” Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
  • [9] S. Bera, D. Chakrabarty, N. Flores, and M. Negahbani, “Fair algorithms for clustering,” Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
  • [10] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner, “Scalable fair clustering,” in Proc. Int. Conf. Mach. Learn.   PMLR, 2019, pp. 405–413.
  • [11] I. M. Ziko, J. Yuan, E. Granger, and I. B. Ayed, “Variational fair clustering,” in Proc. Natl. Conf. Artif. Intell., vol. 35, no. 12, 2021, pp. 11 202–11 209.
  • [12] P. Zeng, Y. Li, P. Hu, D. Peng, J. Lv, and X. Peng, “Deep fair clustering via maximizing and minimizing mutual information: Theory, algorithm and metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 23 986–23 995.
  • [13] P. Li, H. Zhao, and H. Liu, “Deep fair clustering for visual learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9070–9079.
  • [14] M. Kleindessner, S. Samadi, P. Awasthi, and J. Morgenstern, “Guarantees for spectral clustering with fairness constraints,” in Proc. Int. Conf. Mach. Learn.   PMLR, 2019, pp. 3458–3467.
  • [15] J. Wang, D. Lu, I. Davidson, and Z. Bai, “Scalable spectral clustering with group fairness constraints,” in Proc. Int. Conf. Artif. Intell. Stat., AISTATS.   PMLR, 2023, pp. 6613–6629.
  • [16] J. Li, Y. Wang, and A. Merchant, “Spectral normalized-cut graph partitioning with fairness constraints,” arXiv:2307.12065, 2023.
  • [17] S. Gupta and A. Dukkipati, “Protecting individual interests across clusters: Spectral clustering with guarantees,” arXiv: 2105.03714, 2021.
  • [18] Y. Wang, J. Kang, Y. Xia, J. Luo, and H. Tong, “ifig: Individually fair multi-view graph clustering,” in 2022 IEEE International Conference on Big Data (Big Data).   IEEE, 2022, pp. 329–338.
  • [19] J. Huang, F. Nie, and H. Huang, “Spectral rotation versus k-means in spectral clustering,” in Proc. Natl. Conf. Artif. Intell., vol. 27, no. 1, 2013, pp. 431–437.
  • [20] Z. Kang, C. Peng, Q. Cheng, and Z. Xu, “Unified spectral clustering with optimal graph,” in Proc. Natl. Conf. Artif. Intell., vol. 32, no. 1, 2018.
  • [21] Z. Kang, C. Peng, and Q. Cheng, “Twin learning for similarity and clustering: A unified kernel approach,” in Proc. Natl. Conf. Artif. Intell., vol. 31, no. 1, 2017.
  • [22] J. Huang, F. Nie, and H. Huang, “A new simplex sparse learning model to measure data similarity for clustering,” in Int. Joint Conf. Artif. Intell., 2015.
  • [23] Y. Peng, W. Huang, W. Kong, F. Nie, and B.-L. Lu, “Jgsed: An end-to-end spectral clustering model for joint graph construction, spectral embedding and discretization,” IEEE Trans. Emerg. Topics Comput. Intell., 2023.
  • [24] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 35, no. 11, pp. 2765–2781, 2013.
  • [25] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 35, no. 1, pp. 171–184, 2012.
  • [26] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with adaptive neighbors,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2014, pp. 977–986.
  • [27] C. Gao, Y. Wang, J. Zhou, W. Ding, L. Shen, and Z. Lai, “Possibilistic neighborhood graph: A new concept of similarity graph learning,” IEEE Trans. Emerg. Topics Comput. Intell., 2022.
  • [28] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, 2013.
  • [29] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learning Laplacian matrix in smooth graph signal representations,” IEEE Trans. Signal Process., vol. 64, no. 23, pp. 6160–6173, 2016.
  • [30] V. Kalofolias, “How to learn a graph from smooth signals,” in Proc. Int. Conf. Artif. Intell. Stat., AISTATS.   PMLR, 2016, pp. 920–929.
  • [31] X. Dong, D. Thanou, M. Rabbat, and P. Frossard, “Learning graphs from data: A signal representation perspective,” IEEE Signal Process. Mag., vol. 36, no. 3, pp. 44–63, 2019.
  • [32] F. Nie, D. Wu, R. Wang, and X. Li, “Self-weighted clustering with adaptive neighbors,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 9, pp. 3428–3441, 2020.
  • [33] Y. Pang, J. Xie, F. Nie, and X. Li, “Spectral clustering by joint spectral embedding and spectral rotation,” IEEE Trans. Cybern., vol. 50, no. 1, pp. 247–258, 2018.
  • [34] Y. Yang, F. Shen, Z. Huang, and H. T. Shen, “A unified framework for discrete spectral clustering.” in IJCAI, 2016, pp. 2273–2279.
  • [35] W. Huang, Y. Peng, Y. Ge, and W. Kong, “A new kmeans clustering model and its generalization achieved by joint spectral embedding and rotation,” PeerJ Comput. Sci., vol. 7, p. e450, 2021.
  • [36] Y. Han, L. Zhu, Z. Cheng, J. Li, and X. Liu, “Discrete optimal graph clustering,” IEEE Trans. Cybern., vol. 50, no. 4, pp. 1697–1710, 2018.
  • [37] C. Tang, Z. Li, J. Wang, X. Liu, W. Zhang, and E. Zhu, “Unified one-step multi-view spectral clustering,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 6, pp. 6449–6460, 2022.
  • [38] F. Zhang, J. Zhao, X. Ye, and H. Chen, “One-step adaptive spectral clustering networks,” IEEE Signal Process. Lett., vol. 29, pp. 2263–2267, 2022.
  • [39] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Soc. Networks, vol. 5, no. 2, pp. 109–137, 1983.
  • [40] J. Lei and A. Rinaldo, “Consistency of spectral clustering in stochastic block models,” Ann. Stat., vol. 43, no. 1, 2015.
  • [41] Q. Li, X.-M. Wu, H. Liu, X. Zhang, and Z. Guan, “Label efficient semi-supervised learning via graph filtering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9582–9591.
  • [42] Y. Y. Pilavcı, P.-O. Amblard, S. Barthelmé, and N. Tremblay, “Graph tikhonov regularization and interpolation via random spanning forests,” IEEE Trans. Signal. Inf. Process. Netw., vol. 7, pp. 359–374, 2021.
  • [43] E. Pan and Z. Kang, “Multi-view contrastive graph clustering,” Proc. Adv. Neural Inf. Process. Syst., vol. 34, pp. 2148–2159, 2021.
  • [44] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” J. Mach. Learn. Res., vol. 9, no. 11, 2008.
  • [45] G. Zhong and C.-M. Pun, “Self-taught multi-view spectral clustering,” Pattern Recognit., vol. 138, p. 109349, 2023.
  • [46] F. Nie, S. Shi, and X. Li, “Semi-supervised learning with auto-weighting feature and adaptive graph,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 6, pp. 1167–1178, 2019.
  • [47] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 22, no. 8, pp. 888–905, 2000.
  • [48] K. Fan, “On a theorem of weyl concerning eigenvalues of linear transformations i,” Proc. of the Nat. Academy. of Sci., vol. 35, no. 11, pp. 652–655, 1949.
  • [49] S. Kumar, J. Ying, J. V. de Miranda Cardoso, and D. P. Palomar, “A unified framework for structured graph learning via spectral constraints.” J. Mach. Learn. Res., vol. 21, no. 22, pp. 1–60, 2020.
  • [50] D. Wu, F. Nie, J. Lu, R. Wang, and X. Li, “Effective clustering via structured graph learning,” IEEE Trans. Knowl. Data Eng., 2022.
  • [51] E. Pircalabelu and G. Claeskens, “Community-based group graphical lasso,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 2406–2437, 2020.
  • [52] S. S. Saboksayr and G. Mateos, “Accelerated graph learning from smooth signals,” IEEE Signal Process. Lett., vol. 28, pp. 2192–2196, 2021.
  • [53] Z. Wen and W. Yin, “A feasible method for optimization with orthogonality constraints,” Math. Program., vol. 142, pp. 397–434, 2013.
  • [54] P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
  • [55] O. Axelsson and G. Lindskog, “On the rate of convergence of the preconditioned conjugate gradient method,” Numer. Math., vol. 48, pp. 499–523, 1986.
  • [56] V. Kalofolias and N. Perraudin, “Large scale graph learning from smooth signals,” in Int. Conf. Learn. Representations, 2019.
  • [57] D. A. Tarzanagh, L. Balzano, and A. O. Hero, “Fair structure learning in heterogeneous graphical models,” arXiv:2112.05128, 2021.
  • [58] H. Wang, N. Wang, and D.-Y. Yeung, “Collaborative deep learning for recommender systems,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2015, pp. 1235–1244.
  • [59] X. Chen, G. Yuan, F. Nie, and Z. Ming, “Semi-supervised feature selection via sparse rescaled linear square regression,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 1, pp. 165–176, 2018.
  • [60] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and clustering,” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 11, no. 9, pp. 1074–1085, 1992.

Supplementary Materials

A-A Several Extensions to The Proposed Model

1) Improved spectral rotation: Improved spectral rotation is an improved version of (10), which is formulated as [45]:

min𝐐,𝐑𝐐(𝐐𝐐)12𝐔𝐑F2\displaystyle\underset{\mathbf{Q},\mathbf{R}}{\min}\;\lVert\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐑𝐑=𝐈,𝐐.\displaystyle\mathrm{s.t.}\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}. (55)

The improved spectral rotation can output a discrete label matrix 𝐐\mathbf{Q} that are closer to 𝐔𝐑\mathbf{U}\mathbf{R} since (𝐐(𝐐𝐐)12)𝐐(𝐐𝐐)12=(𝐔𝐑)𝐔𝐑=𝐈\left(\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}\right)^{\top}\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}=(\mathbf{U}\mathbf{R})^{\top}\mathbf{U}\mathbf{R}=\mathbf{I}, i.e., 𝐐(𝐐𝐐)12\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}} and 𝐔𝐑\mathbf{U}\mathbf{R} are in the same space [45]. If we employ the improved spectral rotation in (23), the model becomes

min𝐗,𝐋,𝝊,𝐘,𝐑,𝐐1N𝚼(𝐗o𝐗)F2+ξNTr(𝐗𝐋𝐗)+Reg(𝐋)\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})
+μTr(𝐔𝐋𝐔)+γ𝐐(𝐐𝐐)12𝐔𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}(\mathbf{Q}^{\top}\mathbf{Q})^{-\frac{1}{2}}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
+i=1D1𝝊[i]\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}
s.t.𝐋,𝝊>0,𝐔𝐔=𝐈,𝐅𝐔=𝟎,𝐑𝐑=𝐈,𝐐.\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}. (56)

2) Self-weighted feature importance: To improve clustering performance, some works define feature weights to determine the importance of different features in assigning cluster labels [46, 59]. Specifically, given data matrix 𝐗\mathbf{X}, we define weight matrix 𝚿=diag(𝝍)N×N\mathbf{\Psi}=\mathrm{diag}(\bm{\psi})\in\mathbb{R}^{N\times N}, where 𝝍N0\bm{\psi}\in\mathbb{R}^{N}\geq 0 and 𝝍𝟏=1\bm{\psi}^{\top}\mathbf{1}=1. The weighted ii-th feature is 𝚿𝐗[i,:]\mathbf{\Psi}\mathbf{X}_{[i,:]}^{\top}. The weights 𝚿\mathbf{\Psi} can be directly learned from data. Thus, our model (23) with self-weighted feature importance is formulated as

min𝐖,𝝊,𝐔,𝐑,𝐐,𝚿,𝐗1N𝚼(𝐗o𝐗)F2+ξ2N𝐖𝐏ψ1,1\displaystyle\underset{\mathbf{W},\bm{\upsilon},\mathbf{U},\mathbf{R},\mathbf{Q},\mathbf{\Psi},\mathbf{X}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{2N}\lVert\mathbf{W}\circ\mathbf{P}_{\psi}\rVert_{1,1}
+RegW(𝐖)+μTr(𝐔𝐋𝐔)+γ𝐐𝐔𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+Reg_{W}(\mathbf{W})+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
+i=1D1𝝊[i]\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}
s.t.𝐖𝒲,𝐋=𝐃𝐖,𝐔𝐔=𝐈,𝐅𝐔=𝟎,𝐑𝐑=𝐈,\displaystyle\mathrm{s.t.}\;\mathbf{W}\in\mathcal{W},\mathbf{L}=\mathbf{D}-\mathbf{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},
𝐐,(𝐏ψ)[ij]=𝚿𝐗[i,:]𝚿𝐗[j,:]22,𝝊>0,\displaystyle\;\;\;\;\;\;\mathbf{Q}\in\mathcal{I},(\mathbf{P}_{\psi})_{[ij]}=\left\lVert\mathbf{\Psi}\mathbf{X}_{[i,:]}^{\top}-\mathbf{\Psi}\mathbf{X}_{[j,:]}^{\top}\right\rVert_{2}^{2},\bm{\upsilon}>0,
𝝍𝟏=1,𝚿=diag(𝝍).\displaystyle\;\;\;\;\;\;\bm{\psi}^{\top}\mathbf{1}=1,\mathbf{\Psi}=\mathrm{diag}(\bm{\psi}). (57)

3) Normalized spectral clustering: The model (23) is a unified framework based on unnormalized SC [60]. Here, we extend (23) to normalized SC [47]. The standard normalized spectral embedding is

min𝐔Tr(𝐔𝐋𝐔),s.t.𝐔𝐃𝐔=𝐈.\displaystyle\underset{\mathbf{U}}{\min}\;\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U}),\;\;\mathrm{s.t.}\mathbf{U}^{\top}\mathbf{D}\mathbf{U}=\mathbf{I}. (58)

The fair constraint 𝐅𝐔=𝟎\mathbf{F}^{\top}\mathbf{U}=\mathbf{0} also holds for normalized SC [14]. Thus, our model based on the normalized SC is

min𝐗,𝐋,𝝊,𝐘,𝐑,𝐐1N𝚼(𝐗o𝐗)F2+ξNTr(𝐗𝐋𝐗)+Reg(𝐋)\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})
+μTr(𝐔𝐋𝐔)+γ𝐐𝐔𝐑F2+i=1D1𝝊[i]\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}
s.t.𝐋,𝝊>0,𝐔𝐃𝐔=𝐈,𝐅𝐔=𝟎,𝐑𝐑=𝐈,𝐐.\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{U}^{\top}\mathbf{D}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{{U}}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}. (59)

4) Individual fairness: Our model is based on group fairness, which induces the fairness constraint 𝐅𝐔=𝟎\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}. The work [17] introduces individual fairness into SC, which induces a new fairness constraint 𝐌(𝐈1D𝟏𝟏)𝐔=𝟎\mathbf{M}(\mathbf{I}-\frac{1}{D}\mathbf{1}\mathbf{1}^{\top})\mathbf{U}=\mathbf{0}, where 𝐌D×D\mathbf{M}\in\mathbb{R}^{D\times D} is a graph representing individual sensitive attributes. Our unified model based on individual fairness is

min𝐗,𝐋,𝝊,𝐘,𝐑,𝐐1N𝚼(𝐗o𝐗)F2+ξNTr(𝐗𝐋𝐗)+Reg(𝐋)\displaystyle\underset{\mathbf{X},\mathbf{L},\bm{\upsilon},\mathbf{Y},\mathbf{R},\mathbf{Q}}{\mathrm{min}}\,\,\frac{1}{N}\lVert{\mathbf{\Upsilon}}(\mathbf{X}_{o}-\mathbf{X})\rVert_{\mathrm{F}}^{2}+\frac{\xi}{N}\mathrm{Tr}(\mathbf{X}^{\top}\mathbf{L}\mathbf{X})+Reg(\mathbf{L})
+μTr(𝐔𝐋𝐔)+γ𝐐𝐔𝐑F2+i=1D1𝝊[i]\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\mu\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}+\sum_{i=1}^{D}\frac{1}{\bm{\upsilon}_{[i]}}
s.t.𝐋,𝝊>0,𝐔𝐔=𝐈,𝐌(𝐈1D𝟏𝟏)𝐔=𝟎,\displaystyle\mathrm{s.t.}\;\mathbf{L}\in\mathcal{L},\bm{\upsilon}>0,\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{M}\left(\mathbf{I}-\frac{1}{D}\mathbf{1}\mathbf{1}^{\top}\right)\mathbf{U}=\mathbf{0},
𝐑𝐑=𝐈,𝐐,.\displaystyle\;\;\;\;\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q},\in\mathcal{I}. (60)

A-B The Complete Algorithm Flow for Updating (33)

We use the algorithm in [52] to solve problem (33). The complete algorithm flow is presented in Algorithm 2.

Algorithm 2 The algorithm for problem (33)
0:  β,𝐩\beta,\mathbf{p}, set L=D12βL=\frac{D-1}{2\beta}
0:  The learned graph 𝐰{\mathbf{w}}
1:  Initialize η(1)=1\eta^{(1)}=1 and 𝝎(1)=𝐫(0)D\bm{\omega}^{(1)}=\mathbf{r}^{(0)}\in\mathbb{R}^{D} at random
2:  for t=1,2,,t=1,2,..., do
3:     𝐰¯(t)=max(𝐒𝝎(t)2𝐩4β,0)\bar{\mathbf{w}}^{(t)}=\max\left(\frac{\mathbf{S}^{\top}\bm{\omega}^{(t)}-2\mathbf{p}}{4\beta},0\right)
4:     𝐯(t)=𝐒𝐰¯(t)L𝝎(t)+(𝐒𝐰¯(t)L𝝎(t))2+4L𝟏2\mathbf{v}^{(t)}=\frac{\mathbf{S}\bar{\mathbf{w}}^{(t)}-L\bm{\omega}^{(t)}+\sqrt{(\mathbf{S}\bar{\mathbf{w}}^{(t)}-L\bm{\omega}^{(t)})^{2}+4L\mathbf{1}}}{2}
5:     𝐫(t)=𝝎(t)L1(𝐒𝐰¯(t)𝐯(t))\mathbf{r}^{(t)}=\bm{\omega}^{(t)}-L^{-1}\left(\mathbf{S}\bar{\mathbf{w}}^{(t)}-\mathbf{v}^{(t)}\right)
6:     η(t+1)=1+1+4(η(t))22\eta^{(t+1)}=\frac{1+\sqrt{1+4(\eta^{(t)})^{2}}}{2}
7:     𝝎(t+1)=𝐫(t)+(η(t)1η(t+1))(𝐫(t)𝐫(t1))\bm{\omega}^{(t+1)}=\mathbf{r}^{(t)}+\left(\frac{\eta^{(t)}-1}{\eta^{(t+1)}}\right)\left(\mathbf{r}^{(t)}-\mathbf{r}^{(t-1)}\right)
8:  end for
9:  return  𝐰=max(𝐒𝐫(t)2𝐩4β,0)\mathbf{w}=\max\left(\frac{\mathbf{S}^{\top}\mathbf{r}^{(t)}-2\mathbf{p}}{4\beta},0\right)

A-C The Formulation and Algorithm for FJGSED

The model FJGSED is formulated as

min𝐖,𝐔,𝐐,𝐑i,j=1D𝐗[i,:]𝐗[j,:]22𝐖[i,j]+βJ𝐖[i,j]2\displaystyle\underset{\mathbf{W},\mathbf{U},\mathbf{Q},\mathbf{R}}{\min}\sum_{i,j=1}^{D}\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[i,j]}+\beta_{J}\mathbf{W}_{[i,j]}^{2}
+μJTr(𝐔𝐋𝐔)+γJ𝐐𝐔𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;+{\mu_{J}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\gamma_{J}\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐖𝟏=𝟏,𝐖0,𝐋=𝐃𝐖,𝐔𝐔=𝐈,𝐅𝐔=𝟎\displaystyle\mathrm{s.t.}\;\mathbf{W}\mathbf{1}=\mathbf{1},\mathbf{W}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{U}=\mathbf{0}
𝐑𝐑=𝐈,𝐐.\displaystyle\;\;\;\;\;\;\;\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}. (61)

The framework of our algorithm for sloving (61) is the same as Algorithm 1, which alternately updates 𝐖,𝐔\mathbf{W},\mathbf{U}, 𝐑\mathbf{R} and 𝐐\mathbf{Q}. The update of 𝐔\mathbf{U}, 𝐑\mathbf{R}, and 𝐐\mathbf{Q} is the same as Algorithm 1. The main difference is updating 𝐖\mathbf{W}/𝐋\mathbf{L}, and hence we discuss the update of 𝐖\mathbf{W} here. The corresponding sub-problem is

min𝐖\displaystyle\underset{\mathbf{W}}{\min} i,jD𝐗[i,:]𝐗[j,:]22𝐖[i,j]+βJ𝐖[i,j]2+μJTr(𝐔𝐋𝐔)\displaystyle\;\;\sum_{i,j}^{D}\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[i,j]}+\beta_{J}\mathbf{W}_{[i,j]}^{2}+{\mu_{J}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})
s.t.\displaystyle\mathrm{s.t.}\; 𝐖[i,:]𝟏=1,𝐖[i,:]0,𝐋=𝐃𝐖.\displaystyle\mathbf{W}_{[i,:]}\mathbf{1}=1,\mathbf{W}_{[i,:]}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W}. (62)

We can rewrite the problem as

min𝐖𝟏=𝟏,𝐖0i,j=1D𝐗[i,:]𝐗[j,:]22𝐖[ij]\displaystyle\underset{\mathbf{W}\mathbf{1}=\mathbf{1},\mathbf{W}\geq 0}{\min}\sum_{i,j=1}^{D}\;\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[ij]}
+μJ2𝐔[i,:]𝐔[j,:]22𝐖[ij]+βJ𝐖[ij]2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\frac{\mu_{J}}{2}\lVert\mathbf{U}_{[i,:]}-\mathbf{U}_{[j,:]}\rVert_{2}^{2}\mathbf{W}_{[ij]}+\beta_{J}\mathbf{W}^{2}_{[ij]} (63)

Let 𝐂[ij]=𝐗[i,:]𝐗[j,:]22+μJ2𝐔[i,:]𝐔[j,:]22\mathbf{C}_{[ij]}=\lVert\mathbf{X}_{[i,:]}-\mathbf{X}_{[j,:]}\rVert_{2}^{2}+\frac{\mu_{J}}{2}\lVert\mathbf{U}_{[i,:]}-\mathbf{U}_{[j,:]}\rVert_{2}^{2}, and the problem (63) can be optimized for each row, i.e. for i=1,,Di=1,...,D,

min𝐖[i,:]j=1D𝐂[ij]𝐖[ij]+βJ𝐖[ij]2s.t.𝐖[i,:]𝟏=1,𝐖[i,:]0,\displaystyle\underset{\mathbf{W}_{[i,:]}}{\min}\sum_{j=1}^{D}\;\mathbf{C}_{[ij]}\mathbf{W}_{[ij]}+\beta_{J}\mathbf{W}^{2}_{[ij]}\;\;\mathrm{s.t.}\;\mathbf{W}_{[i,:]}\mathbf{1}=1,\mathbf{W}_{[i,:]}\geq 0,
\displaystyle\Rightarrow min𝐖[i,:]𝐖[i,:]+12βJ𝐂[i,:]22s.t.𝐖[i,:]𝟏=1,𝐖[i,:]0.\displaystyle\underset{\mathbf{W}_{[i,:]}}{\min}\left\lVert\mathbf{W}_{[i,:]}+\frac{1}{2\beta_{J}}\mathbf{C}_{[i,:]}\right\rVert_{2}^{2}\;\;\mathrm{s.t.}\;\mathbf{W}_{[i,:]}\mathbf{1}=1,\mathbf{W}_{[i,:]}\geq 0. (64)

It defines a squared Euclidean distance on a simplex constraint. Inspired by [23], we update 𝐖[i,:]\mathbf{W}_{[i,:]} as

𝐖[i,j]=max(𝐂[i,l+1]𝐂[i,j]l𝐂[i,l+1]j=1l𝐂[i,j],0),\displaystyle\mathbf{W}_{[i,j]}=\max\left(\frac{\mathbf{C}_{[i,l+1]}-\mathbf{C}_{[i,j]}}{l\mathbf{C}_{[i,l+1]}-\sum_{j=1}^{l}\mathbf{C}_{[i,j]}},0\right), (65)

where ll is a hyper-parameter determining the number of neighbor nodes of the learned graphs. We select ll instead of βJ\beta_{J} as the model parameters.

We iteratively update 𝐖\mathbf{W}, 𝐔,𝐑\mathbf{U},\mathbf{R}, and 𝐐\mathbf{Q} until convergence. The complete algorithm is shown in Algorithm 3

Algorithm 3 The algorithm for FJGSED
0:    𝐗\mathbf{X}, the number of clusters KK, parameters l,μJ,γJl,\mu_{J},\gamma_{J}
0:    The learned graph 𝐖\mathbf{W}, the cluster indicator matrix 𝐐\mathbf{Q}
1:  Initialize 𝐖\mathbf{W}, 𝐔\mathbf{U}, 𝐐\mathbf{Q}, and 𝐑\mathbf{R}
2:  while not converged do
3:     Update 𝐖\mathbf{W} via (65)
4:     Update 𝐘\mathbf{Y} by solving (35), and let 𝐔=𝐙𝐘\mathbf{U}=\mathbf{Z}\mathbf{Y}
5:     Update 𝐑\mathbf{R} as 𝐑=𝚯R𝚯L\mathbf{R}=\mathbf{\Theta}_{R}\mathbf{\Theta}_{L}^{\top}
6:     Update 𝐐\mathbf{Q} via (39)
7:  end while

Algorithm 4 The algorithm for FSRSC
0:    𝐗\mathbf{X}, the number of clusters KK, parameters γU,μJ,γJ\gamma_{U},\mu_{J},\gamma_{J}
0:    The learned graph 𝐖\mathbf{W}, the cluster indicator matrix 𝐐\mathbf{Q}
1:  Initialize 𝐖\mathbf{W}, 𝐔\mathbf{U}, 𝚪\mathbf{\Gamma}, 𝐐\mathbf{Q}, and 𝐑\mathbf{R}
2:  while not converged do
3:     Update 𝐀\mathbf{A} via (71)
4:     Update 𝐖\mathbf{W} via (76)
5:     𝐖=max(𝐖,0)\mathbf{W}=\max(\mathbf{W},0) and let diag(𝐖)=𝟎\mathrm{diag(}\mathbf{W})=\mathbf{0}.
6:     𝐖=12(𝐖+𝐖)\mathbf{W}=\frac{1}{2}(\mathbf{W}^{\top}+\mathbf{W}).
7:     Update 𝚪\mathbf{\Gamma} as 𝚪=𝚪+γU(𝐀𝐖)\mathbf{\Gamma}=\mathbf{\Gamma}+\gamma_{U}(\mathbf{A}-\mathbf{W})
8:     Update 𝐔\mathbf{U} by solving (35)
9:     Update 𝐑\mathbf{R} as 𝐑=𝚯R𝚯L\mathbf{R}=\mathbf{\Theta}_{R}\mathbf{\Theta}_{L}^{\top}
10:     Update 𝐐\mathbf{Q} via (39)
11:  end while

A-D The Formulation And Algorithm For FSRSC

The model FSRSC is formulated as

min𝐖,𝐔,𝐐,𝐑𝐗𝐖𝐗F2+αU𝐖1,1+μUTr(𝐔𝐋𝐔)\displaystyle\underset{\mathbf{W},\mathbf{U},\mathbf{Q},\mathbf{R}}{\min}\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{W}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})
+γU𝐐𝐔𝐑F2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;+\gamma_{U}\lVert\mathbf{Q}-\mathbf{U}\mathbf{R}\rVert_{\mathrm{F}}^{2}
s.t.𝐖𝒲,𝐔𝐔=𝐈,𝐅𝐔=𝟎,𝐑𝐑=𝐈,𝐐.\displaystyle\mathrm{s.t.}\;\mathbf{W}\in\mathcal{W},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I},\mathbf{F}^{\top}\mathbf{U}=\mathbf{0},\mathbf{R}^{\top}\mathbf{R}=\mathbf{I},\mathbf{Q}\in\mathcal{I}. (66)

The framework of our algorithm for (66) is the same as Algorithm 1, which alternately updates 𝐖,𝐔\mathbf{W},\mathbf{U}, 𝐑\mathbf{R} and 𝐐\mathbf{Q}. The update of 𝐔\mathbf{U}, 𝐑\mathbf{R}, and 𝐐\mathbf{Q} is the same as Algorithm 1. The main difference is updating 𝐖\mathbf{W}, and hence we discuss the update of 𝐖\mathbf{W} here. The corresponding sub-problem is

min𝐖\displaystyle\underset{\mathbf{W}}{\min} 𝐗𝐖𝐗F2+αU𝐖1,1+μUTr(𝐔𝐋𝐔)\displaystyle\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{W}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})
s.t.\displaystyle\mathrm{s.t.}\; diag(𝐖)=𝟎,𝐖=𝐖,𝐖0,𝐋=𝐃𝐖.\displaystyle\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W}. (67)

We use the augmented Lagrange multiplier (ALM) type of method to solve the problem (67). Let us first introduce an auxiliary variable 𝐀\mathbf{A} here

min𝐖\displaystyle\underset{\mathbf{W}}{\min} 𝐗𝐖𝐗F2+αU𝐀1,1+μUTr(𝐔𝐋𝐔)\displaystyle\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{A}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})
s.t.\displaystyle\mathrm{s.t.}\; diag(𝐖)=𝟎,𝐖=𝐖,𝐖0,𝐋=𝐃𝐖,𝐀=𝐖.\displaystyle\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0,\mathbf{L}=\mathbf{D}-\mathbf{W},\mathbf{A}=\mathbf{W}. (68)

The augmented Lagrangian function of the problem is

Lag(𝐖,𝐀,𝚪)\displaystyle Lag(\mathbf{W},\mathbf{A},\mathbf{\Gamma})
=\displaystyle= 𝐗𝐖𝐗F2+αU𝐀1,1+μUTr(𝐔𝐋𝐔)\displaystyle\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+\alpha_{U}\lVert\mathbf{A}\rVert_{1,1}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})
+γU2𝐀𝐖+𝚪/γU22.,\displaystyle+\frac{\gamma_{U}}{2}\left\lVert\mathbf{A}-\mathbf{W}+{\mathbf{\Gamma}}/{\gamma_{U}}\right\rVert_{2}^{2}., (69)

where γU\gamma_{U} is the Lagrangian constant. In the ALM algorithm, we update 𝐖,𝚪\mathbf{W},\mathbf{\Gamma}, and 𝐀\mathbf{A} in an alternating manner. We first fix 𝐖,𝚪\mathbf{W},\mathbf{\Gamma} and update 𝐀\mathbf{A}. Let 𝐉=𝐖𝚪γU\mathbf{J}=\mathbf{W}-\frac{\mathbf{\Gamma}}{\gamma_{U}}, the optimization problem is

min𝐀αU𝐀1,1+γU2𝐀𝐉22,\displaystyle\underset{\mathbf{A}}{\min}\;\;\alpha_{U}\lVert\mathbf{A}\rVert_{1,1}+\frac{\gamma_{U}}{2}\left\lVert\mathbf{A}-\mathbf{J}\right\rVert_{2}^{2}, (70)

which can be updated elementwise as

𝐀[ij]=max(|𝐉[ij]|αUγU,0)sign(𝐉[ij]).\displaystyle\mathbf{A}_{[ij]}=\max\left(|\mathbf{J}_{[ij]}|-\frac{\alpha_{U}}{\gamma_{U}},0\right)\mathrm{sign}\left(\mathbf{J}_{[ij]}\right). (71)

Then, we fix 𝐀,𝚪\mathbf{A},\mathbf{\Gamma} and update 𝐖\mathbf{W}. Let 𝐉~=𝐀+𝚪γU\widetilde{\mathbf{J}}=\mathbf{A}+\frac{\mathbf{\Gamma}}{\gamma_{U}}, and we have

min𝐖𝐗𝐖𝐗F2+μUTr(𝐔𝐋𝐔)+γU2𝐖𝐉~F2,\displaystyle\underset{\mathbf{W}}{\min}\;\lVert\mathbf{X}-\mathbf{W}^{\top}\mathbf{X}\rVert_{\mathrm{F}}^{2}+{\mu_{U}}\mathrm{Tr}(\mathbf{U}^{\top}\mathbf{L}\mathbf{U})+\frac{\gamma_{U}}{2}\lVert\mathbf{W}-\widetilde{\mathbf{J}}\rVert_{\mathrm{F}}^{2},
s.t.diag(𝐖)=𝟎,𝐖=𝐖,𝐖0,\displaystyle\mathrm{s.t.}\;\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0, (72)

which is equivalent to

min𝐖g(𝐖)\displaystyle\underset{\mathbf{W}}{\min}\,g(\mathbf{W})
:=\displaystyle:= min𝐖Tr(𝐖𝐗𝐗𝐖2𝐗𝐗𝐖)+μU2𝐖𝐏U1,1\displaystyle\underset{\mathbf{W}}{\min}\;\mathrm{Tr}\left(\mathbf{W}^{\top}\mathbf{X}\mathbf{X}^{\top}\mathbf{W}-2\mathbf{X}\mathbf{X}^{\top}\mathbf{W}^{\top}\right)+\frac{\mu_{U}}{2}\lVert\mathbf{W}\circ\mathbf{P}_{U}\rVert_{1,1}
+γU2Tr(𝐖𝐖2𝐉~𝐖)\displaystyle+\frac{\gamma_{U}}{2}\mathrm{Tr}\left(\mathbf{W}^{\top}\mathbf{W}-2\widetilde{\mathbf{J}}^{\top}\mathbf{W}\right)
s.t.diag(𝐖)=𝟎,𝐖=𝐖,𝐖0,\displaystyle\mathrm{s.t.}\;\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0, (73)

where 𝐏U\mathbf{P}_{U} is the pair-wise distance matrix of 𝐔\mathbf{U}. For every column of 𝐖\mathbf{W}, we have the following problem

min𝐖[:,i]g(𝐖[:,i])\displaystyle\underset{\mathbf{W}_{[:,i]}}{\min}\,g(\mathbf{W}_{[:,i]})
=\displaystyle= min𝐖[:,i]𝐖[:,i](γU2𝐈+𝐗𝐗)𝐖[:,i]\displaystyle\underset{\mathbf{W}_{[:,i]}}{\min}\,\mathbf{W}_{[:,i]}^{\top}\left(\frac{\gamma_{U}}{2}\mathbf{I}+\mathbf{X}\mathbf{X}^{\top}\right)\mathbf{W}_{[:,i]}
+(μU2(𝐏U)[:,i]γU𝐉~[:,i]2(𝐗𝐗)[i,:])𝐖[:,i].\displaystyle+\left(\frac{\mu_{U}}{2}(\mathbf{P}_{U})^{\top}_{[:,i]}-\gamma_{U}\widetilde{\mathbf{J}}^{\top}_{[:,i]}-2(\mathbf{X}\mathbf{X}^{\top})_{[i,:]}\right)\mathbf{W}_{[:,i]}. (74)

We calculate the derivative of g(𝐖[:,i])g(\mathbf{W}_{[:,i]}) and have

𝐖[:,i]g(𝐖[:,i])\displaystyle\nabla_{\mathbf{W}_{[:,i]}}g(\mathbf{W}_{[:,i]})
=\displaystyle= 2(γU2𝐈+𝐗𝐗)𝐖[:,i]+μU2(𝐏U)[:,i]γU𝐉~[:,i]2(𝐗𝐗)[:,i].\displaystyle 2\left(\frac{\gamma_{U}}{2}\mathbf{I}+\mathbf{X}\mathbf{X}^{\top}\right)\mathbf{W}_{[:,i]}+\frac{\mu_{U}}{2}(\mathbf{P}_{U})_{[:,i]}-\gamma_{U}\widetilde{\mathbf{J}}_{[:,i]}-2(\mathbf{X}\mathbf{X}^{\top})_{[:,i]}. (75)

Let 𝐖[:,i]g(𝐖[:,i])=𝟎\nabla_{\mathbf{W}_{[:,i]}}g(\mathbf{W}_{[:,i]})=\mathbf{0}, and we obtain

𝐖[:,i]=(γU𝐈+2𝐗𝐗)1(γU𝐉~[:,i]+2(𝐗𝐗)[:,i]μU2(𝐏U)[:,i]).\displaystyle\mathbf{W}_{[:,i]}=\left({\gamma_{U}}\mathbf{I}+2\mathbf{X}\mathbf{X}^{\top}\right)^{-1}\left(\gamma_{U}\widetilde{\mathbf{J}}_{[:,i]}+2(\mathbf{X}\mathbf{X}^{\top})_{[:,i]}-\frac{\mu_{U}}{2}(\mathbf{P}_{U})_{[:,i]}\right). (76)

After updating all columns of 𝐖\mathbf{W}, we project 𝐖\mathbf{W} into the constraints diag(𝐖)=𝟎,𝐖=𝐖,𝐖0\mathrm{diag}(\mathbf{W})=\mathbf{0},\mathbf{W}^{\top}=\mathbf{W},\mathbf{W}\geq 0.

Finally, we fix 𝐖,𝐀\mathbf{W},\mathbf{A} and update 𝚪\mathbf{\Gamma}, i.e., 𝚪=𝚪+γU(𝐀𝐖)\mathbf{\Gamma}=\mathbf{\Gamma}+\gamma_{U}(\mathbf{A}-\mathbf{W}).

After updating 𝐖,𝐀\mathbf{W},\mathbf{A}, and 𝚪\mathbf{\Gamma}, we then update 𝐔,𝐑\mathbf{U},\mathbf{R}, and 𝐐\mathbf{Q} by following Algorithm 1. We iteratively update 𝐖,𝐀\mathbf{W},\mathbf{A}, 𝚪\mathbf{\Gamma}, 𝐔,𝐑\mathbf{U},\mathbf{R}, and 𝐐\mathbf{Q} until convergence. The complete algorithm flow is shown in Algorithm 4