This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Gromov–Wasserstein Geometric View of Spectrum-Preserving
Graph Coarsening

Yifan Chen    Rentian Yao    Yun Yang    Jie Chen
Abstract

Graph coarsening is a technique for solving large-scale graph problems by working on a smaller version of the original graph, and possibly interpolating the results back to the original graph. It has a long history in scientific computing and has recently gained popularity in machine learning, particularly in methods that preserve the graph spectrum. This work studies graph coarsening from a different perspective, developing a theory for preserving graph distances and proposing a method to achieve this. The geometric approach is useful when working with a collection of graphs, such as in graph classification and regression. In this study, we consider a graph as an element on a metric space equipped with the Gromov–Wasserstein (GW) distance, and bound the difference between the distance of two graphs and their coarsened versions. Minimizing this difference can be done using the popular weighted kernel KK-means method, which improves existing spectrum-preserving methods with the proper choice of the kernel. The study includes a set of experiments to support the theory and method, including approximating the GW distance, preserving the graph spectrum, classifying graphs using spectral information, and performing regression using graph convolutional networks. Code is available at https://github.com/ychen-stat-ml/GW-Graph-Coarsening.

graph

1 Introduction

Modeling the complex relationship among objects by using graphs and networks is ubiquitous in scientific applications (Morris et al., 2020). Examples range from analysis of chemical and biological networks (Debnath et al., 1991; Helma et al., 2001; Dobson & Doig, 2003; Irwin et al., 2012; Sorkun et al., 2019), learning and inference with social interactions (Oettershagen et al., 2020), to image understanding (Dwivedi et al., 2020). Many of these tasks are faced with large graphs, the computational costs of which can be rather high even when the graph is sparse (e.g., computing a few extreme eigenvalues and eigenvectors of the graph Laplacian can be done with a linear cost by using the Lanczos method, while computing the round-trip commute times requiring the pesudoinverse of the Laplacian, which admits a cubic cost). Therefore, it is practically important to develop methods that can efficiently handle graphs as their size grows. In this work, we consider one type of methods, which resolve the scalability challenge through working on a smaller version of the graph. The production of the smaller surrogate is called graph coarsening.

Refer to caption
Figure 1: An example of graph coarsening. Three nodes v1,v2v_{1},v_{2}, and v3v_{3} are merged to a supernode v1v^{\prime}_{1} in the coarsened graph, while each of the other nodes (v4v_{4}, v5v_{5}, and v6v_{6}) is a separate supernode.

Graph coarsening is a methodology for solving large-scale graph problems. Depending on the problem itself, one may develop a solution based on the coarsened graph, or interpolate the solution back to the original one. Figure 1 illustrates a toy example of graph coarsening, whereby a cluster of nodes (v1v_{1}, v2v_{2}, and v3v_{3}) in the original graph are merged to a so-called supernode (v1v^{\prime}_{1}) in the coarsened graph. If the problem is graph classification (predicting classes of multiple graphs in a dataset), one may train a classifier by using these smaller surrogates, if it is believed that they inherit the characteristics of the original graphs necessary for classification. On the other hand, if the problem is node regression,111Machine learning problems on graphs could be on the graph level (e.g., classifying the toxicity of a protein graph) or on the node level (e.g., predicting the coordinates of each node (atom) in a molecular graph). In this work, our experiments focus on graph-level problems. one may predict the targets for the nodes in the coarsened graph and interpolate them in the original graph (Huang et al., 2021).

Graph coarsening has a long history in scientific computing, while it gains attraction in machine learning recently (Chen et al., 2022). A key question to consider is what characteristics should be preserved when reducing the graph size. A majority of work in machine learning focuses on the spectral properties (Loukas & Vandergheynst, 2018; Loukas, 2019; Hermsdorff & Gunderson, 2019; Jin et al., 2020). That is, one desires that the eigenvalues of the original graph GG are close to those of the coarsened graph G(c)G^{(c)}. While being attractive, it is unclear why this objective is effective for problems involving a collection of graphs (e.g., graph classification), where the distances between graphs shape the classifier and the decision boundary. Hence, in this work, we consider a different objective, which desires that the distance of a pair of graphs, dist(G1,G2)\text{dist}(G_{1},G_{2}), is close to that of their coarsened versions, dist(G1(c),G2(c))\text{dist}(G_{1}^{(c)},G_{2}^{(c)}).

To achieve so, we make the following contributions.

  • We consider a graph as an element on the metric space endowed with the Gromov–Wasserstein (GW) distance (Chowdhury & Mémoli, 2019), which formalizes the distance of graphs with different sizes and different node weights. We analyze the distance between GG and G(c)G^{(c)} as a major lemma for subsequent results.

  • We establish an upper bound on the difference of the GW distance between the original graph pair and that of the coarsened pair. Interestingly, this bound depends on only the respective spectrum change of each of the original graphs. Such a finding explains the effectiveness of spectrum-preserving coarsening methods for graph classification and regression problems.

  • We bridge the connection between the upper bound and weighted kernel KK-means, a popular clustering method, under a proper choice of the kernel. This connection leads to a graph coarsening method that we demonstrate to exhibit attractive inference qualities for graph-level tasks, compared with other spectrum-preserving methods.

2 Related Work

Graph coarsening was made popular by scientific computing many decades ago, where a major problem was to solve large, sparse linear systems of equations (Saad, 2003). (Geometric) Multigrid methods (Briggs et al., 2000) were used to solve partial differential equations, the discretization of which leads to a mesh and an associated linear system. Multigrid methods solve a smaller system on the coarsened mesh and interpolate the solution back to the original mesh. When the linear system is not associated with a geometric mesh, algebraic multigrid methods (Ruge & Stüben, 1987) were developed to treat the coefficient matrix as a general graph, so that graph coarsening results in a smaller graph and the procedure of “solving a smaller problem then interpolating the solution on the larger one” remains applicable.

Graph coarsening was introduced to machine learning as a methodology of graph summarization (Liu et al., 2018) through the concepts of “graph cuts,” “graph clustering,” and “graph partitioning.” A commonality of these concepts is that graph nodes are grouped together based on a certain objective. Graph coarsening plays an important role in multilevel graph partitioning (Karypis & Kumar, 1999). The normalized cut (Shi & Malik, 2000) is a well-known and pioneering method for segmenting an image treated as a graph. Dhillon et al. (2007) compute weighted graph cuts by performing clustering on the coarsest graph resulting from hierarchical coarsening and refining the clustering along the reverse hierarchy. Graph partitioning is used to form convolutional features in graph neural networks (Defferrard et al., 2016) and to perform neighborhood pooling (Ying et al., 2018). Graph coarsening is used to learn node embeddings in a hierarchical manner (Chen et al., 2018). For a survey of graph coarsening with comprehensive accounts on scientific computing and machine learning, see Chen et al. (2022).

A class of graph coarsening methods aim to preserve the spectra of the original graphs. Loukas & Vandergheynst (2018) and Loukas (2019) introduce the notion of “restricted spectral similarity” (RSS), requiring the eigenvalues and eigenvectors of the coarsened graph Laplacian, when restricted to the principal eigen-subspace, to approximate those of the original graph. Local variation algorithms are developed therein to achieve RSS. Jin et al. (2020) suggest the use of a spectral distance as the key metric for measuring the preservation of spectral properties. The authors develop two coarsening algorithms to maximally reduce the spectral distance between the original and the coarsened graph. Hermsdorff & Gunderson (2019) develop a probabilistic algorithm to coarsen a graph while preserving the Laplacian pseudoinverse, by using an unbiased procedure to minimize the variance.

Graph coarsening is increasingly used in deep learning. One limitation of traditional coarsening methods is that they mean to be universally applicable to any graphs, without the flexibility of adapting to a particular dataset or distribution of its own characteristics. Cai et al. (2021) address this limitation by adjusting the edge weights in the coarsened graph through graph neural networks (GNNs). Conversely, graph coarsening techniques can be applied to scale up GNNs by preprocessing the graphs (Huang et al., 2021). On a separate note, graph condensation (Jin et al., 2021) is another technique to accelerate GNNs, which uses supervised learning to condense node features and the graph structure. Furthermore, graph pooling can assign a node in the original graph to multiple supernodes (Grattarola et al., 2022), similar to the relaxation-based graph coarsening strategy explored in the algebraic multigrid literautre (Ron et al., 2011).

3 Notations and Preliminaries

In this section, we set up the notations for graph coarsening and review the background of Gromov–Wasserstein distance. Additional information can be found in Appendix A.

3.1 Graphs and Laplacians

We denote an undirected graph as GG, whose node set is 𝒱:={vi}i=1N\mathcal{V}\vcentcolon=\left\{v_{i}\right\}_{i=1}^{N} with size N=|𝒱|N=|\mathcal{V}| and whose symmetric weighted adjacency is 𝑨:=[aij]\bm{A}\vcentcolon=[a_{ij}]. The NN-by-NN (combinatorial) Laplacian matrix is defined as 𝑳:=𝑫𝑨\bm{L}\vcentcolon=\bm{D}-\bm{A}, where 𝑫:=diag(𝑨𝟏N)\bm{D}\vcentcolon=\text{diag}\left(\bm{A1}_{N}\right) is the degree matrix. The normalized Lapalcian is 𝓛:=𝑫12𝑳𝑫12\bm{\mathcal{L}}\vcentcolon=\bm{D}^{-\frac{1}{2}}\bm{L}\bm{D}^{-\frac{1}{2}}. Without loss of generality, we assume that GG is connected, in which case the smallest eigenvalue of 𝑳\bm{L} and 𝓛\bm{\mathcal{L}} is zero and is simple.

3.2 Graph coarsening and coarsened graphs

Given a graph GG, graph coarsening amounts to finding a smaller graph G(c)G^{(c)} with n|𝒱|n\leq|\mathcal{V}| nodes to approximate GG. One common coarsening approach obtains the coarsened graph from a parititioning 𝒫={𝒫1,𝒫2,,𝒫n}\mathcal{P}=\left\{\mathcal{P}_{1},\mathcal{P}_{2},\dots,\mathcal{P}_{n}\right\} of the node set 𝒱\mathcal{V} (Loukas & Vandergheynst, 2018). In this approach, each subset 𝒫j\mathcal{P}_{j} of nodes are collapsed to a supernode in the coarsened graph and the edge weight between two supernodes is the sum of the edge weights crossing the two corresponding partitions. Additionally, the sum of the edge weights in a partition becomes the weight of the supernode. In the matrix notation, we use 𝑪p{0,1}n×N\bm{C}_{p}\in\left\{0,1\right\}^{n\times N} to denote the membership matrix induced by the partitioning 𝒫\mathcal{P}, with the (k,i)(k,i) entry being 𝑪p(k,i)=1{vi𝒫k}\bm{C}_{p}(k,i)=1_{\{v_{i}\in\mathcal{P}_{k}\}}, where 1{}1_{\{\cdot\}} is the indicator function. For notational convenience, we define the adjacency matrix of the coarsened graph to be 𝑨(c)=𝑪p𝑨𝑪pT\bm{A}^{(c)}=\bm{C}_{p}\bm{A}\bm{C}_{p}^{T}. Note that 𝑨(c)\bm{A}^{(c)} includes not only edge weights but also node weights.

If we similarly define 𝑫(c):=diag(𝑨(𝒄)𝟏n)\bm{D}^{(c)}\vcentcolon=\text{diag}\left(\bm{A^{(c)}1}_{n}\right) to be the degree matrix of the coarsened graph G(c)G^{(c)}, then it can be verified that the matrix 𝑳(c)=𝑪p𝑳𝑪p=𝑫(c)𝑨(c)\bm{L}^{(c)}=\bm{C}_{p}\bm{L}\bm{C}_{p}^{\intercal}=\bm{D}^{(c)}-\bm{A}^{(c)} is a (combinatorial) Laplacian (because its smallest eigenvalue is zero and is simple). Additionally, 𝓛(c)=(𝑫(c))12𝑳(c)(𝑫(c))12\bm{\mathcal{L}}^{(c)}=\left(\bm{D}^{(c)}\right)^{-\frac{1}{2}}\bm{L}^{(c)}\left(\bm{D}^{(c)}\right)^{-\frac{1}{2}} is the normalized Laplacian.

In the literature, the objective of coarsening is often to minimize some difference between the original and the coarsened graph. For example, in spectral graph coarsening, Loukas (2019) proposes that the 𝑳\bm{L}-norm of any NN-dimensional vector 𝒙\bm{x} be close to the 𝑳(c)\bm{L}^{(c)}-norm of the nn-dimensional vector (𝑪p)+𝒙(\bm{C}_{p}^{\intercal})^{+}\bm{x}; while Jin et al. (2020) propose to minimize the difference between the ordered eigenvalues of the Laplacian of GG and those of the lifted Laplacian of G(c)G^{(c)} (such that the number of eigenvalues matches). Such objectives, while being natural and interesting in their own right, are not the only choice. Questions may be raised; for example, it is unclear why the objective is to preserve the Laplacian spectrum but not the degree distribution or the clustering coefficients. Moreover, it is unclear how preserving the spectrum benefits the downstream use. In this paper, we consider a different objective—preserving the graph distance—which may help, for example, maintain the decision boundary in graph classification problems. To this end, we review the Gromov–Wasserstein (GW) distance.

3.3 Gromov–Wasserstein distance and its induced metric space

The GW distance was originally proposed by Mémoli (2011) to measure the distance between two metric measure spaces M𝒳M_{\mathcal{X}} and M𝒴M_{\mathcal{Y}}.222A metric measure space M𝒳M_{\mathcal{X}} is the triple (𝒳,dX,μX)(\mathcal{X},d_{X},\mu_{X}), where (𝒳,dX)(\mathcal{X},d_{X}) is a metric space with metric dXd_{X} and μX\mu_{X} is a Borel probability measure on 𝒳\mathcal{X}. However, using metrics to characterize the difference between elements in sets 𝒳\mathcal{X} and 𝒴\mathcal{Y} can be too restrictive. Peyré et al. (2016) relaxed the metric notion by proposing the GW discrepancy, which uses dissimilarity, instead of metric, to characterize differences. Chowdhury & Mémoli (2019) then extended the concept of GW distance/discrepancy from metric measure spaces to measure networks, which can be considered a generalization of graphs.

Definition 3.1 (Measure network).

A measure network is a triple (𝒳,ωX,μX)\left(\mathcal{X},\omega_{X},\mu_{X}\right), where 𝒳\mathcal{X} is a Polish space (a separable and completely metrizable topological space), μX\mu_{X} is a fully supported Borel probability measure, and ωX\omega_{X} is a bounded measurable function on 𝒳2\mathcal{X}^{2}.

In particular, a graph GG (augmented with additional information) can be taken as a discrete measure network. We let 𝒳\mathcal{X} be the set of graph nodes {vi}i=1N\left\{v_{i}\right\}_{i=1}^{N} and associate with it a probability mass μX=𝒎=[m1,,mN]+N\mu_{X}=\bm{m}=\left[m_{1},\dots,m_{N}\right]^{\intercal}\in\mathbb{R}_{+}^{N}, i=1Nmi=1\sum_{i=1}^{N}m_{i}=1. Additionally, we associate GG with the node similarity matrix 𝑺=[sij]N×N\bm{S}=[s_{ij}]\in\mathbb{R}^{N\times N}, whose entries are induced from the measurable map ωX\omega_{X}: sij=ωX(vi,vj)s_{ij}=\omega_{X}\left(v_{i},v_{j}\right). Note that the mass 𝒎\bm{m} and the similarity 𝑺\bm{S} do not necessarily need to be related to the node weights and edge weights, although later we will justify and advocate the use of some variant of the graph Lapalcian as 𝑺\bm{S}.

For a source graph GsG_{s} with NsN_{s} nodes and mass 𝒎s\bm{m}_{s} and a target graph GtG_{t} with NtN_{t} nodes and mass 𝒎t\bm{m}_{t}, we can define a transport matrix 𝑻=[tij]Ns×Nt\bm{T}=[t_{ij}]\in\mathbb{R}^{N_{s}\times N_{t}}, where tijt_{ij} specifies the probability mass transported from visv_{i}^{s} (the ii-th node of GsG_{s}) to vjtv_{j}^{t} (the jj-th node of GtG_{t}). We denote the collection of all feasible transport matrices as Πs,t\Pi_{s,t}, which includes all 𝑻\bm{T} that satisfy 𝑻𝟏=𝒎s\bm{T1}=\bm{m}_{s} and 𝑻𝟏=𝒎t\bm{T}^{\intercal}\bm{1}=\bm{m}_{t}. Using the p\ell_{p} transportation cost L(a,b)=(ab)pL(a,b)=(a-b)^{p}, Chowdhury & Mémoli (2019) define the GWp\text{GW}_{p} distance for graphs as

GWpp(Gs,Gt)=\displaystyle\operatorname{\mathrm{GW}}_{p}^{p}(G_{s},G_{t})= min𝑻Πs,ti,j=1NSi,j=1Nt|sijssijt|p𝑻ii𝑻jj\displaystyle\min_{\bm{T}\in\Pi_{s,t}}\sum_{i,j=1}^{N_{S}}\sum_{i^{\prime},j^{\prime}=1}^{N_{t}}\left|s^{s}_{ij}-s^{t}_{i^{\prime}j^{\prime}}\right|^{p}\bm{T}_{ii^{\prime}}\bm{T}_{jj^{\prime}}
=\displaystyle= min𝑻Πs,t𝑴,𝑻,\displaystyle\min_{\bm{T}\in\Pi_{s,t}}\left\langle\bm{M},\bm{T}\right\rangle, (1)

whre the cross-graph dissimilarity matrix 𝑴Ns×Nt\bm{M}\in\mathbb{R}^{N_{s}\times N_{t}} has entries 𝑴jj=i,i|sijssijt|p𝑻ii\bm{M}_{jj^{\prime}}=\sum_{i,i^{\prime}}\left|s^{s}_{ij}-s^{t}_{i^{\prime}j^{\prime}}\right|^{p}\bm{T}_{ii^{\prime}} (which by themselves are dependent on 𝑻\bm{T}).

The computation of the GWp\text{GW}_{p} distance (GW distance for short) can be thought of as finding a proper alignment of nodes in two graphs, such that the aligned nodes vjsv^{s}_{j} and vjtv^{t}_{j^{\prime}} have similar interactions with other nodes in their respective graphs. Intuitively, this is achieved by assigning large transport mass 𝑻jj\bm{T}_{jj^{\prime}} to a node pair (vjs,vjt)(v^{s}_{j},v^{t}_{j^{\prime}}) with small dissimilarity 𝑴jj\bm{M}_{jj^{\prime}}, making the GW distance a useful tool for graph matching (Xu et al., 2019b). We note that variants of the GW distance exist; for example, Titouan et al. (2019) proposed a fused GW distance by additioanlly taking into account the dissimilarity of node features. We also note that computing the GW distance is NP-hard, but several approximate methods were developed (Xu et al., 2019a; Zheng et al., 2022). In this work, we use the GW distance as a theoretical tool and may not need to compute it in action.

On closing this section, we remark that Chowdhury & Mémoli (2019, Theorem 18) show that the GW distance is indeed a metric for measure networks, modulo weak isomorphism.333The weak isomorphism allows a node to be split into several identical nodes with the mass preserved. See Definition 3 of Chowdhury & Mémoli (2019). Therefore, we can formally establish the metric space of interest.

Definition 3.2 (GWp\operatorname{\mathrm{GW}}_{p} space).

Let 𝒩\mathcal{N} be the collection of all measure networks. For p1p\geq 1, we denote by (𝒩,GWp)\left(\mathcal{N},\operatorname{\mathrm{GW}}_{p}\right) the metric space of measure networks endowed with the GWp\mathrm{GW}_{p} distance defined in (3.3) and call it the GWp\operatorname{\mathrm{GW}}_{p} space.

4 Graph Coarsening from a Gromov–Wasserstein Geometric View

In this section, we examine how the GW geometric perspective can shape graph coarsening. In Section 4.1, we provide a framework that unifies many variants of the coarsening matrices through the use of the probability mass 𝒎\bm{m} introduced in the context of measure networks. Then, in Section 4.2, we analyze the variant associated with the similarity matrix 𝑺\bm{S}. In particular, we establish an upper bound of the difference of the GW distances before and after coarsening. Based on the upper bound, in Section 4.3 we connect it with some spectral graph techniques and in Section 4.4 we advocate a choice of 𝑺\bm{S} in practice.

4.1 A unified framework for coarsening matrices

The membership matrix 𝑪p\bm{C}_{p} is a kind of coarsening matrices: it connects the adjacency matrix 𝑨\bm{A} with the coarsened version 𝑨(c)\bm{A}^{(c)} through 𝑨(c)=𝑪p𝑨𝑪p\bm{A}^{(c)}=\bm{C}_{p}\bm{A}\bm{C}_{p}^{\intercal}. There are, however, different variants of coarsening matrices. We consider three here, all having a size n×Nn\times N.

  1. (i)

    Accumulation. This is the matrix 𝑪p\bm{C}_{p}. When multiplied to the left of 𝑨\bm{A}, the ii-th row of the product is the sum of all rows of 𝑨\bm{A} corresponding to the partition 𝒫i\mathcal{P}_{i}.

  2. (ii)

    Averaging. A natural alternative to summation is (weighted) averaging. We define the diagonal mass matrix 𝑾=diag(m1,,mN)\bm{W}=\text{diag}\left(m_{1},\cdots,m_{N}\right) and for each partition 𝒫i\mathcal{P}_{i}, the accumulated mass ci=j𝒫imjc_{i}=\sum_{j\in\mathcal{P}_{i}}m_{j} for all i[n]i\in[n]. Then, the averaging coarsening matrix is

    𝑪¯w:=diag(c11,,cn1)𝑪p𝑾.\bar{\bm{C}}_{w}\vcentcolon=\mathrm{diag}(c_{1}^{-1},\cdots,c_{n}^{-1})\bm{C}_{p}\bm{W}.

    This matrix takes the effect of averaging because of the division over cic_{i}. Moreover, when the probability mass 𝒎\bm{m} is uniform (i.e., all mjm_{j}’s are the same), we have the relation 𝑪¯w+=𝑪p\bar{\bm{C}}_{w}^{+}=\bm{C}_{p}^{\intercal}.

  3. (iii)

    Projection. Neither 𝑪p\bm{C}_{p} nor 𝑪¯w\bar{\bm{C}}_{w} is orthogonal. We define the projection coarsening matrix as

    𝑪w:=diag(c11/2,,cn1/2)𝑪p𝑾12,\bm{C}_{w}\vcentcolon=\mathrm{diag}(c_{1}^{-1/2},\cdots,c_{n}^{-1/2})\bm{C}_{p}\bm{W}^{\frac{1}{2}},

    by noting that 𝑪w𝑪w\bm{C}_{w}\bm{C}_{w}^{\intercal} is the identity and hence 𝑪w\bm{C}_{w} has orthonoral rows. Therefore, the NN-by-NN matrix 𝚷w:=𝑾12𝑪p𝑪¯w𝑾12=𝑪w𝑪w\bm{\Pi}_{w}\vcentcolon=\bm{W}^{\frac{1}{2}}\bm{C}_{p}^{\intercal}\bar{\bm{C}}_{w}\bm{W}^{-\frac{1}{2}}=\bm{C}_{w}^{\intercal}\bm{C}_{w} is a projection operator.

In Section 3.2, we have seen that the combinatorial Laplacian 𝑳\bm{L} takes 𝑪p\bm{C}_{p} as the coarsening matrix, because the relationship 𝑳(c)=𝑪p𝑳𝑪p\bm{L}^{(c)}=\bm{C}_{p}\bm{L}\bm{C}_{p}^{\intercal} inherits from 𝑨(c)=𝑪p𝑨𝑪p\bm{A}^{(c)}=\bm{C}_{p}\bm{A}\bm{C}_{p}^{\intercal}. On the other hand, it can be proved that, if we take the diagonal mass matrix 𝑾\bm{W} to be the degree matrix 𝑫\bm{D}, the normalized Laplacian defined in Section 3.2 can be written as 𝓛(c)=𝑪w𝓛𝑪w\bm{\mathcal{L}}^{(c)}=\bm{C}_{w}\bm{\mathcal{L}}\bm{C}_{w}^{\intercal} (see Section B.1). In other words, the normalized Laplacian uses 𝑪w\bm{C}_{w} as the coarsening matrix. The matrix 𝓛(c)\bm{\mathcal{L}}^{(c)} is called a doubly-weighted Laplacian (Chung & Langlands, 1996).

For a general similarity matrix 𝑺\bm{S} (not necessarily a Laplacian), we use the averaging coarsening matrix 𝑪¯w\bar{\bm{C}}_{w} and define 𝑺(c):=𝑪¯w𝑺𝑪¯w\bm{S}^{(c)}\vcentcolon=\bar{\bm{C}}_{w}\bm{S}\bar{\bm{C}}_{w}^{\intercal}. This definition appears to be more natural in the GW setting; see a toy example in Section B.2. It is interesting to note that Vincent-Cuaz et al. (2022) proposed the concept of semi-relaxed GW (srGW) divergence, wherein the first-order optimality condition of the constrained srGW barycenter problem is exactly the equality 𝑺(c)=𝑪¯w𝑺𝑪¯w\bm{S}^{(c)}=\bar{\bm{C}}_{w}\bm{S}\bar{\bm{C}}_{w}^{\intercal}. See Section B.3.

In the next subsection, we will consider the matrix 𝑼:=𝑾12𝑺𝑾12\bm{U}\vcentcolon=\bm{W}^{\frac{1}{2}}\bm{S}\bm{W}^{\frac{1}{2}}. We define the coarsened version by using the projection coarsening matrix 𝑪w\bm{C}_{w} as in 𝑼(c):=𝑪w𝑼𝑪w\bm{U}^{(c)}\vcentcolon=\bm{C}_{w}\bm{U}\bm{C}_{w}^{\intercal}. We will bound the distance of the original and the coarsened graph by using the eigenvalues of 𝑼\bm{U} and 𝑼(c)\bm{U}^{(c)}. The reason why 𝑼\bm{U} and 𝑺\bm{S} use different coarsening matrices lies in the technical subtlety of 𝑾12\bm{W}^{\frac{1}{2}}: 𝑪w\bm{C}_{w} absorbs this factor from 𝑪¯w\bar{\bm{C}}_{w}.

4.2 Graph distance on the GW2\text{GW}_{2} space

Now we consider the GW distance between two graphs. For theoretical and computational convenience, we take the 2\ell_{2} transportation cost (i.e., taking p=2p=2 in (3.3)). When two graphs G1G_{1} and G2G_{2} are concerned, we inherit the subscripts 1 and 2 to all related quantities, such as the probability masses 𝒎1\bm{m}_{1} and 𝒎2\bm{m}_{2}, avoiding verbatim redundancy when introducing notations. This pair of subscripts should not cause confusion with other subscripts, when interpreted in the proper context. Our analysis can be generalized from the GW2\text{GW}_{2} distance to other distances, as long as the transportation cost satisfies the following decomposable condition.

Proposition 4.1.

(Peyré et al., 2016) Let 𝐓N1×N2\bm{T}^{\ast}\in\mathbb{R}^{N_{1}\times N_{2}} be the optimal transport plan from 𝐦1\bm{m}_{1} to 𝐦2\bm{m}_{2},444The existence of 𝐓\bm{T}^{*} is guaranteed by the fact that the feasible region of 𝐓\bm{T} is compact and the object function 𝐌,𝐓\left\langle\bm{M},\bm{T}\right\rangle is continuous with respect to 𝐓\bm{T}. and 𝐒kNk×Nk\bm{S}_{k}\in\mathbb{R}^{N_{k}\times N_{k}} be similarity matrices for k=1,2k=1,2. If the transport cost can be written as L(a,b)=f1(a)+f2(b)h1(a)h2(b)L(a,b)=f_{1}(a)+f_{2}(b)-h_{1}(a)h_{2}(b) for some element-wise functions (f1,f2,h1,h2)\left(f_{1},f_{2},h_{1},h_{2}\right), then we can write 𝐌\bm{M} in (3.3) as

f1(𝑺1)𝒎1𝟏N2+𝟏N1𝒎2f2(𝑺2)h1(𝑺1)𝑻h2(𝑺2).f_{1}(\bm{S}_{1})\bm{m}_{1}\bm{1}_{N_{2}}^{\intercal}+\bm{1}_{N_{1}}\bm{m}_{2}^{\intercal}f_{2}(\bm{S}_{2})^{\intercal}-h_{1}(\bm{S}_{1})\bm{T}^{*}h_{2}(\bm{S}_{2})^{\intercal}.

Clearly, for the squared cost L(a,b)=(ab)2L(a,b)=(a-b)^{2}, we may take f1(a)=a2f_{1}(a)=a^{2}, f2(b)=b2f_{2}(b)=b^{2}, h1(a)=ah_{1}(a)=a, and h2(b)=2bh_{2}(b)=2b.

We start the analysis by first bounding the distance between GG and G(c)G^{(c)}.

Theorem 4.2 (Single graph).

Consider a graph GG with positive semi-definite (PSD) similarity matrix 𝐒\bm{S} and diagonal mass matrix 𝐖\bm{W} and similarly the coarsened graph G(c)G^{(c)}. Let λ1λ2λN\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{N} be the sorted eigenvalues of 𝐔=𝐖12𝐒𝐖12\bm{U}=\bm{W}^{\frac{1}{2}}\bm{S}\bm{W}^{\frac{1}{2}} and λ1(c)λn(c)\lambda^{(c)}_{1}\geq\cdots\geq\lambda^{(c)}_{n} be the sorted eigenvalues of 𝐔(c)=𝐂w𝐔𝐂w\bm{U}^{(c)}=\bm{C}_{w}\bm{U}\bm{C}_{w}^{\intercal}. Then,

GW22(G,G(c))λNn+1i=1n(λiλi(c))+C𝑼,n,\displaystyle\operatorname{\mathrm{GW}}_{2}^{2}(G,G^{(c)})\leq\lambda_{N-n+1}\sum_{i=1}^{n}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U},n}, (2)

where C𝐔,n=i=1nλi(λiλNn+i)+i=n+1Nλi2C_{\bm{U},n}=\sum_{i=1}^{n}\lambda_{i}(\lambda_{i}-\lambda_{N-n+i})+\sum_{i=n+1}^{N}\lambda_{i}^{2} is non-negative and is independent of coarsening.

Remark 4.3.

(i) The bound is tight when n=Nn=N because the right-hand side is zero in this case. (ii) The choice of coarsening only affects the spectral difference

Δ:=i=1n(λiλi(c)),\displaystyle\Delta\vcentcolon=\sum_{i=1}^{n}(\lambda_{i}-\lambda_{i}^{(c)}), (3)

because C𝑼,nC_{\bm{U},n} is independent of it. Each term λiλi(c)\lambda_{i}-\lambda_{i}^{(c)} in Δ\Delta is non-negative due to the Poincaré separation theorem (see Section D.1). (iii) Δ\Delta is a generalization of the spectral distance proposed by Jin et al. (2020), because our matrix 𝑼\bm{U} is not necessarily the normalized Laplacian. For additional discussions, see Section B.4. (iv) When 𝑼\bm{U} is taken as the normalized Laplacian, our bound is advantageous over the bound established by Jin et al. (2020) in the sense that Δ\Delta is the only term impacted by coarsening and that no assumptions on the KK-means cost are imposed.

We now bound the difference of distances. The following theorem suggests that the only terms dependent on coarsening are Δ1\Delta_{1} and Δ2\Delta_{2}, counterparts of Δ\Delta in Theorem 4.2, for graphs G1G_{1} and G2G_{2} respectively.

Theorem 4.4.

Given a pair of graphs G1G_{1} and G2G_{2}, we extend all notations in Theorem 4.2 by adding subscripts 1 and 2 respectively for G1G_{1} and G2G_{2}. We denote the optimal transport plan induced by GW2(G1,G2)\operatorname{\mathrm{GW}}_{2}(G_{1},G_{2}) as 𝐓\bm{T}^{*} and let the normalized counterpart be 𝐏=𝐖112𝐓𝐖212\bm{P}=\bm{W}_{1}^{-\frac{1}{2}}\bm{T}^{*}\bm{W}_{2}^{-\frac{1}{2}}. Additionally, we define 𝐕1:=𝐏𝐖212𝐒2𝐖212𝐏\bm{V}_{1}\vcentcolon=\bm{P}\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal} with eigenvalues ν1,1ν1,2ν1,N1\nu_{1,1}\geq\nu_{1,2}\geq\cdots\geq\nu_{1,N_{1}} and 𝐕2:=𝐏𝐖112𝐒1𝐖112𝐏\bm{V}_{2}\vcentcolon=\bm{P}^{\intercal}\bm{W}_{1}^{\frac{1}{2}}\bm{S}_{1}\bm{W}_{1}^{\frac{1}{2}}\bm{P} with eigenvalues ν2,1ν2,2ν2,N2\nu_{2,1}\geq\nu_{2,2}\geq\cdots\geq\nu_{2,N_{2}}, both independent of coarsening. Then, |GW22(G1(c),G2(c))GW22(G1,G2)|\left|\operatorname{\mathrm{GW}}_{2}^{2}(G_{1}^{(c)},G_{2}^{(c)})-\operatorname{\mathrm{GW}}_{2}^{2}(G_{1},G_{2})\right| is upper bounded by

max{λ1,N1n1+1Δ1+C𝑼1,n1\displaystyle\max\Big{\{}\lambda_{1,N_{1}-n_{1}+1}\cdot\Delta_{1}+C_{\bm{U}_{1},n_{1}}
+λ2,N2n2+1Δ2+C𝑼2,n2,\displaystyle\qquad\qquad+\lambda_{2,N_{2}-n_{2}+1}\cdot\Delta_{2}+C_{\bm{U}_{2},n_{2}},
2[ν1,N1n1+1Δ1+C𝑼1,𝑽1,n1\displaystyle\qquad 2\cdot\left[\nu_{1,N_{1}-n_{1}+1}\cdot\Delta_{1}+C_{\bm{U}_{1},\bm{V}_{1},n_{1}}\right.
+ν2,N2n2+1Δ2+C𝑼2,𝑽2,n2]},\displaystyle\qquad\qquad+\left.\nu_{2,N_{2}-n_{2}+1}\cdot\Delta_{2}+C_{\bm{U}_{2},\bm{V}_{2},n_{2}}\right]\Big{\}},

where C𝐔1,n1C_{\bm{U}_{1},n_{1}} is from Theorem 4.2 and the other coarsening-independent terms 𝐔2,C𝐔2,n2,C𝐔2,𝐕2,n2,C𝐔2,𝐕2,n2\bm{U}_{2},C_{\bm{U}_{2},n_{2}},C_{\bm{U}_{2},\bm{V}_{2},n_{2}},C_{\bm{U}_{2},\bm{V}_{2},n_{2}} are introduced in Lemma E.5 in Appendix E.

Remark 4.5.

(i) The above bound takes into account both the differences between two graphs and their respective coarsenings. Even when the two graphs are identical G1=G2G_{1}=G_{2}, the bound can still be nonzero if the coarsened graphs G1(c)G_{1}^{(c)} and G2(c)G_{2}^{(c)} do not match. (ii) The decoupling of Δ1\Delta_{1} and Δ2\Delta_{2} offers an algorithmic benefit when one wants to optimize the differences of distances for all graph pairs in a dataset: it suffices to optimize the distance between GG and G(c)G^{(c)} for each graph individually. This benefit is in line with the prior practice of directly applying spectrum-preserving coarsening methods for graph-level tasks (Jin et al., 2020; Huang et al., 2021). Their experimental results and our numerical verification in Section 6 show that our bound is useful and it partly explains the empirical success of spectral graph coarsening.

4.3 Connections with truncated SVDs and Laplacian eigenmaps

The spectral difference Δ=i=1n(λiλi(c))\Delta=\sum_{i=1}^{n}\left(\lambda_{i}-\lambda^{(c)}_{i}\right) in Theorem 4.2 can be used as the loss function for defining an optimal coarsening. Because Δ+i=n+1Nλi=Tr(𝑼)Tr(𝑪w𝑼𝑪w)\Delta+\sum_{i=n+1}^{N}\lambda_{i}=\operatorname{Tr}(\bm{U})-\operatorname{Tr}(\bm{C}_{w}\bm{U}\bm{C}_{w}^{\intercal}) and because i=n+1Nλi\sum_{i=n+1}^{N}\lambda_{i} is independent of coarsening, minimizing Δ\Delta is equivalent to the following problem

min𝑪wTr(𝑼𝚷w𝑼𝚷w),\displaystyle\min_{{\bm{C}_{w}}}\operatorname{Tr}\left(\bm{U}-\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\right), (4)

by recalling that 𝚷w=𝑪w𝑪w\bm{\Pi}_{w}=\bm{C}_{w}^{\intercal}\bm{C}_{w} is a projector. The problem (4) is a well-known trace optimization problem, which has rich connections with many spectral graph techniques (Kokiopoulou et al., 2011).

Connection with truncated SVD. At first sight, a small trace difference does not necessarily imply the two matrices are close. However, because 𝚷w\bm{\Pi}_{w} is a projector, the Poincaré separation theorem (see Section D.1) suggests that their eigenvalues can be close. A well-known example of using the trace to find optimal approximations is the truncated SVD, which retains the top singular values (equivalently eigenvalues for PSD matrices). The truncated SVD is a technique to find the optimal rank-nn approximation of a general matrix 𝑼\bm{U} in terms of the spectral norm or the Frobenius norm, solving the problem

min𝑪𝒪nTr(𝑼𝑪𝑪𝑼𝑪𝑪),\displaystyle\min_{\bm{C}\in\mathcal{O}_{n}}\operatorname{Tr}\left(\bm{U}-\bm{C}^{\intercal}\bm{C}\bm{U}\bm{C}^{\intercal}\bm{C}\right),

where 𝒪n\mathcal{O}_{n} is the class of all rank-nn orthogonal matrices. The projection coarsening matrix 𝑪w\bm{C}_{w} belongs to this class.

Connection with Laplacian eigenmaps. The Laplacian eigenmap (Belkin & Niyogi, 2003) is a manifold learning technique that learns an nn-dimensional embedding for NN points connected by a graph. The embedding matrix 𝒀\bm{Y} solves the trace problem min𝒀𝑫𝒀=𝑰nTr(𝒀𝑳𝒀)\min_{\bm{Y}\bm{D}\bm{Y}^{\intercal}=\bm{I}_{n}}\operatorname{Tr}\left(\bm{Y}\bm{L}\bm{Y}^{\intercal}\right). If we let 𝒀¯=𝒀𝑫12\bar{\bm{Y}}=\bm{Y}\bm{D}^{-\frac{1}{2}}, the problem is equivalent to

min𝒀¯𝒪nTr(𝒀¯𝓛𝒀¯).\displaystyle\min_{\bar{\bm{Y}}\in\mathcal{O}_{n}}\operatorname{Tr}\left(\bar{\bm{Y}}\bm{\mathcal{L}}\bar{\bm{Y}}^{\intercal}\right).

Different from truncated SVD, which uses the top singular vectors (eigenvectors) to form the solution, the Laplacian eigenmap uses the bottom eigenvectors of 𝓛\bm{\mathcal{L}} to form the solution.

4.4 Signless Laplacians as similarity matrices

The theory established in Section 4.2 is applicable to any PSD matrix 𝑺\bm{S}, but for practical uses we still have to define it. Based on the foregoing exposition, it is tempting to let 𝑺\bm{S} be the Laplacian 𝑳\bm{L} or the normalized Laplacian 𝓛\bm{\mathcal{L}}, because they are PSD and they reveal important information of the graph structure (Tsitsulin et al., 2018). In fact, as a real example, Chowdhury & Needham (2021) used the heat kernel as the similarity matrix to define the GW distance (and in this case the GW framework is related to spectral clustering). However, there are two problems that make such a choice troublesome. First, the (normalized) Laplacian is sparse but its off-diagonal, nonzero entries are negative. Thus, when 𝑺\bm{S} is interpreted as a similarity matrix, a pair of nodes not connected by an edge becomes more similar than a pair of connected nodes, causing a dilema. Second, under the trace optimization framework, one is lured to intuitively look for solutions toward the bottom eigenvectors of 𝑺\bm{S}, like in Laplacian eigenmaps, an opposite direction to the true solution of (4), which is toward the top eigenvectors instead.

To resolve these problems, we propose to use the signless Laplacian (Cvetković et al., 2007), 𝑫+𝑨\bm{D}+\bm{A}, or its normalized version, 𝑰N+𝑫12𝑨𝑫12\bm{I}_{N}+\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}, as the similarity matrix 𝑺\bm{S}. These matrices are PSD and their nonzero entries are all positive. With such a choice, the spectral difference Δ\Delta in (3) has a fundamentally different behavior from the spectral distance used by Jin et al. (2020) when defining the coarsening objective.

5 Computational Equivalence between Graph Coarsening and Weighted Kernel KK-means

By recalling that 𝑼=𝑾12𝑺𝑾12\bm{U}=\bm{W}^{\frac{1}{2}}\bm{S}\bm{W}^{\frac{1}{2}} and that 𝚷w=𝑪w𝑪w\bm{\Pi}_{w}=\bm{C}_{w}^{\intercal}\bm{C}_{w} is a projector, we rewrite the coarsening objective in (4) as

Tr(𝑾12𝑺𝑾12)Tr(𝑪w𝑾12𝑺𝑾12𝑪w).\displaystyle\operatorname{Tr}\left(\bm{W}^{\frac{1}{2}}\bm{S}\bm{W}^{\frac{1}{2}}\right)-\operatorname{Tr}\left(\bm{C}_{w}\bm{W}^{\frac{1}{2}}\bm{S}\bm{W}^{\frac{1}{2}}\bm{C}_{w}^{\intercal}\right). (5)

When 𝑺\bm{S} is PSD, it could be interpreted as a kernel matrix such that there exists a set of feature vectors {ϕi}\{\bm{\phi}_{i}\} for which 𝑺ij=ϕi,ϕj\bm{S}_{ij}=\langle\bm{\phi}_{i},\bm{\phi}_{j}\rangle for i,j=1,,Ni,j=1,\ldots,N. Then, with simple algebraic manipulation, we see that (5) is equivalent to the well-known clustering objective:

ki𝒫kmiϕi𝝁k2with𝝁k=i𝒫kmickϕi,\sum_{k}\sum_{i\in\mathcal{P}_{k}}m_{i}\|\bm{\phi}_{i}-\bm{\mu}_{k}\|^{2}\quad\text{with}\quad\bm{\mu}_{k}=\sum_{i\in\mathcal{P}_{k}}\frac{m_{i}}{c_{k}}\bm{\phi}_{i}, (6)

where the norm \|\cdot\| is induced by the inner product ,\langle\cdot,\cdot\rangle. Here, 𝝁k\bm{\mu}_{k} is the weighted center of ϕi\bm{\phi}_{i} for all nodes ii belonging to cluster kk. Hence, the weighted kernel KK-means algorithm (Dhillon et al., 2004, 2007) can be applied to minimize (6) by iteratively recomputing the centers and updating cluster assignments according to the distance of a node to all centers. We denote the squared distance between ϕi\bm{\phi}_{i} and any 𝝁j\bm{\mu}_{j} by distj2(i)\mathrm{dist}^{2}_{j}(i), which is

𝑺ii2k𝒫jmk𝑺ki/cj+k1,k2𝒫jmk1mk2𝑺k1k2/cj2.\displaystyle\bm{S}_{ii}-2\underset{k\in\mathcal{P}_{j}}{\sum}m_{k}\bm{S}_{ki}/{c_{j}}+\sum_{k_{1},k_{2}\in\mathcal{P}_{j}}m_{k_{1}}m_{k_{2}}\bm{S}_{k_{1}k_{2}}/{c_{j}^{2}}. (7)

The KK-means algorithm is summarized in Algorithm 1 and we call this method kernel graph coarsening (KGC).

Algorithm 1 Kernel graph coarsening (KGC).

Input: 𝑺\bm{S}: similarity matrix, 𝒎\bm{m}: node mass vector, nn: number of clusters
Output: 𝒫={𝒫i}i=1n\mathcal{P}=\left\{\mathcal{P}_{i}\right\}_{i=1}^{n}: node partition

1:  function KGC  {𝑺,𝒎,n\bm{S},\bm{m},n}
2:     Initialize the nn clusters: 𝒫(0)={𝒫1(0),,𝒫n(0)}\mathcal{P}^{(0)}=\left\{\mathcal{P}_{1}^{(0)},\dots,\mathcal{P}_{n}^{(0)}\right\}; cj(0)=k𝒫j(0)mk,j[n]c^{(0)}_{j}=\sum_{k\in\mathcal{P}^{(0)}_{j}}m_{k},\forall j\in[n].
3:     Set the counter of iterations t=0t=0.
4:     for node i=1i=1 to NN do
5:        Find its new cluster index by (7):
idx(i)=argminj[n]distj(t)(i).\displaystyle\mathrm{idx}(i)=\operatorname*{arg\,min}_{j\in[n]}\mathrm{dist}^{(t)}_{j}(i).
6:     end for
7:     Update the clusters: for all j[n]j\in[n],
𝒫j(t+1)={i:idx(i)=j},cj(t+1)=k𝒫j(t+1)mk.\displaystyle\mathcal{P}_{j}^{(t+1)}=\left\{i:\mathrm{idx}(i)=j\right\},c^{(t+1)}_{j}=\sum_{k\in\mathcal{P}^{(t+1)}_{j}}m_{k}.
8:     if the partition 𝒫(t+1)\mathcal{P}^{(t+1)} is invariant then
9:        return 𝒫(t+1)\mathcal{P}^{(t+1)}
10:     else
11:        Set t=t+1t=t+1 and go to Line 4.
12:     end if
13:  end function

KGC as a post-refinement of graph coarsening. KGC can be used as a standalone coarsening method. A potential drawback of this usage is that the KK-means algorithm is subject to initialization and the clustering quality varies significantly sometimes. Even advanced initialization techniques, such as KK-means++ (Arthur & Vassilvitskii, 2006), are not gauranteed to work well in practice. Moreover, KGC, in the vanilla form, does not fully utilize the graph information (such as node features), unless additional engineering of the similarity matrix 𝑺\bm{S} is conducted. Faced with these drawbacks, we suggest a simple alternative: initialize KGC by the output of another coarsening method and use KGC to improve it (Scrucca & Raftery, 2015). In our experience, KGC almost always monotonically reduces the spectral difference Δ\Delta and improves the quality of the initial coarsening.

Time complexity. Let TT be the number of KK-means iterations. The time complexity of KGC is 𝒪(T(M+Nn))\mathcal{O}\left(T\left(M+Nn\right)\right), where M=nnz(𝑺)M=\mathrm{nnz}(\bm{S}) is the number of nonzeros in 𝑺\bm{S}. See Section B.5 for the derivation of this cost and a comparison with the cost of spectral graph coarsening (Jin et al., 2020).

6 Numerical Experiments

We evaluate graph coarsening methods, including ours, on eight benchmark graph datasets: MUTAG (Debnath et al., 1991; Kriege & Mutzel, 2012), PTC (Helma et al., 2001), PROTEINS (Borgwardt et al., 2005; Schomburg et al., 2004), MSRC (Neumann et al., 2016), IMDB (Yanardag & Vishwanathan, 2015), Tumblr (Oettershagen et al., 2020), AQSOL (Sorkun et al., 2019; Dwivedi et al., 2020), and ZINC (Irwin et al., 2012). Information of these datasets is summarized in Table 1.

Table 1: Summary of datasets. |V|\left|V\right| and |E|\left|E\right| denote the average number of nodes and edges, respectively. All graphs are treated undirected. R(1)\mathrm{R}(1) represents a regression task.
Dataset Classes Size |𝐕|\mathbf{\left|V\right|} |𝐄|\mathbf{\left|E\right|}
 MUTAG 2 188 17.93 19.79
PTC 2 344 14.29 14.69
PROTEINS 2 1113 39.06 72.82
MSRC 8 221 39.31 77.35
IMDB 2 1000 19.77 96.53
Tumblr 2 373 53.11 199.78
AQSOL R(1)\mathrm{R}(1) 9823 17.57 17.86
ZINC R(1)\mathrm{R}(1) 12000 23.16 49.83
 

We compare our method with the following baseline methods (Loukas, 2019; Jin et al., 2020): 1 Variation Neighborhood Graph Coarsening (VNGC); 2 Variation Edge Graph Coarsening (VEGC); 3 Multilevel Graph Coarsening (MGC); and 4 Spectral Graph Coarsening (SGC). For our method, we consider the vanilla KGC, which uses KK-means++ for initialization, and the variant KGC(A), which takes the output of the best-performing baseline method for initialization. More implementation details are provided in Appendix C.

6.1 GW distance approximation

For a sanity check, we evaluate each coarsening method on the approximation of the GW distance, to support our motivation of minimizing the upper bound in Theorem 4.2.

We first compare the average squared GW distance, by using the normalized signless Laplacian 𝑰N+𝑫12𝑨𝑫12\bm{I}_{N}+\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}} as the similarity matrix 𝑺\bm{S} (see Section 4.4). We vary the coarsened graph size n=cNn=\lceil c*N\rceil for c=0.3,,0.9c=0.3,\cdots,0.9. For each graph in PTC and IMDB, we compute GW22(G,G(c))\operatorname{\mathrm{GW}}_{2}^{2}(G,G^{(c)}) and report the average in Table 5, Appendix C.1. We obtain similar observations across the two datasets. In particular, KGC and KGC(A) outperform baselines. When cc is small, it is harder for KK-means clustering to find the best clustering plan and hence KGC(A) works better. When cc gets larger, KGC can work better, probably because the baseline outputs are poor intiializations.

We then report the average gap between the left- and right-hand sides of the bound (2) in Table 6, Appendix C.1. For all methods, including the baselines, the gap is comparable to the actual squared distance shown in Table 5, showcasing the quality of the bound. Moreover, the gap decreases when cc increases, as expected.

We next compute the matrix of GW distances, 𝒁\bm{Z} (before coarsening) and 𝒁(c)\bm{Z}^{(c)} (after coarsening), and compute the change 𝒁𝒁(c)F\|\bm{Z}-\bm{Z}^{(c)}\|_{F}. Following previous works (Chan & Airoldi, 2014; Xu et al., 2020), we set n=Nmax/log(Nmax)n=\lfloor N_{\mathrm{max}}/\log(N_{\mathrm{max}})\rfloor. The similarity matrix for the coarsened graph uses the averaging coarsening matrix as advocated in Section 4.1; that is, 𝑺(c)=𝑪¯w(𝑰N+𝑫12𝑨𝑫12)𝑪¯w\bm{S}^{(c)}=\bar{\bm{C}}_{w}\left(\bm{I}_{N}+\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\right)\bar{\bm{C}}_{w}^{\intercal}. Additionally, we use a variant of the similarity matrix, 𝑺(c)=𝑰n+(𝑫(c))12𝑨(c)(𝑫(c))12\bm{S}^{(c)}=\bm{I}_{n}+\left(\bm{D}^{(c)}\right)^{-\frac{1}{2}}\bm{A}^{(c)}\left(\bm{D}^{(c)}\right)^{-\frac{1}{2}}, resulting from the projection coarsening matrix, for comparison.

Table 2: Change of the matrix of GW distances (computed as Frobenius norm error) and coarsening time. Dataset: PTC. Results are averged over 10 runs. MGC is a deterministic method and therefore standard deviation is 0. KGC(A) is initialized with the MGC output.
Coars. Mat. Methods Frob. Error\downarrow Time\downarrow
 Projection VNGC 182.85±0.02182.85\pm 0.02 6.38±0.016.38\pm 0.01
VEGC 54.81±0.0254.81\pm 0.02 3.81±0.3.81\pm 0.
MGC 13.69±0.13.69\pm 0. 6.71±0.016.71\pm 0.01
SGC 12.41±0.0412.41\pm 0.04 30.24±0.0730.24\pm 0.07
 Averaging VNGC 17.34±0.0117.34\pm 0.01 6.55±0.186.55\pm 0.18
VEGC 9.22±0.029.22\pm 0.02 3.75±0.013.75\pm 0.01
MGC 5.31±0.5.31\pm 0. 6.59±0.026.59\pm 0.02
SGC 6.06±0.026.06\pm 0.02 28.06±0.1028.06\pm 0.10
KGC 4.45±0.03\mathbf{4.45\pm 0.03} 1.34±0.33\mathbf{1.34\pm 0.33}
KGC(A) 5.28±0.5.28\pm 0. 0.27±𝟎.\mathbf{0.27\pm 0.}
 

Table 2 summarizes the results for PTC. It confirms that using “Averaging” is better than using “Projection” for the coarsening matrix. Additionally, KGC(A) initialized with MGC (the best baseline) induces a significantly small change (though no better than the change caused by KGC), further verifying that minimizing the loss (4) will lead to a small change in the GW distance. The runtime reported in the table confirms the analysis in Section 5, showing that KGC is efficient. Moreover, the runtime of KGC(A) is nearly negligible compared with that of the baseline method used for initializing KGC(A).

6.2 Laplacian spectrum preservation

We evaluate the capability of the methods in preserving the Laplacian spectrum, by noting that the spectral objectives of the baseline methods differ from that of ours (see remarks after Theorem 4.2). We coarsen each graph to its 10%,20%,30%10\%,20\%,30\%, and 40%40\% size and evaluate the metric 15i=15λiλi(c)λi\frac{1}{5}\sum_{i=1}^{5}\frac{\lambda_{i}-\lambda_{i}^{(c)}}{\lambda_{i}} (the relative error of the first five largest eigenvalues). The experiment is conducted on Tumblr and the calculation of the error metric is repeated ten times.

In Figure 2, we see that KGC has a comparable error to the variational methods VNGC and VEGC, while incurring a lower time cost, especially when the coarsening ratio is small. Additionally, KGC(A) consistently improves the initialization method SGC and attains the lowest error.

Refer to caption
Figure 2: Runtime of coarsening methods and relative error for spectrum preservation on Tumblr. KGC(A) is initialized with SGC (the black dashed curve); its error curve (the red solid curve) is slightly below SGC.
Table 3: Classification accuracy with coarsened graphs five times smaller than the original graphs.
Datasets MUTAG PTC PROTEINS MSRC IMDB Tumblr
 VNGC 76.11±2.2576.11\pm 2.25 56.69±2.5256.69\pm 2.52 65.44±1.5765.44\pm 1.57 14.92±1.5714.92\pm 1.57 53.90±0.5053.90\pm 0.50 50.43±2.6250.43\pm 2.62
VEGC 84.59±2.0284.59\pm 2.02 56.39±2.0356.39\pm 2.03 64.08±1.1164.08\pm 1.11 16.80±2.1516.80\pm 2.15 64.20±1.9064.20\pm 1.90 48.26±1.7148.26\pm 1.71
MGC 84.15±3.1484.15\pm 3.14 54.66±3.5954.66\pm 3.59 66.16±1.6466.16\pm 1.64 15.36±1.8015.36\pm 1.80 69.50±1.42\mathbf{69.50\pm 1.42} 50.14±2.6750.14\pm 2.67
SGC 84.44±2.8684.44\pm 2.86 53.79±2.2853.79\pm 2.28 63.91±1.5163.91\pm 1.51 16.76±2.5016.76\pm 2.50 66.00±1.2666.00\pm 1.26 48.53±2.3548.53\pm 2.35
 KGC 81.90±2.7481.90\pm 2.74 61.58±2.49\mathbf{61.58\pm 2.49} 63.45±0.8363.45\pm 0.83 19.84±2.23\mathbf{19.84\pm 2.23} 67.80±1.6567.80\pm 1.65 52.52±2.8152.52\pm 2.81
KGC(A) 86.23±2.69\mathbf{86.23\pm 2.69} 57.25±2.16{57.25\pm 2.16} 66.43±0.92\mathbf{66.43\pm 0.92} 17.17±2.9117.17\pm 2.91 69.20±1.3769.20\pm 1.37 52.57±2.22\mathbf{52.57\pm 2.22}
 EIG 85.61±1.6985.61\pm 1.69 56.08±2.2856.08\pm 2.28 64.35±1.4364.35\pm 1.43 12.19±2.7912.19\pm 2.79 68.70±1.7168.70\pm 1.71 49.57±1.9549.57\pm 1.95
FULL 84.59±2.5184.59\pm 2.51 54.37±2.1254.37\pm 2.12 67.51±0.8267.51\pm 0.82 23.58±2.5023.58\pm 2.50 69.90±1.4069.90\pm 1.40 52.57±3.3652.57\pm 3.36
 

6.3 Graph classification using Laplacian spectrum

We test the performance of the various coarsening methods for graph classification. We follow the setting of Jin et al. (2020) and adopt the Network Laplacian Spectral Descriptor (Tsitsulin et al., 2018, NetLSD) along with a 11-NN classifier as the classification method. We set n=0.2Nn=0.2N. For evaluation, we select the best average accuracy of 10-fold cross validations among three different seeds. Additionally, we add two baselines, EIG and FULL, which respectively use the first nn eigenvalues and the full spectrum of 𝓛\bm{\mathcal{L}} in NetLSD. Similarly to before, we initialize KGC(A) with the best baseline.

Table 3 shows that preserving the spectrum does not necessarily leads to the best classification, since EIG and FULL, which use the original spectrum, are sometime outperformed by other methods. On the other hand, our proposed method, KGC or KGC(A), is almost always the best.

6.4 Graph regression using GCNs

We follow the setting of Dwivedi et al. (2020) to perform graph regression on AQSOL and ZINC (see Appendix C.4 for details). For evaluation, we pre-coarsen the whole datasets, reduce the graph size by 70%, and run GCN on the coarsened graphs. Table 4 suggests that KGC(A) initialized with VEGC (the best baseline) always returns the best test MAE. On ZINC, KGC sometimes suffers from the initialization, but its performance is still comparable to a reported MLP baseline (Dwivedi et al., 2020, TestMAE 0.706±0.0060.706\pm 0.006).

Table 4: Graph regression results on AQSOL and ZINC. The asterisk symbol * indicates the TestMAE improvement of KGC(A) over its VEGC initialization is statistically significant at the 95% significance level in an paired tt-test.
(a) AQSOL.
Methods TestMAE±\pms.d. TrainMAE±\pms.d. Epochs
 VNGC 1.403±0.0051.403\pm 0.005 0.629±0.0180.629\pm 0.018 135.75135.75
VEGC 1.390±0.0051.390\pm 0.005 0.702±0.0030.702\pm 0.003 107.75107.75
MGC 1.447±0.0051.447\pm 0.005 0.628±0.0120.628\pm 0.012 111.00111.00
SGC 1.489±0.0101.489\pm 0.010 0.676±0.0210.676\pm 0.021 107.00107.00
KGC 1.389±0.0151.389\pm 0.015 0.678±0.0130.678\pm 0.013 112.00112.00
KGC(A) 1.383±0.005\mathbf{1.383\pm 0.005}^{*} 0.657±0.0130.657\pm 0.013 124.75124.75
 FULL 1.372±0.0201.372\pm 0.020 0.593±0.0300.593\pm 0.030 119.50119.50
 
(b) ZINC.
Methods TestMAE±\pms.d. TrainMAE±\pms.d. Epochs
 VNGC 0.709±0.0050.709\pm 0.005 0.432±0.0120.432\pm 0.012 120.00120.00
VEGC 0.646±0.0010.646\pm 0.001 0.418±0.0080.418\pm 0.008 138.25138.25
MGC 0.677±0.0020.677\pm 0.002 0.414±0.0060.414\pm 0.006 112.50112.50
SGC 0.649±0.0070.649\pm 0.007 0.429±0.0080.429\pm 0.008 111.75111.75
KGC 0.737±0.0100.737\pm 0.010 0.495±0.0120.495\pm 0.012 113.50113.50
KGC(A) 0.641±0.003\mathbf{0.641\pm 0.003}^{*} 0.433±0.0130.433\pm 0.013 126.50126.50
 FULL 0.416±0.0060.416\pm 0.006 0.313±0.0110.313\pm 0.011 159.50159.50
 

7 Conclusions

In this work, we propose a new perspective to study graph coarsening, by analyzing the distance of graphs on the GW space. We derive an upper bound on the change of the distance caused by coarsening, which depends on only the spectral difference Δ=i=1n(λiλi(c))\Delta=\sum_{i=1}^{n}\left(\lambda_{i}-\lambda^{(c)}_{i}\right). This bound in a way justifies the idea of preserving the spectral information as the main objective of graph coarsening, although our definition of “spectral-preserving” differs from prior spectral coarsening techniques. More importantly, we point out the equivalence between the bound and the objective of weighted kernel KK-means clustering. This equivalence leads to a new coarsening method we termed KGC. Our experiment results validate the theoretical analysis, showing that KGC preserves the GW distance between graphs and improves the accuracy of graph-level classification and regression tasks.

Acknowledgements

We wish to appreciate all the constructive and insightful comments from the anonymous reviewers. Yun Yang’s research was supported in part by U.S. NSF grant DMS-2210717. Jie Chen acknowledges supports from the MIT-IBM Watson AI Lab.

References

  • Arthur & Vassilvitskii (2006) Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. Technical report, Stanford, 2006.
  • Belkin & Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
  • Bellman (1997) Bellman, R. Introduction to matrix analysis. SIAM, 1997.
  • Borgwardt et al. (2005) Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.-P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
  • Briggs et al. (2000) Briggs, W. L., Henson, V. E., and McCormick, S. F. A Multigrid Tutorial. Society for Industrial and Applied Mathematics, 2nd edition, 2000.
  • Cai et al. (2021) Cai, C., Wang, D., and Wang, Y. Graph coarsening with neural networks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  • Chan & Airoldi (2014) Chan, S. and Airoldi, E. A consistent histogram estimator for exchangeable graph models. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp.  208–216, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/chan14.html.
  • Chen et al. (2018) Chen, H., Perozzi, B., Hu, Y., and Skiena, S. HARP: Hierarchical representation learning for networks. In AAAI, 2018.
  • Chen et al. (2022) Chen, J., Saad, Y., and Zhang, Z. Graph coarsening: from scientific computing to machine learning. SeMA Journal, 79(1):187–223, 2022.
  • Chowdhury & Mémoli (2019) Chowdhury, S. and Mémoli, F. The gromov–wasserstein distance between networks and stable network invariants. Information and Inference: A Journal of the IMA, 8(4):757–787, 2019.
  • Chowdhury & Needham (2021) Chowdhury, S. and Needham, T. Generalized spectral clustering via gromov-wasserstein learning. In International Conference on Artificial Intelligence and Statistics, pp.  712–720. PMLR, 2021.
  • Chung & Langlands (1996) Chung, F. R. and Langlands, R. P. A combinatorial laplacian with vertex weights. journal of combinatorial theory, Series A, 75(2):316–327, 1996.
  • Cvetković et al. (2007) Cvetković, D., Rowlinson, P., and Simić, S. K. Signless laplacians of finite graphs. Linear Algebra and its applications, 423(1):155–171, 2007.
  • Debnath et al. (1991) Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Shusterman, A. J., and Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.
  • Defferrard et al. (2016) Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp.  3837–3845, 2016.
  • Dhillon et al. (2004) Dhillon, I. S., Guan, Y., and Kulis, B. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  551–556, 2004.
  • Dhillon et al. (2007) Dhillon, I. S., Guan, Y., and Kulis, B. Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence, 29(11):1944–1957, 2007.
  • Dobson & Doig (2003) Dobson, P. D. and Doig, A. J. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003.
  • Dwivedi et al. (2020) Dwivedi, V. P., Joshi, C. K., Luu, A. T., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
  • Flamary et al. (2021) Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1–8, 2021. URL http://jmlr.org/papers/v22/20-451.html.
  • Grattarola et al. (2022) Grattarola, D., Zambon, D., Bianchi, F. M., and Alippi, C. Understanding pooling in graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Helma et al. (2001) Helma, C., King, R. D., Kramer, S., and Srinivasan, A. The predictive toxicology challenge 2000–2001. Bioinformatics, 17(1):107–108, 2001.
  • Hermsdorff & Gunderson (2019) Hermsdorff, G. B. and Gunderson, L. M. A unifying framework for spectrum-preserving graph sparsification and coarsening. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  7734–7745, 2019.
  • Hu et al. (2020) Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
  • Huang et al. (2021) Huang, Z., Zhang, S., Xi, C., Liu, T., and Zhou, M. Scaling up graph neural networks via graph coarsening. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  675–684, 2021.
  • Irwin et al. (2012) Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.
  • Ivashkin & Chebotarev (2016) Ivashkin, V. and Chebotarev, P. Do logarithmic proximity measures outperform plain ones in graph clustering? In International Conference on Network Analysis, pp. 87–105. Springer, 2016.
  • Jin et al. (2021) Jin, W., Zhao, L., Zhang, S., Liu, Y., Tang, J., and Shah, N. Graph condensation for graph neural networks. arXiv preprint arXiv:2110.07580, 2021.
  • Jin et al. (2020) Jin, Y., Loukas, A., and JáJá, J. Graph coarsening with preserved spectral properties. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pp. 4452–4462. PMLR, 2020.
  • Karypis & Kumar (1999) Karypis, G. and Kumar, V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1999.
  • Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
  • Kokiopoulou et al. (2011) Kokiopoulou, E., Chen, J., and Saad, Y. Trace optimization and eigenproblems in dimension reduction methods. Numerical Linear Algebra with Applications, 18(3):565–602, 2011.
  • Kriege & Mutzel (2012) Kriege, N. M. and Mutzel, P. Subgraph matching kernels for attributed graphs. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012.
  • Liu et al. (2018) Liu, Y., Safavi, T., Dighe, A., and Koutra, D. Graph summarization methods and applications: A survey. ACM Comput. Surv., 51(3), jun 2018. ISSN 0360-0300. doi: 10.1145/3186727. URL https://doi.org/10.1145/3186727.
  • Loukas (2019) Loukas, A. Graph reduction with spectral and cut guarantees. Journal of Machine Learning Research, 20(116):1–42, 2019.
  • Loukas & Vandergheynst (2018) Loukas, A. and Vandergheynst, P. Spectrally approximating large graphs with smaller graphs. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 3243–3252. PMLR, 2018.
  • Mémoli (2011) Mémoli, F. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417–487, 2011.
  • Minka (2000) Minka, T. P. Old and new matrix algebra useful for statistics, 2000. https://tminka.github.io/papers/matrix/.
  • Morris et al. (2020) Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel, P., and Neumann, M. Tudataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020.
  • Neumann et al. (2016) Neumann, M., Garnett, R., Bauckhage, C., and Kersting, K. Propagation kernels: efficient graph kernels from propagated information. Machine Learning, 102(2):209–245, 2016.
  • Oettershagen et al. (2020) Oettershagen, L., Kriege, N. M., Morris, C., and Mutzel, P. Temporal graph kernels for classifying dissemination processes. In Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020, Cincinnati, Ohio, USA, May 7-9, 2020, pp. 496–504. SIAM, 2020. doi: 10.1137/1.9781611976236.56.
  • Peyré et al. (2016) Peyré, G., Cuturi, M., and Solomon, J. Gromov-wasserstein averaging of kernel and distance matrices. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp.  2664–2672. JMLR.org, 2016.
  • Ron et al. (2011) Ron, D., Safro, I., and Brandt, A. Relaxation-based coarsening and multiscale graph organization. Multiscale Modeling & Simulation, 9(1):407–423, 2011.
  • Ruge & Stüben (1987) Ruge, J. W. and Stüben, K. Algebraic multigrid. In Multigrid methods, pp.  73–130. SIAM, 1987.
  • Ruhe (1970) Ruhe, A. Perturbation bounds for means of eigenvalues and invariant subspaces. BIT Numerical Mathematics, 10(3):343–354, 1970.
  • Saad (2003) Saad, Y. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 2nd edition, 2003.
  • Schomburg et al. (2004) Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., and Schomburg, D. Brenda, the enzyme database: updates and major new developments. Nucleic acids research, 32(suppl_1):D431–D433, 2004.
  • Scrucca & Raftery (2015) Scrucca, L. and Raftery, A. E. Improved initialisation of model-based clustering using gaussian hierarchical partitions. Advances in data analysis and classification, 9(4):447–460, 2015.
  • Shi & Malik (2000) Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
  • Sorkun et al. (2019) Sorkun, M. C., Khetan, A., and Er, S. Aqsoldb, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of compounds. Scientific data, 6(1):1–8, 2019.
  • Titouan et al. (2019) Titouan, V., Courty, N., Tavenard, R., and Flamary, R. Optimal transport for structured data with application on graphs. In International Conference on Machine Learning, pp. 6275–6284. PMLR, 2019.
  • Tsitsulin et al. (2018) Tsitsulin, A., Mottin, D., Karras, P., Bronstein, A. M., and Müller, E. Netlsd: Hearing the shape of a graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pp.  2347–2356. ACM, 2018. doi: 10.1145/3219819.3219991.
  • Vincent-Cuaz et al. (2022) Vincent-Cuaz, C., Flamary, R., Corneli, M., Vayer, T., and Courty, N. Semi-relaxed gromov-wasserstein divergence and applications on graphs. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RShaMexjc-x.
  • Xu et al. (2019a) Xu, H., Luo, D., and Carin, L. Scalable gromov-wasserstein learning for graph partitioning and matching. Advances in neural information processing systems, 32, 2019a.
  • Xu et al. (2019b) Xu, H., Luo, D., Zha, H., and Carin, L. Gromov-wasserstein learning for graph matching and node embedding. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 6932–6941. PMLR, 2019b.
  • Xu et al. (2020) Xu, H., Luo, D., Carin, L., and Zha, H. Learning graphons via structured gromov-wasserstein barycenters. In AAAI Conference on Artificial Intelligence, 2020.
  • Yanardag & Vishwanathan (2015) Yanardag, P. and Vishwanathan, S. V. N. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pp.  1365–1374. ACM, 2015. doi: 10.1145/2783258.2783417.
  • Ying et al. (2018) Ying, R., You, J., Morris, C., Ren, X., Hamilton, W. L., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In NeurIPS, 2018.
  • Zheng et al. (2022) Zheng, L., Xiao, Y., and Niu, L. A brief survey on computational gromov-wasserstein distance. Procedia Computer Science, 199:697–702, 2022.

Appendix A List of notations

Notations Meaning
 𝑳(𝑳(c))\bm{L}(\bm{L}^{(c)}) (coarsened) graph Laplacian matrix
𝓛(𝓛(c))\bm{\mathcal{L}}(\bm{\mathcal{L}}^{(c)}) (coarsened) normalized graph Laplacian matrix
G(c)G^{(c)} coarsened graph
𝑫(𝑫(c))\bm{D}(\bm{D}^{(c)}) (coarsened) degree matrix
𝑨(𝑨(c))\bm{A}(\bm{A}^{(c)}) (coarsened) adjacency matrix
𝑪p\bm{C}_{p} membership matrix, coarsening matrix with entries {0,1}\in\left\{0,1\right\}
𝑪w\bm{C}_{w} orthogonal coarsening matrix, a variant of coarsening matrix
𝑪¯w\bar{\bm{C}}_{w} weighted averaging matrix, a variant of coarsening matrix
𝚷w\bm{\Pi}_{w} projection matrix induced by 𝑪w\bm{C}_{w}
𝑾\bm{W} diagonal node mass matrix
𝑺\bm{S}(𝑺(c)\bm{S}^{(c)}) (coarsened) similarity matrix
 

We elaborate more about the family of coarsening matrices here. Specifically, we define the matrix 𝑪pn×N\bm{C}_{p}\in\mathbb{R}^{n\times N} as

𝑪p(k,i)={1vi𝒫k0otherwise.\displaystyle\bm{C}_{p}(k,i)=\begin{cases}1&v_{i}\in\mathcal{P}_{k}\\ 0&otherwise\end{cases}.

Let pi=|𝒫i|p_{i}=|\mathcal{P}_{i}| be the number of nodes in the ii-th partition 𝒫i\mathcal{P}_{i} and nn be the number of partitioning. Then, we define the weight matrix 𝑾=diag(m1,,mN)\bm{W}=\text{diag}(m_{1},\cdots,m_{N}) and let ci=j𝒫imjc_{i}=\sum_{j\in\mathcal{P}_{i}}m_{j}. We can thus give the definition of the weighted averaging matrix as

𝑪¯w=(1c11c21cn)𝑪p𝑾.\displaystyle\bar{\bm{C}}_{w}=\begin{pmatrix}\frac{1}{c_{1}}&&&\\ &\frac{1}{c_{2}}&&\\ &&\cdots&\\ &&&\frac{1}{c_{n}}\end{pmatrix}\bm{C}_{p}\bm{W}.

and the orthogonal coarsening matrix

𝑪w=(1c11c21cn)𝑪p𝑾12\displaystyle\bm{C}_{w}=\begin{pmatrix}\sqrt{\frac{1}{c_{1}}}&&&\\ &\sqrt{\frac{1}{c_{2}}}&&\\ &&\cdots&\\ &&&\sqrt{\frac{1}{c_{n}}}\end{pmatrix}\bm{C}_{p}\bm{W}^{\frac{1}{2}}

Let the projection matrix be 𝚷w=𝑪w𝑪w\bm{\Pi}_{w}=\bm{C}_{w}^{\intercal}\bm{C}_{w}. We can check that 𝚷w2=𝚷w\bm{\Pi}_{w}^{2}=\bm{\Pi}_{w}.

Appendix B Derivations Omitted in the Main Text

B.1 Weighted graph coarsening leads to doubly-weighted Laplacian

We show in the following that 𝑪w𝓛𝑪w\bm{C}_{w}\bm{\mathcal{L}}\bm{C}_{w}^{\intercal} can reproduce the normalized graph Laplacian 𝓛(c)\bm{\mathcal{L}}^{(c)}. In this case, 𝑪w=(𝑪p𝑫𝑪p)12𝑪p𝑫12\bm{C}_{w}=\left(\bm{C}_{p}\bm{D}\bm{C}_{p}^{\intercal}\right)^{-\frac{1}{2}}\bm{C}_{p}\bm{D}^{\frac{1}{2}} and interestingly 𝑪p𝑫𝑪p=diag(𝑪p𝑨𝑪p𝟏)=diag(𝑨(c)𝟏)=𝑫(c)\bm{C}_{p}\bm{D}\bm{C}_{p}^{\intercal}=\mathrm{diag}\left(\bm{C}_{p}\bm{A}\bm{C}_{p}^{\intercal}\bm{1}\right)=\mathrm{diag}\left(\bm{A}^{(c)}\bm{1}\right)=\bm{D}^{(c)}, indicating

𝑪w𝓛𝑪w=𝑰n𝑪w𝑫12𝑨𝑫12𝑪w\displaystyle\bm{C}_{w}\bm{\mathcal{L}}\bm{C}_{w}^{\intercal}=\bm{I}_{n}-\bm{C}_{w}\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\bm{C}_{w}^{\intercal}
=\displaystyle= 𝑰n(𝑫(c))12𝑪p𝑫12𝑫12𝑨𝑫12𝑫12𝑪p(𝑫(c))12\displaystyle\bm{I}_{n}-\left(\bm{D}^{(c)}\right)^{-\frac{1}{2}}\bm{C}_{p}\bm{D}^{\frac{1}{2}}\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\bm{D}^{\frac{1}{2}}\bm{C}_{p}^{\intercal}\left(\bm{D}^{(c)}\right)^{-\frac{1}{2}}
=\displaystyle= 𝑰n(𝑫(c))12𝑨(c)(𝑫(c))12=𝓛(c).\displaystyle\bm{I}_{n}-\left(\bm{D}^{(c)}\right)^{-\frac{1}{2}}\bm{A}^{(c)}\left(\bm{D}^{(c)}\right)^{-\frac{1}{2}}=\bm{\mathcal{L}}^{(c)}.

The results above imply we can unify previous coarsening results under the weighted graph coarsening framework in this paper, with a proper choice of similarity matrix 𝑺\bm{S} and node measure μ\mu.

B.2 A toy example of coarsening a 3-node graph

Consider a fully-connected 3-node graph with equal weights for nodes and the partition {{v1},{v2,v3}}\left\{\left\{v_{1}\right\},\left\{v_{2},v_{3}\right\}\right\}. Its similarity matrix 𝑺=𝑫+𝑨\bm{S}=\bm{D}+\bm{A} and the three possible coarsened similarity matrices will respectively be

𝑺=(211121112),𝑪¯w𝑺𝑪¯w=(2113/2),𝑪w𝑺𝑪w=(2223),𝑪p𝑺𝑪p=(2226).\displaystyle\bm{S}=\begin{pmatrix}2&1&1\\ 1&2&1\\ 1&1&2\end{pmatrix},\quad\bar{\bm{C}}_{w}\bm{S}\bar{\bm{C}}_{w}^{\intercal}=\begin{pmatrix}2&1\\ 1&3/2\end{pmatrix},\quad\bm{C}_{w}\bm{S}\bm{C}_{w}^{\intercal}=\begin{pmatrix}2&\sqrt{2}\\ \sqrt{2}&3\end{pmatrix},\quad\bm{C}_{p}\bm{S}\bm{C}_{p}^{\intercal}=\begin{pmatrix}2&2\\ 2&6\end{pmatrix}.

It is clear that 𝑪¯w𝑺𝑪¯w\bar{\bm{C}}_{w}\bm{S}\bar{\bm{C}}_{w}^{\intercal}, which will have the most appropriate entry magnitude to give the minimal GW-distance, is different from 𝑪p𝑺𝑪p\bm{C}_{p}\bm{S}\bm{C}_{p}^{\intercal} proposed in Jin et al. (2020) and (Cai et al., 2021). We note in 𝑺(c)=𝑪¯w𝑺𝑪¯w\bm{S}^{(c)}=\bar{\bm{C}}_{w}\bm{S}\bar{\bm{C}}_{w}^{\intercal} the edge weight (similarity) is still 1, since we explicitly specify the node weight of the supernode becomes 22. The new GW geometric framework thus decouples the node weights and similarity intensities.

B.3 Deriving the “Averaging” magnitude from the constrained srGW barycenter problem

We first introduce the definition of srGW divergence. For a given graph GG with node mass distribution 𝒎ΔN1\bm{m}\in\Delta^{N-1} and similarity matrix 𝑺\bm{S}, we can construct another graph G(c)G^{(c)} with given similarity matrix 𝑺(c)\bm{S}^{(c)} and unspecified node mass distribution 𝒎(c)Δn1\bm{m}^{(c)}\in\Delta^{n-1}. To better reflect the dependence of GW22(G,G(c))\mathrm{GW}_{2}^{2}\left(G,G^{(c)}\right) on node mass distribution and similarity matrix, we abuse the notation GW2\mathrm{GW}_{2} as GW22(𝑺,𝒎,𝑺(c),𝒎(c))\mathrm{GW}_{2}^{2}\left(\bm{S},\bm{m},\bm{S}^{(c)},\bm{m}^{(c)}\right) and Π(𝒎,𝒎(c)):={𝑻+N×n:𝑻𝟏=𝒎,𝑻𝟏=𝒎(c)}\Pi\left(\bm{m},\bm{m}^{(c)}\right)\vcentcolon=\left\{\bm{T}\in\mathbb{R}_{+}^{N\times n}:\bm{T1}=\bm{m},\bm{T}^{\intercal}\bm{1}=\bm{m}^{(c)}\right\} in this subsection. Vincent-Cuaz et al. (2022) then defined

srGW22(𝑺,𝒎,𝑺(c)):=min𝒎(c)GW22(𝑺,𝒎,𝑺(c),𝒎(c)),\displaystyle\mathrm{srGW}_{2}^{2}\left(\bm{S},\bm{m},\bm{S}^{(c)}\right)\vcentcolon=\min_{\bm{m}^{(c)}}\mathrm{GW}_{2}^{2}\left(\bm{S},\bm{m},\bm{S}^{(c)},\bm{m}^{(c)}\right),

and we can further find an optimal (w.r.t. the srGW divergence) by solving the following srGW barycenter problem with only one input (𝑺,𝒎)(\bm{S},\bm{m}), i.e.

min𝑺(c)srGW22(𝑺,𝒎,𝑺(c)).\displaystyle\min_{\bm{S}^{(c)}}\mathrm{srGW}_{2}^{2}\left(\bm{S},\bm{m},\bm{S}^{(c)}\right). (8)

We can then do the following transform to Equation 8 to reveal its connection with our proposed “Averaging” magnitude 𝑪¯w𝑺𝑪¯w\bar{\bm{C}}_{w}\bm{S}\bar{\bm{C}}_{w}^{\intercal} for the coarsened similarity matrix 𝑺(c)\bm{S}^{(c)}.

min𝑺(c)srGW22(𝑺,𝒎,𝑺(c))\displaystyle\min_{\bm{S}^{(c)}}\mathrm{srGW}_{2}^{2}\left(\bm{S},\bm{m},\bm{S}^{(c)}\right) =min𝑺(c)min𝒎(c)GW22(𝑺,𝒎,𝑺(c),𝒎(c))\displaystyle=\min_{\bm{S}^{(c)}}\min_{\bm{m}^{(c)}}\mathrm{GW}_{2}^{2}\left(\bm{S},\bm{m},\bm{S}^{(c)},\bm{m}^{(c)}\right)
=min𝑺(c)min𝒎(c)min𝑻Π(𝒎,𝒎(c))𝑴(𝑺,𝑺(c),𝑻),𝑻(𝒎,𝒎(c))\displaystyle=\min_{\bm{S}^{(c)}}\min_{\bm{m}^{(c)}}\min_{\bm{T}\in\Pi\left(\bm{m},\bm{m}^{(c)}\right)}\left\langle\bm{M}\left(\bm{S},\bm{S}^{(c)},\bm{T}\right),\bm{T}\left(\bm{m},\bm{m}^{(c)}\right)\right\rangle
=min𝒎(c),𝑻Π(𝒎,𝒎(c))min𝑺(c)𝑴(𝑺,𝑺(c),𝑻),𝑻(𝒎,𝒎(c))\displaystyle=\min_{\bm{m}^{(c)},\bm{T}\in\Pi\left(\bm{m},\bm{m}^{(c)}\right)}\min_{\bm{S}^{(c)}}\left\langle\bm{M}\left(\bm{S},\bm{S}^{(c)},\bm{T}\right),\bm{T}\left(\bm{m},\bm{m}^{(c)}\right)\right\rangle
=min𝑻+N×n:𝑻𝟏=𝒎min𝑺(c)𝑴(𝑺,𝑺(c),𝑻),𝑻.\displaystyle=\min_{\bm{T}\in\mathbb{R}_{+}^{N\times n}:\bm{T1}=\bm{m}}\min_{\bm{S}^{(c)}}\left\langle\bm{M}\left(\bm{S},\bm{S}^{(c)},\bm{T}\right),\bm{T}\right\rangle.

In the display above, we use 𝑴(𝑺,𝑺(c),𝑻)\bm{M}\left(\bm{S},\bm{S}^{(c)},\bm{T}\right) and 𝑻(𝒎,𝒎(c))\bm{T}\left(\bm{m},\bm{m}^{(c)}\right) to show the dependence terms of the cross-graph dissimilarity matrix 𝑴\bm{M} and the transport matrix 𝑻\bm{T}. The last equation holds since for a given transport matrix 𝑻\bm{T} the new node mass distribution 𝒎(c)\bm{m}^{(c)} is uniquely decided by 𝒎(c)=𝑻𝟏\bm{m}^{(c)}=\bm{T}^{\intercal}\bm{1}.

Notably, there is a closed-form solution for the inner minimization problem min𝑺(c)𝑴(𝑺,𝑺(c),𝑻),𝑻\min_{\bm{S}^{(c)}}\left\langle\bm{M}\left(\bm{S},\bm{S}^{(c)},\bm{T}\right),\bm{T}\right\rangle. Peyré et al. (2016, Equation (14)) derive that the optimal 𝑺(c)\bm{S}^{(c)} reads

diag(𝒎(c))1𝑻𝑺𝑻diag(𝒎(c))1.\displaystyle\mathrm{diag}\left(\bm{m}^{(c)}\right)^{-1}\bm{T}^{\intercal}\bm{S}\bm{T}\mathrm{diag}\left(\bm{m}^{(c)}\right)^{-1}.

If we then enforce the restriction that the node mass transport must be performed in a clustering manner (i.e., the transport matrix 𝑻=𝑾𝑪pN×n\bm{T}=\bm{W}\bm{C}_{p}^{\intercal}\in\mathbb{R}^{N\times n} for a certain membership matrix 𝑪p\bm{C}_{p}), we exactly have 𝑺(c)=𝑪¯w𝑺𝑪¯w\bm{S}^{(c)}=\bar{\bm{C}}_{w}\bm{S}\bar{\bm{C}}_{w}^{\intercal}. The derivation above verifies the effectiveness of the “Averaging” magnitude we propose.

B.4 Comparing spectral difference Δ\Delta to spectral distance in Jin et al. (2020)

Jin et al. (2020) proposed to specify the graph coarsening plan by minimizing the following full spectral distance term (c.f. their Equation (8)):

SDfull(G,G(c))\displaystyle SD_{\text{full}}\left(G,G^{(c)}\right) =i=1k1|𝝀(i)𝝀c(i)|+i=k1+1k2|𝝀(i)1|+i=k2+1N|𝝀(i)𝝀c(iN+n)|\displaystyle=\sum_{i=1}^{k_{1}}\left|\bm{\lambda}(i)-\bm{\lambda}_{c}(i)\right|+\sum_{i=k_{1}+1}^{k_{2}}|\bm{\lambda}(i)-1|+\sum_{i=k_{2}+1}^{N}\left|\bm{\lambda}(i)-\bm{\lambda}_{c}(i-N+n)\right|
=i=1k1(𝝀c(i)𝝀(i))+i=k2+1N(𝝀(i)𝝀c(iN+n))+i=k1+1k2|𝝀(i)1|,\displaystyle=\sum_{i=1}^{k_{1}}\left(\bm{\lambda}_{c}(i)-\bm{\lambda}(i)\right)+\sum_{i=k_{2}+1}^{N}\left(\bm{\lambda}(i)-\bm{\lambda}_{c}(i-N+n)\right)+\sum_{i=k_{1}+1}^{k_{2}}|\bm{\lambda}(i)-1|,

where 𝝀(i)\bm{\lambda}(i)’s and 𝝀c(i)\bm{\lambda}_{c}(i)’s correspond to the eigenvalues (in an ascending order) of the normalized Laplacian 𝓛\bm{\mathcal{L}} and 𝓛(c)\bm{\mathcal{L}}^{(c)}, and k1k_{1} and k2k_{2} are defined as k1=argmaxi{i:𝝀c(i)<1},k2=Nn+k1k_{1}=\operatorname*{arg\,max}_{i}\left\{i:\bm{\lambda}_{c}(i)<1\right\},k_{2}=N-n+k_{1}. The last equation holds due to their Interlacing Property 4.1 (similar to Theorem D.1).

We note (i) the two “spectral” loss functions, Δ\Delta and SDfullSD_{\text{full}}, refer to different spectra. We leverage the spectrum of CwUCwTC_{w}UC_{w}^{T} while they focus on graph Laplacians. Our framework is more general and takes node weights into consideration. (ii) They actually divided 𝝀c(i)\bm{\lambda}_{c}(i)’s into two sets and respectively compared them to 𝝀(i)\bm{\lambda}(i) and 𝝀(i+Nn)\bm{\lambda}(i+N-n); the signs for λ(i)λc(i)\lambda(i)-\lambda_{c}(i) and λ(i+Nn)λc(i)\lambda(i+N-n)-\lambda_{c}(i) are thus different.

B.5 Time complexity of Algorithm 1

We first recall the complexity of Algorithm 1 stated in Section 5. Let TT be the upper bound of the KK-means iteration, the time complexity of Algorithm 1 is 𝒪(T(M+Nn))\mathcal{O}\left(T\left(M+Nn\right)\right), where M=nnz(𝑺)M=\mathrm{nnz}(\bm{S}) is the number of non-zero elements in 𝑺\bm{S}.

The cost mainly comes from the computation of the “distance” (7), which takes 𝒪(M)\mathcal{O}(M) time to obtain the second term in Equation 7 and pre-compute the third term (independent of ii); obtaining the exact distj(i)\mathrm{dist}_{j}(i) for all the NnNn (i,j)(i,j) pairs requires another 𝒪(Nn)\mathcal{O}(Nn) time. Compared to the previous spectral graph coarsening method (Jin et al., 2020), Algorithm 1 removes the partial sparse eigenvalue decomposition, which takes 𝒪(R(Mn+Nn2))\mathcal{O}\left(R(Mn+Nn^{2})\right) time using Lanczos iteration with RR restarts. The KK-means part of spectral graph coarsening takes 𝒪(TNn2)\mathcal{O}(TNn^{2}) for a certain choice of the eigenvector feature dimension k1k_{1} (Jin et al., 2020, Section 5.2), while weighted kernel KK-means clustering nested in our algorithm can better utilize the sparsity in the similarity matrix.

Appendix C Details of Experiments

We first introduce the hardware and the codebases we utilize in the experiments. The algorithms tested are all implemented in unoptimized Python code, and run with one core of a server CPU (Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz) on Ubuntu 18.04. Specifically, our method KGC is developed based on the (unweighted) kernel KK-means clustering program provided by Ivashkin & Chebotarev (2016); for the other baseline methods, the variation methods VNGC and VEGC are implemented by Loukas (2019), and the spectrum-preserving methods MGC and SGC are implemented by Jin et al. (2020); The S-GWL method (Xu et al., 2019a; Chowdhury & Needham, 2021) is tested as well (can be found in the GitHub repository of this work), while this algorithm cannot guarantee that the number of output partition is the same as specified. For the codebases, we implement the experiments mainly using the code by Jin et al. (2020); Dwivedi et al. (2020), for graph classification and graph regression respectively. More omitted experimental settings in Section 6 will be introduced along the following subsections.

Table 5: Average squared GW distance GW22(G(c),G)GW_{2}^{2}(G^{(c)},G) on PTC and IMDB dataset.
Dataset Methods c=0.3c=0.3\downarrow c=0.5c=0.5\downarrow c=0.7c=0.7\downarrow c=0.9c=0.9\downarrow
 PTC VNGC 0.05558 0.04880 0.03781 0.03326
VEGC 0.03064 0.02352 0.01614 0.00927
MGC 0.05290 0.04360 0.02635 0.00598
SGC 0.03886 0.03396 0.02309 0.00584
KGC 0.03332 0.02369 0.01255 0.00282
KGC(A) 0.03055 0.02346 0.01609 0.00392
 IMDB VNGC 0.05139 0.05059 0.05043 0.05042
VEGC 0.02791 0.02106 0.01170 0.00339
MGC 0.02748 0.02116 0.01175 0.00339
SGC 0.02907 0.02200 0.01212 0.00352
KGC 0.02873 0.02111 0.01137 0.00320
KGC(A) 0.02748 0.02106 0.01170 0.00337
 

C.1 Details of GW distance approximation in Section 6.1

Table 6: Average empirical bound gaps (in Theorem 4.2) on PTC and IMDB dataset.
Dataset Methods c=0.3c=0.3 c=0.5c=0.5 c=0.7c=0.7 c=0.9c=0.9
 PTC VNGC 0.06701 0.06671 0.05393 0.04669
VEGC 0.06246 0.06129 0.04424 0.02577
MGC 0.03203 0.03200 0.02167 0.00540
SGC 0.04599 0.04156 0.02488 0.00554
KGC 0.05145 0.05173 0.03530 0.00852
KGC(A) 0.06519 0.06402 0.04702 0.00372
 IMDB VNGC 0.009281 0.00927 0.009278 0.009268
VEGC 0.016879 0.016735 0.01636 0.008221
MGC 0.017307 0.01663 0.016309 0.008179
SGC 0.015719 0.015793 0.015934 0.008049
KGC 0.016054 0.016679 0.016687 0.008347
KGC(A) 0.017307 0.016735 0.01636 0.008177
 

We compute the pair-wise GW2\operatorname{\mathrm{GW}}_{2} distance matrix 𝑮\bm{G} for graphs in PTC and IMDB, with the normalized signless Laplacians set as the similarity matrices. For the computation of the GW2\operatorname{\mathrm{GW}}_{2} distance, we mainly turn to the popular OT package POT (Flamary et al., 2021, Python Optimal Transport).

The omitted tables of average squared GW distance GW22(G(c),G)GW_{2}^{2}(G^{(c)},G) and average empirical bound gaps are presented in Tables 5 and 6.

Regarding time efficiency, we additionally remark our method KGC works on the dense graph matrix, even though it has a better theoretical complexity using a sparse matrix; for small graphs in MUTAG, directly representing them by dense matrices would even be faster, when a modern CPU is used.

C.2 Details of Laplacian spectrum preservation in Section 6.2

We mainly specify the evaluation metric for spectrum preservation in this subsection. Following the eigenvalue notation in Theorem 4.2, we define the top-55 eigenvalue relative error as 15i=15λiλi(c)λi\frac{1}{5}\sum_{i=1}^{5}\frac{\lambda_{i}-\lambda_{i}^{(c)}}{\lambda_{i}}. Here for all coarsening methods λiλi(c)\lambda_{i}-\lambda_{i}^{(c)} is always non-negative thanks to Poincaré separation theorem (Theorem D.1).

C.3 Details of Graph classification with Laplacian spectrum in Section 6.3

We remark the graph classification experiments are mainly adapted from the similar experiments in Jin et al. (2020, Section 6.1). For MUTAG and PTC datasets, we apply Algorithm 1 to 𝑫+𝑨\bm{D}+\bm{A}; for the other four datasets, we utilize the normalized signless Laplacian 𝑰N+𝑫12𝑨𝑫12\bm{I}_{N}+\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}.

C.4 Details of Graph regression with GCNs in Section 6.4

We mainly follow the GCN settings in Dwivedi et al. (2020). More detailed settings are stated as follows.

Graph regression. The procedure of graph regression is as follows: Taking a graph GG as input, a GCN will return a graph embedding 𝒚Gd\bm{y}_{G}\in\mathbb{R}^{d}; Dwivedi et al. (2020) then pass 𝒚G\bm{y}_{G} to an MLP and compute the prediction score ypredy_{\text{pred}}\in\mathbb{R} as

ypred=𝑷ReLU(𝑸𝒚𝒢),y_{\text{pred}}=\bm{P}\cdot\text{ReLU}(\bm{Q}\bm{y}_{\mathcal{G}}),

where 𝑷1×d,𝑸d×d\bm{P}\in\mathbb{R}^{1\times d},\bm{Q}\in\mathbb{R}^{d\times d}. They will then use |ypredy||y_{\text{pred}}-y|, the L1L_{1} loss (the MAE metric in Table 4) between the predicted score ypredy_{\text{pred}} and the groundtruth score yy, both in training and performance evaluation.

Data splitting. They apply a scaffold splitting (Hu et al., 2020) to the AQSOL dataset in the ratio 8:1:18:1:1 to have 7831,9967831,996, and 996996 samples for train, validation, and test sets.

Training hyperparameters. For the learning rate strategy across all GCN models, we follow the existing setting to choose the initial learning rate as 1×1031\times 10^{-3}, the reduce factor is set as 0.50.5, and the stopping learning rate is 1×1051\times 10^{-5}. Also, all the GCN models tested in our experiments share the same architecture—the network has 44 layers and 108442108442 tunable parameters.

As for the concrete usage of graph coarsening to GCNs, we discuss the main idea in Section C.4.1 and leave the technical details to Section C.4.2.

C.4.1 Application of graph coarsening to GCNs

Motivated by the GW setting, we discuss the application of graph coarsening to a prevailing and fundamental graph model—GCN555We omit the introduction to GCN here and refer readers to Kipf & Welling (2017) for more details.. After removing the bias term for simplicity and adapting the notations in this paper, we take a vanilla 11-layer GCN as an example and formulate it as

𝒚=𝟏NNσ(𝑫12𝑨𝑫12𝑯𝑾GCN),\displaystyle\bm{y}^{\intercal}=\frac{\bm{1}_{N}^{\intercal}}{N}\sigma\left(\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\bm{H}\bm{W}_{\mathrm{GCN}}\right),

where 𝒚\bm{y} is the graph representation of a size-NN graph associated with the adjacency matrix 𝑨\bm{A}, σ\sigma is an arbitrary activation function, 𝑯\bm{H} is the embedding matrix for nodes in the graph, and 𝑾GCN\bm{W}_{\mathrm{GCN}} is the weight matrix for the linear transform in the layer. We take the mean operation 𝟏NN\frac{\bm{1}_{N}^{\intercal}}{N} in the GCN as an implication of even node weights (therefore no ww subscript for coarsening matrices), and intuitively incorporate graph coarsening into the computation by solely replacing 𝑫12𝑨𝑫12\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}} with 𝚷𝑫12𝑨𝑫12𝚷\bm{\Pi}\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\bm{\Pi} and denote 𝒚(c)\bm{y}^{(c)} as the corresponding representation. We have

(𝒚(c))=𝒄σ(𝑪¯𝑫12𝑨𝑫12𝑪p𝑪¯𝑯𝑾GCN),\displaystyle\left(\bm{y}^{(c)}\right)^{\intercal}=\bm{c}^{\intercal}\sigma\left({\bar{\bm{C}}\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\bm{C}_{p}}{\bar{\bm{C}}\bm{H}\bm{W}_{\mathrm{GCN}}}\right), (9)

and the graph matrix 𝑫12𝑨𝑫12\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}} is supposed to be coarsened as 𝑪¯𝑫12𝑨𝑫12𝑪p\bar{\bm{C}}\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\bm{C}_{p}, which can be extended to multiple layers. Due to the space limit, we defer the derivation to Appendix C.4.2 and solely leave some remarks here.

The propagation above is guided by the GW setting along this paper: even the nodes in the original graph are equally-weighted, the supernodes after coarsening can have different weights, which induces the new readout operation 𝒄\bm{c}^{\intercal} (𝒄\bm{c} is the mass vector of the supernodes). Furthermore, an obvious difference between Equation 9 and previous designs (Huang et al., 2021; Cai et al., 2021) of applying graph coarsening to GNNs is the asymmetry of the coarsened graph matrix 𝑪¯𝑫12𝑨𝑫12𝑪p\bar{\bm{C}}\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\bm{C}_{p}. The design (9) is tested for the graph regression experiment in Section 6.4. More technical details are introduced in the subsequent subsection.

C.4.2 Derivation of the GCN propagation in the GW setting

We consider a general layer-LL GCN used by Dwivedi et al. (2020). We first recall the definition (normalization modules are omitted for simplicity):

𝒚\displaystyle\bm{y}^{\intercal} =𝟏NN𝑯(L),\displaystyle=\frac{\bm{1}_{N}^{\intercal}}{N}\bm{H}^{(L)},
𝑯(l)\displaystyle\bm{H}^{(l)} =σ(𝒁(l)),𝒁(l)=(𝑫12𝑨𝑫12)𝑯(l1)𝑾(l),l[L],\displaystyle=\sigma(\bm{Z}^{(l)}),\quad\bm{Z}^{(l)}=\left(\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}\right)\bm{H}^{(l-1)}\bm{W}^{(l)},\quad\forall l\in[L],
𝑯(0)\displaystyle\bm{H}^{(0)} :=𝑿,the embedding matrix for each nodes,\displaystyle\vcentcolon=\bm{X},\quad\text{the embedding matrix for each nodes},

where σ\sigma is an activation function and from now on we will abuse 𝑷\bm{P} here to denote 𝑫12𝑨𝑫12\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}; 𝑯(l)\bm{H}^{(l)} is the embedding matrix of the graph nodes in the ll-th layer, and 𝑾(l)\bm{W}^{(l)} is the weight matrix of the same layer.

To apply the coarsened graph, we enforce the following regulations:

𝑯(c,l)\displaystyle\bm{H}^{(c,l)} =σ(𝒁(c,l)),𝒁(c,l)=(𝚷𝑷𝚷)𝑯(c,l1)𝑾(l),l[L].\displaystyle=\sigma(\bm{Z}^{(c,l)}),\quad\bm{Z}^{(c,l)}=\left(\bm{\Pi}\bm{P}\bm{\Pi}\right)\bm{H}^{(c,l-1)}\bm{W}^{(l)},\quad\forall l\in[L].

To reduce the computation in node aggregation, we utilize the two properties that 𝚷=𝑪p𝑪¯\bm{\Pi}=\bm{C}_{p}^{\intercal}\bar{\bm{C}} and σ(𝑪p𝑩)=𝑪pσ(𝑩)\sigma\left(\bm{C}_{p}^{\intercal}\bm{B}\right)=\bm{C}_{p}^{\intercal}\sigma\left(\bm{B}\right), for any element-wise activation function and matrix 𝑩\bm{B} with a proper shape (considering 𝑪p\bm{C}_{p}^{\intercal} simply “copy” the rows from 𝑩\bm{B}); for the top two layers, we then have

𝒚\displaystyle\bm{y}^{\intercal} =𝟏NNσ(𝑪p𝑪¯𝑷𝚷𝑯(c,L1)𝑾(L))=𝟏NN𝑪pσ(𝑪¯𝑷𝑪p𝑪¯𝑯(c,L1)𝑾(L)),\displaystyle=\frac{\bm{1}_{N}^{\intercal}}{N}\sigma(\bm{C}_{p}^{\intercal}\bar{\bm{C}}\bm{P}\bm{\Pi}\bm{H}^{(c,L-1)}\bm{W}^{(L)})=\frac{\bm{1}_{N}^{\intercal}}{N}\bm{C}_{p}^{\intercal}\sigma\left(\bar{\bm{C}}\bm{P}\bm{C}_{p}^{\intercal}\bar{\bm{C}}\bm{H}^{(c,L-1)}\bm{W}^{(L)}\right),
𝑯(c,L1)\displaystyle\bm{H}^{(c,L-1)} =σ(𝑪p𝑪¯𝑷𝑪p𝑪¯𝑯(c,L2)𝑾(L1))=𝑪pσ(𝑪¯𝑷𝑪p𝑪¯𝑯(c,L2)𝑾(L1)).\displaystyle=\sigma\left(\bm{C}_{p}^{\intercal}\bar{\bm{C}}\bm{P}\bm{C}_{p}^{\intercal}\bar{\bm{C}}\bm{H}^{(c,L-2)}\bm{W}^{(L-1)}\right)=\bm{C}_{p}^{\intercal}\sigma\left(\bar{\bm{C}}\bm{P}\bm{C}_{p}^{\intercal}\bar{\bm{C}}\bm{H}^{(c,L-2)}\bm{W}^{(L-1)}\right).

Note 𝑪p𝑪¯𝑪p=𝑪p\bm{C}_{p}^{\intercal}\bar{\bm{C}}\bm{C}_{p}^{\intercal}=\bm{C}_{p}^{\intercal}; for the top two layers we finally obtain

𝒚=𝟏NN𝑪pσ(𝑪¯𝑷𝑪p𝑪¯𝑯(c,L1)𝑾(L))=𝒄σ(𝑪¯𝑷𝑪pσ(𝑪¯𝑷𝑪p𝑪¯𝑯(c,L2)𝑾(L1))𝑾(L)),\displaystyle\bm{y}^{\intercal}=\frac{\bm{1}_{N}^{\intercal}}{N}\bm{C}_{p}^{\intercal}\sigma\left(\bar{\bm{C}}\bm{P}\bm{C}_{p}^{\intercal}\bar{\bm{C}}\bm{H}^{(c,L-1)}\bm{W}^{(L)}\right)=\bm{c}^{\intercal}\sigma\left(\bar{\bm{C}}\bm{P}\bm{C}_{p}^{\intercal}\sigma\left(\bar{\bm{C}}\bm{P}\bm{C}_{p}^{\intercal}\bar{\bm{C}}\bm{H}^{(c,L-2)}\bm{W}^{(L-1)}\right)\bm{W}^{(L)}\right),

implying that we can replace 𝑷\bm{P} with 𝑪¯𝑷𝑪p\bar{\bm{C}}\bm{P}\bm{C}_{p}^{\intercal} in the propagation of the top two layers. The trick can indeed be repeated for each layer, and specifically, in the bottom layer we can have

𝑯(c,1)=𝑪pσ[(𝑪¯𝑷𝑪p)(𝑪¯𝑯(0))𝑾(1)],\displaystyle\bm{H}^{(c,1)}=\bm{C}_{p}^{\intercal}\sigma\left[\left(\bar{\bm{C}}\bm{P}\bm{C}_{p}^{\intercal}\right)\left(\bar{\bm{C}}\bm{H}^{(0)}\right)\bm{W}^{(1)}\right],

which well corresponds to the node weight concept in the GW setting: 𝑪¯𝑯(0)\bar{\bm{C}}\bm{H}^{(0)} uses the average embedding of the nodes within the cluster to represent the coarsened centroid. We then finish the justification of the new GCN propagation in Equation 9.

Appendix D Useful Facts

D.1 Poincaré separation theorem

For convenience of the reader, we repeat Poincaré separation theorem (Bellman, 1997) in this subsection.

Theorem D.1 (Poincaré separation theorem).

Let 𝐀\bm{A} be an N×NN\times N real symmetric matrix and 𝐂\bm{C} an n×Nn\times N semi-orthogonal matrix such that 𝐂𝐂=𝐈n\bm{C}\bm{C}^{\intercal}=\bm{I}_{n}. Denote by λi,i=1,2,,N\lambda_{i},i=1,2,\dots,N and λi(c),i=1,2,,n\lambda^{(c)}_{i},i=1,2,\dots,n the eigenvalues of 𝐀\bm{A} and 𝐂𝐀𝐂\bm{C}\bm{A}\bm{C}^{\intercal}, respectively (in descending order). We have

λiλi(c)λNn+i,\displaystyle\lambda_{i}\geq\lambda^{(c)}_{i}\geq\lambda_{{N-n+i}}, (10)

D.2 Ruhe’s trace inequality

For convenience of the reader, Ruhe’s trace inequality (Ruhe, 1970) is stated as follows.

Lemma D.2 (Ruhe’s trace inequality).

If 𝐀,𝐁\bm{A},\bm{B} are both N×NN\times N PSD matrices with eigenvalues,

λ1λN0,ν1νN0,\displaystyle\lambda_{1}\geq\cdots\geq\lambda_{N}\geq 0,\quad\nu_{1}\geq\cdots\geq\nu_{N}\geq 0,

then

i=1NλiνNi+1tr(AB)i=1Nλiνi.\displaystyle\sum_{i=1}^{N}\lambda_{i}\nu_{N-i+1}\leq\mathrm{tr}(AB)\leq\sum_{i=1}^{N}\lambda_{i}\nu_{i}.

Appendix E Proof of Theorem 4.2 and Theorem 4.4

We will first prove an intermediate result Lemma E.1 for coarsened graph G1(c)G_{1}^{(c)} and un-coarsend graph G2G_{2} to introduce necessary technical tools for Theorem 4.2 and Theorem 4.4. The ultimate proof of Theorem 4.2 and Theorem 4.4 will be stated in Appendix E.2 and Appendix E.3 respectively.

E.1 Lemma E.1 and Its Proof

We first give the complete statement of Lemma E.1. In Lemma E.1, we consider the case in which only graph G1G_{1} is coarsened, and the notations are slightly simplified: when the context is clear, the coarsening-related terms without subscripts specific to graphs, e.g. 𝑪p,𝑪¯w\bm{C}_{p},\bar{\bm{C}}_{w}, are by default associated with G1G_{1} unless otherwise announced. We follow the simplified notation in the statement and proof of Theorem 4.2, which focuses on solely GW2(G(c),G)\text{GW}_{2}(G^{(c)},G) and the indices 1,21,2 are not involved. For Theorem 4.4, we will explicitly use specific subscripts, such as 𝑪p,1,𝑪¯w,2\bm{C}_{p,1},\bar{\bm{C}}_{w,2}, for disambiguation.

Lemma E.1.

Let 𝐏=𝐖112𝐓𝐖212\bm{P}=\bm{W}_{1}^{-\frac{1}{2}}\bm{T}^{*}\bm{W}_{2}^{-\frac{1}{2}}. If both 𝐒1\bm{S}_{1} and 𝐒2\bm{S}_{2} are PSD, we have

|GW22(G1,G2)GW22(G1(c),G2)|\displaystyle\lvert\operatorname{\mathrm{GW}}_{2}^{2}(G_{1},G_{2})-\operatorname{\mathrm{GW}}_{2}^{2}(G_{1}^{(c)},G_{2})\rvert max{λN1n1+1i=1n1(λiλi(c))+C𝑼,n1,\displaystyle\leq\max\left\{\vphantom{\left(\sum_{i=1}^{n_{1}}\right)}\lambda_{N_{1}-n_{1}+1}\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U},n_{1}},\right. (11)
2(νN1n1+1i=1n1(λiλi(c))+C𝑼,𝑽,n1)}.\displaystyle\qquad\qquad\left.2\left(\nu_{N_{1}-n_{1}+1}\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U},\bm{V},n_{1}}\right)\right\}.

Here, λ1(c)λn(c)\lambda_{1}^{(c)}\geq\cdots\geq\lambda_{n}^{(c)} are eigenvalues of 𝚷w𝐔𝚷w\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}, 𝐔=𝐖112𝐒1𝐖112\bm{U}=\bm{W}_{1}^{\frac{1}{2}}\bm{S}_{1}\bm{W}_{1}^{\frac{1}{2}} with eigenvalues λ1λ2λN1\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{N_{1}}, and 𝐕=𝐏𝐖212𝐒2𝐖212𝐏\bm{V}=\bm{P}\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal} with eigenvalues ν1ν2νN1\nu_{1}\geq\nu_{2}\geq\cdots\geq\nu_{N_{1}}, and we let C𝐔,n1:=i=1n1λi(λiλN1n1+i)+i=n1+1N1λi2C_{\bm{U},n_{1}}\vcentcolon=\sum_{i=1}^{n_{1}}\lambda_{i}(\lambda_{i}-\lambda_{N_{1}-n_{1}+i})+\sum_{i=n_{1}+1}^{N_{1}}\lambda_{i}^{2} and C𝐔,𝐕,n1:=i=1n1λi(νiνN1i+1)+i=n1+1N1λiνiC_{\bm{U},\bm{V},n_{1}}\vcentcolon=\sum_{i=1}^{n_{1}}\lambda_{i}\left(\nu_{i}-\nu_{N_{1}-i+1}\right)+\sum_{i=n_{1}+1}^{N_{1}}\lambda_{i}\nu_{i} be two non-negative constants.

Remark E.2.

Take G2=G1G_{2}=G_{1}, and we have 𝑻=𝑾1\bm{T}^{\ast}=\bm{W}_{1} which implies 𝑷=𝑰N\bm{P}=\bm{I}_{N} and 𝑽=𝑾112𝑺1𝑾112=𝑼\bm{V}=\bm{W}_{1}^{\frac{1}{2}}\bm{S}_{1}\bm{W}_{1}^{\frac{1}{2}}=\bm{U}. This directly leads to the bound in Theorem 4.2 though with an additional factor 22. In Appendix E.2 we will show the approach to obtain a slightly tighter bound removing the unnecessary factor 22.

To illustrate our idea more clearly, we will start from the following decomposition of GW2\operatorname{\mathrm{GW}}_{2} distance. The detailed proofs of all the lemmas in this section will be provided to Appendix E.4 for the readers interested.

Lemma E.3.

For any two graphs G1G_{1} and G2G_{2}, we have

GW22(G1,G2)=I1+I22I3,\displaystyle\operatorname{\mathrm{GW}}_{2}^{2}(G_{1},G_{2})=I_{1}+I_{2}-2I_{3},

where

I1=Tr((𝑾112𝑺𝟏𝑾112)(𝑾112𝑺𝟏𝑾112)),I2=Tr((𝑾212𝑺𝟐𝑾212)(𝑾212𝑺𝟐𝑾212))\displaystyle I_{1}=\operatorname{Tr}\left(\left(\bm{W}_{1}^{\frac{1}{2}}\bm{S_{1}}\bm{W}_{1}^{\frac{1}{2}}\right)\left(\bm{W}_{1}^{\frac{1}{2}}\bm{S_{1}}\bm{W}_{1}^{\frac{1}{2}}\right)\right),\qquad I_{2}=\operatorname{Tr}\left(\left(\bm{W}_{2}^{\frac{1}{2}}\bm{S_{2}}\bm{W}_{2}^{\frac{1}{2}}\right)\left(\bm{W}_{2}^{\frac{1}{2}}\bm{S_{2}}\bm{W}_{2}^{\frac{1}{2}}\right)\right)

do not depend on the choice of transport map, while

I3=Tr(𝑺1𝑻𝑺2(𝑻))\displaystyle I_{3}=\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}\right)

requires the optimal transport map 𝐓\bm{T}^{\ast} from the graph G1G_{1} to the graph G2G_{2}.

Replacing the G1G_{1}-related terms in Equation 15 with their G1(c)G_{1}^{(c)} counterparts, we know that the distance between G1(c)G_{1}^{(c)} and G2G_{2} is GW2(G1(c),G2)=I1+I22I3\text{GW}_{2}(G_{1}^{(c)},G_{2})=I_{1}^{\prime}+I_{2}-2I_{3}^{\prime}, where:

I1=Tr([(𝑾1(c))12𝑺𝟏(𝒄)(𝑾1(c))12]2),I3=Tr(𝑺1(c)𝑻co𝑺2(𝑻co)).\displaystyle I_{1}^{\prime}=\operatorname{Tr}\left(\left[\left(\bm{W}_{1}^{(c)}\right)^{\frac{1}{2}}\bm{S_{1}^{(c)}}\left(\bm{W}_{1}^{(c)}\right)^{\frac{1}{2}}\right]^{2}\right),\qquad I_{3}^{\prime}=\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{T}_{co}^{*}\bm{S}_{2}\left(\bm{T}_{co}^{*}\right)^{\intercal}\right). (12)

𝑻coΠ(μ1(c),μ2)\bm{T}_{co}^{\ast}\in\Pi(\mu_{1}^{(c)},\mu_{2}) (co represents the transport from the “c”oarsened graph to the “o”riginal graph) here is the optimal transport matrix induced by the GW distance and 𝑺1(c):=𝑪¯w𝑺1𝑪¯w\bm{S}_{1}^{(c)}\vcentcolon=\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}.

The key step to preserve the GW distance is to control the difference |I1I1||I_{1}-I_{1}^{\prime}| and |I3I3||I_{3}-I_{3}^{\prime}|, since I2=I2I_{2}=I^{\prime}_{2} will cancel each other. We will start from the bound of |I1I1||I_{1}-I_{1}^{\prime}|.

Lemma E.4.

Let 𝐔:=𝐖112𝐒1𝐖112\bm{U}\vcentcolon=\bm{W}_{1}^{\frac{1}{2}}\bm{S}_{1}\bm{W}_{1}^{\frac{1}{2}}. If the similarity matrix 𝐒1N×N\bm{S}_{1}\in\mathbb{R}^{N\times N} is PSD, we have

0I1I1λNn+1i=1n(λiλi(c))+C𝑼,n.\displaystyle 0\leq I_{1}-I_{1}^{\prime}\leq\lambda_{N-n+1}\sum_{i=1}^{n}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U},n}.

Here, λ1λ2λN\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{N} are the eigenvalues of 𝐔\bm{U}, and C𝐔,n:=i=1nλi(λiλNn+i)+i=n+1Nλi2C_{\bm{U},n}\vcentcolon=\sum_{i=1}^{n}\lambda_{i}(\lambda_{i}-\lambda_{N-n+i})+\sum_{i=n+1}^{N}\lambda_{i}^{2} is non-negative.

We can similarly bound I3I3I_{3}-I^{\prime}_{3} with an additional tool Ruhe’s trace inequality (a variant of Von Neumann’s trace inequality specific to PSD matrices, c.f. Appendix D.2).

Lemma E.5.

Let 𝐏=𝐖112𝐓𝐖212\bm{P}=\bm{W}_{1}^{-\frac{1}{2}}\bm{T}^{*}\bm{W}_{2}^{-\frac{1}{2}} be the normalized optimal transport matrix, and 𝐕:=𝐏𝐖212𝐒2𝐖212𝐏\bm{V}\vcentcolon=\bm{P}\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal} with eigenvalues ν1ν2νN\nu_{1}\geq\nu_{2}\geq\cdots\geq\nu_{N}. If both 𝐒1\bm{S}_{1} and 𝐒2\bm{S}_{2} are PSD, we have

0I3I3νNn+1i=1n(λiλi(c))+C𝑼,𝑽,n.\displaystyle 0\leq I_{3}-I_{3}^{\prime}\leq\nu_{N-n+1}\sum_{i=1}^{n}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U},\bm{V},n}.

Here, C𝐔,𝐕,n:=i=1nλi(νiνNi+1)+i=n+1NλiνiC_{\bm{U},\bm{V},n}\vcentcolon=\sum_{i=1}^{n}\lambda_{i}\left(\nu_{i}-\nu_{N-i+1}\right)+\sum_{i=n+1}^{N}\lambda_{i}\nu_{i} is non-negative.

Now we have

|GW22(G1,G2)GW22(G1(c),G2)|=|I1I1+2(I3I3)|max{I1I1,2(I3I3)}\displaystyle\lvert\operatorname{\mathrm{GW}}_{2}^{2}(G_{1},G_{2})-\operatorname{\mathrm{GW}}_{2}^{2}(G_{1}^{(c)},G_{2})\rvert=\lvert I_{1}-I_{1}^{\prime}+2(I_{3}^{\prime}-I_{3})\rvert\leq\max\{I_{1}-I_{1}^{\prime},2(I_{3}-I_{3}^{\prime})\}

considering that I1I10I_{1}-I^{\prime}_{1}\geq 0 and I3I30I^{\prime}_{3}-I_{3}\leq 0. Then, combining all the pieces above yields the bound in Lemma E.1.

E.2 Proof of Theorem 4.2

With the above dissection of the terms I1,I3I^{\prime}_{1},I^{\prime}_{3}, we can now give a finer control of the distance GW22(G1(c),G1)\text{GW}_{2}^{2}(G_{1}^{(c)},G_{1}). We first expand GW22(G1(c),G1)\text{GW}_{2}^{2}(G_{1}^{(c)},G_{1}) as

GW22(G1(c),G1)=I1+I12Tr(𝑺1(c)𝑻co𝑺1(𝑻co)),\displaystyle\text{GW}_{2}^{2}(G_{1}^{(c)},G_{1})=I_{1}+I^{\prime}_{1}-2\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{T}_{co}^{*}\bm{S}_{1}\left(\bm{T}_{co}^{*}\right)^{\intercal}\right),

where the notation 𝑻co\bm{T}_{co}^{*} is now abused as the optimal transport matrix induced by GW2(G1(c),G1)\mathrm{GW}_{2}(G_{1}^{(c)},G_{1}). Applying the optimality inequality in Lemma E.7, we have

GW22(G1(c),G1)I1+I12Tr(𝑺1(c)𝑪p𝑾1𝑺1𝑾1𝑪p),\displaystyle\text{GW}_{2}^{2}(G_{1}^{(c)},G_{1})\leq I^{\prime}_{1}+I_{1}-2\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{C}_{p}\bm{W}_{1}\bm{S}_{1}\bm{W}_{1}\bm{C}_{p}^{\intercal}\right),

where we remark 𝑪p𝑾1\bm{C}_{p}\bm{W}_{1} is a qualified transport matrix for G1(c)G_{1}^{(c)} and G1G_{1}.

To further simplify the upper bound above, we show the equivalence between I1I^{\prime}_{1} and Tr(𝑺1(c)𝑪p𝑾1𝑺1𝑾1𝑪p)\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{C}_{p}\bm{W}_{1}\bm{S}_{1}\bm{W}_{1}\bm{C}_{p}^{\intercal}\right) as follows:

Tr(𝑺1(c)𝑪p𝑾1𝑺1𝑾1𝑪p)\displaystyle\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{C}_{p}\bm{W}_{1}\bm{S}_{1}\bm{W}_{1}\bm{C}_{p}^{\intercal}\right) =Tr(𝑪¯w𝑺1𝑪¯w𝑪p𝑾1𝑺1𝑾1𝑪p)=Tr(𝚷w𝑼𝚷w𝑼)=Tr(𝚷w𝚷w𝑼𝚷w𝚷w𝑼)\displaystyle=\operatorname{Tr}\left(\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}\bm{C}_{p}\bm{W}_{1}\bm{S}_{1}\bm{W}_{1}\bm{C}_{p}^{\intercal}\right)=\operatorname{Tr}\left(\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\bm{U}\right)=\operatorname{Tr}\left(\bm{\Pi}_{w}\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\bm{\Pi}_{w}\bm{U}\right)
=Tr(𝚷w𝑼𝚷w𝚷w𝑼𝚷w)=I1.\displaystyle=\operatorname{Tr}\left(\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\right)=I^{\prime}_{1}.

Combing the above pieces together, we obtain

GW22(G1(c),G1)I1+I12I1λNn+1i=1n(λiλi(c))+C𝑼,n,\displaystyle\text{GW}_{2}^{2}(G_{1}^{(c)},G_{1})\leq I_{1}+I^{\prime}_{1}-2I^{\prime}_{1}\leq\lambda_{N-n+1}\sum_{i=1}^{n}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U},n},

which uses Lemma E.4 for the last inequality. The proof of Theorem 4.2 is now complete.

Remark E.6.

Theorem 4.2 can be leveraged to directly control |GW2(G1(c),G2(c))GW2(G1,G2)|\left|\text{GW}_{2}(G_{1}^{(c)},G_{2}^{(c)})-\text{GW}_{2}(G_{1},G_{2})\right|. Note that GW2\text{GW}_{2} is a pseudo-metric and satisfies triangular inequality (Chowdhury & Mémoli, 2019), which implies

GW2(G1(c),G2(c))GW2(G1,G2)\displaystyle\text{GW}_{2}(G_{1}^{(c)},G_{2}^{(c)})-\text{GW}_{2}(G_{1},G_{2}) GW2(G1(c),G1)+GW2(G1,G2)+GW2(G2,G2(c))GW2(G1,G2)\displaystyle\leq\text{GW}_{2}(G_{1}^{(c)},G_{1})+\text{GW}_{2}(G_{1},G_{2})+\text{GW}_{2}(G_{2},G_{2}^{(c)})-\text{GW}_{2}(G_{1},G_{2})
=GW2(G1(c),G1)+GW2(G2,G2(c))\displaystyle=\text{GW}_{2}(G_{1}^{(c)},G_{1})+\text{GW}_{2}(G_{2},G_{2}^{(c)})
2(GW22(G1(c),G1)+GW22(G2,G2(c)))12,\displaystyle\leq\sqrt{2}\left(\text{GW}_{2}^{2}(G_{1}^{(c)},G_{1})+\text{GW}_{2}^{2}(G_{2},G_{2}^{(c)})\right)^{\frac{1}{2}},

and similarly we have

GW2(G1,G2)GW2(G1(c),G2(c))GW2(G1(c),G1)+GW2(G2,G2(c))2(GW22(G1(c),G1)+GW22(G2,G2(c)))12.\displaystyle\text{GW}_{2}(G_{1},G_{2})-\text{GW}_{2}(G_{1}^{(c)},G_{2}^{(c)})\leq\text{GW}_{2}(G_{1}^{(c)},G_{1})+\text{GW}_{2}(G_{2},G_{2}^{(c)})\leq\sqrt{2}\left(\text{GW}_{2}^{2}(G_{1}^{(c)},G_{1})+\text{GW}_{2}^{2}(G_{2},G_{2}^{(c)})\right)^{\frac{1}{2}}.

This implies

|GW2(G1(c),G2(c))GW2(G1,G2)|2(GW22(G1(c),G1)+GW22(G2,G2(c)))12\displaystyle\quad\big{|}\text{GW}_{2}(G_{1}^{(c)},G_{2}^{(c)})-\text{GW}_{2}(G_{1},G_{2})\big{|}\leq\sqrt{2}\left(\text{GW}_{2}^{2}(G_{1}^{(c)},G_{1})+\text{GW}_{2}^{2}(G_{2},G_{2}^{(c)})\right)^{\frac{1}{2}}
2(λ1,N1n1+1i=1n1(λ1,iλ1,i(c))+C𝑼1,n1+λ2,N2n2+1i=1n2(λ2,iλ2,i(c))+C𝑼2,n2)12.\displaystyle\leq\sqrt{2}\bigg{(}\lambda_{1,N_{1}-n_{1}+1}\sum_{i=1}^{n_{1}}\left(\lambda_{1,i}-\lambda^{(c)}_{1,i}\right)+C_{\bm{U}_{1},n_{1}}+\lambda_{2,N_{2}-n_{2}+1}\sum_{i=1}^{n_{2}}\left(\lambda_{2,i}-\lambda^{(c)}_{2,i}\right)+C_{\bm{U}_{2},n_{2}}\bigg{)}^{\frac{1}{2}}.

Here, the last inequality is due to Theorem 4.2. We comment the above result obtained from a direct application of triangle inequality is indeed weaker than the result stated in Theorem 4.4; we will then devote the remaining part of this section to the proof thereof.

E.3 Proof of Theorem 4.4

For one side of the result, we can follow the derivation in Lemma E.3 and apply Theorem 4.2 to have

GW22(G1,G2)GW22(G1(c),G2(c))=\displaystyle\text{GW}_{2}^{2}(G_{1},G_{2})-\text{GW}_{2}^{2}(G_{1}^{(c)},G_{2}^{(c)})= I1I1+I2I2+2(I3I3)I1I1+I2I2\displaystyle I_{1}-I^{\prime}_{1}+I_{2}-I^{\prime}_{2}+2(I^{\prime}_{3}-I_{3})\leq I_{1}-I^{\prime}_{1}+I_{2}-I^{\prime}_{2}
\displaystyle\leq λN1n1+1i=1n1(λiλi(c))+C𝑼1,n1+λN2n2+1i=1n2(λiλi(c))+C𝑼2,n2,\displaystyle\lambda_{N_{1}-n_{1}+1}\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U}_{1},n_{1}}+\lambda_{N_{2}-n_{2}+1}\sum_{i=1}^{n_{2}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U}_{2},n_{2}},

where we recall I3:=Tr(𝑺1𝑻𝑺2(𝑻))I_{3}\vcentcolon=\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}\right) and I3I^{\prime}_{3} now is Tr(𝑺1(c)𝑻cc𝑺2(c)(𝑻cc))\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{T}_{cc}^{*}\bm{S}_{2}^{(c)}\left(\bm{T}_{cc}^{*}\right)^{\intercal}\right) (TccT_{cc}^{*} is the optimal transport matrix for G1(c)G_{1}^{(c)} and G2(c)G_{2}^{(c)}).

For the other side, we still decompose the object GW22(G1(c),G2(c))GW22(G1,G2)\text{GW}_{2}^{2}(G_{1}^{(c)},G_{2}^{(c)})-\text{GW}_{2}^{2}(G_{1},G_{2}) as (I1I1)+(I2I2)+2(I3I3)(I^{\prime}_{1}-I_{1})+(I^{\prime}_{2}-I_{2})+2(I_{3}-I^{\prime}_{3}), and similarly we can follow Lemma E.4 to bound the first two terms. The next task is to disassemble I3I3I_{3}-I^{\prime}_{3}.

We first prepare some notations for the analysis. To clarify the affiliation, we redefine 𝑼1:=𝑾112𝑺1𝑾112\bm{U}_{1}\vcentcolon=\bm{W}_{1}^{\frac{1}{2}}\bm{S}_{1}\bm{W}_{1}^{\frac{1}{2}} with eigenvalues λ1,1λ1,2λ1,N1\lambda_{1,1}\geq\lambda_{1,2}\geq\cdots\geq\lambda_{1,N_{1}}, 𝑽1=𝑷𝑾212𝑺2𝑾212𝑷\bm{V}_{1}=\bm{P}\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal} with eigenvalues ν1,1ν1,2ν1,N1\nu_{1,1}\geq\nu_{1,2}\geq\cdots\geq\nu_{1,N_{1}}666We index 𝑷𝑾212𝑺2𝑾212𝑷\bm{P}\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal} as 𝑽1\bm{V}_{1} since it is associated with 𝑼1\bm{U}_{1} and is an N1×N1N_{1}\times N_{1} matrix., 𝑼2=𝑾212𝑺2𝑾212\bm{U}_{2}=\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}} with eigenvalues λ2,1λ2,2λ2,N2\lambda_{2,1}\geq\lambda_{2,2}\geq\cdots\geq\lambda_{2,N_{2}}, 𝑽2=𝑷𝑾112𝑺1𝑾112𝑷\bm{V}_{2}=\bm{P}^{\intercal}\bm{W}_{1}^{\frac{1}{2}}\bm{S}_{1}\bm{W}_{1}^{\frac{1}{2}}\bm{P} with eigenvalues ν2,1ν2,2ν2,N2\nu_{2,1}\geq\nu_{2,2}\geq\cdots\geq\nu_{2,N_{2}}, and similarly re-introduce Cp,1,𝑪¯w,1,𝑪w,1C_{p,1},\bar{\bm{C}}_{w,1},\bm{C}_{w,1}, Cp,2,𝑪¯w,2,𝑪w,2C_{p,2},\bar{\bm{C}}_{w,2},\bm{C}_{w,2}, and

𝚷w,1:=𝑾112𝑪p,1𝑪¯w,1𝑾112=𝑪w,1𝑪w,1,𝚷w,2:=𝑾212𝑪p,2𝑪¯w,2𝑾212=𝑪w,2𝑪w,2.\displaystyle\bm{\Pi}_{w,1}\vcentcolon=\bm{W}_{1}^{\frac{1}{2}}\bm{C}_{p,1}^{\intercal}\bar{\bm{C}}_{w,1}\bm{W}_{1}^{-\frac{1}{2}}=\bm{C}_{w,1}^{\intercal}\bm{C}_{w,1},\qquad\bm{\Pi}_{w,2}\vcentcolon=\bm{W}_{2}^{\frac{1}{2}}\bm{C}_{p,2}^{\intercal}\bar{\bm{C}}_{w,2}\bm{W}_{2}^{-\frac{1}{2}}=\bm{C}_{w,2}^{\intercal}\bm{C}_{w,2}.

Recalling Lemma E.7, we can analogously obtain I3I30I^{\prime}_{3}-I_{3}\leq 0; replacing 𝑻cc\bm{T}_{cc}^{*} with 𝑪p,1𝑻𝑪p,2\bm{C}_{p,1}\bm{T}^{*}\bm{C}_{p,2}^{\intercal}, we have

I3I3\displaystyle I_{3}-I^{\prime}_{3}\leq Tr(𝑺1𝑻𝑺2(𝑻))Tr(𝑺1(c)𝑪p,1𝑻𝑪p,2𝑺2(c)𝑪p,2(𝑻)𝑪p,1)\displaystyle\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}\right)-\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{C}_{p,1}\bm{T}^{*}\bm{C}_{p,2}^{\intercal}\bm{S}_{2}^{(c)}\bm{C}_{p,2}\left(\bm{T}^{*}\right)^{\intercal}\bm{C}_{p,1}^{\intercal}\right)
=\displaystyle= Tr(𝑼1𝑷𝑼2𝑷)Tr(𝚷w,1𝑼1𝚷w,1𝑷𝚷w,2𝑼2𝚷w,2𝑷),\displaystyle\operatorname{Tr}\left(\bm{U}_{1}\bm{P}\bm{U}_{2}\bm{P}^{\intercal}\right)-\operatorname{Tr}\left(\bm{\Pi}_{w,1}\bm{U}_{1}\bm{\Pi}_{w,1}\bm{P}\bm{\Pi}_{w,2}\bm{U}_{2}\bm{\Pi}_{w,2}\bm{P}^{\intercal}\right),

where we apply the same derivation as in Equation (16) to obtain the second line above. For simplicity, we let 𝑼1Π:=𝚷w,1𝑼1𝚷w,1\bm{U}_{1}^{\Pi}\vcentcolon=\bm{\Pi}_{w,1}\bm{U}_{1}\bm{\Pi}_{w,1} and 𝑼2Π:=𝚷w,2𝑼2𝚷w,2\bm{U}_{2}^{\Pi}\vcentcolon=\bm{\Pi}_{w,2}\bm{U}_{2}\bm{\Pi}_{w,2}. We can now bound I3I3I_{3}-I^{\prime}_{3} as

I3I3\displaystyle I_{3}-I^{\prime}_{3}\leq Tr(𝑼1𝑷𝑼2𝑷)Tr(𝑼1Π𝑷𝑼2Π𝑷)\displaystyle\operatorname{Tr}\left(\bm{U}_{1}\bm{P}\bm{U}_{2}\bm{P}^{\intercal}\right)-\operatorname{Tr}\left(\bm{U}_{1}^{\Pi}\bm{P}\bm{U}_{2}^{\Pi}\bm{P}^{\intercal}\right) (13)
=\displaystyle= Tr(𝑼1𝑷𝑼2𝑷)Tr(𝑼1Π𝑷𝑼2𝑷)+Tr(𝑼1Π𝑷𝑼2𝑷)Tr(𝑼1Π𝑷𝑼2Π𝑷)\displaystyle\operatorname{Tr}\left(\bm{U}_{1}\bm{P}\bm{U}_{2}\bm{P}^{\intercal}\right)-\operatorname{Tr}\left(\bm{U}_{1}^{\Pi}\bm{P}\bm{U}_{2}\bm{P}^{\intercal}\right)+\operatorname{Tr}\left(\bm{U}_{1}^{\Pi}\bm{P}\bm{U}_{2}\bm{P}^{\intercal}\right)-\operatorname{Tr}\left(\bm{U}_{1}^{\Pi}\bm{P}\bm{U}_{2}^{\Pi}\bm{P}^{\intercal}\right)
=\displaystyle= Tr[(𝑼1𝑼1Π)𝑷𝑼2𝑷]+Tr[𝑼1Π𝑷(𝑼2𝑼2Π)𝑷]\displaystyle\operatorname{Tr}\left[\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)\bm{P}\bm{U}_{2}\bm{P}^{\intercal}\right]+\operatorname{Tr}\left[\bm{U}_{1}^{\Pi}\bm{P}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\bm{P}^{\intercal}\right]
=\displaystyle= Tr[(𝑼1𝑼1Π)𝑷𝑼2𝑷]+Tr[𝑼1𝑷(𝑼2𝑼2Π)𝑷]\displaystyle\operatorname{Tr}\left[\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)\bm{P}\bm{U}_{2}\bm{P}^{\intercal}\right]+\operatorname{Tr}\left[\bm{U}_{1}\bm{P}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\bm{P}^{\intercal}\right]
Tr[𝑼1𝑷(𝑼2𝑼2Π)𝑷]+Tr[𝑼1Π𝑷(𝑼2𝑼2Π)𝑷]\displaystyle\qquad-\operatorname{Tr}\left[\bm{U}_{1}\bm{P}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\bm{P}^{\intercal}\right]+\operatorname{Tr}\left[\bm{U}_{1}^{\Pi}\bm{P}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\bm{P}^{\intercal}\right]
=\displaystyle= Tr[(𝑼1𝑼1Π)𝑽1]+Tr[𝑽2(𝑼2𝑼2Π)]Tr[(𝑼1𝑼1Π)𝑷(𝑼2𝑼2Π)𝑷].\displaystyle\operatorname{Tr}\left[\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)\bm{V}_{1}\right]+\operatorname{Tr}\left[\bm{V}_{2}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\right]-\operatorname{Tr}\left[\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)\bm{P}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\bm{P}^{\intercal}\right].

The first two terms in the last line above can be directly addressed by Lemma E.5; for the last term, we can bound it as

|Tr[(𝑼1𝑼1Π)𝑷(𝑼2𝑼2Π)𝑷]|\displaystyle\left|\operatorname{Tr}\left[\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)\bm{P}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\bm{P}^{\intercal}\right]\right|\leq (𝑼1𝑼1Π)𝑷F(𝑼2𝑼2Π)𝑷F\displaystyle\left\|\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)\bm{P}\right\|_{F}\left\|\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\bm{P}^{\intercal}\right\|_{F}
\displaystyle\leq 𝑼1𝑼1ΠF𝑷𝑼2𝑼2ΠF𝑷𝑼1𝑼1ΠF𝑼2𝑼2ΠF,\displaystyle\left\|\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right\|_{F}\left\|\bm{P}\right\|\left\|\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right\|_{F}\left\|\bm{P}\right\|\leq\left\|\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right\|_{F}\left\|\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right\|_{F},

where Lemma E.8 shows 𝑷1\left\|\bm{P}\right\|\leq 1 and justifies the last inequality.

We now give another useful fact that I1I1=𝑼1𝑼1ΠF2I_{1}-I^{\prime}_{1}=\left\|\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right\|_{F}^{2}:

I1I1=(i)\displaystyle I_{1}-I^{\prime}_{1}\overset{(i)}{=} Tr[𝑼12(𝑼1Π)2]=Tr[𝑼12+(𝑼1Π)2]2Tr[(𝚷w,1𝑼1𝚷w,1)2]\displaystyle\operatorname{Tr}\left[\bm{U}_{1}^{2}-\left(\bm{U}_{1}^{\Pi}\right)^{2}\right]=\operatorname{Tr}\left[\bm{U}_{1}^{2}+\left(\bm{U}_{1}^{\Pi}\right)^{2}\right]-2\operatorname{Tr}\left[\left(\bm{\Pi}_{w,1}\bm{U}_{1}\bm{\Pi}_{w,1}\right)^{2}\right]
=\displaystyle= Tr[𝑼12+(𝑼1Π)2]Tr(𝚷w,1𝑼1𝚷w,1𝑼1)Tr(𝑼1𝚷w,1𝑼1𝚷w,1)=Tr[(𝑼1𝑼1Π)2]\displaystyle\operatorname{Tr}\left[\bm{U}_{1}^{2}+\left(\bm{U}_{1}^{\Pi}\right)^{2}\right]-\operatorname{Tr}\left(\bm{\Pi}_{w,1}\bm{U}_{1}\bm{\Pi}_{w,1}\bm{U}_{1}\right)-\operatorname{Tr}\left(\bm{U}_{1}\bm{\Pi}_{w,1}\bm{U}_{1}\bm{\Pi}_{w,1}\right)=\operatorname{Tr}\left[\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)^{2}\right]
=\displaystyle= 𝑼1𝑼1ΠF2,\displaystyle\left\|\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right\|_{F}^{2},

where (i)(i) comes from Equation (E.4.2). Analogously we have I2I2=𝑼2𝑼2ΠF2I_{2}-I^{\prime}_{2}=\left\|\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right\|_{F}^{2}. Combining the above pieces together we can bound the object as

GW22(G1(c),G2(c))GW22(G1,G2)\displaystyle\text{GW}_{2}^{2}(G_{1}^{(c)},G_{2}^{(c)})-\text{GW}_{2}^{2}(G_{1},G_{2})\leq 𝑼1𝑼1ΠF2𝑼2𝑼2ΠF2+\displaystyle-\left\|\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right\|_{F}^{2}-\left\|\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right\|_{F}^{2}+
2Tr[(𝑼1𝑼1Π)𝑽1]+2Tr[𝑽2(𝑼2𝑼2Π)]+2𝑼1𝑼1ΠF𝑼2𝑼2ΠF\displaystyle\quad 2\operatorname{Tr}\left[\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)\bm{V}_{1}\right]+2\operatorname{Tr}\left[\bm{V}_{2}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\right]+2\left\|\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right\|_{F}\left\|\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right\|_{F}
\displaystyle\leq 2Tr[(𝑼1𝑼1Π)𝑽1]+2Tr[𝑽2(𝑼2𝑼2Π)]\displaystyle 2\operatorname{Tr}\left[\left(\bm{U}_{1}-\bm{U}_{1}^{\Pi}\right)\bm{V}_{1}\right]+2\operatorname{Tr}\left[\bm{V}_{2}\left(\bm{U}_{2}-\bm{U}_{2}^{\Pi}\right)\right]
(i)\displaystyle\overset{(i)}{\leq} 2[ν1,N1n1+1i=1n1(λ1,iλ1,i(c))+C𝑼1,𝑽1,n1+\displaystyle 2\cdot\left[\nu_{1,N_{1}-n_{1}+1}\sum_{i=1}^{n_{1}}\left(\lambda_{1,i}-\lambda^{(c)}_{1,i}\right)+C_{\bm{U}_{1},\bm{V}_{1},n_{1}}+\right.
ν2,N2n2+1i=1n2(λ2,iλ2,i(c))+C𝑼2,𝑽2,n2],\displaystyle\quad\left.\nu_{2,N_{2}-n_{2}+1}\sum_{i=1}^{n_{2}}\left(\lambda_{2,i}-\lambda^{(c)}_{2,i}\right)+C_{\bm{U}_{2},\bm{V}_{2},n_{2}}\right],

where we reuse Inequality (E.4.2) to attain (i)(i). The proof of Theorem 4.4 is then completed.

E.4 Proof of some technical results

E.4.1 Proof of Lemma E.3

Proof.

Following the definition in Section 3.3, we rewrite the GW distance in the trace form and have

𝑴,𝑻\displaystyle\left\langle\bm{M},\bm{T}^{*}\right\rangle =f1(𝑺1)𝒎1𝟏N2,𝑻+𝟏N1𝒎2f2(𝑺2),𝑻h1(𝑺1)𝑻h2(𝑺2),𝑻\displaystyle=\left\langle f_{1}(\bm{S}_{1})\bm{m}_{1}\bm{1}_{N_{2}}^{\intercal},\bm{T}^{*}\right\rangle+\left\langle\bm{1}_{N_{1}}\bm{m}_{2}^{\intercal}f_{2}(\bm{S}_{2})^{\intercal},\bm{T}^{*}\right\rangle-\left\langle h_{1}(\bm{S}_{1})\bm{T}^{*}h_{2}(\bm{S}_{2})^{\intercal},\bm{T}^{*}\right\rangle
=Tr(f1(𝑺1)𝒎1𝟏N2(𝑻))+Tr(f2(𝑺2)𝒎2𝟏N1𝑻)Tr(h1(𝑺1)𝑻h2(𝑺2)(𝑻))\displaystyle=\operatorname{Tr}\left(f_{1}(\bm{S}_{1})\bm{m}_{1}\bm{1}_{N_{2}}^{\intercal}\left(\bm{T}^{*}\right)^{\intercal}\right)+\operatorname{Tr}\left(f_{2}(\bm{S}_{2})\bm{m}_{2}\bm{1}_{N_{1}}^{\intercal}\bm{T}^{*}\right)-\operatorname{Tr}\left(h_{1}(\bm{S}_{1})\bm{T}^{*}h_{2}(\bm{S}_{2})^{\intercal}\left(\bm{T}^{*}\right)^{\intercal}\right)
=Tr(𝒎1f1(𝑺1)𝒎1)+Tr(𝒎2f2(𝑺2)𝒎2)Tr(h1(𝑺1)𝑻h2(𝑺2)(𝑻));\displaystyle=\operatorname{Tr}\left(\bm{m}_{1}^{\intercal}f_{1}(\bm{S}_{1})\bm{m}_{1}\right)+\operatorname{Tr}\left(\bm{m}_{2}^{\intercal}f_{2}(\bm{S}_{2})\bm{m}_{2}\right)-\operatorname{Tr}\left(h_{1}(\bm{S}_{1})\bm{T}^{*}h_{2}(\bm{S}_{2})^{\intercal}\left(\bm{T}^{*}\right)^{\intercal}\right); (14)

the third equation above holds because for any 𝑻Π(μ1,μ2)\bm{T}\in\Pi(\mu_{1},\mu_{2}), 𝑻𝟏N2=𝒎1\bm{T}\bm{1}_{N_{2}}=\bm{m}_{1} and 𝑻𝟏N1=𝒎2\bm{T}^{\intercal}\bm{1}_{N_{1}}=\bm{m}_{2}.

In the classical square loss case, we can immediately have f1(𝑺1)=𝑺𝟏𝑺𝟏,f2(𝑺2)=𝑺𝟐𝑺𝟐,h1(𝑺1)=𝑺1f_{1}(\bm{S}_{1})=\bm{S_{1}}\odot\bm{S_{1}},f_{2}(\bm{S}_{2})=\bm{S_{2}}\odot\bm{S_{2}},h_{1}(\bm{S}_{1})=\bm{S}_{1} and h2(𝑺2)=2𝑺2h_{2}(\bm{S}_{2})=2\bm{S}_{2}, where \odot denotes the Hadamard product of two matrices with the same size. We can accordingly expand the first term in Equation 14 as

Tr(𝒎1f1(𝑺1)𝒎1)=𝒎1(𝑺𝟏𝑺𝟏)𝒎1=Tr(𝑺𝟏diag(𝒎𝟏)𝑺𝟏diag(𝒎𝟏)),\displaystyle\operatorname{Tr}\left(\bm{m}_{1}^{\intercal}f_{1}(\bm{S}_{1})\bm{m}_{1}\right)=\bm{m}_{1}^{\intercal}\left(\bm{S_{1}}\odot\bm{S_{1}}\right)\bm{m}_{1}=\operatorname{Tr}\left(\bm{S_{1}}^{\intercal}\text{diag}(\bm{m_{1}})\bm{S_{1}}\text{diag}(\bm{m_{1}})\right),

in which the proof of the last equation is provided in a summary sheet by Minka (2000). We note 𝑾1:=diag(𝒎𝟏)\bm{W}_{1}\vcentcolon=\text{diag}(\bm{m_{1}}) and 𝑺1\bm{S}_{1} is constructed symmetric; combining the pieces above we have

Tr(𝒎1f1(𝑺1)𝒎1)=Tr((𝑾112𝑺𝟏𝑾112)(𝑾112𝑺𝟏𝑾112)),\displaystyle\operatorname{Tr}\left(\bm{m}_{1}^{\intercal}f_{1}(\bm{S}_{1})\bm{m}_{1}\right)=\operatorname{Tr}\left(\left(\bm{W}_{1}^{\frac{1}{2}}\bm{S_{1}}\bm{W}_{1}^{\frac{1}{2}}\right)\left(\bm{W}_{1}^{\frac{1}{2}}\bm{S_{1}}\bm{W}_{1}^{\frac{1}{2}}\right)\right),

and similarly we obtain Tr(𝒎2f2(𝑺2)𝒎2)=Tr((𝑾212𝑺𝟐𝑾212)(𝑾212𝑺𝟐𝑾212))\operatorname{Tr}\left(\bm{m}_{2}^{\intercal}f_{2}(\bm{S}_{2})\bm{m}_{2}\right)=\operatorname{Tr}\left(\left(\bm{W}_{2}^{\frac{1}{2}}\bm{S_{2}}\bm{W}_{2}^{\frac{1}{2}}\right)\left(\bm{W}_{2}^{\frac{1}{2}}\bm{S_{2}}\bm{W}_{2}^{\frac{1}{2}}\right)\right). The GW distance (14) can be therefore represented as

Tr((𝑾112𝑺𝟏𝑾112)(𝑾112𝑺𝟏𝑾112))=:I1+Tr((𝑾212𝑺𝟐𝑾212)(𝑾212𝑺𝟐𝑾212))=:I22Tr(𝑺1𝑻𝑺2(𝑻))=:I3.\displaystyle\underbrace{\operatorname{Tr}\left(\left(\bm{W}_{1}^{\frac{1}{2}}\bm{S_{1}}\bm{W}_{1}^{\frac{1}{2}}\right)\left(\bm{W}_{1}^{\frac{1}{2}}\bm{S_{1}}\bm{W}_{1}^{\frac{1}{2}}\right)\right)}_{=\,:I_{1}}+\underbrace{\operatorname{Tr}\left(\left(\bm{W}_{2}^{\frac{1}{2}}\bm{S_{2}}\bm{W}_{2}^{\frac{1}{2}}\right)\left(\bm{W}_{2}^{\frac{1}{2}}\bm{S_{2}}\bm{W}_{2}^{\frac{1}{2}}\right)\right)}_{=\,:I_{2}}-2\underbrace{\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}\right)}_{=\,:I_{3}}. (15)

\diamondsuit

E.4.2 Proof of Lemma E.4

Proof.

First, recall that 𝚷w=𝑾112𝑪p𝑪¯w𝑾112=𝑪w𝑪w\bm{\Pi}_{w}=\bm{W}_{1}^{\frac{1}{2}}\bm{C}_{p}^{\intercal}\bar{\bm{C}}_{w}\bm{W}_{1}^{-\frac{1}{2}}=\bm{C}_{w}^{\intercal}\bm{C}_{w}. So, we have

I1\displaystyle I_{1}^{\prime} =Tr([(𝑪p𝑾1𝑪p)12(𝑪¯w𝑺1𝑪¯w)(𝑪w𝑾1𝑪w)12]2)\displaystyle=\operatorname{Tr}\left(\left[\left(\bm{C}_{p}\bm{W}_{1}\bm{C}_{p}^{\intercal}\right)^{\frac{1}{2}}\left(\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\right)\left(\bm{C}_{w}\bm{W}_{1}\bm{C}_{w}^{\intercal}\right)^{\frac{1}{2}}\right]^{2}\right) (16)
=Tr[(𝑪p𝑾1𝑪p)(𝑪¯w𝑺1𝑪¯w)(𝑪w𝑾1𝑪w)(𝑪¯w𝑺1𝑪¯w)]\displaystyle=\operatorname{Tr}\left[{\left(\bm{C}_{p}\bm{W}_{1}\bm{C}_{p}^{\intercal}\right)\left(\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\right)\left(\bm{C}_{w}\bm{W}_{1}\bm{C}_{w}^{\intercal}\right)\left(\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\right)}\right]
=Tr(𝑾112𝑪p𝑪¯w𝑺1𝑪¯w𝑪w𝑾1𝑪p𝑪¯w𝑺1𝑪¯w𝑪p𝑾112)\displaystyle=\operatorname{Tr}\left(\bm{W}_{1}^{\frac{1}{2}}\bm{C}_{p}^{\intercal}\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\bm{C}_{w}\bm{W}_{1}\bm{C}_{p}^{\intercal}\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\bm{C}_{p}\bm{W}_{1}^{\frac{1}{2}}\right)
=Tr(𝚷w𝑼1𝚷w𝚷w𝑼1𝚷w)\displaystyle=\operatorname{Tr}\left(\bm{\Pi}_{w}\bm{U}_{1}\bm{\Pi}_{w}^{\intercal}\bm{\Pi}_{w}\bm{U}_{1}\bm{\Pi}_{w}\right)
=Tr[(𝚷w𝑼1𝚷w)2].\displaystyle=\operatorname{Tr}\left[\left(\bm{\Pi}_{w}\bm{U}_{1}\bm{\Pi}_{w}\right)^{2}\right].

This implies I1I1=Tr[𝑼2(𝚷w𝑼𝚷w)2]I_{1}-I_{1}^{\prime}=\operatorname{Tr}\left[\bm{U}^{2}-\left(\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\right)^{2}\right]. Applying Lemma E.9 yields I1I10I_{1}-I_{1}^{\prime}\geq 0.

To bound the other direction, we have

I1I1=Tr[𝑼2(𝚷w𝑼𝚷w)2]\displaystyle\quad\,I_{1}-I^{\prime}_{1}=\operatorname{Tr}\left[\bm{U}^{2}-\left(\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\right)^{2}\right]
=i=1N1λi2i=1n1(λi(c))2(i)i=1Niλi2i=1niλi(c)λNn+i\displaystyle=\sum_{i=1}^{N_{1}}\lambda_{i}^{2}-\sum_{i=1}^{n_{1}}\left(\lambda^{(c)}_{i}\right)^{2}\stackrel{{\scriptstyle\text{(i)}}}{{\leq}}\sum_{i=1}^{N_{i}}\lambda_{i}^{2}-\sum_{i=1}^{n_{i}}\lambda^{(c)}_{i}\lambda_{N-n+i}
=i=1n1(λiλi(c))λNn+i+i=1n1λi(λiλNn+i)+i=n1+1N1λi2\displaystyle=\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)\lambda_{N-n+i}+\sum_{i=1}^{n_{1}}\lambda_{i}(\lambda_{i}-\lambda_{N-n+i})+\!\!\sum_{i=n_{1}+1}^{N_{1}}\lambda_{i}^{2}
=i=1n1(λiλi(c))λN1n1+i+C𝑼,n1\displaystyle=\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)\lambda_{N_{1}-n_{1}+i}+C_{\bm{U},n_{1}}
(ii)λN1n1+1i=1n1(λiλi(c))+C𝑼,n1,\displaystyle\stackrel{{\scriptstyle\text{(ii)}}}{{\leq}}\lambda_{N_{1}-n_{1}+1}\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U},n_{1}}, (17)

Here, both (i) and (ii) are by Theorem D.1. \diamondsuit

E.4.3 Proof of Lemma E.5

Proof.

By applying Lemma E.7, we have

0Tr(𝑺1𝑻𝑺2(𝑻)𝑺1(c)𝑻co𝑺2(𝑻co))\displaystyle 0\leq\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}-\bm{S}_{1}^{(c)}\bm{T}_{co}^{*}\bm{S}_{2}\left(\bm{T}_{co}^{*}\right)^{\intercal}\right) Tr[(𝑼𝚷w𝑼𝚷w)𝑽]\displaystyle\leq\operatorname{Tr}\left[\left(\bm{U}-\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\right)\bm{V}\right]
=Tr(𝑼𝑽)Tr(𝑪w𝑼𝑪w𝑪w𝑽𝑪w).\displaystyle=\operatorname{Tr}\left(\bm{U}\bm{V}\right)-\operatorname{Tr}\left(\bm{C}_{w}\bm{U}\bm{C}_{w}^{\intercal}\bm{C}_{w}\bm{V}\bm{C}_{w}^{\intercal}\right).

Recall that {λi(c)}i=1n1\left\{\lambda^{(c)}_{i}\right\}_{i=1}^{n_{1}} are eigenvalues of 𝑪w𝑼𝑪w\bm{C}_{w}\bm{U}\bm{C}_{w}^{\intercal}, and let ν1ν2νn1\nu_{1}\geq\nu_{2}\geq\cdots\geq\nu_{n_{1}} be eigenvalues of 𝑪w𝑽𝑪w\bm{C}_{w}\bm{V}\bm{C}_{w}^{\intercal}. Applying Ruhe’s trace inequality (c.f. Appendix D.2), we further have

Tr(𝑼𝑽)Tr(𝑪w𝑼𝑪w𝑪w𝑽𝑪w)(i)i=1N1λiνii=1n1λi(c)νn1i+1(c)(ii)i=1N1λiνii=1n1λi(c)νN1i+1\displaystyle\operatorname{Tr}\left(\bm{U}\bm{V}\right)-\operatorname{Tr}\left(\bm{C}_{w}\bm{U}\bm{C}_{w}^{\intercal}\bm{C}_{w}\bm{V}\bm{C}_{w}^{\intercal}\right)\stackrel{{\scriptstyle\text{(i)}}}{{\leq}}\sum_{i=1}^{N_{1}}\lambda_{i}\nu_{i}-\sum_{i=1}^{n_{1}}\lambda^{(c)}_{i}\nu^{(c)}_{n_{1}-i+1}\stackrel{{\scriptstyle\text{(ii)}}}{{\leq}}\sum_{i=1}^{N_{1}}\lambda_{i}\nu_{i}-\sum_{i=1}^{n_{1}}\lambda^{(c)}_{i}\nu_{N_{1}-i+1}
=i=1n1(λiλi(c))νN1i+1+i=1n1λi(νiνN1i+1)+i=n1+1N1λiνi=i=1n1(λiλi(c))νN1i+1+C𝑼,𝑽,n1\displaystyle=\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)\nu_{N_{1}-i+1}+\sum_{i=1}^{n_{1}}\lambda_{i}\left(\nu_{i}-\nu_{N_{1}-i+1}\right)+\sum_{i=n_{1}+1}^{N_{1}}\lambda_{i}\nu_{i}=\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)\nu_{N_{1}-i+1}+C_{\bm{U},\bm{V},n_{1}}
(iii)νN1n1+1i=1n1(λiλi(c))+C𝑼,𝑽,n1,\displaystyle\stackrel{{\scriptstyle\text{(iii)}}}{{\leq}}\nu_{N_{1}-n_{1}+1}\sum_{i=1}^{n_{1}}\left(\lambda_{i}-\lambda^{(c)}_{i}\right)+C_{\bm{U},\bm{V},n_{1}}, (18)

(i) is by Ruhe’s trace inequality (Lemma D.2), since both 𝑼\bm{U} and 𝑽\bm{V} are PSD, and both (ii) and (iii) are by Poincaré separation theorem (Theorem D.1). \diamondsuit

E.5 Other technical lemmas

Lemma E.7.

We have

0Tr(𝑺1𝑻𝑺2(𝑻)𝑺1(c)𝑻co𝑺2(𝑻co))Tr[(𝑼𝚷w𝑼𝚷w)𝑽].\displaystyle 0\leq\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}-\bm{S}_{1}^{(c)}\bm{T}_{co}^{*}\bm{S}_{2}\left(\bm{T}_{co}^{*}\right)^{\intercal}\right)\leq\operatorname{Tr}\left[\left(\bm{U}-\bm{\Pi}_{w}\bm{U}\bm{\Pi}_{w}\right)\bm{V}\right].
Proof.

The proof is based on the optimality of 𝑻co\bm{T}_{co}^{*}, i.e. the GW distance induced by 𝑻co\bm{T}_{co}^{*} must be upper bounded by any other transport matrix. Intuitively, we can imagine the mass of a cluster center is transported to the same target nodes in G2G_{2} as the original source nodes within this cluster, which corresponds to the transport matrix 𝑻~co:=𝑪1,p𝑻\widetilde{\bm{T}}_{co}\vcentcolon=\bm{C}_{1,p}\bm{T}^{\ast}. We can verify 𝑻~cΠ(μ1(c),μ2)\widetilde{\bm{T}}_{c}\in\Pi(\mu_{1}^{(c)},\mu_{2}) is feasible, since (𝑪1,p𝑻)𝟏N2=𝑪1,p𝒎1=𝒎1(c)(\bm{C}_{1,p}\bm{T}^{\ast})\bm{1}_{N_{2}}=\bm{C}_{1,p}\bm{m}_{1}=\bm{m}_{1}^{(c)} and (𝑪1,p𝑻)𝟏n1=(𝑻)𝟏N1=𝒎2\left(\bm{C}_{1,p}\bm{T}^{*}\right)^{\intercal}\bm{1}_{n_{1}}=(\bm{T}^{\ast})^{\intercal}\bm{1}_{N_{1}}=\bm{m}_{2}.

To derive the upper bound, applying the optimality of 𝑻co\bm{T}_{co}^{\ast} yields

Tr(𝑺1𝑻𝑺2(𝑻)𝑺1(c)𝑻co𝑺2(𝑻co))Tr(𝑺1𝑻𝑺2(𝑻))Tr(𝑺1(c)𝑻~co𝑺2𝑻~co)\displaystyle\quad\,\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}-\bm{S}_{1}^{(c)}\bm{T}_{co}^{*}\bm{S}_{2}\left(\bm{T}_{co}^{*}\right)^{\intercal}\right)\leq\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}\right)-\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\widetilde{\bm{T}}_{co}\bm{S}_{2}\widetilde{\bm{T}}_{co}^{\intercal}\right) (19)
=Tr(𝑺1𝑻𝑺2(𝑻))Tr((𝑪¯w𝑺1𝑪¯w)(𝑪p𝑻)𝑺2(𝑪p𝑻))\displaystyle=\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}\right)-\operatorname{Tr}\left(\big{(}\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\big{)}\big{(}\bm{C}_{p}\bm{T}^{\ast}\big{)}\bm{S}_{2}\big{(}\bm{C}_{p}\bm{T}^{\ast}\big{)}^{\intercal}\right) (20)
=Tr(𝑺1𝑻𝑺2(𝑻))Tr(𝑪¯w𝑺1𝑪¯w𝑪p𝑻𝑺2(𝑻)𝑪p)\displaystyle=\operatorname{Tr}\left(\bm{S}_{1}\bm{T}^{\ast}\bm{S}_{2}(\bm{T}^{\ast})^{\intercal}\right)-\operatorname{Tr}\left(\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\bm{C}_{p}\bm{T}^{\ast}\bm{S}_{2}(\bm{T}^{\ast})^{\intercal}\bm{C}_{p}^{\intercal}\right) (21)
=Tr(𝑺1(𝑾112𝑷𝑾212)𝑺2(𝑾112𝑷𝑾212))Tr(𝑪¯w𝑺1𝑪¯w𝑪p(𝑾112𝑷𝑾212)𝑺2(𝑾212𝑷𝑾112)𝑪p)\displaystyle=\operatorname{Tr}\left(\bm{S}_{1}\big{(}\bm{W}_{1}^{\frac{1}{2}}\bm{P}\bm{W}_{2}^{\frac{1}{2}}\big{)}\bm{S}_{2}\big{(}\bm{W}_{1}^{\frac{1}{2}}\bm{P}\bm{W}_{2}^{\frac{1}{2}}\big{)}^{\intercal}\right)-\operatorname{Tr}\left(\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\bm{C}_{p}\big{(}\bm{W}_{1}^{\frac{1}{2}}\bm{P}\bm{W}_{2}^{\frac{1}{2}}\big{)}\bm{S}_{2}\big{(}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal}\bm{W}_{1}^{\frac{1}{2}}\big{)}\bm{C}_{p}^{\intercal}\right)
=Tr(𝑺1𝑾112𝑷𝑾212𝑺2𝑾212𝑷𝑾112)Tr(𝑪¯w𝑺1𝑪¯w𝑪p𝑾112𝑷𝑾212𝑺2𝑾212𝑷𝑾112𝑪p)\displaystyle=\operatorname{Tr}\left(\bm{S}_{1}\bm{W}_{1}^{\frac{1}{2}}\bm{P}\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal}\bm{W}_{1}^{\frac{1}{2}}\right)-\operatorname{Tr}\left(\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\bm{C}_{p}\bm{W}_{1}^{\frac{1}{2}}\bm{P}\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal}\bm{W}_{1}^{\frac{1}{2}}\bm{C}_{p}^{\intercal}\right)
=Tr[(𝑾112𝑺1𝑾112𝑾112𝑪p𝑪¯w𝑺1𝑪¯w𝑪p𝑾112𝑫1)𝑷𝑾212𝑺2𝑾212𝑷].\displaystyle=\operatorname{Tr}\left[\left(\underbrace{\bm{W}_{1}^{\frac{1}{2}}\bm{S}_{1}\bm{W}_{1}^{\frac{1}{2}}-\bm{W}_{1}^{\frac{1}{2}}\bm{C}_{p}^{\intercal}\bar{\bm{C}}_{w}\bm{S}_{1}\bar{\bm{C}}_{w}^{\intercal}\bm{C}_{p}\bm{W}_{1}^{\frac{1}{2}}}_{\bm{D}_{1}}\right)\bm{P}\bm{W}_{2}^{\frac{1}{2}}\bm{S}_{2}\bm{W}_{2}^{\frac{1}{2}}\bm{P}^{\intercal}\right]. (22)

The treatment is similar for the lower bound. We replace 𝑻\bm{T}^{*} with 𝑻~:=𝑪¯w𝑻coΠ(μ1,μ2)\widetilde{\bm{T}}\vcentcolon=\bar{\bm{C}}_{w}^{\intercal}\bm{T}_{co}^{*}\in\Pi(\mu_{1},\mu_{2}). Then again due to the optimality of 𝑻\bm{T}^{*}:

Tr(𝑺1(c)𝑻co𝑺2(𝑻co)𝑺1𝑻𝑺2(𝑻))Tr(𝑺1(c)𝑻co𝑺2(𝑻co))Tr(𝑺1𝑻~𝑺2𝑻~)=0\displaystyle\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{T}_{co}^{*}\bm{S}_{2}\left(\bm{T}_{co}^{*}\right)^{\intercal}-\bm{S}_{1}\bm{T}^{*}\bm{S}_{2}\left(\bm{T}^{*}\right)^{\intercal}\right)\leq\operatorname{Tr}\left(\bm{S}_{1}^{(c)}\bm{T}_{co}^{\ast}\bm{S}_{2}(\bm{T}_{co}^{\ast})^{\intercal}\right)-\operatorname{Tr}\left(\bm{S}_{1}\widetilde{\bm{T}}\bm{S}_{2}\widetilde{\bm{T}}^{\intercal}\right)=0

given our definition. \diamondsuit

Remark. An intuitive scheme to control the upper bound (22) is to upper bound the trace of the G1G_{1}-related matrix difference 𝑫1\bm{D}_{1}, since in coarsening G1G_{1} we have no information about G2G_{2}; we will shortly showcase the term is the key to bound the whole GW distance difference |(15)(12)|\left|(\ref{eqn:tr_gw})-(\ref{eqn:tr_gwc})\right| as well.

Lemma E.8.

Consider a non-negative matrix 𝐓N1×N2\bm{T}\in\mathbb{R}^{N_{1}\times N_{2}} in which all the elements tij0t_{ij}\geq 0. We keep to denote 𝐖1=diag(𝐓𝟏N2),𝐖2=diag(𝐓𝟏N1)\bm{W}_{1}=\mathrm{diag}\left(\bm{T}\bm{1}_{N_{2}}\right),\bm{W}_{2}=\mathrm{diag}\left(\bm{T}^{\intercal}\bm{1}_{N_{1}}\right), and 𝐏=𝐖112𝐓𝐖212\bm{P}=\bm{W}_{1}^{-\frac{1}{2}}\bm{T}\bm{W}_{2}^{-\frac{1}{2}}. Then we have 𝐏1\|\bm{P}\|\leq 1.

Proof.

Motivated by the regular proof for bounding the eigenvalues of normalized Laplacian matrices, with two arbitrary vectors 𝒖N1,𝒗N2\bm{u}\in\mathbb{R}^{N_{1}},\bm{v}\in\mathbb{R}^{N_{2}} on the unit spheres we can recast the target statement as

1𝒖𝑷𝒗0,𝒖,𝒗 s.t. 𝒖=1,𝒗=1.\displaystyle 1-\bm{u}^{\intercal}\bm{P}\bm{v}\geq 0,\quad\forall~{}\bm{u},\bm{v}\text{~{}s.t.~{}}\|\bm{u}\|=1,\|\bm{v}\|=1. (23)

We denote the diagonals for 𝑾1\bm{W}_{1} and 𝑾2\bm{W}_{2} respectively as 𝒎1=[m1,1,m1,2,,m1,N1]\bm{m}_{1}=[m_{1,1},m_{1,2},\dots,m_{1,N_{1}}]^{\intercal} and 𝒎2=[m2,1,m2,2,,m2,N2]\bm{m}_{2}=[m_{2,1},m_{2,2},\dots,m_{2,N_{2}}]^{\intercal}; then we can rewrite the right-hand-side quantity in Inequality (23) as

1𝒖𝑷𝒗\displaystyle 1-\bm{u}^{\intercal}\bm{P}\bm{v} =12(𝒖2+𝒗2)i=1N1j=1N2tij𝒖i𝒗jm1,im2,j\displaystyle=\frac{1}{2}\left(\|\bm{u}\|^{2}+\|\bm{v}\|^{2}\right)-\sum_{i=1}^{N_{1}}\sum_{j=1}^{N_{2}}\frac{t_{ij}\bm{u}_{i}\bm{v}_{j}}{\sqrt{m_{1,i}m_{2,j}}}
=12i=1N1j=1N2tij𝒖i2m1,i+12j=1N2i=1N1tij𝒗j2m2,ji=1N1j=1N2tij𝒖i𝒗jm1,im2,j\displaystyle=\frac{1}{2}\sum_{i=1}^{N_{1}}\sum_{j=1}^{N_{2}}\frac{t_{ij}\bm{u}_{i}^{2}}{m_{1,i}}+\frac{1}{2}\sum_{j=1}^{N_{2}}\sum_{i=1}^{N_{1}}\frac{t_{ij}\bm{v}_{j}^{2}}{m_{2,j}}-\sum_{i=1}^{N_{1}}\sum_{j=1}^{N_{2}}\frac{t_{ij}\bm{u}_{i}\bm{v}_{j}}{\sqrt{m_{1,i}m_{2,j}}}
=12i=1N1j=1N2tij(𝒖im1,i𝒗jm2,j)20,\displaystyle=\frac{1}{2}\sum_{i=1}^{N_{1}}\sum_{j=1}^{N_{2}}t_{ij}\left(\frac{\bm{u}_{i}}{\sqrt{m_{1,i}}}-\frac{\bm{v}_{j}}{\sqrt{m_{2,j}}}\right)^{2}\geq 0,

where the second equation holds due to the conditions j=1N2tij=m1,i\sum_{j=1}^{N_{2}}t_{ij}=m_{1,i} and i=1N1tij=m2,j\sum_{i=1}^{N_{1}}t_{ij}=m_{2,j}, and the last inequality holds since tij0,i[N1],j[N2]t_{ij}\geq 0,\forall i\in[N_{1}],j\in[N_{2}]. \diamondsuit

Lemma E.9.

Consider a positive semi-definite (PSD) similarity matrix SN×NS\in\mathbb{R}^{N\times N} along with the probability mass vector 𝐦\bm{m} and the diagonal matrix 𝐖:=diag(𝐦)\bm{W}\vcentcolon=\mathrm{diag}\left(\bm{m}\right). For any non-overlapping partition {𝒫1,𝒫2,,𝒫n}\left\{\mathcal{P}_{1},\mathcal{P}_{2},\dots,\mathcal{P}_{n}\right\}, we denote the corresponding coarsening matrices 𝐂p,𝐂¯w\bm{C}_{p},\bar{\bm{C}}_{w} and the projection matrix 𝚷w=𝐂w𝐂w\bm{\Pi}_{w}=\bm{C}_{w}^{\intercal}\bm{C}_{w}. Let 𝐀:=𝐖12𝐒𝐖12\bm{A}\vcentcolon=\bm{W}^{\frac{1}{2}}\bm{S}\bm{W}^{\frac{1}{2}}. Then we have

Tr(𝑨2)Tr[(𝚷w𝑨𝚷w)2].\displaystyle\operatorname{Tr}\left(\bm{A}^{2}\right)\geq\operatorname{Tr}\left[\left(\bm{\Pi}_{w}\bm{A}\bm{\Pi}_{w}\right)^{2}\right]. (24)
Proof.

We first transform Tr[(𝚷w𝑨𝚷w)2]\operatorname{Tr}\left[\left(\bm{\Pi}_{w}\bm{A}\bm{\Pi}_{w}\right)^{2}\right] as follows:

Tr[(𝚷w𝑨𝚷w)2]=Tr(𝚷w𝑨𝚷w𝚷w𝑨𝚷w)=Tr(𝑪w𝑪w𝑨𝑪w𝑪w𝑨𝑪w𝑪w)=Tr(𝑪w𝑨𝑪w𝑪w𝑨𝑪w),\displaystyle\operatorname{Tr}\left[\left(\bm{\Pi}_{w}\bm{A}\bm{\Pi}_{w}\right)^{2}\right]=\operatorname{Tr}\left(\bm{\Pi}_{w}\bm{A}\bm{\Pi}_{w}\bm{\Pi}_{w}\bm{A}\bm{\Pi}_{w}\right)=\operatorname{Tr}\left(\bm{C}_{w}^{\intercal}\bm{C}_{w}\bm{A}\bm{C}_{w}^{\intercal}\bm{C}_{w}\bm{A}\bm{C}_{w}^{\intercal}\bm{C}_{w}\right)=\operatorname{Tr}\left(\bm{C}_{w}\bm{A}\bm{C}_{w}^{\intercal}\bm{C}_{w}\bm{A}\bm{C}_{w}^{\intercal}\right),

and the last two equations hold due to 𝑪w𝑪w=𝑰n\bm{C}_{w}\bm{C}_{w}^{\intercal}=\bm{I}_{n}.

We notice 𝑨:=𝑾12𝑺𝑾12\bm{A}\vcentcolon=\bm{W}^{\frac{1}{2}}\bm{S}\bm{W}^{\frac{1}{2}} is symmetric and we can apply Poincaré separation theorem (c.f. Appendix D.1 for the complete statement) to control the eigenvalues of 𝑪w𝑨𝑪w\bm{C}_{w}\bm{A}\bm{C}_{w}^{\intercal}. Specifically, let λ1λ2λN\lambda_{1}\geq\lambda_{2}\geq\dots\geq\lambda_{N} be the eigenvalues of 𝑨\bm{A}, and let λ1(c)λn(c)\lambda^{(c)}_{1}\geq\dots\geq\lambda^{(c)}_{n} be the eigenvalues of 𝑪w𝑨𝑪w\bm{C}_{w}\bm{A}\bm{C}_{w}^{\intercal}; for all i[n]i\in[n] we have λiλi(c)0\lambda_{i}\geq\lambda^{(c)}_{i}\geq 0 (being non-negative due to the PSD-ness of 𝑨\bm{A}); and a further conclusion λi2(λi(c))2,i[n]\lambda_{i}^{2}\geq\left(\lambda^{(c)}_{i}\right)^{2},\forall i\in[n]. We can therefore complete the proof with Tr(𝑨2)=i=1Nλi2i=1n(λi(c))2=Tr[(𝚷w𝑨𝚷w)2]\operatorname{Tr}\left(\bm{A}^{2}\right)=\sum_{i=1}^{N}\lambda_{i}^{2}\geq\sum_{i=1}^{n}\left(\lambda^{(c)}_{i}\right)^{2}=\operatorname{Tr}\left[\left(\bm{\Pi}_{w}\bm{A}\bm{\Pi}_{w}\right)^{2}\right]. \diamondsuit