This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Compressing network populations with modal networks reveals structural diversity

Alec Kirkley alec.w.kirkley@gmail.com Institute of Data Science, University of Hong Kong, Hong Kong Department of Urban Planning and Design, University of Hong Kong, Hong Kong Urban Systems Institute, University of Hong Kong, Hong Kong    Alexis Rojas Department of Computer Science, University of Helsinki, Helsinki 00100, Finland    Martin Rosvall Integrated Science Lab, Department of Physics, Umea University, SE-901 87 Umea, Sweden    Jean-Gabriel Young Department of Mathematics and Statistics, University of Vermont, Burlington VT, USA Vermont Complex Systems Center, University of Vermont, Burlington VT, USA
Abstract

Analyzing relational data consisting of multiple samples or layers involves critical challenges: How many networks are required to capture the variety of structures in the data? And what are the structures of these representative networks? We describe efficient nonparametric methods derived from the minimum description length principle to construct the network representations automatically. The methods input a population of networks or a multilayer network measured on a fixed set of nodes and output a small set of representative networks together with an assignment of each network sample or layer to one of the representative networks. We identify the representative networks and assign network samples to them with an efficient Monte Carlo scheme that minimizes our description length objective. For temporally ordered networks, we use a polynomial time dynamic programming approach that restricts the clusters of network layers to be temporally contiguous. These methods recover planted heterogeneity in synthetic network populations and identify essential structural heterogeneities in global trade and fossil record networks. Our methods are principled, scalable, parameter-free, and accommodate a wide range of data, providing a unified lens for exploratory analyses and preprocessing large sets of network samples.

I Introduction

A common way to measure a network is to gather multiple observations of the connectivity of the same nodes. Examples include the mobility patterns of a particular group of students encoded as a longitudinal set of co-location networks eagle2009inferring ; lehmann2019fundamental , measurements of connectivity among the same brain regions for different individuals sporns2010networks , or the observation of protein-protein relationships through a variety of different interaction mechanisms stark2006biogrid . These measurements can be viewed as a multilayer network Kivela14 consisting of one layer for each measurement of all links between the nodes. For generality, we consider them as a population of networks—a set of independent network measurements on the same set of nodes, either over time or across systems with consistent, aligned node labels. There often are regularities among such collections of measurements, but each sample may differ substantially from the next. Summarizing these measurements with robust statistical analyses can separate regularities from noise and simplify downstream analyses such as network visualization or regression  butts2003network ; Newman18b ; young2020bayesian ; Peixoto18 ; priebe2015statistical ; arroyo2021inference ; tang2018connectome ; lunagomez2020modeling ; wang2019common ; young2021reconstruction .

Most statistical methods for summarizing populations of networks share a similar approach. They model all the members of a population as realizations of a single representative network banks1994metric ; butts2003network ; Newman18a ; Peixoto18 ; le2018estimating ; lunagomez2020modeling , which can be retrieved by fitting the model in question to the observed population. However, the strong assumption that a single “modal” network best explains the observed populations can lead to a poor representation of the data at hand la2016gibbs ; YKN22Clustering . For instance, accurately modeling a population of networks recording face-to-face interactions between elementary school pupils requires at least two representative networks if the data include networks observed during class and recess stehle2011high . Modeling the measurements with a single network will most likely neglect essential variations in the pupil’s face-to-face interactions, leading to similar oversights from summarizing a multimodal probability distribution with only its mean.

Recent research has examined related problems and led to, for example, methods for detecting abrupt regime changes in temporal series of networks peel2015detecting ; peixoto2018change , pooling information across subsets of layers of multiplex networks de2015structural , and embedding nodes in common subspaces across network layers nielsen2018multiple ; wang2019joint ; arroyo2021inference . Several recent contributions have addressed the problem of summarizing populations of networks when multiple distinct underlying network representations are needed, using mixtures of parametric models stanley2016clustering ; signorelli2020model ; mantziou2021 ; yin2022finite ; YKN22Clustering , latent space models durante2017nonparametric , or generative models based on ad hoc graph distance measures la2016gibbs .

These methods cluster network populations with good performance but have some significant drawbacks. None of the methods discussed, except Ref. la2016gibbs , outputs a single sparse representative network for each cluster but requires handling ensembles of network structures, making downstream applications such as network visualization or regression cumbersome. Most of these methods also require potentially unrealistic modeling assumptions about the structure of the clusters. For example, that stochastic block models or random dot product graphs can model all network structures in the clusters. Specifying a generative model for the modal structures also has the downside of often requiring complex and time-consuming methods to perform the within-cluster estimation. Perhaps most critically, existing approaches require either specifying the number of modes ahead of time or resorting to regularization with ad hoc penalties mantziou2021 ; YKN22Clustering ; de2015structural ; durante2017nonparametric not motivated directly by the clustering objective or approximative information criteria la2016gibbs ; signorelli2020model ; yin2022finite poorly adapted to network problems. Overall, current approaches for clustering network populations do not provide a principled solution for model selection and often demand extensive tuning and significant computational overhead from fitting the model to several choices of the number of clusters.

Here we introduce nonparametric inference methods which overcome these obstacles and provide a coherent framework through which to approach the problem of clustering network populations or multiplex network layers, while extracting a representative modal network to summarize each cluster. Our solution employs the minimum description length principle, which allows us to derive an objective function that favors parsimonious representations in an information-theoretic sense and selects the number and composition of representative modal networks automatically from first principles. We first develop a fast Monte Carlo scheme for identifying the configuration of measurement clusters and modal networks that minimizes our description length objective. We then extend our framework to account for special cases of interest: bipartite/directed networks and contiguous clusters containing all ordered networks from the earliest to the latest. We show how to solve the latter problem in polynomial run time with a dynamic program PAY2022 . We demonstrate our methods in applications involving synthetic and real network data, finding that they can effectively recover planted network modes and clusters even with considerable noise. Our methods also provide a concise and meaningful summary of real network populations from applications in global trade and macroevolutionary research.

II Methods

II.1 Minimum description length objective

For our clustering method, we rely on the minimum description length (MDL) principle: the best model among a set of candidate models is the one that permits the greatest compression—or shortest description—of a dataset rissanen1978 . The MDL principle provides a principled criterion for statistical model selection and has consequently been employed in various applications ranging from regression to time series analysis to clustering hansen2001mdl . A large body of research uses the MDL principle for clustering data, including studies on MDL-based methods for mixture models that accommodate continuous georgieva2011cluster ; tabor2014cross and categorical data li2004entropy , as well as methods that are based on more general probabilistic generative models narasimhan2005q . The MDL approach has also been applied to complex network data, most notably for community detection algorithms to cluster nodes within a network Rosvall07 ; Peixoto14a ; kirkley2022spatial and for decomposing graphs into subgraphs koutra2014vog ; wegner2014subgraph ; bloem2020large ; young2021hypergraph ; bouritsas2021partition , but also for clustering entire partitions of networks Kirkley22Reps . Our methods are similar in spirit to the one presented in ref. Kirkley22Reps for identifying representative community divisions among a set of plausible network partitions. Both approaches involve transmitting first a set of representatives and then the dataset itself by describing how each partition or network differs from its corresponding representative. However, the methods differ substantially in their details since they address fundamentally different questions.

We consider an experiment in which the initial data are a population of networks consisting of SS undirected, unweighted networks 𝒟={𝑫(1),,𝑫(S)}\mathcal{D}=\{\bm{D}^{(1)},...,\bm{D}^{(S)}\} on a set of NN labeled nodes. The networks record, for instance, the co-location patterns among kids in a class of NN students over SS class periods.

We aim to summarize these data with KK modal networks 𝒜={𝑨(1),,𝑨(K)}\mathcal{A}=\{\bm{A}^{(1)},...,\bm{A}^{(K)}\} (also undirected and unweighted) on the same set of nodes, with associated clusters of networks 𝒞={C1,,CK}\mathcal{C}=\{C_{1},...,C_{K}\}, where CkC_{k} comprises networks that are similar to the mode 𝑨(k)\bm{A}^{(k)}. This summary would allow researchers to, for instance, perform all downstream network analyses on a small set of representative networks—the modes—instead of a large set of networks likely to include measurement errors and from which it is difficult to draw valid conclusions.

We assume for simplicity of presentation that all networks 𝒟\mathcal{D} and 𝒜\mathcal{A} have no self- or multi-edges, although we can account for them straightforwardly. While KK can be fixed if desired, we assume that it is unknown and must be determined from regularities in the data.

To select among all the possible modes and assignments of networks to clusters, we employ information theory and construct an objective function that quantifies how much information is needed to communicate the structure of the network population 𝒟\mathcal{D} to a receiver. Clustering networks in groups of mostly similar instances allows us to communicate the population 𝒟\mathcal{D} efficiently in three steps: first the modes, then the clusters, and finally the networks 𝒟\mathcal{D} themselves as a series of small modifications to the modes 𝒜\mathcal{A}. The MDL principle tells us that any compression achieved in this way reveals modes and clusters that are genuine regularities of the population rather than noise rissanen1978 .

We first establish a baseline for the code length: the number of bits needed to communicate 𝒟\mathcal{D} without using any regularities. One way to do this is to first communicate the parameters of the population at a negligible information cost (size SS, number of nodes NN, and the total number of edges EE in all networks of 𝒟\mathcal{D}) and then transmit the population 𝒟\mathcal{D} directly. There are (N2){N\choose 2} possible edge positions in each of the SS undirected networks in 𝒟\mathcal{D}, or S(N2)S{N\choose 2} possible edge positions for the whole population. So these networks can be configured in (S(N2)E){S{N\choose 2}\choose E} ways. It thus takes approximately

0(𝒟)=log(S(N2)E)\displaystyle\mathcal{L}_{0}(\mathcal{D})=\log{S{N\choose 2}\choose E} (1)

bits to transmit these networks to a receiver. (We use the convention loglog2\log\equiv\log_{2} for brevity.) Applying Stirling’s approximation logx!xlogxx/ln(2)\log x!\approx x\log x-x/\ln(2), we obtain

0(𝒟)S(N2)Hb(ES(N2))\displaystyle\mathcal{L}_{0}(\mathcal{D})\approx S{N\choose 2}H_{b}\left(\frac{E}{S{N\choose 2}}\right) (2)

written in terms of the binary Shannon entropy

Hb(p)=plogp(1p)log(1p).\displaystyle H_{b}(p)=-p\log p-(1-p)\log(1-p). (3)

In practice, we expect to need many fewer bits than 0\mathcal{L}_{0} to communicate 𝒟\mathcal{D}, because the population of networks will often have regularities. We propose a multi-part encoding that identifies such regularities by grouping similar networks in clusters 𝒞\mathcal{C} with modes 𝒜\mathcal{A}, which proceeds as follows. First, we send a small number of modes 𝒜\mathcal{A} in their entirety, which ideally captures most of the heterogeneity in the population 𝒟\mathcal{D}. This step is costly but will save us information later. We then send the network clusters 𝒞\mathcal{C} by transmitting the cluster label of each network s𝒟s\in\mathcal{D}. Finally, we transmit the edges of networks in each cluster, using the already transmitted modes as a starting point to compress this part of the encoding significantly. The expected code length can be quantified using simple combinatorial expressions, and the configuration of modes 𝒜\mathcal{A} and clusters 𝒞\mathcal{C} that minimizes the total expected code length—the MDL configuration—provides a succinct summary of the data 𝒟\mathcal{D}. Figure 1 summarizes the transmission process and the individual description length contributions.

Refer to caption
Figure 1: Information transmission scheme. (a) Example population of networks 𝒟\mathcal{D}, with S=9S=9 networks of N=8N=8 nodes each. (b) Representative modes {𝑨(k)}\{\bm{A}^{(k)}\} with their corresponding clusters of networks {Ck}\{C_{k}\}. First, each mode network is transmitted individually in its entirety, with information content (𝑨(k))\mathcal{L}(\bm{A}^{(k)}) given by Eq. (4). Then, networks in the population are assigned to disjoint clusters surrounding each mode, requiring an information content given by Eq. (6). Finally, all the networks 𝑫(s)\bm{D}^{(s)} in each cluster CkC_{k} are transmitted, given the number of false-negative and false-positive edges nkn_{k} and pkp_{k} in the cluster (represented with dotted red and solid blue lines, respectively). The information content of this step is given by k\ell_{k} in Eq. (10). Different choices of clusters and modes lead to a different total information content, and the aim is to identify the clusters and modes that minimize this information content.

The expected length of this multi-part encoding is the sum of the description length of each part of the code that has significant communication costs. The modes are the first objects that incur such costs. Following the same reasoning as before, we denote the number of edges in mode kk as MkM_{k} and conclude that we can transmit the positions of the occupied edges in mode 𝑨(k)\bm{A}^{(k)} using approximately

(𝑨(k))=log((N2)Mk)(N2)Hb(Mk(N2))\displaystyle\mathcal{L}(\bm{A}^{(k)})=\log{{N\choose 2}\choose M_{k}}\approx{N\choose 2}H_{b}\left(\frac{M_{k}}{{N\choose 2}}\right) (4)

bits, where the second expression results from a Stirling approximation as in Eq. (2). We can therefore transmit all the modes with a total code length of

(𝒜)=k=1K(𝑨(k))\displaystyle\mathcal{L}(\mathcal{A})=\sum_{k=1}^{K}\mathcal{L}(\bm{A}^{(k)}) (5)

bits.

The next step is to transmit the cluster label kk of each network in 𝒟\mathcal{D}. For this part of the code, we first send the number of networks SkS_{k} in each cluster k=1,,Kk=1,...,K at a negligible cost and then specify a particular clustering compatible with these constraints. The multinomial coefficient (SS1S2Sk){S\choose S_{1}\ S_{2}\ \cdots\ S_{k}} gives the total number of possible combinations of these cluster labels. The information content of this step is thus

(𝒞)=log(SS1S2Sk)SH({Sk/S}),\displaystyle\mathcal{L}(\mathcal{C})=\log{S\choose S_{1}\ S_{2}\ \cdots\ S_{k}}\approx S\ H\Big{(}\{S_{k}/S\}\Big{)}, (6)

where we again use the Stirling approximation and where

H({qk})=k=1Kqklogqk\displaystyle H\Big{(}\{q_{k}\}\Big{)}=-\sum_{k=1}^{K}q_{k}\log q_{k} (7)

is the Shannon entropy of a distribution {qk}\{q_{k}\}.

Finally, we transmit the network population 𝒟\mathcal{D} by sending the differences between the networks in each cluster and their associated mode. To calculate the length of this part of the code, we focus on a particular cluster CkC_{k} and count the number of times we will have to remove an edge from the mode 𝑨(k)\bm{A}^{(k)} when specifying the structure of networks in its cluster using 𝑨(k)\bm{A}^{(k)} as a reference. We call these edges false negatives, and count them as

nk=sCk|𝑨(k)𝑫(s)|,\displaystyle n_{k}=\sum_{s\in C_{k}}\left|\bm{A}^{(k)}\setminus\bm{D}^{(s)}\right|, (8)

where we interpret 𝑫(s)\bm{D}^{(s)} and 𝑨(k)\bm{A}^{(k)} as sets of edges, so the summand is the number of edges in mode kk that are not in the network ss. Similarly, we also require the number of edges that occur in the networks of cluster kk but not in the mode—the number of false positives,

pk=sCk|𝑫(s)𝑨(k)|.\displaystyle p_{k}=\sum_{s\in C_{k}}\left|\bm{D}^{(s)}\setminus\bm{A}^{(k)}\right|. (9)

Like the cluster sizes SkS_{k} and edge counts per cluster MkM_{k}, the pairs (nk,pk)(n_{k},p_{k}) can be communicated to the receiver at a comparatively negligible cost, and we ignore them in our calculations.

To estimate the information content of this part of the transmission, we count the number of configurations of false-negative and false-positive edges in CkC_{k}. Focusing first on the false negatives—the edges that must be deleted—we count that of the SkMkS_{k}M_{k} edges in the SkS_{k} copies of the mode of cluster kk, nkn_{k} will be false-negative edges that can be configured in (SkMknk){S_{k}M_{k}\choose n_{k}} ways. Similarly, using the shorthand Mk=(N2)MkM_{k}^{\ast}={N\choose 2}-M_{k} to denote the unoccupied pairs of nodes in the mode kk, there are SkMkS_{k}M_{k}^{\ast} locations in which we must place pkp_{k} false-positive edges, for a total of (SkMkpk){S_{k}M_{k}^{\ast}\choose p_{k}} possible configurations of false-positive edges. The total information content required for transmitting the locations of the false-negative and false-positive edges of every network in cluster kk is thus

k\displaystyle\ell_{k} :=({𝒟(s)|sCk}|𝑨(k))\displaystyle:=\mathcal{L}(\{\mathcal{D}^{(s)}|s\in C_{k}\}|\bm{A}^{(k)})
=log(SkMknk)+log(SkMkpk),\displaystyle=\log{S_{k}M_{k}\choose n_{k}}+\log{S_{k}M_{k}^{\ast}\choose p_{k}}, (10)

which we approximate as

kSkMkHb(nkSkMk)+SkMkHb(pkSkMk).\ell_{k}\approx S_{k}M_{k}H_{b}\left(\frac{n_{k}}{S_{k}M_{k}}\right)+S_{k}M_{k}^{\ast}H_{b}\left(\frac{p_{k}}{S_{k}M_{k}^{\ast}}\right). (11)

Summing over all clusters,

(𝒟|𝒜,𝒞)=k=1Kk,\displaystyle\mathcal{L}(\mathcal{D}|\mathcal{A},\mathcal{C})=\sum_{k=1}^{K}\ell_{k}, (12)

we obtain the total information content of the final step in the transmission process.

We obtain the total description length (𝒟)\mathcal{L}(\mathcal{D}) by adding the contributions of Eqs. (5), (6), and (12), as

(𝒟)=(𝒜)+(𝒞)+(𝒟|𝒜,𝒞).\displaystyle\mathcal{L}(\mathcal{D})=\mathcal{L}(\mathcal{A})+\mathcal{L}(\mathcal{C})+\mathcal{L}(\mathcal{D}|\mathcal{A},\mathcal{C}). (13)

This objective function allows for efficient optimization because we can express it as a sum of the cluster-level description lengths

k(𝑨(k),Ck)=(𝒜(k))+Slog(SSk)+k,\mathcal{L}_{k}(\bm{A}^{(k)},C_{k})=\mathcal{L}(\mathcal{A}^{(k)})+S\log\left(\frac{S}{S_{k}}\right)+\ell_{k}, (14)

giving

(𝒟)=k=1Kk(𝑨(k),Ck).\displaystyle\mathcal{L}(\mathcal{D})=\sum_{k=1}^{K}\mathcal{L}_{k}(\bm{A}^{(k)},C_{k}). (15)

Equations (4) and (10) provide explicit expressions for (𝒜(k))\mathcal{L}(\mathcal{A}^{(k)}) and k\ell_{k}.

Equation (15) gives the total description length of the data 𝒟\mathcal{D} under our multi-part transmission scheme. By minimizing this objective function we identify the best configurations of modes 𝒜\mathcal{A} and clusters 𝒞\mathcal{C}. A good configuration {𝒜,𝒞}\{\mathcal{A},\mathcal{C}\} will allow us to transmit a large portion of the information in 𝒟\mathcal{D} through the modes alone. If we use too many modes, the description length will increase as these are costly to communicate in full. And if we use too few, the description length will also increase because we will have to send lengthy messages describing how mismatched networks and modes differ. Hence, through the principle of parsimony, Eq. (15) favors descriptions with the number of clusters KK as small as possible but not smaller.

This framework can be modified to accommodate populations of bipartite or directed networks. For the bipartite case, we make the transformations (N2)N1N2{N\choose 2}\to N_{1}N_{2} and MkN1N2MkM_{k}^{\ast}\to N_{1}N_{2}-M_{k}, where N1N_{1} and N2N_{2} are the numbers of nodes in each of the two groups. This modification reduces the number of available positions for potential edges. Similarly, for the directed case, we can make the transformations (N2)N(N1){N\choose 2}\to N(N-1) and MkN(N1)MkM_{k}^{\ast}\to N(N-1)-M_{k}, which increases the number of available edge positions.

II.2 Optimization and model selection

Since Eq. (15) has large support, is not convex, and has many local optima, a stochastic optimization method is a natural choice for finding reasonable solutions rapidly. We exploit the objective function’s decoupling into a sum over clusters kk and implement an efficient merge-split Monte Carlo method for the search peixoto2020merge ; Kirkley22Reps . The method greedily optimizes (𝒟)\mathcal{L}(\mathcal{D}) using moves that involve merging and splitting clusters of networks D(s)𝒟D^{(s)}\in\mathcal{D}.

Our merge-split algorithm minimizes the description length in Eq. (15) by performing one of the following moves selected uniformly at random and accepting the move as long as it results in a reduction of the description length (15):

  1. 1.

    Reassignment: Pick a network ss at random and move it from its current cluster CkC_{k} to the cluster CkC_{k^{\prime}} that results in the greatest decrease in the description length. Compute the modes 𝑨(k)\bm{A}^{(k)} and 𝑨(k)\bm{A}^{(k^{\prime})} that minimize the cluster-level description lengths k(𝑨(k),Ck)\mathcal{L}_{k}(\bm{A}^{(k)},C_{k}) and k(𝑨(k),Ck)\mathcal{L}_{k^{\prime}}(\bm{A}^{(k^{\prime})},C_{k^{\prime}}) using Eq. (14) and the procedure described below, conditioned on the networks in CkC_{k} and CkC_{k^{\prime}}.

  2. 2.

    Merge: Pick two clusters CkC_{k^{\prime}} and Ck′′C_{k^{\prime\prime}} at random and merge them into a single cluster CkC_{k}. Compute the mode 𝑨(k)\bm{A}^{(k)} that minimizes the cluster-level description length k(𝑨(k),Ck)\mathcal{L}_{k}(\bm{A}^{(k)},C_{k}) using Eq. (14) and the procedure described below, conditioned on the networks in CkC_{k}. Finally, compute the change in the description length that results from this merge.

  3. 3.

    Split: Pick a cluster CkC_{k} at random and split it into two clusters CkC_{k^{\prime}} and Ck′′C_{k^{\prime\prime}} using the following 22-means algorithm. First assign every network in CkC_{k} to the cluster CkC_{k^{\prime}} or Ck′′C_{k^{\prime\prime}} at random. Refine the assignments by successively moving every network to the cluster CkC_{k^{\prime}} or Ck′′C_{k^{\prime\prime}} that results in a greater decrease in the description length and compute the modes 𝑨(k)\bm{A}^{(k^{\prime})} and 𝑨(k′′)\bm{A}^{(k^{\prime\prime})} that minimize the cluster-level description lengths k(𝑨(k),Ck)\mathcal{L}_{k^{\prime}}(\bm{A}^{(k^{\prime})},C_{k^{\prime}}) and k′′(𝑨(k′′),Ck′′)\mathcal{L}_{k^{\prime\prime}}(\bm{A}^{(k^{\prime\prime})},C_{k^{\prime\prime}}), conditioned on the networks now in CkC_{k^{\prime}} and CkC_{k^{\prime}}. After convergence of the 22-means style algorithm, compute the change in the description length that results from this split of cluster CkC_{k}.

  4. 4.

    Merge-split: Pick two clusters at random, merge them as in move 2, then perform move 3 on this merged cluster. These two moves in direct succession help reassign multiple networks simultaneously; their addition to the move set improves the algorithm’s performance.

Since these moves modify only one or two clusters, the change in the global description length (𝒟)\mathcal{L}(\mathcal{D}) can be recomputed quickly as updates to the cluster-level description lengths in Eq. (14). Every time a mode is needed for these calculations, we use the mode that minimizes the cluster-level description length k(𝑨(k),Ck)\mathcal{L}_{k}(\bm{A}^{(k)},C_{k}) in Eq. (14). To find this optimal mode efficiently, we start with the “complete” mode

𝑨comp(k)=sCk𝑫(s),\displaystyle\bm{A}_{\mathrm{comp}}^{(k)}=\bigcup_{s\in C_{k}}\bm{D}^{(s)}, (16)

with an edge between nodes ii and jj if at least one network in the cluster contains the edge. We then greedily remove edges from 𝑨comp(k)\bm{A}_{\mathrm{comp}}^{(k)} in increasing order of occurrence in the networks of CkC_{k}—starting first with edges only found in a single network and going up from there—and update the cluster-level description length as we go. After removing all edges from 𝑨comp(k)\bm{A}_{\mathrm{comp}}^{(k)}, the mode giving the lowest cluster-level description length is chosen as the mode for the cluster. This approach is locally optimal under a few assumptions about the sparsity of the networks and the composition of edges in the clusters (see Appendix A for details).

We run the algorithm by starting with K0K_{0} initial clusters (this choice has a negligible effect on the results, see Appendix B) and stop when a specified number of consecutive moves all result in rejections, indicating that the algorithm has likely converged. The worst-case complexity of this algorithm is roughly O(NS)O(NS) (the worst case is a split move right at the start, see Appendix B). Appendix A details the entire algorithm, and Appendix B provides additional tests of the algorithm, such as its robustness for different choices of K0K_{0}.

To diagnose the quality of a solution, we compute the inverse compression ratio

η(𝒟)=MDL(𝒟)/0(𝒟),\displaystyle\eta(\mathcal{D})=\mathcal{L}_{\mathrm{MDL}}(\mathcal{D})/\mathcal{L}_{0}(\mathcal{D}), (17)

where MDL(𝒟)\mathcal{L}_{\mathrm{MDL}}(\mathcal{D}) is the minimum value of (𝒟)\mathcal{L}(\mathcal{D}) over all configurations of 𝒜,𝒞\mathcal{A},\mathcal{C}, given by the algorithm after termination, and 0\mathcal{L}_{0} is given in Eq. (2). Equation (17) tells us how much better we can compress the network population 𝒟\mathcal{D} by using our multi-step encoding than by using the naïve fixed-length code to transmit all networks individually. If η(𝒟)<1\eta(\mathcal{D})<1, our model compresses the data 𝒟\mathcal{D}, and if η(𝒟)>1\eta(\mathcal{D})>1, it does not because we waste too much information in the initial transmission steps.

II.3 Contiguous clusters

In Sec. II.2, we described a merge-split Monte Carlo algorithm to identify the clusters 𝒞\mathcal{C} and modes 𝒜\mathcal{A} that minimize the description length in Eq. (15). This algorithm samples the space of unconstrained partitions 𝒞\mathcal{C} of the network population 𝒟\mathcal{D}. However, in many applications, particularly in longitudinal studies, we may only be interested in constructing contiguous clusters, where each cluster is now a set of networks where adjacent indexes s{1,,S}s\in\{1,...,S\} indicate contiguity of some form (temporal, spatial, or otherwise). Such constraints reduce the space of possible clusterings 𝒞\mathcal{C} drastically, and we can minimize the description length exactly (up to the greedy heuristic for the mode construction) using a dynamical program jackson2005algorithm ; bellman2013dynamic ; PAY2022 .

Before we introduce an optimization for this problem, we require a small modification to Eq. (14) for the cluster-level description length to accurately reflect the constrained space of ordered partitions 𝒞\mathcal{C} that we are considering. In our derivation of the description length, we assumed that the receiver knows the sizes {Sk}\{S_{k}\} of the clusters in 𝒞\mathcal{C}. If we transmit these sizes in the order of the clusters they describe, the receiver will also know the exact clusters 𝒞\mathcal{C}, since knowing the sizes {Sk}\{S_{k}\} is equivalent to knowing the cluster boundaries in this contiguous case. We can therefore ignore the term Slog(S/Sk)S\log(S/S_{k}) in Eq. (14) that tells us how much information is required to transmit the exact cluster configuration. This modification results in a new, shorter description length

k(cont)(𝑨(k),Ck)=(𝒜(k))+k\displaystyle\mathcal{L}^{(\mathrm{cont})}_{k}(\bm{A}^{(k)},C_{k})=\mathcal{L}(\mathcal{A}^{(k)})+\ell_{k} (18)

and a new global objective

cont(𝒟)=kk(cont)(𝑨(k),Ck).\displaystyle\mathcal{L}_{\mathrm{cont}}(\mathcal{D})=\sum_{k}\mathcal{L}^{(\mathrm{cont})}_{k}(\bm{A}^{(k)},C_{k}). (19)

Since the objective in Eq. (19) is a sum of independent cluster-level terms, minimizing this description length for contiguous clusters admits a dynamic programming algorithm solution jackson2005algorithm ; bellman2013dynamic ; PAY2022 that can identify the true optima in polynomial time.

The algorithm is constructed by recursing on MDL(i)\mathcal{L}^{(i)}_{\mathrm{MDL}}, the minimum description length of the first ii networks in 𝒟\mathcal{D} according to Eq. (19). Since the objective function decomposes as a sum over clusters, for any j[1,S]j\in[1,S], the MDL can be calculated as

MDL(j)=mini[1,j]{MDL(i1)+k(cont)([i,j])},\displaystyle\mathcal{L}^{(j)}_{\mathrm{MDL}}=\underset{i\in[1,j]}{\operatorname{min}}\left\{\mathcal{L}^{(i-1)}_{\mathrm{MDL}}+\mathcal{L}^{(\mathrm{cont})}_{k}([i,j])\right\}, (20)

where we set the base case to MDL(0)=0\mathcal{L}^{(0)}_{\mathrm{MDL}}=0 and define k(cont)([i,j])\mathcal{L}^{(\mathrm{cont})}_{k}([i,j]) as the description length of the cluster of networks with indices {i,,j}\{i,...,j\}, according to Eq. (18) with the mode computed with the greedy procedure described in Sec. II.2. Once we recurse to MDL(S)\mathcal{L}^{(S)}_{\mathrm{MDL}}, we have found the MDL of our complete dataset, and keeping tab of the minimizing ii in Eq. (20) for every jj allows us to reconstruct the clusters.

In practice, the recursion can be implemented from the bottom up, starting with MDL(1)\mathcal{L}^{(1)}_{\mathrm{MDL}}, then MDL(2)\mathcal{L}^{(2)}_{\mathrm{MDL}}, and so on. The computational bottleneck for calculating MDL(j)\mathcal{L}^{(j)}_{\mathrm{MDL}} is finding the modes of a cluster jj times for each evaluation of Eq. (20) (once for each i=1,,ji=1,...,j), leading to an overall complexity O(jNlogN)O(jN\log N) for this step. Summing over j[1,S]j\in[1,S], the overall time complexity of the dynamic programming algorithm is O(S2NlogN)O(S^{2}N\log N), which we verify numerically in Appendix B.

III Results

We test our methods on a range of real and synthetic example network populations. First, we show that our algorithms can recover synthetically generated clusters and modes with high accuracy despite considerable noise levels. Applied to worldwide networks of food imports and exports, we find a strong compression that uses the difference between categories of products and the locations in which they are produced. We then apply our method for contiguous clustering of ordered network populations to a set of networks representing the fossil record from ordered geological stages in the last 500 million years rojas_multiscale_2021 . We examine bipartite and unipartite representations of these systems and find close alignment between our inferred clusters and known global biotic transitions, including those triggered by mass extinction events.

III.1 Reconstruction of synthetic network populations

To demonstrate that the methods presented in Secs. II.2 and II.3 can effectively identify modes and clusters in network populations, we test their ability to recover the underlying modes and clusters generated from the heterogeneous population model introduced in ref. YKN22Clustering . We examine the robustness of these methods under varying noise levels that influence the similarity of the generated networks with the cluster’s mode.

The generative model in YKN22Clustering supposes (using different notation) that we are given KK modes 𝒜\mathcal{A} as well as the cluster assignments 𝒞\mathcal{C} of the networks 𝒟\mathcal{D}. Each network sCks\in C_{k} is generated by first taking each edge (i,j)𝑨(k)(i,j)\in\bm{A}^{(k)} independently and adding it to 𝑫(s)\bm{D}^{(s)} with probability αk\alpha_{k} (the true-positive rate). Then, each of MkM_{k}^{\ast} possible edges absent from 𝑨(k)\bm{A}^{(k)} is added to 𝑫(s)\bm{D}^{(s)} with probability βk\beta_{k} (the false-positive rate). After performing this procedure for all clusters, the end result is a heterogeneous population of networks 𝒟\mathcal{D} with KK underlying modes, with noise in the networks CkC_{k} surrounding each mode 𝑨(k)\bm{A}^{(k)} determined by the rates αk\alpha_{k} and βk\beta_{k}. The higher the true-positive rate αk\alpha_{k} and the lower the false-positive rate βk\beta_{k}, the closer the networks in cluster CkC_{k} resemble their corresponding mode 𝑨(k)\bm{A}^{(k)}.

Refer to caption
Figure 2: Recovery of planted modes and their clusters in synthetic network populations. Various aspects of the recovery performance are plotted for the three experiments described in Sec. III.1. (a) Network distance, as quantified by the average Hamming distance between the true and inferred and modes (modes 1 and 3 in Fig. 1), for various flip probabilities pp. (b) Partition distance, given by the one minus the normalized mutual information between the true and inferred clusterings of the network population. (c) Inverse compression ratio, given in Eq. (17). Each data point is an average over 200200 realizations of the population for the corresponding value of the flip probability, and error bars correspond to three standard errors in the mean.

Employing Bayesian inference of the modes and cluster assignments as in ref. YKN22Clustering involves adding prior probability distributions over the modes 𝒜\mathcal{A} and cluster assignments 𝒞\mathcal{C} to the heterogeneous network model YKN22Clustering . With a specific choice of priors on the modes and cluster sizes, Eq. (15) is precisely the equation giving us the Maximum A Posteriori (MAP) estimators of 𝒜\mathcal{A} and 𝒞\mathcal{C} in this model. We defer the details of this correspondence to Appendix D.

For our experiments, we use two modes, mode 1 and mode 3 from Fig. 1, as the planted modes 𝒜true\mathcal{A}_{true} we aim to recover. To provide a single intuitive parameter quantifying the noise level in the generative model, we choose the true- and false-positive rates to satisfy p=β1=β3=1α1=1α3p=\beta_{1}=\beta_{3}=1-\alpha_{1}=1-\alpha_{3} for each run. Viewing the networks as binary adjacency matrices, the parameter pp corresponds to the probability of flipping entries of the matrix from 0 to 1 and vice-versa when constructing a network from its assigned cluster. We denote the parameter pp as the “flip probability” to emphasize this interpretation (same formulation as in ref. YKN22Clustering ). A flip probability p=0p=0 corresponds to clusters of networks identical to the cluster modes, and a flip probability of p=0.5p=0.5 corresponds to completely random networks with no clustering in the population. We thus expect it to be easy to recover the planted modes 𝒜true\mathcal{A}_{true} and clusters 𝒞true\mathcal{C}_{true} for p=0p=0, and the problem becomes more and more difficult as we approach p=0.5p=0.5.

We run three separate recovery experiments to test both the unconstrained and contiguous description length objectives in Eq. (15) and Eq (19), respectively. For the unconstrained objective, in each run we generate a population of SS networks from the model described above, with each network generated from either mode 1 or mode 3 at random with equal probability. We then identify the modes 𝒜MDL\mathcal{A}_{\mathrm{MDL}} and clusters 𝒞MDL\mathcal{C}_{\mathrm{MDL}} that minimize the objective in Eq. (15) using the merge-split algorithm detailed in Sec. II.2 and Appendix A. For the recovery of contiguous clusters, in one experiment we generate S/2S/2 consecutive networks from each mode so that the population consists of K=2K=2 adjacent contiguous clusters. And in another experiment, we generate S/4S/4 networks from mode 1, S/4S/4 networks from mode 3, and repeat this so that there are K=4K=4 adjacent contiguous clusters of the SS networks generated from the two distinct modes. For these two experiments, we run the dynamic programming algorithm detailed in Sec. II.3 to identify the modes 𝒜MDL\mathcal{A}_{\mathrm{MDL}} and clusters 𝒞MDL\mathcal{C}_{\mathrm{MDL}} that minimize the objective in Eq. (19). In all three experiments, we generate a population of S=100S=100 networks, each constructed from its corresponding mode using the single flip probability pp to introduce true- and false-positive edges.

To quantify the mode recovery error, we use the network distance quantified by the average Hamming distance between the inferred modes 𝒜\mathcal{A} and the planted modes 𝒜true\mathcal{A}_{true}. As both of our algorithms automatically select the optimal number of clusters KK, the number of modes we infer can differ from the true number (K=2K=2 or K=4K=4, depending on the experiment). In each experiment, we therefore choose the KK inferred modes in 𝒜\mathcal{A} with the largest corresponding clusters and compute the average Hamming distance between these and the true modes in 𝒜true\mathcal{A}_{true}. (Since there are K!K! ways to choose the inferred mode labels, we choose the labeling that produces the smallest Hamming distance.) To measure the error between our inferred clusters 𝒞\mathcal{C} and the planted clusters 𝒞true\mathcal{C}_{true} (the “partition distance”), we use one minus the normalized mutual information vinh2010information . We also compute the inverse compression ratio (Eq. (17)) to measure how well the network population can be compressed. We pick a range of values of pp to tune the noise level in the populations, and at each value of pp we average these three quantities over 200200 realizations of the model to smooth out noise due to randomness in the synthetic network populations. We choose K0=1K_{0}=1 for these experiments, but this choice has little to no effect on the results (see Appendix B).

Refer to caption
Figure 3: Cluster recovery for different mode separations. Partition distance between true and inferred clusters for (a) unconstrained clustering, (b) contiguous clustering with K=2K=2, and (c) contiguous clustering with K=4K=4 for various values of the mode separation γ\gamma. Each data point is an average over 200200 realizations of the population for the corresponding value of the flip probability, and error bars correspond to three standard errors in the mean.

Figure 2 shows the results of our first reconstruction experiment. The reconstruction performance gradually worsens as pp increases due to the increasing noise level in the sampled networks relative to their corresponding modes (Fig. 2a). In all experiments, the network distance reaches that expected for a completely random guess of the mode networks—a 50/50 coin flip to determine the existence of each edge, denoted by the dashed black line—when p=0.5p=0.5. The results in Fig. 2a indicate that in both the unconstrained and contiguous cases, our algorithms are capable of recovering the modes underlying these synthetic network populations with high accuracy, even for substantial levels of noise (up to p0.3p\approx 0.3, corresponding to an average of 30%30\% of the edges/non-edges differing between each mode and networks in its cluster).

The partition distance shows similar gradual performance degradation, with substantial increases in the distance beginning at p0.3p\approx 0.3 for the contiguous experiments and p0.15p\approx 0.15 for the unconstrained experiment (Fig. 2b). The partition distance levels off at different values across the three experiments, with the unconstrained case exhibiting significantly worse performance than the contiguous cases. We expect this result since contiguity simplifies the reconstruction problem by reducing the space of possible clusterings. Because information-based measures account for the entire space of possible clusterings instead of the highly constrained set produced by contiguous partitions, they overestimate the similarity of partitions in this constrained set. This overestimation intensifies with more clusters kirkley2022spatial .

The inverse compression ratio (Eq. (17)) for these experiments gradually approaches 11 (no compression relative to transmitting each network individually, denoted by the dashed black line) as the noise level pp increases (Fig. 2c). This result is consistent with the intuition that noisier data will be harder to compress, while data with strong internal regularity will be much easier to compress, as the homogeneities can be exploited for shorter encodings. When pp is small, we can achieve up to 1010 times compression over the naive baseline by using the inferred underlying modes and clusters to transmit these network populations.

The results in Fig. 2 indicate that our algorithms can recover the underlying modes and their clusters in synthetic network populations. However, these results also depend on how distinguishable the underlying modes are. For identical modes, 𝑨(1)=𝑨(2)\bm{A}^{(1)}=\bm{A}^{(2)}, it is impossible to recover the cluster labels of the individual network samples 𝑫(s)\bm{D}^{(s)}. To investigate the dependence of the recovery performance on the modes themselves, we repeat the experiment in Fig. 2, except this time we systematically vary the mode networks 𝒜\mathcal{A} for each trial to achieve various levels of distinguishability. In each trial, we set 𝑨(1)\bm{A}^{(1)} equal to the corresponding mode in Fig. 1 as before, but then generate the edges in 𝑨(2)\bm{A}^{(2)} from 𝑨(1)\bm{A}^{(1)} using the flip probability γ\gamma, which we call the “mode separation”. For mode separations γ0\gamma\approx 0, it is challenging to recover the correct cluster labels of the individual sample networks because 𝑨(2)\bm{A}^{(2)} will closely resemble 𝑨(1)\bm{A}^{(1)}. On the other hand, for mode separations γ0.5\gamma\approx 0.5, the modes will typically be easily distinguished since 𝑨(2)\bm{A}^{(2)} will have many edges/non-edges that have flipped relative to 𝑨(1)\bm{A}^{(1)}.

Figure 3 shows the results of this second experiment. The panels show the partition distance between the true and inferred cluster labels for a range of mode separations γ\gamma. In all experiments, the recovery becomes worse for lower values of the separation γ\gamma, but the algorithm still recovers a significant amount of cluster information even for relatively low γ\gamma. As in the previous set of experiments, the recovery performance is substantially worse for the discontiguous case compared with the contiguous cases, again due to the highly constrained ensemble of possible partitions considered by the partition distance in the contiguous cases.

In Appendix B, we show the recovery performance results for the network distance between the true and inferred modes as we vary the mode separation γ\gamma. For the mode recovery, the results are even more robust to the changes in mode separation. This result is consistent with the recovery performance in Fig. 2, where the recovery performance of the partitions starts to worsen at lower noise levels than the recovery of the modes. Thus, small perturbations in the inferred clusters may not affect the inferred modes much, since misclassified networks likely have little in common with the rest of their cluster.

III.2 Unordered network population representing global trade relationships

For our first example with empirical network data, we study a collection of worldwide import/export networks. The nodes represent countries and the edges encode trading relationships. The Food and Agriculture Organization of the United Nations (FAO) aggregates these data, and we use the trades made in 2010, as in de2015structural . Each network in the collection corresponds to a category of products, for example, bread, meat, or cigars. We ignore information about the intensity of trades and merely record the presence or absence of a trading relationship for each category of products. The resulting collection comprises 364 networks (layers) on the same set of 214 nodes with 874.6 edges (average degree of 8.2) on average, with some sparse networks having as little as one edge and the densest containing 6,529 edges. These networks are unordered, so we employ the discontiguous clustering method described in Sec. II.2. We run the algorithm multiple times with varying initial number of clusters K0K_{0} to find the best optima, although as with the synthetic reconstruction examples the choice of K0K_{0} has little impact on compression. The best compression we find results in 8 modes and achieves a compression ratio of η(𝒟)=0.562\eta(\mathcal{D})=0.562, indicating that it is nearly twice as efficient to communicate the data when we use the modal networks and their clusters. In contrast, in de2015structural a clustering analysis of the same network layers using structural reducibility—a measure of how many layers can be aggregated to reduce pairwise information redundancies among the layers—yielded 182 final aggregated layers, which would poorly compress the data under our scheme and not provide a significant benefit in downstream analyses due to the large final number of clusters. Key properties of the configuration of modes and clusters inferred by our algorithm are illustrated in Fig. 4.

Refer to caption
Figure 4: Discontiguous networks of imports and exports. We apply the algorithm of Sec. II.2 to a collection of trade networks de2015structural described in Sec. III.2 to identify similar networks of products. (a) Number of edges in each cluster’s mode. (b) Number of networks in each cluster. (c) Edges in mode 7 but not mode 5 are colored in blue, while edges in mode 5 but not in mode 7 are colored in red, highlighting the differences between these two modes. (d) The shared backbone of edges common to both modes 5 and 7. (e) Distribution of product types across the networks in each cluster.

In Fig. 4a we show the number of edges in each inferred mode, which indicates that these modes vary substantially in density to reflect the key underlying structures in networks within their corresponding clusters. The sizes of the clusters, shown in Fig. 4b, also vary substantially, with the most populated cluster (cluster 4) containing nearly 7 times as many networks as the least populated cluster (cluster 6). Some striking geographical commonalities and differences in the structure of the modes can be seen due to the varying composition of their corresponding clusters of networks. Figures 4c and 4d show the differences and similarities respectively between the structure of the modes for clusters 5 and 7, which are chosen as example modes because of their modest densities and distinct distributions of product types (Fig. 4e). Edges that are in mode 7 but not in mode 5 are highlighted in blue, while edges in mode 5 but not in mode 7 are highlighted in red. Meanwhile, the shared edges common to both networks are shown in Fig. 4d in black. Mode 5, which contains a diversity of product types and a relatively large portion of grain and protein products, has a large number of edges connecting the Americas to Europe that are not present in mode 7. On the other hand, mode 7, which is primarily composed of networks representing the trade of fruits and vegetables, has many edges in the global south that are not present in mode 5. However, both modes share a common backbone of edges that are distributed globally.

We categorized the 364 products (the network layers being clustered) into 12 broader categories of product types, plotting their distributions within each cluster in Fig. 4e. There are a few interesting observations we can make about this figure. Nearly all of the dairy products are traded within networks in a single cluster (cluster 3), indicating a high degree of similarity in the trade patterns for dairy products across countries. A similar observation can be made for live animals, which are primarily traded in cluster 4. On the other hand, many of the other products (grains, proteins, sweets, fruits, vegetables, and drinks) are traded in reasonable proportion in all clusters, which may reflect the diversity of these products as well as their geographical sources, which can give rise to heterogeneous trading structures. The densities of the modes and the sizes of the clusters do not have a clear relationship, with cluster 6 containing the smallest number of networks but the densest mode, and clusters 4 and 5 having sparser modes and much larger clusters. This reflects a higher level of heterogeneity in the structure of the trading relationships captured in cluster 6, which requires a denser mode for optimal compression, while the converse is true for clusters 4 and 5.

We also identify substantial structural differences in the inferred modes. In Appendix C, we compute summary statistics (average degree, transitivity, and average betweenness) for the modes output in this experiment and the network layers in their corresponding clusters. The statistics vary much more across clusters than within the clusters, suggesting that the MDL optimal mode configuration exemplifies distinct network structures within the dataset. Because the within-cluster average value of each statistic and the corresponding value for the mode network are similar, our method provides an effective preprocessing step for network-level regression tasks.

III.3 Ordered network population representing the fossil record

Refer to caption
Figure 5: Contiguous clusters of network representing the post-Cambrian fossil record. We apply the dynamic programming algorithm of Sec. II.3 to the unipartite genus-genus network population (lower bar) and the bipartite genus-location network population (upper bar) described in Sec. III.3 to identify key time intervals with distinct fossil assemblages. The clusters inferred by the algorithm are represented with distinct colors, and the networks, one per each post-Cambrian geological stage, are separated by white lines. Boundaries between geological periods, i.e. larger scale rock units cohen2013ics , are indicated by dashed vertical black lines. The five major mass extinction events raup1982mass are shown in dotted vertical red lines.
Eras Extinction Events Periods MDL Optimal
K(𝒟uni)K(\mathcal{D}_{uni}) 33 66 1111 1717
K(𝒟bi)K(\mathcal{D}_{bi}) 33 66 1111 1010
η(𝒟uni)\eta(\mathcal{D}_{uni}) 0.8140.814 0.8050.805 0.8100.810 0.7960.796
η(𝒟bi)\eta(\mathcal{D}_{bi}) 0.8420.842 0.8380.838 0.8500.850 0.8330.833
Table 1: Compression results for different partitions of the fossil record. 𝒟uni\mathcal{D}_{uni} denotes the set of fossil record networks in the unipartite genus-genus representation, and 𝒟bi\mathcal{D}_{bi} denotes the same networks in the bipartite genus-location representation.

We conclude our analysis with a study of a set of networks representing global marine fauna over the past 500 million years. We aggregate fossil occurrences of the shelled marine animals, including bryozoans, corals, brachiopods, mollusks, arthropods, and echinoderms, into a regular grid covering the Earth’s surface rojas_multiscale_2021 . From these data, we construct unweighted bipartite networks representing 9090 ordered time intervals in Earth’s history (geological stages): An edge between a genus and a grid cell indicates that the genus was observed in the grid cell during the network’s corresponding geological stage. We also construct the unipartite projections of these networks: An edge from one genus to another indicates that these two genera were present in the same grid cell during the stage corresponding to the network. In total, there were 18,297 genus nodes, 664 grid cell nodes, 67,371.5 edges on average for the 90 unipartite graphs (average degree of 7.4), and 1462.2 edges on average for the bipartite graphs (average degree of 0.08, corresponding to an average of roughly 10 percent of genera being present at each layer).

In Fig. 5, we show the results of applying our clustering method for contiguous network populations (Sec. II.3) to both the unipartite and bipartite populations representing the post-Cambrian fossil record. We find clusters that capture the known large-scale organization of marine diversity. Major groups of marine animals archived in the fossil record are organized into global-scale assemblages that sequentially dominated oceans and shifted across major biotic transitions. Overall, the bipartite and unipartite fossil record network representations both result in transitions concurrent with the major known geological perturbations in Earth’s history, including the so-called mass extinction events. However, differences in the clusters retrieved from the unipartite and bipartite representations of the underlying paleontological data highlight the impact of this choice on the observed macroevolutionary pattern eriksson_how_2021 .

We also use our methodology to assess the extent to which the standard division of the post-Cambrian rock record in the geological time scale and the well-known mass extinction events compress the assembled networks. Specifically, we evaluate the inverse compression ratio in Eq. (17) on three different partitions of the fossil record networks that are defined by clustering the assembled networks into geological eras (Paleozoic, Mesozoic, and Cenozoic), geological periods (Ordovician to Quaternary), and six time intervals between the five mass extinctions in Fig. 5, with planted modes constructed by placing the networks into each cluster using the algorithm described in Sec. II.2 and Appendix A.

Table 1 shows the results of these experiments. All three partitions compress the fossil record networks almost as much as the optimal partition, which represents a natural division based on major regularities. Accordingly, the planted partition based on mass extinctions is almost as good as this optimal partition because mass extinctions are concurrent with the major geological events shaping the history of marine life. In contrast, partitions based either on standard geological eras or periods are less optimal, likely because they represent, to some extent, arbitrary divisions that are maintained for historical reasons. Our results here provide a complementary perspective to the work in rojas_multiscale_2021 , where a multilayer network clustering algorithm was employed that clusters nodes within and across layers to reveal three major biotic transitions from the fossil data. In Appendix E we review this and other existing multiplex and network population clustering techniques, discussing the similarities and differences with our proposed methods.

IV Conclusion

We have used the minimum description length principle to develop efficient parameter-free methods for summarizing populations of networks using a small set of representative modal networks that succinctly describe the variation across the population. For clustering network populations with no ordering, we have developed a fast merge-split Monte Carlo procedure that performs a series of moves to refine a partition of the networks. For clustering ordered networks into contiguous clusters, we employ a time and memory-efficient dynamic programming approach. These algorithms can accurately reconstruct modes and associated clusters in synthetic datasets and identify significant heterogeneities in real network datasets derived from trading relationships and fossil records. Our methods are principled, non-parametric, and efficient in summarizing complex sets of independent network measurements, providing an essential tool for exploratory and visual analyses of network data and preprocessing large sets of network measurements for downstream applications.

This information-theoretic framework for representing network populations with modal networks can be extended in several ways. For example, a multi-step encoding that allows for hierarchical partitions of network populations would capture multiple levels of heterogeneity in the data. More complex encodings that exploit structural regularities within the networks would allow for simultaneous inference of mesoscale structures—such as communities, core-periphery divisions, or specific informative subgraphs vreeken2022differential —along with the modes and clusters. The encodings can also be adapted to capture weighted networks with multiedges by altering the combinatorial expressions for the number of allowable edge configurations.

Acknowledgments

A.K. was supported in part by the HKU-100 Start Up Grant. M.R. was supported by the Swedish Research Council, Grant No. 2016-00796.

V Code Availability

The algorithm presented in this paper is available at https://github.com/aleckirkley/MDL-network-population-clustering.

VI Data Availability

The data sets used in this paper are available at https://github.com/aleckirkley/MDL-network-population-clustering.

VII Author Contributions

A.K. designed the study and methodology, A.K., A.R., M.R., and J.G.Y. designed the experiments, A.K. and J.G.Y. performed the experiments, A.R. and M.R provided new datasets, and A.K. wrote the manuscript. All authors reviewed, edited and approved the manuscript.

VIII Competing Interests

The authors declare no competing interests.

References

  • (1) N. Eagle, A. S. Pentland, and D. Lazer, Inferring friendship network structure by using mobile phone data. Proc. Natl. Acad. Sci. U.S.A. 106(36), 15274–15278 (2009).
  • (2) S. Lehmann, Fundamental structures in temporal communication networks. In Temporal Network Theory, pp. 25–48, Springer (2019).
  • (3) O. Sporns, Networks of the Brain. MIT Press, Cambridge, MA (2010).
  • (4) C. Stark, B.-J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers, Biogrid: A general repository for interaction datasets. Nucleic Acids Research 34, D535–D539 (2006).
  • (5) M. Kivelä, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, and M. A. Porter, Multilayer networks. Journal of Complex Networks 2, 203–271 (2014).
  • (6) C. T. Butts, Network inference, error, and informant (in)accuracy: A Bayesian approach. Social Networks 25, 103–140 (2003).
  • (7) M. E. J. Newman, Estimating network structure from unreliable measurements. Physical Review E 98, 062321 (2018).
  • (8) J.-G. Young, G. T. Cantwell, and M. Newman, Bayesian inference of network structure from unreliable data. J. of Complex Netw. 8(6), cnaa046 (2020).
  • (9) T. P. Peixoto, Reconstructing networks with unknown and heterogeneous errors. Physical Review X 8, 041011 (2018).
  • (10) C. E. Priebe, D. L. Sussman, M. Tang, and J. T. Vogelstein, Statistical inference on errorfully observed graphs. J. Comput. Graph. Stat 24(4), 930–953 (2015).
  • (11) J. Arroyo, A. Athreya, J. Cape, G. Chen, C. E. Priebe, and J. T. Vogelstein, Inference for multiple heterogeneous networks with a common invariant subspace. Journal of machine learning research 22(142) (2021).
  • (12) R. Tang et al., Connectome smoothing via low-rank approximations. IEEE Trans. Med. Imaging. 38(6), 1446–1456 (2018).
  • (13) S. Lunagómez, S. C. Olhede, and P. J. Wolfe, Modeling network populations via graph distances. J. Am. Stat. Assoc. (116), 536 (2021).
  • (14) L. Wang, Z. Zhang, D. Dunson, et al., Common and individual structure of brain networks. Ann. Appl. Stat. 13.
  • (15) J.-G. Young, F. S. Valdovinos, and M. Newman, Reconstruction of plant–pollinator networks from observational data. Nature Communications 12(1), 1–12 (2021).
  • (16) D. Banks and K. Carley, Metric inference for social networks. Journal of Classification 11, 121–149 (1994).
  • (17) M. E. J. Newman, Network structure from rich but noisy data. Nature Physics 14, 542–545 (2018).
  • (18) C. M. Le, K. Levin, E. Levina, et al., Estimating a network from multiple noisy realizations. Electron. J. Stat. 12(2), 4697–4740 (2018).
  • (19) P. S. La Rosa, T. L. Brooks, E. Deych, B. Shands, F. Prior, L. J. Larson-Prior, and W. D. Shannon, Gibbs distribution for statistical analysis of graphical data with a sample application to fcMRI brain images. Stat. Med. 35, 566–580 (2016).
  • (20) J.-G. Young, A. Kirkley, and M. E. J. Newman, Clustering of heterogeneous populations of networks. Physical Review E 105, 014312 (2022).
  • (21) J. Stehlé, N. Voirin, A. Barrat, C. Cattuto, L. Isella, J.-F. Pinton, M. Quaggiotto, W. Van den Broeck, C. Régis, B. Lina, et al., High-resolution measurements of face-to-face contact patterns in a primary school. PLOS ONE 6, e23176 (2011).
  • (22) L. Peel and A. Clauset, Detecting change points in the large-scale structure of evolving networks. In Proceedings of the 29th International Conference on Artificial Intelligence (AAAI), pp. 2914–2920 (2015).
  • (23) T. P. Peixoto and L. Gauvin, Change points, memory and epidemic spreading in temporal networks. Scientific Reports 8, 15511 (2018).
  • (24) M. De Domenico, V. Nicosia, A. Arenas, and V. Latora, Structural reducibility of multilayer networks. Nature communications 6(1), 1–9 (2015).
  • (25) A. M. Nielsen and D. Witten, The multiple random dot product graph model. Preprint arXiv:1811.12172 (2018).
  • (26) S. Wang, J. Arroyo, J. T. Vogelstein, and C. E. Priebe, Joint embedding of graphs. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1324–1336 (2019).
  • (27) N. Stanley, S. Shai, D. Taylor, and P. J. Mucha, Clustering network layers with the strata multilayer stochastic block model. IEEE Trans. Netw. Sci. Eng. 3, 95–105 (2016).
  • (28) M. Signorelli and E. C. Wit, Model-based clustering for populations of networks. Statistical Modelling 20, 9–29 (2020).
  • (29) A. Mantziou, S. Lunagómez, and R. Mitra, Bayesian model-based clustering for multiple network data. Preprint arXiv:2107.03431 (2021).
  • (30) F. Yin, W. Shen, and C. T. Butts, Finite mixtures of ERGMs for ensembles of networks. Bayesian Analysis (Advance Publication) pp. 1–39 (2022).
  • (31) D. Durante, D. B. Dunson, and J. T. Vogelstein, Nonparametric bayes modeling of populations of networks. J. Am. Stat. Assoc. 112, 1516–1530 (2017).
  • (32) A. Patania, A. Allard, and J.-G. Young, Exact and rapid linear clustering of networks with dynamic programming. Preprint arXiv:2301.10403 (2023).
  • (33) J. Rissanen, Modeling by the shortest data description. Automatica 14, 465–471 (1978).
  • (34) M. H. Hansen and B. Yu, Model selection and the principle of minimum description length. Journal of the American Statistical Association 96, 746–774 (2001).
  • (35) O. Georgieva, K. Tschumitschew, and F. Klawonn, Cluster validity measures based on the minimum description length principle. In Proceedings of the International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pp. 82–89, Springer-Verlag, Berlin (2011).
  • (36) J. Tabor and P. Spurek, Cross-entropy clustering. Pattern Recognition 47, 3046–3059 (2014).
  • (37) T. Li, S. Ma, and M. Ogihara, Entropy-based criterion in categorical clustering. In Proceedings of the Twenty-first International Conference on Machine Learning, p. 68, Association for Computing Machinery, New York (2004).
  • (38) M. Narasimhan, N. Jojic, and J. A. Bilmes, Q-clustering. Advances in Neural Information Processing Systems 18, 979–986 (2005).
  • (39) M. Rosvall and C. T. Bergstrom, An information-theoretic framework for resolving community structure in complex networks. Proceedings of the National Academy of Sciences 104, 7327–7331 (2007).
  • (40) T. P. Peixoto, Hierarchical block structures and high-resolution model selection in large networks. Physical Review X 4, 011047 (2014).
  • (41) A. Kirkley, Spatial regionalization based on optimal information compression. Communications Physics 5(1), 1–10 (2022).
  • (42) D. Koutra, U. Kang, J. Vreeken, and C. Faloutsos, Vog: Summarizing and understanding large graphs. In Proceedings of the 2014 SIAM international conference on data mining, pp. 91–99, SIAM (2014).
  • (43) A. E. Wegner, Subgraph covers: an information-theoretic approach to motif analysis in networks. Physical Review X 4(4), 041026 (2014).
  • (44) P. Bloem and S. de Rooij, Large-scale network motif analysis using compression. Data Mining and Knowledge Discovery 34(5), 1421–1453 (2020).
  • (45) J.-G. Young, G. Petri, and T. P. Peixoto, Hypergraph reconstruction from network data. Communications Physics 4(1), 1–11 (2021).
  • (46) G. Bouritsas, A. Loukas, N. Karalias, and M. Bronstein, Partition and code: learning how to compress graphs. Advances in Neural Information Processing Systems 34, 18603–18619 (2021).
  • (47) A. Kirkley and M. E. J. Newman, Representative community divisions of networks. Communications Physics 5, 1–10 (2022).
  • (48) T. P. Peixoto, Merge-split markov chain monte carlo for community detection. Physical Review E 102, 012305 (2020).
  • (49) B. Jackson, J. D. Scargle, D. Barnes, S. Arabhi, A. Alt, P. Gioumousis, E. Gwin, P. Sangtrakulcharoen, L. Tan, and T. T. Tsai, An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters 12(2), 105–108 (2005).
  • (50) R. Bellman, Dynamic Programming. Princeton University Press (1957).
  • (51) A. Rojas, J. Calatayud, M. Kowalewski, M. Neuman, and M. Rosvall, A multiscale view of the Phanerozoic fossil record reveals the three major biotic transitions. Communications Biology 1, 1–2 (2021).
  • (52) N. X. Vinh, J. Epps, and J. Bailey, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research 11, 2837–2854 (2010).
  • (53) K. M. Cohen, S. C. Finney, P. L. Gibbard, and J.-X. Fan, The ics international chronostratigraphic chart. Episodes Journal of International Geoscience 36, 199–204 (2013).
  • (54) D. M. Raup and J. J. S. Jr., Mass extinctions in the marine fossil record. Science 215, 1501–1503 (1982).
  • (55) A. Eriksson, D. Edler, A. Rojas, M. de Domenico, and M. Rosvall, How choosing random-walk model and network representation matters for flow-based community detection in hypergraphs. Communications Physics 4, 133 (2021).
  • (56) C. Coupette, S. Dalleiger, and J. Vreeken, Differentially describing groups of graphs. Preprint arxiv:2201.04064 (2022).
  • (57) D. J. MacKay, Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003).
  • (58) P. D. Grünwald and A. Grünwald, The Minimum Description Length Principle. MIT Press, Cambridge, MA (2007).
  • (59) L. Peel, D. B. Larremore, and A. Clauset, The ground truth about metadata and community detection in networks. Science Advances 3(5), e1602548 (2017).
  • (60) T. P. Peixoto and A. Kirkley, Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection. Preprint arXiv:2210.09186 (2022).
  • (61) A. Spanos, Statistical Foundations of Econometric Modelling. Cambridge University Press (1986).
  • (62) A. Kirkley, Information theoretic network approach to socioeconomic correlations. Physical Review Research 2, 043212 (2020).

Appendix A Merge-split Monte Carlo algorithm

In this appendix we describe the greedy merge-split Monte Carlo algorithm used to identify the clusters 𝒞\mathcal{C} and modes 𝒜\mathcal{A} that minimize the description length in Eq. (15). The input to the algorithm is SS edge sets corresponding to the network population 𝒟\mathcal{D}, an initial number of clusters K0K_{0}, and a maximum number of consecutive failed moves nfailsn_{\mathrm{fails}} before termination. (nfails=100n_{\mathrm{fails}}=100 for all experiments performed in the paper.) Note that the number of true-positive edges, which we denote tk=SkMknkt_{k}=S_{k}M_{k}-n_{k}, will give equivalent results if swapped for nkn_{k} in Eq. 14. We will use these true-positive counts tkt_{k} in the algorithm below rather than the false-negatives nkn_{k}, keeping in mind that they are interchangeable in the description length expressions.

The algorithm is as follows:

  1. 1.

    Initialize K0K_{0} clusters 𝒞\mathcal{C} by placing each network in 𝒟\mathcal{D} in a cluster chosen uniformly at random.

  2. 2.

    For all clusters CkC_{k}, create a dictionary Ek={(i,j):Xij(k)}E_{k}=\{(i,j):X^{(k)}_{ij}\} mapping all edges (i,j)(i,j) in networks within CkC_{k} to the number of times (i,j)(i,j) occurs in CkC_{k}. We do not need to keep track of the edges that never occur in a cluster. This is one computational bottleneck of the algorithm, requiring O(SN)O(SN) operations for sparse networks with O(N)O(N) edges each.

  3. 3.

    For all clusters CkC_{k}, compute the mode 𝑨(k)\bm{A}^{(k)} using the following greedy edge removal algorithm:

    1. (a)

      Create list of tuples E~k={((i,j),Xij(k)):(i,j)Ek}\tilde{E}_{k}=\{((i,j),X^{(k)}_{ij}):(i,j)\in E_{k}\}, with items sorted in increasing order of frequency Xij(k)X^{(k)}_{ij}. This is the computational bottleneck of this mode update step, with roughly O(NlogN)O(N\log N) time complexity assuming the networks in cluster CkC_{k} have O(N)O(N) unique edges.

    2. (b)

      Initialize:

      • r0r\leftarrow 0

      • tk(i,j)CkXij(k)t_{k}\leftarrow\sum_{(i,j)\in C_{k}}X^{(k)}_{ij}

      • pk0p_{k}\leftarrow 0

      • Mk|E~k|M_{k}\leftarrow\left|\tilde{E}_{k}\right|

      • 𝑨(k){(i,j):(i,j)Ek}\bm{A}^{(k)}\leftarrow\{(i,j):(i,j)\in E_{k}\}

      • Δ0\Delta\mathcal{L}\leftarrow 0

      • Δbest0\Delta\mathcal{L}_{best}\leftarrow 0

      • rbest0r_{best}\leftarrow 0

    3. (c)

      While r<|E~k|r<\left|\tilde{E}_{k}\right|:

      1. i.

        (i,j),Xij(k)E~k[r](i,j),X_{ij}^{(k)}\leftarrow\tilde{E}_{k}[r]

      2. ii.

        Compute the change in description length

        Δsub\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\Delta\mathcal{L}_{sub} =(𝑨(k){(i,j)},Ck)(𝑨(k),Ck).\displaystyle=\mathcal{L}(\bm{A}^{(k)}\setminus\{(i,j)\},C_{k})-\mathcal{L}(\bm{A}^{(k)},C_{k}). (21)

        The first term is identical to the second, but with tktkXij(k)t_{k}\to t_{k}-X^{(k)}_{ij}, pkpk+Xij(k)p_{k}\to p_{k}+X^{(k)}_{ij}, and MkMk1M_{k}\to M_{k}-1.

      3. iii.

        ΔΔ+Δsub\Delta\mathcal{L}\leftarrow\Delta\mathcal{L}+\Delta\mathcal{L}_{sub}

      4. iv.

        𝑨(k)𝑨(k){(i,j)}\bm{A}^{(k)}\leftarrow\bm{A}^{(k)}\setminus\{(i,j)\}

      5. v.

        Increment:

        • rr+1r\leftarrow r+1

        • tktkXij(k)t_{k}\leftarrow t_{k}-X^{(k)}_{ij}

        • pkpk+Xij(k)p_{k}\leftarrow p_{k}+X^{(k)}_{ij}

        • MkMk1M_{k}\leftarrow M_{k}-1

      6. vi.

        If Δ<Δbest\Delta\mathcal{L}<\Delta\mathcal{L}_{best}:

        • ΔbestΔ\Delta\mathcal{L}_{best}\leftarrow\Delta\mathcal{L}

        • rbestrr_{best}\leftarrow r

    4. (d)

      return 𝑨(k)\bm{A}^{(k)} with the first rbestr_{best} edges in E~k\tilde{E}_{k} removed

  4. 4.

    Initialize counter nf0n_{f}\leftarrow 0 and compute the total description length (𝒟)\mathcal{L}(\mathcal{D}) for the current cluster and mode configurations using Eq. (15).

  5. 5.

    While nf<nfailsn_{f}<n_{\mathrm{fails}}, attempt to do one of the following four moves, chosen at random:

    1. (a)

      Move 1: Reassign a randomly chosen network 𝑫(s)\bm{D}^{(s)} from its cluster CkC_{k} to the cluster CkC_{k^{\prime}} that most reduces the description length.

      1. i.

        For all kkk^{\prime}\neq k, compute

        Δ1(k)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\Delta\mathcal{L}_{1}(k^{\prime}) =(𝑨(k),Ck{𝑫(s)})+(𝑨(k),Ck{𝑫(s)})(𝑨(k),Ck)(𝑨(k),Ck).\displaystyle=\mathcal{L}(\bm{A}^{(k^{\prime})},C_{k^{\prime}}\cup\{\bm{D}^{(s)}\})+\mathcal{L}(\bm{A}^{(k)},C_{k}\setminus\{\bm{D}^{(s)}\})-\mathcal{L}(\bm{A}^{(k^{\prime})},C_{k^{\prime}})-\mathcal{L}(\bm{A}^{(k)},C_{k}).

        The modes are not updated to account for the inclusion/absence of 𝑫(s)\bm{D}^{(s)}, and so we only require updates to EkE_{k} and EkE_{k^{\prime}} to account for this reassignment, in order to compute new values of tk,pk,tk,pkt_{k},p_{k},t_{k^{\prime}},p_{k^{\prime}}. These edge dictionary updates can be done in O(N)O(N) operations for sparse graphs.

      2. ii.

        If Δ1(k′′)<0\Delta\mathcal{L}_{1}(k^{\prime\prime})<0 for some k′′kk^{\prime\prime}\neq k:

        • Move ss to the cluster CkC_{k^{\prime}} that minimizes Δ1(k)\Delta\mathcal{L}_{1}(k^{\prime}), and remove ss from CkC_{k}

        • nf0n_{f}\leftarrow 0

        • Update Ek,EkE_{k},~{}E_{k^{\prime}} to account for the new absence/presence of the edges in 𝑫(s)\bm{D}^{(s)}

        • Update modes 𝑨(k)\bm{A}^{(k)} and 𝑨(k)\bm{A}^{(k^{\prime})} using the new dictionaries Ek,EkE_{k},~{}E_{k^{\prime}}

        • Update the description length with the corresponding change after the move.

      3. iii.

        Else: nfnf+1n_{f}\leftarrow n_{f}+1

      4. iv.

        Time complexity: O(KN)O(KN) for sparse networks, as computing the change in the description length for the move to kk^{\prime} requires the O(N)O(N) operation of comparing the edge list of 𝑫(s)\bm{D}^{(s)} with the edge list of 𝑨(k)\bm{A}^{(k^{\prime})}.

    2. (b)

      Move 2: Merge a randomly chosen pair of clusters CkC_{k^{\prime}} and Ck′′C_{k^{\prime\prime}} to form a new cluster Ck=CkCk′′C_{k}=C_{k^{\prime}}\cup C_{k^{\prime\prime}}.

      1. i.

        Compute the dictionary EkE_{k} by merging the counts in the dictionaries EkE_{k^{\prime}} and Ek′′E_{k^{\prime\prime}}.

      2. ii.

        Compute 𝑨(k)\bm{A}^{(k)} using the method outlined in Step 3 of the algorithm.

      3. iii.

        Compute the change in description length

        Δ2\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\Delta\mathcal{L}_{2} =(𝑨(k),Ck)(𝑨(k),Ck)(𝑨(k′′),Ck′′).\displaystyle=\mathcal{L}(\bm{A}^{(k)},C_{k})-\mathcal{L}(\bm{A}^{(k^{\prime})},C_{k^{\prime}})-\mathcal{L}(\bm{A}^{(k^{\prime\prime})},C_{k^{\prime\prime}}). (22)
      4. iv.

        If Δ2<0\Delta\mathcal{L}_{2}<0:

        • Add CkC_{k} to 𝒞\mathcal{C} and delete Ck′′C_{k^{\prime\prime}} and Ck′′C_{k^{\prime\prime}}

        • Keep EkE_{k} and delete EkE_{k^{\prime}} and Ek′′E_{k^{\prime\prime}}

        • Keep 𝑨(k)\bm{A}^{(k)} and delete 𝑨(k)\bm{A}^{(k^{\prime})} and 𝑨(k′′)\bm{A}^{(k^{\prime\prime})}

        • (𝒟)(𝒟)+Δ2\mathcal{L}(\mathcal{D})\leftarrow\mathcal{L}(\mathcal{D})+\Delta\mathcal{L}_{2}

      5. v.

        Else: nfnf+1n_{f}\leftarrow n_{f}+1

      6. vi.

        Time complexity: roughly O(NlogN)O(N\log N), with the bottleneck being the computation of the mode 𝑨(k)\bm{A}^{(k)}.

    3. (c)

      Move 3: Split a randomly chosen cluster CkC_{k} to form new clusters CkC_{k^{\prime}} and Ck′′C_{k^{\prime\prime}}.

      1. i.

        Initialize two clusters and their edge dictionaries using the method in Steps 1 and 2 of the algorithm.

      2. ii.

        Identify the modes of CkC_{k^{\prime}} and Ck′′C_{k^{\prime\prime}} using the method in Step 3 of the algorithm.

      3. iii.

        Run a 22-means style algorithm to compute CkC_{k^{\prime}} and Ck′′C_{k^{\prime\prime}} and their corresponding modes, which proceeds as follows: While networks are still reassigned at the current move:

        1. A.

          For all networks 𝑫(s)\bm{D}^{(s)} in both sub-clusters, compute Δ1(k)\Delta\mathcal{L}_{1}(k^{\prime}) and Δ1(k′′)\Delta\mathcal{L}_{1}(k^{\prime\prime}), and add 𝑫(s)\bm{D}^{(s)} to the cluster with the lower of the two values.

        2. B.

          Update EkE_{k^{\prime}} and Ek′′E_{k^{\prime\prime}} using the method in Step 2 of the overall algorithm.

        3. C.

          Update the modes of CkC_{k^{\prime}} and Ck′′C_{k^{\prime\prime}} using the method in Step 3 of the overall algorithm.

      4. iv.

        Compute the change in description length

        Δ3\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\Delta\mathcal{L}_{3} =(𝑨(k),Ck)+(𝑨(k′′),Ck′′)(𝑨(k),Ck).\displaystyle=\mathcal{L}(\bm{A}^{(k^{\prime})},C_{k^{\prime}})+\mathcal{L}(\bm{A}^{(k^{\prime\prime})},C_{k^{\prime\prime}})-\mathcal{L}(\bm{A}^{(k)},C_{k}). (23)
      5. v.

        If Δ3<0\Delta\mathcal{L}_{3}<0:

        • Add CkC_{k^{\prime}} and Ck′′C_{k^{\prime\prime}} to 𝒞\mathcal{C}, delete CkC_{k}

        • Keep EkE_{k^{\prime}} and Ek′′E_{k^{\prime\prime}}, delete EkE_{k}

        • Keep 𝑨(k)\bm{A}^{(k^{\prime})} and 𝑨(k′′)\bm{A}^{(k^{\prime\prime})}, delete 𝑨(k)\bm{A}^{(k)}

        • (𝒟)(𝒟)+Δ3\mathcal{L}(\mathcal{D})\leftarrow\mathcal{L}(\mathcal{D})+\Delta\mathcal{L}_{3}

      6. vi.

        Else: nfnf+1n_{f}\leftarrow n_{f}+1

      7. vii.

        Time complexity: O(NS/K)O(NS/K) for roughly equally sized clusters, with the computational bottleneck being the reassignment step in the local 22-means algorithm.

    4. (d)

      Move 4: Pick two clusters at random to merge, then immediately split following Move 3.

      • Time complexity: O(NS/K)O(NS/K), similar to Move 3.

Now that we have laid out the merge-split algorithm, we can show that the greedy mode update in Step 3 of the merge-split algorithm is locally optimal under a few assumptions regarding the sparsity of the networks and the set of edges in the cluster as a whole. The relevant terms of the local description length that we need to consider are those that depend on the choice of mode, or

ϕk(𝑨(k),Ck)\displaystyle\phi_{k}(\bm{A}^{(k)},C_{k}) log((N2)Mk)+log(SkMktk)+log(Sk(N2)pk),\displaystyle\approx\log{{N\choose 2}\choose M_{k}}+\log{S_{k}M_{k}\choose t_{k}}+\log{S_{k}{N\choose 2}\choose p_{k}}, (24)

were we’ve approximated Mk=(N2)Mk(N2)M_{k}^{\ast}={N\choose 2}-M_{k}\approx{N\choose 2}, which will be a good approximation when the modes are sparse. We also note the use of true-positives tkt_{k} rather than false-negatives nkn_{k}, as discussed at the beginning of the section.

We can first notice that in the sparse network regime, it will never be optimal to include an edge (i,j)(i,j) in the mode 𝑨(k)\bm{A}^{(k)} that does not exist in any of the networks 𝑫(s)Ck\bm{D}^{(s)}\in C_{k}. If Mk<12(N2)1M_{k}<\frac{1}{2}{N\choose 2}-1, then including an edge (i,j)(i,j) in the mode 𝑨(k)\bm{A}^{(k)} that does not exist in any of the networks 𝑫(s)Ck\bm{D}^{(s)}\in C_{k} will only serve to augment MkM_{k} by 11, increasing the values of the first two terms of ϕk\phi_{k} in Eq. (24). Thus, we will end up increasing the description length if we add such an edge, and so no such edges will exist in the optimal mode 𝑨(k)\bm{A}^{(k)}. We can therefore begin our algorithm with the maximally dense mode 𝑨comp(k)\bm{A}_{\mathrm{comp}}^{(k)} in Eq. (16) and only consider removing edges, since the optimal mode will be comprised of some subset of these edges.

We can now argue that, for any current iteration of the mode, the best edge (i,j)(i,j) to remove from the mode next is the one that occurs the least often in the cluster CkC_{k} (i.e. the number of this edge’s occurrences, Xij(k)X^{(k)}_{ij}, is the smallest). MkM_{k} will always decrease by 11 regardless of which edge is chosen, but tkt_{k} and pkp_{k} will decrease and increase by Xij(k)X^{(k)}_{ij} respectively. If we assume that tk/Sk>Mk/2t_{k}/S_{k}>M_{k}/2—on average, each network in CkC_{k} has at least half of the edges in the mode 𝑨(k)\bm{A}^{(k)}, which will be true when the networks in CkC_{k} have relatively high overlap in their edge sets—then the second term in Eq. (24) will increase its value by the smallest amount under this edge removal when tkt_{k} decreases by the least amount. Likewise, the third term in Eq. (24) will also increase its value with this edge removal (since pkp_{k} increases by Xij(k)X^{(k)}_{ij}, and we assume that pkSk(N2)p_{k}\ll S_{k}{N\choose 2}), and so it will increase by the smallest amount when Xij(k)X^{(k)}_{ij} is smallest. The best edge (i,j)(i,j) to remove from the mode in any given step of the algorithm is, therefore, the one with the smallest value of Xij(k)X^{(k)}_{ij}, as it results in the most negative change in ϕk\phi_{k}. Therefore the greedy edge removal algorithm will reach a locally optimal edge configuration for the mode under these assumptions.

In practice, the total run time of this merge-split algorithm depends on the user’s choice of nfailsn_{\mathrm{fails}}, the number of consecutive failed moves before algorithm termination. When computation time is not a concern, it is safer to let nfailsn_{\mathrm{fails}} be as high as possible to ensure no additional merge-split moves will increase the description length. For context, the convergence of the algorithm in the global trade experiment presented here with nfails=100n_{\mathrm{fails}}=100 required on average 150150 merge-split steps and a total run time of 1515 minutes for a pure Python implementation with no substantial optimizations on an Intel Core i7 CPU. On the other hand, with nfails=10n_{\mathrm{fails}}=10 this same experiment takes on average 36.436.4 merge-split steps and a total run time of 3.73.7 minutes on the same processor. In the synthetic network experiments, each simulation took on average 0.390.39 seconds for 138138 MCMC moves with nfails=100n_{\mathrm{fails}}=100 (SS and NN are much smaller, resulting in shorter run times). The contiguous clustering algorithm does not present the same variability in total run time since it is deterministic given the input networks.

Appendix B Additional synthetic reconstruction tests

In this appendix we show the results of additional synthetic reconstruction results for the discontiguous and contiguous network population clustering algorithms of Sec.s II.2 and Sec. II.3. These results verify the robustness of these algorithms as well as their theoretical run time scaling.

In Fig. 6, we plot the results shown in Fig. 2 for different choices of the initial number of clusters K0K_{0} in the discontiguous case (the contiguous algorithm always begins with all networks in their own cluster), finding that the results are practically indistinguishable for all choices K0=1,5,10,20K_{0}=1,5,10,20. In Fig. 7, we repeat the same experiment but this time for different population sizes SS, finding that the results do depend on SS, but in an intuitive way: more networks give us more statistical evidence for the clusters and their modes, resulting in better reconstruction.

In Fig. 8, we numerically verify the runtime scaling for the MCMC algorithm discussed in Sec. II.2 and Appendix A. For each realization of the experiment, we first constructed two Erdos-Renyi random graphs with NN nodes and average degree 55—this value does not matter for the scaling so long as it is constant across N,SN,S—which served as modes for the synthetic population. We then generate a population of SS networks from these modes using the model of Appendix D, each network having a 50/50 chance of being generated from either mode, with true-positive rates α1=α2=0.9\alpha_{1}=\alpha_{2}=0.9 and false positive rates β1=β2=0\beta_{1}=\beta_{2}=0 to introduce noise into the population but facilitate relatively simple recovery. The MCMC algorithm of Sec. II.2 was then applied to each population, and the average run time of Move 4 in the algorithm was recorded—this move is the computational bottleneck for K>1K>1 since it involves combining a merge and a split move (Moves 2 and 3). Fig. 8 shows the average run time over 100 realizations of synthetic populations for a range of NN values (at fixed SS) and range of SS values (at fixed NN), with error bars corresponding to three standard errors in the mean. The regression logy=alogx+b\log y=a\log x+b was run for the experiments in each panel, and the resulting slopes a1a\approx 1 and coefficients of determination R2R^{2} provide strong numerical evidence that the theoretical run time scaling O(NS)O(NS) is obeyed for this algorithm. The dependence on KK is not investigated since the algorithm is constantly updating this parameter. However, at K=1K=1 (for Move 3) and K=2K=2 (for Moves 2 and 4) the slowest MCMC moves will occur since the largest clusters need to be dealt with, and these will contribute the largest run times to the scaling so in practice the O(NS)O(NS) dependence is the primary one of interest.

A similar experiment was run for synthetic contiguous populations with two planted clusters to verify the theoretical run time scaling for the dynamic programming algorithm. The modes for each realization were generated in the same way as for the previous experiment, except this time the first S/2S/2 networks were generated from the first mode and the final S/2S/2 networks were generated from the second mode. The dynamic programming of Sec. II.3 was then applied to each population, and the average run time of the full algorithm was computed. Fig. 9 shows the average run time over 3 realizations of synthetic populations (each run gave a very consistent run time) for a range of NN values (at fixed SS) and range of SS values (at fixed NN), with error bars corresponding to three standard errors in the mean. The regression y=a(xlogx)y=a(x\log x) was run for the experiment in panel (a) and the regression logy=alogx+b\log y=a\log x+b was run for the experiment in panel (b). The resulting coefficients of determination R2R^{2} and slope a2a\approx 2 for the second regression provide strong numerical evidence that the theoretical run time scaling O(S2NlogN)O(S^{2}N\log N) is obeyed for this algorithm.

In Fig. 11, we rerun the reconstruction experiments of Fig. 2 with the same parameters except with the two modes generated from an Erdos-Renyi random graph with N=100N=100 nodes and average degree c=5c=5 as in the previous experiments. In general, the recovery performance remains qualitatively similar to that of Fig. 2, except with variations in the high noise regime (p0.5p\sim 0.5) due to size-dependent effects from the synthetic population generative process. In this regime, there is a very low probability of any substantial overlap among the networks in the population due to the quadratic scaling of the number of possible positions where an edge can be placed. The discontiguous clustering method (red curves) places all networks in a single cluster which is the most parsimonious description of an unstructured population. This outcome results in one nearly complete inferred mode of E(N2)E\sim{N\choose 2} edges. By the symmetry of the binomial coefficient, it is equally economical in an information-theoretic sense to specify positions of non-edges than edges. Considering the density of the true modes, we then have network hamming distances of roughly (N2)cN/24700{N\choose 2}-cN/2\sim 4700, consistent with the red curve in the top panel of the figure.

In the same p0.5p\sim 0.5 regime for the N=100N=100 case, the contiguous clustering methods tend to do the opposite and place all networks in their own cluster since the corresponding description length does not incur as strong a penalty for extra clusters. Here we then see inferred modes that consist of the individual networks in the population—in other words, random coin flips at each edge position with half of the (N2){N\choose 2} possible edges occupied. Considering the expected overlap with the true modes and fluctuations, we get mode hamming distances of roughly 20002000, as seen in the top panel. Meanwhile, as opposed to the K=1K=1 discontiguous case where the NMI equals 0 (i.e. subtracting the NMI from 11 gives a partition distance of 11), in the contiguous setting we see NMI values of 2logK/(logS+logK)2\log K/(\log S+\log K), resulting in partition distances of 0.70,0.480.70,0.48 for K=2,4K=2,4 respectively.

Refer to caption
Figure 6: Cluster recovery for different initial numbers of clusters K0K_{0}. Recovery performance is plotted for the three experiments described in Sec. III.1, for different initial numbers of clusters K0K_{0}. (a) Network distance, for various flip probabilities pp. (b) Partition distance. (c) Inverse compression ratio (Eq. (17)). Each data point is an average over 200200 realizations of the population for the corresponding value of the flip probability, and error bars correspond to three standard errors in the mean.
Refer to caption
Figure 7: Cluster recovery for different population sizes SS. Recovery performance is plotted for the three experiments described in Sec. III.1, for different population sizes SS. (a) Network distance, for various flip probabilities pp. (b) Partition distance. (c) Inverse compression ratio (Eq. (17)). Each data point is an average over 200200 realizations of the population for the corresponding value of the flip probability, and error bars correspond to three standard errors in the mean.
Refer to caption
Figure 8: Runtime scaling for discontiguous clustering MCMC moves, for different network sizes NN and population sizes SS. In each realization a synthetic population was generated using the procedure outlined in Appendix B, and the average runtime for the merge-split move (Move 4 in Appendix A, which is the computational bottleneck for K>1K>1) was recorded when applying the algorithm of Sec. II.2 to the synthetic data. Regressions logy=alogx+b\log y=a\log x+b were run for each experiment (dashed lines), and the resulting slopes and R2R^{2} values are shown in the corresponding figure panels. Each data point is an average over 100100 synthetic realizations, error bars corresponding to three standard errors.
Refer to caption
Figure 9: Runtime scaling for the contiguous clustering dynamic program, for different network sizes NN and population sizes SS. In each realization a synthetic population was generated using the procedure outlined in Appendix B, and the total runtime was recorded when applying the dynamic programming algorithm of Sec. II.3 to the synthetic data. Regressions y=a(xlogx)y=a(x\log x) and logy=alogx+b\log y=a\log x+b were run for the top and bottom panels respectively (dashed lines), and the resulting slopes and R2R^{2} values are shown in each panel. Each data point is an average over 33 synthetic realizations (each realization showed remarkable runtime consistency), errorbars corresponding to three standard errors.
Refer to caption
Figure 10: Mode recovery for different mode separations. Network distance between true and inferred modes for (a) unconstrained clustering, (b) contiguous clustering with K=2K=2, and (c) contiguous clustering with K=4K=4 for various values of the mode separation γ\gamma. Each data point is an average over 200200 realizations of the population for the corresponding value of the flip probability, and error bars correspond to three standard errors in the mean.
Refer to caption
Figure 11: Reconstruction experiments with larger modes. Same reconstruction experiments of Fig. 2, except with the two modes generated from an Erdos-Renyi random graph with N=100N=100 nodes and average degree c=5c=5.

Appendix C Analysis of modes

One of the advantages of using a concise set of modes instead of a complete collection of networks is that we can simplify analysis of the whole population greatly. For example, a common task in network machine learning and statistics is network-level regression. In such a task, our goal is to understand how the properties of networks relate to response variables, such as the adoption of a product or practice, or the efficiency of shipping.

To implement a regression, one denotes by 𝒙i\bm{x}_{i} the vector of properties of network ii and by yiy_{i} the response for this network, modeling:

yi=f(𝒙i)+ϵiy_{i}=f(\bm{x}_{i})+\epsilon_{i} (25)

where ϵi\epsilon_{i} is a random variable modeling deviations from the predictor f(𝒙i)f(\bm{x}_{i}). Fitting the regression is a matter of identifying a good predictor function f()f(\cdot) given data {(𝒙1,y1),(𝒙2,y2),}\{(\bm{x}_{1},y_{1}),(\bm{x}_{2},y_{2}),...\}

Such analysis can become complex and impractical if the set of networks is too large (for example, if computing the network predictors is costly, as is the case of some centrality measures or macroscopic properties like community structure). Hence, one may run the analysis on a smaller set of representative samples. Clustering with an information criterion can achieve this goal. We demonstrate this in Fig. 12, using the FAO Trade Network analyzed in Sec. III.2. The figure shows the relationship between the value of three representative network predictors as computed on the eight identified modes and the average value of these same predictors across the networks of the cluster associated with each mode. One can see that the two quantities are proportional to one another: when, say, the average betweenness centrality of nodes is higher in the mode, it is also higher on average in the network of the associated cluster. This means that we can run a regression analysis on a small set of modes instead of on a complete network—for example, by using the average response of the networks in any given node as the new response variable.

Refer to caption
Figure 12: Relationship between network predictors in modes and clusters of the FAO trade network. Network quantities are computed for each of the 8 modes of the FAO trade network, and shown on the horizontal axis. The same quantities are then averaged for the networks of each associated cluster of networks, and shown on the vertical axis. The quantities are: (a) Transitivity, defined as a ratio of the number of triangles to the number of connected triplets. The Pearson correlation of the two quantities is 0.949. (b) Average degree, with a Pearson correlation coefficient of 0.9430.943. (c) Average node betweenness centrality, with a Pearson correlation coefficient of 0.9360.936. Error bars show one standard deviation of the quantities among the networks within each mode’s cluster.

Appendix D Equivalence with Bayesian MAP estimation

In this appendix we establish the correspondence between our minimum description length objective and Bayesian MAP estimation in the heterogeneous network population model of YKN22Clustering . Adapting to the notation in this paper and simplifying the product over network samples in Eq. 17 from YKN22Clustering , the likelihood of the data 𝒟\mathcal{D} under the heterogeneous network population generative model is given by

P(𝒟|𝒜,𝒞,𝜶,𝜷)\displaystyle P(\mathcal{D}|\mathcal{A},\mathcal{C},\bm{\alpha},\bm{\beta}) =k=1Kαktk(1αk)SkMktkk=1Kβkpk(1βk)SkMkpk,\displaystyle=\prod_{k=1}^{K}\alpha_{k}^{t_{k}}(1-\alpha_{k})^{S_{k}M_{k}-t_{k}}\prod_{k=1}^{K}\beta_{k}^{p_{k}}(1-\beta_{k})^{S_{k}M_{k}^{\ast}-p_{k}}, (26)

where 𝜶={α1,α2,,αK}\bm{\alpha}=\{\alpha_{1},\alpha_{2},...,\alpha_{K}\} and 𝜷={β1,β2,,βK}\bm{\beta}=\{\beta_{1},\beta_{2},...,\beta_{K}\} are vectors of true- and false-positive rates within the clusters, and where we define tk=SkMknkt_{k}=S_{k}M_{k}-n_{k} as the the number of true-positive edges, as in the previous Appendix.

We can additionally add priors on the modes and clusters to prepare for performing Bayesian inference with the model. As is done in YKN22Clustering , for our prior on the clusters we can assume that each sample ss is assigned a cluster CkC_{k} independently at random with prior probability πk\pi_{k}, giving the following prior over all clusters

P(𝒞|𝝅)=k=1KπkSk.\displaystyle P(\mathcal{C}|\bm{\pi})=\prod_{k=1}^{K}\pi_{k}^{S_{k}}. (27)

We can also suppose that each mode 𝑨(k)\bm{A}^{(k)} has some prior probability ρk\rho_{k} for the existence of each edge (i,j)(i,j), which results in the following prior over all modes

P(𝒜|𝝆)=k=1KρkMk(1ρk)Mk.\displaystyle P(\mathcal{A}|\bm{\rho})=\prod_{k=1}^{K}\rho_{k}^{M_{k}}(1-\rho_{k})^{M_{k}^{\ast}}. (28)

This prior differs from the one used in YKN22Clustering , which couples all modes together through a single edge probability ρ\rho. This prior would result in a similar description length expression to ours when computing MAP estimators, except it does not have the beneficial properties of being a sum of decoupled cluster-level description lengths and allowing the modes to fluctuate independently in their densities. We finally assume (as in YKN22Clustering ) uniform priors over the parameters {𝜶,𝜷,𝝅,𝝆}\{\bm{\alpha},\bm{\beta},\bm{\pi},\bm{\rho}\}, which we collectively denote as 𝜽\bm{\theta}.

Applying Bayes’ rule, the posterior distribution over modes 𝒜\mathcal{A}, clusters 𝒞\mathcal{C}, and model parameters 𝜽\bm{\theta} is given by

P(𝒜,𝒞,𝜽|𝒟)\displaystyle P(\mathcal{A},\mathcal{C},\bm{\theta}|\mathcal{D}) =P(𝒟,𝒜,𝒞,𝜽)𝒜,𝒞,𝜽P(𝒟,𝒜,𝒞,𝜽)\displaystyle=\frac{P(\mathcal{D},\mathcal{A},\mathcal{C},\bm{\theta})}{\sum_{\mathcal{A},\mathcal{C},\bm{\theta}}P(\mathcal{D},\mathcal{A},\mathcal{C},\bm{\theta})}
P(𝒟|𝒜,𝒞,𝜶,𝜷)P(𝒞|𝝅)P(𝒜|𝝆).\displaystyle\propto P(\mathcal{D}|\mathcal{A},\mathcal{C},\bm{\alpha},\bm{\beta})P(\mathcal{C}|\bm{\pi})P(\mathcal{A}|\bm{\rho}). (29)

Now, to find the Maximum A Posteriori (MAP) estimators {𝒜^,𝒞^,𝜽^}\{\mathcal{\hat{A}},\mathcal{\hat{C}},\bm{\hat{\theta}}\}, we can maximize Eq. (D) (or equivalently, its logarithm) using the method of profiling. To do this, we first maximize over the parameters 𝜽\bm{\theta}, finding the MAP estimates 𝜽^(𝒜,𝒞)\bm{\hat{\theta}}(\mathcal{A},\mathcal{C}) as a function of the configuration {𝒜,𝒞}\{\mathcal{A},\mathcal{C}\}. Then, we can plug the function(s) 𝜽^(𝒜,𝒞)\bm{\hat{\theta}}(\mathcal{A},\mathcal{C}) back into Eq. (D) to obtain an objective function over the configuration {𝒜,𝒞}\{\mathcal{A},\mathcal{C}\}, which can then be optimized over the remaining variables to find all the MAP estimators.

The logarithm of the posterior distribution in Eq. (D), which can be maximized to find the MAP estimators, is given by

logP(𝒜,𝒞,𝜽|𝒟)\displaystyle\log P(\mathcal{A},\mathcal{C},\bm{\theta}|\mathcal{D}) =k=1K[tklogαk+(SkMktk)log(1αk)]\displaystyle=\sum_{k=1}^{K}\left[t_{k}\log\alpha_{k}+(S_{k}M_{k}-t_{k})\log(1-\alpha_{k})\right]
+k=1K[pklogβk+(SkMkpk)log(1βk)]\displaystyle+\sum_{k=1}^{K}\left[p_{k}\log\beta_{k}+(S_{k}M_{k}^{\ast}-p_{k})\log(1-\beta_{k})\right]
+k=1KSklogπk\displaystyle+\sum_{k=1}^{K}S_{k}\log\pi_{k} (30)
+k=1K[Mklogρk+Mklog(1ρk)].\displaystyle+\sum_{k=1}^{K}\left[M_{k}\log\rho_{k}+M_{k}^{\ast}\log(1-\rho_{k})\right].

Maximizing over the model parameters 𝜽\bm{\theta} gives the estimators

α^k\displaystyle\hat{\alpha}_{k} =tkSkMk\displaystyle=\frac{t_{k}}{S_{k}M_{k}} (31)
β^k\displaystyle\hat{\beta}_{k} =pkSkMk\displaystyle=\frac{p_{k}}{S_{k}M_{k}^{\ast}} (32)
π^k\displaystyle\hat{\pi}_{k} =SkS\displaystyle=\frac{S_{k}}{S} (33)
ρ^k\displaystyle\hat{\rho}_{k} =Mk(N2)\displaystyle=\frac{M_{k}}{{N\choose 2}} (34)

These can be derived using the method of Lagrange multipliers, with the normalization constraint k=1Kπk=1\sum_{k=1}^{K}\pi_{k}=1 for the group assignment probabilities. Finally, replacing these expressions back into Eq. (D), and using tk=SkMknkt_{k}=S_{k}M_{k}-n_{k} we find

logP(𝒜,𝒞,𝜽^(𝒜,𝒞)|𝒟)=(𝒟).\displaystyle\log P(\mathcal{A},\mathcal{C},\bm{\hat{\theta}}(\mathcal{A},\mathcal{C})|\mathcal{D})=-\mathcal{L}(\mathcal{D}). (35)

Thus, maximizing Eq. (35) to find (profiled) MAP estimators of this Bayesian model is equivalent to minimizing the description length (𝒟)\mathcal{L}(\mathcal{D}). Further, with 𝜽=𝜽^\bm{\theta}=\hat{\bm{\theta}}, the first two lines of Eq. (D) correspond to (𝒟|𝒜,𝒞)\mathcal{L}(\mathcal{D}|\mathcal{A},\mathcal{C}), the third line corresponds to (C)\mathcal{L}(C), and the last line to (A)\mathcal{L}(A), which establishes a direct mapping between likelihoods, priors and each part of the multi-part encoding. This correspondence unifies the Bayesian and MDL formulations of network population clustering and explains the good reconstruction performance seen in the experiments of the main text mackay2003information ; grunwald2007minimum .

Appendix E Additional related methods for clustering network populations

The two algorithms proposed in this paper are not the first methods capable of clustering network populations. Indeed, as discussed in the Introduction, there are a variety of existing methods that have tackled this problem using various modeling assumptions. However, directly making an apples-to-apples numerical comparison of the performance of these methods with our own is difficult for a number of reasons. Firstly, as mentioned in the Introduction, most of these methods do not have a unified model selection criterion that allows for automatic inference of the number of clusters. Therefore to make a completely fair numerical comparison with our own methods, we would either have to fix the number of clusters in all methods being compared—removing the additional obstacle of model selection which is one of the key innovations of our method—or add an ad hoc regularization term/Bayesian prior to the existing methods, to which their output may be highly sensitive mantziou2021 . Additionally, class labels of real-world network datasets are often arbitrary and uncorrelated with the structure of the data peel2017ground , so any purely numerical comparison of unsupervised clustering algorithms for network data using label recovery in real-world examples will be ill-posed peixoto2022implicit . Recovery of synthetic clusters, as described in the experiment in Sec. III.1, can serve as a well-posed task for testing the relative performance of existing algorithms if the data can be generated from the model corresponding to each algorithm peixoto2022implicit , but many existing methods do not have explicit generative models and fitting them to other synthetic data may result in severe model misspecification spanos1986statistical .

With these caveats in mind, we can assess how our proposed algorithm is performing in comparison with similar existing methods by applying other algorithms to the same synthetic discontiguous reconstruction task as in Fig. 2. (No directly comparable methods exist for the temporally contiguous compression task.) Since alternative layer/population clustering algorithms are also seeking similar structural regularities in the data, we would expect these to perform well reconstructing the same synthetic benchmark used to verify the performance of the proposed method. In Fig. 13, we show the results of these experiments, comparing the reconstruction performance of our method with the methods of YKN22Clustering (denoted “Gibbs”), de2015structural (denoted “MultiRed”), and a naive baseline (denoted “Jaccard”) where layers are clustered by applying the Ward hierarchical linkage criterion (used in de2015structural ) to the pairwise matrix of Jaccard distances among the layers’ edge sets. These methods are chosen for comparison as they are the most similar to the proposed method—the method of YKN22Clustering tackles a similar inference problem but from a Bayesian perspective and with a different prior on the mode densities; the method of de2015structural aims to compress the information in the network layers but uses the Jensen-Shannon Divergence between network Laplacians; and the naive baseline also uses the overlap of the edge positions among networks to identify structurally cohesive clusters.

None of these methods constructs a single unweighted representative modal network for each cluster as our method does, so to compute the description lengths for each method, we apply the greedy mode update algorithm of Appendix A to the identified clusters of network layers and evaluate the resulting configuration of modes and clusters using Eq. 15. However, since no principled means exists to construct comparable modal networks of alternative methods without imposing these assumptions, we do not compare the mode reconstruction performance in this experiment (panel (a) in Fig. 2). We have also given a significant advantage to the alternative algorithms—we have fixed their inferred number of clusters to the correct value K=2K=2. In contrast, our algorithm must learn the value of KK automatically. All algorithms perform well on the synthetic benchmark in identifying the planted clusters, but our algorithm performs substantially better for intermediate flip probabilities despite needing to identify the correct value of KK (Fig. 13a). Since our method aims to minimize the description length used to evaluate the algorithms, our method has an edge in terms of compression (Fig. 13b).

Refer to caption
Figure 13: Comparison of reconstruction performance across alternative layer clustering approaches. (a) Partition distance and (b) difference in mean inverse compression ratio (Eq. (17)) with the method of this paper. Each data point is an average over 5050 realizations of the population for the corresponding value of the flip probability. Error bars correspond to three standard errors in the mean.

Below we provide a further qualitative comparison between our methods and existing methods to demonstrate the substantial innovations and benefits of our proposed algorithms:

  • A number of recent methods propose to cluster populations of networks by imposing structure on the clusters with explicit network models such as stochastic block models, exponential random graphs, or latent space models stanley2016clustering ; mantziou2021 ; yin2022finite ; durante2017nonparametric . This approach allows the user to simultaneously infer the clusters of networks as well as the model parameters for each cluster. While the model parameters within each cluster can be useful for identifying common structure among its constituent networks, any explicit model will necessarily miss out on important structure (in particular, transitivity is often not present in parametric network models). Explicit model inference also often necessitates a complex and time intensive within-cluster estimation scheme such as expectation maximization to identify stochastic block model communities and model parameters at the network-cluster level. In contrast, our method takes a non-parametric approach for the representative cluster summaries and infers a single representative modal network for each cluster using a simple greedy edge removal approach, which is only constrained in its density since this will affect modal information content.

  • Other methods perform network change point detection (peel2015detecting and peixoto2018change ), which can be seen as equivalent only to the objective of contiguous partitioning we approach in this paper. On the other hand, the information theoretic methods of this paper unify the discontiguous and contiguous clustering objectives, only modifying a single term of the encoding to accommodate the more restrictive setting of the latter clustering problem. The contiguous clustering method of this paper also has the benefit of achieving exact optimization (up to the greedy mode updates) in polynomial time. Change point detection methods also may allow for significant but slow temporal variation between change points, resulting in very heterogeneous clusters, since they only look for substantial jumps in network structure across time.

  • The methods of nielsen2018multiple , wang2019joint , arroyo2021inference use joint embeddings of network populations. These methods use a single vector to model each node’s position in a latent space across all networks in the population, while the loadings of the latent space decomposition may change between clusters to capture heteorgeneity across network layers. These methods have similar advantages and disadvantages as the methods stanley2016clustering ; mantziou2021 ; yin2022finite ; durante2017nonparametric discussed above when compared to our methods.

  • In la2016gibbs , a Gibbs distribution approach is used to model network populations, where the Hamiltonian determining the properties of the ensemble is based on a user-specified graph distance to the modal network. This approach, like our own, infers an entire graph for each mode, giving it the flexibility to capture a range of possible network structures. On the other hand, this method requires the user to specify a graph distance of interest, of which there are many to choose from and to which the results are sensitive, in contrast with our method which requires no user input. The same downside arises when using the Fréchet mean lunagomez2020modeling to define a modal network of a cluster.

  • The method of YKN22Clustering uses Gibbs sampling to construct posterior distributions over modal edge probabilities and cluster assignments. As with some other methods discussed, this method requires the user to specify an ad hoc prior on the modal structures and re-run the time-consuming sampling process for multiple values of KK to infer the optimal number of clusters. The particular prior used in YKN22Clustering also has the disadvantage of coupling all the modes together with respect to their densities, whereas our transmission scheme treats all clusters independently.

  • The method in de2015structural reduces multilayer networks into a smaller set of aggregated layers using the network Jensen-Shannon Divergence as a measure of information redundancy. While somewhat conceptually similar to our own approach in that it is based in information theory, this method does not explicitly construct parsimonious modal structures describing the structure of the clusters, and instead sums all layers within a cluster. It also performs pairwise comparisons using the Jensen-Shannon Divergence—rather than group-wise comparisons as done, for example, in kirkley2020information —which may result in biased estimation of information redundancies in comparison to our method. This could potentially explain the method returning many more clusters than would be optimal for our method (see Sec. III.2). Finally, this method requires computing the layers’ Laplacian eigenspectra, which results in a significant increase in computation time compared to our method, even when approximating the spectra with polynomial approximations.

  • The approach used in rojas_multiscale_2021 , like some other multilayer community detections methods, clusters a multilayer network into communities of nodes within and across layers, so it is not directly comparable to our own method. It does however provide a complementary perspective to the analysis in Sec. III.3.

While these algorithms we discuss all demonstrate good performance for certain aspects of multilayer network/network population clustering, the methods of this paper have two significant advantages over all algorithms discussed above. The first is that these methods do not have a natural criterion for specifying the number of clusters KK. Therefore, model selection needs to be addressed with ad hoc or approximative regularization methods (e.g., arbitrary information criteria, sparse finite mixtures) instead of the principled automatic penalization of extra modes we achieve through a minimum description length formulation. As discussed in mantziou2021 , the choice of penalty for the number of clusters significantly impacts the results, so this model selection is a critical innovation of our method. Second, our proposed dynamic programming scheme for solving the restricted problem of contiguous layer clustering is unique among existing methods, which do not consider the restriction to contiguous clusters.