Identifying stable communities in Hi-C data using a multifractal null model

Lucas Hedström lucas.hedstrom@umu.se Integrated Science Lab, Department of Physics, Umeå University, Umeå, Sweden Antón Carcedo Martínez Integrated Science Lab, Department of Physics, Umeå University, Umeå, Sweden Ludvig Lizana Integrated Science Lab, Department of Physics, Umeå University, Umeå, Sweden

(October 1, 2025)

Abstract

Chromosome capture techniques like Hi-C have expanded our understanding of mammalian genome 3D architecture and how it influences gene activity. To analyze Hi-C data sets, researchers increasingly treat them as DNA-contact networks and use standard community detection techniques to identify mesoscale 3D communities. However, there are considerable challenges in finding significant communities because the Hi-C networks have cross-scale interactions and are almost fully connected. This paper presents a pipeline to distill 3D communities that remain intact under experimental noise. To this end, we bootstrap an ensemble of Hi-C datasets representing noisy data and extract 3D communities that we compare with the unperturbed dataset. Notably, we extract the communities by maximizing local modularity (using the Generalized Louvain method), which considers the multifractal spectrum recently discovered in Hi-C maps. Our pipeline finds that stable communities (under noise) typically have above-average internal contact frequencies and tend to be enriched in active chromatin marks. We also find they fold into more nested cross-scale hierarchies than less stable ones. Apart from presenting how to systematically extract robust communities in Hi-C data, our paper offers new ways to generate null models that take advantage of the network’s multifractal properties. We anticipate this has a broad applicability to several network applications.

I Introduction

Mammalian genomes fold into complex 3D structures that help regulate genetic processes such as transcription, DNA replication and repair, and epigenetics [1, 2, 3, 4, 5]. Several of these insights derive from chromosome capture techniques, such as Hi-C [6, 7, 8], that quantifies the number of physical contacts between all DNA segment pairs in the genome across a cell population. Data from these experiments allowed constructing comprehensive chromosome-wide 3D interaction maps, highlighting local 3D structures with high internal frequencies, such as Topologically Associated Domains (TADs) and the large-scale binary division into A/B compartments generally associated with active (A) and inactive (B) chromatin [9, 10].

However, these 3D structures represent but two instances on a wide spectrum of structures that fold into each other, forming complex, often nested, hierarchies [11, 12, 13, 14, 15]. These hierarchies make it difficult to design robust TAD-finding algorithms (”TAD callers”) and for the field to agree on common TAD definitions [16, 17]. This ambiguity is likely one reason there exists a large collection of TAD callers [18], each associated with their own parameters and arbitrary choices. In practice, this means that two callers may disagree on the TAD boundaries using the same dataset, let alone two datasets deriving from theoretically identical replicate experiments, including experimental and biological noise [19].

To help resolve some of these issues, we recently proposed a method to extract robust 3D communities in Hi-C data [20]. In that paper, we used a network community detection algorithm (Generalized Louvain) to generate an ensemble of feasible network partitions from a public Hi-C data set [21] and quantified the most conserved node-community relationships. This approach distinguishes robust from variable communities and tracks how the robustness changes with the network scale (and chromatin state). We found that robustness is highly scale-dependent and typically worse for small communities, including TADs. When robustness is low, there are many ways to split the network into mathematically feasible community partitions, which is likely one reason TAD callers disagree, as each focuses on slightly different data features [22].

This paper extends our study [20] by addressing how community robustness changes under experimental noise rather than variability in the community detection method. To this end, we first estimate noise levels from actual Hi-C maps (from humans [21]) and construct an ensemble of blurred maps by adding noise to each contact (or pixel). We call these bootstrapped maps [23]. Next, we extract 3D communities by considering the Hi-C data as a network and using the Generalized Louvain algorithm. Finally, we quantify the overlap between the bootstrapped maps relative to the unperturbed map. We find that stable communities typically have above-average internal contact frequencies, tend to be enriched in active chromatin marks, and fold into more nested cross-scale hierarchies than less stable ones.

Refer to caption — Figure 1: A schematic showcasing the methodology that computes the community stability for a Hi-C network. Using the original Hi-C data, we bootstrap by sampling data from the contact distributions associated with each distance (Fig. 2). We then run a community detection algorithm on the original Hi-C data to get the original partition (GenLouvain). To reduce the number of samples required for good statistics, we use the original partition as an initial partition for GenLouvain. Each sample gives a list of partitions, each with unique communities, shown as boxes along the diagonal colored by the community number ID. Based on the overlap (calculated with the Jaccard index), we count the number of times each community in the original Hi-C data appears in the bootstrapped data, which we use as a proxy for stability. (rightmost panel) Original partition, where a darker color indicates communities with higher stability.

In addition, we made yet another improvement that reaches beyond Hi-C data. This improvement takes advantage of the network’s multifractal spectrum [24], in a way we anticipate useful for other community detection approaches. When using Generalized Louvain, or other methods resting on maximum modularity [25], users must specify a background network connectivity or null model. Simple null models like the random Newman-Girvan model [26] are common choices in network science due to limited alternatives. However, this model does not apply to Hi-C data since it disregards the spatial constraints associated with the DNA segments (represented as nodes) belonging to a long linear molecule. One way to deal with this specific problem is to introduce a distant-dependent null model, where the contacts decay as a power law (as discovered in Hi-C) [14, 27]. But as the power-law exponent changes with distance, too, (from $\approx$ 0.75 to $\approx$ 1.08 [28]) this introduces a new arbitrary parameter as to when to switch exponent. While we explored both exponents in [20], here we remove this arbitrary choice by implementing a one-parameter null model that we fit to the Hi-C map’s bifractal spectrum using techniques from [24]. This approach is not limited to Hi-C data and should apply to any complex network.

II Methods

Our computational method has several steps (Fig. 1). First, we generate bootstrapped samples representing noisy Hi-C data. Second, we find communities using a method that considers the bifractal structure in Hi-C data. Third, we step through the pipeline to determine the stable communities that are invariant under noise.

II.1 Collecting and preprocessing experimental data

Chromosome contacts. We downloaded Hi-C data for the B-lymphoblastoid human cell line (GM12878) [21] from the GEO database (MAPQG0 dataset, 100 kb resolution) [29]. We only consider intra-chromosome contacts in our analysis. We interpret the data file as a weighted network (”Hi-C network”) with elements $A_{ij}$ in sparse form, where each node represents 100 kb of DNA, and the link weight is the measured contact count.

Before constructing the Hi-C network, we normalize the data using the standard Knight-Ruiz matrix balancing algorithm [30] so that the sum of all weights to a node equals unity. Note that the centromere (with no measured contact counts) must be removed before KR-normalization to avoid convergence issues. However, since the indices on the Hi-C matrix represent physical distances, we reintroduce the centromere before performing community detection (see Sec. II.3).

Chromatin states. To relate our results to different functional genomic regions, we downloaded a dataset from ENCODE [31] of 13 different chromatin types created from a multivariate hidden Markov model (denoted HMM states) [32, 33]. To reduce the number of states, we grouped the data into five effective categories: Promoters, Enhancers, Transcribed regions, Heterochromatin, and Insulators (see App. C for complete definitions in terms of the standard HMM states). Moreover, this data set allows us to calculate the chromatin state’s folds of enrichment (FE) for each Hi-C bin. We calculate enrichment relative to the chromosome-wide average (App. C).

II.2 Bootstrapping networks

To estimate community stability across replicates, we bootstrapped data that simulated noise from a typical Hi-C experiment. This noise comes from multiple sources [19], such as randomness of ligations between loci [34] or random contacts between loci due to fluctuations [6]. In Fig 2, we show histograms of the logged Hi-C contacts $\log(A_{ij})$ for four fixed distances in chromosome 10. We note that the distributions at all these distances are log-normal, albeit with their unique mean and standard deviation We use these empirical distributions as a proxy for experimental noise.

To generate bootstrapped Hi-maps with noise, we calculate the standard deviation $\sigma_{d}$ of the logged contacts for a fixed distance $d=|i-j|$ as

\sigma_{d}=\sqrt{\langle\log^{2}(A_{ij})\rangle_{d}-\langle\log(A_{ij})\rangle_{d}^{2}},

(1)

where $\langle\ldots\rangle_{d}$ denotes the average at the distance $d$ . Next, we bootstrap new noisy contacts $A_{ij}^{\prime}$ for each bin using

A_{ij}^{\prime}=A_{ij}\times e^{N(0,(\epsilon\sigma_{d})^{2})}

(2)

where $\epsilon$ is the noise amplitude, and $N(0,(\epsilon\sigma_{d})^{2})$ denotes the normal distribution with mean $0$ and variance $(\epsilon\sigma_{d})^{2}$ . As shown below, we tune the amplitude $\epsilon$ to not completely disrupt the chromosome’s community structure (e.g., the TADs), yet yield maps with realistic noise.

II.3 Detecting communities using a multifractal null model

To find Hi-C network communities, or ”3D communities” [14], we use the Generalized Louvain method (GenLouvain) implemented in MATLAB [35]. GenLouvain searches for network partitions that maximize the modularity function $Q$ , capturing local deviations from the expected background connectivity or null model. In its general form, the modularity is expressed as

Q=\frac{1}{2m}\sum_{i\neq j}\left(A_{ij}-\gamma P_{ij}^{(0)}\right)\delta_{C_{i},C_{j}}

(3)

where $A_{ij}$ are entries in the weighted adjacency (Hi-C) matrix, $m$ is the total weight of this matrix, $\gamma$ is a scale parameter setting the effective community size, $C_{i}$ is node $i$ ’s community assignment, and $P_{ij}^{(0)}$ is the null model. The last factor, $\delta_{ij}$ , denotes the Kronecker-delta function and indicates that the sum calculates the partial sum of all local modularities within a community.

To calculate $Q$ and identify communities, one must specify a null model that captures the characteristic contact patterns in the data. The most common choice is random connections, known as the Newman-Girvan null model [26], but it does not account for structural aspects associated with long DNA polymer chain folded in 3D inside the cell nucleus [36, 37]. Previous work generalized the Newman-Girvin model to include a power-law with linear node separation $d$ as observed in Hi-C maps and predicted by theoretical polymer models [14]. However, the power-law exponent tends to change with distance [6, 28]. For example, Within TADs, the contacts decay on average as $d^{-0.75}$ , whereas it is closer to $d^{-1.08}$ for longer distances. Therefore, if using a single decay function as the null hypothesis, one must introduce an arbitrary cutoff at which the decay exponent should change. To circumvent this problem, we propose another null model embracing the multifractal properties of Hi-C maps [24]. Admittedly, this model also has a parameter, but the fitting to Hi-C data is more systematic.

The model reads

P_{ij}^{(0)}=2m\frac{H_{a}(i,j)}{\sum_{i,j}H_{a}(i,j)}.

(4)

where $H_{a}(i,j)$ represents the so-called Hierarchical domain model (HDM) from [24], calculated as

		$\displaystyle H_{a}(i,j)=$
		$\displaystyle\left\{\begin{aligned} &\frac{1}{2^{n}-d}\sum_{k=0}^{n-1}2^{n}g(d2^{k-n})\frac{a^{k}b^{1-\delta_{kn}}}{4^{\max(n-k-1,0)}}\,\,{\rm if}\,d<2^{n}\\ &\sum_{k=0}^{n-1}2^{n}g(2^{k}-2^{k-n})\frac{a^{k}b^{1-\delta_{kn}}}{4^{\max(n-k-1,0)}}\,\,{\rm otherwise}\end{aligned}\right.$		(5)

where

g(x)=\left\{\begin{aligned} &x\,&{\rm if}\,0\leq x\leq 0.5\\ &1-x\,&{\rm if}\,0.5\leq x\leq 1\\ &0\,&{\rm otherwise.}\end{aligned}\right.

(6)

Here, $n=\lfloor\log_{2}(N)\rfloor$ , $N$ is the size of the Hi-C matrix, $a$ and $b$ are fitting paramters ( $b=0.5-a$ ). We use a matrix formulation of $H_{a}(i,j)$ and calculate the mean contact probability at each distance $d=|i-j|$ , serving as $P_{ij}^{(0)}$ (see App. A). Finally, we fitted the parameter $a$ against our Hi-C data using least-squares and found $a=0.464(3)$ with minor variation between chromosomes. This value model accurately yields the decay exponent in Hi-C contact maps at all scales (Fig. 8).

II.4 Reducing community detection variance

Since the GenLouvin algorithm tries to optimize the modularity function in a Monte Carlo-like fashion, it often finds different local minima, producing slightly different partitions with similar quality [20] (also discussed in App. A). To reduce the number of samples required to get good statistics, we use the partition detected in the original data as the starting partition and then generate a collection of bootstrapped samples, see Fig. 1. This procedure reduces the number of required samples since we do not have to account for varying initial conditions.

III Results

III.1 Fine-tuning noise amplitudes $\epsilon$

To find realistic noise levels $\epsilon$ when bootstrapping our noisy Hi-C maps, we benchmarked against measured TAD stabilities between replicate experiments [16]. These experiments found the overlap of TAD boundaries between replicates to be 62%. We generated communities using GenLouvain, where we set the scale parameter $\gamma$ to achieve an effective community size $\hat{s}\sim 0.88$ Mb that best agrees with the TADs considered in [16] (see App. B). Then we generated sets of communities for $n_{\rm samp}$ bootstrapped samples for different values of $\epsilon$ . To calculate the TAD boundary similarity between two sets of community boundaries $b_{i}$ and $b_{j}$ from two bootstrapped samples $i$ and $j$ , we use the Jaccard index

J_{\rm TAD}(b_{i},b_{j})=\frac{|b_{i}\cap b_{j}|}{|b_{i}\cup b_{j}|},

(7)

which varies between 1 (identical) and 0 (dissimilar).

However, Eq. (7) has a few problems when using it to quantify boundary overlaps. For example, under extensive noise, the communities shrink until each node is its community. This situation yields $J_{\rm TAD}(b_{i},b_{j})\sim 1$ , indicating a large overlap between community borders, but where the partitions no longer represent TAD-sized communities. To solve this problem, we multiply the Jaccard index in Eq. (7) by a ”penalty factor” containing the fraction of the number of communities in the bootstrapped sample compared to the original. This gives

J_{\rm TAD}^{*}(b_{i},b_{j})=\frac{|b_{i}\cap b_{j}|}{|b_{i}\cup b_{j}|}\times\left(1-\frac{||b_{i}\cup b_{j}|-|b_{o}||}{|b_{i}\cup b_{j}|}\right)

(8)

where $b_{o}$ are the TAD boundaries calculated from the original dataset without noise.

To better understand how the penalty factor works, consider two illustrative examples. One is where the bootstrapped partitions contain the same number of communities as the original dataset but shifted along the chromosome. Here, the overlap is low, making the term $|b_{i}\cap b_{j}|/|b_{i}\cup b_{j}|$ small. Another case is when the bootstrapped samples have many small communities (in the extreme case, where every node represents a separate community). Then, the overlap between bootstrapped samples will be large, indicating a high $J_{\rm TAD}(b_{i},b_{j})$ , but the number of communities compared to the original partition differs significantly. Therefore, the last term $||b_{i}\cup b_{j}|-|b_{o}||/|b_{i}\cup b_{j}|$ approaches $0$ , leading to small $J_{\rm TAD}^{*}(b_{i},b_{j})$ , suggesting poor overlap.

Finally, to provide a global measure of TAD stability in an ensemble of bootstrapped samples, we calculate the mean $J_{\rm TAD}^{*}(b_{i},b_{j})$ across all boundaries

J_{\rm TAD}^{*}=\frac{2}{n_{\rm samp}(n_{\rm samp}-1)}\sum_{b_{i}\neq b_{j}}J_{\rm TAD}^{*}(b_{i},b_{j}),

(9)

where $n_{\rm samp}$ is the number of bootstrapped samples.

We show $J_{\rm TAD}^{*}$ over a range of noise amplitudes $\epsilon$ across all chromosomes in Fig. 3. As expected, we find perfect overlap ( $J_{\rm TAD}^{*}\rightarrow 1$ ) when the noise level becomes low ( $\epsilon\rightarrow 0$ ). Also, as $\epsilon$ increases, the overlap reduces below the experimental benchmark (0.62, dashed horizontal line) until reaching a plateau where each node represents separate communities. Using the 0.62-line, we may find numerically the critical noise amplitude $\epsilon^{*}$ for each chromosome. We note that $\epsilon^{*}$ varies between $\sim 0.05-0.15$ . This parameter presents an analog to stability since a higher $\epsilon^{*}$ also indicates stronger resilience to noise, as the original TADs are still extractable from the dataset. From Fig. 3, we choose $\epsilon^{*}=0.11$ as the critical noise amplitude for all chromosomes (vertical dashed line).

III.2 Identifying stable communities

Once we tuned the noise amplitude, we generated several bootstrapped Hi-C maps (see Sec. II.2) and extracted the parts of the community partition that stayed consistent under noise relative to the original dataset. The specific procedure reads as follows (inspired by [23]):

•

Get the original community partition $P_{o}$ .
•

Generate a $n_{\rm samp}$ bootstrapped samples from the original dataset.
•

Get communities from each bootstrapped sample $P_{n}$ , based on the starting partition $P_{0}$ .
•

Identify the communities from the original partition $P_{o}$ that are also in the bootstrapped samples. To do this, we calculate the Jaccard indices between all community pairs in each sample and pick the ones with the largest Jaccard index.
•

Filter all pairs with a Jaccard index smaller than some cutoff $J_{\rm cutoff}$ and calculate the fraction of bootstrapped samples that each community in $P_{o}$ appears in.

Based on these steps,we calculate the stability fraction $S_{C_{i}}$ , for community $C_{i}$ in the original partition $P_{o}$ as

S_{C_{i}}=\frac{1}{n_{\rm samp}}\sum_{n=1}^{n_{\rm samp}}\theta\left(\max_{C_{j}\in P_{n}}J(C_{i},C_{j})-J_{\rm cutoff}\right),

(10)

Here, $\theta(\cdot)$ denotes the Heaviside step function, and $J(C_{i},C_{j})$ is the Jaccard index between two communities as formulated in Eq. (7) but not specific to TADs.

The cutoff $J_{\rm cutoff}$ is a free parameter. We set it to 1.0, meaning that we define a community $C_{i}$ as stable if all of its nodes remain in the original dataset and the bootstrapped samples. After setting this cutoff, we generated several bootstrapped Hi-C maps and calculated the stability fraction $S_{C_{i}}$ (Eq. (10)) for each community $C_{i}$ associated with the original dataset ( $P_{o}$ ). As before, we set $\gamma$ to get the effective community size $\hat{s}\sim 0.88Mb$ for each chromosome.

We show the distributions of the stabilities $S_{C_{i}}$ as boxplots in Fig 4, one box per chromosome, sorted by the $S_{C_{i}}$ medians. By reading the chromosome numbers and the x-axis, we note no strong correlation between chromosome size and stability, as both large and small chromosomes appear on either end. To compare this result to experiments, we note that the same work that calculated the median TAD boundary similarity to be 62% also used an advanced similarity metric TADsim to measure the structural similarity of TAD sets between replicates [16]. They found this overlap to be lower, about 50% between replicates. This is close to our median stability between all chromosomes, which turns out to be $\sim 0.4$ .

Until this point, we studied each chromosome separately. Below, we aggregate all the data and consider genome-wide averages. We also used $n_{\rm samp}=1000$ bootstrapped samples per chromosome to calculate the stability, which we bin to a resolution of 1%.

III.3 Understanding stability and community connectedness

The boxplots and the distribution in Fig. 4 foreshadow that some of the communities have a more resilient community structure than others under noise. Here, we ask how much this stability depends on the local structure of the underlying network. In particular, we study the communities’ internal node strength. This metric differs from general node strength—the sum of all edge weights of a node—by restricting the sum to nodes within a community. Node strength is a good predictor of, for example, node centrality in a fully connected network [38] or search times to a random node [39, 40, 41, 42].

However, in our case, the stability of a community is not related to how well the nodes are connected to other parts of the network. Instead, we quantify the internal community connectedness. We define this as the sum of all edge weights within a community (including self-loops) normalized by the total node strength

\hat{s}_{C_{i}}^{\rm int}=\frac{\sum_{j\in C_{i}}\sum_{i\in C_{i}}A_{ij}}{\sum_{j}\sum_{i\in C_{i}}A_{ij}},

(11)

where ”int” stands for internal node strength. Plotting $\hat{s}_{C_{i}}^{\rm int}$ for all communities across all chromosomes against the community stability $S_{C_{i}}$ (Eq. (10)), we find that high internal node strength suggests high community stability under noise (Fig. 5). This means that strong inter-connected communities are less affected by random noise during bootstrapping.

III.4 Relating community stability to chromatin states

The previous section studied how stability is connected to network structure. Next, we perform a similar analysis using biologically relevant markers: chromatin states. In particular, we ask if our stability metric calculated on TAD-sized communities is associated with the enrichment or depletion of active or inactive genomic regions, which could suggest underlying mechanisms that form strongly connected communities. For example, there are several indications of insulation enrichment at TAD borders and facilitated promoter-enhancer interactions within TADs [43, 44].

To study this, we used data that classify genomic regions into 15 different functional states (see Tab. 2), which we grouped into five categories (defined in App. C). We plotted the folds of enrichment (FE) of these five groups against community stability for TAD-sized communities ( $\hat{s}\sim 0.38$ Mb) in Fig. 6. For three of the groups associated with active genomic regions (promoters, enhancers, and transcribed regions), there is a clear positive relation between stability and FE. However, repressed regions (heterochromatin & repressive states and insulators) show no such correlation. This indicates that the active genomic regions are associated with strong stability towards noise or communities having strong internal connectivity.

III.5 Quantifying cross-scale nestedness of stable communities

Thus far, we have focused on the stability of communities of a particular size. However, by tuning the GenLouvain parameter $\gamma$ we can get communities of varying sizes. This allows us to study how communities at one scale nest with communities at larger scales. Such studies complement previous work finding that active chromatin congregates in communities with a significant hierarchical nestedness across community sizes $\hat{s}$ [15], in particular, hierarchical TADs [45, 46]. Inspired by this, we ask: do nodes in stable communities of one size stay within stable communities of a larger size? Or is community nesting related to stability?

To this end, we first calculated the community stability $S_{C}$ for different effective community sizes. We choose four commonly studied chromosome scales: TADs ( $\hat{s}\sim 0.38$ MB), A/B segments ( $\hat{s}\sim 1.04$ MB), A_1,2, B_1,2,3 structures ( $\hat{s}\sim 49$ MB) and A/B compartments ( $\hat{s}\sim 60$ MB) (see App. B). Next, we assign a node stability $s_{i}$ to each node $i$ , which we choose as the community stability $S_{C}$ . Thus, as $\gamma$ varies, we obtain $s_{i}$ values for each size $\hat{s}$ . The change in $s_{i}$ reflects how stability changes over different network scales.

We show the change in $s_{i}$ over three different scale transitions as a boxplot in Fig. 7 (a) (green boxes). For relatively small communities (around the size of TADs), we note that the stability difference is smaller in smaller communities. To make a fair comparison, we calculated the same difference between two randomly chosen nodes (orange boxes). These boxes show that the stability difference is smaller in the actual data (green) than in the random case, at least for smaller communities. This indicates that stable communities at one size are generally supported by stable communities at a larger size, indicating the cross-scale nestedness of stable communities.

However, this relationship holds best as communities become smaller. At large sizes, the communities occupy a considerable part of the chromosome, implying that randomization does not change much since the probability of selecting two random nodes from the same community is high. This does not necessarily imply that the stability is not nested at larger sizes but rather that a much larger partition sample is required to get good statistics.

To better illustrate how nested stability changes between two select chromosome scales ( $\hat{s}=0.38\mathrm{Mb}\to\hat{s}=1.04\rm{Mb}$ ), we plotted $s_{i}$ in an alluvial diagram (Fig. 7 (b)). To increase readability, we rounded the $s_{i}$ values to a precision of 0.2. We note that most flows from $\hat{s}=0.38\mathrm{Mb}$ to $\hat{s}=1.04\mathrm{Mb}$ go towards between nodes with similar stabilities or just above or below. This gives additional support to our conclusion that stability is nested across network scales. We also depict the chromatin state enrichment FE as bars to the left and right of the alluvial diagram (same data as in Fig. 6). We note that the FE levels for active chromatin (three leftmost bars) remain conserved across both scales for all six stability groups. The inactive chromatin shows similar conservation, but this like stem from them being uncorrelated to node stability (Fig. 6).

IV Discussion

Hi-C represents a promising method to uncover essential relationships between gene activity and chromosome 3D organization. However, extracting biologically meaningful 3D structures from Hi-C data sets poses several challenges. One challenge concerns the 3D contacts forming a nearly fully connected complex network with cross-scale interactions. This leads to inconsistent network partitions when comparing data clustering or TAD-finding methods [18, 13, 20]. Another challenge comes from experimental noise, where the same clustering method yields slightly different network partitions between theoretically identical replicate Hi-C experiments. Here, we propose a new method that systematically helps deal with the noise problem. We also propose a new null model for the distant-dependent average contact frequency that builds on Hi-C maps’ multifractal properties. Using our method, we find that stable 3D communities typically have a high internal connectedness, tend to be enriched in active chromatin marks, and have a more nested cross-scale hierarchy than less stable ones.

As mentioned previously and discussed before [15], genome organization does not likely follow a perfect hierarchical tree structure. This is partly due to scale-dependent folding mechanisms, such as loop extrusion and chromatin-chromatin interactions, which are critical in forming TADs and A/B compartments [47]. Our work focuses on extracting stable communities under noise and suggests new folding mechanisms that are worthy of further experimental inquiry. We base this conclusion on Fig. 7(b), where highly stable communities maintain chromatin state enrichment across chromosome scales. This foreshadows the same mechanisms being responsible for hierarchical folding in select parts of the 3D hierarchy.

We acknowledge that our results stem from a specific Hi-C data resolution (100kb). However, our method is not limited to this choice and can handle any resolution and Hi-C data set. In fact, we do not anticipate drastic changes to our findings because previous research finds persistent correlations between TAD boundaries and structure in the 40-100kb range [16]. When applying the approach in this paper to a different Hi-C resolution, it is essential to double-check the noise amplitude $\epsilon^{*}$ , as it is not a universal parameter.

Furthermore, $\epsilon^{*}$ also depends on the specific choice of community detection method as we calibrate against the communities associated with the unperturbed Hi-C data set. For example, if choosing one of the TAD-finding methods in [18]. Our pipeline also allows experimenters to assess how stable 3D communities are under noise from varying sampling depths [19]. This case likely requires a revised model of the noise (Sec. II.2) and subsequent recalibration of $\epsilon^{*}$ . Otherwise, the pipeline is the same.

We choose the GenLouvin algorithm to find communities. While it has known shortcomings [48, 20], it is one of the most widely used methods in network science because it is relatively simple and allows specifying a theoretical null. Here, we use a null model that takes advantage of the multifractal spectrum of the Hi-C map. We believe this approach is useful for community detection in any network setting representing a complex system because it provides a straightforward way of parameterizing an explicit null model that otherwise, in lack of better options, often becomes the random Newman-Girvan model. While there exist methods to construct multifractal networks (e.g., [49]), our approach addresses the inverse problem.

To summarize, our work expands the techniques employed for community detection in Hi-C data and introduces additional methods and metrics to extract communities that survive realistic noise levels. We anticipate having better access to robust 3D communities will help future research uncover causal connections between chromosome contact networks and genetic processes, such as gene expression and repression.

V Author contributions

L.L. and L.H. devised the study. L.H. and A.C.M. performed the analysis. L.H. made the visualizations and wrote the manuscript. All authors edited and accepted the manuscript in its final form.

VI Competing interests

The authors declare no competing interests.

Acknowledgements.

We thank Moa Lundkvist and Juhee Lee for providing valuable discussion and feedback. We also thank Moa Lundkvist for her help creating the alluvial plot in Fig. 7. We acknowledge financial support from the Swedish Research Council (grant no. 2017-03848). The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at High-Performance Computing Center North (HPC2N), partially funded by the Swedish Research Council through grant agreements no. 2022-06725 and no. 2018-05973.

Appendix A The hierarchical domain model as a null model

As discussed in Methods, the null model used for community detection has to be chosen to represent the underlying structure of the data. Mimicking the power-law dropoff in contact probability observed in Hi-C, the null model for Hi-C is commonly chosen to follow $\sim d^{-\alpha}$ , where $d=|i-j|$ is the distance between two bins on Hi-C and $\alpha$ is a scale parameter [14]. However, later work showed that to capture communities at different scales on Hi-C $\alpha$ had to be fine-tuned to avoid degenerate partitions, i.e. partitions that are very different in structure and quality [20]. To remove the need for fine-tuning a scale parameter, we look at recent work where the contact dropoff in Hi-C was fitted to a bifractal contact map [24].

In that paper, they devise a hierarchical domain model (HDM) as a more representative metric for the distance-dependent structure of Hi-C. In short, the algorithm starts from a $2\times 2$ matrix, which self-repeats $n$ times to a size $2^{n}$ . We use the analytical expression of this contact map to calculate the average contact probability dependent on the distance between two Hi-C bins show in Eq. (II.3). Since the Hi-C matrix of size $N$ is not necessarily a multiple of $2$ , we let $H_{a}(i,j)$ repeat for $|i-j|\geq 2^{n}$ , which is a constant value. We show $H_{a}(i,j)$ versus the genomic distance, along with commonly used contact probabilities [14] in Fig. 8 (a). By capturing the different fractal scales of the chromosomes using this null model, we remove the need to fine-tune the scale parameter $\alpha$ to capture communities at different sizes. Instead, we only fine-tuned the parameter $a$ to match our Hi-C data, which we calculated to have the mean $a=0.464(3)$ .

To evaluate the HDM null model against the power-law scaling based on the fractal globule assumption, we recreate some of the results in [20], shown in Fig. 8 (b). In short, the metric ${\rm MD}(d)$ quantifies the distance between partitions, and a large value indicates that the partitions between community detection runs are structurally significantly different. ${\rm CV}(Q)$ quantifies the coefficient-of-variation between partitions, where a large value indicates that the partition varies much in quality, indicating strong local minima. We see that ${\rm MD}(d)$ for the HDM null model falls somewhere in between the two power-law relationships. However, the HDM has on average a smaller variation in quality $Q$ , indicating that even if the partitions have a slightly larger distance between each other, they are comparatively similar in quality, indicating that there is little risk of getting stuck in local minima.

Appendix B Picking $\gamma$ for different sized communities on Hi-C

We picked the parameter $\gamma$ in the community detection pipeline to approximately represent structural elements at different scales on the Hi-C. We calculated the median effective community size of each partition for all chromosomes as in [20] and compared them against their values reported in the same paper. The parameter $\gamma$ and the corresponding structural scales are shown in Table. 1.

Table 1: The value of

\gamma

that reproduce certain scaled communities on the Hi-C. The sizes of the structures and how to calculate the characteristic community size can be found in [20]. Here we used 100kb data using

a=0.464

in the hierarchical domain model.

Structure	Size (Mb)	$\gamma$	Comm. size (Mb)
TADs	$0.33$	$1.75$	$0.38$
TADs	$0.88$	$1.4$	$0.90$
A/B segments	$1.1$	$1.35$	$1.04$
A1,…,B3	$27$	$0.55$	$49$
A/B compartments	$66$	$0.40$	$60$

Appendix C Preprocessing chromatin states to Hi-C bins

To compare our results of community stability on Hi-C with different functional regions on the genome, we used data dividing the genome into 15 separate functional divisions [32, 33], taken from the ENCODE database [31]. Due to the functional similarity between different states, we group the states into five groups. The different states, along with their names and groupings can be found in Table. 2 below.

Table 2: The 15 functional states identified on the genome grouped by their function [33, 32, 50]. Here we show the state numbers, their functional names and the groups which we combine the states into.

Group & name	HMM state	Name
(A) Promoters	S1	active promoter
(A) Promoters	S2	weak promoter
(B) Enhancers	S4	strong enhancer
	S5	strong enhancer
	S6	weak/poised enhancer
	S7	weak/poised enhancer
(C) Transcribed regions	S9	transcriptional transition
	S10	transcriptional elongation
	S11	weakly transcribed
(D) Heterochromatin & repressive states	S3	inactive/poised promoter
	S13	heterochromatin
	S14	repetitive/copy number variation
	S15	repetitive/copy number variation
(E) Insulators	S8	insulator

The dataset contains start and stop regions for each of the state. To map how much of each state is expressed in each Hi-C bin we calculate the folds of enrichment (FE) relative to the genome-wide average following the same procedure as in [50]

1.

We count the number of peaks $k_{X}$ per bin, where $X$ corresponds to one of the states. We only count the start of the peaks.
2.

We calculate the peak’s expected frequency value using the hypergeometric test as $k^{\prime}_{X}=K_{X}\times(n/N)$ . Here $n$ is the number of peaks of any state in a bin, $N$ is the total number of peaks per chromosome, and $K_{X}$ is the total number of peaks for the state $X$ .
3.

Using this, we calculate the folds of enrichment for each state $Z$ by dividing the observed peak by the expected $FE_{X}=k_{X}/k^{\prime}_{X}$ .

References

Dixon et al. [2016] J. R. Dixon, D. U. Gorkin, and B. Ren, Molecular cell 62, 668 (2016).
Schwartz and Cavalli [2017] Y. B. Schwartz and G. Cavalli, Genetics 205, 5 (2017).
Bonev and Cavalli [2016] B. Bonev and G. Cavalli, Nature Reviews Genetics 17, 661 (2016).
Denker and De Laat [2016] A. Denker and W. De Laat, Genes & development 30, 1357 (2016).
Marchal et al. [2019] C. Marchal, J. Sima, and D. M. Gilbert, Nature Reviews Molecular Cell Biology 20, 721 (2019).
Lieberman-Aiden et al. [2009] E. Lieberman-Aiden, N. L. Van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, et al., Science 326, 289 (2009).
Sexton et al. [2012] T. Sexton, E. Yaffe, E. Kenigsberg, F. Bantignies, B. Leblanc, M. Hoichman, H. Parrinello, A. Tanay, and G. Cavalli, Cell 148, 458 (2012).
Dekker et al. [2013] J. Dekker, M. A. Marti-Renom, and L. A. Mirny, Nature Reviews Genetics 14, 390 (2013).
MacKay and Kusalik [2020] K. MacKay and A. Kusalik, Briefings in functional genomics 19, 292 (2020).
Liu et al. [2021] Y. Liu, L. Nanni, S. Sungalee, M. Zufferey, D. Tavernari, M. Mina, S. Ceri, E. Oricchio, and G. Ciriello, Nature communications 12, 1 (2021).
Weinreb and Raphael [2016] C. Weinreb and B. J. Raphael, Bioinformatics 32, 1601 (2016).
Fraser et al. [2015] J. Fraser, C. Ferrai, A. M. Chiariello, M. Schueler, T. Rito, G. Laudanno, M. Barbieri, B. L. Moore, D. C. Kraemer, S. Aitken, et al., Molecular systems biology 11, 852 (2015).
Cabreros et al. [2016] I. Cabreros, E. Abbe, and A. Tsirigos, in 2016 Annual Conference on Information Science and Systems (CISS) (IEEE, 2016) pp. 584–589.
Lee et al. [2019] S. H. Lee, Y. Kim, S. Lee, X. Durang, P. Stenberg, J.-H. Jeon, and L. Lizana, Scientific reports 9, 1 (2019).
Bernenko et al. [2022] D. Bernenko, S. H. Lee, P. Stenberg, and L. Lizana, bioRxiv (2022).
Sauerwald et al. [2020] N. Sauerwald, A. Singhal, and C. Kingsford, NAR genomics and bioinformatics 2, lqz008 (2020).
de Wit [2020] E. de Wit, Journal of molecular biology 432, 638 (2020).
Sefer [2022] E. Sefer, BMC bioinformatics 23, 127 (2022).
Yardımcı et al. [2019] G. G. Yardımcı, H. Ozadam, M. E. Sauria, O. Ursu, K.-K. Yan, T. Yang, A. Chakraborty, A. Kaul, B. R. Lajoie, F. Song, et al., Genome biology 20, 1 (2019).
Holmgren et al. [2023] A. Holmgren, D. Bernenko, and L. Lizana, Scientific Reports 13, 12979 (2023).
Rao et al. [2014] S. S. Rao, M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson, A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander, et al., Cell 159, 1665 (2014).
Eres and Gilad [2021] I. E. Eres and Y. Gilad, Trends in Genetics 37, 216 (2021).
Rosvall and Bergstrom [2010] M. Rosvall and C. T. Bergstrom, PloS one 5, e8694 (2010).
Pigolotti et al. [2020] S. Pigolotti, M. H. Jensen, Y. Zhan, and G. Tiana, Physical Review Research 2, 043078 (2020).
Sarnataro et al. [2017] S. Sarnataro, A. M. Chiariello, A. Esposito, A. Prisco, and M. Nicodemi, PLoS One 12, e0188201 (2017).
Newman and Girvan [2004] M. E. Newman and M. Girvan, Physical review E 69, 026113 (2004).
Yan et al. [2017] K.-K. Yan, S. Lou, and M. Gerstein, PLoS computational biology 13, e1005647 (2017).
Sanborn et al. [2015] A. L. Sanborn, S. S. Rao, S.-C. Huang, N. C. Durand, M. H. Huntley, A. I. Jewett, I. D. Bochkov, D. Chinnappan, A. Cutkosky, J. Li, et al., Proceedings of the National Academy of Sciences 112, E6456 (2015).
Edgar et al. [2002] R. Edgar, M. Domrachev, and A. E. Lash, Nucleic acids research 30, 207 (2002).
Knight and Ruiz [2013] P. A. Knight and D. Ruiz, IMA Journal of Numerical Analysis 33, 1029 (2013).
Raney et al. [2024] B. J. Raney, G. P. Barber, A. Benet-Pagès, J. Casper, H. Clawson, M. S. Cline, M. Diekhans, C. Fischer, J. Navarro Gonzalez, G. Hickey, et al., Nucleic Acids Research 52, D1082 (2024).
Ernst et al. [2011] J. Ernst, P. Kheradpour, T. S. Mikkelsen, N. Shoresh, L. D. Ward, C. B. Epstein, X. Zhang, L. Wang, R. Issner, M. Coyne, et al., Nature 473, 43 (2011).
Ernst and Kellis [2010] J. Ernst and M. Kellis, Nature biotechnology 28, 817 (2010).
Lajoie et al. [2015] B. R. Lajoie, J. Dekker, and N. Kaplan, Methods 72, 65 (2015).
Jeub et al. [2019] L. Jeub, M. Bazzi, I. Jutla, and P. Mucha, https://github.com/GenLouvain/GenLouvain (2011-2019).
Grosberg et al. [1988] A. Y. Grosberg, S. K. Nechaev, and E. I. Shakhnovich, Journal de physique 49, 2095 (1988).
Ghosh and Jost [2018] S. K. Ghosh and D. Jost, PLoS computational biology 14, e1006159 (2018).
Opsahl et al. [2010] T. Opsahl, F. Agneessens, and J. Skvoretz, Social networks 32, 245 (2010).
Nyberg et al. [2021] M. Nyberg, T. Ambjörnsson, P. Stenberg, and L. Lizana, Physical Review Research 3, 013055 (2021).
Hedström and Lizana [2023] L. Hedström and L. Lizana, New Journal of Physics 25, 033024 (2023).
Noh and Rieger [2004] J. D. Noh and H. Rieger, Physical review letters 92, 118701 (2004).
Tejedor et al. [2009] V. Tejedor, O. Bénichou, and R. Voituriez, Physical Review E 80, 065104 (2009).
Gong et al. [2018] Y. Gong, C. Lazaris, T. Sakellaropoulos, A. Lozano, P. Kambadur, P. Ntziachristos, I. Aifantis, and A. Tsirigos, Nature communications 9, 542 (2018).
Qu et al. [2019] J. Qu, G. Yi, and H. Zhou, Epigenetics & chromatin 12, 31 (2019).
Liu et al. [2022] E. Liu, H. Lyu, Q. Peng, Y. Liu, T. Wang, and J. Han, Communications Biology 5, 608 (2022).
Cresswell et al. [2020] K. G. Cresswell, J. C. Stansfield, and M. G. Dozmorov, BMC bioinformatics 21, 1 (2020).
Nuebler et al. [2018] J. Nuebler, G. Fudenberg, M. Imakaev, N. Abdennur, and L. A. Mirny, Proceedings of the National Academy of Sciences 115, E6697 (2018).
Fortunato and Hric [2016] S. Fortunato and D. Hric, Physics reports 659, 1 (2016).
Palla et al. [2010] G. Palla, L. Lovász, and T. Vicsek, Proceedings of the National Academy of Sciences 107, 7640 (2010).
Bernenko et al. [2023] D. Bernenko, S. H. Lee, and L. Lizana, Journal of Physics: Complexity 4, 035004 (2023).