This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Identifying stable communities in Hi-C data using a multifractal null model

Lucas Hedström lucas.hedstrom@umu.se Integrated Science Lab, Department of Physics, Umeå University, Umeå, Sweden    Antón Carcedo Martínez Integrated Science Lab, Department of Physics, Umeå University, Umeå, Sweden    Ludvig Lizana Integrated Science Lab, Department of Physics, Umeå University, Umeå, Sweden
(October 1, 2025)
Abstract

Chromosome capture techniques like Hi-C have expanded our understanding of mammalian genome 3D architecture and how it influences gene activity. To analyze Hi-C data sets, researchers increasingly treat them as DNA-contact networks and use standard community detection techniques to identify mesoscale 3D communities. However, there are considerable challenges in finding significant communities because the Hi-C networks have cross-scale interactions and are almost fully connected. This paper presents a pipeline to distill 3D communities that remain intact under experimental noise. To this end, we bootstrap an ensemble of Hi-C datasets representing noisy data and extract 3D communities that we compare with the unperturbed dataset. Notably, we extract the communities by maximizing local modularity (using the Generalized Louvain method), which considers the multifractal spectrum recently discovered in Hi-C maps. Our pipeline finds that stable communities (under noise) typically have above-average internal contact frequencies and tend to be enriched in active chromatin marks. We also find they fold into more nested cross-scale hierarchies than less stable ones. Apart from presenting how to systematically extract robust communities in Hi-C data, our paper offers new ways to generate null models that take advantage of the network’s multifractal properties. We anticipate this has a broad applicability to several network applications.

I Introduction

Mammalian genomes fold into complex 3D structures that help regulate genetic processes such as transcription, DNA replication and repair, and epigenetics [1, 2, 3, 4, 5]. Several of these insights derive from chromosome capture techniques, such as Hi-C  [6, 7, 8], that quantifies the number of physical contacts between all DNA segment pairs in the genome across a cell population. Data from these experiments allowed constructing comprehensive chromosome-wide 3D interaction maps, highlighting local 3D structures with high internal frequencies, such as Topologically Associated Domains (TADs) and the large-scale binary division into A/B compartments generally associated with active (A) and inactive (B) chromatin [9, 10].

However, these 3D structures represent but two instances on a wide spectrum of structures that fold into each other, forming complex, often nested, hierarchies [11, 12, 13, 14, 15]. These hierarchies make it difficult to design robust TAD-finding algorithms (”TAD callers”) and for the field to agree on common TAD definitions [16, 17]. This ambiguity is likely one reason there exists a large collection of TAD callers [18], each associated with their own parameters and arbitrary choices. In practice, this means that two callers may disagree on the TAD boundaries using the same dataset, let alone two datasets deriving from theoretically identical replicate experiments, including experimental and biological noise [19].

To help resolve some of these issues, we recently proposed a method to extract robust 3D communities in Hi-C data [20]. In that paper, we used a network community detection algorithm (Generalized Louvain) to generate an ensemble of feasible network partitions from a public Hi-C data set [21] and quantified the most conserved node-community relationships. This approach distinguishes robust from variable communities and tracks how the robustness changes with the network scale (and chromatin state). We found that robustness is highly scale-dependent and typically worse for small communities, including TADs. When robustness is low, there are many ways to split the network into mathematically feasible community partitions, which is likely one reason TAD callers disagree, as each focuses on slightly different data features [22].

This paper extends our study [20] by addressing how community robustness changes under experimental noise rather than variability in the community detection method. To this end, we first estimate noise levels from actual Hi-C maps (from humans [21]) and construct an ensemble of blurred maps by adding noise to each contact (or pixel). We call these bootstrapped maps [23]. Next, we extract 3D communities by considering the Hi-C data as a network and using the Generalized Louvain algorithm. Finally, we quantify the overlap between the bootstrapped maps relative to the unperturbed map. We find that stable communities typically have above-average internal contact frequencies, tend to be enriched in active chromatin marks, and fold into more nested cross-scale hierarchies than less stable ones.

Refer to caption
Figure 1: A schematic showcasing the methodology that computes the community stability for a Hi-C network. Using the original Hi-C data, we bootstrap by sampling data from the contact distributions associated with each distance (Fig. 2). We then run a community detection algorithm on the original Hi-C data to get the original partition (GenLouvain). To reduce the number of samples required for good statistics, we use the original partition as an initial partition for GenLouvain. Each sample gives a list of partitions, each with unique communities, shown as boxes along the diagonal colored by the community number ID. Based on the overlap (calculated with the Jaccard index), we count the number of times each community in the original Hi-C data appears in the bootstrapped data, which we use as a proxy for stability. (rightmost panel) Original partition, where a darker color indicates communities with higher stability.

In addition, we made yet another improvement that reaches beyond Hi-C data. This improvement takes advantage of the network’s multifractal spectrum [24], in a way we anticipate useful for other community detection approaches. When using Generalized Louvain, or other methods resting on maximum modularity [25], users must specify a background network connectivity or null model. Simple null models like the random Newman-Girvan model [26] are common choices in network science due to limited alternatives. However, this model does not apply to Hi-C data since it disregards the spatial constraints associated with the DNA segments (represented as nodes) belonging to a long linear molecule. One way to deal with this specific problem is to introduce a distant-dependent null model, where the contacts decay as a power law (as discovered in Hi-C) [14, 27]. But as the power-law exponent changes with distance, too, (from \approx0.75 to \approx1.08 [28]) this introduces a new arbitrary parameter as to when to switch exponent. While we explored both exponents in [20], here we remove this arbitrary choice by implementing a one-parameter null model that we fit to the Hi-C map’s bifractal spectrum using techniques from [24]. This approach is not limited to Hi-C data and should apply to any complex network.

II Methods

Our computational method has several steps (Fig. 1). First, we generate bootstrapped samples representing noisy Hi-C data. Second, we find communities using a method that considers the bifractal structure in Hi-C data. Third, we step through the pipeline to determine the stable communities that are invariant under noise.

II.1 Collecting and preprocessing experimental data

Chromosome contacts. We downloaded Hi-C data for the B-lymphoblastoid human cell line (GM12878) [21] from the GEO database (MAPQG0 dataset, 100 kb resolution) [29]. We only consider intra-chromosome contacts in our analysis. We interpret the data file as a weighted network (”Hi-C network”) with elements AijA_{ij} in sparse form, where each node represents 100 kb of DNA, and the link weight is the measured contact count.

Before constructing the Hi-C network, we normalize the data using the standard Knight-Ruiz matrix balancing algorithm [30] so that the sum of all weights to a node equals unity. Note that the centromere (with no measured contact counts) must be removed before KR-normalization to avoid convergence issues. However, since the indices on the Hi-C matrix represent physical distances, we reintroduce the centromere before performing community detection (see Sec. II.3).

Chromatin states. To relate our results to different functional genomic regions, we downloaded a dataset from ENCODE [31] of 13 different chromatin types created from a multivariate hidden Markov model (denoted HMM states) [32, 33]. To reduce the number of states, we grouped the data into five effective categories: Promoters, Enhancers, Transcribed regions, Heterochromatin, and Insulators (see App. C for complete definitions in terms of the standard HMM states). Moreover, this data set allows us to calculate the chromatin state’s folds of enrichment (FE) for each Hi-C bin. We calculate enrichment relative to the chromosome-wide average (App. C).

II.2 Bootstrapping networks

To estimate community stability across replicates, we bootstrapped data that simulated noise from a typical Hi-C experiment. This noise comes from multiple sources [19], such as randomness of ligations between loci [34] or random contacts between loci due to fluctuations [6]. In Fig 2, we show histograms of the logged Hi-C contacts log(Aij)\log(A_{ij}) for four fixed distances in chromosome 10. We note that the distributions at all these distances are log-normal, albeit with their unique mean and standard deviation We use these empirical distributions as a proxy for experimental noise.

To generate bootstrapped Hi-maps with noise, we calculate the standard deviation σd\sigma_{d} of the logged contacts for a fixed distance d=|ij|d=|i-j| as

σd=log2(Aij)dlog(Aij)d2,\sigma_{d}=\sqrt{\langle\log^{2}(A_{ij})\rangle_{d}-\langle\log(A_{ij})\rangle_{d}^{2}}, (1)

where d\langle\ldots\rangle_{d} denotes the average at the distance dd. Next, we bootstrap new noisy contacts AijA_{ij}^{\prime} for each bin using

Aij=Aij×eN(0,(ϵσd)2)A_{ij}^{\prime}=A_{ij}\times e^{N(0,(\epsilon\sigma_{d})^{2})} (2)

where ϵ\epsilon is the noise amplitude, and N(0,(ϵσd)2)N(0,(\epsilon\sigma_{d})^{2}) denotes the normal distribution with mean 0 and variance (ϵσd)2(\epsilon\sigma_{d})^{2}. As shown below, we tune the amplitude ϵ\epsilon to not completely disrupt the chromosome’s community structure (e.g., the TADs), yet yield maps with realistic noise.

II.3 Detecting communities using a multifractal null model

Refer to caption
Figure 2: Histogram of the logged Hi-C contacts AijA_{ij} at four distances for chromosome 10 (bars). Overlaying the histogram as solid lines, we depict log-normal distributions fitted with the mean and standard deviation from the data.

To find Hi-C network communities, or ”3D communities” [14], we use the Generalized Louvain method (GenLouvain) implemented in MATLAB [35]. GenLouvain searches for network partitions that maximize the modularity function QQ, capturing local deviations from the expected background connectivity or null model. In its general form, the modularity is expressed as

Q=12mij(AijγPij(0))δCi,CjQ=\frac{1}{2m}\sum_{i\neq j}\left(A_{ij}-\gamma P_{ij}^{(0)}\right)\delta_{C_{i},C_{j}} (3)

where AijA_{ij} are entries in the weighted adjacency (Hi-C) matrix, mm is the total weight of this matrix, γ\gamma is a scale parameter setting the effective community size, CiC_{i} is node ii’s community assignment, and Pij(0)P_{ij}^{(0)} is the null model. The last factor, δij\delta_{ij}, denotes the Kronecker-delta function and indicates that the sum calculates the partial sum of all local modularities within a community.

To calculate QQ and identify communities, one must specify a null model that captures the characteristic contact patterns in the data. The most common choice is random connections, known as the Newman-Girvan null model [26], but it does not account for structural aspects associated with long DNA polymer chain folded in 3D inside the cell nucleus [36, 37]. Previous work generalized the Newman-Girvin model to include a power-law with linear node separation dd as observed in Hi-C maps and predicted by theoretical polymer models [14]. However, the power-law exponent tends to change with distance [6, 28]. For example, Within TADs, the contacts decay on average as d0.75d^{-0.75}, whereas it is closer to d1.08d^{-1.08} for longer distances. Therefore, if using a single decay function as the null hypothesis, one must introduce an arbitrary cutoff at which the decay exponent should change. To circumvent this problem, we propose another null model embracing the multifractal properties of Hi-C maps [24]. Admittedly, this model also has a parameter, but the fitting to Hi-C data is more systematic.

The model reads

Pij(0)=2mHa(i,j)i,jHa(i,j).P_{ij}^{(0)}=2m\frac{H_{a}(i,j)}{\sum_{i,j}H_{a}(i,j)}. (4)

where Ha(i,j)H_{a}(i,j) represents the so-called Hierarchical domain model (HDM) from [24], calculated as

Ha(i,j)=\displaystyle H_{a}(i,j)=
{12ndk=0n12ng(d2kn)akb1δkn4max(nk1,0)ifd<2nk=0n12ng(2k2kn)akb1δkn4max(nk1,0)otherwise\displaystyle\left\{\begin{aligned} &\frac{1}{2^{n}-d}\sum_{k=0}^{n-1}2^{n}g(d2^{k-n})\frac{a^{k}b^{1-\delta_{kn}}}{4^{\max(n-k-1,0)}}\,\,{\rm if}\,d<2^{n}\\ &\sum_{k=0}^{n-1}2^{n}g(2^{k}-2^{k-n})\frac{a^{k}b^{1-\delta_{kn}}}{4^{\max(n-k-1,0)}}\,\,{\rm otherwise}\end{aligned}\right. (5)

where

g(x)={xif 0x0.51xif 0.5x10otherwise.g(x)=\left\{\begin{aligned} &x\,&{\rm if}\,0\leq x\leq 0.5\\ &1-x\,&{\rm if}\,0.5\leq x\leq 1\\ &0\,&{\rm otherwise.}\end{aligned}\right. (6)

Here, n=log2(N)n=\lfloor\log_{2}(N)\rfloor, NN is the size of the Hi-C matrix, aa and bb are fitting paramters (b=0.5ab=0.5-a). We use a matrix formulation of Ha(i,j)H_{a}(i,j) and calculate the mean contact probability at each distance d=|ij|d=|i-j|, serving as Pij(0)P_{ij}^{(0)} (see App. A). Finally, we fitted the parameter aa against our Hi-C data using least-squares and found a=0.464(3)a=0.464(3) with minor variation between chromosomes. This value model accurately yields the decay exponent in Hi-C contact maps at all scales (Fig. 8).

II.4 Reducing community detection variance

Since the GenLouvin algorithm tries to optimize the modularity function in a Monte Carlo-like fashion, it often finds different local minima, producing slightly different partitions with similar quality [20] (also discussed in App. A). To reduce the number of samples required to get good statistics, we use the partition detected in the original data as the starting partition and then generate a collection of bootstrapped samples, see Fig. 1. This procedure reduces the number of required samples since we do not have to account for varying initial conditions.

III Results

III.1 Fine-tuning noise amplitudes ϵ\epsilon

To find realistic noise levels ϵ\epsilon when bootstrapping our noisy Hi-C maps, we benchmarked against measured TAD stabilities between replicate experiments [16]. These experiments found the overlap of TAD boundaries between replicates to be 62%. We generated communities using GenLouvain, where we set the scale parameter γ\gamma to achieve an effective community size s^0.88\hat{s}\sim 0.88 Mb that best agrees with the TADs considered in [16] (see App. B). Then we generated sets of communities for nsampn_{\rm samp} bootstrapped samples for different values of ϵ\epsilon. To calculate the TAD boundary similarity between two sets of community boundaries bib_{i} and bjb_{j} from two bootstrapped samples ii and jj, we use the Jaccard index

JTAD(bi,bj)=|bibj||bibj|,J_{\rm TAD}(b_{i},b_{j})=\frac{|b_{i}\cap b_{j}|}{|b_{i}\cup b_{j}|}, (7)

which varies between 1 (identical) and 0 (dissimilar).

However, Eq. (7) has a few problems when using it to quantify boundary overlaps. For example, under extensive noise, the communities shrink until each node is its community. This situation yields JTAD(bi,bj)1J_{\rm TAD}(b_{i},b_{j})\sim 1, indicating a large overlap between community borders, but where the partitions no longer represent TAD-sized communities. To solve this problem, we multiply the Jaccard index in Eq. (7) by a ”penalty factor” containing the fraction of the number of communities in the bootstrapped sample compared to the original. This gives

JTAD(bi,bj)=|bibj||bibj|×(1||bibj||bo|||bibj|)J_{\rm TAD}^{*}(b_{i},b_{j})=\frac{|b_{i}\cap b_{j}|}{|b_{i}\cup b_{j}|}\times\left(1-\frac{||b_{i}\cup b_{j}|-|b_{o}||}{|b_{i}\cup b_{j}|}\right) (8)

where bob_{o} are the TAD boundaries calculated from the original dataset without noise.

Refer to caption
Figure 3: TAD boundary stability for different noise amplitudes for all chromosomes (color bar). The dash-dotted lines show the modified Jaccard index in Eq. 9 (averaged over 100 samples per chromosome). As the noise ϵ\epsilon increases, the overlap decreases. We note that every chromosome follows a similar decaying pattern, where the stability drops and reaches a plateau. The horizontal dashed line at JTAD=0.62J_{\rm TAD}^{*}=0.62 indicates the experimental boundary overlap from [16]. This intersects all JTADJ_{\rm TAD}^{*} lines at slightly different critical noise amplitudes. The vertical dashed line shows the mean of all these amplitudes, ϵ=0.11\epsilon^{*}=0.11.
Refer to caption
Figure 4: Boxplots showing the distribution of community stability at an effective community size of 0.88\sim 0.88Mb across all chromosomes using ϵ=0.11\epsilon^{*}=0.11 and a Jaccard index cutoff of 1.0. The kernel density estimation (KDE) for the distribution of all stabilities over all chromosomes is shown to the right. With these parameters, the stability median over all chromosomes is 0.40. We note that the total distribution is bimodal, albeit with a slight maximum of around 0.5.

To better understand how the penalty factor works, consider two illustrative examples. One is where the bootstrapped partitions contain the same number of communities as the original dataset but shifted along the chromosome. Here, the overlap is low, making the term |bibj|/|bibj||b_{i}\cap b_{j}|/|b_{i}\cup b_{j}| small. Another case is when the bootstrapped samples have many small communities (in the extreme case, where every node represents a separate community). Then, the overlap between bootstrapped samples will be large, indicating a high JTAD(bi,bj)J_{\rm TAD}(b_{i},b_{j}), but the number of communities compared to the original partition differs significantly. Therefore, the last term ||bibj||bo||/|bibj|||b_{i}\cup b_{j}|-|b_{o}||/|b_{i}\cup b_{j}| approaches 0, leading to small JTAD(bi,bj)J_{\rm TAD}^{*}(b_{i},b_{j}), suggesting poor overlap.

Finally, to provide a global measure of TAD stability in an ensemble of bootstrapped samples, we calculate the mean JTAD(bi,bj)J_{\rm TAD}^{*}(b_{i},b_{j}) across all boundaries

JTAD=2nsamp(nsamp1)bibjJTAD(bi,bj),J_{\rm TAD}^{*}=\frac{2}{n_{\rm samp}(n_{\rm samp}-1)}\sum_{b_{i}\neq b_{j}}J_{\rm TAD}^{*}(b_{i},b_{j}), (9)

where nsampn_{\rm samp} is the number of bootstrapped samples.

We show JTADJ_{\rm TAD}^{*} over a range of noise amplitudes ϵ\epsilon across all chromosomes in Fig. 3. As expected, we find perfect overlap (JTAD1J_{\rm TAD}^{*}\rightarrow 1) when the noise level becomes low (ϵ0\epsilon\rightarrow 0). Also, as ϵ\epsilon increases, the overlap reduces below the experimental benchmark (0.62, dashed horizontal line) until reaching a plateau where each node represents separate communities. Using the 0.62-line, we may find numerically the critical noise amplitude ϵ\epsilon^{*} for each chromosome. We note that ϵ\epsilon^{*} varies between 0.050.15\sim 0.05-0.15. This parameter presents an analog to stability since a higher ϵ\epsilon^{*} also indicates stronger resilience to noise, as the original TADs are still extractable from the dataset. From Fig. 3, we choose ϵ=0.11\epsilon^{*}=0.11 as the critical noise amplitude for all chromosomes (vertical dashed line).

III.2 Identifying stable communities

Once we tuned the noise amplitude, we generated several bootstrapped Hi-C maps (see Sec. II.2) and extracted the parts of the community partition that stayed consistent under noise relative to the original dataset. The specific procedure reads as follows (inspired by [23]):

  • Get the original community partition PoP_{o}.

  • Generate a nsampn_{\rm samp} bootstrapped samples from the original dataset.

  • Get communities from each bootstrapped sample PnP_{n}, based on the starting partition P0P_{0}.

  • Identify the communities from the original partition PoP_{o} that are also in the bootstrapped samples. To do this, we calculate the Jaccard indices between all community pairs in each sample and pick the ones with the largest Jaccard index.

  • Filter all pairs with a Jaccard index smaller than some cutoff JcutoffJ_{\rm cutoff} and calculate the fraction of bootstrapped samples that each community in PoP_{o} appears in.

Based on these steps,we calculate the stability fraction SCiS_{C_{i}}, for community CiC_{i} in the original partition PoP_{o} as

SCi=1nsampn=1nsampθ(maxCjPnJ(Ci,Cj)Jcutoff),S_{C_{i}}=\frac{1}{n_{\rm samp}}\sum_{n=1}^{n_{\rm samp}}\theta\left(\max_{C_{j}\in P_{n}}J(C_{i},C_{j})-J_{\rm cutoff}\right), (10)

Here, θ()\theta(\cdot) denotes the Heaviside step function, and J(Ci,Cj)J(C_{i},C_{j}) is the Jaccard index between two communities as formulated in Eq. (7) but not specific to TADs.

The cutoff JcutoffJ_{\rm cutoff} is a free parameter. We set it to 1.0, meaning that we define a community CiC_{i} as stable if all of its nodes remain in the original dataset and the bootstrapped samples. After setting this cutoff, we generated several bootstrapped Hi-C maps and calculated the stability fraction SCiS_{C_{i}} (Eq. (10)) for each community CiC_{i} associated with the original dataset (PoP_{o}). As before, we set γ\gamma to get the effective community size s^0.88Mb\hat{s}\sim 0.88Mb for each chromosome.

We show the distributions of the stabilities SCiS_{C_{i}} as boxplots in Fig 4, one box per chromosome, sorted by the SCiS_{C_{i}} medians. By reading the chromosome numbers and the x-axis, we note no strong correlation between chromosome size and stability, as both large and small chromosomes appear on either end. To compare this result to experiments, we note that the same work that calculated the median TAD boundary similarity to be 62% also used an advanced similarity metric TADsim to measure the structural similarity of TAD sets between replicates [16]. They found this overlap to be lower, about 50% between replicates. This is close to our median stability between all chromosomes, which turns out to be 0.4\sim 0.4.

Until this point, we studied each chromosome separately. Below, we aggregate all the data and consider genome-wide averages. We also used nsamp=1000n_{\rm samp}=1000 bootstrapped samples per chromosome to calculate the stability, which we bin to a resolution of 1%.

III.3 Understanding stability and community connectedness

Refer to caption
Figure 5: The mean internal node strength s^Ciint\hat{s}_{C_{i}}^{\rm int} versus community stability SCiS_{C_{i}} at an effective community size of 0.88\sim 0.88Mb across all chromosomes. The mean internal node strength is binned by the stability. We observe a nearly linear relationship between the two quantities, indicating that the internal strength within a community is a good predictor of community stability between bootstrapped samples.
Refer to caption
Figure 6: Community stability versus folds of enrichment (FE) for several chromatin groups (colored markers, defined in App. C). The number inside each panel shows the Spearman correlation coefficient. We note a convincing positive correlation with enrichment in the three leftmost panels (promoters, enhancers, and transcribed regions). We note no or weak negative correlation for the two rightmost panels (heterochromatin & repressive states, insulators).

The boxplots and the distribution in Fig. 4 foreshadow that some of the communities have a more resilient community structure than others under noise. Here, we ask how much this stability depends on the local structure of the underlying network. In particular, we study the communities’ internal node strength. This metric differs from general node strength—the sum of all edge weights of a node—by restricting the sum to nodes within a community. Node strength is a good predictor of, for example, node centrality in a fully connected network [38] or search times to a random node [39, 40, 41, 42].

However, in our case, the stability of a community is not related to how well the nodes are connected to other parts of the network. Instead, we quantify the internal community connectedness. We define this as the sum of all edge weights within a community (including self-loops) normalized by the total node strength

s^Ciint=jCiiCiAijjiCiAij,\hat{s}_{C_{i}}^{\rm int}=\frac{\sum_{j\in C_{i}}\sum_{i\in C_{i}}A_{ij}}{\sum_{j}\sum_{i\in C_{i}}A_{ij}}, (11)

where ”int” stands for internal node strength. Plotting s^Ciint\hat{s}_{C_{i}}^{\rm int} for all communities across all chromosomes against the community stability SCiS_{C_{i}} (Eq. (10)), we find that high internal node strength suggests high community stability under noise (Fig. 5). This means that strong inter-connected communities are less affected by random noise during bootstrapping.

III.4 Relating community stability to chromatin states

The previous section studied how stability is connected to network structure. Next, we perform a similar analysis using biologically relevant markers: chromatin states. In particular, we ask if our stability metric calculated on TAD-sized communities is associated with the enrichment or depletion of active or inactive genomic regions, which could suggest underlying mechanisms that form strongly connected communities. For example, there are several indications of insulation enrichment at TAD borders and facilitated promoter-enhancer interactions within TADs [43, 44].

To study this, we used data that classify genomic regions into 15 different functional states (see Tab. 2), which we grouped into five categories (defined in App. C). We plotted the folds of enrichment (FE) of these five groups against community stability for TAD-sized communities (s^0.38\hat{s}\sim 0.38 Mb) in Fig. 6. For three of the groups associated with active genomic regions (promoters, enhancers, and transcribed regions), there is a clear positive relation between stability and FE. However, repressed regions (heterochromatin & repressive states and insulators) show no such correlation. This indicates that the active genomic regions are associated with strong stability towards noise or communities having strong internal connectivity.

III.5 Quantifying cross-scale nestedness of stable communities

Thus far, we have focused on the stability of communities of a particular size. However, by tuning the GenLouvain parameter γ\gamma we can get communities of varying sizes. This allows us to study how communities at one scale nest with communities at larger scales. Such studies complement previous work finding that active chromatin congregates in communities with a significant hierarchical nestedness across community sizes s^\hat{s} [15], in particular, hierarchical TADs [45, 46]. Inspired by this, we ask: do nodes in stable communities of one size stay within stable communities of a larger size? Or is community nesting related to stability?

To this end, we first calculated the community stability SCS_{C} for different effective community sizes. We choose four commonly studied chromosome scales: TADs (s^0.38\hat{s}\sim 0.38MB), A/B segments (s^1.04\hat{s}\sim 1.04MB), A1,2, B1,2,3 structures (s^49\hat{s}\sim 49MB) and A/B compartments (s^60\hat{s}\sim 60MB) (see App. B). Next, we assign a node stability sis_{i} to each node ii, which we choose as the community stability SCS_{C}. Thus, as γ\gamma varies, we obtain sis_{i} values for each size s^\hat{s}. The change in sis_{i} reflects how stability changes over different network scales.

We show the change in sis_{i} over three different scale transitions as a boxplot in Fig. 7 (a) (green boxes). For relatively small communities (around the size of TADs), we note that the stability difference is smaller in smaller communities. To make a fair comparison, we calculated the same difference between two randomly chosen nodes (orange boxes). These boxes show that the stability difference is smaller in the actual data (green) than in the random case, at least for smaller communities. This indicates that stable communities at one size are generally supported by stable communities at a larger size, indicating the cross-scale nestedness of stable communities.

However, this relationship holds best as communities become smaller. At large sizes, the communities occupy a considerable part of the chromosome, implying that randomization does not change much since the probability of selecting two random nodes from the same community is high. This does not necessarily imply that the stability is not nested at larger sizes but rather that a much larger partition sample is required to get good statistics.

To better illustrate how nested stability changes between two select chromosome scales (s^=0.38Mbs^=1.04Mb\hat{s}=0.38\mathrm{Mb}\to\hat{s}=1.04\rm{Mb}), we plotted sis_{i} in an alluvial diagram (Fig. 7 (b)). To increase readability, we rounded the sis_{i} values to a precision of 0.2. We note that most flows from s^=0.38Mb\hat{s}=0.38\mathrm{Mb} to s^=1.04Mb\hat{s}=1.04\mathrm{Mb} go towards between nodes with similar stabilities or just above or below. This gives additional support to our conclusion that stability is nested across network scales. We also depict the chromatin state enrichment FE as bars to the left and right of the alluvial diagram (same data as in Fig. 6). We note that the FE levels for active chromatin (three leftmost bars) remain conserved across both scales for all six stability groups. The inactive chromatin shows similar conservation, but this like stem from them being uncorrelated to node stability (Fig. 6).

Refer to caption
Figure 7: Connection between stability, nestedness, and active genomic regions. (a) Difference in node stability between different network scales (community sizes). Each group in the boxplots shows the distribution of absolute differences in node stability sis_{i} between two different scales. The left boxes (green) show the difference from the actual data, and the right (orange) boxes show sis_{i} after randomization. The x-labels indicate the effective community sizes, and the vertical lines show the median. For smaller communities, we note smaller sis_{i} in the actual data compared to the random case. This indicates that the stability is hierarchical, where a stable node in a smaller community belongs to a similarly stable community with a larger size (or scale). However, this trend does not seem to hold for larger communities. (b) Cross-scale stability and chromatin state. The alluvial diagram shows the individual node stabilities between community sizes of 0.38 Mb and 1.04 Mb, rounded with a precision of 0.20.2. We see that most of the flows from left to right go from between similar stabilities (e.g., most of the dark red at 0.38 Mb remains dark red at 1.04 Mb or slightly lighter). The bars on the sides show the five coarse-grained chromatin states (same colors as in Fig. 6, indicated by colors). We see that correlated HMM states stay correlated between different community sizes, where groups 4 and 5 are notable exceptions.

IV Discussion

Hi-C represents a promising method to uncover essential relationships between gene activity and chromosome 3D organization. However, extracting biologically meaningful 3D structures from Hi-C data sets poses several challenges. One challenge concerns the 3D contacts forming a nearly fully connected complex network with cross-scale interactions. This leads to inconsistent network partitions when comparing data clustering or TAD-finding methods [18, 13, 20]. Another challenge comes from experimental noise, where the same clustering method yields slightly different network partitions between theoretically identical replicate Hi-C experiments. Here, we propose a new method that systematically helps deal with the noise problem. We also propose a new null model for the distant-dependent average contact frequency that builds on Hi-C maps’ multifractal properties. Using our method, we find that stable 3D communities typically have a high internal connectedness, tend to be enriched in active chromatin marks, and have a more nested cross-scale hierarchy than less stable ones.

As mentioned previously and discussed before [15], genome organization does not likely follow a perfect hierarchical tree structure. This is partly due to scale-dependent folding mechanisms, such as loop extrusion and chromatin-chromatin interactions, which are critical in forming TADs and A/B compartments [47]. Our work focuses on extracting stable communities under noise and suggests new folding mechanisms that are worthy of further experimental inquiry. We base this conclusion on Fig. 7(b), where highly stable communities maintain chromatin state enrichment across chromosome scales. This foreshadows the same mechanisms being responsible for hierarchical folding in select parts of the 3D hierarchy.

We acknowledge that our results stem from a specific Hi-C data resolution (100kb). However, our method is not limited to this choice and can handle any resolution and Hi-C data set. In fact, we do not anticipate drastic changes to our findings because previous research finds persistent correlations between TAD boundaries and structure in the 40-100kb range [16]. When applying the approach in this paper to a different Hi-C resolution, it is essential to double-check the noise amplitude ϵ\epsilon^{*}, as it is not a universal parameter.

Furthermore, ϵ\epsilon^{*} also depends on the specific choice of community detection method as we calibrate against the communities associated with the unperturbed Hi-C data set. For example, if choosing one of the TAD-finding methods in [18]. Our pipeline also allows experimenters to assess how stable 3D communities are under noise from varying sampling depths [19]. This case likely requires a revised model of the noise (Sec. II.2) and subsequent recalibration of ϵ\epsilon^{*}. Otherwise, the pipeline is the same.

We choose the GenLouvin algorithm to find communities. While it has known shortcomings [48, 20], it is one of the most widely used methods in network science because it is relatively simple and allows specifying a theoretical null. Here, we use a null model that takes advantage of the multifractal spectrum of the Hi-C map. We believe this approach is useful for community detection in any network setting representing a complex system because it provides a straightforward way of parameterizing an explicit null model that otherwise, in lack of better options, often becomes the random Newman-Girvan model. While there exist methods to construct multifractal networks (e.g., [49]), our approach addresses the inverse problem.

To summarize, our work expands the techniques employed for community detection in Hi-C data and introduces additional methods and metrics to extract communities that survive realistic noise levels. We anticipate having better access to robust 3D communities will help future research uncover causal connections between chromosome contact networks and genetic processes, such as gene expression and repression.

V Author contributions

L.L. and L.H. devised the study. L.H. and A.C.M. performed the analysis. L.H. made the visualizations and wrote the manuscript. All authors edited and accepted the manuscript in its final form.

VI Competing interests

The authors declare no competing interests.

Acknowledgements.
We thank Moa Lundkvist and Juhee Lee for providing valuable discussion and feedback. We also thank Moa Lundkvist for her help creating the alluvial plot in Fig. 7. We acknowledge financial support from the Swedish Research Council (grant no. 2017-03848). The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at High-Performance Computing Center North (HPC2N), partially funded by the Swedish Research Council through grant agreements no. 2022-06725 and no. 2018-05973.

Appendix A The hierarchical domain model as a null model

As discussed in Methods, the null model used for community detection has to be chosen to represent the underlying structure of the data. Mimicking the power-law dropoff in contact probability observed in Hi-C, the null model for Hi-C is commonly chosen to follow dα\sim d^{-\alpha}, where d=|ij|d=|i-j| is the distance between two bins on Hi-C and α\alpha is a scale parameter [14]. However, later work showed that to capture communities at different scales on Hi-C α\alpha had to be fine-tuned to avoid degenerate partitions, i.e. partitions that are very different in structure and quality [20]. To remove the need for fine-tuning a scale parameter, we look at recent work where the contact dropoff in Hi-C was fitted to a bifractal contact map [24].

In that paper, they devise a hierarchical domain model (HDM) as a more representative metric for the distance-dependent structure of Hi-C. In short, the algorithm starts from a 2×22\times 2 matrix, which self-repeats nn times to a size 2n2^{n}. We use the analytical expression of this contact map to calculate the average contact probability dependent on the distance between two Hi-C bins show in Eq. (II.3). Since the Hi-C matrix of size NN is not necessarily a multiple of 22, we let Ha(i,j)H_{a}(i,j) repeat for |ij|2n|i-j|\geq 2^{n}, which is a constant value. We show Ha(i,j)H_{a}(i,j) versus the genomic distance, along with commonly used contact probabilities [14] in Fig. 8 (a). By capturing the different fractal scales of the chromosomes using this null model, we remove the need to fine-tune the scale parameter α\alpha to capture communities at different sizes. Instead, we only fine-tuned the parameter aa to match our Hi-C data, which we calculated to have the mean a=0.464(3)a=0.464(3).

Refer to caption
Figure 8: Comparing the HDM null-model versus two different power-law exponents. (a) The HDM model shows a similar power-law scaling compared to the slope using α=0.75\alpha=0.75. However, the HDM has a slight change in curvature, flattening out for large distances. (b) Degeneracy evaluation plot recreated with permission from [20] using chromosome 10. Two metrics versus the effective community size are shown, the mean distance MD(d){\rm MD}(d) between pairwise partitions for different community detection runs, and the coefficient of variation CV(Q){\rm CV}(Q) of the modularity function between partitions. Higher values in both metrics indicate separated partitions with differing qualities, which are degenerate states. We see that the HDM avoids local minima by having a low CV(Q){\rm CV}(Q), albeit with slightly larger distances between partitions for smaller communities.

To evaluate the HDM null model against the power-law scaling based on the fractal globule assumption, we recreate some of the results in [20], shown in Fig. 8 (b). In short, the metric MD(d){\rm MD}(d) quantifies the distance between partitions, and a large value indicates that the partitions between community detection runs are structurally significantly different. CV(Q){\rm CV}(Q) quantifies the coefficient-of-variation between partitions, where a large value indicates that the partition varies much in quality, indicating strong local minima. We see that MD(d){\rm MD}(d) for the HDM null model falls somewhere in between the two power-law relationships. However, the HDM has on average a smaller variation in quality QQ, indicating that even if the partitions have a slightly larger distance between each other, they are comparatively similar in quality, indicating that there is little risk of getting stuck in local minima.

Appendix B Picking γ\gamma for different sized communities on Hi-C

We picked the parameter γ\gamma in the community detection pipeline to approximately represent structural elements at different scales on the Hi-C. We calculated the median effective community size of each partition for all chromosomes as in [20] and compared them against their values reported in the same paper. The parameter γ\gamma and the corresponding structural scales are shown in Table. 1.

Table 1: The value of γ\gamma that reproduce certain scaled communities on the Hi-C. The sizes of the structures and how to calculate the characteristic community size can be found in [20]. Here we used 100kb data using a=0.464a=0.464 in the hierarchical domain model.
Structure Size (Mb) γ\gamma Comm. size (Mb)
TADs 0.330.33 1.751.75 0.380.38
TADs 0.880.88 1.41.4 0.900.90
A/B segments 1.11.1 1.351.35 1.041.04
A1,…,B3 2727 0.550.55 4949
A/B compartments 6666 0.400.40 6060

Appendix C Preprocessing chromatin states to Hi-C bins

To compare our results of community stability on Hi-C with different functional regions on the genome, we used data dividing the genome into 15 separate functional divisions [32, 33], taken from the ENCODE database [31]. Due to the functional similarity between different states, we group the states into five groups. The different states, along with their names and groupings can be found in Table. 2 below.

Table 2: The 15 functional states identified on the genome grouped by their function [33, 32, 50]. Here we show the state numbers, their functional names and the groups which we combine the states into.
Group & name HMM state Name
(A) Promoters S1 active promoter
S2 weak promoter
(B) Enhancers S4 strong enhancer
S5 strong enhancer
S6 weak/poised enhancer
S7 weak/poised enhancer
(C) Transcribed regions S9 transcriptional transition
S10 transcriptional elongation
S11 weakly transcribed
(D) Heterochromatin & repressive states S3 inactive/poised promoter
S13 heterochromatin
S14 repetitive/copy number variation
S15 repetitive/copy number variation
(E) Insulators S8 insulator

The dataset contains start and stop regions for each of the state. To map how much of each state is expressed in each Hi-C bin we calculate the folds of enrichment (FE) relative to the genome-wide average following the same procedure as in [50]

  1. 1.

    We count the number of peaks kXk_{X} per bin, where XX corresponds to one of the states. We only count the start of the peaks.

  2. 2.

    We calculate the peak’s expected frequency value using the hypergeometric test as kX=KX×(n/N)k^{\prime}_{X}=K_{X}\times(n/N). Here nn is the number of peaks of any state in a bin, NN is the total number of peaks per chromosome, and KXK_{X} is the total number of peaks for the state XX.

  3. 3.

    Using this, we calculate the folds of enrichment for each state ZZ by dividing the observed peak by the expected FEX=kX/kXFE_{X}=k_{X}/k^{\prime}_{X}.

References

  • Dixon et al. [2016] J. R. Dixon, D. U. Gorkin, and B. Ren, Molecular cell 62, 668 (2016).
  • Schwartz and Cavalli [2017] Y. B. Schwartz and G. Cavalli, Genetics 205, 5 (2017).
  • Bonev and Cavalli [2016] B. Bonev and G. Cavalli, Nature Reviews Genetics 17, 661 (2016).
  • Denker and De Laat [2016] A. Denker and W. De Laat, Genes & development 30, 1357 (2016).
  • Marchal et al. [2019] C. Marchal, J. Sima, and D. M. Gilbert, Nature Reviews Molecular Cell Biology 20, 721 (2019).
  • Lieberman-Aiden et al. [2009] E. Lieberman-Aiden, N. L. Van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, et al., Science 326, 289 (2009).
  • Sexton et al. [2012] T. Sexton, E. Yaffe, E. Kenigsberg, F. Bantignies, B. Leblanc, M. Hoichman, H. Parrinello, A. Tanay, and G. Cavalli, Cell 148, 458 (2012).
  • Dekker et al. [2013] J. Dekker, M. A. Marti-Renom, and L. A. Mirny, Nature Reviews Genetics 14, 390 (2013).
  • MacKay and Kusalik [2020] K. MacKay and A. Kusalik, Briefings in functional genomics 19, 292 (2020).
  • Liu et al. [2021] Y. Liu, L. Nanni, S. Sungalee, M. Zufferey, D. Tavernari, M. Mina, S. Ceri, E. Oricchio, and G. Ciriello, Nature communications 12, 1 (2021).
  • Weinreb and Raphael [2016] C. Weinreb and B. J. Raphael, Bioinformatics 32, 1601 (2016).
  • Fraser et al. [2015] J. Fraser, C. Ferrai, A. M. Chiariello, M. Schueler, T. Rito, G. Laudanno, M. Barbieri, B. L. Moore, D. C. Kraemer, S. Aitken, et al., Molecular systems biology 11, 852 (2015).
  • Cabreros et al. [2016] I. Cabreros, E. Abbe, and A. Tsirigos, in 2016 Annual Conference on Information Science and Systems (CISS) (IEEE, 2016) pp. 584–589.
  • Lee et al. [2019] S. H. Lee, Y. Kim, S. Lee, X. Durang, P. Stenberg, J.-H. Jeon, and L. Lizana, Scientific reports 9, 1 (2019).
  • Bernenko et al. [2022] D. Bernenko, S. H. Lee, P. Stenberg, and L. Lizana, bioRxiv  (2022).
  • Sauerwald et al. [2020] N. Sauerwald, A. Singhal, and C. Kingsford, NAR genomics and bioinformatics 2, lqz008 (2020).
  • de Wit [2020] E. de Wit, Journal of molecular biology 432, 638 (2020).
  • Sefer [2022] E. Sefer, BMC bioinformatics 23, 127 (2022).
  • Yardımcı et al. [2019] G. G. Yardımcı, H. Ozadam, M. E. Sauria, O. Ursu, K.-K. Yan, T. Yang, A. Chakraborty, A. Kaul, B. R. Lajoie, F. Song, et al., Genome biology 20, 1 (2019).
  • Holmgren et al. [2023] A. Holmgren, D. Bernenko, and L. Lizana, Scientific Reports 13, 12979 (2023).
  • Rao et al. [2014] S. S. Rao, M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson, A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander, et al., Cell 159, 1665 (2014).
  • Eres and Gilad [2021] I. E. Eres and Y. Gilad, Trends in Genetics 37, 216 (2021).
  • Rosvall and Bergstrom [2010] M. Rosvall and C. T. Bergstrom, PloS one 5, e8694 (2010).
  • Pigolotti et al. [2020] S. Pigolotti, M. H. Jensen, Y. Zhan, and G. Tiana, Physical Review Research 2, 043078 (2020).
  • Sarnataro et al. [2017] S. Sarnataro, A. M. Chiariello, A. Esposito, A. Prisco, and M. Nicodemi, PLoS One 12, e0188201 (2017).
  • Newman and Girvan [2004] M. E. Newman and M. Girvan, Physical review E 69, 026113 (2004).
  • Yan et al. [2017] K.-K. Yan, S. Lou, and M. Gerstein, PLoS computational biology 13, e1005647 (2017).
  • Sanborn et al. [2015] A. L. Sanborn, S. S. Rao, S.-C. Huang, N. C. Durand, M. H. Huntley, A. I. Jewett, I. D. Bochkov, D. Chinnappan, A. Cutkosky, J. Li, et al., Proceedings of the National Academy of Sciences 112, E6456 (2015).
  • Edgar et al. [2002] R. Edgar, M. Domrachev, and A. E. Lash, Nucleic acids research 30, 207 (2002).
  • Knight and Ruiz [2013] P. A. Knight and D. Ruiz, IMA Journal of Numerical Analysis 33, 1029 (2013).
  • Raney et al. [2024] B. J. Raney, G. P. Barber, A. Benet-Pagès, J. Casper, H. Clawson, M. S. Cline, M. Diekhans, C. Fischer, J. Navarro Gonzalez, G. Hickey, et al., Nucleic Acids Research 52, D1082 (2024).
  • Ernst et al. [2011] J. Ernst, P. Kheradpour, T. S. Mikkelsen, N. Shoresh, L. D. Ward, C. B. Epstein, X. Zhang, L. Wang, R. Issner, M. Coyne, et al., Nature 473, 43 (2011).
  • Ernst and Kellis [2010] J. Ernst and M. Kellis, Nature biotechnology 28, 817 (2010).
  • Lajoie et al. [2015] B. R. Lajoie, J. Dekker, and N. Kaplan, Methods 72, 65 (2015).
  • Jeub et al. [2019] L. Jeub, M. Bazzi, I. Jutla, and P. Mucha, https://github.com/GenLouvain/GenLouvain  (2011-2019).
  • Grosberg et al. [1988] A. Y. Grosberg, S. K. Nechaev, and E. I. Shakhnovich, Journal de physique 49, 2095 (1988).
  • Ghosh and Jost [2018] S. K. Ghosh and D. Jost, PLoS computational biology 14, e1006159 (2018).
  • Opsahl et al. [2010] T. Opsahl, F. Agneessens, and J. Skvoretz, Social networks 32, 245 (2010).
  • Nyberg et al. [2021] M. Nyberg, T. Ambjörnsson, P. Stenberg, and L. Lizana, Physical Review Research 3, 013055 (2021).
  • Hedström and Lizana [2023] L. Hedström and L. Lizana, New Journal of Physics 25, 033024 (2023).
  • Noh and Rieger [2004] J. D. Noh and H. Rieger, Physical review letters 92, 118701 (2004).
  • Tejedor et al. [2009] V. Tejedor, O. Bénichou, and R. Voituriez, Physical Review E 80, 065104 (2009).
  • Gong et al. [2018] Y. Gong, C. Lazaris, T. Sakellaropoulos, A. Lozano, P. Kambadur, P. Ntziachristos, I. Aifantis, and A. Tsirigos, Nature communications 9, 542 (2018).
  • Qu et al. [2019] J. Qu, G. Yi, and H. Zhou, Epigenetics & chromatin 12, 31 (2019).
  • Liu et al. [2022] E. Liu, H. Lyu, Q. Peng, Y. Liu, T. Wang, and J. Han, Communications Biology 5, 608 (2022).
  • Cresswell et al. [2020] K. G. Cresswell, J. C. Stansfield, and M. G. Dozmorov, BMC bioinformatics 21, 1 (2020).
  • Nuebler et al. [2018] J. Nuebler, G. Fudenberg, M. Imakaev, N. Abdennur, and L. A. Mirny, Proceedings of the National Academy of Sciences 115, E6697 (2018).
  • Fortunato and Hric [2016] S. Fortunato and D. Hric, Physics reports 659, 1 (2016).
  • Palla et al. [2010] G. Palla, L. Lovász, and T. Vicsek, Proceedings of the National Academy of Sciences 107, 7640 (2010).
  • Bernenko et al. [2023] D. Bernenko, S. H. Lee, and L. Lizana, Journal of Physics: Complexity 4, 035004 (2023).