This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint identification of spatially variable genes via a network-assisted Bayesian regularization approach

Mingcong Wu a,†, Yang Li a,†, Shuangge Ma b, Mengyun Wu c,∗
aSchool of Statistics, Renmin University of China
b Department of Biostatistics, Yale School of Public Health
c School of Statistics and Management,
Shanghai University of Finance and Economics

the co-first authors
the corresponding author: wu.mengyun@mail.shufe.edu.cn
Abstract

Identifying genes that display spatial patterns is critical to investigating expression interactions within a spatial context and further dissecting biological understanding of complex mechanistic functionality. Despite the increase in statistical methods designed to identify spatially variable genes, they are mostly based on marginal analysis and share the limitation that the dependence (network) structures among genes are not well accommodated, where a biological process usually involves changes in multiple genes that interact in a complex network. Moreover, the latent cellular composition within spots may introduce confounding variations, negatively affecting identification accuracy. In this study, we develop a novel Bayesian regularization approach for spatial transcriptomic data, with the confounding variations induced by varying cellular distributions effectively corrected. Significantly advancing from the existing studies, a thresholded graph Laplacian regularization is proposed to simultaneously identify spatially variable genes and accommodate the network structure among genes. The proposed method is based on a zero-inflated negative binomial distribution, effectively accommodating the count nature, zero inflation, and overdispersion of spatial transcriptomic data. Extensive simulations and the application to real data demonstrate the competitive performance of the proposed method.


Keywords: Spatial transcriptomic data; Network analysis; Bayesian analysis

1 Introduction

For complex diseases, the recently developed and rapidly advancing spatial transcriptomics (ST) technology has provided a powerful tool to identify cellular organizations of gene expressions in tissues, which have long been recognized to be intimately linked to critical biological functions (Rao et al., 2021). One crucial task in ST data analysis is to identify genes with varying expressions across space, termed as spatially variable (SV) genes (Svensson et al., 2018). SV genes have been demonstrated to be associated with disease characteristics such as immune cell infiltration and tumor cell proliferation (Zuo et al., 2024), hence facilitating the discovery of tumorigenesis mechanisms and the development of therapeutic strategies.

Many approaches have been recently proposed for SV gene detection. A great majority of them are based on the Gaussian process (GP). For example, SpatialDE (Svensson et al., 2018) models normalized expression data via GP regression and tests the significance of the spatial covariance matrix for each gene separately. SPARK (Sun et al., 2020), BOOST-GP (Li et al., 2021), and GPcounts (BinTayyash et al., 2021) also take advantage of GP regression but directly model raw count data with the Poisson, Negative Binomial (NB), and zero-inflated NB (ZINB) distributions, respectively. CTSV (Yu and Luo, 2022), on the other hand, implements a slightly different technique based on ZINB regression where the mean expression level is parameterized as a linear combination of functions of spatial coordinates. There are some non-parametric approaches with more computational efficiency, such as SPARK-X (Zhu et al., 2021) based on covariance-based testing, MERINGUE (Miller et al., 2021) based on Voronoi tessellation and classical Moran’s I score, and HEARTSVG (Yuan et al., 2023) based on constant variance testing. Despite considerable successes, the results of the aforementioned works are still sometimes unsatisfactory due to the high dimensionality of genes, high levels of noise and sparsity, and low resolution of spots. In addition to the transcriptomic data, other biological information, such as cellular phenotypes and genetic interactions, is often available and potentially provides a valuable complement to the present analysis. The integration of such assisted information into the transcriptomic analysis is a promising direction to mitigate the aforementioned challenges in ST data and further advance existing ST research.

Specifically, first, most existing approaches for SV gene detection rely on marginal analysis, which models each gene separately and has the limitation that the dependence among genes is not well utilized. Increasing evidence has shown that diseases are mostly a result of a combination of multiple genetic alterations, and genetic factors usually interact with each other and are involved in a biological network (Barrio-Hernandez et al., 2023) (Figure 1). Specifically, genes connected within a network are believed to have similar biological functions, leading to potentially similar contributions to cellular organizations and functional mechanisms. Recently, an increasing number of biological networks have been amassed, such as protein-protein interaction (PPI) networks, metabolic networks, and regulatory networks. Curated network information has been widely adopted as a powerful supplement to gene expression analysis, particularly in bulk and single-cell sequencing analysis (Elyanow et al., 2020; Qin et al., 2023). However, the integration of network information for SV gene detection remains limited.

Second, despite prosperous developments in recent years, measurements obtained using the sequencing-based ST technologies, such as Slide-seq and 10X Genomics Visium, are still “spot”-based, where the gene expression measurement at a single spot is usually a mixture of diverse cells from heterogeneous types rather than at a single-cell resolution (Yu and Luo, 2022) (Figure 1). This cellular composition diversity among spots has been demonstrated to probably contribute to expression variability (Cable et al., 2022), potentially confounding the detection of biologically relevant SV genes. In the published studies, there are very few approaches that conduct simultaneous cellular composition accommodation and SV gene detection. Limited existing studies include SPARK-X, which converts the inferred major cell types into binary indicators to mitigate the confounding effect caused by latent cell type distributions. Alternatively, CTSV concentrates on the detection of cell-type-specific SV genes whose expressions are affected by the spatial coordinates of cells of the same type.

Motivated by the aforementioned challenges, we propose a Bayesian regularization approach for the joint identification of SV genes as shown in Figure 1. Specifically, the ZINB distribution is adopted to account for the count nature, sparsity, and overdispersion of raw expression measurements. The mean-value-based strategy is applied for SV gene detection, which enjoys the advantages of simplicity, interpretability, and computational efficiency. The most substantial advancement is that the proposed approach conducts joint detection with the network dependency structure among genes well incorporated, making it a big step forward from the existing marginal analysis. Moreover, the proposed approach introduces a series of cell-type-specific parameters to effectively correct for the confounding variations induced by varying cell-type compositions across spots. Overall, this study provides a practically useful and biologically meaningful approach for SV gene identification in ST analysis, with improved performance over alternatives as demonstrated in both simulation studies and application to a real-world ST dataset.

Refer to caption
Figure 1: Overview.

2 Methods

Consider a tissue section consisting of nn spots, pp genes, and KK cell types. Let 𝐘n×p\mathbf{Y}\in\mathcal{R}^{n\times p} be the ST expression matrix composed of 𝐘j\mathbf{Y}_{j}’s, where 𝐘j=(Y1j,,Ynj)T\mathbf{Y}_{j}=\left(Y_{1j},\dots,Y_{nj}\right)^{\mathrm{T}} is the vector of nn observed raw counts of the jjth gene. 𝐘\mathbf{Y} usually has a very high level of sparsity because of a low capture rate. Each spot ii is associated with a 2-dimensional coordinate 𝐱i=(xi1,xi2)\mathbf{x}_{i}=\left(x_{i1},x_{i2}\right) which represents the location of the corresponding spot center.

2.1 Network and Cellular Composition Information for Integration

Consider an undirected network G(V,E)G(V,E) that is constructed using biological information, where VV is the node set consisting of pp genes and E={e(j,l),j,l{1,E=\left\{e(j,l),j,l\in\{1,\right. ,p}}\left.\cdots,p\}\right\} is the set of edges between nodes. For genes connected within the network, it is expected that they have similar biological functionalities, leading to potentially similar spatial variability, and thus are more likely to be SV or non-SV genes simultaneously. To induce such network-assisted identification, an adjacency matrix 𝑨=(ajl)p×p\boldsymbol{A}=(a_{jl})_{p\times p} is first constructed based on G(V,E)G(V,E), with ajl=1a_{jl}=1 if there is an edge e(j,l)e(j,l) between the jjth and llth genes and ajl=0a_{jl}=0 otherwise, and ajj=0{a}_{jj}=0, for j,l=1,2,,pj,l=1,2,\ldots,p.

As for cellular compositions, denote 𝒘i=(wi1,,wiK)T\boldsymbol{w}_{i}=\left(w_{i1},\dots,w_{iK}\right)^{\mathrm{T}} as the vector of cellular proportions for the iith spot, which satisfies the constraint that 0wik1,k=1,,K0\leq w_{ik}\leq 1,k=1,\dots,K and k=1Kwik=1\sum_{k=1}^{K}w_{ik}=1. Such information is typically available as ground truth or can be obtained using deconvolution methods such as RCTD (Cable et al., 2022), Redeconve (Zhou et al., 2023), and SONAR (Liu et al., 2023). Since the distributions of cellular proportions usually display spatial relatedness, which may confound SV gene detection, we correct for this potential confounding in SV gene identification.

2.2 Network-assisted Bayesian Modeling

We introduce latent binary variables to accommodate the zero-inflation in 𝐘\mathbf{Y} and consider the following zero-inflated negative binomial (ZINB) model:

Yijrij,μij,ϕNB(Yijμij,ϕ)(1rij)δ0(Yij)rij,Y_{ij}\mid r_{ij},\mu_{ij},\phi\sim\mathrm{NB}\left(Y_{ij}\mid\mu_{ij},\phi\right)^{(1-r_{ij})}\delta_{0}(Y_{ij})^{r_{ij}}, (1)

where rij=1r_{ij}=1 indicates that YijY_{ij} is from a Dirac probability measure δ0()\delta_{0}(\cdot) with a point mass at zero, and otherwise YijY_{ij} is from a NB distribution NB(Yijμij,ϕ)\mathrm{NB}\left(Y_{ij}\mid\mu_{ij},\phi\right) with expectation μij\mu_{ij} and dispersion 1/ϕ1/\phi. The NB distribution has variance μij+μij2/ϕ\mu_{ij}+\mu_{ij}^{2}/\phi, thus allowing modeling extra variation.

To accommodate spatial differential expression and cell-type-specific confounding, the logarithm of μij\mu_{ij} is modeled as:

logμij=μ0j+βj(1)𝒦(xi1)+βj(2)𝒦(xi2)+k=1Kαkwik,\log\mu_{ij}=\mu_{0j}+\beta^{(1)}_{j}\mathcal{K}\left(x_{i1}\right)+\beta^{(2)}_{j}\mathcal{K}\left(x_{i2}\right)+\sum_{k=1}^{K}\alpha_{k}w_{ik}, (2)

where μ0j\mu_{0j} is the baseline expression parameter, αk\alpha_{k} is the cell-type-specific coefficient to accommodate the potential influence of cellular distributions, and βj(1)\beta^{(1)}_{j} and βj(2)\beta^{(2)}_{j} reflect the degree of spatial differential expression, with 𝒦()\mathcal{K}(\cdot) being the pre-specified spatialization function to measure the specified trend of spatial gene expression variation.

We adopt the ZINB model due to its superiority for simultaneously accommodating the count measure, over-dispersion, and excessive zeros caused by dropouts. For SV gene identification, significantly different from the previous studies that introduce spatial differential expression via a covariance matrix, which is usually space- and time-consuming and lacks efficiency and scalability for large-scale analysis, we adopt the mean-value-based formulation. The spatial variability is intuitively reflected by βj(1)\beta^{(1)}_{j} and βj(2)\beta^{(2)}_{j}, where the genes with βj(1)0\beta^{(1)}_{j}\neq 0 or βj(2)0\beta^{(2)}_{j}\neq 0 are regarded as SV. This requires less storage space and also involves a simpler estimation procedure. Moreover, we innovatively utilize a set of cell-type-specific αk\alpha_{k}’s for eliminating the confounding impact of the latent cellular composition, where spots that are spatially closer are often observed to have similar cell-type proportions (Cable et al., 2022). Different from the mean-value-based strategy adopted by Yu and Luo (2022) for marginal cell-type-specific SV gene detection, we focus on the detection of global SV genes while accommodating the cell-type proportion confounding.

2.3 Priors Specification

The proposed priors are defined as follows:

𝜷=𝜸𝐭λ,𝝆(𝜸),𝜸N(𝟎2p,σγ2(𝐋+ε𝐈2p)1),σγ2IG(aγ,bγ),λUnif(λl,λu),\displaystyle\boldsymbol{\beta}=\boldsymbol{\gamma}\circ{\mathbf{t}_{\lambda,\boldsymbol{\rho}}(\boldsymbol{\gamma})},\boldsymbol{\gamma}\sim\mathrm{N}\left(\mathbf{0}_{2p},\sigma_{\gamma}^{2}\left(\mathbf{L}+\varepsilon\mathbf{I}_{2p}\right)^{-1}\right),\sigma_{\gamma}^{2}\sim\operatorname{IG}\left(a_{\gamma},b_{\gamma}\right),\lambda\sim\operatorname{Unif}\left(\lambda_{l},\lambda_{u}\right), (3)
rijBern(πj),πjBeta(aπ,bπ),ϕGa(aϕ,bϕ),\displaystyle r_{ij}\sim\operatorname{Bern}\left(\pi_{j}\right),\pi_{j}\sim\operatorname{Beta}\left(a_{\pi},b_{\pi}\right),\phi\sim\operatorname{Ga}\left(a_{\phi},b_{\phi}\right),
μ0jN(0,σ0j2),αkN(0,σαk2),\displaystyle\mu_{0j}\sim N\left(0,\sigma_{0j}^{2}\right),\alpha_{k}\sim N\left(0,\sigma_{\alpha_{k}}^{2}\right),

where 𝜷=((𝜷(1))T,(𝜷(2))T)(2p)T=(β1(1),,βp(1),β1(2),,βp(2))T\boldsymbol{\beta}=\left((\boldsymbol{\beta}^{(1)})^{\mathrm{T}},(\boldsymbol{\beta}^{(2)})^{\mathrm{T}}\right)_{(2p)}^{\mathrm{T}}=\left(\beta^{(1)}_{1},\ldots,\beta^{(1)}_{p},\beta^{(2)}_{1},\ldots,\beta^{(2)}_{p}\right)^{\mathrm{T}}, 𝜸=((𝜸(1))T,(𝜸(2))T)(2p)T\boldsymbol{\gamma}=\left((\boldsymbol{\gamma}^{(1)})^{\mathrm{T}},(\boldsymbol{\gamma}^{(2)})^{\mathrm{T}}\right)_{(2p)}^{\mathrm{T}} represents the effect size of the genes, and \circ denotes the element-wise product. 𝐭λ,𝝆(𝜸)={I(|γ1(1)|>λρ1(1)),,I(|γp(2)|>λρp(2))}T\mathbf{t}_{\lambda,\boldsymbol{\rho}}(\boldsymbol{\gamma})=\left\{\operatorname{I}\left(\left|\gamma^{(1)}_{1}\right|>\lambda\cdot\rho^{(1)}_{1}\right),\ldots,\operatorname{I}\left(\left|\gamma^{(2)}_{p}\right|>\lambda\cdot\rho^{(2)}_{p}\right)\right\}^{\mathrm{T}} is a vector thresholding function with λ\lambda being a parameter controlling model sparsity and 𝝆=((𝝆(1))T,(𝝆(2))T)(2p)T\boldsymbol{\rho}=\left((\boldsymbol{\rho}^{(1)})^{\mathrm{T}},(\boldsymbol{\rho}^{(2)})^{\mathrm{T}}\right)_{(2p)}^{\mathrm{T}} being the adaptive weights. Moreover, 𝐋=(sgn(𝜷~)sgn(𝜷~)T)𝐋~\mathbf{L}=\left(\operatorname{sgn}(\tilde{\boldsymbol{\beta}})\operatorname{sgn}(\tilde{\boldsymbol{\beta}})^{\mathrm{T}}\right)\,\circ\,\widetilde{\mathbf{L}} is a block diagonal adaptive Laplacian matrix, where sgn(𝜷~)=(sgn(β~1(1)),,sgn(β~p(2)))T\operatorname{sgn}(\tilde{\boldsymbol{\beta}})=\left({\operatorname{sgn}}\left(\tilde{\beta}^{(1)}_{1}\right),\ldots,{\operatorname{sgn}}\left(\tilde{\beta}^{(2)}_{p}\right)\right)^{\mathrm{T}} with 𝜷~\tilde{\boldsymbol{\beta}} being a rough estimate of 𝜷\boldsymbol{\beta} and 𝐋~=𝐈𝑫~1/2𝑨~𝑫~1/2\widetilde{\mathbf{L}}=\mathbf{I}-\widetilde{\boldsymbol{D}}^{-1/2}\widetilde{\boldsymbol{A}}\widetilde{\boldsymbol{D}}^{-1/2} with 𝑨~=(𝑨𝟎𝟎𝑨)=(a~jl)2p×2p\widetilde{\boldsymbol{A}}=\left(\begin{array}[]{cc}\boldsymbol{A}&\mathbf{0}\\ \mathbf{0}&\boldsymbol{A}\end{array}\right)=(\widetilde{a}_{jl})_{2p\times 2p} and 𝑫~=diag(d1,,dp,d1,,dp)\widetilde{\boldsymbol{D}}=\operatorname{diag}\left(d_{1},\cdots,d_{p},d_{1},\cdots,d_{p}\right) with dj=l=1pajld_{j}=\sum_{l=1}^{p}a_{jl}. ε\varepsilon is a small constant to make (𝐋+ε𝐈2p)\left({\mathbf{L}}+\varepsilon\mathbf{I}_{2p}\right) strictly positive-definite, which is set as 10310^{-3} in our numerical work.

The proposed priors have been motivated by the following considerations. The identification of SV genes is achieved using the hard-thresholded Gaussian prior, where βj(d)\beta^{(d)}_{j} is shrunk to zero when tλ,ρj(d)(γj(d))=I(|γj(d)|>λρj(d))=0\mathrm{t}_{\lambda,\rho^{(d)}_{j}}\left(\gamma^{(d)}_{j}\right)=\operatorname{I}\left(\left|\gamma^{(d)}_{j}\right|>\lambda\cdot\rho^{(d)}_{j}\right)=0 (d=1,2d=1,2). The thresholded Gaussian prior has been shown as a useful alternative to shrinkage priors in Bayesian sparse analysis (Wu et al., 2024), and is favored here for its simplicity and flexibility as well as its appealing interpretability as a minimal detectable signal. Here, we further introduce a series of weights ρj(d)\rho_{j}^{(d)}’s in 𝐭λ,𝝆(𝜸)\mathbf{t}_{\lambda,\boldsymbol{\rho}}(\boldsymbol{\gamma}) to adjust the shrinkage of various γj(d)\gamma^{(d)}_{j}’s to improve selection efficiency, where the genes with strong spatial variability are potentially assigned small weights and thus are more likely to have nonzero βj(d)\beta^{(d)}_{j}’s.

Moreover, the network dependency is introduced via the graph Laplacian matrix in the covariance matrix of the hard-thresholded Gaussian prior. The proposed network-assisted strategy is motivated by the successes of Laplacian shrinkage in high-dimensional regression analysis (Chakraborty and Lozano, 2019; Cai et al., 2020). Different from these studies, we innovatively conduct SV gene detection with the network structures among genes incorporated. In particular, the Laplacian matrix 𝐋~\widetilde{\mathbf{L}} is further modified with a pre-defined sign matrix sgn(𝜷~)sgn(𝜷~)T\operatorname{sgn}(\tilde{\boldsymbol{\beta}})\operatorname{sgn}(\tilde{\boldsymbol{\beta}})^{\mathrm{T}} to accommodate the scenario where two neighborhood genes are negatively correlated and have opposite directions of spatial variability. With the proposed priors, we have 𝜸T(𝐋+ε𝐈2p)𝜸=jl(sgn(β~j(1))γj(1)djsgn(β~l(1))γl(1)dl)2+jl(sgn(β~j(2))γj(2)djsgn(β~l(2))γl(2)dl)2\boldsymbol{\gamma}^{\mathrm{T}}\left(\mathbf{L}+\varepsilon\mathbf{I}_{2p}\right)\boldsymbol{\gamma}=\sum_{j\sim l}\left(\frac{\text{sgn}\left(\tilde{\beta}_{j}^{(1)}\right)\gamma_{j}^{(1)}}{\sqrt{d_{j}}}-\frac{\text{sgn}\left(\tilde{\beta}_{l}^{(1)}\right)\gamma_{l}^{(1)}}{\sqrt{d_{l}}}\right)^{2}+\sum_{j\sim l}\left(\frac{\text{sgn}\left(\tilde{\beta}_{j}^{(2)}\right)\gamma_{j}^{(2)}}{\sqrt{d_{j}}}-\frac{\text{sgn}\left(\tilde{\beta}_{l}^{(2)}\right)\gamma_{l}^{(2)}}{\sqrt{d_{l}}}\right)^{2}, where for genes jj and ll with an edge e(j,l)e(j,l), the absolute values of γj(d)\gamma^{(d)}_{j} and γl(d)\gamma^{(d)}_{l} are promoted to be similar, further inducing simultaneous SV or non-SV.

We assign a Bernoulli prior for the latent variable rijr_{ij} with the hyperparameter πjBeta(aπ,bπ)\pi_{j}\sim\text{Beta}(a_{\pi},b_{\pi}), where πj\pi_{j} is the probability that YijY_{ij} is a dropout zero. A Gamma distribution Ga(aϕ,bϕ)\text{Ga}(a_{\phi},b_{\phi}) is assumed for the dispersion parameter ϕ\phi. For the variance term σ𝜸2\sigma_{\boldsymbol{\gamma}}^{2}, we use the conjugate prior by assigning the Inverse-Gamma distribution IG(aγ,bγ)\operatorname{IG}\left(a_{\gamma},b_{\gamma}\right). The non-negative uniform prior Unif(λl,λu)\text{Unif}\left(\lambda_{l},\lambda_{u}\right) is assigned for the threshold parameter λ\lambda. For μ0j\mu_{0j} and αk\alpha_{k}, the normal priors with mean 0 and variance σ0j2\sigma_{0j}^{2} and σαk2\sigma_{\alpha_{k}}^{2} are assumed, respectively. These priors have been popular in the existing Bayesian studies.

2.4 Bayesian Inference

The model parameter space consists of (𝜸,𝑹,𝝁0,𝜶,ϕ,λ)\left(\boldsymbol{\gamma},\boldsymbol{R},\boldsymbol{\mu}_{0},\boldsymbol{\alpha},\phi,\lambda\right), where 𝑹=(rij)(n×p)\boldsymbol{R}=(r_{ij})_{(n\times p)}, and 𝝁0\boldsymbol{\mu}_{0} and 𝜶\boldsymbol{\alpha} are the vectors consisting of μ0j\mu_{0j}’s and αk\alpha_{k}’s, respectively. The posterior distribution is

f(𝜸,𝑹,𝝁0,𝜶,ϕ,λ𝐘)\displaystyle f\left(\boldsymbol{\gamma},\boldsymbol{R},\boldsymbol{\mu}_{0},\boldsymbol{\alpha},\phi,\lambda\mid\mathbf{Y}\right)\propto (4)
f(𝐘𝜸,𝑹,𝝁0,𝜶,ϕ,λ)f(𝜸σγ2)f(σγ2)f(λ)f(𝝁0)f(𝜶)f(ϕ)f(𝑹𝝅)f(𝝅)𝑑𝝅.\displaystyle\int f\left(\mathbf{Y}\mid\boldsymbol{\gamma},\boldsymbol{R},\boldsymbol{\mu}_{0},\boldsymbol{\alpha},\phi,\lambda\right)f\left(\boldsymbol{\gamma}\mid\sigma_{\gamma}^{2}\right)f\left(\sigma_{\gamma}^{2}\right)f(\lambda)f\left(\boldsymbol{\mu}_{0}\right)f\left(\boldsymbol{\alpha}\right)f(\phi)f(\boldsymbol{R}\mid\boldsymbol{\pi})f(\boldsymbol{\pi})d\boldsymbol{\pi}.
={{(i,j):rij=0}NB(Yijμij,ϕ)}×N(𝜸𝟎,σγ2(𝐋+ε𝐈2p)1)×IG(σ𝜸aγ,bγ)\displaystyle=\left\{\prod_{\left\{(i,j):r_{ij}=0\right\}}\operatorname{NB}\left(Y_{ij}\mid\mu_{ij},\phi\right)\right\}\times\mathrm{N}\left(\boldsymbol{\gamma}\mid\mathbf{0},\sigma_{\gamma}^{2}\left(\mathbf{L}+\varepsilon\mathbf{I}_{2p}\right)^{-1}\right)\times\operatorname{IG}\left(\sigma_{\boldsymbol{\gamma}}\mid a_{\gamma},b_{\gamma}\right)
×Unif(λλl,λu)×N(𝝁0𝟎,diag(σμ012,,σμ0p2))×N(𝜶𝟎,diag(σα12,,σαK2))\displaystyle\quad\times\operatorname{Unif}\left(\lambda\mid\lambda_{l},\lambda_{u}\right)\times\mathrm{N}\left(\boldsymbol{\mu}_{0}\mid\mathbf{0},\operatorname{diag}\left(\sigma_{\mu_{01}}^{2},\ldots,\sigma_{\mu_{0p}}^{2}\right)\right)\times\mathrm{N}\left(\boldsymbol{\alpha}\mid\mathbf{0},\operatorname{diag}\left(\sigma_{\alpha_{1}}^{2},\ldots,\sigma_{\alpha_{K}}^{2}\right)\right)
×Ga(ϕaϕ,bϕ)×i=1nj=1pBeBern(rijaπ,bπ),\displaystyle\quad\times\operatorname{Ga}\left(\phi\mid a_{\phi},b_{\phi}\right)\times\prod_{i=1}^{n}\prod_{j=1}^{p}\operatorname{Be-Bern}\left(r_{ij}\mid a_{\pi},b_{\pi}\right),

where BeBern\operatorname{Be-Bern} denotes the Beta-Bernoulli distribution with the probability mass function BeBern(rijaπ,bπ)=Beta(aπ+rij,bπrij+1)Beta(aπ,bπ)\operatorname{Be-Bern}\left(r_{ij}\mid a_{\pi},b_{\pi}\right)=\frac{\operatorname{Beta}\left(a_{\pi}+r_{ij},b_{\pi}-r_{ij}+1\right)}{\operatorname{Beta}\left(a_{\pi},b_{\pi}\right)}.

The posterior sampling is conducted based on the MCMC algorithm. We first introduce the sampling variances τμ02,τα2,τϕ2,τγ2,\tau_{\mu_{0}}^{2},\tau_{\alpha}^{2},\tau_{\phi}^{2},\tau_{\gamma}^{2}, and τλ2\tau_{\lambda}^{2} for μ0j\mu_{0j}’s, αk\alpha_{k}’s, ϕ\phi, 𝜸\boldsymbol{\gamma}, and λ\lambda, respectively, and then conduct the following steps at each MCMC iteration, where the symbol “\boldsymbol{-}” in the condition position denotes “the rest parameters”.

  1. (a)

    Sequentially update rijr_{ij} for {(i,j):Yij=0}\left\{\left(i,j\right):Y_{ij}=0\right\} with the conditional distribution of rijr_{ij} given by f(rij)(ϕμij+ϕ)ϕ(1rij)×BeBern(rijaπ,bπ)f\left(r_{ij}\mid\boldsymbol{-}\right)\propto\left(\frac{\phi}{\mu_{ij}+\phi}\right)^{\phi(1-r_{ij})}\times\operatorname{Be-Bern}\left(r_{ij}\mid a_{\pi},b_{\pi}\right).

  2. (b)

    Sequentially sample μ0jN(μ0j,τμ02)\mu_{0j}^{*}\sim N(\mu_{0j},\tau_{\mu_{0}}^{2}) for j=1,,pj=1,\ldots,p, and accept μ0j\mu_{0j}^{*} with probability min{1,N(μ0jμ0j,τμ02)×N(μ0j0,σ0j2)×{i:rij=0}NB(Yijμ0j,)N(μ0jμ0j,τμ02)×N(μ0j0,σ0j2)×{i:rij=0}NB(Yijμ0j,)}.\min\left\{1,\frac{N\left(\mu_{0j}\mid\mu_{0j}^{*},\tau_{\mu_{0}}^{2}\right)\times N\left(\mu_{0j}^{*}\mid 0,\sigma_{0j}^{2}\right)\times\prod_{\left\{i:r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\mu_{0j}^{*},\boldsymbol{-}\right)}{N\left(\mu^{*}_{0j}\mid\mu_{0j},\tau_{\mu_{0}}^{2}\right)\times N\left(\mu_{0j}\mid 0,\sigma_{0j}^{2}\right)\times\prod_{\left\{i:r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\mu_{0j},\boldsymbol{-}\right)}\right\}.

  3. (c)

    Sequentially sample αkN(αk,τα2)\alpha_{k}^{*}\sim N(\alpha_{k},\tau_{\alpha}^{2}) for k=1,,Kk=1,\ldots,K, and accept αk\alpha_{k}^{*} with probability min{1,N(αkαk,τα2)×N(αk0,σαk2)×{(i,j):rij=0}NB(Yijαk,)N(αkαk,τα2)×N(αk0,σαk2)×{(i,j):rij=0}NB(Yijαk,)}.\min\left\{1,\frac{N\left(\alpha_{k}\mid\alpha_{k}^{*},\tau_{\alpha}^{2}\right)\times N\left(\alpha_{k}^{*}\mid 0,\sigma_{\alpha_{k}}^{2}\right)\times\prod_{\left\{(i,j):r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\alpha_{k}^{*},\boldsymbol{-}\right)}{N\left(\alpha^{*}_{k}\mid\alpha_{k},\tau_{\alpha}^{2}\right)\times N\left(\alpha_{k}\mid 0,\sigma_{\alpha_{k}}^{2}\right)\times\prod_{\left\{(i,j):r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\alpha_{k},\boldsymbol{-}\right)}\right\}.

  4. (d)

    Sample ϕ\phi^{*} from N(ϕ,τϕ2)N(\phi,\tau_{\phi}^{2}) truncated at 0, and accept ϕ\phi^{*} with probability min{1,\min\left\{1,\right. N+(ϕϕ,0,,τϕ2)×Ga(ϕ;aϕ,bϕ)×{(i,j):rij=0}NB(Yijϕ,)N+(ϕϕ,0,,τϕ2)×Ga(ϕ;aϕ,bϕ)×{(i,j):rij=0}NB(Yijϕ,)}\left.\frac{N_{+}\left(\phi\mid\phi^{*},0,\infty,\tau_{\phi}^{2}\right)\times\mathrm{Ga}\left(\phi^{*};a_{\phi},b_{\phi}\right)\times\prod_{\left\{\left(i,j\right):r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\phi^{*},\boldsymbol{-}\right)}{N_{+}\left(\phi^{*}\mid\phi,0,\infty,\tau_{\phi}^{2}\right)\times\mathrm{Ga}\left(\phi;a_{\phi},b_{\phi}\right)\times\prod_{\left\{\left(i,j\right):r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\phi,\boldsymbol{-}\right)}\right\}.

  5. (e)

    Sample 𝜸\boldsymbol{\gamma}^{*} from N(μ(𝜸),τ𝜸2σγ2(𝐋+ε𝐈2p)1)\mathrm{N}\left(\mu\left(\boldsymbol{\gamma}\right),\tau_{\boldsymbol{\gamma}}^{2}\cdot\sigma_{\gamma}^{2}\left(\mathbf{L}+\varepsilon\mathbf{I}_{2p}\right)^{-1}\right), where

    μ(𝜸)=1τ𝜸2𝜸+(11τ𝜸2)σγ2(𝐋+ε𝐈2p)1𝜸logf(𝐘𝜸,),\mu\left(\boldsymbol{\gamma}\right)=\sqrt{1-\tau_{\boldsymbol{\gamma}}^{2}}\boldsymbol{\gamma}+\left(1-\sqrt{1-\tau_{\boldsymbol{\gamma}}^{2}}\right)\sigma_{\gamma}^{2}\left(\mathbf{L}+\varepsilon\mathbf{I}_{2p}\right)^{-1}\nabla_{\boldsymbol{\gamma}}\log f(\mathbf{Y}\mid\boldsymbol{\gamma},\boldsymbol{-}), (5)

    with 𝜸logf(𝐘𝜸,)\nabla_{\boldsymbol{\gamma}}\log f(\mathbf{Y}\mid\boldsymbol{\gamma},\boldsymbol{-}) being the first derivative of the log-likelihood function with respect to 𝜸\boldsymbol{\gamma}. Then, accept 𝜸\boldsymbol{\gamma}^{*} with probability min{1,{(i,j):rij=0}NB(Yij𝜸,){(i,j):rij=0}NB(Yij𝜸,)}\min\left\{1,\frac{\prod_{\left\{(i,j):r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\boldsymbol{\gamma}^{*},\boldsymbol{-}\right)}{\prod_{\left\{(i,j):r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\boldsymbol{\gamma},\boldsymbol{-}\right)}\right\}.

  6. (f)

    Update ρj(d)=s=1S(min{l:lVs}1|𝜸l(d)|1/2)I(jVs)\rho_{j}^{(d)}=\prod_{s=1}^{S}\left(\min_{\left\{l:l\in V_{s}\right\}}\frac{1}{\left|\boldsymbol{\gamma}^{(d)}_{l}\right|^{1/2}}\right)^{\operatorname{I}(j\in V_{s})} for j=1,,pj=1,\ldots,p and d=1,2d=1,2, and update 𝐭λ,𝝆(𝜸)\mathbf{t}_{\lambda,\boldsymbol{\rho}}(\boldsymbol{\gamma}) accordingly, where V1,V2,,VSV_{1},V_{2},\ldots,V_{S} are the index sets of the SS disconnected sub-networks in G(V,E)G(V,E).

  7. (g)

    Sample σ𝜸2\sigma_{\boldsymbol{\gamma}}^{2} from IG(a~𝜸,b~𝜸)\operatorname{IG}\left(\tilde{a}_{\boldsymbol{\gamma}},\tilde{b}_{\boldsymbol{\gamma}}\right) with shape parameter a~𝜸=a𝜸+p\tilde{a}_{\boldsymbol{\gamma}}=a_{\boldsymbol{\gamma}}+p and scale parameter b~𝜸=b𝜸+𝜸T(𝐋+ε𝐈2p)𝜸2\tilde{b}_{\boldsymbol{\gamma}}=b_{\boldsymbol{\gamma}}+\frac{\boldsymbol{\gamma}^{\mathrm{T}}\left(\mathbf{L}+\varepsilon\mathbf{I}_{2p}\right)\boldsymbol{\gamma}}{2}.

  8. (h)

    Sample λ\lambda^{*} from N(λ,τλ2)N\left(\lambda,\tau_{\lambda}^{2}\right) truncated at interval [λl,λu][\lambda_{l},\lambda_{u}], and accept λ\lambda^{*} with probability min{1,N+(λλ,λl,λu,τλ2){(i,j):rij=0}NB(Yijλ,)N+(λλ,λl,λu,τλ2){(i,j):rij=0}NB(Yijλ,)}.\min\left\{1,\frac{N_{+}\left(\lambda\mid\lambda^{*},\lambda_{l},\lambda_{u},\tau_{\lambda}^{2}\right)\prod_{\left\{\left(i,j\right):r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\lambda^{*},\boldsymbol{-}\right)}{N_{+}\left(\lambda^{*}\mid\lambda,\lambda_{l},\lambda_{u},\tau_{\lambda}^{2}\right)\prod_{\left\{\left(i,j\right):r_{ij}=0\right\}}\text{NB}\left(Y_{ij}\mid\lambda,\boldsymbol{-}\right)}\right\}.

Here, updating rijr_{ij}’s and σ𝜸2\sigma_{\boldsymbol{\gamma}}^{2} is achieved through the Gibbs sampler, while the Metropolis-Hasting (MH) algorithm is adopted for sampling μ0j\mu_{0j}’s, αk\alpha_{k}’s, ϕ\phi, and λ\lambda. For sampling 𝜸\boldsymbol{\gamma}, we resort to the preconditioned Crank-Nicolson Langevin dynamics (pCNLD), which takes advantage of the gradient information of the target distribution to speed up convergence. Furthermore, with (5), pCNLD explicitly incorporates the network dependency into the sampling process. The details of the proposed pCNLD sampling are given in Supplementary Section S1. For ρj(d)\rho_{j}^{(d)}’s, since biological networks are often composed of multiple disconnected sub-networks, we propose adopting a set of group-wise weights for the SS sub-networks for more effectively utilizing network information. Specifically, a series of data-driven weights inversely proportional to the absolute effect sizes of the genes involved in the specific sub-networks are introduced, potentially facilitating the identification of SV genes with weak signals, which may be involved in the sub-networks consisting of SV genes with strong signals and small thresholds. The settings for the hyperparameters and sampling variances are provided in Supplementary Section S2.

For the spatial modeling function 𝒦()\mathcal{K}(\cdot), we follow the published studies and consider three most popular functions. Specifically, for xid(d=1,2)x_{id}~{}(d=1,2) which has been transformed to have mean 0 and standard deviation 1, we consider one linear function 𝒦1(xid)=xid\mathcal{K}_{1}(x_{id})=x_{id}, two exponential functions 𝒦2(xid)=exp(xid2(ld(1))2)\mathcal{K}_{2}(x_{id})=\exp\left(-\frac{x_{id}}{2\left(l_{d}^{(1)}\right)^{2}}\right) and 𝒦3(xid)=exp(xid2(ld(2))2)\mathcal{K}_{3}(x_{id})=\exp\left(-\frac{x_{id}}{2\left(l_{d}^{(2)}\right)^{2}}\right), and two periodic functions 𝒦4(xid)=cos(2πxidld(1))\mathcal{K}_{4}(x_{id})=\cos\left(\frac{2\pi x_{id}}{l_{d}^{(1)}}\right) and 𝒦5(xid)=cos(2πxidld(2))\mathcal{K}_{5}(x_{id})=\cos\left(\frac{2\pi x_{id}}{l_{d}^{(2)}}\right), where ld(1)l_{d}^{(1)} and ld(2)l_{d}^{(2)} are the 40% and 60% quantiles of |x1d|,,|xnd||x_{1d}|,\cdots,|x_{nd}|, respectively. To accommodate the fact that results may be sensitive to the choice of scale parameter, two values are considered for the exponential and periodic patterns. Then, we conduct the MCMC sampling for each 𝒦s()(s=1,,5)\mathcal{K}_{s}(\cdot)(s=1,\cdots,5) and obtain the estimated posterior expectation 𝜷^𝒦s=m=1M𝜸(m)𝐭λ(m),𝝆(m)(𝜸(m))m=1M𝐭λ(m),𝝆(m)(𝜸(m)),𝝁^0𝒦s=m=1M𝝁0(m)M,𝜶^𝒦s=m=1M𝜶(m)M\hat{\boldsymbol{\beta}}^{{\mathcal{K}_{s}}}=\frac{\sum_{m=1}^{M}\boldsymbol{\gamma}^{(m)}\boldsymbol{\mathbf{t}}_{\lambda^{(m)},\boldsymbol{\rho}^{(m)}}\left(\boldsymbol{\gamma}^{(m)}\right)}{\sum_{m=1}^{M}\boldsymbol{\mathbf{t}}_{\lambda^{(m)},\boldsymbol{\rho}^{(m)}}\left(\boldsymbol{\gamma}^{(m)}\right)},\hat{\boldsymbol{\mu}}_{0}^{{\mathcal{K}_{s}}}=\frac{\sum_{m=1}^{M}\boldsymbol{\mu}_{0}^{(m)}}{M},\hat{\boldsymbol{\alpha}}^{{\mathcal{K}_{s}}}=\frac{\sum_{m=1}^{M}\boldsymbol{\alpha}^{(m)}}{M}, ϕ^𝒦s=m=1Mϕ(m)M\hat{\phi}^{{\mathcal{K}_{s}}}=\frac{\sum_{m=1}^{M}\phi^{(m)}}{M}, and 𝑹^𝒦s=m=1M𝑹(m)M\hat{\boldsymbol{R}}^{{\mathcal{K}_{s}}}=\frac{\sum_{m=1}^{M}\boldsymbol{R}^{(m)}}{M}, where {𝜸(m),λ(m),𝝆(m),𝝁0(m),𝜶(m),ϕ(m),𝑹(m)}m=1M\left\{\boldsymbol{\gamma}^{(m)},\lambda^{(m)},\boldsymbol{\rho}^{(m)},\boldsymbol{\mu}_{0}^{(m)},\boldsymbol{\alpha}^{(m)},\phi^{(m)},\boldsymbol{R}^{(m)}\right\}_{m=1}^{M} denotes the MM samples obtained after burn-in and thinning (we omit the dependence on 𝒦s\mathcal{K}_{s} to simplify notation).

To facilitate the combination of five models, instead of directly considering the values of 𝜷^𝒦s\hat{\boldsymbol{\beta}}^{{\mathcal{K}_{s}}} for SV gene identification, we further introduce a posterior inclusion probability vector estimated as PIP𝒦s=1Mm=1M𝐭λ(m),𝝆(m)(𝜸(m)){\mathrm{PIP}}^{{\mathcal{K}_{s}}}=\frac{1}{M}\sum_{m=1}^{M}\boldsymbol{\mathbf{t}}_{\lambda^{(m)},\boldsymbol{\rho}^{(m)}}\left(\boldsymbol{\gamma}^{(m)}\right) and consider the posterior model probability f(𝒦s𝐘)f\left(\mathcal{M}_{\mathcal{K}_{s}}\mid\mathbf{Y}\right) introduced in Quintana et al. (2011), where 𝒦s\mathcal{M}_{\mathcal{K}_{s}} denotes the model with 𝒦s()\mathcal{K}_{s}(\cdot). Specifically, for each 𝒦s\mathcal{M}_{\mathcal{K}_{s}}, we calculate f(𝒦s𝐘)=f(𝐘𝒦s)f(𝒦s)s=15f(𝐘𝒦s)f(𝒦s)f\left(\mathcal{M}_{\mathcal{K}_{s}}\mid\mathbf{Y}\right)=\frac{f\left(\mathbf{Y}\mid\mathcal{M}_{\mathcal{K}_{s}}\right)f\left(\mathcal{M}_{\mathcal{K}_{s}}\right)}{\sum_{s^{\prime}=1}^{5}f\left(\mathbf{Y}\mid\mathcal{M}_{\mathcal{K}_{s^{\prime}}}\right)f\left(\mathcal{M}_{\mathcal{K}_{s^{\prime}}}\right)}. Here, f(𝒦s)f\left(\mathcal{M}_{\mathcal{K}_{s}}\right) is the prior probability of model 𝒦s\mathcal{M}_{\mathcal{K}_{s}}, and we set a non-informative prior with f(𝒦s)=1/5f\left(\mathcal{M}_{\mathcal{K}_{s}}\right)=1/5. f(𝐘𝒦s)f\left(\mathbf{Y}\mid\mathcal{M}_{\mathcal{K}_{s}}\right) is the likelihood function with the estimated parameters 𝜷^𝒦s,𝝁^0𝒦s,𝜶^𝒦s,ϕ^𝒦s\hat{\boldsymbol{\beta}}^{{\mathcal{K}_{s}}},\hat{\boldsymbol{\mu}}_{0}^{{\mathcal{K}_{s}}},\hat{\boldsymbol{\alpha}}^{{\mathcal{K}_{s}}},\hat{\phi}^{\mathcal{K}_{s}}, and 𝑹^𝒦s\hat{\boldsymbol{R}}^{{\mathcal{K}_{s}}} under model 𝒦s\mathcal{M}_{\mathcal{K}_{s}}. We then utilize f(𝒦s𝐘)f\left(\mathcal{M}_{\mathcal{K}_{s}}\mid\mathbf{Y}\right)’s as the model-specific weights to obtain the combined PIP=s=15f(𝒦s𝐘)PIP𝒦s\operatorname{PIP}=\sum\limits_{s=1}^{5}f\left(\mathcal{M}_{\mathcal{K}_{s}}\mid\mathbf{Y}\right)\mathrm{PIP}^{{\mathcal{K}_{s}}}. Moreover, as gene jj with at least one non-zero βj(d)(d=1,2)\beta^{(d)}_{j}\left(d=1,2\right) is identified as SV, we introduce PIP~j=max(PIPj,PIPj+p)\widetilde{\operatorname{PIP}}_{j}=\max\left(\operatorname{PIP}_{j},\mathrm{PIP}_{j+p}\right) for j=1,,pj=1,\ldots,p. Based on PIP~j\widetilde{\operatorname{PIP}}_{j}’s, the Bayesian false discovery rate (BFDR) control strategy is adopted for controlling multiplicity. Specifically, BFDR(c)=j=1p(1PIP~j)I(1PIP~j<c)j=1pI(1PIP~j<c)\operatorname{BFDR}(c)=\frac{\sum_{j=1}^{p}\left(1-\widetilde{\operatorname{PIP}}_{j}\right)\mathrm{I}\left(1-\widetilde{\operatorname{PIP}}_{j}<c\right)}{\sum_{j=1}^{p}\mathrm{I}\left(1-\widetilde{\operatorname{PIP}}_{j}<c\right)} with BFDR(c)\operatorname{BFDR}(c) being the desired significance level. We set BFDR(c)\operatorname{BFDR}(c) as 0.05 in our numerical studies. Then, the SV gene set is defined as {j:PIP~jc}\left\{j:\widetilde{\operatorname{PIP}}_{j}\geq c\right\}. To achieve improved stability and better false negative control, in numerical studies, we run five MCMC chains independently, and the genes identified in more than 80% of the chains are finally identified as SV.

To improve computational efficiency, parallelization is implemented with the R package RcppParallel. Specifically, the marginal sequential sampling is divided into parallel programming to reduce computer time. For 𝜸\boldsymbol{\gamma} with the dependency Laplacian matrix incorporated, benefiting from the sparse and block-wise nature of biological networks, the parallelism block-wise sampler is adopted to avoid sampling from a high-dimensional multivariate normal distribution, which further accelerates computation. Take a simulated dataset with p=5,000p=5,000 and n=1,024n=1,024 as an example. The average computer time is about 1.01 minutes for 100 iterations on a MacBook Pro with an Intel Core i5 CPU and 16GB RAM.

3 Simulations

3.1 Basic Simulations

Simulation studies are conducted under the following settings. (a) n=1,024n=1,024 spots located on a 32 by 32 square lattice, p=5,000p=5,000 genes, and K=6K=6 cell types. (b) The square lattice is partitioned into three regions as displayed in Figure S1, where the cellular compositions 𝒘i\boldsymbol{w}_{i}’s are independently sampled from Dirichlet distributions Dirc(1,1,1,1,1,1)\operatorname{Dirc}(1,1,1,1,1,1) (Region 1), Dirc(3,5,7,9,11,13)\operatorname{Dirc}(3,5,7,9,11,13) (Region 2), and Dirc(18,16,14,12,10,8)\operatorname{Dirc}(18,16,14,12,10,8) (Region 3). (c) Consider three types of spatial pattern 𝒦(xid)\mathcal{K}\left(x_{id}\right), including xidx_{id} (Linear), exp(xid2)\exp\left(-\frac{x_{id}}{2}\right) (Exponential), and cos(2πxid)\cos\left(2\pi x_{id}\right) (Periodic). (d) The networks are block-wise and composed of 100 disconnected sub-networks with 50 nodes each. For each sub-network, two types of network structure, namely Star and Scale-free, are considered to mimic the real-world transcription factor regulatory network and interaction network with scale-free properties, respectively. Illustrative examples of the sub-networks are presented in Figure S2(A) and S2(B). (e) All genes in the first ten sub-networks are SV, leading to a total of 500 SV genes. Both positive and negative signals are considered with various levels of magnitude. More detailed settings are provided in Supplementary Section S3.1. (f) The spatial transcriptomics count data is generated from model (1) with the dispersion parameter ϕ\phi being 10. Two settings of dropout rates, 0.1 and 0.5, are considered, representing low and high sparsity. The baseline parameter μ0j{\mu}_{0j} and cell-type-specific effect αk{\alpha}_{k} are independently generated from N(2,0.52)N(2,0.5^{2}) and N(0,3.52)N(0,3.5^{2}), respectively. There are 12 scenarios, comprehensively covering a wide spectrum with different patterns of spatial expressions, different structures of networks and the corresponding spatially variable signals, and different degrees of sparsity.

In addition to the proposed approach, six alternatives are considered: (a) SpatialDE (Svensson et al., 2018), a likelihood ratio test method based on Gaussian process regression, (b) SPARK (Sun et al., 2020), a method based on a Poisson log linear model with a Gaussian process incorporated, (c) SPARK-X (Zhu et al., 2021), a scalable non-parametric test built on a robust covariance test framework, (d) MERINGUE (Miller et al., 2021), a test based on spatial cross-correlation analysis, (e) CTSV and CTSV-g (Yu and Luo, 2022), a test based on the ZINB model for cell-type-specific SV gene identification and global SV gene identification, respectively, and (f) HEARTSVG (Yuan et al., 2023), a distribution-free method that employs the exclusion of non SV genes to infer the presence of SV genes. Among them, SpatialDE, SPARK, CTSV, and CTSV-g are parametric, while SPARK-X, MERINGUE, and HEARTSVG are non-parametric. All the alternatives conduct tests marginally and implement multiple testing control. The implementation details for the competing methods are given in Supplementary Section S3.2.

Table 1: Simulation results under the scenarios with a low dropout rate, where FDR (BFDR) is controlled to be <<0.05. In each cell, mean (SD) is based on 50 replicates.
Recall Precision F1 Recall Precision F1
Linear pattern Star Network Scale-free Network
proposed 0.992(0.016) 1.000(0.000) 0.996(0.008) 0.991(0.016) 1.000(0.000) 0.995(0.008)
spatialDE 0.408(0.172) 0.990(0.030) 0.554(0.194) 0.549(0.183) 0.990(0.045) 0.685(0.179)
SPARK 0.907(0.111) 0.977(0.010) 0.937(0.065) 0.911(0.079) 0.991(0.005) 0.947(0.046)
SPARK-X 0.887(0.259) 0.154(0.193) 0.181(0.021) 0.772(0.143) 0.485(0.370) 0.502(0.287)
MERINGUE 0.562(0.272) 0.944(0.027) 0.658(0.241) 0.642(0.256) 0.958(0.021) 0.732(0.219)
CTSV 0.118(0.039) 0.642(0.249) 0.186(0.054) 0.162(0.076) 0.729(0.193) 0.249(0.100)
CTSV-g 0.978(0.057) 0.810(0.104) 0.880(0.063) 0.978(0.044) 0.936(0.013) 0.956(0.023)
HEARTSVG 0.762(0.246) 0.284(0.265) 0.298(0.129) 0.920(0.142) 0.178(0.159) 0.259(0.106)
Exponential pattern Star Network Scale-free Network
proposed 0.933(0.141) 1.000(0.001) 0.959(0.091) 0.979(0.035) 1.000(0.001) 0.989(0.019)
spatialDE 0.036(0.032) 0.989(0.058) 0.084(0.053) 0.278(0.186) 0.986(0.061) 0.424(0.232)
SPARK 0.229(0.103) 0.988(0.013) 0.360(0.139) 0.622(0.133) 0.993(0.005) 0.756(0.113)
SPARK-X 0.860(0.296) 0.098(0.004) 0.177(0.011) 0.702(0.262) 0.473(0.369) 0.442(0.276)
MERINGUE 0.153(0.127) 0.949(0.042) 0.248(0.174) 0.479(0.252) 0.957(0.023) 0.592(0.254)
CTSV 0.012(0.022) 0.116(0.150) 0.043(0.033) 0.025(0.034) 0.366(0.315) 0.058(0.045)
CTSV-g 0.899(0.173) 0.918(0.022) 0.897(0.121) 0.957(0.078) 0.934(0.015) 0.944(0.041)
HEARTSVG 0.282(0.169) 0.839(0.278) 0.361(0.195) 0.634(0.184) 0.623(0.313) 0.536(0.188)
Periodic pattern Star Network Scale-free Network
proposed 0.996(0.008) 1.000(0.000) 0.998(0.004) 0.979(0.130) 1.000(0.000) 0.981(0.121)
spatialDE 0.087(0.027) 0.991(0.053) 0.158(0.049) 0.174(0.097) 0.990(0.056) 0.286(0.137)
SPARK 0.215(0.088) 0.987(0.012) 0.345(0.117) 0.620(0.130) 0.992(0.006) 0.754(0.111)
SPARK-X 0.887(0.268) 0.156(0.195) 0.178(0.023) 0.817(0.238) 0.492(0.366) 0.501(0.299)
MERINGUE 0.433(0.265) 0.952(0.025) 0.543(0.263) 0.688(0.277) 0.961(0.019) 0.761(0.236)
CTSV 0.088(0.055) 0.664(0.268) 0.142(0.072) 0.215(0.163) 0.742(0.217) 0.291(0.187)
CTSV-g 0.969(0.076) 0.884(0.045) 0.922(0.042) 0.990(0.028) 0.947(0.009) 0.968(0.015)
HEARTSVG 0.230(0.135) 0.980(0.043) 0.355(0.165) 0.598(0.184) 0.976(0.084) 0.720(0.159)

For evaluating identification performance, we adopt Recall = TP+FNTP\frac{\text{TP+FN}}{\operatorname{TP}} , Precision =TP+FPTP\frac{\operatorname{TP+FP}}{\operatorname{TP}}, and F1 score = 2PrecisionRecallPrecision+Recall,\frac{2\cdot\text{Precision}\cdot\operatorname{Recall}}{\text{Precision+Recall}}, with TP, FP, and FN being the numbers of true positives, false positives, and false negatives, respectively. Under each scenario, we simulate 50 replicates. Summary results under the scenarios with a low dropout rate are given in Table 1. The rest of the results with a high dropout rate are provided in Supplementary Table S1. It is observed that the proposed approach achieves superior accuracy in identifying SV genes with higher F1 scores across all scenarios. Overall, CTSV-g achieves the second-best identification performance since it also accommodates dropouts and over-dispersion through the adoption of ZINB distribution. Improved performance over CTSV-g supports the validity of incorporating network information and accommodating cellular diversity. Furthermore, the majority of the alternatives exhibit good performance with the simple linear patterns while the proposed approach performs stably across different spatial expression patterns. Moreover, similar to those in some published studies, SPARK-X exhibits many false negatives, probably due to its covariance test framework and neglection of cellular composition confounding, and SPARK achieves excellent FP control across all scenarios, mainly attributable to the robust Cauchy combination rule. CTSV identifies the fewest global SV genes across all scenarios due to its concern with cell-type-specific SV genes and inability in accurately identifying genes with global spatial expression variations. Noticeably, under the scenarios with a scale-free network where the signals of the SV genes are larger, all approaches exhibit improved performance with higher F1 score values. However, the superiority of the proposed approach is again evidently observed. With a higher dropout rate, which is more common with practical ST data, all approaches have decayed performance, especially for SpatialDE which is not well-suited for sparse expression distribution. However, the superiority of the proposed approach becomes more prominent.

To better appreciate the operating characteristics of the proposed network-assisted strategy, in Figure S3, we take the scale-free network as an example and further examine the SV genes with large and small signals separately with two Recall indices (L:Recall and S:Recall). Here, besides the proposed approach, we also consider the corresponding approach without the assistance of the network (where 𝐋\mathbf{L} is a simple identity matrix). As shown in Figure S3, benefiting from network smoothness, the proposed approach has remarkably improved identification performance, especially for the SV genes with small signals. Such an improvement becomes more obvious for sparser data with a higher dropout rate. Moreover, selection stability is also improved.

3.2 Model Misspecification

We further evaluate performance of the proposed approach under scenarios where the data generation model is misspecified. In particular, we consider two types of model misspecification: (M1) the ZINB model considered in Yu and Luo (2022) where the logarithmic mean value is set as a mix of cell-type-specific expression levels; and (M2) the Poisson model considered in Sun et al. (2020) with the parameters inferred from a real mouse olfactory bulb study. More detailed settings are provided in Supplementary Section S3.4, and summary results are reported in Figure S6.

It is observed that the proposed approach can still maintain its superiority or at least perform comparably to the method that matches the data generation model. Specifically, under model (M1), which favors the CTSV method, the proposed approach achieves more accurate identification with a median F1 equal to 1, while CTSV exhibits the second-best performance. Moreover, even under model (M2) without cellular composition confounding, the proposed approach can still correctly identify SV genes with a median F1 score larger than 0.92. Here, spatialDE and SPARK also behave well since the expression pattern is inferred based on the analysis results of the two approaches.

3.3 Examination on the Noise of Network and Cellular Composition

We continue to examine scenarios with noisy network and cellular composition information. In particular, the noisy network is likely to involve edges that connect both SV and non-SV genes. Two specific generation models are considered, including the ZINB model considered in Section 3.1 with a star network, a linear spatial pattern, and two degrees of sparsity, and the misspecified model (M1) considered in Section 3.2, where in each informative sub-network, about 20% of the genes connected to the TF genes are uninformative. The summary results are reported in Figure S7. It can be seen that the proposed approach can still achieve an effective balance between the prior network information and observation likelihood, leading to certain robustness.

Moreover, to explore the impact of noisy cellular composition on identification performance, we consider the scale-free network with the linear pattern and high sparsity (dropout rate = 0.5), where the cell-type deconvolution estimates 𝒘^i\hat{\boldsymbol{w}}_{i} are sampled from Dirc(c𝒘i)\operatorname{Dirc}(c\cdot\boldsymbol{w}_{i}) with 𝒘i\boldsymbol{w}_{i} being the true value and concentration parameter c=100,80,c=100,80, and 50. The lower value of cc indicates less accuracy of the cellular composition estimates. The summary results are provided in Figure S8. As expected, the noisy cellular compositions tend to result in decreased identification performance with increased variance. However, even with c=50c=50, the identification performance is still satisfactory, and the superiority over the alternatives maintains.

4 Data Analysis

We apply the proposed approach to the triple-negative breast cancer (TNBC) dataset from a cohort study of TNBC tumors (Bassiouni et al., 2023). A total of 28 tissue sections representing 14 primary TNBC tumors are subject to spatial transcriptomics using the 10x Genomics Visium platform. The raw expression data and the tissue information are available at the Gene Expression Omnibus (GEO) under record GSE210616. Here, we focus on the first tissue section from patient ID 1, which contains 1,109 spots and 36,601 genes. Following the published studies (Charitakis et al., 2023; Yan and Luo, 2024), to improve efficiency and stability, we first remove the genes expressed in fewer than 5 spots and the spots with fewer than 500 expressed genes, resulting in a total of 1,108 spots and 17,781 genes, and then select the top 5,000 highly variable genes for downstream analysis. After preprocessing, the median value of expressed counts across spots is 7,036, and the proportion of zero expressions is about 68.89%.

We consider the protein-protein interaction (PPI) network obtained from the STRING database (Szklarczyk et al., 2023). Integrating the PPI network has been widely adopted in recent breast cancer analysis, which provides a powerful complement for a mechanistic understanding of the genomic alterations related to breast cancer (Kim et al., 2021). In particular, it has been demonstrated in published studies (Pranavathiyani et al., 2019; Chen et al., 2021) that the oncogenes or prognostic genes associated with breast cancer are typically highly interconnected within distinct modules, underscoring the importance of considering the connections within networks in SV gene identification. Matching with this PPI network leads to 1,687 connected genes involved in 164 disconnected sub-networks, while the remaining 3,313 genes are singleton nodes.

For cell-type deconvolution, we adopt the Redeconve algorithm developed in Zhou et al. (2023), which has been proven to have higher accuracy, robustness, and efficacy. More importantly, the superior performance of Redeconve on human breast cancer has been verified based on biological ground truths. To alleviate the impact of rare cell types and to improve interpretability, we follow Yu and Luo (2022) and remove the cell types whose 90 percentile of proportions across spots is less than 0.1, resulting in a total of seven major cell types. The pie charts of the distributions of different cell types across spots are shown in Figure 2(A). It can be seen that most spots consist of cancer and normal epithelial cells, while some spots located at the middle and lower right parts of the tissue section are mostly CAFs.

Refer to caption
Figure 2: (A) Pie charts of the distributions of cell types across spots. (B) Spatial expression patterns of three representative SV genes identified by all of SPARK-X, spatialDE, SPARK, HEARTSVG, and MERINGUE. (C) Spatial expression patterns of three representative SV genes only identified by the proposed approach.

The proposed approach identifies 1,084 SV genes. Among them, 782 are connected in the PPI network, of which the graph representation is shown in Figure 3(A). Analysis is also conducted using the alternatives. The upset plot, which provides the numbers of the SV genes identified by different approaches as well as their overlaps, is shown in Figure 4. Specifically, SPARK-X identifies the largest number of SV genes (2,365), while CTSV detects the smallest (266). A total of 54 overlapping genes are identified by all seven approaches. For SPARK-X, spatialDE, SPARK, HEARTSVG, and MERINGUE, there are more overlapping SV genes (245), mainly due to the potential cell-type diversity confounding that is not well accommodated by these approaches. To provide a more intuitive illustration, among these 245 genes, we take three representative genes MICAL2, ADAM12, and LTBP2 and present their spatial expression patterns in Figure 2(B). All three genes are observed to have significant cell type diversity. Moreover, 47 genes (all connected) are only selected by the proposed approach. Among them, the expression patterns of three representative genes SF3B4, PCGF6, and MAPK15 are presented in Figure 2(C), where clear spatial variability (not driven by cell type diversity) is observed.

Refer to caption
Figure 3: Network of 782 connected SV genes identified by the proposed approach. (B) Three representative detected communities in the network.
Refer to caption
Figure 4: Upset plot of the numbers of SV genes identified by different approaches and their overlaps.

To provide additional comparisons for the SV genes identified by different approaches, we obtain the lists of breast cancer driver genes from the Candidate Cancer Gene Database (CCGD), IntOGen, and DriverDB databases, which contain 579, 290, and 222 genes, respectively, and examine whether these genes are identified as SV. The inclusion rates per thousand are (20.30, 11.70, 23.06) for the proposed approach, (13.26, 7.50, 14.42) for spatialDE, (14.58, 7.61, 13.95) for SPARK, (19.45, 5.50, 14.80) for SPARK-X, (0.00, 0.00, 7.52) for CTSV, (12.37, 9.00, 13.50) for MERINGUE, and (12.44, 8.61, 6.70) for HEARTSVG, respectively. This may support the higher power of the proposed approach to some extent.

We then conduct community detection with the 782 connected SV genes and show three representative communities in Figure 3(B), where the overlapping genes identified by all seven approaches (Overlap), genes only identified by the proposed approach (proposed Only), and the remaining genes (Others) are represented by different colors. It is observed that most overlapping genes are hubs with larger values of degrees. Some of the proposed Only genes are connected to the hub genes, for example, SF3B4 and PINX1 in community B, which may exhibit weak spatial patterns but are involved in the same functional organization or regulation mechanism with the hub ones. Some other genes seem to serve as bridge linkage genes with high betweenness centrality. Examples include SGMS2 in community C, which expedites the formulation of the complete gene regulatory networks.

Then, we conduct Gene Ontology (GO) enrichment analysis. Specifically, for the three communities in Figure 3(B), we list the corresponding top five significant GO terms in Table 2, which have important biological implications for breast cancer. For example, in community A, wound healing has been shown to promote the growth of breast cancer, while integrin-mediated signaling has been confirmed to play a significant role in TNBC bone metastasis. The genes involved in community B are mainly enriched in RNA splicing, and the relationships between RNA splicing and a variety of cancers have long been recognized and widely utilized in cancer diagnosis and therapy. For community C, the sphingolipid metabolism has been shown to be essential for breast cancer progression, while ceramide metabolism balance has been confirmed as a critical step of breast cancer development and has been long adopted as a targeted pathway to induce apoptosis in breast cancer cell lines. References that support the above results are presented in Supplementary Section S4.

We further take a closer look at the proposed Only genes. Specifically, we conduct GO enrichment analysis for each community again, but with the proposed Only genes eliminated. The comparison results are shown in Figure 5(A), where the top ten significant GO terms with the proposed Only genes included are considered. It can be seen that when the proposed Only genes are included, some more significant GO terms are found. Moreover, the incorporation of the proposed Only genes contributes to some newly detected GO terms (shown in Figure 5(B)), such as skeletal muscle contraction, which has been confirmed to be critical in tumor progression and therapy response (Crist et al., 2022). These evidences suggest the importance of the proposed Only genes in the PPI network and their biological implications.

Table 2: Top five significant GO terms associated with the three representative communities.
ID adjusted P value Description
Community A
GO:0042060 2.20×109\times 10^{-9} wound healing
GO:0031589 2.20×109\times 10^{-9} cell-substrate adhesion
GO:0007229 3.04×108\times 10^{-8} integrin-mediated signaling pathway
GO:0007160 1.07×106\times 10^{-6} cell-matrix adhesion
GO:1902905 1.07×106\times 10^{-6} positive regulation of supramolecular fiber organization
Community B
GO:0000375 1.19×1037\times 10^{-37} RNA splicing, via transesterification reactions
GO:0008380 4.22×1036\times 10^{-36} RNA splicing
GO:0000377 4.22×1036\times 10^{-36}
RNA splicing, via transesterification reactions with
bulged adenosine as nucleophile
GO:0000398 4.22×1036\times 10^{-36} mRNA splicing, via spliceosome
GO:0000387 8.16×1015\times 10^{-15} spliceosomal snRNP assembly
Community C
GO:0030148 1.18×1014\times 10^{-14} sphingolipid biosynthetic process
GO:0046467 7.13×1014\times 10^{-14} membrane lipid biosynthetic process
GO:0006665 9.42×1014\times 10^{-14} sphingolipid metabolic process
GO:0006643 5.62×1013\times 10^{-13} membrane lipid metabolic process
GO:0006672 9.39×1013\times 10^{-13} ceramide metabolic process
Refer to caption
Figure 5: Lollipop plots of GO enrichment analysis for the three communities. Count represents the number of the input genes that are annotated to the specific GO term, and GeneRatio denotes the proportion of the Count divided by the total number of input genes. (A) Comparison results of the analysis without and with the proposed Only genes. (B) Newly detected significant GO terms with the proposed Only genes included.

5 Discussion

In this article, we have presented a novel Bayesian regularization approach for the joint identification of SV genes, with the network structure among genes accommodated. The proposed approach has two main advantages. First, attributing to the Bayesian framework, it can automatically incorporate the network structure through the Gaussian Graph Laplacian priors. Compared to most of the existing methods, the utilization of such network information provides more opportunities to search for more biologically sensible SV genes. Second, to tackle the confounding effects of cell type mixtures within spots, cell-type-specific parameters have been introduced to model the variations induced by diverse latent cellular compositions, which are not well accommodated by most existing methods. The extensive simulation studies have revealed that the proposed approach can achieve better performance. The biological implications of the findings from the application to the TNBC dataset have further supported the utility of the proposed approach.

We have considered a series of presupposed spatial patterns and introduced the parametric model. It can be of interest to extend the proposed framework to accommodate unknown spatial patterns using nonparametric functional techniques. The proposed approach has focused on the identification of SV genes. It has the potential to be extended to additionally accommodate region segmentation of the tissue section. In data analysis, we have utilized the PPI information for network construction. Other resources, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) and High-quality INTeractomes (HINT) databases, can also be adopted. Furthermore, a literature search and GO enrichment analysis have demonstrated the significant implications of the findings. A more definitive confirmation from functional validations may be needed.

References

  • Arnold et al. (2015) Arnold, K. M., L. M. Opdenaker, D. Flynn, and J. Sims-Mourtada (2015). Wound healing and cancer stem cells: inflammation as a driver of treatment resistance in breast cancer. Cancer Growth and Metastasis 8, CGM–S11286.
  • Barrio-Hernandez et al. (2023) Barrio-Hernandez, I., J. Schwartzentruber, A. Shrivastava, N. Del-Toro, A. Gonzalez, Q. Zhang, E. Mountjoy, D. Suveges, D. Ochoa, M. Ghoussaini, et al. (2023). Network expansion of genetic associations defines a pleiotropy map of human cell biology. Nature Genetics 55(3), 389–398.
  • Bassiouni et al. (2023) Bassiouni, R., M. O. Idowu, L. D. Gibbs, V. Robila, P. J. Grizzard, M. G. Webb, J. Song, A. Noriega, D. W. Craig, and J. D. Carpten (2023). Spatial transcriptomic analysis of a diverse patient cohort reveals a conserved architecture in triple-negative breast cancer. Cancer Research 83(1), 34–48.
  • BinTayyash et al. (2021) BinTayyash, N., S. Georgaka, S. John, S. Ahmed, A. Boukouvalas, J. Hensman, and M. Rattray (2021). Non-parametric modelling of temporal and spatial counts data from rna-seq experiments. Bioinformatics 37(21), 3788–3795.
  • Cable et al. (2022) Cable, D. M., E. Murray, L. S. Zou, A. Goeva, E. Z. Macosko, F. Chen, and R. A. Irizarry (2022). Robust decomposition of cell type mixtures in spatial transcriptomics. Nature Biotechnology 40(4), 517–526.
  • Cai et al. (2020) Cai, Q., J. Kang, and T. Yu (2020). Bayesian network marker selection via the thresholded graph Laplacian Gaussian prior. Bayesian Analysis 15(1), 79.
  • Chakraborty and Lozano (2019) Chakraborty, S. and A. C. Lozano (2019). A graph Laplacian prior for Bayesian variable selection and grouping. Computational Statistics & Data Analysis 136, 72–91.
  • Charitakis et al. (2023) Charitakis, N., A. Salim, A. T. Piers, K. I. Watt, E. R. Porrello, D. A. Elliott, and M. Ramialison (2023). Disparities in spatially variable gene calling highlight the need for benchmarking spatial transcriptomics methods. Genome Biology 24(1), 209.
  • Chen et al. (2022) Chen, M., C. Wu, Z. Fu, and S. Liu (2022). ICAM1 promotes bone metastasis via integrin-mediated tgf-β\beta/emt signaling in triple-negative breast cancer. Cancer Science 113(11), 3751–3765.
  • Chen et al. (2021) Chen, Y., F. J. Verbeek, and K. Wolstencroft (2021). Establishing a consensus for the hallmarks of cancer based on gene ontology and pathway annotations. BMC Bioinformatics 22, 1–20.
  • Corsetto et al. (2023) Corsetto, P. A., S. Zava, A. M. Rizzo, and I. Colombo (2023). The critical impact of sphingolipid metabolism in breast cancer progression and drug response. International Journal of Molecular Sciences 24(3), 2107.
  • Crist et al. (2022) Crist, S. B., T. Nemkov, R. F. Dumpit, J. Dai, S. J. Tapscott, L. D. True, A. Swarbrick, L. B. Sullivan, P. S. Nelson, K. C. Hansen, et al. (2022). Unchecked oxidative stress in skeletal muscle prevents outgrowth of disseminated tumour cells. Nature Cell Biology 24(4), 538–553.
  • Elyanow et al. (2020) Elyanow, R., B. Dumitrascu, B. E. Engelhardt, and B. J. Raphael (2020). netNMF-sc: leveraging gene–gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Research 30(2), 195–204.
  • García-González et al. (2018) García-González, V., J. F. Díaz-Villanueva, O. Galindo-Hernández, I. Martínez-Navarro, G. Hurtado-Ureta, and A. A. Pérez-Arias (2018). Ceramide metabolism balance, a multifaceted factor in critical steps of breast cancer development. International Journal of Molecular Sciences 19(9), 2527.
  • Kim et al. (2021) Kim, M., J. Park, M. Bouhaddou, K. Kim, A. Rojc, M. Modak, M. Soucheray, M. J. McGregor, P. O’Leary, D. Wolf, et al. (2021). A protein interaction landscape of breast cancer. Science 374(6563), eabf3066.
  • Li et al. (2021) Li, Q., M. Zhang, Y. Xie, and G. Xiao (2021). Bayesian modeling of spatial molecular profiling data via Gaussian process. Bioinformatics 37(22), 4129–4136.
  • Liu et al. (2023) Liu, Z., D. Wu, W. Zhai, and L. Ma (2023). Sonar enables cell type deconvolution with spatially weighted poisson-gamma model for spatial transcriptomics. Nature Communications 14(1), 4727.
  • Miller et al. (2021) Miller, B. F., D. Bambah-Mukku, C. Dulac, X. Zhuang, and J. Fan (2021). Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomic data with nonuniform cellular densities. Genome Research 31(10), 1843–1855.
  • Pranavathiyani et al. (2019) Pranavathiyani, G., R. R. Thanmalagan, N. L. Devi, and A. Venkatesan (2019). Integrated transcriptome interactome study of oncogenes and tumor suppressor genes in breast cancer. Genes & Diseases 6(1), 78–87.
  • Qin et al. (2023) Qin, X., S. Ma, and M. Wu (2023). Two-level Bayesian interaction analysis for survival data incorporating pathway information. Biometrics 79(3), 1761–1774.
  • Quintana et al. (2011) Quintana, M. A., J. L. Berstein, D. C. Thomas, and D. V. Conti (2011). Incorporating model uncertainty in detecting rare variants: the bayesian risk index. Genetic Epidemiology 35(7), 638–649.
  • Rao et al. (2021) Rao, A., D. Barkley, G. S. França, and I. Yanai (2021). Exploring tissue architecture using spatial transcriptomics. Nature 596(7871), 211–220.
  • Song et al. (2021) Song, Y., X. Zhou, J. Kang, M. T. Aung, M. Zhang, W. Zhao, B. L. Needham, S. L. Kardia, Y. Liu, J. D. Meeker, et al. (2021). Bayesian sparse mediation analysis with targeted penalization of natural indirect effects. Journal of the Royal Statistical Society Series C: Applied Statistics 70(5), 1391–1412.
  • Ståhl et al. (2016) Ståhl, P. L., F. Salmén, S. Vickovic, A. Lundmark, J. F. Navarro, J. Magnusson, S. Giacomello, M. Asp, J. O. Westholm, M. Huss, et al. (2016). Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353(6294), 78–82.
  • Stanley and Abdel-Wahab (2022) Stanley, R. F. and O. Abdel-Wahab (2022). Dysregulation and therapeutic targeting of rna splicing in cancer. Nature Cancer 3(5), 536–546.
  • Sun et al. (2020) Sun, S., J. Zhu, and X. Zhou (2020). Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature Methods 17(2), 193–200.
  • Svensson et al. (2018) Svensson, V., S. A. Teichmann, and O. Stegle (2018). SpatialDE: identification of spatially variable genes. Nature Methods 15(5), 343–346.
  • Szklarczyk et al. (2023) Szklarczyk, D., R. Kirsch, M. Koutrouli, K. Nastou, F. Mehryary, R. Hachilif, A. L. Gable, T. Fang, N. T. Doncheva, S. Pyysalo, et al. (2023). The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Research 51(D1), D638–D646.
  • Vethakanraj et al. (2015) Vethakanraj, H. S., T. A. Babu, G. B. Sudarsanan, P. K. Duraisamy, and S. A. Kumar (2015). Targeting ceramide metabolic pathway induces apoptosis in human breast cancer cell lines. Biochemical and Biophysical Research Communications 464(3), 833–839.
  • Wu et al. (2024) Wu, B., Y. Guo, and J. Kang (2024). Bayesian spatial blind source separation via the thresholded Gaussian process. Journal of the American Statistical Association 119(545), 422–433.
  • Yamauchi et al. (2022) Yamauchi, H., K. Nishimura, and A. Yoshimi (2022). Aberrant RNA splicing and therapeutic opportunities in cancers. Cancer Science 113(2), 373–381.
  • Yan and Luo (2024) Yan, Y. and X. Luo (2024). Bayesian integrative region segmentation in spatially resolved transcriptomic studies. Journal of the American Statistical Association (just-accepted), 1–21.
  • Yu and Luo (2022) Yu, J. and X. Luo (2022). Identification of cell-type-specific spatially variable genes accounting for excess zeros. Bioinformatics 38(17), 4135–4144.
  • Yuan et al. (2023) Yuan, X., Y. Ma, R. Gao, S. Cui, B. Fa, Y. Wang, S. Ma, T. Wei, S. S. Ma, and Z. Yu (2023). HEARTSVG: a fast and accurate method for spatially variable gene identification in large-scale spatial transcriptomic data. bioRxiv, 2023–08.
  • Zhou et al. (2023) Zhou, Z., Y. Zhong, Z. Zhang, and X. Ren (2023). Spatial transcriptomics deconvolution at single-cell resolution using redeconve. Nature Communications 14(1), 7930.
  • Zhu et al. (2021) Zhu, J., S. Sun, and X. Zhou (2021). SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biology 22(1), 184.
  • Zuo et al. (2024) Zuo, C., J. Xia, and L. Chen (2024). Dissecting tumor microenvironment from spatially resolved transcriptomics data by heterogeneous graph learning. Nature Communications 15(1), 5057.

Supplementary Materials for Joint identification of spatially variable genes via a network-assisted Bayesian regularization approach

S1 Derivation of Step (e) in Posterior Sampling

The preconditioned Crank-Nicolson Langevin dynamics (pCNLD) sampler is adopted for sampling of 𝜸\boldsymbol{\gamma}. The proposed sample of 𝜸\boldsymbol{\gamma}^{*} is generated as follows.

𝜸=\displaystyle\boldsymbol{\gamma}^{*}= 1τγ2𝜸+(11τγ2)σγ2(𝐋+ε𝐈2p)1𝜸logf(𝐘𝜸,)\displaystyle\sqrt{1-\tau_{\gamma}^{2}}\boldsymbol{\gamma}+\left(1-\sqrt{1-\tau_{\gamma}^{2}}\right)\sigma_{\gamma}^{2}\left(\mathbf{L}+\varepsilon\mathbf{I}_{2p}\right)^{-1}\nabla_{\boldsymbol{\gamma}}\log f(\mathbf{Y}\mid\boldsymbol{\gamma},\boldsymbol{-})
+τγN{𝟎,σγ2(𝐋+ε𝐈2p)1}\displaystyle+\tau_{\gamma}N\left\{\boldsymbol{0},\sigma_{\gamma}^{2}\left(\mathbf{L}+\varepsilon\mathbf{I}_{2p}\right)^{-1}\right\}

with

γj(d)logf(𝐘𝜸,)={i:rij=0}ϕ(Yijciμij)μij(ciμij+ϕ)μij𝒦(xid)γj(d){γj(d)I(|γj(d)|>λρj(d))}.\nabla_{\gamma^{(d)}_{j}}\log f(\mathbf{Y}\mid\boldsymbol{\gamma},\boldsymbol{-})=\sum_{\left\{i:r_{ij}=0\right\}}\frac{\phi\left(Y_{ij}-c_{i}\mu_{ij}\right)}{\mu_{ij}\left(c_{i}\mu_{ij}+\phi\right)}\cdot\mu_{ij}\cdot\mathcal{K}\left(x_{id}\right)\cdot\nabla_{\gamma^{(d)}_{j}}\left\{\gamma^{(d)}_{j}\operatorname{I}\left(\left|\gamma^{(d)}_{j}\right|>\lambda\cdot\rho^{(d)}_{j}\right)\right\}.

For the derivative of the hard-thresholding function, a smooth approximation is introduced that I(|γj(d)|>λρj(d))12{1+2πarctan((γj(d))2(λρj(d))2ϵ)}\operatorname{I}\left(\left|\gamma_{j}^{(d)}\right|>\lambda\cdot\rho^{(d)}_{j}\right)\simeq\frac{1}{2}\left\{1+\frac{2}{\pi}\arctan\left(\frac{\left(\gamma_{j}^{(d)}\right)^{2}-\left(\lambda\cdot\rho^{(d)}_{j}\right)^{2}}{\epsilon}\right)\right\} for ϵ0\epsilon\rightarrow 0, leading to

γj(d){γj(d)I(|γj(d)|>λρj(d))}I(|γj(d)|>λρj(d))+γj(d)2γj(d)/ϵπ{1+((γj(d))2(λρj(d))2)2/ϵ2},\nabla_{\gamma^{(d)}_{j}}\left\{\gamma^{(d)}_{j}\operatorname{I}\left(\left|\gamma^{(d)}_{j}\right|>\lambda\cdot\rho^{(d)}_{j}\right)\right\}\simeq\operatorname{I}\left(\left|\gamma^{(d)}_{j}\right|>\lambda\cdot\rho^{(d)}_{j}\right)+\gamma^{(d)}_{j}\cdot\frac{2\gamma^{(d)}_{j}/\epsilon}{\pi\left\{1+\left(\left(\gamma^{(d)}_{j}\right)^{2}-\left(\lambda\cdot\rho^{(d)}_{j}\right)^{2}\right)^{2}/\epsilon^{2}\right\}},

where ϵ\epsilon is set as 10410^{-4} throughout the article.

S2 Detailed Settings for the hyperparameter set

Regarding the hyperparameter set, we assume aπ=bπ=1a_{\pi}=b_{\pi}=1 for a weakly informative prior on πj\pi_{j}. We set aϕ=10a_{\phi}=10 and bϕ=0.1b_{\phi}=0.1 for Ga(aϕ,bϕ)\text{Ga}(a_{\phi},b_{\phi}), and aγ=3.5a_{\gamma}=3.5 and bγ=0.5b_{\gamma}=0.5 for IG(aγ,bγ)\operatorname{IG}\left(a_{\gamma},b_{\gamma}\right). σ0j2\sigma_{0j}^{2}’s and σαk2\sigma_{\alpha_{k}}^{2}’s are set as 323^{2} for all jj and kk to yield vague priors for the baseline expression levels and cell-type-specific effects. We set λl=0\lambda_{l}=0 and λu\lambda_{u} as the 90% quantile of |𝜷~|\left|\tilde{\boldsymbol{\beta}}\right| to help against false positive detection as suggested in Song et al. (2021). The sampling variances τμ02,τα2,τϕ2,τγ2,\tau_{\mu_{0}}^{2},\tau_{\alpha}^{2},\tau_{\phi}^{2},\tau_{\gamma}^{2}, and τλ2\tau_{\lambda}^{2} are all adaptively chosen by tuning acceptance rates to 30% in simulation studies and 15% in real data analysis. We perform 2,000 iterations with the first 1000 discarded as burn-in and thinning by 10 in the simulations, and 3,000 iterations with the first 2000 burn-in and thinning by 10 in the data analysis.

S3 Additional simulation settings and results

S3.1 Detailed Settings for Basic Simulations

Refer to caption
Figure S3.1: Partitioned spot regions considered in the basic simulations introduced in Section 3 of the main text. Regions are denoted by different colors, where the cellular compositions 𝒘i\boldsymbol{w}_{i}’s are independently sampled from Dirichlet distributions Dirc(1,1,1,1,1,1)\operatorname{Dirc}(1,1,1,1,1,1) (Region 1), Dirc(3,5,7,9,11,13)\operatorname{Dirc}(3,5,7,9,11,13) (Region 2), and Dirc(18,16,14,12,10,8)\operatorname{Dirc}(18,16,14,12,10,8) (Region 3).
Refer to caption
Figure S3.2: Illustrative examples of the (A) Star network and (B) Scale-free network.

Detailed settings for the spatial effects:

  1. (a)

    Star network

    For each informative sub-network, the transcription factor (TF) gene and its connected target genes have the same sign of spatial effect (either all positive or all negative), while the magnitudes satisfy that |βTF(d)|ntarget>|βtarget(d)|\frac{\left|\beta_{\operatorname{TF}}^{(d)}\right|}{\sqrt{n_{\operatorname{target}}}}>\left|\beta_{\operatorname{target}}^{(d)}\right| with ntargetn_{\operatorname{target}} being the number of the target genes regulated by TF. Specifically, the values of βTF(1)\beta^{(1)}_{\operatorname{TF}} and βTF(2)\beta^{(2)}_{\operatorname{TF}} are fixed as 11 and 1-1 for all three patterns, while βtarget(1)=0.15\beta^{(1)}_{\operatorname{target}}=0.15 and βtarget(2)=0.25\beta^{(2)}_{\operatorname{target}}=-0.25 for linear pattern, βtarget(1)=0.2\beta^{(1)}_{\operatorname{target}}=0.2 and βtarget(2)=0.3\beta^{(2)}_{\operatorname{target}}=-0.3 for periodic pattern, and βtarget(1)=0.4\beta^{(1)}_{\operatorname{target}}=0.4 and βtarget(2)=0.5\beta^{(2)}_{\operatorname{target}}=-0.5 for exponential pattern.

    For the non-informative sub-networks, βTF(1)=βTF(2)=βtarget(1)=βtarget(2)=0\beta^{(1)}_{\operatorname{TF}}=\beta^{(2)}_{\operatorname{TF}}=\beta^{(1)}_{\operatorname{target}}=\beta^{(2)}_{\operatorname{target}}=0.

  2. (b)

    Scale-free network

    For each informative sub-network, the spatial effects of the hub gene and its connected regular genes are randomly assigned as positive or negative. In addition, the magnitudes of βhub(d)(d=1,2)\beta^{(d)}_{\operatorname{hub}}(d=1,2) are simulated from Unif(1,1.2)\operatorname{Unif}(1,1.2) for all three patterns. For the magnitudes of the regular SV genes, consider two types with large and small signal strengths, respectively. Specifically, for both d=1,2d=1,2, the magnitudes are simulated as follows: |βregular(d)|Unif(a0.1,a0.2)|\beta^{(d)}_{\operatorname{regular}}|\sim\operatorname{Unif}(a\cdot 0.1,a\cdot 0.2) for linear pattern, |βregular(d)|Unif(a0.2,a0.3)|\beta^{(d)}_{\operatorname{regular}}|\sim\operatorname{Unif}(a\cdot 0.2,a\cdot 0.3) for periodic pattern, and |βregular(d)|Unif(a0.4,a0.5)|\beta^{(d)}_{\operatorname{regular}}|\sim\operatorname{Unif}(a\cdot 0.4,a\cdot 0.5) for exponential pattern, where a=1a=1 and 2 for small and large strengths, respectively.

    For the non-informative sub-networks, βhub(1)=βhub(2)=βregular(1)=βregular(2)=0\beta^{(1)}_{\operatorname{hub}}=\beta^{(2)}_{\operatorname{hub}}=\beta^{(1)}_{\operatorname{regular}}=\beta^{(2)}_{\operatorname{regular}}=0.

S3.2 Implementation Details of the Competing Methods

  1. (a)

    SpatialDE: The source code for SpatialDE is publicly available on https://github.com/Teichlab/SpatialDE. Following the instructions, we filter the practically unobserved genes with total observations less than 3. The linear regression is conducted on the raw count matrix to account for the potential bias caused by library size or sequencing depth. The main function for SV gene detection is SpatialDE.run() with the arguments being the coordinate information (xi1,xi2)\left(x_{i1},x_{i2}\right) and the sample residual expressions corrected by the linear regression. To account for multiple testing, the genes with qval less than 0.05 are identified as SV.

  2. (b)

    SPARK: The source code for SPARK is publicly available on https://github.com/xzhoulab/SPARK. Following the instructions, we filter genes that are expressed in less than 10% spots and spots with the total observations less than 10. The statistical model under the null hypothesis is first fit by employing the function spark.vc(), and the function spark.test() is subsequently employed for SV gene detection. Then, the genes with adjusted_pvalue less than 0.05 are identified as SV.

  3. (c)

    SPARK-X: The source code for SPARK-X is publicly available on https://github.com/xzhoulab/SPARK. The SV gene detection is implemented with the main function sparkx() where the argument option is set as “mixture” for multiple kernels testing. To account for the impact of latent cellular compositions, as conducted in the original paper, each spot is assigned its major cell type, with the argument X_in being the matrix of binary indicators. Then, the genes with adjustedPval less than 0.05 are identified as SV.

  4. (d)

    MERINGUE: The source code for MERINGUE is publicly available on https://github.com/JEFworks-Lab/MERINGUE. The non-expressed genes and spots are filtered following the tutorial. Normalization is conducted for the raw count matrix by employing the function normalizeCounts(). For SV gene detection, the neighborhood relationships are first constructed using the function getSpatialNeighbors() with the argument filterDist set as the default value 2.5. Then, the SV genes are identified with the function filterSpatialPatterns() with the arguments minPercentCells set as 0.05 to restrict that the SV genes are driven by more than 5% of spots. The adjusted significance threshold is set as 0.05 through setting the arguments adjustPv as TRUE and alpha as 0.05.

  5. (e)

    CTSV and CTSV-g: The source code for CTSV is publicly available on https://github.com/jingeyu/CTSV. The SV gene detection is conducted through the main function ctsv(). Specifically, the cell-type-specific SV genes are identified with the argument WW being the n×Kn\times K matrix composed of wiw_{i}’s (i=1,,ni=1,\ldots,n). The final SV gene set is the union of all cell-type-specific SV genes. We also conduct the global SV gene detection without accommodation for the cellular composition in the simulation studies (CTSV-g). Specifically, the argument WW is set as (1,,1)(n)T(1,\ldots,1)^{\mathrm{T}}_{(n)}. For both CTSV and CTSV-g, the SV genes are identified through function svGene() with the significance threshold thre.alpha set as 0.05.

  6. (f)

    HEARTSVG: The source code for HEARTSVG is publicly available on https://github.com/cz0316/HEARTSVG. The detection of SV genes is conducted through the main function heartsvg() with the raw count matrix first scaled as recommended. The Holm method is further conducted for multiple testing control. The genes with adjusted P values p_adj less than 0.05 are identified as SV.

S3.3 Additional Simulation Results under the Scenarios with a High Dropout Rate

Table S3: Simulation results under the scenarios with a high dropout rate, where FDR (BFDR) is controlled to be <<0.05. In each cell, mean (SD) is based on 50 replicates.
Recall Precision F1 Recall Precision F1
Linear pattern Star Network Scale-free Network
proposed 0.975(0.081) 1.000(0.000) 0.985(0.053) 0.965(0.106) 1.000(0.000) 0.978(0.078)
spatialDE 0.079(0.036) 0.999(0.004) 0.168(0.028) 0.093(0.039) 1.000(0.000) 0.174(0.060)
SPARK 0.169(0.065) 0.985(0.013) 0.284(0.094) 0.312(0.135) 0.983(0.016) 0.457(0.163)
SPARK-X 0.735(0.361) 0.220(0.276) 0.179(0.025) 0.465(0.158) 0.847(0.226) 0.567(0.171)
MERINGUE 0.133(0.044) 0.967(0.026) 0.230(0.069) 0.238(0.111) 0.974(0.022) 0.369(0.145)
CTSV 0.085(0.051) 0.551(0.293) 0.128(0.059) 0.103(0.060) 0.571(0.267) 0.147(0.073)
CTSV-g 0.921(0.131) 0.883(0.039) 0.894(0.074) 0.932(0.100) 0.939(0.015) 0.932(0.055)
HEARTSVG 0.205(0.079) 0.674(0.261) 0.284(0.075) 0.534(0.178) 0.616(0.258) 0.494(0.114)
Exponential pattern Star Network Scale-free Network
proposed 0.873(0.215) 1.000(0.000) 0.913(0.170) 0.937(0.152) 1.000(0.001) 0.958(0.128)
spatialDE 0.000(0.000) -(-) -(-) 0.000(0.000) -(-) -(-)
SPARK 0.001(0.002) 0.897(0.285) 0.007(0.003) 0.009(0.009) 0.965(0.098) 0.024(0.015)
SPARK-X 0.698(0.394) 0.131(0.157) 0.165(0.041) 0.211(0.181) 0.820(0.263) 0.319(0.224)
MERINGUE 0.004(0.005) 0.900(0.251) 0.013(0.011) 0.038(0.035) 0.899(0.173) 0.082(0.061)
CTSV 0.018(0.029) 0.103(0.129) 0.044(0.032) 0.023(0.032) 0.250(0.035) 0.043(0.036)
CTSV-g 0.733(0.278) 0.929(0.018) 0.783(0.215) 0.878(0.154) 0.937(0.013) 0.898(0.095)
HEARTSVG 0.029(0.020) 0.890(0.186) 0.057(0.036) 0.193(0.112) 0.933(0.094) 0.304(0.154)
Periodic pattern Star Network Scale-free Network
proposed 0.983(0.099) 1.000(0.000) 0.988(0.076) 0.977(0.134) 1.000(0.000) 0.980(0.127)
spatialDE 0.000(0.000) -(-) -(-) 0.001(0.002) 1.000(0.000) 0.006(0.004)
SPARK 0.029(0.026) 0.954(0.177) 0.070(0.043) 0.050(0.037) 0.975(0.038) 0.112(0.056)
SPARK-X 0.723(0.370) 0.213(0.267) 0.173(0.033) 0.437(0.222) 0.845(0.236) 0.531(0.246)
MERINGUE 0.092(0.033) 0.965(0.035) 0.166(0.057) 0.272(0.149) 0.972(0.024) 0.401(0.189)
CTSV 0.061(0.062) 0.442(0.315) 0.094(0.074) 0.098(0.092) 0.565(0.302) 0.152(0.114)
CTSV-g 0.890(0.166) 0.912(0.024) 0.889(0.106) 0.959(0.080) 0.945(0.010) 0.950(0.044)
HEARTSVG 0.075(0.017) 0.988(0.021) 0.139(0.029) 0.159(0.055) 0.994(0.010) 0.270(0.083)
Refer to caption
Figure S3.3: Comparison boxplots of Recall for large (L:Recall) and small (S:Recall) signals, overall Recall, Precision, and F1 based on 50 replicates under scenarios with scale-free network and (A) linear pattern, (B) exponential pattern, and (C) periodic pattern, where FDR (BFDR) is controlled to be <<0.05.

S3.4 Simulations under Model Misspecification

We considered two types of model misspecification scenarios as follows.

  1. (M1)

    ZINB model considered in Yu and Luo (2022). Specifically, the raw count data YijY_{ij} is generated from NB(Yijμij,ϕ)(1rij)δ0(Yij)rij,\mathrm{NB}\left(Y_{ij}\mid\mu_{ij},\phi\right)^{(1-r_{ij})}\delta_{0}(Y_{ij})^{r_{ij}}, where logμij=k=1Kμij(k)wik\log\mu_{ij}=\sum_{k=1}^{K}\mu^{(k)}_{ij}w_{ik} with μij(k)=ηjk+βjk(1)𝒦(xi1)+βjk(2)𝒦(xi2)\mu^{(k)}_{ij}=\eta_{jk}+\beta^{(1)}_{jk}\mathcal{K}\left(x_{i1}\right)+\beta^{(2)}_{jk}\mathcal{K}\left(x_{i2}\right). Following Yu and Luo (2022), consider n=600n=600 spots located on a 30×2030\times 20 grid and partitioned into four regions as shown in Figure S3.4(A). Consider K=6K=6 underlying cell types where the cellular composition of spots in regions 1, 2, 3, and 4 are independently sampled from Dir(1,1,1,1,1,1)\operatorname{Dir}(1,1,1,1,1,1), Dir(1,3\operatorname{Dir}(1,3, 5,7,9,11),Dir(16,14,12,10,8,6)5,7,9,11),\operatorname{Dir}(16,14,12,10,8,6), and Dir(1,4,4,4,4,1)\operatorname{Dir}(1,4,4,4,4,1), respectively. Consider p=5,000p=5,000 genes, involving in a block-wise network composed of 100 disconnected sub-networks with 50 nodes each. Here, each sub-network includes one TF gene and 49 connected target genes, as shown in Figure S3.4(B). All genes in the first seven sub-networks are SV. Specifically, the (50(k1)+1)\left(50\cdot(k-1)+1\right)th to (50(k+1))\left(50\cdot(k+1)\right)th genes are set as cell-type-kk SV for k=1,,Kk=1,\ldots,K, resulting in a total of 350 SV genes. The spatial effect function is fixed as a linear pattern with βjk(1)=1.8\beta^{(1)}_{jk}=1.8 and βjk(2)=0.8\beta^{(2)}_{jk}=0.8 for SV genes and βjk(1)\beta^{(1)}_{jk} = βjk(2)\beta^{(2)}_{jk} = 0 for non-SV genes. For ηjk\eta_{jk}, first simulate ηj1\eta_{j1} from N(2,0.22)\mathrm{N}\left(2,0.2^{2}\right) for j=1,,pj=1,\ldots,p, and then simulate 150 differentially expressed genes for each cell type k(2kK)k(2\leq k\leq K) from N(θk,ξk2)\mathrm{N}\left(\theta_{k},\xi_{k}^{2}\right) independently, where (θ2,ξ2)=(3,0.2),(θ3,ξ3)=\left(\theta_{2},\xi_{2}\right)=(3,0.2),\left(\theta_{3},\xi_{3}\right)= (2,0.2),(θ4,ξ4)=(4,0.2),(θ5,ξ5)=(3,0.2),(2,0.2),\left(\theta_{4},\xi_{4}\right)=(4,0.2),\left(\theta_{5},\xi_{5}\right)=(3,0.2), and (θ6,ξ6)=(4,0.2)\left(\theta_{6},\xi_{6}\right)=(4,0.2). In addition, ϕ=100\phi=100 and rijr_{ij}’s are simulated from a Bernoulli distribution Bern(0.6)\operatorname{Bern}\left(0.6\right).

  2. (M2)

    Poisson model considered in Sun et al. (2020). The raw count data is generated as YijPoi(Niλij)Y_{ij}\sim\operatorname{Poi}(N_{i}\cdot\lambda_{ij}), where logλij=Interceptij+ϵij+k=1Kαkwik\log\lambda_{ij}=\operatorname{Intercept}_{ij}+\epsilon_{ij}+\sum_{k=1}^{K}\alpha_{k}w_{ik} with Interceptij\operatorname{Intercept}_{ij} and ϵij\epsilon_{ij} being the intercept and residual error term, respectively. Specifically, following Sun et al. (2020), consider 260 spots collected in the mouse olfactory bulb study (Ståhl et al., 2016) with the corresponding parameters inferred from the data analysis conducted by spatialDE and SPARK, and p=5,000p=5,000 genes. The network is set as the same as (M1) (as shown in Figure S3.4(B)), except that all genes in the first ten sub-networks are SV, resulting in a total number of 500 SV genes. For non-SV genes, Interceptij\operatorname{Intercept}_{ij} is set to be 10.2-10.2 across all spots, which corresponds to the median of the intercept estimates in the mouse olfactory data analysis. For SV genes, the spots are categorized into two groups, including the group with low expression levels (green) and the group with high expression levels (pink), according to the spatial patterns of the identified SV genes in the mouse olfactory bulb data, as illustrated in Figure S3.5. In particular, Interceptij\operatorname{Intercept}_{ij}’s for the low group are set as 10.2-10.2 while Interceptij\operatorname{Intercept}_{ij}’s for the high group are set as 8.8-8.8 and 17.8-17.8 for the target and TF genes, respectively, according to the estimates for mean values. ϵij\epsilon_{ij}’s are independently simulated from N(0,0.22)N(0,0.2^{2}), which is approximately the first quantile of the non-spatial variance estimates. For cellular composition variations, consider K=6K=6 underlying cell types, where wikw_{ik}’s of spots from high and low groups are independently sampled from Dir(3,5,7,9,11,13)\operatorname{Dir}(3,5,7,9,11,13) and Dir(18,16,14,12,10,8)\operatorname{Dir}(18,16,14,12,10,8), respectively. αk\alpha_{k} is set as 0 for k=1,,Kk=1,\ldots,K, so that the cellular composition variations do not contribute to the expression levels. NiN_{i}’s are obtained based on the real counts.

Refer to caption
Figure S3.4: (A) Partitioned spot regions considered in the model misspecification scenario (M1). Regions are denoted by different colors, where the cellular compositions 𝒘i\boldsymbol{w}_{i}’s are independently sampled from Dirichlet distributions Dirc(1,1,1,1,1,1)\operatorname{Dirc}(1,1,1,1,1,1) (Region 1), Dirc(1,3,5,7,9,11)\operatorname{Dirc}(1,3,5,7,9,11) (Region 2), Dirc(16,14,12,10,8,6)\operatorname{Dirc}(16,14,12,10,8,6) (Region 3), and Dirc(1,4,4,4,4,1)\operatorname{Dirc}(1,4,4,4,4,1) (Region 4). (B) Illustrative example of the simple regulatory network considered in model misspecification scenarios.
Refer to caption
Figure S3.5: Summarized spatial expression pattern inferred from the data analysis of mouse olfactory bulb study.
Refer to caption
Figure S3.6: Comparison boxplots of Recall, Precision, and F1 based on 50 replicates under (A) ZINB model considered in Yu and Luo (2022) and (B) Poisson model considered in Sun et al. (2020), where FDR (BFDR) is controlled to be <<0.05.
Refer to caption
Figure S3.7: Comparison boxplots of Recall, Precision, and F1 based on 50 replicates, under (A) ZINB model considered in Yu and Luo (2022); (B) ZINB model considered in Section 3.1 of the main text with star network, linear spatial pattern, and low dropout rate = 0.1; (C) ZINB model considered in Section 3.1 of the main text with star network, linear spatial pattern, and high dropout rate = 0.5, where in each informative sub-network, about 20% of the genes connected to the TF genes are set as uninformative and FDR (BFDR) is controlled to be <<0.05.
Refer to caption
Figure S3.8: Comparison boxplots of Recall, Precision, and F1 with the proposed approach with true cellular compositions (proposed) and different degrees of cellular estimate errors (concentration = 100, 80, and 50) based on 50 replicates, under ZINB model considered in Section 3.1 of the main text with scale-free network, linear pattern, and high dropout rate = 0.5, where FDR (BFDR) is controlled to be <<0.05.

S4 References that support the biological implications of three communities

In community A, wound healing has been shown to promote the growth of breast cancer (Arnold et al., 2015), while integrin-mediated signaling has been confirmed to play a significant role in TNBC bone metastasis (Chen et al., 2022). The genes involved in community B are mainly enriched in RNA splicing, and the relationships between RNA splicing and a variety of cancers have long been recognized and widely utilized in cancer diagnosis and therapy (Stanley and Abdel-Wahab, 2022; Yamauchi et al., 2022). For community C, the sphingolipid metabolism has been shown to be essential for breast cancer progression (Corsetto et al., 2023), while ceramide metabolism balance has been confirmed as a critical step of breast cancer development (García-González et al., 2018) and has been long adopted as a targeted pathway to induce apoptosis in breast cancer cell lines (Vethakanraj et al., 2015).