Bayesian Mapping of Mortality Clusters

A. Sottosanti, P. Belloni, E. Bovo, G. Boccuzzo
University of Padova, Department of Statistical Sciences

(
Address for correspondence: andrea.sottosanti@unipd.it)

Abstract

Disease mapping analyses the distribution of several disease outcomes within a territory. Primary goals include identifying areas with unexpected changes in mortality rates, studying the relation among multiple diseases, and dividing the analysed territory into clusters based on the observed levels of disease incidence or mortality. In this work, we focus on detecting spatial mortality clusters, that occur when neighbouring areas within a territory exhibit similar mortality levels due to one or more diseases. When multiple causes of death are examined together, it is relevant to identify not only the spatial boundaries of the clusters but also the diseases that lead to their formation. However, existing methods in literature struggle to address this dual problem effectively and simultaneously. To overcome these limitations, we introduce perla, a multivariate Bayesian model that clusters areas in a territory according to the observed mortality rates of multiple causes of death, also exploiting the information of external covariates. Our model incorporates the spatial structure of data directly into the clustering probabilities by leveraging the stick-breaking formulation of the multinomial distribution. Additionally, it exploits suitable global-local shrinkage priors to ensure that the detection of clusters depends on diseases showing concrete increases or decreases in mortality levels, while excluding uninformative diseases. We propose an MCMC algorithm for posterior inference that consists of closed-form Gibbs sampling moves for nearly every model parameter, without requiring complex tuning operations. This work is primarily motivated by a case study on the territory of a local unit within the Italian public healthcare system, known as ULSS6 Euganea. To demonstrate the flexibility and effectiveness of our methodology, we also validate perla with a series of simulation experiments and an extensive case study on mortality levels in U.S. counties.
Keywords: Multinomial stick-breaking; Global-local shrinkage priors; Multivariate areal data clustering; Spatial disease mapping

1 Introduction

1.1 Motivation and case studies

In the analysis of environmental and epidemiological phenomena, researchers often gather geolocalised data in the form of multiple outcomes. For example, studies on the effects of long-term exposure to air pollution and high temperatures on public health generally collect and analyse both multiple environmental and health indicators (Künzli et al., 2000, Wong et al., 2008, Lee et al., 2009, Taylor et al., 2015). In epidemiological studies, several outcomes are used to determine the frailty characteristics of specific groups of patients (Li et al., 2021, Qian et al., 2023). Therefore, analysing the joint distribution of multiple outcomes and their interactions is a crucial step to uncover patterns in space and relationships that may not be evident when analysing each outcome on its own. This approach provides a more comprehensive understanding of the underlying environmental and epidemiological processes in a territory.

Disease mapping, in particular, aims at studying the spatial distribution of health phenomena to identify significant changes of diseases incidence or mortality. It therefore represents the first step for understanding geographical disparities in health outcomes and developing strategies for disease prevention and control. Disease mapping techniques have evolved significantly over the last decades, moving from simple representations of a disease occurrence to more advanced statistical analyses and models (Waller and Gotway, 2004, Lawson, 2013, Banerjee et al., 2015). The use of sophisticated modelling techniques yields a more detailed representation of the spatial distribution of phenomena in space and of the dependence across outcomes, resulting in more accurate explanatory and predictive performances (Aungkulanon et al., 2017, Coker et al., 2023, Griffith, 2023, Tesema et al., 2023). One of the main goals of disease mapping is discovering and locating groups of spatially proximate areas that show similar levels of the outcomes considered. We refer to these groups as spatial clusters. In public health, discovering spatial clusters of disease incidence or mortality is crucial, as it represents a step forward in the identification of possible underlying risk factors in specific areas of the territory (Kulldorff, 1997, Kulldorff et al., 2005, Robertson et al., 2010).

For instance, we consider a study of mortality levels in the Padua province, located in northeastern Italy, with data provided by the local healthcare unit ULSS6 Euganea. This institution oversees health services across the province, whose territory is divided into 106 administrative sub-areas (the city of Padua is not considered as a stand-alone municipality but is divided into six districts of size and population comparable to other municipalities in the province). The goal of ULSS6 is to optimise logistical and financial resources by identifying significant variations in mortality levels across either individual areas or clusters of areas. Their interest is particularly focused on the mortality due to three major death causes: diseases of the circulatory system, diseases of the respiratory system, and malignant neoplasms. According to Fedeli et al. (2021), these three categories collectively represent 70% of the overall mortality rate recorded in the province. Mortality data for these death causes are available from 2017 to 2019, divided by gender. The left panel of Figure 1 illustrates the province territory divided into the 106 districts managed by ULSS6. The left panel of Supplementary Figure 1 displays the geographical position of the Padua province in Italy. Bovo et al. (2023) made a first attempt to detect mortality clusters in this province by analysing each cause of death separately. Nonetheless, we recognise some important limitations in this approach since treating the variables independently can lead to identify clusters that vary substantially for each cause of death. Moreover, this method does not take full advantage of the information across variables, meaning the clustering of the territory driven by one cause of death lacks insight from the data of other causes. In contrast, gaining information from all causes of death simultaneously would lead to the identification of territorial clusters with specific mortality profiles, i.e., with estimated risk levels for each cause of death.

The case study of ULSS6 mortality data is relevant due to its depiction of a setting where areas have similar territorial characteristics, comparable population sizes and socio-economic conditions. On the other hand, in many other cases, the territory under consideration may be highly heterogeneous in one or more aspects. For instance, in the United States of America (U.S.) the administrative division of the territory into counties provides a far different scenario from the province of Padua. The distribution of population in space is heterogeneous, as well as population density, urbanisation and wealth; in addition, different U.S. states may have different health care systems, and this leads to another source of variability. The distribution of U.S. counties in space is displayed in the right panel of Figure 1. In this framework, the Centres for Disease Control and Prevention (CDC) collected data on mortality levels by counties in the U.S. from 2016 to 2019, available through the WONDER Online Databases (CDC, 2021). In such diverse contexts, it is crucial not only to quantify different risk levels across areas, but also to determine the existence of possible geographically concentrated clusters with specific epidemiological relevance.

Both the discussed examples reveal the need for a statistical method that can: i.) perform multivariate clustering of a territory based on the joint analyses of multiple outcomes, exploiting the spatial structure of data and the dependence across diseases; ii.) leverage information provided by exogenous variables that influence the outcomes; and iii.) identify which outcome variables lead to the formation of spatial clusters. This last point becomes more and more relevant as the number of outcomes increases or the size of the territory grows, making it essential to identify clusters of disease prevalence or excess mortality that are geographically concentrated in small areas while avoiding spurious signals.

1.2 Spatial clustering of mortality levels

In this article, we consider the problem of clustering areas of a territory based on the observed outcomes of multiple diseases. We assume that the territory is divided into polygons defined by administrative boundaries, and that the outcome variables are aggregates of measures taken within the boundaries of each area. In the ULSS6 Euganea case study presented in Section 1.1, the areas correspond to municipalities or administrative districts, while in the U.S. dataset the areas represent the U.S. counties. In both studies, the outcome variables are measurements of mortality from multiple causes of death.

A common approach for identifying groups of contiguous areas with excess of mortality, and thus detecting some spatial clusters, is through spatial scan statistics (SSS, Kulldorff, 1997). These techniques are designed to detect clusters with unexpected changes in outcome variable levels compared to the rest of the territory. This detection is carried out using suitable test statistics (Cucala, 2016, Liu et al., 2018, Ahmed and Genin, 2020). Over the past twenty years, these techniques have gained considerable attention and have been adapted for spatio-temporal data analysis in both frequentist (Kulldorff et al., 2005, Robertson et al., 2010, Amini et al., 2023) and Bayesian paradigms (Li et al., 2012). Recently, SSS have been largely extended also to the multivariate framework (Cucala et al., 2017, 2019) and to more complex data structures, such as multivariate functional data (Frévent et al., 2023). For a detailed review of applications of multivariate SSS to different contexts, readers can refer to the Introduction section of Frévent et al. (2023). Despite their extensive use, SSS are primarily designed to detect abrupt changes in outcome variables and thus they cannot be used to divide the entire map into clusters. In addition, SSS may fail to detect significant changes in areas with highly irregular shapes (Tango, 2021).

Another established approach for detecting spatial clusters is through hidden Potts models (HPM), introduced as an extension of the Gaussian mixture models (GMM) to account for spatial dependence across adjacent neighbours (Potts, 1952). In Bayesian literature, inference on HPM has been extensively explored using various approaches. Examples include Markov chain Monte Carlo (MCMC) methods (Green and Richardson, 2002), variational Bayes (McGrory et al., 2009), and approximate Bayesian computation (Moores et al., 2020a). Recent advancements also include estimation based on synthetic likelihood (Zhu and Fan, 2023). For a comprehensive review of HPM estimation methods, readers can refer to Moores et al. (2020b). Despite their popularity, the Potts model presents some relevant drawbacks, particularly related to the estimation of the inverse temperature parameter. Recently, Moores et al. (2020a) made a detailed survey of the approaches investigated so far for bypassing the estimation problems of HPM, with particular attention to Approximate Bayesian Computation algorithms, and proposed their own scalable solution. However, the current statistical literature does not adequately address the extension of HPM to the joint analysis of multiple outcomes. Additionally, to the best of our knowledge, there is no implementation of a multivariate HPM available, for instance, for the R programming language (R Core Team, 2024).

To address the need to identify mortality clusters jointly considering multiple causes of death, we propose perla (PEnalised Regression with Localities Aggregation), a multivariate Bayesian model for areal data that aggregates areas of a territory into spatially informed clusters based on different levels of outcome variables, while accounting for the presence of exogenous variables and for the marginal correlation across responses. Recently, Coker et al. (2023) applied a Dirichlet process mixture model to the analysis of areal data, without accounting for the spatial structure in the clustering phase. Instead, perla integrates the data spatial proximity directly into the clustering probabilities using the multinomial stick-breaking strategy of Linderman et al. (2015). This method consists of rewriting a multinomial model as a sequence of binomial random variables, allowing the use of the well-established Pólya-gamma data augmentation in the presence of multinomial outcomes. Not only does the multinomial stick-breaking simplifies the Bayesian inference allowing the conditional posterior distributions of the model parameters to be derived in closed form, but it also allows avoiding the use of the Potts model to account for the neighbourhood structure and consequently avoiding the known inferential problems associated with it. Alternative approaches to link a probability vector defined on a simplex to a series of Gaussian random variables include, for example, the logistic normal distribution (Lafferty and Blei, 2005, Russo et al., 2022). In addition, perla regulates the clusters intercepts through global-local shrinkage priors (Polson and Scott, 2011), allowing to affirm or inhibit the role of specific diseases in the formation of clusters. The shrinkage priors play also a regularising role in the selection of the number of clusters, as they help suppressing the effect of excess clusters. The reader can refer to Bhadra et al. (2019) and van Erp et al. (2019) for two reviews of global-local shrinkage priors.

This work is strongly motivated by the two epidemiological case studies presented in Section 1.1 and can be of great usefulness in the analysis of several other social and epidemiological phenomena, including the distribution of the level of education, the incidence of diseases, and the occurrence of multiple kind of crimes. Nonetheless, perla can be applied to any kind of multivariate areal data. Two examples of the numerous possible applications outside the epidemiological context are image segmentation (Moores et al., 2015) and spatial transcriptomics (Zhao et al., 2021). For these reasons, we implement our methodological solution in the software library perla for the R programming language.

The remainder of this article is structured as follows. Section 2 introduces the perla model, highlighting the formulation of the prior distributions and their role in the detection of spatial clusters. Section 3 delineates an MCMC algorithm to perform posterior inference and explains how to operate model selection and post-processing using the posterior samples. Section 4 presents three simulation experiments. The goal of the first experiment is to demonstrate the significant improvements in interpretability of the results and inferential conclusions provided by our global-local shrinkage priors compared to standard, non-informative priors. In addition, it performs a comparison of the clustering accuracy with other competing methods. The second experiment investigates whether different a priori choices on the parameters that handle the spatial correlation of the clustering probabilities have an impact on the final clustering performance, in particular when spatial clusters present different levels of spatial association. The third experiment evaluates the performance of the proposed methodology against competing methods using data simulated from a different generating mechanism that diverges from perla in several key aspects. Section 5 presents the application of perla on the two cases studies described in Section 1.1, showing the usefulness of our methodology in responding to specific epidemiological questions. Finally, Section 6 takes some concluding remarks and outlines future perspectives.

2 Model formulation

2.1 The statistical model

We assume that data are collected in the form of an $n\times d$ matrix $\mathbf{Y}$ , where $y_{ij}\in\mathbb{R}$ denotes the mortality level due to the $j$ -th disease ( $j=1,\dots,d$ ) measured at the $i$ -th area of a map ( $i=1,\dots,n$ ). Data are structured such that $y_{ij}>0$ denotes a mortality excess, and $y_{ij}<0$ denotes a mortality deficit. In addition, let $\mathbf{X}$ be an $n\times p$ matrix collecting $p$ exogenous variables whose values are available for the $n$ areas. We assume that the $n$ areas can be divided into $K$ territorial clusters, implying that the average mortality levels in the $i$ -th spatial area due to the $d$ diseases considered vary based on the cluster to which the $i$ -th area belong. We collect the information about the unknown clusters into a $n\times K$ matrix $\mathbf{Z}$ , where $\mathrm{Z}_{ik}=1$ if the $i$ -th area belongs to the $k$ -th cluster (with $k=1,\dots,K$ ), and 0 otherwise. To access the rows and the columns of a generic matrix of parameters $\boldsymbol{\theta}$ , we use the notation $\boldsymbol{\theta}_{i.}$ and $\boldsymbol{\theta}_{.j}$ to refer to its $i$ -th row and its $j$ -th column, respectively. The same notation is used also for the data matrices $\mathbf{Y}$ and $\mathbf{X}$ , with the only exception that their row and column vectors are denoted with bold lowercase letters.

We assume that regions within a spatial cluster have similar mortality levels for each disease considered, whereas significant variations in mortality levels occur across clusters. In addition, we assume that exogenous variables influence the outcomes but do not regulate the formation of clusters (Coker et al., 2023). Based on these considerations, we formulate the multivariate regression model

\mathbf{y}_{i.}=\mathbf{Z}^{T}_{i.}\boldsymbol{\mu}+\mathbf{x}^{T}_{i.}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{i.},\hskip 28.45274pt\boldsymbol{\epsilon}_{i.}|\boldsymbol{\Sigma}\sim\mathcal{N}_{d}(\mathbf{0},\boldsymbol{\Sigma}),

(1)

where $\mathbf{Z}_{i.}$ and $\mathbf{x}_{i.}$ are the $i$ -th row of $\mathbf{Z}$ and $\mathbf{X}$ , $\boldsymbol{\mu}=\{\mu_{kj}\}_{k=1,\dots,K;j=1,\dots,d}$ is a matrix containing the cluster- and disease-specific intercepts, $\boldsymbol{\beta}=\{\beta_{\ell j}\}_{\ell=1,\dots,p;j=1,\dots,d}$ is the matrix of regression coefficients, and $\boldsymbol{\epsilon}_{i.}$ is the vector of error terms. $\boldsymbol{\Sigma}$ aims at capturing the correlation across the $d$ diseases. The formulation of the regression model in (1) is equivalent to assuming a matrix variate Gaussian distribution on the outcome matrix $\mathbf{Y}$ , with mean matrix $\mathbf{Z\boldsymbol{\mu}}+\mathbf{X\boldsymbol{\beta}}$ , $\mathbb{I}_{n}$ as covariance matrix of the rows, and $\boldsymbol{\Sigma}$ as covariance matrix of the columns.

By definition of $\mathbf{Z}$ , we define a cluster as a set of labelled areas with the same average levels of the $d$ responses, net of the effects of the covariates $\mathbf{X}$ . Since the clustering matrix $\mathbf{Z}$ is unknown, we assume that $\mathbf{Z}_{i.}$ distributes according to a multinomial distribution of size 1 and vector of probabilities $\boldsymbol{\pi}_{i}=(\pi_{i1},\dots,\pi_{iK})$ . To let the clustering draw information by the spatial structure of data, we first express the multinomial model as a series of binomial probability functions using the multinomial stick-breaking representation of Linderman et al. (2015)

\Pr(\mathbf{Z}_{i.}|\tilde{\boldsymbol{\pi}}_{i})=\prod_{k=1}^{K-1}\left\{\tilde{\pi}_{ik}^{\mathrm{Z}_{ik}}(1-\tilde{\pi}_{ik})^{1-\mathrm{Z}_{ik}}\right\}^{N_{ik}},

(2)

where $N_{i1}=1$ (there is at most one element per row that is equal to one), $\tilde{\pi}_{i1}={\pi}_{i1}$ , and, for $k>1$ ,

\hskip 19.91684ptN_{ik}=1-\sum_{k^{\prime}<k}\mathrm{Z}_{ik^{\prime}},\hskip 14.22636pt\tilde{\pi}_{ik}=\dfrac{\pi_{ik}}{1-\sum_{k^{\prime}<k}\pi_{ik^{\prime}}}.

By construction, $\tilde{\pi}_{iK}=1$ . Then, for $k=1,\dots,K-1$ , we reparametrise the probabilities $\tilde{\pi}_{ik}$ using the logistic function, thus writing $\tilde{\pi}_{ik}=\exp(\psi_{ik})/\{1+\exp(\psi_{ik})\}$ . The vector $\boldsymbol{\psi}_{.k}=(\psi_{1k},\dots,\psi_{nk})$ contains $n$ realisations in space of the transformed conditional probabilities of belonging to cluster $k$ , given that the observations were not assigned to the previous $k-1$ clusters $(\boldsymbol{\psi}_{.1}$ simply represents the transformed probabilities of belonging to cluster 1). To introduce spatial dependence across the transformed clustering probabilities, we assume that each vector $\boldsymbol{\psi}_{.k}$ distributes according to a conditionally autoregressive (CAR) model of the form

\boldsymbol{\psi}_{.k}|\tau_{k},\rho_{k}\sim\mathcal{N}_{n}\{\boldsymbol{0},\tau_{k}(\mathbf{D}-\rho_{k}\mathbf{W})^{-1}\},

(3)

where $\mathbf{W}$ has null diagonal and $\mathrm{W}_{ii^{\prime}}=1$ if regions $i$ and $i^{\prime}$ are neighbours, and 0 otherwise, $\mathbf{D}$ is a diagonal matrix with $\mathrm{D}_{ii}$ equal to the number of neighbours of $i$ , $\tau_{k}>0$ and $\rho_{k}\in[0,1)$ .

Although the model in (3) does not force the areas within the same cluster to be connected by a path, therefore allowing even spatially distant areas to belong to the same cluster, modelling the clustering probabilities with CAR distributions as in (3) encourage the identification of spatially separated clusters. The principal role of $\rho_{k}$ is ensuring that $\mathbf{D}-\rho_{k}\mathbf{W}$ is positive semidefinite. As discussed in Section 4.3.1 of Banerjee et al. (2015), $\rho_{k}$ cannot be interpreted as a measure of spatial correlation of the data, similarly for example to what is expressed by the Moran’s I statistic. The only exception is $\rho_{k}=0$ , which denotes the case of independence across spatial areas. In addition, Assunção and Krainski (2009) highlighted that $\rho_{k}$ can assume also negative values, which however can still be associated to positive correlation values. These inconsistencies were underlined also by Ver Hoef et al. (2018), who stated that, in geostatistics, it is common to set $\rho_{k}$ to be positive, as negative values of spatial correlation are generally not considered. Therefore, following also Carlin and Banerjee (2003), in this work we assume $\rho_{k}$ to be only positive. To solve these oddities, Datta et al. (2019) proposed an alternative to the CAR formulation, named DAGAR, using a directed acyclic graph (DAG) to model the precision matrix of $\boldsymbol{\psi}_{.k}$ . One of the key advantages of their approach lies in the interpretation of $\rho_{k}$ , which now represents the degree of spatial correlation within the data. However, for this study, we have opted to retain the CAR formulation for mainly two reasons. Firstly, the DAGAR model necessitates the ordering of map areas. While Datta et al. (2019) demonstrate that ordering does not heavily impact the final results, it is not clear how the ordering operation should be performed when the statistical units, i.e., the spatial areas, are divided into clusters. In addition, the concept of ordering spatial data may pose challenges for non-experts in the field. Given our intention for the model to be used in epidemiological studies, we prefer to maintain a standard framework that does not rely on ordering. Secondly, our decision is reinforced by the favourable performance of the CAR model compared to DAGAR, as evidenced by Datta et al. (2019). Their study indicates that both models yield comparable results from various perspectives, with the interpretability of $\rho_{k}$ emerging as the primary advantage of DAGAR over CAR. In our application, we employ the spatial model solely to incorporate spatial correlation into the parameterised clustering probabilities $\boldsymbol{\psi}_{.k}$ . Thus, we are not using a spatial model for the raw data itself, but rather for a latent parameter. Therefore, while interpretable, we believe that $\rho_{k}$ would offer a few epidemiological insights.

With the current formulation and assumptions, the stick-breaking construction offers a natural ordering of the discovered clusters. In fact, taking $\mathbb{E}(\boldsymbol{\psi}_{.k}|\tau_{k},\rho_{k})=\mathbf{0}$ is equivalent to assuming that the marginal expected value of $\tilde{\pi}_{ik}$ is $\mathbb{E}(\tilde{\pi}_{ik}|\tau_{k},\rho_{k})=0.5$ , for $i=1,\dots,n$ and $k=1,\dots,K-1$ . As a direct consequence, $\mathbb{E}({\pi}_{ik}|\tau_{k},\rho_{k})>\mathbb{E}({\pi}_{ik^{\prime}}|\tau_{k^{\prime}},\rho_{k^{\prime}})$ when $k>k^{\prime}$ , therefore the average number of observations expected in the $k$ -th cluster decreases as $k$ grows, and clusters are naturally interpreted from the most to the least predominant.

The model in Equation (1) can be rephrased also in terms of a multivariate Gaussian mixture model. In fact, it corresponds to assuming $\mathbf{y}_{i.}|\mathrm{Z}_{ik}=1\sim\mathcal{N}_{d}(\boldsymbol{\mu}_{k.}+\mathbf{x}^{T}_{i.}\boldsymbol{\beta},\boldsymbol{\Sigma})$ , where $\Pr(\mathrm{Z}_{ik}=1)=\pi_{ik}$ . Although, at a data level, perla and HPM are practically equivalent, they adopt distinct strategies for embedding the spatial structure of data into the clustering labels. In fact, HPM works directly with the conditional distributions $\mathrm{Z}_{ik}|\mathbf{Z}_{\mathscr{N}(i),k}$ , where $\mathscr{N}(i)$ denotes the spatial neighbours of region $i$ . Instead, perla assumes that the clustering probabilities are spatially correlated. In Section 3, we show how this strategy offers practical advantages in terms of Bayesian posterior inference.

2.2 Prior specification

In the Bayesian framework, the prior distributions of the model parameters are used either for conveying previous knowledge about the analysed phenomena or for specific modelling needs, like bounding the parameter space or performing variable selection. In this section, we discuss the a priori assumptions that we make to help the model retrieving the spatial structure of data, while inferring the underlying epidemiological phenomena.

perla is designed to disentangle spatial clusters based on the variations of disease-specific mortality in space. The elements in the Formula (1) that differentiate the clusters are the intercepts $\boldsymbol{\mu}=\{\mu_{kj}\}_{k=1,\dots,K,j=1,\dots,d}$ . For instance, it is possible that not all considered diseases contribute to the formation of spatial clusters, or that certain diseases contribute to the formation of clusters only in specific areas of the map. In practice, we state that disease $j$ contributes to the formation of the $k$ -th cluster if $\mu_{kj}$ deviates from zero, either positively or negatively. A positive deviation indicates a mortality excess in the cluster due to disease $j$ , while a negative deviation suggests a mortality deficit. Therefore, we design an a priori setting for $\boldsymbol{\mu}$ that promotes the clustering only when it detects significant deviations from zero of the mean levels of the response variables across different areas of the map. For instance, suppose that in an epidemiological study comparing the mortality due to two diseases, the mean levels related to the first disease ( $j=1$ ) are indicative the presence of two spatial clusters, while the mean levels related to the second disease ( $j=2$ ) are close to zero. We aim for a prior distribution on the $\mu_{kj}$ that leverages the diversity between $\{\mu_{k1}\}_{k=1,2}$ to delineate the clusters, while shrinking the disparities between $\{\mu_{k2}\}_{k=1,2}$ toward zero, as they do not inform the clustering structure of the territory. We translate these concepts into the following global-local shrinkage prior:

\mu_{kj}|\phi,\zeta_{j},\gamma_{kj}\sim\mathcal{N}\left(0,{\phi\zeta_{j}\gamma_{kj}}\right),

(4)

where

\sqrt{\phi}\sim\mathcal{C}_{+}(0,1),\hskip 14.22636pt\sqrt{\zeta_{j}}\sim\mathcal{C}_{+}(0,1),\hskip 14.22636pt\sqrt{\gamma_{kj}}\sim\mathcal{C}_{+}(0,1),

and $\mathcal{C}_{+}$ denotes the half-Cauchy distribution. While $\phi$ represents a global shrinkage factor, shared among all the $\mu_{kj}$ , $\zeta_{j}$ is a disease-specific parameter that is in common with the cluster intercepts $\mu_{1j},\dots,\mu_{Kj}$ . A large value of $\zeta_{j}$ emphasises the contribution of outcome $j$ to the clustering process, and a small value of $\zeta_{j}$ diminishes its contribution. Lastly, the local factor $\gamma_{kj}$ acts specifically on the $kj$ -th mean component, highlighting or shrinking the contribution of outcome $j$ to the identification of cluster $k$ . We collect the shrinkage parameters appearing in (4) into the vectors $\boldsymbol{\zeta}=\{\zeta_{j}\}_{j=1,\dots,d}$ and $\boldsymbol{\gamma}=\{\gamma_{kj}\}_{k=1,\dots,K;j=1,\dots,d}$ .

By introducing progressively more local factors in Formula (4), we model the influence of each disease on the overall clustering structure of data through $\boldsymbol{\zeta}$ , and then we investigate the contribution of each disease to the formation of individual mortality clusters through the $\boldsymbol{\gamma}$ . The local factors $\boldsymbol{\gamma}$ become particularly relevant when some diseases are informative only of a few, specific clusters. For instance, suppose that a disease $j$ is not informative of any cluster but the $k$ -th. We expect the model to shrink the contribution of disease $j$ through a small value of $\zeta_{j}$ , and to highlight the contribution of disease $j$ in identifying the $k$ -th cluster through a large value of $\gamma_{kj}$ . The total number of shrinkage parameters introduced with the prior model in (4) is $Kd+d+1$ . Without $\zeta_{j}$ , our prior distribution reduces to the well-know horseshoe prior of Carvalho et al. (2010).

In alternative to (4), we consider also the prior distribution

\mu_{kj}|\phi,\zeta_{j},\delta_{k}\sim\mathcal{N}\left(0,{\phi\delta_{k}\zeta_{j}}\right).

(5)

The parameter $\sqrt{\delta_{k}}\sim\mathcal{C}_{+}(0,1)$ highlights or shrinks toward zero the differences across diseases within the whole $k$ -th cluster, following a similar rationale of Denti et al. (2023). We collect all the shrinkage parameters introduced through Formula (5) into the vector $\boldsymbol{\delta}=\{\delta_{k}\}_{k=1,\dots,K}$ . Not only does the prior in (5) simplify (4), but it is also more parsimonious than the horseshoe prior when $K,d\geq 2$ , with at least one of them strictly greater than 2. This is because it requires only $K+d+1$ shrinkage parameters, compared to $Kd+1$ in the horseshoe prior. The assumption of separability between cluster and disease effects allows to interpret the whole contribution of disease $j$ in the formation of clusters through $\zeta_{j}$ , and the amount of variability across diseases in cluster $k$ through $\delta_{k}$ . In addition, despite not operating directly as a method for selecting the number of clusters, $\delta_{k}$ shrinks all the coefficients of the $k$ -th cluster toward zero if the cluster finds poor evidence from the data, allowing for a clearer identification of the number of clusters in the post-processing phase. Finally, we can also obtain simpler priors taking only a subset of the shrinkage parameters in Formulas (4)-(5). For example, a prior having only the elements $\phi$ and $\boldsymbol{\zeta}$ applies a global shrinkage and a local disease-specific shrinkage.

The use of a half-Cauchy distribution could be potentially extended by allowing its scale parameter to differ from 1, thereby enabling the inferential process to incorporate some a priori knowledge about the analysed phenomena (Piironen and Vehtari, 2017). This extension can, for example, be used to specify which diseases are more likely to contribute to the clustering. However, we opted for the $\mathcal{C}_{+}(0,1)$ distribution because, a priori, there is often little or no knowledge of how many diseases are responsible of the formation of clusters. This is because, in mortality mapping, the primary question is whether clusters of mortality excess or deficit exist at all, with the number of clusters investigated subsequently (Bivand et al. 2013, Section 10.6). This differs fundamentally from other unsupervised analyses: for example, in transcriptomics, cells are clustered based on their gene expression levels and where there is typically a priori knowledge of both the number of cell types present in a tissue and of the genes known to be primarily linked to its biological processes (Zhao et al., 2021).

The parameters $\tau_{k}$ and $\rho_{k}$ calibrate the spatial distribution of $\boldsymbol{\psi}_{.k}$ . In Supplementary Section 1, we discuss the roles of these two parameters on the clustering probability values. In summary, it emerges that $\tau_{k}$ has no impact on the mean of the values in $\boldsymbol{\psi}_{.k}$ , but it has a large impact on their variability: the larger $\tau_{k}$ , the greater the variability associated with each $\psi_{ik}$ (see for instance Supplementary Figure 4). The inverse gamma distribution is conjugate with the Gaussian model; therefore, a natural choice of a prior model for $\tau_{k}$ would be $\mathcal{IG}(\alpha_{\tau},\beta_{\tau})$ . However, one could also avoid inferring an extra parameter and instead fixing $\tau_{k}$ sufficiently large to guarantee an adequate level of uncertainty in the clustering probabilities. In the simulation experiment proposed in Section 4.3, we compare different fixed values of $\tau_{k}$ with the use of its conjugate prior.

A second relevant aspect that emerges from the discussion in Supplementary Section 1 is that only values of $\rho_{k}$ close to 1 allow the CAR prior to handle spatially correlated data. The limit case $\rho_{k}=1$ is not admittable because $\mathbf{D}-\mathbf{W}$ is not a full-rank precision matrix, implying that the integral of the joint prior density of $\boldsymbol{\psi}_{.k}$ is not finite. Nonetheless, when the CAR model is used as a prior, Banerjee et al. (2015) encourage using the kernel of a CAR with $\rho_{k}=1$ , trusting that the posterior distribution will be proper due to the contribution of the likelihood function. This solution is known in the literature as intrinsic CAR. However, in perla, the CAR serves not as a prior for a likelihood parameter, but for a parameter that defines the distribution of $\mathbf{Z}$ , which is itself a random variable. Therefore, being in a second level of hierarchy in the model structure, we believe that the usage of an intrinsic CAR prior should be properly investigated and compared with a proper CAR prior that takes $\rho_{k}=0.99$ for every $k$ . Alternatively, we propose also inferring $\rho_{k}$ from the data. Carlin and Banerjee (2003) suggest using $\rho_{k}\sim\mathcal{B}(18,2)$ , where $\mathcal{B}$ denotes the Beta distribution, to give large probability to values of $\rho_{k}$ close to 1. It is worth noting that the density of this Beta distribution peaks at 0.944 and decreases rapidly as values approach 1, thus discouraging the limit case $\rho_{k}=1$ . To construct a prior distribution for $\rho_{k}$ , first it could be worth evaluating how the estimate of $\rho_{k}$ changes with respect to different levels of spatial correlation in the data. To do so, we simulated a series of spatial datasets using a DAGAR model, based on a map of $n=259$ U.S. counties on the West side of the U.S. map (see Supplementary Figure 2). The choice of the DAGAR as a data generative mechanism allows us to control the level of spatial correlation (called $\rho^{\mathrm{DAGAR}}$ ) in the simulations. For every level of $\rho^{\mathrm{DAGAR}}$ used as input, we produced 200 realisations from the spatial process, and we estimated the CAR model. Figure 2 reports the distribution of the maximum likelihood estimates of $\rho^{\mathrm{CAR}}$ under different $\rho^{\mathrm{DAGAR}}$ . We notice that a clear correspondence between the estimate of $\rho^{\mathrm{CAR}}$ and the level of spatial correlation $\rho^{\mathrm{DAGAR}}$ does not emerge. In fact, when $\rho^{\mathrm{DAGAR}}\geq 0.4$ , the distributions of the estimates of $\rho^{\mathrm{CAR}}$ concentrate above 0.9, while, when $0.1\leq\rho^{\mathrm{DAGAR}}\leq 0.3$ , the estimates of $\rho^{\mathrm{CAR}}$ span over the whole $(0,1)$ interval. Lastly, when $\rho^{\mathrm{DAGAR}}=0.01$ , the most likely value of $\rho^{\mathrm{CAR}}$ appears to be 0.01, although some uncertainty is also evident. Based on these considerations, a prior distribution for $\rho_{k}$ that favours values near 0 or 1 would be appropriate to capture a wide range of spatial correlation scenarios. We thus propose the following mixture prior:

\rho_{k}\sim 0.5\mathcal{B}(2,18)+0.5\mathcal{B}(18,2).

(6)

We choose a symmetric structure to maintain non-informativeness about the value of $\rho_{k}$ , giving equal probability to very small or very high values. The mixture component $\mathcal{B}(2,18)$ reflects the idea of the $\mathcal{B}(18,2)$ component, assigning high probability mass to small values of $\rho_{k}$ , while discouraging the case of complete independence $\rho_{k}=0$ , which is unusual in spatial data modelling. By placing most of the probability mass toward values close to zero or one - reflecting the almost absence or large presence of spatial correlation - the distribution in (6) retraces the idea of a continuous spike-and-slab prior (Ročková, 2018). Notice also that (6) represents a mixture model that integrates out the mixing probability $\theta_{k}$ , whose prior distribution is assumed to be $\theta_{k}\sim\mathcal{B}(c_{\theta},c_{\theta})$ , for any $c_{\theta}>0$ . One of the goals of the simulation experiments proposed in Section 4.3 is comparing the two prior proposals for $\rho_{k}$ , investigating whether they make any difference in terms of clustering accuracy and model fitting.

We complete the model specification with the prior distributions on $\boldsymbol{\beta}$ and $\boldsymbol{\Sigma}$ . We generically assume the conjugate and poorly informative priors ${\beta}_{kj}\sim\mathcal{N}(0,10)$ and $\boldsymbol{\Sigma}\sim\mathcal{IW}(d,\mathbf{I}_{d})$ , where $\mathcal{IW}$ denotes the inverse-Wishart distribution.

Figure 3 illustrates the hierarchical structure of perla through a DAG. Note that this scheme also includes the augmented variables $\boldsymbol{\omega}$ , $\boldsymbol{\alpha}_{\phi}$ , $\boldsymbol{\alpha}_{\zeta}$ , $\boldsymbol{\alpha}_{\delta}$ and $\boldsymbol{\alpha}_{\gamma}$ , which will be introduced in the next section to allow the posterior simulation of some model parameters to be performed via Gibbs sampling.

3 Posterior inference

3.1 The MCMC algorithm

We carry out the inference on the model parameters simulating values from their posterior distributions. We present a Markov chain Monte Carlo scheme that allows a direct sampling from the conditional posterior distributions of most of the model parameters, requiring a Metropolis-Hastings step only for a few of them. This translates into the practical advantage of requiring fewer tuning parameters that can be hard to set when either the number of responses $d$ or the number of clusters $K$ grow. With respect to HPM, which requires particular attention for sampling the inverse temperature parameter, our method presents no critical step.

As we already mentioned, the stick-breaking representation in (2) allows to rewrite the multinomial model as a sequence of Bernoulli probability functions. To facilitate the posterior sampling, we introduce the latent variables $\omega_{ik}\sim\mathcal{PG}(N_{ik},0)$ , for $k=1,\dots,K-1$ and $i=1,\dots,n$ , where $\mathcal{PG}(\cdot,\cdot)$ denotes the Pólya-gamma random variable. Through this data augmentation strategy, we can update the transformed probabilities $\boldsymbol{\psi}_{.k}$ by generating from the conditional posterior distribution of $\boldsymbol{\psi}_{.k}|\omega_{k},\mathbf{Z}_{.k},\rho_{k}$ , that is

\displaystyle\pi(\boldsymbol{\psi}_{.k}|\omega_{k},\mathbf{Z}_{.k},

\displaystyle\tau_{k},\rho_{k})\propto\left\{\prod_{i=1}^{n}\exp\left(\kappa_{ik}\psi_{ik}-\dfrac{\omega_{ik}\psi^{2}_{ik}}{2}\right)\right\}p(\boldsymbol{\psi}_{.k}|\tau_{k},\rho_{k}),

where $\kappa_{ik}=\mathrm{Z}_{ik}-N_{ik}/2$ . As the number of spatial areas $n$ grows, and with it the dimension of the neighbour matrix $\mathbf{W}$ , it becomes more convenient to update the $\psi_{ik}$ iteratively, rather than generating the full vector $\boldsymbol{\psi}_{.k}$ . This can be performed thanks to the Brook’s lemma, which allows to factorise the joint distribution in (3) into a series of univariate conditional distributions derived from the neighbourhood structure imposed by $\mathbf{W}$ (Besag, 1974). The conditional posterior of $\psi_{ik}$ is

\psi_{ik}|\boldsymbol{\psi}_{-i,k},\mathbf{Z}_{i.},\omega_{ik},\rho_{k}\sim\mathcal{N}\left\{\varsigma^{2}_{ik}\left(\kappa_{ik}+\dfrac{\rho_{k}}{\tau_{k}}\mathbf{W}^{T}_{i,-i}\boldsymbol{\psi}_{-i,k}\right),\varsigma^{2}_{ik}\right\},

where $\varsigma^{2}_{ik}={\tau_{k}}/\left({\tau_{k}\omega_{ik}+D_{ii}}\right)$ . The augmented variables $\{\omega_{ik}\}_{i;k}$ are updated drawing from $\omega_{ik}|\mathbf{Z}_{i.},\psi_{ik}\sim\mathcal{PG}(\kappa_{ik},\psi_{ik})$ if $N_{ik}=1$ , or setting $\omega_{ik}=0$ if $N_{ik}=0$ .

The parameter $\rho_{k}$ under the prior model in (6) is the only case for which a closed-form updating scheme is not available. We implemented a random walk Metropolis-Hastings on the transformed parameter $\mathrm{logit}(\rho_{k})$ , using a Gaussian distribution centred in the parameter value sampled at the previous iteration, and with variance $\tilde{\sigma}^{2}_{\rho}=3$ , in the role of proposal.

To update the clustering labels, it is now sufficient to transform the quantities $\boldsymbol{\psi}_{.1},\dots,\boldsymbol{\psi}_{.K-1}$ into the clustering probabilities $\{{\pi}_{ik}\}_{i;k}$ , then drawing a value from the multinomial distribution whose probabilities are defined as

\Pr(\mathrm{Z}_{ik}=1|{\pi}_{ik},\mathbf{y}_{i.},\boldsymbol{\mu}_{.k},\boldsymbol{\beta},\boldsymbol{\Sigma})\propto\pi_{ik}\mathcal{N}_{d}(\mathbf{y}_{i.}|\boldsymbol{\mu}_{.k}+\mathbf{x}^{T}_{i.}\boldsymbol{\beta},\boldsymbol{\Sigma}),

where $\mathcal{N}_{d}(\mathbf{a}|\mathbf{b},\mathbf{C})$ denotes the density of the $d$ -variate normal distribution with mean vector $\mathbf{b}$ and covariance matrix $\mathbf{C}$ , evaluated in $\mathbf{a}$ .

Under the prior model in (4), we update the shrinkage parameters $\boldsymbol{\zeta}$ , $\boldsymbol{\gamma}$ and $\phi$ following the data augmentation strategy of Makalic and Schmidt (2016), which exploits the scale mixture representation of the half-Cauchy distribution. We thus introduce a set of augmented scale parameters for each shrinkage parameter considered: $\boldsymbol{\alpha}_{\zeta}=\{\alpha_{\zeta_{j}}\}_{j=1,\dots,d}$ , $\boldsymbol{\alpha}_{\gamma}=\{\alpha_{\gamma_{kj}}\}_{k=1,\dots,K;j=1,\dots,d}$ , and $\alpha_{\phi}$ . Each of these augmented parameters is assumed to distribute according to an inverse gamma $\mathcal{IG}(1/2,1)$ . Then, for $j=1,\dots,d$ , draw from

\displaystyle\alpha_{\zeta_{j}}|\zeta_{j}

\displaystyle\sim\mathcal{IG}\left(1,\dfrac{1}{\zeta_{j}}+1\right),\hskip 19.91684pt\zeta_{j}|\boldsymbol{\mu}_{.j},\phi,\boldsymbol{\gamma},\alpha_{\zeta_{j}}\sim\mathcal{IG}\left(\dfrac{K+1}{2},\dfrac{1}{2\phi}\sum_{k=1}^{K}\dfrac{\mu_{kj}^{2}}{\gamma_{kj}}+\dfrac{1}{\alpha_{\zeta_{j}}}\right).

(7)

For $k=1,\dots,K$ and $j=1,\dots,d$ , draw from

\displaystyle\alpha_{\gamma_{kj}}|\gamma_{kj}\sim\mathcal{IG}\left(1,\dfrac{1}{\gamma_{kj}}+1\right),\hskip 19.91684pt\gamma_{kj}|{\mu}_{kj},\phi,\boldsymbol{\zeta},\alpha_{\gamma_{kj}}\sim\mathcal{IG}\left(1,\dfrac{\mu_{kj}^{2}}{2\phi\zeta_{j}}+\dfrac{1}{\alpha_{\gamma_{kj}}}\right).

(8)

Lastly, draw from

\displaystyle a_{\phi}|\phi\sim\mathcal{IG}\left(1,\dfrac{1}{\phi}+1\right),\hskip 19.91684pt\phi|\boldsymbol{\mu},\boldsymbol{\zeta},\boldsymbol{\gamma},\alpha_{\phi}\sim\mathcal{IG}\left(\dfrac{Kd+1}{2},\sum_{j=1}^{d}\dfrac{1}{2\zeta_{j}}\sum_{k=1}^{K}\dfrac{\mu_{kj}^{2}}{\gamma_{jk}}+\dfrac{1}{\alpha_{\phi}}\right).

(9)

Let $\mathcal{C}_{k}=\{i=1,\dots,n:\mathrm{Z}_{ik}=1\}$ be the set denoting which observations belong to cluster $k$ , and $n_{k}=|\mathcal{C}_{k}|$ . The cluster intercepts are updated drawing from

\boldsymbol{\mu}_{k.}|\mathbf{Y}_{\mathcal{C}_{k}.},\boldsymbol{\beta},\phi,\boldsymbol{\zeta},\boldsymbol{\gamma}_{k.},\boldsymbol{\Sigma}\sim\mathcal{N}_{d}\left[\boldsymbol{\Sigma}^{*}_{k}\left\{\boldsymbol{\Sigma}^{-1}\left(\mathbf{Y}_{\mathcal{C}_{k}.}-\mathbf{X}_{\mathcal{C}_{k}.}\boldsymbol{\beta}\right)\right\},\boldsymbol{\Sigma}^{*}_{k}\right],

(10)

where $\boldsymbol{\Sigma}^{*}_{k}=\left[\left\{\phi\;\mathrm{diag}(\zeta_{1}\gamma_{k1},\dots,\zeta_{d}\gamma_{kd})\right\}^{-1}+n_{k}\boldsymbol{\Sigma}^{-1}\right]^{-1}$ .

If instead the prior model in (5) is considered, we introduce the set of augmented scale parameters $\boldsymbol{\alpha}_{\delta}=\{\alpha_{\delta_{k}}\}_{k=1,\dots,K}$ , each of which distributes according to an inverse gamma $\mathcal{IG}(1/2,1)$ . Then, for $k=1,\dots,K$ , draw from

\displaystyle\alpha_{\delta_{k}}|\delta_{k}\sim\mathcal{IG}\left(1,\dfrac{1}{\delta_{k}}+1\right),\hskip 19.91684pt\delta_{k}|\boldsymbol{\mu}_{k.},\phi,\boldsymbol{\zeta},\alpha_{\delta_{k}}\sim\mathcal{IG}\left(\dfrac{d+1}{2},{\dfrac{1}{2\phi}\sum_{j=1}^{d}\dfrac{\mu_{kj}^{2}}{\zeta_{j}}}+\dfrac{1}{\alpha_{\delta_{k}}}\right).

The rest of the updates can be performed as in Formulas (7), (9) and (10), just replacing $\gamma_{kj}$ with $\delta_{k}$ .

Finally, the updates of $\boldsymbol{\Sigma}$ and $\boldsymbol{\beta}$ are straightforward to perform, thanks to the conjugancy between the likelihood and the selected priors. Details can be found for example in Marin and Robert (2007).

In our software implementation, we initialise the MCMC algorithm as follows. The $n$ observations are randomly assigned to the $K$ clusters; each element in $\boldsymbol{\psi}$ and $\boldsymbol{\beta}$ is generated from $\mathcal{N}(0,1)$ ; the Pólya-gamma parameters in $\boldsymbol{\omega}$ are drawn from $\mathcal{G}(1,1)$ ; and all shrinkage parameters are drawn from $\mathcal{C}_{+}(0,1)$ . The cluster centroids $\boldsymbol{\mu}$ and the covariance matrix $\boldsymbol{\Sigma}$ are the only two parameters not randomly initialised. The former are taken equal to the ordinary least squares estimate $(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T}\mathbf{Y}$ , while the latter are initialised as $\sum_{i=1}^{n}(\mathbf{y}_{i.}-\overline{\mathbf{y}})^{T}(\mathbf{y}_{i.}-\overline{\mathbf{y}})/(n-1)$ , where $\overline{\mathbf{y}}$ denotes the vector of sample means for the $d$ outcome variables.

To provide point estimates of the model parameters, we consider the posterior mode. With the notation $\mathcal{M}(\theta|\mathrm{data})$ , we denote the posterior mode of a generic model parameter $\theta$ , given the data. A relevant quantity in the analysis of mortality data is also the posterior probability of observing an excess mortality in a specific region, compared to the whole territory. Following the considerations of Richardson et al. (2004), we state that cluster $k$ shows evidence of mortality excess due to disease $j$ if $\Pr(\mu_{kj}>0|\mathrm{data})\geq 0.95$ , and a deficit of mortality if $\Pr(\mu_{kj}>0|\mathrm{data})\leq 0.05$ . These quantities can be computed straightforwardly from the output of the MCMC algorithm.

3.2 Post-processing and model selection

The presented MCMC scheme is susceptible to the common label-switching phenomenon, which arises due to the unidentifiability of the mixture components and leads the posterior distributions of $\boldsymbol{\mu}$ to be multimodal. In practice, this makes inference on model parameters challenging. To disentangle the mixture components, we explore the ECR algorithm of Papastamoulis and Iliopoulos (2013), considering the partition associated to the largest log-likelihood value as the pivotal one. In addition to this method, the R package label.switching (Papastamoulis, 2016) implements numerous alternative algorithms. Not only are these tools useful to relabel the posterior draws, but they are also designed to return an optimal partition of data based on the MCMC output. This process can also be accomplished using other algorithms, such as the SALSO greedy search method proposed by Dahl et al. (2022).

perla can be fitted with different setups, varying the number of clusters $K$ and the prior distributions of the model parameters. Therefore, a criterion that allows users to evaluate different model setups and determine which is preferable for the analysed data is necessary. This approach aligns with the philosophy of parsimonious clustering (Celeux and Govaert, 1995, Scrucca et al., 2016), which involves comparing the same type of model with different degrees of flexibility and determining the setup that best fits the data.

We conduct the model selection using the $\mathrm{DIC}_{3}$ information criterion of Celeux et al. (2006), which is recommended for mixture models, and whose properties have been studied by Kim (2021). A detailed description on how to compute $\mathrm{DIC}_{3}$ for perla is given in Supplementary Section 2. Models that return smaller $\mathrm{DIC}_{3}$ values are preferable to those producing larger values.

We already discussed in Section 2.2 that different setups for $\boldsymbol{\mu}$ imply different numbers of model parameters. Assuming that the global shrinkage parameter $\phi$ is always inserted into the model, we report in Table 1 the possible combinations of shrinkage factors that we consider for perla, along with the corresponding code notation used in our R package perla. Given the model dimension, users can evaluate all the five combinations of shrinkage factors shown in the figure, starting with simpler models and increasing the complexity.

Lastly, it is crucial to underline that model selection cannot rely solely on an information criterion, but must be performed considering also other factors, such as the quality of model fitting, evaluated through the MCMC output, and the interpretability of the clusters.

4 Simulation studies

We study the performance of perla with three simulation experiments that aim at recreating some possible real data scenarios and conditions. The map considered for the simulations consists of $n=259$ counties of the U.S. map introduced in Section 1, located West of the 110° W longitude meridian. A representation of this territory is given in Supplementary Figure 2.

Every version of perla considered is fitted running the MCMC algorithm four times separately, each for 10,000 iterations, discarding the first half of each run, and merging the remaining draws into a unique chain of length 20,000. Finally, we apply the ECR algorithm to resolve eventual label switching.

4.1 Simulation model

The data generating mechanism that we adopt follows the scheme of perla displayed in Figure 3, but using the DAGAR prior in place of the CAR to generate the clustering probabilities. This choice allows us to control the amount of spatial correlation simulated in each cluster. The use of the DAGAR requires the ordering of the areas. We thus order the counties from the most southern to the most northern. If $i=1,\dots,259$ is the index denoting the counties, then $i=1$ corresponds to the most southern county, and $i=259$ to the most northern. Being $K^{\text{true}}$ the true number of clusters in the map, the simulation model generates the transformed clustering probabilities as

\psi^{\text{true}}_{ik}|\boldsymbol{\psi}^{\text{true}}_{\mathscr{N}(i),k}\sim\mathcal{N}\left(\dfrac{\rho^{\text{true}}_{k}}{1+(n_{<i}-1){\rho^{\text{true}}_{k}}^{2}},\dfrac{1-{\rho^{\text{true}}_{k}}^{2}}{1+(n_{<i}-1){\rho^{\text{true}}_{k}}^{2}}\right)

where $k=1,\dots,K^{\text{true}}-1$ , $\mathscr{N}(i)$ denotes the set of spatial areas that are neighbours of $i$ and precede $i$ according to the imposed ordering, and $n_{<i}=|\mathscr{N}(i)|$ . The quantity $\rho^{\text{true}}_{k}\in(0,1)$ denotes the amount of spatial correlation of the probabilities of belonging to cluster $k$ . For further details on the DAGAR model, the reader can refer to Datta et al. (2019).

By transforming the quantities $\{\psi^{\text{true}}_{ik}\}_{i=1,\dots,259;k=1,\dots,K^{\text{true}}-1}$ into the clustering probabilities $\{\pi^{\text{true}}_{ik}\}_{i=1,\dots,259;k=1,\dots,K^{\text{true}}}$ through the inverse transformation of the stick-breaking weights, we would obtain highly unbalanced clustering probabilities. To simulate approximately balanced clusters, the inverse transformation of the stick-breaking weights must be applied to $\{\psi^{\text{true}}_{ik}-\log(K-k)\}_{i=1,\dots,259;k=1,\dots,K^{\text{true}}-1}$ . Then, the model allocates the spatial areas into clusters, storing the labels into the matrix $\mathbf{Z}^{\text{true}}$ . Lastly, given the matrix of centroids $\boldsymbol{\mu}^{\text{true}}$ and the covariance matrix of diseases $\boldsymbol{\Sigma}^{\text{true}}$ , the model generates the observed data in the form of the logarithm of mortality rates of $d$ diseases (named also death causes) as

\mathbf{y}_{i.}\sim\mathcal{N}_{d}\left\{\left(\mathbf{Z}^{\text{true}}_{i.}\right)^{T}\boldsymbol{\mu}^{\text{true}},\boldsymbol{\Sigma}^{\text{true}}\right\}.

(11)

In all the proposed simulation experiments, we take our conclusions using 20 replicates of the data generated under the same experimental conditions.

4.2 Simulation 1

The first experiment serves a dual purpose. On one hand, it compares perla with some competing clustering models, such as mixture models and SSS. On the other hand, it examines the gain brought by the use of shrinkage priors, such as those appearing in Formulas (4) and (5), in posterior inference compared to using non-informative priors when the number of diseases considered in the analysis is large.

To accomplish this, we propose a scenario where the formation of mortality clusters is driven only by a subset of the diseases considered. We use $d=10$ diseases and $K^{\text{true}}=3$ clusters. For each of the 20 simulated datasets, we induce large spatial correlation across clustering probabilities, taking ${\rho}^{\text{true}}_{k}\sim\mathcal{U}(0.8,1)$ , for $k=1,2$ . Next, we randomly determine the diseases that do not contribute to cluster formation: for $j=1,\dots,10$ , we introduce a random variable $I_{j}\sim\mathcal{B}e(0.5)$ , which takes value 1 if disease $j$ informs the clustering, and 0 if it doesn’t. Then, we draw $\mu^{\text{true}}_{kj}|I_{j}=1\sim\mathcal{U}(-0.5,0.5)$ , while we set $\mu_{kj}=0$ if $I_{j}=0$ , for all $k$ . To generate the correlation matrix of the diseases $d$ , we draw values from the uniform distribution $\mathcal{U}(-0.1,0.1)$ to generate the correlations across diseases, which are used to fill the off-diagonal elements of an identity matrix of order $d$ . Then, we check that the resulting matrix is semi-positive definite. Finally, we multiply all the elements of the correlation matrix by a factor of 0.05 to obtain the covariance matrix $\boldsymbol{\Sigma}^{\text{true}}$ . The 20 synthetic datasets are finally generated following the simulation model described in Section 4.1. The left panel of Figure 4 illustrates the distribution of the outcome variables in one of the 20 datasets, where $I_{j}=0$ for $j=1,\dots,5$ and $j=7$ .

For each of the 20 scenarios, we fit the following models:

•

perla, with shrinkage factors $\phi\zeta$ , $\phi\delta$ , $\phi\gamma$ , $\phi\zeta\delta$ , and $\phi\zeta\gamma$ . To recall them, we use the notation reported in the right column of Table 1;
•

perla, with the non-informative priors $\mu_{kj}\sim\mathcal{N}(0,10)$ , for all $k$ and $j$ ;
•

a Gaussian mixture model (GMM), selected with the BIC criterion, using the R library mclust (Scrucca et al., 2016);
•

the k-means algorithm, considered also by Aungkulanon et al. (2017);
•

the SSS of Gómez-Rubio et al. (2019), implemented in the R library Dclusterm. This algorithm fits a series of generalised linear models using the observed death counts as outcomes, and the expected death counts as offsets. However, our simulation model produces continuous data. Therefore, we applied Dclusterm within a linear regression model, rather than a Poisson regression. In addition, this method does not handle the analysis of multiple causes of death simultaneously, so it needs to be applied to every disease separately.

To the best of our knowledge, a library software that implements the multivariate HPM is not available in the R computing environment. In addition, the current version of bayesImageS (Moores et al., 2020a) does not fully support non-regular lattice data. This does not allow us to compare perla with HPM.

We apply perla, GMM and k-means using $K=4$ to test how effectively each model can retrieve the clustering structure of data when the number of clusters is incorrectly specified. The right panel of Figure 4 displays the adjusted Rand index (ARI, Hubert and Arabie 1985) distribution on the 20 datasets given by each model. We notice that all the configurations of perla lead to comparable results in terms of classification accuracy. This result can be explained by the presence of some $\mu^{\text{true}}_{kj}$ whose value is close to 0 for some $k$ even when $I_{j}=1$ . For example, in the left plot of Figure 4 we see that the distribution of $\{y_{i6}:\mathrm{Z}^{\text{true}}_{i}=3\}$ , the death cause 6 in cluster 3, is much closer to zero than the distributions in the two other clusters. Therefore, in this simulation context, it is reasonable that different combinations of the shrinkage parameters can be equally efficient in retrieving the clustering structure. As expected, all the configurations of perla globally perform better than mixture models not informed by the spatial data structure. We do not report the results obtained with the Dclusterm algorithm, as it failed to retrieve the clustering structure in all of the 20 datasets.

Despite the ARI values obtained with different perla setups are comparable, we now show that the contribute brought by different shrinkage priors extends beyond the overall clustering accuracy and consists of retrieving which diseases are responsible of the formation of mortality clusters. We display in Figure 5 the posterior distributions of the $\mu_{kj}$ parameters across the 20 simulated maps for which $I_{j}=0$ . We compare the results obtained from the five penalised versions of perla, and perla with non-informative priors (denoted ‘Perla NI’ in the figures). Ideally, since the corresponding $\mu^{\text{true}}_{kj}$ are zero, the posterior distributions should be concentrated around zero. Figure 5a displays the posterior modes and the logarithm of the 95% HPD interval lengths. From top-left panel, it emerges that ‘Perla NI’ provides highly variable posterior estimates of the $\mu_{kj}$ , and large HPD intervals. This is also evident from the marginal distributions of the posterior modes and interval lengths, displayed in Figures 5b and 5c. We notice that, among the estimates provided by ‘Perla NI’, some parameters show very large HPD interval sizes. These parameters are associated to the fourth mixture component, which does not find a correspondence with a real cluster. This effect is no longer present when using perla with only the cluster-specific shrinkage factors (setup (c)), as seen in the top-central panel of Figure 5a, although the posterior modes remain generally variable. With the setups (cd), (d), and (c,d), many posterior modes are shrunk toward zero, while a few of them remain insufficiently shrunk, and the HPD interval lengths decrease. In addition, note from Figure 5c that the HPD intervals of perla (cd), which uses the horseshoe prior, remain wider that the ones provided by the two other setups. Finally, our proposed prior (4), implemented in (d,cd) effectively shrinks all posterior modes towards zero, with narrow HPD intervals.

Figure 6 displays the posterior distributions of the $\mu_{kj}$ parameters for cases with $I_{j}=1$ , corresponding to diseases that effectively contribute to the formation of mortality clusters. Since $\mu^{\text{true}}_{kj}\mid I_{j}=1\sim\mathcal{U}(-0.5,0.5)$ , some coefficients may be generated very close to zero, and we expect our shrinkage priors to estimate them as null. This is observed in Figure 6a, where several posterior distributions have mode on zero, especially under the priors (cd) and (d,cd). The reason is that these setups employ the shrinkage factor $\gamma_{kj}$ to act directly on $\mu_{kj}$ , without affecting the other elements in the same row, $\boldsymbol{\mu}_{k\cdot}$ , and column, $\boldsymbol{\mu}_{\cdot j}$ . In contrast, the setup (d) applies the same amount of shrinkage to all the coefficients of the same disease, while (c,d) applies to $\mu_{kj}$ a combination of the shrinkage factors $\delta_{k}$ and $\zeta_{j}$ , which account for all the other elements in $\boldsymbol{\mu}_{k.}$ and $\boldsymbol{\mu}_{.j}$ . Finally, the version (c) effectively shrinks to zero all coefficients that do not match a real cluster, which is something that ‘Perla NI’ fails to achieve. Overall, the marginal distributions of the posterior modes and interval lengths from the five penalised versions of perla appear similar (see Figures 6b and 6c).

We conclude that clustering accuracy is not the only metric to consider for evaluating the model performance. Although perla provides stable results in terms of ARI using different prior setups, our results demonstrate that our shrinkage priors improve interpretability by indicating which diseases do, and which do not, contribute to cluster formation. If disease $j$ does not affect the overall clustering, a prior setup that uses just its specific shrinkage parameter $\zeta_{j}$ is sufficient to shrink the means $\boldsymbol{\mu}_{.j}$ towards zero while recovering the clustering using information from informative diseases. Differently, if disease $j$ is informative for only a subset of the $K$ clusters, then the model requires the use of additional cluster-disease shrinkage parameters $\gamma_{kj}$ . Our results also indicate that combining disease-specific shrinkage factors $\zeta_{j}$ with cluster-disease shrinkage factors $\gamma_{kj}$ yields better performance in distinguishing between informative and uninformative diseases, compared to using cluster-disease shrinkage factors alone. Finally, the cluster-specific factors $\delta_{k}$ efficiently shrink additional mean parameters, a feature that is particularly useful when the number of mixture components is taken larger than the number of clusters in the data.

Model selection can be guided by the $\mathrm{DIC}_{3}$ criterion, as discussed in Section 3.2. In our practical applications presented in Section 5, we select the model based on both the $\mathrm{DIC}_{3}$ and an examination of the posterior distributions of the $\{\mu_{kj}\}_{k;j}$ , evaluating which models produce well-separated mixture components that can be used to describe the underlying epidemiological phenomena.

4.3 Simulation 2

The second simulation experiment aims to study the performance of perla when the clustering probabilities have different levels of spatial correlation. This setting reflects a scenario where some clusters exhibit strong spatial connection, resulting in geographically concentrated and interconnected areas, while others may show weak spatial connection, with areas scattered across the whole territory. Since perla accounts for the spatial structure of the data through the CAR priors in (3), an additional scope of this experiment is investigating the effects of different prior configurations for the parameters $\tau_{k}$ and $\rho_{k}$ .

To accomplish this, we generate 20 simulated maps of mortality rates, considering $d=3$ death causes and $K^{\text{true}}=4$ clusters. To induce different levels of spatial correlation among clustering probabilities, we consider three equispaced values of spatial correlation $\boldsymbol{\rho}^{\text{true}}=(0.01,0.455,0.9)$ , that remain fixed across the 20 replicates. To induce marginal correlation across causes of death, we generate $\boldsymbol{\Sigma}^{\text{true}}$ as described in Section 4.2: correlation values are drawn from a uniform distribution $\mathcal{U}(-0.2,0.2)$ . After ensuring the semi-positive definiteness of the correlation matrix, we scale all its elements by a factor of 0.07 to obtain the covariance matrix. We generate the cluster- and disease-specific intercepts $\mu_{kj}$ , for $k=1,\dots,4$ and $j=1,2,3$ , from $\mathcal{U}(-1,1)$ . Lastly, each dataset is drawn from the simulation model detailed in Section 4.1. Supplementary Figure 5 displays, on the left panel, the spatial clusters in one of the 20 simulated scenarios and, in the right panel, the distribution of the three outcomes within each cluster appearing in the left panel. The map shows that counties in cluster 1 are spread throughout the territory, while those in clusters 2 and 3 are more geographically concentrated due to the larger spatial correlation values employed. From the spatial shape of the clusters, we highlight the importance of using a model that leverages spatial information without strictly enforcing clusters to be made solely of contiguous areas. For example, cluster 3 (shown in green in the left panel) consists of separate agglomerates of contiguous areas that are not directly connected. It is essential that the clustering model recognises these separate agglomerates as part of a unique cluster, since they share the same epidemiological characteristics.

First, we study the effects of different a priori assumptions on $\rho_{k}$ , keeping $\tau_{k}=1$ for all $k$ . For this experiment, we consider three prior setups: i.) $\rho_{k}=0.99$ $\forall k$ , that is a proper CAR model with large spatial correlation; ii.) $\rho_{k}=1$ $\forall k$ , that is an improper CAR; and iii.) the prior given in Formula (6) (denoted as ‘S&S’ in the figures). We evaluate whether allowing $\rho_{k}$ to be learnt from the data offers advantages in clustering accuracy and model fitting compared to assuming a constant value, especially in a framework where clusters are characterised by different levels of spatial correlation. In addition, we check whether assuming a proper or an improper CAR model yields meaningful differences in performance. We fit only perla (d,cd), as the evaluation of different prior models for $\mu_{kj}$ is already treated in Section 4.2.

The left panel of Figure 7 displays the ARI index distribution based on 20 simulated datasets. We observe that, when the number of clusters is correctly specified, perla accurately recovers the underlying clustering structure (median ARI $>0.9$ ). As expected, the performance is generally worse when the number of clusters is misspecified (median ARI index $\in(0.85,0.9)$ ). All the prior setups perform similarly in terms of clustering accuracy for $K=3$ and $K=4$ . Since clusters are characterised by different levels of spatial correlation, we also measure the accuracy in recovering each cluster separately. To do so, let $\mathbf{Z}^{\mathrm{post}}$ be the posterior estimate of the matrix $\mathbf{Z}$ , and $h(k)=\arg\max_{\ell=1,\dots,K}\mathrm{ARI}(\mathrm{Z}^{\text{true}}_{.k},\mathrm{Z}^{\mathrm{post}}_{.\ell})$ be the index of the estimated cluster that is most associated with the $k$ -th real cluster. The top line of Figure 8 displays the ARI distribution between $\mathrm{Z}^{\text{true}}_{.k}$ and $\mathrm{Z}^{\mathrm{post}}_{.h(k)}$ , for each true cluster $k=1,\dots,K^{\text{true}}$ . These graphs help to visualise which clusters are more challenging to be recovered. When $K=3$ , the third cluster remains the most difficult to retrieve, and the model employing the spike-and-slab-type prior ‘S&S’ performs worse than the models with assume a fixed $\rho_{k}$ . When $K=4$ , we do not observe significant differences across clusters or models.

Last, we test if using different prior setups for $\rho_{k}$ has an impact or not on model fitting. To do so, we display in the second row of Figure 8 the mean squared error (MSE) distribution of each real cluster, that is computed as $\sum_{i:\mathrm{Z}^{\text{true}}_{ik}=1}(y_{ij}-\mathbf{Z}_{i.}^{\mathrm{post}}\boldsymbol{\mu}^{\mathrm{post}}_{.j})^{2}$ , where $\boldsymbol{\mu}^{\mathrm{post}}_{.j}$ denotes the posterior estimate of $\boldsymbol{\mu}_{.j}$ . When the number of clusters is misspecified, all models struggle to fit the data in cluster 3. However, the model denoted ‘S&S’ consistently performs poorly in all clusters and causes of death compared to the CAR priors with fixed $\rho_{k}$ values, which perform similarly. This trend is also observed when $K=4$ .

The second step involves studying the effects of different prior models on $\tau_{k}$ , keeping $\rho_{k}=1$ $\forall k$ . For this experiment, we assume $\tau_{k}=\tau$ $\forall k$ , and we investigate the following options: i.) fixing $\tau$ respectively to 0.1, 1, and 5, ii.) allowing $\tau$ to be learned from the data, assuming the conjugate prior $\mathcal{IG}(2.1,3.1)$ , which implies $\mathcal{M}(\tau)=1$ and a 95% HPD interval that ranges approximately between 0.3 and 7.85.

The right panel of Figure 7 displays the distribution of the ARI across the 20 simulated datasets. We observe that any choice of $\tau$ leads essentially to equivalent results in terms of the overall classification accuracy. We display also in the third row of Figure 8 the distribution of the ARI between $\mathrm{Z}^{\text{true}}_{.k}$ and $\mathrm{Z}^{\mathrm{post}}_{.h(k)}$ , for each true cluster $k=1,\dots,K^{\text{true}}$ . When $K=3$ , the third cluster remains the most difficult to recover, and the model assuming $\tau\sim\mathcal{IG}$ performs slightly worse in recovering it than the models with fixed $\tau$ . When $K=4$ , we do not observe significant differences across clusters or across models.

Finally, the last row of Figure 8 displays the distribution of the mean squared error (MSE) within each true cluster. The third cluster remains the most difficult to recover when $K=3$ , while the quality of the model fitting remains stable across every cluster when $K=4$ .

We conclude that, when the data-generating mechanism presents several clusters with different levels of spatial correlation, perla performs well in recovering the clustering structure when the number of clusters is correctly specified. In addition, it appears not to be influenced by the varying levels of spatial correlation. On the contrary, when the assumed number of clusters is lower than the true number, both the clustering accuracy and the model fitting are more affected in areas with high spatial correlation. This experiment has served also to frame the impact of the hyperparameters values of the CAR prior on the model performance. Our simulations show that specifying a prior distribution on $\rho_{k}$ or fixing it at a constant value yield equivalent results in terms of clustering accuracy, while the model fitting is poorer when $\rho_{k}$ is not taken as constant. In addition, we have observed that assuming an improper CAR ( $\rho_{k}=1$ ) is practically equivalent to assuming a proper CAR with a sufficiently large $\rho_{k}$ . Finally, our model appears to be robust to different choices of $\tau_{k}$ , producing practically equivalent results whether $\tau_{k}$ is assumed variable or fixed in advance.

4.4 Simulation 3

In epidemiological studies, data are typically collected as counts. For instance, mortality studies rely on the number of deaths attributed to various diseases within each area of the analysed territory. However, these raw counts are not meaningful per se unless compared with an appropriate reference measure. For example, the observed number of deaths can be assessed against the expected number in the territory, which accounts for population stratification by age and allows to identify areas with mortality excess or deficit.

From a statistical modelling perspective, Poisson regression is a natural choice for count data, with the chosen reference measure incorporated as an offset. In practice, however, while raw counts are often available, suitable comparison measures are frequently inaccessible due to privacy restrictions. This limitation motivated us to develop a model that works with continuous outcome variables, such as rates, which are both readily available and unaffected by privacy constraints. Nonetheless, we believe it essential to assess our model’s performance in scenarios where both observed counts and a valid comparison measure are available.

In this section, we propose an experiment that stresses perla by simulating the data using a generating mechanism that substantially deviates from our model in three aspects: first, it uses Formula (11) to simulate the log-risks, incorporating also three covariates into the generative mechanism; then, it simulates the observed deaths from a Poisson model; and last, it assumes that different diseases yield different spatial clustering patterns. We consider $d=5$ causes of death. The first two diseases yield two clusters, obtained by dividing the map along a line at longitude 117°8’2.4” W. This line lies at the midpoint of the latitude range, so one cluster occupies the northern part of the map and the other the southern part. A representation of these two clusters is provided in the left panel of Supplementary Figure 6. We store the clustering information into the $n\times 2$ matrix ${\mathbf{Z}}^{(1:2)^{\text{true}}}$ , and the mean levels of the two diseases observed in the two clusters into the $2\times 2$ matrix ${\boldsymbol{\mu}^{(1:2)}}^{\text{true}}$ . The next two diseases yield two clusters, obtained by dividing the map along a line at latitude 40°10’33.888” N. As this line lies at the midpoint of the longitude range, one cluster occupies the western part of the territory and the other the eastern part. A representation of these two clusters is shown in the central panel of Supplementary Figure 6. We store the clustering information into the $n\times 2$ matrix ${\mathbf{Z}}^{(3:4)^{\text{true}}}$ , and the mean levels into the $2\times 2$ matrix ${\boldsymbol{\mu}^{(3:4)}}^{\text{true}}$ . The last disease does not show any clustering pattern.

The proposed data-generating scheme simulates the log-risks as

\log(\boldsymbol{\theta}^{\text{true}}_{i.})\sim\mathcal{N}_{d}\left\{\left[\left({\mathbf{Z}^{(1:2)}_{i.}}^{\text{true}}\right)^{T}{\boldsymbol{\mu}^{(1:2)}}^{\text{true}},\left({\mathbf{Z}^{(3:4)}_{i.}}^{\text{true}}\right)^{T}{\boldsymbol{\mu}^{(3:4)}}^{\text{true}},0\right]^{T}+\mathbf{x}^{T}_{i.}\boldsymbol{\beta}^{\text{true}},\boldsymbol{\Sigma}^{\text{true}}\right\}.

The independent variables are generated as $x_{i\ell}\sim\mathcal{N}(-0.2,\,0.5^{2})$ and the regression coefficients as $\beta_{\ell j}\sim\mathcal{N}(0.02,\,0.05^{2})$ , for $\ell=1,2,3$ . These values were manually tuned to avoid unrealistically high mortality risks. The covariance matrix $\boldsymbol{\Sigma}^{\text{true}}$ is assumed to be block diagonal, with the blocks $\boldsymbol{\Sigma}^{\text{true}}_{1:2,1:2}$ and $\boldsymbol{\Sigma}^{\text{true}}_{3:4,3:4}$ generated as in Simulation 1, and with $\Sigma^{\text{true}}_{55}=0.05$ .

To generate the observed deaths, we consider the expected ones as offset. These are obtained from $E_{ij}\sim\mathcal{G}(1,0.03)$ . This distribution was chosen because it assumes that the average number of expected deaths is 30 and approximately the 95% of the areas in the territory have at most 100 deaths, while the remaining 5% can have much higher values, representing a few areas with large population sizes (an consequently more deaths). The observed counts in the $i$ -th area of the map and for the $j$ -th cause of death are simulated from $O_{ij}\sim{\mathcal{P}oi}(\theta^{\text{true}}_{ij}E_{ij})$ . Finally, to analyse these data using perla, we compute the log-SMR as $y_{ij}=\log(O_{ij}/E_{ij})$ or, in case $O_{ij}=0$ , $y_{ij}=\log\{(O_{ij}+1/2)/E_{ij}\}$ , as suggested by Bivand et al. (2013).

We generate 20 replicates of the experiment, and we apply perla (d,cd) to ensure flexibility, incorporating the model covariates. For comparison, we apply the SSS of Gómez-Rubio et al. (2019), already considered in Section 4.2, that must be applied to each outcome separately and works directly with counts data $O_{ij}$ . In order to obtain a final clustering, we combine the labels obtained on the five outcomes into a unique clustering configuration. In addition, we consider also the Gaussian mixture model that assumes $\mathbf{y}_{i\cdot}\mid\mathrm{Z}_{ik}=1\sim\mathcal{N}_{d}\Bigl{(}\boldsymbol{\mu}_{k\cdot}+\mathbf{x}_{i\cdot}^{T}\boldsymbol{\beta},\,\boldsymbol{\Sigma}_{k}\Bigr{)}$ , with $\Pr(\mathrm{Z}_{ik}=1)=\pi_{k}$ and the covariance structure $\boldsymbol{\Sigma}_{k}$ selected using the BIC criterion. Similarly to perla, this model assumes that the covariates are not directly related to the clustering. All the model parameters are estimated via maximum likelihood.

Due to the complex clustering structure used to generate these data, evaluating the clustering accuracy of the models is not immediate. By combining ${\mathbf{Z}^{(1:2)}}^{\text{true}}$ and ${\mathbf{Z}^{(3:4)}}^{\text{true}}$ , we obtain a partition of the territory into four clusters (see the right panel of Supplementary Figure 6) that we use to evaluate the clustering models. We fit perla and GMM using $K=2,3,4$ . The ARI distributions in the left panel of Figure 9 show that the highest clustering quality is achieved by perla with $K=4$ . This result is particularly remarkable considering that the SSS is works directly with raw counts data. We notice also that both perla and GMM perform better when the number of clusters is enlarged. This is due to the nested clustering structure that characterises this simulation; the gain in the ARI level is particularly visible moving from 2 to 3 clusters. We conclude that, when different the clusters vary with the outcome variables, an effective strategy is to increase the number of clusters assumed by the model. In addition, we observe that transforming the outcomes into continuous variables rather that using directly the counts data does not lead to any loss in clustering accuracy.

In addition, we report in the right panel of Figure 9 the distribution of the computational cost of perla as a function of the number of clusters. Each model estimation involves running four Markov chains of length 10,000 sequentially using the algorithm detailed in Section 3.1 and combining the results with the post-processing algorithm described in Section 3.2. Consequently, the runtime for a single Markov chain of length 10,000 can be roughly deduced by dividing the values in the graph by four.

5 Case studies

In this section, we investigate the existence of mortality clusters in the case studies discussed in Section 1.1. In both scenarios, we apply perla to male and female populations separately. We do not include sex as a covariate in the model because, as outlined in Section 2.1, perla assumes that exogenous variables influence the outcomes but do not regulate the formation of clusters. However, in the case studies considered in this article, it is of particular interest to investigate whether the mortality clusters based on the three causes of death considered show significant differences between males and females. This requires us to conduct the analysis separately for each group. We consider all the combinations of shrinkage factors shown in the right plot of Figure 3, assuming $\tau_{k}=1$ and $\rho_{k}=0.99$ , $\forall k$ . Every model is fitted running the MCMC algorithm four times separately, each for 10,000 iterations, discarding the first half of each run, and merging the remaining draws into a unique chain of length 20,000. Lastly, we apply the ECR algorithm to resolve eventual label switching.

5.1 Mortality in the Padua province

We investigate the presence of mortality clusters in Padua province using age-adjusted standardised mortality ratios $SMR_{ij}=O_{ij}/E_{ij}$ , derived as the ratio between observed death counts ( $O_{ij}$ ) and expected death counts ( $E_{ij}$ ) in the $i=1,\dots,106$ areas and for the $j=1,2,3$ diseases considered. The expected number of deaths represents the number of deaths per age class that would occur if the mortality rates per age class of the entire province were applied. Thus, assuming $T$ age classes, the expected number of deaths from disease $j$ in the $i$ -th area is $E_{ij}=\sum_{t=1}^{T}n_{it}r_{jt}$ , where $r_{jt}$ is the provincial mortality rate of the $j$ -th disease of interest within the $t$ -th age class, measured in the $i$ -th area, and $n_{it}$ is the population size within the $i$ -th area and the $t$ -th age class. For our analysis, we consider

y_{ij}=\log\left(SMR_{ij}\right);

the logarithm is applied so that $y_{ij}=0$ represents the case of concordance between observed and expected deaths, $y_{ij}>0$ denotes a mortality excess compared to the province level, and $y_{ij}<0$ denotes a mortality deficit compared to the province level. In cases where $O_{ij}=0$ for some $i$ and $j$ , one can apply the alternative formula $SMR_{ij}=(O_{ij}+1/2)/E_{ij}$ to enable the computation of the logarithm of the standardised mortality ratios. This approach is described, for example, in Section 5 of Bivand et al. (2013). However, it is important to clarify that, although we know how the standardised mortality ratios were calculated, the raw data $O_{ij}$ and $E_{ij}$ were not provided by ULSS6 Euganea due to privacy considerations; only the $SMR_{ij}$ values are available for the analysis.

We also consider for the analysis two variables that are representative of the socioeconomic status of individuals in an area. Specifically, we consider:

•

Deprivation index, quantified through the Caranci’s index (Caranci and Costa, 2009). This index quantifies the deprivation, which represents socio-economic disadvantages, by considering various factors affecting residents in the territory. It is calculated as the standardised sum of five indicators, with higher scores indicating higher deprivation (Rosano et al., 2020). These five indicators include the level of education among individuals aged 15-60, the unemployment status, the condition of single-parent family with underage children, house renting and housing density per 100 m². The index typically ranges between $-5$ and 40, with most cases between $-$ 2 and 2. The data used to compute the index, sourced from the Italian 2011 census, are available at census section level and were aggregated to obtain data at municipality level.
•

Sale price of real estate per squared meter, which serves as an indicator of the general economic status of municipalities. These price data were scraped from www.immobiliare.it, a web portal that collects property sales listings in Italy, using the average of the prices in euro/m² on January 1 ${}^{\makebox{st}}$ of 2017, 2018, and 2019 (immobiliare.it, 2022). The specific time frame was chosen as it is characterised by stable price trends, without notable increases or decreases.

On male data, we fit perla considering $K=2$ and $K=3$ , mainly motivated by the relatively small number of areas in the map. Results of the model selection are shown in Figure 10. The $\mathrm{DIC}_{3}$ criterion clearly supports $K=3$ . In fact, the plot of the posterior distribution of $\boldsymbol{\mu}$ under $K=2$ (top panel of Supplementary Figure 7) displays multiple modes, which is a clear indication that a model with an additional component should be considered. We thus apply perla with $K=3$ and shrinkage factor (c,d), which is more parsimonious than (d,cd) while providing in practice the same value of $\mathrm{DIC}_{3}$ . The final classification of the map is reported in the top-left plot of Figure 11. While cluster 1 extends over the whole territory, particularly on the northern area of the province, clusters 2 and 3 are prevalently located in the south-west. The posterior densities of $\{\exp(\mu_{kj})\}_{k=1,2,3;j=1,2,3}$ are displayed in the top-right panel of Figure 11, showing that cluster 1 is characterised by an excess mortality due to respiratory diseases ( $\mathcal{M}(\mu_{12}|\mathrm{data})=1.216$ , $\Pr\{\exp(\mu_{12})>1|\mathrm{data}\}=0.997$ ), and cluster 2 by a mortality excess due to circulatory diseases ( $\mathcal{M}(\mu_{21}|\mathrm{data})=1.166$ , $\Pr\{\exp(\mu_{21})>1|\mathrm{data}\}=0.966$ ) and by a mortality deficit due to respiratory diseases ( $\mathcal{M}(\mu_{22}|\mathrm{data})=0.318$ , $\Pr\{\exp(\mu_{22})>1|\mathrm{data}\}=0$ ). Lastly, cluster 3 does not show evidence of neither excess nor deficit of mortality due to the three causes of death considered. This is visible from the posterior distributions of the elements in $\{\exp(\mu_{3j})\}_{j=1,2,3}$ , which are shrank toward one thanks to the cluster-specific shrinkage factor $\delta_{3}$ . In Table 2, we report all the posterior SMR estimates and the probabilities of observing an excess mortality. Notice that, according to our model, deaths for tumours appear not to impact on the formation of spatial mortality clusters. This is highlighted by the disease-specific shrinkage factor $\zeta_{3}$ , which collapses the posterior estimates of $\{\exp(\mu_{k3})\}_{k=1,2,3}$ toward one. We report also the trace plots of the cluster centroids $\boldsymbol{\mu}$ in the top panel of Supplementary Figure 8. Chains appear stable, demonstrating consistency across different parallel runs.

Figure 12 displays the posterior estimates of the shrinkage factors used in the analysis of male data. The plot consists of a $K\times d$ matrix, coloured according to the posterior mode of the normalised disease-specific shrinkage factors, $\zeta_{j}/\sum_{l=1}^{d}\zeta_{l}$ . This normalisation allows this quantity to be interpreted as the percentage contribution of disease $j$ to the clustering structure. Since $\zeta_{j}$ either amplifies or shrinks the role of disease $j$ in formation of clusters, in the figure $\zeta_{j}/\sum_{l=1}^{d}\zeta_{l}$ remains consistent across all clusters. The results indicate that cluster formation is driven primarily by respiratory diseases, accounting for roughly 90% of the observed structure. This is further visible from the top row of Table 2, where respiratory diseases exhibit substantially higher risk variability across the associated posterior estimates compared to the other two causes of death. Thus, while mortality from respiratory diseases varies substantially across the territory, leading to distinct spatial clusters, the other two diseases show more stable geographical patterns. The shrinkage factors $\{\delta_{k}\}_{k=1}^{K}$ quantify the variability in mortality levels across different causes of death within each cluster, with larger values indicating greater heterogeneity. However, from an epidemiological perspective, we find these parameters less interpretable than the ${\zeta}_{j}$ , and therefore we do not report their graphical representation.

On female data, all the model setups, except (d), are comparable in terms of $\mathrm{DIC}_{3}$ (see the right panel of Figure 10). Therefore, we apply perla with $K=4$ and shrinkage factor (c), which is the most parsimonious. In alternative, we could have considered the model with three clusters, following the conclusions taken from male data. However, the posterior distribution of $\boldsymbol{\mu}$ under $K=3$ displays multiple modes, as shown in the bottom panel of Supplementary Figure 7. The final classification of the map is displayed in the bottom-left plot of Figure 11. Some differences emerge with respect to male data. Cluster 3 is geographically located in the same area of cluster 2 identified on male data, but its composition is rather different, as is the municipality included in the northern province. Additionally, the model detects a fourth cluster consisting of only two separate municipalities. The posterior densities of $\{\exp(\mu_{kj})\}_{k;j}$ are displayed in the bottom-right plot of Figure 11, and detailed estimates are reported in the bottom part of Table 2. We detect an excess mortality in cluster 1, that corresponds to the northern province, due to respiratory diseases, and a deficit of mortality for the same cause in cluster 3 and 4. In addition, cluster 4 is characterised by a non-negligible deficit of mortality due to tumours. Cluster 3 shows also a probability of excess mortality due to circulatory diseases of almost 0.9. We report also the trace plots of the cluster centroids $\boldsymbol{\mu}$ in the bottom panel of Supplementary Figure 8.

Although cluster 4 comprises only two municipalities, there is strong evidence of its separation from cluster 3. Supplementary Figure 9 illustrates the percentage of times each pair of municipalities in clusters 3 and 4 were assigned to the same cluster across the MCMC iterations. The estimated probability that the two municipalities finally assigned to cluster 4 (Barbona and Arquà Petrarca) belong to the same cluster exceeds 0.9. In contrast, the estimated probability of either municipality being placed in the same cluster with any of the nine municipalities finally assigned to cluster 3 ranges from 0 to 0.05.

We notice also that the best partition selected by the ECR algorithm does not include cluster 2. Observations assigned to cluster 2 in some MCMC iterations are all part of the group of 95 observations denoted as cluster 1 in the final estimate of the data partition. We conclude that, although cluster 2 does not have a practical epidemiological interpretation, its inclusion in some MCMC iterations helped achieve unimodal posterior distributions, facilitating the inference on the model parameters and the computation of some quantities of interest, such as the probability of mortality excess. Notice also that the posterior distributions of the elements in $\{\exp(\mu_{2j})\}_{j=1,2,3}$ are shrank toward one thanks to the cluster-specific shrinkage factor $\delta_{2}$ . Lastly, it is worth underlining that the data partition shown in the bottom-left panel of Figure 11 is equivalent to the partition returned by the SALSO algorithm, which confirms the robustness of the results discussed.

Both male and female mortality data do not show substantial evidence of marginal correlation across the three causes of death considered, as shown in Supplementary Table 1.

5.2 Mortality in the U.S. counties

We now investigate the presence of mortality clusters in the U.S. from 2016 to 2019. Due to the high diversity of the U.S. territory already mentioned in Section 1.1, we restrict our attention to the north-east and central-east coasts, selecting the U.S. counties whose centroids are located east of the 80° W longitude meridian. This territory consists of 388 counties, which are displayed in the right plot of Supplementary Figure 1. We examine the relative age-adjusted mortality risks ${RR}_{ij}=MR_{ij}/TMR_{j}$ , derived as the ratio between the age-adjusted mortality rate ( $MR_{ij}$ ) and the total age-adjusted mortality rate ( $TMR_{j}$ ) in the $i=1,\dots,388$ counties and for the $j=1,2,3$ diseases considered. These rates were obtained from the Centres for Disease Control and Prevention (CDC) WONDER Online Databases CDC (2021). Assuming $T$ age classes, the age-adjusted mortality rates in the county $i$ were computed as $MR_{ij}=\sum_{t=1}^{T}MR_{ijt}P_{t}/P$ , where $P$ is the size of the standard population – the U.S. population in 2000 – $P_{t}$ is the size of the standard population in the $t$ -th age class, and $MR_{ijt}$ is the age-specific mortality rate in the $t$ -th age class. Similarly, $TMR_{j}=\sum_{t=1}^{T}TMR_{jt}P_{t}/P$ , where $TMR_{jt}$ is the total age-specific mortality rate in the $t$ -th age class over the entire U.S. territory. Both $MR_{ij}$ and $TMR_{ij}$ are directly available in the WONDER Online Databases; however, the specific information of age classes (e.g., the $MR_{ijt}$ ) is not.

In our analysis, we apply

y_{ij}=\log\left(RR_{ij}\right),

so that $y_{ij}=0$ , $y_{ij}>0$ and $y_{ij}<0$ represent respectively the cases where the mortality in county $i$ due to disease $j$ is equal, larger or smaller than the mortality across the entire U.S. territory. To enable the computation of the logarithm of the relative age-adjusted mortality risk when $MR_{ij}=0$ for some $i$ and $j$ , one can adopt, for example, the approach described by Möller and Ahrenfeldt (2021). However, it is also common that, when the number of cases falls below a certain threshold, the mortality ratio is reported as missing for privacy reasons, which is the case in our study. For counties with less than 10 deceased persons due to a specific death cause, the number of observed deaths and the corresponding mortality rate for that specific death cause are not reported in the WONDER data. In particular, the mortality level due to respiratory causes is not available for the male population in Tyrrell County (NC) and Grand Isle County (VT), and it is unavailable for both male and female populations in Highland County (VA). In such cases, we impute the missing data using the average mortality rates for the same death cause from neighbouring and bordering counties. It is worth underlying that, although the WONDER Online Databases also provide the number of deaths for every county, information of the population size per age class is not available. This prevents us from computing a possible measure of comparison of the observed deaths that accounts for the age population structure, such as the expected deaths. The use of the crude death rate – the ratio between death counts and population size in each county – does not account for the age population structure. This motivates the use of the age-adjusted mortality rates $MR_{ij}$ (and, consequently, the relative age-adjusted mortality risks $RR_{ij}$ ) in our analysis.

On male data, we choose the model with $K=3$ clusters that exploits cluster-specific penalisation factors, denoted (c), as it is the model that returns the smallest value of $\mathrm{DIC}_{3}$ (see the left panel of Supplementary Figure 10). Results are shown in the top row of Figure 13. A predominant cluster that spans the entire territory is characterised by elevated mortality rates for all causes of death considered. The second cluster comprises primarily large urban areas or their neighbouring counties (including Boston, New York, Long Island, Philadelphia, Baltimore, and the urban area around Washington D.C.), along with some isolated regions. Mortality levels in these areas are uniformly below 1, indicating lower mortality compared to the entire U.S. territory. Lastly, the third cluster consists of isolated counties, showing a mortality deficit for respiratory diseases, and a mortality excess for circulatory diseases tumours. Further details are provided in Table 3. It also appears that the marginal correlation across causes of death is not negligible. The posterior estimates and the corresponding 95% HPD intervals of the marginal correlations are: 0.439 (0.319,0.544) between circulatory and respiratory causes, 0.424 (0.307,0.508) between circulatory causes and tumours, and 0.363 (0.250,0.478) between respiratory causes and tumours.

On female data, we select the model (d,cd), as it returns the minimum value of $\mathrm{DIC}_{3}$ (see the right panel of Supplementary Figure 10). The results displayed in the bottom row of Figure 13 reveal a spatial distribution of the clusters that mirrors the patterns observed in the male data. Once again, cluster 1 spans the entire territory, cluster 2 captures prevalently large urban areas, and cluster 3 consists of isolated counties. Additionally, cluster 4 comprises only two counties. The posterior distributions in the bottom-right plot illustrate the variations in mortality levels across clusters. The mortality rates for circulatory and respiratory diseases observed in cluster 4 are similar to those in cluster 2, while mortality rate for tumours is substantially lower. Unlike the male data, there is also a mortality excess due to tumours in cluster 3. Further details are provided in the bottom section of Table 3. In comparison to males, the marginal correlations with tumours appear to be lower in female data. The posterior estimates and corresponding 95% HPD intervals are: 0.301 (0.168, 0.432) between circulatory and respiratory causes, 0.357 (0.239, 0.453) between circulatory causes and tumours, and 0.37 (0.264, 0.486) between respiratory causes and tumours.

Figure 14 displays the posterior estimates of the shrinkage factors employed by the models used for the analysis of female data. Each figure represents a $K\times d$ matrix, coloured according to the posterior mode of the shrinkage factors. The larger the value in the $kj$ -th cell, the more significant the contribution of disease $j$ to the formation of the $k$ -th cluster. The results for $\zeta_{j}$ are displayed as the normalised version $\zeta_{j}/\sum_{l=1}^{d}\zeta_{l}$ , which can be interpreted as the percentage contribution of disease $j$ to the clustering structure. The left panel shows that territorial mortality clusters are predominantly shaped by respiratory diseases ( $\sim 50\%$ ) and tumours ( $\sim 40\%$ ). The second panel shows the joint effects $\gamma_{kj}$ . Although this quantity cannot be normalised as $\zeta_{j}$ , it indicates that tumours are particularly responsible for the formation of cluster 4, and circulatory diseases significantly contribute to the formation of cluster 3.

The two clusters denominated cluster 1 for both male and female data encompass the majority of counties on the map and exhibit higher mortality rates compared to the entire U.S. territory. The absence of spatial clusters with average mortality levels in keeping with national mortality levels might raise concerns about potential systematic biases in the data. Therefore, we conduct an additional analysis in a geographically distant area. Specifically, we apply perla to female data from the 259 counties on the West side map of the U.S., which is the territory used for the simulation experiments in Section 4. The prior setup considered is the same used for the East coast female data. In this region, 27 areas have at least one missing data point and their value are imputed similarly to the missing data on the East coast territory. The analysis reveal the presence of a predominant cluster characterised by lower mortality rates for circulatory diseases and tumours compared to the national average. The remaining two clusters do not correspond to major urban areas as observed in the East coast data; instead, they include regions with low population densities. Detailed plots are provided in Supplementary Figure 11. We conclude that the U.S. East coast exhibits higher mortality levels for major causes of death compared to the rest of the nation, while the West coast demonstrates generally lower mortality levels.

6 Discussion

The analysis of the mortality distribution due to multiple death causes provides an overview of the health conditions and disparities within a territory. In particular, detecting mortality clusters means identifying areas with similar levels of mortality and investigating the presence of unknown risk factors. In this work, we have proposed perla, a multivariate Bayesian model that addresses the dual problem of detecting spatial mortality clusters considering multiple causes of death simultaneously, and identifying the diseases responsible for cluster formation. perla integrates the spatial structure of the data directly into the clustering probabilities, enabling tractable posterior distributions. Additionally, it accounts for the presence of external covariates, if available.

We have proposed different setups for perla considering different types of global-local shrinkage priors. These setups help the model avoid the identification of spurious clusters, and identify which death causes are associated with mortality increment and deficit in different clusters. The several prior setups allow for the selection of an appropriate level of shrinkage while maintaining parsimony, i.e., without excessively increasing the number of model parameters.

Through two different case-studies, we have demonstrated the various insights achievable with our methodology. Not only does our model partition the areas of a territory into clusters, but it also estimates the probability of risk excess in each cluster for each cause of death. Additionally, the model quantifies the marginal correlation across different causes of death. Finally, the inference on the shrinkage parameters allows us to quantify the impact of multiple diseases on the formation of clusters and to identify the causes of death that more characterise the mortality in specific areas of a territory.

Although this article has introduced a comprehensive solution to address some key questions in the analysis of multivariate areal data, we believe that there is space for further extensions. While we assumed that the data $\mathbf{Y}$ are realisations of continuous and zero-centred variables, as discussed in Section 4.4 the mortality data often come in the form of counts. Consequently, following the approach of Green and Richardson (2002), the regression specified in Formula (1) could be extended to the generalised linear model framework. In addition, the regression framework could be further extended by considering a prior distribution on $\boldsymbol{\beta}$ that induces variable selection when $p$ grows substantially. This approach should be performed considering the $d$ regressions jointly, as suggested by Deshpande et al. (2019).

Our simulation experiments have shown that perla can recover the clustering structure of the data even when the clusters are characterised by different levels of spatial correlation. Nonetheless, in view of potential applications in other frameworks, perla could be extended by incorporating an unstructured random component both at the data level in Formula (1), mirroring the structure of the BYM model (Besag et al., 1991), and in the clustering probabilities. Such an extension, however, complicates the posterior sampling since some steps of the simulation scheme would require the use of the Metropolis-Hastings algorithm.

In the case-studies treated in this work, we have performed the model selection looking both at the $\mathrm{DIC}_{3}$ information criterion and the quality of the MCMC samples, preferring models with unimodal posterior distributions. However, this approach necessitates running perla multiple times. Alternatively, a simulation algorithm such as the reversible jump Markov chain Monte Carlo (RJMCMC, Green 1995) could streamline this process. RJMCMC would test multiple model setups and provide a probabilistic assessment of the number of clusters and the optimal combination of penalisation parameters that best fit the data. Bayesian nonparametric methods are also valuable for selecting the number of mixture components (Rodriguez and Müller, 2013). In our case studies, we have observed that a small number of clusters is enough to describe the data. However, in other applications, this step may be necessary. In such cases, perla can be easily extended to an infinite mixture model by allowing the stick-breaking process in (2) to continue without stopping at a fixed number of components, thereby generating progressively smaller probability sticks (see, for instance, Jo et al. 2017).

Finally, as we previously mentioned, inference on the parameters can be performed easily via MCMC methods, thanks to multiple data augmentation strategies that allow for expressing most conditional posterior distributions in closed form. These computational characteristics make it possible to explore potentially more efficient estimation procedures, such as variational Bayes methods, which are particularly appealing for the analysis of massive datasets.

7 Supplementary material

Contains details about the derivation of the $\mathrm{DIC}_{3}$ information criterion, a discussion on the role of the CAR parameters in the variation of the clustering probabilities, and additional figures and tables related to both the simulation experiments and case studies.

8 Software

The methodology presented in this work is implemented in the R package perla, which can be accessed online at https://github.com/andreasottosanti/perla. This package ensures full reproducibility of the analyses conducted on both simulated data and the publicly available mortality U.S. data. Data from the Italian health care system are not accessible due to privacy concerns.

9 Competing interests

No competing interest is declared.

10 Funding

This work was supported by Next Generation EU, in the context of the National Recovery and Resilience Plan, Investment PE8 – Project Age-It: “Ageing Well in an Ageing Society”, CUP C93C22005240007 [DM 1557 11.10.2022]. The views and opinions expressed are only those of the authors and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them. Additionally, GB and PB acknowledge funding from PRIN SOcial and health Frailty as determinants of Inequality in Aging. (SOFIA), project n. 020KHSSKE, CUP C93C22000270001, directorial decree ministry of university n. 222 dated February 18, 2022.

11 Acknowledgments

The authors are thankful to C. Castiglione (Bocconi University) and P. Onorati (University of Padua) for the helpful discussions about global-local shrinkage priors, and to F. Denti (University of Padua) for precious suggestions after reading the first draft of the manuscript. The authors extend their gratitude to B. Bezzon (University of Padua) for linguistic support. Authors also acknowledge ULSS6 Euganea, particularly L. Benacchio, for providing the aggregated SMRs used in this study, within the framework of the STHEP (State of Health in Padua) collaboration between ULSS6 Euganea and the Department of Statistical Sciences of the University of Padua.

References

Ahmed and Genin (2020) M.-S. Ahmed and M. Genin. A functional-model-adjusted spatial scan statistic. Statistics in Medicine, 39(8):1025–1040, 2020. ISSN 1097-0258. doi: 10.1002/sim.8459.
Amini et al. (2023) M. Amini, M. Azizmohammad Looha, S. Rahimi Pordanjani, H. Asadzadeh Aghdaei, and M. A. Pourhoseingholi. Global long-term trends and spatial cluster analysis of pancreatic cancer incidence and mortality over a 30-year period using the global burden of disease study 2019 data. PloS One, 18(7), 2023.
Assunção and Krainski (2009) R. Assunção and E. Krainski. Neighborhood Dependence in Bayesian Spatial Models. Biometrical Journal, 51(5):851–869, 2009. ISSN 1521-4036. doi: 10.1002/bimj.200900056.
Aungkulanon et al. (2017) S. Aungkulanon, V. Tangcharoensathien, K. Shibuya, K. Bundhamcharoen, and V. Chongsuvivatwong. Area-level socioeconomic deprivation and mortality differentials in Thailand: results from principal component analysis and cluster analysis. International journal for equity in health, 16:1–12, 2017.
Banerjee et al. (2015) S. Banerjee, B. P. Carlin, and A. E. Gelfand. Hierarchical modeling and analysis for spatial data, volume 135 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL, second edition, 2015. ISBN 978-1-4398-1917-3.
Besag (1974) J. Besag. Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):192–236, 1974. ISSN 0035-9246.
Besag et al. (1991) J. Besag, J. York, and A. Mollié. Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics, 43(1):1–20, 1991. ISSN 1572-9052. doi: 10.1007/BF00116466.
Bhadra et al. (2019) A. Bhadra, J. Datta, N. G. Polson, and B. Willard. Lasso Meets Horseshoe: A Survey. Statistical Science, 34(3):405–427, 2019. ISSN 0883-4237.
Bivand et al. (2013) R. S. Bivand, E. Pebesma, and V. Gómez-Rubio. Applied Spatial Data Analysis with R. Springer, New York, NY, 2013. ISBN 978-1-4614-7617-7 978-1-4614-7618-4.
Bovo et al. (2023) E. Bovo, P. Belloni, A. Sottosanti, and G. Boccuzzo. Territorial clusters of mortality and role of social and environmental factors: the case of ULSS 6 Euganea (Italy). In 2023 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pages 1–5. IEEE, 2023.
Caranci and Costa (2009) N. Caranci and G. Costa. Un indice di deprivazione a livello aggregato da utilizzare su scala nazionale: giustificazioni e composizione. Salute e società. Fascicolo 1, 2009, pages 1000–1021, 2009.
Carlin and Banerjee (2003) B. P. Carlin and S. Banerjee. Hierarchical Multivarite CAR Models for Spatio-Temporally Correlated Survival Data. Bayesian Statistics, 7(7):45–63, 2003.
Carvalho et al. (2010) C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, 2010. ISSN 0006-3444.
CDC (2021) CDC. National vital statistics system, mortality 1999-2020 on cdc wonder online database, 2021. Accessed on: Apr 11, 2024. http://wonder.cdc.gov/mcd-icd10.html.
Celeux and Govaert (1995) G. Celeux and G. Govaert. Gaussian parsimonious clustering models. Pattern Recognition, 28(5):781–793, 1995. ISSN 0031-3203. doi: 10.1016/0031-3203(94)00125-6.
Celeux et al. (2006) G. Celeux, F. Forbes, C. P. Robert, and D. M. Titterington. Deviance information criteria for missing data models. Bayesian Analysis, 1(4):651–673, 2006. ISSN 1936-0975, 1931-6690. doi: 10.1214/06-BA122.
Coker et al. (2023) E. S. Coker, J. Molitor, S. Liverani, J. Martin, P. Maranzano, N. Pontarollo, and S. Vergalli. Bayesian profile regression to study the ecologic associations of correlated environmental exposures with excess mortality risk during the first year of the Covid-19 epidemic in Lombardy, Italy. Environmental Research, 216, 2023.
Cucala (2016) L. Cucala. A Mann–Whitney scan statistic for continuous data. Communications in Statistics - Theory and Methods, 45(2):321–329, Jan. 2016. ISSN 0361-0926. doi: 10.1080/03610926.2013.806667.
Cucala et al. (2017) L. Cucala, M. Genin, C. Lanier, and F. Occelli. A multivariate Gaussian scan statistic for spatial data. Spatial Statistics, 21:66–74, Aug. 2017. ISSN 2211-6753. doi: 10.1016/j.spasta.2017.06.001.
Cucala et al. (2019) L. Cucala, M. Genin, F. Occelli, and J. Soula. A multivariate nonparametric scan statistic for spatial data. Spatial Statistics, 29:1–14, Mar. 2019. ISSN 2211-6753. doi: 10.1016/j.spasta.2018.10.002.
Dahl et al. (2022) D. B. Dahl, D. J. Johnson, and P. Müller. Search Algorithms and Loss Functions for Bayesian Clustering. Journal of Computational and Graphical Statistics, 31(4):1189–1201, 2022. ISSN 1061-8600. doi: 10.1080/10618600.2022.2069779.
Datta et al. (2019) A. Datta, S. Banerjee, J. S. Hodges, and L. Gao. Spatial Disease Mapping Using Directed Acyclic Graph Auto-Regressive (DAGAR) Models. Bayesian Analysis, 14(4), 2019. ISSN 1936-0975. doi: 10.1214/19-BA1177.
Denti et al. (2023) F. Denti, R. Azevedo, C. Lo, D. G. Wheeler, S. P. Gandhi, M. Guindani, and B. Shahbaba. A horseshoe mixture model for Bayesian screening with an application to light sheet fluorescence microscopy in brain imaging. The Annals of Applied Statistics, 17(3):2639–2658, Sept. 2023. ISSN 1932-6157, 1941-7330. doi: 10.1214/23-AOAS1736.
Deshpande et al. (2019) S. K. Deshpande, V. Ročková, and E. I. George. Simultaneous Variable and Covariance Selection With the Multivariate Spike-and-Slab LASSO. Journal of Computational and Graphical Statistics, 28(4):921–931, 2019. ISSN 1061-8600. doi: 10.1080/10618600.2019.1593179.
Fedeli et al. (2021) U. Fedeli, E. Schievano, F. Avossa, T. Baruffa, A. Rampado, A. Lucia, M. Veronese, N. Gennaro, M. Pellizzari, E. Pinato, E. Ferroni, C. Basso, S. T. Netti, L. Cestari, A. D. Paoli, M. Dotto, S. Pierobon, V. Casotto, M. Costa, M. Braggion, M. R. Lamattina, and V. Zabeo. La mortalità nella Regione del Veneto. periodo 2016-2019, 2021.
Frévent et al. (2023) C. Frévent, M.-S. Ahmed, S. Dabo-Niang, and M. Genin. Investigating spatial scan statistics for multivariate functional data. Journal of the Royal Statistical Society Series C: Applied Statistics, 72(2):450–475, May 2023. ISSN 0035-9254. doi: 10.1093/jrsssc/qlad017.
Green (1995) P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732, 1995. ISSN 0006-3444. doi: 10.1093/biomet/82.4.711.
Green and Richardson (2002) P. J. Green and S. Richardson. Hidden Markov Models and Disease Mapping. Journal of the American Statistical Association, 97(460):1055–1070, 2002. ISSN 0162-1459.
Griffith (2023) D. A. Griffith. Spatial autocorrelation mixtures in geospatial disease data: An important global epidemiologic/public health assessment ingredient? Transactions in GIS, 27(3):730–751, 2023.
Gómez-Rubio et al. (2019) V. Gómez-Rubio, P. Moraga, J. Molitor, and B. Rowlingson. DClusterm: Model-Based Detection of Disease Clusters. Journal of Statistical Software, 90:1–26, Aug. 2019. ISSN 1548-7660. doi: 10.18637/jss.v090.i14.
Hubert and Arabie (1985) L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, Dec. 1985. ISSN 1432-1343. doi: 10.1007/BF01908075.
immobiliare.it (2022) immobiliare.it. Real estate price data in Italy, 2022. Accessed on: Dec 19, 2022. https://www.immobiliare.it/mercato-immobiliare.
Jo et al. (2017) S. Jo, J. Lee, P. Müller, F. A. Quintana, and L. Trippa. Dependent Species Sampling Models for Spatial Density Estimation. Bayesian Analysis, 12(2), June 2017. ISSN 1936-0975. doi: 10.1214/16-BA1006.
Kim (2021) C. Kim. Deviance information criteria for mixtures of distributions. Communications in Statistics - Simulation and Computation, 50(10):2935–2948, 2021. ISSN 0361-0918. doi: 10.1080/03610918.2019.1617878.
Kulldorff (1997) M. Kulldorff. A spatial scan statistic. Communications in Statistics-Theory and methods, 26(6):1481–1496, 1997.
Kulldorff et al. (2005) M. Kulldorff, R. Heffernan, J. Hartman, R. Assunçao, and F. Mostashari. A space–time permutation scan statistic for disease outbreak detection. PLoS medicine, 2(3):e59, 2005.
Künzli et al. (2000) N. Künzli, R. Kaiser, S. Medina, M. Studnicka, O. Chanel, P. Filliger, M. Herry, F. Horak, V. Puybonnieux-Texier, P. Quénel, J. Schneider, R. Seethaler, J.-C. Vergnaud, and H. Sommer. Public-health impact of outdoor and traffic-related air pollution: a European assessment. The Lancet, 356(9232):795–801, Sept. 2000. ISSN 0140-6736. doi: 10.1016/S0140-6736(00)02653-2.
Lafferty and Blei (2005) J. Lafferty and D. Blei. Correlated Topic Models. In Advances in Neural Information Processing Systems, volume 18. MIT Press, 2005.
Lawson (2013) A. B. Lawson. Statistical methods in spatial epidemiology. John Wiley & Sons, 2013.
Lee et al. (2009) D. Lee, C. Ferguson, and R. Mitchell. Air pollution and health in Scotland: a multicity study. Biostatistics, 10(3):409–423, 2009.
Li et al. (2012) G. Li, N. Best, A. L. Hansell, I. Ahmed, and S. Richardson. BaySTDetect: detecting unusual temporal patterns in small area data via Bayesian model choice. Biostatistics, 13(4):695–710, 2012. ISSN 1465-4644. doi: 10.1093/biostatistics/kxs005.
Li et al. (2021) Y. Li, D. V. Nguyen, S. Banerjee, C. M. Rhee, K. Kalantar-Zadeh, E. Kürüm, and D. Şentürk. Multilevel Modeling of Spatially Nested Functional Data: Spatiotemporal Patterns of Hospitalization Rates in the U.S. Dialysis Population. Statistics in medicine, 40(17):3937–3952, July 2021. ISSN 0277-6715. doi: 10.1002/sim.9007.
Linderman et al. (2015) S. W. Linderman, M. J. Johnson, and R. P. Adams. Dependent multinomial models made easy: stick breaking with the Pólya-gamma augmentation. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3456–3464, Cambridge, MA, USA, 2015. MIT Press.
Liu et al. (2018) Y. Liu, Y. Liu, and T. Zhang. Wald-based spatial scan statistics for cluster detection. Computational Statistics & Data Analysis, 127:298–310, Nov. 2018. ISSN 0167-9473. doi: 10.1016/j.csda.2018.06.002.
Makalic and Schmidt (2016) E. Makalic and D. F. Schmidt. A Simple Sampler for the Horseshoe Estimator. IEEE Signal Processing Letters, 23(1):179–182, 2016. ISSN 1558-2361. doi: 10.1109/LSP.2015.2503725.
Marin and Robert (2007) J.-M. Marin and C. P. Robert. Regression and Variable Selection. In Bayesian Core: A Practical Approach to Computational Bayesian Statistics, pages 47–84. Springer, New York, NY, 2007. ISBN 978-0-387-38983-7.
McGrory et al. (2009) C. A. McGrory, D. M. Titterington, R. Reeves, and A. N. Pettitt. Variational Bayes for estimating the parameters of a hidden Potts model. Statistics and Computing, 19(3):329–340, 2009. ISSN 1573-1375. doi: 10.1007/s11222-008-9095-6.
Moores et al. (2020a) M. Moores, G. Nicholls, A. Pettitt, and K. Mengersen. Scalable Bayesian Inference for the Inverse Temperature of a Hidden Potts Model. Bayesian Analysis, 15(1):1–27, 2020a. ISSN 1936-0975, 1931-6690. doi: 10.1214/18-BA1130.
Moores et al. (2015) M. T. Moores, C. C. Drovandi, K. Mengersen, and C. P. Robert. Pre-processing for approximate Bayesian computation in image analysis. Statistics and Computing, 25(1):23–33, 2015. ISSN 1573-1375. doi: 10.1007/s11222-014-9525-6.
Moores et al. (2020b) M. T. Moores, A. N. Pettitt, and K. L. Mengersen. Bayesian Computation with Intractable Likelihoods. In K. L. Mengersen, P. Pudlo, and C. P. Robert, editors, Case Studies in Applied Bayesian Data Science: CIRM Jean-Morlet Chair, Fall 2018, pages 137–151. Springer International Publishing, Cham, 2020b. ISBN 978-3-030-42553-1.
Möller and Ahrenfeldt (2021) S. Möller and L. J. Ahrenfeldt. Estimating Relative Risk When Observing Zero Events—Frequentist Inference and Bayesian Credibility Intervals. International Journal of Environmental Research and Public Health, 18(11):5527, May 2021. ISSN 1661-7827. doi: 10.3390/ijerph18115527.
Papastamoulis (2016) P. Papastamoulis. label.switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs. Journal of Statistical Software, 69:1–24, 2016. ISSN 1548-7660. doi: 10.18637/jss.v069.c01.
Papastamoulis and Iliopoulos (2013) P. Papastamoulis and G. Iliopoulos. On the Convergence Rate of Random Permutation Sampler and ECR Algorithm in Missing Data Models. Methodology and Computing in Applied Probability, 15(2):293–304, 2013. ISSN 1573-7713. doi: 10.1007/s11009-011-9238-7.
Piironen and Vehtari (2017) J. Piironen and A. Vehtari. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2):5018–5051, Jan. 2017. ISSN 1935-7524, 1935-7524. doi: 10.1214/17-EJS1337SI.
Polson and Scott (2011) N. G. Polson and J. G. Scott. Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and M. West, editors, Bayesian Statistics 9, page 0. Oxford University Press, Oct. 2011. ISBN 978-0-19-969458-7.
Potts (1952) R. B. Potts. Some generalized order-disorder transformations. Mathematical Proceedings of the Cambridge Philosophical Society, 48(1):106–109, 1952. ISSN 1469-8064, 0305-0041. doi: 10.1017/S0305004100027419.
Qian et al. (2023) Q. Qian, D. V. Nguyen, D. Telesca, E. Kurum, C. M. Rhee, S. Banerjee, Y. Li, and D. Senturk. Multivariate spatiotemporal functional principal component analysis for modeling hospitalization and mortality rates in the dialysis population. Biostatistics, 2023.
R Core Team (2024) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/.
Richardson et al. (2004) S. Richardson, A. Thomson, N. Best, and P. Elliott. Interpreting Posterior Relative Risk Estimates in Disease-Mapping Studies. Environmental Health Perspectives, 112(9):1016–1025, 2004. doi: 10.1289/ehp.6740.
Robertson et al. (2010) C. Robertson, T. A. Nelson, Y. C. MacNab, and A. B. Lawson. Review of methods for space–time disease surveillance. Spatial and Spatio-temporal Epidemiology, 1(2):105–116, July 2010. ISSN 1877-5845. doi: 10.1016/j.sste.2009.12.001.
Rodriguez and Müller (2013) A. Rodriguez and P. Müller. Nonparametric Bayesian Inference. NSF-CBMS Regional Conference Series in Probability and Statistics, 9:i–110, 2013. ISSN 1935-5920.
Rosano et al. (2020) A. Rosano, B. Pacelli, N. Zengarini, G. Costa, C. Cislaghi, and N. Caranci. Aggiornamento e revisione dell’indice di deprivazione italiano 2011 a livello di sezione di censimento. Epidemiologia e Prevenzione, 44(2-3):162–70, 2020.
Ročková (2018) V. Ročková. Bayesian estimation of sparse signals with a continuous spike-and-slab prior. The Annals of Statistics, 46(1):401–437, Feb. 2018. ISSN 0090-5364, 2168-8966. doi: 10.1214/17-AOS1554.
Russo et al. (2022) M. Russo, B. H. Singer, and D. B. Dunson. Multivariate mixed membership modeling: Inferring domain-specific risk profiles. The Annals of Applied Statistics, 16(1):391–413, 2022. ISSN 1932-6157. doi: 10.1214/21-aoas1496.
Scrucca et al. (2016) L. Scrucca, M. Fop, T. B. Murphy, and A. E. Raftery. mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. The R journal, 8(1):289–317, 2016. ISSN 2073-4859.
Tango (2021) T. Tango. Spatial scan statistics can be dangerous. Statistical Methods in Medical Research, 30(1):75–86, 2021.
Taylor et al. (2015) J. Taylor, P. Wilkinson, M. Davies, B. Armstrong, Z. Chalabi, A. Mavrogianni, P. Symonds, E. Oikonomou, and S. I. Bohnenstengel. Mapping the effects of urban heat island, housing, and age on excess heat-related mortality in london. Urban Climate, 14:517–528, 2015.
Tesema et al. (2023) G. A. Tesema, Z. T. Tessema, S. Heritier, R. G. Stirling, and A. Earnest. A systematic review of joint spatial and spatiotemporal models in health research. International Journal of Environmental Research and Public Health, 20(7):5295, 2023.
van Erp et al. (2019) S. van Erp, D. L. Oberski, and J. Mulder. Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89:31–50, Apr. 2019. ISSN 0022-2496. doi: 10.1016/j.jmp.2018.12.004.
Ver Hoef et al. (2018) J. M. Ver Hoef, E. E. Peterson, M. B. Hooten, E. M. Hanks, and M.-J. Fortin. Spatial autoregressive models for statistical inference from ecological data. Ecological Monographs, 88(1):36–59, 2018. ISSN 1557-7015. doi: 10.1002/ecm.1283.
Waller and Gotway (2004) L. A. Waller and C. A. Gotway. Applied spatial statistics for public health data. John Wiley & Sons, 2004.
Wong et al. (2008) C.-M. Wong, N. Vichit-Vadakan, H. Kan, Z. Qian, and the PAPA Project Teams. Public Health and Air Pollution in Asia (PAPA): A Multicity Study of Short-Term Effects of Air Pollution on Mortality. Environmental Health Perspectives, 116(9):1195–1202, Sept. 2008. doi: 10.1289/ehp.11257.
Zhao et al. (2021) E. Zhao, M. R. Stone, X. Ren, J. Guenthoer, K. S. Smythe, T. Pulliam, S. R. Williams, C. R. Uytingco, S. E. B. Taylor, P. Nghiem, J. H. Bielas, and R. Gottardo. Spatial transcriptomics at subspot resolution with BayesSpace. Nature Biotechnology, 39(11):1375–1384, 2021. ISSN 1546-1696. doi: 10.1038/s41587-021-00935-2.
Zhu and Fan (2023) W. Zhu and Y. Fan. A synthetic likelihood approach for intractable markov random fields. Computational Statistics, 38(2):749–777, 2023. ISSN 1613-9658. doi: 10.1007/s00180-022-01256-x.

Refer to caption — Figure 1: Left: map of the Padua province territory, which covers about 2,144 km² with a population of approximately 936,000 residents. Right: map of the U.S. divided into counties. We highlight in black the 388 counties on the U.S. East coast that we analyse in Section 5.2. Maps are not shown on the same scale.

Table 1: List of the combinations of shrinkage parameters considered for the prior models for

\boldsymbol{\mu}

. The first column provides the R code from our perla library used to denote the prior setup, the second lists the model parameters, and the third describes the shrinkage effects.

Code in perla	Shrinkage	Description
Code in perla	parameters	Description
(1)	$\phi$	global shrinkage
(c)	$\phi$	global shrinkage;
(c)	$\boldsymbol{\delta}$	cluster shrinkage
(d)	$\phi$	global shrinkage;
(d)	$\boldsymbol{\zeta}$	disease shrinkage
(cd)	$\phi$	global shrinkage;
(cd)	$\boldsymbol{\gamma}$	cluster-disease shrinkage
(c,d)	$\phi$	global shrinkage;
	$\boldsymbol{\delta}$	cluster shrinkage;
	$\boldsymbol{\zeta}$	disease shrinkage
(d,cd)	$\phi$	global shrinkage;
	$\boldsymbol{\zeta}$	disease shrinkage;
	$\boldsymbol{\gamma}$	cluster-disease shrinkage

Table 2: Results on the Padua province, divided for males and females. Each cell of the table displays the posterior modal value of

\exp(\mu_{kj})

and the probability of observing a mortality excess

\Pr\{\exp(\mu_{kj})>1|\mathrm{data}\}

. Values for which the probability of mortality excess is larger than 0.95 or smaller than 0.05 are reported in bold.

		Circulatory ( $j=1$ )	Respiratory ( $j=2$ )	Tumours ( $j=3$ )
Males	$k=1$	1.000 (0.393)	1.216 (0.997)	1.000 (0.511)
	$k=2$	1.166 (0.966)	0.318 (0.000)	1.000 (0.638)
	$k=3$	1.000 (0.580)	0.999 (0.677)	1.000 (0.476)
Females	$k=1$	1.001 (0.461)	1.098 (0.953)	1.001 (0.69)
	$k=2$	1.000 (0.588)	1.002 (0.523)	0.999 (0.637)
	$k=3$	1.154 (0.898)	0.229 (0.000)	1.042 (0.768)
	$k=4$	1.061 (0.704)	0.614 (0.034)	0.295 (0.003)

Table 3: Results on the U.S. East coast data, divided for males and females. Each cell of the table displays the posterior modal value of

\exp(\mu_{kj})

and the probability of observing a mortality excess

\Pr\{\exp(\mu_{kj})>1|\mathrm{data}\}

. Values for which the probability of mortality excess is larger than 0.95 or smaller than 0.05 are reported in bold.

		Circulatory ( $j=1$ )	Respiratory ( $j=2$ )	Tumours ( $j=3$ )
Males	$k=1$	1.033 (0.995)	1.098 (1.000)	1.095 (1.000)
	$k=2$	0.887 (0.001)	0.802 (0.000)	0.919 (0.002)
	$k=3$	1.128 (0.967)	0.483 (0.000)	1.135 (0.999)
Females	$k=1$	1.034 (0.975)	1.079 (1.000)	1.072 (1.000)
	$k=2$	0.878 (0.007)	0.793 (0.000)	0.939 (0.006)
	$k=3$	0.998 (0.445)	0.494 (0.000)	1.22 (1.000)
	$k=4$	0.883 (0.067)	0.707 (0.004)	0.374 (0.000)