This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Modeling heterogeneity in higher-order moments
while preserving mean and variance:
application to spatio-temporal modeling

Hajime Kuno hakuno103@gmail.com Daisuke Murakami
Abstract

In this study, we propose a general model capable of addressing heterogeneity in higher-order moments while preserving mean and variance, including the t, Laplace, and skew-normal distributions as special cases. Our model flexibly accommodates variations in tail heaviness and asymmetry at each data point while maintaining interpretability similar to normal distribution models. Notably, it is closed under linear transformations and provides explicit analytical expressions for skewness and kurtosis. The proposed model is applied to spatial and temporal data analysis, demonstrating that its properties vary based on the chosen matrix decomposition approach. To facilitate efficient inference, we develop a Bayesian estimation method using data augmentation, which is particularly effective for temporal models. Simulation studies confirm that accounting for heterogeneity in higher-order moments enhances parameter estimation accuracy and predictive performance. To illustrate real-world applicability, we analyze production functions across U.S. states. The results indicate that our model effectively captures heterogeneity in higher-order moments, leading to superior model fit in empirical data analysis.

keywords:
Heterogeneity in higher-order moments; Spatio-temporal modeling; Efficient Bayesian inference;
journal: Spatial Statistics
\affiliation

[label1] organization=The Graduate University for Advanced Studies,addressline=Shonan Village, city=Hayama, postcode=240-0193, state=Kanagawa, country=Japan

\affiliation

[label2] organization=BrainPad inc.,addressline=3-1-1 Roppongi, city=Minato-ku, postcode=108-0071, state=Tokyo, country=Japan

\affiliation

[label3] organization=Institute of Statistical Mathematics,addressline=10-3 Midori-cho, city=Tachikawa, postcode=190-8562, state=Tokyo, country=Japan

1 Introduction

The modeling of higher-order moments, such as skewness and kurtosis, is crucial in many real-world applications (Tagle et al., 2019). For example, infection risk may exhibit a skewed distribution with extremely high values during epidemic periods, and the strength of skewness can vary across different regions and time periods. Therefore, describing the process using a non-Gaussian model that can flexibly handle such heterogeneous skewness and higher-order moments is desirable.

To account for non-Gaussianity, a wide variety of methods have been developed, as reviewed by Yan et al. (2020). A prominent example is the closed-skew normal (CSN) distribution, known for its expressive power (Azzalini, 1985; Azzalini and Capitanio, 1999). However, replacing the normal distribution in a model with a CSN distribution often disrupts the mean and variance properties, which are essential for interpretability. In particular, spatio-temporal models—actively studied in fields such as disease mapping (Best et al., 2005), species distribution modeling (Dormann et al., 2007), and socio-economic analysis (Anselin, 2022) —rely on the mean regression function to interpret the influence of the specified covariates and covariance specified to quantify the strength of spatial and temporal correlations.

Márquez-Urbina and González-Farías (2022) addressed this issue by proposing a flexible subclass of the CSN distribution that maintains the mean and variance. This property enhances the interpretability of parameters such that when the mean is specified by a regression function, the coefficients and variance parameters can be interpreted similarly to those in a conventional Gaussian regression model. Additionally, Tan and Chen (2024) proposed an extended model known as the closed skew normal subclass (CSNS) distribution, which addresses the heterogeneity of higher-order moments. The CSNS distribution still has limitations in the range of achievable skewness and kurtosis, restricting its representational flexibility. Moreover, Tan and Chen (2024) used the CSNS distribution as an approximation in variational inference, rather than as a component of modeling. Notably, although the model’s properties vary depending on the matrix decomposition method used internally, it is restricted to LU decomposition. Similarly, no established Bayesian estimation method is available despite its potential merits for modeling uncertainty in spatio-temporal phenomena (Cressie and Wikle, 2011; Márquez-Urbina and González-Farías, 2022).

To overcome these limitations, we propose a general model that encompasses the CSNS distribution as a special case and accommodates the heterogeneity of higher-order moments while preserving the mean and variance. Subsequently, several properties of the proposed model are presented. In particular, since the characteristics of the model vary depending on the matrix factorization method employed, we explain the specific matrix factorization techniques suitable for adapting to spatio-temporal data. Furthermore, we develop an efficient Bayesian inference method that is particularly effective for temporal data.

The remainder of this paper is organized as follows. Section 2 provides background on the CSN and CSNS distributions. Section 3 introduces the proposed model and its theoretical properties. Section 4 investigates the impact of different matrix factorization methods, particularly in spatio-temporal context. Section 5 develops an efficient Bayesian inference method. Section 6 presents the simulations conducted to the model estimation accuracy, and Section 7 applies the model to the estimation of production functions across U.S. states.

2 CSN and CSNS distribution

The CSN distribution is an extension of the multivariate normal distribution that incorporates higher-order moments. For 𝒔𝒞𝒮𝒩N,M(𝝁,𝚺,𝑫,𝝂,𝛀)\bm{s}\sim\mathcal{CSN}_{N,M}(\bm{\mu},\bm{\Sigma},\bm{D},\bm{\nu},\bm{\Omega}), the probability density function is given by:

p(𝒔)=ϕN(𝒔;𝝁,𝚺)ΦM(𝑫(𝒔𝝁);𝝂,𝛀)ΦM(𝟎M;𝝂,𝛀+𝑫𝚺𝑫).\displaystyle p(\bm{s})=\phi_{N}(\bm{s};\bm{\mu},\bm{\Sigma})\frac{\Phi_{M}(\bm{D}(\bm{s}-\bm{\mu});\bm{\nu},\bm{\Omega})}{\Phi_{M}\left(\bm{0}_{M};\bm{\nu},\bm{\Omega}+\bm{D}\bm{\Sigma}\bm{D}^{\top}\right)}.

Here, 𝝁p\bm{\mu}\in\mathbb{R}^{p}, 𝝂q,𝑫q×p\bm{\nu}\in\mathbb{R}^{q},\bm{D}\in\mathbb{R}^{q\times p}, and 𝚺N×N,𝛀M×M\bm{\Sigma}\in\mathbb{R}^{N\times N},\bm{\Omega}\in\mathbb{R}^{M\times M} are covariance matrices. ϕN(𝒔;𝝁,𝚺)\phi_{N}(\bm{s};\bm{\mu},\bm{\Sigma}) and ΦN(𝒔;𝝁,𝚺)\Phi_{N}(\bm{s};\bm{\mu},\bm{\Sigma}) denotes the probability distribution function and cumulative distribution function of an NN-dimensional normal distribution with mean 𝝁\bm{\mu} and covariance 𝚺\bm{\Sigma}, respectively. Finally, 𝟎M\bm{0}_{M} represents an MM-dimensional zero vector.

As a modified version of the CSN distribution, the CSNS distribution ensures that the mean and variance remain consistent with those of a multivariate normal distribution. The CSNS distribution is derived by first defining the following random variable:

θiui+vi,ui𝒯𝒩(0,1),vi𝒩1(0,1),\displaystyle\theta_{i}u_{i}+v_{i},u_{i}\sim\mathcal{TN}(0,1),v_{i}\sim\mathcal{N}_{1}(0,1),

where 𝒩N(𝝁,𝚺)\mathcal{N}_{N}(\bm{\mu},\bm{\Sigma}) denotes a multivariate normal distribution with mean 𝝁\bm{\mu} and covariance 𝚺\bm{\Sigma}, and 𝒯𝒩(μ,σ2)\mathcal{TN}(\mu,\sigma^{2}) denotes a normal distribution 𝒩1(μ,σ2)\mathcal{N}_{1}(\mu,\sigma^{2}) truncated below 0. The mean of the above random variable is 2/πθi\sqrt{2/\pi}\theta_{i} and the variance is fi2=(π+(π2)θi2)/πf_{i}^{-2}=(\pi+(\pi-2)\theta_{i}^{2})/\pi. To standardize the variable, we define ti=fi(θi(ui2/π)+vi)t_{i}=f_{i}(\theta_{i}(u_{i}-\sqrt{2/\pi})+v_{i}), ensuring that tit_{i} has a mean of 0 and a variance of 11. Setting

𝒔=𝝁+𝚺LU𝒕,\displaystyle\bm{s}=\bm{\mu}+\bm{\Sigma}^{LU}\bm{t}, (1)

guarantees that 𝒔\bm{s} maintains mean 𝝁\bm{\mu} and covariance 𝚺\bm{\Sigma}. Here, 𝒔=(s1,,sN)\bm{s}=(s_{1},\ldots,s_{N})^{\top}, 𝒕=(t1,,tN)\bm{t}=(t_{1},\ldots,t_{N})^{\top}, and 𝚺LU\bm{\Sigma}^{LU} is the LU decomposition of 𝚺\bm{\Sigma}. The distribution of 𝒔\bm{s} can be expressed as the CSN distribution:

𝒞𝒮𝒩N,N\displaystyle\mathcal{CSN}_{N,N} (𝝁𝚺LU𝑭𝝁𝒖,𝚺LU𝑭(𝑰N+𝚯𝚯)𝑭(𝚺LU),\displaystyle(\bm{\mu}-\bm{\Sigma}^{LU}\bm{F}\bm{\mu}_{\bm{u}},\bm{\Sigma}^{LU}\bm{F}(\bm{I}_{N}+\bm{\Theta}\bm{\Theta})\bm{F}(\bm{\Sigma}^{LU})^{\top},
𝚯𝑭(𝚺LU𝑭(𝑰N+𝚯𝚯)1/2)1,𝟎N,𝑰N𝚯𝑭𝑭𝚯),\displaystyle\qquad\bm{\Theta}\bm{F}(\bm{\Sigma}^{LU}\bm{F}(\bm{I}_{N}+\bm{\Theta}\bm{\Theta})^{1/2})^{-1},\bm{0}_{N},\bm{I}_{N}-\bm{\Theta}\bm{F}\bm{F}\bm{\Theta}),

where 𝝁𝒖=(2/πθ1,,2/πθN)\bm{\mu}_{\bm{u}}=(\sqrt{2/\pi}\theta_{1},\ldots,\sqrt{2/\pi}\theta_{N})^{\top}, 𝑭=diag(f1,,fN)\bm{F}=diag(f_{1},\ldots,f_{N}), 𝚯=diag(θ1,,θN)\bm{\Theta}=diag(\theta_{1},\ldots,\theta_{N}), and 𝑰N\bm{I}_{N} is the identity matrix.

Figure 1 illustrates the probability density function of the CSNS distribution with mean 0 and variance 11 for different values of θ\theta. When θ=0\theta=0, the CSNS distribution coincides with the normal distribution. As θ\theta increases, the skewness increases, and the distribution approaches a shifted half-normal distribution.

Refer to caption
Figure 1: Probability density function of the CSNS distribution for various values of θ\theta.

3 Proposed method

In this section, a general model that can handle heterogeneity in higher-order moments while preserving the mean and variance is proposed, based on the idea of the CSNS distribution. In Section 3.1, we provide an overview of the model along with several illustrative examples. Subsequently, in Section 3.2, we discuss the model’s closure under linear transformations as well as skewness and kurtosis.

3.1 A general model for heterogeneous highe-order moments

To derive the proposed model, we first consider an NN-dimensional random variable 𝒔=(s1,,sN)\bm{s}=(s_{1},\ldots,s_{N})^{\top} defined as:

𝒔=𝝁+𝚺1/2𝒕,\displaystyle\bm{s}=\bm{\mu}+\bm{\Sigma}^{1/2}\bm{t}, (2)

where 𝒕\bm{t} is a random variable with independent components of mean 0 and variance 11, and 𝚺1/2\bm{\Sigma}^{1/2} is any matrix satisfying 𝚺1/2(𝚺1/2)=𝚺\bm{\Sigma}^{1/2}(\bm{\Sigma}^{1/2})^{\top}=\bm{\Sigma}. The mean and variance of 𝒔\bm{s} are 𝝁\bm{\mu} and 𝚺\bm{\Sigma}, respectively. As observed in Equation (1), the original CSNS distribution considers the LU decomposition. However, since the mean and variance can be preserved regardless of the choice of matrix decomposition, we employ a more general matrix decomposition in the definitions to facilitate subsequent discussions.

There are many possible ways to design 𝐭=(t1,,tN)T\mathbf{t}=(t_{1},\ldots,t_{N})^{T}. In this study, we consider a random variable that can be expressed as follows:

ti=f(𝜽i)(g(𝒖i,𝜽i)+h(𝒖i,𝜽i)vi),vi𝒩(0,1),\displaystyle t_{i}=f(\bm{\theta}_{i})(g(\bm{u}_{i},\bm{\theta}_{i})+h(\bm{u}_{i},\bm{\theta}_{i})v_{i}),v_{i}\sim\mathcal{N}(0,1),

where 𝒖i\bm{u}_{i} is a d1d_{1} dimensional random variable independent of viv_{i}, and 𝜽i\bm{\theta}_{i} is a d2d_{2} dimensional parameter. The function f:d2f:\mathbb{R}^{d_{2}}\rightarrow\mathbb{R} is the reciprocal of the standard deviation of g(𝒖i,𝜽i)+h(𝒖i,𝜽i)vig(\bm{u}_{i},\bm{\theta}_{i})+h(\bm{u}_{i},\bm{\theta}_{i})v_{i} that ensures standardization. g:d1×d2g:\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{2}}\rightarrow\mathbb{R} is a random variable with mean 0. h:d1×d2h:\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{2}}\rightarrow\mathbb{R} is chosen so that the variance of g(𝒖i,𝜽i)+h(𝒖i,𝜽i)vig(\bm{u}_{i},\bm{\theta}_{i})+h(\bm{u}_{i},\bm{\theta}_{i})v_{i} is positive. We adopted the following notations fi=f(𝜽i)f_{i}=f(\bm{\theta}_{i}), gi=g(𝒖i,𝜽i)g_{i}=g(\bm{u}_{i},\bm{\theta}_{i}), and hi=h(𝒖i,𝜽i)h_{i}=h(\bm{u}_{i},\bm{\theta}_{i}).

Table 1 summarizes the specific instances of the proposed model. Here, 𝒢(a,b)\mathcal{IG}(a,b) denotes an inverse gamma distribution with parameters aa and bb, (μ)\mathcal{E}(\mu) denotes an exponential distribution with mean μ\mu, and 𝒩(μ,λ)\mathcal{IN}(\mu,\lambda) denotes an inverse Gaussian distribution with parameters μ\mu and λ\lambda.

fif_{i} gig_{i} hih_{i} p(𝒖i)p(\bm{u}_{i})
t θi2θi\sqrt{\frac{\theta_{i}-2}{\theta_{i}}} 0 ui\sqrt{u_{i}} 𝒢(θi/2,θi/2)\mathcal{IG}(\theta_{i}/2,\theta_{i}/2)
Laplace 11 0 ui\sqrt{u_{i}} (1)\mathcal{E}(1)
NIG θi2θi12+θi2\sqrt{\frac{\theta_{i2}}{\theta_{i1}^{2}+\theta_{i2}}} θi1(ui1)\theta_{i1}(u_{i}-1) ui\sqrt{u_{i}} 𝒩(1,θi2)\mathcal{IN}(1,\theta_{i2})
AL 1θi2+1\frac{1}{\sqrt{\theta_{i}^{2}+1}} θi(ui1)\theta_{i}(u_{i}-1) ui\sqrt{u_{i}} (1)\mathcal{E}(1)
SN ππ+(π2)θi2\sqrt{\frac{\pi}{\pi+(\pi-2)\theta_{i}^{2}}} θi(ui2π)\theta_{i}(u_{i}-\sqrt{\frac{2}{\pi}}) 11 𝒯𝒩(0,1)\mathcal{TN}(0,1)
ST π(θi22)θi2(π+(π2)θi12)\sqrt{\frac{\pi(\theta_{i2}-2)}{\theta_{i2}(\pi+(\pi-2)\theta_{i1}^{2})}} ui2θi1(ui12π)\sqrt{u_{i2}}\theta_{i1}(u_{i1}-\sqrt{\frac{2}{\pi}}) ui2\sqrt{u_{i2}}
ui1𝒯𝒩(0,1)u_{i1}\sim\mathcal{TN}(0,1)
ui2𝒢(θi22,θi22)u_{i2}\sim\mathcal{IG}(\frac{\theta_{i2}}{2},\frac{\theta_{i2}}{2})
Table 1: Examples of the proposed model.

First, a simple example is the t-distribution, which requires θi>2\theta_{i}>2 for the mean and variance to exist. Similarly, the Laplace distribution can be considered. Both distributions are symmetric with zero skewness. A more generalized class of distributions that allow for non-zero skewness is the family of normal mean-variance mixtures. Examples of such distributions include the normal-inverse Gaussian (NIG) and asymmetric Laplace (AL) distributions. A further generalization of the NIG distribution is the generalized hyperbolic distribution.

The most important example is the skew normal (SN) distribution, because, as discussed in the previous section, 𝒔\bm{s} follows the CSNS distribution, which is a subclass of the CSN distribution. Moreover, the CSN distribution and its derivative, the unified skew normal distribution, are known to exhibit conjugacy in linear regression, probit, multinomial probit, and Tobit models, making them an active area of research in recent years (Niccolò et al., 2023). Additionally, as a scale mixture of the SN distribution, the skew-t distribution (ST) emerges as a natural extension. An important property of this model is that it is closed under linear transformations of the SN distribution through scale mixtures (Wang et al., 2024).

Furthermore, since the distributions of each tit_{i} do not necessarily have to be identical, more general models, such as combinations of multiple distributions, can also be treated within the same framework.

3.2 Properties of the proposed model

An important property of this model is that it is closed under linear transformations. Specifically, for any N×NN\times N matrix 𝑨\bm{A}, we have 𝑨𝒔=𝑨𝝁+𝑨𝚺1/2𝒕\bm{A}\bm{s}=\bm{A}\bm{\mu}+\bm{A}\bm{\Sigma}^{1/2}\bm{t}, indicating that the model is closed under linear transformations.

Furthermore, if the third and fourth moments of tit_{i} exist, the skewness and kurtosis can be analytically derived as follows;

Skewness(si)\displaystyle Skewness(s_{i}) =j=1Nγj(Σi,j1/2)3(Σi,i)3/2\displaystyle=\frac{\sum_{j=1}^{N}\gamma_{j}(\Sigma^{1/2}_{i,j})^{3}}{(\Sigma_{i,i})^{3/2}}
Kurtosis(si)\displaystyle Kurtosis(s_{i}) =j=1Nδj(Σi,j1/2)4(Σi,i)23\displaystyle=\frac{\sum_{j=1}^{N}\delta_{j}(\Sigma^{1/2}_{i,j})^{4}}{(\Sigma_{i,i})^{2}}-3

where Σi,j\Sigma_{i,j} is the element in the iith row and jjth column of 𝚺\bm{\Sigma}, and Σi,j1/2\Sigma^{1/2}_{i,j} is the element in the iith row and jjth column of 𝚺1/2\bm{\Sigma}^{1/2}; γi\gamma_{i} represents the third moment of xix_{i}; and δi\delta_{i} denotes its fourth moment. Table 2 presents the third and fourth moments for each model. In the case of the tt-distribution, the third moment exists only if θi>3\theta_{i}>3, and the fourth moment exists only if θi>4\theta_{i}>4. Similarly, in the case of the ST distribution, the third moment exists only if θi2>3\theta_{i2}>3, and the fourth moment exists only if θi2>4\theta_{i2}>4.

Third moment Fourth moment
t 0 3+6θi43+\frac{6}{\theta_{i}-4}
Laplace 0 66
NIG 3θi1θi2(θi12+θi2)\frac{3\theta_{i1}}{\sqrt{\theta_{i2}(\theta_{i1}^{2}+\theta_{i2})}} 3+15θi212θi12+θi23+\frac{15}{\theta_{i2}}-\frac{12}{\theta_{i1}^{2}+\theta_{i2}}
AL θi(3+2θi3)(1+θi2)3/2\frac{\theta_{i}(3+2\theta_{i}^{3})}{(1+\theta_{i}^{2})^{3/2}} 93(1+θi2)29-\frac{3}{(1+\theta_{i}^{2})^{2}}
SN 2(4π)θi3(π+(π2)θi2)3/2\frac{\sqrt{2}(4-\pi)\theta_{i}^{3}}{(\pi+(\pi-2)\theta_{i}^{2})^{3/2}} 3+4θi2(2(π3)θi23(π2))(π+(π2)θi2)23+\frac{4\theta_{i}^{2}(2(\pi-3)\theta_{i}^{2}-3(\pi-2))}{(\pi+(\pi-2)\theta_{i}^{2})^{2}}
ST (4π)2Γ((θi23)/2)Γ(θi2/2)(θi1θi23π+(π2)θi12)3\frac{(4-\pi)}{2}\frac{\Gamma((\theta_{i2}-3)/2)}{\Gamma(\theta_{i2}/2)}\left(\frac{\theta_{i1}\sqrt{\theta_{i2}-3}}{\sqrt{\pi+(\pi-2)\theta_{i1}^{2}}}\right)^{3} θi22θi24(3+4θi12(2(π3)θi123(π2))(π+(π2)θi12)2)\frac{\theta_{i2}-2}{\theta_{i2}-4}\left(3+\frac{4\theta_{i1}^{2}(2(\pi-3)\theta_{i1}^{2}-3(\pi-2))}{(\pi+(\pi-2)\theta_{i1}^{2})^{2}}\right)
Table 2: Third and fourth moments of distributions.

Mardia’s skewness and kurtosis can also be analytically derived, and both are independent of the mean and variance. For two random variables 𝒔\bm{s} and 𝒔\bm{s}^{*} following the same distribution with mean 𝝁\bm{\mu} and covariance matrix 𝚺\bm{\Sigma}, Mardia’s skewness MS(𝒔)\mathrm{MS}(\bm{s}) and kurtosis MK(𝒔)\mathrm{MK}(\bm{s}) follow the expressions given below:

MS(𝒔)\displaystyle\mathrm{MS}(\bm{s}) =𝔼[((𝒔𝝁)𝚺1(𝒔𝝁))3],\displaystyle=\mathbb{E}[((\bm{s}-\bm{\mu})^{\top}\bm{\Sigma}^{-1}(\bm{s}^{*}-\bm{\mu}))^{3}],
MK(𝒔)\displaystyle\mathrm{MK}(\bm{s}) =𝔼[((𝒔𝝁)𝚺1(𝒔𝝁))2].\displaystyle=\mathbb{E}[((\bm{s}-\bm{\mu})^{\top}\bm{\Sigma}^{-1}(\bm{s}-\bm{\mu}))^{2}].

Since 𝒔=𝝁+𝚺1/2𝒕\bm{s}=\bm{\mu}+\bm{\Sigma}^{1/2}\bm{t},

MS(𝒔)\displaystyle\mathrm{MS}(\bm{s}) =𝔼[(𝒕𝒕)3]=i=1Nγi2,\displaystyle=\mathbb{E}[(\bm{t}^{\top}\bm{t}^{*})^{3}]=\sum_{i=1}^{N}\gamma_{i}^{2},
MK(𝒔)\displaystyle\mathrm{MK}(\bm{s}) =𝔼[(𝒕𝒕)2]=N(N+2)+i=1Nδi.\displaystyle=\mathbb{E}[(\bm{t}^{\top}\bm{t})^{2}]=N(N+2)+\sum_{i=1}^{N}\delta_{i}.

These properties imply that the first, second, and higher-order moments can be designed independently.

4 Application to spatio-temporal modeling

In the proposed model, characteristics vary based on the matrix decomposition method. Although the scope of this model is not necessarily limited to spatial or temporal data, it is well-suited for explaining changes in model properties. Additionally, since spatial and temporal models are often designed based on the mean and variance, preserving these statistical properties. In Section 4.1, we consider applications to spatial models, followed by applications to temporal models in Section 4.2.

4.1 Application to spatial modeling

Many spatial models, such as the spatial autoregressive (SAR) and the conditional autoregressive (CAR) model, capture spatial correlations by assuming that the random variable of interest follows NK(𝝁,𝚺)N_{K}(\bm{\mu},\bm{\Sigma}) and designing 𝚺\bm{\Sigma} accordingly, where KK denotes the number of locations.

When applying the proposed model to spatial data, we substitute the mean and variance of the spatial model into the corresponding parameters of the proposed model. However, the model’s properties can significantly vary depending on the chosen matrix decomposition method. For example, similar to the standard SAR model, consider a model 𝒚=η𝑾~𝒛+𝒔\bm{y}=\eta\tilde{\bm{W}}\bm{z}+\bm{s}. where η\eta is a parameter that determines the strength of the spatial correlation, and 𝑾~\tilde{\bm{W}} is the row-standardized adjacency matrix. Here, we assume that 𝒔\bm{s} follows the proposed model, that is, it satisfies Equation (2). In this case, 𝒚=(𝑰η𝑾~)1𝝁+(𝑰η𝑾~)1𝚺1/2𝒕\bm{y}=(\bm{I}-\eta\tilde{\bm{W}})^{-1}\bm{\mu}+(\bm{I}-\eta\tilde{\bm{W}})^{-1}\bm{\Sigma}^{1/2}\bm{t} follows the model, because of the property of closure under linear transformations. This indicates that using (𝑰Kη𝑾~)1𝚺1/2(\bm{I}_{K}-\eta\tilde{\bm{W}})^{-1}\bm{\Sigma}^{1/2} for matrix decomposition is natural, and properties are preserved.

On the other hand, for CAR models, and more generally for data where the column-wise ordering has no inherent meaning, the standard practice is to use the matrix square root. For example, in the Leroux model, which is a type of CAR model, the covariance matrix is specified as follows:

𝚺\displaystyle\bm{\Sigma} =ω2𝑸1\displaystyle=\omega^{2}\bm{Q}^{-1} (3)
𝑸\displaystyle\bm{Q} =η(diag(𝑾𝟏K)𝑾)+(1η)𝑰K,\displaystyle=\eta({\rm diag}(\bm{W}\bm{1}_{K})-\bm{W})+(1-\eta)\bm{I}_{K},

where 𝑾\bm{W} is the adjacency matrix. The reason for using the matrix square root in the CAR model is that the model constructed using the matrix square root exhibits symmetry with respect to changes in the ordering of the data, whereas the model using the Cholesky decomposition or LU decomposition is not symmetric under permutation. Let the permuted random variable of 𝒚\bm{y}, which follows the proposed model, be 𝒚~=(yχ(1),,yχ(N))\tilde{\bm{y}}=(y_{\chi(1)},\ldots,y_{\chi(N)})^{\top}, where 𝑷\bm{P} is the permutation matrix such that 𝑷𝒚=𝒚~\bm{P}\bm{y}=\tilde{\bm{y}}. Consider the randomly permuted variable 𝑷𝒚=𝑷𝝁+𝑷𝚺S𝒙\bm{P}\bm{y}=\bm{P}\bm{\mu}+\bm{P}\bm{\Sigma}^{S}\bm{x} when using the matrix square root 𝚺S\bm{\Sigma}^{S}. Since the matrix square root of 𝚺~=𝑷𝚺𝑷\tilde{\bm{\Sigma}}=\bm{P}\bm{\Sigma}\bm{P}^{\top} is given by 𝑷𝚺S𝑷\bm{P}\bm{\Sigma}^{S}\bm{P}^{\top}, we obtain

𝑷𝒚=𝑷𝝁+𝑷𝚺S𝑷𝑷𝒙=𝝁~+𝚺~S𝒙~\displaystyle\bm{P}\bm{y}=\bm{P}\bm{\mu}+\bm{P}\bm{\Sigma}^{S}\bm{P}^{\top}\bm{P}\bm{x}=\tilde{\bm{\mu}}+\tilde{\bm{\Sigma}}^{S}\tilde{\bm{x}}

which implies that the distribution of 𝑷𝒚\bm{P}\bm{y} is identical to that of 𝒚~\tilde{\bm{y}}. On the other hand, for general matrix decompositions such as Cholesky decomposition or LU decomposition, 𝚺~1/2\tilde{\bm{\Sigma}}^{1/2} does not satisfy 𝚺~1/2=𝑷𝚺1/2𝑷\tilde{\bm{\Sigma}}^{1/2}=\bm{P}\bm{\Sigma}^{1/2}\bm{P}^{\top}, leading to the loss of symmetry with respect to permutation.

If the symmetry does not hold under permutation, a change in the order of the data can affect the estimation results. To illustrate how the choice of the matrix decomposition method impacts symmetry, Figure 2 shows the skewness at each point on a 3×33\times 3 grid for the model under different matrix decomposition methods. When using Cholesky decomposition, the skewness lacks symmetry, and reordering the data sequence causes random variations in skewness. In contrast, when using the matrix square root, the skewness exhibits symmetry and remains unchanged regardless of the order of points. This property holds not only for skewness but for the entire distribution. Consequently, using the matrix square root is generally preferable to decompose the variance matrix in cases where the ordering of data points is arbitrary.

Notably, when using the CAR model, spatial Markov properties are not preserved, although this is rarely a practical concern. Therefore, it is more appropriate to consider the CAR model merely as one of the design methods in constructing covariance matrices.

Refer to caption
(a) Skewness of CSNS distribution using the Cholesky decomposition.
Refer to caption
(b) Skewness of CSNS distribution using the matrix square root.
Figure 2: Skewness of the CSNS when using different matrix decomposition methods. 𝚺\bm{\Sigma} is given by (0.5diag(𝑾𝟏9𝑾)+0.5𝑰9)1(0.5\,\text{diag}(\bm{W}\bm{1}_{9}-\bm{W})+0.5\bm{I}_{9})^{-1}. (Left) Skewness when data are arranged sequentially from the top left; (Right) Skewness when using square root decomposition with data order randomly shuffled.

4.2 Application to temporal modeling

To incorporate the proposed model into a time series framework, first, we consider an \ell-order autoregressive (AR) model with the error term following the proposed distribution:

𝒔τ\displaystyle\bm{s}_{\tau} =𝑨1(𝒔τ1𝝁τ1)++𝑨τ(𝒔τ𝝁τ)+𝝃τ,\displaystyle=\bm{A}_{1}(\bm{s}_{\tau-1}-\bm{\mu}_{\tau-1})+\cdots+\bm{A}_{\tau}(\bm{s}_{\tau-\ell}-\bm{\mu}_{\tau-\ell})+\bm{\xi}_{\tau}, (4)

where τ>\tau>\ell, 𝑨1,,𝑨N×N\bm{A}_{1},\ldots,\bm{A}_{\ell}\in\mathbb{R}^{N\times N}, and 𝒔1=𝝁1+𝚺1/2𝒕1,,𝒔=𝝁+𝚺1/2𝒕\bm{s}_{1}=\bm{\mu}_{1}+\bm{\Sigma}^{1/2}\bm{t}_{1},\ldots,\bm{s}_{\ell}=\bm{\mu}_{\ell}+\bm{\Sigma}^{1/2}\bm{t}_{\ell}. Here, we assume that 𝝃τ\bm{\xi}_{\tau} follows Equation (2).

The AR model is closed under the proposed model, and its mean and variance remain the same when assuming multivariate normal distribution is assumed for the error terms. Specifically, the model represented by Equation (4) is equivalent to the model expressed below:

𝒔1:T=𝝁1:T+𝑹C(𝑰T𝚺1/2)𝒕1:T,\displaystyle\bm{s}_{1:T}=\bm{\mu}_{1:T}+\bm{R}^{C}(\bm{I}_{T}\otimes\bm{\Sigma}^{1/2})\bm{t}_{1:T},

where TT is length of time; \otimes denotes the Kroneker product; 𝒔1:T=(𝒔1,,𝒔T)\bm{s}_{1:T}=(\bm{s}_{1}^{\top},\ldots,\bm{s}_{T}^{\top})^{\top}, 𝒕1:T=(𝒕1,,𝒕T)\bm{t}_{1:T}=(\bm{t}_{1}^{\top},\ldots,\bm{t}_{T}^{\top})^{\top}, and 𝝁1:T=(𝝁1,,𝝁T)\bm{\mu}_{1:T}=(\bm{\mu}_{1}^{\top},\ldots,\bm{\mu}_{T}^{\top})^{\top}. 𝑹C\bm{R}^{C} is the Cholesky decompositon of the covariance matrix of the AR model when 𝚺=𝑰K\bm{\Sigma}=\bm{I}_{K}. Each element of this submatrix is given as follows:

(𝑹C)τ1,τ1=𝑰N,\displaystyle(\bm{R}^{C})_{\tau_{1},\tau_{1}}=\bm{I}_{N}, for τ1=1,,T,\displaystyle\text{ for }\tau_{1}=1,\ldots,T,
(𝑹C)τ1,τ2=𝑶N,\displaystyle(\bm{R}^{C})_{\tau_{1},\tau_{2}}=\bm{O}_{N}, for 1τ2<τ1τ,\displaystyle\text{ for }1\leq\tau_{2}<\tau_{1}\leq\tau,
(𝑹C)τ1,τ2=s=1τ𝑨s(𝑹C)τ1s,τ2,\displaystyle(\bm{R}^{C})_{\tau_{1},\tau_{2}}=\sum_{s=1}^{\tau}\bm{A}_{s}(\bm{R}^{C})_{\tau_{1}-s,\tau_{2}}, for ττ2<τ1T,\displaystyle\text{ for }\tau\leq\tau_{2}<\tau_{1}\leq T,

where (𝑹C)τ1,τ2(\bm{R}^{C})_{\tau_{1},\tau_{2}} is the submatrix of 𝑹C\bm{R}^{C}, consisting of rows (τ11)N+1(\tau_{1}-1)N+1 to τ1N\tau_{1}N and columns (τ21)N+1(\tau_{2}-1)N+1 to τ2N\tau_{2}N, and 𝑶N\bm{O}_{N} is the N×NN\times N zero matrix.

Unlike spatial models, the appearance of the Cholesky decomposition in the AR model can be explained by the fact that the order of data in the temporal domain carries significant meaning. This distinction can also be understood from the fact that spatial adjacency is represented by an undirected graph, whereas temporal adjacency is represented by a directed graph.

5 Bayesian inference

This section proposes a Bayesian estimation method for the parameters of the proposed model. Section 5.1 describes a Bayesian inference approach using the data augmentation method for general models, including spatial models. Since the data augmentation method is particularly effective for time series models, its application is detailed in Section 5.2.

5.1 Bayesian inference for the proposed model

To efficiently perform Bayesian inference for the posterior distribution of the proposed model, we propose a sampling method that combines data augmentation and Gibbs sampling. Firstly, 𝒔𝒖,𝜽,𝝁,𝚺\bm{s}\mid\bm{u},\bm{\theta},\bm{\mu},\bm{\Sigma} follows a multivariate normal distribution:

𝒩N(𝝁+𝚺1/2𝑭𝒈,𝚺1/2𝑭𝑯𝑯𝑭(𝚺1/2)),\displaystyle\mathcal{N}_{N}(\bm{\mu}+\bm{\Sigma}^{1/2}\bm{F}\bm{g},\bm{\Sigma}^{1/2}\bm{F}\bm{H}\bm{H}\bm{F}(\bm{\Sigma}^{1/2})^{\top}),

where 𝑭=diag(f1,,fN)\bm{F}=diag(f_{1},\ldots,f_{N}), 𝒈=(g1,,gN)\bm{g}=(g_{1},\ldots,g_{N})^{\top}, 𝑯=diag(h1,,hN)\bm{H}=diag(h_{1},\ldots,h_{N}), with the elements depending on 𝒖\bm{u} and 𝜽\bm{\theta} (see Table 1). Therefore, 𝒔\bm{s} conditioned on 𝒖\bm{u} can be treated in the same manner as in models with a standard normal prior distribution. For example, given the observation model

𝒚=𝒔+𝜺,𝜺𝒩N(𝟎N,σ2𝑰N),\displaystyle\bm{y}=\bm{s}+\bm{\varepsilon},\quad\bm{\varepsilon}\sim\mathcal{N}_{N}(\bm{0}_{N},\sigma^{2}\bm{I}_{N}),

the posterior distribution of 𝒔𝒚,𝒖,𝜽,𝝁,𝚺\bm{s}\mid\bm{y},\bm{u},\bm{\theta},\bm{\mu},\bm{\Sigma} can be expressed analytically as follows:

𝒩N(𝝁𝒔+𝚺𝒔(𝚺𝒔+σ2𝑰N)1(𝒚𝝁𝒔),𝚺𝒔+𝚺𝒔(𝚺𝒔+σ2𝑰N)1𝚺𝒔),\displaystyle\mathcal{N}_{N}(\bm{\mu}_{\bm{s}}+\bm{\Sigma}_{\bm{s}}(\bm{\Sigma}_{\bm{s}}+\sigma^{2}\bm{I}_{N})^{-1}(\bm{y}-\bm{\mu}_{\bm{s}}),\bm{\Sigma}_{\bm{s}}+\bm{\Sigma}_{\bm{s}}(\bm{\Sigma}_{\bm{s}}+\sigma^{2}\bm{I}_{N})^{-1}\bm{\Sigma}_{\bm{s}}),

where 𝝁𝒔=𝝁+𝚺1/2𝑭𝒈\bm{\mu}_{\bm{s}}=\bm{\mu}+\bm{\Sigma}^{1/2}\bm{F}\bm{g} and 𝚺𝒔=𝚺1/2𝑭𝑯𝑯𝑭(𝚺1/2)\bm{\Sigma}_{\bm{s}}=\bm{\Sigma}^{1/2}\bm{F}\bm{H}\bm{H}\bm{F}(\bm{\Sigma}^{1/2})^{\top}. Similarly, if the prior for 𝝁\bm{\mu} is a multivariate normal distribution, then 𝝁𝒔,𝒖,𝜽,𝚺\bm{\mu}\mid\bm{s},\bm{u},\bm{\theta},\bm{\Sigma} also follows a multivariate normal distribution. This property is useful for deriving coefficients in linear regression and other related models. The variables 𝒖,𝜽\bm{u},\bm{\theta} influences 𝒔\bm{s} only through 𝒕=𝚺1/2(𝒔𝝁)\bm{t}=\bm{\Sigma}^{-1/2}(\bm{s}-\bm{\mu}). Using this property, the elements of 𝒖,𝜽\bm{u},\bm{\theta} may be sampled individually from ui,θitiu_{i},\theta_{i}\mid t_{i} instead of 𝒖,𝜽𝒔,𝝁,𝚺\bm{u},\bm{\theta}\mid\bm{s},\bm{\mu},\bm{\Sigma} after sampling 𝒔\bm{s}. Therefore, the sampling from the posterior distribution uiti,θiu_{i}\mid t_{i},\theta_{i} and θiti,ui\theta_{i}\mid t_{i},u_{i} can be implemented in parallel. Additionally, in some distributions, such as the t, Laplace, CSN distributions, p(uiti,θi)p(u_{i}\mid t_{i},\theta_{i}) is represented analytically. For example, in a CSNS distribution,

uiti,θi𝒯𝒩(θi1+θi2((π2)θi2+ππti+2πθi),11+θi2).\displaystyle u_{i}\mid t_{i},\theta_{i}\sim\mathcal{TN}\left(\frac{\theta_{i}}{1+\theta_{i}^{2}}\left(\sqrt{\frac{(\pi-2)\theta_{i}^{2}+\pi}{\pi}}t_{i}+\sqrt{\frac{2}{\pi}}\theta_{i}\right),\frac{1}{1+\theta_{i}^{2}}\right).

This method significantly reduces computational costs. Sampling from an NN-dimensional CSN distribution requires a computational complexity of O(N3)O(N^{3}) when sampling from an NN-dimensional multivariate truncated normal distribution (Botev, 2017). In contrast, the sampling method described above only requires sampling uiu_{i} from independently distributed truncated normal distributions, resulting in a computational complexity of O(N)O(N).

5.2 Bayesian inference for temporal model

As the AR model is closed under the proposed framework, the aforementioned data augmentation method can also be applied to the AR model. Furthermore, because the transitions of 𝒔1:T\bm{s}_{1:T} form a linear Gaussian state-space model, a more efficient method known as forward filtering backward sampling (FFBS) method can be applied (Carter and Kohn, 1994; Frühwirth-Schnatter, 1994). For simplicity, we consider only the first-order AR here; however, the following techniques can generally be applied to higher-order models as well. The model with observation noise is expresed as follows:

𝒚τ\displaystyle\bm{y}_{\tau} =𝒔τ+𝜺τ,\displaystyle=\bm{s}_{\tau}+\bm{\varepsilon}_{\tau}, (5)
𝒔τ\displaystyle\bm{s}_{\tau} =𝑨1(𝒔τ1𝝁τ1)+𝝃τ,\displaystyle=\bm{A}_{1}(\bm{s}_{\tau-1}-\bm{\mu}_{\tau-1})+\bm{\xi}_{\tau},
𝜺τ\displaystyle\bm{\varepsilon}_{\tau} 𝒩N(𝟎N,σ2𝑰N)\displaystyle\sim\mathcal{N}_{N}(\bm{0}_{N},\sigma^{2}\bm{I}_{N})
𝝃τ\displaystyle\bm{\xi}_{\tau} =𝝁τ+𝚺1/2𝒕τ.\displaystyle=\bm{\mu}_{\tau}+\bm{\Sigma}^{1/2}\bm{t}_{\tau}.

Regarding the posterior distribution of 𝒔1:T\bm{s}_{1:T}, conditioned on 𝒖1:T\bm{u}_{1:T}, it becomes a normal distribution allowing the use of the FFBS method. Let 𝝁τ1|τ2\bm{\mu}_{\tau_{1}|\tau_{2}} and 𝚺τ1|τ2\bm{\Sigma}_{\tau_{1}|\tau_{2}} be the mean and the variance of 𝒔τ1|𝒚1:τ2,𝒖1:τ2,𝜽1:τ2,𝝁1:τ2,𝚺,𝑨1,σ\bm{s}_{\tau_{1}}|\bm{y}_{1:\tau_{2}},\bm{u}_{1:\tau_{2}},\bm{\theta}_{1:\tau_{2}},\bm{\mu}_{1:\tau_{2}},\bm{\Sigma},\bm{A}_{1},\sigma, then

𝝁ττ1\displaystyle\bm{\mu}_{\tau\mid\tau-1} =𝑨1𝝁t1|t1+𝝁t+𝚺1/2𝑭𝒈,\displaystyle=\bm{A}_{1}\bm{\mu}_{t-1|t-1}+\bm{\mu}_{t}+\bm{\Sigma}^{1/2}\bm{F}\bm{g},
𝚺ττ1\displaystyle\bm{\Sigma}_{\tau\mid\tau-1} =𝑨1𝚺t1|t1𝑨1+𝚺1/2𝑭𝑯𝑯𝑭(𝚺1/2),\displaystyle=\bm{A}_{1}\bm{\Sigma}_{t-1|t-1}\bm{A}_{1}^{\top}+\bm{\Sigma}^{1/2}\bm{F}\bm{H}\bm{H}\bm{F}(\bm{\Sigma}^{1/2})^{\top},
𝝁ττ\displaystyle\bm{\mu}_{\tau\mid\tau} =𝝁ττ1+1σ2𝚺ττ(𝒚t𝝁ττ1),\displaystyle=\bm{\mu}_{\tau\mid\tau-1}+\frac{1}{\sigma^{2}}\bm{\Sigma}_{\tau\mid\tau}(\bm{y}_{t}-\bm{\mu}_{\tau\mid\tau-1}),
𝚺ττ\displaystyle\bm{\Sigma}_{\tau\mid\tau} =σ2(𝚺ττ1+σ2𝑰K)1𝚺ττ1,\displaystyle=\sigma^{2}(\bm{\Sigma}_{\tau\mid\tau-1}+\sigma^{2}\bm{I}_{K})^{-1}\bm{\Sigma}_{\tau\mid\tau-1},

where 𝝁1|0\bm{\mu}_{1|0} is 𝝁1+𝚺1/2𝑭𝒈\bm{\mu}_{1}+\bm{\Sigma}^{1/2}\bm{F}\bm{g}, and variance is 𝚺1|0=𝚺1/2𝑭𝑯𝑯𝑭(𝚺1/2)\bm{\Sigma}_{1|0}=\bm{\Sigma}^{1/2}\bm{F}\bm{H}\bm{H}\bm{F}(\bm{\Sigma}^{1/2})^{\top}. By using the above formulation iteratively, the distribution of 𝒔τ|𝒚1:τ,𝒖1:τ,𝜽1:τ,𝝁1:τ,𝚺,𝑨1,σ\bm{s}_{\tau}|\bm{y}_{1:\tau},\bm{u}_{1:\tau},\bm{\theta}_{1:\tau},\bm{\mu}_{1:\tau},\bm{\Sigma},\bm{A}_{1},\sigma can be calculated for τ=1,,T\tau=1,\ldots,T. After calculation, 𝒔τ|𝒚1:T,𝒖1:T,𝜽1:T,𝝁1:T,𝚺,𝑨1,σ\bm{s}_{\tau}|\bm{y}_{1:T},\bm{u}_{1:T},\bm{\theta}_{1:T},\bm{\mu}_{1:T},\bm{\Sigma},\bm{A}_{1},\sigma can be sequentially backward sampled:

𝝁τ|T\displaystyle\bm{\mu}_{\tau|T} =𝝁τ|t+𝑨1𝚺ττ𝚺τ+1|τ1(𝒔τ+1|T(i)𝝁τ+1|τ),\displaystyle=\bm{\mu}_{\tau|t}+\bm{A}_{1}\bm{\Sigma}_{\tau\mid\tau}\bm{\Sigma}_{\tau+1|\tau}^{-1}(\bm{s}_{\tau+1|T}^{(i)}-\bm{\mu}_{\tau+1|\tau}),
𝚺τ|T\displaystyle\bm{\Sigma}_{\tau|T} =𝚺τ|τ𝚺τ|τ𝚺τ+1|τ1𝚺τ|τ,\displaystyle=\bm{\Sigma}_{\tau|\tau}-\bm{\Sigma}_{\tau|\tau}\bm{\Sigma}_{\tau+1|\tau}^{-1}\bm{\Sigma}_{\tau|\tau},

where 𝒔τ+1|T(i)\bm{s}_{\tau+1|T}^{(i)} is a sample of 𝒔τ+1|𝒚1:T,𝒖1:T,𝜽1:T,𝝁1:T,𝚺,𝑨1,σ\bm{s}_{\tau+1}|\bm{y}_{1:T},\bm{u}_{1:T},\bm{\theta}_{1:T},\bm{\mu}_{1:T},\bm{\Sigma},\bm{A}_{1},\sigma.

6 Simulation studies

This study investigated the inverse estimation performance for spatio-temporal data through simulations. Specifically, using the case of the CSNS distribution as an example, we examined whether the estimation performance improves by properly modeling heterogeneity in higher-order moments. To this end, four types of heterogeneity were investigated. The models are defined as follows: the model with θtk=0\theta_{tk}=0 for t=1,,T,k=1,,Kt=1,\ldots,T,k=1,\ldots,K is referred to as the zero theta (ZT) model, corresponding to the Gaussian regression model; the model with θtk=θ\theta_{tk}=\theta for t=1,,T,k=1,,Kt=1,\ldots,T,k=1,\ldots,K is referred to as the fixed theta (FT) model; the model with θt1==θtK=θt\theta_{t1}=\cdots=\theta_{tK}=\theta_{t} for t=1,,Tt=1,\ldots,T is referred to as the temporal varying theta (TT) model; and the model with θ1k==θTk\theta_{1k}=\cdots=\theta_{Tk} for k=1,,Kk=1,\ldots,K is referred to as the spatial varying theta (ST) model.

The observation model is as follows:

𝒚τ\displaystyle\bm{y}_{\tau} =β1𝒙τ+β2𝟏K+𝒔τ+𝜺τ,\displaystyle=\beta_{1}\bm{x}_{\tau}+\beta_{2}\bm{1}_{K}+\bm{s}_{\tau}+\bm{\varepsilon}_{\tau},
𝒔τ\displaystyle\bm{s}_{\tau} =ρ𝒔τ1+𝝃τ,\displaystyle=\rho\bm{s}_{\tau-1}+\bm{\xi}_{\tau},
𝜺τ\displaystyle\bm{\varepsilon}_{\tau} 𝒩N(𝟎N,σ2𝑰N)\displaystyle\sim\mathcal{N}_{N}(\bm{0}_{N},\sigma^{2}\bm{I}_{N})

where 𝝃τ\bm{\xi}_{\tau} follows a CSNS distribution, depending on the θtk\theta_{tk} parameter, using the matrix square root. The covariance matrix 𝚺\bm{\Sigma}, which models spatial correlation, is assumed to follow Equation (3). The time length TT for parameter estimation was set to 5050, whereas the time length for evaluating prediction accuracy, TfutureT_{future}, was set to 22. The location is represented by a 5×55\times 5 grid. The adjacency matrix is defined by assigning a value of 11 to the neighboring pairs that share a border with separating grids, and 0 to all others. The features 𝒙τ\bm{x}_{\tau} is generated as a random sample from the standard normal distribution.

The true values for the simulation model for each case are as follows: in Case 1, we used the ZT model; in Case 2, the FT model with θ=2.5\theta=2.5; in Case 3, the TT model with θτN(2.5,32),t=1,,T\theta_{\tau}\sim N(2.5,3^{2}),t=1,\ldots,T, where θτ\theta_{\tau} were generated as random variables; and in Case 4, the ST model with θkN(2.5,32),k=1,,K\theta_{k}\sim N(2.5,3^{2}),k=1,\ldots,K, where θk\theta_{k} is generated as random variables. For the remaining parameters, the coefficient β1\beta_{1} for randomly generated features was set to 1.01.0, intercept β2\beta_{2} to 0.50.5, standard deviation of the observation noise σ\sigma to 0.10.1, coefficient of the temporal AR model ρ\rho to 0.50.5, parameter ω\omega, which determines the variability of 𝜽\bm{\theta}, to 1.01.0, and parameter η\eta, which controls the strength of spatial correlation, to 0.50.5.

The prior distributions for the estimation model were shared across all four models, following Lee et al. (2018). We employ a hierarchical distribution as the prior distributions for 𝜽1:T\bm{\theta}_{1:T}. The priors are as follows:

θi\displaystyle\theta_{i} 𝒩1(μθ,νθ2),\displaystyle\sim\mathcal{N}_{1}(\mu_{\theta},\nu_{\theta}^{2}),
μθ\displaystyle\mu_{\theta} 𝒩1(0,100νθ2),\displaystyle\sim\mathcal{N}_{1}(0,100\nu_{\theta}^{2}),
νθ2\displaystyle\nu_{\theta}^{2} 𝒢(1,0.01)\displaystyle\sim\mathcal{IG}(1,0.01)
𝜷\displaystyle\bm{\beta} 𝒩p(𝟎p,100𝑰p),\displaystyle\sim\mathcal{N}_{p}(\bm{0}_{p},100\bm{I}_{p}),
σ2\displaystyle\sigma^{2} 𝒢(1,0.01),\displaystyle\sim\mathcal{IG}(1,0.01),
ρ\displaystyle\rho 𝒰(0,1),\displaystyle\sim\mathcal{U}(0,1),
ω2\displaystyle\omega^{2} 𝒢(1,0.01),\displaystyle\sim\mathcal{IG}(1,0.01),
η\displaystyle\eta 𝒰(0,1),\displaystyle\sim\mathcal{U}(0,1),

where 𝒰(0,1)\mathcal{U}(0,1) denotes the uniform distribution from 0 to 11.

Using the sampling method described in Section 4.3, we obtained 1,000 samples from the posterior distribution for each simulation to evaluate estimation performance. During the sampling process, more than 50,000 samples were discarded corresponding to the burn-in period based on the results of trace plots, and thinning was applied to ensure that the autocorrelation was below 0.3. Figure 3 presents an example of a trace plot for several parameters of the FT model in Case 1, after discarding burn-in samples and applying thinning. The plot confirms that convergence was sufficient.

Refer to caption
Figure 3: Trace plots of samples from posterior distribution.

Figure 4 presents the root mean squared errors (RMSE) of each parameter for the four cases and four estimation models that were simulated 50 times. Box plots are used to illustrate each case. In the ZT model, 𝜽\bm{\theta} was fixed to 𝟎TK\bm{0}_{TK} and not estimated; therefore, the RMSE plot for 𝜽\bm{\theta} is left blank for the ZT model.

First, for Case 1, assuming normally distributed latent variables, the estimation performance is nearly identical across all four models. This indicates that even when the true model follows a normal distribution, accounting for higher-order moments does not degrade estimation performance. For Case 2, the estimation performance of the ZT model is generally poor; however no significant difference in performance is observed for the other three models. This suggests that when θ\theta is fixed to a single value, consideration of heterogeneity in higher-order moments does not impair estimation performance. Finally, in Cases 3 and 4, the true models (TT and ST) demonstrate improved estimation performance for most parameters. Notably, the improvements are significant for the standard deviation of observational noise (σ\sigma), time-series coefficient (ρ\rho), and variability of random effects (ω\omega). These results suggest that modeling heterogeneity in higher-order moments enhances parameter estimation performance. For θ\theta, the FT model performed well in Cases 1 and 2, whereas in Case 3, the variation is slightly different with no notable differences among the three models. In Case 4, despite increased variation, the ST model improved estimation performance.

Refer to caption
Figure 4: RMSE in each case for each model.

Similarly, Figure 5 displays predictive accuracy, which was evaluated using log marginal predictive likelihood (LMPL), LMPL in the future data (FLMPL), and RMSE of yy in the future data (FRMSE). The calculation method for LMPL followed that used in the CarBayesST package (Lee et al., 2018). FLMPL is an estimator of LMPL for the future data yT+1:T+Tfuturey_{T+1:T+T_{future}} conditioned on the learning data y1:Ty_{1:T}, specifically logp(yT:T+Tfuture|y1:T)\log p(y_{T:T+T_{future}}|y_{1:T}). FLMPL is calculated from samples of the posterior distribution as follows:

FLMPL=1Si=1Slogp(𝒚T+1:T+Tfuture|𝒔(i),𝜽(i),𝜷(i),σ(i),ρ(i),ω(i),η(i)),\displaystyle\mathrm{FLMPL}=\frac{1}{S}\sum_{i=1}^{S}\log p(\bm{y}_{T+1:T+T_{future}}|\bm{s}^{(i)},\bm{\theta}^{(i)},\bm{\beta}^{(i)},\sigma^{(i)},\rho^{(i)},\omega^{(i)},\eta^{(i)}),

where 𝒔(i),𝜽(i),𝜷(i),σ(i),ρ(i),ω(i),η(i)\bm{s}^{(i)},\bm{\theta}^{(i)},\bm{\beta}^{(i)},\sigma^{(i)},\rho^{(i)},\omega^{(i)},\eta^{(i)} is a sample of parameters from the posterior distribution conditioned on 𝒚1:T\bm{y}_{1:T}, and SS is the number of samples. Similarly, FRMSE is the RMSE of 𝒚T+1:T+Tfuture\bm{y}_{T+1:T+T_{future}} and is numerically calculated as follows:

FRMSE\displaystyle\mathrm{FRMSE} =1Si=1S𝒚T+1:T+Tfuture𝒚^T+1:T+Tfuture(i)2.\displaystyle=\sqrt{\frac{1}{S}\sum_{i=1}^{S}||\bm{y}_{T+1:T+T_{future}}-\hat{\bm{y}}_{T+1:T+T_{future}}^{(i)}||^{2}}.

where 𝒚^T+1:T+Tfuture(i)\hat{\bm{y}}_{T+1:T+T_{future}}^{(i)} is a sample drawn from p(𝒚T+1:T+Tfuture|𝒚1:T)p(\bm{y}_{T+1:T+T_{future}}|\bm{y}_{1:T}), and ||||||\cdot|| is the Euclidean norm.

For LMPL, in Case 1, there is no significant difference among the four models, whereas in Cases 2, 3, and 4, the true model shows higher values. This indicates that selecting the true model improves the fit to the observed data. In contrast, for FLMPL, the ZT model performs poorly across all cases, with the TT model consistently performing slightly better than the FT and ST models. In this scenario, the standard deviation of the observational noise is 0.1, whereas the FRMSE ranges from approximately 1.1 to 1.5. This can be attributed to the assumption that θ\theta varies over time, causing the model to exhibit heavier tails, which influences prediction. Finally, for FRMSE, there is little variation across the cases. This can be interpreted as the CSNS distribution maintaining the mean and variance, resulting in minimal differences in squared errors of the observations.

Refer to caption
Figure 5: Predictive accuracy in each case for each model.

7 Application

This study focuses on identifying the production functions for 48 states in the United States, using annual data from 1970 to 1986, following the model outlined in Baltagi (2021). A production function represents how capital and labor affect output. We assumed a Cobb-Douglas type production function, where output is expressed as the product of capital and labor raised to respective exponents. By taking the logarithm of the Cobb-Douglas production function and rewriting it in a form that allows for coefficient estimation from observed data, it can be expressd as follows:

ytk\displaystyle y_{tk} =βintercept+βKPlogKPtk+βKHlogKHtk+βKWlogKWtk\displaystyle=\beta_{intercept}+\beta_{\mathrm{KP}}\log\mathrm{KP}_{tk}+\beta_{\mathrm{KH}}\log\mathrm{KH}_{tk}+\beta_{\mathrm{KW}}\log\mathrm{KW}_{tk}
+βKOlogKOtk+βLlogLtk+βUnempUnemptk+stk+ϵtk,\displaystyle\qquad+\beta_{\mathrm{KO}}\log\mathrm{KO}_{tk}+\beta_{\mathrm{L}}\log\mathrm{L}_{tk}+\beta_{\mathrm{Unemp}}\mathrm{Unemp}_{tk}+s_{tk}+\epsilon_{tk},

Here, yy represents the logarithm of the gross state product, with the features including private capital stock (KP\mathrm{KP}), highways and streets (KH\mathrm{KH}), water and sewer facilities (KW\mathrm{KW}), other public buildings and structures (KO\mathrm{KO}), nonagricultural payroll employment (L\mathrm{L}), and the state unemployment rate (Unemp\mathrm{Unemp}). All features, except the unemployment rate, are expressed in natural logarithms. The data spanning from 1970 to 1984 were utilized as the training dataset, whereas the data from 1985 to 1986 were used to validate the predictive accuracy of the models. Figure 6 displays the value of 𝒚\bm{y}, the logarithm of the gross state product for each state. Data from the first (1970), middle (1977), and final year (1984) are plotted.

Refer to caption
Figure 6: Logarithm of gross state product in 1970, 1977, and 1984.

For the models, we compared the estimation results of ZT, FT, TT, and ST, using the same approach and priors as that of the simulation study. We set sufficiently long burn-in and thinning periods and obtained 40004000 samples from the posterior distribution to evaluate the estimation results.

Table 3 displays the median and 90% credible interval of samples drawn from the posterior distributions of parameters for each model. The medians of the posterior distributions of the all parameters exhibit similar patterns, although a slight variation is observed in the estimates of the regression coefficients. This variation can be attributed to the proposed model, which preserves both the mean and variance.

Table 3: Estimation results of parameters-50% percentile and 90% credible interval
ZT FT
βintercept\beta_{intercept} 1.197 (1.033–1.322) 1.317 (1.144–1.451)
βKP\beta_{\mathrm{KP}} 0.404 (0.371–0.429) 0.386 (0.352–0.413)
βKH\beta_{\mathrm{KH}} 0.143 (0.101–0.175) 0.120 (0.079–0.152)
βKW\beta_{\mathrm{KW}} 0.080 (0.054–0.099) 0.080 (0.054–0.101)
βKO\beta_{\mathrm{KO}} -0.007 (-0.037–0.016) -0.018 (-0.048–0.007)
βL\beta_{\mathrm{L}} 0.470 (0.431–0.501) 0.513 (0.468–0.547)
βUnemp\beta_{\mathrm{Unemp}} -0.007 (-0.010– -0.005) -0.006 (-0.009– -0.004)
σ\sigma 0.015 (0.014–0.016) 0.015 (0.014–0.016)
ρ\rho 0.953 (0.929–0.973) 0.953 (0.927–0.972)
ω\omega 0.050 (0.047–0.052) 0.049 (0.046–0.052)
η\eta 0.756 (0.640–0.830) 0.742 (0.627–0.822)
TT ST
βintercept\beta_{intercept} 1.431 (1.253–1.566) 1.378 (1.211–1.522)
βKP\beta_{\mathrm{KP}} 0.366 (0.331–0.394) 0.328 (0.293–0.356)
βKH\beta_{\mathrm{KH}} 0.114 (0.072–0.148) 0.179 (0.134–0.212)
βKW\beta_{\mathrm{KW}} 0.081 (0.054–0.102) 0.081 (0.055–0.102)
βKO\beta_{\mathrm{KO}} -0.025 (-0.056– -0.001) -0.021 (-0.053–0.004)
βL\beta_{\mathrm{L}} 0.538 (0.495–0.572) 0.527 (0.486–0.560)
βUnemp\beta_{\mathrm{Unemp}} -0.004 (-0.007– -0.002) -0.007 (-0.010– -0.005)
σ\sigma 0.015 (0.014–0.016) 0.014 (0.013–0.015)
ρ\rho 0.947 (0.924–0.964) 0.960 (0.939–0.975)
ω\omega 0.047 (0.044–0.049) 0.043 (0.041–0.046)
η\eta 0.569 (0.447–0.662) 0.697 (0.565–0.785)

Figure 7 displays the median of the posterior distribution of 𝒔τ\bm{s}_{\tau} for each state in 1970, 1977, and 1984 under each model. The figure reveals that, while the trends in the estimates of 𝒔τ\bm{s}_{\tau} are similar across the four models, the actual values differ significantly. For example, the posterior median of ss for Wyoming in 1970 under the ZT model is approximately 0.150.15, whereas it is 0.280.28 under the TT model. Certain states appear in much brighter colors under the CSNS model. For instance, Wyoming around 1970 is depicted in a brighter color, indicating higher productivity than expected based on its capital and infrastructure. In 1970, Wyoming was the second least populous state (United States Census Bureau, 1973), exhibiting low feature values, with LL being the smallest. However, owing to its abundant resources such as coal, crude oil, and natural gas (U.S. Energy Information Administration, 2023), Wyoming demonstrated a total production that surpassed predictions based on its features. The states appearing in brighter colors under the CSNS model suggest that this model assigns greater probability to positive values than the normal model, because θi\theta_{i} is positive. Additionally, in the ST model, differences in shading intensity are more distinct separated than in the FT and TT models.

Refer to caption
Figure 7: Median of 𝒔\bm{s} for each model

Figure 8 depicts the skewness of the posterior distributions in each model. In this figure, in the ST model for 1970, the differences in skewness are more distinct than in the other models. This is attributed to the ST model’s ability to capture the spatial heterogeneity of higher-order moments. In 1977 and 1984, these differences in skewness are less pronounced, presumably because the variance increases over time, making it more difficult for the skewness to become substantially larger.

Refer to caption
Figure 8: Skewness of 𝒔\bm{s} for each model.

Finally, Table 4 presents the evaluation of predictive accuracy. Compared with the ZT model, the FT, TT and ST model show improvements in all evaluated metrics. Even though 𝒚\bm{y} has undergone a logarithmic transformation, considering the heterogeneity of higher-order moments leads to enhanced estimation performance. In particular, the TT and ST models outperform the FT model, suggesting that incorporating both temporal and spatial heterogeneity in higher-order moments improves model fit and predictive accuracy.

LMPL FLMPL FRMSE(×102)(\times 10^{-2})
ZT 1908.433 203.602 5.945
FT 1920.203 254.741 5.875
TT 1947.617 268.959 5.680
ST 1945.871 255.556 5.302
Table 4: Prediction accuracy in application.

8 Conclusion

In this study, we propose a method that enables the treatment of the heterogeneity of the higher order moments while preserving the mean and variance, crucial for maintaining parameter interpretability (see Section 1). After demonstrating closure under linear transformations of the proposed model, we analytically derived its skewness and kurtosis. In addition, applying it to spatio-temporal data, we examined how the properties of the model change depending on the chosen matrix decomposition method. In particular, using the matrix square root preserves symmetry with respect to ordering, making it suitable for data with inherent symmetry. For time-series data, we showed that the Cholesky decomposition corresponds to the AR model representation. We then proposed an efficient Bayesian inference method employing data augmentation and the FFBS algorithm.

In the simulation study, we compared the estimation performance of the four models that incorporated heterogeneity in higher-order moments. The results demonstrated that appropriately accounting for higher-order moment heterogeneity improves parameter estimation performance, model fit, and predictive accuracy.

For a real-data application, we applied the CSNS model to identify production functions across the United States. Although parameters other than 𝒔\bm{s} generally exhibited similar trends, the posterior distribution of 𝒔\bm{s} revealed that skewness varies substantially by location, particularly when spatial heterogeneity in higher-order moments is considered. Furthermore, both the TT and ST models showed improvements in terms of model fit and predictive performance metrics, indicating that even when the response variable is log-transformed, accounting for higher-order moment heterogeneity enhances estimation performance.

An open question for future research is the development of a model that simultaneously preserves the mean and variance, maintains spatial conditional independence, and accommodates heterogeneity in higher-order moments. Furthermore, as the proposed model encompasses a broad class of models, a potential extension may explore more complex structures where each dimension follows a different distribution, which remains unaddressed in this study.

The Julia implementation of CSNS model is available on GitHub:
(https://github.com/Kuno3/CsnSubclass).

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number 24K00175.

References

  • Anselin (2022) Anselin, L., 2022. Spatial econometrics, in: Rey, S.J., Franklin, R.S. (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar Publishing, Cheltenham, pp. 101–122.
  • Azzalini (1985) Azzalini, A., 1985. A class of distributions which includes the normal ones. Scand. J. Stat. 12, 171–178.
  • Azzalini and Capitanio (1999) Azzalini, A., Capitanio, A., 1999. Statistical applications of the multivariate skew normal distribution. J. R. Stat. Soc. Ser. B Stat. Methodol. 61, 579–602.
  • Baltagi (2021) Baltagi, B.H., 2021. Econometric Analysis of Panel Data. 6th ed., Springer, Berlin.
  • Best et al. (2005) Best, N., Richardson, S., Thomson, A., 2005. A comparison of bayesian spatial models for disease mapping. Stat. Methods Med. Res. 14, 35–59.
  • Botev (2017) Botev, Z.I., 2017. The normal law under linear restrictions: Simulation and estimation via minimax tilting. J. R. Stat. Soc. Ser. B Stat. Methodol. 79, 125–148.
  • Carter and Kohn (1994) Carter, C.K., Kohn, R., 1994. On gibbs sampling for state space models. Biometrika 81, 541–553.
  • Cressie and Wikle (2011) Cressie, N., Wikle, C.K., 2011. Statistics for spatio-temporal data. John Wiley & Sons.
  • Dormann et al. (2007) Dormann, C.F., McPherson, J.M., Araújo, M.B., Bivand, R., Bolliger, J., Carl, G., Davies, R.G., Hirzel, A., Jetz, W., Daniel Kissling, W., et al., 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30, 609–628.
  • Frühwirth-Schnatter (1994) Frühwirth-Schnatter, S., 1994. Data augmentation and dynamic linear models. J. Time Ser. Anal. 15, 183–202.
  • Lee et al. (2018) Lee, D., Rushworth, A., Napier, G., 2018. Spatio-temporal areal unit modeling in r with conditional autoregressive priors using the CARBayesST package. J. Stat. Softw. 84, 1–39.
  • Márquez-Urbina and González-Farías (2022) Márquez-Urbina, J.U., González-Farías, G., 2022. A flexible special case of the CSN for spatial modeling and prediction. Spatial Stat. 47, 100556.
  • Niccolò et al. (2023) Niccolò, A., Augusto, F., Daniele, D., Giacomo, Z., 2023. Bayesian conjugacy in probit, tobit, multinomial probit and extensions: A review and new results. J. Amer. Statist. Assoc. 118, 1451–1469.
  • Tagle et al. (2019) Tagle, F., Castruccio, S., Crippa, P., Genton, M.G., 2019. A non-gaussian spatio-temporal model for daily wind speeds based on a multi-variate skew-t distribution. J. Time Ser. Anal. 40, 312–326.
  • Tan and Chen (2024) Tan, L.S.L., Chen, A., 2024. Variational inference based on a subclass of closed skew normals. J. of Comput. and Graph. Stat. 0, 1–15.
  • United States Census Bureau (1973) United States Census Bureau, 1973. 1970 census of population: Characteristics of the population. https://www.census.gov/library/publications/1973/dec/population-volume-1.html. (accessed 25 May 2024).
  • U.S. Energy Information Administration (2023) U.S. Energy Information Administration, 2023. Production - state energy data system (seds). https://www.eia.gov/state/seds/seds-data-complete.php. (accessed 25 May 2024).
  • Wang et al. (2024) Wang, K., Karling, M.J., Arellano-Valle, R.B., Genton, M.G., 2024. Multivariate unified skew-t distributions and their properties. J. of Multivar. Analysis 203, 105322.
  • Yan et al. (2020) Yan, Y., Jeong, J., Genton, M.G., 2020. Multivariate transformed gaussian processes. Jpn. J. Stat. Data Sci. 3, 129–152.