Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains
Abstract
We introduce a class of scalable Bayesian hierarchical models for the analysis of massive geostatistical datasets. The underlying idea combines ideas on high-dimensional geostatistics by partitioning the spatial domain and modeling the regions in the partition using a sparsity-inducing directed acyclic graph (DAG). We extend the model over the DAG to a well-defined spatial process, which we call the Meshed Gaussian Process (MGP). A major contribution is the development of a MGPs on tessellated domains, accompanied by a Gibbs sampler for the efficient recovery of spatial random effects. In particular, the cubic MGP (Q-MGP) can harness high-performance computing resources by executing all large-scale operations in parallel within the Gibbs sampler, improving mixing and computing time compared to sequential updating schemes. Unlike some existing models for large spatial data, a Q-MGP facilitates massive caching of expensive matrix operations, making it particularly apt in dealing with spatiotemporal remote-sensing data. We compare Q-MGPs with large synthetic and real world data against state-of-the-art methods. We also illustrate using Normalized Difference Vegetation Index (NDVI) data from the Serengeti park region to recover latent multivariate spatiotemporal random effects at millions of locations. The source code is available at github.com/mkln/meshgp .
Keywords: Bayesian, spatial, large , graphical models, domain partitioning, sparsity.
1 Introduction
Collecting large quantities of spatial and spatiotemporal data is now commonplace in many fields. In ecology and forestry, massive datasets are collected using satellite imaging and other remote sensing instruments such as LiDAR that periodically record high-resolution images. Unfortunately, clouds frequently obstruct the view resulting in large regions with missing information. Figure 1 shows this phenomenon in Normalized Difference Vegetation Index (NDVI) data from the Serengeti region. Filling such gaps in the data is an important goal as is quantifying uncertainty in predictions. This goal is achieved through stochastic modeling of the underlying phenomenon, which involves the specification of a spatial or spatiotemporal process characterizing dependence from a finite realization. Gaussian processes (GP) are a customary choice to characterize spatial dependence, but their implementation is notoriously burdened by their computational complexity. Consequently, intense research has been devoted in recent years to developing scalable models for large spatial datasets – see detailed reviews by Sun et al., (2011) and Banerjee, (2017).

Computational complexity can be reduced by considering low-rank models; among these, knot-based methods motivated by “kriging” ideas enjoy some optimality properties but oversmooth the estimates of spatial random effects unless the number of knots is large, and require corrections to avoid overestimation of the nugget (Banerjee et al.,, 2008; Cressie and Johannesson,, 2008; Banerjee et al.,, 2010; Guhaniyogi et al.,, 2011; Finley et al.,, 2012). Other methods reduce the computational burden by introducing sparsity in the covariance matrix; strategies include tapering (Furrer et al.,, 2006; Kaufman et al.,, 2008) or partitioning of the spatial domain into regions with a typical assumption of independence across regions (Sang and Huang,, 2012; Stein,, 2014). These can be improved by considering a recursive partitioning scheme, resulting in a multi-resolution approximation (MRA; Katzfuss, 2017). Other assumptions on conditional independence assumptions also have a good track record in terms of scalability to large spatial datasets: Gaussian random Markov random fields (GMRF; Rue and Held,, 2005), composite likelihood methods (Eidsvik et al.,, 2014), and neighbor-based likelihood approximations (Vecchia,, 1988) belong to this family.
The recent literature has witnessed substantial activity surrounding the so called Vecchia approximation (Vecchia,, 1988). This approximation can be regarded as a special case of the GMRF approximations with a simplified neighborhood structure motivated from a directed acyclic graphical (DAG) representation of a Gaussian process likelihood. Extensions leading to well-defined spatial processes to accommodate inference at arbitrary locations by extending the DAG representation to the entire domain include Nearest neighbor Gaussian processes (NNGP; Datta et al., 2016a ; Datta et al., 2016b ) and further generalizations by constructing DAGs over the augmented space of outcomes and spatial effects (Katzfuss and Guinness,, 2017). These approaches render computational scalability by introducing sparsity in the precision matrix. The DAG relies upon a specific topological ordering of the locations, which also determine the construction of neighborhood sets, and certain orderings tend to deliver improved performance of such models (Katzfuss and Guinness,, 2017; Guinness,, 2018).
When inference on the latent process is sought, Bayesian inference has the benefits of providing direct probability statements based upon the posterior distribution of the process. Inference based on asymptotic approximations are avoided, but there remain challenges in computing the posterior distribution given that inference is sought on a very high-dimensional parameter space (including the realizations of the latent process). One possibility, available for Gaussian first-stage likelihoods, is to work with a collapsed or marginalized likelihood by integrating out the spatial random effects. However, Gibbs samplers and other MCMC algorithms for the collapsed models can be inexorably slow and are impractical when data are in the millions. A sequential Gibbs sampler that updates the latent spatial effects (Datta et al., 2016a, ) is faster in updating the parameters but suffers from high autocorrelation and slow mixing. Another possibility emerges when interest lies in prediction or imputation of the outcome variable only and not the latent process. Here, a so called “response” model that models the outcome itself using an NNGP can be constructed. This model is much faster and enjoys superior convergence properties, but we lose inference on the latent process and its predictive performance tends to be inferior to the latent process model. Furthermore, these options are unavailable in non-Gaussian first-stage hierarchical models or when the focus is not uniquely on prediction. A detailed comparison of different approaches for computing Bayesian NNGP models is presented in Finley et al., (2019).
Our current contribution introduces a class of Meshed Gaussian Process (MGP) models for Bayesian hierarchical modeling of large spatial datasets. This class builds upon the aforementioned works that build upon Vecchia, (1988) and other DAG based models. The inferential focus remains within the context of massive spatial datasets over very large domains. We exploit the demonstrated benefits of the DAG based models, but we now adapt them to partitioned domains. We describe dependence across regions of a partitioned domain using a small, patterned DAG which we refer to as a mesh. Within each region, some locations are selected as reference and collectively mapped to a single node in the DAG. Relationships among nodes are governed by kriging ideas. In the resulting MGP, regions in the spatial domain depend on each other through the reference locations. Realizations at all other locations are assumed independent, conditional upon the reference locations. This construction leads to a valid standalone spatial process.
As a particular subclass of MGPs, we propose a novel partitioning and graph design based on domain tessellations. Unlike methods that build sparse DAGs by limiting dependence to nearest neighbors, our approach shapes the underlying DAG with a known, repeating pattern corresponding to the chosen tessellation geometry. The underlying sparse DAG enables scaling computations to large data settings and its known pattern guarantees the availability of block-parallel sampling schemes; furthermore, large computational savings can be achieved at no additional approximation cost if data are collected on patterned lattices. Finally, extensions to spatiotemporal and/or multivariate data are straightforward once a suitable covariance function has been defined. We use axis-parallel domain partitioning and the corresponding cubic DAG – resulting in cubic MGPs or Q-MGPs – to show substantial improvements in computational time and inferential performance relative to other models with data sizes ranging from the thousands to the several millions, for both spatial and spatiotemporal data and using multivariate spatial processes.
The present work may appear to share similarities with the block-NNGP model of Quiroz et al., (2019), who advocate building sparse DAGs on grouped locations based on their ordering and subsequent identification of “past” neighbors. Unlike block-NNGPs, our tessellated GPs consider the domain tessellation as generating the DAG; the number of parents of any node is thus fixed and depends on the geometry of the chosen tessellation rather than on a user-defined parameter. Inclusion of more distant locations in the parent set of any location will, therefore, not proceed by increasing the number of neighbors , but rather by increasing the regions’ size and/or modifying their shape. Central to tessellated GPs is the idea of forcing a DAG with known coloring on the data, resulting in guaranteed efficiencies when recovering the latent spatial effects. This strategy is analogous in spirit to multi-resolution approximations (Katzfuss,, 2017; Gramacy and Lee,, 2008), which also force a DAG on the data, resulting in conditional independence patterns that are known in advance and that can be used to improve computations. However, while multi-resolution approximations are defined by branching graphs associated to recursive domain partitioning, tessellated GPs use a single domain partition, with each region connected in the DAG only to its immediate neighbors. Compared to treed graphs, tessellated GPs are associated to DAGs with fewer conditionally-independent groups and whose repeated patterns facilitate the identification of redundant matrix operations arising when one or more coordinate margins are gridded. We also note that while the idea of partitioning domains to create approximations is not new, construction of the DAG-based approximation over partitioned domains has received considerably less attention. Finally, our focus here is in developing tessellated GPs as a methodology that enables the efficient recovery of the latent spatial random effects and the Bayesian estimation of covariance parameters via MCMC; we are thus not focusing on alternative computational algorithms (see e.g. Finley et al., 2019), which have been developed for NNGPs but can nonetheless all be adapted to general MGP models.
The balance of this paper proceeds as follows. Section 2 introduces our general framework for hierarchical Bayesian modeling of spatial processes using networks of grouped spatial locations. The MGP is outlined in Section 3, where we provide a general, scalable computing algorithm in Section 3.1. Tessellation-based schemes and the specific case of Q-MGPs are outlined in Section 4, which highlights their properties and computational advantages. We illustrate the performance of our proposed approach in Section 5 using simulation experiments and an application on a massive dataset with millions of spatiotemporal locations. We conclude the paper with a discussion and pointers to further research. Supplementary material accompanying this manuscript as an Appendix is available online and contains further comparisons of Q-MGPs with several state-of-the-art methods for spatial data.
2 Spatial processes on partitioned domains
A spatial process assigns a probability law on , where is a random vector with elements for . In the following general discussion we will not distinguish between spatial () and spatiotemporal domains (), and denote spatial or spatiotemporal locations as , or .
For any finite set of spatial locations of size , let denote the probability law of the random vector with probability density . The joint density of can be expressed as a DAG (or a Bayesian network model) with respect to the ordered set of locations as
(1) |
where the conditional set for each can be interpreted as the set of its parents in a large, dense Bayesian network. Defining a simplified valid joint density on by reducing the size of the conditioning sets is a popular strategy for fast likelihood approximations in the context of large spatial datasets. One typically limits dependence to “past” neighboring locations with respect to the ordering in (1) (Vecchia,, 1988; Gramacy and Apley,, 2015; Stein et al.,, 2004; Datta et al., 2016a, ; Katzfuss and Guinness,, 2017). The neighbors are defined and fixed and model performance may benefit from the addition of some distant locations (Stein et al.,, 2004). The ordering in is also fixed and inferential performance may benefit from the use of some fixed permutations (Guinness,, 2018). The result of shrinking the conditional sets to a smaller set of neighbors from the past yields a sparse DAG or Bayesian network, which yields potentially massive computational gains.
We proceed in a similar manner, but instead of defining a sparse DAG at the level of each individual location, we map entire groups of locations to nodes in a much smaller graph; the same graph will be used to model the dependence between any location in the spatial domain and, therefore, to define a spatial process. Let be a partition of into mutually exclusive subsets so that and whenever . Similar to the nomenclature in the NNGP, we fix a reference set , which itself is partitioned using by letting . The set of non-reference locations is similarly partitioned with so that for each . We now construct a DAG to model dependence within and between and . Let be a graph with nodes , where we refer to as the reference nodes and to as the non-reference, or simply “other”, nodes. Let . We introduce a map such that
(4) |
This surjective many-to-one map links each location in and to a node in . The edges connecting nodes in are where denotes the set of parents of any and, hence, identifies the directed edges pointing to . We let be acyclic, i.e., there is no chain of elements of such that and . Crucially, we assume that for all , i.e., that only reference nodes have children, to distinguish the reference nodes from the other nodes . Apart from the assumption that , we refrain from defining the parents of a node, thereby retaining flexibility. In general, however, all locations in will share the same parent set. In Section 4 we will consider meshes associated to domain tessellations.
Consider the enumeration , where , and let be the random vector listing elements of for each . We now rewrite (1) as a product of conditional densities
(5) |
The conditioning sets are then reduced based on the graph :
(6) |
where we denote , and . This is a proper multivariate joint density since the graph is acyclic (Lauritzen,, 1996). It is instructive to note how the above approximation behaves when the size of the parent set shrinks, for a given domain partitioning scheme. To this end, we adapt a result in Banerjee, (2020) and show that sparser DAGs correspond to a larger Kullback-Leibler (KL) divergence from the base density . This result has been proved earlier for Gaussian likelihoods by Guinness, (2018), but the argument given below is free of distributional assumptions and is linked to the submodularity of entropy and the “information never hurts” principle (see e.g. Cover and Thomas,, 1991).
Consider random vector and some partition of the domain corresponding to nodes via map . Let the base process correspond to graph where . Then, let where and for all . Finally construct by letting for some . In other words, graph is obtained by removing the directed edge from . We approximate using densities and based on and , respectively, obtaining
(7) |
Considering the Kullback-Leibler divergence of each density from , and denoting , we find
(8) | ||||
where we use (7), the fact that and are disjoint, and Jensen’s inequality. This result implies that larger parent sets are preferrable as they correspond to better approximations to the full model; the choice of sparser graphs will be driven by computational considerations – see Section 3.2.
We construct the spatial process over arbitrary locations by enumerating other locations as and extending (6) to the non-reference locations. Given the partition of defined earlier with components for , for each we set and recall that by construction. For each , we denote and define the conditional density of given as
(9) |
Therefore, for any finite subset of spatial locations we can let and obtain
We show (see Appendix A, available online) that this is a well-defined process by verifying the Kolmogorov consistency conditions. This new process can be built starting from a base process, a fixed reference set, domain partition and a graph . Next, we elucidate with Gaussian processes.
3 Meshed Gaussian Processes
Let be a -variate multivariate Gaussian process, denoted as . The cross-covariance indexed by parameters is a function , where is a subset of (the space of all real matrices) such that the -th entry of evaluates the covariance between the -th and -th elements of at and , respectively, i.e., . We omit dependence on to simplify notation. The cross-covariance function itself needs to be neither symmetric nor positive-definite, but must satisfy the following two properties: (i) ; and (ii) for any integer and any finite collection of points and for all . See Genton and Kleiber, (2015) for a review of cross-covariance functions for multivariate processes. The (partial) realization of the multivariate process over any finite set has a multivariate normal distribution where is the column vector and is the block matrix with the matrix as its block for .
We construct the MGP from a base, or parent, multivariate GP for and then, using the graph defined in Section 2, represent the joint density at the reference set as
(10) |
where , and for , and . The resulting joint density is multivariate normal with covariance and a precision matrix . The precision matrix for Gaussian graphical models is easily derived using customary linear model representations for each conditional regression. Consider the DAG in (6). Each is and let be the number of parents for in the graph . Furthermore, let be the covariance matrix between and , be the covariance matrix between and , and be the covariance matrix between and itself. Representing each conditional density in (6) as a linear regression on , we get
(11) |
where each is an is a coefficient matrix representing the multivariate regression of given , for , and each is an residual covariance matrix. We set and , where is the matrix of zeros, whenever . For , let be the indices in and let be the block matrix formed by stacking side by side for each . Since , we obtain and each can be obtained from the respective submatrix of . We also obtain . Therefore, all the ’s and ’s can be computed from the base cross-covariance function.
The distribution of can be obtained by noting that , where is the block matrix with as -th block. Therefore, , where is block-diagonal with as the -th block. Note that is block lower-triangular with ’s on the diagonal, hence non-singular. Also, the precision matrix is sparse because of whenever . Block-sparsity of can be induced by building with few, carefully placed directed edges among nodes in ; Appendix B, available online, contains a more in-depth treatment. We extend (10) to the collection of non-reference locations :
(12) |
where and , analogously to (10), while and are analogous to and . Clearly, given that all the densities are Gaussian, all finite dimensional distributions will also be Gaussian. We have constructed a Gaussian process with the following cross-covariance function for any two locations
For a given base Gaussian covariance function , domain partitioning , mesh , and reference set , we denote the corresponding meshed Gaussian process as .
3.1 Bayesian hierarchical model and Gibbs sampler
Meshed GPs produce block-sparse precision matrices that are constructed cheaply from their block-sparse Cholesky factors by solving small linear systems. General purpose sparse-Cholesky algorithms (Davis,, 2006; Chen et al.,, 2008) can then be used to obtain collapsed samplers as in Finley et al., (2019). Unfortunately, these algorithms can only be used on Gaussian first stage models and are computationally impracticable for data in the millions. Hence, we develop a more general scalable Gibbs sampler for the recovery of spatial random effects in hierarchical MGP models that entirely circumvents large matrix computations.
Consider a multivariate spatiotemporally-varying regression model at ,
(13) |
where is the multivariate point-referenced outcome, is a matrix of spatially referenced predictors linked to constant coefficients , is the spatial process, is a design matrix, is measurement error such that and . A simple univariate regression model with a spatially-varying intercept can be obtained with , . For observed locations , we write the above model compactly , where , and are similarly defined, , , and .
For subsets , let , with analogous definitions for and , , and . After fixing a reference set , we obtain and . We partition the domain as above to obtain for and model using the MGP which yields . We complete the model specification by assigning , , .
The resulting full conditional distribution for is , where , . For , , the full conditional is Inverse-Gamma with parameters and where and are the subsets of corresponding to outcome (out of ).
The Gibbs update of the components can proceed simultaneously as all blocks in have no children and their parents are in . The full conditional for for is thus where and , where is the spatial process at locations corresponding to the parents of .
We update for via its full conditional . Let be the vector of indicators that identify locations with non-missing outputs, and let be the node in corresponding to . Then,
(14) | ||||
where with , and and denotes the Hadamard or Schur (element-by-element) product. Finally, is updated via a Metropolis step with target density using (10) and (12). The Gibbs sampling algorithm will iterate across the above steps and, upon convergence, will produce samples from .
We obtain posterior predictive inference at arbitrary by evaluating . If , then we draw one sample of for each draw of the parameters from . Otherwise, considering that for some and thus , with parent nodes and children , we sample from the full conditional , where and , then draw .
3.2 Non-separable multivariate spatiotemporal covariances
We provide an account of the computational cost of general MGPs as a starting point to motivate the introduction of more efficient tessellated MGPs, and specifically Q-MGPs, in Section 4. We consider (13) and take to simplify our exposition. In the resulting model, is the regression coefficient on the point-referenced regressors with a static effect on the outcome, whereas the -variate spatiotemporal process captures the dynamic effect of the regressors. Typically in geostatistical modeling and are small, hence sampling and carries a negligible computational cost. The cost of each Gibbs iteration is dominated by updates of and . Let us assume, solely for expository purposes, that each of the blocks comprise the same number of locations, i.e. , for all . Thus, and the graph nodes have or fewer parents and or fewer children.
The evaluation of and dominates the computation. Each term in the product entails and , both of size , and their determinants. These require of size or less, resulting in flops via Cholesky decomposition. Reasonably, and are fixed so may grow linearly with sample size and the cost is considering . The total computing time is with processors for computing the densities. Sampling and from their full conditional distributions requires flops, assuming and are stored in the previous step. The first term in the complexity order is due to the Cholesky decomposition of covariance matrices, the second is due to sampling the reference nodes, and the third comes from sampling other nodes. Without further assumptions, parallelization reduces complexity to , since the covariances can be computed beforehand and the components of are independent given . With fixed block size , the overall complexity for a Gibbs iteration is , linear in the sample size and cubic in , highlighting the computational speedup of sparse graphs ( small), the negative impact of large , and the serial sampling of .
In terms of storage, and correspond to a storage requirement of . The matrix of size can be represented as a list of block-diagonal (hence sparse) matrices. Furthermore, computing (dimension ) can be vectorized as the row-wise sum of where and are matrices with th column representing the th space-time varying predictor. The cost of storing is thus .
Complexity is further reduced by considering a graph with small or a finer partition resulting in large and small , whereas the overall time can be reduced by distributing computations on processors. Possible choices for include nearest-neighbor graphs and multiresolution trees. In settings with large , adjusting and may be insufficient to reduce the computational burden. Covariance functions that are separable in the variables (but perhaps non-separable in space and time) bring the cost of Choesky factorizations of matrices from to because , where is the space-time component of the cross-covariance, and the variable component. Savings accrue when evaluating the likelihood and in sampling from the full-conditionals at the cost of realism in describing the spatial process.
The next section develops a novel MGP design based on domain tessellations or tiling – i.e. partitions of the domain forming repeated patterns – to which we associate similarly patterned meshes. If observations are also located in patterns, the bulk of the largest linear solvers will be redundant, resulting in a significant reduction in computational time. In either scenario, sampling will also proceed in parallel with improved mixing.
4 MGPs based on domain tessellation or tiling
We construct MGPs based on a tessellation or tiling of the domain. For spatial domains (, Figure 2), regular tiling results in triangular, quadratic, or hexagonal tessellations; mixed designs are also possible. These partition schemes can be linked to a DAG by drawing directed outgoing edges starting from an originating node/tile. The same fixed pattern can be repeated over a surface of any size. In dimensions , which may include time, space-filling tessellations or honeycombs can be constructed analogously, along with their corresponding meshes. Constructions of MGPs based on these ideas simply requires partitioning the locations into subsets based on the chosen tessellation.
This subclass of MGP models corresponds by design to graphs with known coloring, with each color linked to a subgraph conditionally independent of all nodes of other colors, regardless of the dimension of the domain. This feature enables large-scale parallel sampling of and improves mixing without the need to implement heuristic graph-coloring algorithms. Furthermore, regions in a tessellated domain are typically translations and/or rotations of a single geometric shape. Carefully choosing , it will be possible to avoid computing the bulk of linear solvers, resulting in substantial computational gains. Subsequently, we focus on axis-parallel partitioning (quadratic or cubic tessellation) and cubic meshes, but analogous constructions and the same properties hold with other tessellation schemes.

A cubic MGP (Q-MGP) is constructed by partitioning each coordinate axis into intervals. In dimensions, splitting each axis into intervals results in regions. Consider a spatiotemporal domain , where is the time dimension. We partition each coordinate axis into disjoint sets: , where if and denotes the th interval in the th coordinate axis. Solely for exposition, and without loss of generality, assume that and for . Any location will be such that for some and with , where . We refer to this axis-parallel partition scheme as a cubic tessellation and denote it by . We use to partition the reference set as for .
Next, we define , where if . Then, let be a directed acyclic graph with and reference nodes . Therefore, for any if then . We write each node as . The directed edges are constructed using a “line-of-sight” strategy. Suppose . The th parent of is defined as , where is the smallest integer such that . Consequently if . Thus, the parents of node are the ones that precede it along each of the coordinates. If , then and where is a reference node. To avoid we set . The two parents along the th dimension are , where is the smallest positive integer such that , . In this setting . The construction is finalized by fixing the cross-covariance function ; Figure 4 shows that the same basic structure can be immediately extended to higher dimensions, including time.
4.1 Caching redundant expensive matrix operations
The key computational bottleneck for the Gibbs sampler in Section 3.2 is calculating, for , of (i) ( flops) and (ii) , ( flops). The former is costlier than the latter by a factor of . Q-MGPs are designed to greatly reduce this cost. We start with an axis-parallel tessellation of the domain in equally-sized regions , storing observed locations in to create , which we assume, for simplicity, to be no larger than in size. Taking a stationary base-covariance function , implies that , where is used to shift all locations in the sets. Recall that the reference set of MGPs can include unobserved locations. Hence, we can build on a lattice of regularly spaced locations. Since domain partitions have the same size, we have for , where is a single “prototype set” using which one can locate all other reference subsets. Also, since , there will be prototype sets for parents, i.e. for some and . Then, we can build maps and linking each of and to a parent prototype. This ensures that for each . One only needs the maps and , cache the unique inverses, and reuse them. The same method applies to cache on reference sets, but not on other locations since no redundancy arises in for . See Figure 4 for an illustration. Compared to general MGPs (see Table 1), the number of large linear system solvers is now constant with sample size and significantly reduces computational cost.
Furthermore, Q-MGPs automatically adjust to settings where observed locations are on partly regular lattices, i.e., they are located at patterns repeating in space or time which emerge after initial inspections of the data. Appendix G, available online, outlines a simple algorithm to identify such patterns and create maps and . In such cases, we fix and . In addition to the above mentioned savings, we now do not have to compute and . If is not a regular lattice over the whole domain, is a lower bound and in general there are inverses to compute. If is a fully observed regular lattice and if (a varying intercept model), then we save in computing the full conditional covariances as well, since all . See Appendix C, available online, for details on choosing and .
Sampling MGPs (all cases) Q-MGPs — Irregular locations — Pattern lattice w/missing — Lattice w/ missing — Full lattice and


4.2 Improved mixing via parallel sampling
With caching, a much larger proportion of time is spent on sampling; parallelization may in general be achieved via appropriate node coloring algorithms (see e.g. Molloy and Reed,, 2002; Gonzalez et al.,, 2011; Lewis,, 2016), but this step is unnecessary in Q-MGPs as the colors in are set in advance independently of the data and result in efficient parallel sampling of the latent effects. Reference nodes of are colored to achieve independence conditional on realizations of nodes of all other colors. For example, we partition spatial domains () into regions and link each region to a reference node in a quadratic mesh. A “central” reference node will have two parents and two children, i.e. and , with respectively denoting left, bottom, right, top – refer to Figure 4 (left). We have and . The Markov blanket of , denoted as , is the set of neighbors of in the undirected “moral” graph , hence . The corresponding spatial process is such that . Denoting and , we note that . We partition reference nodes into four groups , such that , , , and . This pattern is repeated over the whole graph. Then, if , . Denoting by the other variables in the Gibbs sampler, we get:
Since parallelization is possible within each of the groups, only be four serial steps are needed; time savings are due to typically being orders of magnitude larger than the number of available processors. Extensions to other tessellation schemes and higher dimensional domains and the associated graphs follow analogously.
5 Data analysis
Satellite imaging and remote sensing data are nowadays frequently collected in large quantities and processed to be used in geology, ecology, forestry, and other fields, but clouds and atmospheric conditions obstruct aerial views and corrupt the data creating gaps. Recovery of the underlying signal and quantification of the associated uncertainty are thus the major goals to enable practitioners in the natural sciences to fully exploit these data sources. Several scalable geostatistical models based on Gaussian processes have been implemented on tens or hundreds of thousands of data points, with few exceptions. In considering larger data sizes, one must either have a large time budget – usually several days – or reduce model flexibility and richness. Scalability concerns become the single most important issue in multivariate spatiotemporal settings. In fact, repeated collection of aerial images and multiple spatially-referenced predictors modeled to have a variable effect on the outcome have a multiplicative effect on data size. With no separability assumptions, the dimension of the latent spatial random effects that one may wish to recover will be huge even when individual images would be manageable when considered individually.
The lack of software to implement scalable models for spatiotemporal data makes it difficult to compare our proposed approach with others in these settings. On the other hand, a recent article (Heaton et al.,, 2019) pins many state-of-the-art models against each other in a spatial () prediction contest. On the same data, we show in Appendix E, available online, that Q-MGPs can outperform all competitors in terms of predictive performance and coverage while using a similar computational budget.
5.1 Non-separable multivariate spatiotemporal base covariance
In our analyses, we choose a class of multivariate space-time cross-covariances that models the covariance between variables and at the space-time lags as:
(15) |
where (and with ) is the latent dissimilarity between variables and . In the resulting cross-covariance function in , each component of the -variate spatial process is represented by a point in a -dimensional latent space, . Refer to Apanasovich and Genton, (2010) for a more in-depth discussion. We set and , ; see Gneiting, (2002) for alternatives. We also fix , and seek to estimate a posteriori. The usual exponential covariance arises in univariate spatial settings.

5.2 Synthetic data
We mimick real world satellite imaging data analyzed later in Section 5.3 at a much smaller scale by generating 81 datasets from the model , where with and is a regular grid of size , resulting in total locations. We take where is as in (15), and . We generate one dataset for each combination of , temporal range , space-time separability , and spatial range .
We compare Q-MGPs with the similarly-targeted Gapfill method of Gerber et al., (2018) as implemented in the R package gapfill. We create “synthetic clouds” of radius and with center where to cover the outcomes at six randomly selected times for each of the 81 datasets. Outcomes at two of the remaining four time periods were then randomly selected to be completely unobserved at all but 10 locations in order to avoid errors from gapfill. Refer to Figure 5 for an illustration.
A Q-MGP model with was fit by partitioning each spatial axis into 10 intervals and the time axis into 5 intervals. The priors were , , , ; 7000 iterations of Gibbs sampling were run, of which 5000 used for burn-in and thinning the remaining 2000 to obtain a posterior sample of size 1000. For each of the 81 datasets we calculate the mean absolute prediction error (MAE) and the root mean squared prediction error (RMSE). Figure 6 compares Gapfill’s 90% intervals with 90% posterior equal-tailed credible intervals for the Q-MGP predictions obtained from 1000 posterior samples. In terms of MAE, the Q-MGP model outperformed Gapfill in all datasets; in terms of RMSE, it outperformed Gapfill in all but one dataset. The average MAE of Q-MGP was 0.4094 against Gapfill’s 0.5366; the average RMSE was 0.5308 against Gapfill’s 0.6820. The Q-MGP also yielded improved coverage of the prediction intervals, although some under-coverage was observed possibly due to the large . This comparison may favor Q-MGPs as the data were generated from a Gaussian process. Appendix K, available online, confirms similar findings on non-Gaussian data (a GIF image).

5.3 NDVI data from the Serengeti ecosystem
Time series of Normalized Difference Vegetation Index (NDVI) derived from satellite imagery are used to understand spatial patterns in vegetation phonology. For such studies, image pixel-level NDVI values are observed over time to assess seasonal trends in vegetation green-up, growing season length and peak, and senescence. These analyses typically require NDVI values for all pixels over the region and time period of interest. As noted in the beginning of this section, atmospheric conditions, e.g., cloud cover, and sensor malfunction cause missing NDVI pixel values and hence predicting such values, i.e., gap-filling, is of key interest to the remote sensing community. Here, we consider NDVI data derived from the LandSat 8 sensor (which provides a 3030 m spatial resolution pixel) taken over Serengeti National Park, Tanzania, Africa. These data were part of a larger study conducted by Desanker et al., (2019) that looked at environmental drivers in vegetation phonology change. The data cover an area of 3030km and 34 months, and correspond to 64 images of size 10001000 collected at 16-day intervals. Data on NDVI are complemented with elevation and soil moisture data, for a total of three spatially referenced predictors. We are thus interested in understanding their varying effect in space and time, in addition to predicting NDVI output at missing locations. We achieve both these goals by implementing model (13), where includes the intercept and three predictors; their varying effect will be represented by , which we recover by implementing Q-MGP models. Storing posterior samples of the multivariate spatially-varying coefficients for the full data with is impossible using our computing resources as each sample would be of size . For this reason, we consider two feasible setups. Denote by the number of observed and missing locations. In model (1), we subsample each image to obtain 64 frames of size 500 500, and fit a regression model with resulting in a spatially-varying intercept model on observed locations, a total of locations for prediction, and a latent spatial random effect of the same size. The Q-MGP was fit using space-time regions of size 48.
The base covariance of (15) becomes a univariate non-separable spatiotemporal covariance as in Gneiting, (2002). In model (2), we aim to estimate the varying effect of elevation on NDVI. We subsample each image to obtain 64 frames of size 278 278, each covering an area of 2525km, and take resulting in and targeting the recovery of latent effects of size . Considering the additional computational burden of the multivariate latent effects, in this case we used , corresponding to smaller space-time regions of average size 31. In this model there is a single unknown in (15) which corresponds to the latent dissimilarity between the intercept and elevation. We thus consider as the unknown parameter. We assign priors for , , , and uniform priors to other covariance parameters (their support is reported in Table 2).



In both cases, approximate posterior samples of the latent random effects and the other unknown parameters were obtained by running the proposed Gibbs sampler for a total of 25,000 iterations. A posterior sample of size 500 was obtained by using the first 22,000 iterations as burn-in, and thinning the remaining 3,000 by a factor of 6. Additional computational details are at Appendix F, available online. Posterior summaries of the unknown parameters for these models are reported in Table 2, along with RMSE in predicting NDVI at 10,000 left-out locations, 95% posterior coverage at those locations, and run times. Both models achieved similar out-of-sample predictive performance and coverage. Figure 7 shows the NDVI predictions of model (2) at one of the 64 time points. This reveals that the varying effect of elevation on NDVI output (see e.g. Figure 8) is credibly different from zero at 42.54% of the space-time locations (95% c. i.). In particular, it highlights the extent to which higher elevation reduces vegetation. The spatial range is approximately 4km; the time range is about 8 days. The large estimated indicates that the correlation between the two covariates of the latent random process is very small at all spatial and temporal lags. The predicted NDVI and latent spatiotemporal effects are supplied as animated GIF images in the supplement.
Q-MGP Model (1) | Q-MGP Model (2) | |
328125 | 156800 | |
1 | 2 | |
– | ||
95% coverage | ||
RMSE | ||
Time/it. (s) | ||
Time (hours) |
6 Discussion
We have developed a class of Bayesian hierarchical models for large spatial and spatiotemporal datasets based on linking domain partitions to directed acyclic graphs. These models can be tailored for specific algorithmic needs, and we have demonstrated the advantages of using a cubic tessellation scheme (Q-MGP) when targeting the efficient recovery of spatial random effects in Bayesian hierarchical models using Gibbs samplers.
When considering alternative computational strategies, the proposed Q-MGP may not be optimal. For example, Gaussian first stage models enable marginalization of the latent spatial effects; posterior sampling of unknown covariance parameters via MCMC is typically associated by better mixing. Future research may thus focus on identifying “optimal” DAGs for collapsed samplers. Furthermore, the blocked conditional independence structure of Q-MGPs may be suboptimal as it corresponds to possibly restrictive conditional independence assumptions in neighboring locations. While we have not focused on the effect of different tessellations or partitioning choices in this article, alternative tessellation schemes (e.g. hexagonal) may be associated to less stringent assumptions and possibly better performance, while retaining all the desirable features of Q-MGP.
Other natural extensions to high-dimensional spatiotemporal statistics include settings where there are a very large number of spatiotemporal outcomes in addition to a large number of spatial and temporal locations. Here there are a few different avenues. One approach is in the same spirit of joint modeling pursued here, but instead of modeling the cross-covariance functions explicitly, as has been done here, we pursue dimension reduction using factor models (see, e.g., Christensen and Amemiya,, 2003; Lopes et al.,, 2008; Ren and Banerjee,, 2013; Taylor-Rodriguez et al.,, 2019). The aforementioned references have attempted to model the factor models using spatial processes some of which have used scalable low-rank predictive processes or the NNGP. We believe that modeling latent factors using spatiotemporal MGPs will impart some of the computational and inferential benefits seen here. However, this will need further development especially with regard to identifiability of loading matrices (Lopes et al.,, 2008; Ren and Banerjee,, 2013) and process parameters.
A different approach to multivariate spatial modeling has relied upon conditional or hierarchical specifications. This has been well articulated in the text by Cressie and Wikle, (2011); see also Royle and Berliner, (1999) and the recent developments in Cressie and Zammit-Mangion, (2016). An advantage of the hierarchical approach is that the multivariate processes are valid stochastic processes, essentially by construction and without requiring spectral representations, and can also impart considerable computational benefits. It will be interesting to extend the ideas in Cressie and Zammit-Mangion, (2016) to augmented spaces of DAGs to further introduce conditional independence, and therefore sparsity, in MGP models with high-dimensional outcomes.
Finally, it is worth pointing out that alternate computational algorithms, particularly tuned for high-dimensional Bayesian models, should also be explored. Recent developments on algorithms based upon classes of piecewise deterministic Markov processes (see e.g., Fearnhead et al.,, 2018; Bierkens et al.,, 2019, and references therein) that avoid Gibbs samplers and even reversible MCMC algorithms are being shown to be increasingly effective for high-dimensional Bayesian inference. Adapting such algorithms to MGP and Q-MGP for scalable Bayesian spatial process models will constitute natural extensions of our current offering.
Acknowledgements
Finley and Peruzzi were supported by National Science Foundation (NSF) EF-1253225 and DMS-1916395, and National Aeronautics and Space Administration’s Carbon Monitoring System project. Peruzzi was supported in part by 1R01ES028804 of the National Institute of Environmental Health Sciences of the National Institutes of Health and European Union project 856506. Banerjee was supported by NSF DMS-1513654, IIS-1562303, and DMS-1916349.
Appendix
Appendix A Spatial Meshed Process
Let be the base process, the fixed reference set, and . Then the joint density is proper. In fact, using the definitions of and we get
We now verify the Kolmogorov consistency conditions. Take as any permutation of . Then call and we get
Set membership is invariant to permutations, so . and therefore in no way the order of the locations changes . Therefore , i.e.
Next, take another location . Call . We want to show that . If then hence
If
Appendix B Meshed Gaussian Process
In graph , the number of parents of node is , i.e. . Therefore is a matrix which can be partitioned by column in blocks:
whose th block is of size and corresponds to the th block of . is then proportional to
and we can define as the matrix such that . Analogously to , we can partition by column in blocks,
where each block
and notice that implies . Then,
where is a block-matrix whose th block-row is for , and . The resulting precision matrix is thus a matrix made of blocks of sizes for such that . We denote its block as and note that by symmetry its transpose is . We find the nonzero blocks of in its block-diagonal, and for in the block such that and are connected in the moralized subgraph of obtained by subsetting to the nodes and linking those that share a child; in other words for all , either (1) or vice-versa, or (2) there exists such that . The sparsity of thus depends on the specific choice for and the corresponding moralized graph (Cowell et al.,, 1999). Suppose that has vertices and induces whose number of undirected edges is . This means that the number of “missing” connections is . Assume that each subset of the reference set is of size , so that each block of is of size . Then the precision matrix will have zeros. Given a partition for which induces a partition for into disjoint subsets, one can thus increase sparsity in the precision matrix at by choosing a graph which induces a lower . But note that all nodes linked to locations outside the reference set, i.e. any , do not influence the precision matrix .
The covariance function of the new process will then be defined using : consider two locations and in . If both are in then , for some , and . If with parents and then
Finally if and then
Domain tessellations in meshed GPs will induce discontinuities in the covariance function when evaluated at locations that span different regions; however, is continuous if and are in the same domain region. To show this, we adapt a result from Datta et al., 2016a . Let denote the Euclidean distance matrix at pairs of locations from sets . Take as the chosen domain partition, and let be the finite union of their boundaries, each of which has measure zero. Then consider and let be the parent set for , i.e. the set of locations that are mapped to parents of the node to which is mapped. Now take and for some (other cases follow). We assume that and is isotropic and continuous. Then for and , implying that . Taking small enough implies thus . Then suppose i.e. is the first element of the parent set of . Consequently and therefore
Appendix C Choosing the tessellation, , and

Tessellation
Meshed GPs based on tessellations correspond to fixed DAGs constructed from the immediate neighbors bordering each region. For this reason they are unlike “nearest-neighbor” models in which the number of neighbors is a parameter to be fixed. Increasing the parent set in tessellated meshed GPs amounts to changing the tessellation or increasing the size of each region. As a reviewer points out, estimating the unknown covariance parameter may be facilitated by choosing parent sets that include more distant points (see e.g. Stein,, 2014). Doing so in tessellated meshed GPs amounts to choosing a tessellation with elongated regions. With axis-parallel partitioning, one can partition one of the axis in a much smaller number of intervals than the others, resulting in parent sets that are elongated in that directions. Other shapes, such as a “herringbone” tessellation pattern, might be better at capturing variation in more spatial directions.
Multivariate data pose additional challenges, as one can potentially elect to use different tessellations on different margins of the multivariate process. However, considering different tessellations may induce complicated connections in the underlying DAG; this goes against our modeling approach that is to keep a simple, manageable graph structure in order to speed up computations. In cases in which one seeks to maintain good spatial coverage and is willing to make conditional independence assumptions on the variables, we suggest using a single tessellations, but perhaps dropping inter-variable edges across nodes. To clarify, we may assume that , the th margin of the multivariate spatial process at location , is conditionally independent of if and and are not in the same region. This design choice maintains long range spatial dependence along the same margin of , but reduces it across different margins.
M – the number of regions
In a meshed GP based on axis-parallel domain partitioning, the number of partitioning regions is defined as where is the number of intervals into which axis is partitioned. As seen above, determines the number of Cholesky factors to compute, as well as the complexity of each factor computation (given by the size of the individual matrices). A small corresponds to larger regions, which in turn will enlarge the parent sets of each block. For this reason, should be chosen as small as possible, given the allotted computational budget.
and
The two sets of locations and differ in that the former are mapped to nodes of graph , whereas the latter to with no children. Since nodes are independent of each other given realizations of they provide little information on spatial dependence; in fact, the set is akin to the set of knots in a predictive process model. Unlike those models, can be chosen to be as large as the set of observations, and left empty. Generally, it is possible to choose to smoothly model spatial dependence across larger regions of unobserved locations, but convergence of the Gibbs sampler may be slower at locations that are very far from the nearest observed ones. Therefore, a convenient choice which is possibly safer for MCMC convergence is to let them coincide, i.e. .
Matters are less straightforward when redundancies occur.Consider locations such that where and for . Denote the set of such locations as ; this is a grid of equally-spaced locations. If observed locations are on a quantized grid of coordinates, i.e. there are such that , choosing may be inefficient as it may reduce the number of redundant ’s and thus increase computation time.Instead, choosing such that (possibly ) may be much more efficient.
In Figure 9 we show how a small subsample of a time slice of the Serengeti data. In that case, we extend to the whole domain (Fig. 9, center) or to extend it just enough to cover all observed locations (Fig. 9, right). For predictions, in the former case we use samples from the full conditional distribution . In the latter case, we map empty blocks to nodes : predictions thus easily proceed in parallel since nodes are independent of each other given nodes in , with being composed of the nearest nodes along each axis-parallel direction.
Appendix D Application: Q-MGP on multivariate response
We consider the multivariate regression model we previously defined:
(16) |
with and and , and where is the multivariate point-referenced outcome. In this section, we specifically consider the case , , and drop and from the model, resulting in
(17) |
where some of the components of may be unobserved. We construct a cross-covariance function based on latent distances among the variables. To do so, we set
and for , . We then model as a Q-MGP with base cross-covariance
(18) |
which is a valid cross-covariance derived from eq. (7) in Apanasovich and Genton, (2010).
D.1 Synthetic data
We generate multivariate outcomes with at spatial locations on a regular grid on for a total sample size of , sampling the latent effects from a Gaussian Process with cross-covariance (18) and setting the parameters as
with measurement errors fixed at . For predictions, we set if , if , and if be observed at all spatial locations. This setup will highlight the advantages of a multivariate approach given the small latent distance between the second and third margin (given by ), and the non-separability of the covariance function on the variables, as given by . We fit a Q-MGP model on regions. We compare the predictions of this multivariate specification with those obtained from univariate Q-MGP models applied on each margin independently. Domain partitioning for the univariate models along each margin was chosen in order to match the total computing time of the multivariate model. We run MCMC for 7000 iterations, of which 5000 discarded as burn-in and thinning the chain with a 1:1 ratio resulting in an approximated posterior sample of size 1000. The results are reported in Table 3, whereas Figure 10 shows the data along with the results from the two approaches.
MAE | RMSE | Coverage | Run Time | |
---|---|---|---|---|
Q-MGP (multivariate) | 0.4439 | 0.3200 | 96.39 | 1.04 min |
Q-MGP (univariate) | 0.5056 | 0.4297 | 95.14 | 1.11 min |

D.2 Spatial multivariate analysis of NDVI in the Serengeti region
We consider the Serengeti dataset previously described; in this section, we build a spatial dataset by considering the NDVI measurement on December 17, 2016, along with elevation and soil moisture at the same spatial locations. The resulting dataset corresponds to variables and spatial locations, for a total size of ; cloud cover obstructed the sensor view, resulting in NDVI being missing at spatial locations. The goal of is thus to fill those gaps. Unlike in the previous analysis, we are not interested in mapping the varying effect of covariates on the outcome. Instead, we are modeling all variables jointly, aiming to recover the conditional distribution of NDVI given the others. We ran a total of 20000 MCMC iterations, thinning the last 5000 at a 4:1 ratio to eventually obtain a posterior sample of size 1000. Run-time averaged about 1 second per iteration using 50 threads.

Appendix E Comparisons with methods in Heaton et al., (2019)
In recent work, Heaton et al., (2019) have reviewed and compared 13 state-of-the-art models for large spatial datasets in a predictive challenge involving (1) simulated and (2) real-world spatial data (daytime land surface temperatures as measured by the Terra instrument onboard the MODIS satellite on August 4, 2016, Level-3 data). The two datasets are available at github.com/finnlindgren/heatoncomparison. The total number of locations is the same in both cases (), with the goal of predicting outcomes when missing. Both datasets include available locations; the test set is of size for the simulated data, for the MODIS data due to cloud cover.

We estimate two MGP models for each dataset by partitioning the spatial domain into (resp. ) rectangular regions for an average block size of (resp. ) spatial locations, and fix as a cubic mesh. We assign blocks to to cover the observed locations; the remaining ones will be used for prediction as in the right plot of Figure 9. Finally, we use an exponential base covariance function. We ran our Gibbs sampler using , , as priors, and respectively as starting values. A log-Normal proposal function was used for the Metropolis update of . We ran a total of 6,000 Monte Carlo iterations, with 4,000 dropped as burn-in, and thinning 2:1 the remaining ones to obtain a sample of size 1,000 that approximates the joint posterior distribution which we then use to obtain predictions. In both cases, the algorithm ran on 40 threads. We report the results of our analysis at the top of Table 5 along with results from select models tested in Heaton et al., (2019) to facilitate comparisons, whereas prediction plots for both datasets are in Figure 12.
Refer to the cited work for a more in-depth overview. We selected: nearest-neighbor Gaussian process (NNGP), conjugate (Finley et al.,, 2019) and response (Datta et al., 2016a, ) algorithms; the multiresolution approximation (MRA) model of Katzfuss, (2017); a stochastic partial differential equations (SPDE) approach estimated via integrated nested Laplace approximations (Rue et al.,, 2009; Lindgren et al.,, 2011); metakriging (Guhaniyogi and Banerjee,, 2018).
Simulated data | MAE | RMSE | Coverage | Run Time | N. Cores | Code Lang. |
Q-MGP, | 0.6270 | 0.8721 | 95.46 | 16.4 min | 40 | C++ (Armadillo) |
Q-MGP, | 0.6136 | 0.8407 | 95.62 | 133.2 min | 40 | C++ (Armadillo) |
NNGP Conjugate | 0.65 | 0.88 | 96 | 1.99 min | 10 | C++ |
NNGP Response | 0.65 | 0.88 | 96 | 45.56 min | 10 | C++ |
MRA | 0.61 | 0.83 | 93 | 13.57 min | 1 | Matlab |
SPDE | 0.62 | 0.86 | 100 | 138.34 min | 2 | R (inla) |
Metakriging | 0.74 | 97 | 99 | 2888.89 min | 30 | R (spBayes) |
MODIS data | MAE | RMSE | Coverage | Run Time | N. Cores | Code Lang. |
Q-MGP, | 1.1151 | 1.5598 | 95.14 | 16.5 min | 40 | C++ (Armadillo) |
Q-MGP, | 1.0729 | 1.5034 | 95.20 | 133.7 min | 40 | C++ (Armadillo) |
NNGP Conjugate | 1.21 | 1.64 | 95 | 2.06 min | 10 | C++ |
NNGP Response | 1.24 | 1.68 | 94 | 42.85 min | 10 | C++ |
MRA | 1.33 | 1.85 | 92 | 15.61 min | 1 | Matlab |
SPDE | 1.10 | 1.53 | 97 | 120.33 min | 2 | R (inla) |
Metakriging | 2.08 | 2.50 | 89 | 2888.52 min | 30 | R (spBayes) |
In terms of predictive performance, coverage, and computation time, both implemented models are competitive with the top-ranking ones in Heaton et al., (2019). In particular, our approach (with ) is much faster than Bayesian models estimated via MCMC – note that NNGP Conjugate uses cross-validation to select suitable covariance parameters. The slower alternative with – corresponding to a much coarser partitioning, i.e. larger regions – was possible by caching the redundant matrix operations given that the data are on a regular lattice. We chose by targeting a total computing time similar to the SPDE model, which achieved the lowest predictive error in the MODIS dataset. Computing times should not differ dramatically on a dual-CPU Intel Xeon server as in Heaton et al., (2019).
Appendix F C++ implementation of Q-MGP in meshgp
All applications were run on a 512GB-memory, dual-CPU server with two Intel Xeon E5-2699v3 processors, each with 18 cores at 2.3Ghz each, running Debian linux and R version 3.6 linked to the OpenBLAS libraries. Source code for implementing Q-MGP models is available as R package meshgp ( github.com/mkln/meshgp ). The meshgp package is written in Armadillo (Sanderson and Curtin, 2016, a C++ library for linear algebra) which can easily be linked to high performance libraries. Parallelization of all high-dimensional operations is implemented via OpenMP (Dagum and Menon,, 1998). R packages Rcpp (Eddelbuettel and François,, 2011) and RcppArmadillo (Eddelbuettel and Sanderson,, 2014) are used for interfacing the C++ source with R.
Appendix G Caching algorithm
-
•
Sort using column 1; resolve ties using columns 2 to .
-
•
Calculate as the matrix of size such that
Calculate .
Algorithm 1 can be used by taking for , for , and for . The resulting dictionaries can be used to build a dictionary for with unique values. Each iteration of the Gibbs sampler becomes cheaper if the number of unique values in is .
Appendix H Compute time comparisons with NNGPs
Figure 13 shows the time-per-iteration of Q-MGP, on a fully-observed slice of the Serengeti data, , choosing , and with or without caching. corresponds to blocks with locations (all blocks are of dimension ), whereas to blocks of dimension corresponding to squares of size , respectively. Caching is optional; in this case we showcase the speed differential between a favorable case (regular lattice) and the worst case (completely irregularly-spaced observations) scenarios by disabling caching.

We compare our results with the NNGP sequential sampler of Datta et al., 2016a as implemented in the spNNGP package for R, choosing . This fully Bayesian NNGP recovers the spatial random effects and is also estimated via a Gibbs sampler, but unlike a Q-MGP, it does not use caching or sample in parallel. Computing times of Q-MGP models without caching are comparable to NNGP models; the Q-MGP model with block size is about 30% faster than the NNGP model with 10 neighbors.
The optional caching feature results in a much smaller computational penalty when considering large regions. The large-region model is 25% slower than the small-region model when caching is activated, but 250% slower if caching is disabled because many more large matrix inverses are computed. The caching feature of Q-MGP models allowed us to design the best-performing model for the real-world application of Heaton et al., (2019), see Table 5.
Appendix I Effective sample size of MCMC posterior samples
Markov-chain Monte Carlo algorithms output correlated samples from the posterior distribution, and their efficiency is decreased when successive samples are highly autocorrelated. The effective sample size (ESS) is the size of an independent sample equivalent to the correlated sample at hand. Highly autocorrelated Markov chains result in lower ESS. This scenario arises in geostatistical settings with large-scale spatial dependency, as the high-dimensional latent spatial random effects are highly correlated at nearby locations. For this reason, Gibbs samplers such as the one proposed in this article and the sequential NNGP sampler of Datta et al., 2016a are expected to be negatively affected by slowly-decaying spatial correlation.


With this in mind, we now compare the parallel sampler of Q-MGPs to a sequential NNGP sampler (from R package spNNGP, Finley et al., 2020) in terms of ESS of the latent random effects. We generate on a regular grid where , , the spatial process is , for , for all and using parameters and , and nugget . The goal is to sample a posteriori. We assign priors , , and use the true values of as starting values for the Gibbs sampler of both NNGP and Q-MGP. The starting value of was set to a vector of zeros of size . We let the Markov chain run for a total of 2000 iterations, discarding the first 1000. The NNGP model used neighbors, whereas the Q-MGP model used regions obtained by axis-parallel partitioning the spatial domain along the two dimensions into and intervals, respectively. We chose to achieve about the same computation time when caching was disabled (actual Q-MGP run times were about 2.6 to 3 times faster than NNGP-sequential once caching was turned on). Figure 14 shows the effective sample size of the resulting approximated posterior sample of , averaged across all spatial locations. While both models evidence a degradation of performance with long-range spatial dependence (small ), the NNGP model shows significantly worse performance. With low spatial decay, the Q-MGP model exhibits effective sample sizes up to 2.5 times larger than the equivalent NNGP model. The smaller effective sample size of NNGPs implies a longer effective runtime to achieve similar performance. Coupled with speed advantages due to caching, we estimate Q-MGP models to result in an effective improvement in compute times up to one order of magnitude over sequential NNGPs.
Appendix J Using MGPs with

MGPs can be implemented on unidimensional domain by splitting the real line into disjoint intervals, then choosing a finite reference set of locations , linking each of them to a reference node based on the intervals they falls into, and using the DAG (e.g. see Figure 15) to establish a process at all other locations. Grouping the reference locations is not an appealing strategy if the chosen covariance exhibits the screening effect (see e.g. Stein, (2002) for a more general discussion), in which case there may be numerical errors in calculating . In fact, suppose for all and and in . If exhibits the screening effect, where . Therefore, one must either choose a valid covariance function, or reduce the size of the reference sets to 1.
Appendix K Recovering missing pixels of an animated GIF image
Lion GIF | MAE | RMSE | Coverage (90%) |
---|---|---|---|
Q-MGP, | 0.0341 | 0.0715 | 93.28 |
Gapfill | 0.0480 | 0.0923 | 62.76 |
We compare Q-MGPs with Gapfill (Gerber et al.,, 2018) on non-Gaussian data in the form of an animated GIF image. The original GIF image collects 30 frames, each of size , for a total data size of 1.5 million locations. We subsample the data size to by downsampling each of the 30 frames at a 16:1 ratio. We simulate cloud cover by placing random clouds in 5 of the 30 frames, covering all but 10 locations in 5 of 30 frames, and covering 50% random locations in 5 of the 30 frames; 15 frames were thus untouched. See Figure 16. The resulting dataset includes non-missing observations and missing ones to be predicted. We implement a Q-MGP model with found by partitioning the spatial coordinates into and intervals, and time in intervals. We run the Gibbs sampler for 15000 iterations; we discard the first 10000 and keep one every five of the remaining 5000 to obtain a sample of size 1000 from the approximated posterior distribution, which we use for predictions. We used the same base covariance, prior distributions, and starting values as in previous sections, for a total run time of 35 minutes (0.14s/iteration) on 11 threads. As seen in Table 6 and from a visual inspection of Figure 16, Q-MGP outperforms Gapfill.

References
- Apanasovich and Genton, (2010) Apanasovich, T. V. and Genton, M. G. (2010). Cross-covariance functions for multivariate random fields based on latent dimensions. Biometrika, 97:15–30. doi:10.1093/biomet/asp078.
- Banerjee, (2017) Banerjee, S. (2017). High-dimensional Bayesian geostatistics. Bayesian Analysis, 12(2):583–614. doi:10.1214/17-BA1056R.
- Banerjee, (2020) Banerjee, S. (2020). Modeling massive spatial datasets using a conjugate Bayesian linear modeling framework. Spatial Statistics, in press. doi:10.1016/j.spasta.2020.100417.
- Banerjee et al., (2010) Banerjee, S., Finley, A. O., Waldmann, P., and Ericsson, T. (2010). Hierarchical spatial process models for multiple traits in large genetic trials. Journal of American Statistical Association, 105(490):506–521. doi:10.1198/jasa.2009.ap09068.
- Banerjee et al., (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:825–848. doi:10.1111/j.1467-9868.2008.00663.x.
- Bierkens et al., (2019) Bierkens, J., Fearnhead, P., and Roberts, G. (2019). The zig-zag process and super-efficient sampling for bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320. doi:10.1214/18-AOS1715.
- Chen et al., (2008) Chen, Y., Davis, T. A., Hager, W. W., and Rajamanickam, S. (2008). Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate. ACM Trans. Math. Softw., 35(3). doi:10.1145/1391989.1391995.
- Christensen and Amemiya, (2003) Christensen, W. F. and Amemiya, Y. (2003). Modeling and prediction for multivariate spatial factor analysis. Journal of Statistical Planning and Inference, 115(2):543 – 564. doi:10.1016/S0378-3758(02)00173-8.
- Cover and Thomas, (1991) Cover, T. M. and Thomas, J. A. (1991). Elements of information theory. Wiley Series in Telecommunications and Signal Processing. Wiley Interscience.
- Cowell et al., (1999) Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Probabilistic Networks and Expert Systems. Springer-Verlag, New York.
- Cressie and Johannesson, (2008) Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:209–226. doi:10.1111/j.1467-9868.2007.00633.x.
- Cressie and Zammit-Mangion, (2016) Cressie, N. and Zammit-Mangion, A. (2016). Multivariate spatial covariance models: a conditional approach. Biometrika, 103(4):915–935. doi:10.1093/biomet/asw045.
- Cressie and Wikle, (2011) Cressie, N. A. C. and Wikle, C. K. (2011). Statistics for spatio-temporal data. Wiley series in probability and statistics. Hoboken, N.J. Wiley.
- Dagum and Menon, (1998) Dagum, L. and Menon, R. (1998). OpenMP: an industry standard api for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46–55.
- (15) Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016a). Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111:800–812. doi:10.1080/01621459.2015.1044091.
- (16) Datta, A., Banerjee, S., Finley, A. O., Hamm, N. A. S., and Schaap, M. (2016b). Nonseparable dynamic nearest neighbor gaussian process models for large spatio-temporal data with an application to particulate matter analysis. The Annals of Applied Statistics, 10:1286–1316. doi:10.1214/16-AOAS931.
- Davis, (2006) Davis, T. A. (2006). Direct Methods for Sparse Linear Systems. SIAM, Philadelphia, PA. doi:10.1137/1.9780898718881.
- Desanker et al., (2019) Desanker, G., Dahlin, K. M., and Finley, A. O. (2019). Environmental controls on landsat-derived phenoregions across an east african megatransect. Ecosphere, In press.
- Eddelbuettel and François, (2011) Eddelbuettel, D. and François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8):1–18. doi:10.18637/jss.v040.i08.
- Eddelbuettel and Sanderson, (2014) Eddelbuettel, D. and Sanderson, C. (2014). RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Computational Statistics and Data Analysis, 71:1054–1063. doi:10.1016/j.csda.2013.02.005.
- Eidsvik et al., (2014) Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M., and Niemi, J. (2014). Estimation and prediction in spatial models with block composite likelihoods. Journal of Computational and Graphical Statistics, 23:295–315. doi:10.1080/10618600.2012.760460.
- Fearnhead et al., (2018) Fearnhead, P., Bierkens, J., Pollock, M., and Roberts, G. O. (2018). Piecewise deterministic Markov processes for continuous-time Monte Carlo. Statistical Science, 33(3):386–412. doi:10.1214/18-STS648.
- Finley et al., (2012) Finley, A. O., Banerjee, S., and Gelfand, A. E. (2012). Bayesian dynamic modeling for large space-time datasets using Gaussian predictive processes. Journal of Geographical Systems, 14:29–47. doi:10.1007/s10109-011-0154-8.
- Finley et al., (2020) Finley, A. O., Datta, A., and Banerjee, S. (2020). R package for Nearest Neighbor Gaussian Process models. arXiv:2001.09111.
- Finley et al., (2019) Finley, A. O., Datta, A., Cook, B. D., Morton, D. C., Andersen, H. E., and Banerjee, S. (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics, 28:401–414. doi:10.1080/10618600.2018.1537924.
- Furrer et al., (2006) Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15:502–523. doi:10.1198/106186006X132178.
- Genton and Kleiber, (2015) Genton, M. G. and Kleiber, W. (2015). Cross-covariance functions for multivariate geostatistics. Statistical Science, 30:147–163. doi:10.1214/14-STS487.
- Gerber et al., (2018) Gerber, F., Furrer, R., Schaepman-Strub, G., de Jong, R., and Schaepman, M. E. (2018). Predicting missing values in spatio-temporal remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 56(5):2841–2853. doi:10.1109/TGRS.2017.2785240.
- Gneiting, (2002) Gneiting, T. (2002). Nonseparable, stationary covariance functions for space-time data. Journal of the American Statistical Association, 97:590–600. doi:10.1198/016214502760047113.
- Gonzalez et al., (2011) Gonzalez, J., Low, Y., Gretton, A., and Guestrin, C. (2011). Parallel Gibbs sampling: From colored fields to thin junction trees. In Gordon, G., Dunson, D., and Dudík, M., editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 324–332, Fort Lauderdale, FL, USA. PMLR.
- Gramacy and Apley, (2015) Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24:561–578. doi:10.1080/10618600.2014.914442.
- Gramacy and Lee, (2008) Gramacy, R. B. and Lee, H. K. H. (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103:1119–1130. doi:10.1198/016214508000000689.
- Guhaniyogi and Banerjee, (2018) Guhaniyogi, R. and Banerjee, S. (2018). Meta-kriging: Scalable bayesian modeling and inference for massive spatial datasets. Technometrics, 60(4):430–444. doi:10.1080/00401706.2018.1437474.
- Guhaniyogi et al., (2011) Guhaniyogi, R., Finley, A. O., Banerjee, S., and Gelfand, A. E. (2011). Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics, 22:997–1007. doi:10.1002/env.1131.
- Guinness, (2018) Guinness, J. (2018). Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics, 60(4):415–429. doi:10.1080/00401706.2018.1437476.
- Heaton et al., (2019) Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3):398–425. doi:10.1007/s13253-018-00348-w.
- Katzfuss, (2017) Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association, 112:201–214. doi:10.1080/01621459.2015.1123632.
- Katzfuss and Guinness, (2017) Katzfuss, M. and Guinness, J. (2017). A general framework for Vecchia approximations of Gaussian processes. arXiv:1708.06302.
- Kaufman et al., (2008) Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association, 103:1545–1555. doi:10.1198/016214508000000959.
- Lauritzen, (1996) Lauritzen, S., L. (1996). Graphical Models. Clarendon Press, Oxford, UK.
- Lewis, (2016) Lewis, R. (2016). A guide to graph colouring. Springer International Publishing. doi:10.1007/978-3-319-25730-3.
- Lindgren et al., (2011) Lindgren, F., Rue, H., and Lindström, J. (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B, 73:423–498. doi:10.1111/j.1467-9868.2011.00777.x.
- Lopes et al., (2008) Lopes, H. F., Salazar, E., and Gamerman, D. (2008). Spatial dynamic factor analysis. Bayesian Analysis, 3(4):759 – 792. doi:10.1214/08-BA329.
- Molloy and Reed, (2002) Molloy, M. and Reed, B. (2002). Graph colouring and the probabilistic method. Springer-Verlag Berlin Heidelberg. doi:10.1007/978-3-642-04016-0.
- Quiroz et al., (2019) Quiroz, Z. C., Prates, M. O., and Dey, D. K. (2019). Block nearest neighboor Gaussian processes for large datasets. arXiv:1908.06437.
- Ren and Banerjee, (2013) Ren, Q. and Banerjee, S. (2013). Hierarchical factor models for large spatially misaligned data: A low-rank predictive process approach. Biometrics, 69(1):19–30. doi:10.1111/j.1541-0420.2012.01832.x.
- Royle and Berliner, (1999) Royle, J. A. and Berliner, L. M. (1999). A hierarchical approach to multivariate spatial modeling and prediction. Journal of Agricultural, Biological, and Environmental Statistics, 1(4):29–56. doi:10.2307/1400420.
- Rue and Held, (2005) Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC. doi:10.1201/9780203492024.
- Rue et al., (2009) Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B, 71:319–392. doi:10.1111/j.1467-9868.2008.00700.x.
- Sanderson and Curtin, (2016) Sanderson, C. and Curtin, R. (2016). Armadillo: a template-based C++ library for linear algebra. Journal of Open Source Software, 1:26.
- Sang and Huang, (2012) Sang, H. and Huang, J. Z. (2012). A full scale approximation of covariance functions for large spatial data sets. Journal of the Royal Statistical Society, Series B, 74:111–132. doi:10.1111/j.1467-9868.2011.01007.x.
- Stein, (2002) Stein, M. L. (2002). The screening effect in kriging. The Annals of Statistics, 30(1):298–323. doi:10.1214/aos/1015362194.
- Stein, (2014) Stein, M. L. (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8:1–19. doi:doi:10.1016/j.spasta.2013.06.003.
- Stein et al., (2004) Stein, M. L., Chi, Z., and Welty, L. J. (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society, Series B, 66:275–296. doi:10.1046/j.1369-7412.2003.05512.x.
- Sun et al., (2011) Sun, Y., Li, B., and Genton, M. (2011). Geostatistics for large datasets. In Montero, J., Porcu, E., and Schlather, M., editors, Advances and Challenges in Space-time Modelling of Natural Events, pages 55–77. Springer-Verlag, Berlin Heidelberg. doi:10.1007/978-3-642-17086-7.
- Taylor-Rodriguez et al., (2019) Taylor-Rodriguez, D., Finley, A. O., Datta, A., Babcock, C., Andersen, H. E., Cook, B. D., Morton, D. C., and Banerjee, S. (2019). Spatial factor models for high-dimensional and large spatial data: An application in forest variable mapping. Statistica Sinica, 29(3):1155–1180. doi:10.5705/ss.202018.0005.
- Vecchia, (1988) Vecchia, A. V. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50:297–312. doi:10.1111/j.2517-6161.1988.tb01729.x.