Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains

We introduce a class of scalable Bayesian hierarchical models for the analysis of massive geostatistical datasets. The underlying idea combines ideas on high-dimensional geostatistics by partitioning the spatial domain and modeling the regions in the partition using a sparsity-inducing directed acyclic graph (DAG). We extend the model over the DAG to a well-defined spatial process, which we call the Meshed Gaussian Process (MGP). A major contribution is the development of a MGPs on tessellated domains, accompanied by a Gibbs sampler for the efficient recovery of spatial random effects. In particular, the cubic MGP (Q-MGP) can harness high-performance computing resources by executing all large-scale operations in parallel within the Gibbs sampler, improving mixing and computing time compared to sequential updating schemes. Unlike some existing models for large spatial data, a Q-MGP facilitates massive caching of expensive matrix operations, making it particularly apt in dealing with spatiotemporal remote-sensing data. We compare Q-MGPs with large synthetic and real world data against state-of-the-art methods. We also illustrate using Normalized Difference Vegetation Index (NDVI) data from the Serengeti park region to recover latent multivariate spatiotemporal random effects at millions of locations. The source code is available at https://github.com/mkln/meshgp.


Introduction
Collecting large quantities of spatial and spatiotemporal data is now commonplace in many fields. In ecology and forestry, very large datasets are collected using satellite imaging and other remote sensing instruments such as LiDAR that periodically record high-resolution images. Unfortunately, clouds frequently obstruct the view, resulting in large regions with missing information. Figure 1 shows this phenomenon using Normalized Difference Vegetation Index (NDVI) data from the Serengeti region. Filling such gaps in the data is thus an important goal, as is quantifying the associated uncertainty in predictions. This goal can be achieved via probabilistic modeling of the underlying phenomenon, which involves the specification of a spatial or spatiotemporal process that characterizes dependence among any finite set of random variables. Gaussian processes (GP) are a common, flexible choice to characterize spatial dependence, but a standard implementation is notoriously burdened by their O(n 3 ) computational complexity. As a consequence, intense research has been devoted in recent years to the development of scalable models for large spatial datasets -see in-depth reviews by Sun et al. (2011) and Banerjee (2017).
Computational complexity can be reduced by considering low-rank models; among these, knot-based methods motivated by kriging ideas enjoy some optimality properties but oversmooth the estimates of spatial random effects unless the number of knots is large, and require corrections to avoid overestimation of the nugget (Banerjee et al., 2008;Cressie and Johannesson, 2008;Banerjee et al., 2010;Guhaniyogi et al., 2011;Finley et al., 2012). Other methods reduce the computational burden by introducing sparsity in the covariance matrix; strategies include tapering (Furrer et al., 2006;Kaufman et al., 2008) or partitioning of the spatial domain into regions with a typical assumption of independence across regions (Sang and Huang, 2012;Stein, 2014). These can be improved by considering a recursive partitioning scheme, resulting in a multi-resolution approximation (MRA; Katzfuss 2017). Other assumptions on conditional independence assumptions also have a good track record in terms of scalability to large spatial datasets: Gaussian random Markov random fields (GMRF; Rue and Held, 2005), composite likelihood methods (Eidsvik et al., 2014), and neighbor-based likelihood approximations (Vecchia, 1988) belong to this family.
In fact, the recent literature has witnessed substantial activity surrounding the so called Vecchia approximation (Vecchia, 1988). This approximation can be looked upon as a special case of the GMRF approximations with a simplified neighborhood structure motivated from a directed acyclic graphical (DAG) representation of a Gaussian process likelihood. Extensions leading to well-defined spatial processes to accommodate inference at arbitrary locations by extending the DAG representation to the entire domain include Nearest neighbor Gaussian processes (NNGP; Datta et al. 2016a,b) and further generalizations and enhancements by constructing DAGs over the augmented space of outcomes and spatial effects (Katzfuss and Guinness, 2017). These approaches render computational scalability by introducing sparsity in the precision matrix. The DAG relies upon a specific topological ordering of the locations, which also determine the construction of the neighborhood sets, and certain orderings tend to deliver improved performance of such models (Katzfuss and Guinness, 2017;Guinness, 2018).
When inference on the latent process is sought, Bayesian inference has the benefits of providing direct probability statements based upon the posterior distribution of the process.
Inference based on asymptotic approximations are avoided, but there remain challenges in computing the posterior distribution given that inference is sought on a very high-dimensional parameter space (including the realizations of the latent process). One possibility, available for Gaussian first-stage likelihoods, is to work with a collapsed or marginalized likelihood by integrating out the spatial random effects. However, Gibbs samplers and other MCMC algorithms for the collapsed models can be inexorably slow and are impractical when data are in the millions. A sequential Gibbs sampler that updates the latent spatial effects (Datta et al., 2016a) is faster in updating the parameters but suffers from high autocorrelation and slow mixing. Another possibility emerges when interest lies in prediction or imputation of the outcome variable only and not the latent process. Here, a so called "response" model that models the outcome itself using an NNGP can be constructed. This model is much faster and enjoys superior convergence properties, but we lose inference on the latent process and its predictive performance tends to be inferior to the latent process model. Furthermore, these options are unavailable in non-Gaussian first-stage hierarchical models or when the focus is not uniquely on prediction. A detailed comparison of different approaches for computing Bayesian NNGP models is presented in Finley et al. (2019).
Our current contribution introduces a class of Meshed Gaussian Process (MGP) models for Bayesian hierarchical modeling of large spatial datasets. This class builds upon the aforementioned works that build upon Vecchia (1988) and other DAG based models. The inferential focus remains within the context of massive spatial datasets over very large domains. We exploit the demonstrated benefits of the DAG based models, but we now adapt them to partitioned domains. We describe dependence across regions of a partitioned domain using a small directed acyclic graph (DAG) which we refer to as a mesh. Within each region, some locations are selected as reference and collectively mapped to a single node in the DAG. Relationships among nodes are governed by kriging ideas. In the resulting Mesh Gaussian Process (MGP), regions in the spatial domain depend on each other through the reference locations. Realizations at all other locations are assumed independent, conditional upon the reference locations. This construction leads to a valid standalone spatial process.
While the idea of partitioning domains to create approximations is not new (Katzfuss, 2017;Gramacy and Lee, 2008), construction of the DAG-based approximation over partitioned domains has received considerably less attention. In recent work, Quiroz et al. (2019) consider a similar approach in building their block-NNGP. However, the advantages of such an approach are unclear when implemented using collapsed samplers because block-NNGPs do not necessarily reduce the precision matrix fill-in. We posit that alternate graphs may be preferable when considering collapsed samplers. Crucially, collapsed samplers are impractically slow when data are in the millions. As a result, there appears to be no sizable gain in replacing NNGP models with their block version. In this regard, a major contribution, in our view, of this paper is to demonstrate the substantial scalability and performance gains when compared to state-of-the-art alternatives by considering MGPs in which domain partitioning via tessellation or tiling are coupled to similarly-patterned meshes to describe dependence across regions.
In particular, we focus on the desirable properties of MGPs based on domain tessellations when focusing on the efficient recovery of latent spatial random effects: first, they induce repeated patterns in the associated DAG that can be used to massively reduce the number of expensive matrix operations. Second, MGPs enable parallel sampling of the latent spatial random effects, promote mixing, and can be flexibly embedded in larger Bayesian hierarchical models. Finally, extensions to spatiotemporal or higher-dimensional data are straightforward once a suitable covariance function has been defined. We use axis-parallel domain partitioning coupled with cubic meshes -resulting in cubic MGPs or Q-MGPs -to show substantial improvements in computational time and inferential performance relative to other models with data sizes ranging from the thousands to the several millions, for both spatial and spatiotemporal data and using multivariate spatial processes.
The balance of this paper proceeds as follows. Section 2 introduces our general framework for hierarchical Bayesian modeling of spatial processes using networks of grouped spatial locations. The MGP is outlined in Section 3, where we provide a general, scalable computing algorithm in Section 3.1. Tessellation-based schemes and the specific case of Q-MGPs are outlined in Section 4, which highlights their properties and computational advantages. We illustrate the performance of our proposed approach in Section 5 using simulation experiments and an application on a massive dataset with millions of spatiotemporal locations. We conclude the paper with a discussion and pointers to further research. Supplementary material accompanying this manuscript as an Appendix contains further comparisons of Q-MGPs with several state-of-the-art methods for spatial data.

Spatial processes on partitioned domains
A q × 1 spatial process assigns a probability law on {w( ) : ∈ D}, where w( ) is a q × 1 random vector with elements w i ( ) for i = 1, 2, . . . , q. In the following general discussion we will not distinguish between spatial (D ⊂ d ) and spatiotemporal domains (D ⊂ d+1 ), and denote spatial or spatiotemporal locations as , s, or u.
For any finite set of spatial locations { 1 , 2 , . . . , n L } = L ⊂ D of size n L , let P (·) denote the probability law of the n L q × 1 random vector w L = (w( 1 ) , w( 2 ) , . . . , w( n L ) ) with probability density p(·). The joint density of w L can be expressed as a DAG (or a Bayesian network model) with respect to the ordered set of locations L as where the conditional set for each w( i ) can be interpreted as the set of its parents in a large, dense Bayesian network. Defining a simplified valid joint density on L by reducing the size of the conditioning sets is a popular strategy for fast likelihood approximations in the context of large spatial datasets. One typically limits dependence to "past" neighboring locations with respect to the ordering in (1) (Vecchia, 1988;Gramacy and Apley, 2015;Stein et al., 2004;Datta et al., 2016a;Katzfuss and Guinness, 2017). The neighbors are defined and fixed and model performance may benefit from the addition of some distant locations (Stein et al., 2004). The ordering in L is also fixed and inferential performance may benefit from the use of some fixed permutations (Guinness, 2018). The result of shrinking the conditional sets to a smaller set of neighbors from the past yields a sparse DAG or Bayesian network, which yields potentially massive computational gains.
We proceed in a similar manner, but instead of defining a sparse DAG at the level of each individual location, we map entire groups of locations to nodes in a much smaller graph; the same graph will be used to model the dependence between any location in the spatial domain and, therefore, to define a spatial process. Let P = {D 1 , . . . , D M } be a partition of D into M mutually exclusive subsets so that D = ∪ M i=1 D i and D i ∩ D j = ∅ whenever i = j. Similar to the nomenclature in the NNGP, we fix a reference set S = {s 1 , . . . , s n S } ⊂ D, which itself is partitioned using P by letting S j = D j ∩ S. The set of non-reference locations is similarly partitioned with U j = D j \ S j so that D j = S j ∪ U j for each j = 1, 2, . . . , M . We now construct a DAG to model dependence within and between S and U. Let G = {V , E} be a graph with nodes V = A ∪ B, where we refer to A = {a 1 , . . . , a M } as the reference nodes and to B = {b 1 , . . . , b M } as the non-reference, or simply "other", nodes. Let A ∩ B = ∅.
We introduce a map η : D → V such that ( This surjective many-to-one map links each location in S j and U j to a node in G.
meaning that only reference nodes have children, to distinguish the reference nodes A from the other nodes B. Apart from the assumption that a j ∈ Pa[b j ], we refrain from defining the parents of a node, thereby retaining flexibility. Later in Section 4 we will consider meshes associated to domain tessellations.

Consider the enumeration
. . , w(s in i ) ) be the n i q × 1 random vector listing elements of w(s) for each s ∈ S i . We now rewrite (1) as a product of M conditional densities The conditioning sets are then reduced based on the graph G: where we denote w [i] = {w j : a j ∈ Pa[a i ]}, and Pa[a i ] ⊂ {a 1 , . . . , a i−1 } ⊂ A. This is a proper multivariate joint density since the graph is acyclic (Lauritzen, 1996). It is instructive to note how the above approximation behaves when the size of the parent set shrinks, for a given domain partitioning scheme. To this end, we adapt a result in Banerjee (2020) and show that sparser DAGs correspond to a larger Kullback-Leibler (KL) divergence from the base density p. This result has been proved earlier for Gaussian likelihoods by Guinness (2018), but the argument given below is free of distributional assumptions.
Consider random vector w and some partition of the domain P corresponding to nodes In other words, graph G 2 is obtained by removing the directed edge v * → v i * from G 1 . We approximate p using densities p 1 and p 2 based on G 1 and G 2 , respectively, obtaining Considering the Kullback-Leibler divergence of each density from p, and denoting V * = where we use (5), the fact that V * and {i * } ∪ Pa 1 [i * ] are disjoint, and Jensen's inequality.
This result implies that larger parent sets are preferrable as they correspond to better approximations to the full model; the choice of sparser graphs will be driven by computational considerations -see Section 3.2.
We construct the spatial process over arbitrary locations by enumerating other locations as U = {u 1 , . . . , u n U } ⊂ D \ S and extending (4) to the non-reference locations. Given the partition of U defined earlier with components U j for j = 1, 2, . . . , M , for each u ∈ U j we set η(u) = b j and recall that Pa[b i ] ⊂ A by construction. For each i = 1, . . . , n U , we denote } ⊂ w S and define the conditional density of w U given w S as Therefore, for any finite subset of spatial locations L ⊂ D we can let U = L \ S and obtain We show (see Appendix A) that this is a well-defined process by verifying the Kolmogorov consistency conditions. This new process can be built starting from a base process, a fixed reference set, domain partition P and a graph G. In the next section, we use this construction for Gaussian processes.

Meshed Gaussian Processes
Let {w( ) : ∈ D} be a q-variate multivariate Gaussian process, denoted as w( ) ∼ GP (0, C(·, · | θ)). The cross-covariance C(·, · | θ) indexed by parameters θ is a function such that the (i, j)-th entry of C( , | θ) evaluates the covariance between the i-th and j-th elements of w( ) at and , respectively, i.e., cov(w i ( ), w j ( )). We omit dependence on θ to simplify notation. The cross-covariance function itself needs to be neither symmetric nor positive-definite, but must satisfy the following two properties: (i) C( , ) = C( , ) ; and (ii) n i=1 n j=1 z i C( i , j )z j > 0 for any integer n and any finite collection of points { 1 , 2 , . . . , n } and for all z i ∈ q \{0}. See Genton and Kleiber (2015) for a review of crosscovariance functions for multivariate processes. The (partial) realization of the multivariate process over any finite set L has a multivariate normal distribution w L ∼ N (0, C L ) where w L is the qn L × 1 column vector and C L is the qn L × qn L block matrix with the q × q matrix C( i , j ) as its (i, j) block for i, j = 1, . . . , n L .
The construction of the MGP will start with a base, or parent, multivariate GP for w( ) and then, using the graph G defined in Section 2, represent the joint density at the reference set S as where ,S j . The resulting joint density p(w S ) is multivariate normal with covariance C S and a precision matrix C −1 S . The precision matrix for Gaussian graphical models is easily derived using customary linear model representations for each conditional regression. Consider the DAG in (4). Each w i is n i q × 1 and let J i = |Pa[a i ]| be the number of parents for a i in the graph G. Furthermore, let C i,j be the n i q × n j q covariance matrix between w i and w j , C i,[i] be the n i q × J i q covariance matrix between w i and w [i] , and C [i],[i] be the J i q × J i q covariance matrix between w i and itself.
Representing each conditional density in (4) as a linear regression on w i , we get where each H ij is an n i q × n j q is a coefficient matrix representing the multivariate regression of w j given w [i] , ω i ind ∼ N (0, R i ) for i = 1, 2, . . . , M , and each R i is an n i q × n i q residual and each H ij K can be obtained from the respective submatrix of H i [i] . We also obtain D . Therefore, all the H ij 's and R i 's can be computed from the base cross-covariance function.
The distribution of w = [w 1 , w 2 , . . . , w M ] can be obtained by noting that w = Hw + S can be induced by building G with few, carefully placed directed edges among nodes in A; Appendix B contains a more in-depth treatment. We extend (8) to the collection of non-reference locations U ⊂ D \ S: where ,U j , analogously to (8), while H U and R U are analogous to H S and R S . Clearly, given that all the p densities are Gaussian, all finite dimensional distributions will also be Gaussian. We have constructed a Gaussian process with the following cross-covariance function for any two locations 1 , 2 ∈ D For a given base Gaussian covariance function C, domain partitioning P, mesh G, and reference set S, we denote the corresponding mesh Gaussian process as MGP(G, P, S, C).

Bayesian hierarchical model and Gibbs sampler
Mesh GPs correspond to block-sparse precision matrices that are constructed cheaply from their block-sparse Cholesky factors by solving small linear systems. General purpose sparse-Cholesky algorithms (Davis, 2006;Chen et al., 2008) can then be used to obtain collapsed samplers as in Finley et al. (2019). Unfortunately, these algorithms can only be used on Gaussian first stage models and tend to be too slow to be practically viable when data are in the millions. Hence, we develop a more general scalable Gibbs sampler for the recovery of spatial random effects in hierarchical MGP models that obviates computing with large matrices for its implementation.
Consider a linear multivariate spatiotemporally-varying regression model at ∈ D ⊂ d+1 , where where y( ) ∈ l is the multivariate point-referenced outcome, A simple univariate regression model with a spatially-varying intercept can be obtained with l = 1, Z( ) = 1. For a given set of observed locations T = { 1 , . . . , n } we can write the above model compactly where y = (y( 1 ) , . . . , y( n ) ) , w and ε are similarly defined, X = [X( 1 ) : · · · : X( n )] , After fixing a reference set S, we obtain S * = T ∩ S and U = T \S. We partition the domain as above to obtain S j , S * j , U j for j = 1, . . . , M and model w( ) using the MGP which yields w ∼ N (0, C S −1 ). We complete the model specification . For τ 2 r , r = 1, . . . , q, the full conditional is Inverse-Gamma with parameters a τr + n/2 and b τr + 1 2 E r E r where E r = y ·r − X ·r β − Z ·r w and y ·r , X ·r , Z ·r are the subsets of y, X, Z corresponding to the rth outcome (out of q) of the multivariate output.
The Gibbs update of the w U components can proceed simultaneously as all blocks in U have no children and their parents are in S. The full conditional for w U j for j = . . , In(s n j ∈ S * j )) be the vector of indicators that identify locations with non-missing outputs, and let a j ∈ V be the node in G corresponding to S j . Then, where D −1 n j = I j D −1 n j with I j = 1 j 1 j , and y j = 1 j (y j − X j β) and denotes the Hadamard or Schur (element-by-element) product. Finally, θ is updated via a Metropolis (8) and (10). The Gibbs sampling algorithm will iterate across the above steps and, upon convergence, will produce samples from p(β, {τ 2 j } q j=1 , w | y). We also use samples from this posterior distribution for prediction. For ∈ D, we obtain a sample of y( ) | y(T ) as follows: if ∈ S ∪ U then y( ) = X( ) β + Z( ) w( ), where β and w are sampled from p(β, {τ 2 j } q j=1 , w | y). Otherwise, considering that ∈ D j for some j and thus η(

Computations with non-separable multivariate space-time base covariances
We provide an account of the computational cost of general MGPs as a starting point to motivate the introduction of more efficient tessellated MGPs, and specifically Q-MGPs, in Section 4. We consider (11) and take l = 1 to simplify our exposition. In the resulting univariate regression, β is the usual linear regression coefficient on the p point-referenced regressors assumed to have a constant effect on the outcome, whereas the q-variate spa-tiotemporal process w(·) captures the variable effect of the Z regressors on the outcome. If p and q are small, as is the case in typical geostatistical modeling, sampling β and τ 2 carries a negligible computational cost. The cost of each Gibbs iteration is dominated by updates of θ and w. Let us assume, solely for expository purposes, that each of the M blocks comprise the same number of locations, i.e. |S j | = |U j | = m, for all j = 1, . . . , M . Thus, m = n 2M and the graph nodes have J or fewer parents and L or fewer children.
dominates the overall computational expense. Each term in the product requires computing R −1 j and R −1 U j , both of size qm × qm, and their determinant.
M 2 ) flops via Cholesky decompositions. Reasonably, J and m are fixed so M may grow linearly with sample size, so the cost is O(nq 3 J 3 ) considering M ∝ n. The total computing time is approximately O( nq 3 J 3 K ) when using K processors to compute the 2M densities.
Sampling w S and w U from their full conditional distributions requires O(2M q 3 m 3 + M Lq 2 m 2 + M q 2 m 2 ) flops, assuming R −1 j and R −1 U j have been stored in the previous step. The first term in the complexity order is due to the Cholesky decomposition of covariance matrices in the full conditionals, the second to sampling the reference nodes, and the third is due to sampling other nodes. Without additional assumptions, parallelization brings the time , since the covariances can be computed beforehand and the M components of w U are independent given w S . With fixed block size m, the overall time for a Gibbs iteration is thus , linear in the sample size and cubic in J, highlighting the computational speedup of sparse graphs (J small), the negative impact of large q, and the serial sampling of w S .
In terms of storage, no large matrices (dense or sparse) need to be computed and/or stored. The H j and R j matrices correspond to a storage requirement O(4M q 2 m 2 ) = O(q 2 n).
The matrix Z of size qn × qn will never be computed. Instead, it can be represented as a list of 2M block-diagonal (thus sparse) Z j matrices. Furthermore, computing Zw (dimension n × 1) can be vectorized as the row-wise sum of Z * w * where Z * and w * are n × q matrices whose jth column represents the jth space-time varying predictor. Storing Z both as a list and as Z * results in total storage of O(2qn).
Complexity is further reduced by considering a graph with small J or a finer partition resulting in large M and small m, whereas the overall time can be reduced by distributing computations on K processors. Possible choices for G include nearest-neighbor graphs and multiresolution trees. The former considers very fine domain partitioning (M ≈ n and m = 1) and restricts graph connectedness by limiting the number of neighbors (i.e. parents) typically to J < 20. Serial updates of w S will be prone to slow mixing when spatial correlation is high, Sparse Cholesky methods for block updates depend on the precision matrix fill-in (Yannakakis, 1981) and are generally much more expensive. A multiresolution tree associated with domain partitioning in J + 1 recursive steps results in "last-generation" children (leaves) with J parents. J may grow sublinearly with sample size, but m will have to be very small to avoid excessive slow-downs. In settings with large q, adjusting J and M may insufficiently reduce the computational burden. In such cases, at the cost of flexibility and model richness one may opt for a cross-covariance function that is separable in the variables (but perhaps non-separable in space and time), bringing the cost of inversion of Jqm × Jqm matrices from where C h,u is the Jm×Jm space-time component of the cross-covariance, and C v the q ×q variable component. This also applies to any Cholesky decomposition, so there will be savings when evaluating the likelihood as well as in sampling from the full-conditionals.
As a cheaper alternative, we develop algorithms based on domain tessellation -i.e. partitioning on regions shaped similarly to form patterns -to which we associate similarly patterned meshes. With irregularly spaced observations, a the bulk of the largest linear solvers will be redundant, resulting in a significant reduction in computational time. These redundancies are larger if data are on partly observed regular lattices. In either scenario, sampling w S will also proceed in parallel with improved mixing.

MGPs based on domain tessellation or tiling
We partition the domain using a tessellation or tiling. For spatial domains (d = 2, Figure 2), regular tiling results in triangular, quadratic, or hexagonal tessellations, but mixed designs are also possible. These partition schemes can be linked to a DAG G by drawing directed outgoing edges starting from an originating node/tile. The same fixed pattern can be repeated over a surface of any size. In dimensions d > 2, which may include time, space-filling tessellations or honeycombs can be constructed analogously, along with their corresponding meshes. Constructing MGPs based on these tessellation and graph designs is immediate and simply requires partitioning the locations S and U into subsets based on the chosen tessellation.
This subclass of MGP models enjoys two properties. First, each node in G can be linked to a color; if node v is colored c and we remove from G all nodes of other colors, v will be disconnected from all other nodes of the same color in the moral graph G M . This means that the graph itself can be partitioned into conditionally independent subgraphs, one for each color. The number of colors is small, e.g. 3, 4, 6 for triangles, squares, and hexagons, respectively. This enables large-scale parallel sampling of w S and improves mixing. Second, each region in the tessellated domain is a translation and/or rotation of a single geometric shape; it is linked to nodes in G with a fixed number of parents and children, and the parent regions are also the translation and/or rotation of the same geometric shapes. Carefully choosing S, it will be possible to avoid computing the bulk of linear solvers, resulting in substantial computational gains. Subsequently, we focus on axis-parallel partitioning (quadratic or cubic tessellation) and cubic meshes, but analogous constructions and the same properties hold with other tessellation schemes.
A cubic MGP (Q-MGP) can be constructed by partitioning each coordinate axis into intervals. In d + 1 dimensions, splitting each axis into L intervals results in L d+1 regions.
In other words, these locations have parents along all 2(d + 1) line-of-sight directions. This scenario corresponds to J = 2(d + 1) and can be avoided by choosing S to cover the observed locations, in which case J = d + 2.
The construction is finalized by fixing the cross-covariance function C( , ), resulting in a MGP(Q, T , S, C) = Q-MGP(S, C). Figure 3 how the same basic structure can be immediately extended to higher dimensions, including time.

Caching redundant expensive matrix operations
The key computational bottleneck of the Gibbs sampler in Section 3.2 is the calculation for j = 1, . . . , 2M of (1) C −1 [j],[j] (2M J 3 q 3 m 3 flops) and (2) R −1 j , Σ * −1 j (4M q 3 m 3 flops). The former is costlier than the latter by a factor of J 3 /2. Q-MGPs can be designed to greatly reduce this cost. Start with an axis-parallel tessellation of the domain in equally-sized regions D 1 , . . . , D M , storing observed locations in U to create U 1 , . . . , U M , which we assume for simplicity to be no larger than m in size. Take a stationary base-covariance function C.
For any two sets of locations, L 1 and L 2 , this implies that C(L 1 , where h ∈ d+1 is used to shift all locations in the sets. Recall that the knot set or In other words one only needs to create the maps ξ S and ξ U , cache the r unique inverses, and reuse them to avoid recalculating the inverses. We can use the same method to cache R −1 S j = R −1 S * r on reference sets, but not on other locations since no redundancy arises in C U j ,U j for j = 1, . . . , M , making R −1 U j all different in general. Compared to general MGPs (see Table 1), the number of large linear system solvers is now constant with sample size; notice that (d + 1) M implies in a large reduction in computational cost.
Furthermore, Q-MGPs can automatically adjust to settings where observed locations T are on partly regular lattices meaning that they are located at patterns which repeat in space and time. These patterns emerge after initial inspections of the data. We outline a simple algorithm to identify simple patterns and create maps ξ S and ξ U in Appendix E. In those cases, we fix S ⊇ T and U = ∅. In addition to the above mentioned savings, we now do not have to compute R −1 U j and Σ * −1 U j . If T are not on a regular lattice over the whole domain, the 4(d + 1) is a lower bound and in general there will be M *   Parent locations of the orange (resp. purple) are in green (resp. blue). Using a stationary covariance, C blue,blue = C green,green . Therefore only one inversion is necessary; this can be replicated at no cost across 9 of the 16 regions.

Improved mixing via parallel sampling
Caching in Q-MGPs greatly reduces computing cost and time, therefore a much larger proportion of time is spent on sampling. Fortunately, MGPs on tessellated domains such as Q-MGPs are associated with efficient parallel sampling of the latent spatial random effects via a "chromatic" sampler (Gonzalez et al., 2011). Reference nodes A of Q are partitioned into groups or colors, and nodes of the same color are conditionally independent given realizations of reference nodes of other colors. To see this, we partition a spatial domain (d = 2) into M 1 × M 2 regions and link each region to a reference node in a quadratic mesh. We define a "central" reference node as one that has two parents and two chil- The corresponding spatial process will then be such that p(w Denoting by D the other conditioning variables in the Gibbs sampler, we obtain Analogous results can be obtained for other tessellation schemes. Extensions to higher dimensions or space-time domains and the associated graphs also follow analogously.
Since parallelization is possible within each of the four groups, there will only be four serial steps. Compute times are now improved, given that M/4 is typically several orders of magnitude larger than the number of available processors K. In addition to faster computations, this four-group design improves mixing of the Markov chain. We compare Q-MGPs with sequential NNGPs in terms of effective sample size in Appendix F.2.

Data analysis
Satellite imaging and remote sensing data are nowadays frequently collected in large quantities and processed to be used in geology, ecology, forestry, and other fields. Unfortunately, cloud cover and atmospheric conditions obstruct aerial views and corrupt the data creating gaps. Recovery of the underlying signal, reconstruction of the missing areas in the images, and quantification of prediction uncertainty are thus the major goals to enable practitioners in the natural sciences to fully exploit these data sources. Geostatistical models based on Gaussian Processes deal with these issues rigorously, but many of the recently developed scalable models have only been implemented on tens or hundreds of thousands of data points, with few exceptions. In considering larger data sizes, one must either have a large time budget -usually several days -or reduce model flexibility and richness. Scalability concerns become the single most important issue in multivariate spatiotemporal settings. In fact, repeated collection of aerial images and multiple spatially-referenced predictors modeled to have a variable effect on the outcome have a multiplicative effect on data size. With no separability assumptions, the dimension of the latent spatial random effects that one may wish to recover will be huge even when individual images would be manageable when considered individually.
There is currently a lack of software to implement scalable models for spatial data that can be extended to spatiotemporal settings, making it difficult to compare our proposed ap-proach with others in these settings. On the other hand, a recent competition paper (Heaton et al., 2019) has pinned many state-of-the-art models against each other in a spatial (d = 2) prediction contest. In order to provide comparisons with those models, we analyze the same data in Appendix D, where we show that Q-MGPs can be set up to outperform all competitors in terms of predictive performance and coverage when using a similar computational budget.

Non-separable multivariate spatiotemporal base covariance
In our analyses, we choose a class of multivariate space-time cross-covariances that models the covariance between variables i and j at the (h, u) ∈ d+1 space-time lags as: where δ ij > 0 (and with δ ij = δ ji ) is the latent dissimilarity between variables i and j.
In the resulting cross-covariance function C(h, u, v) in d+1+k , each component of the qvariate spatial process is represented by a point in a k-dimensional latent space, k ≤ q.
We create "synthetic clouds" of radius √ 0.1 and with center (c 1,t , c 2,t ) ∈ [0, 1/20] 2 where c 1,t , c 2,t iid ∼ U [0, 1] to cover the outcomes at six randomly selected times for each of the 81 datasets. Outcomes at two of the remaining four time periods were then randomly selected to be completely unobserved, save for 10 locations (this was necessary to avoid fitting errors for Gapfill). Figure 5 depicts the three cloud cover types.
We implemented a Q-MGP model with M = 500 found by partitioning each spatial axis into 10 intervals, and the time axis into 5 intervals. We assigned priors τ 2 ∼ Inv.G.(2, 1), coverage of the prediction intervals, although some under-coverage was observed. Figure 6 summarizes these findings. We note that this comparison may favor our Q-MGP model since it directly approximates the true Gaussian process generating the data. For this reason, we consider the problem of recovering missing pixels of an animated GIF image in Appendix G; our findings are confirmed. for a total of three spatially referenced predictors. We are thus interested in understanding their varying effect in space and time, in addition to predicting NDVI output at missing locations. We achieve both these goals by implementing the model

NDVI data from the Serengeti ecosystem
where Z( ) = X( ) will include the intercept and the three predictors; their varying effect will be represented by w( ), which we recover by implementing Q-MGP models. Storing posterior samples of the multivariate spatially-varying coefficients for the full data with q = 4 is impossible using our computing resources as each sample would be of size 1000 × 1000 × 64 × 4 = 2.56e+8. For this reason, we consider two feasible setups. Denote by n all the number of locations, including those corresponding to missing NDVI output. In model (1) The base covariance of equation (13) becomes a univariate non-separable spatiotemporal covariance as in Gneiting (2002). In model (2), we aim to estimate the varying effect of elevation on NDVI. We subsample each image to obtain 64 frames of size 278 × 278, each covering an area of 25×25km, and take Z( ) = (1 X elev ( )) resulting in q = 2 and targeting the recovery of a latent spatial random effect vector of size 9, 892, 352. Considering the additional computational burden of the multivariate latent effects, in this case we used M = 156, 800, corresponding to smaller space-time regions of average size ∼31. In this model there is a single unknown δ ij in eq. (13) which corresponds to the latent dissimilarity between the intercept and elevation. We thus consider ψ 2 = (a 2 δ ij + 1) β 2 as the unknown parameter.
In both cases, approximate posterior samples of the latent random effects and the other iterations as burn-in, and thinning the remaining 3,000 by a factor of 6. Additional computational details are at Appendix C. Posterior summaries of the unknown parameters for these models are reported in Table 2, along with root mean square error (RMSE) in predicting NDVI at 10,000 left-out locations, 95% posterior coverage at those locations, and run times.
Both models achieved similar out-of-sample predictive performance and coverage. Figure 7 shows the NDVI predictions of model (2) at one of the 64 time points. This reveals that the varying effect of elevation on NDVI output is credibly different from zero at 42.54% of the space-time locations (95% c. i.). Figure 8 shows the space-time varying effect of elevation on NDVI. In particular, it highlights the extent to which higher elevation reduces vegetation.
In the same figure, we note that the spatial range is approximately 4km; the time range is about 8 days. The large estimated ψ 2 indicates that the estimated correlation between the two covariates of the latent random process is very small at all spatial and temporal lags.
The predicted NDVI and latent spatiotemporal effects are available as animated GIF images as part of the supplementary material.

Discussion
We have developed a class of Bayesian hierarchical models for large spatial and spatiotemporal datasets based on linking domain partitions to directed acyclic graphs. These models can be tailored for specific algorithmic needs, and we have demonstrated the advantages of using a cubic tessellation scheme (Q-MGP) when targeting the efficient recovery of spatial random effects in Bayesian hierarchical models using Gibbs samplers.
When considering alternative computational strategies, the proposed Q-MGP may not be optimal. For example, Gaussian first stage models enable marginalization of the latent spatial effects; posterior sampling of unknown covariance parameters via MCMC is typically associated by better mixing. Future research may thus focus on identifying "optimal" DAGs for collapsed samplers. Furthermore, the blocked conditional independence structure of Q-MGPs may be suboptimal as it corresponds to possibly restrictive conditional independence assumptions in neighboring locations. While we have not focused on the effect of different tessellations or partitioning choices in this article, alternative tessellation schemes (e.g. hexagonal) may be associated to less stringent assumptions and possibly better performance, while retaining all the desirable features of Q-MGP.
Other natural extensions to high-dimensional spatiotemporal statistics include settings where there are a very large number of spatiotemporal outcomes in addition to a large number of spatial and temporal locations. Here there are a few different avenues. One approach is in the same spirit of joint modeling pursued here, but instead of modeling the cross-covariance functions explicitly, as has been done here, we pursue dimension reduction using factor models (see, e.g., Christensen and Amemiya, 2003;Lopes et al., 2008;Ren and Banerjee, 2013;Taylor-Rodriguez et al., 2019). The aforementioned references have attempted to model the factor models using spatial processes some of which have used scalable low-rank predictive processes or the NNGP. We believe that modeling latent factors using spatiotemporal MGPs will impart some of the computational and inferential benefits seen here. However, this will need further development especially with regard to identifiability of loading matrices (Lopes et al., 2008;Ren and Banerjee, 2013) and process parameters and should generate relevant questions for future research.
A different approach to multivariate spatial modeling has relied upon conditional or hierarchical specifications. This has been well articulated in the text by Cressie and Wikle (2011); see also Royle and Berliner (1999)

and the recent developments in Cressie and
Zammit-Mangion (2016). An advantage of the hierarchical approach is that the multivariate processes are valid stochastic processes, essentially by construction and without requiring spectral representations, and can also impart considerable computational benefits. It will be interesting to extend the ideas in Cressie and Zammit-Mangion (2016) to augmented spaces of DAGs to further introduce conditional independence, and therefore sparsity, in MGP models with high-dimensional outcomes.
Finally, it is worth pointing out that alternate computational algorithms, particularly tuned for high-dimensional Bayesian models, should also be explored. Recent developments on algorithms based upon classes of piecewise deterministic Markov processes (see. e.g., Fearnhead et al., 2018;Bierkens et al., 2019, and references therein) that avoid Gibbs samplers and even reversible MCMC algorithms are being shown to be increasingly effective for high-dimensional Bayesian inference. Adapting such algorithms to MGP and Q-MGP for scalable Bayesian spatial process models will constitute natural extensions of our current offering.

A Spatial Meshed Process
Let w(s), s ∈ D be the base process, S the fixed reference set, L = { 1 , . . . , n } ⊂ D and U = L \ S. Then the joint density p(w L ) is proper. In fact, using the definitions of p(w S ) and p(w U | w S ) we get We now verify the Kolmogorov consistency conditions. Take L π = { π(1) , . . . , π(n) } as any permutation of L. Then call U π = L π \ S and we get Set membership is invariant to permutations, so U π = L π \ S = L \ S = U. and therefore in no way the order of the L locations changes p(w L ). Therefore p(w Lπ ) = p(w L ), i.e.

B Meshed Gaussian Process
In graph G, the number of parents of node n r matrix which can be partitioned by column in b j blocks: whose rth block is of size q × q and corresponds to the rth block of w Pa [j] .
and we can define H * j as the qn j ×qn S matrix such that w j −H j w Pa[j] = H * j w S . Analogously to H j , we can partition H * j by column in M blocks, where H * S is a qn S × qn S block-matrix whose jth block-row is H * j for j = 1, . . . , M , and R S = blockdiag(R 1 , . . . , R M ). The resulting precision matrix C −1 S = H * S R −1 S H * S is thus a qn S × qn S matrix made of M × M blocks of sizes qn j × qn j for j = 1, . . . , M such that n j = n S . We denote its (i, j) block as C −1 S (i, j) and note that by symmetry its transpose is ( C The sparsity of C −1 S thus depends on the specific choice for G(A) and the corresponding moralized graph G m (A) (Cowell et al., 1999) The covariance function of the new process will then be defined using C: consider two locations 1 and 2 in D. If both are in S then 1 = s i , 2 = s j for some i, j, and B.1 Choice of S and U The two sets of locations S and U differ in that the former is linked to nodes A of graph G, whereas the latter is linked to nodes B with no children. Since B nodes are independent of each other given realizations of A they provide little information on spatial dependence; in fact, the set S is akin to the set of knots in a predictive process model. Unlike those models, S can be chosen to be as large as the set of observations, and U left empty. Generally, it is possible to choose S ⊃ T to smoothly model spatial dependence across larger regions of unobserved locations, but convergence of the Gibbs sampler may be slower at locations that are very far from the nearest observed ones. Therefore, a convenient choice which is possibly safer for MCMC convergence is to let them coincide, i.e. S = T .
Matters are less straightforward when redundancies occur.Consider locations t = (t 1 , . . . , t d ) ∈ D such that t j = t * j δ j where t * j = 1, . . . , N j and δ j ∈ [0, 1/N j ] for j = 1, . . . , d. Denote the set of such locations as T * ; this is a N 1 × · · · × N d grid of equally-spaced locations. If observed locations are on a quantized grid of coordinates, i.e. there are {t * j , δ j } d j=1 such that T ⊂ T * , choosing S = T may be inefficient as it may reduce the number of redundant R j 's and thus increase computation time.Instead, choosing S such that T ⊂ S ⊂ T * (possibly S = T * ) may be much more efficient.
In Figure 9 we show how a small subsample of a time slice of the Serengeti data.Large portions of the spatial domain are missing, but observed locations are at quantized coordinates. We may thus choose to extend S to the whole domain ( Fig. 9, center) or to extend it just enough to cover all observed locations (Fig. 9, right). For predictions, in the former case we use samples from the full conditional distribution p(w j | w −j , D). In the latter case, we map empty blocks to nodes B: predictions thus easily proceed in parallel since nodes B are independent of each other given nodes in A, with Pa[b] being composed of the nearest A nodes along each axis-parallel direction.

C C++ implementation of Q-MGP in meshgp
All applications were run on a 512GB-memory, dual-CPU server with two Intel Xeon E5-2699v3 processors, each with 18 cores at 2.3Ghz each, running Debian linux and R version 3.6 linked to the OpenBLAS libraries. Source code for implementing Q-MGP models is available as R package meshgp ( github.com/mkln/meshgp ). The meshgp package is written in Armadillo (Sanderson and Curtin 2016, a C++ library for linear algebra) which can easily be linked to high performance libraries. Parallelization of all high-dimensional operations is implemented via OpenMP (Dagum and Menon, 1998). R packages Rcpp (Eddelbuettel and François, 2011) and RcppArmadillo (Eddelbuettel and Sanderson, 2014) are used for interfacing the C++ source with R. Heaton et al. (2019) In recent work, Heaton et al. (2019) have reviewed and compared 13 state-of-the-art models for large spatial datasets in a predictive challenge involving (1) simulated and (2) realworld spatial data (daytime land surface temperatures as measured by the Terra instrument onboard the MODIS satellite on August 4, 2016, Level-3 data). The two datasets are available at github.com/finnlindgren/heatoncomparison. The total number of locations is the same in both cases (n all = 150, 000), with the goal of predicting outcomes when missing.

D Comparisons with methods in
Both datasets include n train = 105, 569 available locations; the test set is of size n test = 44, 431 for the simulated data, n test = 42, 740 for the MODIS data due to cloud cover.
We estimate two MGP models for each dataset by partitioning the spatial domain into M = 1500 (resp. 375) rectangular regions for an average block size of 100 (resp. 400) spatial locations, and fix G as a cubic mesh. We assign blocks to S to cover the observed locations; the remaining ones will be used for prediction as in the right plot of Figure   9. Finally, we use an exponential base covariance function. We ran our Gibbs sampler using σ 2 ∼ Inv.Gamma(2.01, 1), τ 2 ∼ Inv.Gamma(2.01, 1), φ ∼ U [1/10, 30] as priors, and respectively 10, 1, 10 as starting values. A log-Normal proposal function was used for the Metropolis update of φ. We ran a total of 6,000 Monte Carlo iterations, with 4,000 dropped as burn-in, and thinning 2:1 the remaining ones to obtain a sample of size 1,000 that approximates the joint posterior distribution p(w, φ, σ 2 , τ 2 | y) which we then use to obtain predictions. In both cases, the algorithm ran on 40 threads. We report the results of our analysis in Table 3, whereas prediction plots for both datasets are in Figure 10.
In terms of predictive performance, coverage, and computation time, both implemented models are competitive with the top-ranking ones in Heaton et al. (2019). In particular, our
• Calculate Γ 0,i as the matrix of size n Γ i × d such that

F Benefits of MGPs on tessellated domains
The block conditional independence structure of tessellated MGPs favors parallelization and MCMC mixing; when data are on semi-regular lattices, using standard base covariance functions results in regular patterns and many redundant expensive matrix operations can be avoided. We now show the practical benefits of this class of models using a cubic MGP design.

F.1 Compute time with caching and parallelization
As we mentioned earlier, our approach enables parallelization of all high-dimensional operations and facilitates caching if locations are at least in part on a regular grid. Figure   11 shows the time-per-iteration of Q-MGP, on a fully-observed slice of the Serengeti data, to blocks of dimension 36, 42, 49 corresponding to squares of size 6 × 6, 6 × 7 or 7 × 6, 7 × 7, respectively. Note that caching is optional; in this case we showcase the speed differential between best-case (regular lattice) and worst-case (completely irregularly-spaced observations) scenarios by disabling caching. To put the numbers in perspective, we compare our results with NNGP models estimated via the sequential sampler of Datta et al. (2016a) as implemented in the highly optimized R package spNNGP, choosing m ∈ {10, 15}. Like our method, this alternative is fully Bayesian, recovers the spatial random effects, and is estimated via a Gibbs sampler. Unlike Q-MGP, it does not use caching or sample w in parallel. Computing times of Q-MGP models without caching are comparable to NNGP models; the Q-MGP model with block size 16 is about 30% faster than the NNGP model with 10 neighbors.
However, caching allows one to consider much larger regions at little additional computation cost: without caching, the large-region Q-MGP model (M = 150 2 ) is about 2.5 times slower than its small-region counterpart (M = 250 2 ). With caching turned on, the small-region model is only 25% faster, and the large-region one is still about 4 times faster than NNGP with m = 10. Similarly, the large savings in computing time induced by using caching in Q-MGP models allowed us to design the best-performing model for the real-world application of Heaton et al. (2019), see Table 3.
There is little evidence of marked advantages specifically due to parallel sampling with this 60-core virtual machine, other than a seemingly slightly steeper downward slope for the With this in mind, we now compare the parallel sampler of Q-MGPs to a sequential NNGP sampler (from R package spNNGP, Finley et al. 2020) in terms of ESS of the latent random effects. We generate y = Zw + ε on a regular 80 × 80 grid where Z = I n , n = 6400, the spatial process is w ∼ GP (0, C), C( h ) = σ 2 exp{−φ h } for h = − , ε( ) ∼ N (0, τ 2 ) for all and using parameters σ 2 = 1 and φ ∈ {0.01, 0.025, 0.05, .1, .2, .4, .8, 1.6, 3.2, 6.4, 12.8}, and nugget τ 2 = 0.01. The goal is to sample w a posteriori. We assign priors φ ∼ U [.01, 30], were about 2.6 to 3 times faster than NNGP-sequential once caching was turned on). Figure   12 shows the effective sample size of the resulting approximated posterior sample of w, averaged across all spatial locations. While both models evidence a degradation of performance with long-range spatial dependence (small φ), the NNGP model shows significantly worse performance. With low spatial decay, the Q-MGP model exhibits effective sample sizes up to 2.5 times larger than the equivalent NNGP model. The smaller effective sample size of NNGPs implies that a much larger number of MCMC samples would be necessary to appropriately approximate an independent posterior sample of size 1000, resulting in significantly longer "effective" compute times. We estimate Q-MGP models to be up to ten times faster than sequential NNGPs. As expected, given the small sample size, the differences reduce for larger values of φ, in which case both models output posterior samples with small autocorrelation with large ESS. Larger data sizes are generally associated to worse mixing so we expect these findings to be generalizable to those settings.  Table 4: Comparative performance of Q-MGP and Gapfill on recovering missing pixels of an animated GIF image.

G Recovering missing pixels of an animated GIF image
We compare Q-MGPs with Gapfill (Gerber et al., 2018) on non-Gaussian data in the form of an animated GIF image. The original GIF image collects 30 frames, each of size 200 × 250, for a total data size of 1.5 million locations. We subsample the data size to n all = 94, 500 by downsampling each of the 30 frames at a 16:1 ratio. Similarly to Section 5.2, we simulate cloud cover by placing random clouds in 5 of the 30 frames, covering all but 10 locations in 5 of 30 frames, and covering 50% random locations in 5 of the 30 frames; 15 frames were thus untouched. See Figure 13. The resulting dataset includes n obs = 67, 363 non-missing observations and n test = 27, 137 missing ones to be predicted. We implement a Q-MGP model with M = 1050 found by partitioning the spatial coordinates into 5 and 21 intervals, and time in 10 intervals. We run the Gibbs sampler for 15000 iterations; we discard the first 10000 and keep one every five of the remaining 5000 to obtain a sample of size 1000 from the approximated posterior distribution, which we use for predictions. We used the same base covariance, prior distributions, and starting values as in previous sections, for a total run time of 35 minutes (0.14s/iteration) on 11 threads. As seen in Table 4 and from a visual inspection of Figure 13, Q-MGP outperforms Gapfill. and Gapfill, and observed image frames, at three select times (of 30 total) of the GIF image.