This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerated Gibbs sampling of normal distributions using matrix splittings and polynomials

Colin Foxlabel=e1]fox@physics.otago.ac.nz [    Albert Parkerlabel=e2]parker@math.montana.edu [ Department of Physics, University of Otago, Dunedin, New Zealand Center for Biofilm Engineering, Department of Mathematical Sciences, Montana State University, Bozeman, MT, USA
Abstract

Standard Gibbs sampling applied to a multivariate normal distribution with a specified precision matrix is equivalent in fundamental ways to the Gauss-Seidel iterative solution of linear equations in the precision matrix. Specifically, the iteration operators, the conditions under which convergence occurs, and geometric convergence factors (and rates) are identical. These results hold for arbitrary matrix splittings from classical iterative methods in numerical linear algebra giving easy access to mature results in that field, including existing convergence results for antithetic-variable Gibbs sampling, REGS sampling, and generalizations. Hence, efficient deterministic stationary relaxation schemes lead to efficient generalizations of Gibbs sampling. The technique of polynomial acceleration that significantly improves the convergence rate of an iterative solver derived from a symmetric matrix splitting may be applied to accelerate the equivalent generalized Gibbs sampler. Identicality of error polynomials guarantees convergence of the inhomogeneous Markov chain, while equality of convergence factors ensures that the optimal solver leads to the optimal sampler. Numerical examples are presented, including a Chebyshev accelerated SSOR Gibbs sampler applied to a stylized demonstration of low-level Bayesian image reconstruction in a large 3-dimensional linear inverse problem.

Bayesian inference,
Gaussian Markov random field,
Gibbs sampling,
matrix splitting,
multivariate normal distribution,
non-stationary stochastic iteration,
polynomial acceleration,
keywords:
\startlocaldefs\endlocaldefs

and

1 Introduction

The Metropolis-Hastings algorithm for MCMC was introduced to main-stream statistics around 1990 (Robert and Casella, 2011), though prior to that the Gibbs sampler provided a coherent approach to investigating distributions with Markov random field structure (Turčin, 1971; Grenander, 1983; Geman and Geman, 1984; Gelfand and Smith, 1990; Besag and Green, 1993; Sokal, 1993). The Gibbs sampler may be thought of as a particular Metropolis-Hastings algorithm that uses the conditional distributions as proposal distributions, with acceptance probability always equal to 1 (Geyer, 2011).

In statistics the Gibbs sampler is popular because of ease of implementation (see, e.g., Roberts and Sahu, 1997), when conditional distributions are available in the sense that samples may be drawn from the full conditionals. However, the Gibbs sampler is not often presented as an efficient algorithm, particularly for massive models. In this work we show that generalized and accelerated Gibbs samplers are contenders for the fastest sampling algorithms for normal target distributions, because they are equivalent to the fastest algorithms for solving systems of linear equations.

Almost all current MCMC algorithms, including Gibbs samplers, simulate a fixed transition kernel that induces a homogeneous Markov chain that converges geometrically in distribution to the desired target distribution. In this aspect, modern variants of the Metropolis-Hastings algorithm are unchanged from the Metropolis algorithm as first implemented in the 1950’s. The adaptive Metropolis algorithm of Haario et al. (2001) (see also Roberts and Rosenthal, 2007) is an exception, though it converges to a geometrically convergent Metropolis-Hastings algorithm that bounds convergence behaviour.

We focus on the application of Gibbs sampling to drawing samples from a multivariate normal distribution with a given covariance or precision matrix. Our concern is to develop generalized Gibbs samplers with optimal geometric, or better than geometric, distributional convergence by drawing on ideas in numerical computation, particularly the mature field of computational linear algebra. We apply the matrix-splitting formalism to show that fixed-scan Gibbs sampling from a multivariate normal is equivalent in fundamental ways to the stationary linear iterative solvers applied to systems of equations in the precision matrix.

Stationary iterative solvers are now considered to be very slow precisely because of their geometric rate of convergence, and are no longer used for large systems. However, they remain a basic building block in the most efficient linear solvers. By establishing equivalence of error polynomials we provide a route whereby acceleration techniques from numerical linear algebra may be applied to Gibbs sampling from normal distributions. The fastest solvers employ non-stationary iterations, hence the equivalent generalized Gibbs sampler induces an inhomogeneous Markov chain. Explicit calculation of the error polynomial guarantees convergence, while control of the error polynomial gives optimal performance.

The adoption of the matrix splitting formalism gives the following practical benefits in the context of fixed-scan Gibbs sampling from normal targets:

  1. 1.

    a one-to-one equivalence between generalized Gibbs samplers and classical linear iterative solvers;

  2. 2.

    rates of convergence and error polynomials for the Markov chain induced by a generalized Gibbs sampler;

  3. 3.

    acceleration of the Gibbs sampler to induce an inhomogeneous Markov chain that achieves the optimal error polynomial, and hence has optimal convergence rate;

  4. 4.

    numerical estimates of convergence rate of the Gibbs sampler in a single chain and a priori estimates of number of iterations to convergence;

  5. 5.

    access to preconditioning, whereby the sampling problem is transformed into an equivalent problem for which the accelerated Gibbs sampler has improved convergence rate.

Some direct linear solvers have already been adapted to sampling from multivariate normal distributions, with Rue (2001) demonstrating the use of solvers based on Cholesky factorization to allow computationally efficient sampling. This paper extends the connection to the iterative linear solvers. Since iterative methods are the most efficient for massive linear systems, the associated samplers will be the most efficient for very high dimensional normal targets.

1.1 Context and overview of results

The Cholesky factorization is the conventional way to produce samples from a moderately sized multivariate normal distribution (Rue, 2001; Rue and Held, 2005), and is also the preferred method for solving moderately sized linear systems. For large linear systems, iterative solvers are the methods of choice due to their inexpensive cost per iteration, and small computer memory requirements.

Gibbs samplers applied to normal distributions are essentially identical to stationary iterative methods from numerical linear algebra. This connection was exploited by Adler (1981), and independently by Barone and Frigessi (1990), who noted that the component-wise Gibbs sampler is a stochastic version of the Gauss-Seidel linear solver, and accelerated the Gibbs sampler by introducing a relaxation parameter to implement the stochastic version of the successive over-relaxation (SOR) of Gauss-Seidel. This pairing was further analyzed by Goodman and Sokal (1989).

This equivalence is depicted in panels A and B of Figure 1. Panel B shows the contours of a normal density π(𝐱)\pi(\mathbf{x}), and a sequence of coordinate-wise conditional samples taken by the Gibbs sampler applied to π\pi. Panel A shows the contours of the quadratic minus log(π(𝐱))\log\left(\pi(\mathbf{x})\right) and the Gauss-Seidel sequence of coordinate optimizations111Gauss-Seidel optimization was rediscovered by Besag (1986) as iterated conditional modes., or, equivalently, solves of the normal equations logπ(𝐱)=0\nabla\log\pi(\mathbf{x})=0. Note how in Gauss-Seidel the step sizes decrease towards convergence, which is a tell-tale sign that convergence (in value) is geometric. In Section 4 we will show that the iteration operator is identical to that of the Gibbs sampler in panel B, and hence the Gibbs sampler also converges geometrically (in distribution). Slow convergence of these algorithms is usually understood in terms of the same intuition; high correlations correspond to long narrow contours, and lead to small steps in coordinate directions and many iterations being required to move appreciably along the long axis of the target function.

Solving 𝐀𝐱=𝐛\mathbf{A}\mathbf{x}=\mathbf{b} Sampling from N(𝝁,𝐀1)\operatorname{N}({\boldsymbol{\mu}},\mathbf{A}^{-1}) A: Gauss-Seidel B: Gibbs Refer to caption Refer to caption C: Chebyshev-SSOR D: Chebyshev-SSOR sampler Refer to caption Refer to caption E: CG F: CG Gibbs Refer to caption Refer to caption

Figure 1: The panels in the left column show the contours of a quadratic function 12𝐱T𝐀𝐱𝐛T𝐱\frac{1}{2}\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{b}^{T}\mathbf{x} in two dimensions and the iteration paths for some common optimizers towards the minimizer 𝝁=𝐀1𝐛\mathbf{{\boldsymbol{\mu}}}=\mathbf{A}^{-1}\mathbf{b}, or equivalently the path of iterative linear solvers of 𝐀𝐱=𝐛\mathbf{A}\mathbf{x}=\mathbf{b}. The right column presents the iteration paths of the samplers corresponding to each linear solver, along with the contours of the normal density kexp{12𝐱T𝐀𝐱+𝐛T𝐱}k\exp\left\{-\frac{1}{2}\mathbf{x}^{T}\mathbf{A}\mathbf{x}+\mathbf{b}^{T}\mathbf{x}\right\}, where kk is the normalizing constant. In all panels, the matrix 𝐀\mathbf{A} has eigenpairs {(10,[11]T),(1,[11]T)}\{\left(10,[1~1]^{T}\right),\left(1,[1~-1]^{T}\right)\}. The Gauss-Seidel solver took 45 iterations to converge to 𝝁{\boldsymbol{\mu}} (shown are the 90 coordinate steps; each iteration is a “sweep” of the two coordinate directions), the Chebyshev polynomial accelerated SSOR required just 16 iterations to converge, while CG finds the minimizer in 2 iterations. For each of the samplers 10 iterations are shown (the 20 coordinate steps are shown for the Gibbs sampler). The correspondence between these linear solvers/optimizers and samplers is treated in the text (CG in supplementary material).

Roberts and Sahu (1997) considered forward then backward sweeps of coordinate-wise Gibbs sampling, with relaxation parameter, to give a sampler they termed the REGS sampler. This is a stochastic version of the symmetric-SOR (SSOR) iteration, which comprises forward then backward sweeps of SOR.

The equality of iteration operators and error polynomials, for these pairs of fixed-scan Gibbs samplers and iterative solvers, allows existing convergence results in numerical analysis texts (for example Axelsson, 1996; Golub and Loan, 1989; Nevanlinna, 1993; Saad, 2003; Young, 1971) to be used to establish convergence results for the corresponding Gibbs sampler. Existing results for rates of distributional convergence by fixed-sweep Gibbs samplers (Adler, 1981; Barone and Frigessi, 1990; Liu et al., 1995; Roberts and Sahu, 1997) may be established this way.

The methods of Gauss-Seidel, SOR, and SSOR, give stationary linear iterations that were used as linear solvers in the 1950’s, and are now considered very slow. The corresponding fixed-scan Gibbs samplers are slow for precisely the same reason. The last fifty years has seen an explosion of theoretical results and algorithmic development that have made linear solvers faster and more efficient, so that for large problems, stationary methods are used as preconditioners at best, while the method of preconditioned conjugate gradients, GMRES, multigrid, or fast-multipole methods are the current state-of-the-art for solving linear systems in a finite number of steps (Saad and van der Vorst, 2000).

Linear iterations derived from a symmetric splitting may be sped up by polynomial acceleration, particularly Chebyshev acceleration that results in optimal error reduction amongst methods that have a fixed non-stationary iteration structure (Fox and Parker, 1968; Axelsson, 1996). The Chebyshev accelerated SSOR solver and corresponding Chebyshev accelerated SSOR sampler (Fox and Parker, 2014) are depicted in panels C and D of Figure 1. Both the solver and sampler take steps that are more aligned with the long axis of the target, compared to the coordinate-wise algorithms, and hence achieve faster convergence. However, the step size of Chebyshev-SSOR solving still decreases towards convergence, and hence convergence for both solver and sampler is still asymptotically geometric, albeit with much improved rate.

Fox and Parker (2014) considered point-wise convergence of the mean and variance of a Gibbs SSOR sampler accelerated by Chebyshev polynomials. In this paper we prove convergence in distribution for Gibbs samplers corresponding to any matrix splitting and accelerated by any polynomial that is independent of the Gibbs iterations. We then apply a polynomial accelerated sampler to solve a massive Bayesian linear inverse problem that is infeasible to solve using conventional techniques.

Chebyshev acceleration requires estimates of the extreme eigenvalues of the error operator, which we obtain via a conjugate-gradient (CG) algorithm at no significant computational cost (Meurant, 2006). The CG algorithm itself may be adapted to sample from normal distributions; the CG solver and corresponding sampler, depicted in panels E and F of Figure 1, were analysed by Parker and Fox (2012) and is discussed in the supplementary material.

1.2 Structure of the paper

In Section 2 we review efficient methods for sampling from normal distributions, highlighting Gibbs sampling in various algorithmic forms. Standard results for stationary iterative solvers are presented in Section 3. Theorems in Section 4 establish equivalence of convergence and convergence factors for iterative solvers and Gibbs samplers. Application of polynomial acceleration methods to linear solvers and Gibbs sampling is given in Section 5, including a proof of convergence of the first and second moments of a polynomial accelerated sampler. Numerical verification of convergence results is presented in Section 6.

2 Sampling from multivariate normal distributions

We consider the problem of sampling from an nn-dimensional normal distribution N(𝝁,𝚺)\operatorname{N}({\boldsymbol{\mu}},{\boldsymbol{\Sigma}}) defined by the mean nn-vector 𝝁{\boldsymbol{\mu}}, and the n×nn\times n symmetric and positive definite (SPD) covariance matrix 𝚺{\boldsymbol{\Sigma}}. Since if 𝐳N(𝟎,𝚺)\mathbf{z}\sim\operatorname{N}(\mathbf{0},{\boldsymbol{\Sigma}}) then 𝐳+𝝁N(𝝁,𝚺)\mathbf{z}+{\boldsymbol{\mu}}\sim\operatorname{N}({\boldsymbol{\mu}},{\boldsymbol{\Sigma}}), it often suffices to consider drawing samples from normal distributions with zero mean. An exception is when 𝝁{\boldsymbol{\mu}} is defined implicitly, which we discuss in section 4.1.

In Bayesian formulations of inverse problems that use a GMRF as a prior distribution, typically the precision matrix 𝐀=𝚺1\mathbf{A}={\boldsymbol{\Sigma}}^{-1} is explicitly modeled and available (Rue and Held, 2005; Higdon, 2006), perhaps as part of a hierarchical model (Banerjee et al., 2003). Typically then the precision matrix (conditioned on hyperparameters) is large though sparse, if the neighborhoods specifying conditional independence are small. We are particularly interested in this case, and throughout the paper will focus on sampling from N(𝟎,𝐀1)\operatorname{N}(\mathbf{0},\mathbf{A}^{-1}) when 𝐀\mathbf{A} is sparse and large, or when some other property makes operating by 𝐀\mathbf{A} easy, i.e., one can evaluate 𝐀𝐱\mathbf{Ax} for any vector 𝐱\mathbf{x}.

Standard sampling methods for moderately sized normal distributions utilize the Cholesky factorization (Rue, 2001; Rue and Held, 2005) since it is fast, incurring approximately (1/3)n3(1/3)n^{3} floating point operations (flops) and is backwards stable (Watkins, 2002). Samples can also be drawn using the more expensive eigen-decomposition (Rue and Held, 2005), that costs approximately (10/3)n3(10/3)n^{3} flops, or more generally using mutually conjugate vectors (Fox, 2008; Parker and Fox, 2012). For stationary Gaussian random fields defined on the lattice, Fourier methods can lead to efficient sampling for large problems (Gneiting et al., 2005).

Algorithm 1 shows the steps for sampling from N(𝟎,𝚺)\operatorname{N}(\mathbf{0},{\boldsymbol{\Sigma}}) using Cholesky factorization, when the covariance matrix 𝚺{\boldsymbol{\Sigma}} is available (Neal, 1997; MacKay, 2003; Higdon, 2006).

input : Covariance matrix 𝚺{\boldsymbol{\Sigma}}
output𝐲N(𝟎,𝚺)\mathbf{y}\sim\operatorname{N}(\mathbf{0},{\boldsymbol{\Sigma}})
Cholesky factor 𝚺=𝐂𝐂T{\boldsymbol{\Sigma}}=\mathbf{CC}^{T};
sample 𝐳N(𝟎,𝐈)\mathbf{z}~\sim~\operatorname{N}(\mathbf{0},\mathbf{I});
𝐲=𝐂𝐳\mathbf{y}=\mathbf{Cz};
Algorithm 1 Cholesky sampling using a covariance matrix 𝚺{\boldsymbol{\Sigma}}

When the precision matrix 𝐀\mathbf{A} is available, a sample 𝐲N(𝟎,𝐀1)\mathbf{y}\sim\operatorname{N}(\mathbf{0},\mathbf{A}^{-1}) may be drawn using Algorithm 2 given by Rue (2001) (see also Rue and Held, 2005).

input : Precision matrix 𝐀\mathbf{A}
output𝐲N(𝟎,𝐀1)\mathbf{y}\sim\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})
Cholesky factor 𝐀=𝐁𝐁T\mathbf{A}=\mathbf{BB}^{T};
sample 𝐳N(𝟎,𝐈)\mathbf{z}~\sim~\operatorname{N}(\mathbf{0},\mathbf{I});
solve 𝐁T𝐲=𝐳\mathbf{B}^{T}\mathbf{y=z} by back substitution;
Algorithm 2 Cholesky sampling using a precision matrix 𝐀\mathbf{A}

The computational cost of Algorithm 2 depends on the bandwidth of 𝐀\mathbf{A}, that also bounds the bandwidth of the Cholesky factor 𝐁\mathbf{B}. For a bandwidth bb, calculation of the Cholesky factorization requires 𝒪(b2n)\mathcal{O}(b^{2}n) flops, which provides savings over the full-bandwidth case when bn/2b\ll n/2 (Golub and Loan, 1989; Rue, 2001; Watkins, 2002). For GMRF’s defined over 2-dimensional domains, the use of a bandwidth reducing permutation often leads to substantial computational savings (Rue, 2001; Watkins, 2002). In 3-dimensions and above, typically no permutation exists that can significantly reduce the bandwidth below n2/3n^{2/3}, hence the cost of sampling by Cholesky factoring is at least 𝒪(n7/3)\mathcal{O}(n^{7/3}) flops. Further, Cholesky factorizing requires that the precision matrix and the Cholesky factor be stored in computer memory, which can be prohibitive for large problems. In Section 6 we give an example of sampling from a large GMRF for which Cholesky factorization is prohibitively expensive.

2.1 Gibbs sampling from a normal distribution

Iterative samplers, such as Gibbs, are an attractive option when drawing samples from high dimensional multivariate normal distributions due to their inexpensive cost per iteration and small computer memory requirements (only vectors of size nn need be stored). If the precision matrix is sparse with 𝒪(n)\mathcal{O}(n) non-zero elements, then, regardless of the bandwidth, iterative methods cost only about 2n2n flops per iteration, which is comparable with sparse Cholesky factorizations. However, when the bandwidth is 𝒪(n)\mathcal{O}(n), the cost of the Cholesky factorization is high at 𝒪(n3)\mathcal{O}(n^{3}) flops, while iterative methods maintain their inexpensive cost per iteration. Iterative methods are then preferable when requiring significantly fewer than 𝒪(n2)\mathcal{O}(n^{2}) iterations for adequate convergence. In the examples presented in section 6 we find that 𝒪(n)\mathcal{O}(n) iterations give convergence to machine precision, so the iterative methods are preferable for large problems.

2.1.1 Componentwise formulation

One of the simplest iterative sampling methods is the component-sweep Gibbs sampler (Geman and Geman, 1984; Gelman et al., 1995; Gilks et al., 1996; Rue and Held, 2005). Let 𝐲=(y1,y2,,yn)Tn\mathbf{y}=(y_{1},y_{2},...,y_{n})^{T}\in\Re^{n} denote a vector in terms of its components, and let 𝐀\mathbf{A} be an n×nn\times n precision matrix with elements {aij}\{a_{ij}\}. One sweep over all nn components can be written as in Algorithm 3 (Barone and Frigessi, 1990), showing that the algorithm can be implemented using vector and scalar operations only, and storage or inversion of the precision matrix 𝐀\mathbf{A} is not required.

input : Precision matrix 𝐀\mathbf{A}, initial state 𝐲(0)=(y1(0),y2(0),,yn(0))T\mathbf{y}^{(0)}=\left(y_{1}^{(0)},y^{(0)}_{2},...,y^{(0)}_{n}\right)^{T}, and maximum iteration kmaxk_{\text{max}}
output{𝐲(0),𝐲(1),𝐲(2),,𝐲(kmax)}\left\{\mathbf{y}^{(0)},\mathbf{y}^{(1)},\mathbf{y}^{(2)},\ldots,\mathbf{y}^{(k_{\text{max}})}\right\} where 𝐲(k)𝒟N(𝟎,𝐀1)\mathbf{y}^{(k)}\overset{\mathcal{D}}{\to}\operatorname{N}(\mathbf{0},\mathbf{A}^{-1}) as kk\to\infty
for k=1,2,,kmaxk=1,2,\ldots,k_{\mathrm{max}} do
  for i=1,2,,ni=1,2,\ldots,n do
    Sample zN(0,1)z\sim\operatorname{N}(0,1);
     yi(k)=zaii1aii(j>iaijyj(k1)j<iaijyj(k))y_{i}^{\left(k\right)}=\displaystyle\frac{z}{\sqrt{a_{ii}}}-\frac{1}{a_{ii}}\left(\sum_{j>i}a_{ij}y_{j}^{\left(k-1\right)}-\sum_{j<i}a_{ij}y_{j}^{\left(k\right)}\right)
     end
end
Algorithm 3 Component-sweep Gibbs sampling using a precision matrix 𝐀\mathbf{A}

The index kk may be omitted (and with ‘==’ interpreted as assignment) to give an algorithm that can be evaluated in place, requiring minimal storage.

2.1.2 Matrix formulation

One iteration in Algorithm 3 consists of a sweep over all nn components of 𝐲(k)\mathbf{y}^{(k)} in sequence. The iteration can be written succinctly in the matrix form (Goodman and Sokal, 1989)

𝐲(k+1)=𝐃1𝐋𝐲(k+1)𝐃1𝐋T𝐲(k)+𝐃1/2𝐳(k),\displaystyle\mathbf{y}^{(k+1)}=-\mathbf{D}^{-1}\mathbf{Ly}^{(k+1)}-\mathbf{D}^{-1}\mathbf{L}^{T}\mathbf{y}^{(k)}+\mathbf{D}^{-1/2}\mathbf{z}^{(k)}, (1)

where 𝐳(k)N(𝟎,𝐈)\mathbf{z}^{(k)}\sim\operatorname{N}(\mathbf{0},\mathbf{I}), 𝐃=diag(𝐀)\mathbf{D}={\rm diag}(\mathbf{A}), and 𝐋\mathbf{L} is the strictly lower triangular part of 𝐀\mathbf{A}. This equation makes clear that the computational cost of each sweep is about 2n22n^{2} flops, when 𝐀\mathbf{A} is dense, due to multiplication by the triangular matrices 𝐋\mathbf{L} and 𝐋T\mathbf{L}^{T}, and 𝒪(n)\mathcal{O}(n) flops when 𝐀\mathbf{A} is sparse.

Extending this formulation to sweeps over any other fixed sequence of coordinates is achieved by putting 𝐏𝐀𝐏T\mathbf{P}\mathbf{A}\mathbf{P}^{\text{T}} in place of 𝐀\mathbf{A} for some permutation matrix 𝐏\mathbf{P}. The use of random sweep Gibbs sampling has also been suggested (Amit and Grenander, 1991; Fishman, 1996; Liu et al., 1995; Roberts and Sahu, 1997), though we do not consider that here.

2.1.3 Convergence

If the iterates 𝐲(k)\mathbf{y}^{\left(k\right)} in (1) converge in distribution to a distribution Π\Pi which is independent of the starting state 𝐲(0)\mathbf{y}^{\left(0\right)}, then the sampler is convergent, and we write

𝐲(k)𝒟Π.\mathbf{y}^{\left(k\right)}\overset{\mathcal{D}}{\to}\Pi.

It is well known that the iterates 𝐲(k)\mathbf{y}^{(k)} in the Gibbs sampler (1) converge in distribution geometrically to N(𝟎,𝐀1)=N(𝟎,𝚺)\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})=\operatorname{N}(\mathbf{0},{\boldsymbol{\Sigma}}) (Roberts and Sahu, 1997). We consider geometric convergence in detail in Section 4.

3 Linear stationary iterative methods as linear equation solvers

Our work draws heavily on existing results for stationary linear iterative methods for solving linear systems. Here we briefly review the main results that we will use.

Consider a system of linear equations written as the matrix equation

𝐀𝐱=𝐛\mathbf{Ax}=\mathbf{b} (2)

where 𝐀\mathbf{A} is a given n×nn\times n nonsingular matrix and 𝐛\mathbf{b} is a given nn-dimensional vector. The problem is to find an nn-dimensional vector 𝐱\mathbf{x} satisfying equation (2). Later we will consider the case where 𝐀\mathbf{A} is symmetric positive definite (SPD) as holds for covariance and precision matrices (Feller, 1968).

3.1 Matrix splitting form of stationary iterative algorithms

A common class of methods for solving (2) are the linear iterative methods based on a splitting of 𝐀\mathbf{A} into 𝐀=𝐌𝐍\mathbf{A}=\mathbf{M}-\mathbf{N}. The matrix splitting is the standard way of expressing and analyzing linear iterative algorithms, following its introduction by Varga (1962). The system (2) is then transformed to 𝐌𝐱=𝐍𝐱+𝐛\mathbf{M}\mathbf{x}=\mathbf{N}\mathbf{x}+\mathbf{b} or, if 𝐌\mathbf{M} is nonsingular, 𝐱=𝐌1𝐍𝐱+𝐌1𝐛.\mathbf{x}=\mathbf{M}^{-1}\mathbf{Nx}+\mathbf{M}^{-1}\mathbf{b}. The iterative methods use this equation to compute successively better approximations 𝐱(k)\mathbf{x}^{\left(k\right)} to the solution using the iteration step

𝐱(k+1)\displaystyle\mathbf{x}^{\left(k+1\right)} =𝐌1𝐍𝐱(k)+𝐌1𝐛=𝐆𝐱(k)+𝐠.\displaystyle=\mathbf{M}^{-1}\mathbf{N}\mathbf{x}^{\left(k\right)}+\mathbf{M}^{-1}\mathbf{b}=\mathbf{Gx}^{\left(k\right)}+\mathbf{g}. (3)

We follow the standard terminology used for these methods (see e.g. Axelsson, 1996; Golub and Loan, 1989; Saad, 2003; Young, 1971). Such methods are termed linear stationary iterative methods (of the first degree); they are stationary222This use of stationary corresponds to the term homogeneous when referring to a Markov chain. It is not to be confused with a stationary distribution that is invariant under the iteration. Later we will develop non-stationary iterations, inducing a non-homogeneous Markov chain that will, however, preserve the target distribution at each iterate. because the iteration matrix 𝐆=𝐌1𝐍\mathbf{G}=\mathbf{M}^{-1}\mathbf{N} and the vector 𝐠=𝐌1𝐛\mathbf{g}=\mathbf{M}^{-1}\mathbf{b} do not depend on kk. The splitting is symmetric when both 𝐌\mathbf{M} and 𝐍\mathbf{N} are symmetric matrices. The iteration, and splitting, is convergent if 𝐱(k)\mathbf{x}^{\left(k\right)} tends to a limit independent of 𝐱(0)\mathbf{x}^{(0)}, the limit being 𝐀1𝐛\mathbf{A}^{-1}\mathbf{b} (see, e.g. (Young, 1971, Theorem 5.2)).

The iteration (3) is often written in the residual form so that convergence may be monitored in terms of the norm of the residual vector, and emphasizes that 𝐌1\mathbf{M}^{-1} is acting as an approximation to 𝐀1\mathbf{A}^{-1}, as in Algorithm 4.

input : The splitting 𝐌\mathbf{M}, 𝐍\mathbf{N} of 𝐀\mathbf{A}, initial state 𝐱(0)\mathbf{x}^{(0)}
output𝐱(k)\mathbf{x}^{(k)} approximating 𝐱=𝐀1𝐛\mathbf{x}^{\ast}=\mathbf{A}^{-1}\mathbf{b}
k=0k=0;
repeat
  𝐫(k)=𝐛𝐀𝐱(k)\mathbf{r}^{\left(k\right)}=\mathbf{b}-\mathbf{Ax}^{\left(k\right)};
   𝐱(k+1)=𝐱(k)+𝐌1𝐫(k)\mathbf{x}^{\left(k+1\right)}=\mathbf{x}^{\left(k\right)}+\mathbf{M}^{-1}\mathbf{r}^{\left(k\right)};
   increment kk;
  
   until 𝐫(k)\left\|\mathbf{r}^{\left(k\right)}\right\| is sufficiently small;
Algorithm 4 Stationary iterative solve of 𝐀𝐱=𝐛\mathbf{Ax=b}

In computational algorithms it is important to note that the symbol 𝐌1𝐫\mathbf{M}^{-1}\mathbf{r} is interpreted as “solve the system 𝐌𝐮=𝐫\mathbf{Mu}=\mathbf{r} for 𝐮\mathbf{u}” rather than “form the inverse of 𝐌\mathbf{M} and multiply 𝐫\mathbf{r} by 𝐌1\mathbf{M}^{-1}” since the latter is much more computationally expensive (about 2n32n^{3} flops (Watkins, 2002)). Thus, the practicality of a splitting depends on the ease with which one can solve 𝐌𝐮=𝐫\mathbf{Mu}=\mathbf{r} for any vector 𝐫\mathbf{r}.

3.1.1 The Gauss-Seidel algorithm

Many splittings of the matrix 𝐀\mathbf{A} use the terms in the expansion 𝐀=𝐋+𝐃+𝐔\mathbf{A}=\mathbf{L}+\mathbf{D}+\mathbf{U} where 𝐋\mathbf{L} is the strictly lower triangular part of 𝐀\mathbf{A}, 𝐃\mathbf{D} is the diagonal of 𝐀\mathbf{A}, and 𝐔\mathbf{U} is the strictly upper triangular part.

For example, choosing 𝐌=𝐋+𝐃\mathbf{M}=\mathbf{L}+\mathbf{D} (so 𝐍=𝐔\mathbf{N}=-\mathbf{U}) allows 𝐌𝐮=𝐫\mathbf{Mu}=\mathbf{r} to be solved by “forward substitution” (at a cost of n2n^{2} flops when 𝐀\mathbf{A} is dense), and hence does not require inversion or Gauss-elimination of 𝐌\mathbf{M} (which would cost 2/3n32/3n^{3} flops when 𝐀\mathbf{A} is dense). Using this splitting in Algorithm 4 results in the Gauss-Seidel iterative algorithm. When 𝐀\mathbf{A} is symmetric, 𝐔=𝐋T\mathbf{U=L}^{\text{T}}, and the Gauss-Seidel iteration can be written as

𝐱(k+1)=𝐃1𝐋𝐱(k+1)𝐃1𝐋T𝐱(k)+𝐃1𝐛.\displaystyle\mathbf{x}^{\left(k+1\right)}=-\mathbf{D}^{-1}\mathbf{Lx}^{\left(k+1\right)}-\mathbf{D}^{-1}\mathbf{L}^{\text{T}}\mathbf{x}^{\left(k\right)}+\mathbf{D}^{-1}\mathbf{b}. (4)

Just as we pointed out for the Gibbs sampler, variants of the Gauss-Seidel algorithm such as “red-black” coordinate updates (Saad, 2003), may be written in this form using a suitable permutation matrix.

The component-wise form of the Gauss-Seidel algorithm can be written in ‘equation’ form just as the Gibbs sampler (1) was in Algorithm 3. The component-wise form emphasizes that Gauss-Seidel can be implemented using vector and scalar operations only, and neither storage nor inversion of the splitting is required.

3.2 Convergence

A fundamental theorem of linear stationary iterative methods states that the splitting 𝐀=𝐌𝐍\mathbf{A}=\mathbf{M}-\mathbf{N}, where 𝐌\mathbf{M} is nonsingular, is convergent (i.e., 𝐱(k)𝐀1𝐛\mathbf{x}^{(k)}\to\mathbf{A}^{-1}\mathbf{b} for any 𝐱(0)\mathbf{x}^{(0)}) if and only if ϱ(𝐌1𝐍)<1\varrho\left(\mathbf{M}^{-1}\mathbf{N}\right)<1, where ϱ()\varrho\left(\cdot\right) denotes the spectral radius of a matrix (Young, 1971, Theorem 3.5.1). This characterization is often used as a definition (Axelsson, 1996; Golub and Loan, 1989; Saad, 2003).

The error at step kk is 𝐞(k+1)=𝐱(k+1)𝐱\mathbf{e}^{(k+1)}=\mathbf{x}^{(k+1)}-\mathbf{x}^{\ast}, where 𝐱=𝐀1𝐛\mathbf{x}^{\ast}=\mathbf{A}^{-1}\mathbf{b}. It follows that

𝐞(k+1)=(𝐌1𝐍)k𝐞(0)\mathbf{e}^{(k+1)}=(\mathbf{M}^{-1}\mathbf{N})^{k}\mathbf{e}^{(0)} (5)

and hence the asymptotic average reduction in error per iteration is the multiplicative factor

limk(𝐞(k+1)2𝐞(0)2)1/k=ϱ(𝐌1𝐍)\lim_{k\to\infty}\left(\frac{||\mathbf{e}^{(k+1)}||_{2}}{||\mathbf{e}^{(0)}||_{2}}\right)^{1/k}=\varrho(\mathbf{M}^{-1}\mathbf{N}) (6)

(Axelsson, 1996, p. 166). In numerical analysis this is called the (asymptotic average) convergence factor (Axelsson, 1996; Saad, 2003). Later, we will show that this is exactly the same as the quantity called the geometric convergence rate in the statistics literature (see e.g. Robert and Casella, 1999), for the equivalent Gibbs sampler. We will use the term ‘convergence factor’ throughout this paper to avoid a clash of terminology, since in numerical analysis the rate of convergence is minus the log of the convergence factor (see e.g. Axelsson, 1996, p. 166).

3.3 Common matrix splittings

We now present the matrix splittings corresponding to some common stationary linear iterative solvers, with details for the case where 𝐀\mathbf{A} is symmetric, as holds for precision or covariance matrices.

We have seen that the Gauss-Seidel iteration uses the splitting 𝐌GS=𝐋+𝐃\mathbf{M}_{\text{GS}}=\mathbf{L}+\mathbf{D} and NGS=𝐋TN_{\text{GS}}=-\mathbf{L}^{T}. Gauss-Seidel is one of the simplest splittings and solvers, but is also quite slow. Other splittings have been developed, though the speed of each method is often problem specific. Some common splittings are shown in Table 1, listed with, roughly, greater speed downwards. Speed of convergence in a numerical example is presented later in Section 6.

Table 1: Common stationary linear solvers generated by splittings 𝐀=𝐌𝐍\mathbf{A}=\mathbf{M}-\mathbf{N}, and conditions that guarantee convergence when 𝐀\mathbf{A} is SPD

splitting 𝐌\mathbf{M} convergence Richardson (R) 1ω𝐈\frac{1}{\omega}\mathbf{I} 0<ω<2ϱ(𝐀)\displaystyle 0<\omega<\frac{2}{\varrho(\mathbf{A})} Jacobi (J) 𝐃\mathbf{D} 𝐀\mathbf{A} strictly diagonally dominant Gauss-Seidel (GS) 𝐃+𝐋\mathbf{D+L} always SOR 1ω𝐃+𝐋\frac{1}{\omega}\mathbf{D}+\mathbf{L} 0<ω<20<\omega<2 SSOR ω2ω𝐌SOR𝐃1𝐌SORT\frac{\omega}{2-\omega}\mathbf{M}_{\text{SOR}}\mathbf{D}^{-1}\mathbf{M}_{\text{SOR}}^{\text{T}} 0<ω<20<\omega<2

The method of successive over-relaxation (SOR) uses the splitting

𝐌SOR=1ω𝐃+𝐋and𝐍SOR=1ωω𝐃𝐋T\displaystyle\mathbf{M}_{\text{SOR}}=\frac{1}{\omega}\mathbf{D+L}~~{\rm and}~~\mathbf{N}_{\text{SOR}}=\frac{1-\omega}{\omega}\mathbf{D-L}^{T} (7)

in which ω\omega is a relaxation parameter chosen with 0<ω<20<\omega<2. SOR with ω=1\omega=1 is Gauss-Seidel. For optimal values of ω\omega such that ϱ(𝐌SOR1𝐍SOR)<ϱ(𝐌GS1𝐍GS)\varrho(\mathbf{M}^{-1}_{\text{SOR}}\mathbf{N}_{\text{SOR}})<\varrho(\mathbf{M}^{-1}_{\text{GS}}\mathbf{N}_{\text{GS}}), SOR is an accelerated Gauss-Seidel iteration. Unfortunately, there is no closed form for the optimal value of ω\omega for an arbitrary matrix 𝐀\mathbf{A}, and the interval of values of ω\omega which admits accelerated convergence can be quite narrow (Young, 1971; Golub and Loan, 1989; Saad, 2003).

The symmetric-SOR method (SSOR) incorporates both a forward and backward sweep of SOR so that if 𝐀\mathbf{A} is symmetric then the splitting is symmetric (Golub and Loan, 1989; Saad, 2003),

𝐌SSOR=ω2ω𝐌SOR𝐃1𝐌SORTand𝐍SSOR=ω2ω𝐍SORT𝐃1𝐍SOR.\displaystyle\mathbf{M}_{\text{SSOR}}=\frac{\omega}{2-\omega}\mathbf{M}_{\text{SOR}}\mathbf{D}^{-1}\mathbf{M}_{\text{SOR}}^{T}~~{\rm and}~~\mathbf{N}_{\text{SSOR}}=\frac{\omega}{2-\omega}\mathbf{N}_{\text{SOR}}^{T}\mathbf{D}^{-1}\mathbf{N}_{\text{SOR}}. (8)

We will make use of symmetric splittings in conjunction with polynomial acceleration in Section 5.

When the matrix 𝐀\mathbf{A} is dense, Gauss-Seidel and SOR cost about 3n23n^{2} flops per iteration, with 2n22n^{2} due to multiplication by the matrix 𝐀\mathbf{A} (in order to calculate the residual) and another n2n^{2} for the forward substitution to solve 𝐌𝐮=𝐫\mathbf{Mu=r}. Richardson incurs no cost to solve 𝐌𝐮=𝐫\mathbf{Mu=r}, while a solve with the diagonal Jacobi matrix incurs nn flops. Iterative methods are particularly attractive when the matrix 𝐀\mathbf{A} is sparse, since then the cost per iteration is only O(n)O(n) flops.

Many theorems establish convergence of splittings by utilizing properties of 𝐀\mathbf{A} in specific applications. Some general conditions that guarantee convergence when 𝐀\mathbf{A} is SPD are given in the right column of Table 1 (Golub and Loan, 1989; Saad, 2003; Young, 1971).

4 Equivalence of stationary linear solvers and Gibbs samplers

We first consider the equivalence between linear solvers and stochastic iterations in the case where the starting state and noise are not necessarily normally distributed, then in Section 4.2 et seq. we restrict consideration to normal distributions.

4.1 General noise

The striking similarity between the Gibbs sampler (1) and the Gauss-Seidel iteration (4) is no coincidence. It is an example of a general equivalence between the stationary linear solver derived from a splitting and the associated stochastic iteration used as a sampler. We will make explicit the equivalence in the following theorems and corollary. In the first theorem we show that a splitting is convergent (in the sense of stationary iterative solvers) if and only if the associated stochastic iteration is convergent in distribution.

Theorem 1

Let 𝐀=𝐌𝐍\mathbf{A=M-N} be a splitting with 𝐌\mathbf{M} invertible, and let π()\pi\left(\mathbf{\cdotp}\right) be some fixed probability distribution with zero mean and fixed non-zero covariance. For any fixed vector 𝐛\mathbf{b}, and random vectors 𝐜(k)iidπ\mathbf{c}^{\left(k\right)}\overset{\text{iid}}{\sim}\pi, k=0,1,2,k=0,1,2,\ldots, the stationary linear iteration

𝐱(k+1)=𝐌1𝐍𝐱(k)+𝐌1𝐛\mathbf{x}^{\left(k+1\right)}=\mathbf{M}^{-1}\mathbf{Nx}^{\left(k\right)}+\mathbf{M}^{-1}\mathbf{b} (9)

converges, with 𝐱(k)𝐀1𝐛\mathbf{x}^{\left(k\right)}\rightarrow\mathbf{A}^{-1}\mathbf{b} as kk\rightarrow\infty whatever the initial vector 𝐱(0)\mathbf{x}^{(0)}, if and only if there exists a distribution Π\Pi such that the stochastic iteration

𝐲(k+1)=𝐌1𝐍𝐲(k)+𝐌1𝐜(k)\mathbf{y}^{\left(k+1\right)}=\mathbf{M}^{-1}\mathbf{Ny}^{\left(k\right)}+\mathbf{M}^{-1}\mathbf{c}^{\left(k\right)} (10)

converges in distribution to Π\Pi, with 𝐲(k)𝒟Π\mathbf{y}^{\left(k\right)}\overset{\mathcal{D}}{\rightarrow}\Pi as kk\rightarrow\infty whatever the initial state 𝐲(0)\mathbf{y}^{(0)}.

Proof. If the linear iteration (9) converges, then ϱ(𝐌1𝐍)<1\varrho(\mathbf{M}^{-1}\mathbf{N})<1 (Thm 3-5.1 in Young, 1971). Hence there exists a unique distribution Π\Pi with 𝐲(k+1)𝒟Π\mathbf{y}^{\left(k+1\right)}\overset{\mathcal{D}}{\rightarrow}\Pi (Theorem 2.3.18-4 of Duflo, 1997), which shows necessity. Conversely, if the linear solver does not converge to a limit independent of 𝐱(0)\mathbf{x}^{(0)} for some 𝐛\mathbf{b}, that also holds for 𝐛=𝟎\mathbf{b}=\mathbf{0} and hence initializing the sampler with E[𝐲(0)]=𝐱(0)\operatorname{E}\left[\mathbf{y}^{(0)}\right]=\mathbf{x}^{(0)} yields E[𝐲(k+1)]=(𝐌1𝐍)k𝐱(0)\operatorname{E}\left[\mathbf{y}^{(k+1)}\right]=(\mathbf{M}^{-1}\mathbf{N})^{k}\mathbf{x}^{(0)} which does not converge to a value independent of 𝐲(0)\mathbf{y}^{(0)}. Sufficiency holds by the contrapositive. \Box

Convergence of the stochastic iteration (10) could also be established via the more general theory of Diaconis and Freedman (1999) that allows the iteration operator 𝐆=𝐌1𝐍\mathbf{G}=\mathbf{M}^{-1}\mathbf{N} to be random, with convergence in distribution guaranteed when 𝐆\mathbf{G} is contracting on average; see Diaconis and Freedman (1999) for details.

The following theorem shows how to design the noise distribution π\pi so that the limit distribution Π\Pi has a desired mean μ\mu and covariance 𝚺=𝐀1{\boldsymbol{\Sigma}}=\mathbf{A}^{-1}.

Theorem 2

Let 𝐀\mathbf{A} be SPD, 𝐀=𝐌𝐍\mathbf{A=M-N} be a convergent splitting, 𝛍{\boldsymbol{\mu}} a fixed vector, and π()\pi\left(\mathbf{\cdotp}\right) a fixed probability distribution with finite mean ν\nu and non-zero covariance 𝐕\mathbf{V}. Consider the stochastic iteration (10) where 𝐜(k)iidπ\mathbf{c}^{(k)}\overset{iid}{\sim}\pi, k=0,1,2,k=0,1,2,\ldots. Then, whatever the starting state 𝐲(0)\mathbf{y}^{(0)}, the following are equivalent:

  1. 1.

    E[𝐜(k)]=𝝂\operatorname{E}\left[\mathbf{c}^{(k)}\right]={\boldsymbol{\nu}} and Var(𝐜(k))=𝐕=𝐌T+𝐍\operatorname{Var}\left(\mathbf{c}^{(k)}\right)=\mathbf{V}=\mathbf{M}^{T}+\mathbf{N}

  2. 2.

    the iterates 𝐲(k)\mathbf{y}^{(k)} converge in distribution to some distribution Π\Pi that has mean 𝝁=𝐀1ν{\boldsymbol{\mu}}=\mathbf{A}^{-1}\nu and covariance matrix 𝐀1\mathbf{A}^{-1}; in particular E[𝐲(k)]𝝁\operatorname{E}\left[\mathbf{y}^{(k)}\right]\to{\boldsymbol{\mu}} and Var(𝐲(k))𝐀1\operatorname{Var}\left(\mathbf{y}^{(k)}\right)\to\mathbf{\mathbf{A}}^{-1} as kk\to\infty.

Proof. Appendix A.1. \Box

Additionally, the mean and covariance converge geometrically, with convergence factors given by the convergence factors for the linear iterative method, as established in the following corollary.

Corollary 3

The first and second moments of iterates in the stochastic iteration in Theorem 2 converge geometrically. Specifically, E(𝐲(k))𝛍\operatorname{E}\left(\mathbf{y}^{(k)}\right)\to{\boldsymbol{\mu}} with convergence factor ϱ(𝐌1𝐍)\varrho(\mathbf{M}^{-1}\mathbf{N}) and Var(𝐲(k))=𝐀1𝐆k(𝐀1Var(𝐲(0)))(𝐆k)T𝐀1\operatorname{Var}\left(\mathbf{y}^{(k)}\right)=\mathbf{A}^{-1}-\mathbf{G}^{k}\left(\mathbf{A}^{-1}-\operatorname{Var}(\mathbf{y}^{(0)})\right)(\mathbf{G}^{k})^{T}\to\mathbf{A}^{-1} with convergence factor ϱ(𝐌1𝐍)2\varrho(\mathbf{M}^{-1}\mathbf{N})^{2}.

Proof. Appendix A.1. \Box

Note that the matrix splitting has allowed an explicit construction of the noise covariance to give a desired precision matrix of the target distribution. We see from Theorem 2 that the stochastic iteration may be designed to converge to a distribution with non-zero target mean, essentially by adding the deterministic iteration (9) to the stochastic iteration (10). This is particularly useful when the mean is defined implicitly via solving a matrix equation. In cases where the mean is known explicitly, the mean may be added after convergence of the stochastic iteration with zero mean, giving an algorithm with faster convergence since the covariance matrix converges with factor ϱ(𝐌1𝐍)2<ϱ(𝐌1𝐍)\varrho(\mathbf{M}^{-1}\mathbf{N})^{2}<\varrho(\mathbf{M}^{-1}\mathbf{N}) (this was also noted by Barone et al., 2002). Convergence in variance for non-normal targets was considered in Fox and Parker (2014).

Using Theorems 1 and 2, and Corollary 3 we can draw on the vast literature in numerical linear algebra on stationary linear iterative methods to find random iterations that are computationally efficient and provably convergent in distribution with desired mean and covariance. In particular, results in Amit and Grenander (1991); Barone and Frigessi (1990); Galli and Gao (2001), Roberts and Sahu (1997), and Liu et al. (1995) are special cases of the general theory of matrix splittings presented here.

4.2 Sampling from normal distributions using matrix splittings

We now restrict attention to the case of normal target distributions.

Corollary 4

If in Theorem 2 we set π=N(𝛎,𝐕)\pi=\operatorname{N}({\boldsymbol{\nu}},\mathbf{V}), for some non-zero covariance matrix 𝐕\mathbf{V}, then, whatever the starting state 𝐲(0)\mathbf{y}^{(0)}, the following are equivalent: (i) 𝐕=𝐌T+𝐍\mathbf{V}=\mathbf{M}^{T}+\mathbf{N}; (ii) 𝐲(k)𝒟N(𝛍,𝐀1)\mathbf{y}^{(k)}\overset{\mathcal{D}}{\to}\operatorname{N}({\boldsymbol{\mu}},\mathbf{A}^{-1}) where 𝛍=𝐀1ν{\boldsymbol{\mu}}=\mathbf{A}^{-1}\nu.

Proof. Since π\pi is normal, then Π\Pi in Theorem 2 is normal. Since a normal distribution is sufficiently described by its first two moments, the corollary follows. \Box

Using Corollary 4, we found normal sampling algorithms corresponding to some common stationary linear solvers. The results are given in Table 2.

Table 2: Some generalized Gibbs samplers for drawing from N(𝟎,𝐀1)\operatorname{N}\left(\mathbf{0},\mathbf{A}^{-1}\right) adapted from common stationary linear solvers. Each Gibbs iteration requires sampling the noise vector 𝐜(k)N(𝟎,𝐌T+𝐍)\mathbf{c}^{(k)}\sim\operatorname{N}\left(\mathbf{0},\mathbf{M}^{T}+\mathbf{N}\right)

Sampler 𝐌\mathbf{M} Var(𝐜(k))=𝐌T+𝐍\operatorname{Var}\left(\mathbf{c}^{(k)}\right)=\mathbf{M}^{T}+\mathbf{N} Richardson 1ω𝐈\displaystyle\frac{1}{\omega}\mathbf{I} 2ω𝐈𝐀\displaystyle\frac{2}{\omega}\mathbf{I}-\mathbf{A} Jacobi 𝐃\mathbf{D} 2𝐃𝐀2\mathbf{D}-\mathbf{A} Gibbs (Gauss-Seidel) 𝐃+𝐋\mathbf{D+L} 𝐃\mathbf{D} SOR 1ω𝐃+𝐋\displaystyle\frac{1}{\omega}\mathbf{D+L} 2ωω𝐃\displaystyle\frac{2-\omega}{\omega}\mathbf{D} SSOR (REGS) ω2ω𝐌SOR𝐃1𝐌SORT\displaystyle\frac{\omega}{2-\omega}\mathbf{M}_{\text{SOR}}\mathbf{D}^{-1}\mathbf{M}_{\text{SOR}}^{T} ω2ω(𝐌SOR𝐃1𝐌SORT+𝐍SORT𝐃1𝐍SOR)\displaystyle\frac{\omega}{2-\omega}\left(\mathbf{M}_{\text{SOR}}\mathbf{D}^{-1}\mathbf{M}_{\text{SOR}}^{T}+\mathbf{N}_{\text{SOR}}^{T}\mathbf{D}^{-1}\mathbf{N}_{\text{SOR}}\right)

A sampler corresponding to a convergent splitting is implemented in Algorithm 5.

input : SPD precision matrix 𝐀\mathbf{A}, 𝐌\mathbf{M} and 𝐍\mathbf{N} defining a convergent splitting of 𝐀\mathbf{A}, number of steps kmaxk_{\max}, initial state 𝐲(0)\mathbf{y}^{(0)}
output𝐲(k)\mathbf{y}^{(k)} approximately distributed as N(𝟎,𝐀1)\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})
for k=0,,kmaxk=0,\ldots,k_{\max} do
  sample 𝐜(k)N(𝟎,𝐌T+𝐍)\mathbf{c}^{(k)}\sim\operatorname{N}(\mathbf{0},\mathbf{M}^{T}+\mathbf{N});
   𝐲(k+1)=𝐌1(𝐍𝐲(k)+𝐜(k))\mathbf{y}^{(k+1)}=\mathbf{M}^{-1}(\mathbf{N}\mathbf{y}^{(k)}+\mathbf{c}^{(k)})
end
Algorithm 5 Stationary sampler of N(𝟎,𝐀1)\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})

The assignment 𝐲(k+1)=𝐌1(𝐍𝐲(k)+𝐜(k))\mathbf{y}^{(k+1)}=\mathbf{M}^{-1}(\mathbf{N}\mathbf{y}^{(k)}+\mathbf{c}^{(k)}) in Algorithm 5 can be replaced by the slightly more expensive steps 𝐫(k)=𝐜(k)𝐀𝐲(k)\mathbf{r}^{(k)}=\mathbf{c}^{(k)}-\mathbf{A}\mathbf{y}^{(k)} and 𝐲(k+1)=𝐲(k)+𝐌1𝐫(k)\mathbf{y}^{(k+1)}=\mathbf{y}^{(k)}+\mathbf{M}^{-1}\mathbf{r}^{(k)}, which allows monitoring of the residual, and emphasizes the equivalence with the stationary linear solver in Algorithm 4. Even though convergence may not be diagnosed by a decreasing norm of the residual, lack of convergence can be diagnosed when the residual diverges in magnitude. In practice, the effective convergence factor for a sampler may be calculated by solving the linear system (2) (perhaps with a random right hand side) using the iterative solver derived from the splitting and monitoring the decrease in error to evaluate the asymptotic average convergence factor using equation (6). By Corollary 3, this estimates the convergence factor for the sampler.

The practicality of a sampler derived from a convergent splitting depends on the ease with which one can solve 𝐌𝐲=𝐫\mathbf{M}\mathbf{y}=\mathbf{r} for any 𝐫\mathbf{r} (as for the stationary linear solver) and also the ease of drawing iid noise vectors from N(𝟎,𝐌T+𝐍)\operatorname{N}(\mathbf{0},\mathbf{M}^{T}+\mathbf{N}). Sampling the noise vector is simple when a matrix square root, such as the Cholesky factorization, of 𝐌T+𝐍\mathbf{M}^{T}+\mathbf{N} is cheaply available. Thus, a sampler is at least as expensive as the corresponding linear solver since, in addition to operations in each iteration, the sampler must factor the n×nn\times n matrix 𝐕𝐚𝐫(𝐜(k))=𝐌T+𝐍\mathbf{Var(c}^{(k)})=\mathbf{M}^{T}+\mathbf{N}. For the samplers listed in Table 2 it is interesting that the simpler the splitting, the more complicated is the variance of the noise. Neither Richardson nor Jacobi splittings give useful sampling algorithms since the difficulty of sampling the noise vector is no less than the original task of sampling from N(𝟎,𝐀1)\operatorname{N}(\mathbf{0},\mathbf{A}^{-1}). The Gauss-Seidel splitting, giving the usual Gibbs sampler, is at a kind of sweet spot, where solving equations in 𝐌\mathbf{M} is simple while the required noise variance is diagonal, so posing a simple sampling problem.

The SOR stationary sampler uses the SOR splitting 𝐌SOR\mathbf{M}_{\text{SOR}} and 𝐍SOR\mathbf{N}_{\text{SOR}} in (7) for 0<ω<20<\omega<2 and the noise vector 𝐜(k)N(𝟎,𝐌SORT+𝐍SOR=2ωω𝐃)\mathbf{c}^{(k)}\sim\operatorname{N}(\mathbf{0},\mathbf{M}_{SOR}^{T}+\mathbf{N}_{SOR}=\frac{2-\omega}{\omega}\mathbf{D}) (Table 2). This sampler was introduced by Adler (1981), rediscovered by Barone and Frigessi (1990), and has been studied extensively (Barone et al., 2002; Galli and Gao, 2001; Liu et al., 1995; Neal, 1998; Roberts and Sahu, 1997). For ω=1\omega=1, the SOR sampler is a Gibbs (Gauss-Seidel) sampler. For values of ω\omega such that ϱ(𝐌SOR1𝐍SOR)<ϱ(𝐌GS1𝐍GS)\varrho(\mathbf{M}_{\text{SOR}}^{-1}\mathbf{N}_{\text{SOR}})<\varrho(\mathbf{M}^{-1}_{GS}\mathbf{N}_{GS})), the SOR-sampler is an accelerated Gibbs sampler. As for the linear solver, implementation of the Gibbs and SOR samplers by Algorithm 5 requires multiplication by the upper triangular 𝐍\mathbf{N} and forward substitution with respect to 𝐌\mathbf{M} at a cost of 2n22n^{2} flops. In addition, these samplers must take the square root of the diagonal matrix 2ωω𝐃\frac{2-\omega}{\omega}\mathbf{D} at a mere cost of O(n)O(n) flops.

Implementation of an SSOR sampler instead of a Gibbs or SOR sampler is advantageous since the Markov chain {𝐲(k)}\{\mathbf{y}^{(k)}\} is reversible (Roberts and Sahu, 1997). SSOR sampling uses the symmetric-SOR splitting 𝐌SSOR\mathbf{M}_{\text{SSOR}} and 𝐍SSOR\mathbf{N}_{\text{SSOR}} in (8). The SSOR stationary sampler is most easily implemented by forward and backward SOR sampling sweeps as in Algorithm 6, so the matrices 𝐌SSOR\mathbf{M}_{\text{SSOR}} and 𝐍SSOR\mathbf{N}_{\text{SSOR}} need never be explicitly formed.

input : The SOR splitting 𝐌,𝐍\mathbf{M},~\mathbf{N} of 𝐀\mathbf{A}, relaxation parameter ω\omega, initial state 𝐲\mathbf{y}, kmaxk_{\max}
output𝐲\mathbf{y} approximately distributed as N(𝟎,𝐀1)\operatorname{N}(\mathbf{0,A}^{-1})
set γ=(2ω1)1/2\gamma=\left(\frac{2}{\omega}-1\right)^{1/2};
for k=1,,kmaxk=1,\ldots,k_{\max} do
  sample 𝐳N(𝟎,𝐈)\mathbf{z}\sim\operatorname{N}(\mathbf{0},\mathbf{I});
   𝐱:=𝐌1(𝐍𝐲+γ𝐃1/2𝐳)\mathbf{x}:=\mathbf{M}^{-1}(\mathbf{Ny}+\gamma\mathbf{D}^{1/2}\mathbf{z});
   sample 𝐳N(𝟎,𝐈)\mathbf{z}\sim\operatorname{N}(\mathbf{0},\mathbf{I});
   𝐲:=𝐌T(𝐍T𝐱+γ𝐃1/2𝐳)\mathbf{y}:=\mathbf{M}^{-T}(\mathbf{N}^{T}\mathbf{x}+\gamma\mathbf{D}^{1/2}\mathbf{z})
end
Algorithm 6 SSOR sampling from N(𝟎,𝐀1)\operatorname{N}(\mathbf{0,A}^{-1})

We first encountered restricted versions of Corollary 4 for normal distributions in Amit and Grenander (1991) and in Barone and Frigessi (1990) where geometric convergence of the covariance matrices was established for the Gauss-Seidel and SOR splittings. These and the SSOR splitting were investigated in Roberts and Sahu (1997) (who labelled the sampler REGS).

Corollary 4 and Table 1 show that the Gibbs, SOR and SSOR samplers converge for any SPD precision matrix 𝐀\mathbf{A}. This summarizes results in Barone and Frigessi (1990); Galli and Gao (2001) and the deterministic sweeps investigated in Amit and Grenander (1991); Roberts and Sahu (1997); Liu et al. (1995). Corollary 4 generalizes these results for any matrix splitting 𝐀=𝐌𝐍\mathbf{A=M-N} by guaranteeing convergence of the random iterates (10) to N(𝟎,𝐀1)\operatorname{N}(\mathbf{0},\mathbf{A}^{-1}) with convergence factor ϱ(𝐌1𝐍)\varrho(\mathbf{M}^{-1}\mathbf{N}) (or ϱ(𝐌1𝐍)2\varrho(\mathbf{M}^{-1}\mathbf{N})^{2} if μ=𝟎\mu=\mathbf{0}).

5 Non-stationary iterative methods

5.1 Acceleration of linear solvers by polynomials

A common scheme in numerical linear algebra for accelerating a stationary method when 𝐌\mathbf{M} and 𝐍\mathbf{N} are symmetric is through the use of polynomial preconditioners (Axelsson, 1996; Golub and Loan, 1989; Saad, 2003). Equation (5) shows that after kk steps the error in the stationary method is a kthk^{th} order polynomial of the matrix 𝐈𝐆=𝐌1𝐀\mathbf{I}-\mathbf{G}=\mathbf{M}^{-1}\mathbf{A}. The idea behind polynomial acceleration is to implicitly implement a different kthk^{th} order polynomial Pk(𝐌1𝐀)P_{k}(\mathbf{M}^{-1}\mathbf{A}) such that ϱ(Pk(𝐌1𝐀))<ϱ((𝐈𝐌1𝐀)k)\varrho(P_{k}(\mathbf{M}^{-1}\mathbf{A}))<\varrho(\left(\mathbf{I}-\mathbf{M}^{-1}\mathbf{A}\right)^{k}). The coefficients of Pk(𝐌1𝐀)P_{k}(\mathbf{M}^{-1}\mathbf{A}) are functions of a set of acceleration parameters {{αk},{τk}}\{\{\alpha_{k}\},\{\tau_{k}\}\}, introduced by the second order iteration

𝐱(k+1)=(1αk)𝐱(k1)+αk𝐱(k)+αkτk𝐌1(𝐛𝐀𝐱(k)).\displaystyle\mathbf{x}^{(k+1)}=(1-\alpha_{k})\mathbf{x}^{(k-1)}+\alpha_{k}\mathbf{x}^{(k)}+\alpha_{k}\tau_{k}\mathbf{M}^{-1}(\mathbf{b}-\mathbf{Ax}^{(k)}). (11)

At the first step, α0=1\alpha_{0}=1 and 𝐱(1)=𝐱(0)+τ0𝐌1(𝐛𝐀𝐱(0))\mathbf{x}^{(1)}=\mathbf{x}^{(0)}+\tau_{0}\mathbf{M}^{-1}(\mathbf{b}-\mathbf{Ax}^{(0)}). Setting αk=τk=1\alpha_{k}=\tau_{k}=1 for every kk yields a basic un-accelerated stationary method. The accelerated iteration in (11) is implemented at a negligible increase in cost of 𝒪(n)\mathcal{O}(n) flops per iteration (due to scalar-vector multiplication and vector addition) over the corresponding stationary solver (3).

It can be shown that (e.g., Axelsson (1996)) that the (k+1)st(k+1)^{st} order polynomial Pk+1P_{k+1} generated recursively by the second order non-stationary linear solver (11) is

Pk+1(λ)=(αkαkτkλ)Pk(λ)+(1αk)Pk1(λ).\displaystyle P_{k+1}\left(\lambda\right)=\left(\alpha_{k}-\alpha_{k}\tau_{k}\lambda\right)P_{k}\left(\lambda\right)+\left(1-\alpha_{k}\right)P_{k-1}\left(\lambda\right). (12)

This polynomial acts on the error 𝐞(k)=𝐱(k)𝐀1𝐛\mathbf{e}^{(k)}=\mathbf{x}^{(k)}-\mathbf{A}^{-1}\mathbf{b} by 𝐞(k+1)=Pk(𝐌1𝐀)𝐞(0)\mathbf{e}^{(k+1)}=P_{k}(\mathbf{M}^{-1}\mathbf{A})\mathbf{e}^{(0)}, which can be compared directly to (5).

When estimates of the extreme eigenvalues λmin\lambda_{\min} and λmax\lambda_{\max} of 𝐈𝐆=𝐌1𝐀\mathbf{I}-\mathbf{G}=\mathbf{M}^{-1}\mathbf{A} are available (λmin\lambda_{\min} and λmax\lambda_{\max} are real when 𝐌\mathbf{M} and 𝐍\mathbf{N} are symmetric), then the coefficients {τk,αk}\{\tau_{k},\alpha_{k}\} can be chosen to generate the scaled Chebyshev polynomials {Qk}\{Q_{k}\}, which give optimal error reduction at every step. The Chebyshev acceleration parameters are

τk=2λmax+λmin,βk=(1τkβk1(λmaxλmin4)2)1,αk=βkτk,\displaystyle\tau_{k}=\frac{2}{\lambda_{\max}+\lambda_{\min}},~~\beta_{k}=\left(\frac{1}{\tau_{k}}-\beta_{k-1}\left(\frac{\lambda_{\max}-\lambda_{\min}}{4}\right)^{2}\right)^{-1},~~\alpha_{k}=\frac{\beta_{k}}{\tau_{k}}, (13)

where α0=1\alpha_{0}=1 and β0=τ0\beta_{0}=\tau_{0} (Axelsson, 1996). Note that these parameters are independent of the iterates {𝐱(k)}\{\mathbf{x}^{(k)}\}. Since 𝐌\mathbf{M} is required to be symmetric, applying Chebyshev acceleration to SSOR is a common pairing; its effectiveness as a linear solver is shown later in Table 3.

Whereas the stationary methods converge with asymptotic average convergence factor ϱ(𝐌1𝐍)\varrho(\mathbf{M}^{-1}\mathbf{N}), the convergence factor for the Chebyshev method depends on cond(𝐌1𝐀)=λmax/λmin\operatorname{cond}(\mathbf{M}^{-1}\mathbf{A})=\lambda_{\max}/\lambda_{\min}. Specifically the scaled Chebyshev polynomial Qk(λ)Q_{k}(\mathbf{\lambda}) minimizes maxλ[λmin,λmax]Pk(λ)\max_{\lambda\in[\lambda_{\min},\lambda_{\max}]}P_{k}(\lambda) over all kthk^{th} order polynomials PkP_{k}, with

maxλ[λmin,λmax]|Qk(λ)|=2σk1+σ2k.\displaystyle\max_{\lambda\in[\lambda_{\min},\lambda_{\max}]}|Q_{k}(\lambda)|=\frac{2\sigma^{k}}{1+\sigma^{2k}}. (14)

Since the error at the kthk^{th} step of a Chebyshev accelerated linear solver is 𝐞(k+1)=Qk(λ)𝐞(0)\mathbf{e}^{(k+1)}=Q_{k}(\lambda)\mathbf{e}^{(0)}, then the asymptotic convergence factor is bounded above by

σ=1λmin/λmax1+λmin/λmax\displaystyle\sigma=\frac{1-\sqrt{\lambda_{\min}/\lambda_{\max}}}{1+\sqrt{\lambda_{\min}/\lambda_{\max}}} (15)

(p.181 Axelsson, 1996). Since σ[0,1)\sigma\in[0,1), the polynomial accelerated scheme is guaranteed to converge even if the original splitting was not convergent. Further, the convergence factor of the stationary iterative solver is bounded below by ρ=1λmin/λmax1+λmin/λmax\rho=\frac{1-{\lambda_{\min}/\lambda_{\max}}}{1+{\lambda_{\min}/\lambda_{\max}}} (see e.g. Axelsson, 1996, Thm 5.9). Since σ<ρ\sigma<\rho (except when λmin=λmax\lambda_{\min}=\lambda_{\max} in which case σ=0\sigma=0), polynomial acceleration always reduces the convergence factor, so justifies the term acceleration. The Chebyshev accelerated iteration (11) is amenable to preconditioning that reduces the condition number, and hence reduces σ\sigma, such as incomplete Cholesky factorization or graphical methods (Axelsson, 1996; Saad, 2003). Axelsson also shows that after

k=ln(ε/2)lnσk^{*}=\lceil\frac{\ln(\varepsilon/2)}{\ln\sigma}\rceil (16)

iterations of the Chebyshev solver, the error reduction is 𝐞(k)𝐀ν/𝐞(0)𝐀νε||\mathbf{e}^{(k^{*})}||_{\mathbf{A}^{\nu}}/||\mathbf{e}^{(0)}||_{\mathbf{A}^{\nu}}\leq\varepsilon for some real number ν\nu and any 0<ε<10<\varepsilon<1 (Axelsson, 1996, eqn 5.32).

5.2 Acceleration of Gibbs sampling by polynomials

Any acceleration scheme devised for a stationary linear solver is a candidate for accelerating convergence of a Gibbs sampler. For example, consider the second order stochastic iteration

𝐲(k+1)=(1αk)𝐲(k1)+αk𝐲(k)+αkτk𝐌1(𝐜(k)𝐀𝐲(k))\displaystyle\mathbf{y}^{(k+1)}=(1-\alpha_{k})\mathbf{y}^{(k-1)}+\alpha_{k}\mathbf{y}^{(k)}+\alpha_{k}\tau_{k}\mathbf{M}^{-1}(\mathbf{c}^{(k)}-\mathbf{Ay}^{(k)}) (17)

analogous to the linear solver in (11) but now the vector 𝐛\mathbf{b} has been replaced by a random vector 𝐜(k)\mathbf{c}^{(k)}. The equivalence between polynomial accelerated linear solvers and polynomial accelerated samplers is made clear in the next three theorems.

Theorem 5

Let 𝐀\mathbf{A} be SPD and 𝐀=𝐌𝐍\mathbf{A}=\mathbf{M}-\mathbf{N} be a symmetric splitting. Consider a set of independent noise vectors {𝐜(k)}\{\mathbf{c}^{(k)}\} with moments

E(𝐜(k))=𝝂 and Var(𝐜(k))=ak𝐌+bk𝐍\operatorname{E}(\mathbf{c}^{(k)})={\boldsymbol{\nu}}\mbox{~~and~~}\operatorname{Var}(\mathbf{c}^{(k)})=a_{k}\mathbf{M}+b_{k}\mathbf{N}

such that ak:=2τkτk+(bk1)(1τk+1κk1)a_{k}:=\frac{2-\tau_{k}}{\tau_{k}}+\left(b_{k}-1\right)\left(\frac{1}{\tau_{k}}+\frac{1}{\kappa_{k}}-1\right), bk:=2(1αk)αk(κkτk)+1b_{k}:=\frac{2(1-\alpha_{k})}{\alpha_{k}}\left(\frac{\kappa_{k}}{\tau_{k}}\right)+1, κk+1:=αkτk+(1αk)κk\kappa_{k+1}:=\alpha_{k}\tau_{k}+(1-\alpha_{k})\kappa_{k}, and κ1=τ0\kappa_{1}=\tau_{0}. If the polynomial accelerated linear solver (11) converges to 𝐀1𝐛\mathbf{A}^{-1}\mathbf{b} with a set of parameters {{αk}\{\{\alpha_{k}\}, {τk}}\{\tau_{k}\}\} that are independent of {𝐱(k)}\{\mathbf{x}^{(k)}\}, then the polynomial accelerated stochastic iteration (17) converges in distribution to a distribution Π\Pi with mean 𝐀1𝛎\mathbf{A}^{-1}{\boldsymbol{\nu}} and covariance matrix 𝐀1\mathbf{A}^{-1}. Furthermore, if the {𝐜(k)}\{\mathbf{c}^{(k)}\} are normal, then

𝐲(k)𝒟N(𝝁=𝐀1𝝂,𝐀1).\mathbf{y}^{(k)}\overset{\mathcal{D}}{\to}\operatorname{N}({\boldsymbol{\mu}}=\mathbf{A}^{-1}{\boldsymbol{\nu}},\mathbf{A}^{-1}).

Proof. Appendix A.2. \Box

Given a second order linear solver (11) that converges, Theorem 5 makes clear how to construct a second order sampler (17) that is guaranteed to converge. The next Corollary shows that the polynomial PkP_{k} that acts on the linear solver error 𝐱(k)𝐀1𝐛\mathbf{x}^{(k)}-\mathbf{A}^{-1}\mathbf{b} is the same polynomial that acts on the errors in the first and second moments of the sampler, E(𝐲(k))𝐀1𝝂\operatorname{E}(\mathbf{y}^{(k)})-\mathbf{A}^{-1}{\boldsymbol{\nu}} and Var(𝐲(k))𝐀1\operatorname{Var}(\mathbf{y}^{(k)})-\mathbf{A}^{-1} respectively. In other words, the convergence factors for a polynomial accelerated solver and sampler are the same.

Corollary 6

Suppose that the polynomial accelerated linear solver (11) converges with asymptotic convergence factor σ=(limkmaxλ|Pk(λ)|)1/k\sigma=\left(\lim_{k\to\infty}\max_{\lambda}|P_{k}(\lambda)|\right)^{1/k}, where PkP_{k} is the kthk^{th} order polynomial recursively generated by (12). Then under the conditions of Theorem 5,

E(𝐲(k+1))=Pk(𝐌1𝐀)(E(𝐲(0))𝐀1𝝂)𝐀1𝝂\operatorname{E}\left(\mathbf{y}^{(k+1)}\right)=P_{k}(\mathbf{M}^{-1}\mathbf{A})\left(\operatorname{E}(\mathbf{y}^{(0)})-\mathbf{A}^{-1}{\boldsymbol{\nu}}\right)\to\mathbf{A}^{-1}{\boldsymbol{\nu}}

with asymptotic convergence factor σ\sigma, and

Var(𝐲(k))=𝐀1Pk(𝐌1𝐀)(𝐀1𝐕𝐚𝐫(𝐲(0)))(Pk(𝐌1𝐀))T𝐀1\operatorname{Var}\left(\mathbf{y}^{(k)}\right)=\mathbf{A}^{-1}-P_{k}(\mathbf{M}^{-1}\mathbf{A})\left(\mathbf{A}^{-1}-\mathbf{Var(y}^{(0)})\right)(P_{k}(\mathbf{M}^{-1}\mathbf{A}))^{T}\to\mathbf{A}^{-1}

with asymptotic convergence factor σ2\sigma^{2}.

Proof. Appendix A.2. \Box

Corollary 6 allows a direct comparison of the convergence factor for a polynomial accelerated sampler (σ\sigma, or σ2\sigma^{2} if μ=𝟎\mu=\mathbf{0}) to the convergence factor given previously for the corresponding un-accelerated stationary sampler (ϱ(𝐌1𝐍)\varrho(\mathbf{M}^{-1}\mathbf{N}), or ϱ(𝐌1𝐍)2\varrho(\mathbf{M}^{-1}\mathbf{N})^{2} if μ=𝟎\mu=\mathbf{0}). In particular, given a second order linear solver with accelerated convergence compared to the corresponding stationary iteration, the corollary guarantees that the second order Gibbs sampler (17) will converge faster than the stationary Gibbs sampler (10).

Just as Chebyshev polynomials are guaranteed to accelerate linear solvers, Corollary 6 assures that Chebyshev polynomials can also accelerate a Gibbs sampler. Using Theorem 5, we derived the Chebyshev accelerated SSOR sampler (Fox and Parker, 2014) by iteratively updating parameters via (13) and then generating a sampler via (17). Explicit implementation details of the Chebyshev accelerated sampler are provided in the supplementary materials. The polynomial accelerated sampler is implemented at a negligible increase in cost of 𝒪(n){\mathcal{O}}(n) flops per iteration over the cost (4n24n^{2} flops) of the SSOR sampler (Algorithm 6). The asymptotic convergence factor is given by the next Corollary, which follows from Corollary 6 and equation (15).

Corollary 7

If the Chebyshev accelerated linear solver converges, then the mean E(𝐲(k))\operatorname{E}(\mathbf{y}^{(k)}) of the corresponding Chebyshev accelerated stochastic iteration (17) converges to 𝛍=𝐀1𝛎{\boldsymbol{\mu}}=\mathbf{A}^{-1}{\boldsymbol{\nu}} with asymptotic convergence factor (1λmin/λmax1+λmin/λmax)\left(\frac{1-\sqrt{\lambda_{\min}/\lambda_{\max}}}{1+\sqrt{\lambda_{\min}/\lambda_{\max}}}\right) and the covariance matrix Var(𝐲(k))\operatorname{Var}(\mathbf{y}^{(k)}) converges to 𝐀1\mathbf{A}^{-1} with asymptotic convergence factor (1λmin/λmax1+λmin/λmax)2\left(\frac{1-\sqrt{\lambda_{\min}/\lambda_{\max}}}{1+\sqrt{\lambda_{\min}/\lambda_{\max}}}\right)^{2}.

Corollary 7 and (14) show that a Chebyshev accelerated normal sampler is guaranteed to converge faster than any other acceleration scheme that has the parameters {{τk,αk}}\{\{\tau_{k},\alpha_{k}\}\} independent of the iterates {𝐲(k)}\{\mathbf{y}^{(k)}\}. This result also shows that the preconditioning ideas presented in section 5.1 to reduce cond(𝐌1𝐀)=λmax/λmin{\rm cond}(\mathbf{M}^{-1}\mathbf{A})=\lambda_{\max}/\lambda_{\min} can also be used to speed up Chebyshev accelerated samplers. We do not investigate such preconditioning here.

Corollary 7 and equation (16) suggest that, for any ε>0\varepsilon>0, after kk^{*} iterations the Chebyshev error reduction for the mean is smaller than ε\varepsilon. But even sooner, after k=k/2k^{**}=k^{*}/2 iterations, the Chebyshev error reduction for the variance is predicted to be smaller than ε\varepsilon (Fox and Parker, 2014).

Refer to caption Refer to caption

Figure 2: Left panel: Location of non-zero elements in the 100×100100\times 100 precision matrix 𝐀\mathbf{A}. Right panel: Relative error in covariance 𝐀1𝐒y(k)2/𝐀12||\mathbf{A}^{-1}-\mathbf{S}^{(k)}_{y}||_{2}/||\mathbf{A}^{-1}||_{2} versus number of floating point operations (flops) for a sampler implemented with SSOR and ω=1\omega=1, SSOR with optimal relaxation ω=1.6641\omega=1.6641, and SSOR with Chebyshev acceleration. Also shown is the relative error and flop count for a sample drawn using Cholesky factoring.

6 Computed Examples

The iterative sampling algorithms we have investigated are designed for problems where operating by the precision matrix is cheap. A common such case is when the precision matrix is sparse, as occurs when modeling a GMRF with a local neighbourhood structure. Then, typically, the precision matrix has 𝒪(n)\mathcal{O}(n) non-zero elements, so direct matrix-vector multiplication has 𝒪(n)\mathcal{O}(n) cost. We give two examples of sampling using sparse precision matrices: first, we present a small n=100n=100 example where complete diagnostics can be computed for evaluating the quality of convergence; and second, we present a n=106n=10^{6} Bayesian linear inverse problem that demonstrates computational feasibility for large problems. The samplers are initialized with 𝐲(0)=𝟎\mathbf{y}^{(0)}=\mathbf{0} in both examples.

6.1 A 10×1010\times 10 lattice example (n=100n=100)

A first order locally linear sparse precision matrix 𝐀\mathbf{A}, considered by Higdon (2006); Rue and Held (2005), is

[𝐀]ij=104δij+{niifi=j1ifijandsisj210otherwise.\displaystyle[\mathbf{A}]_{ij}=10^{-4}\delta_{ij}+\left\{\begin{array}[]{cc}n_{i}&~~~{\rm if}~~i=j\\ -1&~~~{\rm if}~~i\neq j~~{\rm and}~~||s_{i}-s_{j}||_{2}\leq 1\\ 0&~~{\rm otherwise}\end{array}.\right.

The discrete points {si}\{s_{i}\} are on a regular 10×1010\times 10 lattice (n=100n=100) over the two dimensional domain 𝒮=[1,10]×[1,10]{\cal S}=[1,10]\times[1,10]. Thus 𝐀\mathbf{A} is 100×100100\times 100, 𝐀2=7.8||\mathbf{A}||_{2}=7.8 and ||𝚺=𝐀1||2=104||{\boldsymbol{\Sigma}}=\mathbf{A}^{-1}||_{2}=10^{4}. The sparsity of 𝐀\mathbf{A} is shown in the left panel of Figure 2. The scalar nin_{i} is the number of points neighbouring sis_{i}, i.e., with distance 11 from sis_{i}. Although n2=104n^{2}=10^{4}, the number of non-zero elements of 𝐀\mathbf{A} is 𝒪(n)\mathcal{O}(n) (460460 in this example). Since the bandwidth of 𝐀\mathbf{A} is 𝒪(n1/2)\mathcal{O}(n^{1/2}), a Cholesky factorization costs 𝒪(n2)\mathcal{O}(n^{2}) flops (Rue, 2001) and each iteration of an iterative method costs 𝒪(n)\mathcal{O}(n) flops.

To provide a comparison between linear solvers and samplers, we solved the system 𝐀𝐱=𝐛\mathbf{Ax}=\mathbf{b} using linear solvers with different matrix splittings (Table 1), where 𝐛\mathbf{b} is fixed and non-zero, all initialized with 𝐱(0)=𝟎\mathbf{x}^{(0)}=\mathbf{0}. The results are given in Table 3. The Richardson method does not converge (DNC) since the spectral radius of the iteration operator is greater than 1. The SOR iteration was run at the optimal relaxation parameter value of ω=1.9852\omega=1.9852. SSOR was run at its optimal value of ω=1.6641\omega=1.6641. Chebyshev accelerated SSOR (Cheby-SSOR), CG accelerated SSOR (CG-SSOR) (both run with ω=1.6641\omega=1.6641) and CG utilize a different implicit operator for each iteration, and so the spectral radius given in these cases is the geometric mean spectral radius of these operators (estimated using (5)). Even for this small example, Chebyshev acceleration reduces the computational effort required for convergence by about two orders of magnitude, while CG acceleration reduces work by nearly two more orders of magnitude.

Table 3: The number of iterations and the total number of floating point operations performed by some common stationary and accelerated linear solvers, and the Cholesky factorization, used to solve 𝐀𝐱=𝐛\mathbf{Ax=b} for fixed non-zero 𝐛\mathbf{b}. Each solver was run until the residual became sufficiently small, 𝐛𝐀𝐱(k)2<108||\mathbf{b}-\mathbf{A}\mathbf{x}^{(k)}||_{2}<10^{-8}. Details in section 6.

solver ω\omega ϱ(𝐌1𝐍)\varrho(\mathbf{M}^{-1}\mathbf{N}) number of iterations flops Richardson 1 6.8 DNC Jacobi .999972 4.01×1054.01\times 10^{5} 5.69×1075.69\times 10^{7} Gauss-Seidel .999944 2.44×1052.44\times 10^{5} 4.34×1084.34\times 10^{8} SSOR 1.6641 .999724 6.7×1046.7\times 10^{4} 2.39×1082.39\times 10^{8} SOR 1.9852 .985210 1655 2.95×1062.95\times 10^{6} Cheby-SSOR 1 .9786 958 3.41×1063.41\times 10^{6} Cheby-SSOR 1.6641 .9673 622 2.21×1062.21\times 10^{6} CG .6375 48 9.22×1049.22\times 10^{4} CG-SSOR 1.6641 .4471 29 6.66×1046.66\times 10^{4} Cholesky 1.35×1041.35\times 10^{4}

We investigated the following Gibbs samplers: SOR, SSOR, and the Chebyshev accelerated SSOR. These samplers are guaranteed to converge since the corresponding solver converges (Theorem 1). Since the convergence factor for a sampler is equal to the convergence factor for the corresponding solver (Corollaries 3 and 6) then Gibbs samplers implemented with any of the matrix splittings in Table 1 exhibit the same convergence behavior as shown for the linear solvers in Table 3. Convergence of the sample covariance 𝐒y(k)𝐕𝐚𝐫(𝐲(k))𝐀1\mathbf{S}^{(k)}_{y}\approx\mathbf{Var(y}^{(k)})\to\mathbf{A}^{-1}, calculated using 10410^{4} samples, is shown in the right panel of Figure 2 that displays the relative error 𝐀1𝐒y(k)2/𝐀12||\mathbf{A}^{-1}-\mathbf{S}^{(k)}_{y}||_{2}/||\mathbf{A}^{-1}||_{2} as a function of the flop count. Each sampler iteration costs about 2.24×1032.24\times 10^{3} flops. This performance is compared to samples constructed by a Cholesky factorization, which cost 1.34×1041.34\times 10^{4} flops (depicted as the green vertical line in the right panel of Figure 2). Since the sample means were uniformly close to zero, error in the mean is not shown.

The benchmark for evaluation of the of the convergence of the iterative samplers in finite precision is the Cholesky factorization, its relative error is depicted as the green horizontal line in the right panel of Figure 2. For this example, the iterative samplers produce better samples than a Cholesky sampler since the iterative sample covariances become more precise with more computing time.

The geometric convergence in distribution of the un-accelerated SSOR samples 𝐲(k)\mathbf{y}^{(k)} to N(𝟎,𝐀1)\operatorname{N}(\mathbf{0},\mathbf{A}^{-1}) is clear in Figure 2, and even after 5×1055\times 10^{5} flops, convergence in distribution has not been attained. This is not surprising based on the large number of iterations (𝒪(n2)\mathcal{O}(n^{2})) necessary for the same stationary method to converge to a solution of 𝐀𝐱=𝐛\mathbf{Ax}=\mathbf{b} (see Table 3). The accelerated convergence of the Chebyshev polynomial samplers, suggested by the fast convergence of the corresponding linear solvers depicted in Table 3, is also evident in Figure 2, with convergence after 1.70×1051.70\times 10^{5} flops (76 iterations) for the Cheby-SSOR sampler with optimal relaxation parameter ω=1.6641\omega=1.6641, and the somewhat slower convergence at 2.37×1052.37\times 10^{5} flops (106 iterations) when ω=1\omega=1.

6.2 A 100×100×100100\times 100\times 100 (n=106n=10^{6}) linear inverse problem in biofilm imaging

We now perform accelerated sampling from a GMRF in 3-dimensions, as a stylized example of estimating a voxel image of a biofilm from confocal scanning laser microscope (CSLM) data (Lewandowski and Beyenal, 2014). This large example illustrates the feasibility of Chebyshev accelerated sampling in large problems for which sampling by Cholesky factorization of the precision matrix is too computationally and memory intensive to be performed on a standard desktop computer.

We consider the problem of reconstructing a 100×100×100100\times 100\times 100 voxel image 𝐱\mathbf{x} of a bacterial biofilm, i.e., a community of bacteria aggregated together as slime, given a subsampled 100×100×10100\times 100\times 10 CSLM data set 𝐲\mathbf{y}. For this exercise, we synthesized a ‘true’ image 𝐱t\mathbf{x}_{\text{t}} of a 90μ90\mum tall ellipsoidal column of biofilm attached to a surface, taking value 1010 inside the biofilm column, and 0 outside, in arbitrary units. Similar geometry has been observed experimentally for Pseudomonas aeruginosa biofilms (Swogger and Pitts, 2005), and is also predicted by mathematical models of biofilm growth (Alpkvist and Klapper, 2007). CSLM captures a set of planar ‘images’ at different distances from the bottom of the biofilm where it is attached to a surface. In nature biofilms attach to any surface over which water flows, e.g., human teeth and creek bottoms. Each horizontal planar image in this example is 100×100100\times 100 pixels; the distance between pixels in each plane is typically about 1 μ\mum, with the exact spatial resolution set by the microscope user. The vertical distance between planar slices in a CSLM image is typically an order of magnitude larger than the horizontal distance between pixels; for this example, the vertical distance between CSLM planes is 10μ\mum.

Given the ‘true’ image 𝐱t\mathbf{x}_{\text{t}}, we generated synthetic 100×100×10100\times 100\times 10 CSLM data by

𝐲=𝐅𝐱t+ϵ\mathbf{y}=\mathbf{F}\mathbf{x}_{\text{t}}+\epsilon

where the 105×10610^{5}\times 10^{6} matrix 𝐅\mathbf{F} arithmetically averages over 10 pixels in the vertical dimension of 𝐱\mathbf{x}, to approximate the point spread function (PSF) of CSLM (Sheppard and Shotton, 1997), and ϵN(𝟎,𝐏1=𝐈\epsilon\sim\operatorname{N}(\mathbf{0},\mathbf{P}^{-1}=\mathbf{I}). The data is displayed in the left panel of Figure 3 as layers of pixels, or ‘slices’, located at the centre of sensitivity of the CSLM, i.e. the centre of the PSF. Thus, the likelihood we consider is π(𝐲|𝐱)=N(𝐅𝐱,𝐏1)\pi(\mathbf{y}|\mathbf{x})=\operatorname{N}(\mathbf{F}\mathbf{x},\mathbf{P}^{-1}).

Refer to caption Refer to caption

Figure 3: The left panel depicts a 100×100×10100\times 100\times 10 pixelated confocal scanning microscope image, 𝐲\mathbf{y} of a simulated ellipsoidal column of a bacterial biofilm; the distance between horizontal pixels is 1μ\mum, the distance between vertical pixels is 10μ\mum. The right panel shows a surface rendering of a sample from the n=106n=10^{6} dimensional multivariate normal posterior distribution conditioned on hyperparameters.

To encapsulate prior knowledge that the bacteria in the biofilm aggregate together we model 𝐱\mathbf{x} by the GMRF 𝐱N(𝟎,𝐐R1)\mathbf{x}\sim N(\mathbf{0},\mathbf{Q}_{R}^{-1}) where the precision matrix 𝐐R\mathbf{Q}_{R} models local smoothness of the density of the biofilm and background. We construct the matrix 𝐐R\mathbf{Q}_{R} as a sparse inverse of the dense covariance matrix corresponding to the exponential covariance function. This construction uses the relationship between stationary Gaussian random fields and partial differential equations (PDEs) that was noted by Whittle (1954) for the Matérn (or Whittle-Matérn (Guttorp and Gneiting, 2005)) class of covariance functions, that was also exploited by Cui et al. (2011) and Lindgren et al. (2011). Rather than stating the PDE, we find it more convenient to work with the equivalent variational form, in this case (the square of)

𝒬(x)=𝒟(R4|x|2+14Rx2)𝑑v+𝒟x22𝑑s,\mathcal{Q}\left(x\right)=\int_{\mathcal{D}}\left(\frac{R}{4}\left|\nabla x\right|^{2}+\frac{1}{4R}x^{2}\right)dv+\int_{\partial\mathcal{D}}\frac{x^{2}}{2}ds,

where xx is a continuous stochastic field, dvdv is the volume element in the domain 𝒟\mathcal{D} and dsds is the surface element on the boundary 𝒟\partial\mathcal{D}. This form has Euler-Lagrange equations being the Helmholtz operator with (local) Robin boundary conditions x+Rxn=0x+R\frac{\partial x}{\partial n}=0 on 𝒟\partial\mathcal{D}, induced by the x22\frac{x^{2}}{2} term. In our example we apply the Hessian of this form twice, which can be thought of as squaring the Helmholtz operator. When the quadratic form is written in the operator form 𝒬(x)=xTHx{\mathcal{Q}}\left(x\right)=x^{T}Hx, where HH is the Hessian, the resulting Gaussian random field has density

π(x)exp{xTH2x}.\displaystyle\pi\left(x\right)\propto\exp\left\{-x^{T}H^{2}x\right\}. (19)

We chose this operator because the discretized precision matrix is sparse, while the covariance function (after scaling) is close to exp{r/R}\exp\left\{-r/R\right\}, having length-scale RR.

The GMRF over the discrete field 𝐱\mathbf{x} is then defined using FEM (finite element method) discretization; we used cubic-elements between nodes at voxel centres in the cubic domain, and tri-linear interpolation from nodal values within each element. To verify this construction we show in Figure 4 contours of the resulting covariance function, between the pixel at the centre of the normalised cubic domain [0,1]3\left[0,1\right]^{3} and all other pixels, for length scale R=1/4R=1/4. The contours are logarithmically spaced in value, hence the evenly spaced spherical contours show that the covariance indeed has exponential dependence with distance. The contours look correct at the boundaries, indicating that the local Robin boundary conditions333Local boundary conditions are approximate but preserve sparseness. The exact boundary conditions are given by the boundary integral equation for the exterior Helmholtz operator, resulting in a dense block in HH that is inconvenient for computation (Neumayer, 2011). give the desired covariance function throughout the domain. In contrast, Dirichlet conditions would make the cubic boundary a contour, while Neumann conditions as used by  Lindgren et al. (2011) would make contours perpendicular to the cubic boundary; neither of those pure boundary conditions produce the desired covariance function.

Refer to caption

Figure 4: Contours of the effective covariance function centred on the cubic domain, logarithmically spaced in value.

In the deterministic setting, this image recovery problem is an example of a linear inverse problem. In the Bayesian setting, we may write the hierarchical model in the general form

𝐲|𝐱,𝜽\displaystyle\mathbf{y}|\mathbf{x},{\boldsymbol{\theta}} \displaystyle\sim N(𝐅𝐱,𝐏𝜽1)\displaystyle\operatorname{N}\left(\mathbf{F}\mathbf{x},\mathbf{P}_{{\boldsymbol{\theta}}}^{-1}\right) (20)
𝐱|𝜽\displaystyle\mathbf{x}|{\boldsymbol{\theta}} \displaystyle\sim N(𝝁,𝐐𝜽1)\displaystyle\operatorname{N}\left({\boldsymbol{\mu}},\mathbf{Q}_{{\boldsymbol{\theta}}}^{-1}\right) (21)
𝜽\displaystyle{\boldsymbol{\theta}} \displaystyle\sim π(𝜽)\displaystyle\pi\left({\boldsymbol{\theta}}\right) (22)

where 𝜽{\boldsymbol{\theta}} is a vector of hyperparameters. This stochastic model occurs in many settings (see, e.g., Simpson et al., 2012; Rue and Held, 2005) with 𝐲\mathbf{y} being observed data, 𝐱\mathbf{x} is a latent field, and 𝜽{\boldsymbol{\theta}} is a vector of hyperparameters that parameterize the precision matrices 𝐏\mathbf{P} and 𝐐\mathbf{Q}. The (hyper)prior π(𝜽)\pi\left({\boldsymbol{\theta}}\right) models uncertainty in covariance of the two random fields.

There are several options for performing sample-based inference on the model (20), (21), (22). Most direct is forming the posterior distribution π(𝐱,𝜽|𝐲)\pi\left(\mathbf{x},{\boldsymbol{\theta}}|\mathbf{y}\right) via Bayes’ rule and implementing Markov chain Monte Carlo (MCMC) sampling, typically employing Metropolis-Hastings dynamics with a random walk proposal on 𝐱\mathbf{x} and 𝜽{\boldsymbol{\theta}}. Such an algorithm can be very slow due to high correlations within the latent field 𝐱\mathbf{x}, and between the latent field and hyperparameters 𝜽{\boldsymbol{\theta}}. More efficient algorithms block the latent field, noting that the distribution over 𝐱\mathbf{x} given everything else is a multivariate normal, and hence can be sampled efficiently as we have discussed in this paper. Higdon (2006) and Bardsley (2012) utilized this structure, along with conjugate hyperpriors on the components of 𝜽{\boldsymbol{\theta}}, to demonstrate a Gibbs sampler that cycled through sampling from the conditional distributions for 𝐱\mathbf{x} and components of 𝜽{\boldsymbol{\theta}}. When the normalizing constant for π(𝐱,𝜽|𝐲)\pi\left(\mathbf{x},{\boldsymbol{\theta}}|\mathbf{y}\right) is available, up to a multiplicative constant independent of state, a more efficient algorithm is the one block algorithm (Rue and Held, 2005, section 4.1.2) in which a candidate 𝜽{\boldsymbol{\theta}}^{\prime} is drawn from a random walk proposal, then a draw 𝐱π(𝐱|𝐲,𝜽)\mathbf{x}^{\prime}\sim\pi(\mathbf{x}^{\prime}|\mathbf{y},{\boldsymbol{\theta}}^{\prime}), with the joint proposal (𝜽,𝐱)\left({\boldsymbol{\theta}}^{\prime},\mathbf{x}^{\prime}\right) accepted with the standard Metropolis-Hastings probability. The resulting transition kernel in 𝜽{\boldsymbol{\theta}} is in detailed balance with the distribution over 𝜽|𝐲{\boldsymbol{\theta}}|\mathbf{y}, and hence can improve efficiency dramatically. A further improvement can be to perform MCMC directly on π(𝜽|𝐲)\pi\left({\boldsymbol{\theta}}|\mathbf{y}\right) as indicated by Simpson et al. (2012), with subsequent independent sampling 𝐱π(𝐱|𝐲,𝜽)\mathbf{x}\sim\pi\left(\mathbf{x}|\mathbf{y},{\boldsymbol{\theta}}\right) to facilitate Monte Carlo evaluation of statistics. In each of these schemes, computational cost is dominated by the cost of drawing samples from the large multivariate normal 𝐱π(𝐱|𝐲,𝜽)\mathbf{x}\sim\pi\left(\mathbf{x}|\mathbf{y},{\boldsymbol{\theta}}\right). We now demonstrate that sampling step for this synthetic example.

In our example, the distribution over the 100×100×100100\times 100\times 100 image 𝐱\mathbf{x}, conditioned on everything else, is the multivariate normal

π(𝐱|𝐲,θ=R)\displaystyle\pi(\mathbf{x}|\mathbf{y},\theta=R) =\displaystyle= N(𝐱;𝝁=𝐀1𝐅T𝐏𝐲,𝚺=𝐀1)\displaystyle\operatorname{N}(\mathbf{x};{\boldsymbol{\mu}}=\mathbf{A}^{-1}\mathbf{F}^{T}\mathbf{P}\mathbf{y},{\boldsymbol{\Sigma}}=\mathbf{A}^{-1}) (23)

with precision matrix 𝐀=𝐅T𝐏𝐅+𝐐R\mathbf{A}=\mathbf{F}^{T}\mathbf{P}\mathbf{F}+\mathbf{Q}_{R} (cf. Calvetti and Somersalo (2007); Higdon (2006)). For this calculation we used the same covariance matrix as shown above, so R=1/4R=1/4 in units of the width of the domain, though for sample-based inference one would use samples from the distribution over R|𝐱,𝐲R|\mathbf{x},\mathbf{y}. The right panel of Figure 3 depicts a reconstructed surface derived from a sample from the conditional distribution in (23) using the Chebyshev polynomial accelerated SSOR sampler. The sampler was initialized with the precision matrix 𝐀\mathbf{A}, E(𝐜(k))=𝐅T𝐏𝐲\operatorname{E}(\mathbf{c}^{(k)})=\mathbf{F}^{T}\mathbf{P}\mathbf{y} for all kk, and relaxation parameter ω=1\omega=1. The contour is at value 66, after smoothing over 3×3×33\times 3\times 3 voxels, displaying a sample surface that separates regions for which the average over 3×3×33\times 3\times 3 voxel blocks is less than 66 (outside surface) and greater than 66 (inside surface). As can be seen, the surface makes an informative reconstruction of the ellipsoidal phantom.

Using CG, estimates of the extreme eigenvalue of 𝐌SSOR1𝐀\mathbf{M}_{\text{SSOR}}^{-1}\mathbf{A} were λ^min=4.38×106\hat{\lambda}_{\min}=4.38\times 10^{-6} and λ^max=11.36×108\hat{\lambda}_{\max}=1-1.36\times 10^{-8}. By Corollary 6, the asymptotic convergence factors for the Chebyshev sampler are σ0.9958\sigma\approx 0.9958 for the mean and σ2=.9917\sigma^{2}=.9917 for the covariance matrix. Using this information, equation (16) predicts the number of sampler iterations until convergence. After k=4566k^{*}=4566 iterations of the Chebyshev accelerated sampler, it is predicted that the mean error is reduced by ε=108\varepsilon=10^{-8}; that is

𝝁E(𝐲(k))2108𝝁E(𝐲(0))2.||{\boldsymbol{\mu}}-\operatorname{E}(\mathbf{y}^{(k^{*})})||_{2}\approx 10^{-8}||{\boldsymbol{\mu}}-\operatorname{E}(\mathbf{y}^{(0)})||_{2}.

But even sooner, after only k=k/2=2283k^{**}=k^{*}/2=2283 iterations, it is predicted that the covariance error is

𝐀1Var(𝐲(k))108𝐀1Var(𝐲(0)).||\mathbf{A}^{-1}-\operatorname{Var}(\mathbf{y}^{(k^{**})})||\approx 10^{-8}||\mathbf{A}^{-1}-\operatorname{Var}(\mathbf{y}^{(0)})||.

Contrast these Chebyshev polynomial convergence results to the performance of the non-accelerated stationary SSOR sampler that has convergence factors ϱ(𝐌1𝐍)1λ^min=14.38×106\varrho(\mathbf{M}^{-1}\mathbf{N})\approx 1-\hat{\lambda}_{\min}=1-4.38\times 10^{-6} for the mean error, and ρ(𝐌1𝐍)2=18.76×106\rho(\mathbf{M}^{-1}\mathbf{N})^{2}=1-8.76\times 10^{-6} for covariance error. These convergence factors suggest that after running the non-accelerated SSOR Gibbs sampler for only 45664566 iterations, the covariance error will be reduced to only ϱ(𝐌1𝐍)245660.96\varrho(\mathbf{M}^{-1}\mathbf{N})^{2\cdot 4566}\approx 0.96 of the original error; 1.9×1061.9\times 10^{6} iterations are required for a 10810^{-8} reduction.

The cost difference between the Cholesky factorization and an iterative sampler in this example is dramatic. After finding a machine with the necessary n2n^{2} memory requirements, the Cholesky factorization would cost about b2n=1016b^{2}n=10^{16} flops (since the bandwidth of the precision matrix 𝐀\mathbf{A} is about b=105b=10^{5}). Since the number of non-zero elements of 𝐀\mathbf{A} is 3.3×1083.3\times 10^{8}, an iterative sampler costs about 6.6×1086.6\times 10^{8} flops per iteration, much less than n2n^{2}. The sample in Figure 3 was generated by kmax=5×103k_{\max}=5\times 10^{3} iterations of the Chebyshev accelerated SSOR sampler, at a total cost of 3.3×10123.3\times 10^{12} flops, which is about 10410^{4} times faster than Cholesky factoring.

7 Discussion

This work began, in part, with a curiosity about the convergence of the sequence of covariance matrices in Gibbs sampling applied to multivariate normal distributions, as studied by Liu et al. (1995). Convergence of that sequence indicates that the algorithm is implicitly implementing some factorization of the target covariance or precision matrix. Which one?

The answer was given by Goodman and Sokal (1989), Amit and Grenander (1991), Barone and Frigessi (1990), and Galli and Gao (2001), that the standard component-sweep Gibbs sampler corresponds to the classical Gauss-Seidel iterative method. That result is given in section 4.2, generalized to arbitrary matrix splittings, showing that any matrix splitting used to generate a deterministic relaxation also induces a stochastic relaxation that is a generalized Gibbs sampler; the linear iterative relaxation and the stochastic relaxation share exactly the same iteration operator, conditions for convergence, and convergence factor, which may be summarized by noting that they share exactly the same error polynomial.

Equivalence of error polynomials is important because they are the central object in designing accelerated solvers including the multigrid, Krylov space, and parallel algorithms. We demonstrated that equivalence explicitly for polynomial acceleration, the basic non-stationary acceleration scheme for linear solvers, showing that this control of the error polynomial can be applied to Gibbs sampling from normal distributions. It follows that, just as for linear solvers, Chebyshev-polynomial accelerated samplers have a smaller average asymptotic convergence factor than their un-accelerated stationary counterparts.

The equivalences noted above are strictly limited to the case of normal target distributions. We are also concerned with continuous non-normal target distributions and whether acceleration of the normal case can usefully inform acceleration of sampling from non-normal distributions. Convergence of the unaccelerated, stationary, iteration applied to bounded perturbations of a normal distribution was established by Amit (1991), though carrying over convergence rates proved more problematic.

There are several possibilities for extending the acceleration techniques to non-normal distributions. A straightforward generalization is to apply Gibbs sampling to the non-normal target, assuming the required conditional distributions are easy to sample from, though using the directions determined by the accelerated algorithm. Simply applying the accelerated algorithm to the non-normal distribution does not lead to optimal acceleration, as demonstrated by Goodman and Sokal (1989).

A second route, that looks more promising to us, is to exploit the connection between Gibbs samplers and linear iterative methods that are often viewed as local solvers for non-linear problems, or equivalently, optimizers for local quadratic approximations to non-quadratic functions. Since a local quadratic approximation to logπ\log\pi is a local Gaussian approximation to π\pi, the iterations developed here may be used to target this local approximation and hence provide local proposals in an MCMC. We imagine an algorithm along the lines of the trust-region methods from optimization in which the local quadratic (Gaussian) approximation is trusted up to some distance from the current state, implemented via a distance penalty. One or more steps of the iterative sampler would act as a proposal to a Metropolis-Hastings accept/reject step that ensures the correct target distribution. Metropolis adjusted Langevin (MALA) and hybrid Monte Carlo (HMC) turn out to be examples of this scheme (Norton and Fox, 2014), as is the algorithm presented by Green and Han (1992). This naturally raises the question of whether acceleration of the local iteration can accelerate the Metropolised algorithm. This remains a topic for ongoing research.

Acknowledgments

This work was partially funded by the New Zealand Institute for Mathematics and its Applications (NZIMA) thematic programme on Analysis, Applications and Inverse Problems in PDEs, and Marsden contract UOO1015.

Appendix A Appendix

A.1 Stationary sampler convergence (Proof of Theorem 2 and Corollary 3)

First, the theorem and corollary are established for the mean. Since 𝐀=𝐌𝐍\mathbf{A}=\mathbf{M}-\mathbf{N} is a convergent splitting, then (10) and Theorem 1 show that E(𝐜(k))=𝝂E(\mathbf{c}^{(k)})={\boldsymbol{\nu}} if and only if E(𝐲(k))𝐀1𝝂E(\mathbf{y}^{(k)})\to\mathbf{A}^{-1}{\boldsymbol{\nu}} with the same convergence factor as for the linear solver. To establish convergence of the variance, let 𝐆=𝐌1𝐍\mathbf{G=M}^{-1}\mathbf{N} in (10), then 𝐲(k)=𝐆k𝐲(0)+i=0k1𝐆i(𝐌1𝐜(k1i)).\mathbf{y}^{(k)}=\mathbf{G}^{k}\mathbf{y}^{(0)}+\sum_{i=0}^{k-1}\mathbf{G}^{i}(\mathbf{M}^{-1}\mathbf{c}^{(k-1-i)}). This equation and the independence of {𝐜(i)}\{\mathbf{c}^{(i)}\} show that 𝐕𝐚𝐫(𝐲(k)|𝐲(0))=i=0k1(𝐆i𝐌1Var(𝐜i)𝐌T(𝐆i)T).\mathbf{Var(y}^{(k)}|\mathbf{y}^{(0)})=\sum_{i=0}^{k-1}\left(\mathbf{G}^{i}\mathbf{M}^{-1}\operatorname{Var}(\mathbf{c}^{i})\mathbf{M}^{-T}(\mathbf{G}^{i})^{T}\right). Theorem 1 establishes the existence of a unique limiting distribution with a non-zero covariance matrix 𝚪\mathbf{\Gamma}. Thus, for 𝐲(i),𝐲(i+1)Π\mathbf{y}^{(i)},\mathbf{y}^{(i+1)}\sim\Pi, (10) implies

𝚪=𝐆𝚪𝐆T+𝐌1Var(𝐜(i))𝐌T\displaystyle\mathbf{\Gamma}=\mathbf{G\Gamma}\mathbf{G}^{T}+\mathbf{M}^{-1}\operatorname{Var}(\mathbf{c}^{(i)})\mathbf{M}^{-T} (24)

since 𝐲(i)\mathbf{y}^{(i)} and 𝐜(i)\mathbf{c}^{(i)} are independent. Thus 𝐕𝐚𝐫(𝐲(k)|𝐲(0))=𝚪𝐆k𝚪(𝐆k)T,\mathbf{Var(y}^{(k)}|\mathbf{y}^{(0)})=\mathbf{\Gamma}-\mathbf{G}^{k}\mathbf{\Gamma}(\mathbf{G}^{k})^{T}, and so

𝐕𝐚𝐫(𝐲(k))=𝚪𝐆k(𝚪𝐕𝐚𝐫(𝐲(0)))(𝐆k)T.\displaystyle\mathbf{Var(y}^{(k)})=\mathbf{\Gamma}-\mathbf{G}^{k}\left(\mathbf{\Gamma}-\mathbf{Var(y}^{(0)})\right)(\mathbf{G}^{k})^{T}. (25)

That is, Var(𝐲(k))𝚪\operatorname{Var}(\mathbf{y}^{(k)})\to\mathbf{\Gamma} with convergence factor ϱ(𝐌1𝐍)2\varrho(\mathbf{M}^{-1}\mathbf{N})^{2}. To prove that part (b) of the theorem implies part (a), consider the starting vector 𝐲(0)Π\mathbf{y}^{(0)}\sim\Pi with covariance matrix 𝚪=𝐀1\mathbf{\Gamma=A}^{-1}. Since 𝐜(k)\mathbf{c}^{(k)} is independent of 𝐲(k)\mathbf{y}^{(k)}, the relation (24) shows that Var(𝐜(k))=𝐌(𝐀1𝐆𝐀1𝐆T)𝐌T=𝐌𝐀1𝐌T𝐍𝐀1𝐍T\operatorname{Var}(\mathbf{c}^{(k)})=\mathbf{M}(\mathbf{A}^{-1}-\mathbf{GA}^{-1}\mathbf{G}^{T})\mathbf{M}^{T}=\mathbf{MA}^{-1}\mathbf{M}^{T}-\mathbf{NA}^{-1}\mathbf{N}^{T}. Substituting in 𝐍=𝐌𝐀\mathbf{N=M-A} shows that Var(𝐜(k))=𝐌T+𝐍\operatorname{Var}(\mathbf{c}^{(k)})=\mathbf{M}^{T}+\mathbf{N}. To prove that (a) implies (b), consider 𝐲(0)N(𝝁,𝐀1)\mathbf{y}^{(0)}\sim\operatorname{N}({\boldsymbol{\mu}},\mathbf{A}^{-1}). By (25), 𝚪Var(𝐲(1))=𝐆(𝚪𝐀1)𝐆T\mathbf{\Gamma}-\operatorname{Var}(\mathbf{y}^{(1)})=\mathbf{G}(\mathbf{\Gamma}-\mathbf{A}^{-1})\mathbf{G}^{T}. Substituting Var(𝐜(0))=𝐌(𝐀1𝐆𝐀1𝐆T)𝐌T\operatorname{Var}(\mathbf{c}^{(0)})=\mathbf{M}(\mathbf{A}^{-1}-\mathbf{GA}^{-1}\mathbf{G}^{T})\mathbf{M}^{T} into equation (24) shows 𝚪𝐀1=𝐆(𝚪𝐀1)𝐆T\mathbf{\Gamma}-\mathbf{A}^{-1}=\mathbf{G}(\mathbf{\Gamma}-\mathbf{A}^{-1})\mathbf{G}^{T}. Thus Var(𝐲(1))=𝐀1\operatorname{Var}(\mathbf{y}^{(1)})=\mathbf{A}^{-1}, which shows that Var(𝐲(k))\operatorname{Var}(\mathbf{y}^{(k)}) has converged to 𝐀1\mathbf{A}^{-1}. By Theorem 1, 𝚪=𝐀1\mathbf{\Gamma=A}^{-1}.

A.2 Polynomial accelerated sampler convergence (Proof of Theorem 5 and Corollary 6)

If the polynomial accelerated linear solver (11) converges, then E(𝐲(k+1))𝐀1E(𝐜(k))=𝝁E(\mathbf{y}^{(k+1)})\to\mathbf{A}^{-1}E(\mathbf{c}^{(k)})={\boldsymbol{\mu}}. To determine Var(𝐜(k))\operatorname{Var}(\mathbf{c}^{(k)}) rewrite the iteration (17) as 𝐲(k+1)=(1αk)𝐲(k1)+αk𝐆(k)𝐲(k)+αk(𝐌(k))1𝐜(k)\mathbf{y}^{(k+1)}=(1-\alpha_{k})\mathbf{y}^{(k-1)}+\alpha_{k}\mathbf{G}^{(k)}\mathbf{y}^{(k)}+\alpha_{k}\left(\mathbf{M}^{(k)}\right)^{-1}\mathbf{c}^{(k)} where 𝐌(k)=1τk𝐌\mathbf{M}^{(k)}=\frac{1}{\tau_{k}}\mathbf{M}, 𝐍(k)=𝐌(k)𝐀\mathbf{N}^{(k)}=\mathbf{M}^{(k)}-\mathbf{A}, and 𝐆(k)=𝐈τk(𝐌(k))1𝐀=(𝐌(k))1𝐍(k)\mathbf{G}^{(k)}=\mathbf{I}-\tau_{k}\left(\mathbf{M}^{(k)}\right)^{-1}\mathbf{A}=\left(\mathbf{M}^{(k)}\right)^{-1}\mathbf{N}^{(k)}. First, we will consider 𝐲(i1)N(μ,𝐀1)\mathbf{y}^{(i-1)}\sim\operatorname{N}(\mathbf{\mu,A}^{-1}) and then find Var(𝐜(k))\operatorname{Var}(\mathbf{c}^{(k)}) that will guarantee that 𝐲(i),𝐲(i+1)N(𝝁,𝐀1)\mathbf{y}^{(i)},\mathbf{y}^{(i+1)}\sim\operatorname{N}({\boldsymbol{\mu}},\mathbf{A}^{-1}). Since {𝐜(i)}\{\mathbf{c}^{(i)}\} are independent of {𝐲(i)}\{\mathbf{y}^{(i)}\}, the above equation for 𝐲(k+1)\mathbf{y}^{(k+1)} shows that Var(𝐜(k))\operatorname{Var}(\mathbf{c}^{(k)}) is equal to

1αk2𝐌(k)((1(1αk)2)𝐀12(1αk)αk(𝐆(k)𝐊(k)+𝐊(k)T𝐆(k)T)αk2𝐆(k)𝐀1𝐆(k)T)𝐌(k)\displaystyle\frac{1}{\alpha_{k}^{2}}\mathbf{M}^{(k)}\left((1-(1-\alpha_{k})^{2})\mathbf{A}^{-1}-2(1-\alpha_{k})\alpha_{k}(\mathbf{G}^{(k)}\mathbf{K}^{(k)}+\mathbf{K}^{(k)T}\mathbf{G}^{(k)T})-\alpha_{k}^{2}\mathbf{G}^{(k)}\mathbf{A}^{-1}\mathbf{G}^{(k)T}\right)\mathbf{M}^{(k)}

where 𝐊(k):=𝐂𝐨𝐯(𝐲(k1),𝐲(k))\mathbf{K}^{(k)}:=\mathbf{Cov}(\mathbf{y}^{(k-1)},\mathbf{y}^{(k)}). To simplify this expression, we need Lemma 8, which gives 𝐊(k)\mathbf{K}^{(k)} explicitly. Parts (1) and (2) of the lemma show that

Var(𝐜(k))=1αk2𝐌(k)(αk2(𝐀1𝐆(k)𝐀1𝐆(k)T)+2(1αk)αk(𝐀1𝐆κ(k)𝐀1𝐆(k)T))𝐌(k).\displaystyle\operatorname{Var}(\mathbf{c}^{(k)})=\frac{1}{\alpha_{k}^{2}}\mathbf{M}^{(k)}\left(\alpha_{k}^{2}(\mathbf{A}^{-1}-\mathbf{G}^{(k)}\mathbf{A}^{-1}\mathbf{G}^{(k)T})+2(1-\alpha_{k})\alpha_{k}(\mathbf{A}^{-1}-\mathbf{G}_{\kappa}^{(k)}\mathbf{A}^{-1}\mathbf{G}^{(k)T})\right)\mathbf{M}^{(k)}.

Part (3) of Lemma 8 shows that Var(𝐜(k))\operatorname{Var}(\mathbf{c}^{(k)}) has the form specified in the theorem.

Lemma 8

For a symmetric splitting 𝐀=𝐌𝐍\mathbf{A}=\mathbf{M}-\mathbf{N},

  1. 1.

    𝐊(k)\mathbf{K}^{(k)} is symmetric.

  2. 2.

    𝐊(k)=𝐆κ(k)𝐀1\mathbf{K}^{(k)}=\mathbf{G}_{\kappa}^{(k)}\mathbf{A}^{-1}, where 𝐆κ(k)=𝐈κk𝐌1𝐀\mathbf{G}_{\kappa}^{(k)}=\mathbf{I}-\kappa_{k}\mathbf{M}^{-1}\mathbf{A} and κk+1:=αkτk+(1αk)κk.\kappa_{k+1}:=\alpha_{k}\tau_{k}+(1-\alpha_{k})\kappa_{k}.

  3. 3.

    𝐀1𝐆τ𝐀1𝐆κT=τκ𝐌1((1/τ+1/κ)𝐌𝐀)𝐌1\mathbf{A}^{-1}-\mathbf{G}_{\tau}\mathbf{A}^{-1}\mathbf{G}_{\kappa}^{T}=\tau\kappa\mathbf{M}^{-1}\left((1/\tau+1/\kappa)\mathbf{M}-\mathbf{A}\right)\mathbf{M}^{-1}.

Proof. To nail down 𝐊(k)\mathbf{K}^{(k)}, rewrite the Chebyshev iteration (17) as

𝐘(k+1)=(αk𝐆(k)(1αk)𝐈𝐈𝟎)𝐘(k)+αk(𝐠(k)𝟎)\mathbf{Y}^{(k+1)}=\left(\begin{array}[c]{cc}\alpha_{k}\mathbf{G}^{(k)}&\left(1-\alpha_{k}\right)\mathbf{I}\\ \mathbf{I}&\mathbf{0}\end{array}\right)\mathbf{Y}^{(k)}+\alpha_{k}\left(\begin{array}[c]{c}\mathbf{g}^{(k)}\\ \mathbf{0}\end{array}\right)

where 𝐘(0)=(𝐲(0)𝟎),𝐘(k+1)=(𝐲(k+1)𝐲(k)) and 𝐠(k)=(𝐌(k))1𝐜(k).\mathbf{Y}^{(0)}=\left(\begin{array}[c]{c}\mathbf{y}^{(0)}\\ \mathbf{0}\end{array}\right),\mathbf{Y}^{(k+1)}=\left(\begin{array}[c]{c}\mathbf{y}^{(k+1)}\\ \mathbf{y}^{(k)}\end{array}\right)\mbox{~~and~~}\mathbf{g}^{(k)}=\left(\mathbf{M}^{(k)}\right)^{-1}\mathbf{c}^{(k)}. Letting 𝒢(k)=(αk𝐆(k)(1αk)𝐈𝐈𝟎)\mathcal{G}^{(k)}=\left(\begin{array}[c]{cc}\alpha_{k}\mathbf{G}^{(k)}&\left(1-\alpha_{k}\right)\mathbf{I}\\ \mathbf{I}&\mathbf{0}\end{array}\right) shows that

Var(𝐘(k+1))=𝒢(k)Var(𝐘(k))𝒢(k)T+αk2(Var(𝐠(k))𝟎𝟎𝟎).\displaystyle\operatorname{Var}\left(\mathbf{Y}^{(k+1)}\right)=\mathcal{G}^{(k)}\mathrm{Var}\left(\mathbf{Y}^{(k)}\right)\mathcal{G}^{(k)T}+\alpha_{k}^{2}\left(\begin{array}[c]{cc}\operatorname{Var}\left(\mathbf{g}^{(k)}\right)&\mathbf{0}\\ \mathbf{0}&\mathbf{0}\end{array}\right). (28)

If Var(𝐲(0))=𝐀1\operatorname{Var}\left(\mathbf{y}^{(0)}\right)=\mathbf{A}^{-1} then Var(𝐲(k))=𝐀1\operatorname{Var}\left(\mathbf{y}^{(k)}\right)=\mathbf{A}^{-1} for k1k\geq 1 in which case Var(𝐘(k+1))\operatorname{Var}\left(\mathbf{Y}^{(k+1)}\right) is

(𝐀1𝐊(k+1)𝐊(k+1)T𝐀1)\displaystyle\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k+1)}\\ \mathbf{K}^{(k+1)T}&\mathbf{A}^{-1}\end{array}\right) =\displaystyle= 𝒢(k)(𝐀1𝐊(k)𝐊(k)T𝐀1)𝒢(k)T+(αk2Var(g(k))𝟎𝟎𝟎).\displaystyle\mathcal{G}^{(k)}\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k)}\\ \mathbf{K}^{(k)T}&\mathbf{A}^{-1}\end{array}\right)\mathcal{G}^{(k)T}+\left(\begin{array}[c]{cc}\alpha_{k}^{2}\operatorname{Var}\left(g^{(k)}\right)&\mathbf{0}\\ \mathbf{0}&\mathbf{0}\end{array}\right). (35)

By definition of 𝐘(0)\mathbf{Y}^{(0)}, 𝐊(0)=𝟎\mathbf{K}^{(0)}=\mathbf{0}; for k0k\geq 0,

𝐊(k+1)=αk𝐆(k)𝐀1+(1αk)𝐊(k)T.\displaystyle\mathbf{K}^{(k+1)}=\alpha_{k}\mathbf{G}^{(k)}\mathbf{A}^{-1}+\left(1-\alpha_{k}\right)\mathbf{K}^{(k)T}. (36)

Since α0=1\alpha_{0}=1, then 𝐊(1)=𝐆(0)𝐀1\mathbf{K}^{(1)}=\mathbf{G}^{(0)}\mathbf{A}^{-1} which proves parts (1) and (2) of the Lemma for k=0k=0 since κ1=τ0\kappa_{1}=\tau_{0} and 𝐆κ(k)𝐀1\mathbf{G}_{\kappa}^{(k)}\mathbf{A}^{-1} is symmetric. Assuming that 𝐊(k)=𝐆κ(k)𝐀1\mathbf{K}^{(k)}=\mathbf{G}^{(k)}_{\kappa}\mathbf{A}^{-1} for k>0k>0, the recursion in (36) gives 𝐊(k+1)=(𝐈[αkτk+(1αk)κk]𝐌1𝐀)𝐀1\mathbf{K}^{(k+1)}=\left(\mathbf{I}-\left[\alpha_{k}\tau_{k}+\left(1-\alpha_{k}\right)\kappa_{k}\right]\mathbf{M}^{-1}\mathbf{A}\right)\mathbf{A}^{-1} so the expansion and recursion hold for k+1k+1, and parts (1) and (2) of the Lemma follow by induction. Part (c) of the Lemma follows from the equation

𝐀1𝐆τ𝐀1𝐆κT\displaystyle\mathbf{A}^{-1}-\mathbf{G}_{\tau}\mathbf{A}^{-1}\mathbf{G}_{\kappa}^{\text{T}} =𝐌τ1(𝐌τ𝐀1𝐌κT)𝐌κT𝐌τ1𝐍τ𝐀1(𝐌κ1𝐍κ)T.\displaystyle=\mathbf{M}_{\tau}^{-1}\left(\mathbf{M}_{\tau}\mathbf{A}^{-1}\mathbf{M}_{\kappa}^{\text{T}}\right)\mathbf{M}_{\kappa}^{-\text{T}}-\mathbf{M}_{\tau}^{-1}\mathbf{N}_{\tau}\mathbf{A}^{-1}\left(\mathbf{M}_{\kappa}^{-1}\mathbf{N}_{\kappa}\right)^{\text{T}}.\hskip 50.58878pt\Box

The selection of 𝐕𝐚𝐫(𝐜(k))=ak𝐌+bk𝐍\mathbf{Var(c}^{(k)})=a_{k}\mathbf{M}+b_{k}\mathbf{N} assures that if Var(𝐲(0))=𝐀1\operatorname{Var}\left(\mathbf{y}^{(0)}\right)=\mathbf{A}^{-1}, then Var(𝐲(k))=𝐀1\operatorname{Var}\left(\mathbf{y}^{(k)}\right)=\mathbf{A}^{-1} for k1k\geq 1. Thus, subtracting (35) from (28) gives

Var(𝐘(k+1))(𝐀1𝐊(k+1)𝐊(k+1)T𝐀1)=𝒢(k)(Var(𝐘(k))(𝐀1𝐊(k)𝐊(k)T𝐀1))𝒢(k)T\operatorname{Var}\left(\mathbf{Y}^{(k+1)}\right)-\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k+1)}\\ \mathbf{K}^{(k+1)T}&\mathbf{A}^{-1}\end{array}\right)=\mathcal{G}^{(k)}\left(\operatorname{Var}\left(\mathbf{Y}^{(k)}\right)-\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k)}\\ \mathbf{K}^{(k)T}&\mathbf{A}^{-1}\end{array}\right)\right)\mathcal{G}^{(k)T}

or (k+1)=𝒢(k)(k)𝒢(k)T\mathcal{E}^{(k+1)}=\mathcal{G}^{(k)}\mathcal{E}^{(k)}\mathcal{G}^{(k)T} for k0k\geq 0, where (k)=Var(𝐘(k))(𝐀1𝐊(k)𝐊(k)T𝐀1).\mathcal{E}^{(k)}=\operatorname{Var}\left(\mathbf{Y}^{(k)}\right)-\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k)}\\ \mathbf{K}^{(k)T}&\mathbf{A}^{-1}\end{array}\right). Hence, by recursion, (k)=(l=0k1𝒢(l))(0)(l=0k1𝒢(l))T.\mathcal{E}^{(k)}=\left(\prod_{l=0}^{k-1}\mathcal{G}^{(l)}\right)\mathcal{E}^{(0)}\left(\prod_{l=0}^{k-1}\mathcal{G}^{(l)}\right)^{\text{T}}.

Denote the polynomial of the block matrix by 𝒫(k+1)=(l=0k𝒢(l))\mathcal{P}^{(k+1)}=\left(\prod_{l=0}^{k}\mathcal{G}^{(l)}\right) that satisfies

𝒫(k+1)\displaystyle\mathcal{P}^{(k+1)} =𝒢(k)𝒫(k)=(αk𝐆(k)(1αk)𝐈𝐈𝟎)(𝒫11(k)𝒫12(k)𝒫21(k)𝒫22(k))\displaystyle=\mathcal{G}^{(k)}\mathcal{P}^{(k)}=\left(\begin{array}[c]{cc}\alpha_{k}\mathbf{G}^{(k)}&\left(1-\alpha_{k}\right)\mathbf{I}\\ \mathbf{I}&\mathbf{0}\end{array}\right)\left(\begin{array}[c]{cc}\mathcal{P}_{11}^{(k)}&\mathcal{P}_{12}^{(k)}\\ \mathcal{P}_{21}^{(k)}&\mathcal{P}_{22}^{(k)}\end{array}\right)

with 𝒫(1)=𝒢(0)=(𝐆(0)𝟎𝐈𝟎).\mathcal{P}^{(1)}=\mathcal{G}^{(0)}=\left(\begin{array}[c]{cc}\mathbf{G}^{(0)}&\mathbf{0}\\ \mathbf{I}&\mathbf{0}\end{array}\right). Thus

𝒫11(k+1)\displaystyle\mathcal{P}_{11}^{(k+1)} =αk𝐆(k)𝒫11(k)+(1αk)𝒫11(k1)=αk(𝐈τk𝐌1𝐀)𝒫11(k)+(1αk)𝒫11(k1)\displaystyle=\alpha_{k}\mathbf{G}^{(k)}\mathcal{P}_{11}^{(k)}+\left(1-\alpha_{k}\right)\mathcal{P}_{11}^{(k-1)}=\alpha_{k}\left(\mathbf{I}-\tau_{k}\mathbf{M}^{-1}\mathbf{A}\right)\mathcal{P}_{11}^{(k)}+\left(1-\alpha_{k}\right)\mathcal{P}_{11}^{(k-1)}

with 𝒫11(1)=𝐆(0)\mathcal{P}_{11}^{(1)}=\mathbf{G}^{(0)}, which shows that 𝒫11(k+1)=Pk+1\mathcal{P}_{11}^{(k+1)}=P_{k+1} by (12). Furthermore, this shows that the error in variance at the kthk^{\text{th}} iteration has the specified form and convergence factor.

References

  • Adler (1981) Adler, S. L. (1981). Over-relaxation method for the Monte Carlo evaluation of the partition function for multiquadratic actions. Phys. Rev. D 23(12), 2901–2904.
  • Alpkvist and Klapper (2007) Alpkvist, E. and I. Klapper (2007). A multidimensional multispecies continuum model for heterogeneous biofilm development. Bulletin of Mathematical Biology 69, 765–789.
  • Amit (1991) Amit, Y. (1991). On rates of convergence of stochastic relaxation for Gaussian and non-Gaussian distributions. Journal of Multivariate Analysis 38(1), 82–99.
  • Amit and Grenander (1991) Amit, Y. and U. Grenander (1991). Comparing sweep strategies for stochastic relaxation. Journal of Multivariate Analysis 37, 197–222.
  • Axelsson (1996) Axelsson, O. (1996). Iterative Solution Methods. Cambridge University Press.
  • Banerjee et al. (2003) Banerjee, S., A. E. Gelfand, and B. P. Carlin (2003). Hierarchical Modeling and Analysis for Spatial Data. Chapman and Hall.
  • Bardsley (2012) Bardsley, J. M. (2012). MCMC-based image reconstruction with uncertainty quantification. SIAM J. Sci. Comput. 34(3), A1316–A1332.
  • Barone and Frigessi (1990) Barone, P. and A. Frigessi (1990). Improving stochastic relaxation for Gaussian random fields. Probability in the Engineering and Informational Sciences 23, 2901–2904.
  • Barone et al. (2002) Barone, P., G. Sebastiani, and J. Stander (2002). Over-relaxation methods and coupled Markov chains for Monte Carlo simulation. Statistics and Computing 12, 17–26.
  • Besag (1986) Besag, J. (1986). On the statistical analysis of dirty pictures. J. R. Statist. Soc. B 48(3), 259–302.
  • Besag and Green (1993) Besag, J. and P. J. Green (1993). Spatial statistics and Bayesian computation. J. R. Statist. Soc. B 55(1), 27–37.
  • Calvetti and Somersalo (2007) Calvetti, D. and E. Somersalo (2007). Introduction to Bayesian Scientific Computing. Springer.
  • Cui et al. (2011) Cui, T., C. Fox, and M. J. O’Sullivan (2011). Bayesian calibration of a large-scale geothermal reservoir model by a new adaptive delayed acceptance Metropolis Hastings algorithm. Water Resources Research 47(10), W10521.
  • Diaconis and Freedman (1999) Diaconis, P. and D. Freedman (1999). Iterated random functions. SIAM Review 41(1), 45–76.
  • Duflo (1997) Duflo, M. (1997). Random Iterative Models. Berlin: Springer-Verlag.
  • Feller (1968) Feller, W. (1968). An Introduction to Probability Theory and its Applications. Wiley.
  • Fishman (1996) Fishman, G. S. (1996). Coordinate selection rules for Gibbs sampling. The Annals of Applied Probability 6(2), 444–465.
  • Fox (2008) Fox, C. (2008). A conjugate direction sampler for normal distributions, with a few computed examples. Technical Report 2008-1, Electronics Group, University of Otago.
  • Fox and Parker (2014) Fox, C. and A. Parker (2014). Convergence in variance of Chebyshev accelerated Gibbs samplers. SIAM Journal on Scientific Computing 36(1), A124–A147.
  • Fox and Parker (1968) Fox, L. and I. Parker (1968). Chebyshev polynomials in numerical analysis. Oxford, UK: Oxford University Press.
  • Galli and Gao (2001) Galli, A. and H. Gao (2001). Rate of convergence of the Gibbs sampler in the Gaussian case. Mathematical Geology 33(6), 653–677.
  • Gelfand and Smith (1990) Gelfand, A. E. and A. Smith (1990). Sampling-based approaches to calculating marginal densities. J. Am. Stat. 85, 398–409.
  • Gelman et al. (1995) Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995). Bayesian Data Analysis. Chapman & Hall.
  • Geman and Geman (1984) Geman, S. and D. Geman (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Analysis and Machine Intelligence 6, 721–741.
  • Geyer (2011) Geyer, C. J. (2011). Introduction to Markov chain Monte Carlo. In S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng (Eds.), Handbook of Markov Chain Monte Carlo, pp.  3–48. Chapman & Hall.
  • Gilks et al. (1996) Gilks, W. R., S. Richardson, and D. Spieglhalter (1996). Introducing Markov chain Monte Carlo. In W. R. Gilks, S. Richardson, and D. Spieglhalter (Eds.), Markov chain Monte Carlo in Practice, pp.  1–16. Chapman & Hall.
  • Gneiting et al. (2005) Gneiting, T., H. Sevcikova, D. B. Percival, M. Schlather, and Y. Jiang (2005). Fast and exact simulation of large Gaussian lattice systems in two dimensions: Exploring the limits. Technical Report 477, Department of Statistics, University of Washington.
  • Golub and Loan (1989) Golub, G. H. and C. F. V. Loan (1989). Matrix Computations (2nd ed.). Baltimore: The Johns Hopkins University Press.
  • Goodman and Sokal (1989) Goodman, J. and A. D. Sokal (1989). Multigrid Monte Carlo method. Conceptual foundations. Phys. Rev. D 40(6), 2035–2071.
  • Green and Han (1992) Green, P. and X. Han (1992). Metropolis methods, Gaussian proposals and antithetic variables. In A. Frigessi, P. Barone, and M. Piccioni (Eds.), Stochastic Models, statisical methods and algorithms in image analysis, Lecture Notes in Statistics, pp.  142–164. Berlin: Springer-Verlag.
  • Grenander (1983) Grenander, U. (1983). Tutorial in pattern theory. Technical report, Division of Applied Mathematics, Brown University.
  • Guttorp and Gneiting (2005) Guttorp, P. and T. Gneiting (2005). On the Whittle-Matérn correlation family. Technical Report 80, NRCSE University of Washington.
  • Haario et al. (2001) Haario, H., E. Saksman, and J. Tamminen (2001). An adaptive Metropolis algorithm. Bernoulli 7, 223–242.
  • Higdon (2006) Higdon, D. (2006). A primer on space-time modelling from a Bayesian perspective. In B. Finkenstadt, L. Held, and V. Isham (Eds.), Statistics of Spatio-Temporal Systems, New York, pp.  217–279. Chapman & Hall/CRC.
  • Lewandowski and Beyenal (2014) Lewandowski, Z. and H. Beyenal (2014). Fundamentals of Biofilm Research. Boca Raton, FL: CRC Press.
  • Lindgren et al. (2011) Lindgren, F., H. Rue, and J. Lindström (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. R. Statist. Soc. B 73(4), 423–498.
  • Liu et al. (1995) Liu, J. S., W. H. Wong, and A. Kong (1995). Covariance structure and convergence rate of the Gibbs sampler with various scans. J. R. Statist. Soc. B 57(1), 157–169.
  • MacKay (2003) MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms. Cambridge, UK: Cambridge University Press.
  • Meurant (2006) Meurant, G. (2006). The Lanczos and Conjugate Gradient Algorithms. Philadelphia: SIAM.
  • Neal (1997) Neal, R. (1997, January). Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report 9702, Department of Statistics, University of Toronto.
  • Neal (1998) Neal, R. (1998). Suppressing random walks in Markov chain Monte Carlo using ordered overrelaxation. In M. I. Jordan (Ed.), Learning in Graphical Models, pp. 205–225. Kluwer Academic Publishers.
  • Neumayer (2011) Neumayer, M. (2011). Accelerated Bayesian Inversion and Calibration for Electrical Tomography. Ph. D. thesis, Graz University of Technology.
  • Nevanlinna (1993) Nevanlinna, O. (1993). Convergence of Iterations for Linear Equations. Birkhauser.
  • Norton and Fox (2014) Norton, R. A. and C. Fox (2014). Efficiency and computability of MCMC with Langevin, Hamiltonian, and other matrix-splitting proposals. Unpublished manuscript.
  • Parker and Fox (2012) Parker, A. and C. Fox (2012). Sampling Gaussian distributions in Krylov spaces with conjugate gradients. SIAM Journal on Scientific Computing 34(3), B312–B334.
  • Robert and Casella (1999) Robert, C. P. and G. Casella (1999, August). Monte Carlo Statistical Methods (1 ed.). Springer-Verlag.
  • Robert and Casella (2011) Robert, C. P. and G. Casella (2011). A short history of Markov chain Monte Carlo: Subjective recollections from incomplete data. Statist. Sci. 26(1), 102–115.
  • Roberts and Rosenthal (2007) Roberts, G. O. and J. S. Rosenthal (2007). Coupling and ergodicity of adaptive MCMC. J. Appl. Prob. 44, 458–475.
  • Roberts and Sahu (1997) Roberts, G. O. and S. Sahu (1997). Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. J. R. Statist. Soc. B 59(2), 291–317.
  • Rue (2001) Rue, H. (2001). Fast sampling of Gaussian Markov random fields. J. R. Statist. Soc. B 63, 325–338.
  • Rue and Held (2005) Rue, H. and L. Held (2005). Gaussian Markov random fields : Theory and applications. New York: Chapman Hall.
  • Saad (2003) Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM.
  • Saad and van der Vorst (2000) Saad, Y. and H. A. van der Vorst (2000). Iterative solution of linear systems in the 20th century. Journal of Computational and Applied Mathematics 123, 1–33.
  • Sheppard and Shotton (1997) Sheppard, C. J. R. and D. R. Shotton (1997). Confocal Laser Scanning Microscopy. Garland Science.
  • Simpson et al. (2012) Simpson, D., F. Lindgren, and H. Rue (2012). Think continuous: Markovian Gaussian models in spatial statistics. Spatial Statistics 1, 16–29.
  • Sokal (1993) Sokal, A. D. (1993). Discussion on the meeting on the Gibbs sampler and other Markov chain Monte Carlo methods. J. R. Statist. Soc. B 55(1), 87.
  • Swogger and Pitts (2005) Swogger, E. and B. Pitts (2005). CSLM 3D view of pseudomonas aeruginosa biofilm structure. Montana State University Center for Biofilm Engineering. Available at http://www.biofilm.montana.edu/resources/movies/2005/2005m06.html.
  • Turčin (1971) Turčin, V. (1971). On the computation of multidimensional integrals by the Monte Carlo method. Theory of Probability and Its Applications 16, 720–724.
  • Varga (1962) Varga, R. S. (1962). Matrix Iterative Analysis. Prentice-Hall.
  • Watkins (2002) Watkins, D. (2002). Fundamentals of Matrix Computations (2nd ed.). New York: Wiley.
  • Whittle (1954) Whittle, P. (1954). On stationary processes in the plane. Biometrika 41, 434–449.
  • Young (1971) Young, D. M. (1971). Iterative Solution of Large Linear Systems. Academic Press.