Accelerated Gibbs sampling of normal distributions using matrix splittings and polynomials

Colin Foxlabel=e1]fox@physics.otago.ac.nz [ Albert Parkerlabel=e2]parker@math.montana.edu [ Department of Physics, University of Otago, Dunedin, New Zealand Center for Biofilm Engineering, Department of Mathematical Sciences, Montana State University, Bozeman, MT, USA

Abstract

Standard Gibbs sampling applied to a multivariate normal distribution with a specified precision matrix is equivalent in fundamental ways to the Gauss-Seidel iterative solution of linear equations in the precision matrix. Specifically, the iteration operators, the conditions under which convergence occurs, and geometric convergence factors (and rates) are identical. These results hold for arbitrary matrix splittings from classical iterative methods in numerical linear algebra giving easy access to mature results in that field, including existing convergence results for antithetic-variable Gibbs sampling, REGS sampling, and generalizations. Hence, efficient deterministic stationary relaxation schemes lead to efficient generalizations of Gibbs sampling. The technique of polynomial acceleration that significantly improves the convergence rate of an iterative solver derived from a symmetric matrix splitting may be applied to accelerate the equivalent generalized Gibbs sampler. Identicality of error polynomials guarantees convergence of the inhomogeneous Markov chain, while equality of convergence factors ensures that the optimal solver leads to the optimal sampler. Numerical examples are presented, including a Chebyshev accelerated SSOR Gibbs sampler applied to a stylized demonstration of low-level Bayesian image reconstruction in a large 3-dimensional linear inverse problem.

Bayesian inference,

Gaussian Markov random field,

Gibbs sampling,

matrix splitting,

multivariate normal distribution,

non-stationary stochastic iteration,

polynomial acceleration,

keywords:

\startlocaldefs\endlocaldefs

and

1 Introduction

The Metropolis-Hastings algorithm for MCMC was introduced to main-stream statistics around 1990 (Robert and Casella, 2011), though prior to that the Gibbs sampler provided a coherent approach to investigating distributions with Markov random field structure (Turčin, 1971; Grenander, 1983; Geman and Geman, 1984; Gelfand and Smith, 1990; Besag and Green, 1993; Sokal, 1993). The Gibbs sampler may be thought of as a particular Metropolis-Hastings algorithm that uses the conditional distributions as proposal distributions, with acceptance probability always equal to 1 (Geyer, 2011).

In statistics the Gibbs sampler is popular because of ease of implementation (see, e.g., Roberts and Sahu, 1997), when conditional distributions are available in the sense that samples may be drawn from the full conditionals. However, the Gibbs sampler is not often presented as an efficient algorithm, particularly for massive models. In this work we show that generalized and accelerated Gibbs samplers are contenders for the fastest sampling algorithms for normal target distributions, because they are equivalent to the fastest algorithms for solving systems of linear equations.

Almost all current MCMC algorithms, including Gibbs samplers, simulate a fixed transition kernel that induces a homogeneous Markov chain that converges geometrically in distribution to the desired target distribution. In this aspect, modern variants of the Metropolis-Hastings algorithm are unchanged from the Metropolis algorithm as first implemented in the 1950’s. The adaptive Metropolis algorithm of Haario et al. (2001) (see also Roberts and Rosenthal, 2007) is an exception, though it converges to a geometrically convergent Metropolis-Hastings algorithm that bounds convergence behaviour.

We focus on the application of Gibbs sampling to drawing samples from a multivariate normal distribution with a given covariance or precision matrix. Our concern is to develop generalized Gibbs samplers with optimal geometric, or better than geometric, distributional convergence by drawing on ideas in numerical computation, particularly the mature field of computational linear algebra. We apply the matrix-splitting formalism to show that fixed-scan Gibbs sampling from a multivariate normal is equivalent in fundamental ways to the stationary linear iterative solvers applied to systems of equations in the precision matrix.

Stationary iterative solvers are now considered to be very slow precisely because of their geometric rate of convergence, and are no longer used for large systems. However, they remain a basic building block in the most efficient linear solvers. By establishing equivalence of error polynomials we provide a route whereby acceleration techniques from numerical linear algebra may be applied to Gibbs sampling from normal distributions. The fastest solvers employ non-stationary iterations, hence the equivalent generalized Gibbs sampler induces an inhomogeneous Markov chain. Explicit calculation of the error polynomial guarantees convergence, while control of the error polynomial gives optimal performance.

The adoption of the matrix splitting formalism gives the following practical benefits in the context of fixed-scan Gibbs sampling from normal targets:

1.

a one-to-one equivalence between generalized Gibbs samplers and classical linear iterative solvers;
2.

rates of convergence and error polynomials for the Markov chain induced by a generalized Gibbs sampler;
3.

acceleration of the Gibbs sampler to induce an inhomogeneous Markov chain that achieves the optimal error polynomial, and hence has optimal convergence rate;
4.

numerical estimates of convergence rate of the Gibbs sampler in a single chain and a priori estimates of number of iterations to convergence;
5.

access to preconditioning, whereby the sampling problem is transformed into an equivalent problem for which the accelerated Gibbs sampler has improved convergence rate.

Some direct linear solvers have already been adapted to sampling from multivariate normal distributions, with Rue (2001) demonstrating the use of solvers based on Cholesky factorization to allow computationally efficient sampling. This paper extends the connection to the iterative linear solvers. Since iterative methods are the most efficient for massive linear systems, the associated samplers will be the most efficient for very high dimensional normal targets.

1.1 Context and overview of results

The Cholesky factorization is the conventional way to produce samples from a moderately sized multivariate normal distribution (Rue, 2001; Rue and Held, 2005), and is also the preferred method for solving moderately sized linear systems. For large linear systems, iterative solvers are the methods of choice due to their inexpensive cost per iteration, and small computer memory requirements.

Gibbs samplers applied to normal distributions are essentially identical to stationary iterative methods from numerical linear algebra. This connection was exploited by Adler (1981), and independently by Barone and Frigessi (1990), who noted that the component-wise Gibbs sampler is a stochastic version of the Gauss-Seidel linear solver, and accelerated the Gibbs sampler by introducing a relaxation parameter to implement the stochastic version of the successive over-relaxation (SOR) of Gauss-Seidel. This pairing was further analyzed by Goodman and Sokal (1989).

This equivalence is depicted in panels A and B of Figure 1. Panel B shows the contours of a normal density $\pi(\mathbf{x})$ , and a sequence of coordinate-wise conditional samples taken by the Gibbs sampler applied to $\pi$ . Panel A shows the contours of the quadratic minus $\log\left(\pi(\mathbf{x})\right)$ and the Gauss-Seidel sequence of coordinate optimizations¹¹1Gauss-Seidel optimization was rediscovered by Besag (1986) as iterated conditional modes., or, equivalently, solves of the normal equations $\nabla\log\pi(\mathbf{x})=0$ . Note how in Gauss-Seidel the step sizes decrease towards convergence, which is a tell-tale sign that convergence (in value) is geometric. In Section 4 we will show that the iteration operator is identical to that of the Gibbs sampler in panel B, and hence the Gibbs sampler also converges geometrically (in distribution). Slow convergence of these algorithms is usually understood in terms of the same intuition; high correlations correspond to long narrow contours, and lead to small steps in coordinate directions and many iterations being required to move appreciably along the long axis of the target function.

Refer to caption — Figure 1: The panels in the left column show the contours of a quadratic function $\frac{1}{2}\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{b}^{T}\mathbf{x}$ in two dimensions and the iteration paths for some common optimizers towards the minimizer $\mathbf{{\boldsymbol{\mu}}}=\mathbf{A}^{-1}\mathbf{b}$ , or equivalently the path of iterative linear solvers of $\mathbf{A}\mathbf{x}=\mathbf{b}$ . The right column presents the iteration paths of the samplers corresponding to each linear solver, along with the contours of the normal density $k\exp\left\{-\frac{1}{2}\mathbf{x}^{T}\mathbf{A}\mathbf{x}+\mathbf{b}^{T}\mathbf{x}\right\}$ , where $k$ is the normalizing constant. In all panels, the matrix $\mathbf{A}$ has eigenpairs $\{\left(10,[1~1]^{T}\right),\left(1,[1~-1]^{T}\right)\}$ . The Gauss-Seidel solver took 45 iterations to converge to ${\boldsymbol{\mu}}$ (shown are the 90 coordinate steps; each iteration is a “sweep” of the two coordinate directions), the Chebyshev polynomial accelerated SSOR required just 16 iterations to converge, while CG finds the minimizer in 2 iterations. For each of the samplers 10 iterations are shown (the 20 coordinate steps are shown for the Gibbs sampler). The correspondence between these linear solvers/optimizers and samplers is treated in the text (CG in supplementary material).

Roberts and Sahu (1997) considered forward then backward sweeps of coordinate-wise Gibbs sampling, with relaxation parameter, to give a sampler they termed the REGS sampler. This is a stochastic version of the symmetric-SOR (SSOR) iteration, which comprises forward then backward sweeps of SOR.

The equality of iteration operators and error polynomials, for these pairs of fixed-scan Gibbs samplers and iterative solvers, allows existing convergence results in numerical analysis texts (for example Axelsson, 1996; Golub and Loan, 1989; Nevanlinna, 1993; Saad, 2003; Young, 1971) to be used to establish convergence results for the corresponding Gibbs sampler. Existing results for rates of distributional convergence by fixed-sweep Gibbs samplers (Adler, 1981; Barone and Frigessi, 1990; Liu et al., 1995; Roberts and Sahu, 1997) may be established this way.

The methods of Gauss-Seidel, SOR, and SSOR, give stationary linear iterations that were used as linear solvers in the 1950’s, and are now considered very slow. The corresponding fixed-scan Gibbs samplers are slow for precisely the same reason. The last fifty years has seen an explosion of theoretical results and algorithmic development that have made linear solvers faster and more efficient, so that for large problems, stationary methods are used as preconditioners at best, while the method of preconditioned conjugate gradients, GMRES, multigrid, or fast-multipole methods are the current state-of-the-art for solving linear systems in a finite number of steps (Saad and van der Vorst, 2000).

Linear iterations derived from a symmetric splitting may be sped up by polynomial acceleration, particularly Chebyshev acceleration that results in optimal error reduction amongst methods that have a fixed non-stationary iteration structure (Fox and Parker, 1968; Axelsson, 1996). The Chebyshev accelerated SSOR solver and corresponding Chebyshev accelerated SSOR sampler (Fox and Parker, 2014) are depicted in panels C and D of Figure 1. Both the solver and sampler take steps that are more aligned with the long axis of the target, compared to the coordinate-wise algorithms, and hence achieve faster convergence. However, the step size of Chebyshev-SSOR solving still decreases towards convergence, and hence convergence for both solver and sampler is still asymptotically geometric, albeit with much improved rate.

Fox and Parker (2014) considered point-wise convergence of the mean and variance of a Gibbs SSOR sampler accelerated by Chebyshev polynomials. In this paper we prove convergence in distribution for Gibbs samplers corresponding to any matrix splitting and accelerated by any polynomial that is independent of the Gibbs iterations. We then apply a polynomial accelerated sampler to solve a massive Bayesian linear inverse problem that is infeasible to solve using conventional techniques.

Chebyshev acceleration requires estimates of the extreme eigenvalues of the error operator, which we obtain via a conjugate-gradient (CG) algorithm at no significant computational cost (Meurant, 2006). The CG algorithm itself may be adapted to sample from normal distributions; the CG solver and corresponding sampler, depicted in panels E and F of Figure 1, were analysed by Parker and Fox (2012) and is discussed in the supplementary material.

1.2 Structure of the paper

In Section 2 we review efficient methods for sampling from normal distributions, highlighting Gibbs sampling in various algorithmic forms. Standard results for stationary iterative solvers are presented in Section 3. Theorems in Section 4 establish equivalence of convergence and convergence factors for iterative solvers and Gibbs samplers. Application of polynomial acceleration methods to linear solvers and Gibbs sampling is given in Section 5, including a proof of convergence of the first and second moments of a polynomial accelerated sampler. Numerical verification of convergence results is presented in Section 6.

2 Sampling from multivariate normal distributions

We consider the problem of sampling from an $n$ -dimensional normal distribution $\operatorname{N}({\boldsymbol{\mu}},{\boldsymbol{\Sigma}})$ defined by the mean $n$ -vector ${\boldsymbol{\mu}}$ , and the $n\times n$ symmetric and positive definite (SPD) covariance matrix ${\boldsymbol{\Sigma}}$ . Since if $\mathbf{z}\sim\operatorname{N}(\mathbf{0},{\boldsymbol{\Sigma}})$ then $\mathbf{z}+{\boldsymbol{\mu}}\sim\operatorname{N}({\boldsymbol{\mu}},{\boldsymbol{\Sigma}})$ , it often suffices to consider drawing samples from normal distributions with zero mean. An exception is when ${\boldsymbol{\mu}}$ is defined implicitly, which we discuss in section 4.1.

In Bayesian formulations of inverse problems that use a GMRF as a prior distribution, typically the precision matrix $\mathbf{A}={\boldsymbol{\Sigma}}^{-1}$ is explicitly modeled and available (Rue and Held, 2005; Higdon, 2006), perhaps as part of a hierarchical model (Banerjee et al., 2003). Typically then the precision matrix (conditioned on hyperparameters) is large though sparse, if the neighborhoods specifying conditional independence are small. We are particularly interested in this case, and throughout the paper will focus on sampling from $\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})$ when $\mathbf{A}$ is sparse and large, or when some other property makes operating by $\mathbf{A}$ easy, i.e., one can evaluate $\mathbf{Ax}$ for any vector $\mathbf{x}$ .

Standard sampling methods for moderately sized normal distributions utilize the Cholesky factorization (Rue, 2001; Rue and Held, 2005) since it is fast, incurring approximately $(1/3)n^{3}$ floating point operations (flops) and is backwards stable (Watkins, 2002). Samples can also be drawn using the more expensive eigen-decomposition (Rue and Held, 2005), that costs approximately $(10/3)n^{3}$ flops, or more generally using mutually conjugate vectors (Fox, 2008; Parker and Fox, 2012). For stationary Gaussian random fields defined on the lattice, Fourier methods can lead to efficient sampling for large problems (Gneiting et al., 2005).

Algorithm 1 shows the steps for sampling from $\operatorname{N}(\mathbf{0},{\boldsymbol{\Sigma}})$ using Cholesky factorization, when the covariance matrix ${\boldsymbol{\Sigma}}$ is available (Neal, 1997; MacKay, 2003; Higdon, 2006).

input : Covariance matrix

{\boldsymbol{\Sigma}}

output :

\mathbf{y}\sim\operatorname{N}(\mathbf{0},{\boldsymbol{\Sigma}})

Cholesky factor

{\boldsymbol{\Sigma}}=\mathbf{CC}^{T}

;

sample

\mathbf{z}~\sim~\operatorname{N}(\mathbf{0},\mathbf{I})

;

\mathbf{y}=\mathbf{Cz}

;

Algorithm 1 Cholesky sampling using a covariance matrix

{\boldsymbol{\Sigma}}

When the precision matrix $\mathbf{A}$ is available, a sample $\mathbf{y}\sim\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})$ may be drawn using Algorithm 2 given by Rue (2001) (see also Rue and Held, 2005).

input : Precision matrix

\mathbf{A}

output :

\mathbf{y}\sim\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})

Cholesky factor

\mathbf{A}=\mathbf{BB}^{T}

;

sample

\mathbf{z}~\sim~\operatorname{N}(\mathbf{0},\mathbf{I})

;

solve

\mathbf{B}^{T}\mathbf{y=z}

by back substitution;

Algorithm 2 Cholesky sampling using a precision matrix

\mathbf{A}

The computational cost of Algorithm 2 depends on the bandwidth of $\mathbf{A}$ , that also bounds the bandwidth of the Cholesky factor $\mathbf{B}$ . For a bandwidth $b$ , calculation of the Cholesky factorization requires $\mathcal{O}(b^{2}n)$ flops, which provides savings over the full-bandwidth case when $b\ll n/2$ (Golub and Loan, 1989; Rue, 2001; Watkins, 2002). For GMRF’s defined over 2-dimensional domains, the use of a bandwidth reducing permutation often leads to substantial computational savings (Rue, 2001; Watkins, 2002). In 3-dimensions and above, typically no permutation exists that can significantly reduce the bandwidth below $n^{2/3}$ , hence the cost of sampling by Cholesky factoring is at least $\mathcal{O}(n^{7/3})$ flops. Further, Cholesky factorizing requires that the precision matrix and the Cholesky factor be stored in computer memory, which can be prohibitive for large problems. In Section 6 we give an example of sampling from a large GMRF for which Cholesky factorization is prohibitively expensive.

2.1 Gibbs sampling from a normal distribution

Iterative samplers, such as Gibbs, are an attractive option when drawing samples from high dimensional multivariate normal distributions due to their inexpensive cost per iteration and small computer memory requirements (only vectors of size $n$ need be stored). If the precision matrix is sparse with $\mathcal{O}(n)$ non-zero elements, then, regardless of the bandwidth, iterative methods cost only about $2n$ flops per iteration, which is comparable with sparse Cholesky factorizations. However, when the bandwidth is $\mathcal{O}(n)$ , the cost of the Cholesky factorization is high at $\mathcal{O}(n^{3})$ flops, while iterative methods maintain their inexpensive cost per iteration. Iterative methods are then preferable when requiring significantly fewer than $\mathcal{O}(n^{2})$ iterations for adequate convergence. In the examples presented in section 6 we find that $\mathcal{O}(n)$ iterations give convergence to machine precision, so the iterative methods are preferable for large problems.

2.1.1 Componentwise formulation

One of the simplest iterative sampling methods is the component-sweep Gibbs sampler (Geman and Geman, 1984; Gelman et al., 1995; Gilks et al., 1996; Rue and Held, 2005). Let $\mathbf{y}=(y_{1},y_{2},...,y_{n})^{T}\in\Re^{n}$ denote a vector in terms of its components, and let $\mathbf{A}$ be an $n\times n$ precision matrix with elements $\{a_{ij}\}$ . One sweep over all $n$ components can be written as in Algorithm 3 (Barone and Frigessi, 1990), showing that the algorithm can be implemented using vector and scalar operations only, and storage or inversion of the precision matrix $\mathbf{A}$ is not required.

input : Precision matrix

\mathbf{A}

, initial state

\mathbf{y}^{(0)}=\left(y_{1}^{(0)},y^{(0)}_{2},...,y^{(0)}_{n}\right)^{T}

, and maximum iteration

k_{\text{max}}

output :

\left\{\mathbf{y}^{(0)},\mathbf{y}^{(1)},\mathbf{y}^{(2)},\ldots,\mathbf{y}^{(k_{\text{max}})}\right\}

where

\mathbf{y}^{(k)}\overset{\mathcal{D}}{\to}\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})

k\to\infty

for $k=1,2,\ldots,k_{\mathrm{max}}$ do

for $i=1,2,\ldots,n$ do

Sample

z\sim\operatorname{N}(0,1)

;

y_{i}^{\left(k\right)}=\displaystyle\frac{z}{\sqrt{a_{ii}}}-\frac{1}{a_{ii}}\left(\sum_{j>i}a_{ij}y_{j}^{\left(k-1\right)}-\sum_{j<i}a_{ij}y_{j}^{\left(k\right)}\right)

end

Algorithm 3 Component-sweep Gibbs sampling using a precision matrix

\mathbf{A}

The index $k$ may be omitted (and with ‘ $=$ ’ interpreted as assignment) to give an algorithm that can be evaluated in place, requiring minimal storage.

2.1.2 Matrix formulation

One iteration in Algorithm 3 consists of a sweep over all $n$ components of $\mathbf{y}^{(k)}$ in sequence. The iteration can be written succinctly in the matrix form (Goodman and Sokal, 1989)

\displaystyle\mathbf{y}^{(k+1)}=-\mathbf{D}^{-1}\mathbf{Ly}^{(k+1)}-\mathbf{D}^{-1}\mathbf{L}^{T}\mathbf{y}^{(k)}+\mathbf{D}^{-1/2}\mathbf{z}^{(k)},

(1)

where $\mathbf{z}^{(k)}\sim\operatorname{N}(\mathbf{0},\mathbf{I})$ , $\mathbf{D}={\rm diag}(\mathbf{A})$ , and $\mathbf{L}$ is the strictly lower triangular part of $\mathbf{A}$ . This equation makes clear that the computational cost of each sweep is about $2n^{2}$ flops, when $\mathbf{A}$ is dense, due to multiplication by the triangular matrices $\mathbf{L}$ and $\mathbf{L}^{T}$ , and $\mathcal{O}(n)$ flops when $\mathbf{A}$ is sparse.

Extending this formulation to sweeps over any other fixed sequence of coordinates is achieved by putting $\mathbf{P}\mathbf{A}\mathbf{P}^{\text{T}}$ in place of $\mathbf{A}$ for some permutation matrix $\mathbf{P}$ . The use of random sweep Gibbs sampling has also been suggested (Amit and Grenander, 1991; Fishman, 1996; Liu et al., 1995; Roberts and Sahu, 1997), though we do not consider that here.

2.1.3 Convergence

If the iterates $\mathbf{y}^{\left(k\right)}$ in (1) converge in distribution to a distribution $\Pi$ which is independent of the starting state $\mathbf{y}^{\left(0\right)}$ , then the sampler is convergent, and we write

\mathbf{y}^{\left(k\right)}\overset{\mathcal{D}}{\to}\Pi.

It is well known that the iterates $\mathbf{y}^{(k)}$ in the Gibbs sampler (1) converge in distribution geometrically to $\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})=\operatorname{N}(\mathbf{0},{\boldsymbol{\Sigma}})$ (Roberts and Sahu, 1997). We consider geometric convergence in detail in Section 4.

3 Linear stationary iterative methods as linear equation solvers

Our work draws heavily on existing results for stationary linear iterative methods for solving linear systems. Here we briefly review the main results that we will use.

Consider a system of linear equations written as the matrix equation

\mathbf{Ax}=\mathbf{b}

(2)

where $\mathbf{A}$ is a given $n\times n$ nonsingular matrix and $\mathbf{b}$ is a given $n$ -dimensional vector. The problem is to find an $n$ -dimensional vector $\mathbf{x}$ satisfying equation (2). Later we will consider the case where $\mathbf{A}$ is symmetric positive definite (SPD) as holds for covariance and precision matrices (Feller, 1968).

3.1 Matrix splitting form of stationary iterative algorithms

A common class of methods for solving (2) are the linear iterative methods based on a splitting of $\mathbf{A}$ into $\mathbf{A}=\mathbf{M}-\mathbf{N}$ . The matrix splitting is the standard way of expressing and analyzing linear iterative algorithms, following its introduction by Varga (1962). The system (2) is then transformed to $\mathbf{M}\mathbf{x}=\mathbf{N}\mathbf{x}+\mathbf{b}$ or, if $\mathbf{M}$ is nonsingular, $\mathbf{x}=\mathbf{M}^{-1}\mathbf{Nx}+\mathbf{M}^{-1}\mathbf{b}.$ The iterative methods use this equation to compute successively better approximations $\mathbf{x}^{\left(k\right)}$ to the solution using the iteration step

\displaystyle\mathbf{x}^{\left(k+1\right)}

\displaystyle=\mathbf{M}^{-1}\mathbf{N}\mathbf{x}^{\left(k\right)}+\mathbf{M}^{-1}\mathbf{b}=\mathbf{Gx}^{\left(k\right)}+\mathbf{g}.

(3)

We follow the standard terminology used for these methods (see e.g. Axelsson, 1996; Golub and Loan, 1989; Saad, 2003; Young, 1971). Such methods are termed linear stationary iterative methods (of the first degree); they are stationary²²2This use of stationary corresponds to the term homogeneous when referring to a Markov chain. It is not to be confused with a stationary distribution that is invariant under the iteration. Later we will develop non-stationary iterations, inducing a non-homogeneous Markov chain that will, however, preserve the target distribution at each iterate. because the iteration matrix $\mathbf{G}=\mathbf{M}^{-1}\mathbf{N}$ and the vector $\mathbf{g}=\mathbf{M}^{-1}\mathbf{b}$ do not depend on $k$ . The splitting is symmetric when both $\mathbf{M}$ and $\mathbf{N}$ are symmetric matrices. The iteration, and splitting, is convergent if $\mathbf{x}^{\left(k\right)}$ tends to a limit independent of $\mathbf{x}^{(0)}$ , the limit being $\mathbf{A}^{-1}\mathbf{b}$ (see, e.g. (Young, 1971, Theorem 5.2)).

The iteration (3) is often written in the residual form so that convergence may be monitored in terms of the norm of the residual vector, and emphasizes that $\mathbf{M}^{-1}$ is acting as an approximation to $\mathbf{A}^{-1}$ , as in Algorithm 4.

input : The splitting

\mathbf{M}

\mathbf{N}

\mathbf{A}

, initial state

\mathbf{x}^{(0)}

output :

\mathbf{x}^{(k)}

approximating

\mathbf{x}^{\ast}=\mathbf{A}^{-1}\mathbf{b}

k=0

;

repeat

\mathbf{r}^{\left(k\right)}=\mathbf{b}-\mathbf{Ax}^{\left(k\right)}

;

\mathbf{x}^{\left(k+1\right)}=\mathbf{x}^{\left(k\right)}+\mathbf{M}^{-1}\mathbf{r}^{\left(k\right)}

;

increment

k

;

until $\left\|\mathbf{r}^{\left(k\right)}\right\|$ is sufficiently small;

Algorithm 4 Stationary iterative solve of

\mathbf{Ax=b}

In computational algorithms it is important to note that the symbol $\mathbf{M}^{-1}\mathbf{r}$ is interpreted as “solve the system $\mathbf{Mu}=\mathbf{r}$ for $\mathbf{u}$ ” rather than “form the inverse of $\mathbf{M}$ and multiply $\mathbf{r}$ by $\mathbf{M}^{-1}$ ” since the latter is much more computationally expensive (about $2n^{3}$ flops (Watkins, 2002)). Thus, the practicality of a splitting depends on the ease with which one can solve $\mathbf{Mu}=\mathbf{r}$ for any vector $\mathbf{r}$ .

3.1.1 The Gauss-Seidel algorithm

Many splittings of the matrix $\mathbf{A}$ use the terms in the expansion $\mathbf{A}=\mathbf{L}+\mathbf{D}+\mathbf{U}$ where $\mathbf{L}$ is the strictly lower triangular part of $\mathbf{A}$ , $\mathbf{D}$ is the diagonal of $\mathbf{A}$ , and $\mathbf{U}$ is the strictly upper triangular part.

For example, choosing $\mathbf{M}=\mathbf{L}+\mathbf{D}$ (so $\mathbf{N}=-\mathbf{U}$ ) allows $\mathbf{Mu}=\mathbf{r}$ to be solved by “forward substitution” (at a cost of $n^{2}$ flops when $\mathbf{A}$ is dense), and hence does not require inversion or Gauss-elimination of $\mathbf{M}$ (which would cost $2/3n^{3}$ flops when $\mathbf{A}$ is dense). Using this splitting in Algorithm 4 results in the Gauss-Seidel iterative algorithm. When $\mathbf{A}$ is symmetric, $\mathbf{U=L}^{\text{T}}$ , and the Gauss-Seidel iteration can be written as

\displaystyle\mathbf{x}^{\left(k+1\right)}=-\mathbf{D}^{-1}\mathbf{Lx}^{\left(k+1\right)}-\mathbf{D}^{-1}\mathbf{L}^{\text{T}}\mathbf{x}^{\left(k\right)}+\mathbf{D}^{-1}\mathbf{b}.

(4)

Just as we pointed out for the Gibbs sampler, variants of the Gauss-Seidel algorithm such as “red-black” coordinate updates (Saad, 2003), may be written in this form using a suitable permutation matrix.

The component-wise form of the Gauss-Seidel algorithm can be written in ‘equation’ form just as the Gibbs sampler (1) was in Algorithm 3. The component-wise form emphasizes that Gauss-Seidel can be implemented using vector and scalar operations only, and neither storage nor inversion of the splitting is required.

3.2 Convergence

A fundamental theorem of linear stationary iterative methods states that the splitting $\mathbf{A}=\mathbf{M}-\mathbf{N}$ , where $\mathbf{M}$ is nonsingular, is convergent (i.e., $\mathbf{x}^{(k)}\to\mathbf{A}^{-1}\mathbf{b}$ for any $\mathbf{x}^{(0)}$ ) if and only if $\varrho\left(\mathbf{M}^{-1}\mathbf{N}\right)<1$ , where $\varrho\left(\cdot\right)$ denotes the spectral radius of a matrix (Young, 1971, Theorem 3.5.1). This characterization is often used as a definition (Axelsson, 1996; Golub and Loan, 1989; Saad, 2003).

The error at step $k$ is $\mathbf{e}^{(k+1)}=\mathbf{x}^{(k+1)}-\mathbf{x}^{\ast}$ , where $\mathbf{x}^{\ast}=\mathbf{A}^{-1}\mathbf{b}$ . It follows that

\mathbf{e}^{(k+1)}=(\mathbf{M}^{-1}\mathbf{N})^{k}\mathbf{e}^{(0)}

(5)

and hence the asymptotic average reduction in error per iteration is the multiplicative factor

\lim_{k\to\infty}\left(\frac{||\mathbf{e}^{(k+1)}||_{2}}{||\mathbf{e}^{(0)}||_{2}}\right)^{1/k}=\varrho(\mathbf{M}^{-1}\mathbf{N})

(6)

(Axelsson, 1996, p. 166). In numerical analysis this is called the (asymptotic average) convergence factor (Axelsson, 1996; Saad, 2003). Later, we will show that this is exactly the same as the quantity called the geometric convergence rate in the statistics literature (see e.g. Robert and Casella, 1999), for the equivalent Gibbs sampler. We will use the term ‘convergence factor’ throughout this paper to avoid a clash of terminology, since in numerical analysis the rate of convergence is minus the log of the convergence factor (see e.g. Axelsson, 1996, p. 166).

3.3 Common matrix splittings

We now present the matrix splittings corresponding to some common stationary linear iterative solvers, with details for the case where $\mathbf{A}$ is symmetric, as holds for precision or covariance matrices.

We have seen that the Gauss-Seidel iteration uses the splitting $\mathbf{M}_{\text{GS}}=\mathbf{L}+\mathbf{D}$ and $N_{\text{GS}}=-\mathbf{L}^{T}$ . Gauss-Seidel is one of the simplest splittings and solvers, but is also quite slow. Other splittings have been developed, though the speed of each method is often problem specific. Some common splittings are shown in Table 1, listed with, roughly, greater speed downwards. Speed of convergence in a numerical example is presented later in Section 6.

Table 1: Common stationary linear solvers generated by splittings

\mathbf{A}=\mathbf{M}-\mathbf{N}

, and conditions that guarantee convergence when

\mathbf{A}

is SPD

splitting $\mathbf{M}$ convergence Richardson (R) $\frac{1}{\omega}\mathbf{I}$ $\displaystyle 0<\omega<\frac{2}{\varrho(\mathbf{A})}$ Jacobi (J) $\mathbf{D}$ $\mathbf{A}$ strictly diagonally dominant Gauss-Seidel (GS) $\mathbf{D+L}$ always SOR $\frac{1}{\omega}\mathbf{D}+\mathbf{L}$ $0<\omega<2$ SSOR $\frac{\omega}{2-\omega}\mathbf{M}_{\text{SOR}}\mathbf{D}^{-1}\mathbf{M}_{\text{SOR}}^{\text{T}}$ $0<\omega<2$

The method of successive over-relaxation (SOR) uses the splitting

\displaystyle\mathbf{M}_{\text{SOR}}=\frac{1}{\omega}\mathbf{D+L}~~{\rm and}~~\mathbf{N}_{\text{SOR}}=\frac{1-\omega}{\omega}\mathbf{D-L}^{T}

(7)

in which $\omega$ is a relaxation parameter chosen with $0<\omega<2$ . SOR with $\omega=1$ is Gauss-Seidel. For optimal values of $\omega$ such that $\varrho(\mathbf{M}^{-1}_{\text{SOR}}\mathbf{N}_{\text{SOR}})<\varrho(\mathbf{M}^{-1}_{\text{GS}}\mathbf{N}_{\text{GS}})$ , SOR is an accelerated Gauss-Seidel iteration. Unfortunately, there is no closed form for the optimal value of $\omega$ for an arbitrary matrix $\mathbf{A}$ , and the interval of values of $\omega$ which admits accelerated convergence can be quite narrow (Young, 1971; Golub and Loan, 1989; Saad, 2003).

The symmetric-SOR method (SSOR) incorporates both a forward and backward sweep of SOR so that if $\mathbf{A}$ is symmetric then the splitting is symmetric (Golub and Loan, 1989; Saad, 2003),

\displaystyle\mathbf{M}_{\text{SSOR}}=\frac{\omega}{2-\omega}\mathbf{M}_{\text{SOR}}\mathbf{D}^{-1}\mathbf{M}_{\text{SOR}}^{T}~~{\rm and}~~\mathbf{N}_{\text{SSOR}}=\frac{\omega}{2-\omega}\mathbf{N}_{\text{SOR}}^{T}\mathbf{D}^{-1}\mathbf{N}_{\text{SOR}}.

(8)

We will make use of symmetric splittings in conjunction with polynomial acceleration in Section 5.

When the matrix $\mathbf{A}$ is dense, Gauss-Seidel and SOR cost about $3n^{2}$ flops per iteration, with $2n^{2}$ due to multiplication by the matrix $\mathbf{A}$ (in order to calculate the residual) and another $n^{2}$ for the forward substitution to solve $\mathbf{Mu=r}$ . Richardson incurs no cost to solve $\mathbf{Mu=r}$ , while a solve with the diagonal Jacobi matrix incurs $n$ flops. Iterative methods are particularly attractive when the matrix $\mathbf{A}$ is sparse, since then the cost per iteration is only $O(n)$ flops.

Many theorems establish convergence of splittings by utilizing properties of $\mathbf{A}$ in specific applications. Some general conditions that guarantee convergence when $\mathbf{A}$ is SPD are given in the right column of Table 1 (Golub and Loan, 1989; Saad, 2003; Young, 1971).

4 Equivalence of stationary linear solvers and Gibbs samplers

We first consider the equivalence between linear solvers and stochastic iterations in the case where the starting state and noise are not necessarily normally distributed, then in Section 4.2 et seq. we restrict consideration to normal distributions.

4.1 General noise

The striking similarity between the Gibbs sampler (1) and the Gauss-Seidel iteration (4) is no coincidence. It is an example of a general equivalence between the stationary linear solver derived from a splitting and the associated stochastic iteration used as a sampler. We will make explicit the equivalence in the following theorems and corollary. In the first theorem we show that a splitting is convergent (in the sense of stationary iterative solvers) if and only if the associated stochastic iteration is convergent in distribution.

Theorem 1

Let $\mathbf{A=M-N}$ be a splitting with $\mathbf{M}$ invertible, and let $\pi\left(\mathbf{\cdotp}\right)$ be some fixed probability distribution with zero mean and fixed non-zero covariance. For any fixed vector $\mathbf{b}$ , and random vectors $\mathbf{c}^{\left(k\right)}\overset{\text{iid}}{\sim}\pi$ , $k=0,1,2,\ldots$ , the stationary linear iteration

\mathbf{x}^{\left(k+1\right)}=\mathbf{M}^{-1}\mathbf{Nx}^{\left(k\right)}+\mathbf{M}^{-1}\mathbf{b}

(9)

converges, with $\mathbf{x}^{\left(k\right)}\rightarrow\mathbf{A}^{-1}\mathbf{b}$ as $k\rightarrow\infty$ whatever the initial vector $\mathbf{x}^{(0)}$ , if and only if there exists a distribution $\Pi$ such that the stochastic iteration

\mathbf{y}^{\left(k+1\right)}=\mathbf{M}^{-1}\mathbf{Ny}^{\left(k\right)}+\mathbf{M}^{-1}\mathbf{c}^{\left(k\right)}

(10)

converges in distribution to $\Pi$ , with $\mathbf{y}^{\left(k\right)}\overset{\mathcal{D}}{\rightarrow}\Pi$ as $k\rightarrow\infty$ whatever the initial state $\mathbf{y}^{(0)}$ .

Proof. If the linear iteration (9) converges, then $\varrho(\mathbf{M}^{-1}\mathbf{N})<1$ (Thm 3-5.1 in Young, 1971). Hence there exists a unique distribution $\Pi$ with $\mathbf{y}^{\left(k+1\right)}\overset{\mathcal{D}}{\rightarrow}\Pi$ (Theorem 2.3.18-4 of Duflo, 1997), which shows necessity. Conversely, if the linear solver does not converge to a limit independent of $\mathbf{x}^{(0)}$ for some $\mathbf{b}$ , that also holds for $\mathbf{b}=\mathbf{0}$ and hence initializing the sampler with $\operatorname{E}\left[\mathbf{y}^{(0)}\right]=\mathbf{x}^{(0)}$ yields $\operatorname{E}\left[\mathbf{y}^{(k+1)}\right]=(\mathbf{M}^{-1}\mathbf{N})^{k}\mathbf{x}^{(0)}$ which does not converge to a value independent of $\mathbf{y}^{(0)}$ . Sufficiency holds by the contrapositive. $\Box$

Convergence of the stochastic iteration (10) could also be established via the more general theory of Diaconis and Freedman (1999) that allows the iteration operator $\mathbf{G}=\mathbf{M}^{-1}\mathbf{N}$ to be random, with convergence in distribution guaranteed when $\mathbf{G}$ is contracting on average; see Diaconis and Freedman (1999) for details.

The following theorem shows how to design the noise distribution $\pi$ so that the limit distribution $\Pi$ has a desired mean $\mu$ and covariance ${\boldsymbol{\Sigma}}=\mathbf{A}^{-1}$ .

Theorem 2

Let $\mathbf{A}$ be SPD, $\mathbf{A=M-N}$ be a convergent splitting, ${\boldsymbol{\mu}}$ a fixed vector, and $\pi\left(\mathbf{\cdotp}\right)$ a fixed probability distribution with finite mean $\nu$ and non-zero covariance $\mathbf{V}$ . Consider the stochastic iteration (10) where $\mathbf{c}^{(k)}\overset{iid}{\sim}\pi$ , $k=0,1,2,\ldots$ . Then, whatever the starting state $\mathbf{y}^{(0)}$ , the following are equivalent:

1.

$\operatorname{E}\left[\mathbf{c}^{(k)}\right]={\boldsymbol{\nu}}$ and $\operatorname{Var}\left(\mathbf{c}^{(k)}\right)=\mathbf{V}=\mathbf{M}^{T}+\mathbf{N}$
2.

the iterates $\mathbf{y}^{(k)}$ converge in distribution to some distribution $\Pi$ that has mean ${\boldsymbol{\mu}}=\mathbf{A}^{-1}\nu$ and covariance matrix $\mathbf{A}^{-1}$ ; in particular $\operatorname{E}\left[\mathbf{y}^{(k)}\right]\to{\boldsymbol{\mu}}$ and $\operatorname{Var}\left(\mathbf{y}^{(k)}\right)\to\mathbf{\mathbf{A}}^{-1}$ as $k\to\infty$ .

Proof. Appendix A.1. $\Box$

Additionally, the mean and covariance converge geometrically, with convergence factors given by the convergence factors for the linear iterative method, as established in the following corollary.

Corollary 3

The first and second moments of iterates in the stochastic iteration in Theorem 2 converge geometrically. Specifically, $\operatorname{E}\left(\mathbf{y}^{(k)}\right)\to{\boldsymbol{\mu}}$ with convergence factor $\varrho(\mathbf{M}^{-1}\mathbf{N})$ and $\operatorname{Var}\left(\mathbf{y}^{(k)}\right)=\mathbf{A}^{-1}-\mathbf{G}^{k}\left(\mathbf{A}^{-1}-\operatorname{Var}(\mathbf{y}^{(0)})\right)(\mathbf{G}^{k})^{T}\to\mathbf{A}^{-1}$ with convergence factor $\varrho(\mathbf{M}^{-1}\mathbf{N})^{2}$ .

Proof. Appendix A.1. $\Box$

Note that the matrix splitting has allowed an explicit construction of the noise covariance to give a desired precision matrix of the target distribution. We see from Theorem 2 that the stochastic iteration may be designed to converge to a distribution with non-zero target mean, essentially by adding the deterministic iteration (9) to the stochastic iteration (10). This is particularly useful when the mean is defined implicitly via solving a matrix equation. In cases where the mean is known explicitly, the mean may be added after convergence of the stochastic iteration with zero mean, giving an algorithm with faster convergence since the covariance matrix converges with factor $\varrho(\mathbf{M}^{-1}\mathbf{N})^{2}<\varrho(\mathbf{M}^{-1}\mathbf{N})$ (this was also noted by Barone et al., 2002). Convergence in variance for non-normal targets was considered in Fox and Parker (2014).

Using Theorems 1 and 2, and Corollary 3 we can draw on the vast literature in numerical linear algebra on stationary linear iterative methods to find random iterations that are computationally efficient and provably convergent in distribution with desired mean and covariance. In particular, results in Amit and Grenander (1991); Barone and Frigessi (1990); Galli and Gao (2001), Roberts and Sahu (1997), and Liu et al. (1995) are special cases of the general theory of matrix splittings presented here.

4.2 Sampling from normal distributions using matrix splittings

We now restrict attention to the case of normal target distributions.

Corollary 4

If in Theorem 2 we set $\pi=\operatorname{N}({\boldsymbol{\nu}},\mathbf{V})$ , for some non-zero covariance matrix $\mathbf{V}$ , then, whatever the starting state $\mathbf{y}^{(0)}$ , the following are equivalent: (i) $\mathbf{V}=\mathbf{M}^{T}+\mathbf{N}$ ; (ii) $\mathbf{y}^{(k)}\overset{\mathcal{D}}{\to}\operatorname{N}({\boldsymbol{\mu}},\mathbf{A}^{-1})$ where ${\boldsymbol{\mu}}=\mathbf{A}^{-1}\nu$ .

Proof. Since $\pi$ is normal, then $\Pi$ in Theorem 2 is normal. Since a normal distribution is sufficiently described by its first two moments, the corollary follows. $\Box$

Using Corollary 4, we found normal sampling algorithms corresponding to some common stationary linear solvers. The results are given in Table 2.

Table 2: Some generalized Gibbs samplers for drawing from

\operatorname{N}\left(\mathbf{0},\mathbf{A}^{-1}\right)

adapted from common stationary linear solvers. Each Gibbs iteration requires sampling the noise vector

\mathbf{c}^{(k)}\sim\operatorname{N}\left(\mathbf{0},\mathbf{M}^{T}+\mathbf{N}\right)

Sampler $\mathbf{M}$ $\operatorname{Var}\left(\mathbf{c}^{(k)}\right)=\mathbf{M}^{T}+\mathbf{N}$ Richardson $\displaystyle\frac{1}{\omega}\mathbf{I}$ $\displaystyle\frac{2}{\omega}\mathbf{I}-\mathbf{A}$ Jacobi $\mathbf{D}$ $2\mathbf{D}-\mathbf{A}$ Gibbs (Gauss-Seidel) $\mathbf{D+L}$ $\mathbf{D}$ SOR $\displaystyle\frac{1}{\omega}\mathbf{D+L}$ $\displaystyle\frac{2-\omega}{\omega}\mathbf{D}$ SSOR (REGS) $\displaystyle\frac{\omega}{2-\omega}\mathbf{M}_{\text{SOR}}\mathbf{D}^{-1}\mathbf{M}_{\text{SOR}}^{T}$ $\displaystyle\frac{\omega}{2-\omega}\left(\mathbf{M}_{\text{SOR}}\mathbf{D}^{-1}\mathbf{M}_{\text{SOR}}^{T}+\mathbf{N}_{\text{SOR}}^{T}\mathbf{D}^{-1}\mathbf{N}_{\text{SOR}}\right)$

A sampler corresponding to a convergent splitting is implemented in Algorithm 5.

input : SPD precision matrix

\mathbf{A}

\mathbf{M}

and

\mathbf{N}

defining a convergent splitting of

\mathbf{A}

, number of steps

k_{\max}

, initial state

\mathbf{y}^{(0)}

output :

\mathbf{y}^{(k)}

approximately distributed as

\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})

for $k=0,\ldots,k_{\max}$ do

sample

\mathbf{c}^{(k)}\sim\operatorname{N}(\mathbf{0},\mathbf{M}^{T}+\mathbf{N})

;

\mathbf{y}^{(k+1)}=\mathbf{M}^{-1}(\mathbf{N}\mathbf{y}^{(k)}+\mathbf{c}^{(k)})

end

Algorithm 5 Stationary sampler of

\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})

The assignment $\mathbf{y}^{(k+1)}=\mathbf{M}^{-1}(\mathbf{N}\mathbf{y}^{(k)}+\mathbf{c}^{(k)})$ in Algorithm 5 can be replaced by the slightly more expensive steps $\mathbf{r}^{(k)}=\mathbf{c}^{(k)}-\mathbf{A}\mathbf{y}^{(k)}$ and $\mathbf{y}^{(k+1)}=\mathbf{y}^{(k)}+\mathbf{M}^{-1}\mathbf{r}^{(k)}$ , which allows monitoring of the residual, and emphasizes the equivalence with the stationary linear solver in Algorithm 4. Even though convergence may not be diagnosed by a decreasing norm of the residual, lack of convergence can be diagnosed when the residual diverges in magnitude. In practice, the effective convergence factor for a sampler may be calculated by solving the linear system (2) (perhaps with a random right hand side) using the iterative solver derived from the splitting and monitoring the decrease in error to evaluate the asymptotic average convergence factor using equation (6). By Corollary 3, this estimates the convergence factor for the sampler.

The practicality of a sampler derived from a convergent splitting depends on the ease with which one can solve $\mathbf{M}\mathbf{y}=\mathbf{r}$ for any $\mathbf{r}$ (as for the stationary linear solver) and also the ease of drawing iid noise vectors from $\operatorname{N}(\mathbf{0},\mathbf{M}^{T}+\mathbf{N})$ . Sampling the noise vector is simple when a matrix square root, such as the Cholesky factorization, of $\mathbf{M}^{T}+\mathbf{N}$ is cheaply available. Thus, a sampler is at least as expensive as the corresponding linear solver since, in addition to operations in each iteration, the sampler must factor the $n\times n$ matrix $\mathbf{Var(c}^{(k)})=\mathbf{M}^{T}+\mathbf{N}$ . For the samplers listed in Table 2 it is interesting that the simpler the splitting, the more complicated is the variance of the noise. Neither Richardson nor Jacobi splittings give useful sampling algorithms since the difficulty of sampling the noise vector is no less than the original task of sampling from $\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})$ . The Gauss-Seidel splitting, giving the usual Gibbs sampler, is at a kind of sweet spot, where solving equations in $\mathbf{M}$ is simple while the required noise variance is diagonal, so posing a simple sampling problem.

The SOR stationary sampler uses the SOR splitting $\mathbf{M}_{\text{SOR}}$ and $\mathbf{N}_{\text{SOR}}$ in (7) for $0<\omega<2$ and the noise vector $\mathbf{c}^{(k)}\sim\operatorname{N}(\mathbf{0},\mathbf{M}_{SOR}^{T}+\mathbf{N}_{SOR}=\frac{2-\omega}{\omega}\mathbf{D})$ (Table 2). This sampler was introduced by Adler (1981), rediscovered by Barone and Frigessi (1990), and has been studied extensively (Barone et al., 2002; Galli and Gao, 2001; Liu et al., 1995; Neal, 1998; Roberts and Sahu, 1997). For $\omega=1$ , the SOR sampler is a Gibbs (Gauss-Seidel) sampler. For values of $\omega$ such that $\varrho(\mathbf{M}_{\text{SOR}}^{-1}\mathbf{N}_{\text{SOR}})<\varrho(\mathbf{M}^{-1}_{GS}\mathbf{N}_{GS})$ ), the SOR-sampler is an accelerated Gibbs sampler. As for the linear solver, implementation of the Gibbs and SOR samplers by Algorithm 5 requires multiplication by the upper triangular $\mathbf{N}$ and forward substitution with respect to $\mathbf{M}$ at a cost of $2n^{2}$ flops. In addition, these samplers must take the square root of the diagonal matrix $\frac{2-\omega}{\omega}\mathbf{D}$ at a mere cost of $O(n)$ flops.

Implementation of an SSOR sampler instead of a Gibbs or SOR sampler is advantageous since the Markov chain $\{\mathbf{y}^{(k)}\}$ is reversible (Roberts and Sahu, 1997). SSOR sampling uses the symmetric-SOR splitting $\mathbf{M}_{\text{SSOR}}$ and $\mathbf{N}_{\text{SSOR}}$ in (8). The SSOR stationary sampler is most easily implemented by forward and backward SOR sampling sweeps as in Algorithm 6, so the matrices $\mathbf{M}_{\text{SSOR}}$ and $\mathbf{N}_{\text{SSOR}}$ need never be explicitly formed.

input : The SOR splitting

\mathbf{M},~\mathbf{N}

\mathbf{A}

, relaxation parameter

\omega

, initial state

\mathbf{y}

k_{\max}

output :

\mathbf{y}

approximately distributed as

\operatorname{N}(\mathbf{0,A}^{-1})

set

\gamma=\left(\frac{2}{\omega}-1\right)^{1/2}

;

for $k=1,\ldots,k_{\max}$ do

sample

\mathbf{z}\sim\operatorname{N}(\mathbf{0},\mathbf{I})

;

\mathbf{x}:=\mathbf{M}^{-1}(\mathbf{Ny}+\gamma\mathbf{D}^{1/2}\mathbf{z})

;

sample

\mathbf{z}\sim\operatorname{N}(\mathbf{0},\mathbf{I})

;

\mathbf{y}:=\mathbf{M}^{-T}(\mathbf{N}^{T}\mathbf{x}+\gamma\mathbf{D}^{1/2}\mathbf{z})

end

Algorithm 6 SSOR sampling from

\operatorname{N}(\mathbf{0,A}^{-1})

We first encountered restricted versions of Corollary 4 for normal distributions in Amit and Grenander (1991) and in Barone and Frigessi (1990) where geometric convergence of the covariance matrices was established for the Gauss-Seidel and SOR splittings. These and the SSOR splitting were investigated in Roberts and Sahu (1997) (who labelled the sampler REGS).

Corollary 4 and Table 1 show that the Gibbs, SOR and SSOR samplers converge for any SPD precision matrix $\mathbf{A}$ . This summarizes results in Barone and Frigessi (1990); Galli and Gao (2001) and the deterministic sweeps investigated in Amit and Grenander (1991); Roberts and Sahu (1997); Liu et al. (1995). Corollary 4 generalizes these results for any matrix splitting $\mathbf{A=M-N}$ by guaranteeing convergence of the random iterates (10) to $\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})$ with convergence factor $\varrho(\mathbf{M}^{-1}\mathbf{N})$ (or $\varrho(\mathbf{M}^{-1}\mathbf{N})^{2}$ if $\mu=\mathbf{0}$ ).

5 Non-stationary iterative methods

5.1 Acceleration of linear solvers by polynomials

A common scheme in numerical linear algebra for accelerating a stationary method when $\mathbf{M}$ and $\mathbf{N}$ are symmetric is through the use of polynomial preconditioners (Axelsson, 1996; Golub and Loan, 1989; Saad, 2003). Equation (5) shows that after $k$ steps the error in the stationary method is a $k^{th}$ order polynomial of the matrix $\mathbf{I}-\mathbf{G}=\mathbf{M}^{-1}\mathbf{A}$ . The idea behind polynomial acceleration is to implicitly implement a different $k^{th}$ order polynomial $P_{k}(\mathbf{M}^{-1}\mathbf{A})$ such that $\varrho(P_{k}(\mathbf{M}^{-1}\mathbf{A}))<\varrho(\left(\mathbf{I}-\mathbf{M}^{-1}\mathbf{A}\right)^{k})$ . The coefficients of $P_{k}(\mathbf{M}^{-1}\mathbf{A})$ are functions of a set of acceleration parameters $\{\{\alpha_{k}\},\{\tau_{k}\}\}$ , introduced by the second order iteration

\displaystyle\mathbf{x}^{(k+1)}=(1-\alpha_{k})\mathbf{x}^{(k-1)}+\alpha_{k}\mathbf{x}^{(k)}+\alpha_{k}\tau_{k}\mathbf{M}^{-1}(\mathbf{b}-\mathbf{Ax}^{(k)}).

(11)

At the first step, $\alpha_{0}=1$ and $\mathbf{x}^{(1)}=\mathbf{x}^{(0)}+\tau_{0}\mathbf{M}^{-1}(\mathbf{b}-\mathbf{Ax}^{(0)})$ . Setting $\alpha_{k}=\tau_{k}=1$ for every $k$ yields a basic un-accelerated stationary method. The accelerated iteration in (11) is implemented at a negligible increase in cost of $\mathcal{O}(n)$ flops per iteration (due to scalar-vector multiplication and vector addition) over the corresponding stationary solver (3).

It can be shown that (e.g., Axelsson (1996)) that the $(k+1)^{st}$ order polynomial $P_{k+1}$ generated recursively by the second order non-stationary linear solver (11) is

\displaystyle P_{k+1}\left(\lambda\right)=\left(\alpha_{k}-\alpha_{k}\tau_{k}\lambda\right)P_{k}\left(\lambda\right)+\left(1-\alpha_{k}\right)P_{k-1}\left(\lambda\right).

(12)

This polynomial acts on the error $\mathbf{e}^{(k)}=\mathbf{x}^{(k)}-\mathbf{A}^{-1}\mathbf{b}$ by $\mathbf{e}^{(k+1)}=P_{k}(\mathbf{M}^{-1}\mathbf{A})\mathbf{e}^{(0)}$ , which can be compared directly to (5).

When estimates of the extreme eigenvalues $\lambda_{\min}$ and $\lambda_{\max}$ of $\mathbf{I}-\mathbf{G}=\mathbf{M}^{-1}\mathbf{A}$ are available ( $\lambda_{\min}$ and $\lambda_{\max}$ are real when $\mathbf{M}$ and $\mathbf{N}$ are symmetric), then the coefficients $\{\tau_{k},\alpha_{k}\}$ can be chosen to generate the scaled Chebyshev polynomials $\{Q_{k}\}$ , which give optimal error reduction at every step. The Chebyshev acceleration parameters are

\displaystyle\tau_{k}=\frac{2}{\lambda_{\max}+\lambda_{\min}},~~\beta_{k}=\left(\frac{1}{\tau_{k}}-\beta_{k-1}\left(\frac{\lambda_{\max}-\lambda_{\min}}{4}\right)^{2}\right)^{-1},~~\alpha_{k}=\frac{\beta_{k}}{\tau_{k}},

(13)

where $\alpha_{0}=1$ and $\beta_{0}=\tau_{0}$ (Axelsson, 1996). Note that these parameters are independent of the iterates $\{\mathbf{x}^{(k)}\}$ . Since $\mathbf{M}$ is required to be symmetric, applying Chebyshev acceleration to SSOR is a common pairing; its effectiveness as a linear solver is shown later in Table 3.

Whereas the stationary methods converge with asymptotic average convergence factor $\varrho(\mathbf{M}^{-1}\mathbf{N})$ , the convergence factor for the Chebyshev method depends on $\operatorname{cond}(\mathbf{M}^{-1}\mathbf{A})=\lambda_{\max}/\lambda_{\min}$ . Specifically the scaled Chebyshev polynomial $Q_{k}(\mathbf{\lambda})$ minimizes $\max_{\lambda\in[\lambda_{\min},\lambda_{\max}]}P_{k}(\lambda)$ over all $k^{th}$ order polynomials $P_{k}$ , with

\displaystyle\max_{\lambda\in[\lambda_{\min},\lambda_{\max}]}|Q_{k}(\lambda)|=\frac{2\sigma^{k}}{1+\sigma^{2k}}.

(14)

Since the error at the $k^{th}$ step of a Chebyshev accelerated linear solver is $\mathbf{e}^{(k+1)}=Q_{k}(\lambda)\mathbf{e}^{(0)}$ , then the asymptotic convergence factor is bounded above by

\displaystyle\sigma=\frac{1-\sqrt{\lambda_{\min}/\lambda_{\max}}}{1+\sqrt{\lambda_{\min}/\lambda_{\max}}}

(15)

(p.181 Axelsson, 1996). Since $\sigma\in[0,1)$ , the polynomial accelerated scheme is guaranteed to converge even if the original splitting was not convergent. Further, the convergence factor of the stationary iterative solver is bounded below by $\rho=\frac{1-{\lambda_{\min}/\lambda_{\max}}}{1+{\lambda_{\min}/\lambda_{\max}}}$ (see e.g. Axelsson, 1996, Thm 5.9). Since $\sigma<\rho$ (except when $\lambda_{\min}=\lambda_{\max}$ in which case $\sigma=0$ ), polynomial acceleration always reduces the convergence factor, so justifies the term acceleration. The Chebyshev accelerated iteration (11) is amenable to preconditioning that reduces the condition number, and hence reduces $\sigma$ , such as incomplete Cholesky factorization or graphical methods (Axelsson, 1996; Saad, 2003). Axelsson also shows that after

k^{*}=\lceil\frac{\ln(\varepsilon/2)}{\ln\sigma}\rceil

(16)

iterations of the Chebyshev solver, the error reduction is $||\mathbf{e}^{(k^{*})}||_{\mathbf{A}^{\nu}}/||\mathbf{e}^{(0)}||_{\mathbf{A}^{\nu}}\leq\varepsilon$ for some real number $\nu$ and any $0<\varepsilon<1$ (Axelsson, 1996, eqn 5.32).

5.2 Acceleration of Gibbs sampling by polynomials

Any acceleration scheme devised for a stationary linear solver is a candidate for accelerating convergence of a Gibbs sampler. For example, consider the second order stochastic iteration

\displaystyle\mathbf{y}^{(k+1)}=(1-\alpha_{k})\mathbf{y}^{(k-1)}+\alpha_{k}\mathbf{y}^{(k)}+\alpha_{k}\tau_{k}\mathbf{M}^{-1}(\mathbf{c}^{(k)}-\mathbf{Ay}^{(k)})

(17)

analogous to the linear solver in (11) but now the vector $\mathbf{b}$ has been replaced by a random vector $\mathbf{c}^{(k)}$ . The equivalence between polynomial accelerated linear solvers and polynomial accelerated samplers is made clear in the next three theorems.

Theorem 5

Let $\mathbf{A}$ be SPD and $\mathbf{A}=\mathbf{M}-\mathbf{N}$ be a symmetric splitting. Consider a set of independent noise vectors $\{\mathbf{c}^{(k)}\}$ with moments

\operatorname{E}(\mathbf{c}^{(k)})={\boldsymbol{\nu}}\mbox{~~and~~}\operatorname{Var}(\mathbf{c}^{(k)})=a_{k}\mathbf{M}+b_{k}\mathbf{N}

such that $a_{k}:=\frac{2-\tau_{k}}{\tau_{k}}+\left(b_{k}-1\right)\left(\frac{1}{\tau_{k}}+\frac{1}{\kappa_{k}}-1\right)$ , $b_{k}:=\frac{2(1-\alpha_{k})}{\alpha_{k}}\left(\frac{\kappa_{k}}{\tau_{k}}\right)+1$ , $\kappa_{k+1}:=\alpha_{k}\tau_{k}+(1-\alpha_{k})\kappa_{k}$ , and $\kappa_{1}=\tau_{0}$ . If the polynomial accelerated linear solver (11) converges to $\mathbf{A}^{-1}\mathbf{b}$ with a set of parameters $\{\{\alpha_{k}\}$ , $\{\tau_{k}\}\}$ that are independent of $\{\mathbf{x}^{(k)}\}$ , then the polynomial accelerated stochastic iteration (17) converges in distribution to a distribution $\Pi$ with mean $\mathbf{A}^{-1}{\boldsymbol{\nu}}$ and covariance matrix $\mathbf{A}^{-1}$ . Furthermore, if the $\{\mathbf{c}^{(k)}\}$ are normal, then

\mathbf{y}^{(k)}\overset{\mathcal{D}}{\to}\operatorname{N}({\boldsymbol{\mu}}=\mathbf{A}^{-1}{\boldsymbol{\nu}},\mathbf{A}^{-1}).

Proof. Appendix A.2. $\Box$

Given a second order linear solver (11) that converges, Theorem 5 makes clear how to construct a second order sampler (17) that is guaranteed to converge. The next Corollary shows that the polynomial $P_{k}$ that acts on the linear solver error $\mathbf{x}^{(k)}-\mathbf{A}^{-1}\mathbf{b}$ is the same polynomial that acts on the errors in the first and second moments of the sampler, $\operatorname{E}(\mathbf{y}^{(k)})-\mathbf{A}^{-1}{\boldsymbol{\nu}}$ and $\operatorname{Var}(\mathbf{y}^{(k)})-\mathbf{A}^{-1}$ respectively. In other words, the convergence factors for a polynomial accelerated solver and sampler are the same.

Corollary 6

Suppose that the polynomial accelerated linear solver (11) converges with asymptotic convergence factor $\sigma=\left(\lim_{k\to\infty}\max_{\lambda}|P_{k}(\lambda)|\right)^{1/k}$ , where $P_{k}$ is the $k^{th}$ order polynomial recursively generated by (12). Then under the conditions of Theorem 5,

\operatorname{E}\left(\mathbf{y}^{(k+1)}\right)=P_{k}(\mathbf{M}^{-1}\mathbf{A})\left(\operatorname{E}(\mathbf{y}^{(0)})-\mathbf{A}^{-1}{\boldsymbol{\nu}}\right)\to\mathbf{A}^{-1}{\boldsymbol{\nu}}

with asymptotic convergence factor $\sigma$ , and

\operatorname{Var}\left(\mathbf{y}^{(k)}\right)=\mathbf{A}^{-1}-P_{k}(\mathbf{M}^{-1}\mathbf{A})\left(\mathbf{A}^{-1}-\mathbf{Var(y}^{(0)})\right)(P_{k}(\mathbf{M}^{-1}\mathbf{A}))^{T}\to\mathbf{A}^{-1}

with asymptotic convergence factor $\sigma^{2}$ .

Proof. Appendix A.2. $\Box$

Corollary 6 allows a direct comparison of the convergence factor for a polynomial accelerated sampler ( $\sigma$ , or $\sigma^{2}$ if $\mu=\mathbf{0}$ ) to the convergence factor given previously for the corresponding un-accelerated stationary sampler ( $\varrho(\mathbf{M}^{-1}\mathbf{N})$ , or $\varrho(\mathbf{M}^{-1}\mathbf{N})^{2}$ if $\mu=\mathbf{0}$ ). In particular, given a second order linear solver with accelerated convergence compared to the corresponding stationary iteration, the corollary guarantees that the second order Gibbs sampler (17) will converge faster than the stationary Gibbs sampler (10).

Just as Chebyshev polynomials are guaranteed to accelerate linear solvers, Corollary 6 assures that Chebyshev polynomials can also accelerate a Gibbs sampler. Using Theorem 5, we derived the Chebyshev accelerated SSOR sampler (Fox and Parker, 2014) by iteratively updating parameters via (13) and then generating a sampler via (17). Explicit implementation details of the Chebyshev accelerated sampler are provided in the supplementary materials. The polynomial accelerated sampler is implemented at a negligible increase in cost of ${\mathcal{O}}(n)$ flops per iteration over the cost ( $4n^{2}$ flops) of the SSOR sampler (Algorithm 6). The asymptotic convergence factor is given by the next Corollary, which follows from Corollary 6 and equation (15).

Corollary 7

If the Chebyshev accelerated linear solver converges, then the mean $\operatorname{E}(\mathbf{y}^{(k)})$ of the corresponding Chebyshev accelerated stochastic iteration (17) converges to ${\boldsymbol{\mu}}=\mathbf{A}^{-1}{\boldsymbol{\nu}}$ with asymptotic convergence factor $\left(\frac{1-\sqrt{\lambda_{\min}/\lambda_{\max}}}{1+\sqrt{\lambda_{\min}/\lambda_{\max}}}\right)$ and the covariance matrix $\operatorname{Var}(\mathbf{y}^{(k)})$ converges to $\mathbf{A}^{-1}$ with asymptotic convergence factor $\left(\frac{1-\sqrt{\lambda_{\min}/\lambda_{\max}}}{1+\sqrt{\lambda_{\min}/\lambda_{\max}}}\right)^{2}$ .

Corollary 7 and (14) show that a Chebyshev accelerated normal sampler is guaranteed to converge faster than any other acceleration scheme that has the parameters $\{\{\tau_{k},\alpha_{k}\}\}$ independent of the iterates $\{\mathbf{y}^{(k)}\}$ . This result also shows that the preconditioning ideas presented in section 5.1 to reduce ${\rm cond}(\mathbf{M}^{-1}\mathbf{A})=\lambda_{\max}/\lambda_{\min}$ can also be used to speed up Chebyshev accelerated samplers. We do not investigate such preconditioning here.

Corollary 7 and equation (16) suggest that, for any $\varepsilon>0$ , after $k^{*}$ iterations the Chebyshev error reduction for the mean is smaller than $\varepsilon$ . But even sooner, after $k^{**}=k^{*}/2$ iterations, the Chebyshev error reduction for the variance is predicted to be smaller than $\varepsilon$ (Fox and Parker, 2014).

6 Computed Examples

The iterative sampling algorithms we have investigated are designed for problems where operating by the precision matrix is cheap. A common such case is when the precision matrix is sparse, as occurs when modeling a GMRF with a local neighbourhood structure. Then, typically, the precision matrix has $\mathcal{O}(n)$ non-zero elements, so direct matrix-vector multiplication has $\mathcal{O}(n)$ cost. We give two examples of sampling using sparse precision matrices: first, we present a small $n=100$ example where complete diagnostics can be computed for evaluating the quality of convergence; and second, we present a $n=10^{6}$ Bayesian linear inverse problem that demonstrates computational feasibility for large problems. The samplers are initialized with $\mathbf{y}^{(0)}=\mathbf{0}$ in both examples.

6.1 A $10\times 10$ lattice example ( $n=100$ )

A first order locally linear sparse precision matrix $\mathbf{A}$ , considered by Higdon (2006); Rue and Held (2005), is

\displaystyle[\mathbf{A}]_{ij}=10^{-4}\delta_{ij}+\left\{\begin{array}[]{cc}n_{i}&~~~{\rm if}~~i=j\\ -1&~~~{\rm if}~~i\neq j~~{\rm and}~~||s_{i}-s_{j}||_{2}\leq 1\\ 0&~~{\rm otherwise}\end{array}.\right.

The discrete points $\{s_{i}\}$ are on a regular $10\times 10$ lattice ( $n=100$ ) over the two dimensional domain ${\cal S}=[1,10]\times[1,10]$ . Thus $\mathbf{A}$ is $100\times 100$ , $||\mathbf{A}||_{2}=7.8$ and $||{\boldsymbol{\Sigma}}=\mathbf{A}^{-1}||_{2}=10^{4}$ . The sparsity of $\mathbf{A}$ is shown in the left panel of Figure 2. The scalar $n_{i}$ is the number of points neighbouring $s_{i}$ , i.e., with distance $1$ from $s_{i}$ . Although $n^{2}=10^{4}$ , the number of non-zero elements of $\mathbf{A}$ is $\mathcal{O}(n)$ ( $460$ in this example). Since the bandwidth of $\mathbf{A}$ is $\mathcal{O}(n^{1/2})$ , a Cholesky factorization costs $\mathcal{O}(n^{2})$ flops (Rue, 2001) and each iteration of an iterative method costs $\mathcal{O}(n)$ flops.

To provide a comparison between linear solvers and samplers, we solved the system $\mathbf{Ax}=\mathbf{b}$ using linear solvers with different matrix splittings (Table 1), where $\mathbf{b}$ is fixed and non-zero, all initialized with $\mathbf{x}^{(0)}=\mathbf{0}$ . The results are given in Table 3. The Richardson method does not converge (DNC) since the spectral radius of the iteration operator is greater than 1. The SOR iteration was run at the optimal relaxation parameter value of $\omega=1.9852$ . SSOR was run at its optimal value of $\omega=1.6641$ . Chebyshev accelerated SSOR (Cheby-SSOR), CG accelerated SSOR (CG-SSOR) (both run with $\omega=1.6641$ ) and CG utilize a different implicit operator for each iteration, and so the spectral radius given in these cases is the geometric mean spectral radius of these operators (estimated using (5)). Even for this small example, Chebyshev acceleration reduces the computational effort required for convergence by about two orders of magnitude, while CG acceleration reduces work by nearly two more orders of magnitude.

Table 3: The number of iterations and the total number of floating point operations performed by some common stationary and accelerated linear solvers, and the Cholesky factorization, used to solve

\mathbf{Ax=b}

for fixed non-zero

\mathbf{b}

. Each solver was run until the residual became sufficiently small,

||\mathbf{b}-\mathbf{A}\mathbf{x}^{(k)}||_{2}<10^{-8}

. Details in section 6.

solver $\omega$ $\varrho(\mathbf{M}^{-1}\mathbf{N})$ number of iterations flops Richardson 1 6.8 DNC – Jacobi – .999972 $4.01\times 10^{5}$ $5.69\times 10^{7}$ Gauss-Seidel – .999944 $2.44\times 10^{5}$ $4.34\times 10^{8}$ SSOR 1.6641 .999724 $6.7\times 10^{4}$ $2.39\times 10^{8}$ SOR 1.9852 .985210 1655 $2.95\times 10^{6}$ Cheby-SSOR 1 .9786 958 $3.41\times 10^{6}$ Cheby-SSOR 1.6641 .9673 622 $2.21\times 10^{6}$ CG – .6375 48 $9.22\times 10^{4}$ CG-SSOR 1.6641 .4471 29 $6.66\times 10^{4}$ Cholesky – – – $1.35\times 10^{4}$

We investigated the following Gibbs samplers: SOR, SSOR, and the Chebyshev accelerated SSOR. These samplers are guaranteed to converge since the corresponding solver converges (Theorem 1). Since the convergence factor for a sampler is equal to the convergence factor for the corresponding solver (Corollaries 3 and 6) then Gibbs samplers implemented with any of the matrix splittings in Table 1 exhibit the same convergence behavior as shown for the linear solvers in Table 3. Convergence of the sample covariance $\mathbf{S}^{(k)}_{y}\approx\mathbf{Var(y}^{(k)})\to\mathbf{A}^{-1}$ , calculated using $10^{4}$ samples, is shown in the right panel of Figure 2 that displays the relative error $||\mathbf{A}^{-1}-\mathbf{S}^{(k)}_{y}||_{2}/||\mathbf{A}^{-1}||_{2}$ as a function of the flop count. Each sampler iteration costs about $2.24\times 10^{3}$ flops. This performance is compared to samples constructed by a Cholesky factorization, which cost $1.34\times 10^{4}$ flops (depicted as the green vertical line in the right panel of Figure 2). Since the sample means were uniformly close to zero, error in the mean is not shown.

The benchmark for evaluation of the of the convergence of the iterative samplers in finite precision is the Cholesky factorization, its relative error is depicted as the green horizontal line in the right panel of Figure 2. For this example, the iterative samplers produce better samples than a Cholesky sampler since the iterative sample covariances become more precise with more computing time.

The geometric convergence in distribution of the un-accelerated SSOR samples $\mathbf{y}^{(k)}$ to $\operatorname{N}(\mathbf{0},\mathbf{A}^{-1})$ is clear in Figure 2, and even after $5\times 10^{5}$ flops, convergence in distribution has not been attained. This is not surprising based on the large number of iterations ( $\mathcal{O}(n^{2})$ ) necessary for the same stationary method to converge to a solution of $\mathbf{Ax}=\mathbf{b}$ (see Table 3). The accelerated convergence of the Chebyshev polynomial samplers, suggested by the fast convergence of the corresponding linear solvers depicted in Table 3, is also evident in Figure 2, with convergence after $1.70\times 10^{5}$ flops (76 iterations) for the Cheby-SSOR sampler with optimal relaxation parameter $\omega=1.6641$ , and the somewhat slower convergence at $2.37\times 10^{5}$ flops (106 iterations) when $\omega=1$ .

6.2 A $100\times 100\times 100$ ( $n=10^{6}$ ) linear inverse problem in biofilm imaging

We now perform accelerated sampling from a GMRF in 3-dimensions, as a stylized example of estimating a voxel image of a biofilm from confocal scanning laser microscope (CSLM) data (Lewandowski and Beyenal, 2014). This large example illustrates the feasibility of Chebyshev accelerated sampling in large problems for which sampling by Cholesky factorization of the precision matrix is too computationally and memory intensive to be performed on a standard desktop computer.

We consider the problem of reconstructing a $100\times 100\times 100$ voxel image $\mathbf{x}$ of a bacterial biofilm, i.e., a community of bacteria aggregated together as slime, given a subsampled $100\times 100\times 10$ CSLM data set $\mathbf{y}$ . For this exercise, we synthesized a ‘true’ image $\mathbf{x}_{\text{t}}$ of a $90\mu$ m tall ellipsoidal column of biofilm attached to a surface, taking value $10$ inside the biofilm column, and $0$ outside, in arbitrary units. Similar geometry has been observed experimentally for Pseudomonas aeruginosa biofilms (Swogger and Pitts, 2005), and is also predicted by mathematical models of biofilm growth (Alpkvist and Klapper, 2007). CSLM captures a set of planar ‘images’ at different distances from the bottom of the biofilm where it is attached to a surface. In nature biofilms attach to any surface over which water flows, e.g., human teeth and creek bottoms. Each horizontal planar image in this example is $100\times 100$ pixels; the distance between pixels in each plane is typically about 1 $\mu$ m, with the exact spatial resolution set by the microscope user. The vertical distance between planar slices in a CSLM image is typically an order of magnitude larger than the horizontal distance between pixels; for this example, the vertical distance between CSLM planes is 10 $\mu$ m.

Given the ‘true’ image $\mathbf{x}_{\text{t}}$ , we generated synthetic $100\times 100\times 10$ CSLM data by

\mathbf{y}=\mathbf{F}\mathbf{x}_{\text{t}}+\epsilon

where the $10^{5}\times 10^{6}$ matrix $\mathbf{F}$ arithmetically averages over 10 pixels in the vertical dimension of $\mathbf{x}$ , to approximate the point spread function (PSF) of CSLM (Sheppard and Shotton, 1997), and $\epsilon\sim\operatorname{N}(\mathbf{0},\mathbf{P}^{-1}=\mathbf{I}$ ). The data is displayed in the left panel of Figure 3 as layers of pixels, or ‘slices’, located at the centre of sensitivity of the CSLM, i.e. the centre of the PSF. Thus, the likelihood we consider is $\pi(\mathbf{y}|\mathbf{x})=\operatorname{N}(\mathbf{F}\mathbf{x},\mathbf{P}^{-1})$ .

To encapsulate prior knowledge that the bacteria in the biofilm aggregate together we model $\mathbf{x}$ by the GMRF $\mathbf{x}\sim N(\mathbf{0},\mathbf{Q}_{R}^{-1})$ where the precision matrix $\mathbf{Q}_{R}$ models local smoothness of the density of the biofilm and background. We construct the matrix $\mathbf{Q}_{R}$ as a sparse inverse of the dense covariance matrix corresponding to the exponential covariance function. This construction uses the relationship between stationary Gaussian random fields and partial differential equations (PDEs) that was noted by Whittle (1954) for the Matérn (or Whittle-Matérn (Guttorp and Gneiting, 2005)) class of covariance functions, that was also exploited by Cui et al. (2011) and Lindgren et al. (2011). Rather than stating the PDE, we find it more convenient to work with the equivalent variational form, in this case (the square of)

\mathcal{Q}\left(x\right)=\int_{\mathcal{D}}\left(\frac{R}{4}\left|\nabla x\right|^{2}+\frac{1}{4R}x^{2}\right)dv+\int_{\partial\mathcal{D}}\frac{x^{2}}{2}ds,

where $x$ is a continuous stochastic field, $dv$ is the volume element in the domain $\mathcal{D}$ and $ds$ is the surface element on the boundary $\partial\mathcal{D}$ . This form has Euler-Lagrange equations being the Helmholtz operator with (local) Robin boundary conditions $x+R\frac{\partial x}{\partial n}=0$ on $\partial\mathcal{D}$ , induced by the $\frac{x^{2}}{2}$ term. In our example we apply the Hessian of this form twice, which can be thought of as squaring the Helmholtz operator. When the quadratic form is written in the operator form ${\mathcal{Q}}\left(x\right)=x^{T}Hx$ , where $H$ is the Hessian, the resulting Gaussian random field has density

\displaystyle\pi\left(x\right)\propto\exp\left\{-x^{T}H^{2}x\right\}.

(19)

We chose this operator because the discretized precision matrix is sparse, while the covariance function (after scaling) is close to $\exp\left\{-r/R\right\}$ , having length-scale $R$ .

The GMRF over the discrete field $\mathbf{x}$ is then defined using FEM (finite element method) discretization; we used cubic-elements between nodes at voxel centres in the cubic domain, and tri-linear interpolation from nodal values within each element. To verify this construction we show in Figure 4 contours of the resulting covariance function, between the pixel at the centre of the normalised cubic domain $\left[0,1\right]^{3}$ and all other pixels, for length scale $R=1/4$ . The contours are logarithmically spaced in value, hence the evenly spaced spherical contours show that the covariance indeed has exponential dependence with distance. The contours look correct at the boundaries, indicating that the local Robin boundary conditions³³3Local boundary conditions are approximate but preserve sparseness. The exact boundary conditions are given by the boundary integral equation for the exterior Helmholtz operator, resulting in a dense block in $H$ that is inconvenient for computation (Neumayer, 2011). give the desired covariance function throughout the domain. In contrast, Dirichlet conditions would make the cubic boundary a contour, while Neumann conditions as used by Lindgren et al. (2011) would make contours perpendicular to the cubic boundary; neither of those pure boundary conditions produce the desired covariance function.

In the deterministic setting, this image recovery problem is an example of a linear inverse problem. In the Bayesian setting, we may write the hierarchical model in the general form

$\displaystyle\mathbf{y}\|\mathbf{x},{\boldsymbol{\theta}}$	$\displaystyle\sim$	$\displaystyle\operatorname{N}\left(\mathbf{F}\mathbf{x},\mathbf{P}_{{\boldsymbol{\theta}}}^{-1}\right)$	(20)
$\displaystyle\mathbf{x}\|{\boldsymbol{\theta}}$	$\displaystyle\sim$	$\displaystyle\operatorname{N}\left({\boldsymbol{\mu}},\mathbf{Q}_{{\boldsymbol{\theta}}}^{-1}\right)$	(21)
$\displaystyle{\boldsymbol{\theta}}$	$\displaystyle\sim$	$\displaystyle\pi\left({\boldsymbol{\theta}}\right)$	(22)

where ${\boldsymbol{\theta}}$ is a vector of hyperparameters. This stochastic model occurs in many settings (see, e.g., Simpson et al., 2012; Rue and Held, 2005) with $\mathbf{y}$ being observed data, $\mathbf{x}$ is a latent field, and ${\boldsymbol{\theta}}$ is a vector of hyperparameters that parameterize the precision matrices $\mathbf{P}$ and $\mathbf{Q}$ . The (hyper)prior $\pi\left({\boldsymbol{\theta}}\right)$ models uncertainty in covariance of the two random fields.

There are several options for performing sample-based inference on the model (20), (21), (22). Most direct is forming the posterior distribution $\pi\left(\mathbf{x},{\boldsymbol{\theta}}|\mathbf{y}\right)$ via Bayes’ rule and implementing Markov chain Monte Carlo (MCMC) sampling, typically employing Metropolis-Hastings dynamics with a random walk proposal on $\mathbf{x}$ and ${\boldsymbol{\theta}}$ . Such an algorithm can be very slow due to high correlations within the latent field $\mathbf{x}$ , and between the latent field and hyperparameters ${\boldsymbol{\theta}}$ . More efficient algorithms block the latent field, noting that the distribution over $\mathbf{x}$ given everything else is a multivariate normal, and hence can be sampled efficiently as we have discussed in this paper. Higdon (2006) and Bardsley (2012) utilized this structure, along with conjugate hyperpriors on the components of ${\boldsymbol{\theta}}$ , to demonstrate a Gibbs sampler that cycled through sampling from the conditional distributions for $\mathbf{x}$ and components of ${\boldsymbol{\theta}}$ . When the normalizing constant for $\pi\left(\mathbf{x},{\boldsymbol{\theta}}|\mathbf{y}\right)$ is available, up to a multiplicative constant independent of state, a more efficient algorithm is the one block algorithm (Rue and Held, 2005, section 4.1.2) in which a candidate ${\boldsymbol{\theta}}^{\prime}$ is drawn from a random walk proposal, then a draw $\mathbf{x}^{\prime}\sim\pi(\mathbf{x}^{\prime}|\mathbf{y},{\boldsymbol{\theta}}^{\prime})$ , with the joint proposal $\left({\boldsymbol{\theta}}^{\prime},\mathbf{x}^{\prime}\right)$ accepted with the standard Metropolis-Hastings probability. The resulting transition kernel in ${\boldsymbol{\theta}}$ is in detailed balance with the distribution over ${\boldsymbol{\theta}}|\mathbf{y}$ , and hence can improve efficiency dramatically. A further improvement can be to perform MCMC directly on $\pi\left({\boldsymbol{\theta}}|\mathbf{y}\right)$ as indicated by Simpson et al. (2012), with subsequent independent sampling $\mathbf{x}\sim\pi\left(\mathbf{x}|\mathbf{y},{\boldsymbol{\theta}}\right)$ to facilitate Monte Carlo evaluation of statistics. In each of these schemes, computational cost is dominated by the cost of drawing samples from the large multivariate normal $\mathbf{x}\sim\pi\left(\mathbf{x}|\mathbf{y},{\boldsymbol{\theta}}\right)$ . We now demonstrate that sampling step for this synthetic example.

In our example, the distribution over the $100\times 100\times 100$ image $\mathbf{x}$ , conditioned on everything else, is the multivariate normal

\displaystyle\pi(\mathbf{x}|\mathbf{y},\theta=R)

\displaystyle=

\displaystyle\operatorname{N}(\mathbf{x};{\boldsymbol{\mu}}=\mathbf{A}^{-1}\mathbf{F}^{T}\mathbf{P}\mathbf{y},{\boldsymbol{\Sigma}}=\mathbf{A}^{-1})

(23)

with precision matrix $\mathbf{A}=\mathbf{F}^{T}\mathbf{P}\mathbf{F}+\mathbf{Q}_{R}$ (cf. Calvetti and Somersalo (2007); Higdon (2006)). For this calculation we used the same covariance matrix as shown above, so $R=1/4$ in units of the width of the domain, though for sample-based inference one would use samples from the distribution over $R|\mathbf{x},\mathbf{y}$ . The right panel of Figure 3 depicts a reconstructed surface derived from a sample from the conditional distribution in (23) using the Chebyshev polynomial accelerated SSOR sampler. The sampler was initialized with the precision matrix $\mathbf{A}$ , $\operatorname{E}(\mathbf{c}^{(k)})=\mathbf{F}^{T}\mathbf{P}\mathbf{y}$ for all $k$ , and relaxation parameter $\omega=1$ . The contour is at value $6$ , after smoothing over $3\times 3\times 3$ voxels, displaying a sample surface that separates regions for which the average over $3\times 3\times 3$ voxel blocks is less than $6$ (outside surface) and greater than $6$ (inside surface). As can be seen, the surface makes an informative reconstruction of the ellipsoidal phantom.

Using CG, estimates of the extreme eigenvalue of $\mathbf{M}_{\text{SSOR}}^{-1}\mathbf{A}$ were $\hat{\lambda}_{\min}=4.38\times 10^{-6}$ and $\hat{\lambda}_{\max}=1-1.36\times 10^{-8}$ . By Corollary 6, the asymptotic convergence factors for the Chebyshev sampler are $\sigma\approx 0.9958$ for the mean and $\sigma^{2}=.9917$ for the covariance matrix. Using this information, equation (16) predicts the number of sampler iterations until convergence. After $k^{*}=4566$ iterations of the Chebyshev accelerated sampler, it is predicted that the mean error is reduced by $\varepsilon=10^{-8}$ ; that is

||{\boldsymbol{\mu}}-\operatorname{E}(\mathbf{y}^{(k^{*})})||_{2}\approx 10^{-8}||{\boldsymbol{\mu}}-\operatorname{E}(\mathbf{y}^{(0)})||_{2}.

But even sooner, after only $k^{**}=k^{*}/2=2283$ iterations, it is predicted that the covariance error is

||\mathbf{A}^{-1}-\operatorname{Var}(\mathbf{y}^{(k^{**})})||\approx 10^{-8}||\mathbf{A}^{-1}-\operatorname{Var}(\mathbf{y}^{(0)})||.

Contrast these Chebyshev polynomial convergence results to the performance of the non-accelerated stationary SSOR sampler that has convergence factors $\varrho(\mathbf{M}^{-1}\mathbf{N})\approx 1-\hat{\lambda}_{\min}=1-4.38\times 10^{-6}$ for the mean error, and $\rho(\mathbf{M}^{-1}\mathbf{N})^{2}=1-8.76\times 10^{-6}$ for covariance error. These convergence factors suggest that after running the non-accelerated SSOR Gibbs sampler for only $4566$ iterations, the covariance error will be reduced to only $\varrho(\mathbf{M}^{-1}\mathbf{N})^{2\cdot 4566}\approx 0.96$ of the original error; $1.9\times 10^{6}$ iterations are required for a $10^{-8}$ reduction.

The cost difference between the Cholesky factorization and an iterative sampler in this example is dramatic. After finding a machine with the necessary $n^{2}$ memory requirements, the Cholesky factorization would cost about $b^{2}n=10^{16}$ flops (since the bandwidth of the precision matrix $\mathbf{A}$ is about $b=10^{5}$ ). Since the number of non-zero elements of $\mathbf{A}$ is $3.3\times 10^{8}$ , an iterative sampler costs about $6.6\times 10^{8}$ flops per iteration, much less than $n^{2}$ . The sample in Figure 3 was generated by $k_{\max}=5\times 10^{3}$ iterations of the Chebyshev accelerated SSOR sampler, at a total cost of $3.3\times 10^{12}$ flops, which is about $10^{4}$ times faster than Cholesky factoring.

7 Discussion

This work began, in part, with a curiosity about the convergence of the sequence of covariance matrices in Gibbs sampling applied to multivariate normal distributions, as studied by Liu et al. (1995). Convergence of that sequence indicates that the algorithm is implicitly implementing some factorization of the target covariance or precision matrix. Which one?

The answer was given by Goodman and Sokal (1989), Amit and Grenander (1991), Barone and Frigessi (1990), and Galli and Gao (2001), that the standard component-sweep Gibbs sampler corresponds to the classical Gauss-Seidel iterative method. That result is given in section 4.2, generalized to arbitrary matrix splittings, showing that any matrix splitting used to generate a deterministic relaxation also induces a stochastic relaxation that is a generalized Gibbs sampler; the linear iterative relaxation and the stochastic relaxation share exactly the same iteration operator, conditions for convergence, and convergence factor, which may be summarized by noting that they share exactly the same error polynomial.

Equivalence of error polynomials is important because they are the central object in designing accelerated solvers including the multigrid, Krylov space, and parallel algorithms. We demonstrated that equivalence explicitly for polynomial acceleration, the basic non-stationary acceleration scheme for linear solvers, showing that this control of the error polynomial can be applied to Gibbs sampling from normal distributions. It follows that, just as for linear solvers, Chebyshev-polynomial accelerated samplers have a smaller average asymptotic convergence factor than their un-accelerated stationary counterparts.

The equivalences noted above are strictly limited to the case of normal target distributions. We are also concerned with continuous non-normal target distributions and whether acceleration of the normal case can usefully inform acceleration of sampling from non-normal distributions. Convergence of the unaccelerated, stationary, iteration applied to bounded perturbations of a normal distribution was established by Amit (1991), though carrying over convergence rates proved more problematic.

There are several possibilities for extending the acceleration techniques to non-normal distributions. A straightforward generalization is to apply Gibbs sampling to the non-normal target, assuming the required conditional distributions are easy to sample from, though using the directions determined by the accelerated algorithm. Simply applying the accelerated algorithm to the non-normal distribution does not lead to optimal acceleration, as demonstrated by Goodman and Sokal (1989).

A second route, that looks more promising to us, is to exploit the connection between Gibbs samplers and linear iterative methods that are often viewed as local solvers for non-linear problems, or equivalently, optimizers for local quadratic approximations to non-quadratic functions. Since a local quadratic approximation to $\log\pi$ is a local Gaussian approximation to $\pi$ , the iterations developed here may be used to target this local approximation and hence provide local proposals in an MCMC. We imagine an algorithm along the lines of the trust-region methods from optimization in which the local quadratic (Gaussian) approximation is trusted up to some distance from the current state, implemented via a distance penalty. One or more steps of the iterative sampler would act as a proposal to a Metropolis-Hastings accept/reject step that ensures the correct target distribution. Metropolis adjusted Langevin (MALA) and hybrid Monte Carlo (HMC) turn out to be examples of this scheme (Norton and Fox, 2014), as is the algorithm presented by Green and Han (1992). This naturally raises the question of whether acceleration of the local iteration can accelerate the Metropolised algorithm. This remains a topic for ongoing research.

Acknowledgments

This work was partially funded by the New Zealand Institute for Mathematics and its Applications (NZIMA) thematic programme on Analysis, Applications and Inverse Problems in PDEs, and Marsden contract UOO1015.

Appendix A Appendix

A.1 Stationary sampler convergence (Proof of Theorem 2 and Corollary 3)

First, the theorem and corollary are established for the mean. Since $\mathbf{A}=\mathbf{M}-\mathbf{N}$ is a convergent splitting, then (10) and Theorem 1 show that $E(\mathbf{c}^{(k)})={\boldsymbol{\nu}}$ if and only if $E(\mathbf{y}^{(k)})\to\mathbf{A}^{-1}{\boldsymbol{\nu}}$ with the same convergence factor as for the linear solver. To establish convergence of the variance, let $\mathbf{G=M}^{-1}\mathbf{N}$ in (10), then $\mathbf{y}^{(k)}=\mathbf{G}^{k}\mathbf{y}^{(0)}+\sum_{i=0}^{k-1}\mathbf{G}^{i}(\mathbf{M}^{-1}\mathbf{c}^{(k-1-i)}).$ This equation and the independence of $\{\mathbf{c}^{(i)}\}$ show that $\mathbf{Var(y}^{(k)}|\mathbf{y}^{(0)})=\sum_{i=0}^{k-1}\left(\mathbf{G}^{i}\mathbf{M}^{-1}\operatorname{Var}(\mathbf{c}^{i})\mathbf{M}^{-T}(\mathbf{G}^{i})^{T}\right).$ Theorem 1 establishes the existence of a unique limiting distribution with a non-zero covariance matrix $\mathbf{\Gamma}$ . Thus, for $\mathbf{y}^{(i)},\mathbf{y}^{(i+1)}\sim\Pi$ , (10) implies

\displaystyle\mathbf{\Gamma}=\mathbf{G\Gamma}\mathbf{G}^{T}+\mathbf{M}^{-1}\operatorname{Var}(\mathbf{c}^{(i)})\mathbf{M}^{-T}

(24)

since $\mathbf{y}^{(i)}$ and $\mathbf{c}^{(i)}$ are independent. Thus $\mathbf{Var(y}^{(k)}|\mathbf{y}^{(0)})=\mathbf{\Gamma}-\mathbf{G}^{k}\mathbf{\Gamma}(\mathbf{G}^{k})^{T},$ and so

\displaystyle\mathbf{Var(y}^{(k)})=\mathbf{\Gamma}-\mathbf{G}^{k}\left(\mathbf{\Gamma}-\mathbf{Var(y}^{(0)})\right)(\mathbf{G}^{k})^{T}.

(25)

That is, $\operatorname{Var}(\mathbf{y}^{(k)})\to\mathbf{\Gamma}$ with convergence factor $\varrho(\mathbf{M}^{-1}\mathbf{N})^{2}$ . To prove that part (b) of the theorem implies part (a), consider the starting vector $\mathbf{y}^{(0)}\sim\Pi$ with covariance matrix $\mathbf{\Gamma=A}^{-1}$ . Since $\mathbf{c}^{(k)}$ is independent of $\mathbf{y}^{(k)}$ , the relation (24) shows that $\operatorname{Var}(\mathbf{c}^{(k)})=\mathbf{M}(\mathbf{A}^{-1}-\mathbf{GA}^{-1}\mathbf{G}^{T})\mathbf{M}^{T}=\mathbf{MA}^{-1}\mathbf{M}^{T}-\mathbf{NA}^{-1}\mathbf{N}^{T}$ . Substituting in $\mathbf{N=M-A}$ shows that $\operatorname{Var}(\mathbf{c}^{(k)})=\mathbf{M}^{T}+\mathbf{N}$ . To prove that (a) implies (b), consider $\mathbf{y}^{(0)}\sim\operatorname{N}({\boldsymbol{\mu}},\mathbf{A}^{-1})$ . By (25), $\mathbf{\Gamma}-\operatorname{Var}(\mathbf{y}^{(1)})=\mathbf{G}(\mathbf{\Gamma}-\mathbf{A}^{-1})\mathbf{G}^{T}$ . Substituting $\operatorname{Var}(\mathbf{c}^{(0)})=\mathbf{M}(\mathbf{A}^{-1}-\mathbf{GA}^{-1}\mathbf{G}^{T})\mathbf{M}^{T}$ into equation (24) shows $\mathbf{\Gamma}-\mathbf{A}^{-1}=\mathbf{G}(\mathbf{\Gamma}-\mathbf{A}^{-1})\mathbf{G}^{T}$ . Thus $\operatorname{Var}(\mathbf{y}^{(1)})=\mathbf{A}^{-1}$ , which shows that $\operatorname{Var}(\mathbf{y}^{(k)})$ has converged to $\mathbf{A}^{-1}$ . By Theorem 1, $\mathbf{\Gamma=A}^{-1}$ .

A.2 Polynomial accelerated sampler convergence (Proof of Theorem 5 and Corollary 6)

If the polynomial accelerated linear solver (11) converges, then $E(\mathbf{y}^{(k+1)})\to\mathbf{A}^{-1}E(\mathbf{c}^{(k)})={\boldsymbol{\mu}}$ . To determine $\operatorname{Var}(\mathbf{c}^{(k)})$ rewrite the iteration (17) as $\mathbf{y}^{(k+1)}=(1-\alpha_{k})\mathbf{y}^{(k-1)}+\alpha_{k}\mathbf{G}^{(k)}\mathbf{y}^{(k)}+\alpha_{k}\left(\mathbf{M}^{(k)}\right)^{-1}\mathbf{c}^{(k)}$ where $\mathbf{M}^{(k)}=\frac{1}{\tau_{k}}\mathbf{M}$ , $\mathbf{N}^{(k)}=\mathbf{M}^{(k)}-\mathbf{A}$ , and $\mathbf{G}^{(k)}=\mathbf{I}-\tau_{k}\left(\mathbf{M}^{(k)}\right)^{-1}\mathbf{A}=\left(\mathbf{M}^{(k)}\right)^{-1}\mathbf{N}^{(k)}$ . First, we will consider $\mathbf{y}^{(i-1)}\sim\operatorname{N}(\mathbf{\mu,A}^{-1})$ and then find $\operatorname{Var}(\mathbf{c}^{(k)})$ that will guarantee that $\mathbf{y}^{(i)},\mathbf{y}^{(i+1)}\sim\operatorname{N}({\boldsymbol{\mu}},\mathbf{A}^{-1})$ . Since $\{\mathbf{c}^{(i)}\}$ are independent of $\{\mathbf{y}^{(i)}\}$ , the above equation for $\mathbf{y}^{(k+1)}$ shows that $\operatorname{Var}(\mathbf{c}^{(k)})$ is equal to

\displaystyle\frac{1}{\alpha_{k}^{2}}\mathbf{M}^{(k)}\left((1-(1-\alpha_{k})^{2})\mathbf{A}^{-1}-2(1-\alpha_{k})\alpha_{k}(\mathbf{G}^{(k)}\mathbf{K}^{(k)}+\mathbf{K}^{(k)T}\mathbf{G}^{(k)T})-\alpha_{k}^{2}\mathbf{G}^{(k)}\mathbf{A}^{-1}\mathbf{G}^{(k)T}\right)\mathbf{M}^{(k)}

where $\mathbf{K}^{(k)}:=\mathbf{Cov}(\mathbf{y}^{(k-1)},\mathbf{y}^{(k)})$ . To simplify this expression, we need Lemma 8, which gives $\mathbf{K}^{(k)}$ explicitly. Parts (1) and (2) of the lemma show that

\displaystyle\operatorname{Var}(\mathbf{c}^{(k)})=\frac{1}{\alpha_{k}^{2}}\mathbf{M}^{(k)}\left(\alpha_{k}^{2}(\mathbf{A}^{-1}-\mathbf{G}^{(k)}\mathbf{A}^{-1}\mathbf{G}^{(k)T})+2(1-\alpha_{k})\alpha_{k}(\mathbf{A}^{-1}-\mathbf{G}_{\kappa}^{(k)}\mathbf{A}^{-1}\mathbf{G}^{(k)T})\right)\mathbf{M}^{(k)}.

Part (3) of Lemma 8 shows that $\operatorname{Var}(\mathbf{c}^{(k)})$ has the form specified in the theorem.

Lemma 8

For a symmetric splitting $\mathbf{A}=\mathbf{M}-\mathbf{N}$ ,

1.

$\mathbf{K}^{(k)}$ is symmetric.
2.

$\mathbf{K}^{(k)}=\mathbf{G}_{\kappa}^{(k)}\mathbf{A}^{-1}$ , where $\mathbf{G}_{\kappa}^{(k)}=\mathbf{I}-\kappa_{k}\mathbf{M}^{-1}\mathbf{A}$ and $\kappa_{k+1}:=\alpha_{k}\tau_{k}+(1-\alpha_{k})\kappa_{k}.$
3.

$\mathbf{A}^{-1}-\mathbf{G}_{\tau}\mathbf{A}^{-1}\mathbf{G}_{\kappa}^{T}=\tau\kappa\mathbf{M}^{-1}\left((1/\tau+1/\kappa)\mathbf{M}-\mathbf{A}\right)\mathbf{M}^{-1}$ .

Proof. To nail down $\mathbf{K}^{(k)}$ , rewrite the Chebyshev iteration (17) as

\mathbf{Y}^{(k+1)}=\left(\begin{array}[c]{cc}\alpha_{k}\mathbf{G}^{(k)}&\left(1-\alpha_{k}\right)\mathbf{I}\\ \mathbf{I}&\mathbf{0}\end{array}\right)\mathbf{Y}^{(k)}+\alpha_{k}\left(\begin{array}[c]{c}\mathbf{g}^{(k)}\\ \mathbf{0}\end{array}\right)

where $\mathbf{Y}^{(0)}=\left(\begin{array}[c]{c}\mathbf{y}^{(0)}\\ \mathbf{0}\end{array}\right),\mathbf{Y}^{(k+1)}=\left(\begin{array}[c]{c}\mathbf{y}^{(k+1)}\\ \mathbf{y}^{(k)}\end{array}\right)\mbox{~~and~~}\mathbf{g}^{(k)}=\left(\mathbf{M}^{(k)}\right)^{-1}\mathbf{c}^{(k)}.$ Letting $\mathcal{G}^{(k)}=\left(\begin{array}[c]{cc}\alpha_{k}\mathbf{G}^{(k)}&\left(1-\alpha_{k}\right)\mathbf{I}\\ \mathbf{I}&\mathbf{0}\end{array}\right)$ shows that

\displaystyle\operatorname{Var}\left(\mathbf{Y}^{(k+1)}\right)=\mathcal{G}^{(k)}\mathrm{Var}\left(\mathbf{Y}^{(k)}\right)\mathcal{G}^{(k)T}+\alpha_{k}^{2}\left(\begin{array}[c]{cc}\operatorname{Var}\left(\mathbf{g}^{(k)}\right)&\mathbf{0}\\ \mathbf{0}&\mathbf{0}\end{array}\right).

(28)

If $\operatorname{Var}\left(\mathbf{y}^{(0)}\right)=\mathbf{A}^{-1}$ then $\operatorname{Var}\left(\mathbf{y}^{(k)}\right)=\mathbf{A}^{-1}$ for $k\geq 1$ in which case $\operatorname{Var}\left(\mathbf{Y}^{(k+1)}\right)$ is

\displaystyle\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k+1)}\\ \mathbf{K}^{(k+1)T}&\mathbf{A}^{-1}\end{array}\right)

\displaystyle=

\displaystyle\mathcal{G}^{(k)}\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k)}\\ \mathbf{K}^{(k)T}&\mathbf{A}^{-1}\end{array}\right)\mathcal{G}^{(k)T}+\left(\begin{array}[c]{cc}\alpha_{k}^{2}\operatorname{Var}\left(g^{(k)}\right)&\mathbf{0}\\ \mathbf{0}&\mathbf{0}\end{array}\right).

(35)

By definition of $\mathbf{Y}^{(0)}$ , $\mathbf{K}^{(0)}=\mathbf{0}$ ; for $k\geq 0$ ,

\displaystyle\mathbf{K}^{(k+1)}=\alpha_{k}\mathbf{G}^{(k)}\mathbf{A}^{-1}+\left(1-\alpha_{k}\right)\mathbf{K}^{(k)T}.

(36)

Since $\alpha_{0}=1$ , then $\mathbf{K}^{(1)}=\mathbf{G}^{(0)}\mathbf{A}^{-1}$ which proves parts (1) and (2) of the Lemma for $k=0$ since $\kappa_{1}=\tau_{0}$ and $\mathbf{G}_{\kappa}^{(k)}\mathbf{A}^{-1}$ is symmetric. Assuming that $\mathbf{K}^{(k)}=\mathbf{G}^{(k)}_{\kappa}\mathbf{A}^{-1}$ for $k>0$ , the recursion in (36) gives $\mathbf{K}^{(k+1)}=\left(\mathbf{I}-\left[\alpha_{k}\tau_{k}+\left(1-\alpha_{k}\right)\kappa_{k}\right]\mathbf{M}^{-1}\mathbf{A}\right)\mathbf{A}^{-1}$ so the expansion and recursion hold for $k+1$ , and parts (1) and (2) of the Lemma follow by induction. Part (c) of the Lemma follows from the equation

\displaystyle\mathbf{A}^{-1}-\mathbf{G}_{\tau}\mathbf{A}^{-1}\mathbf{G}_{\kappa}^{\text{T}}

\displaystyle=\mathbf{M}_{\tau}^{-1}\left(\mathbf{M}_{\tau}\mathbf{A}^{-1}\mathbf{M}_{\kappa}^{\text{T}}\right)\mathbf{M}_{\kappa}^{-\text{T}}-\mathbf{M}_{\tau}^{-1}\mathbf{N}_{\tau}\mathbf{A}^{-1}\left(\mathbf{M}_{\kappa}^{-1}\mathbf{N}_{\kappa}\right)^{\text{T}}.\hskip 50.58878pt\Box

The selection of $\mathbf{Var(c}^{(k)})=a_{k}\mathbf{M}+b_{k}\mathbf{N}$ assures that if $\operatorname{Var}\left(\mathbf{y}^{(0)}\right)=\mathbf{A}^{-1}$ , then $\operatorname{Var}\left(\mathbf{y}^{(k)}\right)=\mathbf{A}^{-1}$ for $k\geq 1$ . Thus, subtracting (35) from (28) gives

\operatorname{Var}\left(\mathbf{Y}^{(k+1)}\right)-\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k+1)}\\ \mathbf{K}^{(k+1)T}&\mathbf{A}^{-1}\end{array}\right)=\mathcal{G}^{(k)}\left(\operatorname{Var}\left(\mathbf{Y}^{(k)}\right)-\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k)}\\ \mathbf{K}^{(k)T}&\mathbf{A}^{-1}\end{array}\right)\right)\mathcal{G}^{(k)T}

or $\mathcal{E}^{(k+1)}=\mathcal{G}^{(k)}\mathcal{E}^{(k)}\mathcal{G}^{(k)T}$ for $k\geq 0$ , where $\mathcal{E}^{(k)}=\operatorname{Var}\left(\mathbf{Y}^{(k)}\right)-\left(\begin{array}[c]{cc}\mathbf{A}^{-1}&\mathbf{K}^{(k)}\\ \mathbf{K}^{(k)T}&\mathbf{A}^{-1}\end{array}\right).$ Hence, by recursion, $\mathcal{E}^{(k)}=\left(\prod_{l=0}^{k-1}\mathcal{G}^{(l)}\right)\mathcal{E}^{(0)}\left(\prod_{l=0}^{k-1}\mathcal{G}^{(l)}\right)^{\text{T}}.$

Denote the polynomial of the block matrix by $\mathcal{P}^{(k+1)}=\left(\prod_{l=0}^{k}\mathcal{G}^{(l)}\right)$ that satisfies

\displaystyle\mathcal{P}^{(k+1)}

\displaystyle=\mathcal{G}^{(k)}\mathcal{P}^{(k)}=\left(\begin{array}[c]{cc}\alpha_{k}\mathbf{G}^{(k)}&\left(1-\alpha_{k}\right)\mathbf{I}\\ \mathbf{I}&\mathbf{0}\end{array}\right)\left(\begin{array}[c]{cc}\mathcal{P}_{11}^{(k)}&\mathcal{P}_{12}^{(k)}\\ \mathcal{P}_{21}^{(k)}&\mathcal{P}_{22}^{(k)}\end{array}\right)

with $\mathcal{P}^{(1)}=\mathcal{G}^{(0)}=\left(\begin{array}[c]{cc}\mathbf{G}^{(0)}&\mathbf{0}\\ \mathbf{I}&\mathbf{0}\end{array}\right).$ Thus

\displaystyle\mathcal{P}_{11}^{(k+1)}

\displaystyle=\alpha_{k}\mathbf{G}^{(k)}\mathcal{P}_{11}^{(k)}+\left(1-\alpha_{k}\right)\mathcal{P}_{11}^{(k-1)}=\alpha_{k}\left(\mathbf{I}-\tau_{k}\mathbf{M}^{-1}\mathbf{A}\right)\mathcal{P}_{11}^{(k)}+\left(1-\alpha_{k}\right)\mathcal{P}_{11}^{(k-1)}

with $\mathcal{P}_{11}^{(1)}=\mathbf{G}^{(0)}$ , which shows that $\mathcal{P}_{11}^{(k+1)}=P_{k+1}$ by (12). Furthermore, this shows that the error in variance at the $k^{\text{th}}$ iteration has the specified form and convergence factor.

References

Adler (1981) Adler, S. L. (1981). Over-relaxation method for the Monte Carlo evaluation of the partition function for multiquadratic actions. Phys. Rev. D 23(12), 2901–2904.
Alpkvist and Klapper (2007) Alpkvist, E. and I. Klapper (2007). A multidimensional multispecies continuum model for heterogeneous biofilm development. Bulletin of Mathematical Biology 69, 765–789.
Amit (1991) Amit, Y. (1991). On rates of convergence of stochastic relaxation for Gaussian and non-Gaussian distributions. Journal of Multivariate Analysis 38(1), 82–99.
Amit and Grenander (1991) Amit, Y. and U. Grenander (1991). Comparing sweep strategies for stochastic relaxation. Journal of Multivariate Analysis 37, 197–222.
Axelsson (1996) Axelsson, O. (1996). Iterative Solution Methods. Cambridge University Press.
Banerjee et al. (2003) Banerjee, S., A. E. Gelfand, and B. P. Carlin (2003). Hierarchical Modeling and Analysis for Spatial Data. Chapman and Hall.
Bardsley (2012) Bardsley, J. M. (2012). MCMC-based image reconstruction with uncertainty quantification. SIAM J. Sci. Comput. 34(3), A1316–A1332.
Barone and Frigessi (1990) Barone, P. and A. Frigessi (1990). Improving stochastic relaxation for Gaussian random fields. Probability in the Engineering and Informational Sciences 23, 2901–2904.
Barone et al. (2002) Barone, P., G. Sebastiani, and J. Stander (2002). Over-relaxation methods and coupled Markov chains for Monte Carlo simulation. Statistics and Computing 12, 17–26.
Besag (1986) Besag, J. (1986). On the statistical analysis of dirty pictures. J. R. Statist. Soc. B 48(3), 259–302.
Besag and Green (1993) Besag, J. and P. J. Green (1993). Spatial statistics and Bayesian computation. J. R. Statist. Soc. B 55(1), 27–37.
Calvetti and Somersalo (2007) Calvetti, D. and E. Somersalo (2007). Introduction to Bayesian Scientific Computing. Springer.
Cui et al. (2011) Cui, T., C. Fox, and M. J. O’Sullivan (2011). Bayesian calibration of a large-scale geothermal reservoir model by a new adaptive delayed acceptance Metropolis Hastings algorithm. Water Resources Research 47(10), W10521.
Diaconis and Freedman (1999) Diaconis, P. and D. Freedman (1999). Iterated random functions. SIAM Review 41(1), 45–76.
Duflo (1997) Duflo, M. (1997). Random Iterative Models. Berlin: Springer-Verlag.
Feller (1968) Feller, W. (1968). An Introduction to Probability Theory and its Applications. Wiley.
Fishman (1996) Fishman, G. S. (1996). Coordinate selection rules for Gibbs sampling. The Annals of Applied Probability 6(2), 444–465.
Fox (2008) Fox, C. (2008). A conjugate direction sampler for normal distributions, with a few computed examples. Technical Report 2008-1, Electronics Group, University of Otago.
Fox and Parker (2014) Fox, C. and A. Parker (2014). Convergence in variance of Chebyshev accelerated Gibbs samplers. SIAM Journal on Scientific Computing 36(1), A124–A147.
Fox and Parker (1968) Fox, L. and I. Parker (1968). Chebyshev polynomials in numerical analysis. Oxford, UK: Oxford University Press.
Galli and Gao (2001) Galli, A. and H. Gao (2001). Rate of convergence of the Gibbs sampler in the Gaussian case. Mathematical Geology 33(6), 653–677.
Gelfand and Smith (1990) Gelfand, A. E. and A. Smith (1990). Sampling-based approaches to calculating marginal densities. J. Am. Stat. 85, 398–409.
Gelman et al. (1995) Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995). Bayesian Data Analysis. Chapman & Hall.
Geman and Geman (1984) Geman, S. and D. Geman (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Analysis and Machine Intelligence 6, 721–741.
Geyer (2011) Geyer, C. J. (2011). Introduction to Markov chain Monte Carlo. In S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng (Eds.), Handbook of Markov Chain Monte Carlo, pp. 3–48. Chapman & Hall.
Gilks et al. (1996) Gilks, W. R., S. Richardson, and D. Spieglhalter (1996). Introducing Markov chain Monte Carlo. In W. R. Gilks, S. Richardson, and D. Spieglhalter (Eds.), Markov chain Monte Carlo in Practice, pp. 1–16. Chapman & Hall.
Gneiting et al. (2005) Gneiting, T., H. Sevcikova, D. B. Percival, M. Schlather, and Y. Jiang (2005). Fast and exact simulation of large Gaussian lattice systems in two dimensions: Exploring the limits. Technical Report 477, Department of Statistics, University of Washington.
Golub and Loan (1989) Golub, G. H. and C. F. V. Loan (1989). Matrix Computations (2nd ed.). Baltimore: The Johns Hopkins University Press.
Goodman and Sokal (1989) Goodman, J. and A. D. Sokal (1989). Multigrid Monte Carlo method. Conceptual foundations. Phys. Rev. D 40(6), 2035–2071.
Green and Han (1992) Green, P. and X. Han (1992). Metropolis methods, Gaussian proposals and antithetic variables. In A. Frigessi, P. Barone, and M. Piccioni (Eds.), Stochastic Models, statisical methods and algorithms in image analysis, Lecture Notes in Statistics, pp. 142–164. Berlin: Springer-Verlag.
Grenander (1983) Grenander, U. (1983). Tutorial in pattern theory. Technical report, Division of Applied Mathematics, Brown University.
Guttorp and Gneiting (2005) Guttorp, P. and T. Gneiting (2005). On the Whittle-Matérn correlation family. Technical Report 80, NRCSE University of Washington.
Haario et al. (2001) Haario, H., E. Saksman, and J. Tamminen (2001). An adaptive Metropolis algorithm. Bernoulli 7, 223–242.
Higdon (2006) Higdon, D. (2006). A primer on space-time modelling from a Bayesian perspective. In B. Finkenstadt, L. Held, and V. Isham (Eds.), Statistics of Spatio-Temporal Systems, New York, pp. 217–279. Chapman & Hall/CRC.
Lewandowski and Beyenal (2014) Lewandowski, Z. and H. Beyenal (2014). Fundamentals of Biofilm Research. Boca Raton, FL: CRC Press.
Lindgren et al. (2011) Lindgren, F., H. Rue, and J. Lindström (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. R. Statist. Soc. B 73(4), 423–498.
Liu et al. (1995) Liu, J. S., W. H. Wong, and A. Kong (1995). Covariance structure and convergence rate of the Gibbs sampler with various scans. J. R. Statist. Soc. B 57(1), 157–169.
MacKay (2003) MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms. Cambridge, UK: Cambridge University Press.
Meurant (2006) Meurant, G. (2006). The Lanczos and Conjugate Gradient Algorithms. Philadelphia: SIAM.
Neal (1997) Neal, R. (1997, January). Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report 9702, Department of Statistics, University of Toronto.
Neal (1998) Neal, R. (1998). Suppressing random walks in Markov chain Monte Carlo using ordered overrelaxation. In M. I. Jordan (Ed.), Learning in Graphical Models, pp. 205–225. Kluwer Academic Publishers.
Neumayer (2011) Neumayer, M. (2011). Accelerated Bayesian Inversion and Calibration for Electrical Tomography. Ph. D. thesis, Graz University of Technology.
Nevanlinna (1993) Nevanlinna, O. (1993). Convergence of Iterations for Linear Equations. Birkhauser.
Norton and Fox (2014) Norton, R. A. and C. Fox (2014). Efficiency and computability of MCMC with Langevin, Hamiltonian, and other matrix-splitting proposals. Unpublished manuscript.
Parker and Fox (2012) Parker, A. and C. Fox (2012). Sampling Gaussian distributions in Krylov spaces with conjugate gradients. SIAM Journal on Scientific Computing 34(3), B312–B334.
Robert and Casella (1999) Robert, C. P. and G. Casella (1999, August). Monte Carlo Statistical Methods (1 ed.). Springer-Verlag.
Robert and Casella (2011) Robert, C. P. and G. Casella (2011). A short history of Markov chain Monte Carlo: Subjective recollections from incomplete data. Statist. Sci. 26(1), 102–115.
Roberts and Rosenthal (2007) Roberts, G. O. and J. S. Rosenthal (2007). Coupling and ergodicity of adaptive MCMC. J. Appl. Prob. 44, 458–475.
Roberts and Sahu (1997) Roberts, G. O. and S. Sahu (1997). Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. J. R. Statist. Soc. B 59(2), 291–317.
Rue (2001) Rue, H. (2001). Fast sampling of Gaussian Markov random fields. J. R. Statist. Soc. B 63, 325–338.
Rue and Held (2005) Rue, H. and L. Held (2005). Gaussian Markov random fields : Theory and applications. New York: Chapman Hall.
Saad (2003) Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM.
Saad and van der Vorst (2000) Saad, Y. and H. A. van der Vorst (2000). Iterative solution of linear systems in the 20th century. Journal of Computational and Applied Mathematics 123, 1–33.
Sheppard and Shotton (1997) Sheppard, C. J. R. and D. R. Shotton (1997). Confocal Laser Scanning Microscopy. Garland Science.
Simpson et al. (2012) Simpson, D., F. Lindgren, and H. Rue (2012). Think continuous: Markovian Gaussian models in spatial statistics. Spatial Statistics 1, 16–29.
Sokal (1993) Sokal, A. D. (1993). Discussion on the meeting on the Gibbs sampler and other Markov chain Monte Carlo methods. J. R. Statist. Soc. B 55(1), 87.
Swogger and Pitts (2005) Swogger, E. and B. Pitts (2005). CSLM 3D view of pseudomonas aeruginosa biofilm structure. Montana State University Center for Biofilm Engineering. Available at http://www.biofilm.montana.edu/resources/movies/2005/2005m06.html.
Turčin (1971) Turčin, V. (1971). On the computation of multidimensional integrals by the Monte Carlo method. Theory of Probability and Its Applications 16, 720–724.
Varga (1962) Varga, R. S. (1962). Matrix Iterative Analysis. Prentice-Hall.
Watkins (2002) Watkins, D. (2002). Fundamentals of Matrix Computations (2nd ed.). New York: Wiley.
Whittle (1954) Whittle, P. (1954). On stationary processes in the plane. Biometrika 41, 434–449.
Young (1971) Young, D. M. (1971). Iterative Solution of Large Linear Systems. Academic Press.

Accelerated Gibbs sampling of normal distributions using matrix splittings and polynomials

Abstract

keywords:

1 Introduction

1.1 Context and overview of results

1.2 Structure of the paper

2 Sampling from multivariate normal distributions

2.1 Gibbs sampling from a normal distribution

2.1.1 Componentwise formulation

2.1.2 Matrix formulation

2.1.3 Convergence

3 Linear stationary iterative methods as linear equation solvers

3.1 Matrix splitting form of stationary iterative algorithms

3.1.1 The Gauss-Seidel algorithm

3.2 Convergence

3.3 Common matrix splittings

4 Equivalence of stationary linear solvers and Gibbs samplers

4.1 General noise

Theorem 1

Theorem 2

Corollary 3

4.2 Sampling from normal distributions using matrix splittings

Corollary 4

5 Non-stationary iterative methods

5.1 Acceleration of linear solvers by polynomials

5.2 Acceleration of Gibbs sampling by polynomials

Theorem 5

Corollary 6

Corollary 7

6 Computed Examples

6.1 A 10×1010\times 10 lattice example (n=100n=100)

6.2 A 100×100×100100\times 100\times 100 (n=106n=10^{6}) linear inverse problem in biofilm imaging

7 Discussion

Acknowledgments

Appendix A Appendix

A.1 Stationary sampler convergence (Proof of Theorem 2 and Corollary 3)

A.2 Polynomial accelerated sampler convergence (Proof of Theorem 5 and Corollary 6)

Lemma 8

References

6.1 A $10\times 10$ lattice example ( $n=100$ )

6.2 A $100\times 100\times 100$ ( $n=10^{6}$ ) linear inverse problem in biofilm imaging