This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Data-Free Likelihood-Informed Dimension Reduction of Bayesian Inverse Problems

Tiangang Cui1, Olivier Zahm2 1School of Mathematics, Monash University, VIC 3800, Australia 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France tiangang.cui@monash.edu olivier.zahm@inria.fr ,
Abstract

Identifying a low-dimensional informed parameter subspace offers a viable path to alleviating the dimensionality challenge in the sampled-based solution to large-scale Bayesian inverse problems. This paper introduces a novel gradient-based dimension reduction method in which the informed subspace does not depend on the data. This permits an online-offline computational strategy where the expensive low-dimensional structure of the problem is detected in an offline phase, meaning before observing the data. This strategy is particularly relevant for multiple inversion problems as the same informed subspace can be reused. The proposed approach allows controlling the approximation error (in expectation over the data) of the posterior distribution. We also present sampling strategies that exploit the informed subspace to draw efficiently samples from the exact posterior distribution. The method is successfully illustrated on two numerical examples: a PDE-based inverse problem with a Gaussian process prior and a tomography problem with Poisson data and a Besov-112\mathcal{B}^{2}_{11} prior.

1 Introduction

The Bayesian approach to inverse problems builds a probabilistic representation of the parameter of interest conditioned on observed data. Denoting the parameter and data by xdx\in\mathbb{R}^{d} and ymy\in\mathbb{R}^{m}, respectively, the solution to the inverse problem is encapsulated in the posterior distribution, which has the probability density function (pdf)

πposy(x)=1πdata(y)y(x)πpr(x),\pi_{\text{pos}}^{y}(x)=\frac{1}{\pi_{\text{data}}(y)}\mathcal{L}^{y}(x)\pi_{\text{pr}}(x), (1)

where πpr(x)\pi_{\text{pr}}(x) denotes the prior density, y(x)\mathcal{L}^{y}(x) is the likelihood function, and πdata(y)\pi_{\text{data}}(y) is the marginal density of the data yy that can be expressed as

πdata(y)=dy(x)πpr(x)dx.\pi_{\text{data}}(y)=\int_{\mathbb{R}^{d}}\mathcal{L}^{y}(x)\pi_{\text{pr}}(x)\,\mathrm{d}x. (2)

This way, one can encode the posterior into summary statistics, for example, moments, quantiles, or probabilities of some events of interest [31, 51, 52], to provide parameter inference and associated uncertainty quantification.

In practice, computing these summary statistics requires dedicated methods to efficiently characterize the posterior distribution. Markov chain Monte Carlo (MCMC) methods [9, 38], originating with the Metropolis-Hastings algorithm [29, 41], and sequential Monte Carlo methods [22] have been developed as workhorses in this context. However, many inverse problems have high-dimensional or infinite-dimensional parameter space, which present a significant hurdle to the applicability of MCMC, SMC, and other related sampling methods in general.

The efficiency of these sampling methods, measured by the required number of posterior density evaluations, may deteriorate with the dimension of the parameter space, see [47, 48] and references therein. Even with the rather strong log-concave assumption, start-of-the-art MCMC methods can still be sensitive to the dimension of the problem, see for instance [20, 24, 25].

One promising way to alleviating the challenge of dimensionality is to exploit the effectively low-dimensional structures of the posterior distribution. Such low-dimensional structures can be used to construct certified low-dimensional approximations of the posterior distribution [50, 54] and efficient MCMC proposals that are robust in the parameter dimension [5, 6, 17, 16, 33]. There exists several ways to detect low-dimensional structures. A widely accepted method is to utilize the regularity of the prior, in which the dominant eigenvectors of the prior covariance operator [40] can be used to define such a low-dimensional subspace. This prior-based dimension reduction also plays a key role in the analysis of high-dimensional integration methods such as [26, 27]. In addition to the prior regularity, the limited accuracy of the observations and the ill-posed nature of the forward model often allow one to express the posterior as a low-dimensional update from the prior. Methods such as the active subspace (AS) [13, 14] and the likelihood-informed subspace (LIS) [18, 19, 54] utilize gradients of the forward model and/or of the likelihood function in order to better identify the low-dimensional structure of the problem. We refer to [19, 54] for an overview and a comparison of the existing methods.

The success of AS and LIS relies on the computation of the gradient or the Hessian of the log-likelihood function. Since the likelihood function depends on the observed data, the resulting subspaces need to be reconstructed each time a new data set is observed. This can add a significant computational burden to the solution of inverse problems. In this paper, we present a new data-free strategy for constructing the informed subspace in which the computationally costly subspace construction can be performed in an offline phase, meaning before observing any data sets. In the subsequent online phase, the data set is observed and the precomputed informed subspace is utilized to accelerate the inversion process. This computational strategy is particularly relevant for real-time systems such as medical imaging where multiple inversions are needed.

The rest of the paper is organized as follows. To begin, we introduce the problem setting in Section 2. In Section 3, we present a new data-free likelihood-informed approach to construct the subspace. Denoting the Fisher information matrix of the likelihood function by (x)=(logy(x))(logy(x))y(x)dy\mathcal{I}(x)=\int(\nabla\log\mathcal{L}^{y}(x))(\nabla\log\mathcal{L}^{y}(x))^{\top}\mathcal{L}^{y}(x)\mathrm{d}y, this approach defines the informed subspace as the rank-rr dominant eigenspace of the matrix

H=(x)πpr(x)dx,H=\int\mathcal{I}(x)\,\pi_{\text{pr}}(x)\mathrm{d}x, (3)

with rdr\ll d. This definition makes no particular assumption on the likelihood function, so it can be applied to a wide range of measurement processes, e.g., Gaussian likelihood and Poisson likelihood. It also does not involve any particular data set yy, and hence can be constructed offline. Given the informed subspace, we approximate the posterior density πposy(x)\pi_{\text{pos}}^{y}(x) by

π~posy(x)=π~posy(xr)πpr(x|xr),\widetilde{\pi}_{\text{pos}}^{y}(x)=\widetilde{\pi}_{\text{pos}}^{y}(x_{r})\pi_{\text{pr}}(x_{\perp}|x_{r}), (4)

where xrx_{r} and xx_{\perp} denote respectively the informed and the non-informed components of xx. We prove that the expected Kullback-Leibler (KL) divergence of the full posterior from its approximation is bounded as

𝔼[DKL(πposY||π~posY)]κi>rλi(H),\mathbb{E}[D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\widetilde{\pi}_{\text{pos}}^{Y})]\leq\kappa\sum_{i>r}\lambda_{i}(H), (5)

where the expectation is taken over the data Yπdata(y)Y\sim\pi_{\text{data}}(y), κ\kappa being the subspace Poincaré constant of the prior [53, 54] and λi(H)\lambda_{i}(H) the ii-th largest eigenvalue of HH. This way, a problem with a fast decay in the spectrum of HH yields an accurate low-dimensional posterior approximation in expectation over the data.

In Section 4, we restrict the analysis to Gaussian likelihood. In this case, we show that the vector-valued extension [53] of the AS method [12], which reduces parameter dimensions via approximating forward models, also leads to the same data-free informed subspace as that obtained using (3). We can further show that, although the likelihood-informed approach and AS employ different approximations to the posterior density, the resulting approximations share the same structure as shown in (4) and follow the same error bound as in (5).

As suggested by (4), the factorized form of the approximate posterior densities allows for dimension-robust sampling. One can explore the low-dimensional intractable parameter reduced posterior π~posy(xr)\widetilde{\pi}_{\text{pos}}^{y}(x_{r}) using methods such as MCMC, followed by direct sampling of the high-dimensional but tractable conditional prior πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r}). This strategy has been previously investigated, see [18, 54] and references therein. We provide a brief summary to this existing sampling strategy in Section 5. Despite the accelerated sampling offered by the informed subspace, the resulting inference results are subject to the dimension truncation error that is bounded in (5). In Section 6, by integrating the pseudo-marginal approach [1, 2] and the surrogate transition approach [11, 38, 39] into the abovementioned sampling strategy, we present new exact inference algorithms that can enjoy the same subspace acceleration while target on the full posterior. Our exact inference algorithms only require minor modifications to the sampling strategy of [18, 54].

While our dimension reduction method readily apply for Gaussian priors, its application to non-Gaussian priors might not be straightforward. In Section 7, we show how to use the propose method for problems with Besov priors [21, 32, 34] which are commonly used in image reconstruction problems.

We demonstrate the accuracy of the proposed data-free LIS and the efficiency of new sampling strategies on a range of problems. These include the identification of the diffusion coefficient of a two-dimensional elliptic partial differential equation (PDE) with a Gaussian prior in Section 8 and Positron emission tomography (PET) with Poisson data and a Besov prior in Section 9.

2 Problem setting

For high-dimensional ill-posed inverse problems, the data are often informative only along a few directions in the parameter space. To detect and exploit this low-dimensional structure, we introduce a projector Prd×dP_{r}\in\mathbb{R}^{d\times d} of rank rdr\ll d such that Im(Pr)\text{Im}(P_{r}) is the informed subspace and Ker(Pr)\text{Ker}(P_{r}) the non-informed one. This splits the parameter space as

d=Im(Pr)Ker(Pr),\mathbb{R}^{d}=\text{Im}(P_{r})\oplus\text{Ker}(P_{r}),

where the subspaces Im(Pr)\text{Im}(P_{r}) and Ker(Pr)\text{Ker}(P_{r}) are not necessarily orthogonal unless PrP_{r} is orthogonal. The fact that the data are only informative in Im(Pr)\text{Im}(P_{r}) means there exists an approximation to the posterior density πposy(x)y(x)πpr(x)\pi_{\text{pos}}^{y}(x)\propto\mathcal{L}^{y}(x)\pi_{\text{pr}}(x) under the form

π~posy(x)~y(Prx)πpr(x),\widetilde{\pi}_{\text{pos}}^{y}(x)\propto\widetilde{\mathcal{L}}^{y}(P_{r}x)\pi_{\text{pr}}(x), (6)

in which the likelihood function xy(x)x\mapsto\mathcal{L}^{y}(x) is replaced by a ridge function x~y(Prx)x\mapsto\widetilde{\mathcal{L}}^{y}(P_{r}x). A ridge function [46] is a function which is constant on a subspace, here Ker(Pr)\text{Ker}(P_{r}). Let xr=Prxx_{r}=P_{r}x and x=(IdPr)xx_{\perp}=(I_{d}-P_{r})x be the components of xx in Im(Pr)\text{Im}(P_{r}) and Ker(Pr)\text{Ker}(P_{r}), respectively. We have the parameter decomposition

x=xr+x.x=x_{r}+x_{\perp}.

Using a slight abuse of notation, we factorize the prior density as πpr(x)=πpr(xr)πpr(x|xr)\pi_{\text{pr}}(x)=\pi_{\text{pr}}(x_{r})\pi_{\text{pr}}(x_{\perp}|x_{r}), where

πpr(xr)=Ker(Pr)πpr(xr+x)dxandπpr(x|xr)=πpr(xr+x)/πpr(xr)\pi_{\text{pr}}(x_{r})=\int_{\text{Ker}(P_{r})}\pi_{\text{pr}}(x_{r}+x_{\perp}^{\prime})\mathrm{d}x_{\perp}^{\prime}\quad{\rm and}\quad\pi_{\text{pr}}(x_{\perp}|x_{r})=\pi_{\text{pr}}(x_{r}+x_{\perp})/\pi_{\text{pr}}(x_{r})

denote the marginal prior and the conditional prior. The approximate posterior (6) writes

π~posy(xr+x)(~y(xr)πpr(xr))π~posy(xr)πpr(x|xr).\widetilde{\pi}_{\text{pos}}^{y}(x_{r}+x_{\perp})\propto\underbrace{\big{(}\widetilde{\mathcal{L}}^{y}(x_{r})\pi_{\text{pr}}(x_{r})\big{)}}_{\widetilde{\pi}_{\text{pos}}^{y}(x_{r})}\pi_{\text{pr}}(x_{\perp}|x_{r}).

This factorization shows that, under the approximate posterior density, the Bayesian update is effective on the informed subspace Im(Pr)\text{Im}(P_{r}) (first term π~posy(xr)\widetilde{\pi}_{\text{pos}}^{y}(x_{r})), while the non-informed subspace Ker(Pr)\text{Ker}(P_{r}) is characterized by the prior (second term πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r})). This property will be exploited later on to design efficient sampling strategies for exploring both the approximate posterior and the full posterior.

The challenge of dimension reduction is to construct both the low-rank projector PrP_{r} and the ridge approximation ~y\widetilde{\mathcal{L}}^{y} such that the KL divergence of the full posterior from its approximation

DKL(πposy||π~posy)=log(πposy(x)π~posy(x))πposy(x)dx,D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\widetilde{\pi}_{\text{pos}}^{y})=\int\log\left(\frac{\pi_{\text{pos}}^{y}(x)}{\widetilde{\pi}_{\text{pos}}^{y}(x)}\right)\pi_{\text{pos}}^{y}(x)\mathrm{d}x,

can be controlled. In this work, we specifically focus on constructing a projector PrP_{r} which is independent on the data yy and which allows to bound DKL(πposy||π~posy)D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\widetilde{\pi}_{\text{pos}}^{y}).

3 Dimension reduction via optimal parameter-reduced likelihood

In this section, we first briefly review the optimal parameter-reduced likelihood and the data-dependent LIS proposed in [54], and then we will introduce the data-free LIS.

3.1 Optimal parameter-reduced likelihood using a given projector

As shown in Section 2.1 of [54], for a given data set yy and a given projector PrP_{r}, the parameter-reduced likelihood function

,y(xr)=Ker(Pr)y(xr+x)πpr(x|xr)dx,\mathcal{L}^{\ast,y}(x_{r})=\int_{\text{Ker}(P_{r})}\mathcal{L}^{y}(x_{r}+x_{\perp})\pi_{\text{pr}}(x_{\perp}|x_{r})\,\mathrm{d}x_{\perp}, (7)

is an optimal approximation in the sense that it minimizes ~yDKL(πposy||π~posy)\widetilde{\mathcal{L}}^{y}\mapsto D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\widetilde{\pi}_{\text{pos}}^{y}). We denote by

πpos,y(x),y(Prx)πpr(x),\pi_{\text{pos}}^{\ast,y}(x)\propto\mathcal{L}^{\ast,y}(P_{r}x)\pi_{\text{pr}}(x),

the resulting approximate posterior density. The marginal density πpos,y(xr)=Ker(Pr)πpos,y(xr+x)dx\pi_{\text{pos}}^{\ast,y}(x_{r})=\int_{\text{Ker}(P_{r})}\pi_{\text{pos}}^{\ast,y}(x_{r}+x_{\perp})\mathrm{d}x_{\perp} can be expressed as

πpos,y(xr),y(xr)πpr(xr)(7)Ker(Pr)y(xr+x)πpr(xr+x)dx(1)πposy(xr),\pi_{\text{pos}}^{\ast,y}(x_{r})\propto\mathcal{L}^{\ast,y}(x_{r})\pi_{\text{pr}}(x_{r})\overset{\eqref{eq:Loptimal}}{\propto}\int_{\text{Ker}(P_{r})}\mathcal{L}^{y}(x_{r}+x_{\perp})\pi_{\text{pr}}(x_{r}+x_{\perp})\mathrm{d}x_{\perp}\overset{\eqref{eq:BayesianModel}}{\propto}\pi_{\text{pos}}^{y}(x_{r}), (8)

for all xrIm(Pr)x_{r}\in\text{Im}(P_{r}), where πposy(xr)=Ker(Pr)πposy(xr+x)dx\pi_{\text{pos}}^{y}(x_{r})=\int_{\text{Ker}(P_{r})}\pi_{\text{pos}}^{y}(x_{r}+x_{\perp}^{\prime})\mathrm{d}x_{\perp}^{\prime} is the marginal density of the full posterior. Thus, for any projector PrP_{r} and any data yy, the approximate posterior πpos,y\pi_{\text{pos}}^{\ast,y} and the full posterior πposy\pi_{\text{pos}}^{y} have the same marginal density on Im(Pr)\text{Im}(P_{r}). In summary we have

πposy(x)\displaystyle\pi_{\text{pos}}^{y}(x) =πposy(xr)πposy(x|xr),\displaystyle=\pi_{\text{pos}}^{y}(x_{r})\pi_{\text{pos}}^{y}(x_{\perp}|x_{r}),
πpos,y(x)\displaystyle\pi_{\text{pos}}^{\ast,y}(x) =(8)πposy(xr)πpr(x|xr),\displaystyle\overset{\eqref{eq:MarginalPosteriorAgree}}{=}\pi_{\text{pos}}^{y}(x_{r})\pi_{\text{pr}}(x_{\perp}|x_{r}),

which shows that the optimal approximation πpos,y(x)\pi_{\text{pos}}^{\ast,y}(x) to πposy(x)\pi_{\text{pos}}^{y}(x) replaces the conditional posterior πposy(x|xr)\pi_{\text{pos}}^{y}(x_{\perp}|x_{r}) with the conditional prior πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r}).

3.2 Data-dependent dimension reduction

We denote by Pr=PryP_{r}=P_{r}^{y} a projector built by a data-dependent approach. Ideally, we would like to build PryP_{r}^{y} that minimizes DKL(πposy||πpos,y)D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\pi_{\text{pos}}^{\ast,y}) over the manifold of rank-rr projectors. However, this non-convex minimization problem can be challenge to solve. Instead, the strategy proposed in [54] minimizes an upper bound of the KL divergence obtained by logarithmic Sobolev inequalities, in which the following assumption on the prior density is adopted.

Assumption 3.1 (Subspace logarithmic Sobolev inequality).

There exists a symmetric positive definite matrix Γd×d\Gamma\in\mathbb{R}^{d\times d} and a scalar κ>0\kappa>0 such that for any projector Prd×dP_{r}\in\mathbb{R}^{d\times d} and for any continuously differentiable function h:dh:\mathbb{R}^{d}\rightarrow\mathbb{R} the inequality

dh(x)2log(h(x)2hPr(x)2)πpr(x)dx2κd(IdPr)h(x)Γ12πpr(x)dx,\displaystyle\int_{\mathbb{R}^{d}}h(x)^{2}\log\left(\frac{h(x)^{2}}{h_{P_{r}}(x)^{2}}\right)\pi_{\text{pr}}(x)\mathrm{d}x\leq 2\kappa\int_{\mathbb{R}^{d}}\|(I_{d}-P_{r})^{\top}\nabla h(x)\|_{\Gamma^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x,

holds, where hPr2h_{P_{r}}^{2} is the conditional expectation of h2h^{2} given by hPr(x)2=Ker(Pr)h(Prx+x)2πpr(x|Prx)dxh_{P_{r}}(x)^{2}=\int_{\text{Ker}(P_{r})}h(P_{r}x+x_{\perp})^{2}\pi_{\text{pr}}(x_{\perp}|P_{r}x)\,\mathrm{d}x_{\perp}. Here the norm Γ1\|\cdot\|_{\Gamma^{-1}} is defined by vΓ12=vΓ1v\|v\|_{\Gamma^{-1}}^{2}=v^{\top}\Gamma^{-1}v for any vdv\in\mathbb{R}^{d}.

Theorem 1 in [54] gives sufficient conditions on the prior density such that Assumption 3.1 holds. In particular, any Gaussian prior πpr=𝒩(mpr,Σpr)\pi_{\text{pr}}=\mathcal{N}(m_{\text{pr}},\Sigma_{\text{pr}}) with mean mprdm_{\text{pr}}\in\mathbb{R}^{d} and non-singular covariance matrix Σprd×d\Sigma_{\text{pr}}\in\mathbb{R}^{d\times d} satisfies Assumption 3.1 with κ=1\kappa=1 and Γ=Σpr1\Gamma=\Sigma_{\text{pr}}^{-1}. As shown in [54, example 2], any Gaussian mixture also satisfies this assumption, but with a constant κ\kappa which might not be accessible in practice. We refer to [28, 36] for nicely written introductions to logarithmic Sobolev inequalities and examples of distributions which satisfy it.

Proposition 3.2.

Suppose πpr\pi_{\text{pr}} satisfies Assumption 3.1 and the likelihood function y\mathcal{L}^{y} is continuously differentiable. Then, for any projector Prd×dP_{r}\in\mathbb{R}^{d\times d}, the posterior approximation πpos,y(x)(,y(xr)πpr(xr))πpr(x|xr)\pi_{\text{pos}}^{\ast,y}(x)\propto\big{(}\mathcal{L}^{\ast,y}(x_{r})\pi_{\text{pr}}(x_{r})\big{)}\pi_{\text{pr}}(x_{\perp}|x_{r}) induced by the optimal parameter-reduced likelihood as in (7) satisfies

DKL(πposy||πpos,y)κ2trace(Γ1(IdPr)H(y)(IdPr)),D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\pi_{\text{pos}}^{\ast,y})\leq\frac{\kappa}{2}\operatorname{trace}\big{(}\Gamma^{-1}(I_{d}-P_{r}^{\top})H(y)(I_{d}-P_{r})\big{)}, (9)

where the matrix H(y)d×dH(y)\in\mathbb{R}^{d\times d} is defined by

H(y)=d(logy(x))(logy(x))πposy(x)dx.H(y)=\int_{\mathbb{R}^{d}}\big{(}\nabla\log\mathcal{L}^{y}(x))(\nabla\log\mathcal{L}^{y}(x)\big{)}^{\top}\,\pi_{\text{pos}}^{y}(x)\mathrm{d}x. (10)
Proof.

See the proof of Corollary 1 in [54]. ∎

Proposition 3.2 gives an upper bound on DKL(πposy||πpos,y)D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\pi_{\text{pos}}^{\ast,y}). The minimizer of this bound

Pryarg minPrd×drank-r projectortrace(Γ1(IdPr)H(y)(IdPr)),P_{r}^{y}\in\underset{\begin{subarray}{c}P_{r}\in\mathbb{R}^{d\times d}\\ \text{rank-$r$ projector}\end{subarray}}{\text{arg\,min}}\,\operatorname{trace}\big{(}\Gamma^{-1}(I_{d}-P_{r}^{\top})H(y)(I_{d}-P_{r})\big{)},

can be obtained from the leading generalized eigenvectors of the matrix pair (H(y),Γ)(H(y),\Gamma), see [53, Proposition 2.6]. Let (λiy,viy)0×d(\lambda_{i}^{y},v_{i}^{y})\in\mathbb{R}_{\geq 0}\times\mathbb{R}^{d} denotes the ii-th eigenpair of (H(y),Γ)(H(y),\Gamma) such that H(y)viy=λiyΓviy,H(y)v_{i}^{y}=\lambda_{i}^{y}\Gamma v_{i}^{y}, with (viy)Γvjy=δi,j(v_{i}^{y})^{\top}\Gamma v_{j}^{y}=\delta_{i,j} and λiyλjy\lambda_{i}^{y}\geq\lambda_{j}^{y} for all iji\leq j. The image and the kernel of PryP_{r}^{y} are respectively defined as

Im(Pry)\displaystyle\text{Im}(P_{r}^{y}) =span{v1y,,vry},\displaystyle=\text{span}\{v_{1}^{y},\ldots,v_{r}^{y}\}, (11)
Ker(Pry)\displaystyle\text{Ker}(P_{r}^{y}) =span{vr+1y,,vdy}.\displaystyle=\text{span}\{v_{r+1}^{y},\ldots,v_{d}^{y}\}.

The resulting projector PryP_{r}^{y} yields an approximate posterior density πpos,y\pi_{\text{pos}}^{\ast,y} that satisfies

DKL(πposy||πpos,y)κ2i=r+1dλiy.D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\pi_{\text{pos}}^{\ast,y})\leq\frac{\kappa}{2}\sum_{i=r+1}^{d}\lambda_{i}^{y}.

The above relation can be used to choose the rank r=rank(Pry)r=\text{rank}(P_{r}^{y}) to guarantee that the DKL(πposy||πpos,y)D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\pi_{\text{pos}}^{\ast,y}) is bounded below some user-defined tolerance. A rapid decay in the spectrum (λ1y,λ2y,)(\lambda_{1}^{y},\lambda_{2}^{y},\ldots) ensures that one can choose a rank rr that is much lower than the original dimension dd. Note that the projector PryP_{r}^{y} may not be unique, unless there exists a spectral gap λry>λr+1y\lambda_{r}^{y}>\lambda_{r+1}^{y} which ensures the rr-dimensional dominant eigenspace of (H(y),Γ)(H(y),\Gamma) is unique.

Remark 3.3 (Coordinate selection).

The projector defined in (11) is, in general, not aligned with the canonical coordinates. However, in some parametrizations—for example, different components of xx represent physical quantities of different nature—we may prefer coordinate selection than subspace identification to make the dimension reduction more interpretable. Denoting the ii-th canonical basis vector of d\mathbb{R}^{d} by eie_{i}, we let Pry=ieiei,P_{r}^{y}=\sum_{i\in\mathcal{I}}e_{i}e_{i}^{\top}, be the projector of rank r=#r=\#\mathcal{I}, which extracts the components of xx indexed by the index set {1,,d}\mathcal{I}\subset\{1,\ldots,d\} such that

Im(Pry)\displaystyle\text{Im}(P_{r}^{y}) =span{xi:i},\displaystyle=\text{span}\{x_{i}:i\in\mathcal{I}\},
Ker(Pry)\displaystyle\text{Ker}(P_{r}^{y}) =span{xi:i}.\displaystyle=\text{span}\{x_{i}:i\notin\mathcal{I}\}.

Using such a projector, the bound (9) becomes

DKL(πposy||πpos,y)κ2i(Γ1)iiH(y)ii,D_{\mathrm{KL}}(\pi_{\text{pos}}^{y}||\pi_{\text{pos}}^{\ast,y})\leq\frac{\kappa}{2}\sum_{i\notin\mathcal{I}}(\Gamma^{-1})_{ii}H(y)_{ii},

which suggests to define the index set \mathcal{I} that selects the rr largest values of (Γ1)iiH(y)ii(\Gamma^{-1})_{ii}H(y)_{ii}.

Because of the dependency on the data set yy, the projector PryP_{r}^{y} must be built after a data set has been observed, see Algorithm 1. For scenarios where one wants to solve multiple inverse problems with multiple data sets, the matrix H(y)H(y) and the resulting projector have to be reconstructed for each data set. This can be a computationally challenging task. In addition, H(y)H(y) is defined as an expectation over the high-dimensional posterior distribution, which further raises the computational burden.

Requires: πpr\pi_{\text{pr}} satisfying Assumption 3.1, tolerance ε>0\varepsilon>0 and maximal rank rmaxr_{\max}
Online phase: given the data yy do:
       Compute the matrix H(y)H(y) using (10).
       Compute the generalized eigendecomposition H(y)viy=λiyΓviyH(y)v_{i}^{y}=\lambda_{i}^{y}\Gamma v_{i}^{y}.
       Find the smallest rr such that κ2i=r+1dλiyε\frac{\kappa}{2}\sum_{i=r+1}^{d}\lambda_{i}^{y}\leq\varepsilon. If rrmaxr\geq r_{\max}, set r=rmaxr=r_{\max}.
       Assemble the projector PryP_{r}^{y} using (11).
       Define the conditional expectation ,y(xr)\mathcal{L}^{\ast,y}(x_{r}) defined in (7).
Return: Approximate posterior πpos,y(x),y(Pryx)πpr(x)\pi_{\text{pos}}^{\ast,y}(x)\propto\mathcal{L}^{\ast,y}(P_{r}^{y}x)\pi_{\text{pr}}(x).
Algorithm 1 Data-dependent dimension reduction.

3.3 Data-free dimension reduction

To overcome the abovementioned computational burden of recomputing the data-dependent projector for every new data set, we present a new data-free dimension reduction method. The key idea is to control the KL divergence in expectation over the marginal density of data. We introduce an mm-dimensional random vector

Yπdata(y),Y\sim\pi_{\text{data}}(y),

where πdata\pi_{\text{data}} is the marginal density of data defined in (2). Note that the observed data yy corresponds to a particular realization of YY. For a given projector PrP_{r} independent on the data, replacing yy with YY in (9) and taking the expectation over YY yields

𝔼[DKL(πposY||πpos,Y)]κ2trace(Γ1(IdPr)𝔼[H(Y)](IdPr)).\mathbb{E}\big{[}D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\pi_{\text{pos}}^{\ast,Y})\big{]}\leq\frac{\kappa}{2}\operatorname{trace}\big{(}\Gamma^{-1}(I_{d}-P_{r}^{\top})\mathbb{E}[H(Y)](I_{d}-P_{r})\big{)}. (12)

Here, the approximate posterior πpos,Y\pi_{\text{pos}}^{\ast,Y} depends on YY via the optimal likelihood ,Y\mathcal{L}^{\ast,Y}. Similar to the data-dependent case, the leading generalized eigenvectors of the matrix pair (𝔼[H(Y)],Γ)(\mathbb{E}[H(Y)],\Gamma) can be used to obtain a projector that minimizes the error bound. However, in this case, the matrix 𝔼[H(Y)]\mathbb{E}[H(Y)] is the expectation of H(y)H(y) over the marginal density of data, and thus it is independent of observed data. Let (λi,vi)0×d(\lambda_{i},v_{i})\in\mathbb{R}_{\geq 0}\times\mathbb{R}^{d} denotes the ii-th eigenpair of (𝔼[H(Y)],Γ)(\mathbb{E}[H(Y)],\Gamma) such that 𝔼[H(Y)]vi=λiΓvi\mathbb{E}[H(Y)]v_{i}=\lambda_{i}\Gamma v_{i}, with viΓvj=δi,jv_{i}^{\top}\Gamma v_{j}=\delta_{i,j} and λiyλjy\lambda_{i}^{y}\geq\lambda_{j}^{y} for all iji\leq j. The data-free projector PrP_{r} that minimizes the right-hand side of (12) is given by

Im(Pr)\displaystyle\text{Im}(P_{r}) =span{v1,,vr},\displaystyle=\text{span}\{v_{1},\ldots,v_{r}\}, (13)
Ker(Pr)\displaystyle\text{Ker}(P_{r}) =span{vr+1,,vd}.\displaystyle=\text{span}\{v_{r+1},\ldots,v_{d}\}.

When using this projector for defining the approximate posterior πpos,Y\pi_{\text{pos}}^{\ast,Y}, the expectation of the KL divergence DKL(πposY||πpos,Y)D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\pi_{\text{pos}}^{\ast,Y}) can be controlled as

𝔼[DKL(πposY||πpos,Y)]κ2i=r+1dλi.\mathbb{E}\big{[}D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\pi_{\text{pos}}^{\ast,Y})\big{]}\leq\frac{\kappa}{2}\sum_{i=r+1}^{d}\lambda_{i}. (14)
Remark 3.4 (Bound in high probability).

Inequality (14) gives a bound on DKL(πposY||πpos,Y)D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\pi_{\text{pos}}^{\ast,Y}) in expectation. In order to obtain a bound in high probability, let us use the Markov inequality {DKL(πposY||πpos,Y)ε}1ε1𝔼[DKL(πposY||πpos,Y)]\mathbb{P}\{D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\pi_{\text{pos}}^{\ast,Y})\leq\varepsilon\}\geq 1-\varepsilon^{-1}\mathbb{E}[D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\pi_{\text{pos}}^{\ast,Y})] for some ε>0\varepsilon>0. Thus, for a given 0<η10<\eta\leq 1, the condition κ2i=r+1dλiεη\frac{\kappa}{2}\sum_{i=r+1}^{d}\lambda_{i}\leq\frac{\varepsilon}{\eta} is sufficient to ensure that

DKL(πposY||πpos,Y)ε,D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\pi_{\text{pos}}^{\ast,Y})\leq\varepsilon,

holds with a probability greater than 1η1-\eta.

Remark 3.5 (Coordinate selection).

Similarly to Remark 3.3, instead of defining PrP_{r} as in (13), we can define a coordinate-aligned projector Pr=ieieiP_{r}=\sum_{i\in\mathcal{I}}e_{i}e_{i}^{\top} by selecting an index set \mathcal{I} corresponding to the rr largest values of (Γ1)ii𝔼[H(Y)]ii(\Gamma^{-1})_{ii}\mathbb{E}[H(Y)]_{ii}.

Now we show that the matrix 𝔼[H(Y)]\mathbb{E}[H(Y)] admits a simple expression in terms of the Fisher information matrix associated with the likelihood function. This leads to a computationally convenient way to construct the data-free projector. Recall that the likelihood y(x)\mathcal{L}^{y}(x), seen as a function of yy, is the pdf of the data yy conditionned on the parameter xdx\in\mathbb{R}^{d}. The Fisher information matrix associated with this family of pdf is

(x)=m(logy(x))(logy(x))y(x)dy.\mathcal{I}(x)=\int_{\mathbb{R}^{m}}\big{(}\nabla\log\mathcal{L}^{y}(x))(\nabla\log\mathcal{L}^{y}(x)\big{)}^{\top}\,\mathcal{L}^{y}(x)\,\mathrm{d}y. (15)

We can write

𝔼[H(Y)]\displaystyle\mathbb{E}[H(Y)] =mH(y)πdata(y)dy\displaystyle=\int_{\mathbb{R}^{m}}H(y)\pi_{\text{data}}(y)\mathrm{d}y
=(10)m(d(logy(x))(logy(x))πposy(x)dx)πdata(y)dy\displaystyle\overset{\eqref{eq:defHy}}{=}\int_{\mathbb{R}^{m}}\left(\int_{\mathbb{R}^{d}}\big{(}\nabla\log\mathcal{L}^{y}(x))(\nabla\log\mathcal{L}^{y}(x)\big{)}^{\top}\pi_{\text{pos}}^{y}(x)\mathrm{d}x\right)\pi_{\text{data}}(y)\mathrm{d}y
=(1)m×d(logy(x))(logy(x))y(x)πpr(x)dy(x)πpr(x)dxπdata(y)dxdy\displaystyle\overset{\eqref{eq:BayesianModel}}{=}\int_{\mathbb{R}^{m}\times\mathbb{R}^{d}}\big{(}\nabla\log\mathcal{L}^{y}(x))(\nabla\log\mathcal{L}^{y}(x)\big{)}^{\top}\frac{\mathcal{L}^{y}(x)\pi_{\text{pr}}(x)}{\int_{\mathbb{R}^{d}}\mathcal{L}^{y}(x^{\prime})\pi_{\text{pr}}(x^{\prime})\mathrm{d}x^{\prime}}\,\pi_{\text{data}}(y)\mathrm{d}x\mathrm{d}y
=(2)m×d(logy(x))(logy(x))y(x)πpr(x)dxdy\displaystyle\overset{\eqref{eq:PiData}}{=}\int_{\mathbb{R}^{m}\times\mathbb{R}^{d}}\big{(}\nabla\log\mathcal{L}^{y}(x))(\nabla\log\mathcal{L}^{y}(x)\big{)}^{\top}\mathcal{L}^{y}(x)\pi_{\text{pr}}(x)\mathrm{d}x\mathrm{d}y
=(15)d(x)πpr(x)dx,\displaystyle\overset{\eqref{eq:FIM}}{=}\int_{\mathbb{R}^{d}}\mathcal{I}(x)\pi_{\text{pr}}(x)\mathrm{d}x, (16)

which shows that the matrix 𝔼[H(Y)]\mathbb{E}[H(Y)] is the expectation of the Fisher information matrix over the prior. This expression does not involve any expectation over the posterior density, which is a major advantage compared to the expression (10) of the data-dependent matrix H(y)H(y). The methodology presented here is summarized in Algorithm 2.

Requires: πpr\pi_{\text{pr}} satisfying Assumption 3.1, Fisher information matrix (x)\mathcal{I}(x) of y\mathcal{L}^{y}, tolerance ε>0\varepsilon>0, and maximal rank rmaxr_{\max}
Offline phase
       Compute the matrix HI=d(x)πpr(x)dxH^{I}=\int_{\mathbb{R}^{d}}\mathcal{I}(x)\pi_{\text{pr}}(x)\mathrm{d}x.
       Compute the generalized eigendecomposition HIvi=λiΓviH^{I}v_{i}=\lambda_{i}\Gamma v_{i}.
       Find the smallest rr such that κ2i=r+1dλiε\frac{\kappa}{2}\sum_{i=r+1}^{d}\lambda_{i}\leq\varepsilon. If rrmaxr\geq r_{\max}, set r=rmaxr=r_{\max}.
Return: Projector PrP_{r} defined by (13)
Online phase: given the data yy do:
      Define ,y\mathcal{L}^{\ast,y} as the conditional expectation defined in (7).
Return: Approximate posterior πpos,y(x),y(Prx)πpr(x)\pi_{\text{pos}}^{\ast,y}(x)\propto\mathcal{L}^{\ast,y}(P_{r}x)\pi_{\text{pr}}(x)
Algorithm 2 Data-free dimension reduction
Example 3.6 (Gaussian likelihood).

Consider the parameter-to-data map is represented by a smooth forward model G:dmG:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m} and corrupted by an additive Gaussian noise ξobs𝒩(0,Σobs)\xi_{\text{obs}}\sim\mathcal{N}(0,\Sigma_{\text{obs}}) with non-singular covariance matrix Σobsm×m\Sigma_{\text{obs}}\in\mathbb{R}^{m\times m}, i.e.,

y=G(x)+ξobs,whereξobs𝒩(0,Σobs).y=G(x)+\xi_{\text{obs}},\quad{\rm where}\quad\xi_{\text{obs}}\sim\mathcal{N}(0,\Sigma_{\text{obs}}).

The likelihood function takes the form y(x)=Z1exp(12G(x)yΣobs12)\mathcal{L}^{y}(x)=Z^{-1}\exp(-\frac{1}{2}\|G(x)-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}), where Z=(2π)mdet(Σobs)Z=\sqrt{(2\pi)^{m}\,\text{det}(\Sigma_{\text{obs}})} is a normalizing constant. The Slepian-Bangs formula gives an explicit expression for the Fisher information matrix (x)=G(x)Σobs1G(x)\mathcal{I}(x)=\nabla G(x)^{\top}\Sigma_{\text{obs}}^{-1}\nabla G(x), where G(x)m×d\nabla G(x)\in\mathbb{R}^{m\times d} denotes the Jacobian of the forward model G(x)G(x). By relation (16) we obtain

𝔼[H(Y)]=dG(x)Σobs1G(x)πpr(x)dx.\mathbb{E}[H(Y)]=\int_{\mathbb{R}^{d}}\nabla G(x)^{\top}\Sigma_{\text{obs}}^{-1}\nabla G(x)\,\pi_{\text{pr}}(x)\mathrm{d}x. (17)

A similar matrix was considered in [18] in the context of data-dependent dimension reduction. The major difference with (17) is that, in [18], the expectation is taken over the posterior density rather than over the prior.

4 Dimension reduction via parameter-reduced forward model

In the previous Section 3, the detection of the data-free informed subspace is based on an approximation of the likelihood function. In this section, we present an alternative strategy which, under Gaussian likelihood assumption, consist in approximating the forward model instead of the likelihood itself. This approach is similar to the vector-valued extension of the AS method [53] and still yields error bounds for the expected KL divergence.

As in Example 3.6, let us start with a Gaussian likelihood of the form

y(x)=1Zexp(12G(x)yΣobs12),\mathcal{L}^{y}(x)=\frac{1}{Z}\exp\left(-\frac{1}{2}\|G(x)-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}\right), (18)

where xG(x)x\mapsto G(x) is a continuously differentiable forward model, Σobsm×m\Sigma_{\text{obs}}\in\mathbb{R}^{m\times m} is a non-singular covariance matrix and Z=(2π)mdet(Σobs)Z=\sqrt{(2\pi)^{m}\,\text{det}(\Sigma_{\text{obs}})} a normalizing constant. Our goal is to build a low-dimensional approximation to the likelihood (18) by replacing the forward model with a ridge approximation xG~(Prx)x\mapsto\widetilde{G}(P_{r}x). That is, we look for a likelihood approximation of the form

~y(Prx)=1Zexp(12G~(Prx)yΣobs12),\widetilde{\mathcal{L}}^{y}(P_{r}x)=\frac{1}{Z}\exp\left(-\frac{1}{2}\|\widetilde{G}(P_{r}x)-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}\right), (19)

where PrP_{r} is a low-rank projector and where G~\widetilde{G} is some parameter-reduced function defined over Ker(Pr)\text{Ker}(P_{r}). In general, this approximate likelihood (19) is different than the previous one ,y\mathcal{L}^{\ast,y}, see (7), and therefore ~y\widetilde{\mathcal{L}}^{y} might not be optimal with respect to the KL divergence as discussed in Section 3.1. The following proposition will guide the construction of the approximate forward model.

Proposition 4.1.

Consider the posterior density πposy(x)y(x)πpr(x)\pi_{\text{pos}}^{y}(x)\propto\mathcal{L}^{y}(x)\pi_{\text{pr}}(x) with a Gaussian likelihood as in (18). For any approximate forward model G^:dm\widehat{G}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m}, the resulting approximate likelihood ^y(x)=1Zexp(12G^(x)yΣobs12)\widehat{\mathcal{L}}^{y}(x)=\frac{1}{Z}\exp(-\frac{1}{2}\|\widehat{G}(x)-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}) defines an approximate posterior density π^posy(x)^y(x)πpr(x)\widehat{\pi}_{\text{pos}}^{y}(x)\propto\widehat{\mathcal{L}}^{y}(x)\pi_{\text{pr}}(x) such that

𝔼[DKL(πposY||π^posY)]+DKL(πdata||π^data)=12dG(x)G^(x)Σobs12πpr(x)dx.\mathbb{E}\left[D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\widehat{\pi}_{\text{pos}}^{Y})\right]+D_{\mathrm{KL}}(\pi_{\text{data}}||\widehat{\pi}_{\text{data}})=\frac{1}{2}\int_{\mathbb{R}^{d}}\|G(x)-\widehat{G}(x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x.

Here the expectation is taken over YπdataY\sim\pi_{\text{data}} and π^data(y)=d^y(x)πpr(x)dx\widehat{\pi}_{\text{data}}(y)=\int_{\mathbb{R}^{d}}\widehat{\mathcal{L}}^{y}(x)\pi_{\text{pr}}(x)\,\mathrm{d}x is the approximate marginal density of data.

Proof.

See A. ∎

Using an approximate forward model in the form of G^(x)=G~(Prx)\widehat{G}(x)=\widetilde{G}(P_{r}x), Proposition 4.1 ensures that the approximate posterior π~posy(x)~y(Prx)πpr(x)\widetilde{\pi}_{\text{pos}}^{y}(x)\propto\widetilde{\mathcal{L}}^{y}(P_{r}x)\pi_{\text{pr}}(x) with ~y(Prx)\widetilde{\mathcal{L}}^{y}(P_{r}x) as in (19) satisfies

𝔼[DKL(πposY||π~posY)]12dG(x)G~(Prx)Σobs12πpr(x)dx,\mathbb{E}[D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\widetilde{\pi}_{\text{pos}}^{Y})]\leq\frac{1}{2}\int_{\mathbb{R}^{d}}\|G(x)-\widetilde{G}(P_{r}x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x, (20)

This suggests to construct a ridge approximation G~(Prx)\widetilde{G}(P_{r}x) to G(x)G(x) in the Lπpr2L_{\pi_{\text{pr}}}^{2} sense. To accomplish this, we follow the methodology proposed in [53] for the approximation of multivariate function using gradient information. First, for any projector PrP_{r}, the optimal function G~\widetilde{G}^{\ast} that minimizes the right-hand side of (20) is the conditional expectation

G~(xr)=Ker(Pr)G(xr+x)πpr(x|xr)dx.\widetilde{G}^{\ast}(x_{r})=\int_{\text{Ker}(P_{r})}G(x_{r}+x_{\perp})\pi_{\text{pr}}(x_{\perp}|x_{r})\,\mathrm{d}x_{\perp}. (21)

Then, similarly to Assumption 3.1, we assume that πpr\pi_{\text{pr}} satisfies the following subspace Poincaré inequality.

Assumption 4.2 (Subspace Poincaré inequality).

There exists a symmetric positive definite matrix Γd×d\Gamma\in\mathbb{R}^{d\times d} and a scalar κ>0\kappa>0 such that for any projector Prd×dP_{r}\in\mathbb{R}^{d\times d} and for any continuously differentiable function h:dh:\mathbb{R}^{d}\rightarrow\mathbb{R} the inequality

d(h(x)hPr(x))2πpr(x)dxκd(IdPr)h(x)Γ12πpr(x)dx,\displaystyle\int_{\mathbb{R}^{d}}\big{(}h(x)-h_{P_{r}}(x)\big{)}^{2}\pi_{\text{pr}}(x)\mathrm{d}x\leq\kappa\int_{\mathbb{R}^{d}}\|(I_{d}-P_{r})^{\top}\nabla h(x)\|_{\Gamma^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x,

holds, where hPrh_{P_{r}} is the conditional expectation of hh defined by hPr(x)=Ker(Pr)h(Prx+x)πpr(x|Prx)dxh_{P_{r}}(x)=\int_{\text{Ker}(P_{r})}h(P_{r}x+x_{\perp}^{\prime})\pi_{\text{pr}}(x_{\perp}^{\prime}|P_{r}x)\mathrm{d}x_{\perp}^{\prime}.

Assumption 4.2 is weaker than Assumption 3.1, in the sense that any distribution which satisfies the subspace logarithmic Sobolev inequality automatically satisfies the subspace Poincaré inequality with the same κ\kappa and the same Γ\Gamma, see for instance [54, Corollary 2]. We refer to the recent contributions [4, 43, 49] for examples of probability distribution which satisfy (subspace) Poincaré inequality. As for the logarithmic-Sobolev constant, the Poincaré constant is hard to compute in practice, except the case of Gaussian prior. Using similar arguments as in the proof of Proposition 2.5 in [53], Assumption 4.2 allows to write

dG(x)G~(Prx)Γobs12πpr(x)dxκtrace(Γ1(IdPr)HG(IdPr)),\int_{\mathbb{R}^{d}}\|G(x)-\widetilde{G}^{\ast}(P_{r}x)\|_{\Gamma_{\text{obs}}^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x\leq\kappa\,\operatorname{trace}\big{(}\Gamma^{-1}(I_{d}-P_{r}^{\top})H^{G}(I_{d}-P_{r})\big{)}, (22)

holds for any projector PrP_{r}, where the matrix HGd×dH^{G}\in\mathbb{R}^{d\times d} is defined by

HG=dG(x)Σobs1G(x)πpr(x)dx,H^{G}=\int_{\mathbb{R}^{d}}\nabla G(x)^{\top}\Sigma_{\text{obs}}^{-1}\nabla G(x)\,\pi_{\text{pr}}(x)\mathrm{d}x, (23)

with G(x)\nabla G(x) the Jacobian matrix of G(x)G(x) given by

G(x)=(G1x1(x)G1xd(x)Gmx1(x)Gmxd(x)).\nabla G(x)=\begin{pmatrix}\frac{\partial G_{1}}{\partial x_{1}}(x)&\ldots&\frac{\partial G_{1}}{\partial x_{d}}(x)\\ \vdots&\ddots&\vdots\\ \frac{\partial G_{m}}{\partial x_{1}}(x)&\ldots&\frac{\partial G_{m}}{\partial x_{d}}(x)\\ \end{pmatrix}.

Again, the projector PrGP_{r}^{G} that minimizes the right-hand side of (22) can be constructed via the generalized eigenvalue problem HGviG=λiGΓviGH^{G}v_{i}^{G}=\lambda_{i}^{G}\Gamma v_{i}^{G}:

Im(PrG)\displaystyle\text{Im}(P_{r}^{G}) =span{v1G,,vrG},\displaystyle=\text{span}\{v_{1}^{G},\ldots,v_{r}^{G}\}, (24)
Ker(PrG)\displaystyle\text{Ker}(P_{r}^{G}) =span{vr+1G,,vdG}.\displaystyle=\text{span}\{v_{r+1}^{G},\ldots,v_{d}^{G}\}.

Using this projector to construct the approximate forward model G~\widetilde{G}^{\ast} in (21) and the approximate likelihood as in (19), Proposition 4.1 and the inequality in (22) yield

𝔼[DKL(πposY||π~posY)]κ2i=r+1dλiG.\mathbb{E}\big{[}D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\widetilde{\pi}_{\text{pos}}^{Y})\big{]}\leq\frac{\kappa}{2}\sum_{i=r+1}^{d}\lambda_{i}^{G}. (25)

The methodology is summarized in Algorithm 3.

The matrix HGH^{G} used in this case takes the same form as the matrix 𝔼[H(Y)]\mathbb{E}[H(Y)] in Section 3.3 with the Gaussian likelihood (cf. Example 3.6), and hence results in the same data-free projector. However, the resulting approximate likelihood functions are not the same. Indeed in Section 3.3, the optimal approximate likelihood ,y\mathcal{L}^{\ast,y} is given as the conditional expectation of the likelihood function (cf. (7)), whereas here, ~y\widetilde{\mathcal{L}}^{y} is defined by the conditional expectation of the forward model G~\widetilde{G}^{\ast} (cf. (21)). Using either the parameter-reduced likelihood in (7) or the parameter-reduced forward model in (21) results in the same parameter truncation error bound in terms of expected KL divergence.

Requires: πpr\pi_{\text{pr}} satisfying Assumption 3.1, Jacobian G(x)\nabla G(x), tolerance ε>0\varepsilon>0 and maximal rank r=rmaxr=r_{\max}.
Offline phase
       Compute the matrix HGH^{G} defined in (23)
       Compute the generalized eigendecomposition HGviG=λiGΓviGH^{G}v_{i}^{G}=\lambda_{i}^{G}\Gamma v_{i}^{G}
       Find the smallest rr such that κ2i=r+1dλiε\frac{\kappa}{2}\sum_{i=r+1}^{d}\lambda_{i}\leq\varepsilon. If rrmaxr\geq r_{\max}, set r=rmaxr=r_{\max}
       Assemble the projector PrGP_{r}^{G} defined in (24)
       Define G~\widetilde{G}^{\ast} as the conditional expectation (21)
Return: Approximate forward model xG~(PrGx)x\mapsto\widetilde{G}^{\ast}(P_{r}^{G}x)
Online phase: given the data yy do:
       Assemble ~y(PrGx)\widetilde{\mathcal{L}}^{y}(P_{r}^{G}x) as in (19)
Return: Approximate posterior π~posy(x)~y(Prx)πpr(x)\widetilde{\pi}_{\text{pos}}^{y}(x)\propto\widetilde{\mathcal{L}}^{y}(P_{r}x)\pi_{\text{pr}}(x)
Algorithm 3 Data-free dimension reduction via forward model approximation
Remark 4.3.

Despite the similarity between the approximate likelihood functions given in (7) and (19), these two approaches offer different computational characteristics. Given the data-free informed subspace, the optimal parameter-reduced forward model xrG(xr)x_{r}\mapsto G^{\ast}(x_{r}) can be further replaced by a surrogate model xrGROM(xr)x_{r}\mapsto G^{\text{ROM}}(x_{r}) constructed in the offline phase. The surrogate model can be obtained using tensor methods [8, 42], the reduced basis method [44], polynomial techniques [35], etc., just to cite a few. All these approximation techniques do not scale well with the apparent parameter dimensions dd, and thus parameter reduction can greatly improve the scalability of surrogate models.

In contrast, the conditional expectation of the likelihood function in (7) cannot be replaced with offline surrogate models because of the data-dependency of the likelihood.

5 Sampling the approximate posterior

Given a data-free informed subspace, the approximate posterior density has the factorized form

π~posy(x)π~posty(xr)πpr(x|xr),\widetilde{\pi}_{\text{pos}}^{y}(x)\propto\widetilde{\pi}_{\text{post}}^{y}(x_{r})\pi_{\text{pr}}(x_{\perp}|x_{r}), (26)

with either π~posty(xr)=πposty(xr)\widetilde{\pi}_{\text{post}}^{y}(x_{r})=\pi_{\text{post}}^{y}(x_{r}) in the optimal parameter-reduced likelihood approach of Section 3, or with π~posty(xr)=~y(xr)πpry(xr)\widetilde{\pi}_{\text{post}}^{y}(x_{r})=\widetilde{\mathcal{L}}^{y}(x_{r})\pi_{\text{pr}}^{y}(x_{r}), ~y(xr)exp(12G~(xr)yΣobs12)\widetilde{\mathcal{L}}^{y}(x_{r})\propto\exp(-\frac{1}{2}\|\widetilde{G}^{\ast}(x_{r})-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}) in the optimal parameter-reduced forward model approach of Section 4. The factorization (26) naturally suggests a dimension robust way to sampling the approximate posterior. The sampling method consists in first drawing samples xr(1),xr(2),,xr(K)x_{r}^{(1)},x_{r}^{(2)},{.\hss.\hss.},x_{r}^{(K)} from the low-dimensional density π~posy(xr)\widetilde{\pi}_{\text{pos}}^{y}(x_{r}) using either MCMC or SMC method. Then, for each sample xr(j)x_{r}^{(j)}, we simulate a conditional prior sample x(j)x_{\perp}^{(j)} from πpr(x|xr(j))\pi_{\text{pr}}(x_{\perp}|x_{r}^{(j)}). In the end, x(j)=xr(j)+x(j)x^{(j)}=x_{r}^{(j)}+x_{\perp}^{(j)} are samples from the approximate posterior π~posy(x)\widetilde{\pi}_{\text{pos}}^{y}(x).

We emphasis here that the key is to be able to sample from the conditional prior πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r}). This task is rather easy for Gaussian priors. We show in Section 7 how to sample from πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r}) for non-Gaussian priors with a particular structure that can be exploited.

Remark 5.1.

If the end goal is to compute expectation of some function hh over of the approximate posterior, the factorization (26) leads to

dh(x)π~pos(x)dx\displaystyle\int_{\mathbb{R}^{d}}h(x)\widetilde{\pi}_{\text{pos}}(x)\mathrm{d}x =Im(Pr)(Ker(Pr)h(xr+x)πpr(x|xr)dx)π~pos(xr)dxr\displaystyle=\int_{\text{Im}(P_{r})}\left(\int_{\text{Ker}(P_{r})}h(x_{r}+x_{\perp})\pi_{\text{pr}}(x_{\perp}|x_{r})\mathrm{d}x_{\perp}\right)\widetilde{\pi}_{\text{pos}}(x_{r})\mathrm{d}x_{r}
1Kj=1KKer(Pr)h(xr(j)+x)πpr(x|xr(j))dx,\displaystyle\approx\frac{1}{K}\sum_{j=1}^{K}\int_{\text{Ker}(P_{r})}h(x_{r}^{(j)}+x_{\perp})\pi_{\text{pr}}(x_{\perp}|x_{r}^{(j)})\mathrm{d}x_{\perp},

where xr(1),,xr(K)x_{r}^{(1)},{.\hss.\hss.},x_{r}^{(K)} are samples from the approximate marginal posterior π~posy(xr)\widetilde{\pi}_{\text{pos}}^{y}(x_{r}). This way, if the expectation over the conditional prior πpr(x|xr(j))\pi_{\text{pr}}(x_{\perp}|x_{r}^{(j)}) can be carried out analytically, one can can simply avoid using conditional prior samples. Alternatively, the KK conditional expectations h(xr(j)+x)πpr(x|xr(j))dx\int h(x_{r}^{(j)}+x_{\perp})\pi_{\text{pr}}(x_{\perp}|x_{r}^{(j)})\mathrm{d}x_{\perp} can also be approximated via other accurate quadrature rule for πpr(x|xr(j))\pi_{\text{pr}}(x_{\perp}|x_{r}^{(j)}). Either way, we assume that integration with respect to the conditional prior is tractable.

In Algorithm 4 we provide the details of an MCMC-based sampling procedure in which the approximate likelihood (defined by either optimal parameter-reduced likelihood or optimal parameter-reduced forward model) can be obtained as sample averages over the conditional prior πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r}). To make these approximations generally applicable, we replace the conditional prior with the marginal prior πpr(x)\pi_{\text{pr}}(x_{\perp}) in computing those conditional expectations in the Equations (27) and (28) in Algorithm 4. Note that the typical class of inverse problems equipped with a Gaussian prior πpr=𝒩(mpr,Σpr)\pi_{\text{pr}}=\mathcal{N}(m_{\text{pr}},\Sigma_{\text{pr}}) is a special case. Since the projector PrP_{r} is orthogonal with respect to Σpr1\Sigma_{\text{pr}}^{-1}, the marginal prior πpr(x)\pi_{\text{pr}}(x_{\perp}) coincides with the conditional prior πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r}).

Requires: A projector PrP_{r}, a sample size NN for approximating the likelihood, a total posterior sample size KK, and a proposal density q(|xr)q(\cdot|x_{r}) on Im(Pr)\text{Im}(P_{r}).
Sample-averaged likelihood approximation
       Draw NN i.i.d. samples x(1),,x(N)x_{\perp}^{(1)},\ldots,x_{\perp}^{(N)} from the marginal πpr(x)\pi_{\text{pr}}(x_{\perp})
       if optimal parameter-reduced likelihood is used then
            
~Ny(xr)=1Ni=1Ny(xr+x(i))\widetilde{\mathcal{L}}_{N}^{y}(x_{r})=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})\vspace{-0.5cm} (27)
for any xrIm(Pr)x_{r}\in\text{Im}(P_{r})
       end if
      if optimal parameter-reduced forward model is used then
            
~Ny(xr)exp(12G~N(xr)yΣobs12),G~N(xr)=1Ni=1NG(xr+x(i))\widetilde{\mathcal{L}}_{N}^{y}(x_{r})\propto\exp\Big{(}-\frac{1}{2}\|\widetilde{G}_{N}(x_{r})-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}\Big{)},\widetilde{G}_{N}(x_{r})=\frac{1}{N}\sum_{i=1}^{N}G(x_{r}+x_{\perp}^{(i)}) (28)
for any xrIm(Pr)x_{r}\in\text{Im}(P_{r})
       end if
      
Return: a sample-averaged approximate likelihood function ~Ny(xr)\widetilde{\mathcal{L}}_{N}^{y}(x_{r})
Subspace MCMC sampling
       for j=1,2,,Kj=1,2,{.\hss.\hss.},K do
             Given the Markov chain state Xr(j1)=xrX_{r}^{(j-1)}=x_{r}, propose a candidate xrq(|xr)x_{r}^{\dagger}\sim q(\cdot|x_{r})
             Evaluate the approximate likelihood function ~Ny(xr)\widetilde{\mathcal{L}}_{N}^{y}(x_{r}^{\dagger})
             Compute the acceptance probability
α(xr|xr)=min{1,~Ny(xr)πpr(xr)q(xr|xr)~Ny(xr)πpr(xr)q(xr|xr)}.\alpha(x_{r}^{\dagger}|x_{r})=\min\bigg{\{}1\,,\,\frac{\widetilde{\mathcal{L}}_{N}^{y}(x_{r}^{\dagger})\pi_{\text{pr}}(x_{r}^{\dagger})\,q(x_{r}|x_{r}^{\dagger})}{\widetilde{\mathcal{L}}_{N}^{y}(x_{r})\pi_{\text{pr}}(x_{r})\,q(x_{r}^{\dagger}|x_{r})}\bigg{\}}.\vspace{-0.2cm} (29)
            With probability α(xr|xr)\alpha(x_{r}^{\dagger}|x_{r}), accept xrx_{r}^{\dagger} by setting Xr(j)=xrX_{r}^{(j)}=x_{r}^{\dagger}, otherwise reject xrx_{r}^{\dagger} by setting Xr(j)=xrX_{r}^{(j)}=x_{r}.
       end for
      
Return: a Markov chain Xr(1),Xr(2),,Xr(K)X_{r}^{(1)},X_{r}^{(2)},{.\hss.\hss.},X_{r}^{(K)} with invariant density π~posy(xr)\widetilde{\pi}_{\text{pos}}^{y}(x_{r})
Approximate posterior sampling
       for j=1,2,,Kj=1,2,{.\hss.\hss.},K do
             Given the state Xr(j)=xr(j)X_{r}^{(j)}=x_{r}^{(j)}, draw a conditional prior sample x(j)πpr(|xr(j))x_{\perp}^{(j)}\sim\pi_{\text{pr}}(\cdot|x_{r}^{(j)})
             Compute the ii-th approximate posterior sample x(j)=xr(j)+x(j)x^{(j)}=x_{r}^{(j)}+x_{\perp}^{(j)}
       end for
      
Return: approximate marginal posterior samples x(1),x(2),,x(K)x^{(1)},x^{(2)},{.\hss.\hss.},x^{(K)}
Algorithm 4 MCMC-based approach for sampling the approximate posterior.

A remaining question is how to choose the sample size NN for computing the conditional expectations in (27) and (28). The following heuristic is developed based on the optimal parameter-reduced forward model. Consider the exact parameter-reduced forward model G~(Prx)=G~(xr)\widetilde{G}^{\ast}(P_{r}x)=\widetilde{G}^{\ast}(x_{r}) and its sample-averaged approximation G~N(Prx)=G~N(xr)\widetilde{G}_{N}(P_{r}x)=\widetilde{G}_{N}(x_{r}). The sample-averaged approximation defines an approximate posterior density

π^posy(x)exp(12G~N(Prx)yΣobs12)πpr(x),\widehat{\pi}_{\text{pos}}^{y}(x)\propto\exp\Big{(}-\frac{1}{2}\|\widetilde{G}_{N}(P_{r}x)-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}\Big{)}\pi_{\text{pr}}(x),

that satisfies

𝔼[DKL(πposY||π^posY)]\displaystyle\mathbb{E}[D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\widehat{\pi}_{\text{pos}}^{Y})] (20)𝔼[12dG(x)G~N(Prx)Σobs12πpr(x)dx]\displaystyle\overset{\eqref{eq:KLBoundExpectation_FORWARDMODEL}}{\leq}\mathbb{E}\left[\frac{1}{2}\int_{\mathbb{R}^{d}}\|G(x)-\widetilde{G}_{N}(P_{r}x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x\right]
=12𝔼[dG(x)G~(Prx)Σobs12+G~(Prx)G~N(Prx)Σobs12πpr(x)dx]\displaystyle=\frac{1}{2}\,\mathbb{E}\left[\int_{\mathbb{R}^{d}}\|G(x)-\widetilde{G}^{\ast}(P_{r}x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}+\|\widetilde{G}^{\ast}(P_{r}x)-\widetilde{G}_{N}(P_{r}x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x\right]
=(1+1N)12dG(x)G~(Prx)Σobs12πpr(x)dx.\displaystyle=\left(1+\frac{1}{N}\right)\frac{1}{2}\int_{\mathbb{R}^{d}}\|G(x)-\widetilde{G}^{\ast}(P_{r}x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x. (30)

Here, the expectation is taken jointly over the data YY and the sample {x(i)}i=1N\{x_{\perp}^{(i)}\}_{i=1}^{N}. The above inequality directly follows from Proposition 4.1 and the fact that G~(Prx)\widetilde{G}^{\ast}(P_{r}x) is the conditional expectation of G(x)G(x) over Ker(Pr)\text{Ker}(P_{r}). We refer to Theorem 3.2 in [12] for more details on this derivation. Inequality (30) implies that the random approximate posterior π^posy(x)\widehat{\pi}_{\text{pos}}^{y}(x) can be used in place of π~posy(x)\widetilde{\pi}_{\text{pos}}^{y}(x), as the bounds on the expected Kullback-Leibler divergence in (20) and (30) are comparable. In addition, this suggests that the sample size NN in (28) does not have to be large. Even with N=1N=1, (20) and (30) differs only by a factor of 2. For the optimal parameter-reduced likelihood function, it is not obvious how to obtain a similar bound for the sampled-averaged conditional expectation in (27), see for instance the result [54, Proposition 5]. In this case, we adopt the identity (30) as a heuristic.

6 Sampling from the exact posterior

In this section, we present new strategies for sampling the exact posterior by adding minor modifications to Algorithm 4.

6.1 Pseudo-marginal for the optimal parameter-reduced likelihood

For the optimal parameter-reduced likelihood approach, Algorithm 4 replaces the optimal likelihood ,y(xr)\mathcal{L}^{*,y}(x_{r}) with the sample-average ~Ny(xr)\widetilde{\mathcal{L}}^{y}_{N}(x_{r}) defined by (27) using frozen (fixed) samples {x(i)}i=1N\{x_{\perp}^{(i)}\}_{i=1}^{N}. This way, Algorithm 4 produces samples from an estimation to the posterior approximation πpos,y(x)=πpos(xr)πpr(x|xr)\pi_{\text{pos}}^{\ast,y}(x)=\pi_{\text{pos}}(x_{r})\pi_{\text{pr}}(x_{\perp}|x_{r}). In this section, we first show that replacing the frozen samples with freshly drawing samples {x(i)}i=1N\{x_{\perp}^{(i)}\}_{i=1}^{N} at each MCMC iteration yields a pseudo-marginal MCMC [1] which samples exactly from πpos,y(x)\pi_{\text{pos}}^{\ast,y}(x). In addition, we also show that an appropriate recycling of the data generated by this modified algorithm allows obtaining samples from the exact posterior πposy(x)\pi_{\text{pos}}^{y}(x) itself.

We propose to modify Algorithm 4 by replacing the acceptance rate αN(xr|xr)\alpha_{N}(x_{r}^{\dagger}|x_{r}) in (29) with

α^N(xr|xr)=min{1,πpr(xr)(1Ni=1Ny(xr+x(i)))q(xr|xr)πpr(xr)(1Ni=1Ny(xr+x(i)))q(xr|xr)}.\widehat{\alpha}_{N}(x_{r}^{\dagger}|x_{r})=\min\left\{1\,,\,\frac{\pi_{\text{pr}}(x_{r}^{\dagger})\left(\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}^{\dagger}+x_{\perp}^{\dagger(i)})\right)q(x_{r}|x_{r}^{\dagger})}{\pi_{\text{pr}}(x_{r})\left(\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})\right)q(x_{r}^{\dagger}|x_{r})}\right\}. (31)

Here, {x(i)}i=1N\{x_{\perp}^{(i)}\}_{i=1}^{N} are i.i.d. samples from πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r}) conditioned on the current state of the chain xrx_{r} and {x(i)}i=1N\{x_{\perp}^{\dagger(i)}\}_{i=1}^{N} are i.i.d. samples from πpr(x|xr)\pi_{\text{pr}}(x_{\perp}|x_{r}^{\dagger}) conditioned on the proposed candidate xrx_{r}^{\dagger}. Compared to the previous acceptance rate (29) where {x(i)}i=1N={x(i)}i=1N\{x_{\perp}^{(i)}\}_{i=1}^{N}=\{x_{\perp}^{\dagger(i)}\}_{i=1}^{N} where frozen, the new acceptance rate (31) requires to redraw fresh samples at each proposal candidate xrx_{r}^{\dagger}. This is summarized in Algorithm 5.

Requires: A projector PrP_{r}, a sample size NN for approximating the likelihood, a total posterior sample size KK, and a proposal density q(|xr)q(\cdot|x_{r}) on Im(Pr)\text{Im}(P_{r}).
for j=1,2,,Kj=1,2,{.\hss.\hss.},K do
       Given the previous state of the Markov chain Xr(j1)=xrX_{r}^{(j-1)}=x_{r} and the associated set of conditional prior samples {X(j1,i)}i=1N={x(i)}i=1N\{X_{\perp}^{(j-1,i)}\}_{i=1}^{N}=\{x_{\perp}^{(i)}\}_{i=1}^{N}
       Propose a candidate xrq(|xr)x_{r}^{\dagger}\sim q(\cdot|x_{r})
       Draw NN independent samples x(1),,x(N)πpr(x|xr)x_{\perp}^{\dagger(1)},{.\hss.\hss.},x_{\perp}^{\dagger(N)}\sim\pi_{\text{pr}}(x_{\perp}|x_{r}^{\dagger})
       Compute the acceptance probability α^(xr|xr)\widehat{\alpha}(x_{r}^{\dagger}|x_{r}) as in (31)
       With probability α^(xr|xr)\widehat{\alpha}(x_{r}^{\dagger}|x_{r}), accept Xr(j)=xrX_{r}^{(j)}=x_{r}^{\dagger} and {X(j,i)}i=1N={x(,i)}i=1N\{X_{\perp}^{(j,i)}\}_{i=1}^{N}=\{x_{\perp}^{(\dagger,i)}\}_{i=1}^{N}. Otherwise reject and set Xr(j)=xrX_{r}^{(j)}=x_{r} and {X(j,i)}i=1N={x(i)}i=1N\{X_{\perp}^{(j,i)}\}_{i=1}^{N}=\{x_{\perp}^{(i)}\}_{i=1}^{N}.
end for
Return: the Markov chain {(Xr(j),{X(j,i)}i=1N)}j=1K\{(X_{r}^{(j)},\{X_{\perp}^{(j,i)}\}_{i=1}^{N})\}_{j=1}^{K}
Algorithm 5 Pseudo-marginal MCMC for sampling the exact marginal posterior.

In the next proposition we apply the analysis of pseudo-marginal MCMC [1] to show that πposy(xr)\pi_{\text{pos}}^{y}(x_{r}) is the invariant density of the Markov chain constructed by Algorithm 5. The key step is to interpret Algorithm 5 as a classical Metropolis-Hastings algorithm that operates on the product space Im(Pr)×Ker(Pr)N\text{Im}(P_{r})\times\text{Ker}(P_{r})^{N}.

Proposition 6.1.

Algorithm 5 constructs an ergodic Markov chain {(Xr(j),{X(j,i)}i=1N)}j1\{(X_{r}^{(j)},\{X_{\perp}^{(j,i)}\}_{i=1}^{N})\}_{j\geq 1} on the product space Im(Pr)×Ker(Pr)N\text{Im}(P_{r})\times\text{Ker}(P_{r})^{N} with invariant density

πtary,N(xr,{x(i)}i=1N)πpr(xr)(i=1Ny(xr+x(i)))j=1Nπpr(x(j)|xr).\pi_{\text{tar}}^{y,N}(x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N})\propto\pi_{\text{pr}}(x_{r})\left(\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})\right)\prod_{j=1}^{N}\pi_{\text{pr}}(x_{\perp}^{(j)}|x_{r}). (32)

The marginal of this target density satisfies πtary,N(xr)=πposy(xr)\pi_{\text{tar}}^{y,N}(x_{r})=\pi_{\text{pos}}^{y}(x_{r}) so that the sequence {Xr(j)}j=1N\{X_{r}^{(j)}\}_{j=1}^{N} is an ergodic Markov chain with the invariant density πposy(xr)\pi_{\text{pos}}^{y}(x_{r}).

Proof.

See B. ∎

Remark 6.2 (Choosing NN in Algorithm 5).

The statistical performance of pseudo-marginal methods depends on the variance of the sample-averaged estimate 1Ni=1Ny(xr+x(i))\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)}). This variance being inversely proportional to the sample size NN, a larger NN may result in better statistical efficiency of the MCMC chain. However, the computational cost per MCMC iteration increases linearly with NN, while the improvement of the statistical efficiency will not follow the same rate. We refer the readers to [3, 23] for a detailed discussion on this topic and only provide an interpretation as follows. With NN\rightarrow\infty, the Markov chain constructed by the pseudo-marginal MCMC converges to that of an idealized standard MCMC, which has the acceptance probability defined by the same proposal density and the exact evaluation of ,y(xr)\mathcal{L}^{*,y}(x_{r}). This way, even with a very large NN, the statistical efficiency of the pseudo-marginal MCMC cannot be improved further beyond that of the idealized standard MCMC. As suggested by [23], the standard deviation of the logarithm of the parameter-reduced likelihood estimate, var[log~Ny]12\text{var}[\log\widetilde{\mathcal{L}}_{N}^{y}]^{\frac{1}{2}}, can be used to monitor the quality of the sample-averaged estimator.

It is remarkable to observe that, for N=1N=1, the target density (32) becomes the true posterior πtary,1(xr,x)=πposy(xr+x)\pi_{\text{tar}}^{y,1}(x_{r},x_{\perp})=\pi_{\text{pos}}^{y}(x_{r}+x_{\perp}). This means that Algorithm 5 actually produces samples x=xr+xx=x_{r}+x_{\perp} from πposy(x)\pi_{\text{pos}}^{y}(x). For N>1N>1, we propose to recycle the Markov chain {X(1,i)}i=1N,,{X(K,i)}i=1N\{X_{\perp}^{(1,i)}\}_{i=1}^{N},\ldots,\{X_{\perp}^{(K,i)}\}_{i=1}^{N} produced by Algorithm 5 in order to generate samples from the exact posterior πposy(x)\pi_{\text{pos}}^{y}(x). This procedure is summarized in Algorithm 6 and a justification is provided in the following proposition.

Proposition 6.3.

Let {(Xr(j),{X(j,i)}i=1N)}j1\{(X_{r}^{(j)},\{X_{\perp}^{(j,i)}\}_{i=1}^{N})\}_{j\geq 1} be a Markov chain generated by Algorithm 5. For any j1j\geq 1 we randomly select X(j){X(j,i)}i=1NX_{\perp}^{(j)}\in\{X_{\perp}^{(j,i)}\}_{i=1}^{N} according to the probability

(X(j)=X(j,k)|Xr(j),{X(j,i)}i=1N)=y(Xr(j)+X(j,k))i=1Ny(Xr(j)+X(j,i)),1kN,\mathbb{P}\left(X_{\perp}^{(j)}=X_{\perp}^{(j,k)}\Big{|}X_{r}^{(j)},\{X_{\perp}^{(j,i)}\}_{i=1}^{N}\right)=\frac{\mathcal{L}^{y}(X_{r}^{(j)}+X_{\perp}^{(j,k)})}{\sum_{i=1}^{N}\mathcal{L}^{y}(X_{r}^{(j)}+X_{\perp}^{(j,i)})},\quad 1\leq k\leq N, (33)

and we let X(j)=Xr(j)+X(j)X^{(j)}=X_{r}^{(j)}+X_{\perp}^{(j)}. Then {X(j)}j1\{X^{(j)}\}_{j\geq 1} is a Markov chain with the exact posterior πposy(x)\pi_{\text{pos}}^{y}(x) as invariant density.

Proof.

See C

Requires: MCMC chain generated {(Xr(j),{X(j,i)}i=1N)}j=1K\{(X_{r}^{(j)},\{X_{\perp}^{(j,i)}\}_{i=1}^{N})\}_{j=1}^{K} by Algorithm 5
for j=1,2,,Kj=1,2,{.\hss.\hss.},K do
       Subsample X(j){X(j,i)}i=1NX_{\perp}^{(j)}\in\{X_{\perp}^{(j,i)}\}_{i=1}^{N} according to the probability (33)
       Assemble X(j)=Xr(j)+X(j)X^{(j)}=X_{r}^{(j)}+X_{\perp}^{(j)}
end for
Return: the Markov chain {X(j)}j=1K\{X^{(j)}\}_{j=1}^{K} with invariant density πposy(x)\pi_{\text{pos}}^{y}(x)
Algorithm 6 Recycling the Markov chain generated by Algorithm 5 to generate exact posterior samples

6.2 Delayed acceptance for the optimal parameter-reduced forward model

For the optimal parameter-reduced forward model, the marginal density of the resulting approximate posterior does not coincide with that of the exact posterior in general. However, we can still modify the approximate inference algorithm 4 using the delayed acceptance technique [11, 38, 39] to explore the exact posterior. The delayed acceptance modifies Algorithm 4 by adding a second stage acceptance rejection within each MCMC iteration. Here we consider the sample-averaged likelihood ~Ny(xr)\widetilde{\mathcal{L}}_{N}^{y}(x_{r}) defined by either (27) (the optimal parameter-reduced likelihood) or (28) (the optimal parameter-reduced forward model), where the marginal prior sample set {x(i)}i=iN\{x_{\perp}^{(i)}\}_{i=i}^{N} is prescribed. The following Proposition and Algorithm 7 detail this modification.

Proposition 6.4.

Suppose we have a proposal distribution q(|xr)q(\cdot|x_{r}) defined in the parameter reduced subspace Im(Pr){\rm Im}(P_{r}). We consider the following two stage Metropolis-Hastings method. In the first stage, we draw a proposal candidate xrq(|xr)x_{r}^{\dagger}\sim q(\cdot|x_{r}). Then, with the probability

α(xr|xr)=min{1,~Ny(xr)πpr(xr)q(xr|xr)~Ny(xr)πpr(xr)q(xr|xr)},\alpha(x_{r}^{\dagger}|x_{r})=\min\bigg{\{}1\,,\,\frac{\widetilde{\mathcal{L}}_{N}^{y}(x_{r}^{\dagger})\pi_{\text{pr}}(x_{r}^{\dagger})\,q(x_{r}|x_{r}^{\dagger})}{\widetilde{\mathcal{L}}_{N}^{y}(x_{r})\pi_{\text{pr}}(x_{r})\,q(x_{r}^{\dagger}|x_{r})}\bigg{\}},\vspace{-0.2cm} (34)

we move the proposal candidate xrx_{r}^{\dagger} to the next stage. In the second stage, we draw a proposal candidate πpr(x|xr)\pi_{\rm pr}(x_{\perp}^{\dagger}|x_{r}^{\dagger}) in the complement subspace Ker(Pr){\rm Ker}(P_{r}) and then accept the pair of proposal candidates (xr,x)(x_{r}^{\dagger},x_{\perp}^{\dagger}) with the probability

β(xr,x|xr,x)=min[1,y(xr+x)~Ny(xr)y(xr+x)~Ny(xr)].\beta(x_{r}^{\dagger},x_{\perp}^{\dagger}|x_{r},x_{\perp})=\min\left[1,\frac{\mathcal{L}^{y}(x_{r}^{\dagger}+x_{\perp}^{\dagger})\,\widetilde{\mathcal{L}}_{N}^{y}(x_{r})}{\mathcal{L}^{y}(x_{r}+x_{\perp})\,\widetilde{\mathcal{L}}_{N}^{y}(x_{r}^{\dagger})}\right]. (35)

Then, the above procedure constructs an ergodic Markov chain with the full posterior πposy(x)\pi_{\text{pos}}^{y}(x) as the invariant density.

Proof.

This result can be derived from the standard delayed acceptance [11]. For completeness, we provide the proof in D. ∎

Requires: A projector PrP_{r}, a sample-averaged likelihood approximation defined in Algorithm 4, a total sample size KK, and a proposal density q(|xr)q(\cdot|x_{r}) on Im(Pr)\text{Im}(P_{r}).
for j=1,2,,Kj=1,2,{.\hss.\hss.},K do
       Given the Markov chain state X(j1)=xr+xX^{(j-1)}=x_{r}+x_{\perp}, propose a candidate xrq(|xr)x_{r}^{\dagger}\sim q(\cdot|x_{r})
       Compute the parameter-reduced likelihood ~Ny(xr)\widetilde{\mathcal{L}}_{N}^{y}(x_{r}^{\dagger}) using either using (27) or (28)
       With probability α(xr|xr)\alpha(x_{r}^{\dagger}|x_{r}) in (34) move xrx_{r}^{\dagger} to the next stage as follows
         Propose a candidate xπpr(|xr)x_{\perp}^{\dagger}\sim\pi_{\rm pr}(\cdot|x_{r}^{\dagger})
         Compute the full likelihood y(xr+x)\mathcal{L}^{y}(x_{r}^{\dagger}+x_{\perp}^{\dagger})
         With probability β(xr,x|xr,x)\beta(x_{r}^{\dagger},x^{\dagger}_{\perp}|x_{r},x_{\perp}) in (35) accept (xr,x)(x_{r}^{\dagger},x^{\dagger}), otherwise reject (xr,x)(x_{r}^{\dagger},x^{\dagger})
       Otherwise reject xrx_{r}^{\dagger}
end for
Accept: set Xj=xr+xX^{j}=x_{r}^{\dagger}+x_{\perp}^{\dagger}
Reject: set Xj=X(j1)X^{j}=X^{(j-1)}
Return: a Markov chain X(1),X(2),,X(K)X^{(1)},X^{(2)},{.\hss.\hss.},X^{(K)} with the invariant density πposy(x)\pi_{\text{pos}}^{y}(x)
Algorithm 7 Delayed acceptance MCMC for sampling the exact posterior.
Remark 6.5.

It worth to note that the delayed acceptance also opens the door to further accelerate the exact inference using surrogate models instead of the original forward model. The approximate likelihood ~Ny(xr)\widetilde{\mathcal{L}}_{N}^{y}(x_{r}) is deterministic and dimension reduced, which makes it possible to further approximate ~Ny(xr)\widetilde{\mathcal{L}}_{N}^{y}(x_{r}) using computationally fast surrogate models. In this case, the same delayed acceptance MCMC (Algorithm 7) can still produce ergodic Markov chains that converge to the full posterior πposy(x)\pi^{y}_{\rm pos}(x). In contrast, the pseudo-marginal method requires an unbiased Monte Carlo estimate of the exact marginal posterior πposy(xr)\pi^{y}_{\rm pos}(x_{r}) at every iteration, which is not straightforward to accelerate using surrogate models.

7 Non-Gaussian priors

The dimension reduction techniques presented in Sections 3 and 4 require one to evaluate the marginal prior density πpr(xr)=Ker(Pr)πpr(xr+x)dx\pi_{\rm pr}(x_{r})=\int_{\text{Ker}(P_{r})}\pi_{\text{pr}}(x_{r}+x_{\perp})\mathrm{d}x_{\perp} and draw samples from the conditional prior πpr(x|xr)=πpr(xr+x)/πpr(xr)\pi_{\rm pr}(x_{\perp}|x_{r})=\pi_{\text{pr}}(x_{r}+x_{\perp})/\pi_{\rm pr}(x_{r}). While these tasks are readily doable for Gaussian distributions, it might not be the case in general. In this section, we use Besov priors as an example to present strategies that can extend the proposed dimension reduction methods to some non-Gaussian priors.

7.1 Besov priors

Besov measure [21, 32, 34] naturally appears in image reconstruction problems in which the detection of edges and interfaces is important. Following [32, 34], we construct Besov priors using wavelet functions and consider functions on the one-dimensional torus 𝕋=(0,1]\mathbb{T}=(0,1]. Starting with a suitable compactly supported mother wavelet function ψ2(𝕋)\psi_{\ast}\in\mathcal{L}_{2}(\mathbb{T}), we can define an orthogonal basis

ψj,k(s)=2j2ψ(2jsk),j,k0,k[0,2j1].\psi_{j,k}(s)=2^{\frac{j}{2}}\psi_{\ast}(2^{j}s-k),\quad j,k\in\mathbb{N}_{\geq 0},\;\;k\in[0,2^{j}-1].

This way, given a smoothness parameter r>0r>0 and integrability parameters 1p,q1\leq p,q\leq\infty, a function f:sf(s)f:s\mapsto f(s) in the Besov space pqr(𝕋)\mathcal{B}^{r}_{pq}(\mathbb{T}) can be written as

f(s)=c0+j=0k=02j12j(r+121p)bj,kψj,k(s),f(s)=c_{0}+\sum_{j=0}^{\infty}\sum_{k=0}^{2^{j}-1}2^{-j(r+\frac{1}{2}-\frac{1}{p})}b_{j,k}\psi_{j,k}(s), (36)

and satisfies

fpqr:=(|c0|q+j=0(k=02j1|bj,k|p)qp)1q<.\|f\|_{\mathcal{B}^{r}_{pq}}:=\bigg{(}|c_{0}|^{q}+\sum_{j=0}^{\infty}\Big{(}\sum_{k=0}^{2^{j}-1}|b_{j,k}|^{p}\Big{)}^{\frac{q}{p}}\bigg{)}^{\frac{1}{q}}<\infty.

In a Bayesian setting, we can set p=qp=q and define the Besov-ppr\mathcal{B}^{r}_{pp} prior with the pdf111This pdf is used for demonstrating the intuition rather than a rigorous characterization, as it is defined with respect to the (non-existent) infinite-dimensional Lebesgue measure. However, the finite-dimensional discretization of the Besov measure, which is used in numerical simulations, has a pdf in this form with respect to Lebesgue measure.

πpr(f)exp(γfpprp),\pi_{\rm pr}(f)\propto\exp\Big{(}-\gamma\|f\|_{\mathcal{B}^{r}_{pp}}^{p}\Big{)}, (37)

where γ>0\gamma>0 is a scale parameter. One can easily generalize the above definition of Besov priors to multidimensional settings by taking tensor products of the one-dimensional basis and associated coefficients.

We can discretize the Besov prior by truncating the infinite sum in (36) to the first DD terms. This way, collecting all the coefficients into a parameter vector x=(c0,b0,0,b0,1,,b1,0,b1,1,)dx=(c_{0},b_{0,0},b_{0,1},{.\hss.\hss.},b_{1,0},b_{1,1},{.\hss.\hss.})\in\mathbb{R}^{d}, where d=2D+1d=2^{D+1}, the discretized Besov-ppr\mathcal{B}^{r}_{pp} prior can be equivalently expressed as a product-form distribution over the parameter xx with the pdf

πpr(x)=i=1dπpr(i)(xi)withπpr(i)(xi)exp(γ|xi|p).\pi_{\rm pr}(x)=\prod_{i=1}^{d}\pi_{\rm pr}^{(i)}(x_{i})\quad{\rm with}\quad\pi_{\rm pr}^{(i)}(x_{i})\propto\exp\left(-\gamma|x_{i}|^{p}\right). (38)

7.2 Dimension reduction via coordinate selection

In general, we do not have closed form expressions for both the marginal πpr(xr)\pi_{\rm pr}(x_{r}) and the conditional πpr(x|xr)\pi_{\rm pr}(x_{\perp}|x_{r}), unless the projector PrP_{r} is aligned with the canonical basis. This leads to the construction of reduced subspace by selecting a subset of canonical basis. As discussed in Remarks 3.3 and 3.5, this task can be achieved by identifying an index set {1,,d}\mathcal{I}\subset\{1,\ldots,d\} with cardinality rr such that \mathcal{I} contains the indices of the rr largest values of i(Γ1)ii𝔼[H(Y)]iii\mapsto(\Gamma^{-1})_{ii}\mathbb{E}[H(Y)]_{ii} in the data-dependent case or those of i(Γ1)ii𝔼[H(Y)]iii\mapsto(\Gamma^{-1})_{ii}\mathbb{E}[H(Y)]_{ii} in the data-free case. This leads to the projector Pr=ieiei,P_{r}=\sum_{i\in\mathcal{I}}e_{i}e_{i}^{\top}, where {e1,,ed}\{e_{1},\ldots,e_{d}\} is the canonical basis of d\mathbb{R}^{d}. Thus, the product-form of (38) yields the marginal prior and the conditional prior

πpr(xr)=iπpr(i)(xi)andπpr(x|xr)=iπpr(i)(xi),\displaystyle\pi_{\rm pr}(x_{r})=\prod_{i\in\mathcal{I}}\pi_{\rm pr}^{(i)}(x_{i})\quad{\rm and}\quad\pi_{\rm pr}(x_{\perp}|x_{r})=\prod_{i\notin\mathcal{I}}\pi_{\rm pr}^{(i)}(x_{i}),

respectively. In this formulation, evaluating the marginal prior density and drawing samples from the conditional prior become straightforward tasks.

Remark 7.1.

For 1q<21\leq q<2, the tails of πpr(i)(xi)\pi_{\rm pr}^{(i)}(x_{i}) defined in (38) are heavier than Gaussian tails, and hence Assumptions 3.1 and 4.2 may not be satisfied. Nonetheless, one can still numerically apply the proposed dimension reduction methods without having the error bounds in (14) and (25). In this case, we set Γ\Gamma to be the identity matrix in accordance with the fact that the prior components are independent and identically distributed.

Remark 7.2 (Other sparsity-inducing prior).

There exist other shrinkage priors similar to Besov priors, in which the random function is expressed as a weighted linear combination of basis functions and the associated random weights follow other type of heavy tail distributions. For example, the horseshoe prior and the Student’s tt prior. See [10] for further discussions and references therein. The coordinate selection technique introduced here may also be applicable to those shrinkage priors.

7.3 Dimension reduction via prior normalization

Alternatively, we consider the case where the prior can be defined as the pushforward of the standard Gaussian measure with pdf μ(x)exp(12x22)\mu(x)\propto\exp(-\frac{1}{2}\|x\|_{2}^{2}) under a C1C^{1}-diffeomorphism T:ddT:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, which takes the form

πpr(x)=Tμ(x).\pi_{\rm pr}(x)=T_{\sharp}\mu(x). (39)

In other words, πpr(x)\pi_{\rm pr}(x) is the pdf of the random vector X=T(Z)X=T(Z) where Zμ(z)Z\sim\mu(z). For the Besov-ppr\mathcal{B}^{r}_{pp} prior defined in (38), the diffeomorphism TT has a diagonal form T(z)=(T1(z1),,Td(zd))T(z)=(T_{1}(z_{1}),\ldots,T_{d}(z_{d})) with Ti(zi)=Φi1(Ψ(zi))T_{i}(z_{i})=\Phi_{i}^{-1}(\Psi(z_{i})), where Φi()\Phi_{i}(\cdot) is the cumulative density function (cdf) of πpr(i)(xi)\pi_{\rm pr}^{(i)}(x_{i}) defined in (38) and Ψ()\Psi(\cdot) is the cdf of the standard Gaussian. We provide details of the cdf Φ()\Phi(\cdot) in E.

The invertibility of TT allows us to reparametrize the Bayesian inverse problem in terms of the variable z=T1(x)z=T^{-1}(x), which is endowed with the Gaussian prior μ\mu. With this change of variable, the likelihood function becomes zy(T(z))z\mapsto\mathcal{L}^{y}(T(z)), and thus the matrix H(y)H(y) used to reduce the dimension of zz should be

Hz(y)=dT(z)(logy(T(z)))(logy(T(z)))T(z)μ(z)dz,H_{z}(y)=\int_{\mathbb{R}^{d}}\nabla T(z)^{\top}\big{(}\nabla\log\mathcal{L}^{y}(T(z)))(\nabla\log\mathcal{L}^{y}(T(z))\big{)}^{\top}\nabla T(z)\,\mu(z)\mathrm{d}z,

in the data-dependent case and 𝔼[Hz(Y)]\mathbb{E}[H_{z}(Y)] in the data-dependent case. For the optimal parameter-reduced forward model in the Gaussian likelihood case (cf. Section 4), the forward model xG(x)x\mapsto G(x) is replaced by zG(T(z))z\mapsto G(T(z)). This way, the matrix HGH^{G} should be replaced by

HzG=dT(z)G(T(z))Σobs1G(T(z))T(z)μ(z)dz.H^{G}_{z}=\int_{\mathbb{R}^{d}}\nabla T(z)^{\top}\nabla G(T(z))^{\top}\Sigma_{\text{obs}}^{-1}\nabla G(T(z))\nabla T(z)\,\mu(z)\mathrm{d}z.

Using either of these matrices, we obtain a projector PrP_{r} to reduce the dimension in the variable z=zr+zz=z_{r}+z_{\perp}, where zr=Przz_{r}=P_{r}z and z=(IdPr)zz_{\perp}=(I_{d}-P_{r})z. In term of the original variable xx, the dimension reduction method allows one to identify xr=T(PrT1(x))x_{r}=T(P_{r}T^{-1}(x)) with the observed data, while x=T((IdPr)T1(x))x_{\perp}=T((I_{d}-P_{r})T^{-1}(x)) is informed by the prior only. Since xrx_{r} and xx_{\perp} are nonlinear with respect to xx, the resulting method can be interpreted as a nonlinear dimension reduction method.

8 Example 1: elliptic PDE

We first validate our methods using an inverse problem of identifying the coefficient of a two-dimensional elliptic PDE from point observations of its solution.

8.1 Problem setup

Consider the problem domain Ω=[0,1]×[0,1]\Omega=[0,1]\times[0,1], with boundary Ω\partial\Omega. We denote the spatial coordinate by s=(s1,s2)Ωs=(s_{1},s_{2})\in\Omega. We model the steady state potential solution field p(s)p(s) for a given conductivity field κ(s)\kappa(s) and forcing function f(s)f(s) using the Poisson’s equation

(κ(s)p(s))=f(s),sΩ.-\nabla\cdot\left(\kappa(s)\nabla p(s)\right)=f(s),\quad s\in\Omega. (40)

Let Ωn={sΩ|s2=0}{sΩ|s2=1}\partial\Omega_{\mathrm{n}}=\{s\in\partial\Omega\,|\,s_{2}=0\}\cup\{s\in\partial\Omega\,|\,s_{2}=1\} denote the top and bottom boundaries, and Ωd={sΩ|s1=0}{sΩ|s1=1}\partial\Omega_{\mathrm{d}}=\{s\in\partial\Omega\,|\,s_{1}=0\}\cup\{s\in\partial\Omega\,|\,s_{1}=1\} denote the left and right boundaries. We impose the mixed boundary condition:

p(s)=0,sΩd,and(κ(s)p(s))n(s)=0,xΩn,p(s)=0,\forall s\in\partial\Omega_{\mathrm{d}},\quad\mathrm{and}\quad(\kappa(s)\nabla p(s))\cdot\vec{n}(s)=0,\forall x\in\partial\Omega_{\mathrm{n}},

and let the forcing function take the form

f(s,t)=c(exp(12r2sa2)exp(12r2sb2)),t0,f(s,t)=c\,\Big{(}\exp\big{(}-\frac{1}{2r^{2}}\|s-a\|^{2}\big{)}-\exp\big{(}-\frac{1}{2r^{2}}\|s-b\|^{2}\big{)}\Big{)},\forall t\geq 0,

with r=0.05r=0.05, which is the superposition of two Gaussian-shaped sink/source terms centered at a=(0.5,0.5)a=(0.5,0.5) and b=(2.5,0.5)b=(2.5,0.5), scaled by a constant c=6×104c=6\times 10^{-4}. The conductivity field κ(s)\kappa(s) is endowed with a log-normal prior. That is, letting x(s)=logκ(s)x(s)=\log\kappa(s), the Gaussian process prior for x(s)x(s) is defined by the stochastic PDE (see [37] and references therein):

x(s)+γx(s)=𝒲(s),sΩ,-\triangle x(s)+\gamma x(s)=\mathcal{W}(s),\quad s\in\Omega, (41)

where \triangle is the Laplace operator and 𝒲(s)\mathcal{W}(s) is the white noise process. We impose a no-flux boundary condition on the above SPDE and set γ=10\gamma=10. Equations (40) and (41) are solved using the finite element method with bilinear basis functions. A mesh with 80×8080\times 80 elements is used in this example. This leads to n=6400n=6400 dimensional discretised parameters.

Refer to caption
Figure 1: Setup of three test cases for the elliptic PDE example. The observation locations are shown as dots. Each column represent a test case, in which the top row shows the true conductivity fields and the bottom row shows the corresponding potential field.

We generate three “true” conductivity fields from the prior distribution and use them to simulate synthetic observed data sets. The true conductivity fields and the simulated potential fields are shown in Figure 1. Observations of the potential fields are measured at the m=36m=36 discrete locations shown as black dots in Figure 1. We set the standard derivation of the observation noise to σ=0.0415\sigma=0.0415, which corresponds to a signal-to-noise ratio of about 2020.

8.2 Low-dimensional posterior approximations

Refer to caption
Refer to caption
Figure 2: Elliptic PDE example. KL divergence of the full posterior densities from the approximate posterior densities defined by projectors with various ranks. Each column represents posteriors conditioned on a given data set. Top row: approximate posteriors are defined by the optimal parameter-reduced likelihood. Bottom: approximate posteriors are defined by the optimal parameter-reduced forward model.

We first compare the approximate posterior densities defined by the data-free dimension reduction with that of the data-dependent dimension reduction and that of the truncated Karhunen–Loéve expansion of the prior. We build five sets of projectors: the data-free projectors as detailed in Section 3.3, three sets of data-dependent projectors (see Section 3.2) that correspond to three synthetic data sets, and projectors defined by leading eigenvectors of the prior covariance (i.e. the truncated Karhunen–Loéve, referred to as the “prior-based projector” from hereinafter). For each of the data sets, the corresponding data-dependent projectors are constructed using the adaptive MCMC algorithm of [54]. Each set consists of projectors with ranks r=22,24,,28r=2^{2},2^{4},{.\hss.\hss.},2^{8}. For each projector, we compute the KL divergences of the full posteriors from the approximated posterior densities defined by the optimal parameter-reduced likelihood (7). The results are shown in the top row of Figure 2. Using the same set of projectors, we also compare the KL divergences of the full posteriors from the resulting approximated posterior densities defined by the optimal parameter-reduced forward model (21). The results are shown in the bottom row of Figure 2.

In these experiments, we estimate the KL divergence using Monte Carlo integration with NN posterior samples, which yields

DKL(πposyπ~posy)1Ni=1N(logy(x(i))log~y(x(i)))+log(1Ni=1Nexp(log~y(x(i))logy(x(i)))),D_{KL}(\pi^{y}_{\text{pos}}\|\tilde{\pi}^{y}_{\text{pos}})\approx\frac{1}{N}\sum_{i=1}^{N}\big{(}\log\mathcal{L}^{y}(x^{(i)})-\log\tilde{\mathcal{L}}^{y}(x^{(i)})\big{)}+\log\Big{(}\frac{1}{N}\sum_{i=1}^{N}\exp\big{(}\log\tilde{\mathcal{L}}^{y}(x^{(i)})-\log\mathcal{L}^{y}(x^{(i)})\big{)}\Big{)},

where the second sample average accounts for the ratio between normalizing constants. For approximations that are close to the full posterior, using a reasonable number of (independent) posterior samples, e.g., 10510^{5} used here, make the standard deviations of the estimated KL divergence insignificant compared with the mean estimates in our numerical examples.

We observe that the optimal parameter-reduced likelihood and the optimal parameter-reduced forward model result in approximate posteriors with similar accuracy. For sufficiently large ranks (r8r\geq 8), the most accurate approximate posterior densities are obtained by the data-dependent projectors of the corresponding data set, followed by those obtained by the data-free projectors. We also observe that, for each data set, the data-dependent projectors constructed using other data sets result in less accurate approximations. By allowing a marginal loss of accuracy compared to the data-dependent construction, the data-free construction bypasses the computationally costly online dimension reduction process for every new data set. Compared with the prior-based dimension reduction, which is also an offline method, the data-free construction offers significantly more accurate approximations in this example.

For each of the data sets, we also compare the errors of the approximate posterior densities with the bounds defined in (14) and (25). Note that the right hand sides of (14) and (25) are the same up to the constant κ\kappa in this example. We plot the errors and the bounds (with κ=1\kappa=1 for Gaussian prior) in Figure 3, in which all approximate posterior densities are defined by the data-free projectors. In this example, we observe that the errors of the approximate posterior densities follow the same trend as their corresponding error bounds. Note that both (14) and (25) give upper bounds on the expected KL divergence, and thus they may not bound the KL divergence for a realization of the data.

Refer to caption
Figure 3: Elliptic PDE example. The bounds in (14) and (25) (with κ=1\kappa=1) are compared to KL divergence of the full posterior densities from the approximate posterior densities defined by the data-free projectors with various ranks. Left: approximate posteriors are defined by the optimal parameter-reduced likelihood. Right: approximate posteriors are defined by the optimal parameter-reduced forward model.

8.3 Subspace accelerated sampling

Table 1: Acronyms of various inference algorithms used in numerical comparisons.
OL approximate inference using Algorithm 4 and the optimal parameter-reduced likelihood function in (27)
PM exact inference using the pseudo-marginal method (Algorithms 5 and 6)
OF approximate inference using Algorithm 4 and the optimal parameter-reduced forward model in (28)
DA exact inference using the delayed acceptance algorithm (Algorithm 7) with the approximation defined by the parameter-reduced forward model in (28)
H-MALA exact inference using the Hessian preconditioned Langevin MCMC [45]
PCN exact inference using the preconditioned Crank–Nicolson MCMC [7, 15]

We demonstrate the sampling performance of various approximate and exact inference algorithms introduced in Sections 5 and 6 using the posterior density conditioned on the first data set. All the methods used in the comparison and their acronyms are summarized in Table 1.

We use the Hessian-preconditioned Metropolis-Adjusted Langevin Algorithm (H-MALA) and the preconditioned Crank–Nicolson (PCN) MCMC as reference MCMC methods for sampling the full-dimensional posterior. Since H-MALA uses the low-rank decomposition of the Hessian matrix of the logarithm of the posterior density computed at the maximum a posteriori point to precondition MCMC, it can also be viewed as a data-dependent subspace-accelerated method. We refer to [17, 45] for a detailed discussion. In order to make a fair comparison with H-MALA, the MCMC algorithm we use on our data-free informed subspace is based on a Langevin proposal preconditioned by the same Hessian matrix used by H-MALA projected onto the data-free informed subspaces.

In Figure 4, the contours of the marginal posterior densities (marginalized onto the first two data-free LIS basis vectors) produced by approximate inference methods (with r=16r=16 and r=48r=48) are compared with those produced by their exact inference modifications (with r=48r=48). We can observe that the results produced by approximate inference methods approach those of their modifications as the rank of informed subspace increases.

Refer to caption
Figure 4: Elliptic PDE example. Contours of the marginal posterior densities computed by various inference algorithms using the data-free projector. (a): OL (with r=16r=16 and r=48r=48) and PM (with r=48r=48). (b): OF (r=16r=16 and r=48r=48) and DA (with r=48r=48). Here (x1,x2)(x_{1},x_{2}) represent the directions spanned by the first two data-free LIS basis vectors.

To measure the efficiency of various MCMC methods, we use the average integrated autocorrelation times (IACTs) of parameters

τ=1di=1dIACT(xi),\tau=\frac{1}{d}\sum_{i=1}^{d}{\rm IACT}(x_{i}),

where IACT(xi){\rm IACT}(x_{i}) is the IACT of the ii-th component of xx. See [38, Section 12.7] for the definition of IACT. The data-free projectors with different ranks rr and two different sample sizes N=2N=2 and N=5N=5 are used in this experiment. Here H-MALA and PCN are used as base cases to benchmark those MCMC methods accelerated by the informed subspace. All the methods are simulated for 5×1055\times 10^{5} iterations and repeated 1010 times to report the mean and the standard deviation of τ\tau. The initial state of all the simulations are randomly selected from a pre-computed Markov chain of posterior samples to avoid burn-in. The results are reported in Table 2.

For the approximate inference methods (OL and OF), the average IACTs consistently increase with the rank of the projectors, as the sampling performance of the Langevin proposal is expected to decay with the underlying parameter dimension. Both OL and OF produce significantly smaller IACTs compared with the full-dimensional H-MALA.

Table 2: Elliptic PDE example. Average IACTs of parameters computed by various inference algorithms applied to the posterior conditioned on data set 1. Here the symbol - indicates poorly mixing Markov chains that do not have reliable estimate of the IACTs. All the data reported here are in the form of mean±\pmstandard derivation.
IACT IACT IACT
OL PM var[log~Ny]\sqrt{\textrm{var}}[\log\widetilde{\mathcal{L}}_{N}^{y}] OF DA 𝔼[β]\mathbb{E}[\beta] HMALA PCN
N=2N=2 r=16r=16 18.9±\pm1.5 163±\pm29 4.45±\pm0.20 19.7±\pm1.2 - <<0.1 164±\pm17 1303±\pm139
r=24r=24 34.9±\pm1.1 106±\pm13 2.65±\pm0.19 35.7±\pm2.1 - <<0.1
r=32r=32 52.6±\pm3.0 91.8±\pm5.3 1.80±\pm0.10 57.1±\pm3.1 208±\pm39 0.31±\pm0.02
r=40r=40 59.4±\pm2.4 91.6±\pm6.1 0.93±\pm0.03 63.0±\pm2.1 208±\pm26 0.36±\pm0.02
r=48r=48 60.7±\pm2.4 83.8±\pm5.6 0.69±\pm0.02 66.9±\pm4.3 146±\pm10 0.46±\pm0.01
N=5N=5 r=16r=16 18.7±\pm1.0 102±\pm8.2 2.28±\pm0.10 19.3±\pm1.3 - <<0.1
r=24r=24 32.7±\pm1.7 72.6±\pm4.3 1.38±\pm0.05 37.8±\pm2.5 255±\pm36 0.19±\pm0.02
r=32r=32 48.8±\pm1.2 71.6±\pm3.0 0.97±\pm0.06 55.8±\pm1.1 214±\pm38 0.31±\pm0.01
r=40r=40 55.2±\pm2.1 67.4±\pm3.4 0.55±\pm0.03 61.7±\pm2.8 173±\pm21 0.39±\pm0.01
r=48r=48 56.0±\pm3.3 64.9±\pm3.2 0.41±\pm0.02 69.9±\pm3.5 148±\pm26 0.47±\pm0.01

Compared to the OL method, the PM method (the exact inference counterpart for OL) has a different behavior. Here we recall that the sample-averaged parameter-reduced likelihood, ~Ny\widetilde{\mathcal{L}}_{N}^{y}, in the PM method is a random estimator, whereas ~Ny\widetilde{\mathcal{L}}_{N}^{y} in the OL method is deterministic because of the usage of prescribed samples. The standard deviation of the logarithm of ~Ny\widetilde{\mathcal{L}}_{N}^{y} in Table 2 confirms that low-rank projectors have rather large Monte Carlo errors as the approximation accuracy is controlled by the rank truncation (cf. (14)). The exactness of the PM method comes at the cost of Monte Carlo error, which is controlled by the sample size NN and the rank of the projector. We observe that increasing either the rank or the sample size can narrow the gap between the IACTs produced by PM and its OL counterpart. This experiment clearly suggests that PM needs to balance the sample size NN and the rank of the projector to achieve the optimal performance.

Compared to the OF method, the DA method (the exact inference counterpart for OF) produces the largest IACTs among all subspace inference methods. This result is not surprising, as the second stage acceptance/rejection of DA necessarily deteriorates the statistical performance [11]. In Table 2, we observe that the second stage acceptance rates, 𝔼[β]\mathbb{E}[\beta], increase with more accurate approximations obtained with higher projector ranks and larger sample sizes. As the result, the gaps between the IACTs produced by OF and DA are smaller for higher projector ranks and larger sample sizes.

Overall, approximate inference methods have better statistical performance compared to other methods in this example (cf. Table 2) and can obtain reasonably accurate results as shown in Figures 2 and Figure 4. With the additional cost that comes in the form of either Monte Carlo error (PM) or the second stage acceptance/rejection (DA), the exact inference modifications produce Markov chains with larger IACTs. Among all the exact inference methods, PM produces smaller IACTs compared with the full-dimensional H-MALA, PCN, and DA.

It is worth to mention that each iteration of the subspace MCMC method needs NN number of forward model simulations, whereas H-MALA requires only one forward model simulation per iteration. In this example, approximate inference methods (OL and OF) with N=2N=2 still outperforms H-MALA in terms of IACTs per model evaluation. Exact inference methods, however, need more model evaluations than H-MALA to obtain the same number of effective samples (we will show in the subsection another example where H-MALA is outperformed by PM and DA). Notice that the forward model evaluations in each iteration can be embarrassingly parallelized: with parallel computing resources available, the subspace MCMC methods can still be more efficient than H-MALA in terms of the effective sample size per wall-clock time.

9 Example 2: PET with Poisson data

The second example is a two dimensional PET imaging problem, where we aim to reconstruct the density of the object from integer-valued Poisson observed data. We use here a Besov prior for which we access the coordinate selection technique and the prior normalization method presented in Sections 7.2 and 7.3.

9.1 Problem setup

In PET imaging, the goal is to identify an object of interest located inside a domain Ω\Omega subjected to gamma rays. The rays travel through Ω\Omega from multiple sources and the detectors count the number of incident photons (thus the data are integer-valued), see Figure 5(a). The object of interest is described by its density of mass which is represented by sexp(f(s))s\mapsto\exp(f(s)), where f:Ωf:\Omega\rightarrow\mathbb{R} follows a Besov-112\mathcal{B}^{2}_{11} prior with the Haar wavelet, see Section 7.1. The change of intensity of a gamma ray along the path, i(s),sΩ\ell_{i}(s),s\in\Omega, can be modelled using Beer’s law:

Id,i=Is,iexp(i(s)exp(f(s))𝑑s),\displaystyle I_{d,i}=I_{s,i}\exp\Big{(}-\int_{\ell_{i}(s)}\exp\big{(}f(s)\big{)}ds\Big{)}, (42)

where Id,i0I_{d,i}\in\mathbb{R}_{\geq 0} and Is,i0I_{s,i}\in\mathbb{R}_{\geq 0} are the intensities at the detector and at the source, respectively. We assume that all the gamma ray sources have the same intensity, Is,i=IsI_{s,i}=I_{s} for i=1,,mi=1,{.\hss.\hss.},m.

In this example, the domain Ω\Omega is discretized into a regular grid with dd cells and the logarithm of the density is assumed to be piecewise constant. This yields the discretized parameter xdx\in\mathbb{R}^{d}. The line integrals in (42) are approximated by

i(s)exp(f(s))𝑑sj=1nAijexp(fj),\int_{\ell_{i}(s)}\exp(f(s))ds\approx\sum_{j=1}^{n}A_{ij}\,\exp(f_{j}),

where Aij0A_{ij}\in\mathbb{R}_{\geq 0} is the length of the intersection between line i\ell_{i} and cell jj, and exp(fj)\exp(f_{j}) is the discretized density in cell jj. By discretizing the wavelet basis on the same grid and following the parametrization discussed in Section 7, we can write

f=Bx,f=Bx,

where Bd×dB\in\mathbb{R}^{d\times d} consists of discretized basis functions and xx consists of associated coefficients. In this setting, xx follows a product-form Laplace distribution given by (38) with p=1p=1 and with the scale parameter arbitrarily set to γ=1\gamma=1. Suppose we have a total of mm number of gamma ray paths and the corresponding matrix Am×dA\in\mathbb{R}^{m\times d}, the forward model G:dmG:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m} is defined as

G(x)=Isexp(Aexp(Bx)).\displaystyle G(x)=I_{s}\,\exp(-A\,\exp(Bx)). (43)

We consider a PET setup shown in Figure 5(a): the problem domain Ω=[10,10]2\Omega=[-10,10]^{2} is discretised into a d=64×64d=64\times 64 regular grid, five radiation sources with intensity Is=10I_{s}=10 are positioned with equal spaces on one side of a circle, spanning a 9090^{\circ} angle, and each source sends a fan of 30 gamma rays that are measured by detectors. This leads to m=150m=150 observations. The model setup is based on the code of [30].

Refer to caption

(a) Discretised domain of interest Ω=[10,10]2\Omega=[-10,10]^{2} (mesh), radiation sources (red dots), and detectors (blue dots).
Refer to caption
(b) Top row: three density functions generated from the prior. Bottom row: corresponding intensity function (solids lines) and measured counting data sets (dots).
Figure 5: The PET setup and three test cases.

We denote the observed data by ymy\in\mathbb{N}^{m} where each element yiy_{i} is associated with the ii-th gamma ray in the model. For the ii-th gamma ray, recall that the intensity at the detector is computed by Gi(x)G_{i}(x) for some input parameter xx, and then the probability mass function of observing the counting data Yi=yiY_{i}=y_{i} is given by

(Yi=yi|x)=Gi(x)yiexp(Gi(x))yi!.\mathbb{P}(Y_{i}=y_{i}|x)=\frac{G_{i}(x)^{y_{i}}\,\exp(-G_{i}(x))}{y_{i}!}.

Suppose we can observe the counting data at all the detectors and assume the measurement processes are independent, we can write the likelihood function with the complete data as

y(x)=i=1m(Yi=yi|x)=i=1mGi(x)yiexp(Gi(x))yi!.\displaystyle\mathcal{L}^{y}(x)=\prod_{i=1}^{m}\mathbb{P}(Y_{i}=y_{i}|x)=\prod_{i=1}^{m}\frac{G_{i}(x)^{y_{i}}\,\exp(-G_{i}(x))}{y_{i}!}. (44)

As shown in F, the Fisher information matrix of the above likelihood function takes the form

(x)=G(x)M(x)G(x),\displaystyle\mathcal{I}(x)=\nabla G(x)^{\top}M(x)\nabla G(x), (45)

where M(x)M(x) is a diagonal matrix with Mii(x)=Gi(x)1M_{ii}(x)=G_{i}(x)^{-1} along its diagonal. We generate three “true” density functions from the prior distribution and use them to simulate synthetic data sets. The true density functions and the simulated data are shown in Figure 5(b).

9.2 Numerical results using coordinate selection

We first present the results obtained by applying the coordinate selection method (cf. Section 7.2). Similar to the first example, here we will first compare the accuracy of approximate posterior densities defined by various approaches and projectors, and then benchmark the performance of MCMC methods. We adopt the same setup and acronyms as in Example 1 and Table 1.

For the approximate posterior densities, five sets of projectors built from selected coordinates, including the data-free projectors, three sets of data-dependent projectors (corresponding to three data sets), and the prior-based truncated wavelet basis, are considered. Each set consists of projectors with ranks r=23,24,,27r=2^{3},2^{4},{.\hss.\hss.},2^{7}. The KL divergences of the full posteriors from the approximated posterior densities defined by the optimal parameter-reduced likelihood (7) are shown in the top row of Figure 6, while those of the optimal parameter-reduced forward model (21) are shown in the bottom row of Figure 6.

Refer to caption
Refer to caption
Figure 6: Same as Figure 2, but for the PET test case and where we used coordinate selection.

In this example, we observe similar results as the results of the elliptic PDE example. The optimal parameter-reduced likelihood and the optimal parameter-reduced forward model result in approximate posteriors with similar accuracy. The most accurate approximate posterior densities are obtained by the data-dependent projectors of the corresponding data set, followed by those obtained by the data-free projectors. For each data set, the data-dependent projectors constructed using other data sets result in less accurate approximations in general. However, the accuracy gaps between the data-free projectors and the data-dependent projectors (using other data sets) are not as significant as the elliptic PDE example. This can be caused by either the coordinate selection method or the rather large data size in this example. Compared with the prior-based dimension reduction, which is also an offline method, the data-free construction offers significantly more accurate approximations in this example. Overall, the data-free dimension reduction provides reasonably accurate posterior approximations for the Poisson observation process considered here.

Refer to caption
Figure 7: Same as Figure 3, but for the PET test case and where we used coordinate selection.

Although it remains an open question if the bounds in (14) and (25) can be applied for Besov priors, we provide a comparison of the errors of the approximate posterior densities (defined by the data-free projectors) with the bounds. The results are shown in Figure 7 with κ\kappa being replaced by 11. Interesting, we still observe that the errors of the approximate posterior densities follow the same trend as their corresponding bounds.

We then compare the performance of various subspace driven inference methods. In Figure 8, the contours of the the marginal posterior densities produced by approximate inference methods (with r=16r=16 and r=48r=48) are compared with those produced by their exact inference modifications (with r=48r=48). In this example, we observe that the contours produced by approximate inference methods are visually similar to those of exact inference methods. In addition, with increasing ranks, the contours produced by approximate inference methods approach those of the exact inference methods.

Refer to caption
Figure 8: Same as Figure 4, but for PET and data set 1. Here r=16,48r=16,48 are used in OL and OF, and r=48r=48 is used in PM and DA. Here (x1,x2)(x_{1},x_{2}) represent the first two coordinates selected by the data-free method.
Table 3: Same as Table 2, but for PET and data set 1. The coordinate selection is used in dimension reduction. Ranks of the subspaces are chosen to be 1616, 3232, and 4848.
IACT IACT IACT
OL PM var[log~Ny]\sqrt{\textrm{var}}[\log\widetilde{\mathcal{L}}_{N}^{y}] OF DA 𝔼[β]\mathbb{E}[\beta] HMALA PCN
N=2N=2 r=16r=16 33.2±\pm1.7 85.1±\pm2.7 1.54±\pm0.02 33.9±\pm1.1 214±\pm44 0.18±\pm0.06 95.9±\pm3.3 387±\pm79
r=32r=32 40.0±\pm1.8 54.1±\pm3.1 0.61±\pm.007 41.0±\pm2.2 87.8±\pm6.5 0.55±\pm0.01
r=48r=48 45.3±\pm1.2 49.4±\pm2.6 0.45±\pm.002 46.0±\pm2.2 73.5±\pm5.8 0.66±\pm0.01
N=5N=5 r=16r=16 31.4±\pm1.9 60.0±\pm6.2 0.93±\pm.006 31.8±\pm1.4 220±\pm65 0.22±\pm0.04
r=32r=32 40.8±\pm2.5 47.6±\pm2.5 0.39±\pm.004 42.8±\pm2.4 88.0±\pm6.8 0.56±\pm0.01
r=48r=48 46.1±\pm2.2 46.5±\pm1.4 0.29±\pm.001 46.3±\pm1.9 69.5±\pm4.0 0.67±\pm0.01

We use the average IACTs of the density function, τ=1di=1dIACT(fi),\tau=\frac{1}{d}\sum_{i=1}^{d}{\rm IACT}(f_{i}), to measure the efficiency of various MCMC methods. The results are reported in Table 3. Here both PCN and H-MALA are implemented to sample the posterior in the transformed coordinate equipped with a Gaussian prior (cf. Section 7.3).

For the approximate inference methods (OL and OF), the average IACTs consistently increase with the rank of the projectors, as the sampling performance of the Langevin proposal is expected to decay with underlying the parameter dimension. Both OL and OF produce significantly smaller IACTs compared with the full-dimensional PCN and H-MALA method. Compared to the OL method, the PM method, has a slightly higher IACTs in this example. This rather mild loss of performance (compared with the elliptic PDE example) is justified by the rather small values of ~Ny\widetilde{\mathcal{L}}_{N}^{y} (with N=2,5N=2,5) in Table 3. Compared to the OF method, the DA method, again produces the largest IACTs among all subspace inference methods. However, the loss of performance here is not as severe as the elliptic PDE example, this is also justified by the improved second stage acceptance rates, 𝔼[β]\mathbb{E}[\beta].

Overall, approximate inference methods have better statistical performance compared to other methods in this example and can obtain reasonably accurate results as shown in Figures 6 and Figure 8. With improved approximation errors, the exact inference methods also produces Markov chains with better mixing. Among all the exact inference methods, PM produces significantly smaller IACTs compared with other methods.

9.3 Numerical results using prior normalization

Then, we present the results obtained by applying the prior normalization method (cf. Section 7.3). The KL divergences of the full posteriors from the approximated posterior densities are shown in Figure 9. Here the result of prior-based dimension reduction is not presented, as the prior in the transformed space has an identity covariance matrix. We observe similar results as those obtained by the coordinate selection. We notice that the accuracy gaps between the data-free projectors and the data-dependent projectors (built using other data sets) are more significant compared with those obtained by the coordinate selection. The comparison of the errors of the approximate posterior densities (defined by the data-free projectors) with the bounds in (14) and (25) are provided in Figure 10. Here we have κ=1\kappa=1 because the transformed coordinate is endowed with a Gaussian prior. We observe that the errors of the approximate posterior densities follow the same trend as their corresponding bounds. The IACTs of various MCMC methods are reported in Table 4. Again, the efficiency of subspace MCMC methods defined by the prior normalization is very close to that defined by the coordinate selection. Overall, both the coordinate selection and the prior normalization can be applied in this example to obtain accurate reduced-dimensional posterior approximations and derive efficient subspace MCMC methods.

Refer to caption
Refer to caption
Figure 9: Same as Figure 2, but for PET. The prior normalization is used in dimension reduction.
Refer to caption
Figure 10: Same as Figure 3, but for PET. The prior normalization is used in dimension reduction.
Table 4: Same as Table 2, but for PET and data set 1. The prior normalization is used in dimension reduction. Ranks of the subspaces are chosen to be 1616, 3232, and 4848.
IACT IACT IACT
OL PM var[log~Ny]\sqrt{\textrm{var}}[\log\widetilde{\mathcal{L}}_{N}^{y}] OF DA 𝔼[β]\mathbb{E}[\beta] HMALA PCN
N=2N=2 r=16r=16 35.8±\pm1.7 81.4±\pm6.0 1.48±\pm0.02 33.5±\pm1.8 168±\pm23 0.25±\pm0.02 95.9±\pm3.3 387±\pm79
r=32r=32 42.8±\pm2.0 55.1±\pm2.9 0.64±\pm.006 41.2±\pm1.6 86.3±\pm6.2 0.55±\pm0.01
r=48r=48 45.0±\pm2.4 51.8±\pm2.0 0.46±\pm.005 44.3±\pm2.2 74.6±\pm7.4 0.65±\pm0.01
N=5N=5 r=16r=16 35.1±\pm1.9 54.1±\pm4.1 0.88±\pm0.01 32.8±\pm2.8 151±\pm21 0.26±\pm0.02
r=32r=32 45.0±\pm1.7 49.0±\pm1.9 0.41±\pm.003 42.0±\pm2.6 83.1±\pm5.1 0.55±\pm0.01
r=48r=48 45.9±\pm2.9 46.3±\pm2.2 0.29±\pm.003 44.4±\pm0.8 70.6±\pm3.7 0.66±\pm0.01

10 Conclusion

We present a new data-free strategy for reducing the dimensionality of large-scale statistical inverse problems. Compared to existing gradient-based dimension reduction technique, this new approach identifies the computationally costly subspace construction in an offline phase. Our data-free dimension reduction is certified in the sense that its development is directly guided by factorizable posterior approximations and associated error bounds. The factorizable posterior approximations naturally offer dimension robust sampling methods for exploring the approximate posterior densities. More interestingly, by adding minor modifications to those approximate inference algorithms, we further develop exact inference methods using the pseudo-marginal approach and the delayed acceptance approach. The resulting exact inference methods also scale well with parameter dimensionality, as the backbone of those methods is based on the dimension robust approximate inference methods. We also demonstrate the efficiency of our data-free dimensional reduction and various inference methods on two inverse problems involving a two-dimensional elliptic PDE with a Gaussian process prior and a PET problem with Poisson data and a Besov-112\mathcal{B}^{2}_{11} prior.

T. Cui and O. Zahm would like to acknowledge support from the INRIA associate team UNQUESTIONABLE. T. Cui also acknowledges support from the Australian Research Council.

Appendix A Proof of Proposition 4.1

Recall πposy(x)=y(x)πpr(x)πdata(y)\pi_{\text{pos}}^{y}(x)=\frac{\mathcal{L}^{y}(x)\pi_{\text{pr}}(x)}{\pi_{\text{data}}(y)} and π^posy(x)=^y(x)πpr(x)π^data(y)\widehat{\pi}_{\text{pos}}^{y}(x)=\frac{\widehat{\mathcal{L}}^{y}(x)\pi_{\text{pr}}(x)}{\widehat{\pi}_{\text{data}}(y)}. By definition of y(x)\mathcal{L}^{y}(x) and ^y(x)\widehat{\mathcal{L}}^{y}(x) we have

DKL(\displaystyle D_{\mathrm{KL}}( πposy||π^posy)=dlog(πposy(x)π^posy(x))πposy(x)dx\displaystyle\pi_{\text{pos}}^{y}||\widehat{\pi}_{\text{pos}}^{y})=\int_{\mathbb{R}^{d}}\log\left(\frac{\pi_{\text{pos}}^{y}(x)}{\widehat{\pi}_{\text{pos}}^{y}(x)}\right)\pi_{\text{pos}}^{y}(x)\mathrm{d}x
=logπ^data(y)πdata(y)+d(12G(x)yΣobs12+12G(x)yΣobs12)πposy(x)dx\displaystyle=\log\frac{\widehat{\pi}_{\text{data}}(y)}{\pi_{\text{data}}(y)}+\int_{\mathbb{R}^{d}}\left(-\frac{1}{2}\|G(x)-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}+\frac{1}{2}\|G^{\ast}(x)-y\|_{\Sigma_{\text{obs}}^{-1}}^{2}\right)\pi_{\text{pos}}^{y}(x)\mathrm{d}x
=logπ^data(y)πdata(y)+d(12e(x)Σobs12e(x)Σobs1(G(x)y))πposy(x)dx\displaystyle=\log\frac{\widehat{\pi}_{\text{data}}(y)}{\pi_{\text{data}}(y)}+\int_{\mathbb{R}^{d}}\left(\frac{1}{2}\|e(x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}-e(x)^{\top}\Sigma_{\text{obs}}^{-1}(G(x)-y)\right)\pi_{\text{pos}}^{y}(x)\mathrm{d}x (46)

where e(x)=G(x)G(x)e(x)=G(x)-G^{\ast}(x) is independent on yy. Next we replace yy by YπdataY\sim\pi_{\text{data}} and we take the expectation over YY. The first term in the above expression becomes

𝔼[logπ^data(Y)πdata(Y)]=mlog(π^data(y)πdata(y))πdata(y)dy=DKL(πdata||π^data).\mathbb{E}\left[\log\frac{\widehat{\pi}_{\text{data}}(Y)}{\pi_{\text{data}}(Y)}\right]=\int_{\mathbb{R}^{m}}\log\left(\frac{\widehat{\pi}_{\text{data}}(y)}{\pi_{\text{data}}(y)}\right)\pi_{\text{data}}(y)\mathrm{d}y=-D_{\mathrm{KL}}(\pi_{\text{data}}||\widehat{\pi}_{\text{data}}). (47)

Next, by definition of πposy(x)\pi_{\text{pos}}^{y}(x), we have

𝔼[d12e(x)Σobs12πposY(x)dx]\displaystyle\mathbb{E}\left[\int_{\mathbb{R}^{d}}\frac{1}{2}\|e(x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pos}}^{Y}(x)\mathrm{d}x\right] =12d×me(x)Σobs12πposy(x)πdata(y)dxdy\displaystyle=\frac{1}{2}\int_{\mathbb{R}^{d}\times\mathbb{R}^{m}}\|e(x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pos}}^{y}(x)\pi_{\text{data}}(y)\mathrm{d}x\mathrm{d}y
=12d×me(x)Σobs12y(x)πpr(x)dxdy\displaystyle=\frac{1}{2}\int_{\mathbb{R}^{d}\times\mathbb{R}^{m}}\|e(x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\mathcal{L}^{y}(x)\pi_{\text{pr}}(x)\mathrm{d}x\mathrm{d}y
=12dG(x)G^(x)Σobs12πpr(x)dx.\displaystyle=\frac{1}{2}\int_{\mathbb{R}^{d}}\|G(x)-\widehat{G}(x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x. (48)

For the last equality we used the fact that yy(x)y\mapsto\mathcal{L}^{y}(x) is a pdf so that my(x)dy=1\int_{\mathbb{R}^{m}}\mathcal{L}^{y}(x)\mathrm{d}y=1. Using the same arguments, we have

𝔼[\displaystyle\mathbb{E}\bigg{[} de(x)Σobs1(G(x)Y)πposY(x)dx]\displaystyle\int_{\mathbb{R}^{d}}e(x)^{\top}\Sigma_{\text{obs}}^{-1}\big{(}G(x)-Y\big{)}\pi_{\text{pos}}^{Y}(x)\mathrm{d}x\bigg{]}
=d×me(x)Σobs1(G(x)y)y(x)πpr(x)dxdy\displaystyle=\int_{\mathbb{R}^{d}\times\mathbb{R}^{m}}e(x)^{\top}\Sigma_{\text{obs}}^{-1}\big{(}G(x)-y\big{)}\mathcal{L}^{y}(x)\pi_{\text{pr}}(x)\mathrm{d}x\mathrm{d}y
=de(x)Σobs1G(x)πpr(x)dxde(x)Σobs1(myy(x)dy)πpr(x)dx\displaystyle=\int_{\mathbb{R}^{d}}e(x)^{\top}\Sigma_{\text{obs}}^{-1}G(x)\pi_{\text{pr}}(x)\mathrm{d}x-\int_{\mathbb{R}^{d}}e(x)^{\top}\Sigma_{\text{obs}}^{-1}\left(\int_{\mathbb{R}^{m}}y\mathcal{L}^{y}(x)\mathrm{d}y\right)\pi_{\text{pr}}(x)\mathrm{d}x
=0.\displaystyle=0. (49)

The last equality is obtained by noting that the expectation of the data knowing the parameter xx is myy(x)dy=G(x)\int_{\mathbb{R}^{m}}y\mathcal{L}^{y}(x)\mathrm{d}y=G(x). Combining (47) (48) and (49), we obtain

𝔼[DKL(πposY||π^posY)]=(46)DKL(πdata||π^data)+12dG(x)G^(x)Σobs12πpr(x)dx,\mathbb{E}\left[D_{\mathrm{KL}}(\pi_{\text{pos}}^{Y}||\widehat{\pi}_{\text{pos}}^{Y})\right]\overset{\eqref{eq:tmp68721}}{=}-D_{\mathrm{KL}}(\pi_{\text{data}}||\widehat{\pi}_{\text{data}})+\frac{1}{2}\int_{\mathbb{R}^{d}}\|G(x)-\widehat{G}(x)\|_{\Sigma_{\text{obs}}^{-1}}^{2}\pi_{\text{pr}}(x)\mathrm{d}x,

which concludes the proof.

Appendix B Proof of Proposition 6.1

Consider a Metropolis-Hastings algorithm which targets the pdf πtary,N\pi_{\text{tar}}^{y,N} defined by (32) using the following proposal density

q(xr,{x(i)}i=1N|xr,{x(i)}i=1N)=q(xr|xr)i=1Nπpr(x(i)|xr),q\left(x_{r}^{\dagger},\{x_{\perp}^{\dagger(i)}\}_{i=1}^{N}\Big{|}x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\right)=q(x_{r}^{\dagger}|x_{r})\prod_{i=1}^{N}\pi_{\text{pr}}(x_{\perp}^{\dagger(i)}|x_{r}^{\dagger}), (50)

where q(xr|xr)q(x_{r}^{\dagger}|x_{r}) is the same proposal density as the one used at step 1 of Algorithm (5). The acceptance probability of this Metropolis-Hastings algorithm is given by

α(xr,{x(i)}i=1N|xr,{x(i)}i=1N)\displaystyle\alpha\left(x_{r}^{\dagger},\{x_{\perp}^{\dagger(i)}\}_{i=1}^{N}\Big{|}x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\right) =min{1,πtary,N(xr,{x(i)}i=1N)q(xr,{x(i)}i=1N|xr,{x(i)}i=1N)πtary,N(xr,{x(i)}i=1N)q(xr,{x(i)}i=1N|xr,{x(i)}i=1N)}\displaystyle=\min\left\{1,\frac{\pi_{\text{tar}}^{y,N}(x_{r}^{\dagger},\{x_{\perp}^{\dagger(i)}\}_{i=1}^{N})~{}q\left(x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\Big{|}x_{r}^{\dagger},\{x_{\perp}^{\dagger(i)}\}_{i=1}^{N}\right)}{\pi_{\text{tar}}^{y,N}(x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N})~{}q\left(x_{r}^{\dagger},\{x_{\perp}^{\dagger(i)}\}_{i=1}^{N}\Big{|}x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\right)}\right\}
=min{1,πpr(xr)(i=1Ny(xr+x(i)))q(xr|xr)πpr(xr)(i=1Ny(xr+x(i)))q(xr|xr)},\displaystyle=\min\left\{1,\frac{\pi_{\text{pr}}(x_{r}^{\dagger})\left(\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}^{\dagger}+x_{\perp}^{\dagger(i)})\right)q(x_{r}|x_{r}^{\dagger})}{\pi_{\text{pr}}(x_{r})\left(\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})\right)q(x_{r}^{\dagger}|x_{r})}\right\},

which is precisely α^N(xr|xr)\widehat{\alpha}_{N}(x_{r}^{\dagger}|x_{r}) defined in (31). Note that the first two steps of Algorithm 5 consists of drawing a sample (xr,{x(i)}i=1N)(x_{r}^{\dagger},\{x_{\perp}^{\dagger(i)}\}_{i=1}^{N}) from the proposal (50). This way, Algorithm 5 can be interpreted as a MCMC algorithm which targets πtary,N\pi_{\text{tar}}^{y,N}. It remains to show that the marginal distribution πtary,N(xr)\pi_{\text{tar}}^{y,N}(x_{r}) is the marginal posterior πposy(xr)\pi_{\text{pos}}^{y}(x_{r}). We can write

πtary,N(xr)\displaystyle\pi_{\text{tar}}^{y,N}(x_{r}) =Ker(Pr)Nπtary,N(xr,x(1),,x(N))dx(1)dx(N)\displaystyle=\int_{\text{Ker}(P_{r})^{N}}\pi_{\text{tar}}^{y,N}(x_{r},x_{\perp}^{(1)},{.\hss.\hss.},x_{\perp}^{(N)})\mathrm{d}x_{\perp}^{(1)}{.\hss.\hss.}\mathrm{d}x_{\perp}^{(N)}
πpr(xr)i=1NKer(Pr)y(xr+x(i))πpr(x(i)|xr)dx(i)\displaystyle\propto\pi_{\text{pr}}(x_{r})\sum_{i=1}^{N}\int_{\text{Ker}(P_{r})}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})\pi_{\text{pr}}(x_{\perp}^{(i)}|x_{r})\mathrm{d}x_{\perp}^{(i)}
Ker(Pr)y(xr+x(i))πpr(xr+x(i))dxπposy(xr)\displaystyle\propto\int_{\text{Ker}(P_{r})}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})\pi_{\text{pr}}(x_{r}+x_{\perp}^{(i)})\mathrm{d}x_{\perp}\propto\pi_{\text{pos}}^{y}(x_{r})

which concludes the proof.

Appendix C Proof of Proposition 6.3

Recall that {(Xr(j),{X(j,i)}i=1N)}j1\{(X_{r}^{(j)},\{X_{\perp}^{(j,i)}\}_{i=1}^{N})\}_{j\geq 1} admits πtary,N\pi_{\text{tar}}^{y,N} (32) as the invariant density, see Proposition 6.1. It remains to prove that {X(j)}j1\{X^{(j)}\}_{j\geq 1} admits πposy(x)\pi_{\text{pos}}^{y}(x) as the invariant density. For a given state (Xr(j),{X(j,i)}=(xr,{x(i)}i=1N)(X_{r}^{(j)},\{X_{\perp}^{(j,i)}\}=(x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}), we have X(j)=xr+xX^{(j)}=x_{r}+x_{\perp} where x{x(i)}i=1Nx_{\perp}\in\{x_{\perp}^{(i)}\}_{i=1}^{N} is selected with respect to the probability

(x=x(k)|xr,{x(i)}i=1N)=y(xr+x(k))i=1Ny(xr+x(i)),1kN.\mathbb{P}\left(x_{\perp}=x_{\perp}^{(k)}\Big{|}x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\right)=\frac{\mathcal{L}^{y}(x_{r}+x_{\perp}^{(k)})}{\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})},\quad 1\leq k\leq N. (51)

Thus, we need to prove that the pdf π(x)\pi(x) where x=xr+xx=x_{r}+x_{\perp} is the posterior density π(x)=πposy(x)\pi(x)=\pi_{\text{pos}}^{y}(x). We can write

π(x)\displaystyle\pi(x) =π(xr,x)\displaystyle=\pi(x_{r},x_{\perp})
=Ker(Pr)Nπ(xr,{x(i)}i=1N,x)dx(1)dx(N)\displaystyle=\int_{\text{Ker}(P_{r})^{N}}\pi(x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N},x_{\perp})\,\mathrm{d}x_{\perp}^{(1)}{.\hss.\hss.}\mathrm{d}x_{\perp}^{(N)}
=Ker(Pr)Nπ(x|xr,{x(i)}i=1N)πtary,N(xr,{x(i)}i=1N)dx(1)dx(N),\displaystyle=\int_{\text{Ker}(P_{r})^{N}}\pi\left(x_{\perp}\Big{|}x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\right)\pi_{\text{tar}}^{y,N}\left(x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\right)\,\mathrm{d}x_{\perp}^{(1)}{.\hss.\hss.}\mathrm{d}x_{\perp}^{(N)},

where π(x|xr,{x(i)}i=1N)\pi(x_{\perp}|x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}) is the pdf of xx_{\perp} conditioned on (xr,{x(i)}i=1N)(x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}). By construction we have

π(x|xr,{x(i)}i=1N)=(51)k=1Nδx(k)(x)y(xr+x(k))i=1Ny(xr+x(i)),\pi\left(x_{\perp}\Big{|}x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\right)\overset{\eqref{eq:x_perpStar_PROOF}}{=}\frac{\sum_{k=1}^{N}\delta_{x_{\perp}^{(k)}}(x_{\perp})\mathcal{L}^{y}(x_{r}+x_{\perp}^{(k)})}{\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})}, (52)

where δx(k)\delta_{x_{\perp}^{(k)}} denotes the Dirac mass function at point x(k)x_{\perp}^{(k)}. We can write

π(x)\displaystyle\pi(x) =Ker(Pr)Nk=1Nδx(k)(x)y(xr+x(k))i=1Ny(xr+x(i))πtary,N(xr,{x(i)}i=1N)dx(1)dx(N)\displaystyle=\int_{\text{Ker}(P_{r})^{N}}\frac{\sum_{k=1}^{N}\delta_{x_{\perp}^{(k)}}(x_{\perp})\mathcal{L}^{y}(x_{r}+x_{\perp}^{(k)})}{\sum_{i=1}^{N}\mathcal{L}^{y}(x_{r}+x_{\perp}^{(i)})}\pi_{\text{tar}}^{y,N}\left(x_{r},\{x_{\perp}^{(i)}\}_{i=1}^{N}\right)\,\mathrm{d}x_{\perp}^{(1)}{.\hss.\hss.}\mathrm{d}x_{\perp}^{(N)}
(32)k=1NKer(Pr)Nδx(k)(x)y(xr+x(k))πpr(xr)i=1Nπpr(x(i)|xr)dx(1)dx(N)\displaystyle\overset{\eqref{eq:piTar}}{\propto}\sum_{k=1}^{N}\int_{\text{Ker}(P_{r})^{N}}\delta_{x_{\perp}^{(k)}}(x_{\perp})\mathcal{L}^{y}(x_{r}+x_{\perp}^{(k)})\pi_{\text{pr}}(x_{r})\prod_{i=1}^{N}\pi_{\text{pr}}(x_{\perp}^{(i)}|x_{r})\,\mathrm{d}x_{\perp}^{(1)}{.\hss.\hss.}\mathrm{d}x_{\perp}^{(N)}
k=1Ker(Pr)δx(k)(x)y(xr+x(k))πpr(xr)πpr(x(k)|xr)dx(k)\displaystyle\propto\sum_{k=1}\int_{\text{Ker}(P_{r})}\delta_{x_{\perp}^{(k)}}(x_{\perp})\mathcal{L}^{y}(x_{r}+x_{\perp}^{(k)})\pi_{\text{pr}}(x_{r})\pi_{\text{pr}}(x_{\perp}^{(k)}|x_{r})\,\mathrm{d}x_{\perp}^{(k)}
k=1y(xr+x)πpr(xr)πpr(x|xr)πposy(xr+x),\displaystyle\propto\sum_{k=1}\mathcal{L}^{y}(x_{r}+x_{\perp})\pi_{\text{pr}}(x_{r})\pi_{\text{pr}}(x_{\perp}|x_{r})\propto\pi_{\text{pos}}^{y}(x_{r}+x_{\perp}),

which concludes the proof.

Appendix D Proof of Proposition 6.4

To show the result of Proposition 6.4, we first interpret the first stage acceptance/rejection and the conditional prior sampling πpr(x|xr)\pi_{\rm pr}(x_{\perp}^{\dagger}|x_{r}^{\dagger}) as a joint proposal acting in the full parameter space Im(Pr)Ker(Pr){\rm Im}(P_{r})\oplus{\rm Ker}(P_{r}). The proposal q(,|xr)q(\cdot,|x_{r}) and the acceptance probability α(xr|xr)\alpha(x_{r}^{\dagger}|x_{r}) defines an effective proposal distribution

q¯(xr|xr)=q(xr|xr)α(xr|xr)+[1q(xr|xr)α(xr|xr)𝑑xr]δxr(xr),\bar{q}(x_{r}^{\dagger}|x_{r})=q(x_{r}^{\dagger}|x_{r})\alpha(x_{r}^{\dagger}|x_{r})+\Big{[}1-\int q(x_{r}^{\dagger}|x_{r})\alpha(x_{r}^{\prime}|x_{r})dx_{r}^{\prime}\Big{]}\delta_{x_{r}}(x_{r}^{\dagger}),

where δxr()\delta_{x_{r}}(\cdot) denotes the Dirac delta and the term in the bracket represents the probability of a proposal candidate being rejected. Then, we can define a joint proposal distribution

Q(xr,x|xr,x):=Q(xr,x|xr)=πpr(x|xr)q¯(xr|xr),Q(x_{r}^{\dagger},x_{\perp}^{\dagger}|x_{r},x_{\perp}):=Q(x_{r}^{\dagger},x_{\perp}^{\dagger}|x_{r})=\pi_{\rm pr}(x_{\perp}^{\dagger}|x_{r}^{\dagger})\,\bar{q}(x_{r}^{\dagger}|x_{r}),

for the MH to sample the full posterior.

Following the exactly same derivation in [11], one can show that accepting (xr,x)Q(,|xr,x)(x_{r}^{\dagger},x_{\perp}^{\dagger})\sim Q(\cdot,\cdot|x_{r},x_{\perp}) with the probability

β(xr,x|xr,x)\displaystyle\beta(x_{r}^{\dagger},x_{\perp}^{\dagger}|x_{r},x_{\perp}) =min[1,πposy(xr+x)Q(xr,x|xr,x)πposy(xr+x)Q(xr,x|xr,x)]\displaystyle=\min\left[1,\frac{\pi^{y}_{\rm pos}(x_{r}^{\dagger}+x_{\perp}^{\dagger})\,Q(x_{r},x_{\perp}|x_{r}^{\dagger},x_{\perp}^{\dagger})}{\pi^{y}_{\rm pos}(x_{r}+x_{\perp})\,Q(x_{r}^{\dagger},x_{\perp}^{\dagger}|x_{r},x_{\perp})}\right]

defines a Markov transition kernel with the full posterior πposy(xr+x)\pi^{y}_{\rm pos}(x_{r}+x_{\perp}) as its invariant distribution. Since the above acceptance probability is only used in the case where the first stage proposal candidate xrx_{r}^{\dagger} is accepted, i.e., xrxrx_{r}^{\dagger}\neq x_{r}, we do not need to consider the Dirac delta term. This way, the above acceptance probability can be simplified as

β(xr,x|xr,x)=min[1,πposy(xr+x)πpr(x|xr)q(xr|xr)α(xr|xr)πposy(xr+x)πpr(x|xr)q(xr|xr)α(xr|xr)].\displaystyle\beta(x_{r}^{\dagger},x_{\perp}^{\dagger}|x_{r},x_{\perp})=\min\left[1,\frac{\pi^{y}_{\rm pos}(x_{r}^{\dagger}+x_{\perp}^{\dagger})\,\pi_{\rm pr}(x_{\perp}|x_{r})\,q(x_{r}|x_{r}^{\dagger})\,\alpha(x_{r}|x_{r}^{\dagger})}{\pi^{y}_{\rm pos}(x_{r}+x_{\perp})\,\pi_{\rm pr}(x_{\perp}^{\dagger}|x_{r}^{\dagger})\,q(x_{r}^{\dagger}|x_{r})\alpha(x_{r}^{\dagger}|x_{r})}\right].

Substituting the identities

α(xr|xr)α(xr|xr)=~Ny(xr)πpr(xr)q(xr|xr)~Ny(xr)πpr(xr)q(xr|xr),\frac{\alpha(x_{r}|x_{r}^{\dagger})}{\alpha(x_{r}^{\dagger}|x_{r})}=\frac{\widetilde{\mathcal{L}}_{N}^{y}(x_{r})\pi_{\text{pr}}(x_{r})\,q(x_{r}^{\dagger}|x_{r})}{\widetilde{\mathcal{L}}_{N}^{y}(x_{r}^{\dagger})\pi_{\text{pr}}(x_{r}^{\dagger})\,q(x_{r}|x_{r}^{\dagger})},

and

πposy(xr+x)πpr(x|xr)πposy(xr+x)πpr(x|xr)=y(xr+x)πpr(xr)y(xr+x)πpr(xr),\frac{\pi^{y}_{\rm pos}(x_{r}^{\dagger}+x_{\perp}^{\dagger})\,\pi_{\rm pr}(x_{\perp}|x_{r})}{\pi^{y}_{\rm pos}(x_{r}+x_{\perp})\,\pi_{\rm pr}(x_{\perp}^{\dagger}|x_{r}^{\dagger})}=\frac{\mathcal{L}^{y}(x_{r}^{\dagger}+x_{\perp}^{\dagger})\,\pi_{\rm pr}(x_{r}^{\dagger})}{\mathcal{L}^{y}(x_{r}+x_{\perp})\,\pi_{\rm pr}(x_{r})},

into the above equation, we obtain

β(xr,x|xr,x)=min[1,y(xr+x)~Ny(xr)y(xr+x)~Ny(xr)],\beta(x_{r}^{\dagger},x_{\perp}^{\dagger}|x_{r},x_{\perp})=\min\left[1,\frac{\mathcal{L}^{y}(x_{r}^{\dagger}+x_{\perp}^{\dagger})\,\widetilde{\mathcal{L}}_{N}^{y}(x_{r})}{\mathcal{L}^{y}(x_{r}+x_{\perp})\,\widetilde{\mathcal{L}}_{N}^{y}(x_{r}^{\dagger})}\right],

which is identical to the second stage acceptance probability in (35). Thus, the result follows.

Appendix E Cumulative density function of p(x)exp(γ|x|p)p(x)\propto\exp(-\gamma|x|^{p})

Given the pdf p(x)=1cγ,pexp(γ|x|p),p(x)=\frac{1}{c_{\gamma,p}}\exp(-\gamma|x|^{p}), xx\in\mathbb{R}, we want to find its normalizing constant cγ,pc_{\gamma,p} and cdf. Using symmetry, the normalizing constant takes the form

cγ,p=20exp(γxp)𝑑x,c_{\gamma,p}=2\int_{0}^{\infty}\exp(-\gamma x^{p})dx,

and the cdf can be expressed as

Φ(x)={12+1cγ,p0xexp(γtp)𝑑tx0121cγ,p0xexp(γtp)𝑑tx<0.\Phi(x)=\left\{\begin{array}[]{ll}\displaystyle\frac{1}{2}+\frac{1}{c_{\gamma,p}}\int_{0}^{x}\;\;\exp(-\gamma t^{p})dt&x\geq 0\vspace{6pt }\\ \displaystyle\frac{1}{2}-\frac{1}{c_{\gamma,p}}\int_{0}^{-x}\exp(-\gamma t^{p})dt&x<0\end{array}\right..

We first introduce the change-of-variable z=γtpz=\gamma t^{p} so that

t=(zγ)1panddtdz=p1γ1pz1p1.t=\bigg{(}\frac{z}{\gamma}\bigg{)}^{\frac{1}{p}}\quad{\rm and}\quad\frac{\mathrm{d}t}{\mathrm{d}z}=p^{-1}\gamma^{-\frac{1}{p}}\,z^{\frac{1}{p}-1}.

This yields

0xexp(γtp)𝑑t=p1γ1p0γxpz1p1exp(z)𝑑z:=p1γ1pΓlower(p1,γxp),\int_{0}^{x}\exp(-\gamma t^{p})dt=p^{-1}\gamma^{-\frac{1}{p}}\int_{0}^{\gamma x^{p}}z^{\frac{1}{p}-1}\exp(-z)dz:=p^{-1}\gamma^{-\frac{1}{p}}\Gamma_{\text{lower}}(p^{-1},\gamma x^{p}),

where Γlower(,)\Gamma_{\text{lower}}(\cdot,\cdot) is the lower incomplete gamma function. Following a similar derivation, we obtain cγ,p=2p1γ1pΓ(p1),c_{\gamma,p}=2\,p^{-1}\gamma^{-\frac{1}{p}}\,\Gamma(p^{-1}), where Γ()\Gamma(\cdot) is the Gamma function. This way, we have the cdf

Φ(x)=12+sign(x)2Γ(p1)Γlower(p1,γ(sign(x)x)p).\Phi(x)=\frac{1}{2}+\frac{{\rm sign}(x)}{2\,\Gamma(p^{-1})}\Gamma_{\text{lower}}\Big{(}p^{-1},\gamma\,\big{(}{\rm sign}(x)\,x\big{)}^{p}\Big{)}.

There are two notable special cases. The Gaussian distribution 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) can be specified using γ=(2σ2)1\gamma=(2\sigma^{2})^{-1} and p=2p=2, in which the cdf can be equivalently expressed using the error function. The Laplace distribution can be specified using p=1p=1, so that the cdf yields a simpler (but equivalent) expression in the form of

Φ(x)=12+sign(x)2(1exp(γsign(x)x)).\Phi(x)=\frac{1}{2}+\frac{{\rm sign}(x)}{2}\Big{(}1-\exp\big{(}-\gamma\,{\rm sign}(x)\,x\big{)}\Big{)}.

Appendix F Derivation of Fisher information matrices

Here we derive the Fisher information matrix for the Poisson likelihood case. Recall the Fisher information matrix

(x)=m(logy(x))(logy(x))y(x)dy.\mathcal{I}(x)=\int_{\mathbb{R}^{m}}\big{(}\nabla\log\mathcal{L}^{y}(x))(\nabla\log\mathcal{L}^{y}(x)\big{)}^{\top}\,\mathcal{L}^{y}(x)\,\mathrm{d}y. (53)

Defining the predata η=G(x)\eta=G(x), we can express the gradient of the likelihood function as

xlogy(x)=G(x)ηlogy(η).\nabla_{x}\log\mathcal{L}^{y}(x)=\nabla G(x)^{\top}\nabla_{\eta}\log\mathcal{L}^{y}(\eta).

where

y(η)=i=1mηiyiexp(ηi)yi!,subjecttoη=G(x).\mathcal{L}^{y}(\eta)=\prod_{i=1}^{m}\frac{\eta_{i}^{y_{i}}\,\exp(-\eta_{i})}{y_{i}!},\quad{\rm subject\;to}\quad\eta=G(x).

This way, the Fisher information matrix can be rewritten as

(x)=G(x)(m(logy(η))(logy(η))y(η)dy)G(x),\mathcal{I}(x)=\nabla G(x)^{\top}\bigg{(}\int_{\mathbb{R}^{m}}\big{(}\nabla\log\mathcal{L}^{y}(\eta))(\nabla\log\mathcal{L}^{y}(\eta)\big{)}^{\top}\,\mathcal{L}^{y}(\eta)\,\mathrm{d}y\bigg{)}\nabla G(x), (54)

subject to η=G(x)\eta=G(x). The term in the brackets of the above equation is the Fisher information matrix of the Poisson distribution, which is a diagonal matrix

(m(logy(η))(logy(η))y(η)dy)ii=1ηi.\bigg{(}\int_{\mathbb{R}^{m}}\big{(}\nabla\log\mathcal{L}^{y}(\eta))(\nabla\log\mathcal{L}^{y}(\eta)\big{)}^{\top}\,\mathcal{L}^{y}(\eta)\,\mathrm{d}y\bigg{)}_{ii}=\frac{1}{\eta_{i}}.

Thus, the Fisher information matrix w.r.t. xx is

(x)=G(x)M(x)G(x),\displaystyle\mathcal{I}(x)=\nabla G(x)^{\top}M(x)\nabla G(x), (55)

where M(x)M(x) is a diagonal matrix with Mii(x)=Gi(x)1M_{ii}(x)=G_{i}(x)^{-1} along its diagonal.

References

References

  • [1] Christophe Andrieu, Gareth O Roberts, et al. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
  • [2] Christophe Andrieu, Matti Vihola, et al. Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms. The Annals of Applied Probability, 25(2):1030–1077, 2015.
  • [3] Christophe Andrieu, Matti Vihola, et al. Establishing some order amongst exact approximations of MCMCs. The Annals of Applied Probability, 26(5):2661–2696, 2016.
  • [4] Mario Bebendorf. A note on the Poincaré inequality for convex domains. Zeitschrift für Analysis und ihre Anwendungen, 22(4):751–756, 2003.
  • [5] Alexandros Beskos, Mark Girolami, Shiwei Lan, Patrick E Farrell, and Andrew M Stuart. Geometric MCMC for infinite-dimensional inverse problems. Journal of Computational Physics, 335:327–351, 2017.
  • [6] Alexandros Beskos, Ajay Jasra, Kody Law, Youssef Marzouk, and Yan Zhou. Multilevel sequential Monte Carlo with dimension-independent likelihood-informed proposals. SIAM/ASA Journal on Uncertainty Quantification, 6(2):762–786, 2018.
  • [7] Alexandros Beskos, Gareth Roberts, Andrew Stuart, and Jochen Voss. MCMC methods for diffusion bridges. Stochastics and Dynamics, 8(03):319–350, 2008.
  • [8] Marie Billaud-Friess, Anthony Nouy, and Olivier Zahm. A tensor approximation method based on ideal minimal residual formulations for the solution of high-dimensional problems. ESAIM: Mathematical Modelling and Numerical Analysis, 48(6):1777–1806, 2014.
  • [9] S. Brooks, A. Gelman, G. Jones, and X. L. Meng, editors. Handbook of Markov Chain Monte Carlo. Taylor & Francis, 2011.
  • [10] Carlos M Carvalho, Nicholas G Polson, and James G Scott. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, pages 73–80, 2009.
  • [11] J Andrés Christen and Colin Fox. Markov chain Monte Carlo using an approximation. Journal of Computational and Graphical statistics, 14(4):795–810, 2005.
  • [12] Paul G Constantine, Eric Dow, and Qiqi Wang. Active subspace methods in theory and practice: applications to kriging surfaces. SIAM Journal on Scientific Computing, 36(4):A1500–A1524, 2014.
  • [13] Paul G Constantine, Carson Kent, and Tan Bui-Thanh. Accelerating Markov chain Monte Carlo with active subspaces. SIAM Journal on Scientific Computing, 38(5):A2779–A2805, 2016.
  • [14] Andrea F Cortesi, Paul G Constantine, Thierry E Magin, and Pietro M Congedo. Forward and backward uncertainty quantification with active subspaces: application to hypersonic flows around a cylinder. Journal of Computational Physics, 407:109079, 2020.
  • [15] Simon L Cotter, Gareth O Roberts, Andrew M Stuart, David White, et al. MCMC methods for functions: modifying old algorithms to make them faster. Statistical Science, 28(3):424–446, 2013.
  • [16] Tiangang Cui, Gianluca Detommaso, and Robert Scheichl. Multilevel dimension-independent likelihood-informed MCMC for large-scale inverse problems. arXiv preprint arXiv:1910.12431, 2019.
  • [17] Tiangang Cui, Kody JH Law, and Youssef M Marzouk. Dimension-independent likelihood-informed MCMC. Journal of Computational Physics, 304:109–137, 2016.
  • [18] Tiangang Cui, James Martin, Youssef M Marzouk, Antti Solonen, and Alessio Spantini. Likelihood-informed dimension reduction for nonlinear inverse problems. Inverse Problems, 30(11):114015, 2014.
  • [19] Tiangang Cui and Xin T Tong. A unified performance analysis of likelihood-informed subspace methods. arXiv preprint arXiv:2101.02417, 2021.
  • [20] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
  • [21] Masoumeh Dashti, Stephen Harris, and Andrew Stuart. Besov priors for Bayesian inverse problems. Inverse Problems & Imaging, 6(2):183, 2012.
  • [22] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
  • [23] Arnaud Doucet, Michael K Pitt, George Deligiannidis, and Robert Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika, 102(2):295–313, 2015.
  • [24] Alain Durmus, Eric Moulines, et al. High-dimensional Bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
  • [25] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling: Metropolis-Hastings algorithms are fast. Journal of Machine Learning Research, 20(183):1–42, 2019.
  • [26] Ivan G Graham, Frances Y Kuo, James A Nichols, Robert Scheichl, Ch Schwab, and Ian H Sloan. Quasi-Monte Carlo finite element methods for elliptic PDEs with lognormal random coefficients. Numerische Mathematik, 131(2):329–368, 2015.
  • [27] Ivan G Graham, Frances Y Kuo, Dirk Nuyens, Robert Scheichl, and Ian H Sloan. Quasi-Monte Carlo methods for elliptic PDEs with random coefficients and applications. Journal of Computational Physics, 230(10):3668–3694, 2011.
  • [28] Alice Guionnet and B Zegarlinksi. Lectures on logarithmic Sobolev inequalities. In Séminaire de probabilités XXXVI, pages 1–134. Springer, 2003.
  • [29] W. Hastings. Monte Carlo sampling using Markov chains and their applications. Biometrika, 57:97–109, 1970.
  • [30] Jere Heikkinen. Statistical inversion theory in x-ray tomography. 2008.
  • [31] Jari Kaipio and Erkki Somersalo. Statistical and computational inverse problems, volume 160. Springer Science & Business Media, 2006.
  • [32] Ville Kolehmainen, Matti Lassas, Kati Niinimäki, and Samuli Siltanen. Sparsity-promoting Bayesian inversion. Inverse Problems, 28(2):025005, 2012.
  • [33] Shiwei Lan. Adaptive dimension reduction to accelerate infinite-dimensional geometric Markov chain Monte Carlo. Journal of Computational Physics, 392:71–95, 2019.
  • [34] Matti Lassas, Eero Saksman, and Samuli Siltanen. Discretization-invariant Bayesian inversion and Besov space priors. Inverse problems and imaging, 3(1):87–122, 2009.
  • [35] Olivier Le Maître and Omar M Knio. Spectral methods for uncertainty quantification: with applications to computational fluid dynamics. Springer Science & Business Media, 2010.
  • [36] Michel Ledoux. Logarithmic Sobolev inequalities for unbounded spin systems revisited. In Séminaire de Probabilités XXXV, pages 167–194. Springer, 2001.
  • [37] Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit link between gaussian fields and gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498, 2011.
  • [38] Jun S Liu. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2001.
  • [39] Jun S Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association, 93(443):1032–1044, 1998.
  • [40] Youssef M Marzouk and Habib N Najm. Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics, 228(6):1862–1902, 2009.
  • [41] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953.
  • [42] Anthony Nouy. Low-Rank Tensor Methods for Model Order Reduction, pages 857–882. Springer International Publishing, Cham, 2017.
  • [43] Mario Teixeira Parente, Jonas Wallin, Barbara Wohlmuth, et al. Generalized bounds for active subspaces. Electronic Journal of Statistics, 14(1):917–943, 2020.
  • [44] Anthony T Patera, Gianluigi Rozza, et al. Reduced basis approximation and a posteriori error estimation for parametrized partial differential equations, 2007.
  • [45] Noemi Petra, James Martin, Georg Stadler, and Omar Ghattas. A computational framework for infinite-dimensional Bayesian inverse problems, part ii: Stochastic newton MCMC with application to ice sheet flow inverse problems. SIAM Journal on Scientific Computing, 36(4):A1525–A1555, 2014.
  • [46] Allan Pinkus. Ridge functions, volume 205. Cambridge University Press, 2015.
  • [47] G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability, 7:110–120, 1997.
  • [48] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling for various metropolis-Hastings algorithms. Statistical science, 16(4):351–367, 2001.
  • [49] Olivier Roustant, Franck Barthe, Bertrand Iooss, et al. Poincaré inequalities on intervals–application to sensitivity analysis. Electronic journal of statistics, 11(2):3081–3119, 2017.
  • [50] Alessio Spantini, Antti Solonen, Tiangang Cui, James Martin, Luis Tenorio, and Youssef Marzouk. Optimal low-rank approximations of Bayesian linear inverse problems. SIAM Journal on Scientific Computing, 37(6):A2451–A2487, 2015.
  • [51] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numerica, 19:451–559, 2010.
  • [52] Albert Tarantola. Inverse problem theory and methods for model parameter estimation, volume 89. SIAM, 2005.
  • [53] Olivier Zahm, Paul G Constantine, Clementine Prieur, and Youssef M Marzouk. Gradient-based dimension reduction of multivariate vector-valued functions. SIAM Journal on Scientific Computing, 42(1):A534–A558, 2020.
  • [54] Olivier Zahm, Tiangang Cui, Kody Law, Alessio Spantini, and Youssef Marzouk. Certified dimension reduction in nonlinear Bayesian inverse problems. arXiv preprint arXiv:1807.03712, 2018.