On the use of approximate Bayesian computation Markov chain Monte Carlo with inflated tolerance and post-correction

Matti Vihola and Jordan Franks Department of Mathematics and Statistics, University of Jyväskylä P.O.Box 35, FI-40014 University of Jyväskylä, Finland matti.s.vihola@jyu.fi, jordan.j.franks@jyu.fi

Abstract.

Approximate Bayesian computation allows for inference of complicated probabilistic models with intractable likelihoods using model simulations. The Markov chain Monte Carlo implementation of approximate Bayesian computation is often sensitive to the tolerance parameter: low tolerance leads to poor mixing and large tolerance entails excess bias. We consider an approach using a relatively large tolerance for the Markov chain Monte Carlo sampler to ensure its sufficient mixing, and post-processing the output leading to estimators for a range of finer tolerances. We introduce an approximate confidence interval for the related post-corrected estimators, and propose an adaptive approximate Bayesian computation Markov chain Monte Carlo, which finds a ‘balanced’ tolerance level automatically, based on acceptance rate optimisation. Our experiments show that post-processing based estimators can perform better than direct Markov chain targetting a fine tolerance, that our confidence intervals are reliable, and that our adaptive algorithm leads to reliable inference with little user specification.

Key words and phrases:

Adaptive, approximate Bayesian computation, confidence interval, importance sampling, Markov chain Monte Carlo, tolerance choice

1. Introduction

Approximate Bayesian computation is a form of likelihood-free inference (see, e.g., the reviews Marin et al., 2012; Sunnåker et al., 2013) which is used when exact Bayesian inference of a parameter $\theta\in\mathsf{T}$ with posterior density $\pi(\theta)\propto\mathrm{pr}(\theta)L(\theta)$ is impossible, where $\mathrm{pr}(\theta)$ is the prior density and $L(\theta)=g(y^{*}\mid\theta)$ is an intractable likelihood with data $y^{*}\in\mathsf{Y}$ . More specifically, when the generative model of observations $g(\,\cdot\,\mid\theta)$ cannot be evaluated, but allows for simulations, we may perform relatively straightforward approximate inference based on the following (pseudo-)posterior:

(1)

\pi_{\epsilon}(\theta)\propto\mathrm{pr}(\theta)L_{\epsilon}(\theta),\qquad\text{where}\qquad L_{\epsilon}(\theta)=\mathbb{E}[K_{\epsilon}(Y_{\theta},y^{*})],\quad Y_{\theta}\sim g(\,\cdot\,\mid\theta),

where $\epsilon>0$ is a ‘tolerance’ parameter, and $K_{\epsilon}:\mathsf{Y}^{2}\to[0,\infty)$ is a ‘kernel’ function, which is often taken as a simple cut-off $K_{\epsilon}(y,y^{*})=1\left(\|s(y)-s(y^{*})\|\leq\epsilon\right)$ , where $s:\mathsf{Y}\to\mathbb{R}^{d}$ extracts a vector of summary statistics from the (pseudo) observations.

The summary statistics are often chosen based on the application at hand, and reflect what is relevant for the inference task; see also (Fearnhead and Prangle, 2012; Raynal et al., to appear). Because $L_{\epsilon}(\theta)$ may be regarded as a smoothed version of the true likelihood $g(y^{*}\mid\theta)$ using the kernel $K_{\epsilon}$ , it is intuitive that using a too large $\epsilon$ may blur the likelihood and bias the inference. Therefore, it is generally desirable to use as small a tolerance $\epsilon>0$ as possible, but because the computational methods suffer from inefficiency with small $\epsilon$ , the choice of tolerance level is difficult (cf. Bortot et al., 2007; Sisson and Fan, 2018; Tanaka et al., 2006).

We discuss simple post-processing procedure which allows for consideration of a range of values for the tolerance $\epsilon\leq\delta$ , based on a single run of approximate Bayesian computation Markov chain Monte Carlo (Marjoram et al., 2003) with tolerance $\delta$ . Such post-processing was suggested in (Wegmann et al., 2009) (in case of simple cut-off), and similar post-processing has been suggested also with regression adjustment (Beaumont et al., 2002) (in a rejection sampling context). The method, discussed further in Section 2, can be useful for two reasons: A range of tolerances $\epsilon\leq\delta$ may be routinely inspected, which can reveal excess bias in the pseudo-posterior $\pi_{\delta}$ ; and the Markov chain Monte Carlo inference may be implemented with sufficiently large $\delta$ to allow for good mixing.

Our contribution is two-fold. We suggest straightforward-to-calculate approximate confidence intervals for the posterior mean estimates calculated from the post-processing output, and discuss some theoretical properties related to it. We also introduce an adaptive approximate Bayesian computation Markov chain Monte Carlo which finds a balanced $\delta$ during burn-in, using acceptance rate as a proxy, and detail a convergence result for it.

2. Post-processing over a range of tolerances

For the rest of the paper, we assume that the kernel function in (1) has the form

K_{\epsilon}(y,y^{*})=\phi\big{(}d(y,y^{*})/\epsilon\big{)},

where $d:\mathsf{Y}^{2}\to[0,\infty)$ is any ‘dissimilarity’ function and $\phi:[0,\infty)\to[0,1]$ is a non-increasing ‘cut-off’ function. Typically $d(y,y^{*})=\|s(y)-s(y^{*})\|$ , where $s:\mathsf{Y}^{2}\to\mathbb{R}^{d}$ are the chosen summaries, and in case of the simple cut-off discussed in Section 1, $\phi(t)=\phi_{\mathrm{simple}}(t)=1\left(t\leq 1\right)$ . We will implicitly assume that the pseudo-posterior $\pi_{\epsilon}$ given in (1) is well-defined for all $\epsilon>0$ of interest, that is, $c_{\epsilon}=\int\mathrm{pr}(\theta)L_{\epsilon}(\theta)\mathrm{d}\theta>0$ .

The following summarises the approximate Bayesian computation Markov chain Monte Carlo algorithm of Marjoram et al. (2003), with proposal $q$ and tolerance $\delta>0$ :

Algorithm 1 (abc-mcmc( $\delta$ )).

Suppose $\Theta_{0}\in\mathsf{T}$ and $Y_{0}\in\mathsf{Y}$ are any starting values, such that $\mathrm{pr}(\Theta_{0})>0$ and $\phi(d(Y_{0},y^{*})/\delta)>0$ . For $k=1,2,\ldots$ , iterate:

(i)

Draw $\tilde{\Theta}_{k}\sim q(\Theta_{k-1},\,\cdot\,)$ and $\tilde{Y}_{k}\sim g(\,\cdot\,\mid\tilde{\Theta}_{k})$ .

(ii)

With probability $\alpha_{\delta}(\Theta_{k-1},Y_{k-1};\tilde{\Theta}_{k},\tilde{Y}_{k})$ accept and set $(\Theta_{k},Y_{k})\leftarrow(\tilde{\Theta}_{k},\tilde{Y}_{k})$ ; otherwise reject and set $(\Theta_{k},Y_{k})\leftarrow(\Theta_{k-1},Y_{k-1})$ , where

\alpha_{\delta}(\theta,y;\tilde{\theta},\tilde{y})=\min\bigg{\{}1,\frac{\mathrm{pr}(\tilde{\theta})q(\tilde{\theta},\theta)\phi\big{(}d(\tilde{y},y^{*})/\delta\big{)}}{\mathrm{pr}(\theta)q(\theta,\tilde{\theta})\phi\big{(}d(y,y^{*})/\delta\big{)}}\bigg{\}}.

Algorithm 1 may be implemented by storing only $\Theta_{k}$ and the related distances $T_{k}=d(Y_{k},y^{*})$ , and in what follows, we regard either $(\Theta_{k},Y_{k})_{k\geq 1}$ or $(\Theta_{k},T_{k})_{k\geq 1}$ as the output of Algorithm 1. In practice, the initial values $(\Theta_{0},Y_{0})$ should be taken as the state of the Algorithm 1 run for a number of initial ‘burn-in’ iterations. We also introduce an adaptive algorithm for parameter tuning later (Section 4).

It is possible to consider a variant of Algorithm 1 where many (possibly dependent) observations $\tilde{Y}_{k}^{(1)},\ldots,\tilde{Y}_{k}^{(m)}\sim g(\,\cdot\,\mid\tilde{\Theta}_{k})$ are simulated in each iteration, and an average of their kernel values is used in the accept-reject step (cf. Andrieu et al., 2018). We focus here in the case of single pseudo-observation per iteration, following the asymptotic efficiency result of Bornn et al. (2017), but remark that our method may be applied in a straightforward manner also with multiple observations.

Definition 2.

Suppose $(\Theta_{k},T_{k})_{k=1,\ldots,n}$ is the output of abc-mcmc( $\delta$ ) for some $\delta>0$ . For any $\epsilon\in(0,\delta]$ such that $\phi(T_{k}/\epsilon)>0$ for some $k=1,\ldots,n$ , and for any function $f:\mathsf{T}\to\mathbb{R}$ , define

	$\displaystyle U_{k}^{(\delta,\epsilon)}$	$\displaystyle=\phi(T_{k}/\epsilon)\big{/}\phi(T_{k}/\delta),$	$\displaystyle\qquad W_{k}^{(\delta,\epsilon)}$	$\displaystyle=U_{k}^{(\delta,\epsilon)}\big{/}\textstyle\sum_{j=1}^{n}U_{j}^{(\delta,\epsilon)},$
	$\displaystyle E_{\delta,\epsilon}(f)$	$\displaystyle=\textstyle\sum_{k=1}^{n}W_{k}^{(\delta,\epsilon)}f(\Theta_{k}),$	$\displaystyle S_{\delta,\epsilon}(f)$	$\displaystyle=\textstyle\sum_{k=1}^{n}(W_{k}^{(\delta,\epsilon)})^{2}\big{\{}f(\Theta_{k})-E_{\delta,\epsilon}(f)\big{\}}^{2}.$

Algorithm 11 in Appendix details how $E_{\delta,\epsilon}(f)$ and $S_{\delta,\epsilon}(f)$ can be calculated in $O(n\log n)$ time simultaneously for all $\epsilon\leq\delta$ in case of simple cut-off. The estimator $E_{\delta,\epsilon}(f)$ approximates $\mathbb{E}_{\pi_{\epsilon}}[f(\Theta)]$ and $S_{\delta,\epsilon}(f)$ may be used to construct a confidence interval; see Algorithm 5 below. Theorem 4 details consistency of $E_{\delta,\epsilon}(f)$ , and relates $S_{\delta,\epsilon}(f)$ to the limiting variance, in case the following (well-known) condition ensuring a central limit theorem holds:

Assumption 3 (Finite integrated autocorrelation).

Suppose that $\mathbb{E}_{\pi_{\epsilon}}[f^{2}(\Theta)]<\infty$ and $\sum_{k\geq 1}\rho_{k}^{(\delta,\epsilon)}$ is finite, with $\rho_{k}^{(\delta,\epsilon)}=\mathrm{Corr}\big{(}h_{\delta,\epsilon}(\Theta_{0}^{(s)},Y_{0}^{(s)}),h_{\delta,\epsilon}(\Theta_{k}^{(s)},Y_{k}^{(s)})\big{)}$ , where $(\Theta_{k}^{(s)},Y_{k}^{(s)})_{k\geq 1}$ is a stationary version of the abc-mcmc( $\delta$ ) chain, and

h_{\delta,\epsilon}(\theta,y)=w_{\delta,\epsilon}(y)f(\theta)\qquad\text{where}\qquad w_{\delta,\epsilon}(y)=\phi\big{(}d(y,y^{*})/\epsilon\big{)}/\phi\big{(}d(y,y^{*})/\delta\big{)}.

Theorem 4.

Suppose $(\Theta_{k},T_{k})_{k\geq 1}$ is the output of abc-mcmc( $\delta$ ), and denote by $E_{\delta,\epsilon}^{(n)}(f)$ and $S_{\delta,\epsilon}^{(n)}(f)$ the estimators in Definition 2. If $(\Theta_{k},T_{k})_{k\geq 1}$ is $\varphi$ -irreducible (Meyn and Tweedie, 2009) then, for any $\epsilon\in(0,\delta)$ , we have as $n\to\infty$ :

(i)

$E_{\delta,\epsilon}^{(n)}(f)\to\mathbb{E}_{\pi_{\epsilon}}[f(\Theta)]$ almost surely, whenever the expectation is finite.
(ii)

Under Assumption 3, $n^{1/2}\big{(}E_{\delta,\epsilon}^{(n)}(f)-\mathbb{E}_{\pi_{\epsilon}}[f(\Theta)]\big{)}\to N\big{(}0,v_{\delta,\epsilon}(f)\tau_{\delta,\epsilon}(f)\big{)}$ in distribution, where $\tau_{\delta,\epsilon}(f)=\big{(}1+2\sum_{k\geq 1}\rho_{k}^{(\delta,\epsilon)}\big{)}\in[0,\infty)$ and $nS_{\delta,\epsilon}^{(n)}(f)\to v_{\delta,\epsilon}(f)\in[0,\infty)$ almost surely.

Proof of Theorem 4 is given in Appendix. Inspired by Theorem 4, we suggest to report the following approximate confidence intervals for the suggested estimators:

Algorithm 5.

Suppose $(\Theta_{k},T_{k})_{k=1,\ldots,n}$ is the output of abc-mcmc( $\delta$ ) and $f:\Theta\to\mathbb{R}$ is a function, then for any $\epsilon\leq\delta$ :

(i)

Calculate $E_{\delta,\epsilon}(f)$ and $S_{\delta,\epsilon}(f)$ as in Definition 2 (or in Algorithm 11).
(ii)

Calculate $\hat{\tau}_{\delta}(f)$ , an estimate of the integrated autocorrelation of $\big{(}f(\Theta_{k})\big{)}_{k=1,\ldots,n}$ .
(iii)

Report the confidence interval

$\big{[}E_{\delta,\epsilon}(f)\pm z_{q}\big{(}S_{\delta,\epsilon}(f)\hat{\tau}_{\delta}(f)\big{)}^{1/2}\big{]},$

where $z_{q}>0$ corresponds to the desired normal quantile.

The confidence interval in Algorithm 5 is straightforward application of Theorem 4, except for using a common integrated autocorrelation estimate $\hat{\tau}_{\delta}(f)$ for all $\tau_{\delta,\epsilon}(f)$ . This relies on the approximation $\tau_{\delta,\epsilon}(f)\lessapprox\tau_{\delta}(f)$ , which may not always be entirely accurate, but likely to be reasonable, as illustrated by Theorem 6 in Section 3 below. We suggest using a common $\hat{\tau}_{\delta}(f)$ for all tolerances because direct estimation of integrated autocorrelation is computationally demanding, and likely to be unstable for small $\epsilon$ .

The classical choice for $\hat{\tau}_{\delta}(f)$ in Algorithm 5(ii) is windowed autocorrelation, $\hat{\tau}_{\delta}(f)=\sum_{k=-\infty}^{\infty}\omega(k)\hat{\rho}_{k}$ , with some $0\leq\omega(k)\leq 1$ , where $\hat{\rho}_{k}$ is the sample autocorrelation of $\big{(}f(\Theta_{k})\big{)}$ (cf. Geyer, 1992). We employ this approach in our experiments with $\omega(k)=1\left(|k|\leq M\right)$ where the cut-off lag $M$ is chosen adaptively as the smallest integer such that $M\geq 5\big{(}1+2\sum_{i=1}^{M}\hat{\rho}_{k}\big{)}$ (Sokal, 1996). Also more sophisticated techniques for the calculation of the asymptotic variance have been suggested (e.g. Flegal and Jones, 2010).

We remark that, although we focus here on the case of using a common cut-off $\phi$ for both the abc-mcmc( $\delta$ ) and the post-correction, one could also use a different cut-off $\phi_{s}$ in the simulation phase, as considered by Beaumont et al. (2002) in the regression context. The extension to Definition 2 is straightforward, setting $U_{k}^{(\delta,\epsilon)}=\phi(T_{k}/\epsilon)\big{/}\phi_{s}(T_{k}/\delta)$ , and Theorem 4 remains valid under a support condition.

3. Theoretical justification

The following result, whose proof is given in Appendix, gives an expression for the integrated autocorrelation in case of simple cut-off.

Theorem 6.

Suppose Assumption 3 holds and $\phi=\phi_{\mathrm{simple}}$ , then

\tau_{\delta,\epsilon}(f)-1=\frac{\big{(}\check{\tau}_{\delta,\epsilon}(f)-1\big{)}\mathrm{var}_{\pi_{\delta}}(f_{\delta,\epsilon})+2\int\pi_{\delta}(\theta)\bar{w}_{\delta,\epsilon}(\theta)\big{(}1-\bar{w}_{\delta,\epsilon}(\theta)\big{)}\frac{r_{\delta}(\theta)}{1-r_{\delta}(\theta)}f^{2}(\theta)\mathrm{d}\theta}{\mathrm{var}_{\pi_{\delta}}(f_{\delta,\epsilon})+\int\pi_{\delta}(\theta)\bar{w}_{\delta,\epsilon}(\theta)\big{(}1-\bar{w}_{\delta,\epsilon}(\theta)\big{)}f^{2}(\theta)\mathrm{d}\theta},

where $\bar{w}_{\delta,\epsilon}(\theta)=L_{\epsilon}(\theta)/L_{\delta}(\theta)$ , $f_{\delta,\epsilon}(\theta)=f(\theta)\bar{w}_{\delta,\epsilon}(\theta)$ , $\check{\tau}_{\delta,\epsilon}(f)$ is the integrated autocorrelation of $\{f_{\delta,\epsilon}(\Theta_{k}^{(s)})\}_{k\geq 1}$ and $r_{\delta}(\theta)$ is the rejection probability of the abc-mcmc( $\delta$ ) chain at $\theta$ .

We next discuss how this loosely suggests that $\tau_{\delta,\epsilon}(f)\lessapprox\tau_{\delta,\delta}(f)=\tau_{\delta}(f)$ . The weight $\bar{w}_{\delta,\delta}\equiv 1$ , and under suitable regularity conditions both $\bar{w}_{\delta,\epsilon}(\theta)$ and $\check{\tau}_{\delta,\epsilon}(f)$ are continuous with respect to $\epsilon$ , and $\bar{w}_{\delta,\epsilon}(\theta)\to 0$ as $\epsilon\to 0$ . Then, for $\epsilon\approx\delta$ , we have $\bar{w}_{\delta,\epsilon}\approx 1$ and therefore $\tau_{\delta,\delta}(f)\approx\tau_{\delta,\epsilon}(f)$ . For small $\epsilon$ , the terms with $\mathrm{var}_{\pi_{\delta}}(f_{\delta,\epsilon})$ are of order $O(\bar{w}_{\delta,\epsilon}^{2})$ , and are dominated by the other terms of order $O(\bar{w}_{\delta,\epsilon})$ . The remaining ratio may be written as

\displaystyle\frac{2\int\pi_{\delta}(\theta)\bar{w}_{\delta,\epsilon}(\theta)\big{(}1-\bar{w}_{\delta,\epsilon}(\theta)\big{)}\frac{r_{\delta}(\theta)}{1-r_{\delta}(\theta)}f^{2}(\theta)\mathrm{d}\theta}{\int\pi_{\delta}(\theta)\bar{w}_{\delta,\epsilon}(\theta)\big{(}1-\bar{w}_{\delta,\epsilon}(\theta)\big{)}f^{2}(\theta)\mathrm{d}\theta}

\displaystyle=2\mathbb{E}_{\pi_{\delta}}\Big{[}\bar{g}_{\delta,\epsilon}^{2}(\Theta)\frac{r_{\delta}(\Theta)}{1-r_{\delta}(\Theta)}\Big{]},

where $\bar{g}_{\delta,\epsilon}\propto\{\bar{w}_{\delta,\epsilon}(1-\bar{w}_{\delta,\epsilon})\}^{1/2}f$ with $\pi_{\delta}(\bar{g}_{\delta,\epsilon}^{2})=1$ . If $r_{\delta}(\theta)\leq r_{*}<1$ , then the term is upper bounded by $2r_{*}(1-r_{*})^{-1}$ , and we believe it to be often less than $\tau_{\delta,\delta}(f)$ , because the latter expression is similar to the contribution of rejections to the integrated autocorrelation; see the proof of Theorem 6.

For general $\phi$ , it appears to be hard to obtain similar theoretical result, but we expect the approximation to be still sensible. Theorem 6 relies on $Y_{k}^{(s)}$ being independent of $(\Theta_{k}^{(0)},Y_{k}^{(0)})$ conditional on $\Theta_{k}^{(s)}$ , assuming at least single acceptance. This is not true with other cut-offs, but we believe that the dependence of $Y_{k}^{(s)}$ from $(\Theta_{0}^{(s)},Y_{0}^{(s)})$ given $\Theta_{k}^{(s)}$ is generally weaker than dependence of $\Theta_{k}^{(s)}$ and $\Theta_{0}^{(s)}$ , suggesting similar behaviour.

We conclude the section with a general (albeit pessimistic) upper bound for the asymptotic variance of the post-corrected estimators.

Theorem 7.

For any $\epsilon\leq\delta$ , denote by $\sigma_{\delta,\epsilon}^{2}(f)=v_{\delta,\epsilon}(f)\tau_{\delta,\epsilon}(f)$ the asymptotic variance of the estimator of Definition 2 (see Theorem 4(ii)) and $\bar{f}(\theta)=f(\theta)-\mathbb{E}_{\pi_{\epsilon}}[f(\Theta)]$ , then for any $\epsilon\leq\delta$ ,

\sigma_{\delta,\epsilon}^{2}(f)\leq(c_{\delta}/c_{\epsilon})\big{\{}\sigma_{\epsilon}^{2}(f)+\tilde{\pi}_{\epsilon}\big{(}\bar{f}^{2}(1-w_{\delta,\epsilon})\big{)}\big{\}},

where $\tilde{\pi}_{\epsilon}$ is the stationary distribution of the direct abc-mcmc( $\epsilon$ ) and $\sigma_{\epsilon}^{2}(f)=\sigma_{\epsilon,\epsilon}^{2}(f)$ its asymptotic variance.

Theorem 7 follows directly from (Franks and Vihola, 2017, Corollary 4). The upper bound guarantees that a moderate correction, that is, $\epsilon$ close to $\delta$ and $c_{\delta}$ close to $c_{\epsilon}$ , is nearly as efficient as direct abc-mcmc( $\delta$ ). Indeed, typically $w_{\delta,\epsilon}\to 1$ and $c_{\epsilon}\to c_{\delta}$ as $\epsilon\to\delta$ , in which case Theorem 7 implies $\limsup_{\epsilon\to\delta}\sigma_{\delta,\epsilon}^{2}(f)\leq\sigma_{\epsilon}^{2}(f)$ . However, as $\epsilon\to 0$ , the bound becomes less informative.

4. Tolerance adaptation

We propose Algorithm 8 below to adapt the tolerance $\delta$ in abc-mcmc( $\delta$ ) during a burn-in of length $n_{b}$ , in order to obtain a user-specified overall acceptance rate $\alpha^{*}\in(0,1)$ . Tolerance optimisation has been suggested earlier based on quantiles of distances, with parameters simulated from the prior (e.g. Beaumont et al., 2002; Wegmann et al., 2009). This heuristic might not be satisfactory in the Markov chain Monte Carlo context, if the prior is relatively uninformative. We believe that acceptance rate optimisation is a more natural alternative, and Sisson and Fan (2018) suggested this as well.

Our method requires also a sequence of decreasing positive step sizes $(\gamma_{k})_{k\geq 1}$ . We used $\alpha^{*}=0.1$ and $\gamma_{k}=k^{-2/3}$ in our experiments, and discuss these choices later.

Algorithm 8.

Suppose $\Theta_{0}\in\mathsf{T}$ is a starting value with $\mathrm{pr}(\Theta_{0})>0$ . Initialise $\delta=d(Y_{0},y^{*})>0$ where $Y_{0}\sim g(\,\cdot\,\mid\Theta_{0})$ . For $k=1,\ldots,n_{b}$ , iterate:

(i)

Draw $\tilde{\Theta}_{k}\sim q(\Theta_{k-1},\,\cdot\,)$ and $\tilde{Y}_{k}\sim g(\,\cdot\,\mid\tilde{\Theta}_{k})$ .
(ii)

With probability $A_{k}=\alpha_{\delta_{k-1}}(\Theta_{k-1},Y_{k-1};\tilde{\Theta}_{k},\tilde{Y}_{k})$ accept and set $(\Theta_{k},Y_{k})\leftarrow(\tilde{\Theta}_{k},\tilde{Y}_{k})$ ; otherwise reject and set $(\Theta_{k},Y_{k})\leftarrow(\Theta_{k-1},Y_{k-1})$ .
(iii)

$\log\delta_{k}\leftarrow\log\delta_{k-1}+\gamma_{k}(\alpha^{*}-A_{k})$ .

In practice, we use Algorithm 8 with a Gaussian symmetric random walk proposal $q_{\Sigma_{k}}$ , where the covariance parameter $\Sigma_{k}$ is adapted simultaneously (Haario et al., 2001; Andrieu and Moulines, 2006) (see Algorithm 23 of Supplement D). We only detail theory for Algorithm 8, but note that similar simultaneous adaptation has been discussed earlier (cf. Andrieu and Thoms, 2008), and expect that our results could be elaborated accordingly.

The following conditions suffice for convergence of the adaptation:

Assumption 9.

Suppose $\phi=\phi_{\mathrm{simple}}$ and the following hold:

(i)

$\gamma_{k}=Ck^{-r}$ with $r\in(\frac{1}{2},1]$ and $C>0$ a constant.
(ii)

The domain $\mathsf{T}\subset\mathbb{R}^{n_{\theta}}$ , $n_{\theta}\geq 1$ , is a nonempty open set and $\mathrm{pr}(\theta)$ is bounded.
(iii)

The proposal $q$ is bounded and bounded away from zero.
(iv)

The distances $D_{\theta}=d(Y_{\theta},y^{*})$ where $Y_{\theta}\sim g(\,\cdot\,\mid\theta)$ admit densities which are uniformly bounded in $\theta$ .
(v)

$(\delta_{k})_{k\geq 1}$ stays in a set $[a,b]$ almost surely, where $0<a\leq b<+\infty$ .
(vi)

$c_{\epsilon}=\int\mathrm{pr}(\mathrm{d}\theta)L_{\epsilon}(\theta)>0$ for all $\epsilon\in[a,b]$ .

Theorem 10.

Under Assumption 9, the expected value of the acceptance probability, with respect to the stationary distribution of the chain, converges to $\alpha^{*}$ .

Proof of Theorem 10 will follow from the more general Theorem 14 of Supplement B.

Polynomially decaying step size sequences as in Assumption 9 (i) are common in adaptation which is of the stochastic approximation type as our approach (Andrieu and Thoms, 2008). Slower decaying step sizes such as $n^{-2/3}$ often behave better with acceptance rate adaptation (cf. Vihola, 2012, Remark 3).

Simple random walk Metropolis with covariance adaptation (Haario et al., 2001) typically leads to a limiting acceptance rate around $0.234$ (Roberts et al., 1997). In case of a pseudo-marginal algorithm such as abc-mcmc( $\delta$ ), the acceptance rate is lower than this, and decreases when $\delta$ is decreased (see Lemma 16 of Supplement C). Markov chain Monte Carlo would typically be necessary when rejection sampling is not possible, that is, when the prior is far from the posterior. In such a case, the likelihood approximation must be accurate enough to provide reasonable approximation $\pi_{\delta}\approx\pi_{\epsilon}$ . This suggests that the desired acceptance rate should be taken substantially lower than $0.234$ .

The choice of the desired acceptance rate $\alpha^{*}$ could also be motivated by theory developed for pseudo-marginal Markov chain Monte Carlo algorithms. Doucet et al. (2015) rely on log-normality of the likelihood estimators, which is problematic in our context, because the likelihood estimators take value zero. Sherlock et al. (2015) find the acceptance rate $0.07$ to be optimal under certain conditions, but also in a quite dissimilar context. Indeed, in our context, the $0.07$ guideline assumes a fixed tolerance, and informs about choosing the number of pseudo-data per iteration. As we stick with single pseudo-data per iteration following (Bornn et al., 2017), the $0.07$ guideline cannot be taken too informative. We recommend slightly higher $\alpha^{*}$ such as $0.1$ to ensure sufficient mixing.

5. Post-processing with regression correction

Beaumont et al. (2002) suggested similar post-processing as in Section 2, applying a further regression correction. Namely, in the context of Section 2, we may consider a function $\tilde{f}^{(\epsilon)}(\theta,y)=f(\theta)-\bar{s}(y)^{\mathrm{\scriptscriptstyle T}}b_{\epsilon}$ where $\bar{s}(y)=s(y)-s(y^{*})$ and $b_{\epsilon}$ is a solution of

\min_{a_{\epsilon},b_{\epsilon}}\mathbb{E}_{\tilde{\pi}_{\epsilon}}\big{[}\big{\{}f(\Theta)-a_{\epsilon}-\bar{s}(Y)^{\mathrm{\scriptscriptstyle T}}b_{\epsilon}\big{\}}^{2}\big{]}=\min_{a_{\epsilon},b_{\epsilon}}\mathbb{E}_{\tilde{\pi}_{\delta}}\big{[}w_{\delta,\epsilon}(Y)\big{\{}f(\Theta)-a_{\epsilon}-\bar{s}(Y)^{\mathrm{\scriptscriptstyle T}}b_{\epsilon}\big{\}}^{2}\big{]},

where $\tilde{\pi}_{\delta}$ is the stationary distribution of abc-mcmc( $\delta$ ), with marginal $\pi_{\delta}$ , given in Appendix. When the latter expectation is replaced by its empirical version, the solution coincides with weighted least squares $(\hat{a}_{\epsilon},\hat{b}_{\epsilon}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}=(\mathrm{M}^{\mathrm{\scriptscriptstyle T}}\mathrm{W}_{\epsilon}\mathrm{M})^{-1}\mathrm{M}^{\mathrm{\scriptscriptstyle T}}\mathrm{W}_{\epsilon}v$ , with $v_{k}=f(\Theta_{k})$ , $\mathrm{W}_{\epsilon}=\mathrm{diag}(W_{1}^{(\delta,\epsilon)},\ldots,W_{n}^{(\delta,\epsilon)})$ and with matrix $\mathrm{M}$ having rows $[M]_{k,\,\cdot\,}=(1,\bar{s}(Y_{k})^{\mathrm{\scriptscriptstyle T}})$ .

We suggest the following confidence interval for $a_{\epsilon}=E_{\tilde{\pi}_{\epsilon}}[\tilde{f}^{(\epsilon)}(\Theta,Y)]$ in the spirit of Algorithm 5:

\big{[}\hat{a}_{\epsilon}\pm z_{q}\big{(}S_{\delta,\epsilon}^{\mathrm{reg}}\hat{\tau}^{\mathrm{reg}}_{\delta}\big{)}^{1/2}\big{]},

where $\hat{\tau}^{\mathrm{reg}}_{\delta}$ is the integrated autocorrelation estimate for $(\hat{F}_{k}^{(\delta)})$ where $\hat{F}_{k}^{(\delta)}=f(\Theta_{k})-\bar{s}^{T}\hat{b}_{\delta}$ and $S_{\delta,\epsilon}^{\mathrm{reg}}=[(\mathrm{M}^{\mathrm{\scriptscriptstyle T}}\mathrm{W}_{\epsilon}M)^{-1}]_{1,1}\sum_{k=1}^{n}(W_{k}^{(\delta,\epsilon)})^{2}(\hat{F}_{k}^{(\epsilon)}-\hat{a}_{\epsilon})^{2}$ , where the first term is included as an attempt to account for the increased uncertainty due to estimated $\hat{b}_{\epsilon}$ , analogous to weighted least squares. Experimental results show some promise for this confidence interval, but we stress that we do not have better theoretical backing for it, and leave further elaboration of the confidence interval for future research.

6. Experiments

We experiment with our methods on two models, a lightweight Gaussian toy example, and a Lotka-Volterra model. Our experiments focus on three aspects: can abc-mcmc( $\delta$ ) with larger tolerance $\delta$ and post-correction to a desired tolerance $\epsilon<\delta$ deliver more accurate results than direct abc-mcmc( $\epsilon$ ); does the approximate confidence interval appear reliable; how well does the tolerance adaptation work in practice. All the experiments are implemented in Julia (Bezanson et al., 2017), and the codes are available in https://bitbucket.org/mvihola/abc-mcmc.

Because we believe that Markov chain Monte Carlo is most useful when little is known about the posterior, we apply covariance adaptation (Haario et al., 2001; Andrieu and Moulines, 2006) throughout the simulation in all our experiments, using an identity covariance initially. When running the covariance adaptation alone, we employ the step size $n^{-1}$ as in the original method of Haario et al. (2001), and in case of tolerance adaptation, we use step size $n^{-2/3}$ .

Regarding our first question, we investigate running abc-mcmc( $\delta$ ) starting near the posterior mode with different pre-selected tolerances $\delta$ . We first attempted to perform the experiments by initialising the chains from independent samples of the prior distribution, but in this case, most of the chains failed to accept a single move during the entire run. In contrast, our experiments with tolerance adaptation are initialised from the prior, and both the tolerances and the covariances are adjusted fully automatically by our algorithm.

6.1. One-dimensional Gaussian model

Our first model is a toy model with $\mathrm{pr}(\theta)=N(\theta;0,30^{2})$ , $g(y\mid\theta)=N(y;\theta,1)$ and $d(y,y^{*})=|y|$ . The true posterior without approximation is Gaussian. While this scenario is clearly academic, the prior is far from the posterior, making rejection sampling approximate Bayesian computation inefficient. It is clear that $\pi_{\epsilon}$ has zero mean for all $\epsilon$ (by symmetry), and that $\pi_{\epsilon}$ is more spread for bigger $\epsilon$ . We experiment with both simple cut-off $\phi_{\mathrm{simple}}$ and Gaussian cut-off $\phi_{\mathrm{Gauss}}(t)=e^{-t^{2}/2}$ .

We run the experiments with 10,000 independent chains, each for 11,000 iterations including 1,000 burn-in. The chains were always started from $\theta_{0}=0$ . We inspect estimates for the posterior mean $\mathbb{E}_{\pi_{\epsilon}}[f(\Theta)]$ for $f(\theta)=\theta$ and $f(\theta)=|\theta|$ . Figure 1 (left) shows the estimates with their confidence intervals based on a single realisation of abc-mcmc(3). Figure 1 (right) shows box plots of the estimates calculated from each abc-mcmc( $\delta$ ), with $\delta$ indicated by colour; the rightmost box plot (blue) corresponds to abc-mcmc(3), the second from the right (red) abc-mcmc(2.275) etc. For $\epsilon=0.1$ , the post-corrected estimates from abc-mcmc( $0.825$ ) and abc-mcmc( $1.55$ ) appear slightly more accurate than direct abc-mcmc( $0.1$ ). Similar figure for Gaussian cut-off, with similar findings, may be found in the Supplement Figure 6.

Refer to caption — Figure 1. Gaussian model with $\phi_{\mathrm{simple}}$ . Estimates from single run of abc-mcmc( $3$ ) (left) and estimates from 10,000 replications of abc-mcmc( $\delta$ ) for $\delta\in\{0.1,0.825,1.55,2.275,3\}$ indicated by colours.

Table 1 shows frequencies of the calculated 95% confidence intervals containing the ‘ground truth’, as well as mean acceptance rates. The ground truth for $\mathbb{E}_{\pi_{\epsilon}}[f_{1}(\Theta)]$ is known to be zero for all $\epsilon$ , and the overall mean of all the calculated estimates is used as the ground truth for $\mathbb{E}_{\pi_{\epsilon}}[f_{2}(\Theta)]$ . The frequencies appear close to ideal with the post-correction approach, being slightly pessimistic in case of simple cut-off as anticipated by the theoretical considerations (cf. Theorem 6 and the discussion below).

Table 1. Frequencies of the 95% confidence intervals, from abc-mcmc(

\delta

) to tolerances

\epsilon

, containing the ground truth in the Gaussian model.

		$f(x)=x$					$f(x)=\|x\|$					Acc.
Cut-off	$\delta$ $\backslash$ $\epsilon$	0.10	0.82	1.55	2.28	3.00	0.10	0.82	1.55	2.28	3.00	rate
$\phi_{\mathrm{simple}}$	0.1	0.93					0.93					0.03
	0.82	0.97	0.95				0.95	0.94				0.22
	1.55	0.97	0.97	0.95			0.96	0.95	0.95			0.33
	2.28	0.98	0.97	0.96	0.95		0.96	0.96	0.96	0.95		0.4
	3.0	0.98	0.98	0.97	0.97	0.95	0.96	0.96	0.96	0.95	0.95	0.43
$\phi_{\mathrm{Gauss}}$	0.1	0.93					0.93					0.05
	0.82	0.94	0.95				0.92	0.95				0.29
	1.55	0.94	0.94	0.95			0.94	0.94	0.95			0.38
	2.28	0.95	0.95	0.95	0.95		0.95	0.95	0.96	0.95		0.41
	3.0	0.95	0.95	0.95	0.95	0.95	0.95	0.96	0.95	0.95	0.95	0.42

Figure 2 shows progress of tolerance adaptations during the burn-in, and histogram of the mean acceptance rates of the chain after burn-in. The lines on the left show the median, and the shaded regions indicate the 50%, 75%, 95% and 99% quantiles. The figure suggests concentration, but reveals that the adaptation has not fully converged yet. This is also visible in the mean acceptance rate over all realisations, which is $0.17$ for simple cut-off and $0.12$ for Gaussian cut-off (see Figure 7 in the Supplement). Table 2 shows root mean square errors for target tolerance $\epsilon=0.1$ , with both abc-mcmc( $\delta$ ) with $\delta$ fixed as above, and for the tolerance adaptive algorithm. Here, only the adaptive chains with final tolerance $\geq 0.1$ were included (9,998 and 9,993 out of 10,000 chains for $\phi_{\mathrm{simple}}$ and $\phi_{\mathrm{Gauss}}$ , respectively). Tolerance adaptation (started from prior distribution) appears to be competitive with ‘optimally’ tuned fixed tolerance abc-mcmc( $\delta$ ).

Table 2. Root mean square errors

(\times 10^{-2})

from abc-mcmc(

\delta

) for tolerance

\epsilon=0.1

with fixed tolerance and with the adaptive algorithms in the Gaussian model.

	$\phi_{\mathrm{simple}}$						$\phi_{\mathrm{Gauss}}$
	Fixed tolerance					Adapt	Fixed tolerance					Adapt
$\delta$	0.1	0.82	1.55	2.28	3.0	0.64	0.1	0.82	1.55	2.28	3.0	0.28
$x$	9.75	8.95	9.29	9.65	10.3	9.15	7.97	7.12	7.82	8.94	9.93	7.08
$\|x\|$	5.49	5.35	5.51	5.81	6.24	5.38	4.47	4.22	4.68	5.26	5.95	4.15

6.2. Lotka-Volterra model

Our second experiment is a Lotka-Volterra model suggested by Boys et al. (2008), which was considered in the approximate Bayesian computation context by Fearnhead and Prangle (2012). The model is a Markov process $(X_{t},Y_{t})_{t\geq 0}$ of counts, corresponding to a reaction network $X\to 2X$ with rate $\theta_{1}$ , $X+Y\to 2Y$ with rate $\theta_{2}$ and $Y\to\emptyset$ with rate $\theta_{3}$ . The reaction log-rates $(\log\theta_{1},\log\theta_{2},\log\theta_{3})^{\mathrm{\scriptscriptstyle T}}$ are parameters, which we equip with a uniform prior, $(\log\theta_{1},\log\theta_{2},\log\theta_{3})^{\mathrm{\scriptscriptstyle T}}\sim U([-6,0]^{3})$ . The data is a simulated trajectory from the model with $\theta=(0.5,0.0025,0.3)^{\mathrm{\scriptscriptstyle T}}$ until time $40$ . The inference is based on the Euclidean distances of five-dimensional summary statistics of the process observed every 5 time units ( $\tilde{X}_{k}=X_{5k}$ and $\tilde{Y}_{k}=Y_{5k}$ ). The summary statistics are the sample autocorrelation of $(\tilde{X}_{k})$ at lag 2 multiplied by 100, and the 10% and 90% quantiles of $(\tilde{X}_{k})$ and $(\tilde{Y}_{k})$ . The observed summary statistics are $(-51.07,29,304,65,404)^{\mathrm{\scriptscriptstyle T}}$ .

We first run comparisons similar to Section 6.1, but now with 1,000 independent abc-mcmc( $\delta$ ) chains with simple cut-off. We investigate the effect of post-correction, with 20,000 samples, including 10,000 burn-in, for each chain. All chains were started from near the posterior mode, from $(-0.55,-5.77,-1.09)^{\mathrm{\scriptscriptstyle T}}$ . Figure 3

shows similar comparisons as in Section 6.1, and Figure 4

shows results for regression correction with Epanechnikov cut-off $\phi_{\mathrm{Epa}}(t)=\max\{0,1-t^{2}\}$ (Beaumont et al., 2002). The results suggest that post-correction might provide slightly more accurate estimators, particularly with smaller tolerances. There is also some bias in abc-mcmc( $\delta$ ) with smaller $\delta$ , when compared to the ground truth calculated from abc-mcmc( $\delta$ ) chain of ten million iterations. Table 3 shows coverages of confidence intervals.

Table 3. Mean acceptance rates and frequencies of the 95% confidence intervals, from abc-mcmc(

\delta

) to tolerances

\epsilon

, in the Lotka-Volterra model.

		$f(\theta)=\theta_{1}$					$f(\theta)=\theta_{2}$					$f(\theta)=\theta_{3}$					Acc.
	$\delta$ $\backslash$ $\epsilon$	80	110	140	170	200	80	110	140	170	200	80	110	140	170	200	rate
$\phi_{\mathrm{simple}}$	80	0.8					0.73					0.74					0.05
	110	0.97	0.93				0.94	0.89				0.94	0.9				0.07
	140	0.99	0.97	0.93			0.98	0.96	0.92			0.98	0.96	0.94			0.1
	170	0.99	0.98	0.96	0.93		0.98	0.97	0.96	0.93		0.99	0.98	0.96	0.95		0.14
	200	1.0	0.99	0.98	0.97	0.94	0.99	0.99	0.98	0.97	0.92	0.99	0.98	0.98	0.96	0.94	0.17
regr. $\phi_{\mathrm{Epa}}$	80	0.75					0.76					0.68					0.05
	110	0.92	0.92				0.93	0.94				0.87	0.91				0.07
	140	0.93	0.94	0.94			0.94	0.96	0.97			0.9	0.92	0.94			0.1
	170	0.93	0.95	0.95	0.95		0.96	0.97	0.97	0.98		0.92	0.94	0.94	0.95		0.14
	200	0.96	0.96	0.96	0.96	0.96	0.98	0.98	0.98	0.98	0.98	0.95	0.96	0.95	0.96	0.96	0.17

In addition, we experiment with the tolerance adaptation, using also 20,000 samples out of which 10,000 are burn-in. Figure 5

shows the progress of the $\log$ -tolerance during the burn-in, and histogram of the realised mean acceptance rates during the estimation phase. The realised acceptance rates are concentrated around the mean $0.10$ . Table 4

Table 4. Root mean square errors of estimators from abc-mcmc(

\delta

) for tolerance

\epsilon=80

, with fixed tolerance and with adaptive tolerance in the Lotka-Volterra model.

	Post-correction, simple cut-off						Regression, Epanechnikov cut-off
	Fixed tolerance					Adapt	Fixed tolerance					Adapt
$\delta$	80	110	140	170	200	122.6	80	110	140	170	200	122.6
$\theta_{1}$ $(\times 10^{-2})$	2.37	1.81	1.75	1.83	1.93	1.8	3.1	2.74	3.02	3.09	3.19	2.57
$\theta_{2}$ $(\times 10^{-4})$	1.32	0.99	0.93	0.96	1.06	1.04	1.52	1.39	1.54	1.61	1.63	1.28
$\theta_{3}$ $(\times 10^{-2})$	2.94	2.26	2.11	2.14	2.37	2.34	2.77	2.53	2.76	2.85	2.91	2.34

shows root mean square errors of the estimators from abc-mcmc( $\delta$ ) for $\epsilon=80$ for fixed tolerance and with tolerance adaptation. Only the adaptive chains with final tolerance $\geq 80.0$ were included (999 out of 1,000 chains).

In this case, the chains run with the tolerance adaptation led to better results than those run only with the covariance adaptation (and fixed tolerance). This perhaps surprising result may be due to the initial behaviour of the covariance adaptation, which may be unstable when there are many rejections. Different initialisation strategies, for instance following (Haario et al., 2001, Remark 2), might lead to more stable behaviour compared to using the adaptation of Andrieu and Moulines (2006) from the start, as we do. The different step size sequences ( $n^{-1}$ and $n^{-2/3}$ ) could also play a rôle. We repeated the experiment for the chains with fixed tolerances, but now with covariance adaptation step size $n^{-2/3}$ . This led to more accurate estimators for abc-mcmc( $\delta$ ) with higher $\delta$ , but worse behaviour with smaller $\delta$ . In any case, also here, tolerance adaptation delivered competitive results (see Supplement F).

7. Discussion

We believe that approximate Bayesian computation inference with Markov chain Monte Carlo is a useful approach, when the chosen simulation tolerance allows for good mixing. Our confidence intervals for post-processing and automatic tuning of simulation tolerance may make this approach more appealing in practice.

A related approach by Bortot et al. (2007) makes tolerance an auxiliary variable with a user-specified prior. This approach avoids explicit tolerance selection, but the inference is based on a pseudo-posterior $\check{\pi}(\theta,\delta)$ not directly related to $\pi_{\delta}(\theta)$ in (1). Bortot et al. (2007) also provide tolerance-dependent analysis, showing parameter means and variances with respect to conditional distributions of $\check{\pi}(\theta,\delta)$ given $\delta\leq\epsilon$ . We believe that our approach, where the effect of tolerance in the expectations with respect $\pi_{\epsilon}$ can be investigated explicitly, can be more immediate to interpret. Our confidence interval only shows the Monte Carlo uncertainty related to the posterior mean, and we are currently investigating how the overall parameter uncertainty could be summarised in a useful manner.

The convergence rates of approximate Bayesian computation has been investigated by Barber et al. (2015) in terms of cost and bias with respect to true posterior, and recently by Li and Fearnhead (2018a, b) in the large data limit, the latter in the context of regression. It would be interesting to consider extensions of these results in the Markov chain Monte Carlo context. In fact, Li and Fearnhead (2018a) already suggest that the acceptance rate must be lower bounded, which is in line with our adaptation rule.

Automatic selection of tolerance has been considered earlier in Ratmann et al. (2007), who propose an algorithm based on tempering and a cooling schedule. Based on our experiments, the tolerance adaptation we present in this paper appears to perform well in practice, and provides reliable results with post-correction. For the adaptation to work efficiently, the Markov chains must be taken relatively long, rendering the approach difficult for the most computationally demanding models.

We conclude with a brief discussion of certain extensions of the suggested post-correction method; more details are given in Supplement E. First, in case of non-simple cut-off, the rejected samples may be ‘recycled’ by using the acceptance probability as weight (Ceperley et al., 1977). The accuracy of the post-corrected estimator could be enhanced with smaller values of $\epsilon$ by performing further independent simulations from $g(\,\cdot\,\mid\Theta_{k})$ (which may be calculated in parallel). The estimator is rather straightforward, but requires some care because the estimators of the pseudo-likelihood take value zero. The latter extension, which involves additional simulations as post-processing, is similar to the ‘lazy’ version of Prangle (2016, 2015) incorporating a randomised stopping rule for simulation, and to debiased ‘exact’ approach of Tran and Kohn (2015), which may lead to estimators which get rid of $\epsilon$ -bias entirely.

8. Acknowledgements

This work was supported by Academy of Finland (grants 274740, 284513 and 312605). The authors wish to acknowledge CSC, IT Center for Science, Finland, for computational resources, and thank Christophe Andrieu for useful discussions.

References

Andrieu and Moulines (2006) C. Andrieu and É. Moulines. On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab., 16(3):1462–1505, 2006.
Andrieu and Thoms (2008) C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statist. Comput., 18(4):343–373, Dec. 2008.
Andrieu et al. (2005) C. Andrieu, É. Moulines, and P. Priouret. Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim., 44(1):283–312, 2005.
Andrieu et al. (2018) C. Andrieu, A. Lee, and M. Vihola. Theoretical and methodological aspects of MCMC computations with noisy likelihoods. In S. A. Sisson, Y. Fan, and M. Beaumont, editors, Handbook of Approximate Bayesian Computation. Chapman & Hall/CRC Press, 2018.
Barber et al. (2015) S. Barber, J. Voss, and M. Webster. The rate of convergence for approximate Bayesian computation. Electron. J. Statist., 9(1):80–105, 2015.
Beaumont et al. (2002) M. Beaumont, W. Zhang, and D. Balding. Approximate Bayesian computation in population genetics. Genetics, 162(4):2025–2035, 2002.
Bezanson et al. (2017) J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A fresh approach to numerical computing. SIAM review, 59(1):65–98, 2017.
Bornn et al. (2017) L. Bornn, N. Pillai, A. Smith, and D. Woodard. The use of a single pseudo-sample in approximate Bayesian computation. Statist. Comput., 27(3):583–590, 2017.
Bortot et al. (2007) P. Bortot, S. Coles, and S. Sisson. Inference for stereological extremes. J. Amer. Statist. Assoc., 102(477):84–92, 2007.
Boys et al. (2008) R. J. Boys, D. J. Wilkinson, and T. B. Kirkwood. Bayesian inference for a discretely observed stochastic kinetic model. Stat. Comput., 18(2):125–135, 2008.
Burkholder et al. (1972) D. Burkholder, B. Davis, and R. Gundy. Integral inequalities for convex functions of operators on martingales. In Proc. Sixth Berkeley Symp. Math. Statist. Prob, volume 2, pages 223–240, 1972.
Ceperley et al. (1977) D. Ceperley, G. Chester, and M. Kalos. Monte Carlo simulation of a many-fermion study. Phys. Rev. D, 16(7):3081, 1977.
Delmas and Jourdain (2009) J.-F. Delmas and B. Jourdain. Does waste recycling really improve the multi-proposal Metropolis–Hastings algorithm? an analysis based on control variates. J. Appl. Probab., 46(4):938–959, 2009.
Doucet et al. (2015) A. Doucet, M. Pitt, G. Deligiannidis, and R. Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika, 102(2):295–313, 2015.
Fearnhead and Prangle (2012) P. Fearnhead and D. Prangle. Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. Ser. B Stat. Methodol., 74(3):419–474, 2012.
Flegal and Jones (2010) J. M. Flegal and G. L. Jones. Batch means and spectral variance estimators in Markov chain Monte Carlo. Ann. Statist., 38(2):1034–1070, 2010.
Franks and Vihola (2017) J. Franks and M. Vihola. Importance sampling correction versus standard averages of reversible MCMCs in terms of the asymptotic variance. Preprint arXiv:1706.09873v3, 2017.
Geyer (1992) C. J. Geyer. Practical Markov chain Monte Carlo. Statist. Sci., pages 473–483, 1992.
Haario et al. (2001) H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
Li and Fearnhead (2018a) W. Li and P. Fearnhead. On the asymptotic efficiency of approximate Bayesian computation estimators. Biometrika, 105(2):285–299, 2018a.
Li and Fearnhead (2018b) W. Li and P. Fearnhead. Convergence of regression-adjusted approximate Bayesian computation. Biometrika, 105(2):301–318, 2018b.
Marin et al. (2012) J.-M. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder. Approximate Bayesian computational methods. Statist. Comput., 22(6):1167–1180, 2012.
Marjoram et al. (2003) P. Marjoram, J. Molitor, V. Plagnol, and S. Tavaré. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA, 100(26):15324–15328, 2003.
Meyn and Tweedie (2009) S. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Cambridge University Press, 2nd edition, 2009. ISBN 978-0-521-73182-9.
Prangle (2015) D. Prangle. Lazier ABC. Preprint arXiv:1501.05144, 2015.
Prangle (2016) D. Prangle. Lazy ABC. Statist. Comput., 26(1-2):171–185, 2016.
Ratmann et al. (2007) O. Ratmann, O. Jørgensen, T. Hinkley, M. Stumpf, S. Richardson, and C. Wiuf. Using likelihood-free inference to compare evolutionary dynamics of the protein networks of H. pylori and P. falciparum. PLoS Comput. Biol., 3(11):e230, 2007.
Raynal et al. (to appear) L. Raynal, J.-M. Marin, P. Pudlo, M. Ribatet, C. P. Robert, and A. Estoup. ABC random forests for bayesian parameter inference. Bioinformatics, to appear.
Roberts et al. (1997) G. Roberts, A. Gelman, and W. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab., 7(1):110–120, 1997.
Roberts and Rosenthal (2006) G. O. Roberts and J. S. Rosenthal. Harris recurrence of Metropolis-within-Gibbs and trans-dimensional Markov chains. Ann. Appl. Probab., 16(4):2123–2139, 2006.
Rudolf and Sprungk (2018) D. Rudolf and B. Sprungk. On a Metropolis-Hastings importance sampling estimator. Preprint arXiv:1805.07174, 2018.
Schuster and Klebanov (2018) I. Schuster and I. Klebanov. Markov chain importance sampling - a highly efficient estimator for MCMC. Preprint arXiv:1805.07179, 2018.
Sherlock et al. (2015) C. Sherlock, A. H. Thiery, G. O. Roberts, and J. S. Rosenthal. On the efficiency of pseudo-marginal random walk Metropolis algorithms. Ann. Statist., 43(1):238–275, 2015.
Sisson and Fan (2018) S. Sisson and Y. Fan. ABC samplers. In S. Sisson, Y. Fan, and M. Beaumont, editors, Handbook of Markov chain Monte Carlo. Chapman & Hall/CRC Press, 2018.
Sokal (1996) A. D. Sokal. Monte Carlo methods in statistical mechanics: Foundations and new algorithms. Lecture notes, 1996.
Sunnåker et al. (2013) M. Sunnåker, A. G. Busetto, E. Numminen, J. Corander, M. Foll, and C. Dessimoz. Approximate Bayesian computation. PLoS computational biology, 9(1):e1002803, 2013.
Tanaka et al. (2006) M. Tanaka, A. Francis, F. Luciani, and S. Sisson. Using approximate Bayesian computation to estimate tuberculosis transmission parameters from genotype data. Genetics, 173(3):1511–1520, 2006.
Tran and Kohn (2015) M. N. Tran and R. Kohn. Exact ABC using importance sampling. Preprint arXiv:1509.08076, 2015.
Vihola (2012) M. Vihola. Robust adaptive Metropolis algorithm with coerced acceptance rate. Statist. Comput., 22(5):997–1008, 2012.
Vihola et al. (2016) M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal MCMC. Preprint arXiv:1609.02541v5, 2016.
Wegmann et al. (2009) D. Wegmann, C. Leuenberger, and L. Excoffier. Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihoods. Genetics, 182(4):1207–1218, 2009.

Appendix

The following algorithm shows that in case of simple (post-correction) cut-off, $E_{\delta,\epsilon}(f)$ and $S_{\delta,\epsilon}(f)$ may be calculated simultaneously for all tolerances efficiently:

Algorithm 11.

Suppose $\phi=\phi_{\mathrm{simple}}$ and $(\Theta_{k},T_{k})_{k=1,\ldots,n}$ is the output of abc-mcmc( $\delta$ ).

(i)
Sort $(\Theta_{k},T_{k})_{k=1,\ldots,n}$ with respect to $T_{k}$ :
- •
  
  Find indices $I_{1},\ldots,I_{n}$ such that $T_{I_{k}}\leq T_{I_{k+1}}$ for all $k=1,\ldots,n-1$ .
- •
  
  Denote $(\hat{\Theta}_{k},\hat{T}_{k})\leftarrow(\Theta_{I_{k}},T_{I_{k}})$ .

(ii)

For all unique values $\epsilon\in\{\hat{T}_{1},\ldots,\hat{T}_{n}\}$ , let $m_{\epsilon}=\max\{k\geq 1\,:\,\hat{T}_{k}\leq\epsilon\}$ , and define

E_{\delta,\epsilon}(f)=m_{\epsilon}^{-1}\textstyle\sum_{k=1}^{m_{\epsilon}}f(\hat{\Theta}_{k}),\qquad\text{and}\qquad S_{\delta,\epsilon}(f)=m_{\epsilon}^{-2}\textstyle\sum_{k=1}^{m_{\epsilon}}\big{[}f(\hat{\Theta}_{k})-E_{\delta,\epsilon}(f)\big{]}^{2}.

(and for $\hat{T}_{k}<\epsilon<\hat{T}_{k+1}$ , let $E_{\delta,\epsilon}(f)=E_{\delta,\hat{T}_{k}}(f)$ and $S_{\delta,\epsilon}(f)=S_{\delta,\hat{T}_{k}}(f)$ .)

The sorting in Algorithm 11(i) may be performed in $O(n\log n)$ time, and $E_{\delta,\epsilon}(f)$ and $S_{\delta,\epsilon}(f)$ may all be calculated in $O(n)$ time by forming appropriate cumulative sums.

of Theorem 4.

Algorithm 1 is a Metropolis–Hastings algorithm with compound proposal $\tilde{q}(\theta,y;\theta^{\prime},y^{\prime})=q(\theta,\theta^{\prime})g(y^{\prime}\mid\theta^{\prime})$ and with target $\tilde{\pi}_{\epsilon}(\theta,y)\propto\mathrm{pr}(\theta)g(y\mid\theta)\phi\big{(}d(y,y^{*})/\epsilon\big{)}$ . The chain $(\Theta_{k},Y_{k})_{k\geq 1}$ is Harris-recurrent, as a full-dimensional Metropolis–Hastings which is $\varphi$ -irreducible (Roberts and Rosenthal, 2006). Because $\phi$ is monotone and $\epsilon\leq\delta$ , we have $\phi\big{(}d(y,y^{*})/\delta\big{)}\geq\phi\big{(}d(y,y^{*})/\epsilon\big{)}$ , and therefore $\tilde{\pi}_{\epsilon}$ is absolutely continuous with respect to $\tilde{\pi}_{\delta}$ , and $w_{\delta,\epsilon}(y)=c_{\delta,\epsilon}\tilde{\pi}_{\epsilon}(\theta,y)/\tilde{\pi}_{\delta}(\theta,y)$ , where $c_{\delta,\epsilon}>0$ is a constant. If we denote $\xi_{k}(f)=U_{k}^{(\delta,\epsilon)}f(\Theta_{k})$ and $\xi_{k}(\mathbf{1})=U_{k}^{(\delta,\epsilon)}=w_{\delta,\epsilon}(Y_{k})$ , then $E_{\delta,\epsilon}^{(n)}(f)=\sum_{k=1}^{n}\xi_{k}(f)/\sum_{j=1}^{n}\xi_{j}(\mathbf{1})\to\mathbb{E}_{\tilde{\pi}_{\epsilon}}[f(\Theta)]$ almost surely by Harris recurrence and $\tilde{\pi}_{\epsilon}$ invariance (e.g. Vihola et al., 2016). The claim (i) follows because $\pi_{\epsilon}$ is the marginal density of $\tilde{\pi}_{\epsilon}$ .

The chain $(\Theta_{k},Y_{k})_{k\geq 1}$ is reversible, so (ii) follows by (Vihola et al., 2016, Theorem 7(i)), because $m_{f}^{(2)}(\theta,y)=w_{\delta,\epsilon}^{2}(y)f^{2}(\theta)$ satisfies

\mathbb{E}_{\tilde{\pi}_{\delta}}[m_{f}^{(2)}(\Theta,Y)]=c_{\delta,\epsilon}\mathbb{E}_{\tilde{\pi}_{\epsilon}}[w_{\delta,\epsilon}(Y)f^{2}(\Theta)]\leq c_{\delta,\epsilon}\mathbb{E}_{\pi_{\epsilon}}[f^{2}(\Theta)]<\infty,

and because the asymptotic variance of the function $h_{\delta,\epsilon}$ with respect to $(\Theta_{k},Y_{k})_{k\geq 1}$ may be expressed as $\mathrm{var}_{\tilde{\pi}_{\delta}}\big{(}h_{\delta,\epsilon}(\Theta,Y)\big{)}\tau_{\delta,\epsilon}(f)$ , so $v_{\delta,\epsilon}(f)=\mathrm{var}_{\tilde{\pi}_{\delta}}\big{(}h_{\delta,\epsilon}(\Theta,Y)\big{)}/c_{\delta,\epsilon}^{2}$ . The convergence $nS_{\delta,\epsilon}^{(n)}(f)\to v_{\delta,\epsilon}(f)$ follows from (Vihola et al., 2016, Theorem 9). ∎

of Theorem 6.

The invariant distribution of abc-mcmc( $\delta$ ) may be written as $\tilde{\pi}_{\delta}(\theta,y)=\pi_{\delta}(\theta)\bar{g}_{\delta}(y\mid\theta)$ where $\bar{g}_{\delta}(y\mid\theta)={g(y\mid\theta)}1\left(d(y,y^{*})\leq\delta\right)/L_{\delta}(\theta)$ , and that $\int\bar{g}_{\delta}(y\mid\theta)w_{\delta,\epsilon}^{p}(y)\mathrm{d}y=\bar{w}_{\delta,\epsilon}(\theta)$ for $p\in\{1,2\}$ . Consequently, $\tilde{\pi}_{\delta}(h_{\delta,\epsilon})=\pi_{\delta}(f_{\delta,\epsilon})$ and $\tilde{\pi}_{\delta}(h_{\delta,\epsilon}^{2})=\pi_{\delta}(f^{2}\bar{w}_{\delta,\epsilon})$ , so $\mathrm{var}_{\tilde{\pi}_{\delta}}(h_{\delta,\epsilon})=\mathrm{var}_{\pi_{\delta}}(f_{\delta,\epsilon})+\pi_{\delta}\big{(}\bar{w}_{\delta,\epsilon}(1-\bar{w}_{\delta,\epsilon})f^{2}\big{)}.$ Hereafter, let $a_{\delta,\epsilon}=\big{(}\mathrm{var}_{\tilde{\pi}_{\delta}}(h_{\delta,\epsilon})\big{)}^{-1/2}$ and denote $\tilde{h}_{\delta,\epsilon}=a_{\delta,\epsilon}h_{\delta,\epsilon}$ and $\tilde{f}_{\delta,\epsilon}=a_{\delta,\epsilon}f_{\delta,\epsilon}$ . Clearly, $\mathrm{var}_{\tilde{\pi}_{\delta}}(\tilde{h}_{\delta,\epsilon})=1$ and

\rho_{k}^{(\delta,\epsilon)}=e_{k}^{(\delta,\epsilon)}-\big{(}\pi_{\delta}(\tilde{f}_{\delta,\epsilon})\big{)}^{2},\qquad\text{where}\qquad e_{k}^{(\delta,\epsilon)}=\mathbb{E}\big{[}\tilde{h}_{\delta,\epsilon}(\Theta_{0}^{(s)},Y_{0}^{(s)})\tilde{h}_{\delta,\epsilon}(\Theta_{k}^{(s)},Y_{k}^{(s)})\big{]}.

Note that with $\phi=\phi_{\mathrm{simple}}$ , the acceptance ratio is $\alpha_{\delta}(\theta,y;\hat{\theta},\hat{y})=\dot{\alpha}(\theta,\hat{\theta})1\left(d(\hat{y},y^{*})\leq\delta\right)$ , where $\dot{\alpha}(\theta,\hat{\theta})=\min\big{\{}1,\mathrm{pr}(\hat{\theta})q(\hat{\theta},\theta)\big{/}\big{(}\mathrm{pr}(\theta)q(\theta,\hat{\theta})\big{)}\big{\}},$ which is independent of $y$ , so $(\Theta_{k}^{(s)})$ is marginally a Metropolis–Hastings type chain, with proposal $q$ and acceptance probability $\alpha(\theta,\hat{\theta})L_{\delta}(\hat{\theta})$ , and

	$\displaystyle\mathbb{E}\big{[}\tilde{h}_{\delta,\epsilon}(\Theta_{1}^{(s)},Y_{1}^{(s)})\;\big{\|}\;(\Theta_{0}^{(s)},Y_{0}^{(s)})=(\theta,y)\big{]}-r_{\delta}(\theta)\tilde{h}_{\delta,\epsilon}(\theta,y)$
	$\displaystyle=a_{\delta,\epsilon}\int q(\theta,\hat{\theta})\dot{\alpha}(\theta,\hat{\theta})g(\hat{y}\mid\hat{\theta})w_{\delta,\epsilon}(\hat{y})f(\hat{\theta})\mathrm{d}\hat{\theta}\mathrm{d}\hat{y}=\int q(\theta,\hat{\theta})\dot{\alpha}(\theta,\hat{\theta})L_{\delta}(\hat{\theta})\tilde{f}_{\delta,\epsilon}(\hat{\theta})\mathrm{d}\hat{\theta}.$

Using this iteratively, we obtain that

\textstyle e_{k}^{(\delta,\epsilon)}=\mathbb{E}\big{[}\tilde{f}_{\delta,\epsilon}(\Theta_{0}^{(s)})\tilde{f}_{\delta,\epsilon}(\Theta_{k}^{(s)})\big{]}+\int\tilde{\pi}_{\delta}(\theta,y)\big{[}\tilde{h}_{\delta,\epsilon}^{2}(\theta,y)-\tilde{f}_{\delta,\epsilon}^{2}(\theta)\big{]}r_{\delta}^{k}(\theta)\mathrm{d}\theta\mathrm{d}y,

and therefore with $\gamma_{k}^{(\delta,\epsilon)}=a_{\delta,\epsilon}^{2}\mathrm{cov}\big{(}f_{\delta,\epsilon}(\Theta_{0}^{(s)}),f_{\delta,\epsilon}(\Theta_{k}^{(s)})\big{)}$ ,

\textstyle\sum_{k\geq 1}\rho_{k}^{(\delta,\epsilon)}=\sum_{k\geq 1}\gamma_{k}^{(\delta,\epsilon)}+a_{\delta,\epsilon}^{2}\int\pi_{\delta}(\theta)\bar{w}_{\delta,\epsilon}(\theta)\big{(}1-\bar{w}_{\delta,\epsilon}(\theta)\big{)}r_{\delta}(\theta)(1-r_{\delta}(\theta))^{-1}f^{2}(\theta)\mathrm{d}\theta.

We conclude by noticing that $2\sum_{k\geq 1}\gamma_{k}^{(\delta,\epsilon)}=a_{\delta,\epsilon}^{2}\mathrm{var}_{\pi_{\delta}}(f_{\delta,\epsilon})(\check{\tau}_{\delta,\epsilon}(f)-1)$ . ∎

Supplement B Convergence of the tolerance adaptive ABC-MCMC under generalised conditions

This section details a convergence theorem, under weaker assumptions than that of Theorem 10, for the tolerance adaptation (Algorithm 8) of Section 4.

For convenience, we denote the distance distribution here as $T\sim Q_{\theta}(\,\cdot\,)$ , where $T=d(Y,y^{*})$ for $Y\sim g(\,\cdot\,|\theta)$ . With this notation, and re-indexing $\Theta_{k}^{\prime}=\tilde{\Theta}_{k+1}$ , we may rewrite Algorithm 8 as follows:

Algorithm 12.

Suppose $\Theta_{0}\in\mathsf{T}$ is a starting value with $\mathrm{pr}(\Theta_{0})>0$ . Initialise $\delta=T_{0}\sim Q_{\Theta_{0}}(\,\cdot\,)$ . $k=0,\ldots,n_{b}-1$ , iterate:

(i)

Draw $\Theta_{k}^{\prime}\sim q(\Theta_{k-1},\,\cdot\,)$ and $T_{k}^{\prime}\sim Q_{\Theta_{k}^{\prime}}(\,\cdot\,)$ .

(ii)

Accept, by setting $(\Theta_{k+1},T_{k+1})\leftarrow(\Theta_{k}^{\prime},T_{k}^{\prime})$ , with probability

(2)

\alpha_{\delta_{k}}^{\prime}(\Theta_{k},T_{k};\Theta_{k}^{\prime},T_{k}^{\prime})=\min\bigg{\{}1,\frac{\mathrm{pr}(\Theta_{k}^{\prime})q(\Theta_{k}^{\prime},\Theta_{k})\phi(T_{k}^{\prime}/\delta_{k})}{\mathrm{pr}(\Theta_{k})q(\Theta_{k},\Theta_{k}^{\prime})\phi(T_{k}/\delta_{k})}\bigg{\}}

and otherwise reject, by setting $(\Theta_{k+1},T_{k+1})\leftarrow(\Theta_{k},T_{k})$ .

(iii)

$\log\delta_{k+1}\leftarrow\log\delta_{k}+\gamma_{k+1}\big{(}\alpha^{*}-\alpha_{\delta_{k}}^{\prime}(\Theta_{k},\Theta_{k}^{\prime},T_{k}^{\prime})\big{)}$ .

Let us set $\beta=\log\delta$ , and consider the proposal-rejection Markov kernel

(3)

\dot{P}_{\beta}(\theta,\mathrm{d}\vartheta)=q(\theta,\mathrm{d}\vartheta)\alpha_{\beta}(\theta,\vartheta)+\bigg{(}1-\int q(\theta,\mathrm{d}\vartheta)\alpha_{\beta}(\theta,\vartheta)\bigg{)}1\left(\theta\in\mathrm{d}\vartheta\right),

where $\alpha_{\beta}(\theta,\vartheta)=\dot{\alpha}(\theta,\vartheta)L_{\beta}(\vartheta),$

\dot{\alpha}(\theta,\vartheta)=\min\bigg{\{}1,\frac{\mathrm{pr}(\vartheta)q(\vartheta,\theta)}{\mathrm{pr}(\theta)q(\theta,\vartheta)}\bigg{\}},\qquad\text{and}\qquad L_{\beta}(\vartheta)=\int Q_{\vartheta}(\mathrm{d}t)1\left(t\leq e^{\beta}\right).

Then $\dot{P}_{\beta_{k}}$ is the transition of the $\theta$ -coordinate chain of Algorithm 12 with simple cut-off at iteration $k$ , obtained by disregarding the $t$ -coordinate. It is easily seen to be reversible with respect to the posterior probability $\pi_{\beta}(\theta)\propto\mathrm{pr}(\theta)L_{\beta}(\theta)$ given in (1), written here in terms of $\beta=\log\delta$ instead of $\delta$ .

Assumption 13.

Suppose $\phi=\phi_{\mathrm{simple}}$ and the following hold:

(i)

Step sizes $(\gamma_{k})_{k\geq 1}$ satisfy $\gamma_{k}\geq 0$ , $\gamma_{k+1}\leq\gamma_{k}$ ,

\sum_{k\geq 1}\gamma_{k}=\infty,\qquad\text{and}\qquad\sum_{k\geq 1}\gamma_{k}^{2}\Big{(}1+\lvert\log\gamma_{k}\rvert+\lvert\log\gamma_{k}\rvert^{2}\Big{)}<\infty.

(ii)

The domain $\mathsf{T}\subset\mathbb{R}^{n_{\theta}}$ , $n_{\theta}\geq 1$ , is a nonempty open set.
(iii)

$\mathrm{pr}(\,\cdot\,)$ and $q(\theta,\,\cdot\,)$ are uniformly bounded densities on $\mathbb{R}^{n_{\theta}}$ (i.e. $\exists C>0$ s.t. $q(\theta,\vartheta)<C$ and $\mathrm{pr}(\theta)<C$ for all $\theta,\,\vartheta\in\mathbb{R}^{n_{\theta}}$ ), and $\mathrm{pr}(\theta)=0$ for $\theta\notin\mathsf{T}$ .
(iv)

$Q_{\theta}(\mathrm{d}t)$ admits a uniformly bounded density $Q_{\theta}(t)$ .
(v)

The values $\{\beta_{k}\}$ remain in some compact subset $\mathsf{B}\subset\mathbb{R}$ almost surely.
(vi)

$c_{\beta}>0$ for all $\beta\in\mathsf{B}$ , where $c_{\beta}=\int\mathrm{pr}(\mathrm{d}\theta)L_{\beta}(\theta)$ .
(vii)

There exists $\dot{V}:\mathsf{T}\to[1,\infty)$ such that the Markov transitions $\dot{P}_{\beta}$ are simultaneously $\dot{V}$ -geometrically ergodic: there exist $C>0$ and $\rho\in(0,1)$ s.t. for all $k\geq 1$ and $f:\mathsf{T}\to\mathbb{R}$ with $\lvert f\rvert\leq\dot{V}$ , it holds that

$\lvert\dot{P}_{\beta}^{k}f(\theta)-\pi_{\beta}(f)\rvert\leq C\dot{V}(\theta)\rho^{k}.$
(viii)

With $\mathbb{E}[\,\cdot\,]=\mathbb{E}_{\theta,\beta}[\,\cdot\,]$ denoting expectation with respect to the law of the marginal chain $(\Theta_{k})$ of Algorithm 12 started at $\theta\in\mathsf{T}$ , $\beta\in\mathsf{B}$ , and with $\dot{V}$ as in Assumption 13(vii), we have,

$\sup_{\theta,\beta,k}\mathbb{E}\big{[}\dot{V}(\Theta_{k})^{2}\big{]}<\infty.$

Theorem 14.

Under Assumption 13, the expected value of the acceptance probability (2), taken with respect to the stationary measure of the chain, converges to $\alpha^{*}$ .

Proof of Theorem 14 can be found in Section C. It relies heavily on the simple conditions of (Andrieu et al., 2005, Theorem 2.3), which says that one must essentially show that the noise in the stochastic approximation update is asymptotically controlled.

We remark that there are likely extensions of Assumption 13(v) to the general non-compact adaptation parameter case based on projections (cf. Andrieu et al., 2005).

Supplement C Analysis of the tolerance adaptive ABC-MCMC

In this section we aim to prove generalised convergence (Theorem 14 of Section B) of the tolerance adaptation, from which Theorem 10 of Section 4 will follow as a corollary. Throughout, we denote by $C>0$ a constant which may change from line to line.

C.1. Proposal augmentation

Suppose $\dot{L}$ is a Markov kernel which can be written as

(4)

\dot{L}(x,\mathrm{d}y)=q(x,\mathrm{d}y)\alpha(x,y)+\bigg{(}1-\int q(x,\mathrm{d}y^{\prime})\alpha(x,y^{\prime})\bigg{)}1\left(x\in\mathrm{d}y\right),

where $\alpha(x,y)\in[0,1]$ is a jointly measurable function and $q(x,\mathrm{d}y)$ is a Markov proposal kernel. With $\breve{x}=(x,x^{\prime})$ , we define the proposal augmentation to be the Markov kernel

(5)

L(\breve{x},\mathrm{d}\breve{y})=\alpha(\breve{x})1\left(x^{\prime}\in\mathrm{d}y\right)q(x^{\prime},\mathrm{d}y^{\prime})+\big{(}1-\alpha(\breve{x})\big{)}1\left(x\in\mathrm{d}y\right)q(x,\mathrm{d}y^{\prime}).

It is easy to see that $L$ need not be reversible even if $\dot{L}$ is reversible. In this case, however, $L$ does leave a probability measure invariant.

Lemma 15.

Suppose a Markov kernel $\dot{L}$ of the form given in (4) is $\dot{\mu}$ -reversible. Let $L$ be its proposal augmentation. Then the following statements hold:

(i)

$\mu L=\mu$ , where $\mu(\mathrm{d}x,\mathrm{d}x^{\prime})=\dot{\mu}(\mathrm{d}x)q(x,\mathrm{d}x^{\prime})$ .
(ii)

If $\dot{L}$ is $\dot{V}$ -geometrically ergodic with constants $(\dot{C},\dot{\rho})$ , then $L$ is $V$ -geometrically ergodic with constants $(C,\rho)$ , where $C=2\dot{C}/\dot{\rho},$ $\rho=\dot{\rho},$ and $V(\breve{x})=\frac{1}{2}\big{(}V(x)+V(x^{\prime})\big{)}.$

Lemma 15 extends (Schuster and Klebanov, 2018, Theorem 4), who consider the case where $\dot{P}$ is a Metropolis–Hastings chain (see also Delmas and Jourdain, 2009; Rudolf and Sprungk, 2018). The extension to the more general class of reversible proposal-rejection chains allows one to consider, for example, jump and delayed acceptance chains, as well as the marginal chain (3) of Section B, which will be important for our analysis of the tolerance adaptation.

of Lemma 15.

Part (i) follows by a direct calculation. We now consider part (ii). For $f:\mathsf{X}^{2}\to\mathbb{R}$ , we shall use the notation $\dot{f}(x)=\int f(\breve{x})q(x,\mathrm{d}x^{\prime}).$ For $f:\mathsf{X}^{2}\to\mathbb{R}$ , we have

\int q(x,\mathrm{d}x^{\prime})L\big{(}(x,x^{\prime});\mathrm{d}\breve{y}\big{)}f(\breve{y})=\int q(x,\mathrm{d}x^{\prime})\alpha(\breve{x})\dot{f}(x^{\prime})+\int q(x,\mathrm{d}x^{\prime})\big{(}1-\alpha(\breve{x})\big{)}\dot{f}(x)=\dot{L}\dot{f}(x),

and then inductively, for $k\geq 1$ ,

	$\displaystyle\int q(x,\mathrm{d}x^{\prime})L^{k}\big{(}(x,x^{\prime});\mathrm{d}\breve{y}\big{)}f(\breve{y})$	$\displaystyle=\int q(x,\mathrm{d}x^{\prime})\alpha(\breve{x})q(x^{\prime},\mathrm{d}y^{\prime})L^{k-1}\big{(}(x^{\prime},y^{\prime});\mathrm{d}\breve{z}\big{)}f(\breve{z})$
		$\displaystyle\qquad+\int q(x,\mathrm{d}x^{\prime})\big{(}1-\alpha(\breve{x})\big{)}q(x,\mathrm{d}y^{\prime})L^{k-1}\big{(}(x,y^{\prime});\mathrm{d}\breve{z})f(\breve{z})$
		$\displaystyle=\int q(x,\mathrm{d}x^{\prime})\alpha(\breve{x})\dot{L}^{k-1}\dot{f}(x^{\prime})+\int q(x,\mathrm{d}x^{\prime})\big{(}1-\alpha(\breve{x})\big{)}\dot{L}^{k-1}\dot{f}(x)$
		$\displaystyle=\dot{L}^{k}\dot{f}(x).$

We then have the equality,

	$\displaystyle L^{k}f(\breve{x})$	$\displaystyle=\alpha(\breve{x})\int q(x^{\prime},\mathrm{d}y^{\prime})L^{k-1}\big{(}(x^{\prime},y^{\prime});\mathrm{d}\breve{z}\big{)}f(\breve{z})+\big{(}1-\alpha(\breve{x})\big{)}\int q(x,\mathrm{d}y^{\prime})L^{k-1}\big{(}(x,y^{\prime});\mathrm{d}\breve{z}\big{)}f(\breve{z})$
		$\displaystyle=\alpha(\breve{x})\dot{L}^{k-1}\dot{f}(x^{\prime})+\big{(}1-\alpha(\breve{x})\big{)}\dot{L}^{k-1}\dot{f}(x).$

For $\lVert f\rVert\leq V$ , note that $\lVert\dot{f}\rVert\leq\dot{V}$ since $\lVert q\rVert_{\infty}\leq 1$ , and we conclude (ii) from

	$\displaystyle\lvert L^{k}f(\breve{x})-\mu(f)\rvert$	$\displaystyle\leq\alpha(\breve{x})\lvert\dot{L}^{k-1}\dot{f}(x^{\prime})-\dot{\mu}(\dot{f})\rvert+\big{(}1-\alpha(\breve{x})\big{)}\lvert\dot{L}^{k-1}\dot{f}(x)-\dot{\mu}(\dot{f})\rvert$
		$\displaystyle\leq\dot{C}\dot{\rho}^{k-1}\big{(}\dot{V}(x^{\prime})+\dot{V}(x)\big{)}.\qed$

Consider now the $\theta$ -coordinate chain $\dot{P}_{\beta}$ presented in (3) of Section B. This transition $\dot{P}_{\beta}$ is clearly a reversible proposal-rejection chain of the form (4). We now consider $P_{\beta}$ , its proposal augmentation. This is the chain $\breve{\Theta}_{k}=(\Theta_{k},\Theta_{k}^{\prime})\in{\mathsf{T}}^{2}$ , formed by disregarding the $t$ -parameter as with $\dot{P}_{\beta}$ before, but now augmenting by the proposal $\theta^{\prime}\sim q(\theta,\,\cdot\,)$ . Its transitions are of the form $\breve{\theta}=\breve{\Theta}_{k}$ goes to $\breve{\vartheta}=\breve{\Theta}_{k+1}$ in the ABC-MCMC, with $\breve{\vartheta}=(\vartheta,\vartheta^{\prime})$ and kernel

P_{\beta}(\breve{\theta},\mathrm{d}\breve{\vartheta})=\alpha_{\beta}(\breve{\theta})1\left(\theta^{\prime}\in\mathrm{d}\vartheta\right)q(\theta^{\prime},\mathrm{d}\vartheta^{\prime})+\big{(}1-\alpha_{\beta}(\breve{\theta})\big{)}1\left(\theta\in\mathrm{d}\vartheta\right)q(\theta,\mathrm{d}\vartheta^{\prime})

By Lemma 15(i), $P_{\beta}$ leaves $\pi_{\beta}^{\prime}=\pi_{\beta,u}^{\prime}/c_{\beta}$ invariant, where $\pi_{\beta,u}^{\prime}(\mathrm{d}\breve{\theta})=\mathrm{pr}(\mathrm{d}\theta)L_{\beta}(\theta)q(\theta,\mathrm{d}\theta^{\prime})$ and $c_{\beta}=\int\mathrm{pr}(\mathrm{d}\theta)L_{\beta}(\theta).$

C.2. Monotonicity properties

The following result establishes monotonicity of the mean field acceptance rate with increasing tolerance.

Lemma 16.

Assume Assumption 13(iii) and 13(iv) hold. The mapping $\beta\mapsto\pi_{\beta}^{\prime}(\alpha_{\beta})$ is monotone non-decreasing.

Proof.

Since $\mathrm{pr}(\theta)$ and $q(\theta,\theta^{\prime})$ are uniformly bounded (Assumption 13(iii)), and $L_{\beta}(\theta)\leq 1$ , differentiation under the integral sign is possible in the following by the dominated convergence theorem. By the quotient rule,

(6)

\frac{\mathrm{d}}{\mathrm{d}\beta}\Big{(}\pi_{\beta}^{\prime}(\alpha_{\beta})\Big{)}=\frac{1}{c_{\beta}^{2}}\bigg{(}c_{\beta}\frac{\mathrm{d}}{\mathrm{d}\beta}\Big{(}\pi_{\beta,u}^{\prime}(\alpha_{\beta})\Big{)}-\pi_{\beta,u}^{\prime}(\alpha_{\beta})\frac{\mathrm{d}c_{\beta}}{\mathrm{d}\beta}\bigg{)}.

By reversibility of Metropolis–Hastings targeting $\mathrm{pr}(\theta)$ with proposal $q$ ,

\frac{\mathrm{d}}{\mathrm{d}\beta}\Big{(}\pi_{\beta,u}^{\prime}(\alpha_{\beta})\Big{)}=2e^{\beta}\int\mathrm{pr}(\mathrm{d}\theta)L_{\beta}(\theta)q(\theta,\mathrm{d}\theta^{\prime})\dot{\alpha}(\theta,\theta^{\prime})Q_{\theta^{\prime}}(e^{\beta}).

With

f(\theta^{\prime})=2Q_{\theta^{\prime}}(e^{\beta})\int\mathrm{pr}(\mathrm{d}\tilde{\theta})L_{\beta}(\tilde{\theta})-L_{\beta}(\theta^{\prime})\int\mathrm{pr}(\mathrm{d}\tilde{\theta})Q_{\tilde{\theta}}(e^{\beta}),

we can then write (6) as

\frac{\mathrm{d}}{\mathrm{d}\beta}\Big{(}\pi_{\beta}^{\prime}(\alpha_{\beta})\Big{)}=\frac{e^{\beta}}{c_{\beta}^{2}}\int\mathrm{pr}(\mathrm{d}\theta)L_{\beta}(\theta)q(\theta,\mathrm{d}\theta^{\prime})\dot{\alpha}(\theta,\theta^{\prime})f(\theta^{\prime}).

By the same reversibility property as before, we can write this again as

\frac{\mathrm{d}}{\mathrm{d}\beta}\Big{(}\pi_{\beta}^{\prime}(\alpha_{\beta})\Big{)}=\frac{e^{\beta}}{c_{\beta}^{2}}\int f(\theta)\mathrm{pr}(\mathrm{d}\theta)\int q(\theta,\mathrm{d}\theta^{\prime})L_{\beta}(\theta^{\prime})\dot{\alpha}(\theta,\theta^{\prime}),

We then conclude, since

\int f(\theta)\mathrm{pr}(\mathrm{d}\theta)=\int Q_{\theta}(e^{\beta})\mathrm{pr}(\mathrm{d}\theta)\int L_{\beta}(\tilde{\theta})\mathrm{pr}(\mathrm{d}\tilde{\theta})\geq 0.\qed

Lemma 17.

The following statements hold:

(i)

The function $\beta\mapsto c_{\beta}$ is monotone non-decreasing on $\mathbb{R}$ .
(ii)

If Assumption 13(v) and 13(vi) hold, then there exist $C_{\min}>0$ , $C_{\max}>0$ such that $C_{\min}\leq c_{\beta}\leq C_{\max}$ for all $\beta\in\mathsf{B}$ .

Proof.

Part (i) follows, for $\beta\leq\beta^{\prime}$ , from

c_{\beta}=\int\mathrm{pr}(\mathrm{d}\theta)Q_{\theta}([0,e^{\beta}])\leq\int\mathrm{pr}(\mathrm{d}\theta)Q_{\theta}([0,e^{\beta^{\prime}}])=c_{\beta^{\prime}}.

Consider part (ii). By part (i) and compactness of $\mathsf{B}$ (Assumption 13(v)), we can set $C_{\min}=c_{\min(\mathsf{B})}$ and $C_{\max}=c_{\max(\mathsf{B})},$ both of which are positive by Assumption 13(vi). ∎

C.3. Stochastic approximation framework

To obtain a form common in the stochastic approximation literature (cf. Andrieu et al., 2005), we write the update in Algorithm 12 as

	$\displaystyle\beta_{k+1}$	$\displaystyle=\beta_{k}+\gamma_{k+1}H_{\beta_{k}}(\breve{\Theta}_{k},T_{k}^{\prime})$
		$\displaystyle=\beta_{k}+\gamma_{k+1}h(\beta_{k})+\gamma_{k+1}\zeta_{k+1}$

where $H_{\beta}(\breve{\theta},t^{\prime})=\alpha^{*}-\alpha_{\beta}^{\prime}(\breve{\theta},t^{\prime}),$

\alpha_{\beta}^{\prime}(\breve{\theta},t^{\prime})=\min\bigg{\{}1,\frac{\mathrm{pr}(\theta^{\prime})q(\theta^{\prime},\theta)}{\mathrm{pr}(\theta)q(\theta,\theta^{\prime})}\bigg{\}}1\left(t^{\prime}\leq e^{\beta}\right),

h(\beta)=\pi_{\beta}^{\prime}(\widehat{H}_{\beta})=\int\pi_{\beta}(\mathrm{d}\theta)q(\theta,\mathrm{d}\theta^{\prime})Q_{\theta^{\prime}}(\mathrm{d}t^{\prime})H_{\beta}(\theta,\theta^{\prime},t^{\prime}),

noise sequence $\zeta_{k+1}=H_{\beta_{k}}(\breve{\Theta}_{k},T_{k}^{\prime})-h(\beta_{k}),$ and conditional expectation

\widehat{H}_{\beta}(\breve{\theta})=\mathbb{E}[H_{\beta}(\breve{\Theta},T^{\prime})|\breve{\Theta}=\breve{\theta}],

where $T^{\prime}\sim Q_{\theta^{\prime}}(\,\cdot\,)$ . We also set for convenience $\bar{H}_{\beta}(\breve{\theta})=\widehat{H}_{\beta}(\breve{\theta})-\pi_{\beta}^{\prime}(\widehat{H}_{\beta}).$

Lemma 18.

Suppose Assumption 13(vii) holds. Then the following statements hold:

(i)

The proposal augmented kernels $(P_{\beta})_{\beta\in\mathsf{B}}$ are simultaneously $V$ -geometrically ergodic, where $V(\theta,\theta^{\prime})=\frac{1}{2}\big{(}\dot{V}(\theta)+\dot{V}(\theta^{\prime})\big{)}$ , with $\dot{V}$ as in Assumption 13(vii).
(ii)

There exists $C>0$ , such that for all $\beta\in\mathsf{B}$ , the formal solution $g_{\beta}=\sum_{k\geq 0}P_{\beta}^{k}\bar{H}_{\beta}$ to the Poisson equation $g_{\beta}-P_{\beta}g_{\beta}=\bar{H}_{\beta}$ satisfies $\lvert g_{\beta}(\breve{\theta})\rvert\leq CV(\breve{\theta}).$

Proof.

(i) follows directly from the explicit parametrisation for $(C,\rho)$ given in Lemma 15(ii).

Part (ii) follows from part (i) and the bound, since $\lvert\bar{H}_{\beta}\rvert\leq 1\leq V$ ,

\lvert g_{\beta}(\breve{\theta})\rvert\leq 1+C_{\beta}\sum_{k\geq 1}\rho_{\beta}^{k}V(\breve{\theta})\leq\bigg{(}1+\frac{C_{\beta}}{1-\rho_{\beta}}\bigg{)}V(\breve{\theta}).\qed

C.4. Contractions

We define for $V:\mathsf{T}\rightarrow[1,\infty)$ and $g:\mathsf{T}\to\mathbb{R}$ the $V$ -norm $\lVert g\rVert_{V}=\sup_{\theta\in\mathsf{T}}\frac{|g(\theta)|}{V(\theta)}.$ We define for a bounded operator $A$ on a Banach space of bounded functions $f$ , the operator norm $\lVert A\rVert_{\infty}=\sup_{f}\frac{\lVert Af\rVert_{\infty}}{\lVert f\rVert_{\infty}}$ .

Lemma 19.

Suppose Assumption 13(iv), 13(v) and 13(vi) hold. The following hold:

(i)

$\exists C>0$ , $\exists C_{\mathsf{B}}^{+}>0$ s.t. $\forall\beta_{1}\in\mathsf{B}$ , $\forall\beta_{2}\in\mathsf{B}$ , $\forall g:\mathsf{T}^{2}\to\mathbb{R}$ bounded, we have

\lVert(P_{\beta_{1}}-P_{\beta_{2}})g\rVert_{\infty}\leq C\lVert g\rVert_{\infty}\lvert e^{\beta_{1}}-e^{\beta_{2}}\rvert\leq C_{\mathsf{B}}^{+}\lVert g\rVert_{\infty}\lvert\beta_{1}-\beta_{2}\rvert.

(ii)

$\exists C_{\mathsf{B}}^{-}>0$ , $\exists C_{\mathsf{B}}>0$ , s.t. $\forall\beta_{1}\in\mathsf{B}$ , $\forall\beta_{2}\in\mathsf{B}$ , we have

\lVert\bar{H}_{\beta_{1}}-\bar{H}_{\beta_{2}}\rVert_{\infty}\leq C_{\mathsf{B}}^{-}\lvert e^{\beta_{1}}-e^{\beta_{2}}\rvert\leq C_{\mathsf{B}}\lvert\beta_{1}-\beta_{2}\rvert.

(iii)

$\exists C_{\mathsf{B}}^{-}>0$ , $\exists C_{\mathsf{B}}>0$ , s.t. $\forall\beta_{1}\in\mathsf{B}$ , $\forall\beta_{2}\in\mathsf{B}$ , $\forall g:\mathsf{T}^{2}\to\mathbb{R}$ bounded, we have

\lvert\pi_{\beta_{1}}^{\prime}(g)-\pi_{\beta_{2}}^{\prime}(g)\rvert\leq C_{\mathsf{B}}^{-}\lVert g\rVert_{\infty}\lvert e^{\beta_{1}}-e^{\beta_{2}}\rvert\leq C_{\mathsf{B}}\lVert g\rVert_{\infty}\lvert\beta_{1}-\beta_{2}\rvert.

Proof.

By Assumption 13(iv), we have for all $\beta_{1},\,\beta_{2}\in\mathsf{B}$ ,

\lvert L_{\beta_{1}}(\theta)-L_{\beta_{2}}(\theta)\rvert=\int_{e^{\beta_{1}\wedge\beta_{2}}}^{e^{\beta_{1}\vee\beta_{2}}}Q_{\theta}(\mathrm{d}t)\leq C\lvert e^{\beta_{1}}-e^{\beta_{2}}\rvert.

We obtain the first inequality for part (i), then, from the bound,

	$\displaystyle\lvert(P_{\beta_{1}}-P_{\beta_{2}})g(\breve{\theta})\rvert$	$\displaystyle=\lvert\big{(}\alpha_{\beta_{1}}(\breve{\theta})-\alpha_{\beta_{2}}(\breve{\theta})\big{)}\dot{g}(\theta^{\prime})+\big{(}\alpha_{\beta_{2}}(\breve{\theta})-\alpha_{\beta_{1}}(\breve{\theta})\big{)}\dot{g}(\theta)\rvert$
		$\displaystyle\leq\dot{\alpha}(\breve{\theta})\lvert L_{\beta_{1}}(\theta^{\prime})-L_{\beta_{2}}(\theta^{\prime})\rvert\int\Big{(}q(\theta^{\prime},\mathrm{d}\vartheta^{\prime})\lvert g(\theta^{\prime},\vartheta^{\prime})\rvert+q(\theta,\mathrm{d}\vartheta^{\prime})\lvert g(\theta,\vartheta^{\prime})\rvert\Big{)},$

The second, Lipschitz bound follows by a mean value theorem argument for the function $\beta\mapsto e^{\beta}$ , namely

\lvert e^{\beta_{1}}-e^{\beta_{2}}\rvert\leq\sup_{\beta\in\mathsf{B}}e^{\beta}\,\lvert\beta_{1}-\beta_{2}\rvert\leq C_{\mathsf{B}}^{+}\lvert\beta_{1}-\beta_{2}\rvert,

where the last inequality follows by compactness of $\mathsf{B}$ (Assumption 13(v)).

We now consider part (ii). We have,

\lVert\bar{H}_{\beta_{1}}-\bar{H}_{\beta_{2}}\rVert_{\infty}\leq\lVert\widehat{H}_{\beta_{1}}-\widehat{H}_{\beta_{2}}\rVert_{\infty}+\lvert h(\beta_{1})-h(\beta_{2})\rvert.

For the first term, by Assumption 13(iv), as in (i), we have

\lVert\widehat{H}_{\beta_{1}}-\widehat{H}_{\beta_{2}}\rVert_{\infty}\leq\sup_{\breve{\theta}}\dot{\alpha}(\breve{\theta})\int_{e^{\beta_{1}\wedge\beta_{2}}}^{e^{\beta_{1}\vee\beta_{2}}}Q_{\theta^{\prime}}(\mathrm{d}t)\leq C\lvert\beta_{1}-\beta_{2}\rvert.

For the other term, we have

\lvert h(\beta_{1})-h(\beta_{2})\rvert\leq\frac{1}{c_{\beta_{1}}}\lvert\pi_{\beta_{1},u}^{\prime}(\alpha_{\beta_{1}})-\pi_{\beta_{2},u}^{\prime}(\alpha_{\beta_{2}})\rvert+\pi_{\beta_{2},u}^{\prime}(\alpha_{\beta_{2}})\frac{\lvert c_{\beta_{1}}-c_{\beta_{2}}\rvert}{c_{\beta_{1}}c_{\beta_{2}}}.

By the triangle inequality, we have

\lvert\pi_{\beta_{1},u}^{\prime}(\alpha_{\beta_{1}})-\pi_{\beta_{2},u}^{\prime}(\alpha_{\beta_{2}})\rvert\leq\lvert\pi_{\beta_{1},u}^{\prime}(\alpha_{\beta_{1}})-\pi_{\beta_{1},u}^{\prime}(\alpha_{\beta_{2}})\rvert+\lvert\pi_{\beta_{1},u}^{\prime}(\alpha_{\beta_{2}})-\pi_{\beta_{2},u}^{\prime}(\alpha_{\beta_{2}})\rvert

Each term above is bounded by $C\lvert e^{\beta_{1}}-e^{\beta_{2}}\rvert$ , as is $\lvert c_{\beta_{1}}-c_{\beta_{2}}\rvert$ . Moreover, by Lemma 17(ii), we have $c_{\beta}\geq c_{\min}>0$ for all $\beta\in\mathsf{B}$ , and the first inequality in part (ii) follows. The second inequality follows by a mean value theorem argument as before. Proof of (iii) is simpler. ∎

C.5. Control of noise

We state a simple standard fact used repeatedly in the proof of Lemma 21 below, our key lemma.

Lemma 20.

Suppose $(X_{j})_{j\geq 1}$ are random variables with $X_{j}\geq 0$ , $X_{j+1}\leq X_{j}$ , and $\lim_{j\to\infty}\mathbb{E}[X_{j}]=0.$ Then, almost surely, $\lim_{j\to\infty}X_{j}=0.$

Lemma 21.

Suppose Assumption 13 holds. Then, with $\mathcal{T}_{j,n}=\sum_{k=j}^{n}\gamma_{k}\zeta_{k},$ we have

\lim_{j\to\infty}\sup_{n\geq j}\big{|}\mathcal{T}_{j,n}\big{|}=0,\qquad\text{almost surely.}

Proof.

Similar to (Andrieu et al., 2005, Proof of Prop. 5.2), we write $\mathcal{T}_{j,n}=\sum_{i=1}^{8}\mathcal{T}_{j,n}^{(j)}$ , where

\widehat{H}_{\beta_{k-1}}(\breve{\Theta}_{k-1})=\mathbb{E}[H_{\beta_{k-1}}(\breve{\Theta}_{k-1},T^{\prime})|\mathcal{F}_{k-1}^{\prime}],

with $\mathcal{F}_{k-1}^{\prime}=\sigma(\beta_{k-1},\Theta_{k-1},\Theta_{k-1}^{\prime})$ representing the information obtained through running Algorithm 12 up to and including iteration $k-2$ and then also generating $\Theta_{k-1}^{\prime}$ , and

	$\displaystyle\mathcal{T}_{j,n}^{(1)}$	$\displaystyle=\sum_{k=j}^{n}\gamma_{k}\Big{(}H_{\beta_{k-1}}(\breve{\Theta}_{k-1},T_{k-1}^{\prime})-\widehat{H}_{\beta_{k-1}}(\breve{\Theta}_{k-1})\Big{)},$
	$\displaystyle\mathcal{T}_{j,n}^{(2)}$	$\displaystyle=\sum_{k=j}^{n}\gamma_{k}\Big{(}g_{\beta_{k-1}}(\breve{\Theta}_{k-1})-P_{\beta_{k-1}}g_{\beta_{k-1}}(\breve{\Theta}_{k-2})\Big{)},$
	$\displaystyle\mathcal{T}_{j,n}^{(3)}$	$\displaystyle=\gamma_{j-1}P_{j-1}g_{\beta_{j-1}}(\breve{\Theta}_{j-2})-\gamma_{n}P_{\beta_{n}}g_{\beta_{n}}(\breve{\Theta}_{n-1}),$
	$\displaystyle\mathcal{T}_{j,n}^{(4)}$	$\displaystyle=\sum_{k=j}^{n}\Big{(}\gamma_{k}-\gamma_{k-1}\Big{)}P_{\beta_{k-1}}g_{\beta_{k-1}}(\breve{\Theta}_{k-2}),$
	$\displaystyle\mathcal{T}_{j,n}^{(5)}$	$\displaystyle=\sum_{k=j}^{n}\gamma_{k}\sum_{i\geq m_{k}+1}P_{\beta_{k}}^{i}\bar{H}_{\beta_{k}}(\breve{\Theta}_{k-1}),$
	$\displaystyle\mathcal{T}_{j,n}^{(6)}$	$\displaystyle=-\sum_{k=j}^{n}\gamma_{k}\sum_{i\geq m_{k}+1}P_{\beta_{k-1}}^{i}\bar{H}_{\beta_{k-1}}(\breve{\Theta}_{k-1}),$
	$\displaystyle\mathcal{T}_{j,n}^{(7)}$	$\displaystyle=\sum_{k=j}^{n}\gamma_{k}\sum_{i=1}^{m_{k}}\Big{(}P_{\beta_{k}}^{i}-P_{\beta_{k-1}}^{i}\Big{)}\bar{H}_{\beta_{k}}(\breve{\Theta}_{k-1}),$
	$\displaystyle\mathcal{T}_{j,n}^{(8)}$	$\displaystyle=\sum_{k=j}^{n}\gamma_{k}\sum_{i=1}^{m_{k}}P_{\beta_{k-1}}^{i}\Big{(}\bar{H}_{\beta_{k}}-\bar{H}_{\beta_{k-1}}\Big{)}(\breve{\Theta}_{k-1}).$

Here, $g_{\beta}$ is the Poisson solution defined in Lemma 18(ii), and $m_{k}=\lceil\lvert\log\gamma_{k}\rvert\rceil$ . We remind that $\bar{H}_{\beta}=\widehat{H}_{\beta}-h(\beta)$ from Section C.3.

We now show $\lim_{j\to\infty}\sup_{n\geq j}\big{|}\mathcal{T}_{j,n}^{(i)}\big{|}=0$ for each of the terms $i\in\{1{:}8\}$ individually, which implies the result of the lemma.

(1) Since for all $n>j$ ,

\mathbb{E}[\mathcal{T}_{j,n}^{(1)}-\mathcal{T}_{j,n-1}^{(1)}|\mathcal{F}_{n-1}^{\prime}]=0,

we have that $(\mathcal{T}_{j,n}^{(1)})_{n\geq j}$ is a $\mathcal{F}_{n}^{\prime}$ -martingale for each $j\geq 1$ . By the Burkholder-Davis-Gundy inequality for martingales (cf. Burkholder et al., 1972), we have

\mathbb{E}[\sup_{n\geq j}|\mathcal{T}_{j,n}^{(1)}|^{2}]\leq C\mathbb{E}\Big{[}\sum_{k=j}^{\infty}\gamma_{k}^{2}\big{(}H_{\beta_{k-1}}(\breve{\Theta}_{k-1},T_{k-1}^{\prime})-\widehat{H}_{\beta_{k-1}}(\breve{\Theta}_{k-1})\big{)}^{2}\Big{]}\leq C\sum_{k=j}^{\infty}\gamma_{k}^{2},

where in the last inequality we have noted that $|H_{\beta}-\widehat{H}_{\beta}|\leq 1$ . Since $\sum_{k\geq 1}\gamma_{k}^{2}<\infty$ , we get that

\lim_{j\to\infty}\mathbb{E}[\sup_{n\geq j}|\mathcal{T}_{j,n}^{(1)}|^{2}]=0.

Hence, the result follows by Lemma 20.

(2) For $j\geq 2$ , we have for $n>j$ ,

\mathbb{E}[\mathcal{T}_{j,n}^{(2)}-\mathcal{T}_{j,n-1}^{(2)}|\mathcal{F}_{n-2}^{\prime}]=0,

so that $(\mathcal{T}_{j,n}^{(2)})_{n\geq j}$ is a $\mathcal{F}_{n-1}^{\prime}$ -martingale, for $j\geq 2$ . By the Burkholder-Davis-Gundy inequality again,

\mathbb{E}[\sup_{n\geq j}\lvert\mathcal{T}_{j,n}^{(2)}\rvert^{2}]\leq C\mathbb{E}\Big{[}\sum_{k=j}^{\infty}\gamma_{k}^{2}\big{(}g_{\beta_{k-1}}(\breve{\Theta}_{k-1})-P_{\beta_{k-1}}g_{\beta_{k-1}}(\breve{\Theta}_{k-2})\big{)}^{2}\Big{]}.

We then use Lemma 18(ii) and $\lVert P_{\beta}\rVert_{\infty}\leq 1$ , to get, after combining terms,

\mathbb{E}[\sup_{n\geq j}\lvert\mathcal{T}_{j,n}^{(2)}\rvert^{2}]\leq C\sum_{k=j-1}^{\infty}\gamma_{k}^{2}\mathbb{E}\Big{[}V(\breve{\Theta}_{k-1})^{2}\Big{]}\leq C\sum_{k=j-1}^{\infty}\gamma_{k}^{2},

where we have used Assumption 13(viii) in the last inequality. We then conclude by Lemma 20 as before.

(3) By Lemma 18(ii), the triangle inequality, $\lVert P_{\beta}\rVert_{\infty}\leq 1$ , and the dominated convergence theorem, we obtain

\mathbb{E}[\sup_{n\geq j}\lvert\mathcal{T}_{j,n}^{(3)}\rvert\leq C\gamma_{j-1}\mathbb{E}[V(\breve{\Theta}_{j-2})]+C\sup_{n\geq j}\gamma_{n}\mathbb{E}[V(\breve{\Theta}_{n-1})].

We then apply Assumption 13(viii) and Jensen’s inequality, and use that $\gamma_{k}$ go to zero, since $\sum\gamma_{k}^{2}<\infty$ , to get that

\lim_{j\to\infty}\mathbb{E}[\sup_{n\geq j}\lvert\mathcal{T}_{j,n}^{(3)}\rvert]\leq C\Big{(}\lim_{j\to\infty}\gamma_{j-1}+\sup_{n\geq j}\gamma_{n}\Big{)}=0.

We now may conclude by Lemma 20.

(4) By Lemma 18(ii) and $\gamma_{k}\leq\gamma_{k-1}$ , we have for $j\geq 2$ ,

\mathbb{E}[\sup_{n\geq j}\lvert\mathcal{T}_{j,n}^{(4)}\rvert]\leq C\sup_{n\geq j}\sum_{k=j}^{n}(\gamma_{k-1}-\gamma_{k})\mathbb{E}[V(\breve{\Theta}_{k-2})]\leq C\sup_{n\geq j}\sum_{k=j}^{n}(\gamma_{k-1}-\gamma_{k})

where we have used lastly Assumption 13(viii) and Jensen’s inequality. Since this is a telescoping sum, we get

\mathbb{E}[\sup_{n\geq j}\lvert\mathcal{T}_{j,n}^{(4)}\rvert]\leq C\sup_{n\geq j}(\gamma_{j-1}-\gamma_{n})\leq C\gamma_{j-1}

We then conclude by Lemma 20, since $\gamma_{j}\to 0$ .

(5) By Lemma 18(i), $\lvert P_{\beta}^{i}\bar{H}_{\beta}(\breve{\theta})\rvert\leq C\rho^{i}V(\breve{\theta}),$ where $C,\,\rho$ do not depend on $\beta\in\mathsf{B}$ . Hence,

\mathbb{E}[\lvert\mathcal{T}_{j,n}^{(5)}\rvert]\leq C\sum_{k=j}^{n}\gamma_{k}\sum_{i\geq m_{k}+1}\rho^{i}\mathbb{E}[V(\breve{\Theta}_{k-1})]\leq C\sum_{k=j}^{n}\gamma_{k}\rho^{m_{k}},

where we have used lastly Assumption 13(viii) and Jensen’s inequality. Since $m_{k}$ was defined to be of order $\lvert\log\gamma_{k}\rvert$ , we have

\mathbb{E}[\lvert\mathcal{T}_{j,n}^{(5)}\rvert]\leq C\sum_{k=j}^{\infty}\gamma_{k}^{2}<\infty

By the dominated convergence theorem, we then have

\mathbb{E}[\sup_{n\geq j}\lvert\mathcal{T}_{j,n}^{(5)}\rvert]\leq C\sum_{k=j}^{\infty}\gamma_{k}^{2}.

Taking the limit $j\to\infty$ , we can then conclude by using Lemma 20.

(6) The proof is essentially the same as for (5).

(7) We write for $i\geq 1$ ,

P_{\beta_{k}}^{i}-P_{\beta_{k-1}}^{i}=\sum_{l=0}^{i-1}P_{\beta_{k}}^{i-l-1}\big{(}P_{\beta_{k}}-P_{\beta_{k-1}}\big{)}P_{\beta_{k-1}}^{l}.

Since $\lVert P_{\beta}^{i}\rVert_{\infty}\leq 1$ for all $i\geq 0$ , and $\lvert\bar{H}_{\beta}\rvert\leq 1$ , by Lemma 19(i), we have

\lVert(P_{\beta_{k}}^{i}-P_{\beta_{k-1}}^{i})\bar{H}_{\beta_{k}}\rVert\leq C\sum_{l=0}^{i-1}\lVert P_{\beta_{k}}^{i-l-1}\rVert_{\infty}\lvert\beta_{k}-\beta_{k-1}\rvert\lVert P_{\beta_{k-1}}^{l}\bar{H}_{\beta_{k}}\rVert_{\infty}\leq C\lvert\beta_{k}-\beta_{k-1}\rvert i.

Since $\lvert\beta_{k}-\beta_{k-1}\rvert\leq\gamma_{k}$ from the adaptation step in Algorithm 12, we have

\lvert\mathcal{T}_{j,n}^{(7)}\rvert\leq C\sum_{k=j}^{n}\gamma_{k}\sum_{i=1}^{m_{k}}i\gamma_{k}\leq C\sum_{k=j}^{\infty}\gamma_{k}^{2}m_{k}(1+m_{k})<\infty.

We then take $\sup_{n\geq j}$ on the left, take the expectation, and conclude by Lemma 20.

(8) Since $\lVert P_{\beta}^{i}\rVert_{\infty}\leq 1$ and by Lemma 19(ii), we have that

\lVert P_{\beta_{k-1}}^{i}(\bar{H}_{\beta_{k}}-\bar{H}_{\beta_{k-1}})\rVert_{\infty}\leq\lVert P_{\beta_{k-1}}^{i}\rVert_{\infty}\lVert\bar{H}_{\beta_{k}}-\bar{H}_{\beta_{k-1}}\rVert_{\infty}\leq C\lvert\beta_{k}-\beta_{k-1}\rvert

Since $\lvert\beta_{k}-\beta_{k-1}\rvert\leq\gamma_{k}$ , we have

\mathbb{E}[\sup_{n\geq j}\mathcal{T}_{j,n}^{(8)}]\leq C\sum_{k=j}^{\infty}\gamma_{k}^{2}m_{k}<\infty.

We then conclude by Lemma 20. ∎

C.6. Proofs of convergence theorems

of Theorem 14.

We define our Lyapunov function $w:\mathbb{R}\rightarrow[0,\infty)$ to be the continuously differentiable function $w(\beta)=\frac{1}{2}|e^{\beta}-e^{\beta^{*}}|^{2}$ . We also have that $h(\beta)=\pi_{\beta}^{\prime}(\widehat{H}_{\beta})$ is continuous, which follows from Lemma 19(iii). One can then check that Assumption 13 and Lemma 21 imply that the assumptions of (Andrieu et al., 2005, Theorem 2.3) hold. The latter result implies $\lim\lvert\beta_{k}-\beta^{*}\rvert\rightarrow 0$ , for some $\beta^{*}\in\mathsf{B}$ satisfying $\pi_{\beta^{*}}^{\prime}(\alpha_{\beta^{*}})=\alpha^{*}$ , as desired. ∎

Lemma 22.

Suppose Assumption 9 holds. Then both $(\dot{P}_{\beta})_{\beta\in\mathsf{B}}$ and $(P_{\beta})_{\beta\in\mathsf{B}}$ are simultaneously $1$ -geometrically ergodic (i.e. uniformly ergodic).

Proof.

We have $\mathrm{pr}(\theta)\leq C_{\mathrm{pr}}$ some $C_{\mathrm{pr}}>0$ , and also $0<\delta_{q}\leq q(\theta,\vartheta)$ , for all $\theta,\,\vartheta\in\mathsf{T}$ . Hence, for $A\subset\mathsf{T}$ ,

\dot{P}_{\beta}(\theta,A)\geq\int\delta_{q}\min\bigg{\{}1,\frac{\mathrm{pr}(\vartheta)}{\mathrm{pr}(\theta)}\bigg{\}}L_{\beta}(\vartheta)1\left(\vartheta\in A\right)\geq\int\delta_{q}\frac{\mathrm{pr}(\vartheta)}{C_{\mathrm{pr}}}L_{\beta}(\vartheta)1\left(\vartheta\in A\right)

By Lemma 17(ii), it holds $c_{\beta}\geq C_{\min}$ for some $C_{\min}>0$ for all $\beta\in\mathsf{B}$ . Therefore,

\dot{P}_{\beta}(\theta,A)\geq\delta\pi_{\beta}(A),

where $\delta_{\dot{P}}=\delta_{q}C_{\min}/C_{\mathrm{pr}}>0$ is independent of $\beta$ . As in Nummelin’s split chain construction (cf. Meyn and Tweedie, 2009), we can then define the Markov kernel $R_{\beta}(\theta,A)=(1-\delta_{\dot{P}})^{-1}\big{(}\dot{P}_{\beta}(\theta,A)-\delta_{\dot{P}}\pi_{\beta}(A)\big{)}$ with $\pi_{\beta}R_{\beta}=\pi_{\beta}$ . Set $\Pi_{\beta}(\theta,A)=\pi_{\beta}(A)$ . For any $f\leq 1$ , $\beta\in\mathsf{B}$ , and $k\geq 1$ , we have

	$\displaystyle\lVert\dot{P}_{\beta}^{k}f-\pi_{\beta}(f)\rVert_{\infty}$	$\displaystyle=(1-\delta_{\dot{P}})\lVert(R_{\beta}-\Pi_{\beta})\dot{P}_{\beta}^{k-1}f\rVert_{\infty}=(1-\delta_{\dot{P}})\lVert R_{\beta}\dot{P}_{\beta}^{k-1}\big{(}f-\pi_{\beta}(f)\big{)}\rVert_{\infty}$
		$\displaystyle\leq(1-\delta_{\dot{P}})\lVert\dot{P}_{\beta}^{k-1}\big{(}f-\pi_{\beta}(f)\big{)}\rVert_{\infty}=(1-\delta_{\dot{P}})\lVert\dot{P}_{\beta}^{k-1}f-\pi_{\beta}(f)\rVert_{\infty}$
		$\displaystyle\leq\ldots\leq(1-\delta_{\dot{P}})^{k}\lVert f-\pi_{\beta}(f)\rVert_{\infty}\leq 2(1-\delta_{\dot{P}})^{k}\lVert f\rVert_{\infty},$

where we have used $\lVert R_{\beta}\rVert_{\infty}\leq 1$ in the first inequality. Hence, $(\dot{P}_{\beta})_{\beta\in\mathsf{B}}$ are simultaneously $1$ -geometrically ergodic, and thus so are $(P_{\beta})_{\beta\in\mathsf{B}}$ by Lemma 18(i). ∎

of Theorem 10.

Since $(\dot{P}_{\beta})_{\beta\in\mathsf{B}}$ are simultaneously $1$ -geometric ergodic by Lemma 22, it is direct to see that Assumption 9 implies Assumption 13. We conclude by Theorem 14. ∎

Supplement D Simultaneous tolerance and covariance adaptation

Algorithm 23 (TA-AM( $n_{b},\alpha^{*}$ )).

Suppose $\Theta_{0}\in\mathsf{T}\subset\mathbb{R}^{n_{\theta}}$ is a starting value with $\mathrm{pr}(\Theta_{0})>0$ and $\Gamma_{0}=\mathbf{1}_{n_{\theta}\times n_{\theta}}$ is the identity matrix.

1.

Initialise $\delta=T_{0}$ where $T_{0}\sim Q_{\Theta_{0}}(\,\cdot\,)$ and $T_{0}>0$ . Set $\mu_{0}=\Theta_{0}$ .

For $k=0,\ldots,n_{b}-1$ , iterate:

(i)

Draw $\Theta_{k}^{\prime}\sim N(\Theta_{k},(2.38^{2}/n_{\theta})\Gamma_{k})$
(ii)

Draw $T_{k}^{\prime}\sim Q_{\Theta_{k}^{\prime}}(\,\cdot\,)$ .

(iii)

Accept, by setting $(\Theta_{k+1},T_{k+1})\leftarrow(\Theta_{k}^{\prime},T_{k}^{\prime})$ , with probability

\alpha_{\delta_{k}}(\Theta_{k},T_{k};\Theta_{k}^{\prime},T_{k}^{\prime})=\min\bigg{\{}1,\frac{\mathrm{pr}(\Theta_{k}^{\prime})\phi(T_{k}^{\prime}/\delta_{k})}{\mathrm{pr}(\Theta_{k})\phi(T_{k}/\delta_{k})}\bigg{\}}.

Otherwise reject, by setting $(\Theta_{k+1},T_{k+1})\leftarrow(\Theta_{k},T_{k})$ .

(iv)

$\log\delta_{k+1}\leftarrow\log\delta_{k}+\gamma_{k+1}\big{(}\alpha^{*}-\alpha_{\delta_{k}}^{\prime}(\Theta_{k},\Theta_{k}^{\prime},T_{k}^{\prime})\big{)}.$
(v)

$\mu_{k+1}\leftarrow\mu_{k}+\gamma_{k+1}\big{(}\Theta_{k+1}-\mu_{k}\big{)}.$
(vi)

$\Gamma_{k+1}\leftarrow\Gamma_{k}+\gamma_{k+1}\big{(}(\Theta_{k+1}-\mu_{k})(\Theta_{k+1}-\mu_{k})^{\mathrm{\scriptscriptstyle T}}-\Gamma_{k}\big{)}.$

3.

Output $(\Theta_{n_{b}},\delta_{n_{b}})$ .

Supplement E Details of extensions in Section 7

In case of non-simple cut-off, the rejected samples may be ‘recycled’ the rejected samples in the estimator (Ceperley et al., 1977). This may improve the accuracy (but can also reduce accuracy in certain pathological cases; see Delmas and Jourdain (2009)). The ‘waste recycling’ estimator is

E^{\mathrm{WR}}_{\delta,\epsilon}(f)=\sum_{k=1}^{n}W_{k}^{(\delta,\epsilon)}\big{[}\alpha_{\delta}(\Theta_{k},Y_{k};\tilde{\Theta}_{k+1},\tilde{Y}_{k+1})f(\tilde{\Theta}_{k+1})+[1-\alpha_{\delta}(\Theta_{k},Y_{k};\tilde{\Theta}_{k+1},\tilde{Y}_{k+1})]f(\Theta_{k})\big{]}.

When $E_{\delta,\epsilon}(f)$ is consistent under Theorem 4, this is also a consistent estimator. Namely, as in the proof of Theorem 4, we find that $(\Theta_{k},Y_{k},\tilde{\Theta}_{k+1},Y_{k+1})_{k\geq 1}$ is a Harris recurrent Markov chain with invariant distribution

\hat{\pi}_{\delta}(\theta,y,\tilde{\theta},\tilde{y})=\tilde{\pi}_{\delta}(\theta,y)\tilde{q}(\theta,y;\tilde{\theta},\tilde{y}),

and $\hat{\pi}_{\epsilon}(\theta,y,\tilde{\theta},\tilde{y})/\hat{\pi}_{\delta}(\theta,y,\tilde{\theta},\tilde{y})=c_{\epsilon}w_{\delta,\epsilon}(y)$ , where $\tilde{q}(\theta,y;\theta^{\prime},y^{\prime})=q(\theta,\theta^{\prime})g(y^{\prime}\mid\theta^{\prime})$ . Therefore, $E^{\mathrm{WR}}_{\delta,\epsilon}(f)$ is a strongly consistent estimator of

\mathbb{E}_{\hat{\pi}_{\epsilon}}\big{[}\alpha_{\delta}(\Theta,Y;\tilde{\Theta},\tilde{Y})f(\tilde{\Theta})+[1-\alpha_{\delta}(\Theta,Y;\tilde{\Theta},\tilde{Y})]f(\Theta)\big{]}=\mathbb{E}_{\pi_{\epsilon}}[f(\Theta)].

See (Rudolf and Sprungk, 2018; Schuster and Klebanov, 2018) for alternative waste recycling estimators based on importance sampling analogues.

A refined estimator may be formed as

\hat{E}_{\delta,\epsilon}(f)=\textstyle\sum_{k=1}^{n}\sum_{j=0}^{m}\hat{U}_{k,j}^{(\delta,\epsilon)}f(\Theta_{k})\big{/}\sum_{\ell=1}^{n}\sum_{i=0}^{m}\hat{U}_{\ell,i}^{(\delta,\epsilon)},

where $\hat{U}_{k,0}^{(\delta,\epsilon)}=U_{k}^{(\delta,\epsilon)}$ and $\hat{U}_{k,j}^{(\delta,\epsilon)}=\hat{N}_{k}\phi(\hat{T}_{k,j}/\epsilon)\big{/}\phi(T_{k}/\delta)$ , for $j\geq 1$ , and where $\hat{N}_{k}$ is the number of independent random variables $\hat{Z}_{1},\hat{Z}_{2},\ldots\sim g(\,\cdot\,\mid\Theta_{k})$ generated before observing $\phi(\hat{T}_{k,\hat{N}_{k}}/\delta)>0$ . The variables $\hat{T}_{k,j}=d(\hat{Z}_{j},y^{*})$ , and $\hat{T}_{k}=d(\hat{Y}_{k},y^{*})$ with independent $\hat{Y}_{k}\sim g(\,\cdot\,\mid\Theta_{k})$ . This ensures that

\mathbb{E}[\hat{N}_{k}\phi(\hat{T}_{k,j}/\epsilon)\mid\Theta_{k}=\theta,Y_{k}=y]=\frac{L_{\epsilon}(\theta)}{\mathbb{P}_{g(\,\cdot\,\mid\theta)}\big{(}\phi\big{(}d(Y,y^{*})/\delta\big{)}>0\big{)}},

which is sufficient to ensure that $\xi_{k,j}(f)=\hat{U}_{k,j}^{(\delta,\epsilon)}f(\Theta_{k})$ is a proper weighting scheme from $\tilde{\pi}_{\delta}$ to $\pi_{\epsilon}$ ; see (Vihola et al., 2016, Proposition 17(ii)), and consequently the average $\xi_{k}(f)=(m+1)^{-1}\sum_{j=0}^{m}\xi_{k,j}(f)$ is a proper weighting.

Supplement F Supplementary results

Table 5. Root mean square errors

(\times 10^{-2})

from abc-mcmc(

\delta

) for tolerances

\epsilon

in the Gaussian mode with

\phi_{\mathrm{simple}}

		$f(x)=x$					$f(x)=\|x\|$					Acc.
Cut-off	$\delta$ $\backslash$ $\epsilon$	0.10	0.82	1.55	2.28	3.00	0.10	0.82	1.55	2.28	3.00	rate
$\phi_{\mathrm{simple}}$	0.1	9.68					5.54					0.03
	0.82	8.99	3.81				5.38	2.14				0.22
	1.55	9.21	3.66	3.59			5.5	2.17	1.96			0.33
	2.28	9.67	3.86	3.6	3.97		5.85	2.28	2.02	2.08		0.4
	3.0	10.36	4.03	3.71	3.98	4.51	6.21	2.42	2.12	2.16	2.26	0.43
$\phi_{\mathrm{Gauss}}$	0.1	7.97					4.47					0.05
	0.82	7.12	3.67				4.22	2.08				0.29
	1.55	7.82	3.39	4.35			4.68	1.99	2.52			0.38
	2.28	8.94	3.59	3.81	5.52		5.26	2.2	2.29	3.29		0.41
	3.0	9.93	4.01	3.97	4.81	6.76	5.95	2.44	2.44	2.92	4.1	0.42

Table 6. Frequencies of the 95% confidence intervals for the adaptive algorithm in the Gaussian model, for tolerance

\epsilon=0.1

	$\phi_{\mathrm{simple}}$						$\phi_{\mathrm{Gauss}}$
	Fixed tolerance					Adapt	Fixed tolerance					Adapt
$\delta$	0.1	0.82	1.55	2.28	3.0	0.64	0.1	0.82	1.55	2.28	3.0	0.28
$x$	0.93	0.97	0.97	0.98	0.98	0.96	0.93	0.94	0.94	0.95	0.95	0.93
$\|x\|$	0.92	0.95	0.96	0.96	0.96	0.96	0.93	0.92	0.94	0.95	0.95	0.92

Table 7. Root mean square errors and acceptance rates in the Lotka-Volterra experiment.

		$f(\theta)=\theta_{1}$					$f(\theta)=\theta_{2}$					$f(\theta)=\theta_{3}$					Acc.
	$\delta$ $\backslash$ $\epsilon$	80	110	140	170	200	80	110	140	170	200	80	110	140	170	200	rate
$\phi_{\mathrm{simple}}$	80	2.37					1.32					2.94					0.05
	110	1.81	1.48				0.99	0.86				2.26	1.88				0.07
	140	1.75	1.41	1.22			0.93	0.77	0.68			2.11	1.69	1.4			0.1
	170	1.83	1.35	1.14	1.05		0.96	0.75	0.64	0.6		2.14	1.65	1.33	1.15		0.14
	200	1.93	1.41	1.11	0.97	0.95	1.06	0.75	0.61	0.56	0.6	2.37	1.74	1.36	1.16	1.09	0.17
regr. $\phi_{\mathrm{Epa}}$	80	3.1					1.52					2.77					0.05
	110	2.74	1.99				1.39	1.0				2.53	1.81				0.07
	140	3.02	2.08	1.56			1.54	1.05	0.79			2.76	1.9	1.39			0.1
	170	3.09	2.13	1.6	1.31		1.61	1.09	0.83	0.69		2.85	1.95	1.46	1.16		0.14
	200	3.19	2.2	1.68	1.36	1.15	1.63	1.1	0.84	0.71	0.63	2.91	2.04	1.52	1.21	1.01	0.17

Table 8. Coverages of confidence intervals from abc-mcmc(

\delta

) for tolerance

\epsilon=80

, with fixed tolerance and with adaptive tolerance in the Lotka-Volterra model.

	Post-correction, simple cut-off						Regression, Epanechnikov cut-off
	Fixed tolerance					Adapt	Fixed tolerance					Adapt
$\delta$	80.0	110.0	140.0	170.0	200.0	122.6	80.0	110.0	140.0	170.0	200.0	122.6
$\theta_{1}$	0.8	0.97	0.99	0.99	1.0	0.93	0.75	0.92	0.93	0.93	0.96	0.9
$\theta_{2}$	0.73	0.94	0.98	0.98	0.99	0.84	0.76	0.93	0.94	0.96	0.98	0.9
$\theta_{3}$	0.74	0.94	0.98	0.99	0.99	0.86	0.68	0.87	0.9	0.92	0.95	0.83

Table 9. Frequencies of the 95% confidence intervals and mean acceptance rates in the Lotka-Volterra experiment with step size

n^{-2/3}

		$f(\theta)=\theta_{1}$					$f(\theta)=\theta_{2}$					$f(\theta)=\theta_{3}$					Acc.
	$\delta$ $\backslash$ $\epsilon$	80	110	140	170	200	80	110	140	170	200	80	110	140	170	200	rate
$\phi_{\mathrm{simple}}$	80	0.32					0.11					0.11					0.07
	110	0.91	0.78				0.76	0.52				0.79	0.56				0.09
	140	0.97	0.96	0.91			0.95	0.88	0.8			0.96	0.9	0.86			0.12
	170	0.98	0.98	0.97	0.94		0.97	0.97	0.93	0.87		0.97	0.97	0.95	0.91		0.15
	200	0.99	0.98	0.98	0.96	0.93	0.99	0.99	0.98	0.94	0.87	0.98	0.98	0.97	0.94	0.92	0.18
regr. $\phi_{\mathrm{Epa}}$	80	0.34					0.41					0.25					0.07
	110	0.86	0.81				0.88	0.87				0.81	0.82				0.09
	140	0.94	0.94	0.92			0.95	0.95	0.96			0.9	0.91	0.91			0.12
	170	0.95	0.95	0.95	0.95		0.96	0.97	0.97	0.97		0.92	0.94	0.94	0.95		0.15
	200	0.95	0.95	0.95	0.95	0.94	0.96	0.97	0.97	0.98	0.98	0.92	0.94	0.95	0.95	0.95	0.18

Table 10. Root mean square errors and acceptance rates in the Lotka-Volterra experiment with step size

n^{-2/3}

		$f(\theta)=\theta_{1}$					$f(\theta)=\theta_{2}$					$f(\theta)=\theta_{3}$					Acc.
	$\delta$ $\backslash$ $\epsilon$	80	110	140	170	200	80	110	140	170	200	80	110	140	170	200	rate
$\phi_{\mathrm{simple}}$	80	3.24					2.2					4.67					0.07
	110	2.12	2.14				1.14	1.38				2.69	3.17				0.09
	140	1.87	1.56	1.49			0.89	0.81	0.79			2.1	1.82	1.63			0.12
	170	1.77	1.27	1.05	0.96		0.87	0.68	0.59	0.59		2.14	1.6	1.31	1.19		0.15
	200	1.94	1.45	1.2	1.11	1.08	0.95	0.69	0.59	0.54	0.57	2.44	1.95	1.68	1.58	1.52	0.18
regr. $\phi_{\mathrm{Epa}}$	80	2.67					1.14					2.17					0.07
	110	2.88	2.18				1.27	0.9				2.36	1.76				0.09
	140	2.67	1.98	1.61			1.38	1.02	0.83			2.57	1.91	1.54			0.12
	170	2.89	1.98	1.49	1.21		1.46	0.98	0.74	0.61		2.63	1.79	1.34	1.08		0.15
	200	3.57	2.85	2.46	4.93	1.2	1.82	1.41	1.21	1.42	0.63	3.11	2.32	1.88	1.81	1.22	0.18

Table 11. Root mean square errors of estimators from abc-mcmc(

\delta

) for tolerance

\epsilon=80

, with fixed tolerance and with adaptive tolerance in the Lotka-Volterra model with step size

n^{-2/3}

	Post-correction, simple cut-off						Regression, Epanechnikov cut-off
	Fixed tolerance					Adapt	Fixed tolerance					Adapt
$\delta$	80	110	140	170	200	122.6	80	110	140	170	200	122.6
$\theta_{1}$ $(\times 10^{-2})$	3.24	2.12	1.87	1.77	1.94	1.8	2.67	2.88	2.67	2.89	3.57	2.57
$\theta_{2}$ $(\times 10^{-4})$	2.2	1.14	0.89	0.87	0.95	1.04	1.14	1.27	1.38	1.46	1.82	1.28
$\theta_{3}$ $(\times 10^{-2})$	4.67	2.69	2.1	2.14	2.44	2.34	2.17	2.36	2.57	2.63	3.11	2.34

On the use of approximate Bayesian computation Markov chain Monte Carlo with inflated tolerance and post-correction

Abstract.

Key words and phrases:

1. Introduction

2. Post-processing over a range of tolerances

Algorithm 1 (abc-mcmc(δ\delta)).

Definition 2.

Assumption 3 (Finite integrated autocorrelation).

Theorem 4.

Algorithm 5.

3. Theoretical justification

Theorem 6.

Theorem 7.

4. Tolerance adaptation

Algorithm 8.

Assumption 9.

Theorem 10.

5. Post-processing with regression correction

6. Experiments

6.1. One-dimensional Gaussian model

6.2. Lotka-Volterra model

7. Discussion

8. Acknowledgements

References

Appendix

Algorithm 11.

of Theorem 4.

of Theorem 6.

Supplement B Convergence of the tolerance adaptive ABC-MCMC under generalised conditions

Algorithm 12.

Assumption 13.

Theorem 14.

Supplement C Analysis of the tolerance adaptive ABC-MCMC

C.1. Proposal augmentation

Lemma 15.

of Lemma 15.

C.2. Monotonicity properties

Lemma 16.

Proof.

Lemma 17.

Proof.

C.3. Stochastic approximation framework

Lemma 18.

Proof.

C.4. Contractions

Lemma 19.

Proof.

C.5. Control of noise

Lemma 20.

Lemma 21.

Proof.

C.6. Proofs of convergence theorems

of Theorem 14.

Lemma 22.

Proof.

of Theorem 10.

Supplement D Simultaneous tolerance and covariance adaptation

Algorithm 23 (TA-AM(nb,α∗n_{b},\alpha^{*})).

Supplement E Details of extensions in Section 7

Supplement F Supplementary results

Algorithm 1 (abc-mcmc( $\delta$ )).

Algorithm 23 (TA-AM( $n_{b},\alpha^{*}$ )).