This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unbiased Markov chain Monte Carlo with couplings

Pierre E. Jacob, John O’Leary, Yves F. Atchadé Department of Statistics, Harvard University, Cambridge, USA. Email: pjacob@fas.harvard.eduDepartment of Statistics, Harvard University, Cambridge, USA. Email: joleary@g.harvard.eduDepartment of Mathematics & Statistics, Boston University, Boston, USA. Email: atchade@bu.edu
Abstract

Markov chain Monte Carlo (MCMC) methods provide consistent approximations of integrals as the number of iterations goes to infinity. MCMC estimators are generally biased after any fixed number of iterations. We propose to remove this bias by using couplings of Markov chains together with a telescopic sum argument of Glynn and Rhee (2014). The resulting unbiased estimators can be computed independently in parallel. We discuss practical couplings for popular MCMC algorithms. We establish the theoretical validity of the proposed estimators and study their efficiency relative to the underlying MCMC algorithms. Finally, we illustrate the performance and limitations of the method on toy examples, on an Ising model around its critical temperature, on a high-dimensional variable selection problem, and on an approximation of the cut distribution arising in Bayesian inference for models made of multiple modules.

1 Introduction

Markov chain Monte Carlo (MCMC) methods constitute a popular class of algorithms to approximate high-dimensional integrals arising in statistics and other fields (Liu, 2008; Robert and Casella, 2004; Brooks et al., 2011; Green et al., 2015). These iterative methods provide estimators that are consistent as the number of iterations grows large but potentially biased for any fixed number of iterations, which discourages the parallel execution of many short chains (Rosenthal, 2000). Consequently, efforts have focused on exploiting parallel processors within each iteration (Tjelmeland, 2004; Brockwell, 2006; Lee et al., 2010; Jacob et al., 2011; Calderhead, 2014; Goudie et al., 2017; Yang et al., 2017) and on the design of parallel chains targeting different distributions (Altekar et al., 2004; Wang et al., 2015; Srivastava et al., 2015). Still, MCMC estimators are ultimately justified by asymptotics in the number of iterations, which is discordant with current trends in computing hardware, characterized by increasing parallelism but stagnating clock speeds.

In this paper we propose a general construction to produce unbiased estimators of integrals with respect to a target probability distribution from MCMC kernels. The lack of bias means that these estimators can be implemented on parallel processors in the framework of Glynn and Heidelberger (1991), without communication between processors. Confidence intervals can be constructed with asymptotic guarantees in the number of processors, in contrast with standard MCMC confidence intervals that are justified asymptotically in the number of iterations (e.g. Flegal et al., 2008; Gong and Flegal, 2016; Atchadé, 2016; Vats et al., 2018). The lack of bias has additional benefits, as discussed in Section 5.5 in which we make use of its interplay with the law of iterated expectations to perform modular inference; see also the discussion in Section 6.

Our contribution follows the path-breaking work of Glynn and Rhee (2014), which uses couplings to construct unbiased estimators of integrals with respect to an invariant distribution. They illustrate their construction on Markov chains represented by iterated random functions, leveraging the contraction properties of such functions. Glynn and Rhee (2014) also consider Harris recurrent chains for which an explicit minorization condition holds. Previously, McLeish (2011) employed similar debiasing techniques to obtain “nearly unbiased” estimators from a single MCMC chain. More recently Jacob et al. (2019) remove the bias from conditional particle filters (Andrieu et al., 2010) by coupling chains so that they meet in finite time. The present article brings this type of “Rhee–Glynn” construction to generic MCMC algorithms, with a novel analysis of estimator efficiency and a variety of examples. Our proposed construction involves couplings of MCMC algorithms, which we discuss for generic Metropolis–Hastings and Gibbs samplers.

Couplings have been used to study the convergence properties of MCMC algorithms from both theoretical and practical points of view (e.g. Reutter and Johnson, 1995; Johnson, 1996; Rosenthal, 1997; Johnson, 1998; Neal, 1999; Roberts and Rosenthal, 2004; Johnson, 2013; Johndrow and Mattingly, 2017). Couplings also underpin perfect samplers (Propp and Wilson, 1996; Murdoch and Green, 1998; Casella et al., 2001; Flegal and Herbei, 2012; Lee et al., 2014; Huber, 2016). A notable aspect of the approach of Glynn and Rhee (2014) preserved in our method is that only two chains have to be coupled for the proposed estimator to be unbiased, without further assumptions on the state space or target distribution. Thus the approach applies more broadly than perfect samplers (see Glynn, 2016) while yielding unbiased estimators rather than exact samples. Coupling pairs of Markov chains also forms the basis of the approach of Neal (1999), with a similar motivation for parallel computation. The proposed estimation technique also shares aims with regeneration methods (e.g. Mykland et al., 1995; Brockwell and Kadane, 2005), and we propose a numerical comparison in Section 5.2.

In Section 2 we introduce our estimators and present a coupling of random walk Metropolis–Hastings chains as an illustration. In Section 3 we establish the efficiency properties of these estimators, discuss the verification of key assumptions, and describe the use of the proposed estimators on parallel processors in light of results from e.g. Glynn and Heidelberger (1991). In Section 4 we describe how to couple some important MCMC algorithms and illustrate the effect of dimension on algorithm performance with a multivariate Normal target. Section 5 contains more challenging examples including a multimodal target, a comparison with regeneration methods, sampling problems in large-dimensional discrete spaces arising in Bayesian variable selection and Ising models, and an application to modular inference. We discuss our findings in Section 6. Scripts in R (R Core Team, 2015) are available at https://github.com/pierrejacob/unbiasedmcmc and supplementary materials are available online.

2 Unbiased estimation from coupled chains

2.1 Rhee–Glynn estimator

Given a target probability distribution π\pi on a Polish space 𝒳\mathcal{X} and a measurable real-valued test function hh integrable with respect to π\pi, we want to estimate the expectation 𝔼π[h(X)]=h(x)π(dx)\mathbb{E}_{\pi}[h(X)]=\int h(x)\pi(dx). Let PP denote a Markov transition kernel on 𝒳\mathcal{X} that leaves π\pi invariant, and let π0\pi_{0} be some initial probability distribution on 𝒳\mathcal{X}. Our estimators are based on a coupled pair of Markov chains (Xt)t0(X_{t})_{t\geq 0} and (Yt)t0(Y_{t})_{t\geq 0}, which marginally start from π0\pi_{0} and evolve according to PP. In particular, we suppose that P¯\bar{P} is a transition kernel on the joint space 𝒳×𝒳\mathcal{X}\times\mathcal{X} such that P¯((x,y),A×𝒳)=P(x,A)\bar{P}((x,y),A\times\mathcal{X})=P(x,A) and P¯((x,y),𝒳×A)=P(y,A)\bar{P}((x,y),\mathcal{X}\times A)=P(y,A) for any x,y𝒳x,y\in\mathcal{X} and any measurable set AA. We then construct the coupled Markov chain (Xt,Yt)t0(X_{t},Y_{t})_{t\geq 0} as follows. We draw (X0,Y0)(X_{0},Y_{0}) such that X0π0X_{0}\sim\pi_{0} and Y0π0Y_{0}\sim\pi_{0}. Given (X0,Y0)(X_{0},Y_{0}) we draw X1P(X0,)X_{1}\sim P(X_{0},\cdot), and then for any t1t\geq 1, given X0,(X1,Y0),,(Xt,Yt1)X_{0},(X_{1},Y_{0}),\ldots,(X_{t},Y_{t-1}), we draw (Xt+1,Yt)P¯((Xt,Yt1),)(X_{t+1},Y_{t})\sim\bar{P}((X_{t},Y_{t-1}),\cdot). We consider the following assumptions.

Assumption 2.1.

As tt\to\infty, 𝔼[h(Xt)]𝔼π[h(X)]\mathbb{E}[h(X_{t})]\to\mathbb{E}_{\pi}[h(X)]. Furthermore, there exists an η>0\eta>0 and D<D<\infty such that 𝔼[|h(Xt)|2+η]D\mathbb{E}[|h(X_{t})|^{2+\eta}]\leq D for all t0t\geq 0.

Assumption 2.2.

The chains are such that the meeting time τ:=inf{t1:Xt=Yt1}\tau:=\inf\{t\geq 1:\ X_{t}=Y_{t-1}\} satisfies (τ>t)Cδt\mathbb{P}(\tau>t)\leq C\;\delta^{t} for all t0t\geq 0, for some constants C<C<\infty and δ(0,1)\delta\in(0,1).

Assumption 2.3.

The chains stay together after meeting, i.e. Xt=Yt1X_{t}=Y_{t-1} for all tτt\geq\tau.

By construction, each of the marginal chains (Xt)t0(X_{t})_{t\geq 0} and (Yt)t0(Y_{t})_{t\geq 0} has initial distribution π0\pi_{0} and transition kernel PP. Assumption 2.1 requires these chains to result in a uniformly bounded (2+η)(2+\eta)-moment of hh; more discussion on moments of Markov chains can be found in Tweedie (1983). Since X0X_{0} and Y0Y_{0} may be drawn from any coupling of π0\pi_{0} with itself, it is possible to set X0=Y0X_{0}=Y_{0}. However, X1X_{1} is then generated from P(X0,)P(X_{0},\cdot), so that X1Y0X_{1}\neq Y_{0} in general. Thus one cannot force the meeting time to be small by setting X0=Y0X_{0}=Y_{0}. Assumption 2.2 puts a condition on the coupling operated by P¯\bar{P}, and would not in general be satisfied for an independent coupling. Coupled kernels must be carefully designed, using e.g. common random numbers and maximal couplings, for Assumption 2.2 to be satisfied. We present a simple case in Section 2.2 and further examples in Section 4. We stress that the state space is not assumed to be discrete, and that the constants DD and η\eta of Assumption 2.1 and CC and δ\delta of Assumption 2.2 do not need to be known to implement the proposed approach. Assumption 2.3 typically holds by design; coupled chains that stay identical after meeting are termed “faithful” in Rosenthal (1997).

Under these assumptions we introduce the following motivation for an unbiased estimator of 𝔼π[h(X)]\mathbb{E}_{\pi}[h(X)], following Glynn and Rhee (2014). We begin by writing 𝔼π[h(X)]\mathbb{E}_{\pi}[h(X)] as limt𝔼[h(Xt)]\lim_{t\to\infty}\mathbb{E}[h(X_{t})]. Then for any fixed k0k\geq 0,

𝔼π[h(X)]\displaystyle\mathbb{E}_{\pi}[h(X)] =𝔼[h(Xk)]+t=k+1(𝔼[h(Xt)]𝔼[h(Xt1)])\displaystyle=\mathbb{E}[h(X_{k})]+\sum_{t=k+1}^{\infty}(\mathbb{E}[h(X_{t})]-\mathbb{E}[h(X_{t-1})]) expanding the limit as a telescoping sum,
=𝔼[h(Xk)]+t=k+1(𝔼[h(Xt)]𝔼[h(Yt1)])\displaystyle=\mathbb{E}[h(X_{k})]+\sum_{t=k+1}^{\infty}(\mathbb{E}[h(X_{t})]-\mathbb{E}[h(Y_{t-1})]) since the chains have the same marginals,
=𝔼[h(Xk)+t=k+1(h(Xt)h(Yt1))]\displaystyle=\mathbb{E}[h(X_{k})+\sum_{t=k+1}^{\infty}(h(X_{t})-h(Y_{t-1}))\text{]} swapping the expectations and limit,
=𝔼[h(Xk)+t=k+1τ1(h(Xt)h(Yt1))]\displaystyle=\mathbb{E}[h(X_{k})+\sum_{t=k+1}^{\tau-1}(h(X_{t})-h(Y_{t-1}))] by Assumption 2.3.

We note that the sum in the last equation is zero if k+1>τ1k+1>\tau-1. The heuristic argument above suggests that the estimator Hk(X,Y)=h(Xk)+t=k+1τ1(h(Xt)h(Yt1)){H_{k}(X,Y)=h(X_{k})+\sum_{t=k+1}^{\tau-1}(h(X_{t})-h(Y_{t-1}))} should have expectation 𝔼π[h(X)]\mathbb{E}_{\pi}[h(X)]. We observe that this estimator requires τ1\tau-1 calls to P¯\bar{P} and max(1,k+1τ)\max(1,k+1-\tau) calls to PP; thus under Assumption 2.2 its cost has a finite expectation.

In Section 3 we establish the validity of the estimator under the three conditions above; this formally justifies the swap of expectation and limit. The estimator can be viewed as a debiased version of h(Xk)h(X_{k}). Unbiasedness is guaranteed for any choice of k0k\geq 0, but both the cost and variance of Hk(X,Y)H_{k}(X,Y) are sensitive to kk; see Section 3.1. Thanks to this unbiasedness property, we can sample RR\in\mathbb{N} independent copies of Hk(X,Y)H_{k}(X,Y) in parallel and average the results to estimate 𝔼π[h(X)]\mathbb{E}_{\pi}[h(X)] consistently as RR\to\infty; we defer further considerations on the use of unbiased estimators on parallel processors to Section 3.3.

Before presenting examples and enhancements to the estimator above, we discuss the relationship between our approach and existing work. There is a rich literature applying forward couplings to study Markov chains convergence (Johnson, 1996, 1998; Thorisson, 2000; Lindvall, 2002; Rosenthal, 2002; Johnson, 2013; Douc et al., 2004; Nikooienejad et al., 2016), and to obtain new algorithms such as perfect samplers (Huber, 2016) and the methods of Neal (1999) and Neal and Pinto (2001). Our approach is closely related to Glynn and Rhee (2014), who employ pairs of Markov chains to obtain unbiased estimators. The present work combines similar arguments with couplings of MCMC algorithms and proposes further improvements to remove bias at a reduced loss of efficiency.

Indeed Glynn and Rhee (2014) did not apply their methodology to the MCMC setting. They consider chains associated with contractive iterated random functions (see also Diaconis and Freedman, 1999), and Harris recurrent chains with an explicit minorization condition. A minorization condition refers to a small set 𝒞\mathcal{C}, λ>0\lambda>0, an integer m1m\geq 1, and a probability measure ν\nu such that for all x𝒞x\in\mathcal{C} and some measurable set AA, Pm(x,A)λν(A)P^{m}(x,A)\geq\lambda\nu(A). Such a condition is said to be explicit if the set, constant and probability measure are known by the user. Finding explicit small sets that are useful in practice can present a technical challenge, even for MCMC experts (see discussion and references in Cowles and Rosenthal, 1998). When available, explicit minorization conditions can also be employed to identify regeneration times, yielding estimators amenable to parallel computation in the framework of Mykland et al. (1995) and Brockwell and Kadane (2005). By contrast Johnson (1996, 1998) and Neal (1999) address the question of coupling MCMC algorithms so that pairs of chains meet exactly, without analytical knowledge on the target distribution. The present article focuses on the use of couplings of this type in the framework of Glynn and Rhee (2014).

2.2 Coupled Metropolis–Hastings example

Before further examination of our estimator and its properties, we present a coupling of Metropolis–Hastings (MH) chains that will typically satisfy Assumptions 2.1-2.3 in realistic settings; this coupling was proposed in Johnson (1998) as part of a method to diagnose convergence. We postpone discussion of other couplings of MCMC algorithms to Section 4. We recall that each iteration tt of the MH algorithm (Hastings, 1970) begins by drawing a proposal XX^{\star} from a Markov kernel q(Xt,)q(X_{t},\cdot), where XtX_{t} is the current state. The next state is set to Xt+1=XX_{t+1}=X^{\star} if Uπ(X)q(X,Xt)/(π(Xt)q(Xt,X)){U\leq\pi(X^{\star})q(X^{\star},X_{t})}/({\pi(X_{t})q(X_{t},X^{\star})}), where UU denotes a uniform random variable on [0,1][0,1], and Xt+1=XtX_{t+1}=X_{t} otherwise.

We define a pair of chains so that each proceeds marginally according to the MH algorithm and jointly so that the chains will meet exactly after a random number of steps. We suppose that the pair of chains are in states XtX_{t} and Yt1Y_{t-1}, and consider how to generate Xt+1X_{t+1} and YtY_{t} so that {Xt+1=Yt}\{X_{t+1}=Y_{t}\} might occur.

If XtYt1X_{t}\neq Y_{t-1}, the event {Xt+1=Yt}\{X_{t+1}=Y_{t}\} cannot occur if both chains reject their respective proposals, XX^{\star} and YY^{\star}. Meeting will occur if these proposals are identical and if both are accepted. Marginally, the proposals follow X|Xtq(Xt,){X^{\star}|X_{t}\sim q(X_{t},\cdot)} and Y|Yt1q(Yt1,)Y^{\star}|Y_{t-1}\sim q(Y_{t-1},\cdot). If q(x,x)q(x,x^{\star}) can be evaluated for all x,xx,x^{\star}, then one can sample from a maximal coupling between the two proposal distributions, which is a coupling of q(Xt,)q(X_{t},\cdot) and q(Yt1,)q(Y_{t-1},\cdot) maximizing the probability of the event {X=Y}\{X^{\star}=Y^{\star}\}. How to sample from maximal couplings of continuous distributions is described in Thorisson (2000) and in Section 4.1. One can accept or reject the two proposals using a common uniform random variable UU. The chains will stay together after they meet: at each step after meeting, the proposals will be identical with probability one, and jointly accepted or rejected with a common uniform variable. This coupling requires neither explicit minorization conditions nor contractive properties of a random function representation of the chain.

2.3 Time-averaged estimator

To motivate our next estimator, we note that we can compute Hk(X,Y)H_{k}(X,Y) for several values of kk from the same realization of the coupled chains, and that the average of these is unbiased as well. For any fixed integer mm with mkm\geq k, we can run coupled chains for max(m,τ)\max(m,\tau) iterations, compute the estimator H(X,Y)H_{\ell}(X,Y) for each {k,,m}\ell\in\{k,\ldots,m\}, and take the average Hk:m(X,Y)=(mk+1)1=kmH(X,Y){H_{k:m}(X,Y)=(m-k+1)^{-1}\sum_{\ell=k}^{m}H_{\ell}(X,Y)}, as we summarize in Algorithm 1. We refer to Hk:m(X,Y)H_{k:m}(X,Y) as the time-averaged estimator; the estimator Hk(X,Y)H_{k}(X,Y) is retrieved when m=km=k. Alternatively we could average the estimators H(X,Y)H_{\ell}(X,Y) using weights ww_{\ell}\in\mathbb{R} for {k,,m}\ell\in\{k,\ldots,m\}, to obtain =kmwH(X,Y)\sum_{\ell=k}^{m}w_{\ell}H_{\ell}(X,Y). This will be unbiased if =kmw=1\sum_{\ell=k}^{m}w_{\ell}=1.

  1. 1.

    Draw X0X_{0},Y0Y_{0} from an initial distribution π0\pi_{0} and draw X1P(X0,)X_{1}\sim P(X_{0},\cdot).

  2. 2.

    Set t=1t=1. While t<max(m,τ)t<\max(m,\tau), where τ=inf{t1:Xt=Yt1}\tau=\inf\{t\geq 1:\;X_{t}=Y_{t-1}\},

    • draw (Xt+1,Yt)P¯((Xt,Yt1),)(X_{t+1},Y_{t})\sim\bar{P}((X_{t},Y_{t-1}),\cdot),

    • set tt+1t\leftarrow t+1.

  3. 3.

    For each {k,,m}\ell\in\left\{k,...,m\right\}, compute H(X,Y)=h(X)+t=+1τ1(h(Xt)h(Yt1))H_{\ell}(X,Y)=h(X_{\ell})+\sum_{t=\ell+1}^{\tau-1}(h(X_{t})-h(Y_{t-1})).

  4. Return Hk:m(X,Y)=(mk+1)1=kmH(X,Y)H_{k:m}(X,Y)=(m-k+1)^{-1}\sum_{\ell=k}^{m}H_{\ell}(X,Y); or compute Hk:m(X,Y)H_{k:m}(X,Y) with (2.1).

Algorithm 1 Unbiased “time-averaged” estimator Hk:m(X,Y)H_{k:m}(X,Y) of 𝔼π[h(X)]\mathbb{E}_{\pi}[h(X)].

Rearranging terms in (mk+1)1=kmH(X,Y)(m-k+1)^{-1}\sum_{\ell=k}^{m}H_{\ell}(X,Y), we can write the time-averaged estimator as

Hk:m(X,Y)\displaystyle H_{k:m}(X,Y) =1mk+1=kmh(X)+=k+1τ1min(1,kmk+1)(h(X)h(Y1)).\displaystyle=\frac{1}{m-k+1}\sum_{\ell=k}^{m}h(X_{\ell})+\sum_{\ell=k+1}^{\tau-1}\min\left(1,\frac{\ell-k}{m-k+1}\right)(h(X_{\ell})-h(Y_{\ell-1})). (2.1)

The term (mk+1)1=kmh(X)(m-k+1)^{-1}\sum_{\ell=k}^{m}h(X_{\ell}) corresponds to a standard MCMC average with mm total iterations and a burn-in period of k1k-1 iterations. We can interpret the other term as a bias correction. If τk+1\tau\leq k+1, then the correction term equals zero. This provides some intuition for the choice of kk and mm: large kk values lead to the bias correction being equal to zero with large probability, and large values of mm result in Hk:m(X,Y)H_{k:m}(X,Y) being similar to an estimator obtained from a long MCMC run. Thus we expect the variance of Hk:m(X,Y)H_{k:m}(X,Y) to be similar to that of MCMC estimators for appropriate choices of kk and mm.

The estimator Hk:m(X,Y)H_{k:m}(X,Y) requires τ1\tau-1 calls to P¯\bar{P} and max(1,m+1τ)\max(1,m+1-\tau) calls to PP, which is overall comparable to mm calls to PP when mm is large. Indeed, for the proposed couplings, calls to P¯\bar{P} are approximately twice as expensive as calls to PP. Therefore, the cost of Hk:m(X,Y)H_{k:m}(X,Y) is comparable to 2(τ1)+max(1,m+1τ)2(\tau-1)+\max(1,m+1-\tau) iterations of the underlying MCMC algorithm. Thus both the variance and the cost of Hk:m(X,Y)H_{k:m}(X,Y) will approach those of MCMC estimators for large values of kk and mm. This motivates the use of the estimator Hk:m(X,Y)H_{k:m}(X,Y) with m>km>k, which allows us to control the loss of efficiency associated with the removal of burn-in bias in contrast with the basic estimator Hk(X,Y)H_{k}(X,Y) of Section 2.1. We discuss the choice of kk and mm in further detail in Section 3 and in the subsequent experiments. A variant of (2.1) can be obtained by considering a time lag greater than one between the two chains (Xt)t0(X_{t})_{t\geq 0} and (Yt)t0(Y_{t})_{t\geq 0}, with the meeting time defined as the first time tt for which {Xt=Ytlag}\{X_{t}=Y_{t-\text{lag}}\} occurs. This introduces another tuning parameter but is found to be fruitful in Biswas and Jacob (2019).

We conclude this section with a few remarks on practical implementations. First, the test function hh does not have to be specified at run-time in Algorithm 1. One can store the coupled chains and choose the test function later. Also, one typically resorts to thinning the output of an MCMC sampler if the memory cost of storing chains is prohibitive, or if the cost of evaluating the test function of interest is significant compared to the cost of each MCMC iteration (e.g. Owen, 2017). This is feasible in the proposed framework: one could consider a variation of Algorithm 1 where each call to the Markov kernels PP and P¯\bar{P} would be replaced by multiple calls to them. We also observe that the proposed estimators can take values outside of the range of the test function hh; for instance they can take negative values even if the range of the test function contains only non-negative values.

Finally, we stress the difficulty inherent in choosing an initial distribution π0\pi_{0}. The estimators are unbiased for any choice of π0\pi_{0}, including point masses, but this choice has an impact on both the computing cost and the variance. There is also a choice about whether to draw X0X_{0} and Y0Y_{0} independently from π0\pi_{0} or not; in our experiments we use independent draws. We will see in Section 5.1 that unfortunate choices of initial distributions can severely affect the performance of the proposed estimators. This suggests trying more than one choice of initialization, especially in the setting of multimodal targets. Overall the choice of π0\pi_{0} and its relative importance compared to standard MCMC are open questions.

2.4 Signed measure estimator

We can formulate the proposed estimation procedure in terms of a signed measure π^\hat{\pi} defined by

π^\displaystyle\hat{\pi} =1mk+1=kmδX+=k+1τ1min(1,kmk+1)(δXδY1),\displaystyle=\frac{1}{m-k+1}\sum_{\ell=k}^{m}\delta_{X_{\ell}}+\sum_{\ell=k+1}^{\tau-1}\min\left(1,\frac{\ell-k}{m-k+1}\right)(\delta_{X_{\ell}}-\delta_{Y_{\ell-1}}), (2.2)

obtained by replacing test function evaluations by delta masses in (2.1), as in Section 4 of Glynn and Rhee (2014). The measure π^\hat{\pi} is of the form π^==1NωδZ\hat{\pi}=\sum_{\ell=1}^{N}\omega_{\ell}\delta_{Z_{\ell}} where the weights satisfy =1Nω=1\sum_{\ell=1}^{N}\omega_{\ell}=1 and where the atoms (Z)(Z_{\ell}) are values among the history of the coupled chains. Some of the weights (ω)(\omega_{\ell}) may be negative, making π^\hat{\pi} a signed empirical measure. In this view the unbiasedness property states 𝔼[=1Nωh(Z)]=𝔼π[h(X)]\mathbb{E}[\sum_{\ell=1}^{N}\omega_{\ell}h(Z_{\ell})]=\mathbb{E}_{\pi}[h(X)] for a test function hh.

One can consider the convergence behavior of π^R=R1r=1Rπ^(r)\hat{\pi}^{R}=R^{-1}\sum_{r=1}^{R}\hat{\pi}^{(r)} towards π\pi, where (π^(r))(\hat{\pi}^{(r)}) for r{1,,R}r\in\{1,\ldots,R\} are independent replications of π^\hat{\pi}. Glynn and Rhee (2014) obtain a Glivenko–Cantelli result for a similar measure related to their estimator. In the current setting, assume for simplicity that π\pi is univariate or else consider only one of its marginals. To emphasize the importance of the number of replications RR, we rewrite the weights and atoms as π^R==1NRωδZ\hat{\pi}^{R}=\sum_{\ell=1}^{N_{R}}\omega_{\ell}\delta_{Z_{\ell}}. Introduce the function sF^R(s)==1NRω𝟙(Zs)s\mapsto\hat{F}^{R}(s)=\sum_{\ell=1}^{N_{R}}\omega_{\ell}\mathds{1}(Z_{\ell}\leq s) on \mathbb{R}. Proposition 3.2 below states that F^R\hat{F}^{R} converges to FF as RR\to\infty uniformly with probability one, where FF is the cumulative distribution function of π\pi.

The function sF^R(s)s\mapsto\hat{F}^{R}(s) is not monotonically increasing because of negative weights among (ω)(\omega_{\ell}), which motivates the following comments regarding the estimation of quantiles of π\pi. Assume from now on that the pairs (ω,Z)(\omega_{\ell},Z_{\ell}) are ordered such that ZZ+1Z_{\ell}\leq Z_{\ell+1}. For any q(0,1)q\in(0,1) there might more than one index \ell such that i=11ωiq\sum_{i=1}^{\ell-1}\omega_{i}\leq q and i=1ωi>q\sum_{i=1}^{\ell}\omega_{i}>q; the quantile estimate might be defined as ZZ_{\ell} for any such \ell. The convergence of F^R\hat{F}^{R} to FF indicates that all such estimates are expected to converge to the qq-th quantile of π\pi. Therefore the signed measure representation leads to a way of estimating quantiles of the target distribution in a consistent way as RR\to\infty. The construction of confidence intervals for these quantiles, perhaps by bootstrapping the RR independent copies, stands as an interesting area for future research. Another route to estimate quantiles of π\pi would be to project marginals of π^R\hat{\pi}^{R} onto the space of probability measures, for instance using a generalization of the Wasserstein metric to signed measures (Mainini, 2012). One could also estimate FF using isotonic regression (Chatterjee et al., 2015), considering F^R(s)\hat{F}^{R}(s) for various values ss as noisy measurements of F(s)F(s).

3 Properties and parallel implementation

The proofs of the results of this section are in the supplementary materials. Our first result establishes the basic validity of the proposed estimators.

Proposition 3.1.

Under Assumptions 2.1-2.3, for all k0k\geq 0 and mkm\geq k, the estimator Hk:m(X,Y)H_{k:m}(X,Y) has expectation 𝔼π[h(X)]\mathbb{E}_{\pi}[h(X)], a finite variance, and a finite expected computing time.

A direct consequence of Proposition 3.1 is that an average of RR independent copies of Hk:m(X,Y)H_{k:m}(X,Y) converges to 𝔼π[h(X)]\mathbb{E}_{\pi}[h(X)] as RR\to\infty. We discuss more sophisticated results on unbiased estimators and parallel processing in Section 3.3 and other uses of such estimators in Sections 5.5 and 6. Following Glynn and Rhee (2014), we provide Proposition 3.2 on the signed measure estimator of (2.2). We recall that such estimators apply to univariate target distributions or to the marginal distributions of a multivariate target.

Proposition 3.2.

Under Assumptions 2.2-2.3, for all mk0m\geq k\geq 0, and assuming that (Xt)t0(X_{t})_{t\geq 0} converges to π\pi in total variation, introduce the function sF^R(s)==1NRω𝟙(Zs)s\mapsto\hat{F}^{R}(s)=\sum_{\ell=1}^{N_{R}}\omega_{\ell}\mathds{1}(Z_{\ell}\leq s), where (ω,Z)=1NR(\omega_{\ell},Z_{\ell})_{\ell=1}^{N_{R}} are weighted atoms obtained from RR independent copies of π^\hat{\pi} in (2.2). Denote by FF the cumulative distribution function of π\pi. Then sups|F^R(s)F(s)|R0\sup_{s\in\mathbb{R}}|\hat{F}^{R}(s)-F(s)|\xrightarrow[R\to\infty]{}0 almost surely.

Section 3.1 studies the variance and efficiency of Hk:m(X,Y)H_{k:m}(X,Y), Section 3.2 concerns the verification of Assumption 2.2 using drift conditions, and Section 3.3 discusses estimation on parallel processors in the presence of a budget constraint.

3.1 Variance and efficiency

We consider the impact of kk and mm on the efficiency of the proposed estimators, which will then suggest guidelines for the choice of these tuning parameters. Estimators Hk:m(r)(X,Y)H^{(r)}_{k:m}(X,Y), for r=1,,Rr=1,\ldots,R, can be generated independently and averaged. More estimators can be produced in a given computing budget if each estimator is cheaper to produce. The trade-off can be understood in the framework of Glynn and Whitt (1992), see also Rhee and Glynn (2012); Glynn and Rhee (2014), by defining the asymptotic inefficiency as the product of the variance and expected cost of the estimator. That product is the asymptotic variance of R1r=1RHk:m(r)(X,Y)R^{-1}\sum_{r=1}^{R}H^{(r)}_{k:m}(X,Y) as the computational budget, as opposed to the number of estimators RR, goes to infinity (Glynn and Whitt, 1992). Of primary interest is the comparison of this asymptotic inefficiency with the asymptotic variance of standard MCMC estimators. We start by writing the time-averaged estimator of (2.1) as

Hk:m(X,Y)\displaystyle H_{k:m}(X,Y) =MCMCk:m+BCk:m,\displaystyle=\text{MCMC}_{k:m}+\text{BC}_{k:m},

where MCMCk:m\text{MCMC}_{k:m} is the MCMC average (mk+1)1=kmh(X)(m-k+1)^{-1}\sum_{\ell=k}^{m}h(X_{\ell}) and BCk:m\text{BC}_{k:m} is the bias correction term. The variance of Hk:m(X,Y)H_{k:m}(X,Y) can be written

𝕍[Hk:m(X,Y)]\displaystyle\mathbb{V}[H_{k:m}(X,Y)] =𝔼[(MCMCk:m𝔼π[h(X)])2]+2𝔼[(MCMCk:m𝔼π[h(X)])BCk:m]+𝔼[BCk:m2].\displaystyle=\mathbb{E}\left[(\text{MCMC}_{k:m}-\mathbb{E}_{\pi}[h(X)])^{2}\right]+2\mathbb{E}\left[(\text{MCMC}_{k:m}-\mathbb{E}_{\pi}[h(X)])\text{BC}_{k:m}\right]+\mathbb{E}\left[\text{BC}_{k:m}^{2}\right].

Defining the mean squared error of the MCMC estimator as MSEk:m=𝔼[(MCMCk:m𝔼π[h(X)])2]\text{MSE}_{k:m}=\mathbb{E}\left[(\text{MCMC}_{k:m}-\mathbb{E}_{\pi}[h(X)])^{2}\right], the Cauchy–Schwarz inequality yields

𝕍[Hk:m(X,Y)]MSEk:m+2MSEk:m𝔼[BCk:m2]+𝔼[BCk:m2].\mathbb{V}[H_{k:m}(X,Y)]\leq\text{MSE}_{k:m}+2\sqrt{\text{MSE}_{k:m}}\sqrt{\mathbb{E}\left[\text{BC}_{k:m}^{2}\right]}+\mathbb{E}\left[\text{BC}_{k:m}^{2}\right]. (3.1)

To bound 𝔼[BCk:m2]\mathbb{E}[\text{BC}_{k:m}^{2}], we introduce a geometric drift condition on the Markov kernel PP.

Assumption 3.1.

The Markov kernel PP is π\pi-invariant, φ\varphi-irreducible and aperiodic, and there exists a measurable function V:𝒳[1,)V:\;\mathcal{X}\to[1,\infty), λ(0,1)\lambda\in(0,1), b<b<\infty and a small set 𝒞\mathcal{C} such that for all x𝒳x\in\mathcal{X},

P(x,dy)V(y)λV(x)+b𝟙(x𝒞).\int P(x,dy)V(y)\leq\lambda V(x)+b\mathds{1}(x\in\mathcal{C}).

We refer the reader to Meyn and Tweedie (2009) for the definitions and core theoretical tools for working with Markov chains on a general state space, in particular Chapter 5 for aperiodicity, φ\varphi-irreducibility and small sets, and Chapter 15 for geometric drift conditions; see also Roberts and Rosenthal (2004). Geometric drift conditions are known to hold for various MCMC algorithms (e.g. Roberts and Tweedie, 1996a, b; Jarner and Hansen, 2000; Atchade, 2006; Khare and Hobert, 2013; Choi and Hobert, 2013; Pal and Khare, 2014). Assumption 3.1 often plays a central role in establishing geometric ergodicity (e.g. Theorem 9 in Roberts and Rosenthal, 2004). We show next that this assumption enables an informative bound on 𝔼[BCk:m2]\mathbb{E}[\text{BC}_{k:m}^{2}].

Proposition 3.3.

Suppose that Assumptions 2.2-2.3 and 3.1 hold, with a function VV for which the integral V(x)π0(dx)\int V(x)\pi_{0}(dx) is finite. If the function hh is such that supx𝒳|h(x)|/V(x)β<\sup_{x\in\mathcal{X}}|h(x)|/V(x)^{\beta}<\infty for some β[0,1/2)\beta\in[0,1/2), then for all mk0m\geq k\geq 0 we have

𝔼[BCk:m2]Cδ,βδβk(mk+1)2,\mathbb{E}[\text{BC}_{k:m}^{2}]\leq\frac{C_{\delta,\beta}\delta_{\beta}^{k}}{(m-k+1)^{2}},

for some constants Cδ,β<+C_{\delta,\beta}<+\infty, and δβ=δ12β(0,1)\delta_{\beta}=\delta^{1-2\beta}\in(0,1), with δ(0,1)\delta\in(0,1) as in Assumption 2.2.

Using Proposition 3.3, equation (3.1) becomes

𝕍[Hk:m(X,Y)]MSEk:m+2MSEk:mCδ,βδβkmk+1+Cδ,βδβk(mk+1)2.\mathbb{V}[H_{k:m}(X,Y)]\leq\text{MSE}_{k:m}+2\sqrt{\text{MSE}_{k:m}}\frac{\sqrt{C_{\delta,\beta}\delta_{\beta}^{k}}}{m-k+1}+\frac{C_{\delta,\beta}\delta_{\beta}^{k}}{(m-k+1)^{2}}. (3.2)

The variance of Hk:m(X,Y)H_{k:m}(X,Y) is thus bounded by the mean squared error of an MCMC estimator plus additive terms that vanish geometrically in kk and polynomially in mkm-k.

In order to facilitate the comparison between the efficiency of Hk:m(X,Y)H_{k:m}(X,Y) and that of MCMC estimators, we add simplifying assumptions. First, the right-most terms of (3.2) decrease geometrically with kk, at a rate driven by δβ=δ12β\delta_{\beta}=\delta^{1-2\beta} where δ\delta is as in Assumption 2.2. This motivates a choice of kk depending on the distribution of the meeting time τ\tau. In practice, we can sample independent realizations of the meeting time and choose kk such that (τ>k)\mathbb{P}(\tau>k) is small; i.e. we choose kk as a large quantile of the meeting times.

Dropping the third term on the right-hand side of (3.2), which is of smaller magnitude than the second term, assuming that MSEk:m>0\text{MSE}_{k:m}>0 and that m>τm>\tau with large probability, we obtain the approximate inequality

𝔼[2(τ1)+max(1,m+1τ)]𝕍[Hk:m(X,Y)](m+𝔼(τ))𝕍[Hk:m(X,Y)](mk+1)MSEk:m(1+k+𝔼(τ)mk+1)(1+2mk+1Cδ,βδβk(mk+1)MSEk:m).\mathbb{E}[2(\tau-1)+\max(1,m+1-\tau)]\mathbb{V}[H_{k:m}(X,Y)]\lessapprox\left(m+\mathbb{E}(\tau)\right)\mathbb{V}[H_{k:m}(X,Y)]\\ \lessapprox(m-k+1)\text{MSE}_{k:m}\left(1+\frac{k+\mathbb{E}(\tau)}{m-k+1}\right)\left(1+\frac{2}{\sqrt{m-k+1}}\sqrt{\frac{C_{\delta,\beta}\delta_{\beta}^{k}}{(m-k+1)\text{MSE}_{k:m}}}\right).

As kk increases we expect (mk+1)MSEk:m(m-k+1)\text{MSE}_{k:m} to converge to 𝕍[(mk+1)1/2t=kmh(Xt)]\mathbb{V}[(m-k+1)^{-1/2}\sum_{t=k}^{m}h(X_{t})], where XkX_{k} would be distributed according to π\pi. Denote this variance by Vk,mV_{k,m}. The limit of Vk,mV_{k,m} as mm\to\infty is the asymptotic variance of the MCMC estimator, denoted by VV_{\infty}. Hence, for kk and mkm-k both large, the loss of efficiency of the method compared to standard MCMC is approximately 1+(k+𝔼(τ))/(mk)1+(k+\mathbb{E}(\tau))/(m-k).

This informal series of approximations suggests that we can retrieve an asymptotic efficiency comparable to the underlying MCMC estimators with appropriate choices of kk and mm that depend on the distribution of the meeting time τ\tau. These choices are thus sensitive to the coupling of the chains, and not only to the performance of the underlying MCMC algorithm. Choosing mm as a multiple of kk, such as 5k5k or 10k10k, makes intuitive sense when considering that k/mk/m is the proportion of iterations that are simply discarded in the event that τ<k\tau<k. In other words, the bias of MCMC can be removed at the cost of an increased variance, which can in turn be reduced by choosing large enough values of kk and mm. This results in a tradeoff with the desired level of parallelism: one might prefer to keep kk and mm small, yielding a suboptimal efficiency for Hk:m(X,Y)H_{k:m}(X,Y), but enabling more independent copies to be generated in a given computing time.

3.2 Verifying Assumption 2.2

We discuss how Assumption 3.1 on the Markov kernel PP can be used to verify Assumption 2.2, on the shape of the meeting time distribution. Informally, Assumption 3.1 guarantees that the bivariate chain {(Xt,Yt1),t1}\{(X_{t},Y_{t-1}),\;t\geq 1\} visits 𝒞×𝒞\mathcal{C}\times\mathcal{C} infinitely often, where 𝒞\mathcal{C} is a small set. If there is a positive probability of the event {Xt+1=Yt}\{X_{t+1}=Y_{t}\} for every tt such that (Xt,Yt1)𝒞×𝒞(X_{t},Y_{t-1})\in\mathcal{C}\times\mathcal{C}, then we expect Assumption 2.2 to hold. The next result formalizes that intuition. The proof is based on a modification of an argument by Douc et al. (2004). We introduce 𝒟={(x,y)𝒳×𝒳:x=y}\mathcal{D}=\{(x,y)\in\mathcal{X}\times\mathcal{X}:\;x=y\}. Then Assumption 2.3 reads P¯((x,x),𝒟)=1\bar{P}((x,x),\mathcal{D})=1 for all x𝒳x\in\mathcal{X}.

Proposition 3.4.

Suppose that PP satisfies Assumption 3.1 with a small set 𝒞\mathcal{C} of the form 𝒞={x:V(x)L}\mathcal{C}=\{x:V(x)\leq L\} where λ+b/(1+L)<1\lambda+b/(1+L)<1. Suppose also that there exists ϵ(0,1)\epsilon\in(0,1) such that

inf(x,y)𝒞×𝒞P¯((x,y),𝒟)ϵ.\inf_{(x,y)\in\mathcal{C}\times\mathcal{C}}\bar{P}((x,y),\mathcal{D})\geq\epsilon. (3.3)

Then there exists a finite constant CC^{\prime} and a κ(0,1)\kappa\in(0,1), such that for all n1n\geq 1,

(τ>n)Cπ0(V)κn,\mathbb{P}(\tau>n)\leq C^{\prime}\pi_{0}(V)\kappa^{n},

where π0(V)=V(x)π0(dx)\pi_{0}(V)=\int V(x)\pi_{0}(dx). Hence Assumption 2.2 holds as long as π0(V)<\pi_{0}(V)<\infty.

Note that if Assumption 3.1 holds with a small set of the form 𝒞={x:V(x)L}\mathcal{C}=\{x:\;V(x)\leq L\} for some L>0L>0, then it also holds for 𝒞={x:V(x)L}\mathcal{C}=\{x:\;V(x)\leq L^{\prime}\} for all LLL^{\prime}\geq L. In that case one can always choose LL large enough so that λ+b/(1+L)<1\lambda+b/(1+L)<1. Hence the main restriction in Proposition 3.4 is the assumption that the small sets in Assumption 3.1 are of the form {x:V(x)L}\{x:V(x)\leq L\}, i.e. level sets of VV. This is known to be true in some cases. For instance it is known from Theorem 2.2 of Roberts and Tweedie (1996b) that for a large class of Metropolis-Hastings algorithms, any non-empty compact set is a small set, and therefore for these algorithms it suffices to check that the level sets of the drift function VV are compact. Common examples of drift functions include V(x)=c/π(x)V(x)=c/\sqrt{\pi(x)} (Roberts and Tweedie, 1996b; Jarner and Hansen, 2000; Atchade, 2006), V(x)=ceb|x|V(x)=ce^{b|x|} (Roberts and Tweedie, 1996a) or the example in Pal and Khare (2014), which all have compact level sets under mild regularity conditions.

The work of Middleton et al. (2018) contains results that generalize Propositions 3.3 and 3.4 to Markov chains satisfying polynomial drift conditions (e.g Andrieu and Vihola, 2015), leading to polynomial tails for the associated meeting times.

3.3 Parallel implementation under budget constraints

Our main motivation for unbiased estimators comes from parallel processing; see Sections 5.5 and 6 for other motivations. Independent unbiased estimators with finite variance can be generated on separate machines, and combined into consistent and asymptotically Normal estimators. If the number of estimators is pre-specified, this follows from the central limit theorem for i.i.d. variables. We might prefer to specify a time budget, and generate as many estimators as possible within the budget. The lack of bias allows the application of a variety of results on budget-constrained parallel simulations, which we briefly review here, following Glynn and Heidelberger (1990, 1991).

We denote the proposed estimator by HH and its expectation, which is the object of interest here, by π(h)\pi(h). Generating HH takes a random time CC. We write N(t)N(t) for the number of independent copies of HH that can be produced by time tt. The sequence (Hn,Cn)n(H_{n},C_{n})_{n\in\mathbb{N}} refers to i.i.d. copies of (H,C)(H,C), so that we can write N(t)=sup{n0:C1++Cnt}N(t)=\sup\{n\geq 0:C_{1}+\ldots+C_{n}\leq t\}, with N(t)=0N(t)=0 if t<C1t<C_{1}. We add the subscript p\mathrm{p} to refer to objects associated with processor p{1,,P}\mathrm{p}\in\{1,\ldots,\mathrm{P}\}.

A first result is that the estimator H¯p(t)\bar{H}_{\mathrm{p}}(t), defined for all 1pP1\leq\mathrm{p}\leq\mathrm{P} as 0 if Np(t)=0N_{\mathrm{p}}(t)=0, and by the sample average of Hp1,,HpNp(t)H_{\mathrm{p}1},\ldots,H_{\mathrm{p}N_{\mathrm{p}}(t)} otherwise, is biased: 𝔼[H¯p(t)]=𝔼[H]𝔼[H𝟙(C>t)]\mathbb{E}[\bar{H}_{\mathrm{p}}(t)]=\mathbb{E}[H]-\mathbb{E}[H\mathds{1}(C>t)]. Corollary 6 of Glynn and Heidelberger (1990) states that if 𝔼[|Hexp(αC)|]<\mathbb{E}[|H\exp(\alpha C)|]<\infty for some α>0\alpha>0, then the bias is negligible compared to exp(αt)\exp(-\alpha t) as tt\to\infty. By Cauchy–Schwarz, 𝔼[|Hexp(αC)|]2\mathbb{E}[|H\exp(\alpha C)|]^{2} is less than the product of 𝔼[H2]\mathbb{E}[H^{2}] and 𝔼[exp(2αC)|]\mathbb{E}[\exp(2\alpha C)|]. In our context, 𝔼[H2]\mathbb{E}[H^{2}] is finite under Proposition 3.1, and 𝔼[exp(2αC)|]\mathbb{E}[\exp(2\alpha C)|] is finite for a range of values of α\alpha that depends on the value of δ\delta in Assumption 2.2.

We can define an unbiased estimator of π(h)\pi(h) with a slight modification of H¯p(t)\bar{H}_{\mathrm{p}}(t). For all p{1,,P}{\mathrm{p}\in\{1,\dots,\mathrm{P}\}}, set H~p(t)=H¯p(t)\tilde{H}_{\mathrm{p}}(t)=\bar{H}_{\mathrm{p}}(t) if Np(t)>0N_{\mathrm{p}}(t)>0 and H~p(t)=Hp1\tilde{H}_{\mathrm{p}}(t)=H_{\mathrm{p}1} if Np(t)=0N_{\mathrm{p}}(t)=0. With N~p(t)=max(1,Np(t))\tilde{N}_{\mathrm{p}}(t)=\max(1,N_{\mathrm{p}}(t)), then H~p(t)\tilde{H}_{\mathrm{p}}(t) is the sample average of Hp1,,HpN~p(t)H_{\mathrm{p}1},\ldots,H_{\mathrm{p}\tilde{N}_{\mathrm{p}}(t)}. The computation of N~p(t)\tilde{N}_{\mathrm{p}}(t) requires the completion of Hp1H_{\mathrm{p}1}, and thus we cannot necessarily return H~p(t)\tilde{H}_{\mathrm{p}}(t) at time tt, in contrast with H¯p(t)\bar{H}_{\mathrm{p}}(t). On the other hand, we have 𝔼[H~p(t)]=𝔼[H]=π(h)\mathbb{E}[\tilde{H}_{\mathrm{p}}(t)]=\mathbb{E}[H]=\pi(h), i.e. the estimator is unbiased, provided that 𝔼[|H|]<\mathbb{E}[|H|]<\infty (Corollary 7 of Glynn and Heidelberger (1990)). We denote the average of H~p(t)\tilde{H}_{\mathrm{p}}(t) over P\mathrm{P} processors by H~(P,t)=P1p=1PH~p(t)\tilde{H}(\mathrm{P},t)=\mathrm{P}^{-1}\sum_{\mathrm{p}=1}^{\mathrm{P}}\tilde{H}_{\mathrm{p}}(t), which is unbiased for π(h)\pi(h).

Asymptotic results on H~(P,t)\tilde{H}(\mathrm{P},t) can be found in Glynn and Heidelberger (1991) and are summarized below. We first have the consistency results: limtH~(P,t)=limPH~(P,t)=π(h)\lim_{t\to\infty}\tilde{H}(\mathrm{P},t)=\lim_{\mathrm{P}\to\infty}\tilde{H}(\mathrm{P},t)=\pi(h) almost surely for all tt and P\mathrm{P}, and if 𝔼[|H|1+δ]\mathbb{E}[|H|^{1+\delta}] for some δ>0\delta>0 and if {tP}\{t_{\mathrm{P}}\} is a sequence such that limPtP=\lim_{\mathrm{P}\to\infty}t_{\mathrm{P}}=\infty, then H~(P,tP)\tilde{H}(\mathrm{P},t_{\mathrm{P}}) converges to π(h)\pi(h) in probability as P\mathrm{P}\to\infty. Next, we can construct confidence intervals for π(h)\pi(h) based on H~(P,t)\tilde{H}(\mathrm{P},t), following the end of Section 3 in Glynn and Heidelberger (1991). Indeed, define

σ^12(P,t)\displaystyle\hat{\sigma}_{1}^{2}(\mathrm{P},t) =1P1p=1P(H~p(t)H~(P,t))2,τ~(P,t)=1Pp=1P1N~p(t)n=1N~p(t)Cpn,\displaystyle=\frac{1}{\mathrm{P}-1}\sum_{\mathrm{p}=1}^{\mathrm{P}}\left(\tilde{H}_{\mathrm{p}}(t)-\tilde{H}(\mathrm{P},t)\right)^{2},\quad\tilde{\tau}(\mathrm{P},t)=\frac{1}{\mathrm{P}}\sum_{\mathrm{p}=1}^{\mathrm{P}}\frac{1}{\tilde{N}_{\mathrm{p}}(t)}\sum_{n=1}^{\tilde{N}_{\mathrm{p}}(t)}C_{\mathrm{p}n},
σ^22(P,t)\displaystyle\hat{\sigma}_{2}^{2}(\mathrm{P},t) =τ~(P,t)[(1Pp=1P1N~p(t)n=1N~p(t)Hpn2)H~(P,t)2],\displaystyle=\tilde{\tau}(\mathrm{P},t)\left[\left(\frac{1}{\mathrm{P}}\sum_{\mathrm{p}=1}^{\mathrm{P}}\frac{1}{\tilde{N}_{\mathrm{p}}(t)}\sum_{n=1}^{\tilde{N}_{\mathrm{p}}(t)}H_{\mathrm{p}n}^{2}\right)-\tilde{H}(\mathrm{P},t)^{2}\right],

where N~p(t)=max(1,Np(t))\tilde{N}_{\mathrm{p}}(t)=\max(1,N_{\mathrm{p}}(t)). Then we have the three following central limit theorems,

for fixed t and P,Pσ^1(P,t)(H~(P,t)π(h))\displaystyle\text{for fixed $t$ and $\mathrm{P}\to\infty$},\quad\frac{\sqrt{\mathrm{P}}}{\hat{\sigma}_{1}(\mathrm{P},t)}\left(\tilde{H}(\mathrm{P},t)-\pi(h)\right) \displaystyle\to 𝒩(0,1),\displaystyle\mathcal{N}(0,1), (3.4)
for fixed P and t,Ptσ^2(P,t)(H~(P,t)π(h))\displaystyle\text{for fixed $\mathrm{P}$ and $t\to\infty$},\quad\frac{\sqrt{\mathrm{P}t}}{\hat{\sigma}_{2}(\mathrm{P},t)}\left(\tilde{H}(\mathrm{P},t)-\pi(h)\right) \displaystyle\to 𝒩(0,1),\displaystyle\mathcal{N}(0,1), (3.5)
if tP as P,PtPσ^2(P,tP)(H~(P,tP)π(h))\displaystyle\text{if $t_{\mathrm{P}}\to\infty$ as $\mathrm{P}\to\infty$},\quad\frac{\sqrt{\mathrm{P}t_{\mathrm{P}}}}{\hat{\sigma}_{2}(\mathrm{P},t_{\mathrm{P}})}\left(\tilde{H}(\mathrm{P},t_{\mathrm{P}})-\pi(h)\right) \displaystyle\to 𝒩(0,1).\displaystyle\mathcal{N}(0,1). (3.6)

These results require moment conditions such as 𝔼[H~p(t)2]<\mathbb{E}[\tilde{H}_{\mathrm{p}}(t)^{2}]<\infty. The central limit theorem in (3.4) will be used to construct confidence intervals in Sections 5.3 and 5.4.

We conclude this section with a remark on the setting where tt is fixed and the number of processors P\mathrm{P} goes to infinity. There, the time to obtain H~(P,t)\tilde{H}(\mathrm{P},t) would typically increase with P\mathrm{P}. Indeed at least one estimator needs to be completed on each processor for H~(P,t)\tilde{H}(\mathrm{P},t) to be available. The completion time behaves as the maximum of independent copies of the cost CC. Under Assumption 2.2, the completion time for H~(P,t)\tilde{H}(\mathrm{P},t) has expectation behaving as logP\log\mathrm{P} when P\mathrm{P}\to\infty, for fixed tt. Other tail assumptions (Middleton et al., 2018) would lead to different behavior for the completion time associated with H~(P,t)\tilde{H}(\mathrm{P},t).

4 Couplings of MCMC algorithms

We consider couplings of various MCMC algorithms that satisfy Assumptions 2.2-2.3. These couplings are widely applicable and do not require extensive analytical knowledge of the target distribution. We stress that they are not optimal in general, and we expect that other constructions would yield more efficient estimators. We begin in Section 4.1 by reviewing maximal couplings.

4.1 Sampling from maximal couplings

A maximal coupling between two distributions pp and qq on a space 𝒳\mathcal{X} is a distribution of a pair of random variables (X,Y)(X,Y) that maximizes (X=Y)\mathbb{P}(X=Y), subject to the marginal constraints XpX\sim p and YqY\sim q. We write pp and qq both for these distributions and for their probability density functions with respect to a common dominating measure, and refer to the uniform distribution on the interval [a,b][a,b] by 𝒰([a,b])\mathcal{U}([a,b]). A procedure to sample from a maximal coupling is described in Algorithm 2; see e.g. Section 4.5 of Chapter 1 of Thorisson (2000), and Johnson (1998) where it is termed γ\gamma-coupling.

We justify Algorithm 2 and compute its cost. Denote by (X,Y)(X,Y) the output of the algorithm. First, XX follows pp from step 1. To prove that YY follows qq, introduce a measurable set AA. We write (YA)=(YA,step 1)+(YA,step 2)\mathbb{P}(Y\in A)=\mathbb{P}(Y\in A,\text{step 1})+\mathbb{P}(Y\in A,\text{step 2}), where the events {step 1}\{\text{step 1}\} and {step 2}\{\text{step 2}\} refer to the algorithm terminating at step 1 or 2. We compute

(YA,step 1)\displaystyle\mathbb{P}\left(Y\in A,\text{step 1}\right) =A0+𝟙(wq(x))𝟙(0wp(x))p(x)p(x)𝑑w𝑑x=Amin(p(x),q(x))𝑑x.\displaystyle=\int_{A}\int_{0}^{+\infty}\mathds{1}\left(w\leq q(x)\right)\frac{\mathds{1}\left(0\leq w\leq p(x)\right)}{p(x)}p(x)dwdx=\int_{A}\min(p(x),q(x))dx.

We can deduce from this that (step 1)=𝒳min(p(x),q(x))𝑑x\mathbb{P}\left(\text{step 1}\right)=\int_{\mathcal{X}}\min(p(x),q(x))dx. For (YA,step 2)\mathbb{P}\left(Y\in A,\text{step 2}\right) to be equal to A(q(x)min(p(x),q(x)))𝑑x\int_{A}(q(x)-\min(p(x),q(x)))dx, we need

A(q(x)min(p(x),q(x)))𝑑x\displaystyle\int_{A}(q(x)-\min(p(x),q(x)))dx =(YA|step 2)(1𝒳min(p(x),q(x))𝑑x),\displaystyle=\mathbb{P}\left(Y\in A|\text{step 2}\right)\left(1-\int_{\mathcal{X}}\min(p(x),q(x))dx\right),

and we conclude that the distribution of YY given {step 2}\{\text{step 2}\} should for all xx have a density q~(x)\tilde{q}(x) equal to (q(x)min(p(x),q(x)))/(1min(p(x),q(x))𝑑x){(q(x)-\min(p(x),q(x)))/(1-\int\min(p(x^{\prime}),q(x^{\prime}))dx^{\prime})}. Step 2 is a standard rejection sampler using qq as a proposal distribution to target q~\tilde{q}, which concludes the proof that YqY\sim q. We also confirm that Algorithm 2 maximizes the probability of {X=Y}\{X=Y\}. Under the algorithm,

(X=Y)=(step 1)=𝒳min(p(x),q(x))𝑑x=1dTV(p,q),\mathbb{P}(X=Y)=\mathbb{P}(\text{step 1})=\int_{\mathcal{X}}\min(p(x),q(x))dx=1-d_{\text{TV}}(p,q),

where dTV(p,q)=1/2𝒳|p(x)q(x)|𝑑xd_{\text{TV}}(p,q)=\nicefrac{{1}}{{2}}\int_{\mathcal{X}}|p(x)-q(x)|dx is the total variation distance. By the coupling inequality (Lindvall, 2002), this proves that the algorithm implements a maximal coupling.

To assess the cost of Algorithm 2, note that step 11 costs one draw from pp, one evaluation from pp and one from qq. Each attempt in the rejection sampler of step 2 costs one draw from qq, one evaluation from pp and one from qq. Hereafter we refer to the cost of one draw and two evaluations by “one unit”. Observe that the probability of acceptance in step 2 is given by (Wp(Y))=1𝒳min(p(y),q(y))𝑑y\mathbb{P}(W^{\star}\geq p(Y^{\star}))=1-\int_{\mathcal{X}}\min\left(p(y),q(y)\right)dy. Then, the number of attempts in step 2 has a Geometric distribution with mean (1𝒳min(p(y),q(y))𝑑y)1(1-\int_{\mathcal{X}}\min\left(p(y),q(y)\right)dy)^{-1}, and step 2 itself occurs with probability 1𝒳min(p(y),q(y))𝑑y1-\int_{\mathcal{X}}\min\left(p(y),q(y)\right)dy. Therefore the overall expected cost is two units. The expectation of the cost is the same for all distributions pp and qq, while the variance of the cost depends on dTV(p,q)d_{\text{TV}}(p,q), and in fact goes to infinity as this distance goes to zero.

  1. 1.

    Sample XpX\sim p and W|X𝒰([0,p(X)])W|X\sim\mathcal{U}([0,p(X)]). If Wq(X)W\leq q(X), output (X,X)(X,X).

  2. 2.

    Otherwise, sample YqY^{\star}\sim q and W|Y𝒰([0,q(Y)])W^{\star}|Y^{\star}\sim\mathcal{U}([0,q(Y^{\star})]) until W>p(Y)W^{\star}>p(Y^{\star}), and output (X,Y)(X,Y^{\star}).

Algorithm 2 Sampling from a maximal coupling of pp and qq.

In Algorithm 2, the value of XX is not used in the generation of YY^{\star} within step 2. In other words, conditional on {XY}\{X\neq Y\}, the two output variables are independent. We might prefer to correlate the outputs in the event {XY}\{X\neq Y\}, e.g. in random walk MH as in the next section. We describe a maximal coupling presented in Bou-Rabee et al. (2018). It applies to distributions pp and qq on d\mathbb{R}^{d} such that XpX\sim p can be represented as X=μ1+Σ1/2X˙X=\mu_{1}+\Sigma^{1/2}\dot{X}, and YqY\sim q as Y=μ2+Σ1/2Y˙Y=\mu_{2}+\Sigma^{1/2}\dot{Y}, where the pair (X˙,Y˙)(\dot{X},\dot{Y}) follows a coupling of some distribution ss with itself. The construction requires that ss is spherically symmetrical: s(x)=s(y)s(x)=s(y) for all x,ydx,y\in\mathbb{R}^{d} such that x=y\|x\|=\|y\|, where \|\cdot\| denotes the Euclidean norm. For instance, if ss is a standard multivariate Normal distribution, then X𝒩(μ1,Σ)X\sim\mathcal{N}(\mu_{1},\Sigma) and Y𝒩(μ2,Σ)Y\sim\mathcal{N}(\mu_{2},\Sigma).

Let z=Σ1/2(μ1μ2)z=\Sigma^{-1/2}(\mu_{1}-\mu_{2}) and e=z/ze=z/\|z\|. We independently draw X˙s\dot{X}\sim s and U𝒰([0,1])U\sim\mathcal{U}([0,1]) and let

Y˙={X˙+zif Umin(1,s(X˙+z)s(X˙)),X˙2(eX˙)eotherwise.\dot{Y}=\begin{cases}\dot{X}+z&\text{if }U\leq\min\left(1,\frac{s(\dot{X}+z)}{s(\dot{X})}\right),\\ \dot{X}-2(e^{\prime}\dot{X})e&\text{otherwise}.\end{cases}

The above procedure outputs a pair (X˙,Y˙)(\dot{X},\dot{Y}) that follows a coupling of ss with itself. We then define (X,Y)=(μ1+Σ1/2X˙,μ2+Σ1/2Y˙)(X,Y)=(\mu_{1}+\Sigma^{1/2}\dot{X},\mu_{2}+\Sigma^{1/2}\dot{Y}). On the event {Y˙=X˙+z}\{\dot{Y}=\dot{X}+z\}, we have X=YX=Y. On the event {Y˙X˙+z}\{\dot{Y}\neq\dot{X}+z\}, the vector X˙2(eX˙)e\dot{X}-2(e^{\prime}\dot{X})e is the reflection of X˙\dot{X} through the hyperplane orthogonal to ee that passes through the origin. We show that the output (X,Y)(X,Y) follows a maximal coupling of pp and qq, which we refer to as a maximal coupling with reflection on the residuals, or a “reflection-maximal coupling”. First we show that Y˙\dot{Y} follows ss, closely following the argument in Bou-Rabee et al. (2018). For a measurable set BB, we compute

(Y˙B)\displaystyle\mathbb{P}(\dot{Y}\in B) =𝟙B(x+z)min(1,s(x+z)s(x))s(x)𝑑x\displaystyle=\int\mathds{1}_{B}(x+z)\min\left(1,\frac{s(x+z)}{s(x)}\right)s(x)dx
+𝟙B(x2(ex)e)max(0,1s(x+z)s(x))s(x)𝑑x.\displaystyle\quad+\int\mathds{1}_{B}(x-2(e^{\prime}x)e)\max\left(0,1-\frac{s(x+z)}{s(x)}\right)s(x)dx.

The first integral above becomes 𝟙B(w)min(s(wz),s(w))𝑑w\int\mathds{1}_{B}(w)\min\left(s(w-z),s(w)\right)dw, after a change of variables w:=x+zw:=x+z. To simplify the second integral we make the change of variables w:=x2(ex)ew:=x-2(e^{\prime}x)e. Since this corresponds to a reflection with respect to a plane orthogonal to ee, we have dw=dxdw=dx, and x=w2(ew)ex=w-2(e^{\prime}w)e, thus

𝟙B(x2(ex)e)max(0,s(x)s(x+z))𝑑x\displaystyle\int\mathds{1}_{B}(x-2(e^{\prime}x)e)\cdot\max\left(0,s(x)-s(x+z)\right)dx
=𝟙B(w)max(0,s(w)s(wz))𝑑w,\displaystyle=\int\mathds{1}_{B}(w)\cdot\max\left(0,s(w)-s(w-z)\right)dw,

where we have used s(w2(ew)e)=s(w)s(w-2(e^{\prime}w)e)=s(w) and s(w2(ew)e+z)=s(wz)s(w-2(e^{\prime}w)e+z)=s(w-z), respectively because w2(ew)e=w\|w-2(e^{\prime}w)e\|=\|w\| and w2(ew)e+z=wz\|w-2(e^{\prime}w)e+z\|=\|w-z\|. Summing the two integrals we obtain (Y˙B)=Bs(w)𝑑w{\mathbb{P}(\dot{Y}\in B)=\int_{B}s(w)dw}, so that Y˙s\dot{Y}\sim s.

To verify that the procedure corresponds to a maximal coupling of pp and qq, we observe that

(XY)\displaystyle\mathbb{P}(X\neq Y) =(Y˙X˙+z)=1min(s(x),s(x+z))𝑑x\displaystyle=\mathbb{P}(\dot{Y}\neq\dot{X}+z)=1-\int\min\left(s(x),s(x+z)\right)dx
=1min(s(Σ1/2(x~μ1)),s(Σ1/2(x~μ2)))|Σ1/2|𝑑x~,\displaystyle=1-\int\min\left(s(\Sigma^{-1/2}(\tilde{x}-\mu_{1})),s(\Sigma^{-1/2}(\tilde{x}-\mu_{2}))\right)|\Sigma^{-1/2}|d\tilde{x},

with the change of variable x~:=μ1+Σ1/2x\tilde{x}:=\mu_{1}+\Sigma^{1/2}x. The above is precisely the total variation distance between pp and qq, upon writing their densities in terms of the density of ss. Note that the computational cost associated with the above sampling technique is deterministic, in contrast with the cost of Algorithm 2.

Finally, for discrete distributions with common finite support, a procedure for sampling from a maximal coupling is described in Section 5.4, with a cost that is also deterministic.

4.2 Metropolis–Hastings

In Section 2.2 we described a coupling of MH chains due to Johnson (1998); we summarize the coupled kernel P¯((Xt,Yt1),)\bar{P}((X_{t},Y_{t-1}),\cdot) in the following procedure.

  1. 1.

    Sample (X,Y)|(Xt,Yt1)(X^{\star},Y^{\star})|(X_{t},Y_{t-1}) from a maximal coupling of q(Xt,)q(X_{t},\cdot) and q(Yt1,)q(Y_{t-1},\cdot).

  2. 2.

    Sample U𝒰([0,1])U\sim\mathcal{U}([0,1]).

  3. 3.

    If Umin(1,π(X)q(X,Xt)/π(Xt)q(Xt,X))U\leq\min(1,\pi(X^{\star})q(X^{\star},X_{t})/\pi(X_{t})q(X_{t},X^{\star})), then Xt+1=XX_{t+1}=X^{\star}, otherwise Xt+1=XtX_{t+1}=X_{t}.

  4. 4.

    If Umin(1,π(Y)q(Y,Yt1)/π(Yt1)q(Yt1,Y))U\leq\min(1,\pi(Y^{\star})q(Y^{\star},Y_{t-1})/\pi(Y_{t-1})q(Y_{t-1},Y^{\star})), then Yt=YY_{t}=Y^{\star}, otherwise Yt=Yt1Y_{t}=Y_{t-1}.

Here we address the verification of Assumptions 2.1-2.3 for this algorithm. Assumption 2.1 can be verified for MH chains under conditions on the target and the proposal (Nummelin, 2002; Roberts and Rosenthal, 2004). In some settings the explicit drift function given in Theorem 3.2 of Roberts and Tweedie (1996b) may be used to verify Assumption 2.2 as in Section 3.2. The probability of coupling at the next step given that the chains are in XtX_{t} and Yt1Y_{t-1} can be controlled as follows. First, the probability of proposing the same value XX^{\star} depends on the total variation distance between q(Xt,)q(X_{t},\cdot) and q(Yt1,)q(Y_{t-1},\cdot), which is typically strictly positive if XtX_{t} and Yt1Y_{t-1} are in bounded subsets of 𝒳\mathcal{X}. Furthermore, the probability of accepting XX^{\star} is often strictly positive on bounded subsets of 𝒳\mathcal{X}, for instance when π(x)>0\pi(x)>0 for all x𝒳x\in\mathcal{X}. Assumption 2.3 is satisfied by design thanks to the use of maximal couplings and common uniform variable UU in the above procedure.

Different considerations drive the choice of proposal distribution in standard MCMC and in our proposed estimators. In the case of random walk proposals with variance Σ\Sigma, larger variances lead to smaller total variation distances between q(Xt,)q(X_{t},\cdot) and q(Yt1,)q(Y_{t-1},\cdot) and thus larger probabilities of proposing identical values. However meeting events only occur if proposals are accepted, which is unlikely if Σ\Sigma is too large. This trade-off could lead to a different choice of Σ\Sigma than the optima known for the marginal chains (Roberts et al., 1997), and deserves further investigation.

We perform experiments with a dd-dimensional Normal target distribution 𝒩(0,V)\mathcal{N}(0,V), where VV is the inverse of a matrix drawn from a Wishart distribution with identity scale matrix and dd degrees of freedom. This setting, borrowed from Hoffman and Gelman (2014), yields Normal targets with strong correlations and a dense precision matrix. Below, each independent run is performed with an independent draw of VV. We consider Normal random walk proposals with variance Σ\Sigma set to V/dV/d. The division by dd heuristically follows from the scaling results of Roberts et al. (1997). We initialize the chains either from the target distribution, or from a Normal centered at (1,,1)(1,\ldots,1) with identity covariance matrix. We first couple the proposals with a maximal coupling given by Algorithm 2. The resulting average meeting times, based on 1,0001,000 independent runs, are given in Figure 1a. The plot indicates an exponential increase of the average meeting times with the dimension, under both initialization strategies. In passing, this illustrates that meeting times can be large even if the chains marginally start at stationarity, i.e. in a setting where there is no burn-in bias.

Next we perform the same experiments with the reflection-maximum coupling described in the previous section. The results are shown in Figure 1b. The average meeting times now increase at a rate that appears closer to linear in the dimension. This is to be compared with established theoretical results on the linear performance of standard MH estimators with respect to the dimension (Roberts et al., 1997). A formal justification of the scaling observed in Figure 1b is an open question, and so is the design of more effective coupling strategies.

Refer to caption
(a) Maximum coupling.
Refer to caption
(b) Reflection-maximum coupling.
Figure 1: Scaling of the average meeting time of a coupled MH algorithm with the dimension of the target 𝒩(0,V)\mathcal{N}(0,V), where VV is the inverse of a Wishart draw, as described in Section 4.2. The chains are either initialized from the target, or from a Normal 𝒩(1d,Id)\mathcal{N}(1_{d},I_{d}), where 1d1_{d} is a vector of ones (“offset” in the legend). Left: using maximal coupling of Algorithm 2. Right: using reflection-maximal coupling described in Section 4.1.

4.3 Gibbs sampling

Gibbs sampling is another popular class of MCMC algorithms, in which components of a Markov chain are updated alternately by sampling from the target conditional distributions (Chapter 10 of Robert and Casella, 2004), implemented e.g. in the software packages JAGS (Plummer et al., 2003). In Bayesian statistics, these conditional distributions sometimes belong to a standard family such as Normal, Gamma, or Inverse Gamma. Otherwise, the conditional updates might require MH steps. We can introduce couplings in each conditional update, using either maximal couplings of the target conditionals, if these are standard distributions, or maximal couplings of the proposal distributions in MH steps targeting the target conditionals. Controlling the probability of meeting at the next step over a set, as required for the application of Proposition 3.4, can be done on a case-by-case basis. Drift conditions for Gibbs samplers also tend to rely on case-by-case arguments (see e.g. Rosenthal, 1996).

Gibbs samplers tend to perform well for targets with weak correlations between the components being updated; otherwise Gibbs chains are expected to mix poorly. We perform numerical experiments on Normal target distributions in varying dimensions to observe the effect of correlations on the meeting times of coupled Gibbs chains. For each target 𝒩(0,V)\mathcal{N}(0,V), we introduce an MH-within-Gibbs sampler, where each univariate component ii is updated with a single Metropolis step, using Normal proposals with variance Vi,iV_{i,i}. Here an iteration of the sampler refers to a complete scan of the components. Figure 2a presents the median meeting times as a function of the dimension, when VV is the inverse of a Wishart draw as in the previous section. In this highly correlated setting, the meeting times scale poorly with the dimension. The plot presents the median instead of the average, because we have stopped the runs after 500,000500,000 iterations; the median is robust to this truncation, but not the average. We remark that shorter meeting times are obtained when initializing the chains away from the target distribution.

Next we consider a Normal target with covariance matrix VV defined by Vi,j=0.5|ij|V_{i,j}=0.5^{-|i-j|}, which induces weak correlations among components; the inverse of VV is tridiagonal. In that case, the same Gibbs sampler performs much more favorably, as we can see from Figure 2b. The average meeting times seem to scale sub-linearly with the dimension, under both choices of initializations π0\pi_{0}. Couplings of other Gibbs samplers will be encountered in the numerical experiments of Section 5.

Refer to caption
(a) With strong correlations.
Refer to caption
(b) With weak correlations.
Figure 2: Scaling of the median (left) and average (right) meeting time of a coupled Gibbs algorithm with the dimension of the target 𝒩(0,V)\mathcal{N}(0,V). The chains are either initialized from the target, or from a Normal 𝒩(1d,Id)\mathcal{N}(1_{d},I_{d}), where 1d1_{d} is a vector of ones (“offset” in the legend). Left: the covariance VV of the target is generated as the inverse of a Wishart sample, inducing strong correlations. Right: the covariance VV is defined as Vij=0.5|ij|V_{ij}=0.5^{-|i-j|}, inducing weak correlations.

4.4 Coupling of other MCMC algorithms

Among extensions of the MH algorithm, Metropolis-adjusted Langevin algorithms (e.g. Roberts and Tweedie, 1996a) are characterized by the use of a proposal distribution given current state XtX_{t} that is Normal with mean Xt+hlogπ(Xt)/2X_{t}+h\nabla\log\pi(X_{t})/2 and variance hΣh\Sigma, with tuning parameter h>0h>0 and covariance matrix Σ\Sigma. Maximal couplings or reflection-maximal couplings of the proposals could be readily implemented to obtain faithful chains. Going further in the use of gradient information, Hamiltonian or Hybrid Monte Carlo (HMC, Duane et al., 1987; Neal, 1993, 2011) is a popular MCMC algorithm for large-dimensional targets. In Heng and Jacob (2019), the framework of the present article is applied to pairs of Hamiltonian Monte Carlo chains, with a focus of the verification of Assumptions 2.1-2.3 in that context. Such couplings are analyzed in detail in Mangoubi and Smith (2017); Bou-Rabee et al. (2018) to obtain convergence rates for the underlying chains. We refer to Heng and Jacob (2019) for more details, and provide for completeness some experiments on the Normal target described above in the supplementary materials.

The present article generalizes unbiased estimators obtained by coupling conditional particle filters in Jacob et al. (2019). These algorithms, introduced in Andrieu et al. (2010), target the distribution of latent processes given observations and fixed parameters for nonlinear state space models. The couplings of conditional particle filters in Jacob et al. (2019) involve a combination of common random numbers and maximal couplings. Couplings of particle independent Metropolis–Hastings, which is a particular case of Metropolis–Hastings with an independent proposal distribution, are simpler to design and considered in Middleton et al. (2019).

The design of generic and efficient MCMC kernels is a topic of active ongoing research (see e.g. Murray et al., 2010; Goodman et al., 2010; Pollock et al., 2016; Vanetti et al., 2017; Titsias and Yau, 2017, and references therein). Any new kernel could lead to unbiased estimators with the proposed framework, as long as appropriate couplings can be implemented.

5 Illustrations

Section 5.1 illustrates the impact of kk, mm, and the initial distribution π0\pi_{0}, identifying a situation where some care is required. Section 5.2 considers the removal of the bias from a Gibbs sampler previously considered for perfect sampling and regeneration methods. Section 5.3 introduces an Ising model and a coupling of a replica exchange algorithm, and we present experiments performed on parallel processors. Section 5.4 considers a high-dimensional variable selection example, with an MH algorithm previously shown to scale linearly with the number of variables. Finally, Section 5.5 focuses on the problem of approximating the cut distribution arising in modular inference, which illustrates the appeal of unbiased estimators beyond parallel computing.

5.1 Bimodal target

We use a bimodal target distribution and a random walk MH algorithm to illustrate our method and highlight some of its limitations. In particular, we consider a mixture of univariate Normal distributions with density π(x)=0.5𝒩(x;4,1)+0.5𝒩(x;+4,1)\pi(x)=0.5\cdot\mathcal{N}(x;-4,1)+0.5\cdot\mathcal{N}(x;+4,1), which we sample from using random walk MH with Normal proposal distributions of variance σq2=9\sigma_{q}^{2}=9. This enables regular jumps between the modes of π\pi. We set the initial distribution π0\pi_{0} to 𝒩(10,102)\mathcal{N}(10,10^{2}), so that chains are likely to start closer to the mode at +4+4 than the mode at 4-4. Over 1,0001,000 independent runs, we find that the meeting time τ\tau has an average of 2020 and a 99%99\% quantile of 105105.

We consider the task of estimating 𝟙(x>3)π(dx)0.421\int\mathds{1}(x>3)\pi(dx)\approx 0.421. First, we consider the choice of kk and mm. Over 1,0001,000 independent experiments, we approximate the expected cost 𝔼[2(τ1)+max(1,mτ+1)]\mathbb{E}[2(\tau-1)+\max(1,m-\tau+1)], the variance 𝕍[Hk:m(X,Y)]\mathbb{V}[H_{k:m}(X,Y)], and compute the inefficiency as the product of the two (as in Section 3.1). We then divide the inefficiency by the asymptotic variance of the MCMC estimator, denoted by VV_{\infty}, which we obtain from 10610^{6} iterations and a burn-in period of 10410^{4} using the R package CODA (Plummer et al., 2006).

We present the results in Table 1. First, we see that the inefficiency is sensitive to the choice of kk and mm. Second, we see that when kk and mm are sufficiently large we can retrieve an inefficiency comparable to that of the underlying MCMC algorithm. The ideal choice of kk and mm will depend on tradeoffs between inefficiency, the desired level of parallelism, and the number of processors available. We present a histogram of the target distribution, obtained using k=200k=200, m=2,000m=2,000, in Figure 3a. These histograms are produced by averaging unbiased estimators of expectations of indicator functions, corresponding to consecutive intervals. Confidence intervals at level 95%95\% are obtained from the central limit theorem and are represented as grey boxes, with vertical bars showing the point estimates.

kk mm Cost Variance Inefficiency / VV_{\infty}
1 1×k1\times k 37 4.1e+02 1878.4
1 10×k10\times k 39 3.6e+02 1703.5
1 20×k20\times k 45 3.0e+02 1624.8
100 1×k1\times k 119 9.0e+00 130.6
100 10×k10\times k 1019 2.3e-02 2.9
100 20×k20\times k 2019 7.9e-03 1.9
200 1×k1\times k 219 2.4e-01 6.5
200 10×k10\times k 2019 5.3e-03 1.3
200 20×k20\times k 4019 2.4e-03 1.2
Table 1: Cost, variance and inefficiency divided by MCMC asymptotic variance VV_{\infty}, for various choices of kk and mm, for the test function h:x𝟙(x>3)h:x\mapsto\mathds{1}(x>3), in the bimodal target example of Section 5.1.

Next, we consider a more challenging case by setting σq2=1\sigma_{q}^{2}=1, again with π0=𝒩(10,102)\pi_{0}=\mathcal{N}(10,10^{2}). These values make it difficult for the chains to jump between the modes of π\pi. Over R=1,000R=1,000 runs we find an average meeting time of 769769, with a 99%99\% quantile of 9,1869,186. When the chains start in different modes, the meeting times are often dramatically larger than when the chains start by the same mode. One can still recover accurate estimates of the target distribution, but kk and mm have to be set to larger values. With k=20,000k=20,000 and m=30,000m=30,000, we obtain the 95%95\% confidence interval [0.397,0.430][0.397,0.430] for 𝟙(x>3)π(dx)0.421\int\mathds{1}(x>3)\pi(dx)\approx 0.421. We show a histogram of π\pi in Figure 3b.

Finally we consider a third case, with σq2=1\sigma_{q}^{2}=1 as before but now with π0\pi_{0} set to 𝒩(10,1)\mathcal{N}(10,1). This initialization makes it unlikely for a chain to start near the mode at 4-4. The pair of chains typically converge around the mode at +4+4 and meet in a small number of iterations. Over R=1,000R=1,000 replications, we find an average meeting time of 99 and a 99%99\% quantile of 3535. A 95%95\% confidence interval on 𝟙(x>3)π(dx){\int\mathds{1}(x>3)\pi(dx)} obtained from the estimators with k=50k=50, m=500m=500 is [0.799,0.816][0.799,0.816], far from the true value of 0.4210.421. The associated histogram of π\pi is shown in Figure 3c.

Sampling 9,0009,000 additional estimators yields a 95%95\% confidence interval [0.353,1.595][-0.353,1.595], again using k=50k=50, m=500m=500. Among these extra 9,0009,000 values, a few correspond to cases where one chain jumped to the left-most mode before meeting the other. This resulted in large meeting times and thus a large empirical variance for Hk:mH_{k:m}. Upon noticing a large empirical variance one can then decide to use larger values of kk and mm. We conclude that although our estimators are unbiased and are consistent in the limit as RR\to\infty, poor performance of the underlying Markov chains combined with ill-chosen initializations can still produce misleading results for any finite RR, such as 1,0001,000 in this example.

Refer to caption
(a) σq2=32\sigma_{q}^{2}=3^{2} and π0=𝒩(10,102)\pi_{0}=\mathcal{N}(10,10^{2}).
Refer to caption
(b) σq2=12\sigma_{q}^{2}=1^{2} and π0=𝒩(10,102)\pi_{0}=\mathcal{N}(10,10^{2}).
Refer to caption
(c) σq2=12\sigma_{q}^{2}=1^{2} and π0=𝒩(10,12)\pi_{0}=\mathcal{N}(10,1^{2}).
Figure 3: Histograms of the mixture target distribution of Section 5.1, obtained with the proposed unbiased estimators, based on a Normal random walk MH algorithm, with a proposal variance σq2\sigma_{q}^{2} and an initial distribution π0\pi_{0}, over R=1,000R=1,000 experiments. The target density function is overlaid in solid lines.

5.2 Gibbs sampler for nuclear pump failure data

Next we consider a classic Gibbs sampler for a model of pump failure counts, used e.g. in Murdoch and Green (1998) to illustrate perfect samplers for continuous distributions, and in Mykland et al. (1995) to illustrate their regeneration approach. Here we focus on a comparison with the regeneration approach, which was motivated by similar practical concerns as this paper, in particular to avoid an arbitrary choice of burn-in, construct confidence intervals on the expectations of interest, and make principled use of parallel processors. In that paper the authors show how to construct regeneration times – random times between which the chain forms independent and identically distributed “tours”. The authors define a consistent estimator for arbitrary test functions, whose asymptotic variance takes a simple form. The estimator is then obtained by aggregating over these independent tours.

The data consist of operating times (tn)n=1K(t_{n})_{n=1}^{K} and failure counts (sn)n=1K(s_{n})_{n=1}^{K} for K=10K=10 pumps at the Farley-1 nuclear power station, as first described in Gaver and O’Muircheartaigh (1987). The model specifies snPoisson(λntn)s_{n}\sim\text{Poisson}(\lambda_{n}t_{n}) and λnGamma(α,β)\lambda_{n}\sim\text{Gamma}(\alpha,\beta), where α=1.802\alpha=1.802, βGamma(γ,δ)\beta\sim\text{Gamma}\left(\gamma,\delta\right), γ=0.01\gamma=0.01, and δ=1\delta=1. The Gibbs sampler for this model consists of the following update steps:

λn| rest\displaystyle\lambda_{n}\ |\text{ rest} Gamma(α+sn,β+tn)for n=1,,K,\displaystyle\sim\text{Gamma}(\alpha+s_{n},\beta+t_{n})\quad\text{for }n=1,\ldots,K,
β| rest\displaystyle\beta\ |\text{ rest} Gamma(γ+10α,δ+n=1Kλn).\displaystyle\sim\text{Gamma}(\gamma+10\alpha,\delta+\sum_{n=1}^{K}\lambda_{n}).

Here Gamma(α,β)\text{Gamma}(\alpha,\beta) refers to the distribution with density xΓ(α)1βαxα1exp(βx)x\mapsto\Gamma(\alpha)^{-1}\beta^{\alpha}x^{\alpha-1}\exp(-\beta x). We initialize all parameter values to 1 (the initialization is not specified in Mykland et al. (1995)). To form our estimator we apply maximal couplings at each conditional update of the Gibbs sampler, as described in Section 4.3.

We begin by drawing 1,0001,000 meeting times independently. Following the guidelines of Section 3.1, we set k=7k=7, corresponding to the 99%99\% quantile of τ\tau and m=10k=70m=10\cdot k=70. For the regeneration approach, Mykland et al. (1995) gives a set of tuning parameters which we adopt below. Applying the regeneration approach to 1,000 Gibbs sampler runs of 5,000 iterations each, we observe on average 1,996 complete tours per run with an average length of 2.50 iterations per tour. These values agree with the count of 1,967 tours of average length 2.56 reported in Mykland et al. (1995). We observe a posterior mean estimate for β\beta of 2.47 with a variance of 1.89×1041.89\times 10^{-4} over the 1,000 independent runs, which implies an efficiency value of (5,0001.89×104)1=1.06(5,000\cdot 1.89\times 10^{-4})^{-1}=1.06. This exceeds the efficiency of 0.940.94 achieved by our estimator with the choice of k=7k=7 and m=70m=70. On the other hand, the regeneration approach often requires more extensive analytical work with the underlying Markov chain; we refer to Mykland et al. (1995) for a detailed description. For reference, the underlying Gibbs sampler achieves an efficiency of 1.081.08, based on a long run of 5×1055\times 10^{5} iterations and a burn-in of 10310^{3} iterations. More extensive comparisons with other regeneration approaches such as that of Brockwell and Kadane (2005) would deserve investigation.

5.3 Ising model

We consider an Ising model on a 32×3232\times 32 square lattice with periodic boundaries. This provides a setting where a basic MCMC sampler can mix slowly depending on an inverse temperature parameter θ\theta, and where a replica exchange strategy as in Geyer (1991) can be helpful. We also use this example to illustrate the use of our estimators on a large computing cluster, with the considerations reviewed in Section 3.3. For ii and jj in {1,,32}2\{1,\ldots,32\}^{2} we write iji\sim j if ii and jj are neighbors in the square lattice with periodic boundaries. We write xi{1,+1}x_{i}\in\{-1,+1\} for the spin at location ii, and x={xi}x=\{x_{i}\} for the full grid. We write t(x)t(x) for the “natural statistic” t(x)=0.5i{1,,32}2jixixj{t(x)=0.5\sum_{i\in\{1,\ldots,32\}^{2}}\sum_{j\sim i}x_{i}x_{j}} summing the products of pairs of neighbors. The 0.50.5 multiplier here results in each pair of neighboring sites only being counted once. Under the model, the probability associated with a grid xx is πθ(x)exp(θt(x))\pi_{\theta}(x)\propto\exp(\theta t(x)), where θ>0\theta>0 denotes an inverse temperature parameter that calibrates the degree of correlation between neighboring sites.

We consider a single-site Gibbs sampler, called a heat bath algorithm in this context, to approximate the distribution πθ\pi_{\theta} given a value of θ\theta. One iteration of the algorithm consists of a sweep through all the locations i{1,,32}2i\in\{1,\ldots,32\}^{2}. For each ii we draw xix_{i} from its conditional distribution under πθ\pi_{\theta} given all the other spins. It can be checked that the conditional probability of {xi=+1}\{x_{i}=+1\} given the other spins equals exp(θsi)/(exp(θsi)+exp(θsi))\exp(\theta s_{i})/(\exp(\theta s_{i})+\exp(-\theta s_{i})), where sis_{i} denotes the sum of spins over the four neighbors of ii. We initialize the chains by drawing spins uniformly in {1,+1}\{-1,+1\} at each site, independently across sites.

A simple strategy to couple heat bath chains consists of sampling from the maximal coupling of each conditional distribution. For a grid of θ\theta values from 0.30.3 and 0.480.48, we run 100 pairs of chains until they meet. We then plot the average meeting time as a function of θ\theta in Figure 4a, noting that the average meeting time increases sharply to values above 10610^{6} as θ\theta approaches its critical value (see the related discussion in Propp and Wilson (1996)). We conclude that it would be expensive to produce unbiased estimators based on the heat bath algorithm for values of θ\theta above 0.480.48, for reasons related to the behavior of the underlying algorithm.

There are several ways to address the degeneracy of the heat bath algorithm as θ\theta increases. Specialized algorithms have been proposed to jointly update groups of spins (Swendsen and Wang, 1987; Wolff, 1989). Here, we consider an approach based on an ensemble of NN chains that regularly exchange their states, a technique often termed replica exchange or parallel tempering. Following e.g. Geyer (1991), we introduce NN chains, x(1)x^{(1)}, …, x(N)x^{(N)}, with each x(n)x^{(n)} targeting πθ(n)\pi_{\theta^{(n)}} with different values of θ(n)\theta^{(n)} ordered as θ(1)<<θ(N)\theta^{(1)}<\ldots<\theta^{(N)}. Each iteration of the algorithm proceeds as follows. With probability pswap(0,1)p_{\text{swap}}\in(0,1), for n{1,,N1}n\in\{1,\ldots,N-1\} (sequentially), we propose exchanging the states x(n)x^{(n)} and x(n+1)x^{(n+1)} corresponding to θ(n)\theta^{(n)} and θ(n+1)\theta^{(n+1)}. We accept this swap with probability min(1,πθ(n)(x(n+1))πθ(n+1)(x(n))/(πθ(n)(x(n))πθ(n+1)(x(n+1))))\min(1,\pi_{\theta^{(n)}}(x^{(n+1)})\pi_{\theta^{(n+1)}}(x^{(n)})/(\pi_{\theta^{(n)}}(x^{(n)})\pi_{\theta^{(n+1)}}(x^{(n+1)}))), which simplifies to min(1,exp((θ(n)θ(n+1))(t(x(n+1))t(x(n))))){\min(1,\exp((\theta^{(n)}-\theta^{(n+1)})(t(x^{(n+1)})-t(x^{(n)}))))}. Otherwise we perform a full sweep of single-site Gibbs updates, independently across chains.

A coupling of this algorithm involves a pair of ensembles with NN chains each; the two ensembles are identical if chain nn in the first ensemble equals chain nn in the second ensemble, for all n{1,,N}n\in\{1,\ldots,N\}. We use common random numbers to decide whether to perform swap moves or single-site Gibbs moves, and whether to accept the proposed states in the event of a swap move. In the event of a single-site Gibbs move, we maximally couple each conditional update.

Throughout the following experiments we use pswap=0.01p_{\text{swap}}=0.01, and introduce an equally spaced grid of θ\theta values from θ(1)=0.3\theta^{(1)}=0.3 to θ(N)=0.55\theta^{(N)}=0.55 for several different choices of NN. We note that these grids includes θ\theta values at which we have seen that the single-site Gibbs sampler mixes poorly. Figure 4b shows the resulting average meeting times over 100 independent runs, as a function of the number of chains NN. The average meeting time first decreases with the number of chains, but then increases again. A possible explanation is that the mixing of the chains first improves as NN increases, and then stabilizes; on the other hand it becomes harder for the ensembles to meet when NN increases since all chains in the ensembles have to meet. The minimum average meeting time is here attained for N=16N=16 chains per ensemble.

Refer to caption
(a) Single-site Gibbs sampler.
Refer to caption
(b) Replica exchange, with θ(1)=0.3\theta^{(1)}=0.3 and θ(N)=0.55\theta^{(N)}=0.55.
Figure 4: For the Ising model of Section 5.3 on a 32×3232\times 32 grid, average meeting times corresponding to different coupled Markov chains. Left: coupled single-site Gibbs sampler, for different inverse temperatures θ\theta. Right: coupled replica exchange algorithm with NN chains, on a grid of values θ(1)<<θ(N)\theta^{(1)}<\ldots<\theta^{(N)} with θ(1)=0.3\theta^{(1)}=0.3 and θ(N)=0.55\theta^{(N)}=0.55, for different values of NN.

Setting N=16N=16, k=105k=10^{5} and m=2×105m=2\times 10^{5} we now illustrate the use of the proposed unbiased estimators on a cluster. The test function is taken as xt(x)x\mapsto t(x) defined above, so that we estimate xt(x)πθ(x)\sum_{x}t(x)\pi_{\theta}(x) for different values of θ\theta. We use 500500 processors to generate unbiased estimates with a time budget of 3030 minutes. Within that time, each processor generated between 11 and 77 estimators, with an average of 3.73.7 estimators per processor and a total of 18581858 estimators. The chronology of the generation of these estimates is illustrated in Figure 5a. For each processor, horizontal segments of different colours indicate the duration associated with each estimator. The final estimates with standard errors are shown in Figure 5b, where we can see that the standard errors are very small compared to the values of the estimates, for each value of θ\theta. These standard errors were computed as σ^1(P,t)/P\hat{\sigma}_{1}(\mathrm{P},t)/\sqrt{\mathrm{P}} following (3.4), the CLT corresponding to the large processor count limit.

Refer to caption
(a) Chronology of the generation of unbiased estimators on 500 parallel processors over 3030 minutes.
Refer to caption
(b) Estimates (top) and standard errors (bottom) for xt(x)πθ(x)\sum_{x}t(x)\pi_{\theta}(x) at different θ\theta.
Figure 5: For the Ising model of Section 5.3 on a 32×3232\times 32 grid: on the left, chronology of the generation of unbiased estimators on 500500 processors over 3030 minutes. Right: estimates (top) of the expected natural statistic, xt(x)πθ(x)\sum_{x}t(x)\pi_{\theta}(x), and standard errors (bottom), for a grid of 1616 values 0.3=θ(1)<<θ(N)=0.550.3=\theta^{(1)}<\ldots<\theta^{(N)}=0.55. Results obtained by coupling a replica exchange algorithm with 1616 chains.

5.4 Variable selection

For our next example we consider a variable selection problem following Yang et al. (2016) to illustrate the scaling of our proposed method on high-dimensional discrete state spaces. For integers pp and nn, let YnY\in\mathbb{R}^{n} represent a response variable depending on covariates X1,,Xpn{X_{1},\ldots,X_{p}\in\mathbb{R}^{n}}. We consider the task of inferring a binary vector γ{0,1}p\gamma\in\{0,1\}^{p} representing which covariates to select as predictors of YY, with the convention that XiX_{i} is selected if γi=1\gamma_{i}=1. For any γ\gamma, we write |γ|=i=1pγi|\gamma|=\sum_{i=1}^{p}\gamma_{i} for the number of selected covariates and XγX_{\gamma} for the n×|γ|n\times|\gamma| matrix of covariates chosen by γ\gamma. Inference on γ\gamma proceeds by way of a linear regression model relating YY to XγX_{\gamma}, namely Y=Xγβγ+wY=X_{\gamma}\beta_{\gamma}+w with w𝒩(0,σ2In)w\sim\mathcal{N}(0,\sigma^{2}I_{n}).

We assume a prior on γ\gamma of π(γ)pκ|γ|𝟙(|γ|s0)\pi(\gamma)\propto p^{-\kappa|\gamma|}\mathds{1}(|\gamma|\leq s_{0}). This distribution puts mass only on vectors γ\gamma with fewer than s0s_{0} ones, imposing a degree of sparsity. Given γ\gamma we assume a Normal prior for the regression coefficient vector βγ|γ|\beta_{\gamma}\in\mathbb{R}^{|\gamma|} with zero mean and variance gσ2(XγXγ)1g\sigma^{2}(X_{\gamma}^{\prime}X_{\gamma})^{-1}. Finally, we give the precision σ2\sigma^{-2} an improper prior π(σ2)1/σ2\pi(\sigma^{-2})\propto 1/\sigma^{-2}. This leads to the marginal likelihood

π(Y|X,γ)\displaystyle\pi(Y|X,\gamma) (1+g)|γ|/2(1+g(1Rγ2))n/2,whereRγ2=YXγ(XγXγ)1XγYYY.\displaystyle\propto\frac{(1+g)^{-|\gamma|/2}}{(1+g(1-R^{2}_{\gamma}))^{n/2}},\quad\text{where}\quad R_{\gamma}^{2}=\frac{Y^{\prime}X_{\gamma}(X^{\prime}_{\gamma}X_{\gamma})^{-1}X^{\prime}_{\gamma}Y}{Y^{\prime}Y}.

To approximate the distribution π(γ|X,Y)\pi(\gamma|X,Y), Yang et al. (2016) employ an MCMC algorithm whose kernel PP is a mixture of two Metropolis kernels. The first component P1(γ,)P_{1}(\gamma,\cdot) selects a coordinate i{1,,p}i\in\{1,\ldots,p\} uniformly at random and flips γi\gamma_{i} to 1γi1-\gamma_{i}. The resulting vector γ\gamma^{\star} is then accepted with probability 1π(γ|X,Y)/π(γ|X,Y)1\wedge\pi(\gamma^{\star}|X,Y)/\pi(\gamma|X,Y), where aba\wedge b denotes min(a,b)\min(a,b) for a,ba,b\in\mathbb{R}. Sampling a vector γ\gamma^{\prime} from the second kernel P2(γ,)P_{2}(\gamma,\cdot) proceeds as follows. If |γ||\gamma| equals zero or pp, then γ\gamma^{\prime} is set to γ\gamma. Otherwise, coordinates i0i_{0}, i1i_{1} are drawn uniformly among {j:γj=0}\{j:\gamma_{j}=0\} and {j:γj=1}\{j:\gamma_{j}=1\}, respectively. The proposal γ\gamma^{\star} has γi0=γi1\gamma^{\star}_{i_{0}}=\gamma_{i_{1}}, γi1=γi0\gamma^{\star}_{i_{1}}=\gamma_{i_{0}}, and γj=γj\gamma^{\star}_{j}=\gamma_{j} for the other components. Then γ\gamma^{\prime} is set to γ\gamma^{\star} with probability 1π(γ|X,Y)/π(γ|X,Y)1\wedge\pi(\gamma^{\star}|X,Y)/\pi(\gamma|X,Y), and to γ\gamma otherwise. The MCMC kernel P(γ,)P(\gamma,\cdot) targets π(γ|X,Y)\pi(\gamma|X,Y) by sampling from P1(γ,)P_{1}(\gamma,\cdot) or from P2(γ,)P_{2}(\gamma,\cdot) with equal probability. Note that each MCMC iteration can only benefit from parallel processors to a limited extent, since |γ||\gamma| is always less than s0s_{0}, itself chosen to be a small value; thus the calculation of Rγ2R_{\gamma}^{2} only involves linear algebra of small matrices.

We consider the following strategy to couple the above MCMC algorithm. To sample a pair of states (γ,γ~)(\gamma^{\prime},\tilde{\gamma}^{\prime}) given (γ,γ~)(\gamma,\tilde{\gamma}), we first use a common uniform random variable to decide whether to sample from a coupling P¯1\bar{P}_{1} of P1P_{1} to itself or a coupling P¯2\bar{P}_{2} of P2P_{2} to itself. The coupled kernel P¯1((γ,γ~),)\bar{P}_{1}((\gamma,\tilde{\gamma}),\cdot) proposes flipping the same coordinate for both vectors γ\gamma and γ~\tilde{\gamma} and then uses a common uniform random variable in the acceptance step. For the coupled kernel P¯2((γ,γ~),)\bar{P}_{2}((\gamma,\tilde{\gamma}),\cdot), we need to select two pairs of indices, (i0,i~0)(i_{0},\tilde{i}_{0}) and (i1,i~1)(i_{1},\tilde{i}_{1}). We obtain the first pair by sampling from a maximal coupling of the discrete uniform distributions on {j:γj=0}\{j:\gamma_{j}=0\} and {j:γ~j=0}\{j:\tilde{\gamma}_{j}=0\}. This yields indices (i0,i~0)(i_{0},\tilde{i}_{0}) with the greatest possible probability that i0=i~0i_{0}=\tilde{i}_{0}. We use the same approach to sample a pair (i1,i~1)(i_{1},\tilde{i}_{1}) to maximize the probability that i1=i~1i_{1}=\tilde{i}_{1}. Finally we use a common uniform variable to accept or reject the proposals. If either vector γ\gamma or γ~\tilde{\gamma} has no zeros or no ones, then it is kept unchanged.

We recall that one can sample from a maximal coupling of two discrete probability distributions q=(q1,,qN)q=(q_{1},\ldots,q_{N}) and q~=(q~1,,q~N)\tilde{q}=(\tilde{q}_{1},\ldots,\tilde{q}_{N}) as follows. First, let c=(c1,,cN)c=(c_{1},\dots,c_{N}) be the distribution with probabilities cn=(qnq~n)/αc_{n}=(q_{n}\wedge\tilde{q}_{n})/\alpha for α=n=1Nqnq~n\alpha=\sum_{n=1}^{N}q_{n}\wedge\tilde{q}_{n} and define residual distributions qq^{\prime} and q~\tilde{q}^{\prime} with probabilities qn=(qnαcn)/(1α)q^{\prime}_{n}=(q_{n}-\alpha c_{n})/(1-\alpha) and q~n=(q~nαcn)/(1α)\tilde{q}^{\prime}_{n}=(\tilde{q}_{n}-\alpha c_{n})/(1-\alpha). Then with probability α\alpha, draw ici\sim c and output (i,i)(i,i). Otherwise draw iqi\sim q^{\prime} and i~q~\tilde{i}\sim\tilde{q}^{\prime} and output (i,i~)(i,\tilde{i}). The resulting pair follows a maximal coupling of qq and q~\tilde{q}, since (i=i~)=α=1dTV(q,q~)\mathbb{P}(i=\tilde{i})=\alpha=1-d_{\text{TV}}(q,\tilde{q}), and marginally (i=n)=αcn+(1α)qn=qn\mathbb{P}(i=n)=\alpha c_{n}+(1-\alpha)q^{\prime}_{n}=q_{n}, and likewise for (i~=n)\mathbb{P}(\tilde{i}=n), for all n{1,,N}n\in\{1,\ldots,N\}. The procedure involves 𝒪(N)\mathcal{O}(N) operations for NN the size of the state space.

We now consider an experiment like those of Yang et al. (2016). We define

β=SNRσ02log(p)n(2,3,2,2,3,3,2,3,2,3,0,,0)p,\beta^{\star}=\text{SNR}\sqrt{\sigma_{0}^{2}\frac{\log(p)}{n}}(2,-3,2,2,-3,3,-2,3,-2,3,0,\ldots,0)^{\prime}\in\mathbb{R}^{p},

and generate YY given XX and β\beta^{\star} from the model with σ2=1\sigma^{2}=1, σ02=1\sigma_{0}^{2}=1, n{500,1000}n\in\{500,1000\}, p{1000,5000}p\in\{1000,5000\}, and signal-to-noise parameter SNR{0.5,1,2}\text{SNR}\in\{0.5,1,2\}. We also set s0=100s_{0}=100, g=p3g=p^{3}, and κ=2\kappa=2 (exactly as in Yang et al. (2016); the value of κ\kappa was obtained by personal communication) and generate the covariates XX using a multivariate normal distribution with covariance matrix Σ\Sigma either equal to a unit diagonal matrix or with entries Σij=exp(|ij|)\Sigma_{ij}=\exp(-|i-j|). We refer to these two cases as the independent design and correlated design cases, respectively. We draw from the initial distribution π0\pi_{0} by creating a vector of pp zeros, sampling s0s_{0} coordinates uniformly from {1,,p}\{1,\ldots,p\} without replacement, and setting the corresponding entries to 11 with probability 0.50.5.

For different values of nn, pp and SNR, and the two types of design, we run coupled chains 100 times independently until they meet. We report the average meeting times in Tables 2 and 3. The average meeting times are of the order of 10410^{4} to 10510^{5}, depending on the problem; the maximum is attained in the correlated design at n=500,p=1000,SNR=2n=500,p=1000,\text{SNR}=2. In contrast with this, the experiments in Yang et al. (2016) identify the scenario n=500,p=5000,SNR=1n=500,p=5000,\text{SNR}=1 as the most challenging one. This discrepancy deserves further study; it could be due to variations from a synthetic data set to another, or to differences in the criteria being reported.

n p SNR = 0.5 SNR = 1 SNR = 2
500 1000 4937 7586 6031
500 5000 24634 25602 38083
1000 1000 4729 5893 4892
1000 5000 23407 46398 24712
Table 2: Average meeting times in the independent design.
n p SNR = 0.5 SNR = 1 SNR = 2
500 1000 5536 5485 216996
500 5000 27535 28756 29083
1000 1000 4921 5451 5613
1000 5000 24101 29215 23043
Table 3: Average meeting times in the correlated design.

To illustrate the impact of dimension, we focus on the independent design setting with n=500n=500 and SNR=1\text{SNR}=1, and consider values of pp between 100100 and 10001000. For each value of pp, we run coupled chains 1,0001,000 times independently until they meet. We present violin plots representing the distributions of meeting times divided by pp in Figure 6a. The distribution of scaled meeting times appears to be approximately constant as a function of pp, suggesting that meeting times increase linearly in pp. This is consistent with the findings of Yang et al. (2016), where mixing times are shown to increase linearly in pp.

Refer to caption
(a) Meeting times divided by pp in violin plots.
Refer to caption
(b) Inclusion probabilities of the first 2020 variables.
Figure 6: Left: meeting times divided pp for p{100,250,500,750,1000}p\in\{100,250,500,750,1000\} and n=500n=500, SNR=1\text{SNR}=1, in the variable selection example of Section 5.4 with independent design, based on R=1,000R=1,000 independent repeats. The violins represent the distributions of scaled meeting times for different pp. Right: posterior probabilities of inclusion for the first 2020 variables, in the setting n=500n=500, p=1,000p=1,000, SNR=1\text{SNR}=1, for three different values of the prior hyperparameter κ\kappa. The error bars representing 95%95\% confidence intervals were obtained after 120 minutes of calculation on 600 processors, using k=75,000k=75,000 and m=150,000m=150,000. The solid lines represent standard MCMC estimates based on 10 independent chains of length 10610^{6}.

Focusing now on the independent design case with n=500n=500, p=1,000p=1{,}000, and SNR=1\text{SNR}=1 we consider different values of the prior hyperparameter κ\kappa in {0.1,1,2}\{0.1,1,2\}. We set k=75,000k=75{,}000 and m=150,000m=150{,}000 and generate unbiased estimators on a cluster for 120120 minutes, using 200200 processors for each value of κ\kappa, and so 600600 processors in total. The test function is chosen so that the estimand π(h)\pi(h) is the vector of inclusion probabilities (γi=1|X,Y)\mathbb{P}(\gamma_{i}=1|X,Y) for i{1,,20}i\in\{1,\ldots,20\}. Within the time budget, 39,28239,282 estimates were produced, with each processor producing between 8 and 181 of these. The largest observed meeting time was 81,42381,423. The meeting times were similar across the three values of κ\kappa.

Figure 6b shows the results in the form of 95%95\% confidence intervals shown as error bars, using (3.4), the CLT relevant when the time budget is fixed and the number of processors grows large. We observe that κ\kappa has a strong impact on the probability of including the first 10 variables in this setting, and that the most satisfactory results are obtained for κ=0.1\kappa=0.1 rather than for κ=2\kappa=2, recalling that β\beta^{\star} has non-zero entries in its first 10 components. Note that the error bars are narrow but still noticeable, particularly for κ=0.1\kappa=0.1. On the same figure, the solid lines represent estimates obtained with 10 independent MCMC runs with 10610^{6} iterations each, discarding the first 10510^{5} iterations as burn-in. These MCMC estimates present noticeable variability in spite of the large number of iterations. In a standard MCMC setting, we might run chains for more iterations until the estimates agree across independent runs. In the proposed framework, we increase the precision by generating more independent unbiased estimators without necessarily modifying kk or mm.

Figure 6b suggests that the variable selection procedure considered here is sensitive to the prior hyperparameter κ\kappa; we refer to Yang et al. (2016), and to Johnson (2013); Nikooienejad et al. (2016) for related discussions on Bayesian variable selection in high dimension and convergence of MCMC.

5.5 Cut distribution

Finally, our proposed estimator can be used to approximate the cut distribution, which poses a significant challenge for existing MCMC methods (Plummer, 2014; Jacob et al., 2017). This illustrates another appeal of the unbiasedness property, beyond the motivation for parallel computation.

Consider two models, one with parameters θ1\theta_{1} and data Y1Y_{1} and another with parameters θ2\theta_{2} and data Y2Y_{2}, where the likelihood of Y2Y_{2} might depend on both θ1\theta_{1} and θ2\theta_{2}. For instance the first model could be a regression with data Y1Y_{1} and coefficients θ1\theta_{1}, and the second model could be another regression whose covariates are the residuals, coefficients, or fitted values of the first regression (Pagan, 1984; Murphy and Topel, 2002). In principle one could introduce an encompassing model and conduct joint inference on θ1\theta_{1} and θ2\theta_{2} via the posterior distribution. In that case, misspecification of either model would lead to misspecification of the ensemble and thus to a misleading quantification of uncertainty, as noted in several studies (e.g. Liu et al., 2009; Plummer, 2014; Lunn et al., 2009; McCandless et al., 2010; Zigler, 2016; Blangiardo et al., 2011).

The cut distribution (Spiegelhalter et al., 2003; Plummer, 2014) allows the propagation of uncertainty about θ1\theta_{1} to inference on θ2\theta_{2} while preventing misspecification in the second model from affecting estimation in the first. The cut distribution is defined as

πcut(θ1,θ2)=π1(θ1)π2(θ2|θ1).\pi_{\text{cut}}\left(\theta_{1},\theta_{2}\right)=\pi_{1}(\theta_{1})\pi_{2}(\theta_{2}|\theta_{1}).

Here π1(θ1)\pi_{1}(\theta_{1}) refers to the distribution of θ1\theta_{1} given Y1Y_{1} in the first model alone, and π2(θ2|θ1)\pi_{2}(\theta_{2}|\theta_{1}) refers to the distribution of θ2\theta_{2} given Y2Y_{2} and θ1\theta_{1} in the second model. Often, the density π2(θ2|θ1)\pi_{2}(\theta_{2}|\theta_{1}) can only evaluated up to a constant in θ2\theta_{2}, which may vary with θ1\theta_{1}. This makes the cut distribution difficult to approximate with MCMC algorithms (Plummer, 2014).

A naive approach consists of first running an MCMC algorithm targeting π1(θ1)\pi_{1}(\theta_{1}) to obtain a sample (θ1n)n=1N1(\theta_{1}^{n})_{n=1}^{N_{1}}, perhaps after discarding a burn-in period and thinning the chain. Then for each θ1n\theta_{1}^{n}, one can run an MCMC algorithm targeting π2(θ2|θ1n)\pi_{2}(\theta_{2}|\theta_{1}^{n}), yielding N2N_{2} samples. One might again discard some burn-in and thin the chains, or just keep the final state of each chain. The resulting joint samples approximate the cut distribution. However, the validity of this approach relies on a double limit in N1N_{1} and N2N_{2}. Diagnosing convergence may also be difficult given the number of chains in the second stage, each of which targets a different distribution π2(θ2|θ1n)\pi_{2}(\theta_{2}|\theta_{1}^{n}).

If one could sample θ1π1\theta_{1}\sim\pi_{1} and θ2|θ1π2(θ2|θ1)\theta_{2}|\theta_{1}\sim\pi_{2}(\theta_{2}|\theta_{1}), then the pair (θ1,θ2)(\theta_{1},\theta_{2}) would follow the cut distribution. The same two-stage rationale can be applied in the proposed framework. Consider a test function (θ1,θ2)h(θ1,θ2)(\theta_{1},\theta_{2})\mapsto h(\theta_{1},\theta_{2}). Writing 𝔼cut\mathbb{E}_{\text{cut}} for expectations with respect to πcut\pi_{\text{cut}}, the law of iterated expectations yields

𝔼cut[h(θ1,θ2)]\displaystyle\mathbb{E}_{\text{cut}}[h(\theta_{1},\theta_{2})] =(h(θ1,θ2)π2(dθ2|θ1))π1(dθ1)=h¯(θ1)π1(dθ1).\displaystyle=\int\left(\int h(\theta_{1},\theta_{2})\pi_{2}(d\theta_{2}|\theta_{1})\right)\pi_{1}(d\theta_{1})=\int\bar{h}(\theta_{1})\pi_{1}(d\theta_{1}).

Here h¯(θ1)=h(θ1,θ2)π2(dθ2|θ1)\bar{h}(\theta_{1})=\int h(\theta_{1},\theta_{2})\pi_{2}(d\theta_{2}|\theta_{1}). In the proposed framework, one can make an unbiased estimator of h¯(θ1)\bar{h}(\theta_{1}) for all θ1\theta_{1}, then plug these estimators into an unbiased estimator of the integral h(θ1)π1(dθ1)\int h(\theta_{1})\pi_{1}(d\theta_{1}). This is perhaps clearer using the signed measure representation of Section 2.4: one can obtain a signed measure π^1==1Nωδθ1,\hat{\pi}_{1}=\sum_{\ell=1}^{N}\omega_{\ell}\delta_{\theta_{1,\ell}} approximating π1\pi_{1}, and then obtain an unbiased estimator of h¯(θ1,)\bar{h}(\theta_{1,\ell}) for all \ell, denoted by H¯\bar{H}_{\ell}. Then the weighted average =1NωH¯\sum_{\ell=1}^{N}\omega_{\ell}\bar{H}_{\ell} is an unbiased estimator of 𝔼cut[h(θ1,θ2)]\mathbb{E}_{\text{cut}}[h(\theta_{1},\theta_{2})] by the law of iterated expectations. Such estimators can be generated independently in parallel, and their average provides a consistent approximation of an expectation with respect to the cut distribution.

We consider the example described in Plummer (2014), inspired by an investigation of the international correlation between human papillomavirus (HPV) prevalence and cervical cancer incidence (Maucort-Boulch et al., 2008). The first module concerns HPV prevalence, with data independently collected in 1313 countries. The parameter θ1=(θ1,1,,θ1,13)\theta_{1}=(\theta_{1,1},\ldots,\theta_{1,13}) receives a Beta(1,1)(1,1) prior distribution independently for each component. The data (Y1,,Y13)(Y_{1},\ldots,Y_{13}) consist of 1313 pairs of integers. The first represents the number of women infected with high-risk HPV, and the second represents population sizes. The likelihood specifies a Binomial model for YiY_{i}, independently for each component ii. The posterior for this model is given by a product of Beta distributions.

The second module concerns the relationship between HPV prevalence and cancer incidence, and posits a Poisson regression. The parameters are θ2=(θ2,1,θ2,2)2\theta_{2}=(\theta_{2,1},\theta_{2,2})\in\mathbb{R}^{2} and receive a Normal prior with zero mean and variance 10310^{3} per component. The likelihood in this module is given by

Z1,iPoisson(exp(θ2,1+θ1,iθ2,2+Z2,i))for i{1,,13},Z_{1,i}\sim\text{Poisson}(\exp\left(\theta_{2,1}+\theta_{1,i}\theta_{2,2}+Z_{2,i}\right))\quad\text{for }i\in\left\{1,\ldots,13\right\},

where the data (Z1,i,Z2,i)i=113(Z_{1,i},Z_{2,i})_{i=1}^{13} are pairs of integers. The first component represents numbers of cancer cases, while the second is a number of woman-years of follow-up. The Poisson regression model might be misspecified, motivating departures from inference based on the joint model (Plummer, 2014).

Here we can draw directly from the first posterior, denoted by π1(θ1)\pi_{1}(\theta_{1}), and obtain a sample (θ1n)n=1N(\theta_{1}^{n})_{n=1}^{N}. For each θ1n\theta_{1}^{n} we consider an MH algorithm targeting π2(θ2|θ1n)\pi_{2}(\theta_{2}|\theta_{1}^{n}), using a Normal random walk proposal with variance Σ\Sigma. We couple this algorithm using reflection-maximal couplings of the proposals as in Section 4.1. In preliminary runs, starting with a standard bivariate Normal as an initial distribution and a proposal covariance matrix set to identity, we estimate the first two moments of the cut distribution, and we use them to refine the initial distribution π0\pi_{0} and the proposal covariance matrix Σ\Sigma. With these settings we obtain a distribution of meeting times shown in Figure 7a. We then set k=100k=100, m=10km=10k, and obtain approximations of the cut distribution represented by histograms in Figures 7b and 7c, using N=10,000N=10,000 unbiased estimators. The overlaid curves correspond to a kernel density estimate obtained by running m=1,000m=1,000 steps of MCMC targeting π2(θ2|θ1n)\pi_{2}(\theta_{2}|\theta_{1}^{n}) with θ1n\theta_{1}^{n} drawn from π1(θ1)\pi_{1}(\theta_{1}), for n{1,,N}n\in\{1,\ldots,N\}, and keeping the final mm-th state of each chain. The proposed estimators can be refined by increasing the number NN of independent replications, whereas the MCMC estimators would converge only in the double limit of NN and mm going to infinity.

Refer to caption
(a) Histogram of meeting times.
Refer to caption
(b) Histogram of θ2,1\theta_{2,1}.
Refer to caption
(c) Histogram of θ2,2\theta_{2,2}.
Figure 7: Meeting times (left) and histograms of θ2,1\theta_{2,1} and θ2,2\theta_{2,2} (middle and right), in the example of Section 5.5, computed from 10,00010,000 unbiased estimators, with k=100k=100 and m=1000m=1000. Grey boxes indicate 95%95\% confidence intervals on the histogram estimates. Overlaid solid lines indicate the marginal probability density functions of the cut distribution, obtained by running N=10,000N=10,000 MCMC chains for 1,0001,000 steps, each targeting π2(θ2|θ1n)\pi_{2}(\theta_{2}|\theta_{1}^{n}) for θ1n\theta_{1}^{n} drawn from π1(θ1)\pi_{1}(\theta_{1}), and retaining the last state only.

6 Discussion

By combining the powerful technique of Glynn and Rhee (2014) with couplings of MCMC algorithms, unbiased estimators of integrals with respect to the target distribution can be constructed. Their efficiency can be controlled with tuning parameters kk and mm, for which we have proposed guidelines: kk can be chosen as a large quantile of the meeting time τ\tau, and mm as a multiple of kk. Improving on these simple guidelines stands as a subject for future research. In numerical experiments we have argued that the proposed estimators yield a practical way of parallelizing MCMC computations in a range of settings. We stress that coupling pairs of Markov chains does not improve their marginal mixing properties, and that poor mixing of the underlying chains can lead to poor performance of the resulting estimator. The choice of initial distribution π0\pi_{0} can have undesirable effects on the estimators, as in the multimodal example of Section 5.1. Unreliable estimators would also result from stopping the chains before their meeting time.

Couplings of MCMC algorithms can be devised using maximal couplings, reflection couplings, and common random numbers. We have focused on couplings that can be implemented without further analytical knowledge about the target distribution or about the MCMC kernels. However, these couplings might result in prohibitively large meeting times, either because the marginal chains mix slowly, as in 5.1, or because the coupling strategy is ineffective, as in Section 4.2.

Regarding convergence diagnostics, the proposed framework yields the following representation for the total variation between πk\pi_{k} and π\pi, where πk\pi_{k} denotes the marginal distribution of XkX_{k}:

dTV(πk,π)\displaystyle d_{\text{TV}}(\pi_{k},\pi) =\displaystyle= 12suph:|h|1|𝔼[h(Xk)]𝔼π[h(X)]|\displaystyle\frac{1}{2}\sup_{h:\;|h|\leq 1}\left|\mathbb{E}[h(X_{k})]-\mathbb{E}_{\pi}[h(X)]\right|
=\displaystyle= 12suph:|h|1|𝔼[t=k+1τ1(h(Xt)h(Yt1))]|,\displaystyle\frac{1}{2}\sup_{h:\;|h|\leq 1}\left|\mathbb{E}\left[\sum_{t=k+1}^{\tau-1}\left(h(X_{t})-h(Y_{t-1})\right)\right]\right|,

Here the supremum ranges over all bounded measurable functions under the stated assumptions. The equality above has several consequences. For instance, the triangle inequality implies dTV(πk,π)min(1,𝔼[max(0,(τk1))])d_{\text{TV}}(\pi_{k},\pi)\leq\min(1,\mathbb{E}[\max(0,(\tau-k-1))]), and we can approximate 𝔼[max(0,(τk1))]\mathbb{E}[\max(0,(\tau-k-1))] by Monte Carlo for a range of kk values. This is pursued in Biswas and Jacob (2019), where the proposed construction is extended to allow for arbitrary time lags between the coupled chains.

Thanks to its potential for parallelization, the proposed framework can facilitate consideration of MCMC kernels that might be too expensive for serial implementation. For instance, one can improve MH-within-Gibbs samplers by performing more MH steps per component update, HMC by using smaller step-sizes in the numerical integrator (Heng and Jacob, 2019), and particle MCMC by using more particles in the particle filters (Andrieu et al., 2010; Jacob et al., 2019). We expect the optimal tuning of MCMC kernels to be different in the proposed framework than when used marginally.

On top of enabling the application of the results of Glynn and Heidelberger (1991) to accomodate budget constraints, the lack of bias of the proposed estimators can be beneficial in combination with the law of total expectation, to implement modular inference procedures as in Section 5.5. In Rischard et al. (2018) the lack of bias is exploited in new estimators of Bayesian cross-validation criteria. In Chen et al. (2018) similar unbiased estimators are used in the expectation step of an expectation-maximization algorithm. There may be other settings where the lack of bias is appealing, for instance in gradient estimation for stochastic gradient descents (Tadić et al., 2017).

Acknowledgement.

The authors are grateful to Jeremy Heng and Luc Vincent-Genod for useful discussions. The authors gratefully acknowledge support by the National Science Foundation through grants DMS-1712872 and DMS-1844695 (Pierre E. Jacob), and DMS-1513040 (Yves F. Atchadé).

References

  • Altekar et al. [2004] G. Altekar, S. Dwarkadas, J. P. Huelsenbeck, and F. Ronquist. Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics, 20(3):407–415, 2004.
  • Andrieu and Vihola [2015] C. Andrieu and M. Vihola. Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms. The Annals of Applied Probability, 25(2):1030–1077, 2015.
  • Andrieu et al. [2010] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
  • Atchade [2006] Y. F. Atchade. An adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift. Methodology and Computing in applied Probability, 8:235–254, 2006.
  • Atchadé [2016] Y. F. Atchadé. Markov chain Monte Carlo confidence intervals. Bernoulli, 22(3):1808–1838, 2016.
  • Biswas and Jacob [2019] N. Biswas and P. E. Jacob. Estimating convergence of Markov chains with L-lag couplings. arXiv preprint arXiv:1905.09971, 2019.
  • Blangiardo et al. [2011] M. Blangiardo, A. Hansell, and S. Richardson. A Bayesian model of time activity data to investigate health effect of air pollution in time series studies. Atmospheric Environment, 45(2):379–386, 2011.
  • Bornn et al. [2010] L. Bornn, A. Doucet, and R. Gottardo. An efficient computational approach for prior sensitivity analysis and cross-validation. Canadian Journal of Statistics, 38(1):47–64, 2010.
  • Bou-Rabee and Sanz-Serna [2018] N. Bou-Rabee and J. M. Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113–206, 2018.
  • Bou-Rabee et al. [2018] N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for Hamiltonian Monte Carlo. arXiv preprint arXiv:1805.00452, 2018.
  • Brockwell [2006] A. E. Brockwell. Parallel Markov chain Monte Carlo simulation by pre-fetching. Journal of Computational and Graphical Statistics, 15(1):246–261, 2006.
  • Brockwell and Kadane [2005] A. E. Brockwell and J. B. Kadane. Identification of regeneration times in mcmc simulation, with application to adaptive schemes. Journal of Computational and Graphical Statistics, 14(2):436–458, 2005.
  • Brooks et al. [2011] S. P. Brooks, A. Gelman, G. Jones, and X.-L. Meng. Handbook of Markov chain Monte Carlo. CRC press, 2011.
  • Calderhead [2014] B. Calderhead. A general construction for parallelizing Metropolis–Hastings algorithms. Proceedings of the National Academy of Sciences, 111(49):17408–17413, 2014.
  • Casella et al. [2001] G. Casella, M. Lavine, and C. P. Robert. Explaining the perfect sampler. The American Statistician, 55(4):299–305, 2001.
  • Chatterjee et al. [2015] S. Chatterjee, A. Guntuboyina, B. Sen, et al. On risk bounds in isotonic and other shape restricted regression problems. The Annals of Statistics, 43(4):1774–1800, 2015.
  • Chen et al. [2018] W. Chen, L. Ma, and X. Liang. Blind identification based on expectation-maximization algorithm coupled with blocked Rhee–Glynn smoothing estimator. IEEE Communications Letters, 22(9):1838–1841, 2018.
  • Choi and Hobert [2013] H. M. Choi and J. P. Hobert. The Pólya–Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic. Electronic Journal of Statistics, 7:2054–2064, 2013.
  • Cowles and Rosenthal [1998] M. K. Cowles and J. S. Rosenthal. A simulation approach to convergence rates for Markov chain Monte Carlo algorithms. Statistics and Computing, 8(2):115–124, 1998.
  • Diaconis and Freedman [1999] P. Diaconis and D. Freedman. Iterated random functions. SIAM review, 41(1):45–76, 1999.
  • Douc et al. [2004] R. Douc, E. Moulines, and J. S. Rosenthal. Quantitative bounds on convergence of time-inhomogeneous markoc chains. Annals of Applied Probability, 14(4):1643–1665, 2004.
  • Duane et al. [1987] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
  • Efron et al. [2004] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
  • Flegal and Herbei [2012] J. M. Flegal and R. Herbei. Exact sampling for intractable probability distributions via a Bernoulli factory. Electronic Journal of Statistics, 6:10–37, 2012.
  • Flegal et al. [2008] J. M. Flegal, M. Haran, and G. L. Jones. Markov chain Monte Carlo: can we trust the third significant figure? Statistical Science, pages 250–260, 2008.
  • Gaver and O’Muircheartaigh [1987] D. P. Gaver and I. G. O’Muircheartaigh. Robust empirical Bayes analyses of event rates. Technometrics, 29(1):1–15, 1987.
  • Geyer [1991] C. J. Geyer. Markov chain Monte Carlo maximum likelihood. Technical report, University of Minnesota, School of Statistics, 1991.
  • Girolami and Calderhead [2011] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
  • Glynn [2016] P. W. Glynn. Exact simulation versus exact estimation. In Winter Simulation Conference (WSC), 2016, pages 193–205. IEEE, 2016.
  • Glynn and Heidelberger [1990] P. W. Glynn and P. Heidelberger. Bias properties of budget constraint simulations. Operations Research, 38(5):801–814, 1990.
  • Glynn and Heidelberger [1991] P. W. Glynn and P. Heidelberger. Analysis of parallel replicated simulations under a completion time constraint. ACM Transactions on Modeling and Computer Simulations, 1(1):3–23, 1991.
  • Glynn and Rhee [2014] P. W. Glynn and C.-H. Rhee. Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability, 51(A):377–389, 2014.
  • Glynn and Whitt [1992] P. W. Glynn and W. Whitt. The asymptotic efficiency of simulation estimators. Operations Research, 40(3):505–520, 1992.
  • Gong and Flegal [2016] L. Gong and J. M. Flegal. A practical sequential stopping rule for high-dimensional Markov chain Monte Carlo. Journal of Computational and Graphical Statistics, 25(3):684–700, 2016. doi: 10.1080/10618600.2015.1044092. URL http://dx.doi.org/10.1080/10618600.2015.1044092.
  • Goodman et al. [2010] J. Goodman, J. Weare, et al. Ensemble samplers with affine invariance. Communications in applied mathematics and computational science, 5(1):65–80, 2010.
  • Goudie et al. [2017] R. J. Goudie, R. M. Turner, D. De Angelis, and A. Thomas. Massively parallel MCMC for Bayesian hierarchical models. arXiv preprint arXiv:1704.03216, 2017.
  • Green et al. [2015] P. J. Green, K. Łatuszyński, M. Pereyra, and C. P. Robert. Bayesian computation: a summary of the current state, and samples backwards and forwards. Statistics and Computing, 25(4):835–862, 2015.
  • Hastings [1970] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970.
  • Heng and Jacob [2019] J. Heng and P. E. Jacob. Unbiased Hamiltonian Monte Carlo with couplings. Biometrika, 106(2):287–302, 2019.
  • Hoffman and Gelman [2014] M. D. Hoffman and A. Gelman. The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
  • Huang et al. [2007] C.-L. Huang, M.-C. Chen, and C.-J. Wang. Credit scoring with a data mining approach based on support vector machines. Expert systems with applications, 33(4):847–856, 2007.
  • Huber [2016] M. Huber. Perfect simulation, volume 148. CRC Press, 2016.
  • Jacob et al. [2011] P. E. Jacob, C. P. Robert, and M. H. Smith. Using parallel computation to improve independent Metropolis–Hastings based estimation. Journal of Computational and Graphical Statistics, 20(3):616–635, 2011.
  • Jacob et al. [2017] P. E. Jacob, L. M. Murray, C. C. Holmes, and C. P. Robert. Better together? Statistical learning in models made of modules. arXiv preprint arXiv:1708.08719, 2017.
  • Jacob et al. [2019] P. E. Jacob, F. Lindsten, and T. B. Schön. Smoothing with couplings of conditional particle filters. Journal of the American Statistical Association, 0(0):1–20, 2019. doi: 10.1080/01621459.2018.1548856. URL https://doi.org/10.1080/01621459.2018.1548856.
  • Jarner and Hansen [2000] S. F. Jarner and E. Hansen. Geometric ergodicity of metropolis algorithms. Stochastic Processes and their Applications, 85(1):341–361, 2000.
  • Johndrow and Mattingly [2017] J. E. Johndrow and J. C. Mattingly. Coupling and decoupling to bound an approximating Markov chain. arXiv preprint arXiv:1706.02040, 2017.
  • Johnson [1996] V. E. Johnson. Studying convergence of Markov chain Monte Carlo algorithms using coupled sample paths. Journal of the American Statistical Association, 91(433):154–166, 1996.
  • Johnson [1998] V. E. Johnson. A coupling-regeneration scheme for diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 93(441):238–248, 1998.
  • Johnson [2013] V. E. Johnson. On numerical aspects of Bayesian model selection in high and ultrahigh-dimensional settings. Bayesian Analysis, 8 (4):741–758, 2013.
  • Khare and Hobert [2013] K. Khare and J. P. Hobert. Geometric ergodicity of the Bayesian lasso. Electronic Journal of Statistics, 7:2150–2163, 2013. doi: 10.1214/13-EJS841. URL http://dx.doi.org/10.1214/13-EJS841.
  • Lee et al. [2010] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Journal of computational and graphical statistics, 19(4):769–789, 2010.
  • Lee et al. [2014] A. Lee, A. Doucet, and K. Łatuszyński. Perfect simulation using atomic regeneration with application to sequential Monte Carlo. arXiv preprint arXiv:1407.5770, 2014.
  • Lichman [2013] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
  • Lindvall [2002] T. Lindvall. Lectures on the coupling method. Dover Books on Mathematics, 2002.
  • Liu et al. [2009] F. Liu, M. Bayarri, and J. Berger. Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis, 4(1):119–150, 2009.
  • Liu [2008] J. S. Liu. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2008.
  • Lunn et al. [2009] D. Lunn, N. Best, D. Spiegelhalter, G. Graham, and B. Neuenschwander. Combining MCMC with sequential PKPD modelling. Journal of Pharmacokinetics and Pharmacodynamics, 36(1):19–38, 2009.
  • Mainini [2012] E. Mainini. A description of transport cost for signed measures. Journal of Mathematical Sciences, 181(6):837–855, 2012.
  • Mangoubi and Smith [2017] O. Mangoubi and A. Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv preprint arXiv:1708.07114, 2017.
  • Maucort-Boulch et al. [2008] D. Maucort-Boulch, S. Franceschi, and M. Plummer. International correlation between human papillomavirus prevalence and cervical cancer incidence. Cancer Epidemiology Biomarkers &amp; Prevention, 17(3):717–720, 2008.
  • McCandless et al. [2010] L. C. McCandless, I. J. Douglas, S. J. Evans, and L. Smeeth. Cutting feedback in Bayesian regression adjustment for the propensity score. The International Journal of Biostatistics, 6(2), 2010.
  • McLeish [2011] D. McLeish. A general method for debiasing a Monte Carlo estimator. Monte Carlo methods and applications, 17(4):301–315, 2011.
  • Meyn and Tweedie [2009] S. Meyn and R. Tweedie. Markov chains and stochastic stability. Cambridge University Press, second edition, 2009.
  • Middleton et al. [2018] L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased Markov chain Monte Carlo for intractable target distributions. arXiv preprint arXiv:1807.08691, 2018.
  • Middleton et al. [2019] L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased smoothing using particle independent Metropolis-Hastings. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 2378–2387. PMLR, 16–18 Apr 2019. URL http://proceedings.mlr.press/v89/middleton19a.html.
  • Morris [1983] C. N. Morris. Parametric empirical Bayes inference: theory and applications. Journal of the American Statistical Association, 78(381):47–55, 1983.
  • Murdoch and Green [1998] D. J. Murdoch and P. J. Green. Exact sampling from a continuous state space. Scandinavian Journal of Statistics, 25(3):483–502, 1998.
  • Murphy and Topel [2002] K. M. Murphy and R. H. Topel. Estimation and inference in two-step econometric models. Journal of Business & Economic Statistics, 20(1):88–97, 2002.
  • Murray et al. [2010] I. Murray, R. P. Adams, and D. J. C. MacKay. Elliptical slice sampling. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 9:541–548, 2010.
  • Mykland et al. [1995] P. Mykland, L. Tierney, and B. Yu. Regeneration in Markov chain samplers. Journal of the American Statistical Association, 90(429):233–241, 1995.
  • Neal [1993] R. M. Neal. Bayesian learning via stochastic dynamics. Advances in neural information processing systems, pages 475–475, 1993.
  • Neal [1999] R. M. Neal. Circularly-coupled Markov chain sampling. Technical report, Department of Statistics, University of Toronto, 1999.
  • Neal [2011] R. M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11), 2011.
  • Neal and Pinto [2001] R. M. Neal and R. L. Pinto. Improving Markov chain Monte Carlo estimators by coupling to an approximating chain. Technical report, Department of Statistics, University of Toronto, 2001.
  • Nikooienejad et al. [2016] A. Nikooienejad, W. Wang, and V. E. Johnson. Bayesian variable selection for binary outcomes in high-dimensional genomic studies using non-local priors. Bioinformatics, 32(9):1338–1345, 2016.
  • Nummelin [2002] E. Nummelin. MC’s for MCMC’ists. International Statistical Review / Revue Internationale de Statistique, 70(2):pp. 215–240, 2002. ISSN 03067734.
  • Owen [2017] A. B. Owen. Statistically efficient thinning of a Markov chain sampler. Journal of Computational and Graphical Statistics, 26(3):738–744, 2017.
  • Pagan [1984] A. Pagan. Econometric issues in the analysis of regressions with generated regressors. International Economic Review, pages 221–247, 1984.
  • Pal and Khare [2014] S. Pal and K. Khare. Geometric ergodicity for Bayesian shrinkage models. Electron. J. Statist., 8(1):604–645, 2014. doi: 10.1214/14-EJS896. URL http://dx.doi.org/10.1214/14-EJS896.
  • Park and Casella [2008] T. Park and G. Casella. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681–686, 2008.
  • Plummer [2014] M. Plummer. Cuts in Bayesian graphical models. Statistics and Computing, pages 1–7, 2014.
  • Plummer et al. [2006] M. Plummer, N. Best, K. Cowles, and K. Vines. CODA: convergence diagnosis and output analysis for MCMC. R news, 6(1):7–11, 2006.
  • Plummer et al. [2003] M. Plummer et al. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd international workshop on distributed statistical computing, volume 124. Vienna, Austria, 2003.
  • Pollock et al. [2016] M. Pollock, P. Fearnhead, A. M. Johansen, and G. O. Roberts. The scalable Langevin exact algorithm: Bayesian inference for big data. arXiv preprint arXiv:1609.03436, 2016.
  • Polson et al. [2013] N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American statistical Association, 108(504):1339–1349, 2013.
  • Propp and Wilson [1996] J. G. Propp and D. B. Wilson. Exact sampling with coupled Markov chains and applications to statistical mechanics. Random structures and Algorithms, 9(1-2):223–252, 1996.
  • R Core Team [2015] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015. URL https://www.R-project.org.
  • Reutter and Johnson [1995] A. Reutter and V. E. Johnson. General strategies for assessing convergence of MCMC algorithms using coupled sample paths. Institute of Statistics & Decision Sciences, Duke University, 1995.
  • Rhee and Glynn [2012] C.-H. Rhee and P. W. Glynn. A new approach to unbiased estimation for SDE’s. In Proceedings of the Winter Simulation Conference, page 17. Winter Simulation Conference, 2012.
  • Rischard et al. [2018] M. Rischard, P. E. Jacob, and N. Pillai. Unbiased estimation of log normalizing constants with applications to Bayesian cross-validation. arXiv preprint arXiv:1810.01382, 2018.
  • Robert and Casella [2004] C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer-Verlag, New York, second edition, 2004.
  • Roberts and Rosenthal [2004] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20–71, 2004.
  • Roberts and Tweedie [1996a] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996a.
  • Roberts and Tweedie [1996b] G. O. Roberts and R. L. Tweedie. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika, 83(1):95–110, 1996b.
  • Roberts et al. [1997] G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7(1):110–120, 1997.
  • Rosenthal [1996] J. S. Rosenthal. Analysis of the Gibbs sampler for a model related to James-Stein estimators. Statistics and Computing, 6(3):269–275, 1996.
  • Rosenthal [1997] J. S. Rosenthal. Faithful couplings of Markov chains: now equals forever. Advances in Applied Mathematics, 18(3):372 – 381, 1997. ISSN 0196-8858.
  • Rosenthal [2000] J. S. Rosenthal. Parallel computing and Monte Carlo algorithms. Far east journal of theoretical statistics, 4(2):207–236, 2000.
  • Rosenthal [2002] J. S. Rosenthal. Quantitative convergence rates of Markov chains: A simple account. Electronic Communications in Probability, 7:123–128, 2002.
  • Spiegelhalter et al. [2003] D. Spiegelhalter, A. Thomas, N. Best, and D. Lunn. WinBUGS user manual, 2003.
  • Srivastava et al. [2015] S. Srivastava, V. Cevher, Q. Dinh, and D. Dunson. WASP: Scalable Bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912–920, 2015.
  • Swendsen and Wang [1987] R. H. Swendsen and J.-S. Wang. Nonuniversal critical dynamics in Monte Carlo simulations. Physical review letters, 58(2):86, 1987.
  • Tadić et al. [2017] V. B. Tadić, A. Doucet, et al. Asymptotic bias of stochastic gradient search. The Annals of Applied Probability, 27(6):3255–3304, 2017.
  • Thorisson [2000] H. Thorisson. Coupling, stationarity, and regeneration, volume 14. Springer New York, 2000.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • Titsias and Yau [2017] M. K. Titsias and C. Yau. The Hamming ball sampler. Journal of the American Statistical Association, pages 1–14, 2017.
  • Tjelmeland [2004] H. Tjelmeland. Using all Metropolis–Hastings proposals to estimate mean values. Technical report, Department of Mathematical Sciences, Norwegian University of Science and Technology, 2004.
  • Tweedie [1983] R. Tweedie. The existence of moments for stationary Markov chains. Journal of Applied Probability, 20(1):191–196, 1983.
  • Vanetti et al. [2017] P. Vanetti, A. Bouchard-Côté, G. Deligiannidis, and A. Doucet. Piecewise deterministic Markov chain Monte Carlo. arXiv preprint arXiv:1707.05296, 2017.
  • Vats et al. [2018] D. Vats, J. M. Flegal, G. L. Jones, et al. Strong consistency of multivariate spectral variance estimators in Markov chain Monte Carlo. Bernoulli, 24(3):1860–1909, 2018.
  • Vihola [2017] M. Vihola. Unbiased estimators and multilevel Monte Carlo. Operations Research, 66(2):448–462, 2017.
  • Wang et al. [2015] X. Wang, F. Guo, K. A. Heller, and D. B. Dunson. Parallelizing MCMC with random partition trees. In Advances in Neural Information Processing Systems, pages 451–459, 2015.
  • West [2000] D. West. Neural network credit scoring models. Computers & Operations Research, 27(11):1131–1152, 2000.
  • Wolff [1989] U. Wolff. Comparison between cluster Monte Carlo algorithms in the Ising model. Physics Letters B, 228(3):379–382, 1989.
  • Yang et al. [2017] S. Yang, Y. Chen, E. Bernton, and J. S. Liu. On parallelizable Markov chain Monte Carlo algorithms with waste-recycling. Statistics and Computing, pages 1–9, 2017.
  • Yang et al. [2016] Y. Yang, M. J. Wainwright, and M. I. Jordan. On the computational complexity of high-dimensional Bayesian variable selection. The Annals of Statistics, 44(6):2497–2532, 2016.
  • Zigler [2016] C. M. Zigler. The central role of Bayes theorem for joint estimation of causal effects and propensity scores. The American Statistician, 70(1):47–54, 2016.

The appendices contain the proofs of the results of the article, numerical experiments with coupled Hamiltonian Monte Carlo on a Normal target of varying dimension, Gibbs sampling on baseball batting average data, Pólya-Gamma Gibbs sampling on the German credit data, and Bayesian Lasso Gibbs sampling on the diabetes data set.

Appendix A Proofs

A.1 Proof of Proposition 3.1

We present the proof for H0(X,Y)H_{0}(X,Y), a similar reasoning holds for Hk(X,Y)H_{k}(X,Y) and Hk:m(X,Y)H_{k:m}(X,Y). We follow the same arguments as in Glynn and Rhee [2014], Vihola [2017]. To study H0(X,Y)=h(X0)+t=1τ1(h(Xt)h(Yt1))H_{0}(X,Y)=h(X_{0})+\sum_{t=1}^{\tau-1}(h(X_{t})-h(Y_{t-1})), we introduce Δ0=h(X0)\Delta_{0}=h\left(X_{0}\right) and Δt=h(Xt)h(Yt1)\Delta_{t}=h\left(X_{t}\right)-h\left(Y_{t-1}\right) for all t1t\geq 1, and define H0n(X,Y)=t=0nΔtH_{0}^{n}(X,Y)=\sum_{t=0}^{n}\Delta_{t}. For simplicity we assume that Δt\Delta_{t}\in\mathbb{R}, which corresponds to studying the component-wise behavior of H0(X,Y)H_{0}(X,Y).

We have 𝔼[τ]<\mathbb{E}[\tau]<\infty from Assumption 2.2, so that the computing time of H0(X,Y)H_{0}(X,Y) has a finite expectation. Together with Assumption 2.3, this implies that H0n(X,Y)H0(X,Y)H_{0}^{n}(X,Y)\to H_{0}(X,Y) almost surely as nn\to\infty. We now show that H0n(X,Y)H_{0}^{n}(X,Y) is a Cauchy sequence in L2L_{2}, the complete space of random variables with finite two moments, that is, supnn𝔼[(H0n(X,Y)H0n(X,Y))2]0\sup_{n^{\prime}\geq n}\mathbb{E}[(H_{0}^{n^{\prime}}(X,Y)-H_{0}^{n}(X,Y))^{2}]\to 0 as nn\to\infty. For nnn^{\prime}\geq n, we compute

𝔼[(H0n(X,Y)H0n(X,Y))2]=s=n+1nt=n+1n𝔼[ΔsΔt],\mathbb{E}[(H_{0}^{n^{\prime}}(X,Y)-H_{0}^{n}(X,Y))^{2}]=\sum_{s=n+1}^{n^{\prime}}\sum_{t=n+1}^{n^{\prime}}\mathbb{E}[\Delta_{s}\Delta_{t}],

and consider each term 𝔼[ΔsΔt]\mathbb{E}[\Delta_{s}\Delta_{t}] for (s,t){n+1,,n}2(s,t)\in\{n+1,\ldots,n^{\prime}\}^{2}. The Cauchy-Schwarz inequality implies 𝔼[ΔsΔt](𝔼[Δs2]𝔼[Δt2])1/2\mathbb{E}[\Delta_{s}\Delta_{t}]\leq\left(\mathbb{E}[\Delta_{s}^{2}]\cdot\mathbb{E}[\Delta_{t}^{2}]\right)^{1/2}, and noting that 𝔼[Δt2]=𝔼[Δt2𝟙(τ>t)]\mathbb{E}[\Delta_{t}^{2}]=\mathbb{E}[\Delta_{t}^{2}\cdot\mathds{1}(\tau>t)], we can apply Hölder’s inequality with p=1+η/2p=1+\eta/2, q=(2+η)/ηq=(2+\eta)/\eta for any η>0\eta>0 to obtain

𝔼[Δt2𝟙(τ>t)]𝔼[|Δt|2+η]1/(1+η/2)𝔼[𝟙(τ>t)]η/(2+η)𝔼[|Δt|2+η]1/(1+η/2)(Cδt)η/(2+η).\mathbb{E}[\Delta_{t}^{2}\cdot\mathds{1}(\tau>t)]\leq\mathbb{E}[|\Delta_{t}|^{2+\eta}]^{\nicefrac{{1}}{{(1+\eta/2)}}}\mathbb{\ \cdot\ E}[\mathds{1}(\tau>t)]^{\nicefrac{{\eta}}{{(2+\eta)}}}\leq\mathbb{E}[|\Delta_{t}|^{2+\eta}]^{\nicefrac{{1}}{{(1+\eta/2)}}}\cdot\left(C\delta^{t}\right)^{\nicefrac{{\eta}}{{(2+\eta)}}}.

Here we have used Assumption 2.2 to bound 𝔼[𝟙(τ>t)]\mathbb{E}[\mathds{1}(\tau>t)]. We can also use Assumption 2.1 together with Minkowski’s inequality to bound 𝔼[|Δt|2+η]1/(1+η/2)\mathbb{E}[|\Delta_{t}|^{2+\eta}]^{\nicefrac{{1}}{{(1+\eta/2)}}} by a constant C~\tilde{C}, for all t0t\geq 0. Defining δ~=δη/(2+η)(0,1)\tilde{\delta}=\delta^{\nicefrac{{\eta}}{{(2+\eta)}}}\in(0,1) then gives the bound 𝔼[Δt2]C~δ~t\mathbb{E}[\Delta_{t}^{2}]\leq\tilde{C}\tilde{\delta}^{t} for all t0t\geq 0. This implies 𝔼[(H0n(X,Y)H0n(X,Y))2]C¯δ~n\mathbb{E}[(H_{0}^{n^{\prime}}(X,Y)-H_{0}^{n}(X,Y))^{2}]\leq\bar{C}\tilde{\delta}^{n} for some other constant C¯\bar{C}, and thus (H0n(X,Y))(H_{0}^{n}(X,Y)) is Cauchy in L2L_{2}. This proves that its limit H0(X,Y)H_{0}(X,Y) has finite first and second moments. Assumption 2.1 implies that limn𝔼[H0n(X,Y)]=𝔼π[h(X)]\lim_{n\to\infty}\mathbb{E}[H_{0}^{n}(X,Y)]=\mathbb{E}_{\pi}[h(X)], by a telescopic sum argument, so we conclude that 𝔼[H0(X,Y)]=𝔼π[h(X)]\mathbb{E}[H_{0}(X,Y)]=\mathbb{E}_{\pi}[h(X)]. We can also obtain an explicit representation of 𝔼[H0(X,Y)2]\mathbb{E}[H_{0}(X,Y)^{2}] as the limit of 𝔼[H0n(X,Y)2]\mathbb{E}[H_{0}^{n}(X,Y)^{2}] when nn\to\infty.

A.2 Proof of Proposition 3.2

We adopt a similar strategy to that of Glynn and Rhee [2014], Section 4. For the Hk(X,Y)H_{k}(X,Y) case, the unbiased estimator of F(s)=π(Xs)F(s)=\mathbb{P}_{\pi}(X\leq s) over RR samples is of the form

F^R(s)\displaystyle\hat{F}^{R}(s) =1Rr=1R(𝟙(Xk(r)s)+=k+1τ(r)1(𝟙(X(r)s)𝟙(Y1(r)s))).\displaystyle=\frac{1}{R}\sum_{r=1}^{R}\left(\mathds{1}(X_{k}^{(r)}\leq s)+\sum_{\ell=k+1}^{\tau^{(r)}-1}(\mathds{1}(X_{\ell}^{(r)}\leq s)-\mathds{1}(Y_{\ell-1}^{(r)}\leq s))\right).

Here τ(r)\tau^{(r)} denotes the meeting time of the rr-th independent run. We want to show that as RR\to\infty, sups|F(s)F^R(s)|a.s.0{\sup_{s}|F(s)-\hat{F}^{R}(s)|\xrightarrow{a.s.}0}. Define GX,kR(s)=R1r=1R𝟙(Xk(r)s)G_{X,k}^{R}(s)=R^{-1}\sum_{r=1}^{R}\mathds{1}(X_{k}^{(r)}\leq s). For >k\ell>k, define

GX,R(s)=1Rr=1R𝟙(X(r)s)𝟙(τ(r)),GY,R(s)=1Rr=1R𝟙(Y1(r)s)𝟙(τ(r)).G_{X,\ell}^{R}(s)=\frac{1}{R}\sum_{r=1}^{R}\mathds{1}(X_{\ell}^{(r)}\leq s)\cdot\mathds{1}(\ell\leq\tau^{(r)}),\qquad G_{Y,\ell}^{R}(s)=\frac{1}{R}\sum_{r=1}^{R}\mathds{1}(Y_{\ell-1}^{(r)}\leq s)\cdot\mathds{1}(\ell\leq\tau^{(r)}).

Then F^R(s)=GX,kR(s)+=k+1(GX,R(s)GY,R(s))\hat{F}^{R}(s)=G_{X,k}^{R}(s)+\sum_{\ell=k+1}^{\infty}(G_{X,\ell}^{R}(s)-G_{Y,\ell}^{R}(s)). By the standard Glivenko-Cantelli theorem, as \mathbb{R}\to\infty we have

sups|GX,kR(s)(Xks)|a.s.0,sups|GX,R(s)𝔼[𝟙(Xs)𝟙(τ)]|a.s.0,\sup_{s}\left|G_{X,k}^{R}(s)-\mathbb{P}(X_{k}\leq s)\right|\xrightarrow{a.s.}0,\qquad\sup_{s}\left|G_{X,\ell}^{R}(s)-\mathbb{E}[\mathds{1}(X_{\ell}\leq s)\cdot\mathds{1}(\ell\leq\tau)]\right|\xrightarrow{a.s.}0,
sups|GY,R(s)𝔼[𝟙(Y1s)𝟙(τ)]|a.s.0,\sup_{s}\left|G_{Y,\ell}^{R}(s)-\mathbb{E}[\mathds{1}(Y_{\ell-1}\leq s)\cdot\mathds{1}(\ell\leq\tau)]\right|\xrightarrow{a.s.}0,

for each >k\ell>k. Next, we observe that, for all s,s,\ell,

𝔼[(𝟙(Xs)𝟙(Y1s))𝟙(τ)]=(Xs)(X1s).\mathbb{E}[(\mathds{1}(X_{\ell}\leq s)-\mathds{1}(Y_{\ell-1}\leq s))\cdot\mathds{1}(\tau\geq\ell)]=\mathbb{P}(X_{\ell}\leq s)-\mathbb{P}(X_{\ell-1}\leq s).

This holds for a simple reason in our setting. For any h()h(\cdot) and any \ell,

𝔼[h(X)h(Y1)]\displaystyle\mathbb{E}[h(X_{\ell})-h(Y_{\ell-1})] =𝔼[(h(X)h(Y1))𝟙(τ>)]+𝔼[(h(X)h(Y1))𝟙(τ)]\displaystyle=\mathbb{E}[(h(X_{\ell})-h(Y_{\ell-1}))\mathds{1}(\tau>\ell)]+\mathbb{E}[(h(X_{\ell})-h(Y_{\ell-1}))\mathds{1}(\tau\leq\ell)]
=𝔼[(h(X)h(Y1))𝟙(τ>)]\displaystyle=\mathbb{E}[(h(X_{\ell})-h(Y_{\ell-1}))\mathds{1}(\tau>\ell)]

since if τ\tau\leq\ell then X=Y1X_{\ell}=Y_{\ell-1} by Assumption 2.3. Applying this result with h()=𝟙(s)h(\cdot)=\mathds{1}(\cdot\leq s) yields the desired statement.

The above implies that for any finite iki\geq k we have

|GX,kR(s)(Xks)+=k+1i(GX,R(s)GY,R(s)𝔼[(𝟙(Xs)𝟙(Y1s))𝟙(τ)])|\displaystyle\left|G_{X,k}^{R}(s)-\mathbb{P}(X_{k}\leq s)+\sum_{\ell=k+1}^{i}\Big{(}G_{X,\ell}^{R}(s)-G_{Y,\ell}^{R}(s)-\mathbb{E}[(\mathds{1}(X_{\ell}\leq s)-\mathds{1}(Y_{\ell-1}\leq s))\cdot\mathds{1}(\ell\leq\tau)]\Big{)}\right|
=|GX,kR(s)(Xks)+=k+1i(GX,R(s)GY,R(s)((Xs)(X1s)))|\displaystyle=\left|G_{X,k}^{R}(s)-\mathbb{P}(X_{k}\leq s)+\sum_{\ell=k+1}^{i}\Big{(}G_{X,\ell}^{R}(s)-G_{Y,\ell}^{R}(s)-(\mathbb{P}(X_{\ell}\leq s)-\mathbb{P}(X_{\ell-1}\leq s))\Big{)}\right|
=|(GX,kR(s)+=k+1i(GX,R(s)GY,R(s)))(Xis)|.\displaystyle=\left|\Big{(}G_{X,k}^{R}(s)+\sum_{\ell=k+1}^{i}(G_{X,\ell}^{R}(s)-G_{Y,\ell}^{R}(s))\Big{)}-\mathbb{P}(X_{i}\leq s)\right|.

Hence

sups|(GX,kR(s)+=k+1i(GX,R(s)GY,R(s)))(Xis)|0.\sup_{s}\left|\Big{(}G_{X,k}^{R}(s)+\sum_{\ell=k+1}^{i}(G_{X,\ell}^{R}(s)-G_{Y,\ell}^{R}(s))\Big{)}-\mathbb{P}(X_{i}\leq s)\right|\to 0.

We have assumed that (Xt)t0(X_{t})_{t\geq 0} converges to π\pi in total variation, which implies sups|(Xis)F(s)|0\sup_{s}|\mathbb{P}(X_{i}\leq s)-F(s)|\to 0 as ii\to\infty. Also, we note that for all ss,

|>iGX,R(s)GY,R(s)|>i1Rr=1R𝟙(τ(r))>i(τ)\left|\sum_{\ell>i}G_{X,\ell}^{R}(s)-G_{Y,\ell}^{R}(s)\right|\leq\sum_{\ell>i}\frac{1}{R}\sum_{r=1}^{R}\mathds{1}(\ell\leq\tau^{(r)})\to\sum_{\ell>i}\mathbb{P}(\ell\leq\tau)

almost surely by the strong law of large numbers. Assumption 2.2 implies that this quantity goes to 0 as ii\to\infty.

Combining these observations with the result obtained for finite ii, we conclude that sups|F^R(s)F(s)|0\sup_{s}|\hat{F}^{R}(s)-F(s)|\to 0 as RR\to\infty, almost surely. The reasoning holds for the function F^R\hat{F}^{R} corresponding to Hk:m(X,Y)H_{k:m}(X,Y) instead of Hk(X,Y)H_{k}(X,Y); such a function is simply the average of a finite number of functions associated with H(X,Y)H_{\ell}(X,Y) for {k,,m}\ell\in\{k,\ldots,m\}.

A.3 Proof of Proposition 3.3

Throughout the proof CC denotes a generic finite constant whose actual value may change from one appearance to the next. We will use the usual Markov chain notation. In particular if f:𝒳×𝒳f:\;\mathcal{X}\times\mathcal{X}\to\mathbb{R} is a measurable function then P¯[f](x,y):=𝒳×𝒳P¯((x,y),dz)f(z)\bar{P}[f](x,y):=\int_{\mathcal{X}\times\mathcal{X}}\bar{P}((x,y),dz)f(z). Note that from the construction of P¯\bar{P}, if ff is depends only on xx (resp. yy), that is f(x,y)=f(x)f(x,y)=f(x) (resp. f(x,y)=f(y)f(x,y)=f(y)), then P¯[f](x,y)=Pf(x)\bar{P}[f](x,y)=Pf(x) (resp. P¯[f](x,y)=Pf(y)\bar{P}[f](x,y)=Pf(y)).

For k1k\geq 1, we consider the general problem of bounding 𝔼[Sk2]\mathbb{E}[S_{k}^{2}], where SkS_{k} is of the form

Sk=𝟙(τ>k)t=kτ1bt(h(Xt)h(Yt1)),S_{k}=\mathds{1}(\tau>k)\sum_{t=k}^{\tau-1}b_{t}\left(h(X_{t})-h(Y_{t-1})\right),

for some arbitrary bounded sequence (bt)t0(b_{t})_{t\geq 0}. Fix an integer NkN\geq k, and set

Sk(N)=𝟙(τ>k)t=kNτ1bt(h(Xt)h(Yt1)).S_{k}^{(N)}=\mathds{1}(\tau>k)\sum_{t=k}^{N\wedge\tau-1}b_{t}\left(h(X_{t})-h(Y_{t-1})\right).

The same argument as in the proof of Proposition 3.1 can be applied here and shows that Sk(N)S_{k}^{(N)} is a Cauchy sequence in L2L_{2} that converges to SkS_{k}, as NN\to\infty, so that

𝔼[Sk2]=limN𝔼[(Sk(N))2].\mathbb{E}[S_{k}^{2}]=\lim_{N\to\infty}\mathbb{E}\left[(S_{k}^{(N)})^{2}\right].

Since |h|Vβ:=supx|h(x)|/Vβ(x)<|h|_{V^{\beta}}:=\sup_{x}|h(x)|/V^{\beta}(x)<\infty, and under Assumption 3.1, there exists a measurable function g:𝒳g:\;\mathcal{X}\to\mathbb{R} such that |g|Vβ<|g|_{V^{\beta}}<\infty, and gPg=hπ(h)g-Pg=h-\pi(h). To see this, first note that the drift condition (3.1) implies that for any β(0,1/2)\beta\in(0,1/2), we have PVβ(x)λβVβ(x)+bβ𝟙(x𝒞)PV^{\beta}(x)\leq\lambda^{\beta}V^{\beta}(x)+b^{\beta}\mathds{1}(x\in\mathcal{C}), for all x𝒳x\in\mathcal{X}. Indeed by Jensen’s inequality PVβ(x)(PV(x))β(λV(x)+b𝟙(x𝒞))βλβVβ(x)+bβ𝟙(x𝒞)PV^{\beta}(x)\leq(PV(x))^{\beta}\leq(\lambda V(x)+b\mathds{1}(x\in\mathcal{C}))^{\beta}\leq\lambda^{\beta}V^{\beta}(x)+b^{\beta}\mathds{1}(x\in\mathcal{C}), using the fact that for all x,y0x,y\geq 0 and α[0,1]\alpha\in[0,1], (x+y)αxα+yα(x+y)^{\alpha}\leq x^{\alpha}+y^{\alpha}. The drift condition in VβV^{\beta} together with the fact that PP is ϕ\phi-irreducible and aperiodic implies that there exists ρβ(0,1)\rho_{\beta}\in(0,1), Cβ<C_{\beta}<\infty such that for all x𝒳x\in\mathcal{X}, n0n\geq 0,

Pn(x,)πVβCβVβ(x)ρβn,\|P^{n}(x,\cdot)-\pi\|_{V^{\beta}}\leq C_{\beta}V^{\beta}(x)\rho_{\beta}^{n}, (A.1)

where for a function W:𝖷[1,)W:\;\mathsf{X}\to[1,\infty), the WW-norm between two probability measures μ,ν\mu,\nu is defined as

μνW:=supf meas.:|f|W1|μ(f)ν(f)|,\|\mu-\nu\|_{W}:=\sup_{f\textsf{ meas.}:\;|f|_{W}\leq 1}\;|\mu(f)-\nu(f)|,

and |f|W:=supx|f(x)|/W(x)|f|_{W}:=\sup_{x}|f(x)|/W(x). This result can be found in Theorem 15.0.1 of Meyn and Tweedie [2009]. It follows from (A.1) that j0|Pj(hπ(h))(x)||h|Vβj0Pj(x,)πVβCβ|h|VβVβ(x)1ρβ<\sum_{j\geq 0}|P^{j}(h-\pi(h))(x)|\leq|h|_{V^{\beta}}\sum_{j\geq 0}\|P^{j}(x,\cdot)-\pi\|_{V^{\beta}}\leq\frac{C_{\beta}|h|_{V^{\beta}}V^{\beta}(x)}{1-\rho_{\beta}}<\infty. Hence the function

g(x)=j0Pj(hπ(h))(x),x𝒳,g(x)=\sum_{j\geq 0}P^{j}(h-\pi(h))(x),\;\;\;x\in\mathcal{X},

is well-defined and measurable (as a limit of a sequence of measurable functions) and satisfies |g(x)|Cβ|h|VβVβ(x)/(1ρβ)|g(x)|\leq C_{\beta}|h|_{V^{\beta}}V^{\beta}(x)/(1-\rho_{\beta}). And since PVβPV^{\beta} is finite everywhere, by Lebesgue’s dominated convergence we deduce that PgPg is finite everywhere as well and

Pg(x)=g(y)P(x,dy)=j0(Pj(hπ(h))(y))P(x,dy)=j1Pj(hπ(h))(x).Pg(x)=\int g(y)P(x,dy)=\sum_{j\geq 0}\int(P^{j}(h-\pi(h))(y))P(x,dy)=\sum_{j\geq 1}P^{j}(h-\pi(h))(x).

Hence gPg=hπ(h)g-Pg=h-\pi(h), as claimed.

Hence, with Zt:=(Xt,Yt1)Z_{t}:=(X_{t},Y_{t-1}), and g¯(x,y)=g(x)g(y)\bar{g}(x,y)=g(x)-g(y), we have

h(Xt)h(Yt1)=g¯(Zt)P¯[g¯](Zt).h(X_{t})-h(Y_{t-1})=\bar{g}(Z_{t})-\bar{P}[\bar{g}](Z_{t}).

Using this and a telescoping sum argument, we write

Sk(N)\displaystyle S_{k}^{(N)} =\displaystyle= t=kN1bt(h(Xt)h(Yt1))𝟙(τ>t)\displaystyle\sum_{t=k}^{N-1}b_{t}\left(h(X_{t})-h(Y_{t-1})\right)\mathds{1}(\tau>t)
=\displaystyle= t=kN1bt(g¯(Zt)P¯[g¯](Zt))𝟙(τ>t)\displaystyle\sum_{t=k}^{N-1}b_{t}\left(\bar{g}(Z_{t})-\bar{P}[\bar{g}](Z_{t})\right)\mathds{1}(\tau>t)
=\displaystyle= t=kN1bt(g¯(Zt+1)P¯[g¯](Zt))𝟙(τ>t)\displaystyle\sum_{t=k}^{N-1}b_{t}\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)\mathds{1}(\tau>t)
+t=kN1(btg¯(Zt)𝟙(τ>t)bt+1g¯(Zt+1)𝟙(τ>t+1))\displaystyle+\sum_{t=k}^{N-1}\left(b_{t}\bar{g}(Z_{t})\mathds{1}(\tau>t)-b_{t+1}\bar{g}(Z_{t+1})\mathds{1}(\tau>t+1)\right)
+t=kN1(bt+1𝟙(τ>t+1)bt𝟙(τ>t))g¯(Zt+1).\displaystyle+\sum_{t=k}^{N-1}\left(b_{t+1}\mathds{1}(\tau>t+1)-b_{t}\mathds{1}(\tau>t)\right)\bar{g}(Z_{t+1}).

Since g¯(Zt+1)=0\bar{g}(Z_{t+1})=0 on {τ=t+1}\{\tau=t+1\}, the last term in the above display reduces to t=kN1(bt+1bt)g¯(Zt+1)𝟙(τ>t+1)\sum_{t=k}^{N-1}(b_{t+1}-b_{t})\bar{g}(Z_{t+1})\mathds{1}(\tau>t+1), and we obtain

Sk(N)=t=kN1bt(g¯(Zt+1)P¯[g¯](Zt))𝟙(τ>t)+bkg¯(Zk)𝟙(τ>k)bNg¯(ZN)𝟙(τ>N)+t=kN1(bt+1bt)g¯(Zt+1)𝟙(τ>t+1).S_{k}^{(N)}=\sum_{t=k}^{N-1}b_{t}\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)\mathds{1}(\tau>t)\\ +b_{k}\bar{g}(Z_{k})\mathds{1}(\tau>k)-b_{N}\bar{g}(Z_{N})\mathds{1}(\tau>N)+\sum_{t=k}^{N-1}(b_{t+1}-b_{t})\bar{g}(Z_{t+1})\mathds{1}(\tau>t+1). (A.2)

Let t\mathcal{F}_{t} denote the sigma-algebra generated by the variables X0,(X1,Y0),,(Xt,Yt1)X_{0},(X_{1},Y_{0}),\ldots,(X_{t},Y_{t-1}). Note that {τ>t}\{\tau>t\} belongs to t\mathcal{F}_{t}. Hence

𝔼[bt(g¯(Zt+1)P¯[g¯](Zt))𝟙(τ>t)|t]\displaystyle\mathbb{E}\left[b_{t}\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)\mathds{1}(\tau>t)|\mathcal{F}_{t}\right] =\displaystyle= bt𝟙(τ>t)𝔼[g¯(Zt+1)P¯[g¯](Zt)|t]\displaystyle b_{t}\mathds{1}(\tau>t)\mathbb{E}\left[\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})|\mathcal{F}_{t}\right]
=\displaystyle= bt𝟙(τ>t)(P¯[g¯](Zt)P¯[g¯](Zt))=0.\displaystyle b_{t}\mathds{1}(\tau>t)\left(\bar{P}[\bar{g}](Z_{t})-\bar{P}[\bar{g}](Z_{t})\right)=0.

In other words, {(t=kjbt(g¯(Zt+1)P¯[g¯](Zt))𝟙(τ>t),j),kjN1}\left\{\left(\sum_{t=k}^{j}b_{t}\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)\mathds{1}(\tau>t),\mathcal{F}_{j}\right),\;k\leq j\leq N-1\right\} is a martingale. The orthogonality of the martingale increments gives

𝔼[(t=kN1bt(g¯(Zt+1)P¯[g¯](Zt))𝟙(τ>t))2]=t=kN1bt2𝔼[(g¯(Zt+1)P¯[g¯](Zt))2𝟙(τ>t)].\mathbb{E}\left[\left(\sum_{t=k}^{N-1}b_{t}\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)\mathds{1}(\tau>t)\right)^{2}\right]=\sum_{t=k}^{N-1}b_{t}^{2}\mathbb{E}\left[\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)^{2}\mathds{1}(\tau>t)\right].

We use this together with (A.2), the convexity of the squared norm, and Minkowski’s inequality to conclude that

𝔼[(Sk(N))2]4t=kN1bt2𝔼[(g¯(Zt+1)P¯[g¯](Zt))2𝟙(τ>t)]+4bk2𝔼[g¯2(Zk)𝟙(τ>k)]+4bN2𝔼[g¯2(ZN)𝟙(τ>N)]+4[t=kN1|bt+1bt|𝔼1/2(g¯2(Zt+1)𝟙(τ>t+1))]2.\mathbb{E}\left[(S_{k}^{(N)})^{2}\right]\leq 4\sum_{t=k}^{N-1}b_{t}^{2}\mathbb{E}\left[\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)^{2}\mathds{1}(\tau>t)\right]+4b_{k}^{2}\mathbb{E}\left[\bar{g}^{2}(Z_{k})\mathds{1}(\tau>k)\right]\\ +4b_{N}^{2}\mathbb{E}\left[\bar{g}^{2}(Z_{N})\mathds{1}(\tau>N)\right]+4\left[\sum_{t=k}^{N-1}|b_{t+1}-b_{t}|\mathbb{E}^{1/2}\left(\bar{g}^{2}(Z_{t+1})\mathds{1}(\tau>t+1)\right)\right]^{2}. (A.3)

Assumption 3.1 together with π0(V)<\pi_{0}(V)<\infty, implies that

supn0𝔼[V(Xn)]C,\sup_{n\geq 0}\mathbb{E}[V(X_{n})]\leq C, (A.4)

for some finite constant CC. Indeed, 𝔼[V(Xn)]=π0(dx)PnV(x)\mathbb{E}[V(X_{n})]=\int\pi_{0}(dx)P^{n}V(x), and a repeated application of the drift condition (3.1) implies that PnV(x)λnV(x)+b1λP^{n}V(x)\leq\lambda^{n}V(x)+\frac{b}{1-\lambda}, for all x𝒳x\in\mathcal{X}. For any t0t\geq 0, and for 1<p=12β1<p=\frac{1}{2\beta}, and 1p+1q=1\frac{1}{p}+\frac{1}{q}=1, we have

𝔼[(Vβ(Xt)+Vβ(Yt1))2𝟙(τ>t)]\displaystyle\mathbb{E}\left[\left(V^{\beta}(X_{t})+V^{\beta}(Y_{t-1})\right)^{2}\mathds{1}(\tau>t)\right] \displaystyle\leq 𝔼1/p[(Vβ(Xt)+Vβ(Yt1))2p]1/q(τ>t)(Hölder)\displaystyle\mathbb{E}^{1/p}\left[\left(V^{\beta}(X_{t})+V^{\beta}(Y_{t-1})\right)^{2p}\right]\mathbb{P}^{1/q}(\tau>t)\;\;\mbox{(H\"{o}lder)}
\displaystyle\leq {𝔼1/(2p)[V2pβ(Xt)]+𝔼1/(2p)[V2pβ(Yt1)]}2\displaystyle\left\{\mathbb{E}^{1/(2p)}\left[V^{2p\beta}(X_{t})\right]+\mathbb{E}^{1/(2p)}\left[V^{2p\beta}(Y_{t-1})\right]\right\}^{2}
×1/q(τ>t)(Minkowski)\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\hskip 85.35826pt\times\mathbb{P}^{1/q}(\tau>t)\;\;\;\mbox{(Minkowski)}
\displaystyle\leq C1/q(τ>t)(by (A.4))\displaystyle C\mathbb{P}^{1/q}(\tau>t)\hskip 136.5733pt(\mbox{by \eqref{eq:control:moment}})
\displaystyle\leq Cδβt(by Assumption 2.2),\displaystyle C\delta_{\beta}^{t}\hskip 170.71652pt(\mbox{by Assumption }\ref{assumption:meetingtime}),

where δβ=δ1/q\delta_{\beta}=\delta^{1/q} with δ\delta as in Assumption 2.2. Note that all the expectations on the right hand side of (A.3) are of the form 𝔼[g¯2(Zt)𝟙(τ>t)]\mathbb{E}\left[\bar{g}^{2}(Z_{t})\mathds{1}(\tau>t)\right], except for the term 𝔼[(g¯(Zt+1)P¯[g¯](Zt))2𝟙(τ>t)]\mathbb{E}\left[\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)^{2}\mathds{1}(\tau>t)\right]. However by the martingale difference property, 𝔼[P¯[g¯](Zt)(g¯(Zt+1)P¯[g¯](Zt))𝟙(τ>t)]=0\mathbb{E}\left[\bar{P}[\bar{g}](Z_{t})\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)\mathds{1}(\tau>t)\right]=0, so that for all tkt\geq k,

𝔼[(g¯(Zt+1)P¯[g¯](Zt))2𝟙(τ>t)]\displaystyle\mathbb{E}\left[\left(\bar{g}(Z_{t+1})-\bar{P}[\bar{g}](Z_{t})\right)^{2}\mathds{1}(\tau>t)\right] =\displaystyle= 𝔼[𝟙(τ>t)g¯(Zt+1)2]𝔼[g¯(Zt+1)P¯[g¯](Zt)𝟙(τ>t)]\displaystyle\mathbb{E}\left[\mathds{1}(\tau>t)\bar{g}(Z_{t+1})^{2}\right]-\mathbb{E}\left[\bar{g}(Z_{t+1})\bar{P}[\bar{g}](Z_{t})\mathds{1}(\tau>t)\right]
=\displaystyle= 𝔼[𝟙(τ>t)g¯(Zt+1)2]𝔼[(P¯[g¯](Zt))2𝟙(τ>t)](by conditioning on t)\displaystyle\mathbb{E}\left[\mathds{1}(\tau>t)\bar{g}(Z_{t+1})^{2}\right]-\mathbb{E}\left[\left(\bar{P}[\bar{g}](Z_{t})\right)^{2}\mathds{1}(\tau>t)\right]\;(\mbox{by conditioning on }\mathcal{F}_{t})
\displaystyle\leq 𝔼[𝟙(τ>t)g¯(Zt+1)2]\displaystyle\mathbb{E}\left[\mathds{1}(\tau>t)\bar{g}(Z_{t+1})^{2}\right]
\displaystyle\leq |g|Vβ2𝔼[𝟙(τ>t)(Vβ(Xt+1)+Vβ(Yt))2]\displaystyle|g|_{V^{\beta}}^{2}\mathbb{E}\left[\mathds{1}(\tau>t)\left(V^{\beta}(X_{t+1})+V^{\beta}(Y_{t})\right)^{2}\right]
\displaystyle\leq Cδβt,\displaystyle C\delta_{\beta}^{t},

using the same arguments as above. On the other hand, for all tkt\geq k, the term 𝔼[g¯2(Zt)𝟙{τ>t}]\mathbb{E}\left[\bar{g}^{2}(Z_{t})\mathds{1}_{\{\tau>t\}}\right] satisfies

𝔼[g¯2(Zt)𝟙(τ>t)]|g|Vβ2𝔼[(Vβ(Xt)+Vβ(Yt1))2𝟙(τ>t)]Cδβt,\mathbb{E}\left[\bar{g}^{2}(Z_{t})\mathds{1}(\tau>t)\right]\leq|g|_{V^{\beta}}^{2}\mathbb{E}\left[\left(V^{\beta}(X_{t})+V^{\beta}(Y_{t-1})\right)^{2}\mathds{1}(\tau>t)\right]\leq C\delta_{\beta}^{t},

as seen above. In conclusion, all the expectations appearing in (A.3) are upper bounded by some constant times terms of the form δβt\delta_{\beta}^{t}. We conclude that

𝔼[(Sk(N))2]C(bk2δβk+bN2δβN+t=kN1bk2δβt+[t=kN1|bt+1bt|δβt]2).\mathbb{E}\left[(S_{k}^{(N)})^{2}\right]\leq C\left(b_{k}^{2}\delta_{\beta}^{k}+b_{N}^{2}\delta_{\beta}^{N}+\sum_{t=k}^{N-1}b_{k}^{2}\delta_{\beta}^{t}+\left[\sum_{t=k}^{N-1}|b_{t+1}-b_{t}|\delta_{\beta}^{t}\right]^{2}\right).

Letting NN\to\infty, we conclude that

𝔼[Sk2]C(bk2δβk+jkbj2δβj+[jk|bj+1bj|δβj]2).\mathbb{E}[S_{k}^{2}]\leq C\left(b_{k}^{2}\delta_{\beta}^{k}+\sum_{j\geq k}b_{j}^{2}\delta_{\beta}^{j}+\left[\sum_{j\geq k}|b_{j+1}-b_{j}|\delta_{\beta}^{j}\right]^{2}\right).

In the particular case of ηk:m\eta_{k:m}, we have ηk:m=𝟙(τ>k)t=kτ1min(1,t+1km+1k)(h(Xt+1)h(Yt))\eta_{k:m}=\mathds{1}(\tau>k)\sum_{t=k}^{\tau-1}\min\left(1,\frac{t+1-k}{m+1-k}\right)\left(h(X_{t+1})-h(Y_{t})\right). Hence bk=0b_{k}=0, bt=(tk)/(mk+1)b_{t}=(t-k)/(m-k+1) if k<tm+1k<t\leq m+1, bt=1b_{t}=1 if t>m+1t>m+1. We then obtain the bound of Proposition 3.3.

A.4 Proof of Proposition 3.4

Here ZnZ_{n} is defined as (Xn,Yn1)(X_{n},Y_{n-1}) for all n1n\geq 1. The assumption in (3.3), within the statement of Proposition 3.4, implies that for (x,y)𝒞×𝒞(x,y)\in\mathcal{C}\times\mathcal{C}, P¯\bar{P} can be written as a mixture

P¯((x,y),dz)=ϵx,yνx,y(dz)+(1ϵx,y)R((x,y),dz),\bar{P}\left((x,y),dz\right)=\epsilon_{x,y}\nu_{x,y}(dz)+(1-\epsilon_{x,y})R\left((x,y),dz\right),

where ϵx,yϵ\epsilon_{x,y}\geq\epsilon, νx,y(dz)\nu_{x,y}(dz) is a restriction of P¯((x,y),dz)\bar{P}\left((x,y),dz\right) on 𝒟\mathcal{D} (that is for any measurable subset AA of 𝒟\mathcal{D}, P¯((x,y),A)=P((x,y),A𝒟)/P((x,y),𝒟)\bar{P}\left((x,y),A\right)=P\left((x,y),A\cap\mathcal{D}\right)/P\left((x,y),\mathcal{D}\right)), and R((x,y),dz)R\left((x,y),dz\right) is the restriction of P¯((x,y),dz)\bar{P}\left((x,y),dz\right) on (𝒳×𝒳)𝒟(\mathcal{X}\times\mathcal{X})\setminus\mathcal{D}. This means that whenever (x,y)𝒞×𝒞(x,y)\in\mathcal{C}\times\mathcal{C} one can sample from P¯((x,y),)\bar{P}((x,y),\cdot) by drawing independently a Bernoulli random variable JJ, with probability of success ϵx,y\epsilon_{x,y}. Then if J=1J=1, we draw from νx,y\nu_{x,y}, if J=0J=0, we draw from R((x,y),)R\left((x,y),\cdot\right). From this decomposition, the proof of the proposition follows the same lines as in Douc et al. [2004], and we give the details only for completeness. We cannot directly invoke their result since their assumptions do not seem to apply to our setting.

Set V¯(x,y)=12(V(x)+V(y))\bar{V}(x,y)=\frac{1}{2}(V(x)+V(y)). First we show that the bivariate kernel satisfies a geometric drift towards 𝒞×𝒞\mathcal{C}\times\mathcal{C}. That is, there exists α(0,1)\alpha\in(0,1) such that

P¯V¯(x,y)αV¯(x,y),(x,y)𝒞×𝒞.\bar{P}\bar{V}(x,y)\leq\alpha\bar{V}(x,y),\;\;\;(x,y)\notin\mathcal{C}\times\mathcal{C}. (A.5)

Indeed for (x,y)𝒞×𝒞(x,y)\notin\mathcal{C}\times\mathcal{C}, since V1V\geq 1, and 𝒞={VL}\mathcal{C}=\{V\leq L\}, V¯(x,y)(1+L)/2\bar{V}(x,y)\geq(1+L)/2. In other words, 12V¯(x,y)/(1+L)\frac{1}{2}\leq\bar{V}(x,y)/(1+L). Therefore,

P¯V¯(x,y)=12(PV(x)+PV(y))λV¯(x,y)+b2λV¯(x,y)+b1+LV¯(x,y)αV¯(x,y),\bar{P}\bar{V}(x,y)=\frac{1}{2}\left(PV(x)+PV(y)\right)\leq\lambda\bar{V}(x,y)+\frac{b}{2}\leq\lambda\bar{V}(x,y)+\frac{b}{1+L}\bar{V}(x,y)\leq\alpha\bar{V}(x,y),

with α=λ+b1+L<1\alpha=\lambda+\frac{b}{1+L}<1. We set

B=max(1,1αsup(x,y)𝒞×𝒞P¯[V¯𝟙𝒟c](x,y)V¯(x,y))λ+bα.B=\max\left(1,\frac{1}{\alpha}\sup_{(x,y)\in\mathcal{C}\times\mathcal{C}}\frac{\bar{P}[\bar{V}\mathds{1}_{\mathcal{D}^{c}}](x,y)}{\bar{V}(x,y)}\right)\leq\frac{\lambda+b}{\alpha}.

In this section 𝟙𝒮()\mathds{1}_{\mathcal{S}}(\cdot) refers to the indicator function on the set 𝒮\mathcal{S}. Let NnN_{n} denote the number of visits to 𝒞×𝒞\mathcal{C}\times\mathcal{C} by time nn. Then

(τ>n)=(τ>n,Nn1j)+(τ>n,Nn1<j).\mathbb{P}(\tau>n)=\mathbb{P}(\tau>n,\;N_{n-1}\geq j)+\mathbb{P}(\tau>n,\;N_{n-1}<j).

The event {τ>n,Nn1j}\{\tau>n,\;N_{n-1}\geq j\} implies that no success occurred within at least jj independent Bernoulli random variables each with probability of success at least ϵ\epsilon. Hence

(τ>n,Nn1j)(1ϵ)j.\mathbb{P}(\tau>n,\;N_{n-1}\geq j)\leq(1-\epsilon)^{j}.

For the second term, we have (since B1B\geq 1, and the chains stay together after meeting via Assumption 2.3),

(τ>n,Nn1j1)(Zn𝒟,BNn1B(j1))=(𝟙𝒟c(Zn)BNn1B(j1)).\mathbb{P}(\tau>n,\;N_{n-1}\leq j-1)\leq\mathbb{P}\left(Z_{n}\notin\mathcal{D},B^{-N_{n-1}}\geq B^{-(j-1)}\right)=\mathbb{P}\left(\mathds{1}_{\mathcal{D}^{c}}(Z_{n})B^{-N_{n-1}}\geq B^{-(j-1)}\right).

Then use Markov’s inequality to conclude that

(τ>n,Nn1j1)Bj1𝔼[𝟙𝒟c(Zn)BNn1]Bj1𝔼[𝟙𝒟c(Zn)BNn1V¯(Zn)]=αnBj1𝔼[Mn],\mathbb{P}(\tau>n,\;N_{n-1}\leq j-1)\leq B^{j-1}\mathbb{E}\left[\mathds{1}_{\mathcal{D}^{c}}(Z_{n})B^{-N_{n-1}}\right]\\ \leq B^{j-1}\mathbb{E}\left[\mathds{1}_{\mathcal{D}^{c}}(Z_{n})B^{-N_{n-1}}\bar{V}(Z_{n})\right]=\alpha^{n}B^{j-1}\mathbb{E}[M_{n}],

where Mn=𝟙𝒟c(Zn)αnBNn1V¯(Zn)M_{n}=\mathds{1}_{\mathcal{D}^{c}}(Z_{n})\alpha^{-n}B^{-N_{n-1}}\bar{V}(Z_{n}) (set N0=0N_{0}=0 so that M1M_{1} is well-defined). The result follows by noting that {Mn,n}\{M_{n},\mathcal{F}_{n}\} is a super-martingale, where n=σ(Z1,,Zn)\mathcal{F}_{n}=\sigma(Z_{1},\ldots,Z_{n}), so that 𝔼[Mn]𝔼[M1]π0(V)+π0(PV)(1+λ)π0(V)+b\mathbb{E}[M_{n}]\leq\mathbb{E}[M_{1}]\leq\pi_{0}(V)+\pi_{0}(PV)\leq(1+\lambda)\pi_{0}(V)+b, which implies that

(τ>n)(1ϵ)j+αnBj1((1+λ)π0(V)+b).\mathbb{P}(\tau>n)\leq(1-\epsilon)^{j}+\alpha^{n}B^{j-1}\left((1+\lambda)\pi_{0}(V)+b\right).

Since α<1\alpha<1, there exists an integer k01k_{0}\geq 1 such that αB1k0<1\alpha B^{\frac{1}{k_{0}}}<1. In that case for nk0n\geq k_{0} one can take j=n/k0j=\lceil n/k_{0}\rceil, to get

(τ>n){(1ϵ)1k0}n+((1+λ)π0(V)+b){αB1k0}n,\mathbb{P}(\tau>n)\leq\left\{(1-\epsilon)^{\frac{1}{k_{0}}}\right\}^{n}+\left((1+\lambda)\pi_{0}(V)+b\right)\left\{\alpha B^{\frac{1}{k_{0}}}\right\}^{n},

as claimed.

The argument that {Mn,n}\{M_{n},\mathcal{F}_{n}\} is a super-martingale is as follows. We need to show that for all n1n\geq 1, 𝔼[Mn+1|n]Mn\mathbb{E}[M_{n+1}|\mathcal{F}_{n}]\leq M_{n}. Note that 𝔼[Mn+1|n]=0Mn\mathbb{E}[M_{n+1}|\mathcal{F}_{n}]=0\leq M_{n} on Zn𝒟Z_{n}\in\mathcal{D}. So it is enough to assume that Zn𝒟Z_{n}\notin\mathcal{D}. Now, suppose also that Zn𝒞×𝒞Z_{n}\notin\mathcal{C}\times\mathcal{C}. Then Nn=Nn1N_{n}=N_{n-1}, and

𝔼[Mn+1|n]\displaystyle\mathbb{E}[M_{n+1}|\mathcal{F}_{n}] =\displaystyle= αn1𝔼[BNn1𝟙𝒟c(Zn+1)V¯(Zn+1)|n],\displaystyle\alpha^{-n-1}\mathbb{E}\left[B^{-N_{n-1}}\mathds{1}_{\mathcal{D}^{c}}(Z_{n+1})\bar{V}(Z_{n+1})|\mathcal{F}_{n}\right],
=\displaystyle= αn1BNn1𝔼[𝟙𝒟c(Zn+1)V¯(Zn+1)|Zn],\displaystyle\alpha^{-n-1}B^{-N_{n-1}}\mathbb{E}\left[\mathds{1}_{\mathcal{D}^{c}}(Z_{n+1})\bar{V}(Z_{n+1})|Z_{n}\right],
\displaystyle\leq αn1BNn1𝔼[V¯(Zn+1)|Zn],\displaystyle\alpha^{-n-1}B^{-N_{n-1}}\mathbb{E}\left[\bar{V}(Z_{n+1})|Z_{n}\right],
\displaystyle\leq αnBNn1V¯(Zn),\displaystyle\alpha^{-n}B^{-N_{n-1}}\bar{V}(Z_{n}),
=\displaystyle= Mn.\displaystyle M_{n}.

Suppose now that Zn𝒞×𝒞Z_{n}\in\mathcal{C}\times\mathcal{C}. Then Nn=Nn1+1N_{n}=N_{n-1}+1. Hence

𝔼[Mn+1|n]\displaystyle\mathbb{E}[M_{n+1}|\mathcal{F}_{n}] =\displaystyle= αn1BNn11𝔼[𝟙𝒟c(Zn+1)V¯(Zn+1)|n],\displaystyle\alpha^{-n-1}B^{-N_{n-1}-1}\mathbb{E}\left[\mathds{1}_{\mathcal{D}^{c}}(Z_{n+1})\bar{V}(Z_{n+1})|\mathcal{F}_{n}\right],
=\displaystyle= αnBNn1V¯(Zn)1αBP¯[𝟙𝒟cV¯](Zn)V¯(Zn),\displaystyle\alpha^{-n}B^{-N_{n-1}}\bar{V}(Z_{n})\frac{1}{\alpha B}\frac{\bar{P}[\mathds{1}_{\mathcal{D}^{c}}\bar{V}](Z_{n})}{\bar{V}(Z_{n})},
\displaystyle\leq αnBNn1V¯(Zn)=Mn.\displaystyle\alpha^{-n}B^{-N_{n-1}}\bar{V}(Z_{n})=M_{n}.

Appendix B Hamiltonian Monte Carlo on multivariate Normals

As Section 4 in the main document, we perform experiments with a dd-dimensional Normal target distribution π=𝒩(0,V)\pi=\mathcal{N}(0,V), where VV is the inverse of a matrix drawn from a Wishart distribution, with identity scale matrix and dd degrees of freedom. We provide average meeting times obtained in varying dimensions, when using the coupled HMC algorithm described in Heng and Jacob [2019]. The latter article presents similar experiments but for a different choice of matrix VV, thus we provide the present section for completeness.

Let us introduce a Markov kernel PP as a mixture of two kernels, an MH kernel PMHP_{\text{MH}} and an HMC kernel PHMCP_{\text{HMC}}. We first describe PMHP_{\text{MH}} and a coupling of it. The kernel is an MH kernel with Normal random walk proposals, with a covariance matrix equal to 10810^{-8} times the identity matrix. The coupled version of PMHP_{\text{MH}} uses a maximal coupling of the proposals (as in Algorithm 2 of the main document).

The kernel PHMCP_{\text{HMC}} corresponds to an HMC algorithm, with mass matrix given by the inverse of the target variance VV. This preconditioning mechanism is motivated by considerations similar to those in Girolami and Calderhead [2011]. In the present case of Normal distributions, this is particularly advantageous as it leads to a complete decoupling of the dd components of the target; see Proposition 3.1 in Bou-Rabee and Sanz-Serna [2018]. We discretize Hamiltonian equations with a leap-frog integrator, using a stepsize of ε=0.1×d1/4\varepsilon=0.1\times d^{-1/4}, and a number of steps of L=1+ε1L=1+\lfloor\varepsilon^{-1}\rfloor, which corresponds to a trajectory length εL\varepsilon L of approximately one. The coupling of such Hamiltonian kernels is done by using common random numbers for the momentum variables, i.e. a synchronous coupling [Bou-Rabee et al., 2018]. The kernels PMHP_{\text{MH}} and PHMCP_{\text{HMC}}, and their coupled counterparts, are combined into mixtures PP and P¯\bar{P}, by assigning weights respectively of 0.050.05 and 0.950.95, i.e. an MH step is performed with probability 0.050.05.

We consider two types of initialization π0\pi_{0}: either the target distribution π\pi, or a Normal distribution 𝒩(1d,Id)\mathcal{N}(1_{d},I_{d}), with 1d1_{d} a vector of ones and IdI_{d} the identity matrix. With the latter initialization, we observed a very low acceptance rate when using the stepsize ε\varepsilon given above. We did not observe any issue when the chains were started from π\pi. We also did not observe such issues when VV was replaced by the identity matrix. In principle, smaller stepsizes could be chosen, in order to increase the acceptance rate. However, this would result in more expensive iterations as LL is defined as 1+ε11+\lfloor\varepsilon^{-1}\rfloor above. Instead, we resort to the following heuristic strategy, which appears to solve the issue in the present example. We draw an initial position from 𝒩(1d,Id)\mathcal{N}(1_{d},I_{d}), then we perform 1010 steps of “unadjusted HMC”, with ε=0.1×d1/4\varepsilon=0.1\times d^{-1/4}, and L=1+ε1L=1+\lfloor\varepsilon^{-1}\rfloor as described above. By “unadjusted HMC”, we refer to a scheme where the final point of the Hamiltonian trajectory is accepted with probability one, i.e. no MH correction is applied. This initialization procedure is simply a redefinition of π0\pi_{0}, and thus would not jeopardize the validity of the proposed estimators.

The results under both initializations are shown in Figure 8, where we observe average meeting times that increase very slowly with the dimension of the target distribution. Thanks to the initialization strategy described above, we obtain similar meeting times when starting from the chains from π\pi or from 𝒩(1d,Id)\mathcal{N}(1_{d},I_{d}).

Refer to caption
Figure 8: Scaling of the average meeting time of a coupled HMC algorithm with the dimension of the target 𝒩(0,V)\mathcal{N}(0,V), where VV is the inverse of a Wishart draw, as described in Section B. The chains are either initialized from the target, or from a Normal 𝒩(1d,Id)\mathcal{N}(1_{d},I_{d}) followed by 10 steps of unadjusted HMC, as described in Section B; this is referred to as “offset” in the legend.

Appendix C Baseball batting averages

We consider a classic Gibbs sampler discussed in the context of parallel computing in Rosenthal [2000]. In Rosenthal [1996], it was proved that the chain produced by this Gibbs sampler converges in total variation to within 1%1\% of its stationary distribution after at most 140140 iterations. This type of derivation is technically challenging and has been done only in specific cases. Using this result, Rosenthal [2000] recommends to run parallel chains with a burn-in of 140140 iterations. We will compare this value with choices of kk and mm for the proposed unbiased estimators.

The data (Zn)n=1K(Z_{n})_{n=1}^{K} are baseball players’ batting averages taken from Table 1 of Morris [1983], where K=18K=18. In the model, each ZnZ_{n} is assumed to follow 𝒩(θn,V)\mathcal{N}(\theta_{n},V) where VV is fixed to 0.004340.00434. Then, θn\theta_{n} is assumed to follow 𝒩(μ,A)\mathcal{N}(\mu,A), where μ\mu is given a flat prior and AA an Inverse Gamma (a,b)(a,b), with a=1a=-1 and b=2b=2. Here the Inverse Gamma (α,β)(\alpha,\beta) distribution has pdf p(x;α,β)=Γ(α)1βαxα1exp(β/x)p(x;\alpha,\beta)=\Gamma(\alpha)^{-1}\beta^{\alpha}x^{-\alpha-1}\exp(-\beta/x). The Gibbs updates are as follows:

A|rest\displaystyle A|\text{rest} Inverse Gamma(a+(K1)/2,b+n=1K(θnθ¯)2/2),whereθ¯=K1n=1Kθn,\displaystyle\sim\text{Inverse Gamma}(a+(K-1)/2,b+\sum_{n=1}^{K}(\theta_{n}-\bar{\theta})^{2}/2),\;\text{where}\;\bar{\theta}=K^{-1}\sum_{n=1}^{K}\theta_{n},
μ|rest\displaystyle\mu|\text{rest} 𝒩(θ¯,A/K),\displaystyle\sim\mathcal{N}\left(\bar{\theta},A/K\right),
θn|rest\displaystyle\theta_{n}|\text{rest} 𝒩((V+A)1(μV+ZnA),(V+A)1(AV)),for all n{1,,K}.\displaystyle\sim\mathcal{N}\left((V+A)^{-1}(\mu V+Z_{n}A),\left(V+A\right)^{-1}(AV)\right),\quad\text{for all }n\in\{1,\ldots,K\}.

The initial values of the Gibbs sampler can be taken as K1n=1KZnK^{-1}\sum_{n=1}^{K}Z_{n} for all θn\theta_{n} [Rosenthal, 2000]. We couple this Gibbs sampler using maximal couplings of the conditional updates. The parameter space is 2020-dimensional, but the chains meet as soon as the components (A,μ)(A,\mu) meet, by construction. We consider the test function h:(A,μ,θ1,,θK)θ1h:(A,\mu,\theta_{1},\ldots,\theta_{K})\mapsto\theta_{1}, that is, we are interested in the posterior expectation of θ1\theta_{1}, which represents the mean of the batting average of the first player.

We first run the coupled chains 1,0001,000 times independently in parallel. All observed meeting times were less or equal to 44, so we consider kk equal to 11, 33 and 55. Then, we consider mm equal to kk, 5k5k and 10k10k. Over 10,00010,000 independent experiments, we approximate the expected cost 𝔼[2(τ1)+max(1,mτ+1)]\mathbb{E}[2(\tau-1)+\max(1,m-\tau+1)], the variance 𝕍[Hk:m(X,Y)]\mathbb{V}[H_{k:m}(X,Y)], and we compute the inefficiency as the product of expected cost and variance. We then divide the inefficiency by the asymptotic variance of the MCMC estimator, denoted by VV_{\infty}, obtained from 5×1055\times 10^{5} iterations and a burn-in period of 10310^{3}, and the CODA package [Plummer et al., 2006]. The results are shown in Table 4. We see that when kk and mm are large enough we can retrieve an inefficiency comparable to that of the underlying MCMC algorithm; for instance k=3k=3 and m=30m=30 yields an efficiency close to 11. The value less than 11 obtained for k=5k=5, m=50m=50 for the inefficiency ratio is likely due to Monte Carlo variability; we expect a value greater or equal to 11.

kk mm Cost Variance Inefficiency / VV_{\infty}
1 1×k1\times k 5.1494 0.0070 5.2152
1 5×k5\times k 8.0747 0.0013 1.5164
1 10×k10\times k 13.0747 0.0006 1.1838
3 1×k1\times k 6.0755 0.0061 5.3332
3 5×k5\times k 18.0747 0.0005 1.1990
3 10×k10\times k 33.0747 0.0002 1.0044
5 1×k1\times k 8.0747 0.0059 6.8677
5 5×k5\times k 28.0747 0.0003 1.1328
5 10×k10\times k 53.0747 0.0001 0.9816
Table 4: Cost, variance, and inefficiency divided by MCMC asymptotic variance VV_{\infty} for various choices of kk and mm, for the test function h:(A,μ,θ1,,θK)θ1h:(A,\mu,\theta_{1},\ldots,\theta_{K})\mapsto\theta_{1}, in the baseball batting averages example of Section C.

Since the efficiency of the underlying MCMC kernel is retrieved with k=3k=3, m=30m=30, for an associated cost comparable to 3333 steps of MCMC, we see that we can perform in parallel what would be equivalent to MCMC runs of 3333 iterations. This is considerably less than the recommended burn-in of 140140 derived in Rosenthal [1996], which is a strong indication that this recommended burn-in is conservative.

We plot histograms of θ1\theta_{1}, AA and μ\mu in Figures 9a, 9b and 9c, obtained for R=10,000R=10,000 estimators with k=3k=3 and m=30m=30. The overlaid red curves indicate the marginal target densities estimated from an MCMC run with 5×1055\times 10^{5} iterations and a burn-in of 1,0001,000 (which is unnecessarily conservative). These histograms confirm that accurate approximations of the posterior distribution are obtained with the proposed method.

Refer to caption
(a) Histogram of θ1\theta_{1}.
Refer to caption
(b) Histogram of μ\mu.
Refer to caption
(c) Histogram of AA.
Figure 9: Gibbs sampling in the baseball batting averages example of Section C. Histograms of marginal posterior distributions of θ1\theta_{1}, AA and μ\mu in 9a, 9b, 9c, produced using R=10,000R=10,000 estimators, k=3k=3, m=30m=30.

Appendix D Pólya-Gamma Gibbs sampler for logistic regression

Next, we turn to a more modern MCMC setting and demonstrate that the proposed estimators can be constructed from the Pólya-Gamma Gibbs (PGG) sampler [Polson et al., 2013].

We consider the logistic regression of an outcome Y=(Y1,,Yn){0,1}nY=(Y_{1},\ldots,Y_{n})\in\left\{0,1\right\}^{n} on covariates X=(x1,,xn)n×pX=(x_{1},\ldots,x_{n})\in\mathbb{R}^{n\times p}. The model specifies (Yi=1|β)=expit(xiTβ)\mathbb{P}(Y_{i}=1|\beta)=\text{expit}\left(x_{i}^{T}\beta\right), where expit:z1/(1+exp(z))\text{expit}:z\mapsto 1/\left(1+\exp(-z)\right), and the vector βp\beta\in\mathbb{R}^{p} is the regression parameter. Realizations of Y=(Y1,,Yn)Y=(Y_{1},\ldots,Y_{n}) are denoted by y=(y1,,yn)y=(y_{1},\ldots,y_{n}). The prior distribution on β\beta is 𝒩(b,B)\mathcal{N}\left(b,B\right), where we set the mean bb to zero and the covariance BB as diagonal, with non-zero entries equal to 1010. The Pólya-Gamma Gibbs (PGG) sampler [Polson et al., 2013] is an MCMC algorithm targeting the posterior distribution, extended with nn Pólya-Gamma variables W=(W1,,Wn)W=(W_{1},\ldots,W_{n}). The Pólya-Gamma distribution with parameters (1,c)(1,c), denoted by PG(1,c), has density defined for all c0c\geq 0 as

x>0pg(x;c)=cosh(c2)exp(c2x2)k=0(1)k(2k+1)2πx3exp((2k+1)28x).\forall x>0\quad\text{pg}(x;c)=\text{cosh}\left(\frac{c}{2}\right)\exp\left(-\frac{c^{2}x}{2}\right)\sum_{k=0}^{\infty}\left(-1\right)^{k}\frac{\left(2k+1\right)}{\sqrt{2\pi x^{3}}}\exp\left(-\frac{(2k+1)^{2}}{8x}\right).

Under the extended target distribution the variables W=(W1,,Wn)W=(W_{1},\ldots,W_{n}) are independent of each other given β\beta, and have the property that WiW_{i} follows PG(1,|xiTβ|)(1,|x_{i}^{T}\beta|) for all 1in1\leq i\leq n. The PGG sampler is a Gibbs sampler which alternates between the following updates:

Wi|rest\displaystyle W_{i}|\text{rest} PG(1,|xiTβ|)for all i{1,,n},\displaystyle\sim\text{PG}(1,|x_{i}^{T}\beta|)\quad\text{for all }i\in\{1,\ldots,n\},
β|rest\displaystyle\beta|\text{rest} 𝒩(Σ(W)(XTy~+B1b),Σ(W)),with Σ(W)=(XTdiag(W)X+B1)1,\displaystyle\sim\mathcal{N}(\Sigma(W)(X^{T}\tilde{y}+B^{-1}b),\Sigma(W)),\quad\text{with }\Sigma(W)=(X^{T}\text{diag}(W)X+B^{-1})^{-1},

where y~=(y11/2,,yn1/2)\tilde{y}=(y_{1}-\nicefrac{{1}}{{2}},\ldots,y_{n}-\nicefrac{{1}}{{2}}). The resulting chain targets the posterior distribution and is uniformly ergodic [Choi and Hobert, 2013]. Here we initialize the algorithm with draws from the prior. We couple this chain using maximal couplings of the conditional updates. For the Pólya-Gamma updates, the probability density functions are intractable but the ratio of two density evaluations can be calculated using the identity

x>0pg(x;c2)pg(x;c1)=cosh(c2/2)cosh(c1/2)exp((c222c122)x),\forall x>0\quad\frac{\text{pg}\left(x;c_{2}\right)}{\text{pg}\left(x;c_{1}\right)}=\frac{\text{cosh}(c_{2}/2)}{\text{cosh}(c_{1}/2)}\exp\left(-\left(\frac{c_{2}^{2}}{2}-\frac{c_{1}^{2}}{2}\right)x\right),

which enables a fast implementation of the maximal coupling algorithm described in Algorithm 2 of the main document.

We apply the proposed method to the German credit data of [Lichman, 2013], a common example in binary regression and machine learning studies such as Polson et al. [2013], Huang et al. [2007], West [2000]. This dataset consists of 1,0001,000 loan application records, 700 of which were rated as creditworthy and 300 were rated as not creditworthy. Each record includes 20 additional variables including loan purpose, demographic information, bank account balances, marital, housing, employment status, and job type. Seven of these are quantitative and the rest are categorical. After translating categorical variables into indicators we obtain p=49p=49 regressors on n=1,000n=1,000 observations. Histograms of the meeting times for R=1,000R=1,000 coupled chains are shown in Figure 10a. These chains took between 18 and 164 steps to meet, with an average meeting time of 48 iterations.

Refer to caption
(a) Histogram of meeting times.
Refer to caption
(b) Number of PG variables met.
Refer to caption
(c) PG distance WW~2||W-\tilde{W}||_{2}.
Refer to caption
(d) Coefficient distance ββ~2||\beta-\tilde{\beta}||_{2}.
Refer to caption
(e) Efficiency of Hk(X,Y)H_{k}(X,Y) vs. kk.
Refer to caption
(f) Histogram of β\beta.
Figure 10: PGG sampling in the German credit example of Section D. Histogram of meeting times in 10a. Trace of the number of Pólya-Gamma variables met in 10b, for one typical run of the coupled chains. Trace of the between-chain distance for Pólya-Gamma variables and regression coefficients in 10c-10d, for one typical run of the coupled chains. Efficiency of Hk(X,Y)H_{k}(X,Y) as a function of kk in 10e, computed over R=1,000R=1,000 replicates. Histogram of the marginal posterior for the ‘installment percent’ regression coefficient in 10f, based on R=1,000R=1,000 estimators, k=110k=110 and m=1,100m=1,100.

The extended space of the Gibbs sampler is of dimension n+p=1,049n+p=1,049. However the two chains meet as soon as either all the nn auxiliary PG variables or all the pp regression coefficients meet. Since we use a maximal coupling of the update of the full vector of regression coefficients, either all or none of these meet at each iteration; MH-within-Gibbs strategies could be employed instead. As we show in Figure 10b, for one run of the coupled chains, the number of met PG variables rapidly increases to a plateau, at which point the chains are close enough to make coupling on the regression coefficients possible. Figure 10c shows the Euclidean distance between the PG variables of the two chains; it starts at zero as an artefact of the initial values of the PG variables being set to zero. Figure 10d shows the Euclidean distance between the regression coefficients. For this particular run, both PG variables and regression variables diverge at first, before converging as an increasing number of PG variables meet.

Finally, we consider the choice of kk for the estimator Hk(X,Y)H_{k}(X,Y), and of mm for the estimator H¯k:m(X,Y)\bar{H}_{k:m}(X,Y). We consider the task of estimating the posterior mean of a particular regression coefficient corresponding to the installment payment as a percentage of disposable income. The efficiency of Hk(X,Y)H_{k}(X,Y), defined as one over the product of the variance times the cost, is shown in Figure 10e as a function of kk. We see that choosing a large quantile of the distribution of τ\tau, as shown in Figure 10a, would result in an efficiency close to its maximum. We choose k=110k=110, and, in line with the heuristics suggested in the main document, we take mm to be 10k=1,10010k=1,100, that is, a large multiple of kk. For the estimation of the posterior mean, we find that the above values of kk and mm yield an inefficiency about 77 times greater to that of the underlying Gibbs sampler, based on a long run; larger values of kk and mm would further reduce this inefficiency ratio.

With these tuning parameters, we produce a histogram of the posterior distribution for that coefficient in Figure 10f. We find agreement with a density estimated from a long MCMC run, depicted here by the overlaid curve in red. To summarize, in this example the proposed methodology effectively allows to run PGG chains in parallel, in chunks of approximately 1,0001,000 iterations, while bypassing the usual difficulties related to the choice of burn-in and the construction of confidence intervals.

In passing, in the particular case of the coupled PGG algorithm, we can directly show that the meeting time has an ε\varepsilon probability of occurring at each step tt, for large enough tt and some ε>0\varepsilon>0 that does not depend on the starting points of the chains. Denote the β\beta-component of the two chains by (βt)t0(\beta_{t})_{t\geq 0} and (β~t)t0(\tilde{\beta}_{t})_{t\geq 0}. From Choi and Hobert [2013], (βt)t0(\beta_{t})_{t\geq 0} and (β~t)t0(\tilde{\beta}_{t})_{t\geq 0} are uniformly ergodic. Consider Fréchet’s inequality (AB)(A)+(B)1\mathbb{P}\left(A\cap B\right)\geq\mathbb{P}(A)+\mathbb{P}(B)-1 and the events A={βtS}A=\{\beta_{t}\in S\} and B={β~t1S}B=\{\tilde{\beta}_{t-1}\in S\}, for some compact set SS of p\mathbb{R}^{p} and some t0t\geq 0. We can define SS such that {βtS}\{\beta_{t}\in S\} has probability at least 0.5+δ0.5+\delta, for some small δ>0\delta>0, provided tt is large enough, since

|π(S|y)Kt(S|β)|dTV(π(|y),Kt(|β))<Mρt,|\pi(S|y)-K^{t}(S|\beta)|\leq d_{\text{TV}}(\pi\left(\cdot|y\right),K^{t}(\cdot|\beta))<M\rho^{t},

where β\beta is any starting point, Kt(|β)K^{t}(\cdot|\beta) is the distribution of the tt-th iterate of the chain starting from β\beta, M>0M>0 and ρ<1\rho<1 are constants independent of β\beta, and dTVd_{\text{TV}} stands for the total variation distance. We take SS to be a compact set of p\mathbb{R}^{p} such that π(S|y)>0.5+2δ\pi(S|y)>0.5+2\delta. There is some t0t_{0} such that tt0t\geq t_{0} implies Kt(S|β)>0.5+δK^{t}(S|\beta)>0.5+\delta; an identical reasoning can be done for the second chain. Using Fréchet’s inequality, (βtS,β~t1S)>2δ\mathbb{P}(\beta_{t}\in S,\tilde{\beta}_{t-1}\in S)>2\delta. On these events, |xiTβt|2|xiTβ~t1|2|x_{i}^{T}\beta_{t}|^{2}-|x_{i}^{T}\tilde{\beta}_{t-1}|^{2} are lower and upper-bounded for all 1in1\leq i\leq n, which results in a strictly non-zero probability of coupling each pair of Pólya-Gamma variables. This, in turns, leads to an ε>0\varepsilon>0 probability of βt+1\beta_{t+1} meeting with β~t\tilde{\beta}_{t}. Therefore, Assumption 2.2 on the meeting time in the main document is satisfied.

Appendix E Bayesian Lasso

We consider the setting of Bayesian inference in regression models. The Bayesian Lasso [Park and Casella, 2008] assigns a hierarchical prior on the parameters of a linear regression in such a way that the posterior mode corresponds to the Lasso estimator [Tibshirani, 1996, Efron et al., 2004]. The posterior distribution can be approximated by Gibbs sampling, independently for a range of regularization parameters λ>0\lambda>0, which can then be selected by cross-validation or as described in Park and Casella [2008]; see also an alternative computational approach in Bornn et al. [2010]. On top of demonstrating the applicability of the proposed methodology in this setting, we illustrate the use of confidence intervals to guide the allocation of computational resources.

The model and associated Gibbs sampler are as follows. Consider an n×pn\times p matrix of standardized covariates XX, and an nn-vector of outcomes YY. The centered outcomes are denoted by Y~\tilde{Y}, i.e. Y~=YY¯1n\tilde{Y}=Y-\bar{Y}1_{n}, where Y¯=n1i=1nYi\bar{Y}=n^{-1}\sum_{i=1}^{n}Y_{i} and 1n1_{n} is a nn-vector of 11’s. Introduce a regularization parameter λ0\lambda\geq 0. The outcome YY is assumed to follow 𝒩(μ1n+Xβ,σ2In)\mathcal{N}(\mu 1_{n}+X\beta,\sigma^{2}I_{n}), where μ\mu is the intercept, and β\beta is the pp-vector of regression coefficients; InI_{n} denotes a unit n×nn\times n diagonal matrix. The hierarchical prior specifies β𝒩(0p,σ2Dτ)\beta\sim\mathcal{N}(0_{p},\sigma^{2}D_{\tau}), where 0p0_{p} is a pp-vector of 0’s, and Dτ=diag(τ12,,τp2)D_{\tau}=\text{diag}(\tau_{1}^{2},\ldots,\tau_{p}^{2}). The prior on τj2\tau_{j}^{2} is Exponential(λ2/2)\text{\text{Exponential}}(\lambda^{2}/2) for all j{1,,p}j\in\{1,\ldots,p\}, and finally σ2Inverse Gamma(a,γ)\sigma^{2}\sim\text{Inverse Gamma}(a,\gamma). A Gibbs sampler proposed in Park and Casella [2008] performs the following updates, after having integrated μ\mu out:

β|rest\displaystyle\beta|\text{rest} 𝒩(Aτ1XTY~,σ2Aτ1),whereAτ=XTX+Dτ1,\displaystyle\sim\mathcal{N}(A_{\tau}^{-1}X^{T}\tilde{Y},\sigma^{2}A_{\tau}^{-1}),\quad\text{where}\quad A_{\tau}=X^{T}X+D_{\tau}^{-1},
σ2|rest\displaystyle\sigma^{2}|\text{rest} Inverse Gamma(a+n12+p2,γ+(Y~Xβ)T(Y~Xβ)/2+βTDτ1β/2),\displaystyle\sim\text{Inverse Gamma}(a+\frac{n-1}{2}+\frac{p}{2},\gamma+\left(\tilde{Y}-X\beta\right)^{T}\left(\tilde{Y}-X\beta\right)/2+\beta^{T}D_{\tau}^{-1}\beta/2),
τj2|rest\displaystyle\tau_{j}^{-2}|\text{rest} Inverse Gaussian((λ2σ2/βj2)1/2,λ2)for all j{1,,p}.\displaystyle\sim\text{Inverse Gaussian}\left(\left(\nicefrac{{\lambda^{2}\sigma^{2}}}{{\beta_{j}^{2}}}\right)^{\nicefrac{{1}}{{2}}},\lambda^{2}\right)\quad\text{for all }j\in\left\{1,\ldots,p\right\}.

Here the Inverse Gaussian (μ,λ)(\mu,\lambda) distribution has pdf p(x;μ,λ)=(λ/2πx3)1/2exp(λ(xμ)2/(2μ2x))p(x;\mu,\lambda)=(\lambda/2\pi x^{3})^{\nicefrac{{1}}{{2}}}\exp(-\lambda(x-\mu)^{2}/(2\mu^{2}x)). We initialize the sampler by setting βj=0,τj2=1\beta_{j}=0,\tau_{j}^{2}=1 for all j{1,,p}j\in\{1,\ldots,p\} and σ2=1\sigma^{2}=1. We consider the diabetes dataset used in Efron et al. [2004], Park and Casella [2008], with n=442n=442 individuals and p=64p=64 covariates, as provided in the lars package accompanying Efron et al. [2004]. The parameter space is of dimension 2p+1=1292p+1=129. We set a=0a=0 and γ=0\gamma=0 throughout. To couple the Gibbs sampler we use a maximal coupling of each conditional update, and focus on producing unbiased estimators of the posterior means 𝔼λ[β|Y,X]\mathbb{E}_{\lambda}[\beta|Y,X] given λ\lambda.

For a range of values of λ\lambda between 10210^{-2} and 10310^{3}, we run 100100 coupled chains until they meet, and plot the empirical average of the meeting times as a function of λ\lambda in Figure 11a. First, we note that the average meeting times are very small for small values of λ\lambda. Next we observe a peak located around λ=102\lambda=10^{2}, which implies a similar peak for the computational cost of the proposed estimators. This could be either a consequence of the mixing properties of the underlying MCMC, or a defect of the coupling strategy. To investigate this, we run the underlying Gibbs sampler for the same values of λ\lambda, for 50,00050,000 iterations, discard the first 5,0005,000 iterations, and compute the effective sample size using the CODA package [Plummer et al., 2006]. We do so for each of the 6464 components of β\beta, and plot the results in Figure 11b; each dot corresponds to the ESS of one of the components, for a particular value of λ\lambda. The effective sample size is divided by the number of iterations post burn-in, and thus we expect a number between 0 and 11. We see a drastic loss of efficiency around λ=102\lambda=10^{2}, indicating that the Gibbs sampler of Park and Casella [2008] mixes poorly for these values of λ\lambda.

Refer to caption
(a) Average meeting time vs λ\lambda.
Refer to caption
(b) Effective sample sizes vs λ\lambda.
Figure 11: Bayesian Lasso for the diabetes data as described in Section E. Figure 11a shows the average of 100100 independent meeting times against the regularization parameter λ\lambda. Figure 11b shows the effective sample size of each of the 6464 components of β\beta, from MCMC runs of length 50,00050,000 and the CODA package, as a function of λ\lambda.

For each λ\lambda, we now choose kk as the 99%99\% quantile of the 100100 meeting times previously obtained, and we choose m=10km=10k. We produce R=100R=100 unbiased estimators for each posterior mean 𝔼λ[β|Y,X]\mathbb{E}_{\lambda}[\beta|Y,X], and plot them against λ\lambda in Figure 12a, component-wise. On top of each dot lies a 95%95\% confidence interval represented by a vertical segment: these are too small to be noticeable except for the smallest values of λ\lambda.

In order to refine the posterior mean estimates, we can allocate computational resources based on the confidence intervals and the costs associated with each λ\lambda (there are 2525 values of λ\lambda in total). In order to tighten the most visible confidence intervals, we produce 1,0001,000 more estimators for the 1010 smallest values of λ\lambda, and obtain the refined estimates shown in Figure 12b. That is, these estimates are obtained from 1,1001,100 unbiased estimators for the 1010 smallest values of λ\lambda, and for 100100 for the other values of λ\lambda. This refinement procedure could be automatized, adaptively producing more estimators for values of λ\lambda where confidence intervals are wider.

Refer to caption
(a) Posterior mean of β\beta vs λ\lambda, with 100100 estimators per λ\lambda.
Refer to caption
(b) Posterior mean of β\beta vs λ\lambda, with 1,0001,000 more estimators for the 1010 smallest values of λ\lambda.
Figure 12: Bayesian Lasso for the diabetes data with n=442n=442 and p=64p=64, described in Section E. Figure 12a shows the paths of posterior means 𝔼λ[β|Y,X]\mathbb{E}_{\lambda}[\beta|Y,X] against the regularization parameter λ\lambda, obtained with R=100R=100 estimators, kk chosen as one plus the 90%90\% quantile of the distribution of meeting times, and m=10km=10k. Figure 12b shows the estimates obtained after having run 1,0001,000 more estimators for the first 1010 values of λ\lambda.