This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Large sample scaling analysis of the Zig-Zag algorithm for Bayesian inference

Sanket Agrawal
Department of Statistics
University of Warwick
sanket.agrawal@warwick.ac.in
   Joris Bierkens
Delft Institute of Applied Mathematics
TU Delft
   Gareth O. Roberts
Department of Statistics
University of Warwick
(August 3, 2025)
Abstract

Piecewise deterministic Markov processes provide scalable methods for sampling from the posterior distributions in big data settings by admitting principled sub-sampling strategies that do not bias the output. An important example is the Zig-Zag process of [Ann. Stats. 47 (2019) 1288 - 1320] where clever sub-sampling has been shown to produce an essentially independent sample at a cost that does not scale with the size of the data. However, sub-sampling also leads to slower convergence and poor mixing of the process, a behaviour which questions the promised scalability of the algorithm. We provide a large sample scaling analysis of the Zig-Zag process and its sub-sampling versions in settings of parametric Bayesian inference. In the transient phase of the algorithm, we show that the Zig-Zag trajectories are well approximated by the solution to a system of ODEs. These ODEs possess a drift in the direction of decreasing KL-divergence between the assumed model and the true distribution and are explicitly characterized in the paper. In the stationary phase, we give weak convergence results for different versions of the Zig-Zag process. Based on our results, we estimate that for large data sets of size nn, using suitable control variates with sub-sampling in Zig-Zag, the algorithm costs O(1)O(1) to obtain an essentially independent sample; a computational speed-up of O(n)O(n) over the canonical version of Zig-Zag and other traditional MCMC methods.

Keywords and phrases: MCMC, piecewise deterministic Markov processes, fluid limits, sub-sampling, multi-scale dynamics, averaging.

1 Introduction

Piecewise deterministic Markov processes [PDMPs, 21] are a class of non-reversible Markov processes which have recently emerged as alternatives to traditional MCMC methods based on the Metropolis-Hastings algorithm. Practically implementable PDMP algorithms such as the Zig-Zag sampler [11], the Bouncy Particle Sampler [16] and others [18, 31, 33, 12, 7, 8, 29] have provided highly encouraging results for Bayesian inference as well as in Physics [26, for example]. As a result, there has been a surge of interest in the Monte Carlo community to better understand the theoretical properties of PDMPs [see e.g. 24, 22, 1, 9, 15, 14, 13]. This body of work demonstrates powerful convergence properties with impressive scalability as a function of the dimension.

However, the most exciting property of PDMPs, at least for statistical applications is that of principled sub-sampling [11], whereby randomly chosen subsets of data can be used to estimate likelihood values within the algorithm without biasing its invariant distribution. The authors only cover sub-sampling for the Zig-Zag algorithm, although the unbiasedness property holds for a rather general class of PDMP MCMC algorithms. This unbiasing property is in sharp contrast to most MCMC methods [32, 28, 5], although see [19]. In fact, [11] gives experiments to show that even for data sets of large size nn, subsets of size 11 can be used to estimate the likelihood, producing a computational speed-up of O(n)O(n). This leads to the property of superefficiency whereby, after initialization, the entire running cost of the algorithm is dominated for large nn by the cost of evaluating the likelihood once. In [11], the authors provide two algorithms which utilise this sub-sampling idea: the straightforward ZZ-SS algorithm together with a control-variate variant ZZ-CV which will be formally introduced in Section 2.

In [11], the authors also point out that although principled sub-sampling gives these dramatic improvements in implementation time, the dynamics of ZZ-SS and ZZ-CV differ from that of canonical Zig-Zag: sample paths exhibit more diffusive behaviour to the extra velocity switches induced by the variation in the estimate of the likelihood. This might lead to slower convergence and inferior mixing of the underlying stochastic process (see Figure 2). For one-dimensional Zig-Zag, [9] studies the diffusive behaviour resulting from excess switches outside of a statistical context. In [2], the authors show that canonical Zig-Zag has superior mixing (in the sense of minimising asymptotic variance for estimating a typical L2L_{2} function) than any alternative with excess switches. Since both ZZ-SS and ZZ-CV can be viewed as Zig-Zag algorithms with excess switches, this raises the question of whether the inferior mixing of sub-sampling Zig-Zag algorithms relative to their canonical parent might negate their computational advantage.

To date, no work has been done to study the extent of the algorithm mixing deterioration caused by these excess switches induced by sub-sampling. The goal of this work is to give the first rigorous theoretical results to understand this problem. We shall study the behaviour of the various Zig-Zag algorithms for large data sets, interplaying closely with large-sample Bayesian asymptotics, both in the correctly specified and misclassified cases.

Refer to caption
Refer to caption
(a) Canonical Zig-Zag (ZZ)
Refer to caption
Refer to caption
(b) Zig-Zag with vanilla sub-sampling (ZZ-SS)
Refer to caption
Refer to caption
(c) Zig-Zag with control variates (ZZ-CV)
Figure 2: Trajectories for different versions of the Zig-Zag algorithm targeting a Bayesian posterior for the 2-dimensional Bayesian logistic regression example with true parameter (1,2)(1,2). The left panel displays the trajectories in the transient phase with red points marking the location of the process at different times. The right panel illustrates the process in the stationary phase.

Our contributions in this paper are as follows.

  1. 1.

    For the Zig-Zag algorithm, we study the effect of sub-sampling on the dynamics of the underlying stochastic process as the data set size nn increases to infinity. In Theorem 3.3, we give weak convergence results for the trajectories of the process to smooth solutions of a system of ODEs. These ODEs can be explicitly characterized as presented in equation (3.4).

  2. 2.

    We also study the effect of sub-sampling on the algorithm mixing in the stationary phase. We find that as nn becomes infinitely large, the trajectories of ZZ-SS converge to a reversible Ornstein-Uhlenbeck process as given in Theorem 4.2, mixing in O(1)O(1) time. Whereas, both canonical Zig-Zag and ZZ-CV converge to a PDMP that mixes in time O(n1/2)O(n^{-1/2}) as presented in Theorem 4.6 and 4.11.

  3. 3.

    Our theoretical study leads to further guidance for implementing sub-sampling with Zig-Zag. In particular, our results highlight that for heavy-tailed models the vanilla sub-sampling outperforms ZZ-CV in the tails. In Section 5, we propose a mixed sub-sampling scheme which in addition to being exact, supersedes both ZZ-SS and ZZ-CV in performance.

  4. 4.

    Finally, we outline how these methods can be generalised to more general classes of PDMPs.

In parametric Bayesian inference, given sufficient regularity in the model and the true data-generating distribution, it is well known that the posterior distribution converges to a degenerate distribution as the data set size nn\to\infty irrespective of whether the true data-generating distribution belongs to the assumed model or not [6, 25]. The point of concentration is the parameter value (if it exists and is unique) minimizing the Kullback-Liebler (KL) divergence between the true data-generating distribution and the assumed model. In well-specified settings, this is the same as the true parameter. Our results in the transient phase (Section 3) show that for large nn, the Zig-Zag trajectories behave like deterministic smooth flows with a drift pushing them in the direction of decreasing KL-divergence. While the canonical Zig-Zag attains optimal speed (±1\pm 1) in the limit, the sub-sampling versions remain sub-optimal with a damping factor that depends on the model. In particular, we find that the vanilla sub-sampling starts with the optimal speed at infinity, suffering from an extreme slowdown as the process gets closer to the minimizer. On the other hand for sub-sampling with control variates, the drift remains positive at stationarity achieving optimal speed for model densities that are log-concave in the parameter.

Let x0x_{0} be the unique parameter value minimizing the KL-divergence between the true data-generating distribution and the true parameter. In stationarity, the high concentration of the posterior mass around x0x_{0} restricts the movement of the Zig-Zag process. Indeed, the canonical Zig-Zag remains close to x0x_{0} once it hits x0x_{0} (Theorem 4.1). In general, the smooth approximations from the transient phase are no longer informative since they only capture the large-scale movements of the Zig-Zag trajectories. Let x^n\hat{x}_{n} be a sequence of suitable estimators of x0x_{0} such that the Bernstein von-Mises Theorem holds: in the rescaled coordinate ξ(x)=n1/2(xx^n)\xi(x)=n^{1/2}(x-\hat{x}_{n}), the posterior converges to a Gaussian distribution with zero mean and variance-covariance matrix given by the inverse of the Fisher’s information [25]. Under regularity conditions, x^n\hat{x}_{n} can be taken to be the maximum likelihood estimators (MLE) for the (possibly misspecified) model. In the ξ\xi-coordinate thus, the trajectories of the Zig-Zag process have a non-trivial limit. Although, the canonical Zig-Zag and control variate Zig-Zag mix at a rate O(n1/2)O(n^{-1/2}) in contrast to O(1)O(1) for the vanilla sub-sampling (Section 4).

In stationarity, the computational complexity of a sampling algorithm can be described as:

computational complexity :=(time to obtain an essentially independent sample)\displaystyle:=(\text{time to obtain an essentially independent sample})
×(computational cost per unit time)\displaystyle\quad\quad\quad\times(\text{computational cost per unit time})

For Zig-Zag algorithms this further breaks down to

computational complexity :=(time to obtain an essentially independent sample)\displaystyle:=(\text{time to obtain an essentially independent sample})
×(no. of proposed switch per unit time)\displaystyle\quad\quad\quad\times(\text{no. of proposed switch per unit time})
×(computational cost per proposed switch)\displaystyle\quad\quad\quad\times(\text{computational cost per proposed switch})

The canonical Zig-Zag incurs an O(n)O(n) cost per proposed switch because the full data set needs to be processed to compute the switching rate at each proposed switching time. In contrast to this, the sub-sampling versions only cost O(1)O(1) per proposed switch as only a small subset of the whole data set is utilized to estimate the switching rate at each proposed switching time. The time taken to obtain an essentially independent sample is O(n1/2)O(n^{-1/2}) for canonical Zig-Zag and Zig-Zag with control variates whereas it is O(1)O(1) for vanilla sub-sampling. The number of proposed switches depends on the computational bounds on the switching rates used in the implementation of the algorithm. In the rescaled coordinate, the effective switching rates are of the magnitude O(n1/2)O(n^{1/2}) for canonical Zig-Zag and Zig-Zag with control variates and O(n)O(n) for vanilla sub-sampling. Assuming the computational bounds are tight and an efficient implementation of the algorithm is possible, Table 1 summarizes the computational complexity of different versions of the Zig-Zag algorithm in the stationary phase.

Algorithm time to an essentially ind. sample proposed switches per unit time cost per proposed switch computational complexity
ZZ O(n1/2)O(n^{-1/2}) O(n1/2)O(n^{1/2}) O(n)O(n) O(n)O(n)
ZZ-SS O(1)O(1) O(n)O(n) O(1)O(1) O(n)O(n)
ZZ-CV O(n1/2)O(n^{-1/2}) O(n1/2)O(n^{1/2}) O(1)O(1) O(1)O(1)
Table 1: Computational complexity of different versions of the Zig-Zag algorithms in stationarity as a function of the data set size nn.

The computational complexity of the canonical Zig-Zag and Zig-Zag with vanilla sub-sampling per independent sample is O(n)O(n) provided tight computational bounds are available. In contrast, Zig-Zag with control variates achieves a computational speed-up of O(n)O(n) when the sequence of reference points is chosen sufficiently close to the sequence of estimators x^n\hat{x}_{n}.

The paper is constructed as follows. Section 2 details the Zig-Zag algorithm applied to the Bayesian setting with i.i.d observations along with a discussion of the sub-sampling method. Section 3 contains our main results on the transient phase of the algorithm. We study the stationary phase separately in Section 4. In Section 5, we discuss the results of the transient phase in more detail along with some illustrations on simple models. We conclude with a final discussion in Section 6. All the proofs can be found in the Appendix.

2 Zig-Zag process for parametric Bayesian inference with i.i.d. observations

In this section, we review how the Zig-Zag process can be designed to sample from a Bayesian posterior distribution in the setting where the data are assumed to be generated independently from a fixed probability model. For details, we refer to [11].

2.1 Designing Zig-Zag process for Bayesian inference

The dd-dimensional Zig-Zag process (Zt)t0=(Xt,Vt)t0(Z_{t})_{t\geq 0}=(X_{t},V_{t})_{t\geq 0} is a continuous time Markov process that evolves on the augmented space E=d×𝒱E=\mathbb{R}^{d}\times\mathcal{V} where 𝒱={1,+1}d\mathcal{V}=\{-1,+1\}^{d}. We refer to XtX_{t} and VtV_{t} as position and velocity process respectively. The Zig-Zag process is completely characterized by a collection of continuous switching rates (λi)i=1d(\lambda_{i})_{i=1}^{d} where λi:E[0,)\lambda_{i}:E\to[0,\infty) for all ii. Given (λi)i=1d(\lambda_{i})_{i=1}^{d}, (Zt)t0(Z_{t})_{t\geq 0} has right-continuous paths in EE such that,

Xt=X0+0tVs𝑑s,t0,X_{t}=X_{0}+\int_{0}^{t}V_{s}\ ds,\quad t\geq 0,

and the ii-th coordinate of (Vt)t0(V_{t})_{t\geq 0} jumps between +1+1 and 1-1 according to an inhomogeneous Poisson process of intensity λi(Zt)\lambda_{i}(Z_{t}) independently of other coordinates.

Given nn independent and identically distributed data points 𝒚(n)=(y1,,yn)\boldsymbol{y}^{(n)}=(y_{1},\dots,y_{n}), consider the classical Bayesian posterior density, π(n)\pi^{(n)}, given by

π(n)(x)exp(logπ0(x)+j=1nlogf(Yj;x));xd,\pi^{(n)}(x)\propto\exp\left(\log\pi_{0}(x)+\sum_{j=1}^{n}\log f(Y_{j};x)\right);\quad x\in\mathbb{R}^{d},

for some probability density function ff and prior density π0\pi_{0}. Define the potential function UnU_{n} by,

Un(x)=logπ(n)(x)=logπ0(x)j=1nlogf(yj;x)+C(𝒚(n)).U_{n}(x)=-\log\pi^{(n)}(x)=-\log\pi_{0}(x)-\sum_{j=1}^{n}\log f(y_{j};x)+C(\boldsymbol{y}^{(n)}). (2.1)

For i=1,,di=1,\dots,d, define the flip operator Fi:𝒱𝒱F_{i}:\mathcal{V}\to\mathcal{V}, which flips the ii-th component of vv: for v𝒱v\in\mathcal{V}, (Fi(v))i=vi(F_{i}(v))_{i}=-v_{i} and (Fi(v))j=vj(F_{i}(v))_{j}=v_{j} for jij\neq i. Suppose the switching rates are linked to UnU_{n} via

λin(x,v)λin(x,Fi(v))=viiUn(x)(x,v)E,i=1,,d.\lambda^{n}_{i}(x,v)-\lambda^{n}_{i}(x,F_{i}(v))=v_{i}\nabla_{i}U_{n}(x)\quad(x,v)\in E,\ i=1,\dots,d. (2.2)

Then, the Zig-Zag process with switching rates (λin)i=1d(\lambda^{n}_{i})_{i=1}^{d} given by (2.3) has an invariant distribution whose marginal distribution on d\mathbb{R}^{d} is equivalent to π(n)\pi^{(n)} [11]. A necessary and sufficient condition for (2.2) to hold is that there exists a continuous γn:E[0,)d\gamma^{n}:E\to[0,\infty)^{d} such that γin(x,v)=γi(x,Fi(v))\gamma^{n}_{i}(x,v)=\gamma_{i}(x,F_{i}(v)) for all (x,v)E(x,v)\in E, and

λin(x,v)=(viiUn(x))++γin(x,v),(x,v)E,i=1,,d.\lambda^{n}_{i}(x,v)=(v_{i}\nabla_{i}U_{n}(x))_{+}+\gamma^{n}_{i}(x,v),\quad(x,v)\in E,\ i=1,\dots,d. (2.3)

The switching rates for which γn0\gamma^{n}\equiv 0 in (2.3) are called canonical switching rates and the corresponding Zig-Zag process the canonical Zig-Zag process. We will denote the canonical rates by λcan,in\lambda^{n}_{\text{can},i}. The function γn\gamma^{n} itself is called the excess switching rate.

2.2 Sub-sampling with Zig-Zag

The implementation of Zig-Zag is usually carried out by Poisson thinning. This involves proposing to switch the velocity according to a Poisson process with a bounding jump intensity and then accepting or rejecting the proposed switch with a certain probability. Each accept-reject step requires evaluation of one of the λin\lambda^{n}_{i}s. Observe that Un\nabla U_{n} admits the following additive decomposition,

Un(x)=j=1nsj(x)=j=1n(n1logπ0(x)logf(yj;x)),xd.\nabla U_{n}(x)=\sum_{j=1}^{n}s^{j}(x)=\sum_{j=1}^{n}\left(-n^{-1}\nabla\log\pi_{0}(x)-\nabla\log f(y_{j};x)\right),\quad x\in\mathbb{R}^{d}. (2.4)

If it costs O(1)O(1) to compute each gradient term in the summation, the canonical rates, λin(x,v)=(viiUn(x))+\lambda^{n}_{i}(x,v)=(v_{i}\nabla_{i}U_{n}(x))_{+}, incur an O(n)O(n) computational cost per evaluation.

Fix xx and let ζ\zeta be random (given data) such that it unbiasedly estimates the sum in (2.4) i.e. 𝔼𝒚(n)(ζ(x))=Un(x)\mathbb{E}_{\boldsymbol{y}^{(n)}}(\zeta(x))=\nabla U_{n}(x). Provided that the cost of generating one realization of ζ\zeta is O(1)O(1), the sub-sampling idea is to replace, at the accept-reject step, the sum (2.4) by a random realization of ζ\zeta. The effective switching rate of this procedure is then given by, λin(x,v)=𝔼𝒚(n)[(viζi(x))+]\lambda^{n}_{i}(x,v)=\mathbb{E}_{\boldsymbol{y}^{(n)}}[(v_{i}\zeta_{i}(x))_{+}] [11]. Observe that the new switching rate preserves the invariance of the target distribution i.e.

λin(x,v)λin(x,Fi(v))\displaystyle\lambda^{n}_{i}(x,v)-\lambda^{n}_{i}(x,F_{i}(v)) =𝔼𝒚(n)[(viζi(x))+]𝔼𝒚(n)[(viζi(x))]\displaystyle=\mathbb{E}_{\boldsymbol{y}^{(n)}}[(v_{i}\zeta_{i}(x))_{+}]-\mathbb{E}_{\boldsymbol{y}^{(n)}}[(v_{i}\zeta_{i}(x))_{-}]
=𝔼𝒚(n)[viζi(x)]\displaystyle=\mathbb{E}_{\boldsymbol{y}^{(n)}}[v_{i}\zeta_{i}(x)]
=viiUn(x),\displaystyle=v_{i}\nabla_{i}U_{n}(x),

which is equivalent to (2.3). Jensen’s inequality implies that

𝔼𝒚(n)[(viζi(x))+](vi𝔼𝒚(n)ζi(x))+.\mathbb{E}_{\boldsymbol{y}^{(n)}}[(v_{i}\zeta_{i}(x))_{+}]\geq(v_{i}\mathbb{E}_{\boldsymbol{y}^{(n)}}\zeta_{i}(x))_{+}.

Hence, any sub-sampling procedure results in a possibly larger effective switching rate. However, the computational cost per proposed switch for this procedure is bounded above by the cost of generating one random realization of ζ\zeta.

We will consider unbiased estimators for (2.4) based on random batches of size m(n)nm(n)\leq n from the whole data set. In our asymptotic study in later sections, mm may be chosen to depend explicitly on nn, e.g., m(n)=lognm(n)=\log n, or be held fixed, i.e. m(n)=mm(n)=m. For now, we will suppress the dependence on nn. Let (Ej)j=1n(E^{j})_{j=1}^{n} be functions from d\mathbb{R}^{d} to d\mathbb{R}^{d} such that for all xx,

iUn(x)=j=1nEij(x),xd.\nabla_{i}U_{n}(x)=\sum_{j=1}^{n}E^{j}_{i}(x),\quad x\in\mathbb{R}^{d}. (2.5)

For mnm\leq n, let SS be a simple random sample (without replacement) of size mm from {1,,n}\{1,\dots,n\}. Then, ζ(x)=nm1jSEj(x)\zeta(x)=nm^{-1}\sum_{j\in S}E^{j}(x) is an unbiased estimator of U(x)\nabla U(x). The effective switching rates are given by,

λin(x,v)=n|𝒮(n,m)|S𝒮(n,m)(vim1jSEij(x))+,(x,v)E,i=1,,d,\lambda^{n}_{i}(x,v)=\frac{n}{|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot m^{-1}\sum_{j\in S}E_{i}^{j}(x)\right)_{+},\quad(x,v)\in E,\ i=1,\dots,d, (2.6)

where 𝒮(n,m)\mathcal{S}_{(n,m)} denotes the collection of all possible simple random samples (without replacement) of size mm from {1,,n}\{1,\dots,n\}. Given (2.5), it is easy to observe that (λin)i=1d(\lambda^{n}_{i})_{i=1}^{d} as defined in (2.6) satisfy (2.3) for all nn and mnm\leq n. Hence, the resulting process is still exact. In this paper, we analyze the following two choices of EjE^{j} suggested in [11]. Our results, however, can be easily extended to other choices.

Zig-Zag with sub-sampling (ZZ-SS): The vanilla sub-sampling strategy where we set

Ej(x)=sj(x)=n1logπ0(x)logf(yj;x);j=1,,n.E^{j}(x)=s^{j}(x)=-n^{-1}\nabla\log\pi_{0}(x)-\nabla\log f(y_{j};x);\quad\ j=1,\dots,n. (2.7)

This choice of EjE^{j} is suitable when the gradient terms sjs^{j} in (2.4) are globally bounded i.e. there exist positive constants cic_{i} such that |sij(x)|ci|s^{j}_{i}(x)|\leq c_{i} for all i,ji,j and xx. We denote the switching rates by λss,in\lambda^{n}_{\text{ss},i} and call this process Zig-Zag with sub-sampling (ZZ-SS).

Zig-Zag with Control Variates (ZZ-CV): The second choice is suitable when the gradient terms sjs^{j} are globally Lipschitz. The Lipschitz condition allows the use of control variates to reduce variance in the vanilla sub-sampling estimator of Un\nabla U_{n}.

Let xnx^{*}_{n} be a reference point in the xx-space. The control variate idea is to set

Ej(x)=sj(x)sj(xn)+k=1nsk(xn);j=1,,n.E^{j}(x)=s^{j}(x)-s^{j}(x^{*}_{n})+\sum_{k=1}^{n}s^{k}(x^{*}_{n});\quad j=1,\dots,n. (2.8)

For a suitably chosen reference point, writing EjE^{j} as above helps reduce its variability as jj is varied [11]. Note that the reference point is allowed to depend on the data (y1,,yn)(y_{1},\dots,y_{n}). In practice, a common choice would be to use the MLE or the posterior mode, however, this might not always be the best choice as we will show in the later sections. Finally, we suppose throughout this paper that xnx^{*}_{n} is chosen and fixed before the Zig-Zag algorithm is initialized. Our results in the later sections, although, advocate for continuous updation of the reference point. We denote the switching rates by λcv,in\lambda^{n}_{\text{cv},i} and call the corresponding process Zig-Zag with control variates (ZZ-CV).

Finally, note that when m=nm=n, the rates in (2.6) reduce to canonical rate irrespective of the choice of EjE^{j}. We will drop the subscript “ss”, “can”, and “cv” from λin\lambda^{n}_{i} when the choice is understood from the context or when we are talking about switching rates in general.

2.3 Technical setup

Let (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) be an arbitrary complete probability space, rich enough to accommodate all the random elements defined in the following. This probability space will be used as the primary source of randomness and the data sequence will be defined on it via a data-generating process (see below). The Zig-Zag process will be viewed as a random element on (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) conditioned on the data. All our limit theorems will be of type “weak convergence (in Skorohod topology) in \mathbb{P}-probability” or “weak convergence (in Skorohod topology) \mathbb{P}-almost surely”.

Denote the observation space by 𝒴\mathcal{Y} and the space of probability measures on 𝒴\mathcal{Y} by (𝒴)\mathcal{M}(\mathcal{Y}). Let Y:Ω𝒴Y:\Omega\to\mathcal{Y} be a data-generating process such that YPY\sim P for some P(𝒴)P\in\mathcal{M}(\mathcal{Y}) i.e. PP is the push-forward of \mathbb{P} by YY. Fix a model {Fx,xd}(𝒴)\{F_{x},x\in\mathbb{R}^{d}\}\subset\mathcal{M}(\mathcal{Y}) for PP. Suppose that PP and {Fx,xd}\{F_{x},x\in\mathbb{R}^{d}\} are absolutely continuous with respect to a common base measure ν\nu on 𝒴\mathcal{Y}. Let p:=dP/dνp:=dP/d\nu be the continuous density of PP with respect to ν\nu and for each xdx\in\mathbb{R}^{d}, let f(;x):=dFx/dνf(\cdot;x):=dF_{x}/d\nu be the respective density of FxF_{x}.

Let 𝒀()P\boldsymbol{Y}^{(\mathbb{N})}\sim P^{\otimes\mathbb{N}} be a sequence of random elements defined on Ω\Omega. Let nn\in\mathbb{N} denote the dataset size. For all nn, let Π(n)\Pi^{(n)} be the random measure on d\mathbb{R}^{d} such that,

Π(n)(dx)exp(j=1nlogf(Yj;x))dx;xd.\Pi^{(n)}(dx)\propto\exp\left(\sum_{j=1}^{n}\log f(Y_{j};x)\right)dx;\quad x\in\mathbb{R}^{d}.

For any fixed nn, given 𝒀(n)=𝒚(n)=(y1,,yn)\boldsymbol{Y}^{(n)}=\boldsymbol{y}^{(n)}=(y_{1},\dots,y_{n}), Π(n)\Pi^{(n)} is the classical Bayesian posterior for xx under a uniform prior. Since we will be concerned with asymptotic behaviour for large nn we can work under the assumption of a uniform prior without loss of generality.

For each nn, let Tn:𝒴ndT_{n}:\mathcal{Y}^{n}\to\mathbb{R}^{d} be a statistic. Define the sequence of reference points

(Xn)n=1:=(Tn𝒀(n))n=1(X^{*}_{n})_{n=1}^{\infty}:=(T_{n}\circ\boldsymbol{Y}^{(n)})_{n=1}^{\infty} (2.9)

to be a random element on Ω\Omega taking values in (d)(\mathbb{R}^{d})^{\mathbb{N}}. We will assume that the sequence (Tn)n=1(T_{n})_{n=1}^{\infty} is chosen and fixed as soon as a model is identified.

For each nn, given 𝒚(n)\boldsymbol{y}^{(n)}, let (Ztn)t0(Z^{n}_{t})_{t\geq 0} be a Zig-Zag process targeting Π(n)\Pi^{(n)} with switching rate defined as in (2.6). The choice of EjE^{j} is taken to be either as in (2.7) or (2.8) where the reference point is set to be the observed value of XnX^{*}_{n} i.e. xn=Tn(𝒚(n))x^{*}_{n}=T_{n}(\boldsymbol{y}^{(n)}).

2.4 Notation and assumptions

We will use \|\cdot\| to denote the Euclidean norm in d\mathbb{R}^{d} and to denote the induced matrix norm. For any aa\in\mathbb{R}, (a)+(a)_{+} denotes the positive part of aa i.e. (a)+=max{0,a}(a)_{+}=\max\{0,a\}. If h:dh:\mathbb{R}^{d}\to\mathbb{R} is differentiable, we denote the gradient vector by h=(1h,,dh)\nabla h=(\nabla_{1}h,\dots,\nabla_{d}h). If hh is twice differentiable, we use 2h\nabla^{\otimes 2}h to denote the Hessian of hh. For g:Eg:E\to\mathbb{R}, absolutely continuous in its first argument, we will use ig\nabla_{i}g to denote the partial derivative of gg with respect to xix_{i}. We denote the dd-dimensional vector of all 11s by 𝟏d\boldsymbol{1}_{d} and the d×dd\times d identity matrix by 𝑰d\boldsymbol{I}_{d}.

Define, when it exists, s(x;y):=xlogf(y;x)s(x;y):=-\nabla_{x}\log f(y;x) and s(x;y):=2logf(y;x)s^{\prime}(x;y):=-\nabla^{\otimes 2}\log f(y;x). We will use S,S,S,S^{\prime},\dots to denote the corresponding random element when YPY\sim P. Given 𝒀(n)=𝒚(n)=(y1,,yn)\boldsymbol{Y}^{(n)}=\boldsymbol{y}^{(n)}=(y_{1},\dots,y_{n}), we will use sj(x)s^{j}(x) and sjs^{\prime j} to denote s(x;yj)s(x;y_{j}) and s(x;yj)s^{\prime}(x;y_{j}) respectively. We make the following assumptions on the model {Fx,xd}\{F_{x},x\in\mathbb{R}^{d}\} and observed data.

Assumption 2.1 (Smoothness).

For each yy, f(y;)𝒞3(d)f(y;\cdot)\in\mathcal{C}^{3}(\mathbb{R}^{d}). There exists M>0M^{\prime}>0 such that for all i=1,,di=1,\dots,d,

2si(x;y)M𝑰d,xd,y𝒴.\nabla^{\otimes 2}s_{i}(x;y)\leq M^{\prime}\boldsymbol{I}_{d},\quad x\in\mathbb{R}^{d},y\in\mathcal{Y}.
Assumption 2.2 (Moments).

For all xx, the following moment condition is satisfied:

𝔼YPS(x;Y)<,𝔼YPS(x;Y)<.\mathbb{E}_{Y\sim P}\|S(x;Y)\|<\infty,\quad\mathbb{E}_{Y\sim P}\|S^{\prime}(x;Y)\|<\infty.
Assumption 2.3 (Unique minimizer of the KL-divergence).

The KL-divergence of the assumed model relative to the true data-generating distribution is finite and uniquely minimized at x0dx_{0}\in\mathbb{R}^{d}, i.e.:

𝔼YP(logf(Y;x0))=infxd𝔼YP(logf(Y;x))<.-\mathbb{E}_{Y\sim P}(\log f(Y;x_{0}))=\inf_{x\in\mathbb{R}^{d}}-\mathbb{E}_{Y\sim P}(\log f(Y;x))<\infty.
Assumption 2.4 (Bernstein von-Mises Theorem).

There exists a sequence of estimators {x^n}n=1\{\hat{x}_{n}\}_{n=1}^{\infty} satisfying j=1nS(x^n;Yj)=𝟎d\sum_{j=1}^{n}S(\hat{x}_{n};Y_{j})=\boldsymbol{0}\in\mathbb{R}^{d} for all nn such that,

  1. (i)

    x^nx0\hat{x}_{n}\to x_{0} in \mathbb{P}-probability as nn\to\infty.

  2. (ii)

    Define I(x0):=𝔼YP[S(x0,Y)]=𝔼YP[2logf(Y;x0)]>0I(x_{0}):=\mathbb{E}_{Y\sim P}[S^{\prime}(x_{0},Y)]=-\mathbb{E}_{Y\sim P}[\nabla^{\otimes 2}\log f(Y;x_{0})]>0. Then,

    In:=n1j=1ns(x^n;yj)I(x0)I^{n}:=n^{-1}\sum_{j=1}^{n}s^{\prime}(\hat{x}_{n};y_{j})\to I(x_{0})

    in \mathbb{P}-probability as nn\to\infty.

  3. (iii)

    Let \mathcal{B} denote the collection of Borel sets on d\mathbb{R}^{d}. Then,

    supB|Π(n)(n1/2B+x^n)N𝟎,I(x0)1(B)|0\sup_{B\in\mathcal{B}}\left|\Pi^{(n)}(n^{-1/2}B+\hat{x}_{n})-N_{\boldsymbol{0},I(x_{0})^{-1}}(B)\right|\to 0

    as nn\to\infty in \mathbb{P}-probability. Here, Nθ,ΣN_{\theta,\Sigma} denotes the (multivariate) Gaussian distribution with mean vector θ\theta and variance-covariance matrix Σ\Sigma.

Assumptions 2.1 and 2.2 are used throughout the paper and provide sufficient regularity in the model and on the data. Although, they can be relaxed to a certain degree (see Remark 3.5), we do not pursue that generalization in this paper. Assumptions 2.3 and 2.4 are used in the stationary phase to ensure the posterior converges at a certain rate to a point mass as nn goes to infinity. Part (ii) of Assumption 2.4 further implies asymptotic normality of the posterior to allow for non-trivial limits in the stationary phase. Under regularity conditions, the sequence of MLE i.e. argmaxxdj=1nlogf(yj;x)\arg\max_{x\in\mathbb{R}^{d}}\sum_{j=1}^{n}\log f(y_{j};x) satisfies the conditions in Assumption 2.4[25]. For our main results in Sections 3 and 4, we do not assume that P{Fx,xd}P\in\{F_{x},x\in\mathbb{R}^{d}\}, i.e. the model is well-specified. If indeed it does, x0x_{0} in Assumption 2.3 is such that P=Fx0P=F_{x_{0}}.

3 Fluid limits in the transient phase

This section establishes fluid limits for the Zig-Zag process in Bayesian inference. We consider the identifiable situation where, as more data becomes available, the posterior distribution contracts to a point mass. In this case, the gradient of the log-likelihood grows linearly in nn and, as a result, the switching intensities become proportionally large (see Proposition 3.1). This results in a limiting process that switches immediately to a ‘favourable’ direction and asymptotically the Zig-Zag process behaves as the solution of an ordinary differential equation as we will detail below. Throughout this section, we assume the following about the sequence of reference points defined in (2.9).

Assumption 3.1.

There exists an XX^{*} defined on Ω\Omega such that 𝔼X2<\mathbb{E}\|X^{*}\|^{2}<\infty, XX^{*} is independent of 𝒀()\boldsymbol{Y}^{(\mathbb{N})}, and for the sequence of reference points we have that XnXX^{*}_{n}\to X^{*}, \mathbb{P}-almost surely.

For each nn, let (Ztn)t0=(Xtn,Vtn)t0(Z^{n}_{t})_{t\geq 0}=(X^{n}_{t},V^{n}_{t})_{t\geq 0} be a Zig-Zag process targeting Π(n)\Pi^{(n)} with switching rates (λin)i=1d(\lambda^{n}_{i})_{i=1}^{d} defined as in (2.6). For each nn and i=1,,di=1,\dots,d, define the sets

Hin={xd:λin(x,𝟏d)+λin(x,𝟏d)=0}.H^{n}_{i}=\left\{x\in\mathbb{R}^{d}:\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(x,\boldsymbol{1}_{d})=0\right\}. (3.1)

Let Hn=i=1dHinH^{n}=\cup_{i=1}^{d}H^{n}_{i}. Since the switching rates are continuous, (Hn)c(H^{n})^{c} is open. Define for each nn,

τn=inf{t0:XtnHn or XtnHn},\tau_{n}=\inf\{t\geq 0:X^{n}_{t}\in H^{n}\text{ or }X^{n}_{t^{-}}\in H^{n}\},

where we set τn=\tau_{n}=\infty if Hn=H^{n}=\emptyset. For some cemetary state Δ\Delta, we consider the sequence of stopped processes Z~tn\tilde{Z}^{n}_{t} defined as,

Z~tn=(X~tn,V~tn)={(Xtn,Vtn),t<τn,Δ,tτn.\tilde{Z}^{n}_{t}=(\tilde{X}^{n}_{t},\tilde{V}^{n}_{t})=\begin{cases}(X^{n}_{t},V^{n}_{t}),&t<\tau_{n},\\ \Delta,&t\geq\tau_{n}.\end{cases}

The stopped process is again a PDMP that is sent to the cemetery state as soon as the position process enters HnH^{n} and stays there. The corresponding semigroup can be appropriately modified [20]. However, we introduce the cemetery state only for the sake of rigour; it can be ignored for all practical purposes.

The following proposition is a consequence of the strong law of large numbers for UU-statistics and applies to any choice of switching rates as defined in Section 1. We remind the reader that we omit the subscripts “can”, “ss”, and “cv” here.

Proposition 3.1.

Suppose Assumptions 2.1, 2.2, and 3.1 hold. Suppose the subsample size is either fixed i.e. for some m,m(n)=mm\in\mathbb{N},m(n)=m for all nmn\geq m or increases to infinity i.e. m(n)m(n)\to\infty as nn\to\infty. There exists a continuous λ:Ω×E[0,)d\lambda:\Omega\times E\to[0,\infty)^{d} such that for all compact KdK\subset\mathbb{R}^{d},

sup(x,v)K×𝒱|λin(x,v)nλi(x,v)|n0,almost surely.\sup_{(x,v)\in K\times\mathcal{V}}\left|\frac{\lambda^{n}_{i}(x,v)}{n}-\lambda_{i}(x,v)\right|\xrightarrow{n\to\infty}0,\quad\mathbb{P}-\text{almost surely}.

The proof is located in the Appendix A. Proposition 3.1 indicates that the switching rates are O(n)O(n). This means that on average the process flips direction every O(n1)O(n^{-1}) time. But since the position component has unit speed in each coordinate, the process only travels O(n1)O(n^{-1}) distance between each direction flip. As nn goes to infinity, the number of switches per unit time goes to infinity but the distance travelled between each switch becomes infinitesimally small. Consequently, on the usual time scale, the position component (Xtn)t0(X^{n}_{t})_{t\geq 0} appears to traverse a smooth path and is feasibly approximated by the solution to an ODE.

For each nn, define the sequence of drifts bn:d[1,1]db^{n}:\mathbb{R}^{d}\to[-1,1]^{d} such that,

bin(x)={λin(x,𝟏d)λin(x,𝟏d)λin(x,𝟏d)+λin(x,𝟏d),xHin,0,otherwise,b^{n}_{i}(x)=\begin{cases}\dfrac{\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})-\lambda^{n}_{i}(x,\boldsymbol{1}_{d})}{\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(x,\boldsymbol{1}_{d})},&x\notin H^{n}_{i},\\ 0,&\text{otherwise},\end{cases} (3.2)

for all i=1,,di=1,\dots,d. Then, bnb^{n} is well-defined. For each fixed ωΩ\omega\in\Omega, let λ\lambda be as in the Proposition 3.1. Define, for each i=2,,di=2,\dots,d,

Hi={xd:λi(x,𝟏d)+λi(x,𝟏d)=0},H_{i}=\{x\in\mathbb{R}^{d}:\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d})=0\},

and let H=i=1dHiH=\cup_{i=1}^{d}H_{i}. Note that the sets HiH_{i} may vary with ω\omega e.g. in ZZ-CV where the sum of switching rates depends on the limiting reference point. For each ωΩ\omega\in\Omega, define the asymptotic drift b(ω,):dH(ω)[1,1]db(\omega,\cdot):\mathbb{R}^{d}\setminus H(\omega)\to[-1,1]^{d} as,

bi(ω,x):=λi(x,𝟏d)λi(x,𝟏d)λi(x,𝟏d)+λi(x,𝟏d),xdH(ω).b_{i}(\omega,x):=\dfrac{\lambda_{i}(x,-\boldsymbol{1}_{d})-\lambda_{i}(x,\boldsymbol{1}_{d})}{\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d})},\quad x\in\mathbb{R}^{d}\setminus H(\omega).

The following proposition follows easily from the last one.

Proposition 3.2.

Suppose Assumptions 2.1, 2.2, and 3.1 hold. With \mathbb{P}-probability 11, bnbb^{n}\to b uniformly over all compact subsets of dH\mathbb{R}^{d}\setminus H. Furthermore, for each such ω\omega, there exists a unique maximal solution (Xt)tτ(X_{t})_{t\leq\tau} to the ODE,

dXtdt=b(ω,Xt)\frac{dX_{t}}{dt}=b(\omega,X_{t}) (3.3)

starting from X0HcX_{0}\in H^{c} with XtHcX_{t}\in H^{c} for all t<τt<\tau.

The proof is located in the Appendix A. Extend the solution (Xt)tτ(X_{t})_{t\leq\tau} to (X~t)t0(\tilde{X}_{t})_{t\geq 0} as,

X~t={Xt,t<τ,Δ,tτ.\tilde{X}_{t}=\begin{cases}X_{t},&t<\tau,\\ \Delta,&t\geq\tau.\end{cases}

We now state the main result of this section.

Theorem 3.3.

Suppose Assumptions 2.1, 2.2, and 3.1 hold. Suppose for all nn, X0n(Hn)cX^{n}_{0}\in(H^{n})^{c}, and X0nX0X^{n}_{0}\to X_{0} in probability for some X0HcX_{0}\in H^{c}. Then, X~tn\tilde{X}^{n}_{t} converges weakly (in Skorohod topology) to X~t\tilde{X}_{t}, \mathbb{P}-almost surely.

The proof is located in the Appendix B.

Remark 3.4.

Theorem 3.3 establishes smooth approximation for the Zig-Zag trajectories until they enter HnH^{n}. This is not a limitation in approximation due to the Markov property; the fluid limits apply as soon as the process leaves HnH^{n}.

Remark 3.5.

The deterministic approximation in Theorem 3.3 is a consequence of multi-scale dynamics in the components of the Zig-Zag process and is not special to the Bayesian settings. In general, suppose switching rates λn=O(nβ)\lambda^{n}=O(n^{\beta}) and the position process moves at a speed nαn^{\alpha}. Similar fluid limits can be obtained whenever α<β\alpha<\beta.

The asymptotic drift bb has the following form (see Appendix A):

bi(ω,x)=𝔼YP[Si(x;Y)]λi(x,𝟏d)+λi(x,𝟏d),xH(ω),b_{i}(\omega,x)=\frac{-\mathbb{E}_{Y\sim P}[S_{i}(x;Y)]}{\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d})},\quad x\in H(\omega), (3.4)

where λ\lambda is the asymptotic switching rate from Proposition 3.1 and Si(x;y)=x,ilogf(y;x)S_{i}(x;y)=\nabla_{x,i}\log f(y;x). In the large nn limit, the function bb entirely characterizes the dynamics of the Zig-Zag trajectories. To understand the effect of different sub-sampling schemes, it is sufficient to understand how the denominator λi(x,𝟏d)+λi(x,𝟏d)\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d}) behaves in each situation. For the canonical Zig-Zag,

λcan,i(x,𝟏d)+λcan,i(x,𝟏d)=|𝔼YP[Si(x;Y)]|\lambda_{\text{can},i}(x,-\boldsymbol{1}_{d})+\lambda_{\text{can},i}(x,\boldsymbol{1}_{d})=\left|\mathbb{E}_{Y\sim P}[S_{i}(x;Y)]\right| (3.5)

for all xx such that the expectation is non-zero. This corresponds to the optimal speed achievable by the Zig-Zag process i.e. ±1\pm 1. Now suppose the subsample size mm is fixed. Then,

λss,i(x,𝟏d)+λss,i(x,𝟏d)=𝔼Y1,,Ym[|m1j=1mSi(x;Yj)|],xd,\lambda_{\text{ss},i}(x,-\boldsymbol{1}_{d})+\lambda_{\text{ss},i}(x,\boldsymbol{1}_{d})=\mathbb{E}_{Y_{1},\dots,Y_{m}}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x;Y_{j})\right|\right],\quad x\in\mathbb{R}^{d}, (3.6)

where Y1,,YmiidPY_{1},\dots,Y_{m}\overset{iid}{\sim}P. Similarly, for Zig-Zag with control variates the denominator is given, when the expectation is non-zero, by

λcv,i(x,𝟏d)+λcv,i(x,𝟏d)\displaystyle\lambda_{\text{cv},i}(x,-\boldsymbol{1}_{d})+\lambda_{\text{cv},i}(x,\boldsymbol{1}_{d})
=𝔼Y1,,Ym[|m1j=1mSi(x,Yj)Si(X,Yj)+𝔼YSi(X,Y)|],\displaystyle\quad\quad\quad=\mathbb{E}_{Y_{1},\dots,Y_{m}}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x,Y_{j})-S_{i}(X^{*},Y_{j})+\mathbb{E}_{Y}S_{i}(X^{*},Y)\right|\right], (3.7)

where Y1,,YmiidPY_{1},\dots,Y_{m}\overset{iid}{\sim}P and XX^{*} is as in Assumption 3.1.

Remark 3.6.

Note that, as a consequence of the above, the asymptotic drift for ZZ-CV is random in terms of XX^{*}. However, the procedure for choosing reference points is fixed before implementing the algorithm. Hence, the value of XX^{*} is known as soon as the data is observed.

Remark 3.7.

For any m>mm^{\prime}>m, it follows from Jensen’s inequality that bi(m,x)>bi(m,x)b_{i}(m^{\prime},x)>b_{i}(m,x) for all xx and i=1,,di=1,\dots,d irrespective of the sampling scheme. When m=m(n)m=m(n)\to\infty as nn\to\infty, we show in the Appendix A that bi(m,x)bcan,i(x)b_{i}(m,x)\to b_{\text{can},i}(x) for both ZZ-SS and ZZ-CV. Thus, larger subsamples lead to faster convergence. However, it can be shown that a subsample of size 2m2m does not produce a two times increase in the drift. However, a subsample of size 2m2m incurs twice the computational cost at each proposed switching time. Thus it seems optimal to select a subsample of size 11.

Remark 3.8.

The expression for λ\lambda differs for different sequences of switching rates but by the definition of H(ω)H(\omega), the denominator in (3.4) is always positive. And so the limiting flow drifts in the direction of decreasing 𝔼YP[logf(x;Y)]\mathbb{E}_{Y\sim P}[-\log f(x;Y)] i.e. decreasing KL-divergence between PP and the model {Fx,xd}\{F_{x},x\in\mathbb{R}^{d}\}. When Assumption 2.3 is satisfied, this would imply that the limiting flow drifts towards x0x_{0}. This is expected behaviour as the posterior mass concentrates around x0x_{0} as nn goes to infinity and the Zig-Zag process tries to find its way towards stationarity. The denominator in (3.4) is the damping factor that quantifies the loss in speed due to sub-sampling. For canonical Zig-Zag, (3.5) implies that the corresponding ODE has the optimal drift i.e. +1+1 or 1-1 depending on the relative location of x0x_{0}. For ZZ-SS and ZZ-CV in general, λi(x,𝟏d)+λi(x,𝟏d)|𝔼YP[Si(x;Y)]|\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d})\geq|\mathbb{E}_{Y\sim P}[S_{i}(x;Y)]|. Although in some cases, ZZ-CV achieves optimal speed as illustrated in Section 5.

The sets HinH^{n}_{i} take different forms for different switching rates. For canonical Zig-Zag, each HinH^{n}_{i} is the hypersurface Hin={xd:j=1nsij(x)=0}H^{n}_{i}=\{x\in\mathbb{R}^{d}:\sum_{j=1}^{n}s_{i}^{j}(x)=0\}. For ZZ-SS, each Hin=H^{n}_{i}=\emptyset since the sum is always positive. For ZZ-CV with reference point xnx^{*}_{n}, the sets HinH^{n}_{i} can be either as small as the empty sets in the ZZ-SS case e.g. if the reference point is not close to any of the conditional modes. Or they can be as large as the hypersurfaces in the canonical case e.g. when the model density is Gaussian in which case the ZZ-CV switching rates are equivalent to the canonical Zig-Zag irrespective of the reference point xnx^{*}_{n}. However, note that since the canonical rates are the smallest, for any λin\lambda^{n}_{i} of the form (2.6),

λin(x,v)+λin(x,v)λcan,in(x,v)+λcan,in(x,v),\lambda^{n}_{i}(x,v)+\lambda^{n}_{i}(x,-v)\geq\lambda^{n}_{\text{can},i}(x,v)+\lambda^{n}_{\text{can},i}(x,-v),

for all (x,v)E(x,v)\in E. And so, for any choice of the switching rates,

Hin{xd:j=1nsij(x)=0}.H^{n}_{i}\subseteq\left\{x\in\mathbb{R}^{d}:\sum_{j=1}^{n}s^{j}_{i}(x)=0\right\}.
Refer to caption
Figure 3: Trajectories of the canonical Zig-Zag process in 22 dimensions with different starting values. The dashed lines represent the hypersurfaces Hin,i=1,2H^{n}_{i},\ i=1,2. Paths indicated in black stick to one of the hypersurfaces as soon as they hit it whereas those in orange don’t. Colours were coded after the paths were observed.
[Uncaptioned image]
[Uncaptioned image]
Figure 4: First coordinate of the trajectory of the ZZ-SS algorithm targeting posterior for the Bayesian logistic regression model. The underlying process behaves like a diffusion around the true parameter value 11.

When XtnX^{n}_{t} hits HinH^{n}_{i} for some ii, the total switching rate, and hence the momentum, in the ii-th coordinate, becomes zero. However, other coordinates may still carry momentum and affect the overall dynamics of the process. This leads to interesting behaviour: the process either jumps back out of HnH^{n} with a certain velocity or remains stuck to HinH^{n}_{i} for a random amount of time (see Figure 3). It is not straightforward to extend the proof of Theorem 3.3 as the process behaves (d1)(d-1)-dimensional on these sets. Consequently, the stability analysis of the sets HinH^{n}_{i} is beyond the scope of this paper.

4 The stationary phase

As noted in Section 1, the fluid limits only capture deviations in the Zig-Zag trajectories that are big enough on the usual space scale. Suppose Assumptions 2.3 and 2.4 hold. As nn becomes large the posterior concentrates around x^n\hat{x}_{n}. Consequently, the posterior mass inside the interval (x^nϵ,x^n+ϵ)(\hat{x}_{n}-\epsilon,\hat{x}_{n}+\epsilon) goes to 11 for any ϵ>0\epsilon>0. This restricts the movement of the Zig-Zag process to regions close to x^n\hat{x}_{n} (see Figure 4). Once the Zig-Zag process targeting Π(n)\Pi^{(n)} hits x^n\hat{x}_{n}, it is hence reasonable to expect that the process stays within ϵ\epsilon-distance of x^n\hat{x}_{n} with high probability. The following result illustrates this in one dimension under the additional assumption of unimodality of the posterior. The proof is located in the Appendix C.

Theorem 4.1.

Suppose Assumptions 2.1 - 2.4 all hold. Suppose there exists an n0n_{0} such that for all nn0n\geq n_{0} and for all s>0s>0, vj=1nsj(x^n+vs)>0v\cdot\sum_{j=1}^{n}s^{j}(\hat{x}_{n}+vs)>0. Let ZtnZ^{n}_{t} be a 11-dimensional canonical Zig-Zag process targeting Π(n)\Pi^{(n)}. Suppose X0n=x^nX^{n}_{0}=\hat{x}_{n}. For all ϵ>0\epsilon>0 and t0t\geq 0,

limn(supst|Xsnx^n|>ϵ)=0.\lim_{n\to\infty}\mathbb{P}\left(\sup_{s\leq t}|X^{n}_{s}-\hat{x}_{n}|>\epsilon\right)=0.

A sufficient condition for the assumption in Theorem 4.1 to hold is when f(y;x)f(y;x) is log-concave as a function of xx for any yy (where f(y;x)f(y;x) is the data generating density for a given parameter xx), and x^n\hat{x}_{n} satisfying jsj(x^n)=0\sum_{j}s^{j}(\hat{x}_{n})=0 exists. Since the sequence x^n\hat{x}_{n} converges to x0x_{0} in probability, Theorem 4.1 implies that once it gets close to x0x_{0}, the canonical Zig-Zag stays close to x0x_{0} as nn\to\infty. To obtain a non-trivial limit in stationarity the process needs to be analyzed on its natural scale which depends on the rate of the posterior contraction. Consider the reparameterization,

ξ(x):=n1/2(xx^n),x(ξ)=x^n+n1/2ξ.\xi(x):=n^{1/2}(x-\hat{x}_{n}),\quad\quad x(\xi)=\hat{x}_{n}+n^{-1/2}\xi.

The posterior distribution Π(n)\Pi^{(n)} in terms of the ξ\xi-parameter converges to a multivariate Gaussian distribution. Under this reparameterization, it is natural to expect that the Zig-Zag process converges to a non-trivial stochastic process targeting the asymptotic Gaussian distribution.

In the ξ\xi coordinate, the Zig-Zag process has a speed n1/2n^{1/2}. However, the switching rates for ZZ-SS and ZZ-CV in terms of ξ\xi may differ in magnitudes and hence require separate analysis. In what follows we assume that the process starts from stationarity i.e. (X0n,V0n)Π(n)×Uniform({1,1}d)(X^{n}_{0},V^{n}_{0})\sim\Pi^{(n)}\times\text{Uniform}(\{-1,1\}^{d}) for all nn.

4.1 Zig-Zag with sub-sampling (without control-variates)

We first consider ZZ-SS. The ii-th switching rate in the rescaled parameter is written as,

λss,in(ξ,v)=nm|𝒮(n,m)|S𝒮(m,n)(vijSsij(x^n+n1/2ξ))+.\lambda^{n}_{\text{ss},i}(\xi,v)=\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\left(v_{i}\cdot\sum_{j\in S}s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)\right)_{+}. (4.1)

For large nn, the above quantity is O(n)O(n). In (ξ,v)(\xi,v)-coordinate, the velocity component still oscillates at rate O(n)O(n) while the position component moves with speed n1/2n^{1/2}. This is, once again, a situation of multi-scale dynamics as in the last section. It is tempting to scale down the time by a factor n1/2n^{-1/2} and do the averaging as in Theorem 3.3. But consider,

λss,in(ξ,v)λss,in(ξ,v)\displaystyle\lambda^{n}_{\text{ss},i}(\xi,v)-\lambda^{n}_{\text{ss},i}(\xi,-v) =vi(j=1nsij(x^n+n1/2ξ))\displaystyle=v_{i}\left(\sum_{j=1}^{n}s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)\right) n1/2vi(ξj=1nsij(x^n)).\displaystyle\approx n^{-1/2}v_{i}\left(\xi\cdot\sum_{j=1}^{n}\nabla s^{j}_{i}(\hat{x}_{n})\right).

Thus, the difference in the rates in opposite directions is O(n1/2)O(n^{1/2}) while the sum remains O(n)O(n). As a result, the drift bnb^{n} goes to 0 and the averaging leads to a trivial fluid limit (see Appendix C). However, in the ξ\xi parameter, the above observations imply that the infinitesimal drift and the infinitesimal variance in the position component is O(1)O(1). This suggests that the process is moving on a diffusive scale.

Let m(n)=mm(n)=m be fixed for some mm\in\mathbb{N}. For each n>mn>m, let Ztn=(Xtn,Vtn)t0Z^{n}_{t}=(X^{n}_{t},V^{n}_{t})_{t\geq 0} be a ZZ-SS process targeting the posterior Π(n)\Pi^{(n)} with fixed sub-sample size mm. For each nn, define (ξtn)t0(\xi^{n}_{t})_{t\geq 0} by

ξtn=n1/2(Xtnx^n);t0.\xi^{n}_{t}=n^{1/2}(X^{n}_{t}-\hat{x}_{n});\quad t\geq 0. (4.2)

We have the following result.

Theorem 4.2.

Suppose assumptions 2.1 - 2.4 hold. Then, as nn\to\infty, (ξtn)t0(\xi^{n}_{t})_{t\geq 0} converges weakly (in Skorohod topology) in \mathbb{P}-probability to (ξt)t0(\xi_{t})_{t\geq 0} where (ξt)t0(\xi_{t})_{t\geq 0} is a stationary Ornstein-Uhlenbeck process satisfying the SDE,

dξt=AI(x0)2ξtdt+A1/2dBt,d\xi_{t}=-\frac{A\cdot I(x_{0})}{2}\ \xi_{t}dt+A^{1/2}\ dB_{t}, (4.3)

with initial distribution ξ0N(0,(I(x0))1)\xi_{0}\sim N(0,(I(x_{0}))^{-1}). Here, x0x_{0} is as in Assumption 2.3 and AA is a d×dd\times d diagonal matrix with

Aii=2𝔼[|m1j=1mSi(x0;Yj)|],Y1,,YmiidP,i=1,,d.A_{ii}=\frac{2}{\mathbb{E}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x_{0};Y_{j})\right|\right]},\ Y_{1},\dots,Y_{m}\overset{\text{iid}}{\sim}P,\quad i=1,\dots,d.

The proof is located in the Appendix C.

Remark 4.3.

The limiting diffusion is invariant with respect to the limiting Gaussian distribution in the Bernstein von-Mises theorem. The matrix AA in 4.3 is the damping factor quantifying the loss in mixing due to sub-sampling.

Remark 4.4.

When the model is well-specified i.e P=Fx0P=F_{x_{0}}, 𝔼Si(x0,Y)=0\mathbb{E}S_{i}(x_{0},Y)=0. By Jensen’s inequality,

m1𝔼[|Si(x0;Y1)|]𝔼[|m1j=1mSi(x0;Yj)|]𝔼[|Si(x0;Y1)|].m^{-1}\mathbb{E}\left[\left|S_{i}(x_{0};Y_{1})\right|\right]\leq\mathbb{E}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x_{0};Y_{j})\right|\right]\leq\mathbb{E}\left[\left|S_{i}(x_{0};Y_{1})\right|\right].

Thus, using a batch size of mm is not mm times better than using a batch size of 11. In terms of the total computational cost, it is thus optimal to choose m=1m=1.

Remark 4.5.

Since we do not rescale time, Theorem 4.2 implies that the ZZ-SS process requires O(1)O(1) time to obtain an (approximately) independent sample. Assuming computational bounds are of the same order as the switching rate i.e. O(n)O(n), the total computational complexity of the ZZ-SS sampler is thus estimated to be O(n)O(n).

4.2 Zig-Zag with Control Variates

Next, we consider the ZZ-CV process in the (ξ,v)(\xi,v)-coordinate. Let (Xn)n=1=(Tn𝒀(n))n=1(X_{n}^{*})_{n=1}^{\infty}=(T_{n}\circ\boldsymbol{Y}^{(n)})_{n=1}^{\infty} be a random sequence of reference points in (d)(\mathbb{R}^{d})^{\mathbb{N}} as used in ZZ-CV. Define ξn=ξ(Xn)\xi^{*}_{n}=\xi(X^{*}_{n}) for all nn i.e. ξn=n1/2(Xnx^n)\xi^{*}_{n}=n^{1/2}(X^{*}_{n}-\hat{x}_{n}). For any observed value xnx^{*}_{n} of XnX^{*}_{n}, the ii-th switching rate in ξ\xi parameter is given by

λcv,in(ξ,v)=nm|𝒮(n,m)|S𝒮(n,m)(vijSEij(ξ))+,\lambda^{n}_{\text{cv},i}(\xi,v)=\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}E_{i}^{j}(\xi)\right)_{+}, (4.4)

where

Eij(ξ)=Eij(x(ξ))\displaystyle E_{i}^{j}(\xi)=E_{i}^{j}(x(\xi)) =sij(x^n+n1/2ξ)sij(x^n+n1/2ξn)+n1k=1nsik(x^n+n1/2ξn)\displaystyle=s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)-s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi_{n}^{*})+n^{-1}\sum_{k=1}^{n}s^{k}_{i}(\hat{x}_{n}+n^{-1/2}\xi_{n}^{*})
n1/2(ξnξ)sij(x^n)+n1k=1nsik(x^n+n1/2ξn)\displaystyle\approx n^{-1/2}(\xi^{*}_{n}-\xi)\cdot\nabla s^{j}_{i}(\hat{x}_{n})+n^{-1}\sum_{k=1}^{n}s^{k}_{i}(\hat{x}_{n}+n^{-1/2}\xi_{n}^{*})

The magnitude of the switching rates depends explicitly on the sequence of reference points and their distance to x^n\hat{x}_{n}. A reasonable choice for TnT_{n} is to ensure XnX^{*}_{n} is sufficiently close to x^n\hat{x}_{n}. Motivated by the scaling arguments in [11], we suppose that the reference points are chosen such that Xnx^n=O(n1/2)\|X^{*}_{n}-\hat{x}_{n}\|=O(n^{-1/2}) i.e. ξn\xi^{*}_{n} converges to some finite random element. Then, it can be shown that the ZZ-CV switching rates in ξ\xi coordinate are of magnitude O(n1/2)O(n^{1/2}). This implies, in the (ξ,v)(\xi,v)-coordinate, the position and the velocity process mix at the same rate which is O(n1/2)O(n^{1/2}). Slowing down time by a factor n1/2n^{-1/2} brings both the components back to the unit time scale. Consequently, the limiting process is a Zig-Zag process with an appropriate switching rate.

Refer to caption
(a) X=(1,2)X^{*}=(1,2)
Refer to caption
(b) X=(10,1)X^{*}=(-10,1)
Refer to caption
(c) X=(5,5)X^{*}=(5,5)
Figure 5: Stationary phase trajectories of ZZ-CV for Bayesian logistic regression example with true parameter (1,2)(1,2) with different reference points.

Let m(n)=mm(n)=m be fixed for some mm\in\mathbb{N}. For each n>mn>m, let Ztn=(Xtn,Vtn)t0Z^{n}_{t}=(X^{n}_{t},V^{n}_{t})_{t\geq 0} be a ZZ-CV process targeting the posterior Π(n)\Pi^{(n)} with reference point xnx^{*}_{n} and fixed sub-sample size mm. Define

Utn:=(ξ(Xn1/2tn),Vn1/2tn)=(n1/2(Xn1/2tnx^n),Vn1/2tn);t0,n>m.U^{n}_{t}:=\left(\xi(X^{n}_{n^{-1/2}t}),V^{n}_{n^{-1/2}t}\right)=\left(n^{1/2}(X^{n}_{n^{-1/2}t}-\hat{x}_{n}),V^{n}_{n^{-1/2}t}\right);\quad t\geq 0,n>m.

We have the following result.

Theorem 4.6.

Suppose Assumptions 2.1 - 2.4. Let ξ\xi^{*} be such that ξnξ\xi^{*}_{n}\to\xi^{*} in \mathbb{P}-probability. Moreover, ξ\xi^{*} is independent of 𝐘()\boldsymbol{Y}^{(\mathbb{N})} and 𝔼ξ2<\mathbb{E}\|\xi^{*}\|^{2}<\infty. Then (Utn)t0(U^{n}_{t})_{t\geq 0} converges weakly (in Skorohod topology) in \mathbb{P}-probability to (Ut)t0(U_{t})_{t\geq 0} where Ut=(ξt,Vt)U_{t}=(\xi_{t},V_{t}) is a (random) dd-dimensional stationary Zig-Zag process with, conditional on ξ\xi^{*}, ii-th switching rate

λi(ξ,v|ξ)\displaystyle\lambda_{i}(\xi,v|\xi^{*}) (4.5)
=m1𝔼Y1,,Ym(vij=1m(ξSi(x0;Yj)ξ(Si(x0;Yj)𝔼Si(x0;Yj))))+,\displaystyle\quad\quad=m^{-1}\mathbb{E}_{Y_{1},\dots,Y_{m}}\left(v_{i}\cdot\sum_{j=1}^{m}\left(\xi\cdot\nabla S_{i}(x_{0};Y_{j})-\xi^{*}\cdot(\nabla S_{i}(x_{0};Y_{j})-\mathbb{E}\nabla S_{i}(x_{0};Y_{j}))\right)\right)_{+},

where Si(x;Y)\nabla S_{i}(x;Y) is the ii-th column of the Hessian matrix S(x,Y)=2logf(Y;x)S^{\prime}(x,Y)=-\nabla^{\otimes 2}\log f(Y;x) and Y1,,YmiidPY_{1},\dots,Y_{m}\overset{\text{iid}}{\sim}P are independent of ξ\xi^{*}.

The proof is located in the Appendix C.

Remark 4.7.

Observe that for any value of ξ\xi^{*},

λi(ξ,v|ξ)λi(ξ,v|ξ)=vi(ξ𝔼[Si(x0;Y)]).\lambda_{i}(\xi,v|\xi^{*})-\lambda_{i}(\xi,-v|\xi^{*})=v_{i}\left(\xi\cdot\mathbb{E}\left[\nabla S_{i}(x_{0};Y)\right]\right).

The limiting Zig-Zag process is invariant with respect to the limiting Gaussian distribution with mean vector 0 and covariance matrix given by the inverse of I(x0)I(x_{0}).

Remark 4.8.

As in the case of fluid limits, the limiting process is a random Zig-Zag process depending on the statistic TnT_{n} used to generate reference points. When TnT_{n} is such that Xnx^n=O(nα)\|X^{*}_{n}-\hat{x}_{n}\|=O(n^{-\alpha}) for α>1/2\alpha>1/2, then ξ=0\xi^{*}=0 almost surely. The limiting switching rate reduces to,

λi(ξ,v)=m1𝔼Y1,,Ym(vij=1mξSi(x0;Yj))+.\lambda_{i}(\xi,v)=m^{-1}\mathbb{E}_{Y_{1},\dots,Y_{m}}\left(v_{i}\cdot\sum_{j=1}^{m}\xi\cdot\nabla S_{i}(x_{0};Y_{j})\right)_{+}.
Remark 4.9.

In general, the limiting Zig-Zag process is not canonical. By Jensen’s inequality, for any fixed mm,

m1𝔼Y1,,Ym(vij=1mξSi(x0;Yj))+(vi(ξ𝔼Si(x0;Y))+.m^{-1}\mathbb{E}_{Y_{1},\dots,Y_{m}}\left(v_{i}\cdot\sum_{j=1}^{m}\xi\cdot\nabla S_{i}(x_{0};Y_{j})\right)_{+}\geq(v_{i}\cdot(\xi\cdot\mathbb{E}\nabla S_{i}(x_{0};Y))_{+}.

However, for the exponential family of distributions represented in their natural parameters, the Hessian S(x;Y)S^{\prime}(x;Y) is independent of YY. The limiting rate then becomes equivalent to the canonical rate. In general, the amount of excess quantifies the loss in efficiency due to sub-sampling.

Remark 4.10.

The difference in the switching rates in opposite directions for ZZ-CV is the same as ZZ-SS i.e. O(n1/2)O(n^{1/2}). The limiting dynamics are then governed only by the magnitude of the sum of the switching rates. The assumption that Xnx^n=O(n1/2)\|X^{*}_{n}-\hat{x}_{n}\|=O(n^{-1/2}) in Theorem 4.6 cannot be dropped. In general, suppose Xnx^n=O(nα)\|X^{*}_{n}-\hat{x}_{n}\|=O(n^{-\alpha}) for some α0\alpha\geq 0, then the switching rates or the sum of the switching rates are of the magnitude max{1/2,1α}\max\{1/2,1-\alpha\}. In particular when α=0\alpha=0, the sum of the rates is O(n)O(n) and the process is once again in the diffusive scale as in the case of ZZ-SS. We do not pursue this any further in this paper but see Figure 5 for examples of Zig-Zag trajectories when Xn↛x^nX^{*}_{n}\not\to\hat{x}_{n}.

Suppose now the sub-sample size mm varies with nn. The canonical Zig-Zag corresponds to the case m(n)=nm(n)=n. If m(n)m(n) is large when nn is large, it is reasonable to expect that ZZ-CV behaves similarly to the canonical Zig-Zag. Although it can be argued, as in the case of ZZ-SS, that m=1m=1 is optimal in terms of the total computational cost of the algorithm. The following result is for completeness. It shows that if m(n)m(n)\to\infty as nn\to\infty, the limiting process does not depend on the reference points in ZZ-CV and is equivalent to the canonical Zig-Zag. Once again, the limiting process is invariant with respect to the limiting Gaussian distribution.

Theorem 4.11.

In Theorem 4.6, suppose that the sub-sample size m=m(n)m=m(n) such that for all nn, 1m(n)n1\leq m(n)\leq n and m(n)m(n)\to\infty as nn\to\infty. Then the limiting process UU is a stationary Zig-Zag process with switching rates

λi(ξ,v)=(vi(ξ𝔼Si(x0;Y))+,\lambda_{i}(\xi,v)=(v_{i}\cdot(\xi\cdot\mathbb{E}\nabla S_{i}(x_{0};Y))_{+},

where YPY\sim P and Si(x;Y)\nabla S_{i}(x;Y) is the ii-th column of S(x,Y)=2logf(Y;x)S^{\prime}(x,Y)=-\nabla^{\otimes 2}\log f(Y;x).

The proof is located in the Appendix C.

Remark 4.12.

Theorem 4.11 and 4.6 verify the observations presented in [11] that, starting from equilibrium, the canonical Zig-Zag and the ZZ-CV process require a time interval of O(n1/2)O(n^{-1/2}) to obtain an essentially independent sample. Provided the computational bound used in the sampler is O(n1/2)O(n^{1/2}), this can be achieved in O(1)O(1) proposed switches. However, at each proposed switch, canonical Zig-Zag incurs O(n)O(n) computational cost while for fixed subsample sizes, ZZ-CV costs only O(1)O(1), resulting in significant benefits.

5 Illustration of fluid limits

In this section, we go back to the transient phase and consider the fluid limit in more detail. Recall the expression for asymptotic drift (3.4) that is,

bi(ω,x)=𝔼YP[Si(x;Y)]λi(x,𝟏d)+λi(x,𝟏d),xH(ω),b_{i}(\omega,x)=\frac{-\mathbb{E}_{Y\sim P}[S_{i}(x;Y)]}{\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d})},\quad x\notin H(\omega),

where H(ω)=i=1d{x:λi(x,𝟏d)+λi(x,𝟏d)=0}H(\omega)=\cup_{i=1}^{d}\{x:\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d})=0\}. Let bcanb_{\text{can}}, bssb_{\text{ss}} and bcvb_{\text{cv}} denote the asymptotic drifts for canonical Zig-Zag, ZZ-SS and ZZ-CV respectively obtained by putting (3.5), (3.6), and (3.7) in (3.4). Note that since the data is almost surely non-degenerate, the drift for ZZ-SS is well-defined on the whole space i.e. Hss=H_{\text{ss}}=\emptyset almost surely. For canonical Zig-Zag, Hcan={x:|𝔼YSi(x;Y)|=0, for some i=1,,d}H_{\text{can}}=\{x:|\mathbb{E}_{Y}S_{i}(x;Y)|=0,\text{ for some }\ i=1,\dots,d\}. For ZZ-CV on the other hand, the set HcvH_{\text{cv}} depends on the model and the observed value of XX^{*}. For example’s sake, Hcv=HcanH_{\text{cv}}=H_{\text{can}} when the model is Gaussian.

5.1 Toy examples

We illustrate fluid limits in some simple examples. Suppose P=Fx0P=F_{x_{0}} for some x0dx_{0}\in\mathbb{R}^{d}. Then, 𝔼YS(x0,Y)=0\mathbb{E}_{Y}S(x_{0},Y)=0. We first look at some toy models in one dimension.

  • Gaussian model: Suppose f(y;x)exp((yx)2/2)f(y;x)\propto\exp(-(y-x)^{2}/2) and P=N(x0,1)P=N(x_{0},1). Then s(x;y)=(xy)s(x;y)=(x-y) and Ej(x)=xyjE^{j}(x)=x-y_{j} for ZZ-SS. But since s(x;y)s(x;y) is linear in both xx and yy, using control variates makes the variance in EjE^{j} equal to 0. To wit,

    Ej(x)=(xyj)(xnyj)+n1k=1n(xnyk)=xx^n.E^{j}(x)=(x-y_{j})-(x^{*}_{n}-y_{j})+n^{-1}\sum_{k=1}^{n}(x^{*}_{n}-y_{k})=x-\hat{x}_{n}.

    Hence, the ZZ-CV becomes equivalent to the canonical Zig-Zag in this case. We get,

    bss(x)\displaystyle b_{\text{ss}}(x) =(xx0)𝔼|xY|;YN(x0,1/m),x.\displaystyle=\frac{-(x-x_{0})}{\mathbb{E}|x-Y|};\quad Y\sim N(x_{0},1/m),\quad x\in\mathbb{R}.
    bcv(x)\displaystyle b_{\text{cv}}(x) =(xx0)|xx0|,xx0.\displaystyle=\frac{-(x-x_{0})}{|x-x_{0}|},\quad x\neq x_{0}.
    bcan(x)\displaystyle b_{\text{can}}(x) =(xx0)|xx0|,xx0.\displaystyle=\frac{-(x-x_{0})}{|x-x_{0}|},\quad x\neq x_{0}.

    In general, for one-dimensional models with log-concave densities, it can be shown that ZZ-CV will always be equivalent to the canonical Zig-Zag.

  • Laplace model: Suppose f(y;x)exp(|yx|)f(y;x)\propto\exp(-|y-x|) and P=Laplace(x0,1)P=\text{Laplace}(x_{0},1). The log-density is not continuously differentiable on a null set and hence, violates the smoothness assumption needed for fluid limits. However, defining s(x;y)s(x;y) to be the weak derivative with respect to xx of |xy||x-y|, we get,

    s(x;y)={sgn(xy);yx,0;y=x.s(x;y)=\begin{cases}\mathrm{sgn}(x-y);&y\neq x,\\ 0;&y=x.\end{cases}

    Let p(x)=P(Yx)=Fx0(x)p(x)=P(Y\leq x)=F_{x_{0}}(x). Then,

    bss(x)=(2p(x)1)𝔼|2m1Y1|;YBin(m,p(x)),x.b_{\text{ss}}(x)=\frac{-(2p(x)-1)}{\mathbb{E}|2m^{-1}Y-1|};\quad Y\sim\text{Bin}(m,p(x)),\quad x\in\mathbb{R}.

    Given the form of ss function, note that for any x,xx,x^{\prime}, s(x;y)s(x;y)s(x;y)-s(x^{\prime};y) only depends on the direction of xx relative to xx^{\prime}. Then the denominator term in bcvb_{\text{cv}} only depends on the location of XX^{*} and the direction of xx relative to XX^{*}. Let m=1m=1. Let xx^{*} be the observed value of XX^{*}. When p(x)<1/2p(x^{*})<1/2, i.e. x<x0x^{*}<x_{0}, the drift is given by,

    bcv(x)={1xx12p(x)4p(x)(p(x)p(x))2p(x)+1x>x.b_{\text{cv}}(x)=\begin{cases}1&x\leq x^{*}\\ \dfrac{1-2p(x)}{4p(x^{*})(p(x)-p(x^{*}))-2p(x^{*})+1}&x>x^{*}.\end{cases}

    On the other hand when x>x0x^{*}>x_{0}, p(x)>1/2p(x^{*})>1/2, and we have,

    bcv(x)={12p(x)4(1p(x))(p(x)p(x))+2p(x)1x<x1xx.b_{\text{cv}}(x)=\begin{cases}\dfrac{1-2p(x)}{4(1-p(x^{*}))(p(x^{*})-p(x))+2p(x^{*})-1}&x<x^{*}\\ -1&x\geq x^{*}.\end{cases}

    Hence, the fluid limit travels with optimal speed till xx^{*} after which it slows down until it reaches x0x_{0} where it settles. Appropriate choice of control variates, however, can make the process more efficient. Suppose the sequence of reference points is chosen such that X=x0X^{*}=x_{0} almost surely. Then, p(Z)=1/2p(Z)=1/2 almost surely and the asymptotic drift is equivalent to the canonical drift,

    bcv(x)=bcan(x)=(2p(x)1)|2p(x)1|,xx0.b_{\text{cv}}(x)=b_{\text{can}}(x)=\frac{-(2p(x)-1)}{|2p(x)-1|},\quad x\neq x_{0}.

    In addition if xn=x^nx^{*}_{n}=\hat{x}_{n}, the ZZ-CV is equivalent to the canonical Zig-Zag for corresponding Π(n)\Pi^{(n)}.

  • Cauchy model: Suppose f(y;x)(1+(yx)2)1f(y;x)\propto(1+(y-x)^{2})^{-1} and P=Cauchy(x0,1)P=\text{Cauchy}(x_{0},1). Then, s(x;y)=2(xy)/(1+(xy)2)s(x;y)=2(x-y)/(1+(x-y)^{2}). The explicit expression is hard to calculate in this case. In particular, suppose X=x0X^{*}=x_{0} almost surely and let m=1m=1. Then,

    bss(x)\displaystyle b_{\text{ss}}(x) =2(xx0)(xx0)2+4(𝔼YFx0[|2(xY)1+(xY)2|])1,x.\displaystyle=-\frac{2(x-x_{0})}{(x-x_{0})^{2}+4}\left(\mathbb{E}_{Y\sim F_{x_{0}}}\left[\left|\frac{2(x-Y)}{1+(x-Y)^{2}}\right|\right]\right)^{-1},\quad x\in\mathbb{R}.
    bcv(x)\displaystyle b_{\text{cv}}(x) =2(xx0)(xx0)2+4(𝔼YFx0[|2(xY)1+(xY)22(x0Y)1+(x0Y)2|])1,xx0.\displaystyle=-\frac{2(x-x_{0})}{(x-x_{0})^{2}+4}\left(\mathbb{E}_{Y\sim F_{x_{0}}}\left[\left|\frac{2(x-Y)}{1+(x-Y)^{2}}-\frac{2(x_{0}-Y)}{1+(x_{0}-Y)^{2}}\right|\right]\right)^{-1},\quad x\neq x_{0}.
    bcan(x)\displaystyle b_{\text{can}}(x) =(xx0)|xx0|,xx0.\displaystyle=\frac{-(x-x_{0})}{|x-x_{0}|},\quad x\neq x_{0}.
Refer to caption
(a) Normal
Refer to caption
(b) Laplace
Refer to caption
(c) Cauchy
Figure 6: Asymptotic drift for different models and sub-sampling schemes in one dimension with P=F0P=F_{0} and X=0X^{*}=0 and m=1m=1. The black curve denotes bcanb_{\text{can}}, the red curve is bssb_{\text{ss}} and the blue curve is bcvb_{\text{cv}}. Note that in the first two plots, the black and the blue curves overlap.

Figure 6 plots the asymptotic drifts for the above models for canonical Zig-Zag, ZZ-SS, and ZZ-CV (when X=x0X^{*}=x_{0} almost surely). The subsample size m=1m=1 and x0=0x_{0}=0. As the calculations above show, when the model is Normal or Laplace, ZZ-CV is equivalent to canonical Zig-Zag. However, we see that for the Cauchy model, ZZ-CV is sup-optimal near x0x_{0}. Indeed, limxx0|bcv(x)|=π/4<1\lim_{x\to x_{0}}|b_{\text{cv}}(x)|=\pi/4<1. The more interesting feature of these plots is the Cauchy model where, contrary to expectations, ZZ-CV performs worse than ZZ-SS. Later in Section 5.3, we will see that this is a consequence of heavy tails and is generally true.

5.2 Bayesian logisitic regression

Consider a set of observations (wj,yj)j=1n(w^{j},y^{j})_{j=1}^{n} with wjdw^{j}\in\mathbb{R}^{d} and yj{0,1}y^{j}\in\{0,1\}, j=1,,nj=1,\dots,n. We assume a Bayesian logistic regression model on the observations such that given a parameter xdx\in\mathbb{R}^{d} and covariate wjw^{j}, the binary variable yjy^{j} has distribution,

(yj=1|wj,x)=11+exp(i=1dxiwij)=1(y=0|wj,x),j=1,,n.\mathbb{P}(y^{j}=1|w^{j},x)=\frac{1}{1+\exp(-\sum_{i=1}^{d}x_{i}w^{j}_{i})}=1-\mathbb{P}(y=0|w^{j},x),\quad j=1,\dots,n.

Additionally, since we are interested in large sample asymptotics, we suppose that the covariates wjiidQ,j=1,,nw^{j}\overset{\text{iid}}{\sim}Q,j=1,\dots,n, for some probability measure QQ on d\mathbb{R}^{d} with Lebesgue density qq. Under a flat prior assumption, the posterior density π(n)\pi^{(n)} for the parameter xx is,

π(n)(x)j=1nq(wj)(y=yj|wj,x)=j=1nq(wj)exp(yji=1dxiwij)1+exp(i=1dxiwij),xd.\pi^{(n)}(x)\propto\prod_{j=1}^{n}q(w^{j})\mathbb{P}(y=y^{j}|w^{j},x)=\prod_{j=1}^{n}q(w^{j})\frac{\exp(y_{j}\cdot\sum_{i=1}^{d}x_{i}w^{j}_{i})}{1+\exp(\sum_{i=1}^{d}x_{i}w^{j}_{i})},\quad x\in\mathbb{R}^{d}.

The corresponding potential function is given by,

U(x)=j=1n{log(1+exp(i=1dxiwij))yji=1dxiwij}j=1nlogq(wj).U(x)=\sum_{j=1}^{n}\left\{\log\left(1+\exp\left(\sum_{i=1}^{d}x_{i}w^{j}_{i}\right)\right)-y^{j}\cdot\sum_{i=1}^{d}x_{i}w^{j}_{i}\right\}-\sum_{j=1}^{n}\log q(w^{j}).

and so,

s(x;y,w)=x(log(1+exp(i=1dxiwi))yi=1dxiwilogq(w)).s(x;y,w)=\nabla_{x}\left(\log\left(1+\exp\left(\sum_{i=1}^{d}x_{i}w_{i}\right)\right)-y\cdot\sum_{i=1}^{d}x_{i}w_{i}-\log q(w)\right).

Let p(x;w)p(x;w) denote the probability (Y=1;w,x)\mathbb{P}(Y=1;w,x). In the case of a well-specified model with true model parameter x0x_{0} it follows, after conditioning on WW, that

𝔼(Si(x;Y,W))\displaystyle\mathbb{E}(S_{i}(x;Y,W)) =𝔼(Wip(x;W)YWi)\displaystyle=\mathbb{E}(W_{i}p(x;W)-YW_{i})
=𝔼(Wip(x;W)Wip(x0;W)),WQ.\displaystyle=\mathbb{E}(W_{i}p(x;W)-W_{i}p(x_{0};W)),\quad W\sim Q.

Moreover, we get

𝔼[|Si(x;Y,W)|]=𝔼[|Wi|(p(x0,W)+p(x;W)2p(x0,W)p(x;W))],WQ.\mathbb{E}[|S_{i}(x;Y,W)|]=\mathbb{E}\left[|W_{i}|\cdot(p(x_{0},W)+p(x;W)-2p(x_{0},W)p(x;W))\right],\quad W\sim Q.

The ii-th coordinate of the asymptotic drift for ZZ-SS corresponding to sub-sample size m=1m=1 is then

bss,i(x)=𝔼WQ[Wip(x;W)Wip(x0;W)]𝔼WQ[|Wi|(p(x0,W)+p(x;W)2p(x0,W)p(x;W))],xd.b_{\text{ss},i}(x)=\frac{-\mathbb{E}_{W\sim Q}\left[W_{i}p(x;W)-W_{i}p(x_{0};W)\right]}{\mathbb{E}_{W\sim Q}\left[|W_{i}|\cdot(p(x_{0},W)+p(x;W)-2p(x_{0},W)p(x;W))\right]},\quad x\in\mathbb{R}^{d}.

Similarly, for ZZ-CV, observe that for any given xdx^{*}\in\mathbb{R}^{d},

𝔼[|Si(x;Y,W)Si(x;Y,W)+𝔼Si(x,Y,W)|]\displaystyle\mathbb{E}[|S_{i}(x;Y,W)-S_{i}(x^{*};Y,W)+\mathbb{E}S_{i}(x^{*},Y,W)|]
=𝔼[|Wip(x;W)Wip(x;W)+𝔼Si(x,Y,W)|].\displaystyle\quad\quad\quad=\mathbb{E}\left[\left|W_{i}p(x;W)-W_{i}p(x^{*};W)+\mathbb{E}S_{i}(x^{*},Y,W)\right|\right].

In particular, when X=x0X^{*}=x_{0} almost surely, the above simplifies and we get the asymptotic ZZ-CV drift,

bcv,i(x)=𝔼WQ[Wip(x;W)Wip(x0;W)]𝔼WQ[|Wip(x;W)Wip(x0;W)|],xx0.b_{\text{cv},i}(x)=\frac{-\mathbb{E}_{W\sim Q}\left[W_{i}p(x;W)-W_{i}p(x_{0};W)\right]}{\mathbb{E}_{W\sim Q}\left[\left|W_{i}p(x;W)-W_{i}p(x_{0};W)\right|\right]},\quad x\neq x_{0}.

Finally,

bcan,i(x)=𝔼WQ[Wip(x;W)Wip(x0;W)]|𝔼WQ[Wip(x;W)Wip(x0;W)],b_{\text{can},i}(x)=\frac{-\mathbb{E}_{W\sim Q}\left[W_{i}p(x;W)-W_{i}p(x_{0};W)\right]}{|\mathbb{E}_{W\sim Q}\left[W_{i}p(x;W)-W_{i}p(x_{0};W)\right]},

for all xx for which the denominator is non-zero.

As an illustrative example, we generated a synthetic data set of size 1000010000 from the above model with Q=δ1×N(0,1)Q=\delta_{1}\times N(0,1) and x0=(1,2)x_{0}=(1,2). Figure 8 plots trajectories of the Zig-Zag process under different sub-sampling schemes targeting the corresponding Bayesian posterior. The reference point for ZZ-CV was chosen to be a numerical estimate of the MLE. The black curve denotes the actual Zig-Zag process trajectories and the superimposing red curve denotes the solution to the asymptotic ODE.

Refer to caption
Refer to caption
(a) Canonical Zig-Zag
Refer to caption
Refer to caption
(b) ZZ-SS
Refer to caption
Refer to caption
(c) ZZ-CV
Figure 8: Bayesian Logistic regression example with 10000 data points. The black curves are the Zig-Zag process trajectories and the red curves denote the theoretical asymptotic limit. The top and the bottom panels correspond to the first and second coordinates respectively.

5.3 Comparison of ZZ-SS and ZZ-CV

As noted before, comparing drifts for different sub-sampling schemes is equivalent to comparing the denominator in (3.4). Looking at the expressions for the denominator in (3.6) and (3.7), it is not obvious when one is faster than the other. Figure 6 already illustrates an example where, contrary to expectations, ZZ-CV performs worse than ZZ-SS.

Although the results so far do not assume well-specified models, in what follows, we suppose P=Fx0P=F_{x_{0}} for some x0dx_{0}\in\mathbb{R}^{d}. Given the smoothness conditions, we have 𝔼YS(x0,Y)=0\mathbb{E}_{Y}S(x_{0},Y)=0. For the ease of exposition, we set m=1m=1. The asymptotic drift reduces to

bcv,i(x)\displaystyle b_{\text{cv},i}(x) =𝔼Y[Si(x;Y)]𝔼Y[|Si(x,Y)Si(X,Y)+𝔼YSi(X,Y)|]\displaystyle=\frac{-\mathbb{E}_{Y}[S_{i}(x;Y)]}{\mathbb{E}_{Y}\left[\left|S_{i}(x,Y)-S_{i}(X^{*},Y)+\mathbb{E}_{Y}S_{i}(X^{*},Y)\right|\right]}
=𝔼Y[|Si(x;Y)|]𝔼Y[|Si(x,Y)Si(X,Y)+𝔼YSi(X,Y)|]bss,i(x)\displaystyle=\frac{\mathbb{E}_{Y}[|S_{i}(x;Y)|]}{\mathbb{E}_{Y}\left[\left|S_{i}(x,Y)-S_{i}(X^{*},Y)+\mathbb{E}_{Y}S_{i}(X^{*},Y)\right|\right]}\ b_{\text{ss},i}(x)

When the model is heavy-tailed in the sense that for all yy, limxs(x,y)=0\lim_{\|x\|\to\infty}\|s(x,y)\|=0, the dominated convergence theorem implies that the multiplicative factor in the above expression goes to 0 as x\|x\| goes to infinity for each fixed value of ZZ.

Corollary 5.1.

For heavy tailed models, i.e. when limxs(x,y)=0\lim_{\|x\|\to\infty}\|s(x,y)\|=0 for all yy, there exists an MM such that for all x>M\|x\|>M, bcv,i(x)<bss,i(x)b_{\text{cv},i}(x)<b_{\text{ss},i}(x), i=1,,di=1,\dots,d. Moreover, limxbcv,i(x)=0\lim_{\|x\|\to\infty}b_{\text{cv},i}(x)=0 for all i=1,,di=1,\dots,d.

Hence, the ZZ-SS will outperform ZZ-CV in tails in terms of the speed of convergence when the model is heavy-tailed. This happens because control variates in general require correlation between terms to offer variance reduction. For heavy-tailed models, as the s(,y)s(\cdot,y) function becomes flatter in the tails, the terms s(x,Y)s(x,Y) and s(X,Y)s(X^{*},Y) become increasingly uncorrelated. Consequently, the ZZ-CV drift becomes extremely slow. A naive but sensible thing to do in such situations is to run ZZ-SS in the tails and ZZ-CV when the process is closer to the centre. More explicitly, consider the mixed sub-sampling scheme where EijE^{j}_{i} is defined by,

Eij(x)={sij(x);x>Msij(x)sij(xn)+n1k=1nsik(xn);xM.E_{i}^{j}(x)=\begin{cases}s^{j}_{i}(x);&\|x\|>M\\ s^{j}_{i}(x)-s^{j}_{i}(x^{*}_{n})+n^{-1}\sum_{k=1}^{n}s^{k}_{i}(x^{*}_{n});&\|x\|\leq M.\end{cases} (5.1)

The resulting algorithm is still exact. This is so because the switching rates only need to satisfy a local condition to guarantee the invariance of a given target distribution. Since both ZZ-SS and ZZ-CV switching rates satisfy the condition individually, so does the rate for the mixed scheme. More importantly, the mixed sub-sampling scheme exhibits much faster convergence to ZZ-SS and ZZ-CV. As an illustration, consider the Cauchy example from the earlier section. The point of intersection of bssb_{\text{ss}} and bcvb_{\text{cv}} in the last panel of Figure 6 was obtained numerically to be 1.605\approx 1.605. For comparison purposes, we apply the mixed scheme in (5.1) with M=1.605M=1.605. Figure 9 plots trajectories of the Zig-Zag process under different sub-sampling schemes for the Cauchy model. The solid black line corresponds to the mixed algorithm described above. The proposed method provides an improvement over both ZZ-SS and ZZ-CV.

Refer to caption
Figure 9: Trajectories of Zig-Zag process for different sub-sampling schemes targeting posterior from a Cauchy model.

More sophisticated choices of control variates might achieve greater variance reduction [see 4, and references therein]. However, these might cost the exactness and, in some cases, even the Markov property in the resulting algorithm.

On the other hand, close to x0x_{0}, we argued in the previous section that the limiting drift for ZZ-SS is 0. Indeed by dominated convergence, we have that limxx0bss,i(x)=0\lim_{x\to x_{0}}b_{\text{ss},i}(x)=0 for all i=1,,di=1,\dots,d. Further, suppose the reference points are chosen such that X=x0X^{*}=x_{0} almost surely. When d=1d=1,

|bcv|(x)=|𝔼Y[S(x;Y)]𝔼Y[|S(x,Y)S(x0,Y)|]|=|𝔼Y[S(x(Y);Y)𝔼Y[|S(x(Y);Y)|]|,\displaystyle|b_{\text{cv}}|(x)=\left|\frac{\mathbb{E}_{Y}[S(x;Y)]}{\mathbb{E}_{Y}\left[\left|S(x,Y)-S(x_{0},Y)\right|\right]}\right|=\left|\frac{\mathbb{E}_{Y}[S^{\prime}(x(Y);Y)}{\mathbb{E}_{Y}\left[\left|S^{\prime}(x(Y);Y)\right|\right]}\right|,

where x(Y)x(Y) lies between xx and x0x_{0}. When the model is log-concave, S(x(Y),Y)>0S^{\prime}(x(Y),Y)>0 almost surely for all xx. We get |bcv(x)|=1|b_{\text{cv}}(x)|=1 for all xx0x\neq x_{0} and the ZZ-CV is equivalent to the canonical Zig-Zag. This is true for any value of mm. In general, we get near stationarity,

limxx0|bcv|(x)=|𝔼Y[S(x0;Y)𝔼Y[|S(x0;Y)|]|>0.\lim_{x\to x_{0}}|b_{\text{cv}}|(x)=\left|\frac{\mathbb{E}_{Y}[S^{\prime}(x_{0};Y)}{\mathbb{E}_{Y}\left[\left|S^{\prime}(x_{0};Y)\right|\right]}\right|>0.

Hence, close to x0x_{0}, the ZZ-CV process performs better than ZZ-SS. But when the model is not log-concave, the limit in the above expression is less than 11 and the ZZ-CV process is sub-optimal. Observe also that the right-hand side is equal to the absolute value of the drift defined in terms of the asymptotic switching rate (4.5) for ZZ-CV when X=0X^{*}=0.

6 Discussion

In this paper, we study the sub-sampling versions of the Zig-Zag algorithm in the settings of Bayesian inference with big data. Our objective in undertaking this study has been to comment on the algorithmic complexity of these algorithms as a function of data size nn. We do so by analyzing the behaviour of the underlying Zig-Zag process when nn goes to infinity. Based on our results, we estimate that ZZ-CV has a total algorithmic complexity of O(1)O(1) in stationarity while both canonical Zig-Zag and ZZ-SS have O(n)O(n). This reveals that the Zig-Zag algorithm is no worse than other reversible MCMC algorithms such as the random walk Metropolis which also has a complexity of O(n)O(n) in stationarity [30]. Moreover, the control variates can be utilized to gain an order of magnitude in total computational cost further. Our results also provide insights into the convergence time of the Zig-Zag samplers initiated out from stationarity. As expected, the canonical Zig-Zag converges to stationarity with optimal speed. The sub-sampling versions remain sub-optimal in general. However, while the corresponding drift for ZZ-SS goes to 0 in stationarity, ZZ-CV remains positive achieving optimal speed for log-concave model densities.

Based on the above observations, we infer strong support for the use of the Zig-Zag algorithm for sampling from Bayesian posteriors for big data. However, we also caution that the superefficient performance of ZZ-CV heavily relies on good reference points. Inappropriately chosen reference points can lead to undesirable behaviours as discussed in Sections 4 and 5. To mitigate the effect of bad reference points in the transient phase, we propose a mixed scheme whereby vanilla sub-sampling is done in the initial phases of the algorithm and the control variates are introduced later when the sampler is close to stationarity. We empirically show that such a scheme can achieve faster convergence than ZZ-SS and ZZ-CV. Nevertheless, this also calls for investigation into more sophisticated control variates and can be a direction of future research.

The asymptotic behaviour of both ZZ-SS and ZZ-CV depends on the choice of the batch size mm. Theoretically, large batch size mm implies faster convergence and better mixing. In practice, larger mm also means additional costs. We show that it is optimal to choose batch size m=1m=1 in terms of total algorithmic complexity. Secondly, we assume that the subsamples are drawn randomly without replacement. As a result, we can invoke the law of large numbers for UU-statistics to obtain our limit results. There is no added generality in the results in allowing sampling with replacement. The switching rates then resemble VV-statistics of order mm which can be expressed explicitly as a combination of UU-statistics and the strong law of large numbers still hold [27]. In practice, however, it is better to do sampling without replacement as it leads to a smaller variance during the implementation.

In this paper, we have limited ourselves to the analysis of the underlying PDMP to study the algorithmic complexity of Zig-Zag algorithms. However, in practice, this further depends on the computational bounds used in Poisson thinning. These computational bounds determine the number of proposed switches in the algorithm and directly affect the total complexity of the sampler. Consider for example the Logistic regression model from Section 5. When the weight distribution QQ is (sub-)Gaussian, the computational bounds scale as O(nlogn)O(n\sqrt{\log n}) for ZZ-SS and O(nlogn)O(n\log n) for canonical Zig-Zag and ZZ-CV [10]. The total complexity in stationarity then scales as O(n3/2logn)O(n^{3/2}\log n) for canonical Zig-Zag, O(nlogn)O(n\sqrt{\log n}) for ZZ-SS, and O(n1/2logn)O(n^{1/2}\log n) for ZZ-CV. On the other hand, when QQ is supported on a bounded set, the complexity in stationarity scales as O(n3/2)O(n^{3/2}) for canonical Zig-Zag, O(n)O(n) for ZZ-SS, and O(n1/2)O(n^{1/2}) for ZZ-CV. In general, the magnitude of the tightest computational bounds will depend, in addition to the sub-sampling scheme, on the true data-generating distribution and the chosen model.

The dynamics of a Zig-Zag process only depend on the switching rates and the relationship between switching rates in opposite directions. To obtain our results in this paper, we only rely on the fact that switching rates are sufficiently smooth and they grow to infinity at a certain rate. Although, in this setting, it is a consequence of posterior contraction, similar results as in Sections 3 and 4 can be obtained more generally for PDMPs with growing event rates. Such results can be used to compare different PDMP samplers such as the Bouncy particle sampler [16] or the Coordinate sampler [33], and to provide further guidance to practitioners. Thus, we envision the extension of our results to other PDMP methods as important future work.

Appendix A Asymptotic switching rates and the drift

A.1 Proof of Propositions 3.1 and 3.2

First, consider the simple case when Eij(x)=sij(x)E^{j}_{i}(x)=s^{j}_{i}(x). Then for each ii, the terms Eij,j=1,,nE^{j}_{i},j=1,\dots,n are independent and identically distributed. Let m(n)=mm(n)=m be fixed for some mm. For each ii, observe that for all n>mn>m, λin(x,v)/n\lambda^{n}_{i}(x,v)/n is a UU-statistic of degree mm with kernel k(y1,,ym)=(m1si(x,yj))+k(y_{1},\dots,y_{m})=\left(m^{-1}\sum s_{i}(x,y_{j})\right)_{+} [see 3] indexed by (x,v)(x,v). By the law of large numbers for UU-statistics,

λss,in(x,v)nn𝔼Y1,,YmiidP[(vim1j=1mSi(x;Yj))+]Palmost surely,\frac{\lambda^{n}_{\text{ss},i}(x,v)}{n}\xrightarrow{n\to\infty}\mathbb{E}_{Y_{1},\dots,Y_{m}\overset{iid}{\sim}P}\left[\left(v_{i}\cdot m^{-1}\sum_{j=1}^{m}S_{i}(x;Y_{j})\right)_{+}\right]\quad P^{\otimes\mathbb{N}}-\text{almost surely}, (A.1)

for all (x,v)E(x,v)\in E and i=1,,di=1,\dots,d. Furthermore, given the smoothness assumption (2.1), the convergence is uniform on all compact sets.

Suppose m(n)m(n)\to\infty as nn\to\infty. Note that when m(n)=nm(n)=n, this corresponds to the canonical rates. Given (yj)i=1n(y_{j})_{i=1}^{n}, ES(x):=(jSEij(x))/m(n)E_{S}(x):=(\sum_{j\in S}E^{j}_{i}(x))/m(n) is the value of the SRSWOR estimator of the population mean E¯(x):=n1j=1nEij(x)\bar{E}(x):=n^{-1}\sum_{j=1}^{n}E^{j}_{i}(x) when the sample S𝒮(n,m(n))S\in\mathcal{S}_{(n,m(n))} is observed. Consider,

|λin(x,v)n(vij=1nsij(x))+n|\displaystyle\left|\frac{\lambda^{n}_{i}(x,v)}{n}-\frac{(v_{i}\cdot\sum_{j=1}^{n}s^{j}_{i}(x))_{+}}{n}\right| =|1|𝒮(n,m(n))|S𝒮(n,m(n))(viES(x))+(viE¯(x))+|\displaystyle=\left|\frac{1}{|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\left(v_{i}\cdot E_{S}(x)\right)_{+}-\left(v_{i}\cdot\bar{E}(x)\right)_{+}\right|
1|𝒮(n,m(n))|S𝒮(n,m(n))|(viES(x))+(viE¯(x))+|\displaystyle\leq\frac{1}{|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\left|\left(v_{i}\cdot E_{S}(x)\right)_{+}-\left(v_{i}\cdot\bar{E}(x)\right)_{+}\right|
1|𝒮(n,m(n))|S𝒮(n,m(n))|ES(x)E¯(x)|.\displaystyle\leq\frac{1}{|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\left|E_{S}(x)-\bar{E}(x)\right|.

The last term on the right in the above equation is the mean absolute deviation about mean of the SRSWOR estimator ESE_{S}. This can be bounded above by the standard error of this estimator.

Given a population (Yi)i=1n(Y_{i})_{i=1}^{n}, the variance of the SRSWOR estimator, y¯\bar{y}, for estimating population mean, Y¯\bar{Y}, based on a sample of size mm, is given by [17],

V(y¯)=(nmnm)1n1i=1n(YiY¯)2,V(\bar{y})=\left(\frac{n-m}{n\cdot m}\right)\frac{1}{n-1}\sum_{i=1}^{n}(Y_{i}-\bar{Y})^{2},

Applied to the present context, this implies

|λin(x,v)n(vij=1nsij(x))+n|\displaystyle\left|\frac{\lambda^{n}_{i}(x,v)}{n}-\frac{(v_{i}\cdot\sum_{j=1}^{n}s^{j}_{i}(x))_{+}}{n}\right| (nm(n)nm(n))nn11nj=1n(EijE¯)2.\displaystyle\leq\sqrt{\left(\frac{n-m(n)}{n\cdot m(n)}\right)\frac{n}{n-1}}\sqrt{\frac{1}{n}\sum_{j=1}^{n}(E^{j}_{i}-\bar{E})^{2}}.

But also,

1nj=1n(Eij(x)E¯(x))2nVarYP(Si(x;Y))<\frac{1}{n}\sum_{j=1}^{n}(E^{j}_{i}(x)-\bar{E}(x))^{2}\xrightarrow{n\to\infty}\text{Var}_{Y\sim P}(S_{i}(x;Y))<\infty

uniformly on compact sets almost surely by the law of large numbers. And thus, if m(n)m(n)\to\infty as nn\to\infty, the right-hand side goes to 0 uniformly in xx almost surely. Consequently, we have,

λss,in(x,v)nn(vi𝔼YPSi(x;Y))+.\frac{\lambda^{n}_{\text{ss},i}(x,v)}{n}\xrightarrow{n\to\infty}\left(v_{i}\cdot\mathbb{E}_{Y\sim P}S_{i}(x;Y)\right)_{+}. (A.2)

Now consider the ZZ-CV case i.e.

Eij(x)=sij(x)sij(xn)+k=1nsik(xn),E^{j}_{i}(x)=s^{j}_{i}(x)-s^{j}_{i}(x^{*}_{n})+\sum_{k=1}^{n}s^{k}_{i}(x^{*}_{n}),

where xn=Tn(𝒚(n))x^{*}_{n}=T_{n}(\boldsymbol{y}^{(n)}) is a realization of XnX^{*}_{n}. The sequence of EijE^{j}_{i} is no longer independent and so the switching rate does not resemble a UU-statistic. Define GijG^{j}_{i} as,

Gij(x)=sij(x)(sij(X)𝔼YPSi(X;Y)).G^{j}_{i}(x)=s^{j}_{i}(x)-(s^{j}_{i}(X^{*})-\mathbb{E}_{Y\sim P}S_{i}(X^{*};Y)).

Let xx^{*} be the realization of XX^{*} such that xnxx^{*}_{n}\to x^{*}. Given X=xX^{*}=x^{*}, GijG^{j}_{i} are conditionally independent since XX^{*} is independent of 𝒀()\boldsymbol{Y}^{(\mathbb{N})}. Consider the difference,

|1|𝒮(n,m(n))|S𝒮(n,m(n))(vijSEij(x)m(n))+1|𝒮(n,m(n))|S𝒮(n,m(n))(vijSGij(x)m(n))+|\displaystyle\left|\frac{1}{|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\left(v_{i}\cdot\frac{\sum_{j\in S}E^{j}_{i}(x)}{m(n)}\right)_{+}-\frac{1}{|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\left(v_{i}\cdot\frac{\sum_{j\in S}G^{j}_{i}(x)}{m(n)}\right)_{+}\right|
1m(n)|𝒮(n,m(n))|S𝒮(n,m(n))jS|Eij(x)Gij(x)|=n1j=1n|Eij(x)Gij(x)|.\displaystyle\quad\quad\quad\leq\frac{1}{m(n)|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\sum_{j\in S}\left|E^{j}_{i}(x)-G^{j}_{i}(x)\right|=n^{-1}\sum_{j=1}^{n}|E^{j}_{i}(x)-G^{j}_{i}(x)|. (A.3)

However,

|Eij(x)Gij(x)|\displaystyle|E^{j}_{i}(x)-G^{j}_{i}(x)| =|sij(xn)+sij(x)+n1k=1nsik(xn)𝔼YSi(x)|\displaystyle=\left|-s^{j}_{i}(x_{n}^{*})+s^{j}_{i}(x^{*})+n^{-1}\sum_{k=1}^{n}s^{k}_{i}(x_{n}^{*})-\mathbb{E}_{Y}S_{i}(x^{*})\right|
|sij(xn)sij(x)|+|n1k=1nsik(xn)𝔼YSi(x)|\displaystyle\leq\left|s^{j}_{i}(x_{n}^{*})-s^{j}_{i}(x^{*})\right|+\left|n^{-1}\sum_{k=1}^{n}s^{k}_{i}(x_{n}^{*})-\mathbb{E}_{Y}S_{i}(x^{*})\right|

for all xx\in\mathbb{R}. A second order Taylor’s expansion gives,

sij(xn)sij(z)=sij(z)(xnz)+(xnz)T2sij(zj)(xnz)s_{i}^{j}(x^{*}_{n})-s^{j}_{i}(z)=\nabla s^{j}_{i}(z)\cdot(x^{*}_{n}-z)+(x^{*}_{n}-z)^{T}\nabla^{\otimes 2}s^{j}_{i}(z^{j})(x^{*}_{n}-z)

where zj=z+θj(xnz)z^{j}=z+\theta^{j}(x^{*}_{n}-z) for some θj(0,1)\theta^{j}\in(0,1). Under the assumption that third derivatives are bounded (Assumption 2.1),

|sij(xn)sij(x)|sij(x)xnx+Mxnx2|s_{i}^{j}(x^{*}_{n})-s^{j}_{i}(x^{*})|\leq\|\nabla s^{j}_{i}(x^{*})\|\cdot\|x^{*}_{n}-x^{*}\|+M^{\prime}\|x^{*}_{n}-x^{*}\|^{2}

Then,

n1j=1n|Eij(x)Gij(x)|\displaystyle n^{-1}\sum_{j=1}^{n}|E^{j}_{i}(x)-G^{j}_{i}(x)| n1xnxj=1nsij(x)+Mxnx2\displaystyle\leq n^{-1}\|x^{*}_{n}-x^{*}\|\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x^{*})\|+M^{\prime}\|x^{*}_{n}-x^{*}\|^{2}
+|n1k=1nsik(xn)𝔼YSi(x)|\displaystyle\quad\quad\quad\quad+\left|n^{-1}\sum_{k=1}^{n}s^{k}_{i}(x_{n}^{*})-\mathbb{E}_{Y}S_{i}(x^{*})\right|

Given that 𝔼[|S(x;Y)|]<\mathbb{E}[|S^{\prime}(x;Y)|]<\infty for all xx, and xnxx^{*}_{n}\to x^{*}, each term on the right hand side goes to 0. Since the above does not depend on xx, this convergence is uniform on compact sets. This implies that (A.1) goes to 0 uniformly almost surely. Now, given X=xX^{*}=x^{*}, GijG^{j}_{i}s are independent and identically distributed. And so, λin(x,v)/n\lambda^{n}_{i}(x,v)/n is asymptotically equivalent to,

1|𝒮(n,m(n))|S𝒮(n,m(n))(vi(m(n))1jSGij(x))+\frac{1}{|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\left(v_{i}\cdot(m(n))^{-1}\sum_{j\in S}G^{j}_{i}(x)\right)_{+}

When mm is fixed, the above is a UU-statistic indexed by xx. Then by the law of large numbers again,

λcv,in(x,v)nn𝔼Y1,,YmiidP[(vim1j=1mSi(x,Yj)Si(X,Yj)+𝔼YPSi(X;Y))+],\frac{\lambda^{n}_{\text{cv},i}(x,v)}{n}\xrightarrow{n\to\infty}\mathbb{E}_{Y_{1},\dots,Y_{m}\overset{iid}{\sim}P}\left[\left(v_{i}\cdot m^{-1}\sum_{j=1}^{m}S_{i}(x,Y_{j})-S_{i}(X^{*},Y_{j})+\mathbb{E}_{Y\sim P}S_{i}(X^{*};Y)\right)_{+}\right], (A.4)

uniformly on compact sets PP^{\otimes\mathbb{N}}-almost surely. When m(n)m(n)\to\infty, a similar argument as before gives,

λcv,in(x,v)nn(vi𝔼YP(Si(x;Y)))+.\frac{\lambda^{n}_{\text{cv},i}(x,v)}{n}\xrightarrow{n\to\infty}(v_{i}\cdot\mathbb{E}_{Y\sim P}(S_{i}(x;Y)))_{+}. (A.5)

This covers all the cases and, hence, proves Proposition 3.1. The limiting rate λi\lambda_{i} is given by the right hand side of equations (A.1), (A.2), (A.4), and (A.5).

For the proof of Proposition 3.2, we first recall the definition of bnb^{n} i.e.,

bin(x)={λin(x,𝟏d)λin(x,𝟏d)λin(x,𝟏d)+λin(x,𝟏d),xHin,0,otherwise.b^{n}_{i}(x)=\begin{cases}\dfrac{\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})-\lambda^{n}_{i}(x,\boldsymbol{1}_{d})}{\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(x,\boldsymbol{1}_{d})},&x\notin H^{n}_{i},\\ 0,&\text{otherwise}.\end{cases} (A.6)

Let ω\omega be a point from the almost sure set in Proposition 3.1. For this ω\omega, it is then straightforward that the numerator and the denominator in binb^{n}_{i} converge individually uniformly on all compact subsets of d\mathbb{R}^{d} to λi(x,𝟏d)λi(x,𝟏d)\lambda_{i}(x,-\boldsymbol{1}_{d})-\lambda_{i}(x,\boldsymbol{1}_{d}) and λi(x,𝟏d)+λi(x,𝟏d)\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d}) respectively. Let KK be a compact subset of HcH^{c}. Since λi\lambda_{i} is continuous, there exists an ϵ>0\epsilon>0 such that λi(x,𝟏d)+λi(x,𝟏d)>ϵ\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d})>\epsilon for all xKx\in K. Thus, the fraction binb^{n}_{i} converges uniformly to bib_{i} on KK for all i=1,,di=1,\dots,d. Finally, observe that for any choice of switching rates,

bi(x)=𝔼YP[Si(x;Y)]λi(x,𝟏d)+λi(x,𝟏d),b_{i}(x)=\frac{-\mathbb{E}_{Y\sim P}[S_{i}(x;Y)]}{\lambda_{i}(x,-\boldsymbol{1}_{d})+\lambda_{i}(x,\boldsymbol{1}_{d})},

is locally Lipschitz on HcH^{c} and hence a unique solution to the ODE (3.3) exists.

Appendix B Fluid limits in the transient phase

B.1 Generator of the Zig-Zag process

Let (Zt)t0=(Xt,Vt)t0(Z_{t})_{t\geq 0}=(X_{t},V_{t})_{t\geq 0} be a dd-dimensional Zig-Zag process evolving on E=d×{1,1}dE=\mathbb{R}^{d}\times\{-1,1\}^{d} with switching rates (λi)i=1d(\lambda_{i})_{i=1}^{d}. Define an operator \mathcal{L} with domain 𝒟()\mathcal{D}(\mathcal{L}) where

𝒟()={f𝒞(E):f(,v) is absolutely continuous for all v𝒱}\mathcal{D}(\mathcal{L})=\{f\in\mathcal{C}(E):f(\cdot,v)\text{ is absolutely continuous for all }v\in\mathcal{V}\}

by

f(x,v)=vDf(x,v)+i=1dλi(x,v){f(x,Fi(v))f(x,v)},(x,v)E,\mathcal{L}f(x,v)=v\cdot\textbf{D}f(x,v)+\sum_{i=1}^{d}\lambda_{i}(x,v)\{f(x,F_{i}(v))-f(x,v)\},\quad(x,v)\in E, (B.1)

where D denotes the weak derivative operator in d\mathbb{R}^{d} such that for all (x,v)E(x,v)\in E,

f(x+tv,v)f(x,v)=0tvDf(x+vs,v)𝑑s,t0.f(x+tv,v)-f(x,v)=\int_{0}^{t}v\cdot\textbf{D}f(x+vs,v)\ ds,\quad t\geq 0.

When ff is differentiable in its first argument, Df=f\textbf{D}f=\nabla f, where f\nabla f denotes the gradient of ff. The pair (,𝒟())(\mathcal{L},\mathcal{D}(\mathcal{L})) is the extended generator of the Markov semigroup of the Zig-Zag process (Zt)t0(Z_{t})_{t\geq 0} (see [20]).

B.2 Proof of Theorem 3.3

Fix 1/2<δ<11/2<\delta<1. Define a sequence {W^n}n=1\{\hat{W}^{n}\}_{n=1}^{\infty} of discrete time processes as follows:

W^kn=Z~k/nδn;k{0}.\hat{W}^{n}_{k}=\tilde{Z}^{n}_{k/n^{\delta}};\quad k\in\{0\}\cup\mathbb{N}.

Let for t0t\geq 0 and all nn, Wtn=W^[tnδ]n=Z~[tnδ]/nδnW^{n}_{t}=\hat{W}^{n}_{[tn^{\delta}]}=\tilde{Z}^{n}_{[tn^{\delta}]/n^{\delta}}. Denote by X~tn\tilde{X}^{n}_{t} the position component of the stopped process Z~tn\tilde{Z}^{n}_{t}. For all t0t\geq 0, |t[tnδ]/nδ|nδ|t-[tn^{\delta}]/n^{\delta}|\leq n^{-\delta}. Since the position process moves with unit speed, for all T>0T>0,

sup0sTX~[snδ]/nδnX~sndnδ.\sup_{0\leq s\leq T}\|\tilde{X}^{n}_{[sn^{\delta}]/n^{\delta}}-\tilde{X}^{n}_{s}\|\leq\sqrt{d}n^{-\delta}.

Hence, it suffices to show the weak convergence of X~[tnδ]/nδn\tilde{X}^{n}_{[tn^{\delta}]/n^{\delta}}.

For an arbitrary test function f𝒞c(d)f\in\mathcal{C}_{c}^{\infty}(\mathbb{R}^{d}), define fΔf_{\Delta} to be the restriction of ff to HcΔH^{c}\cup\Delta where we set f(Δ)=0f(\Delta)=0. In fact, it is enough to consider functions in 𝒞c(d)\mathcal{C}_{c}^{\infty}(\mathbb{R}^{d}) for which the compact support KHcK\subset H^{c}. These functions will then form a core for the generator of the limiting ODE. Also, define for all nn, fΔn:d×𝒱f^{n}_{\Delta}:\mathbb{R}^{d}\times\mathcal{V}\to\mathbb{R} as,

fΔn(x,v)={f(x),(x,v)(Hn)c×𝒱,0,otherwise.f^{n}_{\Delta}(x,v)=\begin{cases}f(x),&(x,v)\in(H^{n})^{c}\times\mathcal{V},\\ 0,&\text{otherwise}.\end{cases}

It is clear that for all (x,v)(HnH)c×𝒱(x,v)\in(H^{n}\cup H)^{c}\times\mathcal{V}, fΔn(x,v)=fΔ(x)f^{n}_{\Delta}(x,v)=f_{\Delta}(x). Let n\mathcal{L}^{n} be the generator of the stopped process Z~tn\tilde{Z}^{n}_{t}. For all nn, fΔnf^{n}_{\Delta} belongs to the domain of the generator n\mathcal{L}^{n} and by (B.1),

nfΔn(x,v)={vf(x),x(Hn)c,v𝒱,0,otherwise.\mathcal{L}^{n}f^{n}_{\Delta}(x,v)=\begin{cases}v\cdot\nabla f(x),&x\in(H^{n})^{c},v\in\mathcal{V},\\ 0,&\text{otherwise}.\end{cases}

Moreover, if \mathcal{L} denotes the generator of X~t\tilde{X}_{t}, then,

fΔ(x)={b(x)f(x),xHc,0,otherwise.\mathcal{L}f_{\Delta}(x)=\begin{cases}b(x)\cdot\nabla f(x),&x\in H^{c},\\ 0,&\text{otherwise}.\end{cases}

By construction for all nn and T>0T>0, (Z~tn((Hn)c×𝒱)Δ, 0tT)=1\mathbb{P}(\tilde{Z}^{n}_{t}\in((H^{n})^{c}\times\mathcal{V})\cup\Delta,\ 0\leq t\leq T)=1. For all nn, define a sequence of sets H¯n\bar{H}^{n} as,

H¯n={xd:d(x,i=1dHin)>nδd}.\bar{H}^{n}=\{x\in\mathbb{R}^{d}:d(x,\cup_{i=1}^{d}H^{n}_{i})>n^{-\delta}\sqrt{d}\}.

For all nn, H¯n(Hn)c\bar{H}^{n}\subset(H^{n})^{c}. As nn\to\infty, (Hn)cH¯n(H^{n})^{c}\setminus\bar{H}^{n}\to\emptyset. Let En=(H¯n×𝒱)ΔE^{n}=(\bar{H}^{n}\times\mathcal{V})\cup\Delta. Then,

limn(Z~tnEn, 0tT)=1.\lim_{n\to\infty}\mathbb{P}(\tilde{Z}^{n}_{t}\in E^{n},\ 0\leq t\leq T)=1.

Moreover, since the position component moves with the unit speed in each direction, for all xH¯nx\in\bar{H}^{n}, X~tn(Hn)c\tilde{X}^{n}_{t}\in(H^{n})^{c} for all snδs\leq n^{-\delta}. Now, let G^n,δ\hat{G}^{n,\delta} be the discrete time generator of W^n\hat{W}^{n}. Then, for all xH¯nx\in\bar{H}^{n},

G^n,δfΔn(x,v)\displaystyle\hat{G}^{n,\delta}f^{n}_{\Delta}(x,v) =𝔼[fΔn(W^1n)fΔn(W^0n)|W^0n=(x,v)]\displaystyle=\mathbb{E}\left[f^{n}_{\Delta}(\hat{W}^{n}_{1})-f^{n}_{\Delta}(\hat{W}^{n}_{0})\ |\hat{W}^{n}_{0}=(x,v)\right]
=𝔼[fΔn(Z~1/nδn)fΔn(Z~0n)|Z0n=(x,v)]\displaystyle=\mathbb{E}\left[f^{n}_{\Delta}(\tilde{Z}^{n}_{1/n^{\delta}})-f^{n}_{\Delta}(\tilde{Z}^{n}_{0})\ |Z^{n}_{0}=(x,v)\right]
=𝔼(x,v)01/nδnfΔn(Z~sn)𝑑s.\displaystyle=\mathbb{E}_{(x,v)}\int_{0}^{1/n^{\delta}}\mathcal{L}^{n}f^{n}_{\Delta}(\tilde{Z}^{n}_{s})\ ds.
=𝔼(x,v)0nδV~snf(X~sn)𝑑s.\displaystyle=\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\tilde{V}^{n}_{s}\cdot\nabla f(\tilde{X}^{n}_{s})\ ds.

The infinitesimal generator Gn,δG^{n,\delta} of WnW^{n} is then given by,

Gn,δfΔn(x,v)=nδG^n,δfΔn(x,v)=1nδ𝔼(x,v)0nδV~snf(X~sn)𝑑s.G^{n,\delta}f^{n}_{\Delta}(x,v)=n^{\delta}\cdot\hat{G}^{n,\delta}f^{n}_{\Delta}(x,v)=\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\tilde{V}^{n}_{s}\cdot\nabla f(\tilde{X}^{n}_{s})\ ds.

Also, we have Gn,δfΔn(Δ)=0G^{n,\delta}f^{n}_{\Delta}(\Delta)=0. Because the position process moves with unit speed in each coordinate, the change in the gradient term f(X~sn)\nabla f(\tilde{X}^{n}_{s}) during the short time interval (0,nδ)(0,n^{-\delta}) is of the order nδn^{-\delta}. Hence, it is possible to replace X~sn\tilde{X}^{n}_{s} by X~0n\tilde{X}^{n}_{0}. Define G~n,δ\tilde{G}^{n,\delta} as,

G~n,δfΔn(x,v)=1nδ𝔼(x,v)0nδV~snf(x)𝑑s,xHn,v𝒱,\tilde{G}^{n,\delta}f^{n}_{\Delta}(x,v)=\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\tilde{V}^{n}_{s}\cdot\nabla f(x)\ ds,\quad x\in H^{n},v\in\mathcal{V}, (B.2)

and 0 otherwise.

Lemma B.1.

The following is true:

limnsup(x,v)En|Gn,δfΔn(x,v)G~n,δfΔn(x,v)|=0.\lim_{n\to\infty}\sup_{(x,v)\in E^{n}}\left|G^{n,\delta}f^{n}_{\Delta}(x,v)-\tilde{G}^{n,\delta}f^{n}_{\Delta}(x,v)\right|=0.
Proof.

Since fCc(d)f\in C^{\infty}_{c}(\mathbb{R}^{d}), we have for some C<C<\infty and all xH¯nHx\in\bar{H}^{n}\cap H,

|Gn,δfΔn(x,v)G~n,δfΔn(x,v)|\displaystyle\left|G^{n,\delta}f^{n}_{\Delta}(x,v)-\tilde{G}^{n,\delta}f^{n}_{\Delta}(x,v)\right| =1nδ|𝔼(x,v)0nδV~snf(X~sn)V~snf(X0n)ds|\displaystyle=\frac{1}{n^{-\delta}}\left|\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\tilde{V}^{n}_{s}\cdot\nabla f(\tilde{X}^{n}_{s})-\tilde{V}^{n}_{s}\cdot\nabla f(X^{n}_{0})\ ds\right|
1nδ𝔼(x,v)0nδ|V~sn(f(X~sn)f(X0n))|𝑑s\displaystyle\leq\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\left|\tilde{V}^{n}_{s}\cdot(\nabla f(\tilde{X}^{n}_{s})-\nabla f(X^{n}_{0}))\right|\ ds
dnδ𝔼(x,v)0nδf(X~sn)f(X0n)𝑑s\displaystyle\leq\frac{\sqrt{d}}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\left\|\nabla f(\tilde{X}^{n}_{s})-\nabla f(X^{n}_{0})\right\|\ ds
dnδ𝔼(x,v)0nδCX~snX0n𝑑s\displaystyle\leq\frac{\sqrt{d}}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}C\left\|\tilde{X}^{n}_{s}-X^{n}_{0}\right\|\ ds
Cdnδ.\displaystyle\leq Cdn^{-\delta}.

The last inequality follows because XnX^{n} moves with the unit speed in each coordinate. The right-hand side goes to 0 since δ>0\delta>0. ∎

Thus, we can replace Gn,δG^{n,\delta} by G~n,δ\tilde{G}^{n,\delta}. Recall that VtnV^{n}_{t} jumps at random times according to the time-varying rates (λin(Xtn,))i=1d(\lambda^{n}_{i}(X^{n}_{t},\cdot))_{i=1}^{d}. If (λin)i=1d(\lambda^{n}_{i})_{i=1}^{d} do not change much during a short time interval of length nδn^{-\delta}, we can replace VtnV^{n}_{t} by another process V¯tn\bar{V}^{n}_{t} (say) which evolves according to fixed rates (λin(x,))i=1d(\lambda^{n}_{i}(x,\cdot))_{i=1}^{d}.

Lemma B.2.

Suppose Assumptions 2.1 and 2.2 hold. Let bnb^{n} be as defined in equation (3.2). Then,

limnsup(x,v)En|G~n,δfΔn(x,v)bn(x)f(x)|=0,in almost surely.\lim_{n\to\infty}\sup_{(x,v)\in E^{n}}\left|\tilde{G}^{n,\delta}f^{n}_{\Delta}(x,v)-b^{n}(x)\cdot f(x)\right|=0,\quad\text{in }\mathbb{P}-\text{almost surely}.
Proof.

For any given (x,v)H¯n×𝒱(x,v)\in\bar{H}^{n}\times\mathcal{V}, we will reconstruct the parent Zig-Zag process ZtnZ^{n}_{t} along with the V¯tn\bar{V}^{n}_{t} process on a common probability space uptil time nδn^{-\delta} and with initial conditions Z0n=(x,v)Z^{n}_{0}=(x,v) and V¯0n=v\bar{V}^{n}_{0}=v. To that end, Let H(x,r)H(x,r) denote the hyperpercube in d\mathbb{R}^{d} of length 2r2r centered at xx, i,e,

H(x,r)={yd:|yixi|r,i=1,,d}.H(x,r)=\{y\in\mathbb{R}^{d}:|y_{i}-x_{i}|\leq r,\ i=1,\dots,d\}.

For all nn, H(x,nδ)HnH(x,n^{-\delta})\subset H^{n} for all xH¯nx\in\bar{H}^{n}. Fix xH¯nx\in\bar{H}^{n}. Given xx, let λn(x,v)=i=1dλin(x,v)\lambda^{n}(x,v)=\sum_{i=1}^{d}\lambda^{n}_{i}(x,v) be the total switching rate and define νn\nu^{n} by,

νn=sup{λn(y,v):yH(x,nδ)}<.\nu^{n}=\sup\{\lambda^{n}(y,v):y\in H(x,n^{-\delta})\}<\infty.

Since λin(x,𝟏d)+λin(x,𝟏d)>0,i=1,,d\lambda^{n}_{i}(x,\boldsymbol{1}_{d})+\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})>0,\ i=1,\dots,d for all x(Hn)cx\in(H^{n})^{c}, it follows that νn>0\nu^{n}>0. Define, for each i=1,,di=1,\dots,d,

p0,in,x(y,v):=min{λin(y,v),λin(x,v)}2νn,\displaystyle p_{0,i}^{n,x}(y,v):=\frac{\min\{\lambda^{n}_{i}(y,v),\lambda^{n}_{i}(x,v)\}}{2\nu^{n}},
p1,in,x(y,v):=(λin(y,v)λin(x,v))+2νn,\displaystyle p_{1,i}^{n,x}(y,v):=\frac{(\lambda^{n}_{i}(y,v)-\lambda^{n}_{i}(x,v))_{+}}{2\nu^{n}},
p2,in,x(y,v):=(λin(y,v)λin(x,v))2νn,\displaystyle p_{2,i}^{n,x}(y,v):=\frac{(\lambda^{n}_{i}(y,v)-\lambda^{n}_{i}(x,v))_{-}}{2\nu^{n}},

for yH(x,nδ)y\in H(x,n^{-\delta}) and v𝒱v\in\mathcal{V}. We can chose an auxiliary measurable space (S,𝒮)(S,\mathcal{S}), a probability measure n,x\mathbb{P}^{n,x} on (S,𝒮)(S,\mathcal{S}), and a measurable function F:d×𝒱×𝒱×Sd×𝒱×𝒱F:\mathbb{R}^{d}\times\mathcal{V}\times\mathcal{V}\times S\to\mathbb{R}^{d}\times\mathcal{V}\times\mathcal{V} satisfying the following. For any yH(x,nδ)y\in H(x,n^{-\delta}) and v𝒱v\in\mathcal{V},

n,x(uS:F(y,v,v,u)=(y,Fi(v),Fi(v)))\displaystyle\mathbb{P}^{n,x}(u\in S:F(y,v,v,u)=(y,F_{i}(v),F_{i}(v))) =p0,in,x(y,v),\displaystyle=p_{0,i}^{n,x}(y,v),
n,x(uS:F(y,v,v,u)=(y,Fi(v),v))\displaystyle\mathbb{P}^{n,x}(u\in S:F(y,v,v,u)=(y,F_{i}(v),v)) =p1,in,x(y,v),\displaystyle=p_{1,i}^{n,x}(y,v),
n,x(uS:F(y,v,v,u)=(y,v,Fi(v)))\displaystyle\mathbb{P}^{n,x}(u\in S:F(y,v,v,u)=(y,v,F_{i}(v))) =p2,in,x(y,v),\displaystyle=p_{2,i}^{n,x}(y,v),
n,x(uS:F(y,v,v,u)=(y,v,v))\displaystyle\mathbb{P}^{n,x}(u\in S:F(y,v,v,u)=(y,v,v)) =1i=1d(p0,in,x(y,v)+p1,in,x(y,v)+p2,in,x(y,v))\displaystyle=1-\sum_{i=1}^{d}(p^{n,x}_{0,i}(y,v)+p^{n,x}_{1,i}(y,v)+p^{n,x}_{2,i}(y,v))
=112νni=1dmax{λin(y,v),λin(x,v)}.\displaystyle=1-\frac{1}{2\nu^{n}}\sum_{i=1}^{d}\max\{\lambda^{n}_{i}(y,v),\lambda^{n}_{i}(x,v)\}.

for all i=1,,di=1,\dots,d. Also, for any v𝒱v^{\prime}\in\mathcal{V}, vvv^{\prime}\neq v,

n,x(uS:F(y,v,v,u)=(y,Fi(v),Fj(v)))=λin(y,v)2νnλjn(x,v)2νn,\displaystyle\mathbb{P}^{n,x}(u\in S:F(y,v,v^{\prime},u)=(y,F_{i}(v),F_{j}(v)))=\frac{\lambda^{n}_{i}(y,v)}{2\nu^{n}}\cdot\frac{\lambda^{n}_{j}(x,v)}{2\nu^{n}},
n,x(uS:F(y,v,v,u)=(y,Fi(v),v))=λin(y,v)2νn(1λn(x,v)2νn),\displaystyle\mathbb{P}^{n,x}(u\in S:F(y,v,v^{\prime},u)=(y,F_{i}(v),v^{\prime}))=\frac{\lambda^{n}_{i}(y,v)}{2\nu^{n}}\cdot\left(1-\frac{\lambda^{n}(x,v^{\prime})}{2\nu^{n}}\right),
n,x(uS:F(y,v,v,u)=(y,v,Fj(v)))=(1λn(y,v)2νn)λjn(x,v)2νn,\displaystyle\mathbb{P}^{n,x}(u\in S:F(y,v,v^{\prime},u)=(y,v,F_{j}(v)))=\left(1-\frac{\lambda^{n}(y,v)}{2\nu^{n}}\right)\cdot\frac{\lambda^{n}_{j}(x,-v)}{2\nu^{n}},
n,x(uS:F(y,v,v,u)=(y,v,v))=(1λn(y,v)2νn)(1λn(x,v)2νn).\displaystyle\mathbb{P}^{n,x}(u\in S:F(y,v,v^{\prime},u)=(y,v,v^{\prime}))=\left(1-\frac{\lambda^{n}(y,v)}{2\nu^{n}}\right)\cdot\left(1-\frac{\lambda^{n}(x,v^{\prime})}{2\nu^{n}}\right).

for all i,j1,,di,j\in 1,\dots,d. Finally, for any yH(x,nδ)y\notin H(x,n^{-\delta}) and v,v𝒱v,v^{\prime}\in\mathcal{V},

n,x(uS:F(y,v,v,u)=(x+vnδ,v,v))=1\mathbb{P}^{n,x}(u\in S:F(y,v,v^{\prime},u)=(x+vn^{-\delta},v,v^{\prime}))=1

Let NtN_{t} be a time-homogeneous Poisson process on (S,𝒮)(S,\mathcal{S}) with intensity 2νn2\nu^{n}. Let Uk{U_{k}} be a sequence of identical and independent random variables on (S,𝒮)(S,\mathcal{S}) with law n,x\mathbb{P}^{n,x}. Let T1,T2,T_{1},T_{2},\dots be the sequence of arrival times for NtN_{t}. Fix v𝒱v\in\mathcal{V} and given (x,v)(x,v), set (X0n,V0n)=(x,v)(X^{n}_{0},V^{n}_{0})=(x,v) and V¯0n=v\bar{V}^{n}_{0}=v. Recursively define for k0k\geq 0,

(Xtn,Vtn,V¯tn)=(XTkn+VTkn(tTk),VTkn,V¯Tkn),t[Tk,Tk+1),\displaystyle(X^{n}_{t},V^{n}_{t},\bar{V}^{n}_{t})=(X^{n}_{T_{k}}+V^{n}_{T_{k}}(t-T_{k}),V^{n}_{T_{k}},\bar{V}^{n}_{T_{k}}),\quad t\in[T_{k},T_{k+1}),
(XTkn,VTkn,V¯Tkn)=F(XTk1n+VTk1n(TkTk1),VTk1n,V¯Tk1n,Uk).\displaystyle(X^{n}_{T_{k}},V^{n}_{T_{k}},\bar{V}^{n}_{T_{k}})=F(X^{n}_{T_{k-1}}+V^{n}_{T_{k-1}}(T_{k}-T_{k-1}),V^{n}_{T_{k-1}},\bar{V}^{n}_{T_{k-1}},U_{k}).

By construction, the process (Xtn,Vtn,V¯tn)t<nδ(X^{n}_{t},V^{n}_{t},\bar{V}^{n}_{t})_{t<n^{-\delta}} is a PDMP with event rate 2νn2\nu^{n} and transition kernel determined by the law of the random variable FF. At each event time TkT_{k}, if VTkn=V¯TknV^{n}_{T_{k}}=\bar{V}^{n}_{T_{k}}, there are three possibilities: both VnV^{n} and V¯n\bar{V}^{n} jump together to the same location, or only one of them jumps, or neither of them jumps. On the other hand if VTknV¯TknV^{n}_{T_{k}}\neq\bar{V}^{n}_{T_{k}}, then they both jump independently of each other with certain probabilities. The probabilites are chosen such that marginally, (Xtn,Vtn)t<nδ(X^{n}_{t},V^{n}_{t})_{t<n^{-\delta}} is a Zig-Zag process on d×𝒱\mathbb{R}^{d}\times\mathcal{V} with initial state (X0n,V0n)=(x,v)(X^{n}_{0},V^{n}_{0})=(x,v) and rates (λin)i=1d(\lambda^{n}_{i})_{i=1}^{d}. Moreover, the process (V¯tn)t0(\bar{V}^{n}_{t})_{t\geq 0} is marginally a continuous time Markov process on the 𝒱\mathcal{V} space with V¯0n=v\bar{V}^{n}_{0}=v and generator given by,

Gxnf(v)=i=1dλin(x,v)(f(Fi(v))f(v)).G^{n}_{x}f(v)=\sum_{i=1}^{d}\lambda^{n}_{i}(x,v)(f(F_{i}(v))-f(v)).

Now, let Z~tn\tilde{Z}^{n}_{t} be the stopped process stopped at the random time that XtnX^{n}_{t} enters HnH^{n}. But for xH¯nx\in\bar{H}^{n}, by construction XtnHnX^{n}_{t}\notin H^{n} for all tnδt\leq n^{-\delta}. And so (X~tn,V~tn)=(Xtn,Vtn)(\tilde{X}^{n}_{t},\tilde{V}^{n}_{t})=(X^{n}_{t},V^{n}_{t}) for all tnδt\leq n^{-\delta}. Consider, for (x,v)H¯n×𝒱(x,v)\in\bar{H}^{n}\times\mathcal{V},

|G~n,δfΔn(x,v)bn(x)f(x)|\displaystyle\left|\tilde{G}^{n,\delta}f^{n}_{\Delta}(x,v)-b^{n}(x)\cdot\nabla f(x)\right| =|1nδ𝔼(x,v)0nδV~snf(x)𝑑sbn(x)f(x)|\displaystyle=\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\tilde{V}^{n}_{s}\cdot\nabla f(x)\ ds-b^{n}(x)\cdot\nabla f(x)\right|
=|1nδ𝔼(x,v)0nδVsnf(x)𝑑sbn(x)f(x)|.\displaystyle=\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}V^{n}_{s}\cdot\nabla f(x)\ ds-b^{n}(x)\cdot\nabla f(x)\right|. (B.3)

For each nn, let τcn\tau^{n}_{c} be the first time less than or equal to nδn^{-\delta} such that VsnV^{n}_{s} and V¯sn\bar{V}^{n}_{s} are not equal, i.e.,

τcn:=inf{0s<nδ:VsnV¯sn}.\tau^{n}_{c}:=\inf\{0\leq s<n^{-\delta}:V^{n}_{s}\neq\bar{V}^{n}_{s}\}.

If no such time exists, set τcn=nδ\tau^{n}_{c}=n^{-\delta}. We will call this time as the decoupling time. We will show that the probability of decoupling before time nδn^{-\delta} goes to 0 as nn\to\infty. For any x,ydx,y\in\mathbb{R}^{d}, and all i=1,,di=1,\dots,d,

|λin(y,v)λin(x,v)|\displaystyle\left|\lambda^{n}_{i}(y,v)-\lambda^{n}_{i}(x,v)\right| nm|𝒮(n,m)|S𝒮(n,m)jS|Eij(y)Eij(x)|\displaystyle\leq\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\sum_{j\in S}\left|E^{j}_{i}(y)-E^{j}_{i}(x)\right|
=j=1n|Eij(y)Eij(x)|\displaystyle=\sum_{j=1}^{n}\left|E^{j}_{i}(y)-E^{j}_{i}(x)\right|
=j=1n|sij(y)sij(x)|.\displaystyle=\sum_{j=1}^{n}\left|s^{j}_{i}(y)-s^{j}_{i}(x)\right|.

The last equality is true irrespective of whether the process is ZZ-SS or ZZ-CV. A second order Taylor’s expansion gives,

sij(y)sij(x)=sij(x)(yx)+(yx)T2sij(xj)(yx)s_{i}^{j}(y)-s^{j}_{i}(x)=\nabla s^{j}_{i}(x)\cdot(y-x)+(y-x)^{T}\nabla^{\otimes 2}s^{j}_{i}(x^{j})(y-x)

where xj=x+θj(yx)x^{j}=x+\theta^{j}(y-x) for some θ(0,1)\theta\in(0,1). Under the assumption that third derivatives are bounded,

|sij(y)sij(x)|sij(x)yx+Myx2|s_{i}^{j}(y)-s^{j}_{i}(x)|\leq\|\nabla s^{j}_{i}(x)\|\cdot\|y-x\|+M^{\prime}\|y-x\|^{2}

Consequently,

|λin(y,v)λin(x,v)|yxj=1nsij(x)+Mnyx2\left|\lambda^{n}_{i}(y,v)-\lambda^{n}_{i}(x,v)\right|\leq\|y-x\|\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x)\|+M^{\prime}n\|y-x\|^{2}

for all x,ydx,y\in\mathbb{R}^{d} and i=1,,di=1,\dots,d. Let pn(s)p^{n}(s) be the probability that VnV^{n} and V¯n\bar{V}^{n} decouple at time ss given that they were coupled till time ss. Then, for snδs\leq n^{-\delta}

pn(s)\displaystyle p^{n}(s) =p1n,x(Xsn,v)+p2n,x(Xsn,v)\displaystyle=p^{n,x}_{1}(X^{n}_{s},v)+p^{n,x}_{2}(X^{n}_{s},v)
=12νni=1d|λin(Xsn,v)λin(x,v)|\displaystyle=\frac{1}{2\nu^{n}}\sum_{i=1}^{d}\left|\lambda^{n}_{i}(X^{n}_{s},v)-\lambda^{n}_{i}(x,v)\right|
12νni=1d(Xsnxj=1nsij(x)+MnXsnx2)\displaystyle\leq\frac{1}{2\nu^{n}}\sum_{i=1}^{d}\left(\|X^{n}_{s}-x\|\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x)\|+M^{\prime}n\|X^{n}_{s}-x\|^{2}\right)
12νni=1d(dnδj=1nsij(x)+Mdn12δ)\displaystyle\leq\frac{1}{2\nu^{n}}\sum_{i=1}^{d}\left(\sqrt{d}n^{-\delta}\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x)\|+M^{\prime}dn^{1-2\delta}\right)

This gives,

(τc<nδ)=𝔼[k=1Nn(nδ)(τc=Tk)]\displaystyle\mathbb{P}(\tau_{c}<n^{-\delta})=\mathbb{E}\left[\sum_{k=1}^{N^{n}(n^{-\delta})}\mathbb{P}(\tau_{c}=T_{k})\right] =𝔼[k=1Nn(nδ)p(Tk)(τcTk)]\displaystyle=\mathbb{E}\left[\sum_{k=1}^{N^{n}(n^{-\delta})}p(T_{k})\mathbb{P}(\tau_{c}\geq T_{k})\right]
𝔼[k=1Nn(nδ)p(Tk)]\displaystyle\hskip-80.0pt\leq\mathbb{E}\left[\sum_{k=1}^{N^{n}(n^{-\delta})}p(T_{k})\right]
12νni=1d(dnδj=1nsij(x)+Mdn12δ)𝔼(Nn(nδ))\displaystyle\hskip-80.0pt\leq\frac{1}{2\nu^{n}}\sum_{i=1}^{d}\left(\sqrt{d}n^{-\delta}\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x)\|+M^{\prime}dn^{1-2\delta}\right)\mathbb{E}\left(N^{n}(n^{-\delta})\right)
dn2δi=1dj=1nsij(x)+Md2n13δ.\displaystyle\hskip-80.0pt\leq\sqrt{d}n^{-2\delta}\sum_{i=1}^{d}\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x)\|+M^{\prime}d^{2}n^{1-3\delta}. (B.4)

The last equality comes from the fact 𝔼(Nn(nδ))=2νnnδ\mathbb{E}(N^{n}(n^{-\delta}))=2\nu^{n}n^{-\delta}. Consider,

1nδ𝔼0nδVsnf(x)𝑑s1nδ𝔼0nδV¯snf(x)𝑑s\displaystyle\left\|\frac{1}{n^{-\delta}}\mathbb{E}\int_{0}^{n^{-\delta}}V^{n}_{s}\cdot\nabla f(x)\ ds-\frac{1}{n^{-\delta}}\mathbb{E}\int_{0}^{n^{-\delta}}\bar{V}^{n}_{s}\cdot\nabla f(x)\ ds\right\|
=1nδ𝔼0nδ(VsnV¯sn)f(x)𝑑s\displaystyle\quad\quad\quad\quad=\frac{1}{n^{-\delta}}\left\|\mathbb{E}\int_{0}^{n^{-\delta}}(V^{n}_{s}-\bar{V}^{n}_{s})\cdot\nabla f(x)\ ds\right\|
f(x)nδ𝔼τcnnδVsnV¯sn𝑑s\displaystyle\quad\quad\quad\quad\leq\frac{\|f(x)\|}{n^{-\delta}}\mathbb{E}\int_{\tau^{n}_{c}}^{n^{-\delta}}\left\|V^{n}_{s}-\bar{V}^{n}_{s}\right\|\ ds
2df(x)𝔼(1τcnnδ)\displaystyle\quad\quad\quad\quad\leq 2\sqrt{d}\|f(x)\|\cdot\mathbb{E}\left(1-\frac{\tau^{n}_{c}}{n^{-\delta}}\right)
2df(x)(τcn<nδ).\displaystyle\quad\quad\quad\quad\leq 2\sqrt{d}\|f(x)\|\cdot\mathbb{P}(\tau^{n}_{c}<n^{-\delta}). (B.5)

Combining (B.4) and (B.5), we get for all (x,v)En(x,v)\in E^{n},

1nδ𝔼(x,v)0nδVsnf(x)𝑑s1nδ𝔼(x,v)0nδV¯snf(x)𝑑s\displaystyle\left\|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}V^{n}_{s}\cdot\nabla f(x)\ ds-\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\bar{V}^{n}_{s}\cdot\nabla f(x)\ ds\right\|
f(x)(2dn2δi=1dj=1nsij(x)+2Md5/2n13δ).\displaystyle\quad\quad\quad\quad\leq\|f(x)\|\left(2dn^{-2\delta}\sum_{i=1}^{d}\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x)\|+2M^{\prime}d^{5/2}n^{1-3\delta}\right). (B.6)

For fixed xx, recall that V¯n\bar{V}^{n} is a continuous time Markov process on {1,1}d\{-1,1\}^{d} which flips the sign of its ii-th component at rate λin(x,v)\lambda^{n}_{i}(x,v). Moreover, given the form of λin\lambda^{n}_{i}, this rate only depends on the ii-th coordinate of vv. Hence, for each ii, (V¯s)i(\bar{V}_{s})_{i} evolves as a continuous time Markov process on {1,1}\{-1,1\} with rate λin\lambda^{n}_{i}, independently of other coordinates. Suppose λin(x,v)>0\lambda^{n}_{i}(x,v^{\prime})>0 for v𝒱v^{\prime}\in\mathcal{V} and all i=1,,di=1,\dots,d. Then for all ii, (V¯sn)i(\bar{V}^{n}_{s})_{i} process is ergodic with corresponding stationary distribution πin,x\pi^{n,x}_{i} given by,

πin,x(1)=1πin,x(1)=λin(x,𝟏d)λin(x,𝟏d)+λin(x,𝟏d).\pi^{n,x}_{i}(1)=1-\pi^{n,x}_{i}(-1)=\frac{\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})}{\lambda^{n}_{i}(x,\boldsymbol{1}_{d})+\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})}.

Additionally, the V¯n\bar{V}^{n} process is ergodic with stationary distribution πn,x\pi^{n,x} given by,

πn,x(v)=i=1dλi(x,Fi(v))λi(x,v)+λi(x,Fi(v))=i=1dπin,x(vi),v𝒱.\pi^{n,x}(v^{\prime})=\prod_{i=1}^{d}\frac{\lambda_{i}(x,F_{i}(v^{\prime}))}{\lambda_{i}(x,v^{\prime})+\lambda_{i}(x,F_{i}(v^{\prime}))}=\prod_{i=1}^{d}\pi^{n,x}_{i}(v^{\prime}_{i}),\quad v^{\prime}\in\mathcal{V}.

For all i=1,,di=1,\dots,d, 𝔼πin,x(V)=bin(x)\mathbb{E}_{\pi^{n,x}_{i}}(V)=b^{n}_{i}(x). Moreover, observe that, for all v𝒱v^{\prime}\in\mathcal{V},

bin(x)=λin(x,𝟏d)λin(x,𝟏d)λin(x,𝟏d)+λin(x,𝟏d)=viλin(x,Fi(v))viλin(x,v)λin(x,Fi(v))+λin(x,v).b^{n}_{i}(x)=\frac{\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})-\lambda^{n}_{i}(x,\boldsymbol{1}_{d})}{\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(x,\boldsymbol{1}_{d})}=\frac{v^{\prime}_{i}\lambda^{n}_{i}(x,F_{i}(v^{\prime}))-v^{\prime}_{i}\lambda^{n}_{i}(x,v^{\prime})}{\lambda^{n}_{i}(x,F_{i}(v^{\prime}))+\lambda^{n}_{i}(x,v^{\prime})}.

For all ii, define hin(x):=1/(λin(x,𝟏d)+λin(x,𝟏d))h^{n}_{i}(x):=1/(\lambda^{n}_{i}(x,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(x,\boldsymbol{1}_{d})). Let PtiP^{i}_{t} be the semigroup associated with the process in the ii-th coordinate. Then, for any ff,

Ptif(v)=𝔼πin,xf(V)+πin,x(v)et/hin(x)(f(v)f(v)),v{1,1}.P^{i}_{t}f(v^{\prime})=\mathbb{E}_{\pi^{n,x}_{i}}f(V)+\pi^{n,x}_{i}(-v^{\prime})\cdot e^{-t/h^{n}_{i}(x)}\cdot(f(v^{\prime})-f(-v^{\prime})),\quad v^{\prime}\in\{-1,1\}.

Thus,

1nδ𝔼(x,v)0nδ(V¯sn)i𝑑s\displaystyle\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}(\bar{V}^{n}_{s})_{i}\ ds =1nδ0nδPsvi𝑑s\displaystyle=\frac{1}{n^{-\delta}}\int_{0}^{n^{-\delta}}P_{s}v_{i}\ ds
=bin(x)+2viπin,x(vi)nδ0nδes/hin(x)𝑑s\displaystyle=b^{n}_{i}(x)+\frac{2v_{i}\pi^{n,x}_{i}(-v_{i})}{n^{-\delta}}\int_{0}^{n^{-\delta}}e^{-s/h^{n}_{i}(x)}\ ds
=bin(x)+2viπn,x(vi)hin(x)nδ(1enδ/hin(x)).\displaystyle=b^{n}_{i}(x)+2v_{i}\pi^{n,x}(-v_{i})\cdot\frac{h^{n}_{i}(x)}{n^{-\delta}}(1-e^{-n^{-\delta}/h^{n}_{i}(x)}).

The above implies, for all i=1,,di=1,\dots,d,

|1nδ𝔼(x,v)0nδ(V¯sn)i𝑑sbin(x)|nδhin(x)(1e1/(hin(x)nδ))nδhin(x).\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}(\bar{V}^{n}_{s})_{i}\ ds-b^{n}_{i}(x)\right|\leq n^{\delta}h^{n}_{i}(x)(1-e^{-1/(h^{n}_{i}(x)n^{\delta})})\leq n^{\delta}h^{n}_{i}(x). (B.7)

For arbitrary ii, suppose λin(x,v)=0\lambda^{n}_{i}(x,v^{\prime})=0 for some v𝒱v^{\prime}\in\mathcal{V}. Then, for all θ𝒱\theta\in\mathcal{V} such that θi=vi\theta_{i}=v^{\prime}_{i}, λin(x,θ)=0\lambda^{n}_{i}(x,\theta)=0. Note that bin(x)=vib^{n}_{i}(x)=v^{\prime}_{i} in this case. If vi=viv^{\prime}_{i}=v_{i}, the ii-th coordinate of the V¯n\bar{V}^{n} process never jumps out of viv_{i}. And we have,

|1nδ𝔼(x,v)0nδ(V¯sn)i𝑑sbin(x)|=|vivi|=0.\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}(\bar{V}^{n}_{s})_{i}\ ds-b^{n}_{i}(x)\right|=\left|v_{i}-v_{i}\right|=0. (B.8)

If vi=viv^{\prime}_{i}=-v_{i}, the ii-th coordinate jumps from viv_{i} to vi-v_{i} and stays there forever. Let TT be the time it first jumps to vi-v_{i}. Then TExp(λin(x,v))T\sim\text{Exp}(\lambda^{n}_{i}(x,v)). Thus,

|1nδ𝔼(x,v)0nδ(V¯sn)i𝑑sbin(x)|\displaystyle\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}(\bar{V}^{n}_{s})_{i}\ ds-b^{n}_{i}(x)\right| =|1nδ𝔼(x,v)0nδ(V¯sn)i𝑑s(vi)|\displaystyle=\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}(\bar{V}^{n}_{s})_{i}\ ds-(-v_{i})\right|
=|1nδ𝔼(x,v)0nδ((V¯sn)i+vi)𝑑s|\displaystyle=\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}((\bar{V}^{n}_{s})_{i}+v_{i})\ ds\right|
=|1nδ𝔼(x,v)(2viT1(Tnδ)+2vinδ1(T>nδ))|\displaystyle=\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\left(2v_{i}T1(T\leq n^{-\delta})+2v_{i}n^{-\delta}1(T>n^{-\delta})\right)\right|
=nδλin(x,v)(1eλin(x,v)/nδ)nδhin(x).\displaystyle=\frac{n^{\delta}}{\lambda^{n}_{i}(x,v)}(1-e^{-\lambda^{n}_{i}(x,v)/n^{\delta}})\leq n^{\delta}h^{n}_{i}(x). (B.9)

From (B.7), (B.8), and (B.2), we get for all i=1,,di=1,\dots,d,

|1nδ𝔼(x,v)0nδ(V¯sn)i𝑑sbin(x)|nδhin(x).\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}(\bar{V}^{n}_{s})_{i}\ ds-b^{n}_{i}(x)\right|\leq n^{\delta}h^{n}_{i}(x). (B.10)

Consequently,

1nδ𝔼(x,v)0nδV¯snf(x)𝑑sbn(x)f(x)\displaystyle\left\|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}\bar{V}^{n}_{s}\cdot\nabla f(x)\ ds-b^{n}(x)\cdot\nabla f(x)\right\|
i=1d|1nδ𝔼(x,v)0nδ(V¯sn)iif(x)𝑑sbin(x)if(x)|\displaystyle\quad\quad\quad\quad\leq\sum_{i=1}^{d}\left|\frac{1}{n^{-\delta}}\mathbb{E}_{(x,v)}\int_{0}^{n^{-\delta}}(\bar{V}^{n}_{s})_{i}\cdot\nabla_{i}f(x)\ ds-b^{n}_{i}(x)\cdot\nabla_{i}f(x)\right|
i=1dnδ|if(x)|hin(x)\displaystyle\quad\quad\quad\quad\leq\sum_{i=1}^{d}n^{\delta}|\nabla_{i}f(x)|h^{n}_{i}(x) (B.11)

Combining (B.3), (B.6), and (B.11), we get,

|G~n,δfΔn(x,v)bn(x)f(x)|\displaystyle\left|\tilde{G}^{n,\delta}f^{n}_{\Delta}(x,v)-b^{n}(x)\cdot\nabla f(x)\right|
f(x)(2dn2δi=1dj=1nsij(x)+2Md5/2n13δ+i=1dnδhin(x)).\displaystyle\quad\quad\quad\quad\leq\|f(x)\|\left(2dn^{-2\delta}\sum_{i=1}^{d}\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x)\|+2M^{\prime}d^{5/2}n^{1-3\delta}+\sum_{i=1}^{d}n^{\delta}h^{n}_{i}(x)\right).

But by our assumptions on ff, there exists a compact set KHcK\subset H^{c} and a constant C1>0C_{1}>0 such that,

supxH¯n|G~n,δfΔn(x,v)bn(x)f(x)|\displaystyle\sup_{x\in\bar{H}^{n}}\left|\tilde{G}^{n,\delta}f^{n}_{\Delta}(x,v)-b^{n}(x)\cdot\nabla f(x)\right|
C1supxKH¯n(2dn2δi=1dj=1nsij(x)+2Md5/2n13δ+i=1dnδhin(x)).\displaystyle\hskip 50.0pt\leq C_{1}\sup_{x\in K\cap\bar{H}^{n}}\left(2dn^{-2\delta}\sum_{i=1}^{d}\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x)\|+2M^{\prime}d^{5/2}n^{1-3\delta}+\sum_{i=1}^{d}n^{\delta}h^{n}_{i}(x)\right).

By Assumption 2.2, 𝔼YP[|Si(x)|]<\mathbb{E}_{Y\sim P}[|\nabla S_{i}(x)|]<\infty for all xx. Given the smoothness (Assumption 2.1), n1i=1nsij(x)n^{-1}\sum_{i=1}^{n}\|\nabla s^{j}_{i}(x)\| converges uniformly on compact sets by the law of large numbers. Moreover, in the proof of Proposition 3.1, we show that nhin(x)nh^{n}_{i}(x) converges to a finite quantity uniformly on compact sets. Consequently, the right-hand side goes to 0 since 1/2<δ<11/2<\delta<1. ∎


Proof of Theorem 3.3. For arbitrary test function f𝒞c(d)f\in\mathcal{C}_{c}^{\infty}(\mathbb{R}^{d}) with support KHcK\subset H^{c},

sup(x,v)En|Gn,δfΔn(x,v)fΔ(x)|\displaystyle\sup_{(x,v)\in E^{n}}\left|G^{n,\delta}f^{n}_{\Delta}(x,v)-\mathcal{L}f_{\Delta}(x)\right| =supxKH¯n|Gn,δfΔn(x,v)b(x)f(x)|.\displaystyle=\sup_{x\in K\cap\bar{H}^{n}}\left|G^{n,\delta}f^{n}_{\Delta}(x,v)-b(x)\cdot\nabla f(x)\right|.

Lemma B.1 and Lemma B.2 together imply that

limnsupxKH¯n|Gn,δfΔn(x,v)bn(x)f(x)|=0,\lim_{n\to\infty}\sup_{x\in K\cap\bar{H}^{n}}|G^{n,\delta}f^{n}_{\Delta}(x,v)-b^{n}(x)\cdot\nabla f(x)|=0,

\mathbb{P}-almost surely. Further from Proposition 3.2, we know that bnb^{n} converges to bb uniformly on compact sets with \mathbb{P}-probability 11. Since the sets EnE^{n} have limiting probability 11 and fnf^{n} and ff agree on EnE^{n}, the result then follows by Corollary 8.7(f), Chapter 4 of [23]. ∎

Appendix C Proofs for the stationary phase

C.1 Proof of Theorem 4.1

Let ϵ\epsilon be arbitrary. Let n1n_{1} be such that n11/2<ϵn^{-1/2}_{1}<\epsilon. Let nmax{n0,n1}n\geq\max\{n_{0},n_{1}\}. Since vj=1nsj(x^n+vs)>0v\cdot\sum_{j=1}^{n}s^{j}(\hat{x}_{n}+vs)>0 for all ss, this implies that the process visits x^n\hat{x}_{n} exactly once between each velocity flip.

Fix V0n=v{1,1}V^{n}_{0}=v^{*}\in\{-1,1\} arbitrarily. Iteratively define random times (Tk±)k(T_{k}^{\pm})_{k\in\mathbb{N}} and (Sk±)k(S_{k}^{\pm})_{k\in\mathbb{N}} as follows:

T0+\displaystyle T_{0}^{+} =0\displaystyle=0
Tk\displaystyle T_{k}^{-} =inf{t>Tk1+:(Xtn,Vtn)=(x^n,v)},\displaystyle=\inf\{t>T_{k-1}^{+}:(X^{n}_{t},V^{n}_{t})=(\hat{x}_{n},-v^{*})\}, k=1,2,3,,\displaystyle k=1,2,3,\cdots,
Tk+\displaystyle T_{k}^{+} =inf{t>Tk:(Xtn,Vtn)=(x^n,v)},\displaystyle=\inf\{t>T_{k}^{-}:(X^{n}_{t},V^{n}_{t})=(\hat{x}_{n},v^{*})\}, k=1,2,3,,\displaystyle k=1,2,3,\cdots,
Sk+\displaystyle S_{k}^{+} =inf{t>Tk1+:Vtn=v},\displaystyle=\inf\{t>T_{k-1}^{+}:V^{n}_{t}=-v^{*}\}, k=1,2,3,,\displaystyle k=1,2,3,\cdots,
Sk\displaystyle S_{k}^{-} =inf{t>Tk:Vtn=v},\displaystyle=\inf\{t>T_{k}^{-}:V^{n}_{t}=v^{*}\}, k=1,2,3,.\displaystyle k=1,2,3,\cdots.

Now for k=1,2,,k=1,2,\dots, define the random variables

Zk+\displaystyle Z_{k}^{+} :=Sk+Tk1+,\displaystyle:=S_{k}^{+}-T^{+}_{k-1},
Zk\displaystyle Z_{k}^{-} :=SkTk,\displaystyle:=S_{k}^{-}-T_{k}^{-},
Zk\displaystyle Z_{k} :=Tk+Tk1+=2(Zk++Zk).\displaystyle:=T_{k}^{+}-T_{k-1}^{+}=2(Z_{k}^{+}+Z_{k}^{-}).

Let N(t):=sup{k:Tk+t}N(t):=\sup\{k:T^{+}_{k}\leq t\} where N(t)=0N(t)=0 if the set is empty. Since ZtnZ^{n}_{t} is a canonical Zig-Zag process, by the Markov property, it follows that Zk±iidZ±Z_{k}^{\pm}\overset{iid}{\sim}Z^{\pm} where the distribution of Z±Z^{\pm} is same as the distribution of the time to switch velocity starting from (x^n,±v)(\hat{x}_{n},\pm v^{*}) i.e.

(Z±>t)=exp(0tλn(x^n±vs,±v)𝑑s),t0.\mathbb{P}(Z^{\pm}>t)=\exp\left(-\int_{0}^{t}\lambda^{n}(\hat{x}_{n}\pm v^{*}s,\pm v^{*})ds\right),\quad t\geq 0.

Consequently, N(t)N(t) is a renewal process with the inter-arrival times between subsequent events distributed as 2(Z++Z)2(Z^{+}+Z^{-}). By construction Tk±T^{\pm}_{k} and Sk±S^{\pm}_{k} alternate i.e. for all kk, Tk1+<Sk+<Tk<Sk<Tk+T^{+}_{k-1}<S^{+}_{k}<T^{-}_{k}<S^{-}_{k}<T^{+}_{k}. Then, for all t0t\geq 0,

supst|Xsnx^n|\displaystyle\sup_{s\leq t}|X^{n}_{s}-\hat{x}_{n}| max{|XSk+nXTk1+n|,|XSknXTkn|;kN(t)+1}\displaystyle\leq\max\left\{|X^{n}_{S_{k}^{+}}-X^{n}_{T_{k-1}^{+}}|,|X^{n}_{S_{k}^{-}}-X^{n}_{T_{k}^{-}}|;\ k\leq N(t)+1\right\}
=max{Zk+,Zk;kN(t)+1}\displaystyle=\max\left\{Z^{+}_{k},Z_{k}^{-};\ k\leq N(t)+1\right\} (C.1)
(supst|Xsnx^n|>ϵ/2)(max{Zk+,Zk;kN(t)+1}>ϵ/2).\mathbb{P}\left(\sup_{s\leq t}|X^{n}_{s}-\hat{x}_{n}|>\epsilon/2\right)\leq\mathbb{P}\left(\max\left\{Z^{+}_{k},Z_{k}^{-};\ k\leq N(t)+1\right\}>\epsilon/2\right). (C.2)
Lemma C.1.

For any fixed tt,

(max{Zk+,Zk;kN(t)+1}>ϵ/2)\displaystyle\mathbb{P}\left(\max\left\{Z^{+}_{k},Z_{k}^{-};\ k\leq N(t)+1\right\}>\epsilon/2\right) 𝔼(N(t)+1)(P(Z+>ϵ/2)+P(Z>ϵ/2)).\displaystyle\leq\mathbb{E}(N(t)+1)(P(Z^{+}>\epsilon/2)+P(Z^{-}>\epsilon/2)). (C.3)
Proof.

Observe that N(t)+1=inf{k:Tk+>t}N(t)+1=\inf\{k:T^{+}_{k}>t\} is a stopping time with respect to the filtration generated by {Zk±;k}\{Z^{\pm}_{k};k\in\mathbb{N}\}. Then,

(max{Zk+,Zk;kN(t)+1}>ϵ/2)\displaystyle\mathbb{P}\left(\max\left\{Z^{+}_{k},Z_{k}^{-};\ k\leq N(t)+1\right\}>\epsilon/2\right)
=(kN(t)+1{Zk+>ϵ/2}{Zk>ϵ/2})\displaystyle\quad\quad=\mathbb{P}\left(\cup_{k\leq N(t)+1}\{Z^{+}_{k}>\epsilon/2\}\cup\{Z^{-}_{k}>\epsilon/2\}\right)
=𝔼[1(kN(t)+1{Zk+>ϵ/2}{Zk>ϵ/2})]\displaystyle\quad\quad=\mathbb{E}\left[1\left(\cup_{k\leq N(t)+1}\{Z^{+}_{k}>\epsilon/2\}\cup\{Z^{-}_{k}>\epsilon/2\}\right)\right]
𝔼[kN(t)+11({Zk+>ϵ/2})+kN(t)+11({Zk>ϵ/2})]\displaystyle\quad\quad\leq\mathbb{E}\left[\sum_{k\leq N(t)+1}1\left(\{Z^{+}_{k}>\epsilon/2\}\right)+\sum_{k\leq N(t)+1}1\left(\{Z^{-}_{k}>\epsilon/2\}\right)\right]
=𝔼(N(t)+1)(P(Z+>ϵ/2)+P(Z>ϵ/2)),\displaystyle\quad\quad=\mathbb{E}(N(t)+1)(P(Z^{+}>\epsilon/2)+P(Z^{-}>\epsilon/2)),

where the last equality follows by the Wald’s identity since N(t)+1N(t)+1 is a stopping time. ∎

Recall that the canonical switching rate λn(x,v)=(vUn(x))+=(vj=1nsj(x))+\lambda^{n}(x,v)=\left(v\cdot U^{\prime}_{n}(x)\right)_{+}=\left(v\cdot\sum_{j=1}^{n}s^{j}(x)\right)_{+}. Then, for all s>0s>0, λ(x^n+vs,v)>0\lambda(\hat{x}_{n}+vs,v)>0. Consequently, for all t<ϵt<\epsilon,

(Z±>t)\displaystyle\mathbb{P}(Z^{\pm}>t) =exp(0tλn(x^n±vs,±v)𝑑s)\displaystyle=\exp\left(-\int_{0}^{t}\lambda^{n}(\hat{x}_{n}\pm v^{*}s,\pm v^{*})\ ds\right)
=exp(0t±vUn(x^n±vs)ds)\displaystyle=\exp\left(-\int_{0}^{t}\pm v^{*}U^{\prime}_{n}(\hat{x}_{n}\pm v^{*}s)\ ds\right)
=exp((Un(x^n±vt)Un(x^n))).\displaystyle=\exp\left(-\left(U_{n}(\hat{x}_{n}\pm v^{*}t)-U_{n}(\hat{x}_{n})\right)\right). (C.4)

A Taylor’s expansion about x^n\hat{x}_{n} gives,

Un(x^n±vt)Un(x^n)=12Un′′(x^n)t2±v6Un′′′(xt)t3U_{n}(\hat{x}_{n}\pm v^{*}t)-U_{n}(\hat{x}_{n})=\frac{1}{2}U^{\prime\prime}_{n}(\hat{x}_{n})t^{2}\pm\frac{v^{*}}{6}U^{\prime\prime\prime}_{n}(x^{\prime}_{t})t^{3}

for some xtx^{\prime}_{t} between x^n\hat{x}_{n} and x^n±vt\hat{x}_{n}\pm v^{*}t. Let t=n1/2<ϵt=n^{-1/2}<\epsilon. Recall that J(x^n)=n1U′′(x^n)J(\hat{x}_{n})=n^{-1}U^{\prime\prime}(\hat{x}_{n}). Then by Assumption 2.4, Un′′(x^n)n1𝔼S(x0)>0U^{\prime\prime}_{n}(\hat{x}_{n})n^{-1}\to\mathbb{E}S^{\prime}(x_{0})>0 in \mathbb{P}-probability as nn\to\infty. Moreover, the boundedness assumption on s′′s^{\prime\prime} implies that the second term goes to 0 as nn goes to infinity. Consequently, (Z±>n1/2)exp(𝔼S(x0))\mathbb{P}(Z^{\pm}>n^{-1/2})\to\exp(-\mathbb{E}S^{\prime}(x_{0})) in \mathbb{P}-probability as nn\to\infty.

Define Z:=2(Z++Z)Z:=2(Z^{+}+Z^{-}). Then this implies, since Z+Z^{+} and ZZ^{-} are independent,

limn(Z>n1/2)exp(𝔼S(x0))>0\lim_{n\to\infty}\mathbb{P}(Z>n^{-1/2})\geq\exp(-\mathbb{E}S^{\prime}(x_{0}))>0

in \mathbb{P}-probability. Recall that N(t)N(t) is a renewal process with inter-arrival times ZkiidZZ_{k}\overset{iid}{\sim}Z. Using the truncation method from the proof of the elementary renewal theorem, it can be shown that for any t>0t>0,

𝔼(N(t)+1)tn1/2+1(Z>n1/2).\mathbb{E}(N(t)+1)\leq\frac{tn^{1/2}+1}{\mathbb{P}(Z>n^{-1/2})}.

Hence, it follows that,

limnn1/2𝔼(N(t)+1)texp(𝔼S(x0))<.\lim_{n\to\infty}n^{-1/2}\mathbb{E}(N(t)+1)\leq\frac{t}{\exp(-\mathbb{E}S^{\prime}(x_{0}))}<\infty. (C.5)

Consider (Z±>t)\mathbb{P}(Z^{\pm}>t) again. Let π(n)\pi^{(n)} be the Lebesgue density associated with the Bayesian posterior Π(n)\Pi^{(n)}. Recall that π(n)(x)exp(Un(x))\pi^{(n)}(x)\propto\exp(-U_{n}(x)). Then, by (C.4),

(Z±>t)=exp((Un(x^n±vt)Un(x^n)))=π(n)(x^n±vt)π(n)(x^n)\mathbb{P}(Z^{\pm}>t)=\exp\left(-\left(U_{n}(\hat{x}_{n}\pm v^{*}t)-U_{n}(\hat{x}_{n})\right)\right)=\frac{\pi^{(n)}(\hat{x}_{n}\pm v^{*}t)}{\pi^{(n)}(\hat{x}_{n})}

Denote by An+A^{+}_{n} and AnA^{-}_{n}, the intervals [x^n+ϵ/4,x^n+ϵ/2][\hat{x}_{n}+\epsilon/4,\hat{x}_{n}+\epsilon/2] and [x^nϵ/2,x^nϵ/4][\hat{x}_{n}-\epsilon/2,\hat{x}_{n}-\epsilon/4] respectively. Also, denote by AnA_{n}, the interval [x^nn1/2/2,x^n+n1/2/2][\hat{x}_{n}-n^{-1/2}/2,\hat{x}_{n}+n^{-1/2}/2]. Since the density is monotonically non-increasing on either side of x^n\hat{x}_{n}, it follows that,

Π(n)(An±)π(n)(x^n±ϵ/2)ϵ/4;Π(n)(An)π(n)n1/2.\Pi^{(n)}\left(A^{\pm}_{n}\right)\geq\pi^{(n)}(\hat{x}_{n}\pm\epsilon/2)\epsilon/4;\quad\Pi^{(n)}(A_{n})\leq\pi^{(n)}n^{-1/2}.

These yield,

(Z+>ϵ/2)+(Z>ϵ/2)\displaystyle\mathbb{P}(Z^{+}>\epsilon/2)+\mathbb{P}(Z^{-}>\epsilon/2) =πn(x^n+ϵ/2)+π(n)(x^nϵ/2)πn(x^n)\displaystyle=\frac{\pi^{n}(\hat{x}_{n}+\epsilon/2)+\pi^{(n)}(\hat{x}_{n}-\epsilon/2)}{\pi^{n}(\hat{x}_{n})}
4n1/2ϵ(Π(n)(An+)+Π(n)(An)Π(n)(An)).\displaystyle\leq\frac{4n^{-1/2}}{\epsilon}\left(\frac{\Pi^{(n)}(A^{+}_{n})+\Pi^{(n)}(A^{-}_{n})}{\Pi^{(n)}(A_{n})}\right).

The Bernstein-von Mises theorem (Assumption 2.4) then implies that Π(n)(An±)0\Pi^{(n)}(A^{\pm}_{n})\to 0 and Π(n)(An)1\Pi^{(n)}(A_{n})\to 1 in \mathbb{P}-probability as nn\to\infty. Combining this observation with (C.2), (C.3), and (C.5), we get

limn(supst|Xsnx^n|>ϵ/2)=0.\lim_{n\to\infty}\mathbb{P}\left(\sup_{s\leq t}|X^{n}_{s}-\hat{x}_{n}|>\epsilon/2\right)=0. (C.6)

C.2 Proof of Theorem 4.2

Recall that the ZZ-SS switching rates in terms of the reparameterized coordinate for sub-sample of size mm are given by,

λin(ξ,v)=nm|𝒮(n,m)|S𝒮(m,n)(viiSsi(x^n+n1/2ξ))+,(ξ,v)d×𝒱,i=1,d.\lambda^{n}_{i}(\xi,v)=\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\left(v_{i}\cdot\sum_{i\in S}s_{i}(\hat{x}_{n}+n^{-1/2}\xi)\right)_{+},\quad(\xi,v)\in\mathbb{R}^{d}\times\mathcal{V},i=1,\dots d.

Define once again, in terms of the reparameterized coordinate, bn:d[0,1]db^{n}:\mathbb{R}^{d}\to[0,1]^{d} by,

bin(ξ)=λin(ξ,𝟏d)λn(ξ,𝟏d)λin(ξ,𝟏d)+λin(ξ,𝟏d),i=1,,d.b^{n}_{i}(\xi)=\frac{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})-\lambda^{n}(\xi,\boldsymbol{1}_{d})}{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(\xi,\boldsymbol{1}_{d})},\quad i=1,\dots,d.

Also define for each i=1,,di=1,\dots,d, hin:d(0,)h^{n}_{i}:\mathbb{R}^{d}\to(0,\infty) by

hin(ξ)=1λin(ξ,𝟏d)+λin(ξ,𝟏d).h^{n}_{i}(\xi)=\frac{1}{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})}.

Then, for each ii, hin(ξ)h^{n}_{i}(\xi) is continuous. Moreover, hin(ξ)h^{n}_{i}(\xi) is finite since the sum of the switching rates is non-zero. Indeed,

λin(ξ,𝟏d)+λn(ξ,𝟏d)=nm|𝒮(n,m)|S𝒮(m,n)|jSsij(x^n+n1/2ξ)|.\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})+\lambda^{n}(\xi,\boldsymbol{1}_{d})=\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\left|\sum_{j\in S}s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)\right|.

The above is 0 at a point if and only if for all SS, jSsij(x^n+n1/2ξ)=0\sum_{j\in S}s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)=0. This happens with 0 probability since the data is non-degenerate.

Lemma C.2.

For all R>0R>0, and each i=1,,di=1,\dots,d,

limnsupξR|n1/2bin(ξ)ξ𝔼[Si(x0;Y)]𝔼[|m1j=1mSi(x0;Yj)|]|=0,\lim_{n\to\infty}\sup_{\|\xi\|\leq R}\left|n^{1/2}b^{n}_{i}(\xi)-\frac{-\xi\cdot\mathbb{E}[\nabla S_{i}(x_{0};Y)]}{\mathbb{E}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x_{0};Y_{j})\right|\right]}\right|=0,

in \mathbb{P}-probability, where, Y,Y1,,YmiidPY,Y_{1},\dots,Y_{m}\overset{\text{iid}}{\sim}P.

Proof.

Let ii be arbitrary, decompose binb^{n}_{i} as,

n1/2bin(ξ)\displaystyle n^{1/2}b^{n}_{i}(\xi) =n1/2λin(ξ,𝟏d)λn(ξ,𝟏d)λn(ξ,𝟏d)+λn(ξ,𝟏d)\displaystyle=n^{1/2}\frac{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})-\lambda^{n}(\xi,\boldsymbol{1}_{d})}{\lambda^{n}(\xi,-\boldsymbol{1}_{d})+\lambda^{n}(\xi,\boldsymbol{1}_{d})}
=(λin(ξ,𝟏d)λin(ξ,𝟏d)n1/2)(nλin(ξ,𝟏d)+λin(ξ,𝟏d)).\displaystyle=\left(\frac{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})-\lambda^{n}_{i}(\xi,\boldsymbol{1}_{d})}{n^{1/2}}\right)\left(\frac{n}{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(\xi,\boldsymbol{1}_{d})}\right).

A second order Taylor’s expansion gives, for all i,ji,j

sij(x^n+n1/2ξ)=sij(x^n)+n1/2ξsij(x^n)+n12ξT2sij(x^n+θi,jn1/2ξ)ξ,s_{i}^{j}(\hat{x}_{n}+n^{-1/2}\xi)=s_{i}^{j}(\hat{x}_{n})+n^{-1/2}\xi\cdot\nabla s_{i}^{j}(\hat{x}_{n})+\frac{n^{-1}}{2}\xi^{T}\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta_{i,j}n^{-1/2}\xi)\xi,

where θij(0,1)\theta_{ij}\in(0,1). Consider the first term. Then,

λin(ξ,𝟏d)λin(ξ,𝟏d)n1/2\displaystyle\frac{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})-\lambda^{n}_{i}(\xi,\boldsymbol{1}_{d})}{n^{1/2}} =1n1/2(nm|𝒮(n,m)|S𝒮(m,n)jSsij(x^n+n1/2ξ))\displaystyle=\frac{-1}{n^{1/2}}\left(\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\sum_{j\in S}s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)\right)
=1n1/2j=1nsij(x^n+n1/2ξ)\displaystyle=\frac{-1}{n^{1/2}}\sum_{j=1}^{n}s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)
=1n1/2j=1n(sij(x^n)+n1/2ξsij(x^n)+n12ξT2sij(x^n+θi,jn1/2ξ)ξ)\displaystyle=\frac{-1}{n^{1/2}}\sum_{j=1}^{n}\left(s_{i}^{j}(\hat{x}_{n})+n^{-1/2}\xi\cdot\nabla s_{i}^{j}(\hat{x}_{n})+\frac{n^{-1}}{2}\xi^{T}\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta_{i,j}n^{-1/2}\xi)\xi\right)
=1nξj=1nsij(x^n)1n3/2ξT(j=1n2sij(x^n+θi,jn1/2ξ))ξ.\displaystyle=\frac{-1}{n}\xi\cdot\sum_{j=1}^{n}\nabla s^{j}_{i}(\hat{x}_{n})-\frac{1}{n^{-3/2}}\xi^{T}\left(\sum_{j=1}^{n}\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta_{i,j}n^{-1/2}\xi)\right)\xi.

A further Taylor’s expansion about x0x_{0} gives,

sij(x^n)=sij(x0)+2sij(x0+θi,j(x^nx0))(x^nx0),\nabla s^{j}_{i}(\hat{x}_{n})=\nabla s^{j}_{i}(x_{0})+\nabla^{\otimes 2}s^{j}_{i}(x_{0}+\theta^{\prime}_{i,j}(\hat{x}_{n}-x_{0}))(\hat{x}_{n}-x_{0}),

for some θi,j(0,1)\theta^{\prime}_{i,j}\in(0,1). Then,

λin(ξ,𝟏d)λin(ξ,𝟏d)n1/2\displaystyle\frac{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})-\lambda^{n}_{i}(\xi,\boldsymbol{1}_{d})}{n^{1/2}} =1nξj=1n(sij(x0)+2sij(x0+θi,j(x^nx0))(x^nx0))\displaystyle=\frac{-1}{n}\xi\cdot\sum_{j=1}^{n}\left(\nabla s^{j}_{i}(x_{0})+\nabla^{\otimes 2}s^{j}_{i}(x_{0}+\theta^{\prime}_{i,j}(\hat{x}_{n}-x_{0}))(\hat{x}_{n}-x_{0})\right)
1n3/2ξT(j=1n2sij(x^n+θi,jn1/2ξ))ξ\displaystyle\quad\quad-\frac{1}{n^{-3/2}}\xi^{T}\left(\sum_{j=1}^{n}\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta_{i,j}n^{-1/2}\xi)\right)\xi
=ξ(1nj=1nsij(x0))\displaystyle=-\xi\cdot\left(\frac{1}{n}\sum_{j=1}^{n}\nabla s^{j}_{i}(x_{0})\right)
ξT(1nj=1n2sij(x0+θi,j(x^nx0)))(x^nx0)\displaystyle\quad\quad-\xi^{T}\left(\frac{1}{n}\sum_{j=1}^{n}\nabla^{\otimes 2}s^{j}_{i}(x_{0}+\theta^{\prime}_{i,j}(\hat{x}_{n}-x_{0}))\right)(\hat{x}_{n}-x_{0})
n1/2ξT(1nj=1n2sij(x^n+θi,jn1/2ξ))ξ.\displaystyle\quad\quad-n^{-1/2}\xi^{T}\left(\frac{1}{n}\sum_{j=1}^{n}\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta_{i,j}n^{-1/2}\xi)\right)\xi.

Due to smoothness assumption 2.1, the second and the third terms above converge to 0 uniformly in ξ\xi for ξ\xi in a compact set. Furthermore, the law of large numbers gives

λin(ξ,𝟏d)λin(ξ,𝟏d)n1/2ξ𝔼[Si(x0;Y)]in probability,\frac{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})-\lambda^{n}_{i}(\xi,\boldsymbol{1}_{d})}{n^{1/2}}\to-\xi\cdot\mathbb{E}[\nabla S_{i}(x_{0};Y)]\quad\text{in }\mathbb{P}-\text{probability}, (C.7)

uniformly on compact sets. Next, observe that,

1nhin(ξ)=λin(ξ,𝟏d)+λin(ξ,𝟏d)n=1m|𝒮(n,m)|S𝒮(m,n)|jSsij(x^n+n1/2ξ)|.\frac{1}{nh^{n}_{i}(\xi)}=\frac{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(\xi,\boldsymbol{1}_{d})}{n}\\ =\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\left|\sum_{j\in S}s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)\right|.

Consider,

|1nhn(ξ)1m|𝒮(n,m)|S𝒮(m,n)|jSsij(x0)||\displaystyle\left|\frac{1}{nh^{n}(\xi)}-\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\left|\sum_{j\in S}s^{j}_{i}(x_{0})\right|\right|
1nj=1n|sij(x^n+n1/2ξ)sij(x0)|\displaystyle\quad\quad\leq\frac{1}{n}\sum_{j=1}^{n}\left|s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)-s^{j}_{i}(x_{0})\right|
=1nj=1n|sij(x0)xn+(xn)T2sij(x0+θijxn)xn|(xn=x^n+n1/2ξx0)\displaystyle\quad\quad=\frac{1}{n}\sum_{j=1}^{n}\left|\nabla s^{j}_{i}(x_{0})\cdot x^{\prime}_{n}+(x^{\prime}_{n})^{T}\nabla^{\otimes 2}s^{j}_{i}(x_{0}+\theta_{ij}x^{\prime}_{n})x^{\prime}_{n}\right|\quad\quad(x^{\prime}_{n}=\hat{x}_{n}+n^{-1/2}\xi-x_{0})
xn(1nj=1nsij(x0))+Mxn2.\displaystyle\quad\quad\leq\|x^{\prime}_{n}\|\cdot\left(\frac{1}{n}\sum_{j=1}^{n}\left\|\nabla s^{j}_{i}(x_{0})\right\|\right)+M^{\prime}\|x^{\prime}_{n}\|^{2}.

By the consistency of x^n\hat{x}_{n} (Assumption 2.4), xnx^{\prime}_{n} goes to 0 in \mathbb{P}-probability. As a result, the right hand side goes to 0 uniformly in ξ\xi. Then, by the law of large numbers for UU-statistics [3], for all i=1,,di=1,\dots,d,

1nhin(ξ)𝔼[|m1j=1mSi(x0;Yj)|],in probability,\frac{1}{nh^{n}_{i}(\xi)}\to\mathbb{E}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x_{0};Y_{j})\right|\right],\quad\text{in }\mathbb{P}-\text{probability},

uniformly on compact sets, where, Y1,,YmiidPY_{1},\dots,Y_{m}\overset{\text{iid}}{\sim}P. Since hinh^{n}_{i} is continuous and positive, it is bounded away from 0 on compact sets. And so, this implies,

nλin(ξ,𝟏d)+λin(ξ,𝟏d)=nhin(ξ)1𝔼[|m1j=1mSi(x0;Yj)|],in probability,\frac{n}{\lambda^{n}_{i}(\xi,-\boldsymbol{1}_{d})+\lambda^{n}_{i}(\xi,\boldsymbol{1}_{d})}=nh^{n}_{i}(\xi)\to\frac{1}{\mathbb{E}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x_{0};Y_{j})\right|\right]},\quad\text{in }\mathbb{P}-\text{probability}, (C.8)

uniformly on compact sets. Combining (C.7) and (C.8) gives the convergence of n1/2bin(ξ)n^{-1/2}b_{i}^{n}(\xi). Since ii was arbitrary, the result follows. ∎

We will prove Theorem 4.6 following the approach of [9]. For each i=1,,di=1,\dots,d, define,

fin(ξ,v)=ξi+nvihin(ξ),(ξ,v)E.f^{n}_{i}(\xi,v)=\xi_{i}+\sqrt{n}\cdot v_{i}h^{n}_{i}(\xi),\quad(\xi,v)\in E.
Lemma C.3.

For each nn and ii, the function finf^{n}_{i} belongs to the domain of the extended generator n\mathcal{L}^{n} of the Zig-Zag process.

Proof.

Fix ii and nn arbitrarily. We first check that the function tfin(ξ+n1/2vt,v)t\mapsto f^{n}_{i}(\xi+n^{1/2}vt,v) is absolutely continuous on [0,)[0,\infty) for all (ξ,v)d×𝒱(\xi,v)\in\mathbb{R}^{d}\times\mathcal{V}. Given the form of finf^{n}_{i}, it suffices to check that thin(ξ+n1/2vt,v)t\mapsto h^{n}_{i}(\xi+n^{1/2}vt,v) is absolutely continuous on [0,)[0,\infty) for all (ξ,v)(\xi,v). But that is true given the smoothness assumptions on ss and the fact that the reciprocal of hinh^{n}_{i} is non-zero everywhere.

Let T1,T2,T_{1},T_{2},\dots be the switching times for the Zig-Zag Zn=(ξtn,vtn)t0Z^{n}=(\xi^{n}_{t},v^{n}_{t})_{t\geq 0}. The jump measure of the process is given by,

μ=i=1δ(ZTin,Ti).\mu=\sum_{i=1}^{\infty}\delta_{(Z^{n}_{T_{i}},T_{i})}.

Define 𝔅fn\mathfrak{B}f^{n} by,

𝔅fin(ξ,v,s)=fin(ξ,v)fin(ξsn,vsn).\mathfrak{B}f^{n}_{i}(\xi,v,s)=f^{n}_{i}(\xi,v)-f^{n}_{i}(\xi^{n}_{s^{-}},v^{n}_{s^{-}}).

Then for any t0t\geq 0,

E×[0,t)|𝔅fin(z,s)|μ(dz,ds)\displaystyle\int_{E\times[0,t)}|\mathfrak{B}f^{n}_{i}(z,s)|\mu(dz,ds) =Tktfin(ZTk)fin(ZTk)\displaystyle=\sum_{T_{k}\leq t}f^{n}_{i}(Z_{T_{k}})-f^{n}_{i}(Z_{T_{k}^{-}})
=Tkt|fin(ξTkn,vTkn)fin(ξTkn,vTk1n)|\displaystyle=\sum_{T_{k}\leq t}|f^{n}_{i}(\xi^{n}_{T_{k}},v^{n}_{T_{k}})-f^{n}_{i}(\xi^{n}_{T_{k}},v^{n}_{T_{k-1}})|
=Tktnhin(ξTkn)|(vTkn)i(vTk1n)i|\displaystyle=\sum_{T_{k}\leq t}\sqrt{n}h^{n}_{i}(\xi^{n}_{T_{k}})\left|(v^{n}_{T_{k}})_{i}-(v^{n}_{T_{k-1}})_{i}\right|
2nTkthin(ξTkn)\displaystyle\leq 2\sqrt{n}\sum_{T_{k}\leq t}h^{n}_{i}(\xi^{n}_{T_{k}})

And thus,

𝔼E×[0,t)|𝔅fin(z,s)|μ(dz,ds)2n𝔼Tkthin(ξTkn)<\mathbb{E}\int_{E\times[0,t)}|\mathfrak{B}f^{n}_{i}(z,s)|\mu(dz,ds)\leq 2\sqrt{n}\mathbb{E}\sum_{T_{k}\leq t}h^{n}_{i}(\xi^{n}_{T_{k}})<\infty

because hin<h^{n}_{i}<\infty everywhere and the switching rates are continuous, hence bounded on [0,t][0,t]. By [21, Theorem 5.5], finf^{n}_{i} belongs to the domain of the generator. ∎

Observe that for all ii,

fin(ξ,Fj(v))fin(ξ,v)={0;ji,2n1/2vihin(ξ);j=i.f^{n}_{i}(\xi,F_{j}(v))-f^{n}_{i}(\xi,v)=\begin{cases}0;&j\neq i,\\ -2n^{1/2}v_{i}h^{n}_{i}(\xi);&j=i.\end{cases}

Let D denote the weak derivative operator with respect to the ξ\xi co-ordinate. For i=1,,di=1,\dots,d, let eide_{i}\in\mathbb{R}^{d} denote the ii-th basis vector, i.e. (ei)i=1(e_{i})_{i}=1 and (ei)j=0(e_{i})_{j}=0 for jij\neq i. Then, also for all ii, Dfin(ξ,v)=ei+n1/2viDhin(ξ)\textbf{D}f^{n}_{i}(\xi,v)=e_{i}+n^{1/2}v_{i}\cdot\textbf{D}h^{n}_{i}(\xi).

nfin(ξ,v)\displaystyle\mathcal{L}^{n}f^{n}_{i}(\xi,v) =n1/2vDfin(ξ,v)+j=1dλjn(ξ,v){fin(ξ,Fj(v))fin(ξ,v)}\displaystyle=n^{1/2}v\cdot\textbf{D}f^{n}_{i}(\xi,v)+\sum_{j=1}^{d}\lambda^{n}_{j}(\xi,v)\{f^{n}_{i}(\xi,F_{j}(v))-f^{n}_{i}(\xi,v)\}
=n1/2v(ei+n1/2viDhin(ξ))+λin(ξ,v){2n1/2vihin(ξ)}\displaystyle=n^{1/2}v\cdot(e_{i}+n^{1/2}v_{i}\cdot\textbf{D}h^{n}_{i}(\xi))+\lambda^{n}_{i}(\xi,v)\{-2n^{1/2}v_{i}h^{n}_{i}(\xi)\}
=n1/2(vi2viλin(ξ,v)hin(ξ))+nvi(vDhin(ξ))\displaystyle=n^{1/2}(v_{i}-2v_{i}\lambda^{n}_{i}(\xi,v)h^{n}_{i}(\xi))+nv_{i}(v\cdot\textbf{D}h^{n}_{i}(\xi))
=n1/2vi(λin(ξ,Fi(v))λin(ξ,v)λin(ξ,Fi(v))+λin(ξ,v))+nvi(vDhin(ξ))\displaystyle=n^{1/2}v_{i}\left(\frac{\lambda^{n}_{i}(\xi,F_{i}(v))-\lambda^{n}_{i}(\xi,v)}{\lambda^{n}_{i}(\xi,F_{i}(v))+\lambda^{n}_{i}(\xi,v)}\right)+nv_{i}(v\cdot\textbf{D}h^{n}_{i}(\xi))
=n1/2bin(ξ)+nvi(vDhin(ξ)).\displaystyle=n^{1/2}b^{n}_{i}(\xi)+nv_{i}(v\cdot\textbf{D}h^{n}_{i}(\xi)).

Define,

Ytn,i:=fin(ξtn,vtn)andjtn,i:=nfin(ξtn,vtn)=n1/2bin(ξtn)+n(vtn)i(vtnDhin(ξtn)).Y^{n,i}_{t}:=f^{n}_{i}(\xi^{n}_{t},v^{n}_{t})\quad\text{and}\quad j^{n,i}_{t}:=\mathcal{L}^{n}f^{n}_{i}(\xi^{n}_{t},v^{n}_{t})=n^{1/2}b^{n}_{i}(\xi^{n}_{t})+n(v^{n}_{t})_{i}(v^{n}_{t}\cdot\textbf{D}h^{n}_{i}(\xi^{n}_{t})).

It follows that, for all i=1,,di=1,\dots,d,

Mtn,i=Ytn,i0tjsn,i𝑑s,M^{n,i}_{t}=Y^{n,i}_{t}-\int_{0}^{t}j^{n,i}_{s}\ ds,

is a local martingale with respect to the filtration tn\mathcal{F}^{n}_{t} generated by (ξtn,vtn)t0(\xi^{n}_{t},v^{n}_{t})_{t\geq 0}. Consequently, the vector process Mtn=(Mtn,1,,Mtn,d)M^{n}_{t}=(M^{n,1}_{t},\dots,M^{n,d}_{t}) is a local martingale. For all i,ji,j, define,

gijn(ξ,v):=fin(ξ,v)fjn(ξ,v).g^{n}_{ij}(\xi,v):=f^{n}_{i}(\xi,v)f^{n}_{j}(\xi,v).

Then gijng^{n}_{ij} belongs to the domain of n\mathcal{L}^{n}. We get that,

Ntn,ij:=Ytn,iYtn,j0tngijn(ξsn,vsn)𝑑s,N^{n,ij}_{t}:=Y^{n,i}_{t}Y^{n,j}_{t}-\int_{0}^{t}\mathcal{L}^{n}g^{n}_{ij}(\xi^{n}_{s},v^{n}_{s})ds,

is a local martingale with respect to the filtration tn\mathcal{F}^{n}_{t}. We now decompose the outer product of the local martingale MtnM^{n}_{t} into a local martingale term and a remainder. To this end, for all ii, define Jtn,i=0tjsn,i𝑑sJ^{n,i}_{t}=\int_{0}^{t}j^{n,i}_{s}\ ds. Using integration by parts,

Mtn,iMtn,j\displaystyle M^{n,i}_{t}M^{n,j}_{t} =(Ytn,iJtn,i)(Ytn,jJtn,j)\displaystyle=(Y^{n,i}_{t}-J^{n,i}_{t})(Y^{n,j}_{t}-J^{n,j}_{t})
=Ytn,iYtn,jJtn,iYtn,jYtn,iJtn,j+Jtn,iJtn,j\displaystyle=Y^{n,i}_{t}Y^{n,j}_{t}-J^{n,i}_{t}Y^{n,j}_{t}-Y^{n,i}_{t}J^{n,j}_{t}+J^{n,i}_{t}J^{n,j}_{t}
=Ytn,iYtn,j0tJsn,i𝑑Ysn,j0tYsn,jjsn,i𝑑s0tJsn,j𝑑Ysn,i0tYsn,ijsn,j𝑑s\displaystyle=Y^{n,i}_{t}Y^{n,j}_{t}-\int_{0}^{t}J^{n,i}_{s}\ dY^{n,j}_{s}-\int_{0}^{t}Y^{n,j}_{s}j^{n,i}_{s}\ ds-\int_{0}^{t}J^{n,j}_{s}\ dY^{n,i}_{s}-\int_{0}^{t}Y^{n,i}_{s}j^{n,j}_{s}\ ds
+0tJsn,ijsn,j𝑑s+0tJsn,jjsn,i𝑑s\displaystyle\quad\quad+\int_{0}^{t}J^{n,i}_{s}j^{n,j}_{s}\ ds+\int_{0}^{t}J^{n,j}_{s}j^{n,i}_{s}\ ds
=Ytn,iYtn,j0tYsn,jjsn,i𝑑s0tYsn,ijsn,j𝑑s0tJsn,i𝑑Msn,j0tJsn,j𝑑Msn,i\displaystyle=Y^{n,i}_{t}Y^{n,j}_{t}-\int_{0}^{t}Y^{n,j}_{s}j^{n,i}_{s}\ ds-\int_{0}^{t}Y^{n,i}_{s}j^{n,j}_{s}\ ds-\int_{0}^{t}J^{n,i}_{s}\ dM^{n,j}_{s}-\int_{0}^{t}J^{n,j}_{s}\ dM^{n,i}_{s}
=Ntn,ij0tJsn,i𝑑Msn,j0tJsn,j𝑑Msn,i\displaystyle=N^{n,ij}_{t}-\int_{0}^{t}J^{n,i}_{s}\ dM^{n,j}_{s}-\int_{0}^{t}J^{n,j}_{s}\ dM^{n,i}_{s}
+0tngijn(ξsn,vsn)𝑑s0tYsn,jjsn,i𝑑s0tYsn,ijsn,j𝑑s.\displaystyle\quad\quad+\int_{0}^{t}\mathcal{L}^{n}g^{n}_{ij}(\xi^{n}_{s},v^{n}_{s})ds-\int_{0}^{t}Y^{n,j}_{s}j^{n,i}_{s}\ ds-\int_{0}^{t}Y^{n,i}_{s}j^{n,j}_{s}\ ds.

It follows that, for all i,ji,j,

Htn,ij:=Mtn,iMtn,j0tngijn(ξsn,vsn)𝑑s+0tYsn,jjsn,i𝑑s+0tYsn,ijsn,j𝑑s,H^{n,ij}_{t}:=M^{n,i}_{t}M^{n,j}_{t}-\int_{0}^{t}\mathcal{L}^{n}g^{n}_{ij}(\xi^{n}_{s},v^{n}_{s})ds+\int_{0}^{t}Y^{n,j}_{s}j^{n,i}_{s}\ ds+\int_{0}^{t}Y^{n,i}_{s}j^{n,j}_{s}\ ds,

is a local martingale with respect to the filtration tn\mathcal{F}^{n}_{t}.

The weak convergence result will be proven using Theorem 4.1, Chapter 7 of [23]. To this end, define, for all nn, Btn:=ξtnMtnB^{n}_{t}:=\xi^{n}_{t}-M^{n}_{t}. Write Bn,iB^{n,i} for the ii-th component. Then,

Btn,i=n(vtn)ihin(ξtn)+0tjsn,i𝑑s,i=1,,d.B^{n,i}_{t}=-\sqrt{n}(v^{n}_{t})_{i}h^{n}_{i}(\xi^{n}_{t})+\int_{0}^{t}j^{n,i}_{s}\ ds,\quad i=1,\dots,d.

By properties of the Zig-Zag process, BtnB^{n}_{t} has right-continuous paths. Moreover, for all ii, nn, and t0t\geq 0,

|Btn,iBtn,i|=|n(vtn)ihin(ξtn)+n(vtn)ihin(ξtn)|2nhin(ξt).|B^{n,i}_{t}-B^{n,i}_{t^{-}}|=|-\sqrt{n}(v^{n}_{t})_{i}h^{n}_{i}(\xi^{n}_{t})+\sqrt{n}(v^{n}_{t^{-}})_{i}h^{n}_{i}(\xi^{n}_{t})|\leq 2\sqrt{n}h^{n}_{i}(\xi_{t}).

Then, for any compact set KdK\subset\mathbb{R}^{d}, it follows from the proof of Lemma C.2 that,

supξKBtnBtn2i=1dsupξK4n(hin(ξ))20,\sup_{\xi\in K}\|B^{n}_{t}-B^{n}_{t^{-}}\|^{2}\leq\sum_{i=1}^{d}\sup_{\xi\in K}4n(h^{n}_{i}(\xi))^{2}\to 0,

as nn\to\infty. Also define, for all nn, a symmetric d×dd\times d matrix valued process Atn=((Atn,ij))A^{n}_{t}=((A^{n,ij}_{t})) as,

Atn,ij\displaystyle A^{n,ij}_{t} =Mtn,iMtn,jHtn,ij(i,j=1,,d)\displaystyle=M^{n,i}_{t}M^{n,j}_{t}-H^{n,ij}_{t}\quad\quad(i,j=1,\dots,d)
=0tngijn(ξsn,vsn)𝑑s0tYsn,jjsn,i𝑑s0tYsn,ijsn,j𝑑s.\displaystyle=\int_{0}^{t}\mathcal{L}^{n}g^{n}_{ij}(\xi^{n}_{s},v^{n}_{s})ds-\int_{0}^{t}Y^{n,j}_{s}j^{n,i}_{s}\ ds-\int_{0}^{t}Y^{n,i}_{s}j^{n,j}_{s}\ ds.

Recall that, jtn,i=nfin(ξtn,vtn)j^{n,i}_{t}=\mathcal{L}^{n}f^{n}_{i}(\xi^{n}_{t},v^{n}_{t}). Moreover,

ngijn(ξ,v)\displaystyle\mathcal{L}^{n}g^{n}_{ij}(\xi,v) =n1/2vDgijn(ξ,v)+k=1dλkn(ξ,v){gijn(ξ,Fk(v))gijn(ξ,v)}\displaystyle=n^{1/2}v\cdot\textbf{D}g^{n}_{ij}(\xi,v)+\sum_{k=1}^{d}\lambda^{n}_{k}(\xi,v)\{g^{n}_{ij}(\xi,F_{k}(v))-g^{n}_{ij}(\xi,v)\}

Calculations yield, for jij\neq i, Atn,ij=0A^{n,ij}_{t}=0 for all tt and nn. Additionally, for i=1,,di=1,\dots,d,

Atn,ii=0t4nλin(ξsn,vsn)(hin(ξsn))2𝑑s=0t(2nhin(ξsn)2n(vsn)ihin(ξsn)bin(ξsn))𝑑s.A^{n,ii}_{t}=\int_{0}^{t}4n\lambda^{n}_{i}(\xi^{n}_{s},v^{n}_{s})(h^{n}_{i}(\xi^{n}_{s}))^{2}\ ds=\int_{0}^{t}\left(2nh^{n}_{i}(\xi^{n}_{s})-2n(v^{n}_{s})_{i}h^{n}_{i}(\xi^{n}_{s})b^{n}_{i}(\xi^{n}_{s})\right)\ ds.

The middle term further implies that for all t>s0t>s\geq 0, AtnAsnA^{n}_{t}-A^{n}_{s} is non-negative definite.

Also define, b:ddb:\mathbb{R}^{d}\to\mathbb{R}^{d} by,

bi(ξ)=ξ𝔼[Si(x0;Y)]𝔼[|m1j=1mSi(x0;Yj)|],ξd,i=1,,d,b_{i}(\xi)=\frac{-\xi\cdot\mathbb{E}[\nabla S_{i}(x_{0};Y)]}{\mathbb{E}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x_{0};Y_{j})\right|\right]},\quad\xi\in\mathbb{R}^{d},i=1,\dots,d,

where Y,Y1,,YmiidPY,Y_{1},\dots,Y_{m}\overset{\text{iid}}{\sim}P. Let AA be a d×dd\times d diagonal matrix with ii-th diagonal entry, AiiA_{ii}, given by

Aii=2𝔼[|m1j=1mSi(x0;Yj)|].A_{ii}=\frac{2}{\mathbb{E}\left[\left|m^{-1}\sum_{j=1}^{m}S_{i}(x_{0};Y_{j})\right|\right]}.

Define stopping time τrn:=inf{t0:ξtnr or ξn(t)r}\tau^{n}_{r}:=\inf\{t\geq 0:\|\xi^{n}_{t}\|\geq r\text{ or }\|\xi^{n}(t-)\|\geq r\}. Let T,R0T,R\geq 0 and tTτRnt\leq T\wedge\tau^{n}_{R}. For all iji\neq j, we have trivially Atn,ijtAij=0A^{n,ij}_{t}-tA_{ij}=0. For any fixed ii, consider,

|Atn,ii0tAii𝑑s|\displaystyle\left|A^{n,ii}_{t}-\int_{0}^{t}A_{ii}\ ds\right| =|0t(2nhin(ξsn)2n(vsn)ihin(ξsn)bin(ξsn))𝑑stAii|\displaystyle=\left|\int_{0}^{t}\left(2nh^{n}_{i}(\xi^{n}_{s})-2n(v^{n}_{s})_{i}h^{n}_{i}(\xi^{n}_{s})b^{n}_{i}(\xi^{n}_{s})\right)\ ds-tA_{ii}\right|
|0t2nhin(ξsn)𝑑stAii|+|0t2n(vsn)ihin(ξsn)bin(ξsn)𝑑s|\displaystyle\leq\left|\int_{0}^{t}2nh^{n}_{i}(\xi^{n}_{s})\ ds-tA_{ii}\right|+\left|\int_{0}^{t}2n(v^{n}_{s})_{i}h^{n}_{i}(\xi^{n}_{s})b^{n}_{i}(\xi^{n}_{s})\ ds\right|
0t|2nhin(ξsn)Aii|𝑑s+0t|2nhin(ξsn)bin(ξsn)|𝑑s.\displaystyle\leq\int_{0}^{t}\left|2nh^{n}_{i}(\xi^{n}_{s})-A_{ii}\right|\ ds+\int_{0}^{t}\left|2nh^{n}_{i}(\xi^{n}_{s})b^{n}_{i}(\xi^{n}_{s})\right|\ ds.

This gives,

suptTτRn|Atn,ii0tAii𝑑s|TsupξR|2nhin(ξ)Aii|+TsupξR|2nhin(ξ)bin(ξ)|.\sup_{t\leq T\wedge\tau^{n}_{R}}\left|A^{n,ii}_{t}-\int_{0}^{t}A_{ii}\ ds\right|\leq T\sup_{\|\xi\|\leq R}\left|2nh^{n}_{i}(\xi)-A_{ii}\right|+T\sup_{\|\xi\|\leq R}\left|2nh^{n}_{i}(\xi)b^{n}_{i}(\xi)\right|.

From the proof of Lemma C.2, it follows that each term on the right hand side converges to 0 in \mathbb{P}^{\otimes\mathbb{N}}-probability. Thus, for all i,ji,j,

suptTτRn|Atn,ij0tAij𝑑s|n0,in probability.\sup_{t\leq T\wedge\tau^{n}_{R}}\left|A^{n,ij}_{t}-\int_{0}^{t}A_{ij}\ ds\right|\xrightarrow{n\to\infty}0,\quad\text{in }\mathbb{P}-\text{probability}. (C.9)

Secondly, consider,

|Btn,i0tbi(ξsn)𝑑s|\displaystyle\left|B^{n,i}_{t}-\int_{0}^{t}b_{i}(\xi^{n}_{s})\ ds\right|
=|n(vtn)ihin(ξtn)+0tn1/2bin(ξsn)+n(vsn)i(vsnDhin(ξsn))ds0tbi(ξsn)𝑑s|\displaystyle\quad\quad=\left|-\sqrt{n}(v^{n}_{t})_{i}h^{n}_{i}(\xi^{n}_{t})+\int_{0}^{t}n^{1/2}b^{n}_{i}(\xi^{n}_{s})+n(v^{n}_{s})_{i}(v^{n}_{s}\cdot\textbf{D}h^{n}_{i}(\xi^{n}_{s}))\ ds-\int_{0}^{t}b_{i}(\xi^{n}_{s})\ ds\right|
|0tn1/2bin(ξsn)bi(ξsn)ds|+|0tn(vsn)i(vsnDhin(ξsn))𝑑sn(vtn)ihin(ξtn)|\displaystyle\quad\quad\leq\left|\int_{0}^{t}n^{1/2}b^{n}_{i}(\xi^{n}_{s})-b_{i}(\xi^{n}_{s})\ ds\right|+\left|\int_{0}^{t}n(v^{n}_{s})_{i}(v^{n}_{s}\cdot\textbf{D}h^{n}_{i}(\xi^{n}_{s}))\ ds-\sqrt{n}(v^{n}_{t})_{i}h^{n}_{i}(\xi^{n}_{t})\right|
0t|n1/2bin(ξsn)bi(ξsn)|𝑑s+0tndDhin(ξsn)𝑑s+|n1/2hin(ξtn)|\displaystyle\quad\quad\leq\int_{0}^{t}\left|n^{1/2}b^{n}_{i}(\xi^{n}_{s})-b_{i}(\xi^{n}_{s})\right|\ ds+\int_{0}^{t}n\sqrt{d}\left\|\textbf{D}h^{n}_{i}(\xi^{n}_{s})\right\|\ ds+\left|n^{1/2}h^{n}_{i}(\xi^{n}_{t})\right|

This gives,

suptTτRn|Btn,i0tbi(ξsn)𝑑s|\displaystyle\sup_{t\leq T\wedge\tau^{n}_{R}}\left|B^{n,i}_{t}-\int_{0}^{t}b_{i}(\xi^{n}_{s})\ ds\right| TsupξR|n1/2bin(ξ)bi(ξ)|+TdsupξRnDhin(ξ)\displaystyle\leq T\sup_{\|\xi\|\leq R}\left|n^{1/2}b^{n}_{i}(\xi)-b_{i}(\xi)\right|+T\sqrt{d}\sup_{\|\xi\|\leq R}\left\|n\textbf{D}h^{n}_{i}(\xi)\right\|
+TsupξR|n1/2hin(ξ)|.\displaystyle\quad\quad\quad+T\sup_{\|\xi\|\leq R}\left|n^{1/2}h^{n}_{i}(\xi)\right|.

Once again from the proof of Lemma C.3, the first and the third terms on the right hand side go to 0 in probability. Let Dk\textbf{D}_{k} denote the kk-th partial weak derivative operator. For S𝒮(n,m)S\in\mathcal{S}_{(n,m)}, let ES(ξ)E_{S}(\xi) denote the sum jSsij(x^n+n1/2ξ)\sum_{j\in S}s_{i}^{j}(\hat{x}_{n}+n^{-1/2}\xi). Let kk be arbitrary. Then,

Dkhin(ξ)\displaystyle\textbf{D}_{k}h^{n}_{i}(\xi) =Dk(nm|𝒮(n,m)|S𝒮(m,n)|ES(ξ)|)1\displaystyle=\textbf{D}_{k}\left(\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\left|E_{S}(\xi)\right|\right)^{-1}
=(hin(ξ))2(nm|𝒮(n,m)|S𝒮(m,n)Dk(|ES(ξ)|))\displaystyle=-(h_{i}^{n}(\xi))^{2}\cdot\left(\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\textbf{D}_{k}\left(\left|E_{S}(\xi)\right|\right)\right)
=(hin(ξ))2(nm|𝒮(n,m)|S𝒮(m,n)sgn(ES)Dk(ES))\displaystyle=-(h_{i}^{n}(\xi))^{2}\cdot\left(\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\mathrm{sgn}(E_{S})\cdot\textbf{D}_{k}(E_{S})\right)
=(hin(ξ))2(nm|𝒮(n,m)|S𝒮(m,n)sgn(ES)n1/2jSskij(x^n+n1/2ξ)),\displaystyle=-(h_{i}^{n}(\xi))^{2}\cdot\left(\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(m,n)}}\mathrm{sgn}(E_{S})\cdot n^{-1/2}\sum_{j\in S}s^{\prime j}_{ki}(\hat{x}_{n}+n^{-1/2}\xi)\right),

where sj(x)=2logf(yj;x)s^{\prime j}(x)=-\nabla^{\otimes 2}\log f(y_{j};x). The above gives,

n|Dkhin(ξ)|\displaystyle n|\textbf{D}_{k}h^{n}_{i}(\xi)| n1/2|hin(ξ)|2(j=1n|skij(x^n+n1/2ξ)|)\displaystyle\leq n^{1/2}|h_{i}^{n}(\xi)|^{2}\left(\sum_{j=1}^{n}\left|s^{\prime j}_{ki}(\hat{x}_{n}+n^{-1/2}\xi)\right|\right)
n1/2|hin(ξ)|2(j=1n|skij(x0)|+|skij(x0+θjxn)xn|)\displaystyle\leq n^{1/2}|h_{i}^{n}(\xi)|^{2}\left(\sum_{j=1}^{n}|s^{\prime j}_{ki}(x_{0})|+|\nabla s^{\prime j}_{ki}(x_{0}+\theta^{j}x^{\prime}_{n})\cdot x^{\prime}_{n}|\right)
n3/2|hin(ξ)|2(1nj=1n|skij(x0)|+Mxn),\displaystyle\leq n^{3/2}|h_{i}^{n}(\xi)|^{2}\left(\frac{1}{n}\sum_{j=1}^{n}|s^{\prime j}_{ki}(x_{0})|+M^{\prime}\|x^{\prime}_{n}\|\right),

where xn=(x^n+n1/2ξ+x0)x^{\prime}_{n}=(\hat{x}_{n}+n^{-1/2}\xi+x_{0}). By assumption, xnx^{\prime}_{n} converges to 0 in probability uniformly in ξ\xi. By the law of large numbers and the moment assumption (Assumption 2.2), the term inside the parenthesis converges to a finite quantity. Also, n3/2|hin|2n^{3/2}|h^{n}_{i}|^{2} goes to 0 uniformly from the proof of Lemma C.2. Since the above is true for arbitrary kk, it follows that nDhin(ξ)n\|\textbf{D}h^{n}_{i}(\xi)\| goes to 0 uniformly in ξ\xi in probability. Thus, for all ii,

suptTτRn|Btn,i0tbi(ξsn)𝑑s|n0,in probability.\sup_{t\leq T\wedge\tau^{n}_{R}}\left|B^{n,i}_{t}-\int_{0}^{t}b_{i}(\xi^{n}_{s})\ ds\right|\xrightarrow{n\to\infty}0,\quad\text{in }\mathbb{P}-\text{probability}. (C.10)

For any subsequence (nk)(n_{k}), let (nkl)(n_{k_{l}}) be a further sub-sequence such that the convergence in (C.9) and (C.10) is almost sure. Then by Theorem 4.1, Chapter 7 of [23], along subsequence (nkl)(n_{k_{l}}), (ξtn)(\xi^{n}_{t}) converges weakly in Skorohod topology to the solution of the martingale problem of the operator (𝒜,𝒟(𝒜))(\mathcal{A},\mathcal{D}(\mathcal{A})), where 𝒟(𝒜)=Cc2(d)\mathcal{D}(\mathcal{A})=C^{2}_{c}(\mathbb{R}^{d}) and for h𝒟(𝒜)h\in\mathcal{D}(\mathcal{A}),

𝒜h(ξ)=b(ξ)h(ξ)+12Tr(2h(ξ)A).\mathcal{A}h(\xi)=b(\xi)\cdot\nabla h(\xi)+\frac{1}{2}\text{Tr}(\nabla^{\otimes 2}h(\xi)A).

Since, the martingale problem above is well-posed, the result follows.

C.3 Proof of Theorems 4.6 and 4.11

From Section 4, the ii-th ZZ-CV switching rate in the ξ\xi parameter,

λcv,in(ξ,v)=nm|𝒮(n,m)|S𝒮(n,m)(vijSEij(ξ))+,\lambda^{n}_{\text{cv},i}(\xi,v)=\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}E_{i}^{j}(\xi)\right)_{+}, (C.11)

where,

Eij(ξ)=Eij(x(ξ))=sij(x^n+n1/2ξ)sij(x^n+n1/2ξn)+n1k=1nsik(x^n+n1/2ξn).E_{i}^{j}(\xi)=E_{i}^{j}(x(\xi))=s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi)-s^{j}_{i}(\hat{x}_{n}+n^{-1/2}\xi_{n}^{*})+n^{-1}\sum_{k=1}^{n}s^{k}_{i}(\hat{x}_{n}+n^{-1/2}\xi_{n}^{*}).

Let sj(x)=2logf(yj;x)s^{\prime j}(x)=-\nabla^{\otimes 2}\log f(y_{j};x) for all jj. Then sij\nabla s^{j}_{i} is the ii-th column of sjs^{\prime j}. Let n\mathcal{L}^{n} be the infinitesimal generator for the rescaled process UnU^{n}. Then,

nf(ξ,v)=vξf(ξ,v)+i=1dn1/2λcv,in(ξ,v){f(x,Fi(v))f(x,v)},\mathcal{L}^{n}f(\xi,v)=v\cdot\nabla_{\xi}f(\xi,v)+\sum_{i=1}^{d}n^{-1/2}\lambda^{n}_{\text{cv},i}(\xi,v)\{f(x,F_{i}(v))-f(x,v)\},

for arbitrary test function ff. Given ξ\xi^{*}, the infinitesimal generator \mathcal{L} of the Zig-Zag process with switching rates given by the equation (4.5) is,

f(ξ,v)=vξf(ξ,v)+i=1dλcv,i(ξ,v|ξ){f(x,Fi(v))f(x,v)}.\mathcal{L}f(\xi,v)=v\cdot\nabla_{\xi}f(\xi,v)+\sum_{i=1}^{d}\lambda_{\text{cv},i}(\xi,v|\xi^{*})\{f(x,F_{i}(v))-f(x,v)\}.

We will establish uniform convergence of the sequence of generators n\mathcal{L}^{n} to \mathcal{L}. From the expressions for both n\mathcal{L}^{n} and \mathcal{L}, it is enough to show the uniform convergence of switching rates n1/2λcv,inn^{-1/2}\lambda^{n}_{\text{cv},i} to λcv,i\lambda_{\text{cv},i} for all ii. Towards that end, define, for i=1,,di=1,\dots,d,

Fij(ξ)=ξsij(x^n)ξn(sij(x^n)Jin(x^n)),j=1,,n,F_{i}^{j}(\xi)=\xi\cdot\nabla s^{j}_{i}(\hat{x}_{n})-\xi_{n}^{*}\cdot\left(\nabla s^{j}_{i}(\hat{x}_{n})-J^{n}_{\cdot i}(\hat{x}_{n})\right),\quad j=1,\dots,n, (C.12)

and correspondingly,

λ~cv,in(ξ,v)=1m|𝒮(n,m)|S𝒮(n,m)(vijSFij(ξ))+.\tilde{\lambda}^{n}_{\text{cv},i}(\xi,v)=\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}F_{i}^{j}(\xi)\right)_{+}.
Lemma C.4.

Suppose Assumptions 2.1 - 2.4 all hold. Let KdK\subset\mathbb{R}^{d} be compact. Then, for all i=1,,di=1,\dots,d,

supξK|n1/2λcv,in(ξ,v)λ~cv,in(ξ,v)|0,in probability.\sup_{\xi\in K}\left|n^{-1/2}\lambda^{n}_{\text{cv},i}(\xi,v)-\tilde{\lambda}^{n}_{\text{cv},i}(\xi,v)\right|\to 0,\quad\text{in }\mathbb{P}-\text{probability}.
Proof.

Let ii and nn be arbitrary. Observe that,

|n1/2λcv,in(ξ,v)λ~cv,in(ξ,v)|\displaystyle\left|n^{-1/2}\lambda^{n}_{\text{cv},i}(\xi,v)-\tilde{\lambda}^{n}_{\text{cv},i}(\xi,v)\right|
=|n1/2(nm|𝒮(n,m)|S𝒮(n,m)(vijSEij(ξ))+)1m|𝒮(n,m)|S𝒮(n,m)(vijSFij(ξ))+|\displaystyle\quad\quad=\left|n^{-1/2}\left(\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}E_{i}^{j}(\xi)\right)_{+}\right)-\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}F_{i}^{j}(\xi)\right)_{+}\right|
=nm|𝒮(n,m)||S𝒮(n,m)(vijSn1/2Eij(ξ))+S𝒮(n,m)(vijSn1Fij(ξ))+|\displaystyle\quad\quad=\frac{n}{m|\mathcal{S}_{(n,m)}|}\left|\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}n^{-1/2}E_{i}^{j}(\xi)\right)_{+}-\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}n^{-1}F_{i}^{j}(\xi)\right)_{+}\right|
nm|𝒮(n,m)|S𝒮(n,m)|(vijSn1/2Eij(ξ))+(vijSn1Fij(ξ))+|\displaystyle\quad\quad\leq\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left|\left(v_{i}\cdot\sum_{j\in S}n^{-1/2}E_{i}^{j}(\xi)\right)_{+}-\left(v_{i}\cdot\sum_{j\in S}n^{-1}F_{i}^{j}(\xi)\right)_{+}\right|
nm|𝒮(n,m)|S𝒮(n,m)|vijSn1/2Eij(ξ)vjSn1Fij(ξ)|\displaystyle\quad\quad\leq\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left|v_{i}\cdot\sum_{j\in S}n^{-1/2}E_{i}^{j}(\xi)-v\cdot\sum_{j\in S}n^{-1}F_{i}^{j}(\xi)\right|
nm|𝒮(n,m)|S𝒮(n,m)jS|n1/2Eij(ξ)n1Fij(ξ)|.\displaystyle\quad\quad\leq\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\sum_{j\in S}\left|n^{-1/2}E_{i}^{j}(\xi)-n^{-1}F_{i}^{j}(\xi)\right|.

Using a second-order Taylor approximation about x^n\hat{x}_{n}, we get, for all i,ji,j,

sij(x^n+n1/2ξ)=sij(x^n)+n1/2ξsij(x^n)+n12ξT2sij(x^n+θi,jn1/2ξ)ξ,\displaystyle s_{i}^{j}(\hat{x}_{n}+n^{-1/2}\xi)=s_{i}^{j}(\hat{x}_{n})+n^{-1/2}\xi\cdot\nabla s_{i}^{j}(\hat{x}_{n})+\frac{n^{-1}}{2}\xi^{T}\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta_{i,j}n^{-1/2}\xi)\xi,
sij(x^n+n1/2ξn)=sij(x^n)+n1/2ξnsij(x^n)+n12(ξn)T2sij(x^n+θi,jn1/2ξn)ξn\displaystyle s_{i}^{j}(\hat{x}_{n}+n^{-1/2}\xi^{*}_{n})=s_{i}^{j}(\hat{x}_{n})+n^{-1/2}\xi_{n}^{*}\cdot\nabla s_{i}^{j}(\hat{x}_{n})+\frac{n^{-1}}{2}(\xi_{n}^{*})^{T}\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta^{*}_{i,j}n^{-1/2}\xi^{*}_{n})\xi_{n}^{*}

where θi,j,θi,j(0,1)\theta_{i,j},\theta^{*}_{i,j}\in(0,1). Using the above, Eij(ξ)E^{j}_{i}(\xi) can be re-written as,

Eij(ξ)=n1/2Fij(ξ)+Rij(ξ),E^{j}_{i}(\xi)=n^{-1/2}F^{j}_{i}(\xi)+R^{j}_{i}(\xi),

where Rij(ξ)R^{j}_{i}(\xi) is a remainder term given by,

Rij(ξ)\displaystyle R^{j}_{i}(\xi) =n12ξT2sij(x^n+θi,jn1/2ξ)ξ\displaystyle=\frac{n^{-1}}{2}\xi^{T}\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta_{i,j}n^{-1/2}\xi)\xi
n12((ξn)T(2sij(x^n+θi,jn1/2ξn)n1k=1n2sik(x^n+θi,kn1/2ξn))ξn).\displaystyle\quad\quad-\frac{n^{-1}}{2}\left((\xi_{n}^{*})^{T}\left(\nabla^{\otimes 2}s_{i}^{j}(\hat{x}_{n}+\theta^{*}_{i,j}n^{-1/2}\xi^{*}_{n})-n^{-1}\sum_{k=1}^{n}\nabla^{\otimes 2}s_{i}^{k}(\hat{x}_{n}+\theta^{*}_{i,k}n^{-1/2}\xi^{*}_{n})\right)\xi_{n}^{*}\right).

For all jj, by Assumption 2.1,

|n1/2Eij(ξ)n1Fij(ξ)|=|n1/2Rij(ξ)|n3/22ξ2M+n3/2ξn2M.\left|n^{-1/2}E^{j}_{i}(\xi)-n^{-1}F^{j}_{i}(\xi)\right|=\left|n^{-1/2}R_{i}^{j}(\xi)\right|\leq\frac{n^{-3/2}}{2}\|\xi\|^{2}M^{\prime}+n^{-3/2}\|\xi_{n}^{*}\|^{2}M^{\prime}.

Thus,

|n1/2λcv,in(ξ,v)λ~cv,in(ξ,v)|\displaystyle\left|n^{-1/2}\lambda^{n}_{\text{cv},i}(\xi,v)-\tilde{\lambda}^{n}_{\text{cv},i}(\xi,v)\right| nm|𝒮(n,m)|S𝒮(n,m)jS|n1/2Ein(ξ)n1Fin(ξ)|\displaystyle\leq\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\sum_{j\in S}\left|n^{-1/2}E_{i}^{n}(\xi)-n^{-1}F_{i}^{n}(\xi)\right|
nm|𝒮(n,m)|S𝒮(n,m)jS(n3/22ξ2M+n3/2ξn2M)\displaystyle\leq\frac{n}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\sum_{j\in S}\left(\frac{n^{-3/2}}{2}\|\xi\|^{2}M^{\prime}+n^{-3/2}\|\xi_{n}^{*}\|^{2}M^{\prime}\right)
=n1/22ξ2M+n1/2ξn2M.\displaystyle=\frac{n^{-1/2}}{2}\|\xi\|^{2}M^{\prime}+n^{-1/2}\|\xi_{n}^{*}\|^{2}M^{\prime}.

Then, since ξK\xi\in K compact and ξnξ\xi^{*}_{n}\to\xi^{*} in probability with 𝔼ξ2<\mathbb{E}\|\xi^{*}\|^{2}<\infty, the result follows for all i=1,,di=1,\dots,d. ∎

Lemma C.5.

Suppose Assumptions 2.1 - 2.4 all hold. Let KK\subset\mathbb{R} be compact. Then, for all i=1,,di=1,\dots,d,

supξK|n1/2λcv,in(ξ,v)λcv,i(ξ,v|ξ)|0,in probability.\sup_{\xi\in K}\left|n^{-1/2}\lambda^{n}_{\text{cv},i}(\xi,v)-\lambda_{\text{cv},i}(\xi,v|\xi^{*})\right|\to 0,\quad\text{in }\mathbb{P}-\text{probability}.
Proof.

Due to Lemma C.4, it is sufficient to show the result for λ~cv,in(ξ,v)\tilde{\lambda}^{n}_{\text{cv},i}(\xi,v) instead of n1/2λcv,i(ξ,v)n^{-1/2}\lambda_{\text{cv},i}(\xi,v). For an arbitrary nn, a first order Taylor’s approximation about x0x_{0} gives,

sij(x^n)=sij(x0)+2sij(x0+θi,j(x^nx0))(x^nx0),\nabla s^{j}_{i}(\hat{x}_{n})=\nabla s^{j}_{i}(x_{0})+\nabla^{\otimes 2}s^{j}_{i}(x_{0}+\theta_{i,j}(\hat{x}_{n}-x_{0}))\cdot(\hat{x}_{n}-x_{0}),

for some θi,j(0,1)\theta_{i,j}\in(0,1). Consequently, we write FijF^{j}_{i} as,

Fij(ξ)\displaystyle F^{j}_{i}(\xi) =ξsij(x^n)ξn(sij(x^n)Jin(x^n))\displaystyle=\xi\cdot\nabla s^{j}_{i}(\hat{x}_{n})-\xi_{n}^{*}\cdot\left(\nabla s^{j}_{i}(\hat{x}_{n})-J^{n}_{\cdot i}(\hat{x}_{n})\right)
=(ξξn)(sij(x0)+2sij(x0+θi,j(x^nx0))(x^nx0))ξnJin(x^n)\displaystyle=(\xi-\xi^{*}_{n})\cdot\left(\nabla s^{j}_{i}(x_{0})+\nabla^{\otimes 2}s^{j}_{i}(x_{0}+\theta_{i,j}(\hat{x}_{n}-x_{0}))\cdot(\hat{x}_{n}-x_{0})\right)-\xi^{*}_{n}\cdot J_{\cdot i}^{n}(\hat{x}_{n})

Define,

Gij(ξ)=ξsij(x0)ξ(sij(x0)𝔼Si(x0)),G^{j}_{i}(\xi)=\xi\cdot\nabla s^{j}_{i}(x_{0})-\xi^{*}\cdot\left(\nabla s^{j}_{i}(x_{0})-\mathbb{E}\nabla S_{i}(x_{0})\right), (C.13)

where we use short 𝔼Si(x0)\mathbb{E}\nabla S_{i}(x_{0}) to denote 𝔼Si(x0;Y1)\mathbb{E}\nabla S_{i}(x_{0};Y_{1}). Then,

|Fij(ξ)Gij(ξ)|\displaystyle\left|F^{j}_{i}(\xi)-G^{j}_{i}(\xi)\right| =|(ξξn)2sij(x0+θi,j(x^nx0))(x^nx0)\displaystyle=\left|(\xi-\xi_{n}^{*})\cdot\nabla^{\otimes 2}s^{j}_{i}(x_{0}+\theta_{i,j}(\hat{x}_{n}-x_{0}))\cdot(\hat{x}_{n}-x_{0})\right.
(ξnξ)sij(x0)+ξnJin(x^n)ξ𝔼Si(x0)|\displaystyle\quad\quad\left.-(\xi_{n}^{*}-\xi^{*})\cdot\nabla s^{j}_{i}(x_{0})+\xi^{*}_{n}\cdot J^{n}_{\cdot i}(\hat{x}_{n})-\xi^{*}\cdot\mathbb{E}\nabla S_{i}(x_{0})\right|
Mξξnx^nx0+ξnξsij(x0)+ξnJin(x^n)ξ𝔼Si(x0),\displaystyle\leq M^{\prime}\|\xi-\xi_{n}^{*}\|\|\hat{x}_{n}-x_{0}\|+\|\xi^{*}_{n}-\xi^{*}\|\|\nabla s^{j}_{i}(x_{0})\|+\|\xi^{*}_{n}\cdot J^{n}_{\cdot i}(\hat{x}_{n})-\xi^{*}\cdot\mathbb{E}\nabla S_{i}(x_{0})\|,

for all i,ji,j. Consider,

|λ~cv,in(ξ,v)1m|𝒮(n,m)|S𝒮(n,m)(vijSGij(ξ))+|\displaystyle\left|\tilde{\lambda}^{n}_{\text{cv},i}(\xi,v)-\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}G^{j}_{i}(\xi)\right)_{+}\right|
1m|𝒮(n,m)|S𝒮(n,m)jS|Fin(ξ)Gi(ξ)|\displaystyle\quad\quad\leq\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\sum_{j\in S}\left|F_{i}^{n}(\xi)-G_{i}(\xi)\right|
Mξξnx^nx0+ξnJin(x^n)ξ𝔼Si(x0)\displaystyle\quad\quad\leq M^{\prime}\|\xi-\xi_{n}^{*}\|\|\hat{x}_{n}-x_{0}\|+\|\xi^{*}_{n}\cdot J^{n}_{\cdot i}(\hat{x}_{n})-\xi^{*}\cdot\mathbb{E}\nabla S_{i}(x_{0})\|
+ξnξ1m|𝒮(n,m)|S𝒮(n,m)jSsij(x0)\displaystyle\quad\quad\quad\quad+\|\xi^{*}_{n}-\xi^{*}\|\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\sum_{j\in S}\|\nabla s^{j}_{i}(x_{0})\|
=Mξξnx^nx0+ξnJin(x^n)ξ𝔼Si(x0)+ξnξ1nj=1nsij(x0)\displaystyle\quad\quad=M^{\prime}\|\xi-\xi_{n}^{*}\|\|\hat{x}_{n}-x_{0}\|+\|\xi^{*}_{n}\cdot J^{n}_{\cdot i}(\hat{x}_{n})-\xi^{*}\cdot\mathbb{E}\nabla S_{i}(x_{0})\|+\|\xi^{*}_{n}-\xi^{*}\|\frac{1}{n}\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x_{0})\|

Under Assumption 2.2, n1j=1nsij(x0)𝔼Si(x0)<n^{-1}\sum_{j=1}^{n}\|\nabla s^{j}_{i}(x_{0})\|\to\mathbb{E}\|\nabla S_{i}(x_{0})\|<\infty almost surely as nn\to\infty. Hence, under the hypotheses,

supξK|λ~cv,in(ξ,v)1m|𝒮(n,m)|S𝒮(n,m)(vijSGij(ξ))+|0,\sup_{\xi\in K}\left|\tilde{\lambda}^{n}_{\text{cv},i}(\xi,v)-\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}G^{j}_{i}(\xi)\right)_{+}\right|\to 0,

in \mathbb{P}-probability. Finally, observe that given ξ\xi^{*} and for each fixed ii, GijG^{j}_{i}s are conditionally independent and identically distributed. Also given ξ\xi^{*}, for each fixed (ξ,v)(\xi,v) and ii,

1m|𝒮(n,m)|S𝒮(n,m)(vijSGij(ξ))+\frac{1}{m|\mathcal{S}_{(n,m)}|}\sum_{S\in\mathcal{S}_{(n,m)}}\left(v_{i}\cdot\sum_{j\in S}G^{j}_{i}(\xi)\right)_{+}

is a UU-statistic [see 3] of degree mm with kernel fm(y1,,ym)f_{m}(y_{1},\dots,y_{m}) given by,

fm(y1,,ym)=1m(vij=1mξsi(x0;yj)ξ(si(x0;yj)𝔼Si(x0)))+.f_{m}(y_{1},\dots,y_{m})=\frac{1}{m}\left(v_{i}\cdot\sum_{j=1}^{m}\xi\cdot\nabla s_{i}(x_{0};y_{j})-\xi^{*}\cdot(\nabla s_{i}(x_{0};y_{j})-\mathbb{E}\nabla S_{i}(x_{0}))\right)_{+}.

Then, conditioned on ξ\xi^{*}, the strong law of large numbers for UU-statistics [3] gives convergence for each (ξ,v)(\xi,v). Furthermore, since ξ\xi and vv appear only as multiplicative factors, the strong law is uniform on compact sets and the result follows. ∎

Lemma C.6.

Let ff be such that f(,v)𝒞c(d)f(\cdot,v)\in\mathcal{C}_{c}^{\infty}(\mathbb{R}^{d}) for each vv. Suppose Assumptions 2.1 - 2.4 all hold. Given ξ\xi^{*},

supξd|nf(ξ,v)f(ξ,v)|0,in probability.\sup_{\xi\in\mathbb{R}^{d}}\left|\mathcal{L}^{n}f(\xi,v)-\mathcal{L}f(\xi,v)\right|\to 0,\quad\text{in }\mathbb{P}-\text{probability}.
Proof.

Let KK be the compact support of ff. Then, for an arbitrary nn,

supξd|nf(ξ,v)f(ξ,v)|\displaystyle\sup_{\xi\in\mathbb{R}^{d}}\left|\mathcal{L}^{n}f(\xi,v)-\mathcal{L}f(\xi,v)\right| supξK2fi=1d|n1/2λcv,i(ξ,v)λcv,i(ξ,v|ξ)|.\displaystyle\leq\sup_{\xi\in K}2\|f\|_{\infty}\sum_{i=1}^{d}\left|n^{-1/2}\lambda_{\text{cv},i}(\xi,v)-\lambda_{\text{cv},i}(\xi,v|\xi^{*})\right|.

Since f<\|f\|_{\infty}<\infty, the result follows by Lemma C.4 and Lemma C.5. ∎


Proof of Theorem 4.6 Lemma C.6 shows that the infinitesimal generators converge uniformly in \mathbb{P}-probability. For any subsequence (nk)(n_{k}), let (nkl)(n_{k_{l}}) be a further sub-sequence such that this convergence is almost sure along (nkl)(n_{k_{l}}). Then by Corollary 8.7(f), Chapter 4 of [23] and since 𝒞c\mathcal{C}_{c}^{\infty} is a core for \mathcal{L} [24], UnU^{n} converges weakly (in Skorohod topology) to UU, \mathbb{P}-almost surely along the subsequence (nkl)(n_{k_{l}}). This in turn implies weak convergence (in Skorohod topology) in \mathbb{P}-probability of UnU^{n} to UU as nn\to\infty. ∎


Proof of Theorem 4.11 Following the steps in Lemma C.4 and Lemma C.5, it is possible to once again show that for any KdK\subset\mathbb{R}^{d},

supξK|λ~cv,in(ξ,v)1m(n)|𝒮(n,m(n))|S𝒮(n,m(n))(vijSGij(ξ))+|0,\sup_{\xi\in K}\left|\tilde{\lambda}^{n}_{\text{cv},i}(\xi,v)-\frac{1}{m(n)|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\left(v_{i}\cdot\sum_{j\in S}G^{j}_{i}(\xi)\right)_{+}\right|\to 0,

for all i=1,,di=1,\dots,d. Following a similar argument as in the proof of Proposition 2.1, one can further show that whenever m(n)m(n)\to\infty as nn\to\infty, the difference

|1|𝒮(n,m(n))|S𝒮(n,m(n))(vim(n)1jSGij(ξ))+(vin1j=1nGij(ξ))+|\left|\frac{1}{|\mathcal{S}_{(n,m(n))}|}\sum_{S\in\mathcal{S}_{(n,m(n))}}\left(v_{i}\cdot m(n)^{-1}\sum_{j\in S}G^{j}_{i}(\xi)\right)_{+}-\left(v_{i}\cdot n^{-1}\sum_{j=1}^{n}G^{j}_{i}(\xi)\right)_{+}\right|

goes to 0 uniformly on compact sets in \mathbb{P}-probability. This implies that for all ii and compact KdK\subset\mathbb{R}^{d},

supξK|n1/2λcv,in(ξ,v)n1(vij=1nGij(ξ))+|0,in probability.\sup_{\xi\in K}\left|n^{-1/2}\lambda^{n}_{\text{cv},i}(\xi,v)-n^{-1}\left(v_{i}\cdot\sum_{j=1}^{n}G^{j}_{i}(\xi)\right)_{+}\right|\to 0,\quad\text{in }\mathbb{P}-\text{probability}.

Also, the strong law of large numbers gives,

1n(vij=1nGij(ξ))+n(vi𝔼Y[ξSi(x0;Y)ξ(Si(x0;Y)𝔼Si(x0))])+,\frac{1}{n}\left(v_{i}\cdot\sum_{j=1}^{n}G^{j}_{i}(\xi)\right)_{+}\xrightarrow{n\to\infty}\left(v_{i}\mathbb{E}_{Y}\left[\xi\cdot\nabla S_{i}(x_{0};Y)-\xi^{*}\cdot(\nabla S_{i}(x_{0};Y)-\mathbb{E}\nabla S_{i}(x_{0}))\right]\right)_{+},

almost surely. But, 𝔼Y[ξSi(x0;Y)ξ(Si(x0;Y)𝔼Si(x0))]=ξ𝔼YSi(x0;Y)\mathbb{E}_{Y}\left[\xi\cdot\nabla S_{i}(x_{0};Y)-\xi^{*}\cdot(\nabla S_{i}(x_{0};Y)-\mathbb{E}\nabla S_{i}(x_{0}))\right]=\xi\cdot\mathbb{E}_{Y}\nabla S_{i}(x_{0};Y) almost surely. And thus we get, for all ii,

supξK|n1/2λcv,in(ξ,v)(vi(ξ𝔼Si(x0;Y))+|0,in probability.\sup_{\xi\in K}\left|n^{-1/2}\lambda^{n}_{\text{cv},i}(\xi,v)-(v_{i}(\xi\cdot\mathbb{E}\nabla S_{i}(x_{0};Y))_{+}\right|\to 0,\quad\text{in }\mathbb{P}-\text{probability}.

The result follows. ∎

References

  • Andrieu et al., [2021] Andrieu, C., Durmus, A., Nüsken, N., and Roussel, J. (2021). Hypocoercivity of piecewise deterministic markov process-monte carlo. The Annals of Applied Probability, 31(5):2478–2517.
  • Andrieu and Livingstone, [2021] Andrieu, C. and Livingstone, S. (2021). Peskun–tierney ordering for markovian monte carlo: beyond the reversible scenario. The Annals of Statistics, 49(4):1958–1981.
  • Arcones and Giné, [1993] Arcones, M. A. and Giné, E. (1993). Limit theorems for U-processes. The Annals of Probability, pages 1494–1542.
  • Baker et al., [2019] Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. (2019). Control variates for stochastic gradient mcmc. Statistics and Computing, 29:599–615.
  • Bardenet et al., [2017] Bardenet, R., Doucet, A., and Holmes, C. (2017). On markov chain monte carlo methods for tall data. Journal of Machine Learning Research, 18(47):1–43.
  • Berk, [1970] Berk, R. H. (1970). Consistency a posteriori. The Annals of Mathematical Statistics, 41(3):894–906.
  • Bertazzi and Bierkens, [2022] Bertazzi, A. and Bierkens, J. (2022). Adaptive schemes for piecewise deterministic monte carlo algorithms. Bernoulli, 28(4):2404–2430.
  • Bertazzi et al., [2022] Bertazzi, A., Bierkens, J., and Dobson, P. (2022). Approximations of piecewise deterministic markov processes and their convergence properties. Stochastic Processes and their Applications, 154:91–153.
  • Bierkens and Duncan, [2017] Bierkens, J. and Duncan, A. (2017). Limit theorems for the zig-zag process. Advances in Applied Probability, 49(3):791–825.
  • [10] Bierkens, J., Fearnhead, P., and Roberts, G. O. (2019a). Supplement to “The zig-zag process and super-efficient sampling for Bayesian analysis of big data.
  • [11] Bierkens, J., Fearnhead, P., and Roberts, G. O. (2019b). The zig-zag process and super-efficient sampling for Bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320.
  • Bierkens et al., [2020] Bierkens, J., Grazzi, S., Kamatani, K., and Roberts, G. (2020). The boomerang sampler. In International conference on machine learning, pages 908–918. PMLR.
  • Bierkens et al., [2022] Bierkens, J., Kamatani, K., and Roberts, G. O. (2022). High-dimensional scaling limits of piecewise deterministic sampling algorithms. The Annals of Applied Probability, 32(5):3361–3407.
  • Bierkens et al., [2021] Bierkens, J., Nyquist, P., and Schlottke, M. C. (2021). Large deviations for the empirical measure of the zig-zag process. The Annals of Applied Probability, 31(6):2811–2843.
  • [15] Bierkens, J., Roberts, G. O., and Zitt, P.-A. (2019c). Ergodicity of the zigzag process. The Annals of Applied Probability, 29(4):2266–2301.
  • Bouchard-Côté et al., [2018] Bouchard-Côté, A., Vollmer, S. J., and Doucet, A. (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association, 113(522):855–867.
  • Cochran, [1977] Cochran, W. G. (1977). Sampling techniques. John Wiley & Sons.
  • Corbella et al., [2022] Corbella, A., Spencer, S. E., and Roberts, G. O. (2022). Automatic Zig-Zag sampling in practice. arXiv preprint arXiv:2206.11410.
  • Cornish et al., [2019] Cornish, R., Vanetti, P., Bouchard-Côté, A., Deligiannidis, G., and Doucet, A. (2019). Scalable metropolis-hastings for exact bayesian inference with large datasets. In International Conference on Machine Learning, pages 1351–1360. PMLR.
  • Davis, [1993] Davis, M. (1993). Markov Models & Optimization. Taylor & Francis.
  • Davis, [1984] Davis, M. H. (1984). Piecewise-deterministic Markov processes: A general class of non-diffusion stochastic models. Journal of the Royal Statistical Society: Series B (Methodological), 46(3):353–376.
  • Deligiannidis et al., [2021] Deligiannidis, G., Paulin, D., Bouchard-Côté, A., and Doucet, A. (2021). Randomized hamiltonian monte carlo as scaling limit of the bouncy particle sampler and dimension-free convergence rates. The Annals of Applied Probability, 31(6):2612–2662.
  • Ethier and Kurtz, [1986] Ethier, S. N. and Kurtz, T. G. (1986). Markov processes: Characterization and convergence. John Wiley & Sons.
  • Holderrieth, [2021] Holderrieth, P. (2021). Cores for piecewise-deterministic markov processes used in markov chain monte carlo. Electronic Communications in Probability, 26:1–12.
  • Kleijn and van der Vaart, [2012] Kleijn, B. and van der Vaart, A. (2012). The Bernstein-Von-Mises theorem under misspecification. Electronic Journal of Statistics, 6(none):354 – 381.
  • Krauth, [2021] Krauth, W. (2021). Event-chain monte carlo: foundations, applications, and prospects. Frontiers in Physics, 9:663457.
  • Lee, [2019] Lee, A. J. (2019). U-statistics: Theory and Practice. Routledge.
  • Nemeth and Fearnhead, [2021] Nemeth, C. and Fearnhead, P. (2021). Stochastic gradient Markov chain Monte Carlo. Journal of the American Statistical Association, 116(533):433–450.
  • Pagani et al., [2024] Pagani, F., Chevallier, A., Power, S., House, T., and Cotter, S. (2024). Nuzz: Numerical zig-zag for general models. Statistics and Computing, 34(1):61.
  • Schmon and Gagnon, [2022] Schmon, S. M. and Gagnon, P. (2022). Optimal scaling of random walk Metropolis algorithms using Bayesian large-sample asymptotics. Statistics and Computing, 32:1–16.
  • Vasdekis and Roberts, [2021] Vasdekis, G. and Roberts, G. O. (2021). Speed up zig-zag. arXiv preprint arXiv:2103.16620.
  • Welling and Teh, [2011] Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer.
  • Wu and Robert, [2020] Wu, C. and Robert, C. P. (2020). Coordinate sampler: a non-reversible gibbs-like mcmc sampler. Statistics and Computing, 30(3):721–730.