Random forward models and log-likelihoods in Bayesian inverse problems
Abstract
Abstract: We consider the use of randomised forward models and log-likelihoods within the Bayesian approach to inverse problems. Such random approximations to the exact forward model or log-likelihood arise naturally when a computationally expensive model is approximated using a cheaper stochastic surrogate, as in Gaussian process emulation (kriging), or in the field of probabilistic numerical methods. We show that the Hellinger distance between the exact and approximate Bayesian posteriors is bounded by moments of the difference between the true and approximate log-likelihoods. Example applications of these stability results are given for randomised misfit models in large data applications and the probabilistic solution of ordinary differential equations.
Keywords: Bayesian inverse problem, random likelihood, surrogate model, posterior consistency, uncertainty quantification, randomised misfit, probabilistic numerics.
2010 Mathematics Subject Classification: 62F15, 62G08, 65C99, 65D05, 65D30, 65J22, 68W20.
1 Introduction
Inverse problems are ubiquitous in the applied sciences and in recent years renewed attention has been paid to their mathematical and statistical foundations (Evans and Stark, 2002; Kaipio and Somersalo, 2005; Stuart, 2010). Questions of well-posedness — i.e. the existence, uniqueness, and stability of solutions — have been of particular interest for infinite-dimensional/non-parametric inverse problems because of the need to ensure stable and discretisation-independent inferences (Lassas and Siltanen, 2004) and develop algorithms that scale well with respect to high discretisation dimension (Cotter et al., 2013).
This paper considers the stability of the posterior distribution in a Bayesian inverse problem (BIP) when an accurate but computationally intractable forward model or likelihood is replaced by a random surrogate or emulator. Such stochastic surrogates arise often in practice. For example, an expensive forward model such as the solution of a PDE may replaced by a kriging/Gaussian process (GP) model (Stuart and Teckentrup, 2017). In the realm of “big data” a residual vector of prohibitively high dimension may be randomly subsampled or orthogonally projected onto a randomly-chosen low-dimensional subspace (Le et al., 2017; Nemirovski et al., 2008). In the field of probabilistic numerical methods (Hennig et al., 2015), a deterministic dynamical system may be solved stochastically, with the stochasticity representing epistemic uncertainty about the behaviour of the system below the temporal or spatial grid scale (Conrad et al., 2016; Lie et al., 2017).
In each of the above-mentioned settings, the stochasticity in the forward model propagates to associated inverse problems, so that the Bayesian posterior becomes a random measure, , which we define precisely in (3.1). Alternatively, one may choose to average over the randomness to obtain a marginal posterior, , which we define precisely in (3.2). It is natural to ask in which sense the approximate posterior (either the random or the marginal version) is close to the ideal posterior of interest, .
In earlier work, Stuart and Teckentrup (2017) examined the case in which the random surrogate was a GP. More precisely, the object subjected to GP emulation was either the forward model (i.e. the parameter-to-observation map) or the negative log-likelihood. The prior GP was assumed to be continuous, and was then conditioned upon finitely many observations (i.e. pointwise evaluations) of the parameter-to-observation map or negative log-likelihood as appropriate. That paper provided error bounds on the Hellinger distance between the BIP’s exact posterior distribution and various approximations based on the GP emulator, namely approximations based on the mean of the predictive (i.e. conditioned) GP, as well as approximations based on the full GP emulator. Those results showed that the Hellinger distance between the exact BIP posterior and its approximations can be bounded by moments of the error in the emulator.
In this paper, we extend the analysis of Stuart and Teckentrup (2017) to consider more general (i.e. non-Gaussian) random approximations to forward models and log-likelihoods, and quantify the impact upon the posterior measure in a BIP. After establishing some notation in Section 2, we state the main approximation theorems in Section 3. Section 4 gives an application of the general theory to random misfit models, in which high-dimensional data are rendered tractable by projection into a randomly-chosen low-dimensional subspace. Section 5 gives an application to the stochastic numerical solution of deterministic dynamical systems, in which the stochasticity is a device used to represent the impact of numerical discretisation uncertainty. The proofs of all theorems are deferred to an appendix located after the bibliographic references.
2 Setup and notation
2.1 Spaces of probability measures
Throughout, is a fixed probability space that is rich enough to serve as a common domain for all random variables of interest.
The space of probability measures on the Borel -algebra of a topological space will be denoted by ; in practice, will be a separable Banach space.
When , integration of a measurable function (random variable) will also be denoted by expectation, i.e. .
The space will be endowed with the Hellinger metric : for that are both absolutely continuous with respect to a reference measure ,
(2.1) |
The Hellinger distance is in fact independent of the choice of reference measure and defines a metric on (Bogachev, 2007, Lemma 4.7.35–36) with respect to which evidently has diameter at most . The Hellinger topology coincides with the total variation topology (Kraft, 1955) and is strictly weaker than the Kullback–Leibler (relative entropy) topology (Pinsker, 1964); all these topologies are strictly stronger than the topology of weak convergence of measures.
As used in Sections 3–5, the Hellinger metric is useful for uncertainty quantification when assessing the similarity of Bayesian posterior probability distributions, since expected values of square-integrable functions are Lipschitz continuous with respect to the Hellinger metric:
(2.2) |
when . In particular, for bounded , .
2.2 Bayesian inverse problems
By an inverse problem we mean the recovery of from an imperfect observation of , for a known forward operator . In practice, the operator may arise as the composition of the solution operator of a system of ordinary or partial differential equations with an observation operator , and it is typically the case that for some , whereas and can have infinite dimension. For simplicity, we assume an additive noise model
(2.3) |
where the statistics but not the realisation of are known. In the strict sense, this inverse problem is ill-posed in the sense that there may be no element for which , or there may be multiple such that are highly sensitive to the observed data .
The Bayesian perspective eases these problems by interpreting , , and all as random variables or fields. Through knowledge of the distribution of , (2.3) defines the conditional distribution of . After positing a prior probability distribution for , the Bayesian solution to the inverse problem is nothing other than the posterior distribution for the conditioned random variable . This posterior measure, which we denote , is from the Bayesian point of view the proper synthesis of the prior information in with the observed data . The same posterior can also be arrived at via the minimisation of penalised Kullback–Leibler, , or Dirichlet energies (Dupuis and Ellis, 1997; Jordan and Kinderlehrer, 1996; Ohta and Takatsu, 2011), where the penalisation again expresses compromise between fidelity to the prior and fidelity to the data.
The rigorous formulation of Bayes’ formula for this context requires careful treatment and some further notation (Stuart, 2010). The pair is assumed to be a well-defined random variable with values in . The marginal distribution of is the Bayesian prior . The observational noise is distributed according to , independently of . The random variable is distributed according to , the translate of by , which is assumed to be absolutely continuous with respect to , with
The function is called the negative log-likelihood or simply potential. In the elementary setting of centred Gaussian noise, on , the potential is the non-negative quadratic misfit666Hereafter, to reduce notational clutter, we write both and as . . However, particularly for cases in which , it may be necessary to allow to take negative values and even to be unbounded below (Stuart, 2010, Remark 3.8).
With this notation, Bayes’ theorem is then as follows (Dashti and Stuart, 2016, Theorem 3.4):
Theorem 2.1 (Generalised Bayesian formula).
Suppose that is -measurable and that
satisfies for -almost all . Then, for such , the conditional distribution of exists and is absolutely continuous with respect to with density
(2.4) |
Note that, for (2.4) to make sense, it is essential to check that . Hereafter, to save space, we regard the data as fixed, and hence write in place of , in place of , and in place of . In particular, we shall redefine the negative log-likelihood as a function , instead of a function as in Theorem 2.1 above.
From the perspective of numerical analysis, it is natural to ask about the well-posedness of the Bayesian posterior : is it stable when the prior , the potential , or the observed data are slightly perturbed, e.g. due to discretisation, truncation, or other numerical errors? For example, what is the impact of using an approximate numerical forward operator in place of , and hence an approximate in place of ? Here, we quantify stability in the Hellinger metric from (2.1).
Stability of the posterior with respect to the observed data and the log-likelihood was established for Gaussian priors by Stuart (2010) and for more general priors by many later contributions (Dashti et al., 2012; Hosseini, 2017; Hosseini and Nigam, 2017; Sullivan, 2017). (We note in passing that the stability of BIPs with respect to perturbation of the prior is possible but much harder to establish, particularly when the data are highly informative and the normalisation constant is close to zero; see e.g. the “brittleness” phenomenon of (Owhadi and Scovel, 2017; Owhadi et al., 2015).) Typical approximation theorems for the replacement of the potential by a deterministic approximate potential , leading to an approximate posterior , aim to transfer the convergence rate of the forward problem to the inverse problem, i.e. to prove an implication of the form
where is suitably well behaved, quantifies the convergence rate of the forward problem, and is a constant. Following Stuart and Teckentrup (2017), the purpose of this article is to extend this paradigm and these approximation results to the case in which the approximation is a random object.
3 Well-posed Bayesian inverse problems with random likelihoods
In many practical applications, the negative log-likelihood is computationally too expensive or impossible to evaluate exactly; one therefore often uses an approximation of . This leads to an approximation of the exact posterior , and a key desideratum is convergence, in a suitable sense, of to as the approximation error in the potential tends to zero.
The focus of this work is on random approximations . One particular example of such random approximations are the GP emulators analysed in Stuart and Teckentrup (2017); other examples include the randomised misfit models in Section 4 and the probabilistic numerical methods in Section 5. The present section extends the analysis of Stuart and Teckentrup (2017) from the case of GP approximations of forward models or log-likelihoods to more general non-Gaussian approximations. In doing so, more precise conditions are obtained for the exact Bayesian posterior to be well approximated by its random counterpart.
Let now be a measurable function that provides a random approximation to , where we recall that we have fixed the data . Let be a probability measure on such that the distribution of the inputs of is given by ; we sometimes abuse notation and think of itself as being -distributed. We assume throughout that the randomness in the approximation of is independent of the randomness in the parameters being inferred.
Replacing by in (2.4), we obtain the sample approximation , the random measure given by
(3.1) | ||||
(Henceforth, we will omit the argument for brevity.) Thus, the measure is approximated by the random measure , and the normalisation constant is a random variable. A deterministic approximation of the posterior distribution can now be obtained either by fixing , i.e. by taking one particular realisation of the random posterior , or by taking the expected value of the random likelihood , i.e. by averaging over different realisations of . This yields the marginal approximation defined by
(3.2) |
where . We note that an alternative averaged, deterministic approximation can be obtained by taking the expected value of the density in (3.1) as a whole, i.e. by taking the expected value of the ratio rather than the ratio of expected values. A result very similar to Theorem 3.1, with slightly modified assumptions, holds also in this case, with the proof following the same steps. However, the marginal approximation presented here appears more intuitive and more amenable to applications. Firstly, the marginal approximation provides a clear interpretation as the posterior distribution obtained by the approximation of the true data likelihood by . Secondly, the marginal approximation is more amenable to sampling methods such as Markov chain Monte Carlo, with clear connections to the pseudo-marginal approach (Andrieu and Roberts, 2009; Beaumont, 2003).
3.1 Random misfit models
This section considers the general setting in which the deterministic potential is approximated by a random potential . Recall from (2.4) that is the normalisation constant of , and that for to be well-defined, we must have that . The following two results, Theorems 3.1 and 3.2, extend Theorems 4.9 and 4.11 respectively of Stuart and Teckentrup (2017), in which the approximation is a GP model:
Theorem 3.1 (Deterministic convergence of the marginal posterior).
Suppose that there exist scalars , independent of , such that, for the Hölder-conjugate exponent pairs , , and , we have
-
(a)
;
-
(b)
;
-
(c)
.
Then there exists , independent of , such that
(3.3a) | ||||
(3.3b) |
In the proof of Theorem 3.1, we show that hypothesis (a) arises as an upper bound on the quantity . In order for the conclusion of Theorem 3.1 to hold, we need the latter to be finite. Thus, hypothesis (a) is an exponential decay condition on the positive tails of either or , with respect to the appropriate measures. Alternatively, by applying Jensen’s inequality to , one can strengthen hypothesis (a) into the hypothesis of exponential integrability of either with respect to or with respect to ; this yields the same interpretation. Thus, the parameter quantifies the exponential decay of the positive tail of either or .
By comparing the quantity from hypothesis (a) with the quantity in hypothesis (b), it follows that hypothesis (b) is an exponential decay condition on the negative tails of both and . The two new parameters in this decay condition arise because we apply Hölder’s inequality twice in order to develop the desired bound (3.3) on . The key desideratum here is that the bound is multiplicative in some -norm of . The two new parameters and quantify the decay with respect to and respectively. Note that the interaction between the hypotheses (a) and (b) as described by the conjugate exponent pair implies that one can trade off faster exponential decay of one tail with slower exponential decay of the other.
The two-sided condition on in hypothesis (c) ensures that both tails of with respect to decay sufficiently quickly. This hypothesis ensures that the Radon–Nikodym derivative in (3.2) is well-defined.
Finally, we note that the quantity on the right hand side of (3.3a) depends directly on the conjugate exponents of and appearing in hypotheses (a) and (b). The more well behaved the quantities in these hypotheses are, the weaker the norm we can choose on the right hand side of (3.3a).
Theorem 3.2 (Mean-square convergence of the sample posterior).
Suppose that there exist scalars , independent of , such that, for Hölder-conjugate exponent pairs and , we have
-
(a)
;
-
(b)
.
Then
(3.4) |
Hypothesis (a) of Theorem 3.2 arises during the proof as a result of developing an upper bound on that is multiplicative in some -norm of . Thus, it describes an exponential decay condition of the negative tails of both or ; in particular, hypothesis (a) is always satisfied when the potentials or are non-negative, as is usually the case for finite-dimensional data. The appearance of and arises due to one application of Hölder’s inequality for fulfulling the desideratum of multiplicativity, and and quantify the decay with respect to and respectively.
Hypothesis (b) of Theorem 3.2 arises as a result of developing an upper bound on the quantity that fulfills the desideratum of multiplicativity mentioned above. The presence of both and its reciprocal indicates that hypothesis (b) is analogous to hypothesis (c) of Theorem 3.1, in that hypothesis (b) is a condition on the tails of with respect to . The difference between hypothesis (b) of Theorem 3.2 and hypothesis (c) of Theorem 3.1 arises due to the fact that the Radon–Nikodym derivative in (3.1) features instead of .
We now show that the assumptions of Theorems 3.1 and 3.2 are satisfied when the exact potential and the approximation quality are suitably well behaved. Since , it follows that for some .
Assumption 3.3.
There exists that does not depend on , such that, for all ,
(3.5) |
and for any with the property that , there exists such that, for all ,
(3.6) |
The lower bound conditions in (3.5) ensure that the hypothesised exponential decay conditions on the negative tails of the true likelihood and the random likelihoods from Theorems 3.1 and 3.2 are satisfied. The uniform lower bound on translates into a uniform upper bound of the Radon–Nikodym derivative of the posterior with respect to the prior, and is a very mild condition that is satisfied in many, if not most, BIPs. Given this fact, it is reasonable to demand that the satisfy the same uniform lower bound, -almost surely and for all ; this is the content of the second condition in (3.5). Condition (3.6) expresses the condition that, by choosing sufficiently large, one can approximate arbitrarily well using the random , with respect to the topology. This assumption ensures that the stated aims of this work are reasonable.
Lemma 3.4.
The uniform upper bound condition on with respect to in (3.7) is rather strong; we use it to ensure that is bounded away from zero, uniformly with respect to and . Together with the condition on in (3.5), this translates to uniform lower and upper bounds on ; the latter implies that hypothesis (b) in Theorem 3.2 holds with the stated values of and . A sufficient condition for (3.7) is that the are themselves uniformly bounded. This condition is of interest when the misfit is associated to a bounded forward model and the data take values in a bounded subset.
Lemma 3.5.
By comparing the hypotheses and conclusions of Lemma 3.4 and Lemma 3.5, we observe that, by reducing the exponent of integrability from to , we can replace the strong uniform upper bound condition (3.7) on from Lemma 3.4 with the weaker condition that in Lemma 3.5, and thus increase the scope of applicability of the conclusion.
In Lemmas 3.4 and 3.5 above, we have specified the largest possible values of the exponents that are compatible with the hypotheses. This is because later, in Theorem 3.9, we will want to use the smallest possible values of the corresponding conjugate exponents in the resulting inequalities (3.3a) and (3.4).
3.2 Random forward models in quadratic potentials
In many settings, the potentials and have a common form and differ only in the parameter-to-observable map. In this section we shall assume that and are quadratic misfits of the form
(3.8) |
corresponding to centred Gaussian observational noise with symmetric positive-definite covariance . Again, we assume that is deterministic while is random. In this section, for this setting, we show how the quality of the approximation transfers to the approximation , and hence to the approximation (for either the sample or marginal approximate posterior).
Pointwise in and , the errors in the misfit and the forward model are related according to the following proposition.
Proposition 3.6.
Let and be defined as in (3.8), where for some and the eigenvalues of the operator are bounded away from zero. Then, for some , for all , and -almost surely
(3.9) |
Hence, for and all ,
(3.10) | ||||
By assuming that , we assume that the data live in a finite-dimensional space. This is a standard assumption in the area, and implies that the operator is simply a matrix. The assumption of the eigenvalues of being bounded away from zero is equivalent to assuming that is invertible, which follows immediately from the assumption stated earlier that is a symmetric and positive-definite covariance matrix.
Corollary 3.7.
Let , and suppose that . If there exists an such that, for all ,
then, there exists some that does not depend on such that for all ,
where and is as in Proposition 3.6.
The hypotheses ensure that the integrability of the misfit determines the highest degree of integrability of the forward operators and , and that for sufficiently large , we may make the norm of the difference of in an appropriate topology small enough. The constraint (3.7) is used to combine the and terms in (3.9). The resulting simplification ensures that we may apply Lemma 3.8.
The lemma states that if the random forward model converges to the true forward model in the appropriate topology, then the conditions in Assumption 3.3 are satisfied by the corresponding random misfits. Since the misfits were assumed to be quadratic in (3.8), the key contribution of Lemma 3.8 is to ensure that the approximation quality condition (3.6) is satisfied.
We shall use the preceding results to obtain bounds on the Hellinger distance in terms of errors in the forward model, of the following form: for and that do not depend on ,
(3.12) | ||||
(3.13) |
For brevity and simplicity, the following result uses one pair in (3.11) in order to obtain convergence statements for both and . If one is interested in only one of these measures, then one may optimise and accordingly.
Theorem 3.9 (Convergence of posteriors for randomised forward models in quadratic potentials).
The proof of Theorem 3.9 consists of tracking the dependence of the parameters over the sequential application of the preceding results, all of which are used.
Case (a) applies in the situation where the random approximations are uniformly bounded from above; as discussed earlier, this condition is satisfied in the case that the misfit is associated to a bounded forward model and the data take values in a bounded subset of . Note that the topology of the convergence of to is quantified by and , and that depends on the parameter that quantifies the exponential -integrability of the misfit . In particular, the faster the exponential decay of the positive tail of (i.e. the larger the value of ), the stronger the topology of convergence of to .
In contrast to case (a), case (b) does not assume that the misfit is exponentially integrable or that the random approximations are uniformly bounded from above -almost surely. Instead, exponential integrability of the random misfit is required. Another difference is that the exponential integrability parameter determines the strength of the topology of convergence of the random forward models, not only with respect to the -topology, but also to the -topology as well.
4 Application: randomised misfit models
This section considers a particular Monte Carlo approximation of a quadratic potential , proposed by Nemirovski et al. (2008); Shapiro et al. (2009), and further applied and analysed in the context of BIPs by Le et al. (2017). This approximation is particularly useful when the data has very high dimension, so that one does not wish to interrogate every component of the data vector , or evaluate every component of the model prediction and compare it with the corresponding component of .
Let be an -valued random vector with mean zero and identity covariance, and let be independent and identically distributed copies (samples) of . We then have the following approximation:
The analysis and numerical studies in Le et al. (2017, Sections 3–4) suggest that a good choice for the random vector would be one with independent and identically distributed (i.i.d.) entries from a sub-Gaussian probability distribution on . Examples of sub-Gaussian distributions considered include
-
(a)
the standard Gaussian distribution: , for ; and
-
(b)
the -sparse distribution: for , let and set, for ,
The randomised misfit can provide computational benefits in two ways. Firstly, a single evaluation of can be made cheap by choosing the -sparse distribution for , with large sparsity parameter . This choice ensures that a large proportion of the entries of each sample will be zero, significantly reducing the cost to compute the required inner products in , since there is no need to compute the components of the data or model vector that will be eliminated by the sparsity pattern. The value of of course also influences the computational cost. It is observed by Le et al. (2017) that, for large and moderate , the random potential and the original potential are already very similar, in particular having approximately the same minimisers and minimum values. Statistically, these correspond to the maximum likelihood estimators under and being very similar; after weighting by a prior, this corresponds to similarity of maximum a posteriori (MAP) estimators.
The second benefit of the randomised misfit approach, and the main motivation for its use in Le et al. (2017), is the reduction in computational effort needed to compute the MAP estimate. This task involves the solution of a large-scale optimisation problem involving in the objective function, which is typically done using inexact Newton methods. It is shown by Le et al. (2017) that the required number of evaluations of the forward model and its adjoint is drastically reduced when using the randomised misfit as opposed to using the true misfit , approximately by a factor of .
The aim of this section is to show that the use of the randomised misfit does not only lead to the MAP estimate being well-approximated, but in fact the whole Bayesian posterior distribution. Thus, the corresponding conjecture is that the ideal and deterministic posterior is well approximated by the random posterior . Indeed, via Theorem 3.2, we have the following convergence result for the case of a sparsifying distribution:
Proposition 4.1.
Suppose that the entries of are i.i.d. -sparse, for some , and that . Then there exists a constant , independent of , such that
(4.1) |
(In this section, plays the role of the distribution of .) As the proof reveals, a valid choice of the constant in (4.1) is
(4.2) |
where the constant is as in Theorem 3.2. Thus, as one would expect, the accuracy of the approximation decreases as approaches the complete sparsification case or as the data dimension increases, but always with the same convergence rate in terms of the approximation dimension .
Remark 4.2.
The proof of Proposition 4.1 can be modified to yield the same result for arbitrary i.i.d. with bounded support, though the sparsifying case is obviously the one with the easiest interpretation. However, extending Proposition 4.1 to the case of i.i.d. Gaussian random variables appears to be problematic. In the proof, we crucially make use of the bound to verify Assumption (b) of Theorem 3.2. For Gaussian random variables, we would similarly need an -independent bound on the exponential moments of
which is not possible. We leave this as an interesting question for future work: would a different proof strategy yield convergence in the Gaussian case, or is the Gaussian setting genuinely one in which the MAP problem is well approximated but the BIP is not?
5 Application: probabilistic integration of dynamical systems
The data-based inference of initial conditions or governing parameters for dynamical problems arises frequently in scientific applications, a prime example being data assimilation in numerical weather prediction (Law et al., 2015; Reich and Cotter, 2015). In this setting, the Bayesian likelihood involves a solution of the mathematical model for the dynamics, which is typically an ODE or time-dependent PDE; we focus here on the ODE situation. Even when the governing ODE is deterministic, it may be profitable to perform a probabilistic numerical solution: possible motivations for doing so include the representation of model error (model inadequacy) in the ODE itself, and the impact of discretisation uncertainty. When such a probabilistic solver is used for the ODE, the likelihood becomes random in the sense considered in this paper.
Random approximate solution of deterministic ODEs is an old idea (Diaconis, 1988; Skilling, 1992) that has received renewed attention in recent years (Conrad et al., 2016; Hennig et al., 2015; Lie et al., 2017; Schober et al., 2014). As random forward models, these probabilistic ODE solvers are amenable to the analysis of Section 3. Let and consider the following parameter-dependent initial value problem for a fixed, parameter-independent duration :
for , | (5.1) | ||||
In the context of the BIP presented in Section 2, the unknown parameter will appear in the definition of the initial condition or the right-hand side , resulting in the parameter-dependent solution . Define the solution operator
(5.2) |
where solves (5.1). We equip with the supremum norm.
For notational convenience, we will for the majority of this section not indicate the dependence of or on . We will, however, explicitly track the dependence on and of the error analysis below.
Let be the flow map associated to the initial value problem (5.1), i.e. . Fix a time step such that , and a time grid
(5.3) |
We denote by the value of the exact solution to (5.1) at time . We shall sometimes abuse notation and write or .
To a single-step numerical integration method (e.g. a Runge–Kutta method of some order) we shall associate a numerical flow map . The numerical flow map approximates the sequence by a sequence , where . A fundamental task in numerical analysis is to determine sufficient conditions for convergence of the sequence to . The investigations of Conrad et al. (2016) and Lie et al. (2017) concern a similar task in the context of uncertainty quantification. Given , consider a collection of stochastic processes having almost-surely continuous paths. Define a stochastic process in terms of a new randomised integrator
(5.4) |
The stochastic processes are intended to capture the effect of uncertainties, e.g. those that arise due to properties of the vector field that are not resolved by the time grid (5.3) associated to the time step . We extend the definition (5.4) to continuous time via
(5.5) |
We shall use the to construct our random approximations to . Note therefore that, in order to be consistent with our assumption (see the third paragraph of Section 3) that the randomness in the approximation of is independent of the randomness in the parameter being inferred, we shall assume that the do not depend on the parameter . However, the map does depend on the parameter , because involves the vector field .
Define the random solution operator associated to the randomised integrator (5.5):
(5.6) |
where satisfies (5.5), and is almost surely continuous.
Let be a strictly increasing sequence of time points, indexed by a finite, nonempty index set with cardinality . Note that may coincide with the time grid defined in (5.3); to increase the scope of the subsequent analysis however, we allow for to differ from (5.3). Let , and equip it with the topology induced by the standard Euclidean inner product. Define the observation operator
(5.7) |
which projects some to a finite-dimensional vector in constructed by stacking the -valued vectors that result from evaluating at the time points in . We take the norm on to be .
Given the operators , , and defined in (5.2), (5.7), and (5.6), we define the forward operators by
(5.8) |
The associated likelihoods are the quadratic misfits given by (3.8) with some fixed, positive-definite matrix .
We define the continuous-time error process by
(5.9) |
Since is a proper subset of , it follows that
(5.10) |
This completes our formulation of the probabilistic numerical integration of the ODE (5.1) as a random likelihood model of the type considered in Section 3.
5.1 Convergence in continuous time for Lipschitz flows
In this section, we quote some assumptions and results from Lie et al. (2017). The vector field in (5.1) induces a flow by
(5.11) |
Assumption 5.1 (Assumption 3.1, Lie et al. (2017)).
The vector field admits and , such that for , the flow defined by (5.11) is globally Lipschitz, with
A globally Lipschitz vector field in (5.1) yields a flow map that satisfies Assumption 5.1. However, vector fields that satisfy a one-sided Lipschitz condition also have the same property. Such vector fields have been studied in the numerical analysis literature for both ordinary and stochastic differential equations in the last four decades; see, e.g. (Butcher, 1975), and the references cited in Section 3.1 of (Higham et al., 2002).
Recall that represents the numerical method that we use to integrate (5.1).
Assumption 5.2 (Assumption 3.2, Lie et al. (2017)).
The numerical method has uniform local truncation error of order : for some constant that does not depend on ,
The assumption above is satisfied for both single-step and multistep numerical methods that are obtained by considering vector fields in , provided that the derivatives are bounded; see Section III.2 of (Hairer et al., 2009). We emphasise that the above assumption is made to simplify the analysis, and that the convergence results below extend to the case where the uniform bound does not hold; see Section 4 of (Lie et al., 2017).
Now recall the collection of random variables, where is used in (5.4).
Assumption 5.3 (Assumption 5.1, Lie et al. (2017)).
The stochastic processes admit , , and , independent of and , such that for all and all ,
(In this section, plays the role of the distribution of the ’s.) The assumption above quantifies the regularity of the by specifying how many moments each has and how quickly these decay with . We do not require the to have zero mean, to be independent, or to be identically distributed. It is shown in Section 5.2 of (Lie et al., 2017) that, for example, the integrated Brownian motion process satisfies Assumption 5.3. The integrated Brownian motion process has been used as a state-independent model of the uncertainty in the off-grid behaviour of solutions to ODEs in (Conrad et al., 2016; Schober et al., 2014; Chkrebtii et al., 2016).
We now consider the following convergence theorem:
Theorem 5.4 (Theorem 5.2, Lie et al. (2017)).
Note that the scalars and depend on , since and depend on the vector field , which in turn depends on the parameter .
Recall that the random variable defined in (5.5) is a random surrogate for the true solution of the ODE (5.1). The law of is thus a probability measure on the space of continuous paths defined on the interval . With this in mind, the interpretation of Theorem 5.4 is that the law of contracts to the Dirac distribution located at the true solution of (5.1), as the spacing in the time grid (5.3) decreases to zero. Equivalently, given the true solution operator and its random counterpart , Theorem 5.4 implies that the random solution operator converges in the topology to the true solution operator. Thus, Theorem 5.4 guarantees that by refining the time grid, one reduces the uncertainty over the solution of (5.1). This is a desirable feature for uncertainty quantification, since estimates of the solution uncertainty can also be fed forward to obtain estimates of the uncertainty of functionals of the solution, and since the probabilistic description allows for a more nuanced description of the uncertainty compared to the usual worst-case description that is common in the numerical analysis of deterministic methods.
Corollary 5.5 (Corollary 5.3, Lie et al. (2017)).
Since the exponential integrability of a random variable is related to the exponential concentration of its values about its mean or median, the above result shows that strong assumptions on the model of uncertainty translate to strong conclusions about the behaviour of the corresponding error. In the context of random approximations of BIPs, we shall use Corollary 5.5 in order to establish the convergence of the random approximations in the Hellinger sense.
5.2 Effect of probabilistic integration on Bayesian posterior distribution
Define the approximate posteriors and according to (3.2) and (3.1), using the quadratic misfits and from (3.8) and the forward models and given in (5.2) and (5.6) respectively, where denotes the observation operator associated to a fixed, finite sequence of observation times in .
As we saw in the last section, the results of Lie et al. (2017) guarantee convergence in the topology of the random solution operator to the true solution operator . It is of interest to determine whether one can use this result to guarantee that one can perform inference over using the probabilistic integrator. In particular, given that the probabilistic integrator provides a random approximation, it is of interest to determine whether one can obtain results that are not only reasonable, but that improve as the time resolution of the probabilistic integrator increases. The following result shows that this is indeed the case: as the time resolution increases, the random forward model yields a random posterior over the parameter space that converges in the Hellinger topology to the true posterior at the expected rate.
Theorem 5.6.
The parameter above plays the same role as in Theorem 3.9, case (b); quantifies the exponential decay of the random misfit with respect to . For this reason, is constrained to the same range of values as given there, and determines the parameters and that partly describe the convergence rates in (3.12) and (3.13). As shown in the proof, the reason why does not appear to play any further role is due to (5.10) and Corollary 5.5. In particular, since Assumption 5.3 holds with , Corollary 5.5 ensures that the exponential decay parameter need not be constrained to any bounded interval.
The continuous dependence on of the parameters of Assumptions 5.1, 5.2 and 5.3 also allows for parameters that do not depend on , e.g. . The assumption of continuous dependence on of the parameters, and the assumptions on , ensure that the map is uniformly bounded by a scalar that depends only on ; from this the exponential integrability hypothesis on of Theorem 3.9(b) holds, and we can apply the corresponding conclusions. While these assumptions may appear to be strong, they simplify the analysis considerably and thus are not uncommon in the literature on parameter inference for dynamical systems. We leave the investigation of weaker assumptions for future work.
6 Concluding remarks
In this paper we have considered the impact upon a BIP of replacing the log-likelihood function by a random function . Such approximations occur for example when a cheap stochastic emulator is used in place of an expensive exact log-likelihood, or when a probabilistic solver is used to simulate the forward model.
Our results show that such approximations are well-posed, with the approximate Bayesian posterior distribution converging to the true Bayesian posterior as the error between and , measured in a suitable sense, goes to zero. More precisely, we have shown that the convergence rate of the random log-likelihood to — as assessed in a nested norm with respect to the distribution of and the Bayesian prior distribution of the unknown — transfers to convergence of two natural approximations to the exact Bayesian posterior , namely (a) the randomised posterior measure that simply has in place of , and (b) the deterministic pseudo-marginal posterior measure , in which the likelihood function and marginal likelihood of are individually averaged with respect to .
Since the hypotheses that are required for these results operate directly at the level of finite-order moments of the error , the convergence results in this paper automatically apply to GP approximations, as previously considered by Stuart and Teckentrup (2017), which have moments of all orders. However, in a substantial generalisation, Theorems 3.1 and 3.2 show that, in the case,
for general approximations . This optimal bound requires that the random misfit allows pointwise bounds on and with respect to the distribution on and the Bayesian prior on the unknown . If the distribution of does not allow pointwise () bounds on and , but only bounds in for some , then the norms in the bounds above need to be strengthened to higher order norms and/or higher order moments of the error , resulting in the quantity appearing on the right hand sides, for some and (respectively ).
Our error bounds are explicit in the sense that the aforementioned exponents and can typically be calculated explicitly given the structure of . This is the case for the GP emulators considered in Stuart and Teckentrup (2017), and also the randomised misfit models and probabilistic numerical solvers considered here. The constant in the error bounds, on the other hand, is typically not computable in advance; it involves quantities such as the normalising constants and , which for most forward models are not known analytically and very expensive to compute numerically. In a sense, this is similar to the everyday situation of using an ODE or PDE solver of known order but unknown constant prefactor.
A significant open question in this work is the one highlighted at the end of Section 4: in contrast to randomised dimension reduction using bounded random variables, is the case of Gaussian randomly-projected misfits one in which the MAP problem and BIP genuinely have different convergence properties?
Acknowledgements
ALT is partially supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. HCL and TJS are partially supported by the Freie Universität Berlin within the Excellence Initiative of the German Research Foundation (DFG). This work was partially supported by the DFG through grant CRC 1114 “Scaling Cascades in Complex Systems”, and by the National Science Foundation (NSF) under grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute (SAMSI) and SAMSI’s QMC Working Group II “Probabilistic Numerics”. Any opinions, findings, and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of the above-named funding agencies and institutions.
References
- Andrieu and Roberts [2009] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Statist., 37(2):697–725, 2009. 10.1214/07-AOS574.
- Beaumont [2003] M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139–1160, 2003. URL http://www.genetics.org/content/164/3/1139.
- Bogachev [2007] V. I. Bogachev. Measure Theory. Vol. I, II. Springer-Verlag, Berlin, 2007. 10.1007/978-3-540-34514-5.
- Butcher [1975] J. C. Butcher. A stability property of implicit Runge–Kutta methods. BIT Num. Math., 15(4):358–361, Dec 1975. 10.1007/BF01931672.
- Chkrebtii et al. [2016] O. A. Chkrebtii, D. A. Campbell, B. Calderhead, and M. A. Girolami. Bayesian solution uncertainty quantification for differential equations. Bayesian Anal., 11(4):1239–1267, 2016. 10.1214/16-BA1017.
- Conrad et al. [2016] P. R. Conrad, M. Girolami, S. Särkkä, A. M. Stuart, and K. C. Zygalakis. Statistical analysis of differential equations: introducing probability measures on numerical solutions. Stat. Comput., 27(4):1065–1082, 2016. 10.1007/s11222-016-9671-0.
- Cotter et al. [2013] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC methods for functions: modifying old algorithms to make them faster. Statist. Sci., 28(3):424–446, 2013. 10.1214/13-STS421.
- Dashti and Stuart [2016] M. Dashti and A. M. Stuart. The Bayesian approach to inverse problems. In R. Ghanem, D. Higdon, and H. Owhadi, editors, Handbook of Uncertainty Quantification, pages 311–428. Springer, 2016. 10.1007/978-3-319-11259-6_7-1.
- Dashti et al. [2012] M. Dashti, S. Harris, and A. M. Stuart. Besov priors for Bayesian inverse problems. Inverse Probl. Imaging, 6(2):183–200, 2012. 10.3934/ipi.2012.6.183.
- Diaconis [1988] P. Diaconis. Bayesian numerical analysis. In Statistical Decision Theory and Related Topics, IV, Vol. 1 (West Lafayette, Ind., 1986), pages 163–175. Springer, New York, 1988.
- Dupuis and Ellis [1997] P. Dupuis and R. S. Ellis. A Weak Convergence Approach to the Theory of Large Deviations. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons, Inc., New York, 1997. 10.1002/9781118165904.
- Evans and Stark [2002] S. N. Evans and P. B. Stark. Inverse problems as statistics. Inverse Probl., 18(4):R55–R97, 2002. 10.1088/0266-5611/18/4/201.
- Hairer et al. [2009] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I: Nonstiff Problems, volume 8 of Springer Series in Computational Mathematics. Springer-Verlag, New York, 2009. 10.1007/978-3-540-78862-1.
- Hennig et al. [2015] P. Hennig, M. A. Osborne, and M. Girolami. Probabilistic numerics and uncertainty in computations. Proc. A., 471(2179):20150142, 17, 2015. 10.1098/rspa.2015.0142.
- Higham et al. [2002] D. J. Higham, X. Mao, and A. M. Stuart. Strong convergence of Euler-type methods for nonlinear stochastic differential equations. SIAM J. Numer. Anal., 40(3):1041–1063, 2002. 10.1137/S0036142901389530.
- Hosseini [2017] B. Hosseini. Well-posed Bayesian inverse problems with infinitely-divisible and heavy-tailed prior measures. SIAM/ASA J. Uncertain. Quantif., 5(1):1024–1060, 2017. 10.1137/16M1096372.
- Hosseini and Nigam [2017] B. Hosseini and N. Nigam. Well-posed Bayesian inverse problems: priors with exponential tails. SIAM/ASA J. Uncertain. Quantif., 5(1):436–465, 2017. 10.1137/16M1076824.
- Jordan and Kinderlehrer [1996] R. Jordan and D. Kinderlehrer. An extended variational principle. In Partial Differential Equations and Applications, volume 177 of Lecture Notes in Pure and Appl. Math., pages 187–200. Dekker, New York, 1996. 10.5006/1.3292113.
- Kaipio and Somersalo [2005] J. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems, volume 160 of Applied Mathematical Sciences. Springer-Verlag, New York, 2005. 10.1007/b138659.
- Kraft [1955] C. H. Kraft. Some conditions for consistency and uniform consistency of statistical procedures. Univ. California Publ. Statist., 2:125–141, 1955.
- Lassas and Siltanen [2004] M. Lassas and S. Siltanen. Can one use total variation prior for edge-preserving Bayesian inversion? Inverse Probl., 20(5):1537–1563, 2004. 10.1088/0266-5611/20/5/013.
- Law et al. [2015] K. Law, A. Stuart, and K. Zygalakis. Data Assimilation: A Mathematical Introduction, volume 62 of Texts in Applied Mathematics. Springer, 2015. 10.1007/978-3-319-20325-6.
- Le et al. [2017] E. B. Le, A. Myers, T. Bui-Thanh, and Q. P. Nguyen. A data-scalable randomized misfit approach for solving large-scale PDE-constrained inverse problems. Inverse Probl., 33(6):065003, 2017. 10.1088/1361-6420/aa6cbd.
- Lie et al. [2017] H. C. Lie, A. M. Stuart, and T. J. Sullivan. Strong convergence rates of probabilistic integrators for ordinary differential equations, 2017. arXiv:1703.03680.
- Nemirovski et al. [2008] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2008. 10.1137/070704277.
- Ohta and Takatsu [2011] S. Ohta and A. Takatsu. Displacement convexity of generalized relative entropies. Adv. Math., 228(3):1742–1787, 2011. 10.1016/j.aim.2011.06.029.
- Owhadi and Scovel [2017] H. Owhadi and C. Scovel. Qualitative robustness in Bayesian inference. ESAIM Probab. Stat., 21:251–274, 2017. 10.1051/ps/2017014.
- Owhadi et al. [2015] H. Owhadi, C. Scovel, and T. J. Sullivan. Brittleness of Bayesian inference under finite information in a continuous world. Electron. J. Stat., 9(1):1–79, 2015. 10.1214/15-EJS989.
- Pinsker [1964] M. S. Pinsker. Information and Information Stability of Random Variables and Processes. Holden-Day, Inc., San Francisco, Calif.-London-Amsterdam, 1964.
- Reich and Cotter [2015] S. Reich and C. Cotter. Probabilistic Forecasting and Bayesian Data Assimilation. Cambridge University Press, New York, 2015. 10.1017/CBO9781107706804.
- Robert and Casella [1999] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer-Verlag, New York, 1999. 10.1007/978-1-4757-3071-5.
- Schober et al. [2014] M. Schober, D. K. Duvenaud, and P. Hennig. Probabilistic ODE solvers with Runge–Kutta means. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 739–747. Curran Associates, Inc., 2014. https://papers.nips.cc/paper/5451-probabilistic-ode-solvers-with-runge-kutta-means.
- Shapiro et al. [2009] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory, volume 9 of MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2009. 10.1137/1.9780898718751.
- Skilling [1992] J. Skilling. Bayesian solution of ordinary differential equations. In C. R. Smith, G. J. Erickson, and P. O. Neudorfer, editors, Maximum Entropy and Bayesian Methods, volume 50 of Fundamental Theories of Physics, pages 23–37. Springer, 1992. 10.1007/978-94-017-2219-3.
- Stuart [2010] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numer., 19:451–559, 2010. 10.1017/S0962492910000061.
- Stuart and Teckentrup [2017] A. M. Stuart and A. L. Teckentrup. Posterior consistency for Gaussian process approximations of Bayesian posterior distributions. Math. Comput., 2017. 10.1090/mcom/3244.
- Sullivan [2017] T. J. Sullivan. Well-posed Bayesian inverse problems and heavy-tailed stable quasi-Banach space priors. Inverse Probl. Imaging, 11(5):857–874, 2017. 10.3934/ipi.2017040.
Appendix: Proofs of Results
The proofs in this section will make repeated use of the following inequalities for real and :
(A.1) | |||||
(A.2) | |||||
(A.3) | |||||
for . | (A.4) |
We also have, for arbitrary and (not necessarily integer-valued), by the triangle inequality and Jensen’s inequality,
(A.5) |
Proof of Theorem 3.1.
Using (2.4) and (3.1), we have
Inequality (A.1) with and and the definition (2.1) of the Hellinger distance yield
For the first term, we use inequality (A.2) with and , together with Hölder’s inequality with conjugate exponents and , to derive
(A.6) |
We estimate the second factor on the right-hand side of (A.6). Using the facts that is decreasing on , that for all , and that both and are strictly positive, we obtain
For , the partition and the corresponding integral inequalities on and imply that . Hence,
(A.7) |
where is the constant specified in assumption (a). This completes our estimate for the second factor on the right-hand side of (A.6). For the first factor, the linearity of expectation, inequality (A.3), and Hölder’s inequality with conjugate exponents with respect to and with respect to give
(A.8) |
Letting be the constant in assumption (b), and using (A.7), it follows that
Proof of Theorem 3.2.
This proof is similar to the proof of Theorem 3.1. Since
Tonelli’s theorem, inequality (A.1), and Jensen’s inequality yield
For the first term , inequality (A.3), and Hölder’s inequality with conjugate exponent pairs and give
By (a), we may bound the first factor on the right-hand side of the last inequality by . Now by (A.2) with and , and by inequality (A.4), we obtain (see the proof of Theorem 3.1 after (A.8)) that
Jensen’s inequality and another application of inequality (A.2) yield
Combining the preceding two estimates, using Tonelli’s theorem and Hölder’s inequality with the same conjugate exponent pairs and as used in the bound for , and using (b), we get
Combining the preceding estimates yields (3.4). ∎
Proof of Lemma 3.4.
Since , examination of assumption (a) of Theorem 3.1 indicates that we may set and . By (3.5), it follows that ; thus assumption (b) of Theorem 3.1 holds with (so that ) and . We now prove that Assumption (c) of Theorem 3.1 holds. It follows by setting and in inequality (A.3) that . Thus
(A.9) |
Using Jensen’s inequality, (A.9), Tonelli’s theorem, and (3.6),
The last inequality implies that assumption (c) of Theorem 3.1 holds with the same as in (3.6), since for any that satisfies and (3.6), we have
and
and combining both the implied statements yields assumption (c) of Theorem 3.1; thus (3.3) holds, as desired.
Now note that (3.5) implies that assumption (a) of Theorem 3.2 holds with and . Furthermore, (3.5) also implies that for all . Thus, given that is -a.s. constant, and given that there exists some such that ,
(A.10) |
A necessary and sufficient condition for setting above (and therefore also in assumption (b) of Theorem 3.2) is that is -a.s. bounded away from zero by a constant that does not depend on . By the convexity and monotonicity of ,
for as in (3.7). In particular, if (3.7) holds, then so does assumption (b) of Theorem 3.2, with and , by inequality (A.10). ∎
Proof of Lemma 3.5.
The proof proceeds in the same way as the proof of Lemma 3.4, with the exception that we need to prove that the assumption that for some implies that assumption (a) of Theorem 3.1 and assumption (b) of Theorem 3.2 hold with the stated parameters. Therefore, the proof will only concern these two assertions. Since is strictly convex on for any , Jensen’s inequality yields that . Therefore, setting , we find that assumption (a) of Theorem 3.1 holds, with and . The inequality for implies that
while Jensen’s inequality, Tonelli’s theorem, and the definition of the -norm yield that
Since the last term is finite for by the hypothesis that , it follows that assumption (b) of Theorem 3.2 holds with the parameters , , and the scalar . ∎
Proof of Proposition 3.6.
Recall (3.8), and fix an arbitrary . We have
Adding and subtracting inside the absolute value, rearranging terms, applying the Cauchy–Schwarz inequality, and letting be the largest eigenvalue of yields
(A.11) |
By the triangle inequality,
and the triangle inequality and (3.8) yield
Together, these inequalities yield
and substituting the above into (A.11) yields
thus proving (3.9). Using (A.5) yields
Now take expectations with respect to : since and are constant with respect to ,
and taking the th root of both sides proves (3.10). ∎
Proof of Corollary 3.7.
Proof of Lemma 3.8.
Proof of Theorem 3.9.
We first verify that Assumption 3.3 holds. Since and satisfy (3.8), it follows that we may set in (3.5). Since we assume throughout that , it follows that has moments of all orders, and hence belongs to for all . Therefore, given that (3.11) holds for , it follows from Jensen’s inequality and Corollary 3.7 that we can make as small as desired. In particular, for any that satisfies , there exists a such that, for all , (3.6) holds.
Case (a). The hypotheses in this case ensure that we may apply Lemma 3.4. Set and , so that and . Substituting these exponents into (3.3a) and applying Corollary 3.7 with and (note that ), we obtain
where changes value between inequalities. Thus we have shown that (3.12) holds with and .
To prove that (3.13) holds with the desired exponents, we again use Lemma 3.4 to set , so that . Substituting these exponents into (3.4), and applying Corollary 3.7 with and , we obtain
where changes value between inequalities. Thus we have shown that (3.13) holds with .
It remains to ensure that both the rightmost terms above converge to zero. Since (3.11) holds with and , the desired convergence follows from the nesting property of finite-measure -spaces. Therefore, both and converge to as claimed.
Case (b). Since the arguments in this case are the same as in the previous case, we only record the different material.
The hypotheses ensure that we may apply Lemma 3.5. Set and , so that and . Substituting these exponents into (3.3a) and applying Corollary 3.7 with and , we obtain
where changes value between inequalities. Thus we have shown that (3.12) holds with and .
To prove that (3.13) holds with the desired exponents, we again use Lemma 3.5 to set and , so that and . Substituting these exponents into (3.4), and applying Corollary 3.7 with and , we obtain
where changes value between inequalities. Thus (3.13) holds with and . Since (3.11) holds with and , it follows from the nesting property of -spaces defined on finite measure spaces that both
converge to zero. ∎
Proof of Proposition 4.1.
We start by verifying the assumptions of Theorem 3.2. First, since for all , and for all and all , assumption (a) is satisfied for . For assumption (b), we then have, for any ,
Since for all and all , we have for any
(A.12) |
Using the -sparse distribution of , we further have and
which implies that . It follows that, for any ,
and assumption (b) is hence also satisfied for . Hence, by Theorem 3.2,
Using standard properties of Monte Carlo estimators (see e.g. Robert and Casella [1999]), we have
Now, using , , the linearity of expectation, the -sparse distribution of , and , we have
The claim (4.1) now follows, with the choice of constant as in (4.2). ∎
Proof of Theorem 5.6.
Recall that is a set of time points in , indexed by an index set with cardinality . In (5.10), we observed that
Fix . Omitting the argument of , , and , we have
where the last two inequalities follow from (3.9) and Young’s inequality for . Using (5.10), we therefore obtain
where we note that we have suppressed the -dependence of and simply written . Since is compact and is continuous, it follows that and hence are continuous on ; by the extreme value theorem, is bounded on , i.e. is finite. Using this fact and taking expectations with respect to we obtain
By Corollary 5.5, the two terms on the right-hand side are finite for every . Given the continuous dependence of the parameters of Assumptions 5.1, 5.2, and 5.3 on , and given that is a compact subset of a finite-dimensional Euclidean space, it follows that the right-hand side can be bounded by a scalar that does not depend on any . Hence, the function belongs to , so that the first hypothesis of Theorem 3.9(b) holds. For the second hypothesis, observe that, since Assumption 5.3 holds for , it follows that (5.12) holds for any , and thus (3.11) holds for any . Therefore the hypotheses of Theorem 3.9(b) are satisfied, and the desired conclusion follows from Theorem 3.9. ∎