This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence of the Deep BSDE Method
for Coupled FBSDEs

Jiequn Han jiequnh@princeton.edu Jihao Long jihaol@pku.edu.cn School of Mathematical Sciences, Peking University, Beijing 100871, China
Abstract

The recently proposed numerical algorithm, deep BSDE method, has shown remarkable performance in solving high-dimensional forward-backward stochastic differential equations (FBSDEs) and parabolic partial differential equations (PDEs). This article lays a theoretical foundation for the deep BSDE method in the general case of coupled FBSDEs. In particular, a posteriori error estimation of the solution is provided and it is proved that the error converges to zero given the universal approximation capability of neural networks. Numerical results are presented to demonstrate the accuracy of the analyzed algorithm in solving high-dimensional coupled FBSDEs.

1 Introduction

Forward-backward stochastic differential equations (FBSDEs) and partial differential equations (PDEs) of parabolic type have found numerous applications in stochastic control, finance, physics, etc., as a ubiquitous modeling tool. In most situations encountered in practice the equations cannot be solved analytically but require certain numerical algorithms to provide approximate solutions. On the one hand, the dominant choices of numerical algorithms for PDEs are mesh-based methods, such as finite differences, finite elements, etc. On the other hand, FBSDEs can be tackled directly through probabilistic means, with appropriate methods for the approximation of conditional expectation. Since these two kinds of equations are intimately connected through the nonlinear Feynman–Kac formula [1], the algorithms designed for one kind of equation can often be used to solve another one.

However, the aforementioned numerical algorithms become more and more difficult, if not impossible, when the dimension increases. They are doomed to run into the so-called “curse of dimensionality” [2] when the dimension is high, namely, the computational complexity grows exponentially as the dimension grows. The classical mesh-based algorithms for PDEs require a mesh of size O(Nd)O(N^{d}). The simulation of FBSDEs faces a similar difficulty in the general nonlinear cases, due to the need to compute conditional expectation in high dimension. The conventional methods, including the least squares regression [3], Malliavin approach [4], and kernel regression [5], are all of exponential complexity. There are a limited number of cases where practical high-dimensional algorithms are available. For example, in the linear case, Feynman–Kac formula and Monte Carlo simulation together provide an efficient approach to solving PDEs and associated BSDEs numerically. In addition, methods based on the branching diffusion process [6, 7] and multilevel Picard iteration [8, 9, 10] overcome the curse of dimensionality in their considered settings. We refer [9] for the detailed discussion on the complexity of the algorithms mentioned above. Overall there is no numerical algorithm in literature so far proved to overcome the curse of dimensionality for general quasilinear parabolic PDEs and the corresponding FBSDEs.

A recently developed algorithm, called the deep BSDE method [11, 12], has shown astonishing power in solving general high-dimensional FBSDEs and parabolic PDEs [13, 14, 15]. In contrast to conventional methods, the deep BSDE method employs neural networks to approximate unknown gradients and reformulates the original equation-solving problem into a stochastic optimization problem. Thanks to the universal approximation capability and parsimonious parameterization of neural networks, in practice the objective function can be effectively optimized in high-dimensional cases, and the function values of interests are obtained quite accurately.

The deep BSDE method was initially proposed for decoupled FBSDEs. In this paper, we extend the method to deal with coupled FBSDEs and a broader class of quasilinear parabolic PDEs. Furthermore, we present an error analysis of the proposed scheme, including decoupled FBSDEs as a special case. Our theoretical result consists of two theorems. Theorem 1 provides a posteriori error estimation of the deep BSDE method. As long as the objective function is optimized to be close to zero under fine time discretization, the approximate solution is close to the true solution. In other words, in practice, the accuracy of the numerical solution is effectively indicated by the value of the objective function. Theorem 2 shows that such a situation is attainable, by relating the infimum of the objective function to the expression ability of neural networks. As an implication of the universal approximation property (in the L2L^{2} sense), there exist neural networks with suitable parameters such that the obtained numerical solution is approximately accurate. To the best of our knowledge, this is the first theoretical result of the deep BSDE method for solving FBSDEs and parabolic PDEs. Although our numerical algorithm is based on neural networks, the theoretical result provided here is equally applicable to the algorithms based on other forms of function approximations.

The article is organized as follows. In section 2, we precisely state our numerical scheme for coupled FBSDEs and quasilinear parabolic PDEs and give the main theoretical results of the proposed numerical scheme. In section 3, the basic assumptions and some useful results from the literature are given for later use. The proofs of the two main theorems are provided in section 4 and section 5, respectively. Some numerical experiments with the proposed scheme are presented in section 6.

2 A Numerical Scheme for Coupled FBSDEs and Main Results

Let T(0,+)T\in(0,+\infty) be the terminal time, (Ω,𝔽,{t}0tT,)(\Omega,\mathbb{F},\{\mathcal{F}_{t}\}_{0\leq t\leq T},\mathbb{P}) be a filtered probability space equipped with a dd-dimensional standard Brownian motion {Wt}0tT\{W_{t}\}_{0\leq t\leq T} starting from 0. ξ\xi is a square-integrable random variable independent of {Wt}0tT\{W_{t}\}_{0\leq t\leq T}. We use the same notation (Ω,𝔽,{t}0tT,)(\Omega,\mathbb{F},\{\mathcal{F}_{t}\}_{0\leq t\leq T},\mathbb{P}) to denote the filtered probability space generated by {Wt+ξ}0tT\{W_{t}+\xi\}_{0\leq t\leq T}. The notation |x||x| denotes the Euclidean norm of a vector xx and A=trace(ATA)\|A\|=\sqrt{\mathrm{trace}(A^{\operatorname{T}}A)} denotes the Frobenius norm of a matrix AA.

Consider the following coupled FBSDEs

Xt=ξ+0tb(s,Xs,Ys)ds+0tσ(s,Xs,Ys)dWs,\displaystyle X_{t}=\xi+\int_{0}^{t}b(s,X_{s},Y_{s})\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},Y_{s})\,\mathrm{d}W_{s}, (2.1)
Yt=g(XT)+tTf(s,Xs,Ys,Zs)dstT(Zs)TdWs,\displaystyle Y_{t}=g(X_{T})+\int_{t}^{T}f(s,X_{s},Y_{s},Z_{s})\,\mathrm{d}s-\int_{t}^{T}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s}, (2.2)

in which XtX_{t} takes values in m\mathbb{R}^{m}, YtY_{t} takes values in \mathbb{R}, and ZtZ_{t} takes values in d\mathbb{R}^{d}. Here we assume YtY_{t} to be one-dimensional to simplify the presentation. The result can be extended without any difficulty to the case where YtY_{t} is multi-dimensional. We say (Xt,Yt,Zt)(X_{t},Y_{t},Z_{t}) is a solution of the above FBSDEs, if all its components are t\mathcal{F}_{t}-adapted and square-integrable, together satisfying equations (2.1)(2.2).

Solving coupled FBSDEs numerically is more difficult than solving decoupled FBSDEs. Except the Picard iteration method developed in [16], most methods exploit the relation to quasilinear parabolic PDEs via the four-time-step-scheme in [17]. This type of methods suffers from high dimensionality due to spatial discretization of PDEs. In contrast, our strategy, starting from simulating the coupled FBSDEs directly, is a new purely probabilistic scheme. To state the numerical algorithm precisely, we consider a partition of the time interval [0,T][0,T], π:0=t0<t1<<tN=T\pi:0=t_{0}<t_{1}<\dots<t_{N}=T with h=T/Nh=T/N and ti=iht_{i}=ih. Let ΔWiWti+1Wti\Delta W_{i}\coloneqq W_{t_{i+1}}-W_{t_{i}} for i=0,1,,N1i=0,1,\dots,N-1. Inspired by the nonlinear Feynman–Kac formula that will be introduced below, we view Y0Y_{0} as a function of X0X_{0} and view ZtZ_{t} as a function of XtX_{t} and YtY_{t}. Equipped with this viewpoint, our goal becomes finding appropriate functions μ0π:m\mu_{0}^{\pi}:\mathbb{R}^{m}\rightarrow\mathbb{R} and ϕiπ:m×d\phi_{i}^{\pi}:\mathbb{R}^{m}\times\mathbb{R}\rightarrow\mathbb{R}^{d} for i=0,1,,N1i=0,1,\dots,N-1 such that μ0π(ξ)\mu_{0}^{\pi}(\xi) and ϕiπ(Xtiπ,Ytiπ)\phi_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}) can serve as good surrogates of Y0Y_{0} and ZtiZ_{t_{i}}, respectively. To this end, we consider the classical Euler scheme

{X0π=ξ,Y0π=μ0π(ξ),Xti+1π=Xtiπ+b(ti,Xtiπ,Ytiπ)h+σ(ti,Xtiπ,Ytiπ)ΔWi,Ztiπ=ϕiπ(Xtiπ,Ytiπ),Yti+1π=Ytiπf(ti,Xtiπ,Ytiπ,Ztiπ)h+(Ztiπ)TΔWi.\displaystyle\begin{dcases}X_{0}^{\pi}=\xi,\quad Y_{0}^{\pi}=\mu_{0}^{\pi}(\xi),\\ X_{t_{i+1}}^{\pi}=X_{t_{i}}^{\pi}+b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})h+\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})\Delta W_{i},\\ Z_{t_{i}}^{\pi}=\phi_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}),\\ Y_{t_{i+1}}^{\pi}=Y_{t_{i}}^{\pi}-f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})h+(Z_{t_{i}}^{\pi})^{\operatorname{T}}\Delta W_{i}.\end{dcases} (2.3)

Without loss of clarity, here we use the notation X0πX_{0}^{\pi} as Xt0πX_{t_{0}}^{\pi}, XTπX_{T}^{\pi} as XtNπX_{t_{N}}^{\pi}, etc.

Following the spirit of the deep BSDE method, we employ a stochastic optimizer to solve the following stochastic optimization problem

infμ0π𝒩0,ϕiπ𝒩iF(μ0π,ϕ0π,,ϕN1π)E|g(XTπ)YTπ|2,\displaystyle\inf_{\mu_{0}^{\pi}\in\mathcal{N}^{\prime}_{0},\phi_{i}^{\pi}\in\mathcal{N}_{i}}F(\mu_{0}^{\pi},\phi_{0}^{\pi},\dots,\phi_{N-1}^{\pi})\coloneqq E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}, (2.4)

where 𝒩0\mathcal{N}^{\prime}_{0} and 𝒩i(0iN1)\mathcal{N}_{i}\leavevmode\nobreak\ (0\leq i\leq{N-1}) are parametric function spaces generated by neural networks. To see intuitively where the objective function (2.4) comes from, we consider the following variational problem:

infY0,{Zt}0tTE|g(XT)YT|2,\displaystyle\inf_{Y_{0},\{Z_{t}\}_{0\leq t\leq T}}E|g(X_{T})-Y_{T}|^{2}, (2.5)
s.t.Xt=ξ+0tb(s,Xs,Ys)ds+0tσ(s,Xs,Ys)dWs,\displaystyle s.t.\leavevmode\nobreak\ \leavevmode\nobreak\ X_{t}=\xi+\int_{0}^{t}b(s,X_{s},Y_{s})\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},Y_{s})\,\mathrm{d}W_{s},
Yt=Y00tf(s,Xs,Ys,Zs)ds+0t(Zs)TdWs,\displaystyle\qquad Y_{t}=Y_{0}-\int_{0}^{t}f(s,X_{s},Y_{s},Z_{s})\,\mathrm{d}s+\int_{0}^{t}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s},

where Y0Y_{0} is 0\mathcal{F}_{0}-measurable and square-integrable, and ZtZ_{t} is a t\mathcal{F}_{t}-adapted square-integrable process. The solution of the FBSDEs (2.1)(2.2) is a minimizer of the above problem since the loss function attains zero when it is evaluated at the solution. In addition, the wellposedness of the FBSDEs (under some regularity conditions) ensures the existence and uniqueness of the minimizer. Therefore, we expect (2.4), as a discretized counterpart of (2.5), defines a benign optimization problem and the associated near-optimal solution provides us a good approximate solution of the original FBSDEs. The reason we do not represent ZtiZ_{t_{i}} as a function of XtiX_{t_{i}} only is that the process {Xtiπ}0iN\{X_{t_{i}}^{\pi}\}_{0\leq i\leq N} is not Markovian, while the process {Xtiπ,Ytiπ}0iN\{X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\}_{0\leq i\leq N} is Markovian, which facilitates our analysis considerably. If bb and σ\sigma are both independent of YY, then the FBSDEs (2.1)(2.2) are decoupled, we can take ϕiπ\phi_{i}^{\pi} as a function of XtiπX_{t_{i}}^{\pi} only, as the numerical scheme introduced in [11, 12].

Our two main theorems regarding the deep BSDE method are the following, mainly on the justification and property of the objective function (2.4) in the general coupled case, regardless of the specific choice of parametric function spaces. An important assumption for the two theorems is the so-called weak coupling or monotonicity condition, which will be explained in detail in section 3. The precise statement of the theorems can be found in Theorem 1 (section 4) and Theorem 2 (section 5), respectively.

Theorem 1.

Under some assumptions, there exists a constant C, independent of h, d, and m, such that for sufficiently small h,

supt[0,T](E|XtX^tπ|2+E|YtY^tπ|2)+0TE|ZtZ^tπ|2dtC[h+E|g(XTπ)YTπ|2],\sup_{t\in[0,T]}(E|X_{t}-\hat{X}_{t}^{\pi}|^{2}+E|Y_{t}-\hat{Y}_{t}^{\pi}|^{2})+\int_{0}^{T}E|Z_{t}-\hat{Z}_{t}^{\pi}|^{2}\,\mathrm{d}t\leq C[h+E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}], (2.6)

where X^tπ=Xtiπ\hat{X}_{t}^{\pi}=X_{t_{i}}^{\pi}, Y^tπ=Ytiπ\hat{Y}_{t}^{\pi}=Y_{t_{i}}^{\pi}, Z^tπ=Ztiπ\hat{Z}_{t}^{\pi}=Z_{t_{i}}^{\pi} for t[ti,ti+1)t\in[t_{i},t_{i+1}).

Theorem 2.

Under some assumptions, there exists a constant C, independent of h, d and m, such that for sufficiently small h,

infμ0π𝒩0,ϕiπ𝒩iE|g(XTπ)YTπ|2\displaystyle\inf_{\mu_{0}^{\pi}\in\mathcal{N}^{\prime}_{0},\phi_{i}^{\pi}\in\mathcal{N}_{i}}E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}
\displaystyle\leq\leavevmode\nobreak\ C{h+infμ0π𝒩0,ϕiπ𝒩i[E|Y0μ0π(ξ)|2\displaystyle C\Big{\{}h+\inf_{\mu_{0}^{\pi}\in\mathcal{N}^{\prime}_{0},\phi_{i}^{\pi}\in\mathcal{N}_{i}}\big{[}E|Y_{0}-\mu_{0}^{\pi}(\xi)|^{2}
+i=0N1E|E[Z~ti|Xtiπ,Ytiπ]ϕiπ(Xtiπ,Ytiπ)|2h]},\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{i=0}^{N-1}E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-\phi_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})|^{2}h\big{]}\Big{\}},

where Z~ti=h1E[titi+1Ztdt|ti]\tilde{Z}_{t_{i}}=h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]. If bb and σ\sigma are independent of YY, the term E[Z~ti|Xtiπ,Ytiπ]E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}] can be replaced with E[Z~ti|Xtiπ]E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi}].

Briefly speaking, Theorem 1 states that the simulation error (left side of equation (2.6)) can be bounded through the value of the objective function (2.4). To the best of our knowledge, this is the first result for the error estimation of the coupled FBSDEs, concerning both time discretization error and terminal distance. Theorem 2 states that the optimal value of the objective function can be small if the approximation capability of the parametric function spaces (𝒩0\mathcal{N}^{\prime}_{0} and 𝒩i\mathcal{N}_{i} above) is high. Neural networks are a promising candidate for such a requirement, especially in high-dimensional problems. There are numerous results, dating back to the 90s (see, e.g., [18, 19, 20, 21, 22, 23, 24, 25, 26, 27]), in regard to the universal approximation and complexity of neural networks. There are also some recent analysis [28, 29, 30, 31] on approximating the solutions of certain parabolic partial differential equations with neural networks. However, the problem is still far from resolved. Theorem 2 implies that if the involved conditional expectations can be approximated by neural networks whose numbers of parameters growing at most polynomially both in the dimension and the reciprocal of the required accuracy, then the solutions of the considered FBSDEs can be represented in practice without the curse of dimensionality. Under what conditions this assumption is true is beyond the scope of this work and remains for further investigation.

The above-mentioned scheme in (2.3)(2.4) is for solving FBSDEs. The so-called nonlinear Feynman–Kac formula, connecting FBSDEs with the quasilinear parabolic PDEs, provides an approach to numerically solve quasilinear parabolic PDEs (2.7) below through the same scheme. We recall a concrete version of the nonlinear Feynman–Kac formula in Theorem 3 below and refer interested readers to e.g., [32] for more details. According to this formula, the term E|Y0Y0π|2E|Y_{0}-Y_{0}^{\pi}|^{2} can be interpreted as E|u(0,ξ)μ0π(ξ)|2E|u(0,\xi)-\mu_{0}^{\pi}(\xi)|^{2}. Therefore, we can choose the random variable ξ\xi with a delta distribution, a uniform distribution in a bounded region, or any other distribution we are interested in. After solving the optimization problem, we obtain μ0π(ξ)\mu_{0}^{\pi}(\xi) as an approximation of u(0,ξ)u(0,\xi). See [11, 12] for more details.

Theorem 3.

Assume

  1. 1.

    m=dm=d and b(t,x,y)b(t,x,y), σ(t,x,y)\sigma(t,x,y), f(t,x,y,z)f(t,x,y,z) are smooth functions with bounded first-order derivatives with respect to x,y,zx,y,z.

  2. 2.

    There exist a positive continuous function ν\nu and a constant μ\mu, satisfying that

    ν(|y|)𝐈σσT(t,x,y)μ𝐈,\displaystyle\nu(|y|)\mathbf{I}\leq\sigma\sigma^{\operatorname{T}}(t,x,y)\leq\mu\mathbf{I},
    |b(t,x,0)|+|f(t,x,0,z)|μ.\displaystyle|b(t,x,0)|+|f(t,x,0,z)|\leq\mu.
  3. 3.

    There exists a constant α(0,1)\alpha\in(0,1) such that gg is bounded in the Hölder space C2,α(m)C^{2,\alpha}(\mathbb{R}^{m}).

Then the following quasilinear PDE has a unique classical solution u(t,x)u(t,x) that is bounded with bounded utu_{t}, xu\nabla_{x}u, and x2u\nabla^{2}_{x}u,

{ut+12trace(σσT(t,x,u)x2u)+bT(t,x,u)xu+f(t,x,u,σT(t,x,u)xu)=0,u(T,x)=g(x).\displaystyle\begin{dcases}u_{t}+\frac{1}{2}\text{trace}(\sigma\sigma^{\operatorname{T}}(t,x,u)\nabla^{2}_{x}u)\\ \hphantom{u_{t}}+\leavevmode\nobreak\ b^{\operatorname{T}}(t,x,u)\nabla_{x}u+f(t,x,u,\sigma^{\operatorname{T}}(t,x,u)\nabla_{x}u)=0,\\ u(T,x)=g(x).\end{dcases} (2.7)

The associated FBSDEs (2.1)(2.2) have a unique solution (Xt,Yt,Zt)(X_{t},Y_{t},Z_{t}) with Yt=u(t,Xt)Y_{t}=u(t,X_{t}), Zt=σT(t,Xt,u(t,Xt))xu(t,Xt)Z_{t}=\sigma^{\operatorname{T}}(t,X_{t},u(t,X_{t}))\nabla_{x}u(t,X_{t}), and XtX_{t} is the solution of the following SDE

Xt=ξ+0tb(s,Xs,u(s,Xs))ds+0tσ(s,Xs,u(s,Xs))dWs.X_{t}=\xi+\int_{0}^{t}b(s,X_{s},u(s,X_{s}))\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},u(s,X_{s}))\,\mathrm{d}W_{s}.
Remark.

The statement regarding FBSDEs  (2.1)(2.2) in Theorem 3 is developed through a PDE-based argument, which requires m=dm=d, uniform ellipticity of σ\sigma, and high-order smoothness of b,σ,fb,\sigma,f, and gg. An analogous result through probabilistic argument is given below in Theorem 4 (point 4). In that case, we only need the Lipschitz condition for all of the involved functions, in addition to some weak coupling or monotonicity conditions demonstrated in Assumption 3. Note that the Lipschitz condition alone does not guarantee the existence of a solution to the coupled FBSDEs, even in the situation when b,f,σb,f,\sigma are linear (see [16, 32] for a concrete counterexample).

Remark.

Theorem 3 also implies that the assumption that the drift function bb only depends on x,yx,y is general. If bb depends on zz as well, one can move the associated term in (2.7) into the nonlinearity ff and apply the nonlinear Feynman–Kac formula back to obtain an equivalent system of coupled FBSDEs, in which the new drift function is independent of zz.

3 Preliminaries

In this section, we introduce our assumptions and two useful results in [16]. We use the notation Δx=x1x2\Delta x=x_{1}-x_{2}, Δy=y1y2\Delta y=y_{1}-y_{2}, Δz=z1z2\Delta z=z_{1}-z_{2}.

Assumption 1.
  1. (i)

    There exist (possibly negative) constants kbk_{b}, kfk_{f} such that

    [b(t,x1,y)b(t,x2,y)]TΔx\displaystyle[b(t,x_{1},y)-b(t,x_{2},y)]^{\operatorname{T}}\Delta x kb|Δx|2,\displaystyle\leq k_{b}|\Delta x|^{2},
    [f(t,x,y1,z)f(t,x,y2,z)]Δy\displaystyle[f(t,x,y_{1},z)-f(t,x,y_{2},z)]\Delta y kf|Δy|2.\displaystyle\leq k_{f}|\Delta y|^{2}.
  2. (ii)

    b, σ\sigma, f, g are uniformly Lipschitz continuous with respect to (x,y,z). In particular, there are non-negative constants K, byb_{y}, σx\sigma_{x}, σy\sigma_{y}, fxf_{x}, fzf_{z}, and gxg_{x} such that

    |b(t,x1,y1)b(t,x2,y2)|2\displaystyle|b(t,x_{1},y_{1})-b(t,x_{2},y_{2})|^{2} K|Δx|2+by|Δy|2,\displaystyle\leq K|\Delta x|^{2}+b_{y}|\Delta y|^{2},
    σ(t,x1,y1)σ(t,x2,y2)2\displaystyle\|\sigma(t,x_{1},y_{1})-\sigma(t,x_{2},y_{2})\|^{2} σx|Δx|2+σy|Δy|2,\displaystyle\leq\sigma_{x}|\Delta x|^{2}+\sigma_{y}|\Delta y|^{2},
    |f(t,x1,y1,z1)f(t,x2,y2,z2)|2\displaystyle|f(t,x_{1},y_{1},z_{1})-f(t,x_{2},y_{2},z_{2})|^{2} fx|Δx|2+K|Δy|2+fz|Δz|2,\displaystyle\leq f_{x}|\Delta x|^{2}+K|\Delta y|^{2}+f_{z}|\Delta z|^{2},
    |g(x1)g(x2)|2\displaystyle|g(x_{1})-g(x_{2})|^{2} gx|Δx|2.\displaystyle\leq g_{x}|\Delta x|^{2}.
  3. (iii)

    b(t,0,0)b(t,0,0), f(t,0,0,0)f(t,0,0,0), and σ(t,0,0)\sigma(t,0,0) are bounded. In particular, there are constants b0b_{0}, σ0\sigma_{0}, f0f_{0}, and g0g_{0} such that

    |b(t,x,y)|2\displaystyle|b(t,x,y)|^{2} b0+K|x|2+by|y|2,\displaystyle\leq b_{0}+K|x|^{2}+b_{y}|y|^{2},
    σ(t,x,y)2\displaystyle\|\sigma(t,x,y)\|^{2} σ0+σx|x|2+σy|y|2,\displaystyle\leq\sigma_{0}+\sigma_{x}|x|^{2}+\sigma_{y}|y|^{2},
    |f(t,x,y,z)|2\displaystyle|f(t,x,y,z)|^{2} f0+fx|x|2+K|y|2+fz|z|2,\displaystyle\leq f_{0}+f_{x}|x|^{2}+K|y|^{2}+f_{z}|z|^{2},
    |g(x)|2\displaystyle|g(x)|^{2} g0+gx|x|2.\displaystyle\leq g_{0}+g_{x}|x|^{2}.

We note here byb_{y} et al. are all constants, not partial derivatives. For convenience, we use \mathscr{L} to denote the set of all the constants mentioned above and assume K is the upper bound of \mathscr{L}.

Assumption 2.

b,σ,fb,\sigma,f are uniformly Hölder-12\frac{1}{2} continuous with respect to tt. We assume the same constant K to be the upper bound of the square of the Hölder constants as well.

Assumption 3.

One of the following five cases holds:

  1. 1.

    Small time duration, that is, T is small.

  2. 2.

    Weak coupling of Y into the forward SDE (2.1), that is, byb_{y} and σy\sigma_{y} are small. In particular, if by=σy=0b_{y}=\sigma_{y}=0, then the forward equation does not depend on the backward one and, thus, equations (2.1)(2.2) are decoupled.

  3. 3.

    Weak coupling of X into the backward SDE (2.2), that is, fxf_{x} and gxg_{x} are small. In particular, if fx=gx=0f_{x}=g_{x}=0, then the backward equation does not depend on the forward one and, thus, equations (2.1)(2.2) are also decoupled. In fact, in this case, Z = 0 and (2.2) reduces to an ODE.

  4. 4.

    f is strongly decreasing in y, that is, kfk_{f} is very negative.

  5. 5.

    b is strongly decreasing in x, that is, kbk_{b} is very negative.

The assumptions stated above are usually called weak coupling and monotonicity conditions in literature [16, 33, 34]. To make it more precise, we define

L0=[by+σy][gx+fxT]Te[by+σy][gx+fxT]T+[2kb+2kf+2+σx+fz]T,\displaystyle L_{0}=[b_{y}+\sigma_{y}][g_{x}+f_{x}T]Te^{[b_{y}+\sigma_{y}][g_{x}+f_{x}T]T+[2k_{b}+2k_{f}+2+\sigma_{x}+f_{z}]T},
L1=[gx+fxT][e[by+σy][gx+fxT]T+[2kb+2kf+2+σx+fz]T+11],\displaystyle L_{1}=[g_{x}+f_{x}T][e^{[b_{y}+\sigma_{y}][g_{x}+f_{x}T]T+[2k_{b}+2k_{f}+2+\sigma_{x}+f_{z}]T+1}\vee 1],
Γ0(x)=ex1x,(x>0),\displaystyle\Gamma_{0}(x)=\frac{e^{x}-1}{x},\leavevmode\nobreak\ \leavevmode\nobreak\ (x>0),
Γ1(x,y)=sup0<θ<1θeθxΓ0(y),\displaystyle\Gamma_{1}(x,y)=\sup_{0<\theta<1}\theta e^{\theta x}\Gamma_{0}(y),
c=infλ1>0{[e[2kb+1+σx+[by+σy]L1]T1](1+λ11)[by+σy]T\displaystyle c=\inf_{\lambda_{1}>0}\Big{\{}[e^{[2k_{b}+1+\sigma_{x}+[b_{y}+\sigma_{y}]L_{1}]T}\vee 1](1+\lambda_{1}^{-1})[b_{y}+\sigma_{y}]T
×[gxΓ1([2kf+1+fz]T,[2kb+1+σx+(1+λ1)[by+σy]L1]T)\displaystyle\qquad\qquad\quad\leavevmode\nobreak\ \times[g_{x}\Gamma_{1}([2k_{f}+1+f_{z}]T,[2k_{b}+1+\sigma_{x}+(1+\lambda_{1})[b_{y}+\sigma_{y}]L_{1}]T)
+fxTΓ0([2kf+1+fz]T)\displaystyle\qquad\qquad\leavevmode\nobreak\ +f_{x}T\Gamma_{0}([2k_{f}+1+f_{z}]T)
×Γ0(2kb+1+σx+(1+λ1)[by+σy]L1]T)}.\displaystyle\qquad\qquad\quad\leavevmode\nobreak\ \times\Gamma_{0}(2k_{b}+1+\sigma_{x}+(1+\lambda_{1})[b_{y}+\sigma_{y}]L_{1}]T)\Big{\}}.

Then, a specific quantitative form of the above five conditions can be summarized as:

L0<e1 and c<1.L_{0}<e^{-1}\text{\leavevmode\nobreak\ \leavevmode\nobreak\ and\leavevmode\nobreak\ \leavevmode\nobreak\ }c<1. (3.1)

In other words, if any of the five conditions of the weak coupling and monotonicity conditions holds to certain extent, the two inequalities in (3.1) hold. Below, we refer to  (3.1) as Assumption 3 and the five general qualitative conditions described above as the weak coupling and monotonicity conditions.

The above three assumptions are basic assumptions in [16], which we need in order to use the results from [16], as stated in Theorems 4 and 5 below. Theorem 4 gives the connections between coupled FBSDEs and quasilinear parabolic PDEs under weaker conditions. Theorem 5 provides the convergence of the implicit scheme for coupled FBSDEs. Our work primarily uses the same set of assumptions except that we assume some further quantitative restrictions related to the weak coupling and monotonicity conditions, which will be transparent through the extra constants we define in proofs. Our aim is to provide explicit conditions on which our results hold and more clearly present the relationship between these constants and the error estimates. As will be seen in the proof, roughly speaking, the weaker the coupling (resp., the stronger the monotonicity, the smaller the time horizon) is, the easier the condition is satisfied, and the smaller the constant CC related with error estimates are.

Theorem 4.

Under Assumptions 1, 2, and 3, there exists a function u: ×m\mathbb{R}\times\mathbb{R}^{m}\rightarrow\mathbb{R} that satisfies the following statements.

  1. 1.

    |u(t,x1)u(t,x2)|2L1|x1x2|2|u(t,x_{1})-u(t,x_{2})|^{2}\leq L_{1}|x_{1}-x_{2}|^{2}.

  2. 2.

    |u(s,x)u(t,x)|2C(1+|x|2)|st||u(s,x)-u(t,x)|^{2}\leq C(1+|x|^{2})|s-t| with some constant C depending on \mathscr{L} and TT.

  3. 3.

    u is a viscosity solution of the PDE (2.7).

  4. 4.

    The FBSDEs (2.1)(2.2) have a unique solution (Xt,Yt,Zt)(X_{t},Y_{t},Z_{t}) and Yt=u(t,Xt)Y_{t}=u(t,X_{t}). Thus, (Xt,Yt,Zt)(X_{t},Y_{t},Z_{t}) satisfies decoupled FBSDEs

    Xt\displaystyle X_{t} =ξ+0tb(s,Xs,u(s,Xs))ds+0tσ(s,Xs,u(s,Xs))dWs,\displaystyle=\xi+\int_{0}^{t}b(s,X_{s},u(s,X_{s}))\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},u(s,X_{s}))\,\mathrm{d}W_{s},
    Yt\displaystyle Y_{t} =g(XT)+tTf(s,Xs,Ys,Zs)dstT(Zs)TdWs.\displaystyle=g(X_{T})+\int_{t}^{T}f(s,X_{s},Y_{s},Z_{s})\,\mathrm{d}s-\int_{t}^{T}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s}.

Furthermore, the solution of the FBSDEs satisfies the path regularity with some constant C depending on \mathscr{L} and T

supt[0,T](E|XtX~t|2+E|YtY~t|2)+0TE|ZtZ~t|2dtC(1+E|ξ|2)h,\sup_{t\in[0,T]}(E|X_{t}-\tilde{X}_{t}|^{2}+E|Y_{t}-\tilde{Y}_{t}|^{2})+\int_{0}^{T}E|Z_{t}-\tilde{Z}_{t}|^{2}\,\mathrm{d}t\leq C(1+E|\xi|^{2})h, (3.2)

in which X~t=Xti\tilde{X}_{t}=X_{t_{i}}, Y~t=Yti\tilde{Y}_{t}=Y_{t_{i}}, Z~t=h1E[titi+1Ztdt|ti]\tilde{Z}_{t}=h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}] for t[ti,ti+1)t\in[t_{i},t_{i+1}). If ZtZ_{t} is càdlàg, we can replace h1E[titi+1Ztdt|ti]h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}] with ZtiZ_{t_{i}}.

Remark.

Several conditions can guarantee ZtZ_{t} admits a càdlàg version, such as m=dm=d and σσTδI\sigma\sigma^{\operatorname{T}}\geq\delta I with some δ>0\delta>0, see e.g., [35].

Theorem 5.

Under Assumptions 1, 2, and 3, for sufficiently small h, the following discrete-time equation (0iN10\leq i\leq N-1)

{X¯0π=ξ,X¯ti+1π=X¯tiπ+b(ti,X¯tiπ,Y¯tiπ)h+σ(ti,X¯tiπ,Y¯tiπ)ΔWi,Y¯Tπ=g(X¯Tπ),Z¯tiπ=1hE[Y¯ti+1πΔWi|ti],Y¯tiπ=E[Y¯ti+1π+f(ti,X¯tiπ,Y¯tiπ,Z¯tiπ)h|ti],\displaystyle\begin{dcases}\overline{X}_{0}^{\pi}=\xi,\\ \overline{X}_{t_{i+1}}^{\pi}=\overline{X}_{t_{i}}^{\pi}+b(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi})h+\sigma(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi})\Delta W_{i},\\ \overline{Y}_{T}^{\pi}=g(\overline{X}_{T}^{\pi}),\\ \overline{Z}_{t_{i}}^{\pi}=\frac{1}{h}E[\overline{Y}_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}],\\ \overline{Y}_{t_{i}}^{\pi}=E[\overline{Y}_{t_{i+1}}^{\pi}+f(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi})h|\mathcal{F}_{t_{i}}],\end{dcases} (3.3)

has a solution (X¯tiπ,Y¯tiπ,Z¯tiπ)(\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi}) such that X¯tiπL2(Ω,ti,)\overline{X}_{t_{i}}^{\pi}\in L^{2}(\Omega,\mathcal{F}_{t_{i}},\mathbb{P}) and

supt[0,T](E|XtX¯tπ|2+E|YtY¯tπ|2)+0TE|ZtZ¯tπ|2dtC(1+E|ξ|2)h,\sup_{t\in[0,T]}(E|X_{t}-\overline{X}_{t}^{\pi}|^{2}+E|Y_{t}-\overline{Y}_{t}^{\pi}|^{2})+\int_{0}^{T}E|Z_{t}-\overline{Z}_{t}^{\pi}|^{2}\,\mathrm{d}t\leq C(1+E|\xi|^{2})h, (3.4)

where X¯tπ=X¯tiπ\overline{X}_{t}^{\pi}=\overline{X}_{t_{i}}^{\pi}, Y¯tπ=Y¯tiπ\overline{Y}_{t}^{\pi}=\overline{Y}_{t_{i}}^{\pi}, Z¯tπ=Z¯tiπ\overline{Z}_{t}^{\pi}=\overline{Z}_{t_{i}}^{\pi} for t[ti,ti+1)t\in[t_{i},t_{i+1}), and C is a constant depending on \mathscr{L} and T.

Remark.

In [16], the above result (existence and convergence) is proved for the explicit scheme, which is formulated as replacing f(ti,X¯tiπ,Y¯tiπ,Z¯tiπ)f(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi}) with f(ti,X¯tiπ,Y¯ti+1π,Z¯tiπ)f(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i+1}}^{\pi},\overline{Z}_{t_{i}}^{\pi}) in the last equation of (3.3). The same techniques can be used to prove the implicit scheme, as we state in Theorem 5.

Finally, to make sure the system in (2.3) is well-defined, we restrict our parametric function spaces 𝒩0\mathcal{N}^{\prime}_{0} and 𝒩i\mathcal{N}_{i} as in Assumption 4 below. Note that neural networks with common activation functions, including ReLU and sigmoid function, satisfy this assumption. Under Assumption 1 and 4, one can easily prove by induction that {Xtiπ}0iN\{X_{t_{i}}^{\pi}\}_{0\leq i\leq N}, {Ytiπ}0iN\{Y_{t_{i}}^{\pi}\}_{0\leq i\leq N} and {Ztiπ}0iN1\{Z_{t_{i}}^{\pi}\}_{0\leq i\leq N-1} defined in (2.3) are all measurable and square-integrable random variables.

Assumption 4.

𝒩0\mathcal{N}_{0}^{{}^{\prime}} and 𝒩i(0iN1)\mathcal{N}_{i}(0\leq i\leq N-1) are subsets of measurable functions from m\mathbb{R}^{m} to \mathbb{R} and m×\mathbb{R}^{m}\times\mathbb{R} to d\mathbb{R}^{d} with linear growth, namely, μ0π\mu_{0}^{\pi} and {ϕiπ}0iN1\{\phi_{i}^{\pi}\}_{0\leq i\leq N-1} in (2.3) satisfy |μ0π(x)|2A0+B0|x|2|\mu_{0}^{\pi}(x)|^{2}\leq A^{\prime}_{0}+B^{\prime}_{0}|x|^{2} and |ϕiπ(x,y)|2Ai+Bi|x|2+Ci|y|2|\phi_{i}^{\pi}(x,y)|^{2}\leq A_{i}+B_{i}|x|^{2}+C_{i}|y|^{2} for 0iN10\leq i\leq N-1.

4 A Posteriori Estimation of the Simulation Error

We prove Theorem 1 in this section. Comparing the statements of Theorem 1 and Theorem 5, we wish to bound the differences between (Xtiπ,Ytiπ,Ztiπ)(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi}) and (X¯tiπ,Y¯tiπ,Z¯tiπ)(\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi}) with the objective function E|g(XTπ)YTπ|2E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}. Recalling the definition of the system of equations (2.3), we have

Xti+1π\displaystyle X_{t_{i+1}}^{\pi} =Xtiπ+b(ti,Xtiπ,Ytiπ)h+σ(ti,Xtiπ,Ytiπ)ΔWi,\displaystyle=X_{t_{i}}^{\pi}+b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})h+\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})\Delta W_{i}, (4.1)
Yti+1π\displaystyle Y_{t_{i+1}}^{\pi} =Ytiπf(ti,Xtiπ,Ytiπ,Ztiπ)h+(Ztiπ)TΔWi.\displaystyle=Y_{t_{i}}^{\pi}-f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})h+(Z_{t_{i}}^{\pi})^{\operatorname{T}}\Delta W_{i}. (4.2)

Taking the expectation E[|ti]E[\cdot|\mathcal{F}_{t_{i}}] on both sides of (4.2), we obtain

Ytiπ=E[Yti+1π+f(ti,Xtiπ,Ytiπ,Ztiπ)h|ti].Y_{t_{i}}^{\pi}=E[Y_{t_{i+1}}^{\pi}+f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})h|\mathcal{F}_{t_{i}}].

Right multiplying (ΔWi)T(\Delta W_{i})^{\operatorname{T}} on both sides of (4.2) and taking the expectation E[|ti]E[\cdot|\mathcal{F}_{t_{i}}] again, we obtain

Ztiπ=1h[Yti+1πΔWi|ti].Z_{t_{i}}^{\pi}=\frac{1}{h}[Y_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}].

The above observation motivates us to consider the following system of equations

{X0π=ξ,Xti+1π=Xtiπ+b(ti,Xtiπ,Ytiπ)h+σ(ti,Xtiπ,Ytiπ)ΔWi,Ztiπ=1hE[Yti+1πΔWi|ti],Ytiπ=E[Yti+1π+f(ti,Xtiπ,Ytiπ,Ztiπ)h|ti].\displaystyle\begin{dcases}X_{0}^{\pi}=\xi,\\ X_{t_{i+1}}^{\pi}=X_{t_{i}}^{\pi}+b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})h+\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})\Delta W_{i},\\ Z_{t_{i}}^{\pi}=\frac{1}{h}E[Y_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}],\\ Y_{t_{i}}^{\pi}=E[{Y}_{t_{i+1}}^{\pi}+f(t_{i},{X}_{t_{i}}^{\pi},{Y}_{t_{i}}^{\pi},{Z}_{t_{i}}^{\pi})h|\mathcal{F}_{t_{i}}].\end{dcases} (4.3)

Note that (4.3) is defined just like the FBSDEs (2.1)(2.2), where the XX component is defined forwardly and the Y,ZY,Z components are defined backwardly. However, since we do not specify the terminal condition of YTπY_{T}^{\pi}, the system of equations (4.3) has infinitely many solutions. The following lemma gives an estimate of the difference between two such solutions.

Lemma 1.

For j=1,2j=1,2, suppose ({Xtiπ,j}0iN,{Ytiπ,j}0iN,{Ztiπ,j}0iN1)(\{X_{t_{i}}^{\pi,j}\}_{0\leq i\leq N},\{Y_{t_{i}}^{\pi,j}\}_{0\leq i\leq N},\{Z_{t_{i}}^{\pi,j}\}_{0\leq i\leq N-1}) are two solutions of (4.3), with Xtiπ,j,Ytiπ,jL2(Ω,ti,)X_{t_{i}}^{\pi,j},Y_{t_{i}}^{\pi,j}\in L^{2}(\Omega,\mathcal{F}_{t_{i}},\mathbb{P}), 0iN0\leq i\leq N. For any λ1>0,λ2fz\lambda_{1}>0,\lambda_{2}\geq f_{z}, and sufficiently small h, denote

A1\displaystyle A_{1} 2kb+λ1+σx+Kh,\displaystyle\coloneqq 2k_{b}+\lambda_{1}+\sigma_{x}+Kh, (4.4)
A2\displaystyle A_{2} (λ11+h)by+σy,\displaystyle\coloneqq(\lambda_{1}^{-1}+h)b_{y}+\sigma_{y},
A3\displaystyle A_{3} ln[1(2kf+λ2)h]h,\displaystyle\coloneqq-\frac{\ln[1-(2k_{f}+\lambda_{2})h]}{h},
A4\displaystyle A_{4} fx[1(2kf+λ2)h]λ2.\displaystyle\coloneqq\frac{f_{x}}{[1-(2k_{f}+\lambda_{2})h]\lambda_{2}}.

Let δXi=Xtiπ,1Xtiπ,2,δYi=Ytiπ,1Ytiπ,2\delta X_{i}=X_{t_{i}}^{\pi,1}-X_{t_{i}}^{\pi,2},\delta Y_{i}=Y_{t_{i}}^{\pi,1}-Y_{t_{i}}^{\pi,2}, then we have, for 0nN0\leq n\leq N,

E|δXn|2\displaystyle E|\delta X_{n}|^{2} A2i=0n1eA1(ni1)hE|δYi|2h,\displaystyle\leq A_{2}\sum_{i=0}^{n-1}e^{A_{1}(n-i-1)h}E|\delta Y_{i}|^{2}h,
E|δYn|2\displaystyle E|\delta Y_{n}|^{2} eA3(Nn)hE|δYN|2+A4i=nN1eA3(in)hE|δXi|2h.\displaystyle\leq e^{A_{3}(N-n)h}E|\delta Y_{N}|^{2}+A_{4}\sum_{i=n}^{N-1}e^{A_{3}(i-n)h}E|\delta X_{i}|^{2}h.

To prove Lemma 1, we need the following lemma to handle the ZZ component.

Lemma 2.

Let 0s1<s20\leq s_{1}<s_{2}, given QL2(Ω,s2,)Q\in L^{2}(\Omega,\mathcal{F}_{s_{2}},\mathbb{P}), by the martingale representation theorem, there exists an t\mathcal{F}_{t}-adapted process {Hs}s1ss2\{H_{s}\}_{s_{1}\leq s\leq s_{2}} such that s1s2E|Hs|2ds<\int_{s_{1}}^{s_{2}}E|H_{s}|^{2}\,\mathrm{d}s<\infty and Q=E[Q|s1]+s1s2HsdWsQ=E[Q|\mathcal{F}_{s_{1}}]+\int_{s_{1}}^{s_{2}}H_{s}\,\mathrm{d}W_{s}. Then we have E[Q(Ws2Ws1)|s1]=E[s1s2Hsds|s1]E[Q(W_{s_{2}}-W_{s_{1}})|\mathcal{F}_{s_{1}}]=E[\int_{s_{1}}^{s_{2}}H_{s}\,\mathrm{d}s|\mathcal{F}_{s_{1}}].

Proof.

Consider the auxiliary stochastic process Qs=(E[Q|s1]+s1sHtdWt)(WsWs1)Q_{s}=(E[Q|\mathcal{F}_{s_{1}}]+\int_{s_{1}}^{s}H_{t}\,\mathrm{d}W_{t})(W_{s}-W_{s_{1}}) for s[s1,s2]s\in[s_{1},s_{2}]. By Itô’s formula,

dQs=(WsWs1)HsdWs+(E[Q|s1]+s1sHtdWt)dWs+Hsds.\mathrm{d}Q_{s}=(W_{s}-W_{s_{1}})H_{s}\,\mathrm{d}W_{s}+(E[Q|\mathcal{F}_{s_{1}}]+\int_{s_{1}}^{s}H_{t}\,\mathrm{d}W_{t})\,\mathrm{d}W_{s}+H_{s}\,\mathrm{d}s.

Noting that Qs1=0Q_{s_{1}}=0, we have

E[Q(Ws2Ws1)|s1]=E[Qs2|s1]=E[s1s2Hsds|s1].E[Q(W_{s_{2}}-W_{s_{1}})|\mathcal{F}_{s_{1}}]=E[Q_{s_{2}}|\mathcal{F}_{s_{1}}]=E[\int_{s_{1}}^{s_{2}}H_{s}\,\mathrm{d}s|\mathcal{F}_{s_{1}}].

Proof of Lemma 1.

Let

δZi\displaystyle\delta Z_{i} =Ztiπ,1Ztiπ,2,\displaystyle=Z_{t_{i}}^{\pi,1}-Z_{t_{i}}^{\pi,2},
δbi\displaystyle\delta b_{i} =b(ti,Xtiπ,1,Ytiπ,1)b(ti,Xtiπ,2,Ytiπ,2),\displaystyle=b(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1})-b(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}),
δσi\displaystyle\delta\sigma_{i} =σ(ti,Xtiπ,1,Ytiπ,1)σ(ti,Xtiπ,2,Ytiπ,2),\displaystyle=\sigma(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1})-\sigma(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}),
δfi\displaystyle\delta f_{i} =f(ti,Xtiπ,1,Ytiπ,1,Ztiπ,1)f(ti,Xtiπ,2,Ytiπ,2,Ztiπ,2).\displaystyle=f(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1},Z_{t_{i}}^{\pi,1})-f(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2},Z_{t_{i}}^{\pi,2}).

Then we have

δXi+1\displaystyle\delta X_{i+1} =δXi+δbih+δσiΔWi,\displaystyle=\delta X_{i}+\delta b_{i}h+\delta\sigma_{i}\Delta W_{i}, (4.5)
δZi\displaystyle\delta Z_{i} =1hE[δYi+1ΔWi|ti],\displaystyle=\frac{1}{h}E[\delta Y_{i+1}\Delta W_{i}|\mathcal{F}_{t_{i}}], (4.6)
δYi\displaystyle\delta Y_{i} =E[δYi+1+δfih|ti].\displaystyle=E[\delta Y_{i+1}+\delta f_{i}h|\mathcal{F}_{t_{i}}]. (4.7)

By the martingale representation theorem, there exists an t\mathscr{F}_{t}-adapted square-integrable process {δZt}titti+1\{\delta Z_{t}\}_{t_{i}\leq t\leq t_{i+1}} such that

δYi+1=E[δYi+1|ti]+titi+1(δZt)TdWt,\delta Y_{i+1}=E[\delta Y_{i+1}|\mathcal{F}_{t_{i}}]+\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t},

or, equivalently,

δYi+1\displaystyle\delta Y_{i+1} =δYiδfih+titi+1(δZt)TdWt.\displaystyle=\delta Y_{i}-\delta f_{i}h+\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t}. (4.8)

From equations (4.5) and (4.8), noting that δXi\delta X_{i}, δYi\delta Y_{i}, δbi\delta b_{i}, δσi\delta\sigma_{i}, and δfi\delta f_{i} are all ti\mathcal{F}_{t_{i}}-measurable, and E[ΔWi|ti]=0E[\Delta W_{i}|\mathcal{F}_{t_{i}}]=0, E[titi+1(δZt)TdWt|ti]=0E[\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t}|\mathcal{F}_{t_{i}}]=0, we have

E|δXi+1|2\displaystyle E|\delta X_{i+1}|^{2} =E|δXi+δbih|2+E[(ΔWi)T(δσi)TδσiΔWi]\displaystyle=E|\delta X_{i}+\delta b_{i}h|^{2}+E[(\Delta W_{i})^{\operatorname{T}}(\delta\sigma_{i})^{\operatorname{T}}\delta\sigma_{i}\Delta W_{i}]
=E|δXi+δbih|2+hEδσi2,\displaystyle=E|\delta X_{i}+\delta b_{i}h|^{2}+hE\|\delta\sigma_{i}\|^{2}, (4.9)
E|δYi+1|2\displaystyle E|\delta Y_{i+1}|^{2} =E|δYiδfih|2+titi+1E|δZt|2dt.\displaystyle=E|\delta Y_{i}-\delta f_{i}h|^{2}+\int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2}\,\mathrm{d}t. (4.10)

From equation (4.9), by Assumptions 1, 2 and the root-mean square and geometric mean inequality (RMS-GM inequality), for any λ1>0\lambda_{1}>0, we have

E|δXi+1|2\displaystyle E|\delta X_{i+1}|^{2}
=\displaystyle=\leavevmode\nobreak\ E|δXi|2+E|δbi|2h2+hEδσi2\displaystyle E|\delta X_{i}|^{2}+E|\delta b_{i}|^{2}h^{2}+hE\|\delta\sigma_{i}\|^{2}
+2hE[(b(ti,Xtiπ,1,Ytiπ,1)b(ti,Xtiπ,2,Ytiπ,1))TδXi]\displaystyle+2hE[(b(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1})-b(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,1}))^{\operatorname{T}}\delta X_{i}]
+2hE[(b(ti,Xtiπ,2,Ytiπ,1)b(ti,Xtiπ,2,Ytiπ,2))TδXi]\displaystyle+2hE[(b(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,1})-b(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}))^{\operatorname{T}}\delta X_{i}]
\displaystyle\leq\leavevmode\nobreak\ E|δXi|2+(KE|δXi|2+byE|δYi|2)h2+2kbhE|δXi|2\displaystyle E|\delta X_{i}|^{2}+(KE|\delta X_{i}|^{2}+b_{y}E|\delta Y_{i}|^{2})h^{2}+2k_{b}hE|\delta X_{i}|^{2}
+(λ1E|δXi|2+λ11byE|δYi|2)h+(σxE|δXi|2+σyE|δYi|2)h\displaystyle+(\lambda_{1}E|\delta X_{i}|^{2}+\lambda_{1}^{-1}b_{y}E|\delta Y_{i}|^{2})h+(\sigma_{x}E|\delta X_{i}|^{2}+\sigma_{y}E|\delta Y_{i}|^{2})h
=\displaystyle=\leavevmode\nobreak\ [1+(2kb+λ1+σx+Kh)h]E|δXi|2+[(λ11+h)by+σy]E|δYi|2h.\displaystyle[1+(2k_{b}+\lambda_{1}+\sigma_{x}+Kh)h]E|\delta X_{i}|^{2}+[(\lambda_{1}^{-1}+h)b_{y}+\sigma_{y}]E|\delta Y_{i}|^{2}h.

Recall A1=2kb+λ1+σx+KhA_{1}=2k_{b}+\lambda_{1}+\sigma_{x}+Kh, A2=(λ11+h)by+σyA_{2}=(\lambda_{1}^{-1}+h)b_{y}+\sigma_{y}, E|δX0|2=0E|\delta X_{0}|^{2}=0. By induction we can obtain that, for 0nN0\leq n\leq N,

E|δXn|2A2i=0n1eA1(ni1)hE|δYi|2h.E|\delta X_{n}|^{2}\leq A_{2}\sum_{i=0}^{n-1}e^{A_{1}(n-i-1)h}E|\delta Y_{i}|^{2}h.

Similarly, from equation (4.10), for any λ2>0\lambda_{2}>0, we have

E|δYi+1|2\displaystyle E|\delta Y_{i+1}|^{2}
\displaystyle\geq\leavevmode\nobreak\ E|δYi|2+titi+1E|δZt|2dt\displaystyle E|\delta Y_{i}|^{2}+\int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2}\,\mathrm{d}t
2hE[(f(ti,Xi1,π,Yi1,π,Zi1,π)f(ti,Xi1,π,Yi2,π,Zi1,π))TδYi]\displaystyle-2hE[(f(t_{i},X_{i}^{1,\pi},Y_{i}^{1,\pi},Z_{i}^{1,\pi})-f(t_{i},X_{i}^{1,\pi},Y_{i}^{2,\pi},Z_{i}^{1,\pi}))^{\operatorname{T}}\delta Y_{i}]
2hE[(f(ti,Xi1,π,Yi2,π,Zi1,π)f(ti,Xi2,π,Yi2,π,Zi2,π))TδYi]\displaystyle-2hE[(f(t_{i},X_{i}^{1,\pi},Y_{i}^{2,\pi},Z_{i}^{1,\pi})-f(t_{i},X_{i}^{2,\pi},Y_{i}^{2,\pi},Z_{i}^{2,\pi}))^{\operatorname{T}}\delta Y_{i}]
\displaystyle\geq\leavevmode\nobreak\ E|δYi|2+titi+1E|δZt|2dt2kfhE|δYi|2\displaystyle E|\delta Y_{i}|^{2}+\int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2}\,\mathrm{d}t-2k_{f}hE|\delta Y_{i}|^{2}
[λ2E|δYi|2+λ21(fxE|δXi|2+fzE|δZi|2)]h.\displaystyle-[\lambda_{2}E|\delta Y_{i}|^{2}+\lambda_{2}^{-1}(f_{x}E|\delta X_{i}|^{2}+f_{z}E|\delta Z_{i}|^{2})]h. (4.11)

To deal with the integral term in (4.11), we apply Lemma 2 to (4.6)(4.8) and get

δZi\displaystyle\delta Z_{i} =1hE[titi+1δZtdt|ti],\displaystyle=\frac{1}{h}E[\int_{t_{i}}^{t_{i+1}}\delta Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}],

which implies, by the Cauchy inequality,

E|δZi|2h\displaystyle E|\delta Z_{i}|^{2}h =k=1dE|(δZi)k|2h=k=1d1hE|E[titi+1(δZt)kdt|ti]|2\displaystyle=\sum_{k=1}^{d}E|(\delta Z_{i})_{k}|^{2}h=\sum_{k=1}^{d}\frac{1}{h}E\Big{|}E[\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})_{k}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]\Big{|}^{2}
k=1d1hE|titi+1(δZt)kdt|2k=1dtiti+1E|(δZt)k|2dt\displaystyle\leq\sum_{k=1}^{d}\frac{1}{h}E\Big{|}\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})_{k}\,\mathrm{d}t\Big{|}^{2}\leq\sum_{k=1}^{d}\int_{t_{i}}^{t_{i+1}}E|(\delta Z_{t})_{k}|^{2}\,\mathrm{d}t
=titi+1E|δZt|2dt,\displaystyle=\int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2}\,\mathrm{d}t,

where ()k(\cdot)_{k} denotes the kk-th component of the vector. Plugging it into (4.11) gives us

E|δYi+1|2[1(2kf+λ2)h]E|δYi|2+(1fzλ21)E|δZi|2hfxλ21E|δXi|2h.\displaystyle E|\delta Y_{i+1}|^{2}\geq[1-(2k_{f}+\lambda_{2})h]E|\delta Y_{i}|^{2}+(1-f_{z}\lambda_{2}^{-1})E|\delta Z_{i}|^{2}h-f_{x}\lambda_{2}^{-1}E|\delta X_{i}|^{2}h. (4.12)

Then for any λ2fz\lambda_{2}\geq f_{z} and sufficiently small hh satisfying (2kf+λ2)h<1(2k_{f}+\lambda_{2})h<1, we have

E|δYi|2[1(2kf+λ2)h]1[E|δYi+1|2+fxλ21E|δXi|2h].E|\delta Y_{i}|^{2}\leq[1-(2k_{f}+\lambda_{2})h]^{-1}[E|\delta Y_{i+1}|^{2}+f_{x}\lambda_{2}^{-1}E|\delta X_{i}|^{2}h].

Recall A3=h1ln[1(2kf+λ2)h]A_{3}=-h^{-1}\ln[1-(2k_{f}+\lambda_{2})h], A4=fxλ21[1(2kf+λ2)h]1A_{4}=f_{x}\lambda_{2}^{-1}[1-(2k_{f}+\lambda_{2})h]^{-1}. By induction we obtain that, for 0nN0\leq n\leq N,

E|δYn|2eA3(Nn)hE|δYN|2+A4i=nN1eA3(in)hE|δXi|2h.E|\delta Y_{n}|^{2}\leq e^{A_{3}(N-n)h}E|\delta Y_{N}|^{2}+A_{4}\sum_{i=n}^{N-1}e^{A_{3}(i-n)h}E|\delta X_{i}|^{2}h.

Now we are ready to prove Theorem 1, whose precise statement is given below. Note that its conditions are satisfied if any of the five cases in the weak coupling and monotonicity conditions holds.

Theorem 1.

Suppose Assumptions 1, 2, 3, and 4 hold true and there exist λ1>0,λ2fz\lambda_{1}>0,\lambda_{2}\geq f_{z} such that A0¯<1\overline{A_{0}}<1, where

A1¯2kb+λ1+σx,\displaystyle\overline{A_{1}}\coloneqq 2k_{b}+\lambda_{1}+\sigma_{x}, (4.13)
A2¯byλ11+σy,\displaystyle\overline{A_{2}}\coloneqq b_{y}\lambda_{1}^{-1}+\sigma_{y},
A3¯2kf+λ2,\displaystyle\overline{A_{3}}\coloneqq 2k_{f}+\lambda_{2},
A4¯fxλ21,\displaystyle\overline{A_{4}}\coloneqq f_{x}\lambda_{2}^{-1},
A0¯A2¯1e(A1¯+A3¯)TA1¯+A3¯{gxe(A1¯+A3¯)T+A4¯e(A1¯+A3¯)T1A1¯+A3¯}.\displaystyle\overline{A_{0}}\coloneqq\overline{A_{2}}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}\Big{\{}g_{x}e^{(\overline{A_{1}}+\overline{A_{3}})T}+\overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}}\Big{\}}.

Then there exists a constant C>0C>0, depending on E|ξ|2E|\xi|^{2}, \mathscr{L}, TT, λ1\lambda_{1}, and λ2\lambda_{2}, such that for sufficiently small hh,

supt[0,T](E|XtX^tπ|2+E|YtY^tπ|2)+0TE|ZtZ^tπ|2dtC[h+E|g(XTπ)YTπ|2],\sup_{t\in[0,T]}(E|X_{t}-\hat{X}_{t}^{\pi}|^{2}+E|Y_{t}-\hat{Y}_{t}^{\pi}|^{2})+\int_{0}^{T}E|Z_{t}-\hat{Z}_{t}^{\pi}|^{2}\,\mathrm{d}t\leq C[h+E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}], (4.14)

where X^tπ=Xtiπ\hat{X}_{t}^{\pi}=X_{t_{i}}^{\pi}, Y^tπ=Ytiπ\hat{Y}_{t}^{\pi}=Y_{t_{i}}^{\pi}, Z^tπ=Ztiπ\hat{Z}_{t}^{\pi}=Z_{t_{i}}^{\pi} for t[ti,ti+1)t\in[t_{i},t_{i+1}).

Remark.

The above theorem also implies the coercivity of the objective function (2.4) used in the deep BSDE method. Formally speaking, the coercivity means that if i=0N1E|Ztiπ|2+E|Y0π|2+\sum_{i=0}^{N-1}E|Z_{t_{i}}^{\pi}|^{2}+E|Y_{0}^{\pi}|^{2}\rightarrow+\infty, we have E|g(XTπ)YTπ|2+E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\rightarrow+\infty, which is a direct result from Theorem 1.

Remark.

If any of the weak coupling and monotonicity conditions introduced in Assumption 3 holds to a sufficient extent, there must exist λ1,λ2\lambda_{1},\lambda_{2} satisfying the conditions in Theorem 1. We discuss the 5 cases in what follows.

  1. 1.

    Suppose all other constants and λ1>0,λ2fz\lambda_{1}>0,\lambda_{2}\geq f_{z} are fixed, if T>0T>0 is sufficiently small, then the second factor of A0¯\overline{A_{0}} could be sufficiently close to 0 such that A0¯<1\overline{A_{0}}<1.

  2. 2.

    Suppose all other constants and λ1>0,λ2fz\lambda_{1}>0,\lambda_{2}\geq f_{z} are fixed, if by0b_{y}\geq 0 and σy0\sigma_{y}\geq 0 are sufficiently small, then A2¯0\overline{A_{2}}\geq 0 could be sufficiently small such that A0¯<1\overline{A_{0}}<1.

  3. 3.

    Suppose all other constants and λ1>0,λ2fz\lambda_{1}>0,\lambda_{2}\geq f_{z} are fixed, if fx0f_{x}\geq 0 and gx0g_{x}\geq 0 are sufficiently small, then A4¯\overline{A_{4}} and thus the last factor in A0¯\overline{A_{0}} could be sufficiently close to 0 such that A0¯<1\overline{A_{0}}<1.

  4. 4.

    Suppose all constants except kfk_{f} and λ2>0\lambda_{2}>0 are fixed. Let A1¯A1¯+A3¯=2kb+2kf+σx+λ1+λ2\overline{A_{1}}^{\prime}\coloneqq\overline{A_{1}}+\overline{A_{3}}=2k_{b}+2k_{f}+\sigma_{x}+\lambda_{1}+\lambda_{2} and rewrite A0¯\overline{A_{0}} as

    A0¯=A2¯{gxeA1¯T1A1¯+A4¯eA1¯T+eA1¯T2(A1¯)2}.\displaystyle\overline{A_{0}}=\overline{A_{2}}\Big{\{}g_{x}\frac{e^{\overline{A_{1}}^{\prime}T}-1}{\overline{A_{1}}^{\prime}}+\overline{A_{4}}\frac{e^{\overline{A_{1}}^{\prime}T}+e^{-\overline{A_{1}}^{\prime}T}-2}{(\overline{A_{1}}^{\prime})^{2}}\Big{\}}.

    It is straightforward to check that there exists a negative constant C1C_{1} such that when A1¯C1\overline{A_{1}}^{\prime}\leq C_{1}, (eA1¯T1)/A1¯<1/(2A2¯gx)(e^{\overline{A_{1}}^{\prime}T}-1)/\overline{A_{1}}^{\prime}<1/(2\overline{A_{2}}g_{x}). By the definition of A1¯\overline{A_{1}}^{\prime}, if kfk_{f} is sufficiently negative, there exists λ2fx\lambda_{2}\geq f_{x} such that A1¯=C1\overline{A_{1}}^{\prime}=C_{1} and λ2\lambda_{2} is sufficiently large to ensure

    A2¯A4¯eC1T+eC1T2C12=fxA2¯(eC1T+eC1T2)λ2C12<12.\displaystyle\overline{A_{2}}\,\overline{A_{4}}\frac{e^{C_{1}T}+e^{-C_{1}T}-2}{C_{1}^{2}}=\frac{f_{x}\overline{A_{2}}(e^{C_{1}T}+e^{-C_{1}T}-2)}{\lambda_{2}C_{1}^{2}}<\frac{1}{2}.

    Combining these two estimates gives A0¯<1\overline{A_{0}}<1.

  5. 5.

    Noting that kbk_{b} and kfk_{f} play the same role in A1¯\overline{A_{1}}^{\prime}, we use the same argument as above to show that when kbk_{b} is sufficiently negative, there exists λ2fx\lambda_{2}\geq f_{x} such that A0¯<1\overline{A_{0}}<1.

Proof of Theorem 1.

From the proof of this theorem and throughout the remainder of the paper, we use CC to generally denote a constant that only depends on E|ξ|2E|\xi|^{2}, \mathscr{L}, and TT, whose value may change from line to line when there is no need to distinguish. We also use C()C(\cdot) to generally denote a constant depending on E|ξ|2E|\xi|^{2}, \mathscr{L}, TT and the constants represented by \cdot.

We use the same notations as Lemma 1. Let Xtiπ,1=XtiπX_{t_{i}}^{\pi,1}=X_{t_{i}}^{\pi}, Ytiπ,1=Ytiπ,Ztiπ,1=ZtiπY_{t_{i}}^{\pi,1}=Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi,1}=Z_{t_{i}}^{\pi} (defined in system (2.3)) and Xtiπ,2=X¯tiπX_{t_{i}}^{\pi,2}=\overline{X}_{t_{i}}^{\pi}, Ytiπ,2=Y¯tiπY_{t_{i}}^{\pi,2}=\overline{Y}_{t_{i}}^{\pi}, Ztiπ,2=Z¯tiπZ_{t_{i}}^{\pi,2}=\overline{Z}_{t_{i}}^{\pi} (defined in system (3.3)). It can be easily checked that both ({Xtiπ,j}0iN(\{X_{t_{i}}^{\pi,j}\}_{0\leq i\leq N}, {Ytiπ,j}0iN\{Y_{t_{i}}^{\pi,j}\}_{0\leq i\leq N}, {Ztiπ,j}0iN1)\{Z_{t_{i}}^{\pi,j}\}_{0\leq i\leq N-1}), j=1,2j=1,2 satisfy the system of equations (4.3). Our proof strategy is to use Lemma 1 to bound the difference between two solutions through the objective function E|g(XTπ)YTπ|2E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}. This allows us to apply Theorem 5 to derive the desired estimates.

To begin with, note that for any λ3>0\lambda_{3}>0, the RMS-GM inequality yields

E|δYN|2=E|g(X¯Tπ)YTπ|2(1+λ31)E|g(XTπ)YTπ|2+gx(1+λ3)E|δXN|2.E|\delta Y_{N}|^{2}=E|g(\overline{X}_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{3}^{-1})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}+g_{x}(1+\lambda_{3})E|\delta X_{N}|^{2}.

Let

P=max0nNeA1nhE|δXn|2,S=max0nNeA3nhE|δYn|2.\displaystyle P=\displaystyle{\max_{0\leq n\leq N}e^{-A_{1}nh}E|\delta X_{n}|^{2}},\quad S=\displaystyle{\max_{0\leq n\leq N}e^{A_{3}nh}E|\delta Y_{n}|^{2}}.

Lemma 1 tells us

eA1nhE|δXn|2A2i=0n1eA1(i+1)hE|δYi|2hA2Si=0n1eA1(i+1)hA3ihh,e^{-A_{1}nh}E|\delta X_{n}|^{2}\leq A_{2}\sum_{i=0}^{n-1}e^{-A_{1}(i+1)h}E|\delta Y_{i}|^{2}h\leq A_{2}S\sum_{i=0}^{n-1}e^{-A_{1}(i+1)h-A_{3}ih}h,

and

eA3nhE|δYn|2\displaystyle e^{A_{3}nh}E|\delta Y_{n}|^{2}
\displaystyle\leq eA3TE|δYN|2+A4i=nN1eA3ihE|δXi|2h\displaystyle e^{A_{3}T}E|\delta Y_{N}|^{2}+A_{4}\sum_{i=n}^{N-1}e^{A_{3}ih}E|\delta X_{i}|^{2}h
\displaystyle\leq eA3T[(1+λ31)E|g(XTπ)YTπ|2+gx(1+λ3)E|δXN|2]+A4i=nN1eA3ihE|δXi|2h\displaystyle e^{A_{3}T}[(1+\lambda_{3}^{-1})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}+g_{x}(1+\lambda_{3})E|\delta X_{N}|^{2}]+A_{4}\sum_{i=n}^{N-1}e^{A_{3}ih}E|\delta X_{i}|^{2}h
\displaystyle\leq eA3T(1+λ31)E|g(XTπ)YTπ|2+[gx(1+λ3)e(A1+A3)T+A4i=nN1e(A1+A3)ihh]P.\displaystyle e^{A_{3}T}(1+\lambda_{3}^{-1})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}+[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T}+A_{4}\sum_{i=n}^{N-1}e^{(A_{1}+A_{3})ih}h]P.

Therefore by definition of PP and SS, we have

P\displaystyle P A2heA1he(A1+A3)T1e(A1+A3)h1S,\displaystyle\leq A_{2}he^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T}-1}{e^{-(A_{1}+A_{3})h}-1}S,
S\displaystyle S eA3T(1+λ31)E|g(XTπ)YTπ|2+[gx(1+λ3)e(A1+A3)T+A4he(A1+A3)T1e(A1+A3)h1]P.\displaystyle\leq e^{A_{3}T}(1+\lambda_{3}^{-1})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}+[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T}+A_{4}h\frac{e^{(A_{1}+A_{3})T}-1}{e^{(A_{1}+A_{3})h}-1}]P.

Consider the function

A(h)\displaystyle A(h) =A2heA1he(A1+A3)T1e(A1+A3)h1[gx(1+λ3)e(A1+A3)T+A4he(A1+A3)T1e(A1+A3)h1].\displaystyle=A_{2}he^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T}-1}{e^{-(A_{1}+A_{3})h}-1}[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T}+A_{4}h\frac{e^{(A_{1}+A_{3})T}-1}{e^{(A_{1}+A_{3})h}-1}].

When A(h)<1A(h)<1, we have

P\displaystyle P [1A(h)]1eA3T(1+λ31)A2heA1he(A1+A3)T1e(A1+A3)h1E|g(XTπ)YTπ|2,\displaystyle\leq[1-A(h)]^{-1}e^{A_{3}T}(1+\lambda_{3}^{-1})A_{2}he^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T}-1}{e^{-(A_{1}+A_{3})h}-1}E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2},
S\displaystyle S [1A(h)]1eA3T(1+λ31)E|g(XTπ)YTπ|2.\displaystyle\leq[1-A(h)]^{-1}e^{A_{3}T}(1+\lambda_{3}^{-1})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}.

Let

P¯=max0nNeA1¯nhE|δXn|2,S¯=max0nNeA3¯nhE|δYn|2.\overline{P}=\max_{0\leq n\leq N}e^{-\overline{A_{1}}nh}E|\delta X_{n}|^{2},\quad\overline{S}=\max_{0\leq n\leq N}e^{\overline{A_{3}}nh}E|\delta Y_{n}|^{2}. (4.15)

Recall

limh0Ai=Ai¯,i=1,2,3,4,\displaystyle\lim_{h\rightarrow 0}A_{i}=\overline{A_{i}},\quad i=1,2,3,4,

and note that

limh0A(h)=A2¯1e(A1¯+A3¯)TA1¯+A3¯[gx(1+λ3)e(A1¯+A3¯)T+A4¯e(A1¯+A3¯)T1A1¯+A3¯].\displaystyle\lim_{h\rightarrow 0}A(h)=\overline{A_{2}}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[g_{x}(1+\lambda_{3})e^{(\overline{A_{1}}+\overline{A_{3}})T}+\overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}}].

When A0¯<1\overline{A_{0}}<1, comparing limh0A(h)\lim_{h\rightarrow 0}A(h) and A0¯\overline{A_{0}}, we know that, for any ϵ>0\epsilon>0, there exists λ3>0\lambda_{3}>0 and sufficiently small hh such that

P¯\displaystyle\overline{P} (1+ϵ)[1A0¯]1A2¯eA3¯T(1+λ31)1e(A1¯+A3¯)TA1¯+A3¯E|g(XTπ)YTπ|2,\displaystyle\leq(1+\epsilon)[1-\overline{A_{0}}]^{-1}\overline{A_{2}}e^{\overline{A_{3}}T}(1+\lambda_{3}^{-1})\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}, (4.16)
S¯\displaystyle\overline{S} (1+ϵ)[1A0¯]1eA3¯T(1+λ31)E|g(XTπ)YTπ|2.\displaystyle\leq(1+\epsilon)[1-\overline{A_{0}}]^{-1}e^{\overline{A_{3}}T}(1+\lambda_{3}^{-1})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}. (4.17)

By fixing ϵ=1\epsilon=1 and choosing suitable λ3\lambda_{3}, we obtain our error estimates of E|δXn|2E|\delta X_{n}|^{2} and E|δYn|2E|\delta Y_{n}|^{2} as

max0nNE|δXn|2\displaystyle\max_{0\leq n\leq N}E|\delta X_{n}|^{2} eA1¯T0P¯C(λ1,λ2)E|g(XTπ)YTπ|2,\displaystyle\leq e^{\overline{A_{1}}T\vee 0}\overline{P}\leq C(\lambda_{1},\lambda_{2})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}, (4.18)
max0nNE|δYn|2\displaystyle\max_{0\leq n\leq N}E|\delta Y_{n}|^{2} e(A3¯T)0S¯C(λ1,λ2)E|g(XTπ)YTπ|2.\displaystyle\leq e^{(-\overline{A_{3}}T)\vee 0}\overline{S}\leq C(\lambda_{1},\lambda_{2})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}. (4.19)

To estimate E|δZn|2E|\delta Z_{n}|^{2}, we consider estimate (4.12), in which λ2\lambda_{2} can take any value no smaller than fzf_{z}. If fz0f_{z}\neq 0, we choose λ2=2fz\lambda_{2}=2f_{z} and obtain

12E|δZi|2hfx2fzE|δXi|2h+E|δYi+1|2[1(2kf+2fz)h]E|δYi|2.\frac{1}{2}E|\delta Z_{i}|^{2}h\leq\frac{f_{x}}{2f_{z}}E|\delta X_{i}|^{2}h+E|\delta Y_{i+1}|^{2}-[1-(2k_{f}+2f_{z})h]E|\delta Y_{i}|^{2}.

Summing from 0 to N1N-1 gives us

i=0N1E|δZi|2h\displaystyle\sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h\leq\leavevmode\nobreak\ fxTfzmax0nNE|δXn|2+[4(kf+fz)T0+2]max0nNE|δYn|2\displaystyle\frac{f_{x}T}{f_{z}}\max_{0\leq n\leq N}E|\delta X_{n}|^{2}+[4(k_{f}+f_{z})T\vee 0+2]\max_{0\leq n\leq N}E|\delta Y_{n}|^{2}
\displaystyle\leq\leavevmode\nobreak\ C(λ1,λ2)E|g(XTπ)YTπ|2.\displaystyle C(\lambda_{1},\lambda_{2})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}. (4.20)

The case fz=0f_{z}=0 can be dealt with similarly by choosing λ2=1\lambda_{2}=1 and the same type of estimate can be derived. Finally, combining estimates (4.18)(4.19)(4.20) with Theorem 5, we prove the statement in Theorem 1. ∎

5 An Upper Bound for the Minimized Objective Function

We prove Theorem 2 in this section. We first state three useful lemmas. Theorem 2, as a detailed statement of Theorem 2, and Theorem 6, as an variation of Theorem 2 under stronger conditions, are then provided, followed by their proofs. The proofs of three lemmas are given at the end of the section.

The main process we analyze is (2.3). Lemma 3 gives an estimate of the final distance E|g(XTπ)YTπ|2E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2} provided by (2.3) in terms of the deviation between the approximated variables Y0π,ZtiπY_{0}^{\pi},Z_{t_{i}}^{\pi} and the true solutions.

Lemma 3.

Suppose Assumptions 1, 2, and 3 hold true. Let XTπ,Y0π,YTπX_{T}^{\pi},Y_{0}^{\pi},Y_{T}^{\pi}, {Ztiπ}0iN1\{Z_{t_{i}}^{\pi}\}_{0\leq i\leq N-1} be defined as in system (2.3) and Z~ti=h1E[titi+1Ztdt|ti]\tilde{Z}_{t_{i}}=h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]. Given λ4>0\lambda_{4}>0, there exists a constant C>0C>0 depending on E|ξ|2E|\xi|^{2}, \mathscr{L}, TT, and λ4\lambda_{4}, such that for sufficiently small hh,

E|g(XTπ)YTπ|2(1+λ4)Hmini=0N1E|δZ~ti|2h+C[h+E|Y0Y0π|2],\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{4})H_{\mathrm{min}}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C[h+E|Y_{0}-Y_{0}^{\pi}|^{2}],

where δZ~ti=Z~tiZtiπ\delta\tilde{Z}_{t_{i}}=\tilde{Z}_{t_{i}}-Z_{t_{i}}^{\pi}, H(x)=(1+gx)2e(2K+2Kx1+x)T(1+fzx1)H(x)=(1+\sqrt{g_{x}})^{2}e^{(2K+2Kx^{-1}+x)T}(1+f_{z}x^{-1}), and Hmin=minxR+H(x)H_{\mathrm{min}}=\min_{x\in R^{+}}H(x).

Lemma 3 is close to Theorem 2, except that Z~ti\tilde{Z}_{t_{i}} is not a function of XtiπX_{t_{i}}^{\pi} and YtiπY_{t_{i}}^{\pi} defined in (2.3). To bridge this gap, we need the following two lemmas. First, similar to the proof of Theorem 1, an estimate of the distance between the process defined in (2.3) and the process defined in (3.3) is also needed here. Lemma 4 is a general result to serve this purpose, providing an estimate of the difference between two backward processes driven by different forward processes.

Lemma 4.

Let Xtiπ,jL2(Ω,ti,)X_{t_{i}}^{\pi,j}\in L^{2}(\Omega,\mathcal{F}_{t_{i}},\mathbb{P}) for 0iN0\leq i\leq N, j=1,2j=1,2. Suppose {Ytiπ,j}0iN\{Y_{t_{i}}^{\pi,j}\}_{0\leq i\leq N} and {Ztiπ,j}0iN1\{Z_{t_{i}}^{\pi,j}\}_{0\leq i\leq N-1} satisfy

{YTπ,j=g(XTπ,j),Ztiπ,j=1hE[Yti+1π,jΔWi|ti],Ytiπ,j=E[Yti+1π,j+f(ti,Xtiπ,j,Ytiπ,j,Ztiπ,j)h|ti],\displaystyle\begin{dcases}Y_{T}^{\pi,j}=g(X_{T}^{\pi,j}),\\ Z_{t_{i}}^{\pi,j}=\frac{1}{h}E[Y_{t_{i+1}}^{\pi,j}\Delta W_{i}|\mathcal{F}_{t_{i}}],\\ Y_{t_{i}}^{\pi,j}=E[{Y}_{t_{i+1}}^{\pi,j}+f(t_{i},{X}_{t_{i}}^{\pi,j},{Y}_{t_{i}}^{\pi,j},{Z}_{t_{i}}^{\pi,j})h|\mathcal{F}_{t_{i}}],\end{dcases} (5.1)

for 0iN10\leq i\leq N-1, j=1,2j=1,2. Let δXi=Xtiπ,1Xtiπ,2\delta X_{i}=X_{t_{i}}^{\pi,1}-X_{t_{i}}^{\pi,2}, δZi=Ztiπ,1Ztiπ,2\delta Z_{i}=Z_{t_{i}}^{\pi,1}-Z_{t_{i}}^{\pi,2}, then for any λ7>fz\lambda_{7}>f_{z}, and sufficiently small hh, we have

i=0N1E|δZi|2hλ7(eA5T1)λ7fz{gxeA5TA5hE|δXN|2+fxλ7i=0N1eA5ihE|δXi|2h},\displaystyle\sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h\leq\frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{A_{5}T-A_{5}h}E|\delta X_{N}|^{2}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E|\delta X_{i}|^{2}h\Big{\}},

where A5h1ln[1(2kf+λ7)h]A_{5}\coloneqq-h^{-1}\ln[1-(2k_{f}+\lambda_{7})h].

Lemma 5 shows that, similar to the nonlinear Feynman–Kac formula, the discrete stochastic process defined in (2.3) can also be linked to some deterministic functions.

Lemma 5.

Let {Xtiπ}0iN,{Ytiπ}0iN\{X_{t_{i}}^{\pi}\}_{0\leq i\leq N},\{Y_{t_{i}}^{\pi}\}_{0\leq i\leq N} be defined in (2.3). When h<1/Kh<1/\sqrt{K}, there exist deterministic functions Uiπ:m×,Viπ:m×dU_{i}^{\pi}:\mathbb{R}^{m}\times\mathbb{R}\rightarrow\mathbb{R},V_{i}^{\pi}:\mathbb{R}^{m}\times\mathbb{R}\rightarrow\mathbb{R}^{d} for 0iN0\leq i\leq N such that Ytiπ,=Uiπ(Xtiπ,Ytiπ)Y_{t_{i}}^{\pi,^{\prime}}=U_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}), Ztiπ,=Viπ(Xtiπ,Ytiπ)Z_{t_{i}}^{\pi,^{\prime}}=V_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}) satisfy

{YtNπ,=g(XtNπ),Ztiπ,=1hE[Yti+1π,ΔWi|ti],Ytiπ,=E[Yti+1π,+f(ti,Xtiπ,Ytiπ,,Ztiπ,)h|ti],\displaystyle\begin{dcases}Y_{t_{N}}^{\pi,^{\prime}}=g(X_{t_{N}}^{\pi}),\\ Z_{t_{i}}^{\pi,^{\prime}}=\frac{1}{h}E[Y_{t_{i+1}}^{\pi,^{\prime}}\Delta W_{i}|\mathcal{F}_{t_{i}}],\\ Y_{t_{i}}^{\pi,^{\prime}}=E[{Y}_{t_{i+1}}^{\pi,^{\prime}}+f(t_{i},{X}_{t_{i}}^{\pi},{Y}_{t_{i}}^{\pi,^{\prime}},{Z}_{t_{i}}^{\pi,^{\prime}})h|\mathcal{F}_{t_{i}}],\end{dcases} (5.2)

for 0iN10\leq i\leq N-1. If bb and σ\sigma are independent of yy, then there exist deterministic functions Uiπ:m,Viπ:mdU_{i}^{\pi}:\mathbb{R}^{m}\rightarrow\mathbb{R},V_{i}^{\pi}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{d} for 0iN0\leq i\leq N such that Ytiπ,=Uiπ(Xtiπ)Y_{t_{i}}^{\pi,^{\prime}}=U_{i}^{\pi}(X_{t_{i}}^{\pi}), Ztiπ,=Viπ(Xtiπ)Z_{t_{i}}^{\pi,^{\prime}}=V_{i}^{\pi}(X_{t_{i}}^{\pi}) satisfy (5.2).

Now we are ready to prove Theorem 2, with a precise statement given below. Like Theorem 1, the conditions below are satisfied if any of the five cases of the weak coupling and monotonicity conditions holds to certain extent.

Theorem 2.

Suppose Assumptions 1, 2, 3, and 4 hold true. Given any λ1,λ3>0\lambda_{1},\lambda_{3}>0, λ2fz\lambda_{2}\geq f_{z}, and λ7>fz\lambda_{7}>f_{z}, let Ai¯,(i=1,2,3,4)\overline{A_{i}},(i=1,2,3,4) be defined in (4.4) and

A5¯\displaystyle\overline{A_{5}}\coloneqq λ7+2kf,\displaystyle\leavevmode\nobreak\ \lambda_{7}+2k_{f}, (5.3)
A0¯\displaystyle\overline{A_{0}}^{\prime}\coloneqq A2¯1e(A1¯+A3¯)TA1¯+A3¯{gx(1+λ3)e(A1¯+A3¯)T+A4¯e(A1¯+A3¯)T1A1¯+A3¯},\displaystyle\leavevmode\nobreak\ \overline{A_{2}}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}\Big{\{}g_{x}(1+\lambda_{3})e^{(\overline{A_{1}}+\overline{A_{3}})T}+\overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}}\Big{\}},
B0¯\displaystyle\overline{B_{0}}\coloneqq HminA2¯eA3¯T1e(A1¯+A3¯)TA1¯+A3¯[1A0¯]1(1+λ31)\displaystyle\leavevmode\nobreak\ H_{\mathrm{min}}\overline{A_{2}}e^{\overline{A_{3}}T}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})
×λ7(eA5¯T1)λ7fz{gxe(A1¯+A5¯)T+fxλ7e(A1¯+A5¯)T1A1¯+A5¯}.\displaystyle\leavevmode\nobreak\ \times\frac{\lambda_{7}(e^{-\overline{A_{5}}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{(\overline{A_{1}}+\overline{A_{5}})T}+\frac{f_{x}}{\lambda_{7}}\frac{e^{(\overline{A_{1}}+\overline{A_{5}})T}-1}{\overline{A_{1}}+\overline{A_{5}}}\Big{\}}.

If there exist λ1,λ2,λ3,λ7\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{7} satisfying A0¯<1\overline{A_{0}}^{\prime}<1 and B0¯<1\overline{B_{0}}<1, then there exists a constant C depending on E|ξ|2E|\xi|^{2}, \mathscr{L}, TT, λ1\lambda_{1}, λ2\lambda_{2}, λ3\lambda_{3}, and λ7\lambda_{7}, such that for sufficiently small hh,

E|g(XTπ)YTπ|2C{h+E|Y0Y0π|2+i=0N1E|E[Z~ti|Xtiπ,Ytiπ]Ztiπ|2h},E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq C\Big{\{}h+E|Y_{0}-Y_{0}^{\pi}|^{2}+\sum_{i=0}^{N-1}E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}|^{2}h\Big{\}}, (5.4)

where Z~ti=h1E[titi+1Ztdt|ti]\tilde{Z}_{t_{i}}=h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]. If ZtZ_{t} is cádlag, we can replace Z~ti\tilde{Z}_{t_{i}} with ZtiZ_{t_{i}}. If bb and σ\sigma are independent of yy, we can replace E[Z~ti|Xtiπ,Ytiπ]E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}] with E[Z~ti|Xtiπ]E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi}].

Remark.

If we take the infimum within the domains of Y0πY_{0}^{\pi} and ZtiπZ_{t_{i}}^{\pi} on both sides, we recover the original statement in Theorem 2.

Remark.

If any of the weak coupling and monotonicity conditions introduced in Assumption 3 holds to a sufficient extent, there must exist λ1,λ2,λ3,λ7\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{7} satisfying the conditions in Theorem 2. The arguments are very similar to those provided in Remark Remark. Hence, we omit the details here for the sake of brevity.

Proof of Theorem 2.

Using Lemma 3 with λ4>0\lambda_{4}>0, we obtain

E|g(XTπ)YTπ|2(1+λ4)Hmini=0N1E|δZ~ti|2h+C(λ4)[h+E|Y0Y0π|2].\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{4})H_{\mathrm{min}}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{4})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}]. (5.5)

Splitting the term δZ~ti=Z~tiZtiπ\delta\tilde{Z}_{t_{i}}=\tilde{Z}_{t_{i}}-Z_{t_{i}}^{\pi} and applying the generalized mean inequality, we have (recall Z¯tiπ\overline{Z}_{t_{i}}^{\pi} is defined in Theorem 5)

E|δZ~ti|2\displaystyle\leavevmode\nobreak\ E|\delta\tilde{Z}_{t_{i}}|^{2}
\displaystyle\leq (1+λ4)E|Z¯tiπE[Z¯tiπ|Xtiπ,Ytiπ]|2\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}
+(1+λ41){E|(Z~tiZ¯tiπ)E[(Z~tiZ¯tiπ)|Xtiπ,Ytiπ]\displaystyle\leavevmode\nobreak\ +(1+\lambda_{4}^{-1})\Big{\{}E|(\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi})-E[(\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi})|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]
+(E[Z~ti|Xtiπ,Ytiπ]Ztiπ)|2}\displaystyle\leavevmode\nobreak\ \qquad\qquad\qquad+(E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi})|^{2}\Big{\}}
\displaystyle\leq (1+λ4)E|Z¯tiπE[Z¯tiπ|Xtiπ,Ytiπ]|2\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}
+3(1+λ41){E|Z~tiZ¯tiπ|2+E|E[(Z~tiZ¯tiπ)|Xtiπ,Ytiπ]|2\displaystyle\leavevmode\nobreak\ +3(1+\lambda_{4}^{-1})\Big{\{}E|\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi}|^{2}+E|E[(\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi})|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}
+E|E[Z~ti|Xtiπ,Ytiπ]Ztiπ|2}\displaystyle\leavevmode\nobreak\ \qquad\qquad\qquad\leavevmode\nobreak\ +E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}|^{2}\Big{\}}
\displaystyle\leq (1+λ4)E|Z¯tiπE[Z¯tiπ|Xtiπ,Ytiπ]|2\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}
+3(1+λ41){2E|Z~tiZ¯tiπ|2+E|E[Z~ti|Xtiπ,Ytiπ]Ztiπ|2}.\displaystyle\leavevmode\nobreak\ +3(1+\lambda_{4}^{-1})\Big{\{}2E|\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi}|^{2}+E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}|^{2}\Big{\}}. (5.6)

From equations (3.2)(3.4), we know that

i=0N1E|Z~tiZ¯tiπ|2h\displaystyle\sum_{i=0}^{N-1}E|\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi}|^{2}h 2i=0N1titi+1[E|ZtZ~ti|2+E|ZtZ¯tiπ|2dt\displaystyle\leq 2\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}[E|Z_{t}-\tilde{Z}_{t_{i}}|^{2}+E|Z_{t}-\overline{Z}_{t_{i}}^{\pi}|^{2}\mathrm{d}t
=20T[E|ZtZ~ti|2+E|ZtZ¯tiπ|2dt\displaystyle=2\int_{0}^{T}[E|Z_{t}-\tilde{Z}_{t_{i}}|^{2}+E|Z_{t}-\overline{Z}_{t_{i}}^{\pi}|^{2}\mathrm{d}t
C(1+E|ξ|2)h.\displaystyle\leq C(1+E|\xi|^{2})h. (5.7)

Plugging estimates (5.6)(5.7) into (5.5) gives us

E|g(XTπ)YTπ|2\displaystyle\leavevmode\nobreak\ E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}
\displaystyle\leq (1+λ4)2Hmini=0N1E|Z¯tiπE[Z¯tiπ|Xtiπ,Ytiπ]|2h\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{2}H_{\mathrm{min}}\sum_{i=0}^{N-1}E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}h
+C(λ4){h+E|Y0Y0π|2+i=0N1E|E[Z~ti|Xtiπ,Ytiπ]Ztiπ|2h}.\displaystyle\leavevmode\nobreak\ +C(\lambda_{4})\Big{\{}h+E|Y_{0}-Y_{0}^{\pi}|^{2}+\sum_{i=0}^{N-1}E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}|^{2}h\Big{\}}. (5.8)

It remains to estimate the term i=0N1E|Z¯tiπE[Z¯tiπ|Xtiπ,Ytiπ]|2h\sum_{i=0}^{N-1}E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}h, to which we intend to apply Lemma 4. Let Xtiπ,1=XtiπX_{t_{i}}^{\pi,1}=X_{t_{i}}^{\pi} and Xtiπ,2=X¯tiπX_{t_{i}}^{\pi,2}=\overline{X}_{t_{i}}^{\pi}. The associated Ztiπ,1Z_{t_{i}}^{\pi,1} and Ztiπ,2Z_{t_{i}}^{\pi,2} are then defined according to equation (5.1). Note that Ztiπ,2=Z¯tiπZ_{t_{i}}^{\pi,2}=\overline{Z}_{t_{i}}^{\pi} but Ztiπ,1Z_{t_{i}}^{\pi,1} is not necessarily equal to ZtiπZ_{t_{i}}^{\pi}, due to the possible violation of the terminal condition. From Lemma 5, we know Ztiπ,1Z_{t_{i}}^{\pi,1} can be represented as Viπ(Xtiπ,Ytiπ)V_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}) with ViπV_{i}^{\pi} being a deterministic function. By the property of conditional expectation, we have

E|Z¯tiπE[Z¯tiπ|Xtiπ,Ytiπ]|2E|Z¯tiπVi(Xtiπ,Ytiπ)|2,\displaystyle E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}\leq E|\overline{Z}_{t_{i}}^{\pi}-V_{i}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})|^{2},

for any ViV_{i}. Therefore we have the estimate

i=0N1E|Z¯tiπE[Z¯tiπ|Xtiπ,Ytiπ]|2hi=0N1E|δZi|2h\displaystyle\leavevmode\nobreak\ \sum_{i=0}^{N-1}E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}h\leq\sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h
\displaystyle\leq λ7(eA5T1)λ7fz{gxeA5TA5hE|δXN|2+fxλ7i=0N1eA5ihE|δXi|2h}.\displaystyle\leavevmode\nobreak\ \frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{A_{5}T-A_{5}h}E|\delta X_{N}|^{2}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E|\delta X_{i}|^{2}h\Big{\}}. (5.9)

Recall that δXi=XtiπX¯tiπ,δZi=Ztiπ,1Z¯tiπ\delta X_{i}=X_{t_{i}}^{\pi}-\overline{X}_{t_{i}}^{\pi},\delta Z_{i}=Z_{t_{i}}^{\pi,1}-\overline{Z}_{t_{i}}^{\pi}. Similar to the derivation of estimate (4.16) (using a given λ3>0\lambda_{3}>0 without final specification) in the proof of Theorem 1, when A0¯<1\overline{A_{0}}^{\prime}<1, we have

P¯\displaystyle\overline{P} (1+λ4)A2¯eA3¯T1e(A1¯+A3¯)TA1¯+A3¯[1A0¯]1(1+λ31)E|YTπg(XTπ)|2,\displaystyle\leq(1+\lambda_{4})\overline{A_{2}}e^{\overline{A_{3}}T}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})E|Y_{T}^{\pi}-g(X_{T}^{\pi})|^{2}, (5.10)

in which P¯=max0nNeA1¯nhE|δXi|2\overline{P}=\max_{0\leq n\leq N}e^{-\overline{A_{1}}nh}E|\delta X_{i}|^{2}. Plugging (5.10) into (5.9), and then into (5.8), we get

i=0N1E|δZi|2h\displaystyle\sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h λ7(eA5T1)P¯λ7fz{gxe(A1¯+A5)TA5h+fxλ7i=0N1e(A1¯+A5)ihh},\displaystyle\leq\frac{\lambda_{7}(e^{-A_{5}T}\vee 1)\overline{P}}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{(\overline{A_{1}}+A_{5})T-A_{5}h}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{(\overline{A_{1}}+A_{5})ih}h\Big{\}},

and

E|g(XTπ)YTπ|2\displaystyle\leavevmode\nobreak\ E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}
\displaystyle\leq (1+λ4)3B(h)E|g(XTπ)YTπ|2\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{3}B(h)E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}
+C(λ4){h+E|Y0Y0π|2+i=0N1E|E[Z~ti|Xtiπ,Ytiπ]Ztiπ|2h},\displaystyle\leavevmode\nobreak\ +C(\lambda_{4})\Big{\{}h+E|Y_{0}-Y_{0}^{\pi}|^{2}+\sum_{i=0}^{N-1}E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}|^{2}h\Big{\}}, (5.11)

for sufficiently small hh. Here B(h)B(h) is defined as

B(h)=\displaystyle B(h)= HminA2¯eA3¯T1e(A1¯+A3¯)TA1¯+A3¯[1A0¯]1(1+λ31)\displaystyle\leavevmode\nobreak\ H_{\mathrm{min}}\overline{A_{2}}e^{\overline{A_{3}}T}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})
×λ7(eA5T1)λ7fz{gxe(A1¯+A5)TA5h+fxλ7i=0N1e(A1¯+A5)ihh}.\displaystyle\leavevmode\nobreak\ \times\frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{(\overline{A_{1}}+A_{5})T-A_{5}h}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{(\overline{A_{1}}+A_{5})ih}h\Big{\}}.

The forms of inequalities (5.4) and (5.11) are already very close. When limh0B(h)=B0¯<1\displaystyle{\lim_{h\rightarrow 0}B(h)}=\overline{B_{0}}<1, there exists λ4>0\lambda_{4}>0 such that for sufficiently small hh, we have 1(1+λ4)3B(h)>12(1B0¯)1-(1+\lambda_{4})^{3}B(h)>\frac{1}{2}(1-\overline{B_{0}}). Rearranging the term E|g(XTπ)YTπ|2E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2} in inequality (5.11) yields our final estimate. ∎

We shall briefly discuss how the universal approximation theorem can be applied based on Theorem 2. For instance, Theorem 2.1 in [22] states that every continuous and piecewise linear function with mm-dimensional input can be represented by a deep neural network with rectified linear units and at most 1+log2(m+1)\lceil 1+\log_{2}(m+1)\rceil depth. Now we view Y0Y_{0} as a target function with input ξ\xi and E[Z~ti|Xtiπ,Ytiπ]E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}] as another target function with input (Xtiπ,Ytiπ)(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}). Since E|Y0|2<+E|Y_{0}|^{2}<+\infty and E|E[Z~ti|Xtiπ,Ytiπ]|2E|Z~ti|2<+E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}\leq E|\tilde{Z}_{t_{i}}|^{2}<+\infty, we know that both target functions can be approximated in the L2L^{2} sense by continuous and piecewise linear functions with arbitrary accuracy. Then the aforementioned statement implies that the two target functions can be approximated by two neural networks with rectified linear units and at most 1+log2(m+1)\lceil 1+\log_{2}(m+1)\rceil depth, although the width might go to infinity as the approximation error decreases to 0. Therefore, according to Theorem 2, there exist good neural networks such that the value of the objective function is small.

Note that there still exist some concerns about the result in Theorem 2. First, the function E[Z~ti|Xtiπ,Ytiπ]E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}] changes when ZtjπZ_{t_{j}}^{\pi} changes for j<ij<i. Second, the function may depend on YtiπY_{t_{i}}^{\pi}. Even if the FBSDEs are decoupled so that the above two concerns do not exist, we know nothing a priori about the property of E[Z~ti|Xtiπ,Ytiπ]E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]. In the next theorem, we replace E[Z~ti|Xtiπ,Ytiπ]E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}] with σT(ti,Xtiπ,u(ti,Xtiπ))xu(ti,Xtiπ)\sigma^{\operatorname{T}}(t_{i},X_{t_{i}}^{\pi},u(t_{i},X_{t_{i}}^{\pi}))\nabla_{x}u(t_{i},X_{t_{i}}^{\pi}), which can resolve these problems. However, meanwhile we require more regularity for the coefficients of the FBSDEs.

Theorem 6.

Suppose Assumptions 1, 2, 3, 4 and the assumptions in Theorem 3 hold true. Let uu be the solution of corresponding quasilinear PDEs (2.7) and LL be the squared Lipschitz constant of σT(t,x,u(t,x))xu(t,x)\sigma^{\operatorname{T}}(t,x,u(t,x))\nabla_{x}u(t,x) with respect to x. With the same notations of Theorem 2, when A0¯<1\overline{A_{0}}^{\prime}<1 and

B0¯HminLA2¯eA3¯T(eA1¯T1)(1e(A1¯+A3¯)T)A1¯(A1¯+A3¯)[1A0¯]1(1+λ31)<1,\displaystyle\overline{B_{0}}^{\prime}\coloneqq H_{\mathrm{min}}L\overline{A_{2}}e^{\overline{A_{3}}T}\frac{(e^{\overline{A_{1}}T}-1)(1-e^{-(\overline{A_{1}}+\overline{A_{3}})T})}{\overline{A_{1}}(\overline{A_{1}}+\overline{A_{3}})}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})<1,

there exists a constant C>0C>0 depending on E|ξ|2E|\xi|^{2}, TT, \mathscr{L}, LL, λ1\lambda_{1}, λ2\lambda_{2}, and λ3\lambda_{3}, such that for sufficiently small hh,

E|g(XTπ)YTπ|2C{h+E|Y0Y0π|2+i=0N1E|fi(Xtiπ)Ztiπ|2h},E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq C\Big{\{}h+E|Y_{0}-Y_{0}^{\pi}|^{2}+\sum_{i=0}^{N-1}E|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}|^{2}h\Big{\}}, (5.12)

where fi(x)=σT(ti,x,u(ti,x))xu(ti,x)f_{i}(x)=\sigma^{\operatorname{T}}(t_{i},x,u(t_{i},x))\nabla_{x}u(t_{i},x).

Proof.

By Theorem 3, we have Zti=fi(Xti)Z_{t_{i}}=f_{i}(X_{t_{i}}), in which XtX_{t} is the solution of

Xt=ξ+0tb(s,Xs,u(s,Xs))ds+0tσ(s,Xs,u(s,Xs))dWs.X_{t}=\xi+\int_{0}^{t}b(s,X_{s},u(s,X_{s}))\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},u(s,X_{s}))\,\mathrm{d}W_{s}.

Using Lemma 3 again with λ4>0\lambda_{4}>0 gives us

E|g(XTπ)YTπ|2(1+λ4)Hmini=0N1E|δZ~ti|2h+C(λ4)[h+E|Y0Y0π|2].\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{4})H_{\mathrm{min}}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{4})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}].

Given the continuity of σT(t,x,u(t,x))xu(t,x)\sigma^{\operatorname{T}}(t,x,u(t,x))\nabla_{x}u(t,x), we know ZtZ_{t} admits a continuous version. Hence the term Z~ti\tilde{Z}_{t_{i}} in δZ~ti=Z~tiZtiπ\delta\tilde{Z}_{t_{i}}=\tilde{Z}_{t_{i}}-Z_{t_{i}}^{\pi} can be replaced with ZtiZ_{t_{i}}, i.e.,

E|g(XTπ)YTπ|2(1+λ4)Hmini=0N1E|ZtiZtiπ|2h+C(λ4)[h+E|Y0Y0π|2].\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{4})H_{\mathrm{min}}\sum_{i=0}^{N-1}E|Z_{t_{i}}-Z_{t_{i}}^{\pi}|^{2}h+C(\lambda_{4})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}]. (5.13)

Similar to the arguments in inequalities (5.6)(5.7), we have

E|ZtiZtiπ|2\displaystyle\leavevmode\nobreak\ E|Z_{t_{i}}-Z_{t_{i}}^{\pi}|^{2}
\displaystyle\leq (1+λ41)E|fi(Xtiπ)Ztiπ|2+(1+λ4)E|Ztifi(Xtiπ)|2\displaystyle\leavevmode\nobreak\ (1+\lambda_{4}^{-1})E|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}|^{2}+(1+\lambda_{4})E|Z_{t_{i}}-f_{i}(X_{t_{i}}^{\pi})|^{2}
\displaystyle\leq (1+λ41)E|fi(Xtiπ)Ztiπ|2+(1+λ4)LE|XtiXtiπ|2\displaystyle\leavevmode\nobreak\ (1+\lambda_{4}^{-1})E|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}|^{2}+(1+\lambda_{4})LE|X_{t_{i}}-X_{t_{i}}^{\pi}|^{2}
\displaystyle\leq (1+λ41)E|fi(Xtiπ)Ztiπ|2\displaystyle\leavevmode\nobreak\ (1+\lambda_{4}^{-1})E|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}|^{2}
+(1+λ4)L[(1+λ4)E|XtiπX¯tiπ|2+(1+λ41)E|XtiX¯tiπ|2]\displaystyle\leavevmode\nobreak\ +(1+\lambda_{4})L[(1+\lambda_{4})E|X_{t_{i}}^{\pi}-\overline{X}_{t_{i}}^{\pi}|^{2}+(1+\lambda_{4}^{-1})E|X_{t_{i}}-\overline{X}_{t_{i}}^{\pi}|^{2}]
\displaystyle\leq (1+λ4)2LE|XtiπX¯tiπ|2+C(L,λ4){E|fi(Xtiπ)Ztiπ|2+h},\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{2}LE|X_{t_{i}}^{\pi}-\overline{X}_{t_{i}}^{\pi}|^{2}+C(L,\lambda_{4})\Big{\{}E|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}|^{2}+h\Big{\}},

where the last equality uses the convergence result (3.4). Plugging it into (5.13), we have

E|g(XTπ)YTπ|2\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq (1+λ4)3HminLi=0N1E|XtiπX¯tiπ|2h\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{3}H_{\mathrm{min}}L\sum_{i=0}^{N-1}E|X_{t_{i}}^{\pi}-\overline{X}_{t_{i}}^{\pi}|^{2}h
+C(L,λ4){h+E|Y0Y0π|2+i=0N1E|fi(Xtiπ)Ztiπ|2h}\displaystyle\leavevmode\nobreak\ +C(L,\lambda_{4})\Big{\{}h+E|Y_{0}-Y_{0}^{\pi}|^{2}+\sum_{i=0}^{N-1}E|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}|^{2}h\Big{\}} (5.14)

for sufficiently small hh.

We employ the estimate (5.10) again to rewrite inequality (5.14) as

E|g(XTπ)YTπ|2\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq (1+λ4)4B~(h)E|g(XTπ)YTπ|2\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{4}\widetilde{B}(h)E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}
+C(L,λ4){h+E|Y0Y0π|2+i=0N1E|fi(Xtiπ)Ztiπ|2h},\displaystyle\leavevmode\nobreak\ +C(L,\lambda_{4})\Big{\{}h+E|Y_{0}-Y_{0}^{\pi}|^{2}+\sum_{i=0}^{N-1}E|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}|^{2}h\Big{\}}, (5.15)

where

B~(h)=HminLA2¯eA3¯T1e(A1¯+A3¯)TA1¯+A3¯[1A0¯]1(1+λ31)i=0N1eiA1¯hh.\displaystyle\widetilde{B}(h)=H_{\mathrm{min}}L\overline{A_{2}}e^{\overline{A_{3}}T}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})\sum_{i=0}^{N-1}e^{i\overline{A_{1}}h}h.

Arguing in the same way as that in the proof of Theorem 2, when B~(h)\tilde{B}(h) is strictly bounded above by 11 for sufficiently small hh, we can choose λ4\lambda_{4} small enough and rearrange the terms in inequality (5.15) to obtain the result in inequality (5.12).

Remark.

The Lipschitz constant used in Theorem 6 may be further estimated a priori. Denote the Lipschitz constant of function ff with respect to xx as Lx(f)L_{x}(f), and the bound of function ff as M(f)M(f). Loosely speaking, we have

Lx(σT(t,x,u(t,x))xu(t,x))M(σ)Lx(xu)+M(xu)[Lx(σ)+Ly(σ)Lx(u)].L_{x}(\sigma^{\operatorname{T}}(t,x,u(t,x))\nabla_{x}u(t,x))\leq M(\sigma)L_{x}(\nabla_{x}u)+M(\nabla_{x}u)[L_{x}(\sigma)+L_{y}(\sigma)L_{x}(u)].

Here Lx(u)=M(xu(t,x))L_{x}(u)=M(\nabla_{x}u(t,x)) can be estimated from the first point of Theorem 4 and L(xu(t,x))=M(xxu)L(\nabla_{x}u(t,x))=M(\nabla_{xx}u) can be estimated through the Schauder estimate (see, e.g., [32, Chapter 4, Lemma 2.1]). Note that the resulting estimate may depend on the dimension dd.

5.1 Proof of Lemmas

Proof of Lemma 3.

We construct continuous processes Xtπ,YtπX_{t}^{\pi},Y_{t}^{\pi} as follows. For t[ti,ti+1)t\in[t_{i},t_{i+1}), let

Xtπ\displaystyle X_{t}^{\pi} =Xtiπ+b(ti,Xtiπ,Ytiπ)(tti)+σ(ti,Xtiπ,Ytiπ)(WtWti),\displaystyle=X_{t_{i}}^{\pi}+b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})(t-t_{i})+\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})(W_{t}-W_{t_{i}}),
Ytπ\displaystyle Y_{t}^{\pi} =Ytiπf(ti,Xtiπ,Ytiπ,Ztiπ)(tti)+(Ztiπ)T(WtWti).\displaystyle=Y_{t_{i}}^{\pi}-f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})(t-t_{i})+(Z_{t_{i}}^{\pi})^{\operatorname{T}}(W_{t}-W_{t_{i}}).

From system (2.3), we see this definition also works at ti+1t_{i+1}. We are interested in again the estimates of the following terms

δXt\displaystyle\delta X_{t} =XtXtπ,δYt=YtYtπ,δZt=ZtZtiπ,t[ti,ti+1).\displaystyle=X_{t}-X_{t}^{\pi},\quad\delta Y_{t}=Y_{t}-Y_{t}^{\pi},\quad\delta Z_{t}=Z_{t}-Z_{t_{i}}^{\pi},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ t\in[t_{i},t_{i+1}).

For t[ti,ti+1)t\in[t_{i},t_{i+1}), let

δbt\displaystyle\delta b_{t} =b(t,Xt,Yt)b(ti,Xtiπ,Ytiπ),\displaystyle=b(t,X_{t},Y_{t})-b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}),
δσt\displaystyle\delta\sigma_{t} =σ(t,Xt,Yt)σ(ti,Xtiπ,Ytiπ),\displaystyle=\sigma(t,X_{t},Y_{t})-\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}),
δft\displaystyle\delta f_{t} =f(t,Xt,Yt,Zt)f(ti,Xtiπ,Ytiπ,Ztiπ).\displaystyle=f(t,X_{t},Y_{t},Z_{t})-f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi}).

By definition,

d(δXt)\displaystyle\mathrm{d}(\delta X_{t}) =δbtdt+δσtdWt,\displaystyle=\delta b_{t}\,\mathrm{d}t+\delta\sigma_{t}\,\mathrm{d}W_{t},
d(δYt)\displaystyle\mathrm{d}(\delta Y_{t}) =δftdt+(δZt)TdWt.\displaystyle=-\delta f_{t}\,\mathrm{d}t+(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t}.

Then by Itô’s formula, we have

d|δXt|2\displaystyle\mathrm{d}|\delta X_{t}|^{2} =[2(δbt)TδXt+δσt2]dt+2(δXt)TδσtdWt,\displaystyle=[2(\delta b_{t})^{T}\delta X_{t}+\|\delta\sigma_{t}\|^{2}]\mathrm{d}t+2(\delta X_{t})^{\operatorname{T}}\delta\sigma_{t}\,\mathrm{d}W_{t},
d|δYt|2\displaystyle\mathrm{d}|\delta Y_{t}|^{2} =[2(δft)TδYt+|δZt|2]dt+2δYt(δZt)TdWt.\displaystyle=[-2(\delta f_{t})^{\operatorname{T}}\delta Y_{t}+|\delta Z_{t}|^{2}]\,\mathrm{d}t+2\delta Y_{t}(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t}.

Thus,

E|δXt|2\displaystyle E|\delta X_{t}|^{2} =E|δXti|2+titE[2(δbs)TδXs+δσs2]ds,\displaystyle=E|\delta X_{t_{i}}|^{2}+\int_{t_{i}}^{t}E[2(\delta b_{s})^{\operatorname{T}}\delta X_{s}+\|\delta\sigma_{s}\|^{2}]\,\mathrm{d}s,
E|δYt|2\displaystyle E|\delta Y_{t}|^{2} =E|δYti|2+titE[2(δfs)TδYs+|δZs|2]ds.\displaystyle=E|\delta Y_{t_{i}}|^{2}+\int_{t_{i}}^{t}E[-2(\delta f_{s})^{\operatorname{T}}\delta Y_{s}+|\delta Z_{s}|^{2}]\,\mathrm{d}s.

For any λ5,λ6>0\lambda_{5},\lambda_{6}>0, using Assumptions 1, 2 and the RMS-GM inequality, we have

E|δXt|2\displaystyle\leavevmode\nobreak\ E|\delta X_{t}|^{2}
\displaystyle\leq E|δXti|2+tit[λ5E|δXs|2+λ51E|δbs|2+Eδσs2]ds\displaystyle\leavevmode\nobreak\ E|\delta X_{t_{i}}|^{2}+\int_{t_{i}}^{t}[\lambda_{5}E|\delta X_{s}|^{2}+\lambda_{5}^{-1}E|\delta b_{s}|^{2}+E\|\delta\sigma_{s}\|^{2}]\,\mathrm{d}s
\displaystyle\leq E|δXti|2+λ5titE|δXs|2ds+titK(λ51+1)|sti|ds\displaystyle\leavevmode\nobreak\ E|\delta X_{t_{i}}|^{2}+\lambda_{5}\int_{t_{i}}^{t}E|\delta X_{s}|^{2}\,\mathrm{d}s+\int_{t_{i}}^{t}K(\lambda_{5}^{-1}+1)|s-t_{i}|\,\mathrm{d}s
+tit[(Kλ51+σx)E|XsXtiπ|2+(byλ51+σy)E|YsYtiπ|2]ds.\displaystyle\leavevmode\nobreak\ +\int_{t_{i}}^{t}[(K\lambda_{5}^{-1}+\sigma_{x})E|X_{s}-X_{t_{i}}^{\pi}|^{2}+(b_{y}\lambda_{5}^{-1}+\sigma_{y})E|Y_{s}-Y_{t_{i}}^{\pi}|^{2}]\,\mathrm{d}s. (5.16)

By the RMS-GM inequality, we also have

E|XsXtiπ|2\displaystyle E|X_{s}-X_{t_{i}}^{\pi}|^{2} (1+ϵ1)E|δXti|2+(1+ϵ11)E|XsXti|2,\displaystyle\leq(1+\epsilon_{1})E|\delta X_{t_{i}}|^{2}+(1+\epsilon_{1}^{-1})E|X_{s}-X_{t_{i}}|^{2}, (5.17)
E|YsYtiπ|2\displaystyle E|Y_{s}-Y_{t_{i}}^{\pi}|^{2} (1+ϵ2)E|δYti|2+(1+ϵ21)E|YsYti|2,\displaystyle\leq(1+\epsilon_{2})E|\delta Y_{t_{i}}|^{2}+(1+\epsilon_{2}^{-1})E|Y_{s}-Y_{t_{i}}|^{2}, (5.18)

in which we choose ϵ1=λ6(Kλ51+σx)1\epsilon_{1}=\lambda_{6}(K\lambda_{5}^{-1}+\sigma_{x})^{-1} and ϵ2=λ6(byλ51+σy)1\epsilon_{2}=\lambda_{6}(b_{y}\lambda_{5}^{-1}+\sigma_{y})^{-1}. The path regularity in Theorem 4 tells us

sups[ti,ti+1](E|XsXti|2+E|YsYti|2)Ch.\displaystyle\sup_{s\in[t_{i},t_{i+1}]}(E|X_{s}-X_{t_{i}}|^{2}+E|Y_{s}-Y_{t_{i}}|^{2})\leq Ch. (5.19)

Plugging inequalities (5.17)(5.18)(5.19) into (5.16) with simplification, we obtain

E|δXt|2\displaystyle E|\delta X_{t}|^{2} [1+(Kλ51+σx+λ6)h]E|δXti|2+λ5titE|δXs|2ds\displaystyle\leq[1+(K\lambda_{5}^{-1}+\sigma_{x}+\lambda_{6})h]E|\delta X_{t_{i}}|^{2}+\lambda_{5}\int_{t_{i}}^{t}E|\delta X_{s}|^{2}\,\mathrm{d}s
+(byλ51+σy+λ6)E|δYti|2h+C(λ5,λ6)h2.\displaystyle\hphantom{=\leavevmode\nobreak\ }+(b_{y}\lambda_{5}^{-1}+\sigma_{y}+\lambda_{6})E|\delta Y_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})h^{2}. (5.20)

Then, by Grönwall inequality, we have

E|δXti+1|2\displaystyle E|\delta X_{t_{i+1}}|^{2}
\displaystyle\leq\leavevmode\nobreak\ eλ5h{[1+(Kλ51+σx+λ6)h]E|δXti|2\displaystyle e^{\lambda_{5}h}\{[1+(K\lambda_{5}^{-1}+\sigma_{x}+\lambda_{6})h]E|\delta X_{t_{i}}|^{2}
+(byλ51+σy+λ6)E|δYti|2h+C(λ5,λ6)h2}\displaystyle\quad\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +(b_{y}\lambda_{5}^{-1}+\sigma_{y}+\lambda_{6})E|\delta Y_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})h^{2}\}
\displaystyle\leq\leavevmode\nobreak\ eA6hE|δXti|2+eλ5h(byλ51+σy+λ6)E|δYti|2h+C(λ5,λ6)h2\displaystyle e^{A_{6}h}E|\delta X_{t_{i}}|^{2}+e^{\lambda_{5}h}(b_{y}\lambda_{5}^{-1}+\sigma_{y}+\lambda_{6})E|\delta Y_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})h^{2}
\displaystyle\leq\leavevmode\nobreak\ eA6hE|δXti|2+A7E|δYti|2h+C(λ5,λ6)h2,\displaystyle e^{A_{6}h}E|\delta X_{t_{i}}|^{2}+A_{7}E|\delta Y_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})h^{2}, (5.21)

where A6Kλ51+σx+λ5+λ6A_{6}\coloneqq K\lambda_{5}^{-1}+\sigma_{x}+\lambda_{5}+\lambda_{6}, A7byλ51+σy+2λ6A_{7}\coloneqq b_{y}\lambda_{5}^{-1}+\sigma_{y}+2\lambda_{6}, and hh is sufficiently small.

Similarly, with the same type of estimates in (5.16)(5.20), for any λ5,λ6>0\lambda_{5},\lambda_{6}>0, we have

E|δYt|2\displaystyle\leavevmode\nobreak\ E|\delta Y_{t}|^{2}
\displaystyle\leq E|δYti|2+tit[λ5E|δYs|2+λ51E|δfs|2+E|δZs|2]ds\displaystyle\leavevmode\nobreak\ E|\delta Y_{t_{i}}|^{2}+\int_{t_{i}}^{t}[\lambda_{5}E|\delta Y_{s}|^{2}+\lambda_{5}^{-1}E|\delta f_{s}|^{2}+E|\delta Z_{s}|^{2}]\,\mathrm{d}s
\displaystyle\leq E|δYti|2+λ5titE|δYs|2ds+titKλ51|sti|ds\displaystyle\leavevmode\nobreak\ E|\delta Y_{t_{i}}|^{2}+\lambda_{5}\int_{t_{i}}^{t}E|\delta Y_{s}|^{2}\,\mathrm{d}s+\int_{t_{i}}^{t}K\lambda_{5}^{-1}|s-t_{i}|\,\mathrm{d}s
+titλ51[fxE|XsXtiπ|2+KE|YsYtiπ|2]ds+(1+fzλ51)titE|δZs|2ds\displaystyle\leavevmode\nobreak\ +\int_{t_{i}}^{t}\lambda_{5}^{-1}[f_{x}E|X_{s}-X_{t_{i}}^{\pi}|^{2}+KE|Y_{s}-Y_{t_{i}}^{\pi}|^{2}]\,\mathrm{d}s+(1+f_{z}\lambda_{5}^{-1})\int_{t_{i}}^{t}E|\delta Z_{s}|^{2}\,\mathrm{d}s
\displaystyle\leq [1+(Kλ51+λ6)h]E|δYti|2+λ5titE|δYs|2ds+(fxλ51+λ6)E|δXtiπ|2h\displaystyle\leavevmode\nobreak\ [1+(K\lambda_{5}^{-1}+\lambda_{6})h]E|\delta Y_{t_{i}}|^{2}+\lambda_{5}\int_{t_{i}}^{t}E|\delta Y_{s}|^{2}\,\mathrm{d}s+(f_{x}\lambda_{5}^{-1}+\lambda_{6})E|\delta X_{t_{i}}^{\pi}|^{2}h
+(1+fzλ51)titE|δZs|2ds+C(λ5,λ6)h2.\displaystyle\leavevmode\nobreak\ +(1+f_{z}\lambda_{5}^{-1})\int_{t_{i}}^{t}E|\delta Z_{s}|^{2}\,\mathrm{d}s+C(\lambda_{5},\lambda_{6})h^{2}.

Arguing in the same way of (5.21), by Grönwall inequality, for sufficiently small hh, we have

E|δYti+1|2\displaystyle E|\delta Y_{t_{i+1}}|^{2}
\displaystyle\leq\leavevmode\nobreak\ eA8hE|δYti|2+A9E|δXti|2h+(1+fzλ51+λ6)titE|δZs|2ds+C(λ5,λ6)h2,\displaystyle e^{A_{8}h}E|\delta Y_{t_{i}}|^{2}+A_{9}E|\delta X_{t_{i}}|^{2}h+(1+f_{z}\lambda_{5}^{-1}+\lambda_{6})\int_{t_{i}}^{t}E|\delta Z_{s}|^{2}\,\mathrm{d}s+C(\lambda_{5},\lambda_{6})h^{2},

with A8Kλ51+λ5+λ6A_{8}\coloneqq K\lambda_{5}^{-1}+\lambda_{5}+\lambda_{6}, A9fxλ51+2λ6A_{9}\coloneqq f_{x}\lambda_{5}^{-1}+2\lambda_{6}. Choosing ϵ3=(1+fzλ51+λ6)1λ6\epsilon_{3}=(1+f_{z}\lambda_{5}^{-1}+\lambda_{6})^{-1}\lambda_{6} and using

titi+1E|δZt|2dt(1+ϵ3)E|δZ~ti|2h+(1+ϵ31)Ezi,\int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2}\,\mathrm{d}t\leq(1+\epsilon_{3})E|\delta\tilde{Z}_{t_{i}}|^{2}h+(1+\epsilon_{3}^{-1})E_{z}^{i},

where δZ~ti=Z~tiZtiπ\delta\tilde{Z}_{t_{i}}=\tilde{Z}_{t_{i}}-Z_{t_{i}}^{\pi} and Ezi=titi+1E|ZtZ~ti|2dtE_{z}^{i}=\int_{t_{i}}^{t_{i+1}}E|Z_{t}-\tilde{Z}_{t_{i}}|^{2}\,\mathrm{d}t, we furthermore obtain

E|δYti+1|2eA8hE|δYti|2+A9E|δXti|2h+A10E|δZ~ti|2h+C(λ5,λ6)(h2+Ezi),E|\delta Y_{t_{i+1}}|^{2}\leq e^{A_{8}h}E|\delta Y_{t_{i}}|^{2}+A_{9}E|\delta X_{t_{i}}|^{2}h+A_{10}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})(h^{2}+E_{z}^{i}), (5.22)

with A101+fzλ51+2λ6A_{10}\coloneqq 1+f_{z}\lambda_{5}^{-1}+2\lambda_{6}.

Define

Mi=max{E|δXi|2,E|δYi|2},0iN.\displaystyle M_{i}=\max\{E|\delta X_{i}|^{2},E|\delta Y_{i}|^{2}\},\quad 0\leq i\leq N.

Combining inequalities (5.21)(5.22) together yields

Mi+1\displaystyle M_{i+1}
\displaystyle\leq\leavevmode\nobreak\ (emax{A6,A8}h+max{A7,A9}h)Mi+A10E|δZ~ti|2h+C(λ5,λ6)(h2+Ezi)\displaystyle(e^{\max\{A_{6},A_{8}\}h}+\max\{A_{7},A_{9}\}h)M_{i}+A_{10}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})(h^{2}+E_{z}^{i})
\displaystyle\leq\leavevmode\nobreak\ e(max{A6,A8}+max{A7,A9})hMi+A10E|δZ~ti|2h+C(λ5,λ6)(h2+Ezi).\displaystyle e^{(\max\{A_{6},A_{8}\}+\max\{A_{7},A_{9}\})h}M_{i}+A_{10}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})(h^{2}+E_{z}^{i}).

Letting A11max{A6,A8}+max{A7,A9}A_{11}\coloneqq\max\{A_{6},A_{8}\}+\max\{A_{7},A_{9}\}, we have

Mi+1eA11hMi+A10E|δZ~ti|2h+C(λ5,λ6)(h2+Ezi).M_{i+1}\leq e^{A_{11}h}M_{i}+A_{10}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})(h^{2}+E_{z}^{i}). (5.23)

We start from M0=E|Y0Y0π|2M_{0}=E|Y_{0}-Y_{0}^{\pi}|^{2} and apply inequality (5.23) repeatedly to obtain

MNA10eA11Ti=0N1E|δZ~ti|2h+C(λ5,λ6)[h+E|Y0Y0π|2],M_{N}\leq A_{10}e^{A_{11}T}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}], (5.24)

in which for the last term we use the fact i=0N1EziCh\sum_{i=0}^{N-1}E_{z}^{i}\leq Ch from inequality (3.2). Note that

A10\displaystyle A_{10} =1+fzλ51+2λ6,\displaystyle=1+f_{z}\lambda_{5}^{-1}+2\lambda_{6},
A11\displaystyle A_{11} 2K+2Kλ51+λ5+3λ6.\displaystyle\leq 2K+2K\lambda_{5}^{-1}+\lambda_{5}+3\lambda_{6}.

Given any λ4>0\lambda_{4}>0, we can choose λ6\lambda_{6} small enough such that

(1+fzλ51+2λ6)eA11T\displaystyle(1+f_{z}\lambda_{5}^{-1}+2\lambda_{6})e^{A_{11}T} (1+λ4)(1+fzλ51)e(2K+2Kλ51+λ5)T.\displaystyle\leq(1+\lambda_{4})(1+f_{z}\lambda_{5}^{-1})e^{(2K+2K\lambda_{5}^{-1}+\lambda_{5})T}.

This condition and inequality (5.24) together give us

MN\displaystyle M_{N}\leq\leavevmode\nobreak\ (1+λ4)(1+fzλ51)e(2K+2Kλ51+λ5)Ti=0N1E|δZ~ti|2h\displaystyle(1+\lambda_{4})(1+f_{z}\lambda_{5}^{-1})e^{(2K+2K\lambda_{5}^{-1}+\lambda_{5})T}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h
+C(λ4,λ5)[h+E|Y0Y0π|2].\displaystyle+C(\lambda_{4},\lambda_{5})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}]. (5.25)

Finally, by decomposing the objective function, we have

E|g(XTπ)YTπ|2\displaystyle\leavevmode\nobreak\ E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}
=\displaystyle= E|g(XTπ)g(XT)+YTYTπ|2\displaystyle\leavevmode\nobreak\ E|g(X_{T}^{\pi})-g(X_{T})+Y_{T}-Y_{T}^{\pi}|^{2}
\displaystyle\leq (1+(gx)1)E|g(XTπ)g(XT)|2+(1+gx)E|δYN|2\displaystyle\leavevmode\nobreak\ (1+(\sqrt{g_{x}})^{-1})E|g(X_{T}^{\pi})-g(X_{T})|^{2}+(1+\sqrt{g_{x}})E|\delta Y_{N}|^{2}
\displaystyle\leq (gx+gx)E|δXN|2+(1+gx)E|δYN|2\displaystyle\leavevmode\nobreak\ (g_{x}+\sqrt{g_{x}})E|\delta X_{N}|^{2}+(1+\sqrt{g_{x}})E|\delta Y_{N}|^{2}
\displaystyle\leq (1+gx)2MN.\displaystyle\leavevmode\nobreak\ (1+\sqrt{g_{x}})^{2}M_{N}. (5.26)

We complete our proof by combining inequalities (5.25)(5.26) and choosing λ5=argminx+H(x)\lambda_{5}=\mathrm{argmin}_{x\in\mathbb{R}^{+}}H(x). ∎

Proof of Lemma 4.

We use the same notations as in the proof of Lemma 1. As derived in (4.12), for any λ7>fz0\lambda_{7}>f_{z}\geq 0, we have

E|δYi+1|2[1(2kf+λ7)h]E|δYi|2+(1fzλ71)E|δZi|2hfxλ71E|δXi|2h.E|\delta Y_{i+1}|^{2}\geq[1-(2k_{f}+\lambda_{7})h]E|\delta Y_{i}|^{2}+(1-f_{z}\lambda_{7}^{-1})E|\delta Z_{i}|^{2}h-f_{x}\lambda_{7}^{-1}E|\delta X_{i}|^{2}h. (5.27)

Multiplying both sides of (5.27) by eA5ih(eA5T1)/(1fzλ71)e^{A_{5}ih}(e^{-A_{5}T}\vee 1)/(1-f_{z}\lambda_{7}^{-1}) gives us

λ7(eA5T1)λ7fz{eA5ihE|δYi+1|2eA5(i1)hE|δYi|2+eA5ihfxλ7E|δXi|2h}\displaystyle\leavevmode\nobreak\ \frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}e^{A_{5}ih}E|\delta Y_{i+1}|^{2}-e^{A_{5}(i-1)h}E|\delta Y_{i}|^{2}+e^{A_{5}ih}\frac{f_{x}}{\lambda_{7}}E|\delta X_{i}|^{2}h\Big{\}}
\displaystyle\geq eA5ih(eA5T1)E|δZi|2h\displaystyle\leavevmode\nobreak\ e^{A_{5}ih}(e^{-A_{5}T}\vee 1)E|\delta Z_{i}|^{2}h
\displaystyle\geq E|δZi|2h.\displaystyle E|\delta Z_{i}|^{2}h. (5.28)

Summing (5.28) up from i=0i=0 to N1N-1, we obtain

i=0N1E|δZi|2hλ7(eA5T1)λ7fz{eA5TA5hE|δYN|2+fxλ7i=0N1eA5ihE|δXi|2h}.\displaystyle\sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h\leq\frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}e^{A_{5}T-A_{5}h}E|\delta Y_{N}|^{2}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E|\delta X_{i}|^{2}h\Big{\}}. (5.29)

Note that E|δYN|2gxE|δXN|2E|\delta Y_{N}|^{2}\leq g_{x}E|\delta X_{N}|^{2} by Assumption 1. Plugging it into (5.29), we arrive at the desired result. ∎

Proof of Lemma 5.

We prove by induction backwardly. Let ZtNπ,=0Z_{t_{N}}^{\pi,^{\prime}}=0 for convenience. It is straightforward to see that the statement holds for i=Ni=N. Assume the statement holds for i=k+1i=k+1. For i=ki=k, we know Ytk+1π,=Uk+1(Xtk+1π,Ytk+1π)Y_{t_{k+1}}^{\pi,^{\prime}}=U_{k+1}(X_{t_{k+1}}^{\pi},Y_{t_{k+1}}^{\pi}). Recalling the definition of {Xtiπ}0iN,{Ytiπ}0iN\{X_{t_{i}}^{\pi}\}_{0\leq i\leq N},\{Y_{t_{i}}^{\pi}\}_{0\leq i\leq N} in (2.3), we can rewrite Ytk+1π,=U¯k(Xtkπ,Ytkπ,ΔWk)Y_{t_{k+1}}^{\pi,^{\prime}}=\overline{U}_{{k}}(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi},\Delta W_{k}), with U¯k:m××d\overline{U}_{k}:\mathbb{R}^{m}\times\mathbb{R}\times\mathbb{R}^{d}\rightarrow\mathbb{R} being a deterministic function. Note Ztkπ,=h1E[U¯k(Xtkπ,Ytkπ,ΔWk)ΔWk|tk]Z_{t_{k}}^{\pi,^{\prime}}=h^{-1}E[\overline{U}_{k}(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi},\Delta W_{k})\Delta W_{k}|\mathcal{F}_{t_{k}}]. Since ΔWk\Delta W_{k} is independent of tk\mathcal{F}_{t_{k}}, there exists a deterministic function Vkπ:m×dV_{k}^{\pi}:\mathbb{R}^{m}\times\mathbb{R}\rightarrow\mathbb{R}^{d} such that Ztkπ,=Vkπ(Xtkπ,Ytkπ)Z_{t_{k}}^{\pi,^{\prime}}=V_{k}^{\pi}(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi}).

Next we consider Ytkπ,Y_{t_{k}}^{\pi,^{\prime}}. Let Hk=L2(Ω,σ(Xtkπ,Ytkπ),)H_{k}=L^{2}(\Omega,\sigma(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi}),\mathbb{P}), where σ(Xtkπ,Ytkπ)\sigma(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi}) denotes the σ\sigma-algebra generated by Xtkπ,YtkπX_{t_{k}}^{\pi},Y_{t_{k}}^{\pi}. We know HkH_{k} is a Banach space and another equivalent representation is

Hk={Y=ϕ(Xtkπ,Ytkπ)|ϕ is measurable and E|Y|2<}.H_{k}=\{Y=\phi(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi})\leavevmode\nobreak\ |\leavevmode\nobreak\ \phi\text{ is measurable and }E|Y|^{2}<\infty\}.

Consider the following map defined on HkH_{k}:

Φk(Y)=E[Ytk+1π,+f(tk,Xtkπ,Y,Ztkπ,)h|tk].\Phi_{k}(Y)=E[{Y}_{t_{k+1}}^{\pi,^{\prime}}+f(t_{k},{X}_{t_{k}}^{\pi},Y,{Z}_{t_{k}}^{\pi,^{\prime}})h|\mathcal{F}_{t_{k}}].

By Assumption 3, Φk(Y)\Phi_{k}(Y) is square-integrable. Furthermore, following the same argument for Ztkπ,Z_{t_{k}}^{\pi,^{\prime}}, Φk(Y)\Phi_{k}(Y) can also be represented as a deterministic function of Xtkπ,YtkπX_{t_{k}}^{\pi},Y_{t_{k}}^{\pi}. Hence, Φk(Y)Hk\Phi_{k}(Y)\in H_{k}. Note that Assumption 1 implies E|Φk(Y1)Φk(Y2)|2Kh2E|Y1Y2|2E|\Phi_{k}(Y_{1})-\Phi_{k}(Y_{2})|^{2}\leq Kh^{2}E|Y_{1}-Y_{2}|^{2}. Therefore Φk\Phi_{k} is a contraction map on HkH_{k} when h<1/Kh<1/\sqrt{K}. By the Banach fixed-point theorem, there exists a unique fixed-point Y=ϕk(Xtkπ,Ytkπ)HkY^{*}=\phi_{k}^{*}(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi})\in H_{k} satisfying Y=Φk(Y)Y^{*}=\Phi_{k}(Y^{*}). We choose Ukπ=ϕkU_{k}^{\pi}=\phi_{k}^{*} to validate the statement for Ytkπ,Y_{t_{k}}^{\pi,^{\prime}}.

When bb and σ\sigma are independent of yy, all of the arguments above can be made similarly with Uiπ,ViπU_{i}^{\pi},V_{i}^{\pi} also being independent of YY. ∎

6 Numerical Examples

6.1 General Setting

In this section, we illustrate the proposed numerical scheme by solving two high-dimensional coupled FBSDEs adapted from the literature. The common setting for these two numerical examples is as follows. We assume d=m=100d=m=100, that is, Xt,Zt,Wt100X_{t},Z_{t},W_{t}\in\mathbb{R}^{100}. Assume ξ\xi is deterministic and we are interested in the approximation error of Y0Y_{0}, which is also a deterministic scalar.

We use N1N-1 fully-connected feedforward neural networks to parameterize ϕiπ,i=0,1,,N1\phi_{i}^{\pi},i=0,1,\dots,N-1. Each of the neural networks has 2 hidden layers with dimension d+10d+10. The input has dimension d+1(Xid,Yi)d+1\leavevmode\nobreak\ (X_{i}\in\mathbb{R}^{d},Y_{i}\in\mathbb{R}) and the output has dimension dd. In practice, one can of course choose XiX_{i} only as the input. We additionally test this input for the two examples and find no difference in terms of the relative error of Y0Y_{0} (up to second decimal place). We use the rectifier function (ReLU) as the activation function and adopt batch normalization [36] right after each matrix multiplication and before activation. We employ the Adam optimizer [37] to optimize the parameters with batch-size being 64. The loss function is computed based on 256 validation sample paths. We initialize all the parameters using a uniform or normal distribution and run each experiment 5 times to report the average result.

6.2 Example 1

The first problem is adapted from [38, 39], in which the original spatial dimension of the problem is 1. We consider the following coupled FBSDEs

{Xj,t=xj,0+0tXj,s(1+Xj,s2)(2+Xj,s)3ds+0t1+Xj,s22+Xj,s21+2Ys21+Ys2+exp(2|Xs|2d(s+5))dWj,s,j=1,,d,Yt=exp(|XT|2d(T+5))+tTa(s,Xs,Ys)+j=1db(s,Xj,s,Ys)Zj,sdstT(Zs)TdWs,\displaystyle\begin{dcases}X_{j,t}=x_{j,0}+\int_{0}^{t}\frac{X_{j,s}(1+X_{j,s}^{2})}{(2+X_{j,s})^{3}}\,\mathrm{d}s\\ \qquad\quad+\int_{0}^{t}\frac{1+X_{j,s}^{2}}{2+X_{j,s}^{2}}\sqrt{\frac{1+2Y_{s}^{2}}{1+Y_{s}^{2}+\exp{\Big{(}-\frac{2|X_{s}|^{2}}{d(s+5)}\Big{)}}}}\,\mathrm{d}W_{j,s},\quad j=1,\dots,d,\\ Y_{t}=\exp{\Big{(}-\frac{|X_{T}|^{2}}{d(T+5)}\Big{)}}\\ \qquad\leavevmode\nobreak\ +\int_{t}^{T}a(s,X_{s},Y_{s})+\sum_{j=1}^{d}b(s,X_{j,s},Y_{s})Z_{j,s}\,\mathrm{d}s-\int_{t}^{T}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s},\end{dcases} (6.1)

where Xj,t,Zj,t,Wj,tX_{j,t},Z_{j,t},W_{j,t} denote the jj-th components of Xt,Yt,WtX_{t},Y_{t},W_{t}, and the coefficient functions are given as

a(t,x,u)=\displaystyle a(t,x,u)= 1d(t+5)exp(|x|2d(t+5))\displaystyle\leavevmode\nobreak\ \frac{1}{d(t+5)}\exp{\Big{(}-\frac{|x|^{2}}{d(t+5)}\Big{)}}
×j=1d{4xj2(1+xj2)(2+xj2)3+(1+xj2)2(2+xj2)22xj2(1+xj2)2d(t+5)(2+xj2)2xj2t+5},\displaystyle\leavevmode\nobreak\ \times\sum_{j=1}^{d}\Bigg{\{}\frac{4x_{j}^{2}(1+x_{j}^{2})}{(2+x_{j}^{2})^{3}}+\frac{(1+x_{j}^{2})^{2}}{(2+x_{j}^{2})^{2}}-\frac{2x_{j}^{2}(1+x_{j}^{2})^{2}}{d(t+5)(2+x_{j}^{2})^{2}}-\frac{x_{j}^{2}}{t+5}\Bigg{\}},
b(t,xj,u)=\displaystyle b(t,x_{j},u)= xj(2+xj2)21+u2+exp(|x|2d(t+5))1+2u2.\displaystyle\leavevmode\nobreak\ \frac{x_{j}}{(2+x_{j}^{2})^{2}}\sqrt{\frac{1+u^{2}+\exp{\Big{(}-\frac{|x|^{2}}{d(t+5)}\Big{)}}}{1+2u^{2}}}.

It can be verified by Itô’s formula that the YY part of the solution of (6.1) is given by

Yt=exp(|Xt|2d(t+5)).\displaystyle Y_{t}=\exp{\Big{(}-\frac{|X_{t}|^{2}}{d(t+5)}\Big{)}}.

Let ξ=(1,1,,1)\xi=(1,1,\dots,1) (100100-dimensional), T=5,N=160T=5,N=160. The initial guess of Y0Y_{0} is generated from a uniform distribution on the interval [2,4][2,4] while the true value of Y00.81873Y_{0}\approx 0.81873. We train 25000 steps with an exponential decay learning rate that decays every 100 steps, with the starting learning rate being 1e-2 and ending learning rate being 1e-5. Figure 1 illustrates the mean of the loss function and relative approximation error of Y0Y_{0} against the number of iteration steps. All runs converged and the average final relative error of Y0Y_{0} is 0.39%0.39\%.

Refer to caption
Refer to caption
Figure 1: Loss function (left) and relative approximation error of Y0Y_{0} (right) against the number of iteration steps in the case of Example 1 (100100-dimensional). The proposed deep BSDE method achieves a relative error of size 0.39%0.39\%. The shaded area depicts the mean ±\pm the standard deviation of the associated quantity in 5 runs.

6.3 Example 2

The second problem is adapted from [16], in which the spatial dimension is originally tested up to 10. The coupled FBSDEs are given by

{Xj,t=xj,0+0tσYsdWj,s,j=1,,d,Yt=Dj=1dsin(Xj,T)+tTrYs+12e3r(Ts)σ2(Dj=1dsin(Xj,s))3dstT(Zs)TdWs,\displaystyle\begin{dcases}X_{j,t}=x_{j,0}+\int_{0}^{t}\sigma Y_{s}\,\mathrm{d}W_{j,s},\quad j=1,\dots,d,\\ Y_{t}=D\sum_{j=1}^{d}\sin(X_{j,T})\\ \qquad\leavevmode\nobreak\ +\int_{t}^{T}-rY_{s}+\frac{1}{2}e^{-3r(T-s)\sigma^{2}}\Big{(}D\sum_{j=1}^{d}\sin(X_{j,s})\Big{)}^{3}\,\mathrm{d}s-\int_{t}^{T}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s},\end{dcases} (6.2)

where σ>0,r,D\sigma>0,r,D are constants. One can easily check by Itô’s formula that the YY part of the solution of (6.2) is

Yt=er(Tt)Dj=1dsin(Xj,t).Y_{t}=e^{-r(T-t)}D\sum_{j=1}^{d}\sin(X_{j,t}).

Let ξ=(π/2,π/2,,π/2)\xi=(\pi/2,\pi/2,\dots,\pi/2) (100100-dimensional), T=1,r=0.1,σ=0.3,D=0.1T=1,r=0.1,\sigma=0.3,D=0.1. The initial guess of Y0Y_{0} is generated from a uniform distribution on the interval [0,1][0,1] while the true value of Y09.04837Y_{0}\approx 9.04837. We train 5000 steps with an exponential decay learning rate that decays every 100 steps, with the starting learning rate being 1e-2 and the ending learning rate being 1e-3. When h=0.005(N=200)h=0.005\leavevmode\nobreak\ (N=200), the relative approximation error of Y0Y_{0} is 0.09%0.09\%. Furthermore, we test the influence of the time partition by choosing different values of NN. In all cases, the training converged, and we plot in Figure 2 the mean of relative error of Y0Y_{0} against the number of time steps NN. It is clearly shown that the error decreases as NN increases (hh decreases).

Refer to caption
Figure 2: Relative approximation error of Y0Y_{0} against the time step size hh in the case of Example 2 (100100-dimensional). The proposed deep BSDE method achieves a relative error of size 0.09%0.09\% when N=200(h=0.005)N=200\leavevmode\nobreak\ (h=0.005).

References

  • [1] Etienne Pardoux and Shige Peng. Backward stochastic differential equations and quasilinear parabolic partial differential equations. In Stochastic Partial Differential Equations and Their Applications, pages 200–217. Springer, 1992.
  • [2] Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.
  • [3] Christian Bender and Jessica Steiner. Least-squares Monte Carlo for backward SDEs. In Numerical Methods in Finance, pages 257–289. Springer, 2012.
  • [4] Bruno Bouchard, Ivar Ekeland, and Nizar Touzi. On the Malliavin approach to Monte Carlo approximation of conditional expectations. Finance and Stochastics, 8(1):45–71, 2004.
  • [5] Bruno Bouchard and Nizar Touzi. Discrete-time approximation and Monte-Carlo simulation of backward stochastic differential equations. Stochastic Processes and their Applications, 111(2):175–206, 2004.
  • [6] Pierre Henry-Labordere. Counterparty risk valuation: A marked branching diffusion approach. Available at SSRN 1995503, 2012.
  • [7] Pierre Henry-Labordere, Nadia Oudjane, Xiaolu Tan, Nizar Touzi, Xavier Warin, et al. Branching diffusion representation of semilinear PDEs and Monte Carlo approximation. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 55, pages 184–210. Institut Henri Poincaré, 2019.
  • [8] Martin Hutzenthaler, Arnulf Jentzen, and Thomas Kruse. Multilevel Picard iterations for solving smooth semilinear parabolic heat equations. arXiv preprint arXiv:1607.03295, 2016.
  • [9] Weinan E, Martin Hutzenthaler, Arnulf Jentzen, and Thomas Kruse. On multilevel Picard numerical approximations for high-dimensional nonlinear parabolic partial differential equations and high-dimensional nonlinear backward stochastic differential equations. Journal of Scientific Computing, 79(3):1534–1571, 2019.
  • [10] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen, and Philippe von Wurstemberger. Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations. arXiv preprint arXiv:1807.01212, 2018.
  • [11] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
  • [12] Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
  • [13] Christian Beck, Weinan E, and Arnulf Jentzen. Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. arXiv preprint arXiv:1709.05963, 2017.
  • [14] Jiequn Han and Ruimeng Hu. Deep fictitious play for finding Markovian Nash equilibrium in multi-agent games. arXiv preprint arXiv:1912.01809, 2019.
  • [15] Jiequn Han, Jianfeng Lu, and Mo Zhou. Solving high-dimensional eigenvalue problems using deep neural networks: A diffusion Monte Carlo like approach. arXiv preprint arXiv:2002.02600, 2020.
  • [16] Christian Bender and Jianfeng Zhang. Time discretization and Markovian iteration for coupled FBSDEs. The Annals of Applied Probability, 18(1):143–177, 2008.
  • [17] Jin Ma, Philip Protter, and Jiongmin Yong. Solving forward-backward stochastic differential equations explicitly–a four step scheme. Probability Theory and Related Fields, 98(3):339–359, 1994.
  • [18] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
  • [19] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
  • [20] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • [21] Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
  • [22] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • [23] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907–940, 2016.
  • [24] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
  • [25] Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.
  • [26] Helmut Bölcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks. arXiv preprint arXiv:1705.01714, 2017.
  • [27] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • [28] Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe von Wurstemberger. A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. arXiv preprint arXiv:1809.02362, 2018.
  • [29] Arnulf Jentzen, Diyora Salimova, and Timo Welti. A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients. arXiv preprint arXiv:1809.07321, 2018.
  • [30] Julius Berner, Philipp Grohs, and Arnulf Jentzen. Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. arXiv preprint arXiv:1809.03062, 2018.
  • [31] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, and Tuan Anh Nguyen. A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations. SN Partial Differential Equations and Applications, 1:1–34, 2020.
  • [32] Jin Ma and Jiongmin Yong. Forward-Backward Stochastic Differential Equations and their Applications. Springer Berlin Heidelberg, 2007.
  • [33] Fabio Antonelli. Backward-forward stochastic differential equations. The Annals of Applied Probability, pages 777–793, 1993.
  • [34] Etienne Pardoux and Shanjian Tang. Forward-backward stochastic differential equations and quasilinear parabolic PDEs. Probability Theory and Related Fields, 114(2):123–150, 1999.
  • [35] Jianfeng Zhang. A numerical scheme for BSDEs. The Annals of Applied Probability, 14(1):459–488, 2004.
  • [36] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), 2015.
  • [37] Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [38] G.N. Milstein and M.V. Tretyakov. Numerical algorithms for forward-backward stochastic differential equations. SIAM Journal on Scientific Computing, 28(2):561–582, 2006.
  • [39] T.P. Huijskens, M.J. Ruijter, and C.W. Oosterlee. Efficient numerical Fourier methods for coupled forward–backward SDEs. Journal of Computational and Applied Mathematics, 296:593–612, 2016.