Convergence of the Deep BSDE Method
for Coupled FBSDEs

Jiequn Han jiequnh@princeton.edu Jihao Long jihaol@pku.edu.cn School of Mathematical Sciences, Peking University, Beijing 100871, China

Abstract

The recently proposed numerical algorithm, deep BSDE method, has shown remarkable performance in solving high-dimensional forward-backward stochastic differential equations (FBSDEs) and parabolic partial differential equations (PDEs). This article lays a theoretical foundation for the deep BSDE method in the general case of coupled FBSDEs. In particular, a posteriori error estimation of the solution is provided and it is proved that the error converges to zero given the universal approximation capability of neural networks. Numerical results are presented to demonstrate the accuracy of the analyzed algorithm in solving high-dimensional coupled FBSDEs.

1 Introduction

Forward-backward stochastic differential equations (FBSDEs) and partial differential equations (PDEs) of parabolic type have found numerous applications in stochastic control, finance, physics, etc., as a ubiquitous modeling tool. In most situations encountered in practice the equations cannot be solved analytically but require certain numerical algorithms to provide approximate solutions. On the one hand, the dominant choices of numerical algorithms for PDEs are mesh-based methods, such as finite differences, finite elements, etc. On the other hand, FBSDEs can be tackled directly through probabilistic means, with appropriate methods for the approximation of conditional expectation. Since these two kinds of equations are intimately connected through the nonlinear Feynman–Kac formula [1], the algorithms designed for one kind of equation can often be used to solve another one.

However, the aforementioned numerical algorithms become more and more difficult, if not impossible, when the dimension increases. They are doomed to run into the so-called “curse of dimensionality” [2] when the dimension is high, namely, the computational complexity grows exponentially as the dimension grows. The classical mesh-based algorithms for PDEs require a mesh of size $O(N^{d})$ . The simulation of FBSDEs faces a similar difficulty in the general nonlinear cases, due to the need to compute conditional expectation in high dimension. The conventional methods, including the least squares regression [3], Malliavin approach [4], and kernel regression [5], are all of exponential complexity. There are a limited number of cases where practical high-dimensional algorithms are available. For example, in the linear case, Feynman–Kac formula and Monte Carlo simulation together provide an efficient approach to solving PDEs and associated BSDEs numerically. In addition, methods based on the branching diffusion process [6, 7] and multilevel Picard iteration [8, 9, 10] overcome the curse of dimensionality in their considered settings. We refer [9] for the detailed discussion on the complexity of the algorithms mentioned above. Overall there is no numerical algorithm in literature so far proved to overcome the curse of dimensionality for general quasilinear parabolic PDEs and the corresponding FBSDEs.

A recently developed algorithm, called the deep BSDE method [11, 12], has shown astonishing power in solving general high-dimensional FBSDEs and parabolic PDEs [13, 14, 15]. In contrast to conventional methods, the deep BSDE method employs neural networks to approximate unknown gradients and reformulates the original equation-solving problem into a stochastic optimization problem. Thanks to the universal approximation capability and parsimonious parameterization of neural networks, in practice the objective function can be effectively optimized in high-dimensional cases, and the function values of interests are obtained quite accurately.

The deep BSDE method was initially proposed for decoupled FBSDEs. In this paper, we extend the method to deal with coupled FBSDEs and a broader class of quasilinear parabolic PDEs. Furthermore, we present an error analysis of the proposed scheme, including decoupled FBSDEs as a special case. Our theoretical result consists of two theorems. Theorem 1 provides a posteriori error estimation of the deep BSDE method. As long as the objective function is optimized to be close to zero under fine time discretization, the approximate solution is close to the true solution. In other words, in practice, the accuracy of the numerical solution is effectively indicated by the value of the objective function. Theorem 2 shows that such a situation is attainable, by relating the infimum of the objective function to the expression ability of neural networks. As an implication of the universal approximation property (in the $L^{2}$ sense), there exist neural networks with suitable parameters such that the obtained numerical solution is approximately accurate. To the best of our knowledge, this is the first theoretical result of the deep BSDE method for solving FBSDEs and parabolic PDEs. Although our numerical algorithm is based on neural networks, the theoretical result provided here is equally applicable to the algorithms based on other forms of function approximations.

The article is organized as follows. In section 2, we precisely state our numerical scheme for coupled FBSDEs and quasilinear parabolic PDEs and give the main theoretical results of the proposed numerical scheme. In section 3, the basic assumptions and some useful results from the literature are given for later use. The proofs of the two main theorems are provided in section 4 and section 5, respectively. Some numerical experiments with the proposed scheme are presented in section 6.

2 A Numerical Scheme for Coupled FBSDEs and Main Results

Let $T\in(0,+\infty)$ be the terminal time, $(\Omega,\mathbb{F},\{\mathcal{F}_{t}\}_{0\leq t\leq T},\mathbb{P})$ be a filtered probability space equipped with a $d$ -dimensional standard Brownian motion $\{W_{t}\}_{0\leq t\leq T}$ starting from $0$ . $\xi$ is a square-integrable random variable independent of $\{W_{t}\}_{0\leq t\leq T}$ . We use the same notation $(\Omega,\mathbb{F},\{\mathcal{F}_{t}\}_{0\leq t\leq T},\mathbb{P})$ to denote the filtered probability space generated by $\{W_{t}+\xi\}_{0\leq t\leq T}$ . The notation $|x|$ denotes the Euclidean norm of a vector $x$ and $\|A\|=\sqrt{\mathrm{trace}(A^{\operatorname{T}}A)}$ denotes the Frobenius norm of a matrix $A$ .

Consider the following coupled FBSDEs

	$\displaystyle X_{t}=\xi+\int_{0}^{t}b(s,X_{s},Y_{s})\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},Y_{s})\,\mathrm{d}W_{s},$		(2.1)
	$\displaystyle Y_{t}=g(X_{T})+\int_{t}^{T}f(s,X_{s},Y_{s},Z_{s})\,\mathrm{d}s-\int_{t}^{T}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s},$		(2.2)

in which $X_{t}$ takes values in $\mathbb{R}^{m}$ , $Y_{t}$ takes values in $\mathbb{R}$ , and $Z_{t}$ takes values in $\mathbb{R}^{d}$ . Here we assume $Y_{t}$ to be one-dimensional to simplify the presentation. The result can be extended without any difficulty to the case where $Y_{t}$ is multi-dimensional. We say $(X_{t},Y_{t},Z_{t})$ is a solution of the above FBSDEs, if all its components are $\mathcal{F}_{t}$ -adapted and square-integrable, together satisfying equations (2.1)(2.2).

Solving coupled FBSDEs numerically is more difficult than solving decoupled FBSDEs. Except the Picard iteration method developed in [16], most methods exploit the relation to quasilinear parabolic PDEs via the four-time-step-scheme in [17]. This type of methods suffers from high dimensionality due to spatial discretization of PDEs. In contrast, our strategy, starting from simulating the coupled FBSDEs directly, is a new purely probabilistic scheme. To state the numerical algorithm precisely, we consider a partition of the time interval $[0,T]$ , $\pi:0=t_{0}<t_{1}<\dots<t_{N}=T$ with $h=T/N$ and $t_{i}=ih$ . Let $\Delta W_{i}\coloneqq W_{t_{i+1}}-W_{t_{i}}$ for $i=0,1,\dots,N-1$ . Inspired by the nonlinear Feynman–Kac formula that will be introduced below, we view $Y_{0}$ as a function of $X_{0}$ and view $Z_{t}$ as a function of $X_{t}$ and $Y_{t}$ . Equipped with this viewpoint, our goal becomes finding appropriate functions $\mu_{0}^{\pi}:\mathbb{R}^{m}\rightarrow\mathbb{R}$ and $\phi_{i}^{\pi}:\mathbb{R}^{m}\times\mathbb{R}\rightarrow\mathbb{R}^{d}$ for $i=0,1,\dots,N-1$ such that $\mu_{0}^{\pi}(\xi)$ and $\phi_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})$ can serve as good surrogates of $Y_{0}$ and $Z_{t_{i}}$ , respectively. To this end, we consider the classical Euler scheme

\displaystyle\begin{dcases}X_{0}^{\pi}=\xi,\quad Y_{0}^{\pi}=\mu_{0}^{\pi}(\xi),\\ X_{t_{i+1}}^{\pi}=X_{t_{i}}^{\pi}+b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})h+\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})\Delta W_{i},\\ Z_{t_{i}}^{\pi}=\phi_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}),\\ Y_{t_{i+1}}^{\pi}=Y_{t_{i}}^{\pi}-f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})h+(Z_{t_{i}}^{\pi})^{\operatorname{T}}\Delta W_{i}.\end{dcases}

(2.3)

Without loss of clarity, here we use the notation $X_{0}^{\pi}$ as $X_{t_{0}}^{\pi}$ , $X_{T}^{\pi}$ as $X_{t_{N}}^{\pi}$ , etc.

Following the spirit of the deep BSDE method, we employ a stochastic optimizer to solve the following stochastic optimization problem

\displaystyle\inf_{\mu_{0}^{\pi}\in\mathcal{N}^{\prime}_{0},\phi_{i}^{\pi}\in\mathcal{N}_{i}}F(\mu_{0}^{\pi},\phi_{0}^{\pi},\dots,\phi_{N-1}^{\pi})\coloneqq E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2},

(2.4)

where $\mathcal{N}^{\prime}_{0}$ and $\mathcal{N}_{i}\leavevmode\nobreak\ (0\leq i\leq{N-1})$ are parametric function spaces generated by neural networks. To see intuitively where the objective function (2.4) comes from, we consider the following variational problem:

	$\displaystyle\inf_{Y_{0},\{Z_{t}\}_{0\leq t\leq T}}E\|g(X_{T})-Y_{T}\|^{2},$		(2.5)
	$\displaystyle s.t.\leavevmode\nobreak\ \leavevmode\nobreak\ X_{t}=\xi+\int_{0}^{t}b(s,X_{s},Y_{s})\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},Y_{s})\,\mathrm{d}W_{s},$
	$\displaystyle\qquad Y_{t}=Y_{0}-\int_{0}^{t}f(s,X_{s},Y_{s},Z_{s})\,\mathrm{d}s+\int_{0}^{t}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s},$

where $Y_{0}$ is $\mathcal{F}_{0}$ -measurable and square-integrable, and $Z_{t}$ is a $\mathcal{F}_{t}$ -adapted square-integrable process. The solution of the FBSDEs (2.1)(2.2) is a minimizer of the above problem since the loss function attains zero when it is evaluated at the solution. In addition, the wellposedness of the FBSDEs (under some regularity conditions) ensures the existence and uniqueness of the minimizer. Therefore, we expect (2.4), as a discretized counterpart of (2.5), defines a benign optimization problem and the associated near-optimal solution provides us a good approximate solution of the original FBSDEs. The reason we do not represent $Z_{t_{i}}$ as a function of $X_{t_{i}}$ only is that the process $\{X_{t_{i}}^{\pi}\}_{0\leq i\leq N}$ is not Markovian, while the process $\{X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\}_{0\leq i\leq N}$ is Markovian, which facilitates our analysis considerably. If $b$ and $\sigma$ are both independent of $Y$ , then the FBSDEs (2.1)(2.2) are decoupled, we can take $\phi_{i}^{\pi}$ as a function of $X_{t_{i}}^{\pi}$ only, as the numerical scheme introduced in [11, 12].

Our two main theorems regarding the deep BSDE method are the following, mainly on the justification and property of the objective function (2.4) in the general coupled case, regardless of the specific choice of parametric function spaces. An important assumption for the two theorems is the so-called weak coupling or monotonicity condition, which will be explained in detail in section 3. The precise statement of the theorems can be found in Theorem 1^′ (section 4) and Theorem 2^′ (section 5), respectively.

Theorem 1.

Under some assumptions, there exists a constant C, independent of h, d, and m, such that for sufficiently small h,

\sup_{t\in[0,T]}(E|X_{t}-\hat{X}_{t}^{\pi}|^{2}+E|Y_{t}-\hat{Y}_{t}^{\pi}|^{2})+\int_{0}^{T}E|Z_{t}-\hat{Z}_{t}^{\pi}|^{2}\,\mathrm{d}t\leq C[h+E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}],

(2.6)

where $\hat{X}_{t}^{\pi}=X_{t_{i}}^{\pi}$ , $\hat{Y}_{t}^{\pi}=Y_{t_{i}}^{\pi}$ , $\hat{Z}_{t}^{\pi}=Z_{t_{i}}^{\pi}$ for $t\in[t_{i},t_{i+1})$ .

Theorem 2.

Under some assumptions, there exists a constant C, independent of h, d and m, such that for sufficiently small h,

		$\displaystyle\inf_{\mu_{0}^{\pi}\in\mathcal{N}^{\prime}_{0},\phi_{i}^{\pi}\in\mathcal{N}_{i}}E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}$
	$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle C\Big{\{}h+\inf_{\mu_{0}^{\pi}\in\mathcal{N}^{\prime}_{0},\phi_{i}^{\pi}\in\mathcal{N}_{i}}\big{[}E\|Y_{0}-\mu_{0}^{\pi}(\xi)\|^{2}$
		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{i=0}^{N-1}E\|E[\tilde{Z}_{t_{i}}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-\phi_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})\|^{2}h\big{]}\Big{\}},$

where $\tilde{Z}_{t_{i}}=h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]$ . If $b$ and $\sigma$ are independent of $Y$ , the term $E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]$ can be replaced with $E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi}]$ .

Briefly speaking, Theorem 1 states that the simulation error (left side of equation (2.6)) can be bounded through the value of the objective function (2.4). To the best of our knowledge, this is the first result for the error estimation of the coupled FBSDEs, concerning both time discretization error and terminal distance. Theorem 2 states that the optimal value of the objective function can be small if the approximation capability of the parametric function spaces ( $\mathcal{N}^{\prime}_{0}$ and $\mathcal{N}_{i}$ above) is high. Neural networks are a promising candidate for such a requirement, especially in high-dimensional problems. There are numerous results, dating back to the 90s (see, e.g., [18, 19, 20, 21, 22, 23, 24, 25, 26, 27]), in regard to the universal approximation and complexity of neural networks. There are also some recent analysis [28, 29, 30, 31] on approximating the solutions of certain parabolic partial differential equations with neural networks. However, the problem is still far from resolved. Theorem 2 implies that if the involved conditional expectations can be approximated by neural networks whose numbers of parameters growing at most polynomially both in the dimension and the reciprocal of the required accuracy, then the solutions of the considered FBSDEs can be represented in practice without the curse of dimensionality. Under what conditions this assumption is true is beyond the scope of this work and remains for further investigation.

The above-mentioned scheme in (2.3)(2.4) is for solving FBSDEs. The so-called nonlinear Feynman–Kac formula, connecting FBSDEs with the quasilinear parabolic PDEs, provides an approach to numerically solve quasilinear parabolic PDEs (2.7) below through the same scheme. We recall a concrete version of the nonlinear Feynman–Kac formula in Theorem 3 below and refer interested readers to e.g., [32] for more details. According to this formula, the term $E|Y_{0}-Y_{0}^{\pi}|^{2}$ can be interpreted as $E|u(0,\xi)-\mu_{0}^{\pi}(\xi)|^{2}$ . Therefore, we can choose the random variable $\xi$ with a delta distribution, a uniform distribution in a bounded region, or any other distribution we are interested in. After solving the optimization problem, we obtain $\mu_{0}^{\pi}(\xi)$ as an approximation of $u(0,\xi)$ . See [11, 12] for more details.

Theorem 3.

Assume

1.

$m=d$ and $b(t,x,y)$ , $\sigma(t,x,y)$ , $f(t,x,y,z)$ are smooth functions with bounded first-order derivatives with respect to $x,y,z$ .

There exist a positive continuous function $\nu$ and a constant $\mu$ , satisfying that

	$\displaystyle\nu(\|y\|)\mathbf{I}\leq\sigma\sigma^{\operatorname{T}}(t,x,y)\leq\mu\mathbf{I},$
	$\displaystyle\|b(t,x,0)\|+\|f(t,x,0,z)\|\leq\mu.$

3.

There exists a constant $\alpha\in(0,1)$ such that $g$ is bounded in the Hölder space $C^{2,\alpha}(\mathbb{R}^{m})$ .

Then the following quasilinear PDE has a unique classical solution $u(t,x)$ that is bounded with bounded $u_{t}$ , $\nabla_{x}u$ , and $\nabla^{2}_{x}u$ ,

\displaystyle\begin{dcases}u_{t}+\frac{1}{2}\text{trace}(\sigma\sigma^{\operatorname{T}}(t,x,u)\nabla^{2}_{x}u)\\ \hphantom{u_{t}}+\leavevmode\nobreak\ b^{\operatorname{T}}(t,x,u)\nabla_{x}u+f(t,x,u,\sigma^{\operatorname{T}}(t,x,u)\nabla_{x}u)=0,\\ u(T,x)=g(x).\end{dcases}

(2.7)

The associated FBSDEs (2.1)(2.2) have a unique solution $(X_{t},Y_{t},Z_{t})$ with $Y_{t}=u(t,X_{t})$ , $Z_{t}=\sigma^{\operatorname{T}}(t,X_{t},u(t,X_{t}))\nabla_{x}u(t,X_{t})$ , and $X_{t}$ is the solution of the following SDE

X_{t}=\xi+\int_{0}^{t}b(s,X_{s},u(s,X_{s}))\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},u(s,X_{s}))\,\mathrm{d}W_{s}.

Remark.

The statement regarding FBSDEs (2.1)(2.2) in Theorem 3 is developed through a PDE-based argument, which requires $m=d$ , uniform ellipticity of $\sigma$ , and high-order smoothness of $b,\sigma,f$ , and $g$ . An analogous result through probabilistic argument is given below in Theorem 4 (point 4). In that case, we only need the Lipschitz condition for all of the involved functions, in addition to some weak coupling or monotonicity conditions demonstrated in Assumption 3. Note that the Lipschitz condition alone does not guarantee the existence of a solution to the coupled FBSDEs, even in the situation when $b,f,\sigma$ are linear (see [16, 32] for a concrete counterexample).

Remark.

Theorem 3 also implies that the assumption that the drift function $b$ only depends on $x,y$ is general. If $b$ depends on $z$ as well, one can move the associated term in (2.7) into the nonlinearity $f$ and apply the nonlinear Feynman–Kac formula back to obtain an equivalent system of coupled FBSDEs, in which the new drift function is independent of $z$ .

3 Preliminaries

In this section, we introduce our assumptions and two useful results in [16]. We use the notation $\Delta x=x_{1}-x_{2}$ , $\Delta y=y_{1}-y_{2}$ , $\Delta z=z_{1}-z_{2}$ .

Assumption 1.

(i)

There exist (possibly negative) constants $k_{b}$ , $k_{f}$ such that

	$\displaystyle[b(t,x_{1},y)-b(t,x_{2},y)]^{\operatorname{T}}\Delta x$	$\displaystyle\leq k_{b}\|\Delta x\|^{2},$
	$\displaystyle[f(t,x,y_{1},z)-f(t,x,y_{2},z)]\Delta y$	$\displaystyle\leq k_{f}\|\Delta y\|^{2}.$

(ii)

b, $\sigma$ , f, g are uniformly Lipschitz continuous with respect to (x,y,z). In particular, there are non-negative constants K, $b_{y}$ , $\sigma_{x}$ , $\sigma_{y}$ , $f_{x}$ , $f_{z}$ , and $g_{x}$ such that

	$\displaystyle\|b(t,x_{1},y_{1})-b(t,x_{2},y_{2})\|^{2}$	$\displaystyle\leq K\|\Delta x\|^{2}+b_{y}\|\Delta y\|^{2},$
	$\displaystyle\\|\sigma(t,x_{1},y_{1})-\sigma(t,x_{2},y_{2})\\|^{2}$	$\displaystyle\leq\sigma_{x}\|\Delta x\|^{2}+\sigma_{y}\|\Delta y\|^{2},$
	$\displaystyle\|f(t,x_{1},y_{1},z_{1})-f(t,x_{2},y_{2},z_{2})\|^{2}$	$\displaystyle\leq f_{x}\|\Delta x\|^{2}+K\|\Delta y\|^{2}+f_{z}\|\Delta z\|^{2},$
	$\displaystyle\|g(x_{1})-g(x_{2})\|^{2}$	$\displaystyle\leq g_{x}\|\Delta x\|^{2}.$

(iii)

$b(t,0,0)$ , $f(t,0,0,0)$ , and $\sigma(t,0,0)$ are bounded. In particular, there are constants $b_{0}$ , $\sigma_{0}$ , $f_{0}$ , and $g_{0}$ such that

	$\displaystyle\|b(t,x,y)\|^{2}$	$\displaystyle\leq b_{0}+K\|x\|^{2}+b_{y}\|y\|^{2},$
	$\displaystyle\\|\sigma(t,x,y)\\|^{2}$	$\displaystyle\leq\sigma_{0}+\sigma_{x}\|x\|^{2}+\sigma_{y}\|y\|^{2},$
	$\displaystyle\|f(t,x,y,z)\|^{2}$	$\displaystyle\leq f_{0}+f_{x}\|x\|^{2}+K\|y\|^{2}+f_{z}\|z\|^{2},$
	$\displaystyle\|g(x)\|^{2}$	$\displaystyle\leq g_{0}+g_{x}\|x\|^{2}.$

We note here $b_{y}$ et al. are all constants, not partial derivatives. For convenience, we use $\mathscr{L}$ to denote the set of all the constants mentioned above and assume K is the upper bound of $\mathscr{L}$ .

Assumption 2.

$b,\sigma,f$ are uniformly Hölder- $\frac{1}{2}$ continuous with respect to $t$ . We assume the same constant K to be the upper bound of the square of the Hölder constants as well.

Assumption 3.

One of the following five cases holds:

1.

Small time duration, that is, T is small.
2.

Weak coupling of Y into the forward SDE (2.1), that is, $b_{y}$ and $\sigma_{y}$ are small. In particular, if $b_{y}=\sigma_{y}=0$ , then the forward equation does not depend on the backward one and, thus, equations (2.1)(2.2) are decoupled.
3.

Weak coupling of X into the backward SDE (2.2), that is, $f_{x}$ and $g_{x}$ are small. In particular, if $f_{x}=g_{x}=0$ , then the backward equation does not depend on the forward one and, thus, equations (2.1)(2.2) are also decoupled. In fact, in this case, Z = 0 and (2.2) reduces to an ODE.
4.

f is strongly decreasing in y, that is, $k_{f}$ is very negative.
5.

b is strongly decreasing in x, that is, $k_{b}$ is very negative.

The assumptions stated above are usually called weak coupling and monotonicity conditions in literature [16, 33, 34]. To make it more precise, we define

	$\displaystyle L_{0}=[b_{y}+\sigma_{y}][g_{x}+f_{x}T]Te^{[b_{y}+\sigma_{y}][g_{x}+f_{x}T]T+[2k_{b}+2k_{f}+2+\sigma_{x}+f_{z}]T},$
	$\displaystyle L_{1}=[g_{x}+f_{x}T][e^{[b_{y}+\sigma_{y}][g_{x}+f_{x}T]T+[2k_{b}+2k_{f}+2+\sigma_{x}+f_{z}]T+1}\vee 1],$
	$\displaystyle\Gamma_{0}(x)=\frac{e^{x}-1}{x},\leavevmode\nobreak\ \leavevmode\nobreak\ (x>0),$
	$\displaystyle\Gamma_{1}(x,y)=\sup_{0<\theta<1}\theta e^{\theta x}\Gamma_{0}(y),$
	$\displaystyle c=\inf_{\lambda_{1}>0}\Big{\{}[e^{[2k_{b}+1+\sigma_{x}+[b_{y}+\sigma_{y}]L_{1}]T}\vee 1](1+\lambda_{1}^{-1})[b_{y}+\sigma_{y}]T$
	$\displaystyle\qquad\qquad\quad\leavevmode\nobreak\ \times[g_{x}\Gamma_{1}([2k_{f}+1+f_{z}]T,[2k_{b}+1+\sigma_{x}+(1+\lambda_{1})[b_{y}+\sigma_{y}]L_{1}]T)$
	$\displaystyle\qquad\qquad\leavevmode\nobreak\ +f_{x}T\Gamma_{0}([2k_{f}+1+f_{z}]T)$
	$\displaystyle\qquad\qquad\quad\leavevmode\nobreak\ \times\Gamma_{0}(2k_{b}+1+\sigma_{x}+(1+\lambda_{1})[b_{y}+\sigma_{y}]L_{1}]T)\Big{\}}.$

Then, a specific quantitative form of the above five conditions can be summarized as:

L_{0}<e^{-1}\text{\leavevmode\nobreak\ \leavevmode\nobreak\ and\leavevmode\nobreak\ \leavevmode\nobreak\ }c<1.

(3.1)

In other words, if any of the five conditions of the weak coupling and monotonicity conditions holds to certain extent, the two inequalities in (3.1) hold. Below, we refer to (3.1) as Assumption 3 and the five general qualitative conditions described above as the weak coupling and monotonicity conditions.

The above three assumptions are basic assumptions in [16], which we need in order to use the results from [16], as stated in Theorems 4 and 5 below. Theorem 4 gives the connections between coupled FBSDEs and quasilinear parabolic PDEs under weaker conditions. Theorem 5 provides the convergence of the implicit scheme for coupled FBSDEs. Our work primarily uses the same set of assumptions except that we assume some further quantitative restrictions related to the weak coupling and monotonicity conditions, which will be transparent through the extra constants we define in proofs. Our aim is to provide explicit conditions on which our results hold and more clearly present the relationship between these constants and the error estimates. As will be seen in the proof, roughly speaking, the weaker the coupling (resp., the stronger the monotonicity, the smaller the time horizon) is, the easier the condition is satisfied, and the smaller the constant $C$ related with error estimates are.

Theorem 4.

Under Assumptions 1, 2, and 3, there exists a function u: $\mathbb{R}\times\mathbb{R}^{m}\rightarrow\mathbb{R}$ that satisfies the following statements.

1.

$|u(t,x_{1})-u(t,x_{2})|^{2}\leq L_{1}|x_{1}-x_{2}|^{2}$ .
2.

$|u(s,x)-u(t,x)|^{2}\leq C(1+|x|^{2})|s-t|$ with some constant C depending on $\mathscr{L}$ and $T$ .
3.

u is a viscosity solution of the PDE (2.7).

The FBSDEs (2.1)(2.2) have a unique solution $(X_{t},Y_{t},Z_{t})$ and $Y_{t}=u(t,X_{t})$ . Thus, $(X_{t},Y_{t},Z_{t})$ satisfies decoupled FBSDEs

	$\displaystyle X_{t}$	$\displaystyle=\xi+\int_{0}^{t}b(s,X_{s},u(s,X_{s}))\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},u(s,X_{s}))\,\mathrm{d}W_{s},$
	$\displaystyle Y_{t}$	$\displaystyle=g(X_{T})+\int_{t}^{T}f(s,X_{s},Y_{s},Z_{s})\,\mathrm{d}s-\int_{t}^{T}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s}.$

Furthermore, the solution of the FBSDEs satisfies the path regularity with some constant C depending on $\mathscr{L}$ and T

\sup_{t\in[0,T]}(E|X_{t}-\tilde{X}_{t}|^{2}+E|Y_{t}-\tilde{Y}_{t}|^{2})+\int_{0}^{T}E|Z_{t}-\tilde{Z}_{t}|^{2}\,\mathrm{d}t\leq C(1+E|\xi|^{2})h,

(3.2)

in which $\tilde{X}_{t}=X_{t_{i}}$ , $\tilde{Y}_{t}=Y_{t_{i}}$ , $\tilde{Z}_{t}=h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]$ for $t\in[t_{i},t_{i+1})$ . If $Z_{t}$ is càdlàg, we can replace $h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]$ with $Z_{t_{i}}$ .

Remark.

Several conditions can guarantee $Z_{t}$ admits a càdlàg version, such as $m=d$ and $\sigma\sigma^{\operatorname{T}}\geq\delta I$ with some $\delta>0$ , see e.g., [35].

Theorem 5.

Under Assumptions 1, 2, and 3, for sufficiently small h, the following discrete-time equation ( $0\leq i\leq N-1$ )

\displaystyle\begin{dcases}\overline{X}_{0}^{\pi}=\xi,\\ \overline{X}_{t_{i+1}}^{\pi}=\overline{X}_{t_{i}}^{\pi}+b(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi})h+\sigma(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi})\Delta W_{i},\\ \overline{Y}_{T}^{\pi}=g(\overline{X}_{T}^{\pi}),\\ \overline{Z}_{t_{i}}^{\pi}=\frac{1}{h}E[\overline{Y}_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}],\\ \overline{Y}_{t_{i}}^{\pi}=E[\overline{Y}_{t_{i+1}}^{\pi}+f(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi})h|\mathcal{F}_{t_{i}}],\end{dcases}

(3.3)

has a solution $(\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi})$ such that $\overline{X}_{t_{i}}^{\pi}\in L^{2}(\Omega,\mathcal{F}_{t_{i}},\mathbb{P})$ and

\sup_{t\in[0,T]}(E|X_{t}-\overline{X}_{t}^{\pi}|^{2}+E|Y_{t}-\overline{Y}_{t}^{\pi}|^{2})+\int_{0}^{T}E|Z_{t}-\overline{Z}_{t}^{\pi}|^{2}\,\mathrm{d}t\leq C(1+E|\xi|^{2})h,

(3.4)

where $\overline{X}_{t}^{\pi}=\overline{X}_{t_{i}}^{\pi}$ , $\overline{Y}_{t}^{\pi}=\overline{Y}_{t_{i}}^{\pi}$ , $\overline{Z}_{t}^{\pi}=\overline{Z}_{t_{i}}^{\pi}$ for $t\in[t_{i},t_{i+1})$ , and C is a constant depending on $\mathscr{L}$ and T.

Remark.

In [16], the above result (existence and convergence) is proved for the explicit scheme, which is formulated as replacing $f(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi})$ with $f(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i+1}}^{\pi},\overline{Z}_{t_{i}}^{\pi})$ in the last equation of (3.3). The same techniques can be used to prove the implicit scheme, as we state in Theorem 5.

Finally, to make sure the system in (2.3) is well-defined, we restrict our parametric function spaces $\mathcal{N}^{\prime}_{0}$ and $\mathcal{N}_{i}$ as in Assumption 4 below. Note that neural networks with common activation functions, including ReLU and sigmoid function, satisfy this assumption. Under Assumption 1 and 4, one can easily prove by induction that $\{X_{t_{i}}^{\pi}\}_{0\leq i\leq N}$ , $\{Y_{t_{i}}^{\pi}\}_{0\leq i\leq N}$ and $\{Z_{t_{i}}^{\pi}\}_{0\leq i\leq N-1}$ defined in (2.3) are all measurable and square-integrable random variables.

Assumption 4.

$\mathcal{N}_{0}^{{}^{\prime}}$ and $\mathcal{N}_{i}(0\leq i\leq N-1)$ are subsets of measurable functions from $\mathbb{R}^{m}$ to $\mathbb{R}$ and $\mathbb{R}^{m}\times\mathbb{R}$ to $\mathbb{R}^{d}$ with linear growth, namely, $\mu_{0}^{\pi}$ and $\{\phi_{i}^{\pi}\}_{0\leq i\leq N-1}$ in (2.3) satisfy $|\mu_{0}^{\pi}(x)|^{2}\leq A^{\prime}_{0}+B^{\prime}_{0}|x|^{2}$ and $|\phi_{i}^{\pi}(x,y)|^{2}\leq A_{i}+B_{i}|x|^{2}+C_{i}|y|^{2}$ for $0\leq i\leq N-1$ .

4 A Posteriori Estimation of the Simulation Error

We prove Theorem 1 in this section. Comparing the statements of Theorem 1 and Theorem 5, we wish to bound the differences between $(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})$ and $(\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi})$ with the objective function $E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}$ . Recalling the definition of the system of equations (2.3), we have

	$\displaystyle X_{t_{i+1}}^{\pi}$	$\displaystyle=X_{t_{i}}^{\pi}+b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})h+\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})\Delta W_{i},$		(4.1)
	$\displaystyle Y_{t_{i+1}}^{\pi}$	$\displaystyle=Y_{t_{i}}^{\pi}-f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})h+(Z_{t_{i}}^{\pi})^{\operatorname{T}}\Delta W_{i}.$		(4.2)

Taking the expectation $E[\cdot|\mathcal{F}_{t_{i}}]$ on both sides of (4.2), we obtain

Y_{t_{i}}^{\pi}=E[Y_{t_{i+1}}^{\pi}+f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})h|\mathcal{F}_{t_{i}}].

Right multiplying $(\Delta W_{i})^{\operatorname{T}}$ on both sides of (4.2) and taking the expectation $E[\cdot|\mathcal{F}_{t_{i}}]$ again, we obtain

Z_{t_{i}}^{\pi}=\frac{1}{h}[Y_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}].

The above observation motivates us to consider the following system of equations

\displaystyle\begin{dcases}X_{0}^{\pi}=\xi,\\ X_{t_{i+1}}^{\pi}=X_{t_{i}}^{\pi}+b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})h+\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})\Delta W_{i},\\ Z_{t_{i}}^{\pi}=\frac{1}{h}E[Y_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}],\\ Y_{t_{i}}^{\pi}=E[{Y}_{t_{i+1}}^{\pi}+f(t_{i},{X}_{t_{i}}^{\pi},{Y}_{t_{i}}^{\pi},{Z}_{t_{i}}^{\pi})h|\mathcal{F}_{t_{i}}].\end{dcases}

(4.3)

Note that (4.3) is defined just like the FBSDEs (2.1)(2.2), where the $X$ component is defined forwardly and the $Y,Z$ components are defined backwardly. However, since we do not specify the terminal condition of $Y_{T}^{\pi}$ , the system of equations (4.3) has infinitely many solutions. The following lemma gives an estimate of the difference between two such solutions.

Lemma 1.

For $j=1,2$ , suppose $(\{X_{t_{i}}^{\pi,j}\}_{0\leq i\leq N},\{Y_{t_{i}}^{\pi,j}\}_{0\leq i\leq N},\{Z_{t_{i}}^{\pi,j}\}_{0\leq i\leq N-1})$ are two solutions of (4.3), with $X_{t_{i}}^{\pi,j},Y_{t_{i}}^{\pi,j}\in L^{2}(\Omega,\mathcal{F}_{t_{i}},\mathbb{P})$ , $0\leq i\leq N$ . For any $\lambda_{1}>0,\lambda_{2}\geq f_{z}$ , and sufficiently small h, denote

$\displaystyle A_{1}$	$\displaystyle\coloneqq 2k_{b}+\lambda_{1}+\sigma_{x}+Kh,$	(4.4)
$\displaystyle A_{2}$	$\displaystyle\coloneqq(\lambda_{1}^{-1}+h)b_{y}+\sigma_{y},$
$\displaystyle A_{3}$	$\displaystyle\coloneqq-\frac{\ln[1-(2k_{f}+\lambda_{2})h]}{h},$
$\displaystyle A_{4}$	$\displaystyle\coloneqq\frac{f_{x}}{[1-(2k_{f}+\lambda_{2})h]\lambda_{2}}.$

Let $\delta X_{i}=X_{t_{i}}^{\pi,1}-X_{t_{i}}^{\pi,2},\delta Y_{i}=Y_{t_{i}}^{\pi,1}-Y_{t_{i}}^{\pi,2}$ , then we have, for $0\leq n\leq N$ ,

	$\displaystyle E\|\delta X_{n}\|^{2}$	$\displaystyle\leq A_{2}\sum_{i=0}^{n-1}e^{A_{1}(n-i-1)h}E\|\delta Y_{i}\|^{2}h,$
	$\displaystyle E\|\delta Y_{n}\|^{2}$	$\displaystyle\leq e^{A_{3}(N-n)h}E\|\delta Y_{N}\|^{2}+A_{4}\sum_{i=n}^{N-1}e^{A_{3}(i-n)h}E\|\delta X_{i}\|^{2}h.$

To prove Lemma 1, we need the following lemma to handle the $Z$ component.

Lemma 2.

Let $0\leq s_{1}<s_{2}$ , given $Q\in L^{2}(\Omega,\mathcal{F}_{s_{2}},\mathbb{P})$ , by the martingale representation theorem, there exists an $\mathcal{F}_{t}$ -adapted process $\{H_{s}\}_{s_{1}\leq s\leq s_{2}}$ such that $\int_{s_{1}}^{s_{2}}E|H_{s}|^{2}\,\mathrm{d}s<\infty$ and $Q=E[Q|\mathcal{F}_{s_{1}}]+\int_{s_{1}}^{s_{2}}H_{s}\,\mathrm{d}W_{s}$ . Then we have $E[Q(W_{s_{2}}-W_{s_{1}})|\mathcal{F}_{s_{1}}]=E[\int_{s_{1}}^{s_{2}}H_{s}\,\mathrm{d}s|\mathcal{F}_{s_{1}}]$ .

Proof.

Consider the auxiliary stochastic process $Q_{s}=(E[Q|\mathcal{F}_{s_{1}}]+\int_{s_{1}}^{s}H_{t}\,\mathrm{d}W_{t})(W_{s}-W_{s_{1}})$ for $s\in[s_{1},s_{2}]$ . By Itô’s formula,

\mathrm{d}Q_{s}=(W_{s}-W_{s_{1}})H_{s}\,\mathrm{d}W_{s}+(E[Q|\mathcal{F}_{s_{1}}]+\int_{s_{1}}^{s}H_{t}\,\mathrm{d}W_{t})\,\mathrm{d}W_{s}+H_{s}\,\mathrm{d}s.

Noting that $Q_{s_{1}}=0$ , we have

E[Q(W_{s_{2}}-W_{s_{1}})|\mathcal{F}_{s_{1}}]=E[Q_{s_{2}}|\mathcal{F}_{s_{1}}]=E[\int_{s_{1}}^{s_{2}}H_{s}\,\mathrm{d}s|\mathcal{F}_{s_{1}}].

∎

Proof of Lemma 1.

Let

	$\displaystyle\delta Z_{i}$	$\displaystyle=Z_{t_{i}}^{\pi,1}-Z_{t_{i}}^{\pi,2},$
	$\displaystyle\delta b_{i}$	$\displaystyle=b(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1})-b(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}),$
	$\displaystyle\delta\sigma_{i}$	$\displaystyle=\sigma(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1})-\sigma(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}),$
	$\displaystyle\delta f_{i}$	$\displaystyle=f(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1},Z_{t_{i}}^{\pi,1})-f(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2},Z_{t_{i}}^{\pi,2}).$

Then we have

$\displaystyle\delta X_{i+1}$	$\displaystyle=\delta X_{i}+\delta b_{i}h+\delta\sigma_{i}\Delta W_{i},$	(4.5)
$\displaystyle\delta Z_{i}$	$\displaystyle=\frac{1}{h}E[\delta Y_{i+1}\Delta W_{i}\|\mathcal{F}_{t_{i}}],$	(4.6)
$\displaystyle\delta Y_{i}$	$\displaystyle=E[\delta Y_{i+1}+\delta f_{i}h\|\mathcal{F}_{t_{i}}].$	(4.7)

By the martingale representation theorem, there exists an $\mathscr{F}_{t}$ -adapted square-integrable process $\{\delta Z_{t}\}_{t_{i}\leq t\leq t_{i+1}}$ such that

\delta Y_{i+1}=E[\delta Y_{i+1}|\mathcal{F}_{t_{i}}]+\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t},

or, equivalently,

\displaystyle\delta Y_{i+1}

\displaystyle=\delta Y_{i}-\delta f_{i}h+\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t}.

(4.8)

From equations (4.5) and (4.8), noting that $\delta X_{i}$ , $\delta Y_{i}$ , $\delta b_{i}$ , $\delta\sigma_{i}$ , and $\delta f_{i}$ are all $\mathcal{F}_{t_{i}}$ -measurable, and $E[\Delta W_{i}|\mathcal{F}_{t_{i}}]=0$ , $E[\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t}|\mathcal{F}_{t_{i}}]=0$ , we have

$\displaystyle E\|\delta X_{i+1}\|^{2}$	$\displaystyle=E\|\delta X_{i}+\delta b_{i}h\|^{2}+E[(\Delta W_{i})^{\operatorname{T}}(\delta\sigma_{i})^{\operatorname{T}}\delta\sigma_{i}\Delta W_{i}]$
	$\displaystyle=E\|\delta X_{i}+\delta b_{i}h\|^{2}+hE\\|\delta\sigma_{i}\\|^{2},$	(4.9)
$\displaystyle E\|\delta Y_{i+1}\|^{2}$	$\displaystyle=E\|\delta Y_{i}-\delta f_{i}h\|^{2}+\int_{t_{i}}^{t_{i+1}}E\|\delta Z_{t}\|^{2}\,\mathrm{d}t.$	(4.10)

From equation (4.9), by Assumptions 1, 2 and the root-mean square and geometric mean inequality (RMS-GM inequality), for any $\lambda_{1}>0$ , we have

		$\displaystyle E\|\delta X_{i+1}\|^{2}$
	$\displaystyle=\leavevmode\nobreak\$	$\displaystyle E\|\delta X_{i}\|^{2}+E\|\delta b_{i}\|^{2}h^{2}+hE\\|\delta\sigma_{i}\\|^{2}$
		$\displaystyle+2hE[(b(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1})-b(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,1}))^{\operatorname{T}}\delta X_{i}]$
		$\displaystyle+2hE[(b(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,1})-b(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}))^{\operatorname{T}}\delta X_{i}]$
	$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle E\|\delta X_{i}\|^{2}+(KE\|\delta X_{i}\|^{2}+b_{y}E\|\delta Y_{i}\|^{2})h^{2}+2k_{b}hE\|\delta X_{i}\|^{2}$
		$\displaystyle+(\lambda_{1}E\|\delta X_{i}\|^{2}+\lambda_{1}^{-1}b_{y}E\|\delta Y_{i}\|^{2})h+(\sigma_{x}E\|\delta X_{i}\|^{2}+\sigma_{y}E\|\delta Y_{i}\|^{2})h$
	$\displaystyle=\leavevmode\nobreak\$	$\displaystyle[1+(2k_{b}+\lambda_{1}+\sigma_{x}+Kh)h]E\|\delta X_{i}\|^{2}+[(\lambda_{1}^{-1}+h)b_{y}+\sigma_{y}]E\|\delta Y_{i}\|^{2}h.$

Recall $A_{1}=2k_{b}+\lambda_{1}+\sigma_{x}+Kh$ , $A_{2}=(\lambda_{1}^{-1}+h)b_{y}+\sigma_{y}$ , $E|\delta X_{0}|^{2}=0$ . By induction we can obtain that, for $0\leq n\leq N$ ,

E|\delta X_{n}|^{2}\leq A_{2}\sum_{i=0}^{n-1}e^{A_{1}(n-i-1)h}E|\delta Y_{i}|^{2}h.

Similarly, from equation (4.10), for any $\lambda_{2}>0$ , we have

	$\displaystyle E\|\delta Y_{i+1}\|^{2}$
$\displaystyle\geq\leavevmode\nobreak\$	$\displaystyle E\|\delta Y_{i}\|^{2}+\int_{t_{i}}^{t_{i+1}}E\|\delta Z_{t}\|^{2}\,\mathrm{d}t$
	$\displaystyle-2hE[(f(t_{i},X_{i}^{1,\pi},Y_{i}^{1,\pi},Z_{i}^{1,\pi})-f(t_{i},X_{i}^{1,\pi},Y_{i}^{2,\pi},Z_{i}^{1,\pi}))^{\operatorname{T}}\delta Y_{i}]$
	$\displaystyle-2hE[(f(t_{i},X_{i}^{1,\pi},Y_{i}^{2,\pi},Z_{i}^{1,\pi})-f(t_{i},X_{i}^{2,\pi},Y_{i}^{2,\pi},Z_{i}^{2,\pi}))^{\operatorname{T}}\delta Y_{i}]$
$\displaystyle\geq\leavevmode\nobreak\$	$\displaystyle E\|\delta Y_{i}\|^{2}+\int_{t_{i}}^{t_{i+1}}E\|\delta Z_{t}\|^{2}\,\mathrm{d}t-2k_{f}hE\|\delta Y_{i}\|^{2}$
	$\displaystyle-[\lambda_{2}E\|\delta Y_{i}\|^{2}+\lambda_{2}^{-1}(f_{x}E\|\delta X_{i}\|^{2}+f_{z}E\|\delta Z_{i}\|^{2})]h.$	(4.11)

To deal with the integral term in (4.11), we apply Lemma 2 to (4.6)(4.8) and get

\displaystyle\delta Z_{i}

\displaystyle=\frac{1}{h}E[\int_{t_{i}}^{t_{i+1}}\delta Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}],

which implies, by the Cauchy inequality,

	$\displaystyle E\|\delta Z_{i}\|^{2}h$	$\displaystyle=\sum_{k=1}^{d}E\|(\delta Z_{i})_{k}\|^{2}h=\sum_{k=1}^{d}\frac{1}{h}E\Big{\|}E[\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})_{k}\,\mathrm{d}t\|\mathcal{F}_{t_{i}}]\Big{\|}^{2}$
		$\displaystyle\leq\sum_{k=1}^{d}\frac{1}{h}E\Big{\|}\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})_{k}\,\mathrm{d}t\Big{\|}^{2}\leq\sum_{k=1}^{d}\int_{t_{i}}^{t_{i+1}}E\|(\delta Z_{t})_{k}\|^{2}\,\mathrm{d}t$
		$\displaystyle=\int_{t_{i}}^{t_{i+1}}E\|\delta Z_{t}\|^{2}\,\mathrm{d}t,$

where $(\cdot)_{k}$ denotes the $k$ -th component of the vector. Plugging it into (4.11) gives us

\displaystyle E|\delta Y_{i+1}|^{2}\geq[1-(2k_{f}+\lambda_{2})h]E|\delta Y_{i}|^{2}+(1-f_{z}\lambda_{2}^{-1})E|\delta Z_{i}|^{2}h-f_{x}\lambda_{2}^{-1}E|\delta X_{i}|^{2}h.

(4.12)

Then for any $\lambda_{2}\geq f_{z}$ and sufficiently small $h$ satisfying $(2k_{f}+\lambda_{2})h<1$ , we have

E|\delta Y_{i}|^{2}\leq[1-(2k_{f}+\lambda_{2})h]^{-1}[E|\delta Y_{i+1}|^{2}+f_{x}\lambda_{2}^{-1}E|\delta X_{i}|^{2}h].

Recall $A_{3}=-h^{-1}\ln[1-(2k_{f}+\lambda_{2})h]$ , $A_{4}=f_{x}\lambda_{2}^{-1}[1-(2k_{f}+\lambda_{2})h]^{-1}$ . By induction we obtain that, for $0\leq n\leq N$ ,

E|\delta Y_{n}|^{2}\leq e^{A_{3}(N-n)h}E|\delta Y_{N}|^{2}+A_{4}\sum_{i=n}^{N-1}e^{A_{3}(i-n)h}E|\delta X_{i}|^{2}h.

∎

Now we are ready to prove Theorem 1, whose precise statement is given below. Note that its conditions are satisfied if any of the five cases in the weak coupling and monotonicity conditions holds.

Theorem 1^′.

Suppose Assumptions 1, 2, 3, and 4 hold true and there exist $\lambda_{1}>0,\lambda_{2}\geq f_{z}$ such that $\overline{A_{0}}<1$ , where

		$\displaystyle\overline{A_{1}}\coloneqq 2k_{b}+\lambda_{1}+\sigma_{x},$		(4.13)
		$\displaystyle\overline{A_{2}}\coloneqq b_{y}\lambda_{1}^{-1}+\sigma_{y},$
		$\displaystyle\overline{A_{3}}\coloneqq 2k_{f}+\lambda_{2},$
		$\displaystyle\overline{A_{4}}\coloneqq f_{x}\lambda_{2}^{-1},$
		$\displaystyle\overline{A_{0}}\coloneqq\overline{A_{2}}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}\Big{\{}g_{x}e^{(\overline{A_{1}}+\overline{A_{3}})T}+\overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}}\Big{\}}.$

Then there exists a constant $C>0$ , depending on $E|\xi|^{2}$ , $\mathscr{L}$ , $T$ , $\lambda_{1}$ , and $\lambda_{2}$ , such that for sufficiently small $h$ ,

\sup_{t\in[0,T]}(E|X_{t}-\hat{X}_{t}^{\pi}|^{2}+E|Y_{t}-\hat{Y}_{t}^{\pi}|^{2})+\int_{0}^{T}E|Z_{t}-\hat{Z}_{t}^{\pi}|^{2}\,\mathrm{d}t\leq C[h+E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}],

(4.14)

where $\hat{X}_{t}^{\pi}=X_{t_{i}}^{\pi}$ , $\hat{Y}_{t}^{\pi}=Y_{t_{i}}^{\pi}$ , $\hat{Z}_{t}^{\pi}=Z_{t_{i}}^{\pi}$ for $t\in[t_{i},t_{i+1})$ .

Remark.

The above theorem also implies the coercivity of the objective function (2.4) used in the deep BSDE method. Formally speaking, the coercivity means that if $\sum_{i=0}^{N-1}E|Z_{t_{i}}^{\pi}|^{2}+E|Y_{0}^{\pi}|^{2}\rightarrow+\infty$ , we have $E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\rightarrow+\infty$ , which is a direct result from Theorem 1^′.

Remark.

If any of the weak coupling and monotonicity conditions introduced in Assumption 3 holds to a sufficient extent, there must exist $\lambda_{1},\lambda_{2}$ satisfying the conditions in Theorem 1^′. We discuss the 5 cases in what follows.

1.

Suppose all other constants and $\lambda_{1}>0,\lambda_{2}\geq f_{z}$ are fixed, if $T>0$ is sufficiently small, then the second factor of $\overline{A_{0}}$ could be sufficiently close to 0 such that $\overline{A_{0}}<1$ .
2.

Suppose all other constants and $\lambda_{1}>0,\lambda_{2}\geq f_{z}$ are fixed, if $b_{y}\geq 0$ and $\sigma_{y}\geq 0$ are sufficiently small, then $\overline{A_{2}}\geq 0$ could be sufficiently small such that $\overline{A_{0}}<1$ .
3.

Suppose all other constants and $\lambda_{1}>0,\lambda_{2}\geq f_{z}$ are fixed, if $f_{x}\geq 0$ and $g_{x}\geq 0$ are sufficiently small, then $\overline{A_{4}}$ and thus the last factor in $\overline{A_{0}}$ could be sufficiently close to 0 such that $\overline{A_{0}}<1$ .

Suppose all constants except $k_{f}$ and $\lambda_{2}>0$ are fixed. Let $\overline{A_{1}}^{\prime}\coloneqq\overline{A_{1}}+\overline{A_{3}}=2k_{b}+2k_{f}+\sigma_{x}+\lambda_{1}+\lambda_{2}$ and rewrite $\overline{A_{0}}$ as

\displaystyle\overline{A_{0}}=\overline{A_{2}}\Big{\{}g_{x}\frac{e^{\overline{A_{1}}^{\prime}T}-1}{\overline{A_{1}}^{\prime}}+\overline{A_{4}}\frac{e^{\overline{A_{1}}^{\prime}T}+e^{-\overline{A_{1}}^{\prime}T}-2}{(\overline{A_{1}}^{\prime})^{2}}\Big{\}}.

It is straightforward to check that there exists a negative constant $C_{1}$ such that when $\overline{A_{1}}^{\prime}\leq C_{1}$ , $(e^{\overline{A_{1}}^{\prime}T}-1)/\overline{A_{1}}^{\prime}<1/(2\overline{A_{2}}g_{x})$ . By the definition of $\overline{A_{1}}^{\prime}$ , if $k_{f}$ is sufficiently negative, there exists $\lambda_{2}\geq f_{x}$ such that $\overline{A_{1}}^{\prime}=C_{1}$ and $\lambda_{2}$ is sufficiently large to ensure

\displaystyle\overline{A_{2}}\,\overline{A_{4}}\frac{e^{C_{1}T}+e^{-C_{1}T}-2}{C_{1}^{2}}=\frac{f_{x}\overline{A_{2}}(e^{C_{1}T}+e^{-C_{1}T}-2)}{\lambda_{2}C_{1}^{2}}<\frac{1}{2}.

Combining these two estimates gives $\overline{A_{0}}<1$ .

5.

Noting that $k_{b}$ and $k_{f}$ play the same role in $\overline{A_{1}}^{\prime}$ , we use the same argument as above to show that when $k_{b}$ is sufficiently negative, there exists $\lambda_{2}\geq f_{x}$ such that $\overline{A_{0}}<1$ .

Proof of Theorem 1^′.

From the proof of this theorem and throughout the remainder of the paper, we use $C$ to generally denote a constant that only depends on $E|\xi|^{2}$ , $\mathscr{L}$ , and $T$ , whose value may change from line to line when there is no need to distinguish. We also use $C(\cdot)$ to generally denote a constant depending on $E|\xi|^{2}$ , $\mathscr{L}$ , $T$ and the constants represented by $\cdot$ .

We use the same notations as Lemma 1. Let $X_{t_{i}}^{\pi,1}=X_{t_{i}}^{\pi}$ , $Y_{t_{i}}^{\pi,1}=Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi,1}=Z_{t_{i}}^{\pi}$ (defined in system (2.3)) and $X_{t_{i}}^{\pi,2}=\overline{X}_{t_{i}}^{\pi}$ , $Y_{t_{i}}^{\pi,2}=\overline{Y}_{t_{i}}^{\pi}$ , $Z_{t_{i}}^{\pi,2}=\overline{Z}_{t_{i}}^{\pi}$ (defined in system (3.3)). It can be easily checked that both $(\{X_{t_{i}}^{\pi,j}\}_{0\leq i\leq N}$ , $\{Y_{t_{i}}^{\pi,j}\}_{0\leq i\leq N}$ , $\{Z_{t_{i}}^{\pi,j}\}_{0\leq i\leq N-1})$ , $j=1,2$ satisfy the system of equations (4.3). Our proof strategy is to use Lemma 1 to bound the difference between two solutions through the objective function $E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}$ . This allows us to apply Theorem 5 to derive the desired estimates.

To begin with, note that for any $\lambda_{3}>0$ , the RMS-GM inequality yields

E|\delta Y_{N}|^{2}=E|g(\overline{X}_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{3}^{-1})E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}+g_{x}(1+\lambda_{3})E|\delta X_{N}|^{2}.

Let

\displaystyle P=\displaystyle{\max_{0\leq n\leq N}e^{-A_{1}nh}E|\delta X_{n}|^{2}},\quad S=\displaystyle{\max_{0\leq n\leq N}e^{A_{3}nh}E|\delta Y_{n}|^{2}}.

Lemma 1 tells us

e^{-A_{1}nh}E|\delta X_{n}|^{2}\leq A_{2}\sum_{i=0}^{n-1}e^{-A_{1}(i+1)h}E|\delta Y_{i}|^{2}h\leq A_{2}S\sum_{i=0}^{n-1}e^{-A_{1}(i+1)h-A_{3}ih}h,

and

		$\displaystyle e^{A_{3}nh}E\|\delta Y_{n}\|^{2}$
	$\displaystyle\leq$	$\displaystyle e^{A_{3}T}E\|\delta Y_{N}\|^{2}+A_{4}\sum_{i=n}^{N-1}e^{A_{3}ih}E\|\delta X_{i}\|^{2}h$
	$\displaystyle\leq$	$\displaystyle e^{A_{3}T}[(1+\lambda_{3}^{-1})E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}+g_{x}(1+\lambda_{3})E\|\delta X_{N}\|^{2}]+A_{4}\sum_{i=n}^{N-1}e^{A_{3}ih}E\|\delta X_{i}\|^{2}h$
	$\displaystyle\leq$	$\displaystyle e^{A_{3}T}(1+\lambda_{3}^{-1})E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}+[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T}+A_{4}\sum_{i=n}^{N-1}e^{(A_{1}+A_{3})ih}h]P.$

Therefore by definition of $P$ and $S$ , we have

	$\displaystyle P$	$\displaystyle\leq A_{2}he^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T}-1}{e^{-(A_{1}+A_{3})h}-1}S,$
	$\displaystyle S$	$\displaystyle\leq e^{A_{3}T}(1+\lambda_{3}^{-1})E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}+[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T}+A_{4}h\frac{e^{(A_{1}+A_{3})T}-1}{e^{(A_{1}+A_{3})h}-1}]P.$

Consider the function

\displaystyle A(h)

\displaystyle=A_{2}he^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T}-1}{e^{-(A_{1}+A_{3})h}-1}[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T}+A_{4}h\frac{e^{(A_{1}+A_{3})T}-1}{e^{(A_{1}+A_{3})h}-1}].

When $A(h)<1$ , we have

	$\displaystyle P$	$\displaystyle\leq[1-A(h)]^{-1}e^{A_{3}T}(1+\lambda_{3}^{-1})A_{2}he^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T}-1}{e^{-(A_{1}+A_{3})h}-1}E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2},$
	$\displaystyle S$	$\displaystyle\leq[1-A(h)]^{-1}e^{A_{3}T}(1+\lambda_{3}^{-1})E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}.$

Let

\overline{P}=\max_{0\leq n\leq N}e^{-\overline{A_{1}}nh}E|\delta X_{n}|^{2},\quad\overline{S}=\max_{0\leq n\leq N}e^{\overline{A_{3}}nh}E|\delta Y_{n}|^{2}.

(4.15)

Recall

\displaystyle\lim_{h\rightarrow 0}A_{i}=\overline{A_{i}},\quad i=1,2,3,4,

and note that

\displaystyle\lim_{h\rightarrow 0}A(h)=\overline{A_{2}}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[g_{x}(1+\lambda_{3})e^{(\overline{A_{1}}+\overline{A_{3}})T}+\overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}}].

When $\overline{A_{0}}<1$ , comparing $\lim_{h\rightarrow 0}A(h)$ and $\overline{A_{0}}$ , we know that, for any $\epsilon>0$ , there exists $\lambda_{3}>0$ and sufficiently small $h$ such that

	$\displaystyle\overline{P}$	$\displaystyle\leq(1+\epsilon)[1-\overline{A_{0}}]^{-1}\overline{A_{2}}e^{\overline{A_{3}}T}(1+\lambda_{3}^{-1})\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2},$		(4.16)
	$\displaystyle\overline{S}$	$\displaystyle\leq(1+\epsilon)[1-\overline{A_{0}}]^{-1}e^{\overline{A_{3}}T}(1+\lambda_{3}^{-1})E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}.$		(4.17)

By fixing $\epsilon=1$ and choosing suitable $\lambda_{3}$ , we obtain our error estimates of $E|\delta X_{n}|^{2}$ and $E|\delta Y_{n}|^{2}$ as

	$\displaystyle\max_{0\leq n\leq N}E\|\delta X_{n}\|^{2}$	$\displaystyle\leq e^{\overline{A_{1}}T\vee 0}\overline{P}\leq C(\lambda_{1},\lambda_{2})E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2},$		(4.18)
	$\displaystyle\max_{0\leq n\leq N}E\|\delta Y_{n}\|^{2}$	$\displaystyle\leq e^{(-\overline{A_{3}}T)\vee 0}\overline{S}\leq C(\lambda_{1},\lambda_{2})E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}.$		(4.19)

To estimate $E|\delta Z_{n}|^{2}$ , we consider estimate (4.12), in which $\lambda_{2}$ can take any value no smaller than $f_{z}$ . If $f_{z}\neq 0$ , we choose $\lambda_{2}=2f_{z}$ and obtain

\frac{1}{2}E|\delta Z_{i}|^{2}h\leq\frac{f_{x}}{2f_{z}}E|\delta X_{i}|^{2}h+E|\delta Y_{i+1}|^{2}-[1-(2k_{f}+2f_{z})h]E|\delta Y_{i}|^{2}.

Summing from 0 to $N-1$ gives us

	$\displaystyle\sum_{i=0}^{N-1}E\|\delta Z_{i}\|^{2}h\leq\leavevmode\nobreak\$	$\displaystyle\frac{f_{x}T}{f_{z}}\max_{0\leq n\leq N}E\|\delta X_{n}\|^{2}+[4(k_{f}+f_{z})T\vee 0+2]\max_{0\leq n\leq N}E\|\delta Y_{n}\|^{2}$
	$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle C(\lambda_{1},\lambda_{2})E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}.$		(4.20)

The case $f_{z}=0$ can be dealt with similarly by choosing $\lambda_{2}=1$ and the same type of estimate can be derived. Finally, combining estimates (4.18)(4.19)(4.20) with Theorem 5, we prove the statement in Theorem 1^′. ∎

5 An Upper Bound for the Minimized Objective Function

We prove Theorem 2 in this section. We first state three useful lemmas. Theorem 2^′, as a detailed statement of Theorem 2, and Theorem 6, as an variation of Theorem 2^′ under stronger conditions, are then provided, followed by their proofs. The proofs of three lemmas are given at the end of the section.

The main process we analyze is (2.3). Lemma 3 gives an estimate of the final distance $E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}$ provided by (2.3) in terms of the deviation between the approximated variables $Y_{0}^{\pi},Z_{t_{i}}^{\pi}$ and the true solutions.

Lemma 3.

Suppose Assumptions 1, 2, and 3 hold true. Let $X_{T}^{\pi},Y_{0}^{\pi},Y_{T}^{\pi}$ , $\{Z_{t_{i}}^{\pi}\}_{0\leq i\leq N-1}$ be defined as in system (2.3) and $\tilde{Z}_{t_{i}}=h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]$ . Given $\lambda_{4}>0$ , there exists a constant $C>0$ depending on $E|\xi|^{2}$ , $\mathscr{L}$ , $T$ , and $\lambda_{4}$ , such that for sufficiently small $h$ ,

\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{4})H_{\mathrm{min}}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C[h+E|Y_{0}-Y_{0}^{\pi}|^{2}],

where $\delta\tilde{Z}_{t_{i}}=\tilde{Z}_{t_{i}}-Z_{t_{i}}^{\pi}$ , $H(x)=(1+\sqrt{g_{x}})^{2}e^{(2K+2Kx^{-1}+x)T}(1+f_{z}x^{-1})$ , and $H_{\mathrm{min}}=\min_{x\in R^{+}}H(x)$ .

Lemma 3 is close to Theorem 2, except that $\tilde{Z}_{t_{i}}$ is not a function of $X_{t_{i}}^{\pi}$ and $Y_{t_{i}}^{\pi}$ defined in (2.3). To bridge this gap, we need the following two lemmas. First, similar to the proof of Theorem 1^′, an estimate of the distance between the process defined in (2.3) and the process defined in (3.3) is also needed here. Lemma 4 is a general result to serve this purpose, providing an estimate of the difference between two backward processes driven by different forward processes.

Lemma 4.

Let $X_{t_{i}}^{\pi,j}\in L^{2}(\Omega,\mathcal{F}_{t_{i}},\mathbb{P})$ for $0\leq i\leq N$ , $j=1,2$ . Suppose $\{Y_{t_{i}}^{\pi,j}\}_{0\leq i\leq N}$ and $\{Z_{t_{i}}^{\pi,j}\}_{0\leq i\leq N-1}$ satisfy

\displaystyle\begin{dcases}Y_{T}^{\pi,j}=g(X_{T}^{\pi,j}),\\ Z_{t_{i}}^{\pi,j}=\frac{1}{h}E[Y_{t_{i+1}}^{\pi,j}\Delta W_{i}|\mathcal{F}_{t_{i}}],\\ Y_{t_{i}}^{\pi,j}=E[{Y}_{t_{i+1}}^{\pi,j}+f(t_{i},{X}_{t_{i}}^{\pi,j},{Y}_{t_{i}}^{\pi,j},{Z}_{t_{i}}^{\pi,j})h|\mathcal{F}_{t_{i}}],\end{dcases}

(5.1)

for $0\leq i\leq N-1$ , $j=1,2$ . Let $\delta X_{i}=X_{t_{i}}^{\pi,1}-X_{t_{i}}^{\pi,2}$ , $\delta Z_{i}=Z_{t_{i}}^{\pi,1}-Z_{t_{i}}^{\pi,2}$ , then for any $\lambda_{7}>f_{z}$ , and sufficiently small $h$ , we have

\displaystyle\sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h\leq\frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{A_{5}T-A_{5}h}E|\delta X_{N}|^{2}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E|\delta X_{i}|^{2}h\Big{\}},

where $A_{5}\coloneqq-h^{-1}\ln[1-(2k_{f}+\lambda_{7})h]$ .

Lemma 5 shows that, similar to the nonlinear Feynman–Kac formula, the discrete stochastic process defined in (2.3) can also be linked to some deterministic functions.

Lemma 5.

Let $\{X_{t_{i}}^{\pi}\}_{0\leq i\leq N},\{Y_{t_{i}}^{\pi}\}_{0\leq i\leq N}$ be defined in (2.3). When $h<1/\sqrt{K}$ , there exist deterministic functions $U_{i}^{\pi}:\mathbb{R}^{m}\times\mathbb{R}\rightarrow\mathbb{R},V_{i}^{\pi}:\mathbb{R}^{m}\times\mathbb{R}\rightarrow\mathbb{R}^{d}$ for $0\leq i\leq N$ such that $Y_{t_{i}}^{\pi,^{\prime}}=U_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})$ , $Z_{t_{i}}^{\pi,^{\prime}}=V_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})$ satisfy

\displaystyle\begin{dcases}Y_{t_{N}}^{\pi,^{\prime}}=g(X_{t_{N}}^{\pi}),\\ Z_{t_{i}}^{\pi,^{\prime}}=\frac{1}{h}E[Y_{t_{i+1}}^{\pi,^{\prime}}\Delta W_{i}|\mathcal{F}_{t_{i}}],\\ Y_{t_{i}}^{\pi,^{\prime}}=E[{Y}_{t_{i+1}}^{\pi,^{\prime}}+f(t_{i},{X}_{t_{i}}^{\pi},{Y}_{t_{i}}^{\pi,^{\prime}},{Z}_{t_{i}}^{\pi,^{\prime}})h|\mathcal{F}_{t_{i}}],\end{dcases}

(5.2)

for $0\leq i\leq N-1$ . If $b$ and $\sigma$ are independent of $y$ , then there exist deterministic functions $U_{i}^{\pi}:\mathbb{R}^{m}\rightarrow\mathbb{R},V_{i}^{\pi}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{d}$ for $0\leq i\leq N$ such that $Y_{t_{i}}^{\pi,^{\prime}}=U_{i}^{\pi}(X_{t_{i}}^{\pi})$ , $Z_{t_{i}}^{\pi,^{\prime}}=V_{i}^{\pi}(X_{t_{i}}^{\pi})$ satisfy (5.2).

Now we are ready to prove Theorem 2, with a precise statement given below. Like Theorem 1^′, the conditions below are satisfied if any of the five cases of the weak coupling and monotonicity conditions holds to certain extent.

Theorem 2^′.

Suppose Assumptions 1, 2, 3, and 4 hold true. Given any $\lambda_{1},\lambda_{3}>0$ , $\lambda_{2}\geq f_{z}$ , and $\lambda_{7}>f_{z}$ , let $\overline{A_{i}},(i=1,2,3,4)$ be defined in (4.4) and

$\displaystyle\overline{A_{5}}\coloneqq$	$\displaystyle\leavevmode\nobreak\ \lambda_{7}+2k_{f},$	(5.3)
$\displaystyle\overline{A_{0}}^{\prime}\coloneqq$	$\displaystyle\leavevmode\nobreak\ \overline{A_{2}}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}\Big{\{}g_{x}(1+\lambda_{3})e^{(\overline{A_{1}}+\overline{A_{3}})T}+\overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}}\Big{\}},$
$\displaystyle\overline{B_{0}}\coloneqq$	$\displaystyle\leavevmode\nobreak\ H_{\mathrm{min}}\overline{A_{2}}e^{\overline{A_{3}}T}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})$
	$\displaystyle\leavevmode\nobreak\ \times\frac{\lambda_{7}(e^{-\overline{A_{5}}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{(\overline{A_{1}}+\overline{A_{5}})T}+\frac{f_{x}}{\lambda_{7}}\frac{e^{(\overline{A_{1}}+\overline{A_{5}})T}-1}{\overline{A_{1}}+\overline{A_{5}}}\Big{\}}.$

If there exist $\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{7}$ satisfying $\overline{A_{0}}^{\prime}<1$ and $\overline{B_{0}}<1$ , then there exists a constant C depending on $E|\xi|^{2}$ , $\mathscr{L}$ , $T$ , $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ , and $\lambda_{7}$ , such that for sufficiently small $h$ ,

E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq C\Big{\{}h+E|Y_{0}-Y_{0}^{\pi}|^{2}+\sum_{i=0}^{N-1}E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}|^{2}h\Big{\}},

(5.4)

where $\tilde{Z}_{t_{i}}=h^{-1}E[\int_{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm{d}t|\mathcal{F}_{t_{i}}]$ . If $Z_{t}$ is cádlag, we can replace $\tilde{Z}_{t_{i}}$ with $Z_{t_{i}}$ . If $b$ and $\sigma$ are independent of $y$ , we can replace $E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]$ with $E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi}]$ .

Remark.

If we take the infimum within the domains of $Y_{0}^{\pi}$ and $Z_{t_{i}}^{\pi}$ on both sides, we recover the original statement in Theorem 2.

Remark.

If any of the weak coupling and monotonicity conditions introduced in Assumption 3 holds to a sufficient extent, there must exist $\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{7}$ satisfying the conditions in Theorem 2^′. The arguments are very similar to those provided in Remark Remark. Hence, we omit the details here for the sake of brevity.

Proof of Theorem 2^′.

Using Lemma 3 with $\lambda_{4}>0$ , we obtain

\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{4})H_{\mathrm{min}}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{4})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}].

(5.5)

Splitting the term $\delta\tilde{Z}_{t_{i}}=\tilde{Z}_{t_{i}}-Z_{t_{i}}^{\pi}$ and applying the generalized mean inequality, we have (recall $\overline{Z}_{t_{i}}^{\pi}$ is defined in Theorem 5)

	$\displaystyle\leavevmode\nobreak\ E\|\delta\tilde{Z}_{t_{i}}\|^{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})E\|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]\|^{2}$
	$\displaystyle\leavevmode\nobreak\ +(1+\lambda_{4}^{-1})\Big{\{}E\|(\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi})-E[(\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi})\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]$
	$\displaystyle\leavevmode\nobreak\ \qquad\qquad\qquad+(E[\tilde{Z}_{t_{i}}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi})\|^{2}\Big{\}}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})E\|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]\|^{2}$
	$\displaystyle\leavevmode\nobreak\ +3(1+\lambda_{4}^{-1})\Big{\{}E\|\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi}\|^{2}+E\|E[(\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi})\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]\|^{2}$
	$\displaystyle\leavevmode\nobreak\ \qquad\qquad\qquad\leavevmode\nobreak\ +E\|E[\tilde{Z}_{t_{i}}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}\|^{2}\Big{\}}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})E\|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]\|^{2}$
	$\displaystyle\leavevmode\nobreak\ +3(1+\lambda_{4}^{-1})\Big{\{}2E\|\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi}\|^{2}+E\|E[\tilde{Z}_{t_{i}}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}\|^{2}\Big{\}}.$	(5.6)

From equations (3.2)(3.4), we know that

$\displaystyle\sum_{i=0}^{N-1}E\|\tilde{Z}_{t_{i}}-\overline{Z}_{t_{i}}^{\pi}\|^{2}h$	$\displaystyle\leq 2\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}[E\|Z_{t}-\tilde{Z}_{t_{i}}\|^{2}+E\|Z_{t}-\overline{Z}_{t_{i}}^{\pi}\|^{2}\mathrm{d}t$
	$\displaystyle=2\int_{0}^{T}[E\|Z_{t}-\tilde{Z}_{t_{i}}\|^{2}+E\|Z_{t}-\overline{Z}_{t_{i}}^{\pi}\|^{2}\mathrm{d}t$
	$\displaystyle\leq C(1+E\|\xi\|^{2})h.$	(5.7)

Plugging estimates (5.6)(5.7) into (5.5) gives us

	$\displaystyle\leavevmode\nobreak\ E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{2}H_{\mathrm{min}}\sum_{i=0}^{N-1}E\|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]\|^{2}h$
	$\displaystyle\leavevmode\nobreak\ +C(\lambda_{4})\Big{\{}h+E\|Y_{0}-Y_{0}^{\pi}\|^{2}+\sum_{i=0}^{N-1}E\|E[\tilde{Z}_{t_{i}}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}\|^{2}h\Big{\}}.$	(5.8)

It remains to estimate the term $\sum_{i=0}^{N-1}E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}h$ , to which we intend to apply Lemma 4. Let $X_{t_{i}}^{\pi,1}=X_{t_{i}}^{\pi}$ and $X_{t_{i}}^{\pi,2}=\overline{X}_{t_{i}}^{\pi}$ . The associated $Z_{t_{i}}^{\pi,1}$ and $Z_{t_{i}}^{\pi,2}$ are then defined according to equation (5.1). Note that $Z_{t_{i}}^{\pi,2}=\overline{Z}_{t_{i}}^{\pi}$ but $Z_{t_{i}}^{\pi,1}$ is not necessarily equal to $Z_{t_{i}}^{\pi}$ , due to the possible violation of the terminal condition. From Lemma 5, we know $Z_{t_{i}}^{\pi,1}$ can be represented as $V_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})$ with $V_{i}^{\pi}$ being a deterministic function. By the property of conditional expectation, we have

\displaystyle E|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}\leq E|\overline{Z}_{t_{i}}^{\pi}-V_{i}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})|^{2},

for any $V_{i}$ . Therefore we have the estimate

		$\displaystyle\leavevmode\nobreak\ \sum_{i=0}^{N-1}E\|\overline{Z}_{t_{i}}^{\pi}-E[\overline{Z}_{t_{i}}^{\pi}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]\|^{2}h\leq\sum_{i=0}^{N-1}E\|\delta Z_{i}\|^{2}h$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{A_{5}T-A_{5}h}E\|\delta X_{N}\|^{2}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E\|\delta X_{i}\|^{2}h\Big{\}}.$		(5.9)

Recall that $\delta X_{i}=X_{t_{i}}^{\pi}-\overline{X}_{t_{i}}^{\pi},\delta Z_{i}=Z_{t_{i}}^{\pi,1}-\overline{Z}_{t_{i}}^{\pi}$ . Similar to the derivation of estimate (4.16) (using a given $\lambda_{3}>0$ without final specification) in the proof of Theorem 1^′, when $\overline{A_{0}}^{\prime}<1$ , we have

\displaystyle\overline{P}

\displaystyle\leq(1+\lambda_{4})\overline{A_{2}}e^{\overline{A_{3}}T}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})E|Y_{T}^{\pi}-g(X_{T}^{\pi})|^{2},

(5.10)

in which $\overline{P}=\max_{0\leq n\leq N}e^{-\overline{A_{1}}nh}E|\delta X_{i}|^{2}$ . Plugging (5.10) into (5.9), and then into (5.8), we get

\displaystyle\sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h

\displaystyle\leq\frac{\lambda_{7}(e^{-A_{5}T}\vee 1)\overline{P}}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{(\overline{A_{1}}+A_{5})T-A_{5}h}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{(\overline{A_{1}}+A_{5})ih}h\Big{\}},

and

	$\displaystyle\leavevmode\nobreak\ E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{3}B(h)E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}$
	$\displaystyle\leavevmode\nobreak\ +C(\lambda_{4})\Big{\{}h+E\|Y_{0}-Y_{0}^{\pi}\|^{2}+\sum_{i=0}^{N-1}E\|E[\tilde{Z}_{t_{i}}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-Z_{t_{i}}^{\pi}\|^{2}h\Big{\}},$	(5.11)

for sufficiently small $h$ . Here $B(h)$ is defined as

	$\displaystyle B(h)=$	$\displaystyle\leavevmode\nobreak\ H_{\mathrm{min}}\overline{A_{2}}e^{\overline{A_{3}}T}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})$
		$\displaystyle\leavevmode\nobreak\ \times\frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}g_{x}e^{(\overline{A_{1}}+A_{5})T-A_{5}h}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{(\overline{A_{1}}+A_{5})ih}h\Big{\}}.$

The forms of inequalities (5.4) and (5.11) are already very close. When $\displaystyle{\lim_{h\rightarrow 0}B(h)}=\overline{B_{0}}<1$ , there exists $\lambda_{4}>0$ such that for sufficiently small $h$ , we have $1-(1+\lambda_{4})^{3}B(h)>\frac{1}{2}(1-\overline{B_{0}})$ . Rearranging the term $E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}$ in inequality (5.11) yields our final estimate. ∎

We shall briefly discuss how the universal approximation theorem can be applied based on Theorem 2^′. For instance, Theorem 2.1 in [22] states that every continuous and piecewise linear function with $m$ -dimensional input can be represented by a deep neural network with rectified linear units and at most $\lceil 1+\log_{2}(m+1)\rceil$ depth. Now we view $Y_{0}$ as a target function with input $\xi$ and $E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]$ as another target function with input $(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})$ . Since $E|Y_{0}|^{2}<+\infty$ and $E|E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]|^{2}\leq E|\tilde{Z}_{t_{i}}|^{2}<+\infty$ , we know that both target functions can be approximated in the $L^{2}$ sense by continuous and piecewise linear functions with arbitrary accuracy. Then the aforementioned statement implies that the two target functions can be approximated by two neural networks with rectified linear units and at most $\lceil 1+\log_{2}(m+1)\rceil$ depth, although the width might go to infinity as the approximation error decreases to 0. Therefore, according to Theorem 2^′, there exist good neural networks such that the value of the objective function is small.

Note that there still exist some concerns about the result in Theorem 2^′. First, the function $E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]$ changes when $Z_{t_{j}}^{\pi}$ changes for $j<i$ . Second, the function may depend on $Y_{t_{i}}^{\pi}$ . Even if the FBSDEs are decoupled so that the above two concerns do not exist, we know nothing a priori about the property of $E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]$ . In the next theorem, we replace $E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]$ with $\sigma^{\operatorname{T}}(t_{i},X_{t_{i}}^{\pi},u(t_{i},X_{t_{i}}^{\pi}))\nabla_{x}u(t_{i},X_{t_{i}}^{\pi})$ , which can resolve these problems. However, meanwhile we require more regularity for the coefficients of the FBSDEs.

Theorem 6.

Suppose Assumptions 1, 2, 3, 4 and the assumptions in Theorem 3 hold true. Let $u$ be the solution of corresponding quasilinear PDEs (2.7) and $L$ be the squared Lipschitz constant of $\sigma^{\operatorname{T}}(t,x,u(t,x))\nabla_{x}u(t,x)$ with respect to x. With the same notations of Theorem 2^′, when $\overline{A_{0}}^{\prime}<1$ and

\displaystyle\overline{B_{0}}^{\prime}\coloneqq H_{\mathrm{min}}L\overline{A_{2}}e^{\overline{A_{3}}T}\frac{(e^{\overline{A_{1}}T}-1)(1-e^{-(\overline{A_{1}}+\overline{A_{3}})T})}{\overline{A_{1}}(\overline{A_{1}}+\overline{A_{3}})}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})<1,

there exists a constant $C>0$ depending on $E|\xi|^{2}$ , $T$ , $\mathscr{L}$ , $L$ , $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ , such that for sufficiently small $h$ ,

E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq C\Big{\{}h+E|Y_{0}-Y_{0}^{\pi}|^{2}+\sum_{i=0}^{N-1}E|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}|^{2}h\Big{\}},

(5.12)

where $f_{i}(x)=\sigma^{\operatorname{T}}(t_{i},x,u(t_{i},x))\nabla_{x}u(t_{i},x)$ .

Proof.

By Theorem 3, we have $Z_{t_{i}}=f_{i}(X_{t_{i}})$ , in which $X_{t}$ is the solution of

X_{t}=\xi+\int_{0}^{t}b(s,X_{s},u(s,X_{s}))\,\mathrm{d}s+\int_{0}^{t}\sigma(s,X_{s},u(s,X_{s}))\,\mathrm{d}W_{s}.

Using Lemma 3 again with $\lambda_{4}>0$ gives us

\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{4})H_{\mathrm{min}}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{4})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}].

Given the continuity of $\sigma^{\operatorname{T}}(t,x,u(t,x))\nabla_{x}u(t,x)$ , we know $Z_{t}$ admits a continuous version. Hence the term $\tilde{Z}_{t_{i}}$ in $\delta\tilde{Z}_{t_{i}}=\tilde{Z}_{t_{i}}-Z_{t_{i}}^{\pi}$ can be replaced with $Z_{t_{i}}$ , i.e.,

\displaystyle E|g(X_{T}^{\pi})-Y_{T}^{\pi}|^{2}\leq(1+\lambda_{4})H_{\mathrm{min}}\sum_{i=0}^{N-1}E|Z_{t_{i}}-Z_{t_{i}}^{\pi}|^{2}h+C(\lambda_{4})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}].

(5.13)

Similar to the arguments in inequalities (5.6)(5.7), we have

		$\displaystyle\leavevmode\nobreak\ E\|Z_{t_{i}}-Z_{t_{i}}^{\pi}\|^{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4}^{-1})E\|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}\|^{2}+(1+\lambda_{4})E\|Z_{t_{i}}-f_{i}(X_{t_{i}}^{\pi})\|^{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4}^{-1})E\|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}\|^{2}+(1+\lambda_{4})LE\|X_{t_{i}}-X_{t_{i}}^{\pi}\|^{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4}^{-1})E\|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}\|^{2}$
		$\displaystyle\leavevmode\nobreak\ +(1+\lambda_{4})L[(1+\lambda_{4})E\|X_{t_{i}}^{\pi}-\overline{X}_{t_{i}}^{\pi}\|^{2}+(1+\lambda_{4}^{-1})E\|X_{t_{i}}-\overline{X}_{t_{i}}^{\pi}\|^{2}]$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{2}LE\|X_{t_{i}}^{\pi}-\overline{X}_{t_{i}}^{\pi}\|^{2}+C(L,\lambda_{4})\Big{\{}E\|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}\|^{2}+h\Big{\}},$

where the last equality uses the convergence result (3.4). Plugging it into (5.13), we have

	$\displaystyle E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{3}H_{\mathrm{min}}L\sum_{i=0}^{N-1}E\|X_{t_{i}}^{\pi}-\overline{X}_{t_{i}}^{\pi}\|^{2}h$
		$\displaystyle\leavevmode\nobreak\ +C(L,\lambda_{4})\Big{\{}h+E\|Y_{0}-Y_{0}^{\pi}\|^{2}+\sum_{i=0}^{N-1}E\|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}\|^{2}h\Big{\}}$		(5.14)

for sufficiently small $h$ .

We employ the estimate (5.10) again to rewrite inequality (5.14) as

	$\displaystyle E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}\leq$	$\displaystyle\leavevmode\nobreak\ (1+\lambda_{4})^{4}\widetilde{B}(h)E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}$
		$\displaystyle\leavevmode\nobreak\ +C(L,\lambda_{4})\Big{\{}h+E\|Y_{0}-Y_{0}^{\pi}\|^{2}+\sum_{i=0}^{N-1}E\|f_{i}(X_{t_{i}}^{\pi})-Z_{t_{i}}^{\pi}\|^{2}h\Big{\}},$		(5.15)

where

\displaystyle\widetilde{B}(h)=H_{\mathrm{min}}L\overline{A_{2}}e^{\overline{A_{3}}T}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}}[1-\overline{A_{0}}^{\prime}]^{-1}(1+\lambda_{3}^{-1})\sum_{i=0}^{N-1}e^{i\overline{A_{1}}h}h.

Arguing in the same way as that in the proof of Theorem 2^′, when $\tilde{B}(h)$ is strictly bounded above by $1$ for sufficiently small $h$ , we can choose $\lambda_{4}$ small enough and rearrange the terms in inequality (5.15) to obtain the result in inequality (5.12).

∎

Remark.

The Lipschitz constant used in Theorem 6 may be further estimated a priori. Denote the Lipschitz constant of function $f$ with respect to $x$ as $L_{x}(f)$ , and the bound of function $f$ as $M(f)$ . Loosely speaking, we have

L_{x}(\sigma^{\operatorname{T}}(t,x,u(t,x))\nabla_{x}u(t,x))\leq M(\sigma)L_{x}(\nabla_{x}u)+M(\nabla_{x}u)[L_{x}(\sigma)+L_{y}(\sigma)L_{x}(u)].

Here $L_{x}(u)=M(\nabla_{x}u(t,x))$ can be estimated from the first point of Theorem 4 and $L(\nabla_{x}u(t,x))=M(\nabla_{xx}u)$ can be estimated through the Schauder estimate (see, e.g., [32, Chapter 4, Lemma 2.1]). Note that the resulting estimate may depend on the dimension $d$ .

5.1 Proof of Lemmas

Proof of Lemma 3.

We construct continuous processes $X_{t}^{\pi},Y_{t}^{\pi}$ as follows. For $t\in[t_{i},t_{i+1})$ , let

	$\displaystyle X_{t}^{\pi}$	$\displaystyle=X_{t_{i}}^{\pi}+b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})(t-t_{i})+\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})(W_{t}-W_{t_{i}}),$
	$\displaystyle Y_{t}^{\pi}$	$\displaystyle=Y_{t_{i}}^{\pi}-f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi})(t-t_{i})+(Z_{t_{i}}^{\pi})^{\operatorname{T}}(W_{t}-W_{t_{i}}).$

From system (2.3), we see this definition also works at $t_{i+1}$ . We are interested in again the estimates of the following terms

\displaystyle\delta X_{t}

\displaystyle=X_{t}-X_{t}^{\pi},\quad\delta Y_{t}=Y_{t}-Y_{t}^{\pi},\quad\delta Z_{t}=Z_{t}-Z_{t_{i}}^{\pi},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ t\in[t_{i},t_{i+1}).

For $t\in[t_{i},t_{i+1})$ , let

	$\displaystyle\delta b_{t}$	$\displaystyle=b(t,X_{t},Y_{t})-b(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}),$
	$\displaystyle\delta\sigma_{t}$	$\displaystyle=\sigma(t,X_{t},Y_{t})-\sigma(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}),$
	$\displaystyle\delta f_{t}$	$\displaystyle=f(t,X_{t},Y_{t},Z_{t})-f(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi}).$

By definition,

	$\displaystyle\mathrm{d}(\delta X_{t})$	$\displaystyle=\delta b_{t}\,\mathrm{d}t+\delta\sigma_{t}\,\mathrm{d}W_{t},$
	$\displaystyle\mathrm{d}(\delta Y_{t})$	$\displaystyle=-\delta f_{t}\,\mathrm{d}t+(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t}.$

Then by Itô’s formula, we have

	$\displaystyle\mathrm{d}\|\delta X_{t}\|^{2}$	$\displaystyle=[2(\delta b_{t})^{T}\delta X_{t}+\\|\delta\sigma_{t}\\|^{2}]\mathrm{d}t+2(\delta X_{t})^{\operatorname{T}}\delta\sigma_{t}\,\mathrm{d}W_{t},$
	$\displaystyle\mathrm{d}\|\delta Y_{t}\|^{2}$	$\displaystyle=[-2(\delta f_{t})^{\operatorname{T}}\delta Y_{t}+\|\delta Z_{t}\|^{2}]\,\mathrm{d}t+2\delta Y_{t}(\delta Z_{t})^{\operatorname{T}}\,\mathrm{d}W_{t}.$

Thus,

	$\displaystyle E\|\delta X_{t}\|^{2}$	$\displaystyle=E\|\delta X_{t_{i}}\|^{2}+\int_{t_{i}}^{t}E[2(\delta b_{s})^{\operatorname{T}}\delta X_{s}+\\|\delta\sigma_{s}\\|^{2}]\,\mathrm{d}s,$
	$\displaystyle E\|\delta Y_{t}\|^{2}$	$\displaystyle=E\|\delta Y_{t_{i}}\|^{2}+\int_{t_{i}}^{t}E[-2(\delta f_{s})^{\operatorname{T}}\delta Y_{s}+\|\delta Z_{s}\|^{2}]\,\mathrm{d}s.$

For any $\lambda_{5},\lambda_{6}>0$ , using Assumptions 1, 2 and the RMS-GM inequality, we have

	$\displaystyle\leavevmode\nobreak\ E\|\delta X_{t}\|^{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ E\|\delta X_{t_{i}}\|^{2}+\int_{t_{i}}^{t}[\lambda_{5}E\|\delta X_{s}\|^{2}+\lambda_{5}^{-1}E\|\delta b_{s}\|^{2}+E\\|\delta\sigma_{s}\\|^{2}]\,\mathrm{d}s$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ E\|\delta X_{t_{i}}\|^{2}+\lambda_{5}\int_{t_{i}}^{t}E\|\delta X_{s}\|^{2}\,\mathrm{d}s+\int_{t_{i}}^{t}K(\lambda_{5}^{-1}+1)\|s-t_{i}\|\,\mathrm{d}s$
	$\displaystyle\leavevmode\nobreak\ +\int_{t_{i}}^{t}[(K\lambda_{5}^{-1}+\sigma_{x})E\|X_{s}-X_{t_{i}}^{\pi}\|^{2}+(b_{y}\lambda_{5}^{-1}+\sigma_{y})E\|Y_{s}-Y_{t_{i}}^{\pi}\|^{2}]\,\mathrm{d}s.$	(5.16)

By the RMS-GM inequality, we also have

	$\displaystyle E\|X_{s}-X_{t_{i}}^{\pi}\|^{2}$	$\displaystyle\leq(1+\epsilon_{1})E\|\delta X_{t_{i}}\|^{2}+(1+\epsilon_{1}^{-1})E\|X_{s}-X_{t_{i}}\|^{2},$		(5.17)
	$\displaystyle E\|Y_{s}-Y_{t_{i}}^{\pi}\|^{2}$	$\displaystyle\leq(1+\epsilon_{2})E\|\delta Y_{t_{i}}\|^{2}+(1+\epsilon_{2}^{-1})E\|Y_{s}-Y_{t_{i}}\|^{2},$		(5.18)

in which we choose $\epsilon_{1}=\lambda_{6}(K\lambda_{5}^{-1}+\sigma_{x})^{-1}$ and $\epsilon_{2}=\lambda_{6}(b_{y}\lambda_{5}^{-1}+\sigma_{y})^{-1}$ . The path regularity in Theorem 4 tells us

\displaystyle\sup_{s\in[t_{i},t_{i+1}]}(E|X_{s}-X_{t_{i}}|^{2}+E|Y_{s}-Y_{t_{i}}|^{2})\leq Ch.

(5.19)

Plugging inequalities (5.17)(5.18)(5.19) into (5.16) with simplification, we obtain

	$\displaystyle E\|\delta X_{t}\|^{2}$	$\displaystyle\leq[1+(K\lambda_{5}^{-1}+\sigma_{x}+\lambda_{6})h]E\|\delta X_{t_{i}}\|^{2}+\lambda_{5}\int_{t_{i}}^{t}E\|\delta X_{s}\|^{2}\,\mathrm{d}s$
		$\displaystyle\hphantom{=\leavevmode\nobreak\ }+(b_{y}\lambda_{5}^{-1}+\sigma_{y}+\lambda_{6})E\|\delta Y_{t_{i}}\|^{2}h+C(\lambda_{5},\lambda_{6})h^{2}.$		(5.20)

Then, by Grönwall inequality, we have

	$\displaystyle E\|\delta X_{t_{i+1}}\|^{2}$
$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle e^{\lambda_{5}h}\{[1+(K\lambda_{5}^{-1}+\sigma_{x}+\lambda_{6})h]E\|\delta X_{t_{i}}\|^{2}$
	$\displaystyle\quad\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +(b_{y}\lambda_{5}^{-1}+\sigma_{y}+\lambda_{6})E\|\delta Y_{t_{i}}\|^{2}h+C(\lambda_{5},\lambda_{6})h^{2}\}$
$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle e^{A_{6}h}E\|\delta X_{t_{i}}\|^{2}+e^{\lambda_{5}h}(b_{y}\lambda_{5}^{-1}+\sigma_{y}+\lambda_{6})E\|\delta Y_{t_{i}}\|^{2}h+C(\lambda_{5},\lambda_{6})h^{2}$
$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle e^{A_{6}h}E\|\delta X_{t_{i}}\|^{2}+A_{7}E\|\delta Y_{t_{i}}\|^{2}h+C(\lambda_{5},\lambda_{6})h^{2},$	(5.21)

where $A_{6}\coloneqq K\lambda_{5}^{-1}+\sigma_{x}+\lambda_{5}+\lambda_{6}$ , $A_{7}\coloneqq b_{y}\lambda_{5}^{-1}+\sigma_{y}+2\lambda_{6}$ , and $h$ is sufficiently small.

Similarly, with the same type of estimates in (5.16)(5.20), for any $\lambda_{5},\lambda_{6}>0$ , we have

		$\displaystyle\leavevmode\nobreak\ E\|\delta Y_{t}\|^{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ E\|\delta Y_{t_{i}}\|^{2}+\int_{t_{i}}^{t}[\lambda_{5}E\|\delta Y_{s}\|^{2}+\lambda_{5}^{-1}E\|\delta f_{s}\|^{2}+E\|\delta Z_{s}\|^{2}]\,\mathrm{d}s$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ E\|\delta Y_{t_{i}}\|^{2}+\lambda_{5}\int_{t_{i}}^{t}E\|\delta Y_{s}\|^{2}\,\mathrm{d}s+\int_{t_{i}}^{t}K\lambda_{5}^{-1}\|s-t_{i}\|\,\mathrm{d}s$
		$\displaystyle\leavevmode\nobreak\ +\int_{t_{i}}^{t}\lambda_{5}^{-1}[f_{x}E\|X_{s}-X_{t_{i}}^{\pi}\|^{2}+KE\|Y_{s}-Y_{t_{i}}^{\pi}\|^{2}]\,\mathrm{d}s+(1+f_{z}\lambda_{5}^{-1})\int_{t_{i}}^{t}E\|\delta Z_{s}\|^{2}\,\mathrm{d}s$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ [1+(K\lambda_{5}^{-1}+\lambda_{6})h]E\|\delta Y_{t_{i}}\|^{2}+\lambda_{5}\int_{t_{i}}^{t}E\|\delta Y_{s}\|^{2}\,\mathrm{d}s+(f_{x}\lambda_{5}^{-1}+\lambda_{6})E\|\delta X_{t_{i}}^{\pi}\|^{2}h$
		$\displaystyle\leavevmode\nobreak\ +(1+f_{z}\lambda_{5}^{-1})\int_{t_{i}}^{t}E\|\delta Z_{s}\|^{2}\,\mathrm{d}s+C(\lambda_{5},\lambda_{6})h^{2}.$

Arguing in the same way of (5.21), by Grönwall inequality, for sufficiently small $h$ , we have

		$\displaystyle E\|\delta Y_{t_{i+1}}\|^{2}$
	$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle e^{A_{8}h}E\|\delta Y_{t_{i}}\|^{2}+A_{9}E\|\delta X_{t_{i}}\|^{2}h+(1+f_{z}\lambda_{5}^{-1}+\lambda_{6})\int_{t_{i}}^{t}E\|\delta Z_{s}\|^{2}\,\mathrm{d}s+C(\lambda_{5},\lambda_{6})h^{2},$

with $A_{8}\coloneqq K\lambda_{5}^{-1}+\lambda_{5}+\lambda_{6}$ , $A_{9}\coloneqq f_{x}\lambda_{5}^{-1}+2\lambda_{6}$ . Choosing $\epsilon_{3}=(1+f_{z}\lambda_{5}^{-1}+\lambda_{6})^{-1}\lambda_{6}$ and using

\int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2}\,\mathrm{d}t\leq(1+\epsilon_{3})E|\delta\tilde{Z}_{t_{i}}|^{2}h+(1+\epsilon_{3}^{-1})E_{z}^{i},

where $\delta\tilde{Z}_{t_{i}}=\tilde{Z}_{t_{i}}-Z_{t_{i}}^{\pi}$ and $E_{z}^{i}=\int_{t_{i}}^{t_{i+1}}E|Z_{t}-\tilde{Z}_{t_{i}}|^{2}\,\mathrm{d}t$ , we furthermore obtain

E|\delta Y_{t_{i+1}}|^{2}\leq e^{A_{8}h}E|\delta Y_{t_{i}}|^{2}+A_{9}E|\delta X_{t_{i}}|^{2}h+A_{10}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})(h^{2}+E_{z}^{i}),

(5.22)

with $A_{10}\coloneqq 1+f_{z}\lambda_{5}^{-1}+2\lambda_{6}$ .

Define

\displaystyle M_{i}=\max\{E|\delta X_{i}|^{2},E|\delta Y_{i}|^{2}\},\quad 0\leq i\leq N.

Combining inequalities (5.21)(5.22) together yields

		$\displaystyle M_{i+1}$
	$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle(e^{\max\{A_{6},A_{8}\}h}+\max\{A_{7},A_{9}\}h)M_{i}+A_{10}E\|\delta\tilde{Z}_{t_{i}}\|^{2}h+C(\lambda_{5},\lambda_{6})(h^{2}+E_{z}^{i})$
	$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle e^{(\max\{A_{6},A_{8}\}+\max\{A_{7},A_{9}\})h}M_{i}+A_{10}E\|\delta\tilde{Z}_{t_{i}}\|^{2}h+C(\lambda_{5},\lambda_{6})(h^{2}+E_{z}^{i}).$

Letting $A_{11}\coloneqq\max\{A_{6},A_{8}\}+\max\{A_{7},A_{9}\}$ , we have

M_{i+1}\leq e^{A_{11}h}M_{i}+A_{10}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})(h^{2}+E_{z}^{i}).

(5.23)

We start from $M_{0}=E|Y_{0}-Y_{0}^{\pi}|^{2}$ and apply inequality (5.23) repeatedly to obtain

M_{N}\leq A_{10}e^{A_{11}T}\sum_{i=0}^{N-1}E|\delta\tilde{Z}_{t_{i}}|^{2}h+C(\lambda_{5},\lambda_{6})[h+E|Y_{0}-Y_{0}^{\pi}|^{2}],

(5.24)

in which for the last term we use the fact $\sum_{i=0}^{N-1}E_{z}^{i}\leq Ch$ from inequality (3.2). Note that

	$\displaystyle A_{10}$	$\displaystyle=1+f_{z}\lambda_{5}^{-1}+2\lambda_{6},$
	$\displaystyle A_{11}$	$\displaystyle\leq 2K+2K\lambda_{5}^{-1}+\lambda_{5}+3\lambda_{6}.$

Given any $\lambda_{4}>0$ , we can choose $\lambda_{6}$ small enough such that

\displaystyle(1+f_{z}\lambda_{5}^{-1}+2\lambda_{6})e^{A_{11}T}

\displaystyle\leq(1+\lambda_{4})(1+f_{z}\lambda_{5}^{-1})e^{(2K+2K\lambda_{5}^{-1}+\lambda_{5})T}.

This condition and inequality (5.24) together give us

	$\displaystyle M_{N}\leq\leavevmode\nobreak\$	$\displaystyle(1+\lambda_{4})(1+f_{z}\lambda_{5}^{-1})e^{(2K+2K\lambda_{5}^{-1}+\lambda_{5})T}\sum_{i=0}^{N-1}E\|\delta\tilde{Z}_{t_{i}}\|^{2}h$
		$\displaystyle+C(\lambda_{4},\lambda_{5})[h+E\|Y_{0}-Y_{0}^{\pi}\|^{2}].$		(5.25)

Finally, by decomposing the objective function, we have

	$\displaystyle\leavevmode\nobreak\ E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ E\|g(X_{T}^{\pi})-g(X_{T})+Y_{T}-Y_{T}^{\pi}\|^{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+(\sqrt{g_{x}})^{-1})E\|g(X_{T}^{\pi})-g(X_{T})\|^{2}+(1+\sqrt{g_{x}})E\|\delta Y_{N}\|^{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (g_{x}+\sqrt{g_{x}})E\|\delta X_{N}\|^{2}+(1+\sqrt{g_{x}})E\|\delta Y_{N}\|^{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\sqrt{g_{x}})^{2}M_{N}.$	(5.26)

We complete our proof by combining inequalities (5.25)(5.26) and choosing $\lambda_{5}=\mathrm{argmin}_{x\in\mathbb{R}^{+}}H(x)$ . ∎

Proof of Lemma 4.

We use the same notations as in the proof of Lemma 1. As derived in (4.12), for any $\lambda_{7}>f_{z}\geq 0$ , we have

E|\delta Y_{i+1}|^{2}\geq[1-(2k_{f}+\lambda_{7})h]E|\delta Y_{i}|^{2}+(1-f_{z}\lambda_{7}^{-1})E|\delta Z_{i}|^{2}h-f_{x}\lambda_{7}^{-1}E|\delta X_{i}|^{2}h.

(5.27)

Multiplying both sides of (5.27) by $e^{A_{5}ih}(e^{-A_{5}T}\vee 1)/(1-f_{z}\lambda_{7}^{-1})$ gives us

	$\displaystyle\leavevmode\nobreak\ \frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}e^{A_{5}ih}E\|\delta Y_{i+1}\|^{2}-e^{A_{5}(i-1)h}E\|\delta Y_{i}\|^{2}+e^{A_{5}ih}\frac{f_{x}}{\lambda_{7}}E\|\delta X_{i}\|^{2}h\Big{\}}$
$\displaystyle\geq$	$\displaystyle\leavevmode\nobreak\ e^{A_{5}ih}(e^{-A_{5}T}\vee 1)E\|\delta Z_{i}\|^{2}h$
$\displaystyle\geq$	$\displaystyle E\|\delta Z_{i}\|^{2}h.$	(5.28)

Summing (5.28) up from $i=0$ to $N-1$ , we obtain

\displaystyle\sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h\leq\frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7}-f_{z}}\Big{\{}e^{A_{5}T-A_{5}h}E|\delta Y_{N}|^{2}+\frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E|\delta X_{i}|^{2}h\Big{\}}.

(5.29)

Note that $E|\delta Y_{N}|^{2}\leq g_{x}E|\delta X_{N}|^{2}$ by Assumption 1. Plugging it into (5.29), we arrive at the desired result. ∎

Proof of Lemma 5.

We prove by induction backwardly. Let $Z_{t_{N}}^{\pi,^{\prime}}=0$ for convenience. It is straightforward to see that the statement holds for $i=N$ . Assume the statement holds for $i=k+1$ . For $i=k$ , we know $Y_{t_{k+1}}^{\pi,^{\prime}}=U_{k+1}(X_{t_{k+1}}^{\pi},Y_{t_{k+1}}^{\pi})$ . Recalling the definition of $\{X_{t_{i}}^{\pi}\}_{0\leq i\leq N},\{Y_{t_{i}}^{\pi}\}_{0\leq i\leq N}$ in (2.3), we can rewrite $Y_{t_{k+1}}^{\pi,^{\prime}}=\overline{U}_{{k}}(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi},\Delta W_{k})$ , with $\overline{U}_{k}:\mathbb{R}^{m}\times\mathbb{R}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ being a deterministic function. Note $Z_{t_{k}}^{\pi,^{\prime}}=h^{-1}E[\overline{U}_{k}(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi},\Delta W_{k})\Delta W_{k}|\mathcal{F}_{t_{k}}]$ . Since $\Delta W_{k}$ is independent of $\mathcal{F}_{t_{k}}$ , there exists a deterministic function $V_{k}^{\pi}:\mathbb{R}^{m}\times\mathbb{R}\rightarrow\mathbb{R}^{d}$ such that $Z_{t_{k}}^{\pi,^{\prime}}=V_{k}^{\pi}(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi})$ .

Next we consider $Y_{t_{k}}^{\pi,^{\prime}}$ . Let $H_{k}=L^{2}(\Omega,\sigma(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi}),\mathbb{P})$ , where $\sigma(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi})$ denotes the $\sigma$ -algebra generated by $X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi}$ . We know $H_{k}$ is a Banach space and another equivalent representation is

H_{k}=\{Y=\phi(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi})\leavevmode\nobreak\ |\leavevmode\nobreak\ \phi\text{ is measurable and }E|Y|^{2}<\infty\}.

Consider the following map defined on $H_{k}$ :

\Phi_{k}(Y)=E[{Y}_{t_{k+1}}^{\pi,^{\prime}}+f(t_{k},{X}_{t_{k}}^{\pi},Y,{Z}_{t_{k}}^{\pi,^{\prime}})h|\mathcal{F}_{t_{k}}].

By Assumption 3, $\Phi_{k}(Y)$ is square-integrable. Furthermore, following the same argument for $Z_{t_{k}}^{\pi,^{\prime}}$ , $\Phi_{k}(Y)$ can also be represented as a deterministic function of $X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi}$ . Hence, $\Phi_{k}(Y)\in H_{k}$ . Note that Assumption 1 implies $E|\Phi_{k}(Y_{1})-\Phi_{k}(Y_{2})|^{2}\leq Kh^{2}E|Y_{1}-Y_{2}|^{2}$ . Therefore $\Phi_{k}$ is a contraction map on $H_{k}$ when $h<1/\sqrt{K}$ . By the Banach fixed-point theorem, there exists a unique fixed-point $Y^{*}=\phi_{k}^{*}(X_{t_{k}}^{\pi},Y_{t_{k}}^{\pi})\in H_{k}$ satisfying $Y^{*}=\Phi_{k}(Y^{*})$ . We choose $U_{k}^{\pi}=\phi_{k}^{*}$ to validate the statement for $Y_{t_{k}}^{\pi,^{\prime}}$ .

When $b$ and $\sigma$ are independent of $y$ , all of the arguments above can be made similarly with $U_{i}^{\pi},V_{i}^{\pi}$ also being independent of $Y$ . ∎

6 Numerical Examples

6.1 General Setting

In this section, we illustrate the proposed numerical scheme by solving two high-dimensional coupled FBSDEs adapted from the literature. The common setting for these two numerical examples is as follows. We assume $d=m=100$ , that is, $X_{t},Z_{t},W_{t}\in\mathbb{R}^{100}$ . Assume $\xi$ is deterministic and we are interested in the approximation error of $Y_{0}$ , which is also a deterministic scalar.

We use $N-1$ fully-connected feedforward neural networks to parameterize $\phi_{i}^{\pi},i=0,1,\dots,N-1$ . Each of the neural networks has 2 hidden layers with dimension $d+10$ . The input has dimension $d+1\leavevmode\nobreak\ (X_{i}\in\mathbb{R}^{d},Y_{i}\in\mathbb{R})$ and the output has dimension $d$ . In practice, one can of course choose $X_{i}$ only as the input. We additionally test this input for the two examples and find no difference in terms of the relative error of $Y_{0}$ (up to second decimal place). We use the rectifier function (ReLU) as the activation function and adopt batch normalization [36] right after each matrix multiplication and before activation. We employ the Adam optimizer [37] to optimize the parameters with batch-size being 64. The loss function is computed based on 256 validation sample paths. We initialize all the parameters using a uniform or normal distribution and run each experiment 5 times to report the average result.

6.2 Example 1

The first problem is adapted from [38, 39], in which the original spatial dimension of the problem is 1. We consider the following coupled FBSDEs

\displaystyle\begin{dcases}X_{j,t}=x_{j,0}+\int_{0}^{t}\frac{X_{j,s}(1+X_{j,s}^{2})}{(2+X_{j,s})^{3}}\,\mathrm{d}s\\ \qquad\quad+\int_{0}^{t}\frac{1+X_{j,s}^{2}}{2+X_{j,s}^{2}}\sqrt{\frac{1+2Y_{s}^{2}}{1+Y_{s}^{2}+\exp{\Big{(}-\frac{2|X_{s}|^{2}}{d(s+5)}\Big{)}}}}\,\mathrm{d}W_{j,s},\quad j=1,\dots,d,\\ Y_{t}=\exp{\Big{(}-\frac{|X_{T}|^{2}}{d(T+5)}\Big{)}}\\ \qquad\leavevmode\nobreak\ +\int_{t}^{T}a(s,X_{s},Y_{s})+\sum_{j=1}^{d}b(s,X_{j,s},Y_{s})Z_{j,s}\,\mathrm{d}s-\int_{t}^{T}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s},\end{dcases}

(6.1)

where $X_{j,t},Z_{j,t},W_{j,t}$ denote the $j$ -th components of $X_{t},Y_{t},W_{t}$ , and the coefficient functions are given as

	$\displaystyle a(t,x,u)=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{d(t+5)}\exp{\Big{(}-\frac{\|x\|^{2}}{d(t+5)}\Big{)}}$
		$\displaystyle\leavevmode\nobreak\ \times\sum_{j=1}^{d}\Bigg{\{}\frac{4x_{j}^{2}(1+x_{j}^{2})}{(2+x_{j}^{2})^{3}}+\frac{(1+x_{j}^{2})^{2}}{(2+x_{j}^{2})^{2}}-\frac{2x_{j}^{2}(1+x_{j}^{2})^{2}}{d(t+5)(2+x_{j}^{2})^{2}}-\frac{x_{j}^{2}}{t+5}\Bigg{\}},$
	$\displaystyle b(t,x_{j},u)=$	$\displaystyle\leavevmode\nobreak\ \frac{x_{j}}{(2+x_{j}^{2})^{2}}\sqrt{\frac{1+u^{2}+\exp{\Big{(}-\frac{\|x\|^{2}}{d(t+5)}\Big{)}}}{1+2u^{2}}}.$

It can be verified by Itô’s formula that the $Y$ part of the solution of (6.1) is given by

\displaystyle Y_{t}=\exp{\Big{(}-\frac{|X_{t}|^{2}}{d(t+5)}\Big{)}}.

Let $\xi=(1,1,\dots,1)$ ( $100$ -dimensional), $T=5,N=160$ . The initial guess of $Y_{0}$ is generated from a uniform distribution on the interval $[2,4]$ while the true value of $Y_{0}\approx 0.81873$ . We train 25000 steps with an exponential decay learning rate that decays every 100 steps, with the starting learning rate being 1e-2 and ending learning rate being 1e-5. Figure 1 illustrates the mean of the loss function and relative approximation error of $Y_{0}$ against the number of iteration steps. All runs converged and the average final relative error of $Y_{0}$ is $0.39\%$ .

Refer to caption — Figure 1: Loss function (left) and relative approximation error of $Y_{0}$ (right) against the number of iteration steps in the case of Example 1 ( $100$ -dimensional). The proposed deep BSDE method achieves a relative error of size $0.39\%$ . The shaded area depicts the mean $\pm$ the standard deviation of the associated quantity in 5 runs.

6.3 Example 2

The second problem is adapted from [16], in which the spatial dimension is originally tested up to 10. The coupled FBSDEs are given by

\displaystyle\begin{dcases}X_{j,t}=x_{j,0}+\int_{0}^{t}\sigma Y_{s}\,\mathrm{d}W_{j,s},\quad j=1,\dots,d,\\ Y_{t}=D\sum_{j=1}^{d}\sin(X_{j,T})\\ \qquad\leavevmode\nobreak\ +\int_{t}^{T}-rY_{s}+\frac{1}{2}e^{-3r(T-s)\sigma^{2}}\Big{(}D\sum_{j=1}^{d}\sin(X_{j,s})\Big{)}^{3}\,\mathrm{d}s-\int_{t}^{T}(Z_{s})^{\operatorname{T}}\,\mathrm{d}W_{s},\end{dcases}

(6.2)

where $\sigma>0,r,D$ are constants. One can easily check by Itô’s formula that the $Y$ part of the solution of (6.2) is

Y_{t}=e^{-r(T-t)}D\sum_{j=1}^{d}\sin(X_{j,t}).

Let $\xi=(\pi/2,\pi/2,\dots,\pi/2)$ ( $100$ -dimensional), $T=1,r=0.1,\sigma=0.3,D=0.1$ . The initial guess of $Y_{0}$ is generated from a uniform distribution on the interval $[0,1]$ while the true value of $Y_{0}\approx 9.04837$ . We train 5000 steps with an exponential decay learning rate that decays every 100 steps, with the starting learning rate being 1e-2 and the ending learning rate being 1e-3. When $h=0.005\leavevmode\nobreak\ (N=200)$ , the relative approximation error of $Y_{0}$ is $0.09\%$ . Furthermore, we test the influence of the time partition by choosing different values of $N$ . In all cases, the training converged, and we plot in Figure 2 the mean of relative error of $Y_{0}$ against the number of time steps $N$ . It is clearly shown that the error decreases as $N$ increases ( $h$ decreases).

References

[1] Etienne Pardoux and Shige Peng. Backward stochastic differential equations and quasilinear parabolic partial differential equations. In Stochastic Partial Differential Equations and Their Applications, pages 200–217. Springer, 1992.
[2] Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.
[3] Christian Bender and Jessica Steiner. Least-squares Monte Carlo for backward SDEs. In Numerical Methods in Finance, pages 257–289. Springer, 2012.
[4] Bruno Bouchard, Ivar Ekeland, and Nizar Touzi. On the Malliavin approach to Monte Carlo approximation of conditional expectations. Finance and Stochastics, 8(1):45–71, 2004.
[5] Bruno Bouchard and Nizar Touzi. Discrete-time approximation and Monte-Carlo simulation of backward stochastic differential equations. Stochastic Processes and their Applications, 111(2):175–206, 2004.
[6] Pierre Henry-Labordere. Counterparty risk valuation: A marked branching diffusion approach. Available at SSRN 1995503, 2012.
[7] Pierre Henry-Labordere, Nadia Oudjane, Xiaolu Tan, Nizar Touzi, Xavier Warin, et al. Branching diffusion representation of semilinear PDEs and Monte Carlo approximation. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 55, pages 184–210. Institut Henri Poincaré, 2019.
[8] Martin Hutzenthaler, Arnulf Jentzen, and Thomas Kruse. Multilevel Picard iterations for solving smooth semilinear parabolic heat equations. arXiv preprint arXiv:1607.03295, 2016.
[9] Weinan E, Martin Hutzenthaler, Arnulf Jentzen, and Thomas Kruse. On multilevel Picard numerical approximations for high-dimensional nonlinear parabolic partial differential equations and high-dimensional nonlinear backward stochastic differential equations. Journal of Scientific Computing, 79(3):1534–1571, 2019.
[10] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen, and Philippe von Wurstemberger. Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations. arXiv preprint arXiv:1807.01212, 2018.
[11] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
[12] Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
[13] Christian Beck, Weinan E, and Arnulf Jentzen. Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. arXiv preprint arXiv:1709.05963, 2017.
[14] Jiequn Han and Ruimeng Hu. Deep fictitious play for finding Markovian Nash equilibrium in multi-agent games. arXiv preprint arXiv:1912.01809, 2019.
[15] Jiequn Han, Jianfeng Lu, and Mo Zhou. Solving high-dimensional eigenvalue problems using deep neural networks: A diffusion Monte Carlo like approach. arXiv preprint arXiv:2002.02600, 2020.
[16] Christian Bender and Jianfeng Zhang. Time discretization and Markovian iteration for coupled FBSDEs. The Annals of Applied Probability, 18(1):143–177, 2008.
[17] Jin Ma, Philip Protter, and Jiongmin Yong. Solving forward-backward stochastic differential equations explicitly–a four step scheme. Probability Theory and Related Fields, 98(3):339–359, 1994.
[18] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
[19] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
[20] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
[21] Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
[22] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
[23] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907–940, 2016.
[24] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
[25] Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.
[26] Helmut Bölcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks. arXiv preprint arXiv:1705.01714, 2017.
[27] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
[28] Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe von Wurstemberger. A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. arXiv preprint arXiv:1809.02362, 2018.
[29] Arnulf Jentzen, Diyora Salimova, and Timo Welti. A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients. arXiv preprint arXiv:1809.07321, 2018.
[30] Julius Berner, Philipp Grohs, and Arnulf Jentzen. Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. arXiv preprint arXiv:1809.03062, 2018.
[31] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, and Tuan Anh Nguyen. A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations. SN Partial Differential Equations and Applications, 1:1–34, 2020.
[32] Jin Ma and Jiongmin Yong. Forward-Backward Stochastic Differential Equations and their Applications. Springer Berlin Heidelberg, 2007.
[33] Fabio Antonelli. Backward-forward stochastic differential equations. The Annals of Applied Probability, pages 777–793, 1993.
[34] Etienne Pardoux and Shanjian Tang. Forward-backward stochastic differential equations and quasilinear parabolic PDEs. Probability Theory and Related Fields, 114(2):123–150, 1999.
[35] Jianfeng Zhang. A numerical scheme for BSDEs. The Annals of Applied Probability, 14(1):459–488, 2004.
[36] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), 2015.
[37] Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
[38] G.N. Milstein and M.V. Tretyakov. Numerical algorithms for forward-backward stochastic differential equations. SIAM Journal on Scientific Computing, 28(2):561–582, 2006.
[39] T.P. Huijskens, M.J. Ruijter, and C.W. Oosterlee. Efficient numerical Fourier methods for coupled forward–backward SDEs. Journal of Computational and Applied Mathematics, 296:593–612, 2016.

		$\displaystyle\inf_{\mu_{0}^{\pi}\in\mathcal{N}^{\prime}_{0},\phi_{i}^{\pi}\in\mathcal{N}_{i}}E\|g(X_{T}^{\pi})-Y_{T}^{\pi}\|^{2}$
	$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle C\Big{\{}h+\inf_{\mu_{0}^{\pi}\in\mathcal{N}^{\prime}_{0},\phi_{i}^{\pi}\in\mathcal{N}_{i}}\big{[}E\|Y_{0}-\mu_{0}^{\pi}(\xi)\|^{2}$
		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{i=0}^{N-1}E\|E[\tilde{Z}_{t_{i}}\|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}]-\phi_{i}^{\pi}(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi})\|^{2}h\big{]}\Big{\}},$

	$\displaystyle\|b(t,x_{1},y_{1})-b(t,x_{2},y_{2})\|^{2}$	$\displaystyle\leq K\|\Delta x\|^{2}+b_{y}\|\Delta y\|^{2},$
	$\displaystyle\\|\sigma(t,x_{1},y_{1})-\sigma(t,x_{2},y_{2})\\|^{2}$	$\displaystyle\leq\sigma_{x}\|\Delta x\|^{2}+\sigma_{y}\|\Delta y\|^{2},$
	$\displaystyle\|f(t,x_{1},y_{1},z_{1})-f(t,x_{2},y_{2},z_{2})\|^{2}$	$\displaystyle\leq f_{x}\|\Delta x\|^{2}+K\|\Delta y\|^{2}+f_{z}\|\Delta z\|^{2},$
	$\displaystyle\|g(x_{1})-g(x_{2})\|^{2}$	$\displaystyle\leq g_{x}\|\Delta x\|^{2}.$

	$\displaystyle\|b(t,x,y)\|^{2}$	$\displaystyle\leq b_{0}+K\|x\|^{2}+b_{y}\|y\|^{2},$
	$\displaystyle\\|\sigma(t,x,y)\\|^{2}$	$\displaystyle\leq\sigma_{0}+\sigma_{x}\|x\|^{2}+\sigma_{y}\|y\|^{2},$
	$\displaystyle\|f(t,x,y,z)\|^{2}$	$\displaystyle\leq f_{0}+f_{x}\|x\|^{2}+K\|y\|^{2}+f_{z}\|z\|^{2},$
	$\displaystyle\|g(x)\|^{2}$	$\displaystyle\leq g_{0}+g_{x}\|x\|^{2}.$

	$\displaystyle E\|\delta X_{n}\|^{2}$	$\displaystyle\leq A_{2}\sum_{i=0}^{n-1}e^{A_{1}(n-i-1)h}E\|\delta Y_{i}\|^{2}h,$
	$\displaystyle E\|\delta Y_{n}\|^{2}$	$\displaystyle\leq e^{A_{3}(N-n)h}E\|\delta Y_{N}\|^{2}+A_{4}\sum_{i=n}^{N-1}e^{A_{3}(i-n)h}E\|\delta X_{i}\|^{2}h.$

$\displaystyle E\|\delta X_{i+1}\|^{2}$	$\displaystyle=E\|\delta X_{i}+\delta b_{i}h\|^{2}+E[(\Delta W_{i})^{\operatorname{T}}(\delta\sigma_{i})^{\operatorname{T}}\delta\sigma_{i}\Delta W_{i}]$
	$\displaystyle=E\|\delta X_{i}+\delta b_{i}h\|^{2}+hE\\|\delta\sigma_{i}\\|^{2},$	(4.9)
$\displaystyle E\|\delta Y_{i+1}\|^{2}$	$\displaystyle=E\|\delta Y_{i}-\delta f_{i}h\|^{2}+\int_{t_{i}}^{t_{i+1}}E\|\delta Z_{t}\|^{2}\,\mathrm{d}t.$	(4.10)

Convergence of the Deep BSDE Method for Coupled FBSDEs

Abstract

1 Introduction

2 A Numerical Scheme for Coupled FBSDEs and Main Results

Theorem 1.

Theorem 2.

Theorem 3.

Remark.

Remark.

3 Preliminaries

Assumption 1.

Assumption 2.

Assumption 3.

Theorem 4.

Remark.

Theorem 5.

Remark.

Assumption 4.

4 A Posteriori Estimation of the Simulation Error

Lemma 1.

Lemma 2.

Proof.

Proof of Lemma 1.

Theorem 1′.

Remark.

Remark.

Proof of Theorem 1′.

5 An Upper Bound for the Minimized Objective Function

Lemma 3.

Lemma 4.

Lemma 5.

Theorem 2′.

Remark.

Remark.

Proof of Theorem 2′.

Theorem 6.

Proof.

Remark.

5.1 Proof of Lemmas

Proof of Lemma 3.

Proof of Lemma 4.

Proof of Lemma 5.

6 Numerical Examples

6.1 General Setting

6.2 Example 1

6.3 Example 2

References

Convergence of the Deep BSDE Method
for Coupled FBSDEs

Theorem 1^′.

Proof of Theorem 1^′.

Theorem 2^′.

Proof of Theorem 2^′.