Non-asymptotic estimation of risk measures using stochastic gradient Langevin dynamics

Jiarui Chu¹¹1Corresponding Author: Jiarui Chu
Affiliation: Princeton University, ORFE
E-mail Address:jiaruic@princeton.edu Ludovic Tangpi²²2Princeton University, ORFE, ludovic.tangpi@princeton.edu

Abstract

In this paper we will study the approximation of some law invariant risk measures. As a starting point, we approximate the average value at risk using stochastic gradient Langevin dynamics, which can be seen as a variant of the stochastic gradient descent algorithm. Further, the Kusuoka’s spectral representation allows us to bootstrap the estimation of the average value at risk to extend the algorithm to general law invariant risk measures. We will present both theoretical, non-asymptotic convergence rates of the approximation algorithm and numerical simulations.

Keywords Convex risk measure $\cdot$ Stochastic Optimization $\cdot$ Risk minimization $\cdot$ Average value at risk $\cdot$ Stochastic gradient Langevin

Mathematics Subject Classification 91G70 $\cdot$ 90C90

Statements and Declarations The authors gratefully acknowledge support from the NSF grant DMS-2005832. The authors have no competing interests to declare.

1 Introduction

Every financial decision involves some degree of risk. Quantifying risk associated with a future random outcome allows organizations to compare financial decisions and develop risk management plans to prepare for potential loss and uncertainty. By the seminal work of Artzner, Delbaen, Eber, and Heath [3], the canonical way to quantify the riskiness of a random financial position $X$ is to compute the number $\rho(X)$ for a convex risk measure $\rho$ , whose definition we recall.

Definition 1.1.

A mapping $\rho:\mathbb{L}^{\infty}\to\mathbb{R}$ is a convex risk measure if it satisfies the following conditions for all $X,Y\in\mathcal{X}$ :

•

translation invariance: $\rho(X+m)=\rho(X)-m$ for all $m\in\mathbb{R}$
•

monotonicity: $\rho(X)\geq\rho(Y)$ if $X\leq Y$
•

convexity: $\rho(\lambda X+(1-\lambda)Y)\leq\lambda\rho(X)+(1-\lambda)\rho(Y)$ for $\lambda\in[0,1]$ .

Intuitively³³3In the rest of the paper, for notation simplicity, we will assume risk measures to be increasing. That is, we work with $\rho(-X)$ . This does not restrict the generality., $\rho(X)$ measures the minimum amount of capital that should be added to the current financial position $X$ to make it acceptable. Due to its fundamental importance in quantitative finance, the theory of risk measures (sometimes called quantitative risk management) has been extensively developed. We refer for instance to [18, 28, 42, 16, 45, 15, 33, 46] for a few milestones and the influential textbooks of McNeil, Frey, and Embrechts [48] and Föllmer and Schied [30] for overviews.

An important problem for risk managers in practice is to efficiently simulate the number $\rho(X)$ for a financial position $X$ and a risk measure $\rho$ . The difficulty here stems from the fact that, unless $\rho$ is a “simple enough” risk measure and the law of $X$ belongs to a tractable family of distributions, there are no closed form formula allowing to compute $\rho(X)$ . The goal of this paper is to develop a method allowing to numerically simulate the riskiness $\rho(X)$ for general convex risk measures, and when the law of $X$ is not necessarily known (as it is the case in practical applications).

One commonly used measure of the riskiness of a financial position is the value at risk (VaR). For a given risk intolerance $u\in(0,1)$ , the value at risk $\mathrm{VaR}_{u}(X)$ of $X$ is the $(1-u)$ -quantile of the distribution of $X$ . Despite the various shortcomings of this measure of risk documented by the academic community [48], $\mathrm{VaR}$ remains the standard in the banking industry, and due to its widespread use, the computation of $\mathrm{VaR}$ has been extensively studied. We refer interested readers for instance to [35, 38, 10, 25], and references therein for various simulation techniques. Recommendations [50] from the Basel Committee on Banking Supervision which advises on risk management for financial institutions have revived the development of convex risk measures such as the average value at risk (AVaR), also called conditional value at risk or expected shortfall. This risk measure is the expected loss given that losses are greater than or equal to the $\mathrm{VaR}$ . That is, $\mathrm{AVaR}_{u}$ is given by:

\mathrm{AVaR}_{u}(X):=\mathbb{E}[X|X>\mathrm{VaR}_{u}(X)].

(1.1)

For general distributions of $X$ , $\mathrm{AVaR}_{u}(X)$ usually does not have closed form expressions. Therefore, in practice, numerical estimations are often required. As a result, the estimation of $\mathrm{AVaR}$ has received considerable attention. We refer for instance to works by Eckstein and Kupper [24] and Bühler, Gonon, Teichmann, and Wood [9] in which (among other things) the simulation of optimized certainty equivalents (of which $\mathrm{AVaR}$ is a particular case) are considered using deep learning techniques. One approximation technique for the $\mathrm{AVaR}$ is based on Monte-Carlo type algorithm. In this direction, let us refer for instance to Hong and Liu. [37] and Chen [13] on Monte Carlo estimation of $\mathrm{VaR}$ and $\mathrm{AVaR}$ , Zhu and Zhou [64] on nested Monte Carlo estimation. More recently, motivated by developments of gradient descent methods in stochastic optimization, (in particular the stochastic Langevin gradient descent (SGLD) technique), Sabanis and Zhang [56] provide non-asymptotic error bounds for the estimation of $\mathrm{AVaR}$ . Other works developing such gradient descent techniques in the context of risk management include Iyengar and Ma [41], Tamar, Glassner, and Mannor [59] and Soma and Yoshida [58]. Essentially, these papers take advantage of new developments in machine learning and optimization, see e.g. Allen-Zhu [2], Gelfand and Mitter [34], Nesterov [49] and Raginsky, Rakhlin, and Telgarsky [53]. Let us also mention the recent work of Reppen and Soner [54] who develop a data–driven approach based on ideas from learning theory.

In this work we go beyond the numerical simulation of $\mathrm{AVaR}$ by extending stochastic gradient descent type techniques to compute a large family of risk measures, including the $\mathrm{AVaR}$ . We are interested in this work in deriving explicit (non–asymptotic) error estimates for the approximation. We will restrict our attention to law-invariant convex risk measures (whose definition we recall below), since in practice, only the law of a financial position can be (approximately) observed. In fact, the requirement for a risk measure to be law-invariant is natural and is satisfied by most risk measures⁴⁴4All risk measures considered in this work will be implicitly assumed to be convex law–invariant risk measures..

Definition 1.2.

[32] A risk measure $\rho$ is law-invariant if for all $X,Y$ with the same distribution, we have $\rho(X)=\rho(Y).$

To the best of our knowledge, the papers considering (non-parametric) estimation of general convex risk measures are Weber [61], Belomestny and Krätschmer [6] and Bartl and Tangpi [4]. These papers consider a (data-driven) Monte-Carlo estimation method by proposing a plug-in estimator based on the empirical measure of the historical observations of the underlying distribution of the random outcome. Weber [61] proves a large deviation theorem, and Belomestny and Krätschmer [6] provide a central limit theorem. Note that both of these papers give asymptotic estimation results. Bartl and Tangpi [4] provide sharp non-asymptotic convergence rates for the estimation.

To estimate general law-invariant convex risk measures, we rely on the Kusuoka’s spectral representation [46]. Intuitively, this representation says that any law invariant risk measure can be constructed as an integral of the $\mathrm{AVaR}$ risk measure. Therefore, the first step of our approximation of general law–invariant risk measures is to estimate $\mathrm{AVaR}$ . Since we would like to analyze approximation algorithms for the risk of claims with possibly non-convex payoffs, we employ the idea of Raginsky, Rakhlin, and Telgarsky [53] and use the stochastic gradient Langevin dynamic which, essentially, adds a Gaussian noise to the unbiased estimate of the gradient in stochastic gradient descent. To quantify the distance between the estimator and the true value of the risk measure, we present non–asymptotic rates on the mean squared estimation error both in the case of a $\mathrm{AVaR}$ , and of general law–invariant risk measures. The proof of the mean squared error of estimating the $\mathrm{AVaR}$ makes use of the observation that the SGLD algorithm is a variant of the Euler-Maruyama discretization of the solution of the Langevin stochastic differential equation (SDE). This observation allows us to use results on the convergence rate of the Euler-Maruyama scheme and classical techniques of deriving the convergence rate of the solution of the Langevin SDE to the invariant measure. For the rate on the mean squared estimation error of the general case, our proof relies heavily on Kusuoka’s representation which allows to build general law–invariant risk measures from $\mathrm{AVaR}$ .

Beyond our theoretical guarantees for the convergence of approximation algorithms for general convex risk measures, the present work also contributes to the non-convex optimization literature in that we propose a new proof for the convergence of SGLD algorithms for some non–convex objective functions. The idea is essentially to reduce the problem into the analysis of contractivity properties of the semi–group originating from a Langevin diffusion with non–convex potential. This problem was notably investigated by Eberle [23].

The paper is organized as follows: We start by describing the approximation techniques and presenting the main results in Section 2. In the same section, we also present numerical results on the estimation of AVaR. In Section 3, we prove the rates on the mean squared error for the estimation of AVaR. The derivation of the mean squared error for the estimation of a general law-invariant risk measure is done in Section 4.

Notations: Let $\mathbb{N}^{\star}:=\mathbb{N}\setminus\{0\}$ and let $\mathbb{R}_{+}^{\star}$ be the set of real positive numbers. Fix an arbitrary Polish space $E$ endowed with a metric $d_{E}$ . Throughout this paper, for every $p$ -dimensional $E$ -valued vector $e$ with $p\in\mathbb{N}^{\star}$ , we denote by $e^{1},\ldots,e^{p}$ its coordinates. For $(\alpha,\beta)\in\mathbb{R}^{p}\times\mathbb{R}^{p}$ , we also denote by $\alpha\cdot\beta$ the usual inner product, with associated norm $\|\cdot\|$ , which we simplify to $|\cdot|$ when $p$ is equal to $1$ . For any $(\ell,c)\in\mathbb{N}^{\star}\times\mathbb{N}^{\star}$ , $E^{\ell\times c}$ will denote the space of $\ell\times c$ matrices with $E$ -valued entries.

Let $\mathcal{B}(E)$ be the Borel $\sigma$ -algebra on $E$ (for the topology generated by the metric $d_{E}$ on $E)$ . For any $p\geq 1$ , for any two probability measures $\mu$ and $\nu$ on $(E,\mathcal{B}(E))$ with finite $p$ -moments, we denote by $\mathcal{W}_{p}(\mu,\nu)$ the $p$ -Wasserstein distance between $\mu$ and $\nu$ , that is

\mathcal{W}_{p}(\mu,\nu):=\bigg{(}\inf_{\alpha\in\Gamma(\mu,\nu)}\int_{E\times E}d(x,y)^{p}\alpha(\mathrm{d}x,\mathrm{d}y)\bigg{)}^{1/p},

where the infimum is taken over the set $\Gamma(\mu,\nu)$ of all couplings $\pi$ of $\mu$ and $\nu$ , that is, probability measures on $\big{(}E^{2},\mathcal{B}(E)^{\otimes 2}\big{)}$ with marginals $\mu$ and $\nu$ on the first and second factors respectively.

2 Approximation technique and main results

In this section we rigorously describe the approximation method developed in this article as well as our main results. Throughout, we fix a probability space $(\Omega,\mathcal{F},\mathbb{P})$ on which all random variables will be defined, unless otherwise stated. Let us denote by $\mathbb{L}^{\infty}$ the space of essentially bounded random variables on this probability space. The starting point of our method is based on the following spectral representation of law-invariant risk measures due to Kusuoka [46]:

Theorem 2.1.

A mapping $\rho:\mathbb{L}^{\infty}\to\mathbb{R}$ is a law-invariant risk measure if and only if it satisfies

\rho(X)=\sup_{\gamma\in\mathcal{M}}\bigg{(}\int_{[0,1)}\text{AVaR}_{u}(X)\gamma(du)-\beta(\gamma)\bigg{)},\text{ for all }X\in\mathbb{L}^{\infty}

(2.1)

for some functional $\beta:\mathcal{M}\to[0,\infty)$ , where $\mathcal{M}$ is the set of all Borel probability measures on $[0,1]$ .

In fact, this spectral representation suggests that the risk measure $\mathrm{AVaR}$ is the “basic building block” allowing to construct all convex law–invariant risk measures. Thus, the idea will be to propose an approximation algorithm for $\mathrm{AVaR}$ that will be later bootstrapped to derive an algorithm for general law invariant risk measures. This approach is also used in [4] for a very different approximation method.

2.1 Approximation of average value at risk

Let us first focus on estimating $\mathrm{AVaR}$ . For this purpose, recall (see e.g. [30, Proposition 4.51]) that for every $X\in\mathbb{L}^{\infty}$ and $u\in(0,1)$ , $\mathrm{AVaR}_{u}(X)$ takes the form

\text{AVaR}_{u}(X)=\inf_{q\in\mathbb{R}}\Big{(}\frac{1}{1-u}\mathbb{E}[({}X-q)^{+}]+q\Big{)}.

In other words, $\mathrm{AVaR}_{u}(X)$ is nothing but the value of a stochastic optimization problem. In most financial applications, the contingent claim $X$ whose risk is assessed is of the form $X=f(r,S)$ where $S=(S^{1},\dots,S^{d})$ is a $d$ –dimensional random vector of risk factors, and $r\in\mathcal{X}$ , see e.g. [48, Section 2.1] for details. Note that here the space $\mathcal{X}$ can be infinite dimensional. A standard practice to approach the infinite dimensional case is to use neural networks for approximation, which leads to non-convex objective functions. Therefore, we allow $f$ to be non-convex with certain regularity conditions. A standard example arises when $X$ is the profit and loss (P $\&$ L) of an investment strategy. In this case, $r$ is the portfolio and the random vector $S$ represents (increments) of the stock prices. That is, $f(r,S):=\sum_{i=1}^{d}r_{i}(S_{1}^{i}-S^{i}_{0})$ where $S^{i}_{1}$ and $S^{i}_{0}$ are the values of the stock $i$ at times $1$ and $0$ , respectively. Hence, we let $r$ be in a compact and convex set $A\subseteq\mathbb{R}^{d-1}$ , and our goal will be to estimate the value of the (multi-dimensional) risk minimization problem

	$\displaystyle\overline{\mathrm{AVaR}_{u}}(f)$	$\displaystyle:=\inf_{r\in A}\mathrm{AVaR}_{u}(f(r,S))$
		$\displaystyle=\inf_{r\in A,q\in\mathbb{R}}\Big{(}\frac{1}{1-u}\mathbb{E}\Big{[}\big{(}f(r,S)-q\big{)}^{+}\Big{]}+q\Big{)}.$		(2.2)

A natural way to numerically solve such problems is by gradient descent. However, when the dataset is large, gradient descent usually does not perform well, since computing the gradient on the full dataset at each iteration is computationally expensive.

Among others, one method that has been proposed to get around the high computational cost of gradient descent is the stochastic gradient descent (SGD) algorithm, which replaces the true gradient with an unbiased estimate calculated from a random subset of the data. A more recent approach, called the Stochastic Langevin Gradient Descent, injects a random noise to an unbiased estimate of the gradient at each iteration of the SGD algorithm. Originally introduced by Welling and Teh. [62] as a tool for Bayesian posterior sampling on large scale and high dimensional datasets, SGLD maintains the scalability property of SGD, and has a few advantages over the SGD: By adding a noise to SGD, SGLD navigates out of saddle points and local minima more easily [7], outperforms SGD in terms of accuracy [19], and overcomes the curse of dimensionality [14]. Moreover, SGLD also applies to cases where the objective function is non-convex but sufficiently regular [34] [53].

We will apply the SGLD in the present context of estimation of $\mathrm{AVaR}$ . Recall that our goal is to solve the optimization problem given in equation (2.2). Let $z:=(r,q)$ , and consider the (objective) function

\widetilde{L}(r,q):=\frac{1}{1-u}\mathbb{E}\Big{[}\big{(}f(r,S)-q\big{)}^{+}\Big{]}+q

and, given a strictly positive constant $\gamma>0$ , let

L(r,q):=\widetilde{L}(r,q)+\frac{\gamma}{2}\|q\|^{2},\quad\text{and }\overline{L}(r,q):=L(r,q)+\frac{\gamma}{2}\text{dist}^{2}(r,A)

(2.3)

be the usual penalized objective function, where

\text{dist}^{2}(r,A):=\inf_{x\in A}\|r-x\|^{2}

denotes the squared distance from $r$ to the set $A$ . Since for $\gamma$ small we have

\inf_{(r,q)\in\mathbb{R}^{d}}\overline{L}(r,q)\approx\overline{\mathrm{AVaR}}_{u}(f),

we will approximate the left hand side above using the SGLD algorithm, which consists in approximating its minimizer by the (support of the) invariant measure of the Markov chain $(Z^{\lambda}_{m,h})$ given by

Z_{m+1,h}^{\lambda}=Z_{m,h}^{\lambda}-\nabla\overline{L}(Z_{m,h}^{\lambda})h+\sqrt{2\lambda^{-1}}\xi_{m},

(2.4)

where $(\xi_{m})_{m\geq 1}$ are independent Gaussian random variables. In the practice of financial risk management, the distribution of $S$ is typically unknown. This is a well-studied issue in quantitative finance, refer for instance to [5, 17, 44] and the references therein. In particular, $\nabla L$ cannot be directly computed. It will be replaced by an unbiased estimator. Following Monte–Carlo simulation ideas, we let $(S^{1},\dots,S^{P})$ be independent copies of $S$ and $(\widetilde{W}^{1},\dots,\widetilde{W}^{N})$ be independent Brownian motions, and we thus let

\widetilde{\ell}(z):=\frac{1}{P}\sum_{p=1}^{P}\frac{1}{1-u}(f(r,S^{p})-q)^{+}+q,\quad\ell(z):=\widetilde{\ell}(z)+\frac{\gamma}{2}\|q\|^{2},\quad\text{and }\overline{\ell}(z):=\ell(z)+\text{dist}^{2}(r,A),\quad\text{with }z=(r,q)\in\mathbb{R}^{d}.

In the following, we will take $P=N$ for simplicity. Put

\widetilde{Z}_{m+1,h}^{\lambda,n}=\widetilde{Z}_{m,h}^{\lambda,n}-\nabla L(\widetilde{Z}_{m,h}^{\lambda,n})h+\sqrt{2\lambda^{-1}}\Delta\widetilde{W}_{h}^{n},\quad\text{with}\quad\Delta\widetilde{W}_{h}^{n}:=\widetilde{W}^{n}_{m+1}-\widetilde{W}^{n}_{m},

(2.5)

\widetilde{Z}_{m+1,h}^{\prime\lambda,n}=\widetilde{Z}_{m,h}^{\prime\lambda,n}-\nabla\ell(\widetilde{Z}_{m,h}^{\prime\lambda,n})h+\sqrt{2\lambda^{-1}}\Delta\widetilde{W}_{h}^{n},\quad\text{with}\quad\Delta\widetilde{W}_{h}^{n}:=\widetilde{W}^{n}_{m+1}-\widetilde{W}^{n}_{m},

(2.6)

and

\overline{Z}_{m+1,h}^{\lambda,n}=\overline{Z}_{m,h}^{\lambda,n}-\nabla\overline{\ell}(\overline{Z}_{m,h}^{\lambda,n})h+\sqrt{2\lambda^{-1}}\Delta\overline{W}_{h}^{n},\quad\text{with}\quad\Delta\overline{W}_{h}^{n}:=\overline{W}^{n}_{m+1}-\overline{W}^{n}_{m}.

(2.7)

Hence we will show that

\widetilde{\mathrm{AVaR}}_{u}(f):=\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\widetilde{Z}^{\prime\lambda,n}_{M,h})

(2.8)

approximates $\overline{\mathrm{AVaR}}_{u}(f)$ . Note that the optimal portfolio $r$ can be easily recovered. It is simply the last $d-1$ coordinates of $\overline{Z}^{\lambda,n}_{M,h}$ . Similarly, the value-at-risk can be obtained from the Markov chain $\overline{Z}^{\lambda,n}_{M,h}$ . See Remark 3.1 for details. Let us now formulate the assumptions we make on $f$ and the random vector $S$ .

Assumption 2.2.

The random variable $S$ takes values in $\mathbb{R}^{d}$ and the function $f:\mathbb{R}^{d}\times R^{d-1}\to\mathbb{R}$ is Borel measurable, and they satisfy
$(i)$ $S$ has finite fourth moment.
$(ii)$ The function $(r,s)\mapsto f(r,s)$ Lipschitz–continuous and continuously differentiable.
$(iii)$ $\inf_{r}\mathbb{E}[f(r,S)]>0$ .
$(iv)$ The random variable $\nabla_{r}f(r,S)$ is bounded, uniformly in $S$ , and $\nabla_{s}f(r,\cdot)$ is Lipschitz, uniformly in $r$
$(v)$ Consider the function $\kappa$ defined as

\kappa(u):=\inf\Big{\{}-\sqrt{2\lambda}\frac{(z-z^{\prime})\cdot(\nabla L(z^{\prime})-\nabla L(z))}{\|z-z^{\prime}\|},\,\,z,z^{\prime}\in\mathbb{R}^{d}:\|z-z^{\prime}\|=u\Big{\}}.

It holds

\liminf_{u\to\infty}\kappa(u)>0\quad\text{and}\quad\int_{0}^{1}u\kappa(u)^{-}\,\mathrm{d}u<\infty.

Let us briefly comment on these conditions before stating the result. The integrability, regularity, and lower boundedness conditions $(i)-(iii)$ allow to ensure that the problem is well-posed. The boundedness condition $(iv)$ is assumed mostly to simplify the exposition. Most of our statements will remain true if it is replaced by a suitable integrability condition. We introduce the more involved condition $(v)$ to make for the possible lack of convexity of the objective function $L$ . This condition is by now standard when employing coupling by reflection techniques to prove contractivity of diffusion semigroups. We refer for instance to Eberle [23, 22] or the earlier work of Chen and Li [12]. Note, for instance, that this condition is automatically satisfied if $f$ is convex (since in this case $L$ is strongly convex) or when $L$ is strictly convex outside a given ball (see [23, Example 1]).

The following is the first main result of this work:

Theorem 2.3.

Let Assumptions 2.2 hold. Let $t,M,h>0$ be such that $h=\frac{t}{M^{2}}$ . For all $t,\lambda>0,0<\gamma<1,$ and $M,N\in\mathbb{N}^{\star}$ , we have

\displaystyle\mathbb{E}\bigg{[}\Big{|}\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\widetilde{Z}_{M,h}^{\prime\lambda,n})-\overline{\text{AVaR}}_{u}(f)\Big{|}^{2}\bigg{]}\leq C^{1}_{(u,t,\lambda,t)}\frac{1}{N}+C^{2}_{(u,t,\lambda)}\gamma^{2}+C^{3}_{(u,t,\lambda)}h^{2}+C^{4}_{(u,\lambda)}e^{-tC^{5}_{(\lambda)}}+C^{6}_{(u)}\frac{1}{\lambda^{2}},

(2.9)

where the constants are given in the appendix.

Theorem 2.3 provides a non-asymptotic rate for the convergence of the estimator $\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\widetilde{Z}^{\prime\lambda,n}_{M,h})$ to the (optimized) average value at risk. Such a rate is crucial in applications since it gives a precise order of magnitude for the choice of the parameters $M,N,\gamma$ and $\lambda$ needed to achieve a desired order of accuracy. Moreover, the rate is independent of the dimension $d$ , implying in particular that the rate is not made worst when increasing the size of the portfolio $S=(S^{1},\dots,S^{d})$ (or in general the number of risk factors). Furthermore, observe that this estimator $\widetilde{\mathrm{AVaR}}_{u}(f)$ is rather easy to simulate: one only needs to simulate $N$ independent Gaussian random variables, for each of them simulate the iterative scheme (2.6) and compute the empirical average of the outcomes. We provide numerical results on the estimation of AVaR in Section 2.3 below.

Remark 2.4.

Observe that the method developed here can also allow (with minor changes) to simulate the value function of utility maximization problems of the form

\sup_{r\in A}\mathbb{E}^{\mu}[U(f(r,S))]

where $U$ is a concave utility function and $\mathbb{E}^{\mu}$ the expectation when $S\sim\mu$ , or even of the robust utility maximization problem

\sup_{r\in A}\inf_{\mu\in\mathcal{P}}\mathbb{E}^{\mu}[U(f(r,S))]

where $\mathcal{P}$ is the set of possible distributions of $S$ . In the latter case, one will need to compute (or find an appropriate unbiased estimator of) $\nabla L$ with

L(r):=\inf_{\mu\in\mathcal{P}}\mathbb{E}^{\mu}[U(f(r,S))].

This is easily done for instance when $\mathcal{P}$ is a ball with respect to the Wasserstein metric around a given distribution $\mu_{0}$ , see e.g. [5].

2.2 Approximation of general convex risk measures

Let us return to the problem of approximating general law-invariant convex risk measures. In this context (as in the case of $\mathrm{AVaR}$ ) our goal is to simulate the optimized risk measure

\overline{\rho}(f):=\inf_{r\in A}\rho(f(r,S)).

To that end, let us recall a notion of regularity of risk measures introduced in [4] that will be needed to derive an explicit non-asymptotic convergence rate. Recall that a random variable $X^{*}$ is said to follow the Pareto distribution with scale parameter $x>0$ and shape parameter $q>0$ if

P(X\geq t)=\begin{cases}(x/t)^{q}\text{ if }t\geq x\\ 1\text{ if }t<x.\end{cases}

Definition 2.5.

[4] Let $q\in(1,\infty)$ , and let $X^{*}$ follow Pareto distribution with scale parameter 1 and shape parameter $q$ . A convex risk measure $\rho:\mathbb{L}^{\infty}\to\mathbb{R}$ is said to be $q$ -regular if it satisfies

\sup_{n\in\mathbb{N}}\rho(X^{*}\wedge n)<\infty.

We refer to [4] for a discussion on this notion of regularity, but note for instance that $\mathrm{AVaR}$ is $q$ –regular for all $q>1$ and that this notion of regularity is slightly stronger than the well-known Fatou property and the Lebesgue property often assumed for risk measures, see e.g. Föllmer and Schied [30]. Moreover, one consequence of $q$ –regularity is the following slight refinement of Kusuoka’s representation: The risk measure $\rho$ satisfies

\rho(f(r,S))=\sup_{\gamma\in\mathcal{M}:s.t.\beta(\gamma)\leq b}\left(\int_{[0,1)}\text{AVaR}_{u}(f(r,S))\gamma(du)-\beta(\gamma)\right).

(2.10)

for some $b>0$ , see [4, Lemma 4.4] for details. Thus, the estimator we consider for $\overline{\rho}$ is given by

\widetilde{\rho}^{\delta}(f):=\operatorname*{ess\,sup}_{\gamma\in\mathcal{M}:\beta(\gamma)\leq b}\left(\int_{[0,\delta)}\widetilde{\text{AVaR}_{u}}(f)\gamma(du)-\beta(\gamma)\right)

(2.11)

for some $\delta\in(0,1)$ , and where $\widetilde{\text{AVaR}_{u}}(f)$ is the estimator of $\overline{\mathrm{AVaR}_{u}}(f)$ given by (2.8), which implicitly depends on $u$ through the objective functions $L$ and $\overline{\ell}$ . The following theorem gives a convergence rate for the approximation of the general law-invariant convex risk measure $\bar{\rho}(f)$ by $\widetilde{\rho}^{\delta}(f)$ .

Theorem 2.6.

Let $\rho$ be a $q$ –regular convex risk measure with $q>1$ . Let $f$ be bounded and satisfy the assumptions of Theorem 2.3. Let $h=\frac{t}{M^{2}}$ . For all $t,\lambda>0,0<\gamma<1,$ and $M,N\in\mathbb{N}^{\star}$ , we have

\displaystyle\mathbb{E}\Big{[}|\bar{\rho}(f)-\widetilde{\rho}^{\delta}(f)|^{2}]\leq C(1-\delta)^{1/q}+C^{7}_{(\delta,t,\lambda)}\frac{1}{N}+C^{2}_{(\delta,t,\lambda)}\gamma^{2}+C^{3}_{(\delta,t,\lambda)}h^{2}+C^{4}_{(\delta,\lambda)}e^{-tC^{5}_{(\lambda)}}+C^{6}_{(\delta)}\frac{1}{\lambda^{2}},

where $C^{7}_{(\delta,t,\lambda,t)}$ is given in the Appendix, and constants $C^{2}_{(\delta,t,\lambda)},C^{3}_{(\delta,t,\lambda)},C^{4}_{(\delta,\lambda)}$ and $C^{6}_{(\delta)}$ correspond to those given in the Appendix, with $u$ replaced by $\delta$ .

2.3 Numerical results on AVaR

Let us complement the above theoretical guarantees with empirical experiments⁵⁵5Code available at https://github.com/jiaruic/sgld_risk_measures. We first focus on the approximation of the average value at risk and the value at risk with respect to the time evolution of the Markov chain in the SGLD algorithm. Thus, for the numerical computations, we set

A=[0,1]^{d},\quad\lambda=10^{8},\quad\gamma=10^{-8},\quad h=10^{-4},\quad\text{and}\quad u=0.95.

We will consider two cases in our experiments. In the first case we assume the underlying distribution to be known and use Monte–Carlo simulation, and in the second case we use real historical stock price data.

2.3.1 Monte Carlo simulation

For the Monte Carlo experiments, we set $N=5000$ . Figure 1(a) shows the convergence of AVaR in the 1 dimensional case with $f(r,S)=S$ , where $S$ is sampled from a Gaussian distributions. Figure 1(b) shows the estimation error, $\widetilde{\text{AVaR}}_{u}-\overline{\text{AVaR}}_{u}$ , where $\overline{\text{AVaR}}_{u}$ is the theoretical average value at risk for 1-dimensional Gaussian distributions given by

\overline{\text{AVaR}}_{u}=\mu+\sigma\frac{\phi(\Phi^{-1}(u))}{1-u},

(2.12)

where $\phi$ and $\Phi$ are respectively, the PDF and the CDF of a standard Gaussian distribution.

For the multi-dimensional case, we take the function

f(r,S)=\sum_{i=1}^{d}\frac{e^{r_{i}}}{\sum_{j=1}^{d}e^{r_{j}}}S_{i},\text{ for }i=1,\cdots,d.

Figure 2(a) and Figure 2(b) show the convergence of VaR and AVaR in the 2-dimensional case, where $S^{1}$ is sampled from $\mathcal{N}(1,4)$ and $S^{2}$ is sampled from $\mathcal{N}(0,1)$ .

2.3.2 Numerical results with real data

In this subsection, we compute AVaR for a portfolio of 106 stocks using real aggregated stock prices over 15-minute time intervals from January 2, 2015 to August 31, 2015. Among 128 NASDAQ stocks that are "sufficiently liquid", we remove the ones with missing values, and use the remaining 106 stocks. For a detailed description of the data used and for a definition of "sufficiently liquid", please refer to Section 3.2 of Pohl, Ristig, Schachermayer, and Tangpi [52]. We use changes in stock prices instead of stock prices themselves, because stock prices are highly dependent. We present paths of the estimated optimized VaR and AVaR of the portfolio of 106 stocks in Figures 3(a) and 3(b) respectively.

In addition, our approach can also be easily applied to a fixed portfolio of stocks. We take 20 stocks from the 106 stocks described above, and consider a fixed portfolio of equal weights, i.e., $r_{i}=\frac{1}{20}$ for each $i$ . We present paths the estimated AVaR in Figure 4.

2.4 Numerical results on general risk measures

In order to simulate general risk measures one needs to specify the penalty function $\beta$ , or alternatively the precise form of $\rho$ since $\beta$ is given by [31]

\beta(\gamma)=\sup_{X\in\mathcal{A}_{\rho}}\int_{[0,1]}\mathrm{AVaR}_{u}(X)\gamma(du),\quad\text{with}\quad\mathcal{A}_{\rho}:=\{X\in\mathbb{L}^{\infty}:\rho(X+m)\leq 0\}.

In general, the simulation of $\tilde{\rho}^{\delta}(f)$ as given in (2.11) will probably require introducing neural networks since it is the value of an infinite dimensional optimization problem. This will be addressed in future research. We will focus here on a case where the problem can be simplified.

In fact, denote by $\partial_{\mu}\beta$ the so–called linear functional derivative of $\beta$ . It is defined as the function $\partial_{\mu}\beta:\mathcal{M}([0,1])\times[0,1]\to\mathbb{R}$ such that

\beta(\mu^{\prime})-\beta(\mu)=\int_{0}^{1}\int_{0}^{1}\partial_{\mu}\beta((1-\lambda)\mu+\lambda\mu^{\prime},x)(\mu^{\prime}-\mu)(\mathrm{d}x)\mathrm{d}\lambda.

Up to an additive constant, there exists a unique such derivative $\partial_{\mu}\beta$ , see e.g. [11]. We have the following:

Proposition 2.7.

Let the assumptions of Theorem 2.6 hold and assume that $\beta$ admits a second order linear functional derivative that is jointly continuous and such that

\sup_{\eta_{1},\eta_{2}\in\mathbb{L}^{2}}\mathbb{E}\Big{[}\sup_{\mu\in\mathcal{M}([0,1])}|\partial_{\mu}\beta(\mu,\eta_{1})|+\sup_{\mu\in\mathcal{M}([0,1])}|\partial_{\mu}^{2}\beta(\mu,\eta_{1},\eta_{2})|\Big{]}\leq K<\infty.

Then, it holds

		$\displaystyle\mathbb{E}\bigg{[}\Big{\|}\inf_{(x_{i})_{i=1,\dots,J}\subset[0,1]}F\Big{(}\frac{1}{J}\sum_{i=1}^{J}\delta_{x_{i}}\Big{)}-\rho(f)\Big{\|}^{2}\bigg{]}$
	$\displaystyle\leq$	$\displaystyle\frac{4K^{2}}{J^{2}}+C(1-\delta)^{1/q}+C(1-\delta)^{1/q}+C^{7}_{(\delta,t,\lambda)}\frac{1}{N}+C^{2}_{(\delta,t,\lambda)}\gamma^{2}+C^{3}_{(\delta,t,\lambda)}h^{2}+C^{4}_{(\delta,\lambda)}e^{-tC^{5}_{(\lambda)}}+C^{6}_{(\delta)}\frac{1}{\lambda^{2}},$

with $F(\mu):=\int_{0}^{1}\widetilde{\mathrm{AVaR}}_{u}(f)\mu(\mathrm{d}u)-\beta(\mu)$ (recall Equation (2.8)).

This is a direct consequence of Theorem 2.6 and [39, Theorem 2.4]. The proof is omitted.

As an illustrative example, we consider the so–called entropic value-at-risk introduced by Ahmadi-Javid [1] and studied e.g. by Pichler and Schlotter [51] and Föllmer and Knispel [27] in connection to large portfolio asymptotics. This is a risk measure based on the Rényi entropy given by

\rho_{u}(X)=\sup\Big{\{}\mathbb{E}[ZX]:Z\geq 0,\,\,\mathbb{E}[Z]=1,\,\,H_{q}(Z)\leq\log\frac{1}{1-u}\Big{\}}

with $H_{q}(Z):=\frac{1}{q-1}\log\mathbb{E}Z^{q}$ for $q\in\mathbb{R}^{*}_{+}\setminus\{1\}$ . The associated penalty function takes the form

\beta(\gamma):=\begin{cases}0\text{ if }\int_{0}^{1}\sigma_{\gamma}(x)^{q}\mathrm{d}x\leq\big{(}\frac{1}{1-u}\big{)}^{q-1}\\ +\infty\text{ else}\end{cases},\quad\text{with}\quad\sigma_{\gamma}(x):=\int_{0}^{x}\frac{1}{1-v}\gamma(\mathrm{d}v).

To numerically compute entropic value-at-risk, for a large $k$ , we simulate

\sup_{(x_{i})_{i=1,\dots,J}\subset[0,1]}F\Big{(}\frac{1}{J}\sum_{i=1}^{J}\delta_{x_{j}}\Big{)}:=\sup_{(x_{i})_{i=1,\dots,J}\subset[0,1]}\bigg{(}\frac{1}{J}\sum_{i=1}^{J}\widetilde{\mathrm{AVaR}}_{x_{i}}(f)-k\Big{\{}\int_{0}^{1}\Big{(}\frac{1}{J}\sum_{i=1}^{N}\frac{1}{1-x_{i}}1_{x_{j}\leq x}\Big{)}^{q}\mathrm{d}x-(\frac{1}{1-u})^{q-1}\Big{\}}^{+}\bigg{)}.

For Monte Carlo simulation, we set $J=5000$ , $k=10^{18}$ , $q=1.00001$ , and estimate the supremum over $(x_{i})_{i=1,\dots,J}\subset[0,1]$ by the maximum of 5000 random partitions, each consisting of $J$ points, of the interval $[0,1]$ . Figure 5(a) shows the convergence of the entropic value at risk in the 1 dimensional case with $f(r,S)=S$ , where $S$ is sampled from $\mathcal{N}(1,2)$ . Figure 5(b) shows the estimation error compared to the theoretical entropic value-at-risk for $\mathcal{N}(1,2)$ given by $\rho_{u}(X)=1+\sqrt{-2\log((1-u)2)}$ .

3 Rates for the average value at risk

This section is dedicated to the proof of Theorem 2.3. We will start by some preliminary considerations allowing us to introduce ideas used in the proofs. The details of the proofs will be given in the subsection 3.2.

3.1 Preliminaries

The starting point of our method is to recognize (2.6) as the $m$ –th step of the Euler-Maruyama scheme that discretizes the stochastic differential equation

\,\mathrm{d}Z_{t}^{\lambda}=-\nabla{L}(Z_{t}^{\lambda})\,\mathrm{d}t+\sqrt{2\lambda^{-1}}\,\mathrm{d}W_{t},\quad Z_{0}^{\lambda}=z\in\mathbb{R}^{d}

(3.1)

where $W$ is a $d$ -dimensional Brownian motion. This SDE is the Langevin SDE, with inverse temperature parameter $\lambda$ . The Langevin SDE is widely studied in physics [57] and for the sampling of Gibbs distribution via Markov chain Monte–Carlo methods [21]. Equipping the probability space $(\Omega,\mathcal{F},\mathbb{P})$ with the $\mathbb{P}$ –completion of the filtration of $W$ , the equation (3.1) admits a unique strong solution. It is well-known that this solution has a unique invariant measure (that we denote by $\mu_{\infty}^{\lambda}$ ) and whose density reads⁶⁶6As usual, in this article, we use the same notation for a probability measure on $\mathbb{R}^{n}$ for any $n\in\mathbb{N}$ and its density function.

\mu_{\infty}^{\lambda}(x)=\frac{e^{-\lambda{L}(x)}}{\int_{\mathbb{R}^{d}}e^{-\lambda{L}(z)}\,\mathrm{d}z},

(3.2)

see e.g. [47, Lemma 2.1]. In this work, the interest of the Langevin equation (aside from its analytical tractability) stems from the fact that the limiting measure $\mu_{\infty}$ of $\mu_{\infty}^{\lambda}$ as $\lambda\to\infty$ concentrates on the minimizers of $L$ , which we will show exist. This follows from results of Hwang [40]. Intuitively, this means that if $(r^{*},m^{*})$ is the minimizer of $L$ , then for $\lambda\to\infty$

\int_{\mathbb{R}^{d}}L(z)\mu_{\infty}^{\lambda}(\,\mathrm{d}z)\approx L(r^{*},q^{*}).

(3.3)

Moreover, the Langevin equation allows us to exploit classical techniques in order to derive explicit convergence rates to the invariant measure in the present non–convex potential case.

Remark 3.1.

One interesting byproduct of our method is that, the simulation of $\mathrm{AVaR}$ directly allows to compute the value at risk and the optimal portfolios, as well as deriving non–asymptotic rates. Let us illustrate this on the problem of simulation of optimal portfolios $r^{*}$ in Equation 2.2. As observed above, $\mu^{\lambda}_{\infty}$ converges to a measure $\mu$ supported on the optimal portfolios. Now, let $G:\mathbb{R}^{d}\to\mathbb{R}$ be a strictly convex function such that the gradient $\nabla G$ is invertible. Then, by Taylor’s expansion we have

G(\overline{Z}^{\lambda,n}_{M,h})-G(q^{*},r^{*})\geq\nabla G(K)(\overline{Z}^{\lambda,n}_{M,h}-(q^{*},r^{*}))

for some random variable $K$ , showing that

\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}-(q^{*},r^{*})\|\leq\|\nabla G(K)^{-1}\||G(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-G(q^{*},r^{*})|.

Therefore, provided that the inverse of $\nabla G$ does not grow too fast, the argument we give below to prove Theorem 2.3 would allow to derive theoretical guarantees for the optimal portfolio as well, replacing $L$ by $G$ .

3.2 Proof of Theorem 2.3

Throughout this section we assume that the assumptions of Theorem 2.3 are satisfied. We split the proof into several intermediate lemmas. The first one is probably well known, it asserts that the optimization problem defining $\overline{\mathrm{AVaR}}(f)$ admits a solution.

Lemma 3.2.

The function $\widetilde{L}$ defined in Equation 2.3 admits a minimum.

Proof.

In [40, Proposition 2.1], Hwang gives a sufficient condition for $\widetilde{L}$ admitting a minimum: $\{\mu_{\infty}^{\lambda}\}$ is tight. A sufficient condition for the tightness of $\{\mu_{\infty}^{\lambda}\}_{\lambda>0}$ is that there exists $\varepsilon>0$ such that the set $B:=\{z\in\mathbb{R}^{d}:\widetilde{L}(z)\leq\varepsilon\}$ is compact [40, Proposition 2.3]. The rest of the proof checks the compactness of set $B$ for any $\epsilon>0$ .

Since $\widetilde{L}$ is continuous, the set $B=\{z\in\mathbb{R}^{d}:\widetilde{L}(z)\leq\varepsilon\}$ is closed as the pre-image of the closed set $(-\infty,\varepsilon]$ . In addition, since $\frac{1}{1-u}(x)^{+}>x$ for $|x|$ large enough, we have that $B=\{z\in\mathbb{R}^{d}:\frac{1}{1-u}\mathbb{E}[(f(r,S)-q)^{+}]+q\leq\epsilon\}$ is bounded. To see this, assume to the contrary that $B$ is unbounded. Then there exists a sequence $\{z_{i}\}\in B$ such that $\|z_{i}\|\to\infty$ . Then for the subsequence of $\{z_{i}\}$ with $\frac{1}{1-u}(f(r,S)-q)^{+}>f(r,S)$ , we have $\widetilde{L}(x_{i})\to\infty$ , which contradicts $\widetilde{L}(x)\leq\epsilon$ . Thus, the set $B$ is bounded, and is therefore compact. ∎

To derive the claimed convergence rate, we decompose the expected error into terms that will be handled independently. First, we will exploit the approximation (3.3). Next, using the $N$ independent Brownian motions $\widetilde{W}^{n}$ introduced just before (2.6), we construct $N$ i.i.d. copies $\widehat{Z}_{t}^{\lambda,n}$ of the solution of the Langevin equation as solutions of the SDEs

\,\mathrm{d}\widehat{Z}^{\lambda,n}_{t}=-\nabla L(\widehat{Z}^{\lambda,n}_{t})\,\mathrm{d}t+\sqrt{2\lambda^{-1}}\,\mathrm{d}\widetilde{W}^{n},\quad n=1,\dots,N.

(3.4)

Recall that

\widetilde{Z}_{m+1,h}^{\lambda,n}=\widetilde{Z}_{m,h}^{\lambda,n}-\nabla L(\widetilde{Z}_{m,h}^{\lambda,n})h+\sqrt{2\lambda^{-1}}\Delta\widetilde{W}_{h}^{n},\quad\text{with}\quad\Delta\widetilde{W}_{h}^{n}:=\widetilde{W}^{n}_{m+1}-\widetilde{W}^{n}_{m},

\widetilde{Z}_{m+1,h}^{\prime\lambda,n}=\widetilde{Z}_{m,h}^{\prime\lambda,n}-\nabla\ell(\widetilde{Z}_{m,h}^{\prime\lambda,n})h+\sqrt{2\lambda^{-1}}\Delta\widetilde{W}_{h}^{n},\quad\text{with}\quad\Delta\widetilde{W}_{h}^{n}:=\widetilde{W}^{n}_{m+1}-\widetilde{W}^{n}_{m},

and

\overline{Z}_{m+1,h}^{\lambda,n}=\overline{Z}_{m,h}^{\lambda,n}-\nabla\overline{\ell}(\overline{Z}_{m,h}^{\lambda,n})h+\sqrt{2\lambda^{-1}}\Delta\overline{W}_{h}^{n},\quad\text{with}\quad\Delta\overline{W}_{h}^{n}:=\overline{W}^{n}_{m+1}-\overline{W}^{n}_{m}.

We decompose the error as

	$\displaystyle\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\overline{Z}^{\lambda,n}_{M,h})-\overline{\text{AVAR}}_{u}(f)\Big{\|}^{2}\bigg{]}\leq 2^{6}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\overline{Z}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{\|}^{2}\bigg{]}+2^{6}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}{L}(\widetilde{Z}^{\lambda,n}_{M,h})\Big{\|}^{2}\bigg{]}$
	$\displaystyle\qquad+2^{6}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}{L}(\widehat{Z}^{\lambda,n}_{t})\Big{\|}^{2}\bigg{]}+2^{6}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}{L}(\widehat{Z}^{\lambda,n}_{t})-\mathbb{E}[L(Z_{t}^{\lambda})]\Big{\|}^{2}\bigg{]}+2^{6}\Big{\|}\mathbb{E}[L(Z^{\lambda}_{t})]-\int_{\mathbb{R}^{d}}{L}\,\mathrm{d}\mu^{\lambda}_{\infty}\Big{\|}^{2}$
	$\displaystyle\qquad+2^{6}\Big{\|}\int_{\mathbb{R}^{d}}{L}\,\mathrm{d}\mu_{\infty}^{\lambda}-\int_{\mathbb{R}^{d}}\widetilde{L}\,\mathrm{d}\mu_{\infty}^{\lambda}\Big{\|}^{2}+2^{6}\Big{\|}\int_{\mathbb{R}^{d}}\widetilde{L}d\mu_{\infty}^{\lambda}-\overline{\text{AVaR}}_{u}(f)\Big{\|}^{2}.$		(3.5)

The rest of the proof consists in controlling each term above separately.

Lemma 3.3.

Under the conditions of Theorem 2.3, for all $t,0<\gamma<1,u\in(0,1),N,M\in\mathbb{N}^{\star}$ and $\lambda>1$ , we have

\displaystyle\mathbb{E}\bigg{[}\Big{|}\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\overline{Z}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{|}^{2}\bigg{]}\leq C^{{}^{\prime}1}_{(u,t,\lambda)}h^{2}+C^{{}^{\prime}2}_{(u,t,\lambda)}\gamma^{2},

where $C^{{}^{\prime}1}_{(u,t,\lambda)}$ , and $C^{{}^{\prime}2}_{(u,t,\lambda)}$ are given in equations (5.1) and (5.2).

Proof.

Let $\overline{Z}^{\lambda,n,d-1}_{M,h}$ denote the last $d-1$ coordinates of $\overline{Z}^{\lambda,n}_{M,h}$ . By the definition of $\overline{l}$ and Jensen’s inequality, we have

	$\displaystyle\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\overline{Z}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{\|}^{2}\bigg{]}$
	$\displaystyle\quad=\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\left\{\ell(\overline{Z}^{\lambda,n}_{M,h})+\frac{\gamma}{2}\text{dist}^{2}(\overline{Z}^{\lambda,n,d-1}_{M,h},A)-\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\right\}\Big{\|}^{2}\bigg{]}$
	$\displaystyle\quad\leq C\bigg{(}\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\Big{(}\ell(\overline{Z}^{\lambda,n}_{M,h})-\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}+\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\mathbb{E}\text{dist}^{4}(\overline{Z}^{\lambda,n,d-1}_{M,h},A)\bigg{)}.$		(3.6)

For the first term in (3.6), using the definition of $\ell$ , we have

	$\displaystyle\mathbb{E}\Big{[}(\ell\Big{(}\overline{Z}^{\lambda,n}_{M,h})-\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}$
$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\Big{(}\widetilde{\ell}(\overline{Z}^{\lambda,n}_{M,h})-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}+\frac{\gamma^{2}}{2}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{2}\Big{]}$
$\displaystyle\leq$	$\displaystyle\Bigg{(}\Big{(}\frac{2}{1-u}\Big{)}^{2}+\frac{\gamma^{2}}{2}\Bigg{)}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{2}\Big{]},$	(3.7)

where we used that $\widetilde{\ell}$ is $\frac{2}{1-u}$ -Lipschitz in the last step.

To control $\mathbb{E}\Big{[}\big{\|}\overline{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{2}\Big{]}$ , using the definitions of $\overline{Z}^{\lambda,n}_{M,h}$ and $\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}$ , we have

\displaystyle\mathbb{E}\Big{[}\big{\|}\overline{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{2}\Big{]}\leq C\mathbb{E}\Big{[}\big{\|}\overline{Z}^{\lambda,n}_{M-1,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M-1,h}\big{\|}^{2}\Big{]}+Ch^{2}\mathbb{E}\Big{[}\big{\|}\nabla\overline{\ell}(\overline{Z}^{\lambda,n}_{M-1,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M-1,h})\big{\|}^{2}\Big{]}.

Using this recursive relationship, it can be checked by induction that we have

\mathbb{E}\Big{[}\big{\|}\overline{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{2}\Big{]}\leq C\sum_{m=0}^{M-1}h^{2}\mathbb{E}\Big{[}\big{\|}\nabla\overline{\ell}(\overline{Z}^{\lambda,n}_{m,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})\big{\|}^{2}\Big{]}.

(3.8)

Note that the derivative of $\text{dist}^{2}(r,A)$ is $2(r-A(r)),$ where we denote $A(r):=\operatorname*{arg\,min}{\text{dist}^{2}(r,A)}$ . In addition, $\nabla A(r)=1$ . Therefore, $\nabla\overline{\ell}$ is $\Big{(}\frac{1}{1-u}+2\Big{)}-$ Lipschitz. By adding and then subtracting $\nabla\overline{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})$ , and then using the definitions of $\nabla\overline{\ell}$ and $\nabla\ell$ , we have

	$\displaystyle\mathbb{E}\Big{[}\big{\\|}\nabla\overline{\ell}(\overline{Z}^{\lambda,n}_{m,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})\big{\\|}^{2}\Big{]}$
$\displaystyle\leq$	$\displaystyle C\mathbb{E}\Big{[}\big{\\|}\nabla\overline{\ell}(\overline{Z}^{\lambda,n}_{m,h})-\nabla\overline{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})\big{\\|}^{2}\Big{]}+C\mathbb{E}\Big{[}\big{\\|}\nabla\overline{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})\big{\\|}^{2}\Big{]}$
$\displaystyle\leq$	$\displaystyle\Big{(}\frac{1}{(1-u)^{2}}+2\Big{)}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{m,h}\big{\\|}^{2}\Big{]}+C\mathbb{E}\Big{[}\big{\\|}\widetilde{Z^{\prime}}^{\lambda,n}_{m,h}-A(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})\big{\\|}^{2}\Big{]}.$	(3.9)

Combining equations (3.8) and (3.9), we have

\mathbb{E}\Big{[}\big{\|}\overline{Z}^{\lambda,n}_{M-1,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M-1,h}\big{\|}^{2}\Big{]}\leq Ch^{2}\sum_{m=0}^{M-1}\Bigg{\{}\Big{(}\frac{1}{(1-u)^{2}}+2\Big{)}\mathbb{E}\Big{[}\big{\|}\overline{Z}^{\lambda,n}_{m,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{m,h}\big{\|}^{2}\Big{]}+\mathbb{E}\Big{[}\big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{m,h}\big{\|}^{2}\Big{]}+\mathbb{E}\Big{[}\big{\|}A(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})\big{\|}^{2}\Big{]}\Bigg{\}}.

Using the discrete version of the Grönwall’s inequality [36, Proposition 5], we have

\mathbb{E}\Big{[}\big{\|}\overline{Z}^{\lambda,n}_{M-1,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M-1,h}\big{\|}^{2}]\leq Ch^{2}M\bigg{(}\mathbb{E}\Big{[}\big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{2}\Big{]}+\mathbb{E}\Big{[}\big{\|}A(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{\|}^{2}\Big{]}\bigg{)}\exp\Bigg{(}C\bigg{(}\frac{1}{(1-u)^{2}}+2\bigg{)}Mh^{2}\Bigg{)}.

(3.10)

For the second term in (3.6), we rewrite as

\displaystyle\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\mathbb{E}\text{dist}^{4}(\overline{Z}^{\lambda,n,d-1}_{M,h},A)=\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\big{\|}\overline{Z}^{\lambda,n,d-1}_{M,h}-A(\overline{Z}^{\lambda,n,d-1}_{M,h})\big{\|}^{4}\Big{]}\leq\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\mathbb{E}\big{\|}\overline{Z}^{\lambda,n}_{M,h}\big{\|}^{4}+C\gamma^{2},

(3.11)

where the last step follows because the set $A$ is compact.

It remains to bound the fourth moments of $\overline{Z}^{\lambda,n}_{M,h}$ and $\widetilde{Z}^{{}^{\prime},\lambda,n}_{M,h}$ . Using the definition of $\overline{Z}^{\lambda,n}_{m+1,h}$ , and letting $C_{d}=\Big{(}\frac{d}{2}+1\Big{)}\Big{(}\frac{d}{2}\Big{)}$ we have

	$\displaystyle\mathbb{E}[\big{\\|}\overline{Z}^{\lambda,n}_{m,h}\big{\\|}^{4}]$	$\displaystyle=\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}-h\nabla\overline{\ell}(\overline{Z}^{\lambda,n}_{m-1,h})+\sqrt{2\lambda^{-1}}\Delta\widetilde{W_{h}}\big{\\|}^{4}\Big{]}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\|\nabla\overline{\ell}(\overline{Z}^{\lambda,n}_{m-1,h})\|^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}\Big{(}\frac{d}{2}+1\Big{)}\Big{(}\frac{d}{2}\Big{)}\bigg{)}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\big{\|}\nabla\ell(\overline{Z}^{\lambda,n}_{m-1,h})\big{\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\big{\\|}2(\overline{Z}^{\lambda,n}_{m-1,h}-A(\overline{Z}^{\lambda,n}_{m-1,h}))\big{\\|}^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\bigg{)}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\bigg{[}\Big{\\|}\frac{2}{1-u}+\gamma\overline{Z}^{\lambda,n}_{m-1,h}\Big{\\|}^{4}\bigg{]}+h^{4}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\big{\\|}A(\overline{Z}^{\lambda,n}_{m-1,h})\big{\\|}^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\bigg{)}$
		$\displaystyle\leq C\left((1+\gamma^{4}h^{4}+h^{4})\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+\frac{h^{4}}{(1-u)^{4}}+h^{4}+\frac{h^{2}}{\lambda^{2}}C_{d}\right),$

where we used the compactness of $A$ in the last step. Using this recursive relationship, it can be checked by induction that for all $m\in\mathbb{N}$ , we have

$\displaystyle\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m,h}\big{\\|}^{4}\Big{]}$	$\displaystyle\leq C\bigg{(}(1+\gamma^{4}h^{4}+h^{4})^{m}\mathbb{E}[\big{\\|}\overline{Z}^{\lambda,n}_{0,h}\big{\\|}^{4}]+\Big{(}\frac{h^{4}}{(1-u)^{4}}+h^{4}+\frac{h^{2}}{\lambda^{2}}C_{d}\Big{)}\sum_{i=1}^{m}(1+\gamma^{4}h^{4}+h^{4})^{i}\bigg{)}$
	$\displaystyle\leq C\bigg{(}(1+\gamma^{4}h^{4}+h^{4})^{m}+\Big{(}\frac{h^{4}}{(1-u)^{4}}+h^{4}+\frac{h^{2}}{\lambda^{2}}C_{d}\Big{)}\frac{(1+\gamma^{4}h^{4}+h^{4})((1+\gamma^{4}h^{4}+h^{4})^{m}-1)}{(1+\gamma^{4}h^{4}+h^{4})-1}\bigg{)}$
	$\displaystyle\leq C\bigg{(}(1+\gamma^{4}h^{4}+h^{4})^{m}+\Big{(}\frac{h^{4}}{(1-u)^{4}}+h^{4}+\frac{h^{2}}{\lambda^{2}}C_{d}\Big{)}(1+\gamma^{4}h^{4}+h^{4})^{m}\bigg{)},$	(3.12)

where the second inequality uses sum of geometric series. The bound of $\widetilde{Z}^{{}^{\prime}\lambda,n}_{M,h}$ is given in the proof of Lemma 3.4 below. Combining equations (3.6), (3.2), (3.10),(3.11), and the moments given in equations (3.2) and (3.2), and taking $h,\gamma\leq 1$ , and $M\geq 1$ , we have the result of the lemma. ∎

Lemma 3.4.

Under the conditions of Theorem 2.3, for all $t,\gamma>0,u\in(0,1),N,M\in\mathbb{N}^{\star}$ and $\lambda>1$ if $h<\frac{1}{\left(\frac{2}{1-u}+\gamma\right)}$ then we have

\displaystyle\mathbb{E}\bigg{[}\Big{|}\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}L(\widetilde{Z}^{\lambda,n}_{M,h})\Big{|}^{2}\bigg{]}\leq C^{{}^{\prime}3}_{(u,t)}h^{2}+\big{(}1+C^{{}^{\prime}4}_{(u,t)}\big{)}\frac{C}{N}+C^{{}^{\prime}5}_{(u,t,\lambda)}\gamma^{2},

for some constants $C^{{}^{\prime}3}_{(u,t)},C^{{}^{\prime}4}_{(u,t)}$ and $C^{{}^{\prime}5}_{(u,t,\lambda)}$ given in equations (5.3) - (5.5).

Proof.

By the definition of $L$ and Jensen’s inequality, we have

	$\displaystyle\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}L(\widetilde{Z}^{\lambda,n}_{M,h})\Big{\|}^{2}\bigg{]}$
	$\displaystyle\quad=\mathbb{E}\Bigg{[}\bigg{\|}\frac{1}{N}\sum_{n=1}^{N}\left\{\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})+\frac{\gamma}{2}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{2}-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\frac{\gamma}{2}\big{\\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{2}\right\}\bigg{\|}^{2}\Bigg{]}$
	$\displaystyle\quad\leq C\bigg{(}\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}(\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{\ell}\big{(}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{)}^{2}\Big{]}+\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\big{(}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{2}-\big{\\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{2}\big{)}^{2}\Big{]}\bigg{)}$
	$\displaystyle\quad\leq\frac{C}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\big{(}\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{)}^{2}\Big{]}+C\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\Big{(}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{4}\Big{]}+\mathbb{E}\Big{[}\big{\\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{4}\Big{]}\Big{)},$		(3.13)

where the last inequality follows by Cauchy-Schwarz inequality.

Next, we will bound the fourth moments of $\widetilde{Z}^{\lambda,n}_{M,h}$ and $\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}$ . Recall $C_{d}=\Big{(}\frac{d}{2}+1\Big{)}\Big{(}\frac{d}{2}\Big{)}$ . Using the definition of $\widetilde{Z}^{\lambda,n}_{m+1,h}$ , we have

	$\displaystyle\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m,h}\big{\\|}^{4}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}-h\nabla L(\widetilde{Z}^{\lambda,n}_{m-1,h})+\sqrt{2\lambda^{-1}}\Delta\widetilde{W_{h}}\big{\\|}^{4}\Big{]}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\|\nabla L(\widetilde{Z}^{\lambda,n}_{m-1,h})\|^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\bigg{)}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\bigg{[}\Big{\\|}\frac{2}{1-u}+\gamma\widetilde{Z}^{\lambda,n}_{m-1,h}\Big{\\|}^{4}\bigg{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\bigg{)}$
		$\displaystyle\leq C\left(\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+\frac{h^{4}}{(1-u)^{4}}+\gamma^{4}h^{4}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\right)$
		$\displaystyle\leq C\left((1+\gamma^{4}h^{4})\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+\frac{h^{4}}{(1-u)^{4}}+\frac{h^{2}}{\lambda^{2}}C_{d}\right),$

where the third inequality follows from the fact that $\widetilde{L}$ is $\frac{2}{1-u}-$ Lipschitz. Using this recursive relationship, it can be checked by induction that for all $m\in\mathbb{N}$ ,

$\displaystyle\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m,h}\big{\\|}^{4}\Big{]}$	$\displaystyle\leq C\bigg{(}(1+\gamma^{4}h^{4})^{m}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{0,h}\big{\\|}^{4}\Big{]}+\Big{(}\frac{h^{4}}{(1-u)^{4}}+\frac{h^{2}}{\lambda^{2}}C_{d}\Big{)}\sum_{i=1}^{m}(1+\gamma^{4}h^{4})^{i}\bigg{)}$
	$\displaystyle\leq C\bigg{(}(1+\gamma^{4}h^{4})^{m}+\Big{(}\frac{h^{4}}{(1-u)^{4}}+\frac{h^{2}}{\lambda^{2}}C_{d}\Big{)}\frac{(1+\gamma^{4}h^{4})((1+\gamma^{4}h^{4})^{m}-1)}{(1+\gamma^{4}h^{4})-1}\bigg{)}$
	$\displaystyle\leq C\bigg{(}(1+\gamma^{4}h^{4})^{m}+\Big{(}\frac{h^{4}}{(1-u)^{4}}+\frac{h^{2}}{\lambda^{2}}C_{d}\Big{)}(1+\gamma^{4}h^{4})^{m}\bigg{)}$	(3.14)

where the second inequality follows by properties of geometric series. By the same argument, we also have

\displaystyle\mathbb{E}\Big{[}\big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{4}\Big{]}\leq C\bigg{(}(1+\gamma^{4}h^{4})^{M}+\Big{(}\frac{h^{4}}{(1-u)^{4}}+\frac{h^{2}}{\lambda^{2}}C_{d}\Big{)}(1+\gamma^{4}h^{4})^{M}\bigg{)}.

(3.15)

Thus, we have

\displaystyle\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\Big{(}\mathbb{E}\Big{[}\big{\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\|}^{4}\Big{]}+\mathbb{E}\Big{[}\big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{4}\Big{]}\Big{)}\leq\gamma^{2}C^{{}^{\prime}5}_{(u,t,\gamma)},

(3.16)

where $C^{{}^{\prime}5}_{(u,t,\gamma)}$ is given in (5.5).

Let us now turn to the first term on the right hand side of (3.13). Adding and subtracting $\widetilde{L}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})$ , we have

\displaystyle\mathbb{E}\Big{[}\Big{(}\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}\leq 2\mathbb{E}\Big{[}\Big{(}\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{L}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}+2\mathbb{E}\Big{[}\Big{(}\widetilde{L}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}.

(3.17)

Using that $\widetilde{L}$ is $\frac{2}{1-u}$ –Lipschitz, it follows that

\displaystyle\mathbb{E}\Big{[}\Big{(}\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{L}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}

\displaystyle\leq\frac{4}{(1-u)^{2}}\mathbb{E}\left[\big{\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{2}\right]

Next, we will bound the fourth moment of the difference between $\widetilde{Z}^{\lambda,n}_{M,h}$ and $\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}$ . Using the definitions of $\widetilde{Z}^{\lambda,n}_{M,h}$ and $\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}$ , given in Equation 2.4 and Equation 2.6 respectively, we have

		$\displaystyle\mathbb{E}\left[\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{4}\right]=\mathbb{E}\left[\big{\\|}\widetilde{Z}^{\lambda,n}_{M-1,h}-h\nabla L(\widetilde{Z}^{\lambda,n}_{M-1,h})-\widetilde{Z^{\prime}}^{\lambda,n}_{M-1,h}+h\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M-1,h})\big{\\|}^{4}\right]$
		$\displaystyle\leq C\mathbb{E}\left[\big{\\|}\widetilde{Z}^{\lambda,n}_{M-1,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M-1,h}\big{\\|}^{4}\right]+Ch^{4}\mathbb{E}\left[\big{\\|}\nabla L(\widetilde{Z}^{\lambda,n}_{M-1,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M-1,h})\big{\\|}^{4}\right].$

Using this recursive relationship, it can be checked by induction that we have

\mathbb{E}\left[\big{\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{4}\right]\leq C\sum_{m=0}^{M-1}h^{4}\mathbb{E}\left[\big{\|}\nabla L(\widetilde{Z}^{\lambda,n}_{m,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{m,h})\big{\|}^{4}\right].

(3.18)

Let $\widetilde{Z}^{\lambda,n,d-1}_{M,h}$ (resp. $\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h}$ ) denote the last $d-1$ coordinates of $\widetilde{Z}^{\lambda,n}_{M,h}$ (resp. $\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}$ ). Define the vector $\boldsymbol{S}:=(-1,S^{1},\cdots,S^{d})$ , and let $e_{1}\in\mathbb{R}^{d}$ be a vector with $1+\gamma m$ in the first entry and $0$ everywhere else. Adding and subtracting $\nabla L(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})$ , we get

	$\displaystyle\mathbb{E}\left[\big{\\|}\nabla L(\widetilde{Z}^{\lambda,n}_{M,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{\\|}^{4}\right]$
$\displaystyle\leq$	$\displaystyle C\mathbb{E}\left[\big{\\|}\nabla L(\widetilde{Z}^{\lambda,n}_{M,h})-\nabla L(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{\\|}^{4}\right]+C\mathbb{E}\left[\big{\\|}\nabla L(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{\\|}^{4}\right]$
$\displaystyle\leq$	$\displaystyle C\left(\frac{1}{1-u}+\gamma\right)^{4}\mathbb{E}\left[\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{4}\right]$
	$\displaystyle+C\mathbb{E}\bigg{[}\bigg{\|}\Big{(}e_{1}+\frac{1}{1-u}\mathbb{E}\big{[}\nabla_{r}f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)1_{\{f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)\geq\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\}}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{]}\Big{)}$	(3.19)
	$\displaystyle-\Big{(}e_{1}+\frac{1}{1-u}\frac{1}{N}\sum_{i=1}^{N}(\nabla f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})1_{\{f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})\geq\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}(1)\}})\Big{)}\bigg{\|}^{4}\bigg{]}$
$\displaystyle\leq$	$\displaystyle C\bigg{(}\frac{1}{1-u}+\gamma\bigg{)}^{4}\mathbb{E}\bigg{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{4}\bigg{]}+C\bigg{(}\frac{1}{1-u}\bigg{)}^{4},$	(3.20)

where the last inequality follows by Assumption 2.2 $(iv)$ . Putting equations (3.18), (3.20) together, we have

\mathbb{E}\left[\big{\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{4}\right]\leq Ch^{2}\left(\frac{1}{1-u}+\gamma\right)^{4}\sum_{m=0}^{M-1}\mathbb{E}\left[\big{\|}\widetilde{Z}^{\lambda,n}_{m,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{m,h}\big{\|}^{4}\right]+CMh^{4}\left(\frac{1}{1-u}\right)^{4}.

Using the discrete version of the Grönwall’s inequality [36, Proposition 5], we have

\mathbb{E}\left[\big{\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\|}^{4}\right]\leq CMh^{4}\left(\frac{1}{1-u}\right)^{4}\exp\left(CMh^{4}\left(\frac{1}{1-u}+\gamma\right)^{4}\right).

(3.21)

Therefore, it follows that

\displaystyle\mathbb{E}\Big{[}\Big{(}\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{L}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}

\displaystyle\leq\frac{4}{(1-u)^{4}}M^{1/2}h^{2}\exp\left(CMh^{4}\left(\frac{1}{1-u}+\gamma\right)^{4}\right).

For the second term on the right hand side of (3.17), we use a law of large number type argument. In fact, we have

		$\displaystyle\mathbb{E}\Big{[}(\widetilde{L}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}))^{2}\Big{]}=\mathbb{E}\bigg{[}\mathbb{E}\Big{[}(\widetilde{L}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}))^{2}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\bigg{]}$
		$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\left(\frac{1}{1-u}\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}\Big{]}-\frac{1}{1-u}\frac{1}{N}\sum_{i=1}^{N}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}\right)^{2}\Bigg{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\right]\right]$
		$\displaystyle=\left(\frac{1}{1-u}\right)^{2}\frac{1}{N^{2}}\mathbb{E}\Bigg{[}\mathbb{E}\Bigg{[}\sum_{i,j=1}^{N}\left((f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}-\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\right)$
		$\displaystyle\qquad\qquad\cdot\left((f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{j})-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}-\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}(1))^{+}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\right)\Big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Bigg{]}\Bigg{]}$		(3.22)

For $i\neq j$ , using that $S^{i}$ and $S^{j}$ are independent, we have

		$\displaystyle\mathbb{E}\Bigg{[}\mathbb{E}\Bigg{[}\left((f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}-\mathbb{E}\Big{[}f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\right)$
		$\displaystyle\cdot\left((f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{j})-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}-\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\right)\Big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\Bigg{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Bigg{[}\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}-\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}(1))^{+}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\Big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\Bigg{]}$
		$\displaystyle\cdot\mathbb{E}\Bigg{[}\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{j})-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}(1))^{+}-\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\Big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\Bigg{]}=0.$

Therefore, we can estimate the desired term as

	$\displaystyle\mathbb{E}\Big{[}\Big{(}\widetilde{L}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{)}^{2}\Big{]}$		(3.23)
	$\displaystyle=\left(\frac{1}{1-u}\right)^{2}\frac{1}{N^{2}}\sum_{i=1}^{N}\mathbb{E}\Bigg{[}\mathbb{E}\Big{[}\left((f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}-\mathbb{E}\Big{[}(f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})^{+}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\right)^{2}\Big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\Bigg{]}$
	$\displaystyle=\left(\frac{1}{1-u}\right)^{2}\frac{C}{N}\mathbb{E}\bigg{[}\mathbb{E}\Big{[}\big{\|}f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)\big{\|}^{2}\Big{\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\Big{]}\bigg{]}\leq\left(\frac{1}{1-u}\right)^{2}\frac{C}{N}\mathbb{E}\Big{[}1+\|S\|^{2}\Big{]}\leq\frac{C}{N}f\big{(}1+C^{{}^{\prime}4}_{(u,t)}\big{)},$		(3.24)

with $C^{{}^{\prime}4}_{(u,t)}$ given in (5.4). Here the third to last step follows by Jensen’s inequality and tower property. The penultimate step follows by the boundedness of $\nabla_{r}f(r,S)$ and Lipschitzness of $\nabla_{s}f(r,S)$ . In the last step, we used that $S$ has finite fourth moment, and equation (3.2). Finally, putting (3.13), (3.16) and (3.24) together yields the lemma. ∎

We now investigate the second term on the right hand side of (3.5).

Lemma 3.5.

Under the conditions of Theorem 2.3, for all $t,\gamma>0,u\in(0,1),M,N\in\mathbb{N}^{\star}$ and $\lambda>1$ , we have

\displaystyle\mathbb{E}\bigg{[}\bigg{|}\frac{1}{N}\sum_{n=1}^{N}L(\widetilde{Z}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}L(\widehat{Z}^{\lambda,n}_{t})\bigg{|}^{2}\bigg{]}\leq C^{{}^{\prime}6}_{(u)}h^{2}+C^{{}^{\prime}7}_{(u,M,\lambda)}\gamma^{2},

where constants $C^{{}^{\prime}6}_{(u)}$ , and $C^{{}^{\prime}7}_{(u,M,\lambda)}$ are given in Equations (5.6) and (5.7).

Proof.

Following the same argument as in the proof of Lemma 3.3, we have

\displaystyle\mathbb{E}\bigg{[}\bigg{|}\frac{1}{N}\sum_{n=1}^{N}L(\widetilde{Z}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}L(\widehat{Z}^{\lambda,n}_{t})\bigg{|}^{2}\bigg{]}\leq C\bigg{(}\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\Big{(}\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{L}(\widehat{Z}^{\lambda,n}_{t})\Big{)}^{2}\Big{]}+\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\Big{(}\big{\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\|}^{2}-\big{\|}\widehat{Z}^{\lambda,n}_{t}\big{\|}^{2}\Big{)}^{2}\Big{]}\bigg{)}.

(3.25)

Let $Z^{\lambda}(s)$ be a continuous time approximation of the Euler-Maruyama scheme in (2.6). One way to define such an approximation is by setting

Z^{\lambda}(s):=z-\int_{0}^{n_{s}}\nabla L(Z^{\lambda}(r))\,\mathrm{d}r+\int_{0}^{n_{s}}\sqrt{2\lambda^{-1}}\,\mathrm{d}W_{r},

with $n_{s}:=\max\{\frac{t}{M^{2}}n:\frac{t}{M^{2}}n\leq s,n\in\mathbb{Z}\}$ . Note that for each $1\leq i\leq N,1\leq m\leq M$ , and $t>0$ , we have $Z^{\lambda}(s_{m})=\widetilde{Z}_{m,h}^{\lambda,i}$ . In other words, $Z^{\lambda}(s)$ coincides with $\widetilde{Z}_{m,h}^{\lambda,i}$ at the time discretization points. For the first term in (3.25), we have

	$\displaystyle\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\|\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{L}(\widehat{Z}^{\lambda,n}_{t})\|^{2}\Big{]}$	$\displaystyle\leq\frac{4}{(1-u)^{2}N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}_{M,\frac{t}{M^{2}}}^{\lambda,n}-\widehat{Z}_{t}^{\lambda,n}\big{\\|}^{2}\Big{]}$
		$\displaystyle\leq\frac{4}{(1-u)^{2}N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\sup_{0\leq s\leq t}\big{\\|}Z^{\lambda}(s)-\widehat{Z}_{s}^{\lambda,n}\big{\\|}^{2}\Big{]},$

where we used that $\widetilde{L}$ is $\frac{2}{1-u}$ –Lipschitz in the first inequality. By standard results on the error estimation for SDE approximations, see e.g. [43, Theorems 10.3.5 and 10.6.3], we have

\mathbb{E}\Big{[}\sup_{0\leq s\leq t}|Z^{\lambda}(s)-\widehat{Z}_{s}^{\lambda,n}|^{4}\Big{]}\leq Ch^{4}

(3.26)

for some constant $C>0$ . Thus, we have

\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}|\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{L}(\widehat{Z}^{\lambda,n}_{t})|^{2}\Big{]}\leq\frac{Ch^{2}}{(1-u)^{2}}.

(3.27)

For the second term on the right hand side of (3.25), by Cauchy-Schwartz inequality, we have

$\displaystyle\mathbb{E}\Big{[}\Big{(}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{2}-\big{\\|}\widehat{Z}^{\lambda,n}_{t}\big{\\|}^{2}\Big{)}^{2}\Big{]}\bigg{)}$	$\displaystyle\leq\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{4}+\big{\\|}\widehat{Z}^{\lambda,n}_{t}\big{\\|}^{4}\Big{]}^{\frac{1}{2}}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widehat{Z}^{\lambda,n}_{t}\big{\\|}^{4}\Big{]}^{\frac{1}{2}}$
	$\displaystyle\leq\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{4}]+\big{\\|}\widehat{Z}^{\lambda,n}_{t}\big{\\|}^{4}\Big{]}^{\frac{1}{2}}\mathbb{E}\Big{[}\sup_{0\leq s\leq t}\big{\\|}Z^{\lambda}(s)-\widehat{Z}^{\lambda,n}_{s}\big{\\|}^{4}\Big{]}^{\frac{1}{2}}$
	$\displaystyle\leq C\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{4}+\big{\\|}\widehat{Z}^{\lambda,n}_{t}\big{\\|}^{4}\Big{]}^{\frac{1}{2}}h^{2}$	(3.28)

where we used equation (3.26) in the last step.

It remains to control the fourth moment of $\widehat{Z}^{\lambda,n}_{t}$ , since that of $\widetilde{Z}^{\lambda,n}_{M,h}$ was bounded in Equation (3.2). Since $\widehat{Z}^{\lambda,n}_{t}$ solves the SDE

\,\mathrm{d}\widehat{Z}^{\lambda,n}_{t}=-\nabla L(\widehat{Z}^{\lambda,n}_{t})\,\mathrm{d}t+\sqrt{2\lambda^{-1}}\,\mathrm{d}\widetilde{W}^{n}_{t},

with a linearly growing drift, the following bound on the fourth moment of the solution follows by standard arguments:

\displaystyle\mathbb{E}\Big{[}\big{\|}\widehat{Z}^{\lambda,n}_{t}\big{\|}^{4}]

\displaystyle\leq C\left(1+\frac{t}{(1-u)^{4}}+\frac{1}{\lambda^{2}}t^{2}\right)e^{\gamma^{4}Ct},

(3.29)

where $C$ is a constant depending only on the Lipschitz constant of $\nabla L$ . We omit the proof. Putting equations (3.27), (3.28), (3.29) and (3.2) together, and recalling that $t=hM^{2}$ , we have the result of the lemma. ∎

Now we move on to analyzing the third term in (3.5).

Lemma 3.6.

Under the conditions of Theorem 2.3, for all $h,t>0,u\in(0,1),N\in\mathbb{N}^{\star}$ , $0<\gamma<1$ , and $\lambda>1$ , we have

\displaystyle\mathbb{E}\bigg{[}\bigg{|}\frac{1}{N}\sum_{n=1}^{N}L(\widehat{Z}^{\lambda,n}_{t})-\mathbb{E}[L(Z_{t}^{\lambda})]\bigg{|}^{2}\bigg{]}\leq\frac{1}{N}C^{{}^{\prime}8}_{(u,M,\lambda,t)}

for $C^{{}^{\prime}8}_{(u,M,\lambda,t)}$ given in Equation (5.8).

Proof.

Since $(\widehat{Z}^{\lambda,n})_{n\geq 1}$ are i.i.d. copies of $Z_{t}^{\lambda}$ , a standard law of large numbers argument gives

\displaystyle\mathbb{E}\bigg{[}\bigg{|}\frac{1}{N}\sum_{n=1}^{N}L(\widehat{Z}^{\lambda,n}_{t})-\mathbb{E}[L(Z_{t}^{\lambda})]\bigg{|}^{2}\bigg{]}\leq\frac{1}{N}\mathrm{Var}(L(Z^{\lambda}_{t}))

where $\mathrm{Var}(L(Z^{\lambda}_{t}))$ is the variance of $L(Z^{\lambda}_{t})$ . Since $\nabla L$ is $(\frac{1}{1-u}+\gamma)$ –Lipschitz, it follows by [20, Corollary 5.11], that the law of $Z^{\lambda}_{t}$ satisfies the Poincaré inequality. That is,

\mathrm{Var}(L(Z_{t}^{\lambda}))\leq\frac{2}{\lambda(\frac{1}{1-u}+\gamma)^{2}}\mathbb{E}\Big{[}\big{\|}\nabla L(Z_{t}^{\lambda})\big{\|}^{2}\Big{]}.

Now using that $\nabla\widetilde{L}$ is $\frac{2}{1-u}$ –Lipschitz, we have $\big{\|}\nabla L(x)\big{\|}\leq\|\nabla\widetilde{L}(0)\|+(\frac{2}{1-u}+\gamma)\|x\|$ . Thus, it follows that

	$\displaystyle\mathrm{Var}(L(Z_{t}^{\lambda}))$	$\displaystyle\leq\frac{2}{\lambda(\frac{1}{1-u}+\gamma)^{2}}\mathbb{E}\bigg{[}\bigg{\|}\\|\nabla\widetilde{L}(0)\\|+\Big{(}\frac{2}{1-u}+\gamma\Big{)}\big{\\|}Z^{\lambda}_{t}\big{\\|}^{2}\bigg{\|}\bigg{]}$
		$\displaystyle\leq C\bigg{(}\frac{1}{\lambda(\frac{1}{1-u}+\gamma)^{2}}+\frac{1}{\lambda}\Big{(}\mathbb{E}\Big{[}\big{\\|}Z_{t}^{\lambda}\big{\\|}^{4}\Big{]}\Big{)}^{\frac{1}{2}}\bigg{)}$
		$\displaystyle\leq C\bigg{(}\frac{1}{\lambda(1+\gamma)^{2}}+\frac{1}{\lambda}\bigg{(}1+\frac{\sqrt{t}}{(1-u)^{2}}+\frac{t}{\lambda}\bigg{)}e^{\gamma^{4}Ct}\bigg{)},$

where the last inequality follows from (3.29). Taking $\gamma\geq 0$ yields the result of the lemma. ∎

In the next lemma we analyze the fourth term in (3.5). Essentially, this concerns the rate of convergence of the law $\mu_{t}^{\lambda}$ of the solution of the Langevin equation to its invariant measure $\mu^{\lambda}_{\infty}$ .

Lemma 3.7.

For all $t>0,u,\gamma\in(0,1)$ , $\lambda>1$ and an initial position $Z^{\lambda}_{0}=z\in\mathbb{R}^{d+1}$ , we have

\displaystyle\bigg{|}\mathbb{E}L(Z^{\lambda}_{t})-\int L(x)\,\mathrm{d}\mu_{\infty}^{\lambda}\bigg{|}^{2}

\displaystyle\leq C^{4}_{(u,\lambda)}e^{-tC^{5}_{(\lambda)}}+\gamma^{2}C^{{}^{\prime}9}_{(\lambda,t)},

where constants $C^{4}_{(u,\lambda)},C^{5}_{(\lambda)}$ , and $C^{{}^{\prime}9}_{(\lambda,t)}$ are given in equations (5.9) - (5.11).

Proof.

The investigation of convergence rates to the invariance measure is an active research area, see e.g. [8]. In the present case with non-convex potential functions $\nabla L$ , this follows from the so-called coupling by reflection arguments of Eberle [22]. In fact, by 2.2. $(iv)$ , it follows from [22, Corollary 2.1], (see also [23, Corollary 2]) that there is a constant $C_{(\lambda)}>0$ depending only on $d,\gamma$ and $\lambda$ such that

\mathcal{W}_{1}(\mu^{\lambda}_{t},\mu_{\infty}^{\lambda})\leq C_{(d,\lambda)}e^{-tC_{(d,\lambda)}}\mathcal{W}_{1}(\delta_{z},\mu^{\lambda}_{\infty})\quad\text{for all }t>0.

(3.30)

It now remains to bound $|\mathbb{E}L(Z^{\lambda}_{t})-\int L(x)d\mu_{\infty}^{\lambda}|$ by $\mathcal{W}_{1}(\mu^{\lambda}_{t},\mu^{\lambda}_{\infty})$ . Let $\hat{\alpha}\in\Gamma(\mu_{\infty}^{\lambda},\mu_{t}^{\lambda})$ be an optimal coupling of $\mu_{\infty}^{\lambda}$ and $\mu_{t}^{\lambda}$ , i.e. such that

\mathbb{E}_{\hat{\alpha}}\|X-Y\|=\inf_{\alpha\in\Gamma(\mu_{\infty}^{\lambda},\mu_{t}^{\lambda})}\mathbb{E}_{\alpha}\|X-Y\|,

see e.g. [60] for existence of $\hat{\alpha}$ . Above, we denote by $\mathbb{E}_{\alpha}[\|X-Y\|]$ the expectation under $\alpha$ of $\|X-Y\|$ , with $(X,Y)\sim\alpha$ . By definition of $L$ and Lipschitz–continuity of $\widetilde{L}$ , it holds

|L(x)-L(y)|\leq\frac{2}{1-u}\|x-y\|+\frac{\gamma}{2}(\|x\|^{2}+\|y\|^{2}).

Taking the expectation with respect to $\hat{\alpha}\in\Gamma(\mu_{\infty}^{\lambda},\mu_{t}^{\lambda})$ we have

	$\displaystyle\mathbb{E}_{\hat{\alpha}}\Big{[}\|L(X)-L(Y)\|\Big{]}$	$\displaystyle\leq\frac{2}{1-u}\mathbb{E}_{\hat{\alpha}}\Big{[}\\|X-Y\\|\Big{]}+\frac{\gamma}{2}\mathbb{E}_{\hat{\alpha}}\Big{[}\\|X\\|^{2}+\\|Y\\|^{2}\Big{]}$
		$\displaystyle\leq\frac{2}{1-u}\mathcal{W}_{1}(\mu_{\infty}^{\lambda},\mu_{t}^{\lambda})+\frac{\gamma}{2}\Big{(}\mathbb{E}[\\|Z^{\lambda}_{\infty}\\|^{2}]+\mathbb{E}[\big{\\|}Z^{\lambda}_{t}\big{\\|}^{2}]\Big{)},$		(3.31)

where $Z^{\lambda}_{\infty}\sim\mu^{\lambda}_{\infty}$ . As in the proof of Lemma 3.5, see e.g. Equation 3.29, $Z^{\lambda}_{t}$ has second moment bounded by a constant $C_{(\lambda,t)}>0$ . Concerning the term $\mathbb{E}[\|Z^{\lambda}_{\infty}\|^{2}]$ , note that it holds

\mathbb{E}[\|Z^{\lambda}_{\infty}\|^{2}]=\int\|x\|^{2}\,\mathrm{d}\mu_{\infty}^{\lambda}=\int\|x\|^{2}\frac{e^{-\lambda L(x)}}{\int e^{-\lambda L(a)}\,\mathrm{d}a}\,\mathrm{d}x.

Since $\widetilde{L}$ is $\frac{2}{1-u}$ -Lipschitz, $|\widetilde{L}(x)|\leq C+\frac{2}{1-u}\|x\|$ for some constant $C$ . We thus have

\displaystyle\int\|x\|^{2}e^{-\lambda L(x)}\,\mathrm{d}x=\int\|x\|^{2}e^{-\lambda(\widetilde{L}(x)+\gamma\|x\|^{2}/2)}\,\mathrm{d}x\leq\int\|x\|^{2}e^{C\lambda+2\lambda\|x\|/(1-u)-\|x\|^{2}/2}\,\mathrm{d}x<\infty.

(3.32)

Using the same argument, $0<\int e^{-\lambda L(a)}\,\mathrm{d}a<\infty$ . Therefore, we have $\mathbb{E}[\|Z^{\lambda}_{\infty}\|^{2}]<\infty$ . Combining this with (3.30) and (3.31) yields the desired result. ∎

Remark 3.8.

If the function $f$ is convex, it follows that $L$ is strongly convex in the sense that $\nabla^{2}L\geq\gamma I_{d}$ , where $I_{d}$ is the $(d)\times(d)$ identity matrix. In this case, the exponential convergence to equilibrium follows by standard arguments, see e.g. [8]. It fact, we have the following bound is second order Wasserstein distance:

\mathcal{W}_{2}^{2}(\mu^{\lambda}_{t},\mu^{\lambda}_{\infty})\leq e^{-2\gamma t}\mathcal{W}_{2}^{2}(\delta_{z},\mu^{\lambda}_{\infty}).

(3.33)

In this case, a slight modification of the above arguments allow to get the bound

\displaystyle\bigg{|}\mathbb{E}L(Z^{\lambda}_{t})-\int L(x)\,\mathrm{d}\mu_{\infty}^{\lambda}\bigg{|}^{2}

\displaystyle\leq C\left(\frac{1}{(1-u)^{2}}+1\right)e^{-\gamma t}\mathcal{W}_{2}^{2}(\delta_{z},\mu^{\lambda}_{\infty}).

The estimation of the fifth term in the decomposition (3.5) is an immediate consequence of the existence of second moment of the invariant measure $\mu_{\infty}^{\lambda}$ obtained in the proof of the preceding lemma. In fact, by definition of $L$ and $\widetilde{L}$ we have

\bigg{|}\int L(x)\,\mathrm{d}\mu_{\infty}^{\lambda}-\int\widetilde{L}(x)\,\mathrm{d}\mu_{\infty}^{\lambda}\bigg{|}^{2}=\left(\frac{\gamma}{2}\right)^{2}\int\|x\|^{2}\,\mathrm{d}\mu_{\infty}^{\lambda}\leq C\gamma^{2},

(3.34)

for a constant $C>0$ . We conclude the proof of the theorem with the following lemma estimating the last term on the right hand side of (3.5).

Lemma 3.9.

Under the conditions of Theorem 2.3, we have

\displaystyle\bigg{|}\int\widetilde{L}(x)\,\mathrm{d}\mu_{\infty}^{\lambda}-\overline{\mathrm{AVaR}(f)}\bigg{|}^{2}\leq C\gamma^{2}+\frac{C^{{}^{\prime}9}_{(u)}}{\lambda^{2}},

(3.35)

where $C^{6}_{(u)}$ is given in Equation (5.13).

Proof.

First recall, see Equation 2.2 and Lemma 3.2 that $\overline{\mathrm{AVaR}}(f)=\widetilde{L}(z^{*})$ where $z^{*}=(r^{*},m^{*})$ is the optimizer in (2.2). Now, consider the differential entropy

	$\displaystyle-\int\mu_{\infty}^{\lambda}\log\mu_{\infty}^{\lambda}\,\mathrm{d}x$	$\displaystyle=-\int\frac{e^{-\lambda L(x)}}{\int e^{-\lambda L(u)}\,\mathrm{d}u}\log\frac{e^{-\lambda L(x)}}{\int e^{-\lambda L(u)}\,\mathrm{d}u}\,\mathrm{d}x$
		$\displaystyle=-\int\frac{e^{-\lambda L(x)}}{\int e^{-\lambda L(u)}\,\mathrm{d}u}\left(-\lambda L(x)-\log\int e^{-\lambda L(u)}\,\mathrm{d}u\right)\,\mathrm{d}x$
		$\displaystyle=\lambda\int L(x)\,\mathrm{d}\mu_{\infty}^{\lambda}+\log\int e^{-\lambda L(x)}\,\mathrm{d}x$
		$\displaystyle=\lambda\int\left(\widetilde{L}(x)+\frac{\gamma}{2}\\|x\\|^{2}\right)\,\mathrm{d}\mu_{\infty}^{\lambda}+\log\int e^{-\lambda\widetilde{L}(x)-\frac{\lambda\gamma}{2}\\|x\\|^{2}}\,\mathrm{d}x.$

Rearranging the terms gives the following expression for the integral of $\widetilde{L}$ :

\int\widetilde{L}(x)\,\mathrm{d}\mu_{\infty}^{\lambda}=-\frac{\gamma}{2}\int\|x\|^{2}\,\mathrm{d}\mu_{\infty}^{\lambda}-\frac{1}{\lambda}\int\mu_{\infty}^{\lambda}\log\mu_{\infty}^{\lambda}\,\mathrm{d}x-\frac{1}{\lambda}\log\int e^{-\lambda\widetilde{L}(x)-\frac{\lambda\gamma}{2}\|x\|^{2}}\,\mathrm{d}x.

(3.36)

Since for any continuous random variable, a Gaussian distribution with the same second moment maximizes the differential entropy ([63, Theorem 10.48]), it holds

-\int\mu_{\infty}^{\lambda}\log\mu_{\infty}^{\lambda}\,\mathrm{d}x\leq\frac{1}{2}\log\left(({2\pi e})^{d+1}\int\|x\|^{2}\,\mathrm{d}\mu_{\infty}^{\lambda}\right),

(3.37)

and using that $\int\|x\|^{2}\,\mathrm{d}\mu_{\infty}^{\lambda}<\infty$ (see equation (3.32)) and subtracting $\widetilde{L}(z^{*})$ from both sides of (3.36), we have

	$\displaystyle\int\widetilde{L}(x)\,\mathrm{d}\mu_{\infty}^{\lambda}-\widetilde{L}(z^{*})$	$\displaystyle\leq C\left(-\gamma+\frac{1}{\lambda}-\frac{1}{\lambda}\log\int e^{-\lambda\widetilde{L}(x)-\frac{\gamma}{2}\\|x\\|^{2}}\,\mathrm{d}x-\widetilde{L}(z^{*})\right)$
		$\displaystyle\leq C\left(\gamma+\frac{1}{\lambda}-\frac{1}{\lambda}\log\left(e^{-\lambda\widetilde{L}(z^{})}\int e^{-\lambda(\widetilde{L}(x)-\widetilde{L}(z^{})-\frac{\gamma\lambda}{2}\\|x\\|^{2}}\,\mathrm{d}x\right)-\widetilde{L}(z^{*})\right)$
		$\displaystyle=C\left(\gamma+\frac{1}{\lambda}-\frac{1}{\lambda}\log\int e^{-\lambda(\widetilde{L}(x)-\widetilde{L}(z^{*}))-\frac{\lambda\gamma}{2}\\|x\\|^{2}}\,\mathrm{d}x\right).$

Hence, using the fact that $\nabla\widetilde{L}$ is $\frac{1}{1-u}$ -Lipschitz, we can estimate the exponent in the integral above as

|\widetilde{L}(x)-\widetilde{L}(x^{*})-\nabla\widetilde{L}(z^{*})\cdot(x-z^{*})|\leq\frac{1}{2(1-u)}\|x-z^{*}\|^{2}.

Thanks to the above inequality, using $\nabla\widetilde{L}(z^{*})=0$ we obtain

	$\displaystyle\int\widetilde{L}(x)\,\mathrm{d}\mu_{\infty}^{\lambda}-L(z^{*})$	$\displaystyle\leq C\bigg{(}\gamma+\frac{1}{\lambda}-\frac{1}{\lambda}\log\int e^{-\frac{\lambda}{2(1-u)}\\|x-z^{*}\\|^{2}-\frac{\gamma\lambda}{2}\\|x\\|^{2}}\,\mathrm{d}x\bigg{)}$
		$\displaystyle=C\bigg{(}\gamma+\frac{1}{\lambda}-\frac{1}{\lambda}\log\bigg{(}\sqrt{\frac{2\pi}{\lambda(\gamma+\frac{1}{(1-u)})}}e^{-\frac{\gamma\lambda\\|z^{*}\\|^{2}}{2(1-u)(1/(1-u)+\gamma)}}\bigg{)}\bigg{)}$
		$\displaystyle\leq C\bigg{(}\gamma+\frac{1}{\lambda}-\frac{1}{\lambda}\log\bigg{(}\sqrt{\frac{2\pi}{\lambda(\gamma+\frac{1}{1-u})}}\bigg{)}+\frac{\gamma}{(1-u)(\frac{1}{1-u}+\gamma)}\bigg{)}.$

For the other side of the inequality, since $z^{*}$ is a minimizer of $\widetilde{L}$ , we have

\displaystyle\widetilde{L}(x^{*})-\int\widetilde{L}(x)\,\mathrm{d}\mu_{\infty}^{\lambda}

\displaystyle\leq\widetilde{L}(z^{*})-\widetilde{L}(z^{*})\int\,\mathrm{d}\mu_{\infty}^{\lambda}=0.

Using $0<\gamma<1$ and $\lambda\geq 1$ concludes the proof. ∎

4 Rate for general law invariant convex risk measures

We now focus on the estimation of the approximation error of general convex risk measures. As explained in Section 2, the main argument for the derivation of the rate is the representation of the (law invariant) convex risk measure with respect to $\mathrm{AVaR}$ . One technical difficulty is that this representation involves an integral with respect to the risk aversion level $u$ of $\mathrm{AVaR}$ . Notice that both the functions $L$ and $\widetilde{L}$ as well as the Markov chain $\widetilde{Z}^{\lambda,n}_{m,h}$ in the approximation scheme depend on $u$ . We will make this dependence explicit in this section by writing $L^{u}$ , $\widehat{Z}^{\lambda,n,u}_{t}$ , and $\widetilde{Z}^{\lambda,n,u}_{M,h}$ for the function and processes defined in (2.3), (3.4) and (2.6) respectively.

4.1 Proof of Theorem 2.6

This subsection covers the proof of Theorem 2.6. Recall that we approximate a general law invariant convex risk measure by

\widetilde{\rho}^{\delta}(f):=\operatorname*{ess\,sup}_{\gamma\in\mathcal{M}:\beta(\gamma)\leq b}\bigg{(}\int_{0}^{\delta}\widetilde{\text{AVaR}}_{u}(f)\gamma(\,\mathrm{d}u)-\beta(\gamma)\bigg{)},

with $\widetilde{\text{AVaR}}_{u}(f)=\frac{1}{N}\sum_{n=1}^{N}\overline{\ell^{u}}(\widetilde{Z}_{M,h}^{\prime,\lambda,n,u})$ . Further define

\rho^{\delta}(f):=\inf_{r\in A}\sup_{\gamma\in\mathcal{M}:\beta(\gamma)\leq b}\bigg{(}\int_{0}^{\delta}\text{AVaR}_{u}(f(r,S))\gamma(\,\mathrm{d}u)-\beta(\gamma)\bigg{)},\quad\delta\in(0,1).

(4.1)

To begin, we decompose the approximation error into two parts:

\displaystyle\mathbb{E}\Big{[}|\rho(f)-\widetilde{\rho}^{\delta}(f)|^{2}\Big{]}\leq 2|\rho(f)-\rho^{\delta}(f)|^{2}+2\mathbb{E}\Big{[}|\rho^{\delta}(f)-\widetilde{\rho}^{\delta}(f)|^{2}\Big{]}.

(4.2)

Let us estimate the first term. First, we will show that for all $\delta\in(0,1)$ , $\rho^{\delta}$ satisfies

	$\displaystyle\rho^{\delta}(f)$	$\displaystyle=\sup_{\gamma\in\mathcal{M}:\beta(\gamma)\leq b}\bigg{(}\int_{0}^{\delta}\inf_{r\in A}\text{AVaR}_{u}(f(r,S))\gamma(\,\mathrm{d}u)-\beta(\gamma)\bigg{)}$		(4.3)
		$\displaystyle=\sup_{\gamma\in\mathcal{M}:\beta(\gamma)\leq b}\bigg{(}\int_{0}^{\delta}\overline{\text{AVaR}}_{u}(f)\gamma(\,\mathrm{d}u)-\beta(\gamma)\bigg{)}.$

Note that going from the definition in (4.1) to the expression in (4.3) requires interchaging the infimum and the supremum and then the infimum and the integral in the definition given in (4.1). Since the supremum is taken over a compact set and the function $(r,\gamma)\mapsto\int_{0}^{\delta}\text{AVaR}_{u}(f(r,S))\gamma(du)-\beta(\gamma)$ is continuous in $r$ and upper semi–continuous and concave in $\gamma$ , by Fan’s minimax theorem, see [26, Theorem 2], we can interchange the supremum and the infimum. To interchange the infimum and the integral, we apply Rockafellar’s interchange theorem [55]. Thus, using Equation 2.10 and Equation 4.3, it holds that

|\rho(f)-\rho^{\delta}(f)|^{2}\leq\sup_{\gamma\in\mathcal{M}:\beta(\gamma)\leq b}\int_{\delta}^{1}|\overline{\text{AVaR}}(f)|^{2}\gamma(\,\mathrm{d}u).

Let us partition the interval $[\delta,1)$ into $\bigcup_{n\geq 1}I_{n}$ , where $I_{n}=\big{[}\delta+(1-2^{-n+1})(1-\delta),\delta+(1-2^{-n})(1-\delta)\big{)}$ for every $n\geq 1$ . Defining

\Gamma_{b}(I_{n}):=\sup_{\gamma\in\mathcal{M}\text{ s.t }\beta(\gamma)\leq b}\gamma(I_{n})

, we obtain the estimation

\displaystyle|\bar{\rho}(f)-\rho^{\delta}(f)|^{2}

\displaystyle\leq\sum_{n\geq 1}\Gamma_{b}(I_{n})\sup_{u\in I_{n}}|\overline{\text{AVaR}}_{u}(f)|^{2}.

(4.4)

Now by [4, Lemma 4.5 and Lemma 4.3], we have

\Gamma_{b}(I_{n})\leq C((1-\delta)2^{-n})^{1/q}\quad\text{and}\quad\left|\overline{\text{AVaR}}_{u}(f)\right|\leq\frac{C}{(1-u)^{1/p}},

(4.5)

for every $p\in(1,\infty)$ . Therefore, for any given $p\in(1,\infty)$ and $n\in\mathbb{N}$ , it holds

\sup_{u\in I_{n}}\left|\overline{\mathrm{AVaR}}_{u}(f)\right|^{2}\leq\frac{C}{((1-\delta)2^{-n})^{2/p}}.

(4.6)

Choosing $p=4q$ and using (4.4), (4.5) and (4.6), we have

	$\displaystyle\|\bar{\rho}(f)-\rho^{\delta}(f)\|^{2}$	$\displaystyle\leq\sum_{n\geq 1}((1-\delta)2^{-n})^{1/q}\frac{C}{((1-\delta)2^{-n})^{2/p}}$
		$\displaystyle\leq C(1-\delta)^{1/q-2/p}\sum_{n\geq 1}2^{n(2/p-1/q)}$
		$\displaystyle\leq C(1-\delta)^{1/2q}\sum_{n\geq 1}2^{-n/2q}.$

Since $q\in(1,\infty)$ , it holds $\sum_{n\geq 1}2^{-n/2q}\leq C$ for some universal constant $C$ , and thus

|\bar{\rho}(f)-\rho^{\delta}(f)|^{2}\leq C(1-\delta)^{1/2q}.

(4.7)

For the second term in equation (4.2), we use the error rate for the estimation of AVaR. Let $\mathcal{M}^{R}$ be the set of all random probability measures on $[0,1)$ . We have

	$\displaystyle\mathbb{E}\|\rho^{\delta}(f)-\widetilde{\rho}^{\delta}(f)\|^{2}\Big{]}$	$\displaystyle\leq\mathbb{E}\Big{[}\operatorname*{ess\,sup}_{\gamma\in\mathcal{M}}\int_{0}^{\delta}\|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}\|^{2}\gamma(du)\Big{]}$
		$\displaystyle\leq\mathbb{E}\Big{[}\operatorname*{ess\,sup}_{\gamma\in\mathcal{M}^{R}}\int_{0}^{\delta}\|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}\|^{2}\gamma(du)\Big{]}.$

For each random measure $\gamma^{i}\in\mathcal{M}^{R}$ , define the corresponding random variable $A^{i}:=\int_{0}^{\delta}|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}|\gamma^{i}(du)$ , and let $\mathcal{A}$ be the set of all such random variables. To make use of the error rate for the estimation of AVaR, we first show the set of random variables $\mathcal{A}$ is directed upward, i.e. for any pair of random variables $A^{i},A^{j}\in\mathcal{A}$ , there exists $\widetilde{A}\in\mathcal{A}$ with $\widetilde{A}\geq\max\{A^{i},A^{j}\}$ . Then, by the following theorem, we can rewrite $\operatorname*{ess\,sup}\mathcal{A}=\lim_{n}A^{n}$ for some increasing sequence $(A)_{n}\in\mathcal{A}$ .

Theorem 4.1.

[29] If $\mathcal{A}$ is directed upward, there exists an increasing sequence $A^{1}\leq A^{2}\leq\cdots\in\mathcal{A}$ such that $\operatorname*{ess\,sup}\mathcal{A}=\lim_{n}A^{n}$ $\mathbb{P}$ -almost surely.

To show $\mathcal{A}$ is directed upward, for any $\gamma^{i},\gamma^{j}\in{\mathcal{A}},$ define the set of events

B:=\Bigg{\{}\omega:\int_{0}^{\delta}|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}|\gamma^{i}(\omega,du)\geq\int_{0}^{\delta}|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}|\gamma^{j}(\omega,du)\Bigg{\}},

(4.8)

and the random measure

\widetilde{\gamma}(\omega,du):=\gamma^{1}(\omega,du)\mathds{1}_{\omega\in B}+\gamma^{2}(\omega,du)\mathds{1}_{\omega\in B^{C}}.

(4.9)

Then, we have $\widetilde{A}:=\int|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}|\widetilde{\gamma}(\omega,du)\geq\max\{A^{i},A^{j}\}$ and $\widetilde{A}\in\mathcal{A}$ . Therefore, $\mathcal{A}$ is directed upward. By Theorem 4.1, there exists an increasing sequence $(A^{n})_{n}$ in $\mathcal{A}$ with $\operatorname*{ess\,sup}\mathcal{A}=\lim_{n}A^{n}$ $\mathbb{P}$ -almost surely, and we have

	$\displaystyle\mathbb{E}\Big{[}\|\rho^{\delta}(f)-\widetilde{\rho}^{\delta}(f)\|^{2}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}\lim_{n}\int_{0}^{\delta}\|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}\|^{2}\gamma^{n}(du)\Big{]}$
		$\displaystyle=\lim_{n}\int_{0}^{\delta}\mathbb{E}\Big{[}\|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}\|^{2}\Big{]}\gamma^{n}(du)$
		$\displaystyle\leq\sup_{n\in\mathbb{N}}\int_{0}^{\delta}\mathbb{E}\Big{[}\|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}\|^{2}\Big{]}\gamma^{n}(du)$
		$\displaystyle\leq\sup_{u\in[0,\delta]}\mathbb{E}\Big{[}\|\widetilde{\text{AVaR}}_{u}-\text{AVaR}_{u}\|^{2}\Big{]}\sup_{n\in\mathbb{N}}\int_{0}^{\delta}\gamma^{n}(du),$

where we used monotone convergence theorem and Fubini’s theorem in the second line. Using Theorem 2.3 and taking the supremum over $u\in(0,\delta)$ , we have the result of the theorem.

5 Appendix

Here we give explicit formulas for constants in the proof of Theorem 2.3.

$\displaystyle C^{{}^{\prime}1}_{(u,t,\lambda)}$	$\displaystyle=C\bigg{(}\Big{(}\frac{1}{1-u}\Big{)}^{2}+1\bigg{)}\frac{1}{t}\bigg{(}2^{1/t}+\Big{(}\frac{1}{(1-u)^{4}}+\frac{C_{d}}{\lambda^{2}}\Big{)}\cdot 2^{1/t}+1\bigg{)}\exp\bigg{(}C\frac{1}{t}\Big{(}\frac{1}{(1-u)^{2}}+1\Big{)}\bigg{)}$	(5.1)
$\displaystyle C^{{}^{\prime}2}_{(u,t,\lambda)}$	$\displaystyle=C\bigg{(}3^{1/t}+\Big{(}\frac{1}{(1-u)^{4}}+1+\frac{C_{d}}{\lambda^{2}}\Big{)}\cdot 3^{1/t}+1\bigg{)}$	(5.2)
$\displaystyle C^{{}^{\prime}3}_{(u,t)}$	$\displaystyle=C\frac{1}{(1-u)^{4}}\frac{1}{t}\exp\bigg{(}\frac{C}{t}\Big{(}\frac{1}{1-u}+1\Big{)}^{4}\bigg{)}$	(5.3)
$\displaystyle C^{{}^{\prime}4}_{(u,t)}$	$\displaystyle=C\frac{1}{(1-u)^{2}}\bigg{(}2^{1/t}+\Big{(}\frac{1}{(1-u)^{4}}+\frac{1}{\lambda^{2}}C_{d}\Big{)}\cdot 2^{1/t}\bigg{)}$	(5.4)
$\displaystyle C^{{}^{\prime}5}_{(u,t,\lambda)}$	$\displaystyle=\bigg{(}2^{1/t}+\Big{(}\frac{1}{(1-u)^{4}}+\frac{1}{\lambda^{2}}C_{d}\Big{)}\cdot 2^{1/t}\bigg{)}$	(5.5)
$\displaystyle C^{{}^{\prime}6}_{(u)}$	$\displaystyle=C\frac{1}{(1-u)^{4}}$	(5.6)
$\displaystyle C^{{}^{\prime}7}_{(u,t,\lambda)}$	$\displaystyle=C\bigg{(}2^{1/t}+\Big{(}\frac{1}{(1-u)^{4}}+\frac{1}{\lambda^{2}}C_{d}\Big{)}\cdot 2^{1/t}+\Big{(}1+\frac{1}{t(1-u)^{4}}+\frac{1}{t\lambda^{2}}e^{1/t}\Big{)}^{1/2}\bigg{)}$	(5.7)
$\displaystyle C^{{}^{\prime}8}_{(u,t,\lambda)}$	$\displaystyle=C\bigg{(}\frac{1}{\lambda(\frac{1}{1-u})^{2}}+\frac{1}{\lambda(\frac{1}{1-u})^{2}}\Big{(}1+\frac{1}{t(1-u)^{4}}+\frac{1}{t^{2}\lambda^{2}}\Big{)}^{1/2}e^{Ct}\bigg{)}$	(5.8)
$\displaystyle C^{4}_{(u,\lambda)}$	$\displaystyle=\frac{C_{(\lambda)}}{(1-u)^{2}}\mathcal{W}^{2}_{1}(\delta_{z},\mu^{\lambda}_{\infty})$	(5.9)
$\displaystyle C^{5}_{(\lambda)}$	$\displaystyle=2C_{(\lambda)}$	(5.10)
$\displaystyle C^{{}^{\prime}9}_{(u,t,\lambda)}$	$\displaystyle=(C^{{}^{\prime}10})^{2}+C\bigg{(}1+\frac{1}{t(1-u)^{4}}+\frac{1}{t^{2}\lambda^{2}}\bigg{)}e^{C/t}$	(5.11)
$\displaystyle C^{{}^{\prime}10}_{(u,\lambda)}$	$\displaystyle=\int\\|x\\|^{2}e^{C\lambda+2\lambda\\|x\\|/(1-u)-\\|x\\|^{2}/2}\,\mathrm{d}x$	(5.12)
$\displaystyle C^{6}_{(u)}$	$\displaystyle=C\Big{(}\log\sqrt{2\pi(1-u)}+1\Big{)}^{2}$	(5.13)
$\displaystyle C^{1}_{(u,t,\lambda)}$	$\displaystyle=1+C^{{}^{\prime}4}_{(u,t)}+C^{{}^{\prime}8}_{(u,t,\lambda)}$	(5.14)
$\displaystyle C^{2}_{(u,t,\lambda)}$	$\displaystyle=C^{{}^{\prime}2}_{(u,t,\lambda)}+C^{{}^{\prime}5}_{(u,t,\lambda)}+C^{{}^{\prime}7}_{(u,t,\lambda)}+C^{{}^{\prime}11}_{(u,t,\lambda)}+C$	(5.15)
$\displaystyle C^{3}_{(u,t,\lambda)}$	$\displaystyle=C^{{}^{\prime}1}_{(u,t,\lambda)}+C^{{}^{\prime}3}_{(u,t)}+C^{{}^{\prime}6}_{(u)}$	(5.16)
$\displaystyle C^{7}_{(\delta,t,\lambda)}$	$\displaystyle=C\frac{1}{(1-\delta)^{2}}\bigg{(}2^{1/t}+\Big{(}\frac{1}{(1-\delta)^{4}}+\frac{1}{\lambda^{2}}C_{d}\Big{)}\cdot 2^{1/t}\bigg{)}+C\bigg{(}\frac{1}{\lambda}+\frac{1}{\lambda}\Big{(}1+\frac{1}{t(1-\delta)^{4}}+\frac{1}{t^{2}\lambda^{2}}\Big{)}^{1/2}e^{Ct}\bigg{)}.$	(5.17)

References

Ahmadi-Javid [2012] A. Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. J Optimiz Theory App, 155(3):1105–1123, 2012.
Allen-Zhu [2018] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. Advances in neural information processing systems, 31, 2018.
Artzner et al. [1999] Philippe Artzner, Freddy Delbaen, Jean Marc Eber, and David Heath. Coherent measures of risk. Math. Finance, 9:203–228, 1999.
Bartl and Tangpi [2022] Daniel Bartl and Ludovic Tangpi. Nonasymptotic convergence rates for the plug-in estimation of risk measures. Mathematics of Operations Research, 2022.
Bartl et al. [2020] Daniel Bartl, Samuel Drapeau, and Ludovic Tangpi. Computational aspects of robust optimized certainty equivalents and option pricing. Math. Finance, 30:287–309, 2020.
Belomestny and Krätschmer [2012] Denis Belomestny and Volker Krätschmer. Central limit theorems for law-invariant coherent risk measures. Journal of Applied Probability, 49(1):1–21, 2012.
Bhardwaj [2019] Chandrasekaran Anirudh Bhardwaj. Adaptively preconditioned stochastic gradient langevin dynamics. arXiv preprint arXiv:1906.04324, 2019.
Bolley et al. [2012] François Bolley, Ivan Gentil, and Armand Guillin. Convergence to equilibrium in Wasserstein distance for Fokker–Planck equations. J. Funct. Anal., 263(8):2430–2457, 2012.
Bühler et al. [2019] Hans Bühler, Lukas Gonon, Josef Teichmann, and Ben Wood. Deep hedging. Quantitative Finance, 19(8):1271–1291, 2019.
Butler and Schachter [1997] JS Butler and Barry Schachter. Estimating value-at-risk with a precision measure by combining kernel estimation with historical simulation. Rev. Deriv. Res., 1:371–390, 1997.
Carmona and Delarue [2018] René Carmona and François Delarue. Probabilistic theory of mean field games with applications. I, volume 83 of Probability Theory and Stochastic Modelling. Springer, Cham, 2018. ISBN 978-3-319-56437-1; 978-3-319-58920-6. Mean field FBSDEs, control, and games.
Chen and Li [1989] Mu-Fa Chen and Shao-Fu Li. Coupling methods for multidimensional diffusion processes. The Ann. Probab., pages 151–177, 1989.
Chen [2008] Song Xi Chen. Nonparametric estimation of expected shortfall. Journal of Financial Econometrics, 6(1):87–107, 2008.
Chen et al. [2014] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691. PMLR, 2014.
Cheridito and Li [2009] Patrick Cheridito and Tianhui Li. Risk measures on Orlicz hearts. Math. Finance, 19(2):189–214, 2009.
Cheridito et al. [2006] Patrick Cheridito, Freddy Delbaen, and Michael Kupper. Dynamic monetary risk measures for bounded discrete-time processes. Electron. J. Probab., 11(3):57–106, 2006.
Cont et al. [2010] Rama Cont, Romain Deguest, and Giacomo Scandolo. Robustness and sensitivity analysis of risk measurement procedures. Quantitative finance, 10(6):593–606, 2010.
Delbaen [2012] Freddy Delbaen. Monetary Utility Functions. Osaka University Press, 2012.
Deng et al. [2020] Wei Deng, Guang Lin, and Faming Liang. A contour stochastic gradient langevin dynamics algorithm for simulations of multi-modal distributions. Advances in neural information processing systems, 33:15725–15736, 2020.
Djellout et al. [2004] H. Djellout, A. Guillin, and L. Wu. Transportation cost-information inequalities and applications to random dynamical systems and diffusions. The Annals of Probability, 32(3B):2702–2732, 2004.
"Durmus et al. [2019] Alain "Durmus, Szymon Majewski, and Blażej" Miasojedow. Analysis of langevin monte carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666–2711, 2019.
Eberle [2011] Andreas Eberle. Reflection coupling and wasserstein contractivity without convexity. C. R. Math, 349(19-20):1101–1104, 2011.
Eberle [2016] Andreas Eberle. Reflection couplings and contraction rates for diffusions. Probab. Theory Relat. Fields, 166(3):851–886, 2016.
Eckstein and Kupper [2021] Stephan Eckstein and Michael Kupper. Computation of optimal transport and related hedging problems via penalization and neural networks. Applied Mathematics & Optimization, 83:639–667, 2021.
Fan and Gu [2003] Jianqing Fan and Juan Gu. Semiparametric estimation of value at risk. Econom. J., 6(2):261–290, 2003.
Fan [1953] Ky Fan. Minimax theorems. PNAS USA, 39(1):42, 1953.
Föllmer and Knispel [2011] Hans Föllmer and T. Knispel. Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations. Stoch. Dyn., 11(02n03):333–351, 2011.
Föllmer and Schied [2002] Hans Föllmer and Alexander Schied. Convex measures of risk and trading constraint. Finance Stoch., 6(4):429–447, 2002.
Föllmer and Schied [2004] Hans Föllmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter, Berlin, New York, 2 edition, 2004.
Föllmer and Schied [2004] Hans Föllmer and Alexander Schied. Stochastic Finance. An Introduction in Discrete Time. de Gruyter Studies in Mathematics. Walter de Gruyter, Berlin, New York, 2 edition, 2004.
Föllmer and Schied [2016] Hans Föllmer and Alexander Schied. Stochastic finance. de Gruyter, 2016.
Frittelli and Gianin. [2005] Marco Frittelli and Emanuela Rosazza Gianin. Law invariant convex risk measures. Advances in mathematical economics., pages 33–46, 2005.
Frittelli and Rosazza Gianin [2002] Marco Frittelli and Emanuela Rosazza Gianin. Putting order in risk measures. Journal of Banking & Finance, 26(7):1473–1486, July 2002.
Gelfand and Mitter [1991] Saul B Gelfand and Sanjoy K Mitter. Recursive stochastic algorithms for global optimization in r^d. SIAM J Control Optim, 29(5):999–1018, 1991.
Glasserman et al. [2000] Paul Glasserman, Philip Heidelberger, and Perwez Shahabuddin. Variance reduction techniques for estimating value-at-risk. Manag. Sci, 46(10):1349–1364, 2000.
Holte [2009] John M Holte. Discrete gronwall lemma and applications. In MAA-NCS meeting at the University of North Dakota, volume 24, pages 1–7, 2009.
Hong and Liu. [2011] L. Jeff Hong and Guangwu Liu. Monte carlo estimation of value-at-risk, conditional value-at-risk and their sensitivities. Proceedings of the 2011 Winter Simulation Conference (WSC). IEEE, 2011.
Hoogerheide and van Dijk [2010] Lennart Hoogerheide and Herman K van Dijk. Bayesian forecasting of value at risk and expected shortfall using adaptive importance sampling. Int. J. Forecast., 26(2):231–247, 2010.
Hu et al. [2020] Kaitong Hu, Shenjie Ren, David Siska, and Lukasz Szpruch. Mean–field Langevin dynamics and energy landscape of neural networks. Annales de l’Institut Henri Poincaré (B) Probabilités and Statistiques, to appear, 2020.
Hwang [1980] C. R. Hwang. Laplace’s method revisited: weak convergence of probability measures. Annals of Probability, pages 2189–2211, 1980.
Iyengar and Ma [2013] Garud Iyengar and Alfred Ka Chun Ma. Fast gradient descent method for mean-cvar optimization. Ann. Oper. Res., 205(1):203–212, 2013.
Jouini et al. [2006] Elyès Jouini, Walter Schachermayer, and Nizar Touzi. Law invariant risk measures have the Fatou property. In Shigeo Kusuoka and Akira Yamazaki, editors, Advances in Mathematical Economics, volume 9 of Advances in Mathematical Economics, pages 49–71. Springer Japan, 2006.
Kloeden and Platen [2013] P. E. Kloeden and E. Platen. Numerical solution of stochastic differential equations. Springer Science and Business Media, 2013.
Krätschmer et al. [2014] Volker Krätschmer, Alexander Schied, and Henryk Zähle. Comparative and quantitative robustness for law-invariant risk measures. Finance Stoch., 2014.
Kupper and Schachermayer [2009] Michael Kupper and Walter Schachermayer. Representation results for law invariant time consistent functions. Math. Financ. Econ., 2(3):189–210, 2009.
Kusuoka [2001] Shigeo Kusuoka. On law invariant coherent risk measures. In Advances in mathematical economics, pages 83–95. Springer, 2001.
Lacker et al. [2020] D. Lacker, M. Shkolnikov, and J. Zhang. Inverting the markovian projection, with an application to local stochastic volatility models. Annals of Probability, 48(5):2189–2211, 2020.
McNeil et al. [2015] Alexander J. McNeil, Rüdiger Frey, and Paul Embrechts. Quantitative Risk Management. Princeton University Press, 2015.
Nesterov [2005] Yu Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127–152, 2005.
on Banking Supervision [2014] Basel Committee on Banking Supervision. Fundamental review of the trading book: A revised market risk framework. Technical report, Bank of international settlments, 2014.
Pichler and Schlotter [2020] Alois Pichler and R. Schlotter. Entropic based risk measures. European Journal on Operational Research, 285(1):223–236, 2020.
Pohl et al. [2020] Mathias Pohl, Alexander Ristig, Walter Schachermayer, and Ludovic Tangpi. Theoretical and empirical analysis of trading activity. Math. Program., 181(2):405–434, 2020.
Raginsky et al. [2017] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In COLT, pages 1674–1703. PMLR, 2017.
Reppen and Soner [2023] Anders Max Reppen and Halil Mete Soner. Deep empirical risk minimization in finance: looking into the future. Mathematical Finance, 33(1):116–145, 2023.
Rockafellar [1968] Ralph Rockafellar. Integrals which are convex functionals. Pacific journal of mathematics, 24(3):525–539, 1968.
Sabanis and Zhang [2020] Sotirios Sabanis and Ying Zhang. A fully data-driven approach to minimizing CVaR for portfolio of assets via SGLD with discontinuous updating. arXiv preprint arXiv:2007.01672, 2020.
Sekimoto [2010] Ken Sekimoto. Stochastic energetics, volume 799. Springer, 2010.
Soma and Yoshida [2020] Tasuku Soma and Yuichi Yoshida. Statistical learning with conditional value at risk. arXiv preprint arXiv:2002.05826, 2020.
Tamar et al. [2015] Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the cvar via sampling. In AAAI-15, 2015.
Villani [2009] C. Villani. Optimal Transport. Old and New, volume 338 of Grundlehren der mathematischen Wissenschaften. Springer, 2009.
Weber [2007] Stefan Weber. Distribution-invariant risk measures, Entropy, and large deviations. J. Appl. Prob., 44:16–40, 2007.
Welling and Teh. [2011] Max Welling and Yee W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
Yeung [2008] R. W. Yeung. Information theory and network coding. Springer Science and Business Media, 2008.
Zhu and Zhou [2015] Helin Zhu and Enlu Zhou. Estimation of conditional value-at-risk for input uncertainty with budget allocation. In Proc. 2015 WSC, pages 655–666. IEEE, 2015.

	$\displaystyle\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\overline{Z}^{\lambda,n}_{M,h})-\overline{\text{AVAR}}_{u}(f)\Big{\|}^{2}\bigg{]}\leq 2^{6}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\overline{\ell}(\overline{Z}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\Big{\|}^{2}\bigg{]}+2^{6}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}{L}(\widetilde{Z}^{\lambda,n}_{M,h})\Big{\|}^{2}\bigg{]}$
	$\displaystyle\qquad+2^{6}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}{L}(\widehat{Z}^{\lambda,n}_{t})\Big{\|}^{2}\bigg{]}+2^{6}\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}{L}(\widehat{Z}^{\lambda,n}_{t})-\mathbb{E}[L(Z_{t}^{\lambda})]\Big{\|}^{2}\bigg{]}+2^{6}\Big{\|}\mathbb{E}[L(Z^{\lambda}_{t})]-\int_{\mathbb{R}^{d}}{L}\,\mathrm{d}\mu^{\lambda}_{\infty}\Big{\|}^{2}$
	$\displaystyle\qquad+2^{6}\Big{\|}\int_{\mathbb{R}^{d}}{L}\,\mathrm{d}\mu_{\infty}^{\lambda}-\int_{\mathbb{R}^{d}}\widetilde{L}\,\mathrm{d}\mu_{\infty}^{\lambda}\Big{\|}^{2}+2^{6}\Big{\|}\int_{\mathbb{R}^{d}}\widetilde{L}d\mu_{\infty}^{\lambda}-\overline{\text{AVaR}}_{u}(f)\Big{\|}^{2}.$		(3.5)

	$\displaystyle\mathbb{E}[\big{\\|}\overline{Z}^{\lambda,n}_{m,h}\big{\\|}^{4}]$	$\displaystyle=\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}-h\nabla\overline{\ell}(\overline{Z}^{\lambda,n}_{m-1,h})+\sqrt{2\lambda^{-1}}\Delta\widetilde{W_{h}}\big{\\|}^{4}\Big{]}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\|\nabla\overline{\ell}(\overline{Z}^{\lambda,n}_{m-1,h})\|^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}\Big{(}\frac{d}{2}+1\Big{)}\Big{(}\frac{d}{2}\Big{)}\bigg{)}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\big{\|}\nabla\ell(\overline{Z}^{\lambda,n}_{m-1,h})\big{\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\big{\\|}2(\overline{Z}^{\lambda,n}_{m-1,h}-A(\overline{Z}^{\lambda,n}_{m-1,h}))\big{\\|}^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\bigg{)}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\bigg{[}\Big{\\|}\frac{2}{1-u}+\gamma\overline{Z}^{\lambda,n}_{m-1,h}\Big{\\|}^{4}\bigg{]}+h^{4}\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\big{\\|}A(\overline{Z}^{\lambda,n}_{m-1,h})\big{\\|}^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\bigg{)}$
		$\displaystyle\leq C\left((1+\gamma^{4}h^{4}+h^{4})\mathbb{E}\Big{[}\big{\\|}\overline{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+\frac{h^{4}}{(1-u)^{4}}+h^{4}+\frac{h^{2}}{\lambda^{2}}C_{d}\right),$

	$\displaystyle\mathbb{E}\bigg{[}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\frac{1}{N}\sum_{n=1}^{N}L(\widetilde{Z}^{\lambda,n}_{M,h})\Big{\|}^{2}\bigg{]}$
	$\displaystyle\quad=\mathbb{E}\Bigg{[}\bigg{\|}\frac{1}{N}\sum_{n=1}^{N}\left\{\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})+\frac{\gamma}{2}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{2}-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\frac{\gamma}{2}\big{\\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{2}\right\}\bigg{\|}^{2}\Bigg{]}$
	$\displaystyle\quad\leq C\bigg{(}\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}(\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{\ell}\big{(}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{)}^{2}\Big{]}+\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\big{(}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{2}-\big{\\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{2}\big{)}^{2}\Big{]}\bigg{)}$
	$\displaystyle\quad\leq\frac{C}{N}\sum_{n=1}^{N}\mathbb{E}\Big{[}\big{(}\widetilde{L}(\widetilde{Z}^{\lambda,n}_{M,h})-\widetilde{\ell}(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{)}^{2}\Big{]}+C\frac{\gamma^{2}}{N}\sum_{n=1}^{N}\Big{(}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}\big{\\|}^{4}\Big{]}+\mathbb{E}\Big{[}\big{\\|}\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{4}\Big{]}\Big{)},$		(3.13)

	$\displaystyle\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m,h}\big{\\|}^{4}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}-h\nabla L(\widetilde{Z}^{\lambda,n}_{m-1,h})+\sqrt{2\lambda^{-1}}\Delta\widetilde{W_{h}}\big{\\|}^{4}\Big{]}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\Big{[}\|\nabla L(\widetilde{Z}^{\lambda,n}_{m-1,h})\|^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\bigg{)}$
		$\displaystyle\leq C\bigg{(}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+h^{4}\mathbb{E}\bigg{[}\Big{\\|}\frac{2}{1-u}+\gamma\widetilde{Z}^{\lambda,n}_{m-1,h}\Big{\\|}^{4}\bigg{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\bigg{)}$
		$\displaystyle\leq C\left(\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+\frac{h^{4}}{(1-u)^{4}}+\gamma^{4}h^{4}\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+\frac{h^{2}}{\lambda^{2}}C_{d}\right)$
		$\displaystyle\leq C\left((1+\gamma^{4}h^{4})\mathbb{E}\Big{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{m-1,h}\big{\\|}^{4}\Big{]}+\frac{h^{4}}{(1-u)^{4}}+\frac{h^{2}}{\lambda^{2}}C_{d}\right),$

	$\displaystyle\mathbb{E}\left[\big{\\|}\nabla L(\widetilde{Z}^{\lambda,n}_{M,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{\\|}^{4}\right]$
$\displaystyle\leq$	$\displaystyle C\mathbb{E}\left[\big{\\|}\nabla L(\widetilde{Z}^{\lambda,n}_{M,h})-\nabla L(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{\\|}^{4}\right]+C\mathbb{E}\left[\big{\\|}\nabla L(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})-\nabla\ell(\widetilde{Z^{\prime}}^{\lambda,n}_{M,h})\big{\\|}^{4}\right]$
$\displaystyle\leq$	$\displaystyle C\left(\frac{1}{1-u}+\gamma\right)^{4}\mathbb{E}\left[\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{4}\right]$
	$\displaystyle+C\mathbb{E}\bigg{[}\bigg{\|}\Big{(}e_{1}+\frac{1}{1-u}\mathbb{E}\big{[}\nabla_{r}f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)1_{\{f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S)\geq\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\}}\|\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{]}\Big{)}$	(3.19)
	$\displaystyle-\Big{(}e_{1}+\frac{1}{1-u}\frac{1}{N}\sum_{i=1}^{N}(\nabla f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})1_{\{f(\widetilde{Z^{\prime}}^{\lambda,n,d-1}_{M,h},S^{i})\geq\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}(1)\}})\Big{)}\bigg{\|}^{4}\bigg{]}$
$\displaystyle\leq$	$\displaystyle C\bigg{(}\frac{1}{1-u}+\gamma\bigg{)}^{4}\mathbb{E}\bigg{[}\big{\\|}\widetilde{Z}^{\lambda,n}_{M,h}-\widetilde{Z^{\prime}}^{\lambda,n}_{M,h}\big{\\|}^{4}\bigg{]}+C\bigg{(}\frac{1}{1-u}\bigg{)}^{4},$	(3.20)