Reinforcement Learning for optimal dividend problem under diffusion model

Lihua Bai, Thejani Gamage, Jin Ma, Pengxu Xie School of Mathematics, Nankai University, Tianjin, 300071, China. Email: lbai@nankai.edu.cn. This author is supported in part by Chinese NSF grants #11931018, #12272274, and #12171257. Department of Mathematics, University of Southern California, Los Angeles, CA 90089. Email: gamage@usc.edu. Department of Mathematics, University of Southern California, Los Angeles, CA 90089. Email: jinma@usc.edu. This author is supported in part by NSF grants #DMS-1908665 and 2205972. School of Mathematics, Nankai University, Tianjin, 300071, China. Email:1120180026@mail.nankai.edu.cn.

Abstract

In this paper, we study the optimal dividend problem under the continuous time diffusion model with the dividend rate being restricted in a given interval $[0,a]$ . Unlike the standard literature, we shall particularly be interested in the case when the parameters (e.g. drift and diffusion coefficients) of the model are not specified so that the optimal control cannot be explicitly determined. We therefore follow the recently developed method via the Reinforcement Learning (RL) to find the optimal strategy. Specifically, we shall design a corresponding RL-type entropy-regularized exploratory control problem, which randomize the control actions, and balance the exploitation and exploration. We shall first carry out a theoretical analysis of the new relaxed control problem and prove that the value function is the unique bounded classical solution to the corresponding HJB equation. We will then use a policy improvement argument, along with policy evaluation devices (e.g., Temporal Difference (TD)-based algorithm and Martingale Loss (ML)-algorithms) to construct approximating sequences of the optimal strategy. We present some numerical results using different parametrization families for the cost functional to illustrate the effectiveness of the approximation schemes.

Keywords. Optimal dividend problem, entropy-regularized exploratory control problem, policy improvement, policy evaluation, temporal difference (TD) algorithm, martingale loss (ML)..

2020 AMS Mathematics subject classification: 93E20,35; 91G50; 93B47.

1 Introduction

The problem of maximizing the cumulative discounted dividend payment can be traced back to the work of de Finetti [9]. Since then the problem has been widely studied in the literature under different models, and in many cases the problem can be explicitly solved when the model parameters are known. In particular, for optimal dividend problem and its many variations in continuous time under diffusion models, we refer to the works of, among others, [1, 3, 4, 5, 8, 21, 25] and the references cited therein. The main motivation of this paper is to study the optimal dividend problems in which the model parameters are not specified so the optimal control cannot be explicitly determined. Following the recently developed method using Reinforcement Learning (RL), we shall try to find the optimal strategy for a corresponding entropy-regularized control problem and solve it along the lines of both policy improvement and evaluation schemes.

The method of using reinforcement learning to solve discrete Markov decision problems has been well studied, but the extension of these concepts to the continuous time and space setting is still fairly new. Roughly speaking, in RL the learning agent uses a sequence of trial and errors to simultaneously identify the environment or the model parameters, and to determine the optimal control and optimal value function. Such a learning process has been characterized by a mixture of exploration and exploitation, which repeatedly tries the new actions to improve the collective outcomes. A critical point in this process is to balance the exploration and exploitation levels since the former is usually computationally expensive and time consuming, while the latter may lead to sub optimums. In RL theory a typical idea to balance the exploration and exploitation in an optimal control problem is to “randomize” the control action and add a (Shannon’s) entropy term in the cost function weighted by a temperature parameter. By maximizing the entropy one encourages the exploration and by decreasing the temperature parameter one gives more weight to exploitation. The resulting optimal control problem is often refer to as the entropy-regularized exploratory optimal control problem, which will be the starting point of our investigation.

As any reinforcement learning scheme, we shall solve the entropy-regularized exploratory optimal dividend problem via a sequence of Policy Evaluation (PE) and Policy Improvement (PI) procedures. The former evaluates the cost functional for a given policy, and the latter produces new policy that is “better” than the current one. We note that the idea of Policy Improvement Algorithms (PIA) as well as their convergence analysis is not new in the numerical optimal control literature (see, e.g., [13, 17, 16, 19]). The main difference in the current RL setting is the involvement of the entropy regularization, which causes some technical complications in the convergence analysis. In the continuous time entropy-regularized exploratory control problem with diffusion models a successful convergence analysis of PIA was first established for a particular Linear-Quadratic (LQ) case in [23], in which the exploratory HJB equation (i.e., HJB equation corresponding to the entropy regularized problem) can be directly solved, and the Gaussian nature of the optimal exploratory control is known. A more general case was recently investigated in [12], in which the convergence of PIA is proved in a general infinite horizon setting, without requiring the knowledge of the explicit form of the optimal control. The problem studied in this paper is very close to the one in [12], but not identical. While some of the analysis in this paper is benefitted from the fact that the spatial variable is one dimensional, but there are particular technical subtleties because of the presence of ruin time, although the problem is essentially an infinite horizon one, like the one studied in [12].

There are two main issues that this paper will focus on. The first is to design the PE and PI algorithms that are suitable for the continuous time optimal dividend problems. We shall follow some of the “popular” schemes in RL, such as the well-understood Temporal Difference (TD) methods, combined with the so-called martingale approach to design the PE learing procedure. Two technical points are worth noting: 1) since the cost functional involves ruin time, and the observation of the ruin time of the state process is sometimes practically impossible (especially in the cases where ruin time actually occurs beyond the time horizon we can practically observe), we shall propose algorithms that are insensitive to the ruin time; 2) although the infinite horizon nature of the problem somewhat prevents the use of the so-called “batch” learning method, we shall nevertheless try to study the temporally “truncated” problem so that the batch learning method can be applied. It should also be noted that one of the main difficulties in PE methods is to find an effective parameterization family of functions from which the best approximation for the cost functional is chosen, and the choice of the parameterization family directly affects the accuracy of the approximation. Since there are no proven standard methods of finding a suitable parameterization family, except for the LQ (Gaussian) case when the optimal value function is explicitly known, we shall use the classical “barrier”-type (restricted) optimal dividend strategy in [1] to propose the parametrization family, and carry out numerical experiments using the corresponding families.

The second main issue is the convergence analysis of the PIA. Similar to [12], in this paper we focus on the regularity analysis on the solution to the exploratory HJB equation and some related PDEs. Compared to the heavy PDE arguments as in [12], we shall take advantage of the fact that in this paper the state process is one dimensional taking nonnegative values, so that some stability arguments for 2-dimensional first-order nonlinear systems can be applied to conclude that the exploratory HJB equation has a concave, bounded classical solution, which would coincide with the viscosity solution (of class (L)) of HJB equation and the value function of the optimal dividend problem. With the help of these regularity results, we prove the convergence of PIA to the value function along the line of [12] but with a much simpler argument.

The rest of the paper is organized as follows. In §2 we give the preliminary description of the problem and all the necessary notations, definitions, and assumptions. In §3 we study the value function and its regularity, and prove that it is a concave, bounded classical solution to the exploratory HJB equation. In §4 we study the issue of policy update. We shall introduce our PIA and prove its convergence. In §5 and§6 we discuss the methods of Policy Evaluation, that is, the methods for approximating the cost functional for a given policy, using a martingale lost function based approach and (online) CTD( $\gamma$ ) methods, respectively. In §7 we propose parametrization families for PE and present numerical experiments using the proposed PI and PE methods.

2 Preliminaries and Problem Formulation

Throughout this paper we consider a filtered probability space $(\Omega,{\cal F},\{{\cal F}_{t}\}_{t\geq 0},\mathbb{P})$ on which is defined a standard Brownian motion $\{W_{t},t\geq 0\}$ . We assume that the filtration $\mathbb{F}:=\{{\cal F}_{t}\}=\{{\cal F}^{W}_{t}\}$ , with the usual augmentation so that it satisfies the usual conditions. For any metric space $\mathbb{X}$ with topological Borel sets $\mathscr{B}(\mathbb{X})$ , we denote $\mathbb{L}^{0}(\mathbb{X})$ to be all $\mathscr{B}(\mathbb{X})$ -measurable functions, and $\mathbb{L}^{p}(\mathbb{X})$ , $p\geq 1$ , to be the space of $p$ -th integrable functions. The spaces $\mathbb{L}^{0}_{\mathbb{F}}([0,T];\mathbb{R})$ and $\mathbb{L}^{p}_{\mathbb{F}}([0,T];\mathbb{R})$ , $p\geq 1$ , etc., are defined in the usual ways. Furthermore, for a given domain $\mathbb{D}\subset\mathbb{R}$ , we denote $\mathbb{C}^{k}(\mathbb{D})$ to be the space of all $k$ -th order continuously differentiable functions on $\mathbb{D}$ , and $\mathbb{C}(\mathbb{D})=\mathbb{C}^{0}(\mathbb{D})$ . In particular, for $\mathbb{R}_{+}:=[0,\infty)$ , we denote $\mathbb{C}^{k}_{b}(\mathbb{R}_{+})$ to be the space of all bounded and $k$ -th continuously differentiable functions on $\mathbb{R}_{+}$ with all derivatives being bounded.

Consider the following simplest diffusion approximation of a Cramer-Lundberg model with dividend:

\displaystyle dX_{t}=(\mu-\alpha_{t})dt+\sigma dW_{t},\quad t>0,\quad X_{0}=x\in\mathbb{R},

(2.1)

where $x$ is the initial state, $\mu$ and $\sigma$ are constants determined by the premium rate and the claim frequency and size (cf., e.g., [1]), and $\alpha_{t}$ is the dividend rate at time $t\geq 0$ . We denote $X=X^{\alpha}$ if necessary, and say that $\alpha=\{\alpha_{t},t\geq 0\}$ is admissible if it is $\mathbb{F}$ -adapted and takes values in a given “action space” $[0,a]$ . Furthermore, let us define the ruin time to be $\tau^{\alpha}_{x}:=\inf\{t>0:X^{\alpha}_{t}<0\}$ . Clearly, $X^{\alpha}_{\tau^{\alpha}}=0$ , and the problem is considered “ruined” as no dividend will be paid after $\tau^{\alpha}$ . Our aim is to maximize the expected total discounted dividend given the initial condition $X^{\alpha}_{0}=x\in\mathbb{R}$ :

\displaystyle\qquad V(x):=\sup_{\alpha\in\mathscr{U}[0,a]}\mathbb{E}_{x}\Big{[}\int_{0}^{\tau_{x}^{\alpha}}e^{-ct}\alpha_{t}dt\Big{]}:=\sup_{\alpha\in\mathscr{U}[0,a]}\mathbb{E}\Big{[}\int_{0}^{\tau^{\alpha}_{x}}e^{-ct}\alpha_{t}dt\Big{|}x_{0}=x\Big{]},

(2.2)

where $c>0$ is the discount rate, and $\mathscr{U}[0,a]$ is the set of admissible dividend rates taking values in $[0,a]$ . The problem (2.1)-(2.2) is often referred to as the classical optimal restricted dividend problem, meaning that the dividend rate is restricted in a given interval $[0,a]$ .

It is well-understood that when the parameters $\mu$ and $\sigma$ are known, then the optimal control is of the “feedback” form: $\alpha_{t}^{*}=\boldsymbol{a}^{*}(X_{t}^{*})$ , where $X_{t}^{*}$ is the corresponding state process and $\boldsymbol{a}^{*}(\cdot)$ is a deterministic function taking values in $[0,a]$ , often in the form of a threshold control (see, e.g., [1]). However, in practice the exact form of $\boldsymbol{a}^{*}(\cdot)$ is not implementable since the model parameters are usually not known, thus the “parameter insensitive” method through Reinforcement Learning (RL) becomes much more desirable alternative, which we now elaborate.

In the RL formulation, the agent follows a process of exploration and exploitation via a sequence of trial and error evaluation. A key element is to randomize the control action as a probability distribution over $[0,a]$ , similar to the notion of relaxed control in control theory, and the classical control is considered as a special point-mass (or Dirac $\delta$ -measure) case. To make the idea more accurate mathematically, let us denote $\mathscr{B}([0,a])$ to be the Borel field on $[0,a]$ , and $\mathscr{P}([0,a])$ to be the space of all probability measure on $([0,a],\mathscr{B}([0,a]))$ , endowed with, say, the Wasserstein metric. A “relaxed control” is a randomized policy defined as a measure-valued progressively measurable process $(t,\omega)\mapsto{\pi}(\cdot;t,\omega)\in\mathscr{P}([0,a])$ . Assuming that ${\pi}(\cdot;t,\omega)$ has a density, denoted by $\pi_{t}(\cdot,\omega)\in\mathbb{L}^{1}_{+}([0,a])\subset\mathbb{L}^{1}([0,a])$ , $(t,\omega)\in[0,T]\times\Omega$ , then we can write

{\pi}(A;t,\omega)=\int_{A}\pi_{t}(w,\omega)dw,\qquad A\in\mathscr{B}([0,a]),\quad(t,\omega)\in[0,T]\times\Omega.

In what follows we shall often identify a relaxed control with its density process $\pi=\{\pi_{t},t\geq 0\}$ . Now, for $t\in[0,T]$ , we define a probability measure on $([0,a]\times\Omega,\mathscr{B}([0,a])\otimes{\cal F})$ as follows: for $A\in\mathscr{B}([0,a])$ and $B\in{\cal F}$ ,

\displaystyle\mathbb{Q}_{t}(A\times B):=\int_{A}\int_{B}\pi(dw;t,\omega)\mathbb{P}(d\omega)=\int_{A}\int_{B}\pi_{t}(w,\omega)dw\mathbb{P}(d\omega).

(2.3)

We call a function $A^{\pi}:[0,T]\times[0,a]\times\Omega\mapsto[0,a]$ the “canonical representation” of a relaxed control $\pi=\{\pi(\cdot,t,\cdot)\}_{t\geq 0}$ , if $A^{\pi}_{t}(w,\omega)=w$ . Then, for $t\geq 0$ we have

\displaystyle\mathbb{E}^{\mathbb{Q}_{t}}[A^{\pi}_{t}]=\int_{\Omega}\int_{0}^{a}A^{\pi}_{t}(w,\omega)\pi(dw;t,\omega)\mathbb{P}(d\omega)=\mathbb{E}^{\mathbb{P}}\Big{[}\int_{0}^{a}w\pi_{t}(w)dw\Big{]}.

(2.4)

We can now derive the exploratory dynamics of the state process $X$ along the lines of entropy-regularized relaxed stochastic control arguments (see, e.g., [22]). Roughly speaking, consider the discrete version of the dynamics (2.1): for small $\Delta t>0$ ,

\displaystyle\Delta x_{t}:=x_{t+\Delta t}-x_{t}\approx(\mu-a_{t})\Delta t+\sigma(W_{t+\Delta t}-W_{t}),\qquad t\geq 0.

(2.5)

Let $\{a_{t}^{i}\}_{i=1}^{N}$ and $\{(x^{i}_{t},W_{t}^{i})\}_{i=1}^{N}$ be $N$ independent samples of $(a_{t})$ under the distribution $\pi_{t}$ , and the corresponding samples of $(X_{t}^{\pi},W_{t})$ , respectively. Then, the law of large numbers and (2.4) imply that

\displaystyle\qquad\sum_{i=1}^{N}\frac{\Delta x_{t}^{i}}{N}\approx\sum_{i=1}^{N}(\mu-a^{i}_{t})\frac{\Delta t}{N}\approx\mathbb{E}^{\mathbb{Q}_{t}}[\mu-A^{\pi}_{t}]\Delta t=\mathbb{E}^{\mathbb{P}}\Big{[}\mu\negthinspace-\negthinspace\int_{0}^{a}w\pi_{t}(w,\cdot)dw\Big{]}\Delta t,

(2.6)

as $N\to\infty$ . This, together with the fact $\frac{1}{N}\sum_{i=1}^{N}(\Delta x_{t}^{i})^{2}\approx\sigma^{2}\Delta t$ , leads to the follow form of the exploratory version of the state dynamics:

\displaystyle dX_{t}=\Big{(}\mu-\int_{0}^{a}w\pi_{t}(w,\cdot)dw\Big{)}dt+\sigma dW_{t},\quad X_{0}=x,

(2.7)

where $\{\pi_{t}(w,\cdot)\}$ is the (density of) relaxed control process, and we shall often denote $X=X^{\pi,0,x}=X^{\pi,x}$ to specify its dependence on control $\pi$ and the initial state $x$ .

To formulate the entropy-regularized optimal dividend problem, we first give a heuristic argument. Similar to (2.6), for $N$ large and $\Delta t$ small we should have

\displaystyle\frac{1}{N}\sum_{i=1}^{N}e^{-ct}a_{t}^{i}{\bf 1}_{[t\leq\tau^{i}]}\Delta t\approx\mathbb{E}^{\mathbb{Q}_{t}}\Big{[}e^{-ct}A_{t}^{\pi}{\bf 1}_{[t\leq\tau^{\pi}_{x}]}\Delta t\Big{]}=\mathbb{E}^{\mathbb{P}}\Big{[}{\bf 1}_{[t\leq\tau^{\pi}_{x}]}e^{-ct}\int_{0}^{a}w\pi_{t}(w)dw\Delta t\Big{]}.

Therefore, in light of [22] we shall define the entropy-regularized cost functional of the optimal expected dividend control problem under the relaxed control ${\pi}$ as

\displaystyle J(x,{\pi)}=\mathbb{E}_{x}\Big{[}\int_{0}^{\tau_{x}^{{\pi}}}e^{-ct}{\cal H}_{\lambda}^{\pi}(t)dt\Big{]},

(2.8)

where ${\cal H}_{\lambda}^{\pi}(t):=\int_{0}^{a}(w-\lambda\ln\pi_{t}(w))\pi_{t}(w)dw$ , ${\tau^{\pi}_{x}}=\inf\{t>0:X_{t}^{\pi,x}<0\}$ , and $\lambda>0$ is the so-called temperature parameter balancing the exploration and exploitation.

We now define the set of open-loop admissible controls as follows.

Definition 2.1.

A measurable (density) process ${\pi}=\{\pi_{t}(\cdot,\cdot)\}_{t\geq 0}\in\mathbb{L}^{0}([0,\infty)\times[0,a]\times\Omega)$ is called an open-loop admissible relaxed control if

1. ${{\pi}_{t}}(\cdot;\omega)\in\mathbb{L}^{1}([0,a])$ , for $dt\otimes d\mathbb{P}$ -a.e. $(t,\omega)\in[0,\infty)\times\Omega$ ;

2. for each $w\in[0,a]$ , the process $(t,\omega)\mapsto{{\pi}_{t}}(w,\omega)$ is $\mathbb{F}$ -progressively measurable;

3. $\mathbb{E}_{x}\big{[}\int_{0}^{\tau^{\pi}_{x}}e^{-ct}|{\cal H}_{\lambda}^{\pi}(t)|dt\big{]}<+\infty$ .

We shall denote $\mathscr{A}(x)$ to be the set of open-loop admissible relaxed controls.

Consequently, the value function (2.2) now reads

\displaystyle V(x)=\sup_{{\pi}\in\mathscr{A}(x)}\mathbb{E}_{x}\Big{\{}\int_{0}^{\tau_{x}^{\pi}}e^{-ct}{\cal H}_{\lambda}^{\pi}(t)dt\Big{\}},\qquad x\geq 0.

(2.9)

An important type of $\pi\in\mathscr{A}(x)$ is of the “feedback” nature, that is, $\pi_{t}(w,\omega)=\boldsymbol{\pi}(w,X^{\boldsymbol{\pi},x}_{t}(\omega))$ for some deterministic function $\boldsymbol{\pi}$ , where $X^{\boldsymbol{\pi},x}$ satisfies the SDE:

\displaystyle dX_{t}=\Big{(}\mu-\int_{0}^{a}w{\boldsymbol{\pi}}(w,X_{t})dw\Big{)}dt+\sigma dW_{t},\quad t\geq 0;\quad X_{0}=x.

(2.10)

Definition 2.2.

A function $\boldsymbol{\pi}\in\mathbb{L}^{0}([0,a]\times\mathbb{R})$ is called a closed-loop admissible relaxed control if, for every $x>0$ ,

1. The SDE $(\ref{sde2})$ admits a unique strong solution $X^{\boldsymbol{\pi},x}$ ,

2. The process $\pi=\{{{\pi}_{t}}(\cdot;\omega):={\boldsymbol{\pi}}(\cdot,X^{\boldsymbol{\pi},x}_{t}(\omega));(t,\omega)\in[0,T]\times\Omega\}\in\mathscr{A}(x)$ .

We denote $\mathscr{A}_{cl}\subset\mathscr{A}(x)$ to be the set of closed-loop admissible relaxed controls.

The following properties of the value function is straightforward.

Proposition 2.3.

Assume $a>1$ . Then the value function $V$ satisfies the following properties:

(1) $V(x)\geq V(y)$ , if $x\geq y>0$ ;

(2) $0\leq V(x)\leq\frac{\lambda\ln a+a}{c}$ , $x\in\mathbb{R}_{+}$ .

Proof. (1) Let $x\geq y$ , and $\pi\in\mathscr{A}(y)$ . Consider $\hat{\pi}_{t}(w,\omega):=\pi_{t}(w,\omega){\bf 1}_{\{t<\tau^{\pi}_{y}(\omega)\}}+\frac{e^{\frac{w}{\lambda}}}{\lambda(e^{\frac{a}{\lambda}}-1)}{\bf 1}_{\{t\geq\tau^{\pi}_{y}(\omega)\}}$ , $(t,w,\omega)\in[0,\infty)\times[0,a]\times\Omega$ . Then, it is readily seen that $J(x,\hat{\pi})\geq J(y,\pi)$ , for $a>1$ . Thus $V(x)\geq J(x,\hat{\pi})\geq V(y)$ , proving (1), as $\pi\in\mathscr{A}(y)$ is arbitrary.

(2) By definition $\int_{0}^{a}w\pi_{t}(w)dw\leq a$ and $-\int_{0}^{a}\ln\pi_{t}(w))\pi_{t}(w)dw\leq\ln a$ by the well-known Kullback-Leibler divergence property. Thus, ${\cal H}_{\lambda}^{\pi}(t)\leq\lambda\ln a+a$ , and then $V(x)\leq\frac{\lambda\ln a+a}{c}$ . On the other hand, since $\hat{\boldsymbol{\pi}}(w,x)\equiv\frac{1}{a}$ , $(x,w)\in\mathbb{R}_{+}\times[0,a]$ , is admissible and $J(x,\hat{\boldsymbol{\pi}})\geq 0$ for $a>1$ , the conclusion follows.

We remark that in optimal dividend control problems it is often assumed that the maximal dividend rate is greater than the average return rate (that is, $a>2\mu$ ), and that the average return of a surplus process $X$ , including the safety loading, is higher than the interest rate $c$ . These, together with Proposition 2.3, lead to the following standing assumption that will be used throughout the paper.

Assumption 2.4.

(i) The maximal dividend rate $a$ satisfies $a>\max\left\{1,2\mu\right\}$ ; and

(ii) the average return $\mu$ satisfies $\mu>\max\left\{c,\sigma^{2}/2\right\}$ .

3 The Value Function and Its Regularity

In this section we study the value function of the relaxed control problem (2.9). We note that while most of the results are well-understood, some details still require justification, especially concerning the regularity, due particularly to the non-smoothness of the exit time $\tau_{x}$ .

We begin by recalling the Bellman optimality principle (cf. e.g., [24]):

\displaystyle V(x)

\displaystyle=

\displaystyle\sup_{{\pi}(\cdot)\in\mathscr{A}(x)}\mathbb{E}_{x}\Big{[}\int_{0}^{s\wedge\tau^{\pi}}e^{-ct}{\cal H}_{\lambda}^{\pi}(t)dt+e^{-c(s\wedge\tau^{\pi})}V(X_{s\wedge\tau^{\pi}}^{\pi})\Big{]},\quad s>0.

Noting that $V(0)=0$ , we can (formally) argue that $V$ satisfies the HJB equation:

\displaystyle\qquad\quad\begin{cases}\displaystyle cv(x)\negthinspace=\negthinspace\sup_{{\pi}\in\mathbb{L}^{1}[0,a]}\int_{0}^{a}\negthinspace\Big{[}w\negthinspace-\negthinspace\lambda\ln\pi(w)+\frac{1}{2}\sigma^{2}v^{\prime\prime}(x)\negthinspace+\negthinspace(\mu-w)v^{\prime}(x)\Big{]}\pi(w)dw;\\ v(0)=0.\end{cases}

(3.1)

Next, by an argument of Lagrange multiplier and the calculus of variations (see [10]), we can find the maximizer of the right hand side of (3.1) and obtain the optimal feedback control which has the following Gibbs form, assuming all derivatives exist:

\displaystyle{\boldsymbol{\pi}}^{*}(w,x)=G(w,1-v^{\prime}(x)),

(3.2)

where $G(w,y)=\frac{y}{\lambda[e^{\frac{a}{\lambda}y}-1]}\cdot e^{\frac{w}{\lambda}y}{\bf 1}_{\{y\neq 0\}}+\frac{1}{a}{\bf 1}_{\{y=0\}}$ for $y\in\mathbb{R}$ . Plugging (3.2) into (3.1), we see that the HJB equation (3.1) becomes the following second order ODE:

\displaystyle\frac{1}{2}\sigma^{2}v^{\prime\prime}(x)+f(v^{\prime}(x))-cv(x)=0,\quad x\geq 0;\qquad v(0)=0,

(3.3)

where the function $f$ is defined by

\displaystyle f(z):=\Big{\{}\mu z+\lambda\ln\Big{[}\frac{\lambda(e^{\frac{a}{\lambda}(1-z)}-1)}{1-z}\Big{]}\Big{\}}{\bf 1}_{\{z\neq 1\}}+[\mu+\lambda\ln a]{\bf 1}_{\{z=1\}}.

(3.4)

The following result regarding the function $f$ is important in our discussion.

Proposition 3.1.

The function $f$ defined by (3.4) enjoys the following properties:

(1) $f(z)=\mu z+\lambda\ln w(z)$ for all $z\in\mathbb{R}$ , where $w(z)=a+\sum_{n=2}^{\infty}\frac{a^{n}(1-z)^{n-1}}{n!\lambda^{n-1}}$ , $z\in\mathbb{R}$ . In particular, $f\in\mathbb{C}^{\infty}(\mathbb{R})$ ;

(2) the function $f(\cdot)$ is convex and has a unique intersection point with $k(x)=\mu x$ , $x\in\mathbb{R}$ . Moreover, the abscissa value of intersection point $H\in(1,1+\lambda)$ .

Proof. (1) Since the function $w(z)$ is an entire function and $w(z)>0$ for $z\neq 1$ , $\ln[w(z)]$ is infinitely many times differentiable for all $z\neq 1$ . On the other hand, since $w(1)=a>1$ , by the continuity of $w(z)$ , there exists $r>0$ such that $w(z)\in(\frac{a}{2},\frac{3a}{2})$ , whenever $|z-1|<r$ . Thus $\ln[w(z)]$ is infinitely many times differentiable for $|z-1|<r$ as well. Consequently, $\ln[w(z)]\in\mathbb{C}^{\infty}(\mathbb{R})$ , whence $f(z)\in\mathbb{C}^{\infty}(\mathbb{R})$ by extension.

(2) The convexity of the function $f$ follows from a direct calculation of $f^{\prime\prime}(z)$ for $z\in\mathbb{R}$ . Define $\tilde{w}(z):=\lambda\ln[w(z)]=f(z)-\mu z$ . It is straightforward to show that $\tilde{w}(1)>0$ , $\tilde{w}(1+\lambda)<0$ , and $\tilde{w}^{\prime}(z)<0$ . Thus $\tilde{w}(H)=0$ for some (unique) $H\in(1,1+\lambda)$ , proving (2).

We should note that (3.3) can be viewed as either a boundary value problem of an elliptic PDE with unbounded domain $[0,\infty)$ or a second order ODE defined on $[0,\infty)$ . But in either case, there is missing information on boundary/initial conditions. Therefore the well-posedness of the classical solution is actually not trivial.

Let us first consider the equation (3.3) as an ODE defined on $[0,\infty)$ . Since the value function is non-decreasing by Proposition 2.3, for the sake of argument let us first consider (3.3) as an ODE with initial condition $v(0)=0$ and $v^{\prime}(0)=\alpha>0$ . By denoting $X_{1}(x)=v(x)$ and $X_{2}(x)=v^{\prime}(x)$ , we see that (3.3) is equivalent to the following system of first order ODEs: for $x\in[0,\infty)$ ,

\displaystyle\begin{cases}X_{1}^{\prime}=X_{2},\qquad\qquad\qquad\qquad&X_{1}(0)=v(0)=0;\\ X_{2}^{\prime}=\frac{2c}{\sigma^{2}}X_{1}-\frac{2}{\sigma^{2}}f(X_{2}),&X_{2}(0)=v^{\prime}(0).\end{cases}

(3.5)

Here $f$ is an entire function. Let us define $\tilde{X}_{1}:=X_{1}-\frac{f(0)}{c}$ , $X:=(\tilde{X}_{1},X_{2})^{T}$ , $A:=\begin{bmatrix}0&1\\ \frac{2c}{\sigma^{2}}&-\frac{2}{\sigma^{2}}h^{\prime}(0)\end{bmatrix}$ and $q(X)=\begin{bmatrix}0\\ -\frac{2}{\sigma^{2}}k(X_{2})\end{bmatrix}$ where $h(y):=f(y)-f(0)=yh^{\prime}(0)+\sum_{n=2}^{\infty}\frac{h^{(n)}(0)y^{n}}{n!}=yh^{\prime}(0)+k(y)$ . Then, $X$ satisfies the following system of ODEs:

\displaystyle X^{\prime}=AX+q(X),\qquad X(0)=(-f(0)/c,v^{\prime}(0))^{T}.

(3.6)

It is easy to check $A$ has eigenvalues $\lambda_{1,2}=\frac{-h^{\prime}(0)\mp\sqrt{2c\sigma^{2}+h^{\prime}(0)^{2}}}{\sigma^{2}}$ , with $\lambda_{1}<0<\lambda_{2}$ . Now, let $Y=PX$ , where $P$ is such that $PAP^{-1}=\mbox{\rm diag}[\lambda_{1},\lambda_{2}]:=B$ . Then $Y$ satisfies

\displaystyle Y^{\prime}=BY+g(Y),\quad Y(0)=PX(0),

(3.7)

where $g(Y)=Pq(P^{-1}Y)$ . Since $\nabla_{Y}g(Y)$ exists and tends to 0 as $|Y|\to 0$ , and $\lambda_{1}<0<\lambda_{2}$ , we can follow the argument of [6, Theorem 13.4.3] to construct a solution $\tilde{\phi}$ to (3.7) such that $|\tilde{\phi}(x)|\leq Ce^{-\alpha x}$ for some constant $C>0$ , so that $|\tilde{\phi}(x)|\to 0$ , as $x\to\infty$ . Consequently, the function $\phi(x):=P^{-1}\tilde{\phi}(x)$ is a solution to (3.6) satisfying $|\phi(x)|\to 0$ , as $x\to\infty$ . In other words, (3.5) has a solution such that $(X_{1}(x),X_{2}(x))\to(0+\frac{f(0)}{c},0)=(\frac{f(0)}{c},0)$ as $x\to\infty$ . We summarize the discussion above as the following result.

Proposition 3.2.

The differential equation (3.3) has a classical solution $v$ that enjoys the following properties:

(i) $\lim_{x\to\infty}v(x)=\frac{f(0)}{c}$ ;

(ii) $v^{\prime}(0)>0$ and $\lim_{x\to\infty}v^{\prime}(x)=0$ ;

(iii) $v$ is increasing and concave.

Proof. Following the discussion proceeding the proposition we know that the classical solution $v$ to (3.3) satisfying (i) and (ii) exists. We need only check (iii).

To this end, we shall follow an argument of [20]. Let us first formally differentiate (3.3) to get $v^{\prime\prime\prime}(x)=\frac{2c}{\sigma^{2}}v^{\prime}(x)-\frac{2}{\sigma^{2}}f^{\prime}(v^{\prime}(x))v^{\prime\prime}(x)$ , $x\in[0,\infty)$ . Since $v\in\mathbb{C}^{2}_{b}([0,\infty))$ , denoting $m(x):=v^{\prime}(x)$ , we can write

\displaystyle m^{\prime\prime}(x)=\frac{2c}{\sigma^{2}}m(x)-\frac{2}{\sigma^{2}}f^{\prime}(m(x))m^{\prime}(x)\qquad x\in[0,\infty).

Now, noting Proposition 3.1, we define a change of variables such that for $x\in[0,\infty)$ , $\varphi(x):=\int_{0}^{x}\exp\left[\int_{0}^{v}-\frac{2}{\sigma^{2}}f^{\prime}(m(w))dw\right]dv$ , and denote $l(y)=m(\varphi^{-1}(y))$ , $y\in(0,\infty)$ . Since $\varphi(0)=0$ , and $\varphi^{\prime}(0)=1$ , we can define $\varphi^{-1}(0)=0$ as well. Then we see that,

\displaystyle\quad\quad\quad l^{\prime\prime}(y)=[\varphi^{\prime}(\varphi^{-1}(y))]^{-2}\frac{2c}{\sigma^{2}}l(y),~{}y\in(0,\infty);\quad l(0)=m(0)=v^{\prime}(0)=\alpha>0.

(3.8)

Since (3.8) is a homogeneous linear ODE, by uniqueness $l(0)=\alpha>0$ implies that $l(y)>0$ , $y\geq 0$ . That is, $m(x)=v^{\prime}(x)>0$ , $x\geq 0$ , and $v$ is (strictly) increasing.

Finally, from (3.8) we see that $l(y)>0$ , $y\in[0,\infty)$ also implies that $l^{\prime\prime}(y)>0$ , $y\in[0,\infty)$ . Thus $l(\cdot)$ is convex on $[0,+\infty)$ , and hence would be unbounded unless $l^{\prime}(y)\leq 0$ for all $y\in[0,\infty)$ . This, together with the fact that $v(x)$ is a bounded and increasing function, shows that $l(\cdot)$ (i.e., $v^{\prime}(\cdot)$ ) can only be decreasing and convex, thus $v^{\prime\prime}(x)$ (i.e., $l^{\prime}(y)$ ) $\leq 0$ , proving the concavity of $v$ , whence the proposition.

Viscosity Solution of (3.3). We note that Proposition 3.2 requires that $v^{\prime}(0)$ exists, which is not a priorily known. We now conside (3.3) as an elliptic PDE defined on $[0,\infty)$ , and argue that it possesses a unique bounded viscosity solution. We will then identify its value $v^{\prime}(0)$ and argue that it must coincide with the classical solution identified in Proposition 3.2.

To begin with, let us first recall the notion of viscosity solution to (3.3). For $\mathbb{D}\subseteq\mathbb{R}$ , we denote the set of all upper (resp. lower) semicontinuous function in $\mathbb{D}$ by USC $(\mathbb{D})$ (resp. LSC $(\mathbb{D})$ ).

Definition 3.3.

We say that $u\in USC([0,+\infty))$ is a viscosity sub-(resp. super-)solution of (3.3) on $[0,+\infty)$ , if $u(0)=0$ and for any $x\in(0,+\infty)$ and $\varphi\in\mathbb{C}^{2}(\mathbb{R})$ such that $0=[u-\varphi](x)=\max_{y\in(0,+\infty)}$ (resp. $\min_{y\in(0,+\infty)}$ ) $[u-\varphi](y)$ , it holds that

\displaystyle\frac{1}{2}\sigma^{2}\varphi^{\prime\prime}(x)+f(\varphi^{\prime}(x))-cu(x)\geq(resp.~{}\leq\ \!)\ \!0.

We say that $u\in\mathbb{C}([0,+\infty))$ is a viscosity solution of (3.3) on $[0,+\infty)$ if it is both a viscosity subsolution and a viscosity supersolution of (3.3) on $[0,+\infty)$ .

We first show that both viscosity subsolution and viscosity supersolution to (3.3) exist. To see this, for $x\in[0,\infty)$ , consider the following two functions:

\displaystyle\underline{\psi}(x):=1-e^{-x},\quad\overline{\psi}(x):=\frac{A}{M}(1-e^{-M(x\wedge b)}),

(3.9)

where $A$ , $M$ , $b>0$ are constants satisfying $M>2\mu/\sigma^{2}$ and the following constraints:

\displaystyle\left\{\begin{array}[]{lll}\frac{1}{M}\Big{\{}\ln\Big{(}\frac{A}{A-M}\Big{)}\vee\ln\Big{(}\frac{A}{A-\frac{f(0)}{c}M}\Big{)}\Big{\}}<b<\frac{1}{M}\Big{\{}\ln\frac{A}{H}\wedge\ln\Big{(}\frac{\sigma^{2}}{2\mu}M\Big{)}\Big{\}};\vskip 6.0pt plus 2.0pt minus 2.0pt\\ A>\max\left\{M+H,~{}\frac{f(0)}{c}M+H,~{}\frac{\sigma^{2}M^{2}}{\sigma^{2}M-2\mu},~{}\frac{f(0)}{c}\cdot\frac{\sigma^{2}M^{2}}{\sigma^{2}M-2\mu}\right\}.\end{array}\right.

(3.12)

Proposition 3.4.

Assume that Assumption 2.4 holds, and let $\underline{\psi}$ , $\overline{\psi}$ be defined by (3.9). Then $\underline{\psi}(\cdot)$ is a viscosity subsolution of (3.3) on $[0,\infty)$ , $\overline{\psi}(\cdot)$ is a viscosity supersolution of (3.3) on $[0,\infty)$ . Furthermore, it holds that $\underline{\psi}(x)\leq\overline{\psi}(x)$ on $[0,\infty)$ .

Proof. We first show that $\underline{\psi}$ is a viscosity subsolution. To see this, note that $\underline{\psi}^{\prime}(x)=e^{-x}$ and $\underline{\psi}^{\prime\prime}(x)=-e^{-x}$ on $(0,+\infty)$ . By Assumption 2.4, Proposition 3.1, and the fact $f^{\prime}(1)<0$ , we have

			$\displaystyle\frac{1}{2}\sigma^{2}\underline{\psi}^{\prime\prime}(x)+f(\underline{\psi}^{\prime}(x))-c(\underline{\psi}(x))=-\frac{1}{2}\sigma^{2}e^{-x}+f(e^{-x})-c(1-e^{-x})$
		$\displaystyle\geq$	$\displaystyle-\frac{1}{2}\sigma^{2}e^{-x}+\mu+\lambda\ln a-c(1-e^{-x})=(c-\frac{1}{2}\sigma^{2})e^{-x}+\mu+\lambda\ln a-c>0.$

That is, $\underline{\psi}(x)$ is a viscosity subsolution of (3.3) on $[0,\infty)$ .

To prove that $\overline{\psi}$ is a supersolution of (3.3), we take the following three steps:

(i) Note that $\overline{\psi}^{\prime}(x)=Ae^{-Mx}$ and $\overline{\psi}^{\prime\prime}(x)=-AMe^{-Mx}$ for all $x\in(0,b)$ . Let $H$ be the abscissa value of intersection point of $f(x)$ and $k(x)$ , then $H\in(1,1+\lambda)$ , thanks to Proposition 3.1.

Since $A>H$ and $b<(1/M)\ln(A/H)$ (i.e. $Ae^{-Mb}>H$ ), we have $f(A)<\mu A$ and $f(Ae^{-Mb})<\mu Ae^{-Mb}$ . Also since $M>(2\mu/\sigma^{2})$ and $b<(1/M)\ln(\sigma^{2}M/2\mu)$ , we have $f(A)<\frac{1}{2}\sigma^{2}AMe^{-Mb}$ and $f(Ae^{-Mb})<\frac{1}{2}\sigma^{2}AMe^{-Mb}$ .

By Assumption 2.4, $A>\max\left\{M,H\right\}$ , $M>\frac{2\mu}{\sigma^{2}}$ and $b<\min\left\{\frac{\ln\left(\frac{A}{H}\right)}{M},\frac{\ln\left(\frac{\sigma^{2}}{2\mu}M\right)}{M}\right\}$ , we have $-\frac{1}{2}\sigma^{2}AMe^{-Mx}+f(Ae^{-Mx})-c\cdot\frac{A}{M}(1-e^{-Mx})<0$ for $x\in(0,b)$ . That is, $\overline{\psi}(x)$ is a viscosity supersolution of (3.3) on $[0,b)$ .

(ii) Next, note that $A>\frac{f(0)M}{c}$ and $b>\frac{1}{M}\ln\big{(}\frac{A}{A-\frac{f(0)M}{c}}\big{)}$ , we see that $f(0)-c\cdot\frac{A}{M}(1-e^{-Mb})<0$ for $x\in(b,\infty)$ , and it follows that $\overline{\psi}(x)$ is a viscosity supersolution of (3.3) on $(b,\infty)$ .

(iii) Finally, for $x=b$ , it is clear that there is no test function satisfying the definition of supersolution. We thus conclude that $\overline{\psi}(x)$ is a viscosity supersolution of (3.3) on $[0,\infty)$ .

Furthermore, noting Assumption 2.4, $A>M$ and $b>\frac{\ln\big{(}\frac{A}{A-M}\big{)}}{M}$ , some direct calculations shows that $\underline{\psi}(x)\leq\overline{\psi}(x)$ on $[0,\infty)$ , proving the proposition.

We now follow the Perron’s method to prove the existence of the (bounded) viscosity solution for (3.3). We first recall the following definition (see, e.g., [2]).

Definition 3.5.

A function $\varphi\in\mathbb{L}^{1}(\mathbb{R}_{+})$ is said to be of class $(L)$ if

(1) $\varphi$ is increasing with respect to $x$ on $[0,+\infty)$ ;

(2) $\varphi$ is bounded on $[0,+\infty)$ .

Now let $\underline{\psi}$ and $\overline{\psi}$ be defined by (3.9), and consider the set

\displaystyle\qquad\quad\mathfrak{F}:=\{u\in\mathbb{C}(\mathbb{R}_{+})\ |\ \underline{\psi}\leq u\leq\overline{\psi};~{}\mbox{$u$ is a class ($L$) vis. subsolution to \eqref{ode}}\}.

(3.13)

Clearly, $\underline{\psi}\in\mathfrak{F}$ , so $\mathfrak{F}\neq\emptyset$ . Define $\hat{v}(x)=\sup_{u\in\mathfrak{F}}u(x)$ , $x\in[0,+\infty)$ , and let $v^{*}$ (resp. $v_{*}$ ) be the USC (resp. LSC) envelope of $\hat{v}$ , defined respectively by

\displaystyle v^{*}(x)(\mbox{resp. $v_{*}(x)$}):=\mathop{\overline{\rm lim}}_{r\downarrow 0}(\mbox{resp. $\displaystyle\mathop{\underline{\rm lim}}_{r\downarrow 0}$})\{\hat{v}(y):y\in(0,+\infty),|y-x|\leq r\}.

Theorem 1.

$v^{*}$ (resp. $v_{*}$ ) is a viscosity sub-(resp. super-)solution of class $(L)$ to (3.3) on $\mathbb{R}_{+}$ .

Proof. The proof is the same as a similar result in [2]. We omit it here.

Note that by definition we have $v_{*}(x)\leq v^{*}(x)$ , $x\in[0,\infty)$ . Thus, given Theorem 1, one can derive the existence and uniqueness of the viscosity solution to (3.3) of class (L) by the following comparison principle, which can be argued along the lines of [7, Theorem 5.1], we omit the proof.

Theorem 2 (Comparison Principle).

Let $\bar{v}$ be a viscosity supersolution and $\underline{v}$ a viscosity subsolution of (3.3), both of class $(L)$ . Then $\underline{v}\leq\bar{v}$ . Consequently, $v^{*}=v_{*}=\hat{v}$ is the unique viscosity solution of class $(L)$ to (3.3) on $[0,+\infty)$ .

Following our discussion we can easily raise the regularity of the viscosity solution.

Corollary 3.6.

Let $v$ be a vis. solution of class (L) to the HJB equation (3.1). Then, $v$ has a right-derivative $v^{\prime}(0+)>0$ , and consequently $v\in\mathbb{C}^{2}_{b}([0,\infty))$ . Furthermore, $v$ is concave and satisfies $\lim_{x\to\infty}v(0)=f(0)/c$ and $\lim_{x\to+\infty}v^{\prime}(x)=0$ .

Proof. Let $v$ be a viscosity solution of class (L) to (3.1). We first claim that $v^{\prime}(0+)>0$ exists. Indeed, consider the subsolution $\underline{\psi}$ and supersolution $\overline{\psi}$ defined by (3.9). Applying Theorem 2, for any $x>0$ but small enough we have

\frac{1-e^{-x}}{x}=\frac{\underline{\psi}(x)}{x}\leq\frac{v(x)}{x}\leq\frac{\overline{\psi}(x)}{x}=\frac{A}{M}\frac{(1-e^{-Mx})}{x}.

Sending $x\searrow 0$ we obtain that $1\leq\mathop{\underline{\rm lim}}_{x\searrow 0}\frac{v(x)}{x}\leq\mathop{\overline{\rm lim}}_{x\searrow 0}\frac{v(x)}{x}\leq A$ . Since $v$ is of class (L), whence increasing, and thus $v^{\prime}(0+)$ exists, and $v^{\prime}(0+)=\alpha\geq 1>0$ . Then, it follows from Proposition 3.2 that the ODE (3.3) has a bounded classical solution in $\mathbb{C}^{2}_{b}([0,\infty))$ satisfying $v^{\prime}(0+)=\alpha$ , and is increasing and concave. Hence it is also a viscosity solution to (3.3) of class ( $L$ ). But by Theorem 2, the bounded viscosity solution to (3.3) of class ( $L$ ) is unique, thus the viscosity solution $v\in\mathbb{C}^{2}_{b}([0,\infty))$ . The rest of the properties are the consequences of Proposition 3.2.

Verification Theorem and Optimal Strategy. Having argued the well-posedness of ODE (3.3) from both classical and viscosity sense, we now look at its connection to the value function. We have the following Verification Theorem.

Theorem 3.

Assume that Assumption 2.4 is in force. Then, the value function $V$ defined in (2.9) is a viscosity solution of class ( $L$ ) to the HJB equation (3.3). More precisely, it holds that

\displaystyle V(x)=\sup_{v\in\mathfrak{F}}v(x):=\hat{v}(x),\qquad x\in[0,+\infty),

(3.14)

where the set $\mathfrak{F}$ is defined by (3.13). Moreover, $V$ coincides with the classical solution of (3.3) described in Proposition 3.2, and the optimal control has the following form:

\displaystyle\pi^{*}_{t}(w)=G(w,1-V^{\prime}(X^{\pi^{*}}_{t})).

(3.15)

Proof. The proof that $V$ is a viscosity solution satisfying (3.14) is more or less standard (see, e.g., [24]), and Proposition 2.3 shows that $V$ must be of class ( $L$ ). It then follows from Corollary 3.6 that $V^{\prime}(0+)$ exists and $V$ is the (unique) classical solution of (3.3). It remains to show that $\pi^{*}$ defined by (3.15) is optimal. To this end, note that $|{\cal H}_{\lambda}^{\pi^{*}}(t)|=\left|\int_{0}^{a}\bar{f}(V^{\prime}(X_{t}^{*}))\pi_{t}^{*}(w)dw\right|$ , where $\bar{f}(z):=\big{\{}wz+\lambda\ln\big{[}\frac{\lambda(e^{\frac{a}{\lambda}(1-z)}-1)}{1-z}\big{]}\big{\}}{\bf 1}_{\{z\neq 1\}}+[w+\lambda\ln a]{\bf 1}_{\{z=1\}}$ . Thus

\displaystyle\mathbb{E}_{x}\Big{[}\int_{0}^{\tau^{\pi^{*}}}e^{-ct}|{\cal H}_{\lambda}^{\pi^{*}}(t)|dt\Big{]}=\mathbb{E}_{x}\Big{[}\int_{0}^{\tau^{\pi^{*}}}\negthinspace\negthinspace\negthinspace e^{-ct}\Big{|}\int_{0}^{a}\bar{f}(V^{\prime}(X_{t}^{*}))\pi_{t}^{*}(w)dw\Big{|}dt\Big{]}<+\infty,

as $V^{\prime}(X_{t}^{*})\in(0,V^{\prime}(0+)]$ , thanks to the concavity of $V$ . Consequently $\pi^{*}\in\mathcal{A}(x)$ .

Finally, since $V\in\mathbb{C}^{2}_{b}([0,\infty))$ and $\pi^{*}$ is defined by (3.15) is obviously the maximizer of the Hamiltonian in HJB equation (3.1), the optimality of $\pi^{*}$ follows from a standard argument via Itô’s formula. We omit it.

4 Policy Update

We now turn to an important step in the RL scheme, that is, the so-called Policy Update. More precisely, we prove a Policy Improvement Theorem which states that for any close-loop policy $\boldsymbol{\pi}\in{\cal A}_{cl}(x)$ , we can construct another $\boldsymbol{\pi}^{\prime}\in{\cal A}_{cl}(x)$ , such that $J(x,\tilde{\boldsymbol{\pi}})\geq J(x,\boldsymbol{\pi})$ . Furthermore, we argue that such a policy policy updating procedure can be constructed without using the system parameters, and we shall discuss the convergence of the iterations to the optimal policy

To begin with, for $x\in\mathbb{R}$ and $\boldsymbol{\pi}\in\mathscr{A}_{cl}(x)$ , let $X^{\boldsymbol{\pi},x}$ be the unique strong solution to the SDE (2.10). For $t>0$ , we consider the process $\hat{W}_{s}:=W_{s+t}-W_{t}$ , $s>0$ . Then $\hat{W}$ is an $\hat{\mathbb{F}}$ -Brownian motion, where $\hat{\cal F}_{s}={\cal F}_{s+t}$ , $s>0$ . Since the SDE (2.10) is time-homogeneous, the path-wise uniqueness then renders the flow property: $X_{r+t}^{\boldsymbol{\pi},x}=\hat{X}_{r}^{\boldsymbol{\pi},X_{t}^{\boldsymbol{\pi},x}}$ , $r\geq 0$ , where $\hat{X}$ satisfies the SDE

\displaystyle d\hat{X}_{s}=\Big{(}\mu-\int_{0}^{a}w\boldsymbol{\pi}(w,\hat{X}_{s})dw\Big{)}ds+\sigma d\hat{W}_{s},\quad s\geq 0;\quad\hat{X}_{0}=X_{t}^{\boldsymbol{\pi},x}.

(4.1)

Now we denote $\hat{\pi}:=\boldsymbol{\pi}(\cdot,\hat{X}_{\cdot})\in\mathscr{A}_{ol}(X_{t}^{\boldsymbol{\pi},x})$ to be the open-loop strategy induced by the closed-loop control $\boldsymbol{\pi}$ . Then the corresponding cost functional can be written as (denoting $X^{\boldsymbol{\pi}}=X^{\boldsymbol{\pi},x}$ )

\displaystyle\quad J(X_{t}^{\boldsymbol{\pi}};\boldsymbol{\pi})=\mathbb{E}_{X_{t}^{\boldsymbol{\pi}}}\Big{[}\int_{0}^{\tau^{\boldsymbol{\pi}}_{X_{t}^{\boldsymbol{\pi}}}}e^{-cr}\Big{[}\int_{0}^{a}(w-\lambda\ln\hat{\pi}_{r}(w))\hat{\pi}_{r}(w)dw\Big{]}dr\Big{]},\quad t\geq 0,

(4.2)

where ${\tau^{\boldsymbol{\pi}}_{X_{t}^{\boldsymbol{\pi},x}}}=\inf\{r>0:\hat{X}_{r}^{\boldsymbol{\pi},X_{t}^{\boldsymbol{\pi},x}}<0\}$ . It is clear that, by flow property, we have ${\tau^{\boldsymbol{\pi}}_{x}}={\tau^{\boldsymbol{\pi}}_{X_{t}^{\boldsymbol{\pi},x}}}+t$ , $\mathbb{P}$ -a.s. on $\{\tau^{\boldsymbol{\pi}}_{x}>t\}$ . Next, for any admissible policy $\boldsymbol{\pi}\in\mathscr{A}_{cl}$ , we formally define a new feedback control policy as follows: for $(w,x)\in[0,a]\times\mathbb{R}^{+}$ ,

\displaystyle\boldsymbol{\tilde{\pi}}(w,x):=G(w,1-J^{{}^{\prime}}(x;\boldsymbol{\pi})),

(4.3)

where $G(\cdot,\cdot)$ is the Gibbs function defined by (3.2). We would like to emphasize that the new policy $\boldsymbol{\tilde{\pi}}$ in (4.3) depends on $J$ and $\boldsymbol{\pi}$ , but is independent of the coefficients $(\mu,\sigma)$ (!). To facilitate the argument we introduce the following definition.

Definition 4.1.

A function $x\mapsto{\boldsymbol{\pi}}(\cdot;x)\in\mathscr{P}([0,a])$ is called “Strongly Admissible” if its density function enjoys the following properties:

(i) there exist $u,l>0$ such that $l\leq\boldsymbol{\pi}(x,w)\leq u$ , $x\in\mathbb{R}^{+}$ and $w\in[0,a]$ ;

(ii) there exists $K>0$ such that $|\boldsymbol{\pi}(x,w)-\boldsymbol{\pi}(y,w)|\leq K|x-y|$ , $x,y\in\mathbb{R}^{+}$ , uniformly in $w$ .

The set of strongly admissible controls is denoted by $\mathscr{A}^{s}_{cl}$ .

The following lemma justifies the Definition 4.1.

Lemma 1.

Suppose that a function $x\mapsto{\boldsymbol{\pi}}(\cdot;x)\in\mathscr{P}([0,a])$ whose density takes the form $\boldsymbol{\pi}(w,x)=G(w,c(x))$ where $c\in\mathbb{C}^{1}_{b}(\mathbb{R}_{+})$ . Then $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ .

Proof. Since $c\in\mathbb{C}^{1}_{b}(\mathbb{R}_{+})$ , let $K>0$ be such that $|c(x)|+|c^{\prime}(x)|\leq K$ , $x\in\mathbb{R}_{+}$ . Next, note that $G$ is a positive and continuous, and for fixed $w$ , $G(w,\cdot)\in\mathbb{C}^{\infty}(\mathbb{R})$ , there exist constants $0<l<u$ , such that $l\leq G(w,y)\leq u$ and $|G_{y}(w,y)|\leq u$ , $(w,y)\in[0,a]\times[-K,K]$ . Consequently, we have $l\leq\boldsymbol{\pi}(x,w)=G(w,c(x))\leq u$ , $(w,x)\in[0,a]\times\mathbb{R}^{+}$ , and $\boldsymbol{\pi}(\cdot,w)=G(w,c(\cdot))$ is uniformly Lipschitz on $\mathbb{R}^{+}$ , uniformly in $w\in[0,a]$ , proving the lemma.

In what follows we shall use the following notations. For any $\boldsymbol{\pi}\in\mathscr{A}_{cl}$ ,

\displaystyle\quad r^{\boldsymbol{\pi}}(x):=\int_{0}^{a}(w-\lambda\ln\boldsymbol{\pi}(w,x))\boldsymbol{\pi}(w,x)dw;\quad b^{\boldsymbol{\pi}}(x)=\mu-\int_{0}^{a}w\boldsymbol{\pi}(w,x)dw.

(4.4)

Clearly, for $\boldsymbol{\pi}\in\mathscr{A}_{cl}^{s}$ , $b^{\boldsymbol{\pi}}$ and $r^{\boldsymbol{\pi}}$ are bounded and are Lipschitz continuous. We denote $X:=X^{\boldsymbol{\pi},x}$ to be the solution to SDE (2.10), and rewrite the cost function (2.8) as

\displaystyle\qquad J(x,\boldsymbol{\pi})=\mathbb{E}_{x}\Big{[}\int_{0}^{\tau_{x}^{\boldsymbol{\pi}}}e^{-cs}r^{\boldsymbol{\pi}}(X^{\boldsymbol{\pi},x}_{s})ds\Big{]}.

(4.5)

where $\tau^{\boldsymbol{\pi}}_{x}=\inf\{t>0:X^{\boldsymbol{\pi},x}_{t}<0\}$ . Thus, in light of the Feynman-Kac formula, for any $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ , $J(\cdot,\boldsymbol{\pi})$ is the probabilistic solution to the following ODE on $\mathbb{R}_{+}$ :

\displaystyle\qquad L^{\boldsymbol{\pi}}[u](x)+r^{\boldsymbol{\pi}}(x)\negthinspace:=\negthinspace\frac{1}{2}\sigma^{2}u_{xx}(x)+b^{\boldsymbol{\pi}}(x)u_{x}(x)-cu(x)+r^{\boldsymbol{\pi}}(x)=0,~{}u(0)=0.

(4.6)

Now let us denote $u^{\boldsymbol{\pi}}_{R}$ to be solution to the linear elliptic equation (4.6) on finite interval $[0,R]$ with boundary conditions $u(0)=0$ and $u(R)=J(R,\boldsymbol{\pi})$ , then by the regularity and the boundedness of $b^{\boldsymbol{\pi}}$ and $r^{\boldsymbol{\pi}}$ , and using only the interior type Schauder estimates (cf. [11]), one can show that $u^{\boldsymbol{\pi}}_{R}\in\mathbb{C}^{2}_{b}([0,R])$ and the bounds of $(u^{\boldsymbol{\pi}}_{R})^{\prime}$ and $(u^{\boldsymbol{\pi}}_{R})^{\prime\prime}$ depend only on those of the coefficients $b^{\boldsymbol{\pi}}$ , $r^{\boldsymbol{\pi}}$ and $J(\cdot,\boldsymbol{\pi})$ , but uniform in $R>0$ . By sending $R\to\infty$ and applying the standard diagonalization argument (cf. e.g., [18]) one shows that $\lim_{R\to\infty}u^{\boldsymbol{\pi}}_{R}(\cdot)=J(\cdot,\boldsymbol{\pi})$ , which satisfies (4.6). We summarize the above discussion as the following proposition for ready reference.

Proposition 4.2.

If $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ , then $J(\cdot,\boldsymbol{\pi})\in\mathbb{C}^{2}_{b}(\mathbb{R}^{+})$ , and the bounds of $J^{\prime}$ and $J^{\prime\prime}$ depend only on those of $b^{\boldsymbol{\pi}}$ , $r^{\boldsymbol{\pi}}$ , and $J(\cdot,\boldsymbol{\pi})$ .

Our main result of this section is the following Policy Improvement Theorem.

Theorem 4.

Assume that Assumption 2.4 is in force. Then, let $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ and let $\tilde{\boldsymbol{\pi}}$ be defined by (4.3) associate to $\boldsymbol{\pi}$ , it holds that $J(x,\boldsymbol{\tilde{\pi}})\geq J(x,\boldsymbol{\pi})$ , $x\in\mathbb{R}_{+}$ .

Proof. Let $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ be given, and let $\boldsymbol{\tilde{\pi}}$ be the corresponding control defined by (4.3). Since $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ , $b^{\boldsymbol{\pi}}$ and $r^{\boldsymbol{\pi}}$ are uniformly bounded, and by Proposition 4.2, $(1-J^{\prime}(\cdot,\boldsymbol{\pi}))\in\mathbb{C}^{1}_{b}(\mathbb{R}^{+})$ . Thus Lemma 1 (with $c(x)=1-J^{\prime}(x,\boldsymbol{\pi})$ ) implies that $\tilde{\boldsymbol{\pi}}\in\mathscr{A}^{s}_{cl}$ as well. Moreover, since $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ , $J(\cdot,\boldsymbol{\pi})$ is a $\mathbb{C}^{2}$ -solution to the ODE (4.6). Now recall that $\tilde{\boldsymbol{\pi}}\in\mathscr{A}^{s}_{cl}$ is the maximizer of $\sup_{\widehat{\boldsymbol{\pi}}\in\mathscr{A}^{s}_{cl}}[b^{\widehat{\boldsymbol{\pi}}}(x)J^{\prime}(x,\boldsymbol{\pi})+r^{\widehat{\boldsymbol{\pi}}}(x)]$ , we have

\displaystyle L^{\tilde{\boldsymbol{\pi}}}[J(\cdot,\boldsymbol{\pi})](x)+r^{\tilde{\boldsymbol{\pi}}}(x)\geq 0,\quad x\in\mathbb{R}_{+}.

(4.7)

Now, let us consider the process $X^{\tilde{\boldsymbol{\pi}}}$ , the solution to (4.1) with $\boldsymbol{\pi}$ being replaced by $\tilde{\boldsymbol{\pi}}$ . Applying Itô’s formula to $e^{-ct}J(X^{\tilde{\boldsymbol{\pi}}}_{t},{\boldsymbol{\pi}})$ from $0$ to $\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T$ , for any $T>0$ , and noting the definitions of $b^{\tilde{\boldsymbol{\pi}}}$ and $r^{\tilde{\boldsymbol{\pi}}}$ , we deduce from (4.7) that

			$\displaystyle e^{-c(\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T)}J(X_{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}^{\tilde{\boldsymbol{\pi}}},\boldsymbol{\pi})$
		$\displaystyle=$	$\displaystyle J(x,\boldsymbol{\pi})+\int_{0}^{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}\negthinspace\negthinspace e^{-cr}L^{\tilde{\boldsymbol{\pi}}}[J(\cdot,\boldsymbol{\pi})](X_{r}^{\tilde{\boldsymbol{\pi}}})dr+\int_{0}^{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}e^{-cr}J^{{}^{\prime}}(X_{r}^{\tilde{\boldsymbol{\pi}}},\boldsymbol{\pi}){\sigma}dW_{r}$
		$\displaystyle\geq$	$\displaystyle J(x,\boldsymbol{\pi})-\int_{0}^{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}e^{-cr}r^{\tilde{\boldsymbol{\pi}}}(X_{r}^{{\tilde{\pi}}})dr+\int_{0}^{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}e^{-cr}J^{{}^{\prime}}(X_{r}^{\tilde{\boldsymbol{\pi}}},\boldsymbol{\pi}){\sigma}dW_{r},$

Taking expectation on both sides above, sending $T\to\infty$ and noting that $J(X_{\tau^{\tilde{\boldsymbol{\pi}}}_{x}}^{\tilde{\boldsymbol{\pi}}},\boldsymbol{\pi})=J(0,\boldsymbol{\pi})=0$ , we obtain that $J(x,\boldsymbol{\pi})\leq J(x,\tilde{\boldsymbol{\pi}})$ , $x\in\mathbb{R}^{+}$ , proving the theorem.

In light of Theorem 4 we can naturally define a “learning sequence” as follows. We start with $c_{0}\in C^{1}_{b}(\mathbb{R}^{+})$ , and define $\boldsymbol{\pi}_{0}(x,w):=G(w,c_{0}(x))$ , and $v_{0}(x):=J(x,\boldsymbol{\pi}_{0})$ ,

\displaystyle\quad\quad\boldsymbol{\pi}_{n}(x,w):=G(w,1-J^{\prime}(x,\boldsymbol{\pi}_{n-1})),\qquad(w,x)\in[0,a]\times\mathbb{R}^{+},\text{ for }\quad n\geq 1

(4.9)

Also for each $n\geq 1$ , let $v_{n}(x):=J(x,\boldsymbol{\pi}_{n})$ . The natural question is whether this learning sequence is actually a “maximizing sequence”, that is, $v_{n}(x)\nearrow v(x)$ , as $n\to\infty$ . Such a result would obviously justify the policy improvement scheme, and was proved in the LQ case in [23].

Before we proceed, we note that by Proposition 4.2 the learning sequence $v_{n}=J(\cdot,\boldsymbol{\pi}^{n})\in\mathbb{C}^{2}_{b}(\mathbb{R}_{+})$ , $n\geq 1$ , but the bounds may depend on the coefficients $b^{\boldsymbol{\pi}^{n}}$ , $r^{\boldsymbol{\pi}^{n}}$ , thus may not be uniform in $n$ . But by definition $b^{\boldsymbol{\pi}^{n}}$ and Proposition 2.3, we see that $\sup_{n}\|b^{\boldsymbol{\pi}^{n}}\|_{\mathbb{L}^{\infty}(\mathbb{R}_{+})}+\|V\|_{\mathbb{L}^{\infty}(\mathbb{R}_{+})}\leq C$ for some $C>0$ . Moreover, since for each $n\geq 1$ , $J(\cdot,\boldsymbol{\pi}^{0})\leq J(\cdot,\boldsymbol{\pi}^{n})\leq V(\cdot)$ , if we choose $\boldsymbol{\pi}^{0}\in\mathscr{A}^{s}_{cl}$ be such that $J(x,\boldsymbol{\pi}^{0})\geq 0$ (e.g., $\boldsymbol{\pi}^{0}_{t}\equiv\frac{1}{a}$ ), then we have $\|J(\cdot,\boldsymbol{\pi}^{n})\|_{L^{\infty}(\mathbb{R}_{+})}\leq\|V\|_{\mathbb{L}^{\infty}(\mathbb{R}_{+})}\leq C$ for all $n\geq 1$ . That is, $v_{n}$ ’s are uniformly bounded, and uniformly in $n$ , provided that $r^{\boldsymbol{\pi}^{n}}$ ’s are. The following result, based on the recent work [12], is thus crucial.

Proposition 4.3.

The functions $r^{\boldsymbol{\pi}_{n}}$ , $n\geq 1$ are uniformly bounded, uniformly in $n$ . Consequently, the learning sequence $v_{n}=J(\cdot,\boldsymbol{\pi}^{n})\in\mathbb{C}^{2}_{b}(\mathbb{R}_{+})$ , $n\geq 1$ , and the bounds of $v_{n}$ ’s, up to their second derivatives, are uniform in $n$ .

Our main result of this section is the following.

Theorem 5.

Assume that the Assumption 2.4 is in force. Then the sequence $\{v_{n}\}_{n\geq 0}$ is a maximizing sequence. Furthermore, the sequence $\{\pi_{n}\}_{n\geq 0}$ converges to the optimal policy $\pi^{*}$ .

Proof. We first observe that by Lemma 1 the sequence $\{\boldsymbol{\pi}_{n}\}\subset\mathscr{A}_{cl}^{s}$ , provided $\boldsymbol{\pi}_{0}\in\mathscr{A}_{cl}^{s}$ . Since $v_{n}=J(\cdot,\boldsymbol{\pi}_{n})$ , Proposition 4.3 guarantees that $v_{n}\in\mathbb{C}^{2}_{b}(\mathbb{R}_{+})$ , and the bounds are independent of $n$ . Thus a simple application of Arzella-Ascolli Theorem shows that there exist subsequences $\{n_{k}\}_{k\geq 1}$ and $\{n^{\prime}_{k}\}_{k\geq 1}$ such that $\{v_{n_{k}}\}_{k\geq 0}$ and $\{v^{\prime}_{n^{\prime}_{k}}\}_{k\geq 0}$ converge uniformly on compacts.

Let us fix any compact set $E\subset\mathbb{R}_{+}$ , and assume $\lim_{k\to\infty}v_{n_{k}}(\cdot)=v^{*}(\cdot)$ , uniformly on $E$ , for some function $v^{*}$ . By definition of $\boldsymbol{\pi}_{n}$ ’s we know that $\{v_{n}\}$ is monotonically increasing, thanks to Theorem 4, thus the whole sequence $\{v_{n}\}_{n\geq 0}$ must converge uniformly on $E$ to $v^{*}$ . Next, let us assume that $\lim_{k\to\infty}v^{\prime}_{n^{\prime}_{k}}(\cdot)=v^{**}(\cdot)$ , uniformly on $E$ , for some function $v^{**}$ . Since obviously $\lim_{k\to\infty}v_{n^{\prime}_{k}}(\cdot)=v^{*}(\cdot)$ as well, and note that the derivative operator is a closed operator, it follows that $v^{**}(x)=(v^{*})^{\prime}(x)$ , $x\in E$ . Applying the same argument one shows that for any subsequence of $\{v^{\prime}_{n}\}$ , there exists a sub-subsequence that converges uniformly on $E$ to the same limit $(v^{*})^{\prime}$ , we conclude that the sequence $\{v^{\prime}_{n}\}$ itself converges uniformly on $E$ to $(v^{*})^{\prime}$ . Since $E$ is arbitrary, this shows that $\{(v_{n},v^{\prime}_{n})\}_{n\geq 0}$ converges uniformly on compacts to $(v^{*},(v^{*})^{\prime})$ . Since $\boldsymbol{\pi}_{n}$ is a continuous function of $v^{\prime}_{n}$ , we see that $\{\boldsymbol{\pi}_{n}\}_{n\geq 0}$ converges uniformly to $\boldsymbol{\pi}^{*}\in\mathscr{A}_{cl}$ defined by $\boldsymbol{\pi}^{*}(x,w):=G(w,1-(v^{*})^{\prime}(x))$ .

Finally, applying Lemma 1 we see that $\boldsymbol{\pi}^{*}\in\mathscr{A}^{s}_{cl}$ , and the structure of the $\boldsymbol{\pi}^{*}(\cdot,\cdot)$ guarantees that $v^{*}$ satisfies the HJB equation (3.1) on the compact set $E$ . By expanding the result to $\mathbb{R}^{+}$ using the fact that $E$ is arbitrary, $v^{*}$ satisfies the HJB equation (3.1) (or equivalently (3.3)). Now by using the slightly modified verification argument in Theorem 4.1 in [12] we conclude that $v^{*}=V^{*}$ is the unique solution to the HJB equation (3.1) and thus $\pi^{*}$ by definition is the optimal control.

Remark 4.4.

An alternate policy improvement method is the so-called Policy Gradient (PG) method introduced in [15], applicable for both finite and infinite horizon problems. Roughly speaking, a PG method parametrizes the policies $\boldsymbol{\pi}^{\phi}\in{\cal A}_{cl}^{s}$ and then solves for $\phi$ via the equation $\nabla_{\phi}J(x,\pi^{\phi})=0$ , using stochastic approximation method. The advantage of a PG method is that it does not depend on the system parameter, whereas in theory Theorem 4 is based on finding the maximizer of the Hamiltonian, and thus the learning strategy (4.9) may depend on the system parameter. However, a closer look at the learning parameters $c_{n}$ and $d_{n}$ in (4.9) we see that they depend only on $v_{n}$ , but not $(\mu,\sigma)$ directly. In fact, we believe that in our case the PG method would not be advantageous, especially given the convergence result in Theorem 5 and the fact that the the PG method requires also a proper choice of the parameterization family which, to the best of our knowledge, remains a challenging issue in practice. We shall therefore content ourselves to algorithms using learning strategy (4.9) for our numerical analysis in §7.

5 Policy Evaluation — A Martingale Approach

Having proved the policy improvement theorem, we turn our attention to an equally important issue in the learning process, that is, the evaluation of the cost (value) functional, or the Policy Evaluation. The main idea of the policy evaluation in reinforcement learning literature usually refers to a process of approximating the cost functional $J(\cdot,\boldsymbol{\pi})$ , for a given feedback control $\boldsymbol{\pi}$ , by approximating $J(\cdot,\boldsymbol{\pi})$ by a parametric family of functions $J^{\theta}$ , where $\theta\in\Theta\subseteq\mathbb{R}^{l}$ . Throughout this section, we shall consider a fixed feedback control policy $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ . Thus for simplicity of notation, we shall drop the superscript $\boldsymbol{\pi}$ and thus write $r(x)=r^{\boldsymbol{\pi}}(x),b(x)=b^{\boldsymbol{\pi}}(x)$ and $J(x,\boldsymbol{\pi})=J(x);\quad\tau_{x}=\tau^{\boldsymbol{\pi}}_{x}$ .

We note that for $\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}$ , the functions $r,b\in\mathbb{C}_{b}(\mathbb{R}_{+})$ and $J\in\mathbb{C}_{b}^{2}(\mathbb{R}_{+})$ . Now let $X^{x}=X^{\boldsymbol{\pi},x}$ be the solution to the SDE (LABEL:Xpi), and $J(\cdot)$ satisfies the ODE (4.6). Then, applying Itô’s formula we see that

\displaystyle M^{x}_{t}:=e^{-ct}J(X^{x}_{t})+\int_{0}^{t}e^{-cs}r(X^{x}_{s})ds,\qquad t\geq 0,

(5.1)

is an $\mathbb{F}$ -martingale. Furthermore, the following result is more or less standard.

Proposition 5.1.

Assume that Assumption 4.3 holds, and suppose that $\tilde{J}(\cdot)\in\mathbb{C}_{b}(\mathbb{R}_{+})$ is such that $\tilde{J}(0)=0$ , and for all $x\in\mathbb{R}_{+}$ , the process $\tilde{M}^{x}:=\{\tilde{M}^{x}_{t}=e^{-cs}\tilde{J}(X^{x}_{s})+\int_{0}^{t}e^{-cs}r(X^{x}_{s})ds;~{}t\geq 0\}$ is an $\mathbb{F}$ -martingale. Then $J\equiv\tilde{J}$ .

Proof. First note that $J(0)=\tilde{J}(0)=0$ , and $X^{x}_{\tau_{x}}=0$ . By (5.1) and definition of $\tilde{M}^{x}$ we have $\tilde{M}^{x}_{\tau_{x}}=\int_{0}^{\tau_{x}}e^{-cs}r(X^{x}_{s})ds=M^{x}_{\tau_{x}}$ . Now, since $r$ , $J$ , and $\tilde{J}$ are bounded, both $\tilde{M}^{x}$ and $M^{x}$ are uniformly integrable $\mathbb{F}$ -martingales, by optional sampling it holds that

\tilde{J}(x)=\tilde{M}^{x}_{0}=\mathbb{E}[\tilde{M}^{x}_{\tau_{x}}|{\cal F}_{0}]=\mathbb{E}[M^{x}_{\tau_{x}}|{\cal F}_{0}]=M^{x}_{0}=J(x),\qquad x\in\mathbb{R}_{+}.

The result follows.

We now consider a family of functions $\{J^{\theta}(x):(x,\theta)\in\mathbb{R}_{+}\times\Theta\}$ , where $\Theta\subseteq\mathbb{R}^{l}$ is a certain index set. For the sake of argument, we shall assume further that $\Theta$ is compact. Moreover, we shall make the following assumptions for the parameterized family $\{J^{\theta}\}$ .

Assumption 5.2.

(i) The mapping $(x,\theta)\mapsto J^{\theta}(x)$ is sufficiently smooth, so that all the derivatives required exist in the classical sense.

(ii) For all $\theta\in\Theta$ , $\varphi^{\theta}(X^{x}_{\cdot})$ are square-integrable continuous processes, and the mappings $\theta\mapsto\|\varphi^{\theta}\|_{\mathbb{L}^{2}_{{\cal F}}([0,T])}$ are continuous, where $\varphi^{\theta}=J^{\theta},(J^{\theta})^{\prime},(J^{\theta})^{\prime\prime}$ .

(iii) There exists a continuous function $K(\cdot)>0$ , such that $\|J^{\theta}\|_{\infty}\leq K(\theta)$ .

In what follows we shall often drop the superscript $x$ from the processes $X^{x}$ , $M^{x}$ etc., if there is no danger of confusion. Also, for practical purpose we shall consider a finite time horizon $[0,T]$ , for an arbitrarily fixed and sufficiently large $T>0$ . Denoting the stopping time $\tilde{\tau}_{x}=\tau^{T}_{x}:=\tau_{x}\wedge T$ , by optional sampling theorem, we know that $\tilde{M}_{t}:=M_{\tilde{\tau}_{x}\wedge t}=M_{\tau_{x}\wedge t}$ , for $t\in[0,T]$ , is an $\tilde{\mathbb{F}}$ -martingale on $[0,T]$ , where $\tilde{\mathbb{F}}=\{{\cal F}_{\tilde{\tau}_{x}\wedge t}\}_{t\in[0,T]}$ . Let us also denote $\tilde{M}^{\theta}_{t}:=M^{\theta}_{\tau_{x}\wedge t}$ , $t\in[0,T]$ .

We now follow the idea of [14] to construct the so-called Martingale Loss Function. For any $\theta\in\Theta$ , consider the parametrized approximation of the process $M=M^{x}$ :

\displaystyle M^{\theta}_{t}=M^{\theta,x}_{t}:=e^{-ct}J^{\theta}(X^{x}_{t})+\int_{0}^{t}e^{-cs}r(X^{x}_{s})ds,\qquad t\in[0,T].

(5.2)

In light of the Martingale Loss function introduced in [14], we denote

\displaystyle{ML}(\theta)\negthinspace=\negthinspace\frac{1}{2}\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace\negthinspace|M_{\tau_{x}}\negthinspace\negthinspace-\negthinspace\tilde{M}_{t}^{\theta}|^{2}dt\Big{]}\negthinspace=\negthinspace\frac{1}{2}\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace\negthinspace\negthinspace\big{|}e^{-ct}J^{\theta}(X_{t})\negthinspace-\negthinspace\negthinspace\int_{t}^{\tau_{x}}\negthinspace\negthinspace\negthinspace e^{-cs}r(X_{s})ds\big{|}^{2}dt\Big{]}.

(5.3)

We should note that the last equality above indicates that the martingale loss function is actually independent of the function $J$ , which is one of the main features of this algorithm. Furthermore, inspired by the mean-squared and discounted mean-squared value errors we define

	$\displaystyle\mbox{\it MSVE}(\theta)$	$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}\|J^{\theta}(X_{t})-J(X_{t})\|^{2}dt\Big{]},$		(5.4)
	$\displaystyle\mbox{\it DMSVE}(\theta)$	$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}e^{-2ct}\|J^{\theta}(X_{t})-J(X_{t})\|^{2}dt\Big{]}.$		(5.5)

The following result shows the connection between the minimizers of $ML(\cdot)$ and $DMSVE(\cdot)$ .

Theorem 6.

Assume that Assumption 5.2 is in force. Then, it holds that

\displaystyle\arg\min_{\theta\in\Theta}\mbox{\it ML}(\theta)=\arg\min_{\theta\in\Theta}\mbox{\it DMSVE}(\theta).

(5.6)

Proof. First, note that $J(0)=J^{\theta}(0)=0$ , and $X_{\tau_{x}}=0$ , we see that

\tilde{M}^{\theta}_{t}=M^{\theta}_{\tau_{x}}=M_{\tau_{x}}=\int_{0}^{\tau_{x}}e^{-cs}r(X_{s})ds,\qquad t\in(\tilde{\tau}_{x},T).

Here in the above we use the convention that $(\tilde{\tau}_{x},T)=\emptyset$ if $\tilde{\tau}_{x}=T$ , and the identities becomes trivial. Consequently, by definition (5.3) and noting that $\tilde{M}^{\theta}_{t}=M^{\theta}_{t}$ , for $t\in[0,\tilde{\tau}_{x}]$ , we can write

	$\displaystyle\qquad 2\mbox{\it ML}(\theta)\negthinspace\negthinspace$	$\displaystyle\negthinspace\negthinspace=\negthinspace\negthinspace$	$\displaystyle\negthinspace\negthinspace\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}\|M_{\tau_{x}}-\tilde{M}_{t}^{\theta}\|^{2}dt\Big{]}=\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}\|M_{\tau_{x}}-M_{t}^{\theta}\|^{2}dt\Big{]}$
		$\displaystyle\negthinspace\negthinspace=\negthinspace\negthinspace$	$\displaystyle\negthinspace\negthinspace\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}\big{[}\|M_{\tau_{x}}-M_{t}\|^{2}+\|M_{t}-M_{t}^{\theta}\|^{2}+2(M_{\tau_{x}}-M_{t})(M_{t}-M_{t}^{\theta})\big{]}dt\Big{]}.$

Next, noting (5.1) and (5.2), we see that

\displaystyle\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace|M_{t}-M_{t}^{\theta}|^{2}dt\Big{]}=\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace\negthinspace e^{-2ct}|J(X_{t})-J^{\theta}(X_{t})|^{2}dt\Big{]}=2\text{DMSVE}(\theta).

Also, applying optional sampling we can see that

	$\displaystyle\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace\negthinspace(M_{\tau_{x}}\negthinspace-\negthinspace M_{t})(M_{t}\negthinspace-\negthinspace M_{t}^{\theta})dt\Big{]}=\int_{0}^{T}\negthinspace\negthinspace\mathbb{E}\big{[}\mathbb{E}\big{[}({M}_{\tau_{x}}\negthinspace-\negthinspace{M}_{t})\|{\cal F}_{t}\big{]}{\bf 1}_{\{\tau_{x}\geq t\}}({M}_{t}-{M}_{t}^{\theta})\big{]}dt$
	$\displaystyle=\mathbb{E}\Big{[}\int_{0}^{T}\mathbb{E}\big{[}({M}_{\tau_{x}}-{M}_{t})\|{\cal F}_{t\wedge\tau_{x}}\big{]}{\bf 1}_{\{\tau_{x}\geq t\}}(\tilde{M}_{t}-\tilde{M}_{t}^{\theta})dt\Big{]}=0.$

Combining above we see that (5) becomes $2ML(\theta)=2\text{DMSVE}(\theta)+\mathbb{E}\big{[}\int_{0}^{\tilde{\tau}_{x}}|M_{\tau_{x}}-M_{t}|^{2}dt\big{]}$ . Since $\mathbb{E}[\int_{0}^{\tilde{\tau}_{x}}|M_{\tau_{x}}-M_{t}|^{2}dt]$ is independent of $\theta$ , we conclude the result.

Remark 5.3.

Since the minimizers of MSVE $(\theta)$ and DMSVE $(\theta)$ are obviously identical, Theorem 6 suggests that if $\theta^{*}$ is a minimizer of either one of $ML(\cdot)$ , $MSVE(\cdot)$ , $DMSVE(\cdot)$ , then $J^{\theta^{*}}$ would be an acceptable approximation of $J$ . In the rest of the section we shall therefore focus on the identification of $\theta^{*}$ .

We now propose an algorithm that provides a numerical approximation of the policy evaluation $J(\cdot)$ (or equivalently the martingale $M^{x}$ ), by discretizing the integrals in the loss functional $ML(\cdot)$ . To this end, let $T>0$ be an arbitrary but fixed time horizon, and consider the partition $0=t_{0}<\cdots<t_{n}=T$ , and denote $\Delta t=t_{i}-t_{i-1}$ , $i=1,\cdots n$ . Now for $x\in\mathbb{R}_{+}$ , we define $K_{x}=\min\{l\in\mathbb{N}:\exists t\in[l\Delta t,(l+1)\Delta t):X^{x}_{t}<0\}$ , and $\lfloor\tau_{x}\rfloor:=K_{x}\Delta t$ so that $\tau_{x}\in[K_{x}\Delta t,(K_{x}+1)\Delta t)$ . Finally, we define $N_{x}=\min\{K_{x},n\}$ . Clearly, both $K_{x}$ and $N_{x}$ are integer-valued random variables, and we shall often drop the subscript $x$ if there is no danger of confusion.

In light of (5.3), let us define

\displaystyle\mbox{\it ML}_{\Delta t}(\theta)=\frac{1}{2}\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\Big{|}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{K-1}e^{-c{{t_{j}}}}r(X_{t_{j}})\Delta t\Big{|}^{2}\Delta t\Big{]}=:\frac{1}{2}\mathbb{E}\Big{[}\sum_{i=0}^{N-1}|\Delta\tilde{M}^{\theta}_{t_{i}}|^{2}\Delta t\Big{]},

where $\Delta\tilde{M}^{\theta}_{t_{i}}$ , $i=1,\cdots,n$ , are defined in an obvious way. Furthermore, for $t\in[0,\tau_{x}]$ , we define $m(t,\theta):=-e^{-ct}J^{\theta}(X_{t})+\int_{t}^{\tau_{x}}e^{-cs}r(X_{s})ds$ . Now note that $\{{\tau_{x}}\geq T\}=\{\lfloor\tau_{x}\rfloor<T\leq\tau_{x}\}\cup\{\lfloor\tau_{x}\rfloor\geq T\}$ , and $\{\lfloor\tau_{x}\rfloor\geq T\}=\{N=n\}$ . Denoting $\tilde{F}_{1}=\mathbb{E}\Big{[}\int_{0}^{T}|m(t,\theta)|^{2}dt{\bf 1}_{\{\lfloor\tau_{x}\rfloor<T\ \leq\tau_{x}}\}\Big{]}$ , we have

	$\displaystyle\mathbb{E}\Big{[}\int_{0}^{T}\|m(t,\theta)\|^{2}dt{\bf 1}_{\{{\tau_{x}}\geq T\}}\Big{]}$	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\int_{0}^{T}\|m(t,\theta)\|^{2}dt\big{(}{\bf 1}_{\{\lfloor\tau_{x}\rfloor<T\leq\tau_{x}\}}+{\bf 1}_{\{\lfloor\tau_{x}\rfloor\geq T\}}\big{)}\Big{]}$		(5.8)
		$\displaystyle=$	$\displaystyle\tilde{F}_{1}+\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\|m(t,\theta)\|^{2}dt{\bf 1}_{\{N=n\}}\Big{]}.$		(5.8)

Since $\{{\lfloor\tau_{x}\rfloor}<T\}=\negthinspace\{N<n\}=\{{\tau_{x}}<T\}\cup\{\lfloor\tau_{x}\rfloor<T\ \leq\tau_{x}\}$ , denoting $\tilde{F}_{2}=\mathbb{E}\Big{[}\int_{0}^{\tau_{x}}|m(t,\theta)|^{2}dt{\bf 1}_{\{\lfloor\tau_{x}\rfloor<T\ \leq\tau_{x}}\}\Big{]}$ and $\tilde{F}_{3}=\mathbb{E}\Big{[}\int_{\lfloor\tau_{x}\rfloor}^{\tau_{x}}|m(t,\theta)|^{2}dt{\bf 1}_{\{\lfloor\tau_{x}\rfloor<T\}}\Big{]}$ , we obtain,

	$\displaystyle\mathbb{E}\Big{[}\int_{0}^{\tau_{x}}\|m(t,\theta)\|^{2}dt{\bf 1}_{\{{\tau_{x}}<T\}}\Big{]}$	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\int_{0}^{\lfloor\tau_{x}\rfloor}\|m(t,\theta)\|^{2}dt{\bf 1}_{\{N<n\}}\Big{]}+\tilde{F}_{3}-\tilde{F}_{2}$
		$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\|m(t,\theta)\|^{2}dt{\bf 1}_{\{N<n\}}\Big{]}+\tilde{F}_{3}-\tilde{F}_{2}$

Combining (5.8) and (5), similar to (5)) we can now rewrite (5.3) as

$\displaystyle 2\mbox{\it ML}(\theta)$	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}\|m(t,\theta)\|^{2}dt\Big{]}=\mathbb{E}\Big{[}\int_{0}^{T}\|m(t,\theta)\|^{2}dt\big{(}{\bf 1}_{\{{\tau_{x}}\geq T\}}+1_{\{{\tau_{x}}<T\}}\big{)}\Big{]}$	(5.9)
	$\displaystyle\negthinspace=\negthinspace$	$\displaystyle\negthinspace\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\|m(t,\theta)\|^{2}dt\big{(}{\bf 1}_{\{N=n\}}+{\bf 1}_{\{N<n\}}\big{)}\Big{]}+\tilde{F}_{1}-\tilde{F}_{2}+\tilde{F}_{3}$
	$\displaystyle\negthinspace=\negthinspace$	$\displaystyle\negthinspace\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\|m(t,\theta)\|^{2}dt\Big{]}+\tilde{F}_{1}-\tilde{F}_{2}+\tilde{F}_{3}.$

We are now ready to give the main result of this section.

Theorem 7.

Let Assumptions 4.3 and 5.2 be in force. Then it holds that

\lim_{\Delta t\to 0}\mbox{ ML}_{\Delta t}(\theta)=\mbox{ML}(\theta),\quad\mbox{\rm uniformly in $\theta$ on compacta. }

Proof. Fix a partition $0=t_{0}<\cdots<t_{n}=T$ . By (5.9) and (5) we have, for $\theta\in\Theta$ ,

	$\displaystyle 2\|M\negthinspace L(\theta)-M\negthinspace L_{\Delta t}(\theta)\|$	$\displaystyle=$	$\displaystyle\mathbb{E}\bigg{\|}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\|m(t,\theta)\|^{2}dt+\tilde{F}_{1}+\tilde{F}_{3}-\tilde{F}_{2}-\sum_{i=0}^{N-1}\|\Delta\tilde{M}^{\theta}_{t_{i}}\|^{2}\Delta t\bigg{\|}$		(5.10)
		$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\big{\|}\|m(t,\theta)\|^{2}-\|\Delta\tilde{M}_{t_{i}}^{\theta}\|^{2}\big{\|}dt\Big{]}+\sum_{i=1}^{3}\|\tilde{F}_{i}\|.$		(5.10)

Let us first check $|\tilde{F}_{i}|$ , $i=1,2,3$ . First, by Assumption 5.2, we see that

|m(t,\theta)|\leq|J^{\theta}(X_{t})|+\int_{0}^{\infty}e^{-cs}|r(X_{s})|ds\leq K(\theta)+\frac{R}{c}=:C_{1}(\theta),\qquad\text{ for }t>0,~{}\mathbb{P}\hbox{\rm-a.s.{ }},

where $C_{1}(\cdot)$ is a continuous function, and $R$ is the bound of $r(\cdot)$ . Thus we have

\displaystyle|\tilde{F}_{3}|=\mathbb{E}\Big{[}\int_{\lfloor\tau_{x}\rfloor}^{\tau_{x}}|m(t,\theta)|^{2}dt{\bf 1}_{\{N<n\}}\Big{]}\leq|C_{1}(\theta)|^{2}\Delta t,

(5.11)

Next, note that $\lfloor\tau_{x}\rfloor\leq T$ implies $\tau_{x}\leq\lfloor\tau_{x}\rfloor+\Delta t\leq T+\Delta t$ , and since we are considering the case when $\Delta t\to 0$ , we might assume $\Delta t<1$ . Thus by definitions of $\tilde{F}_{1}$ and $\tilde{F}_{2}$ we have

	$\displaystyle\|\tilde{F}_{1}\|\negthinspace+\negthinspace\|\tilde{F}_{2}\|\negthinspace\negthinspace$	$\displaystyle\negthinspace\leq\negthinspace$	$\displaystyle\negthinspace\negthinspace 2\mathbb{E}\Big{[}\negthinspace\negthinspace\int_{0}^{T+1}\negthinspace\negthinspace\negthinspace\negthinspace\|m(t,\theta)\|^{2}dt{\bf 1}_{\{\lfloor\tau_{x}\rfloor\leq T\leq\tau_{x}}\}\Big{]}\leq 2\|C_{1}(\theta)\|^{2}(T+1)\mathbb{P}\{\lfloor\tau_{x}\rfloor\leq T\leq\tau_{x}\}$		(5.12)
		$\displaystyle\leq$	$\displaystyle 2\|C_{1}(\theta)\|^{2}(T+1)\mathbb{P}\{\|T-\tau_{x}\|\leq\Delta t\}.$		(5.12)

Since $X$ is a diffusion, one can easily check that $\lim_{\Delta t\to 0}\mathbb{P}\{|T-\tau_{x}|\leq\Delta t\}=\mathbb{P}\{T=\tau_{x}\}\leq\mathbb{P}\{X_{T}=0\}=0$ . Furthermore, noting that $|C_{1}(\theta)|^{2}$ is uniformly bounded for $\theta$ in any compact set, from (5.11) and (5.12) we conclude that

\displaystyle\lim_{\Delta t\to 0}(|\tilde{F}_{1}|+|\tilde{F}_{2}|+|\tilde{F}_{3}|)=0,\qquad\mbox{uniformly in $\theta$ on compacta.}

(5.13)

It remains to show that

\displaystyle\lim_{\Delta t\to 0}\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\big{|}|m(t,\theta)|^{2}-|\Delta\tilde{M}_{t_{i}}^{\theta}|^{2}\big{|}dt\Big{]}=0,

(5.14)

uniformly in $\theta$ on compacta. To this end, we first note that

\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\big{|}|m(t,\theta)|^{2}-|\Delta\tilde{M}^{\theta}_{t_{i}}|^{2}\big{|}dt\Big{]}\leq\tilde{E}^{\Delta t}_{1}+\tilde{E}^{\Delta t}_{2},

(5.15)

where $\tilde{E}^{\Delta t}_{1}\negthinspace\negthinspace:=\negthinspace\negthinspace\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\negthinspace\negthinspace\negthinspace\negthinspace\big{(}|m(t,\theta)|^{2}\negthinspace\negthinspace-\negthinspace\negthinspace|m(t_{i},\theta)|^{2}\big{)}dt\Big{]}$ and $\tilde{E}^{\Delta t}_{2}\negthinspace\negthinspace:=\negthinspace\negthinspace\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\Big{|}|m(t_{i},\theta)|^{2}\negthinspace\negthinspace-\negthinspace\negthinspace|\Delta\tilde{M}^{\theta}_{t_{i}}|^{2}\Big{|}\Delta t\Big{]}$ . From definition (5) we see that under Assumption 5.2 it holds that $|\Delta\tilde{M}^{\theta}_{t_{i}}|\leq C_{1}(\theta)$ , $i=1,\cdots,n$ . Furthermore, we denote

	$\displaystyle D_{i}$	$\displaystyle:=$	$\displaystyle m(t_{i},\theta)-\Delta\tilde{M}^{\theta}_{t_{i}}=\int_{t_{i}}^{\tau_{x}}e^{-cs}r(X_{s})ds-\sum_{j=i}^{K-1}e^{-ct_{j}}r(X_{t_{j}})\Delta t$
		$\displaystyle=$	$\displaystyle\sum_{j=i}^{K-1}\int_{t_{j}}^{t_{j+1}}[e^{-cs}r(X_{s})-e^{-ct_{j}}r(X_{t_{j}})]dt+\int_{\lfloor\tau_{x}\rfloor}^{\tau_{x}}e^{-cs}r(X_{s})ds,$

Since $r(\cdot)$ is bounded, we see that $|\mathbb{E}(\int_{\lfloor\tau_{x}\rfloor}^{\tau_{x}}e^{-cs}r(X_{s})ds)|\leq K_{1}\Delta t$ for some constant $K_{1}>0$ . Then it holds that

\displaystyle\mathbb{E}|D_{i}|\leq\mathbb{E}\Big{[}\sum_{j=i}^{K-1}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}+K_{1}\Delta t\leq\mathbb{E}\Big{[}\sum_{j=i}^{\infty}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}+K_{1}\Delta t,

where $\tilde{r}_{t}:=e^{-ct}r(X_{t})$ , $t\geq 0$ , is a bounded and continuous process. Now for any $\varepsilon>0$ , choose $\tilde{M}\in\mathbb{Z}^{+}$ so that $e^{-ct}\leq\frac{\varepsilon c}{4R}$ , for $t\geq\tilde{M}$ , and define $\rho^{\tilde{M}}_{2}(\tilde{r},\Delta t):=\sup_{|t-s|\leq\Delta t,t,s\in[0,\tilde{M}]}\|\tilde{r}_{t}-\tilde{r}_{s}\|_{\mathbb{L}^{2}(\Omega)}$ , we have

			$\displaystyle\mathbb{E}\Big{[}\sum_{j=i}^{\infty}\int_{t_{j}}^{t_{j+1}}\|\tilde{r}_{s}-\tilde{r}_{t_{j}}\|ds\Big{]}\leq\sum_{j=i}^{\tilde{M}-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\|\tilde{r}_{s}-\tilde{r}_{t_{j}}\|ds+\sum_{j=\tilde{M}}^{\infty}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\|\tilde{r}_{s}-\tilde{r}_{t_{j}}\|ds$
		$\displaystyle\leq$	$\displaystyle\sum_{j=i}^{\tilde{M}-1}\int_{t_{j}}^{t_{j+1}}\rho_{2}(\tilde{r},\Delta t)ds+4R\int_{\tilde{M}}^{\infty}e^{-cs}ds=\Delta t(\tilde{M}-1)\rho^{\tilde{M}}_{2}(\tilde{r},\Delta t)+4R\frac{e^{-c\tilde{M}}}{c}$
		$\displaystyle\leq$	$\displaystyle\Delta t(\tilde{M}-1)\rho^{\tilde{M}}_{2}(\tilde{r},\Delta t)+\epsilon.$

Sending $\Delta t\to 0$ we obtain that $\mathop{\overline{\rm lim}}_{\Delta t\to 0}\mathbb{E}\Big{[}\sum_{j=i}^{\infty}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}\leq\varepsilon$ . Since $\varepsilon>0$ is arbitrary, we conclude that $\mathbb{E}\Big{[}\sum_{j=i}^{\infty}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}\to 0$ as $\Delta t\to 0$ . Note that the argument above is uniform in $i$ , it follows that $\sup_{i\geq 0}\mathbb{E}|D_{i}|\to 0$ as $\Delta t\to 0$ . Consequently, we have

	$\displaystyle\tilde{E}^{\Delta t}_{2}$	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\big{[}\|m(t_{i},\theta)+\Delta\tilde{M}^{\theta}_{t_{i}}\|D_{i}\big{]}\Delta t\Big{]}\leq 2\Delta t\|C_{1}(\theta)\|\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\|D_{i}\|\Big{]}$
		$\displaystyle\leq$	$\displaystyle 2n\Delta t\|C_{1}(\theta)\|\sup_{i\geq 0}\mathbb{E}\|D_{i}\|\to 0,\quad\mbox{as $\Delta t\to 0$.}$

Since $C_{1}(\cdot)$ is continuous in $\theta$ , we see that the convergence above is uniform in $\theta$ on compacta. Similarly, note that by Assumption 5.2 the process $m(\cdot,\theta)$ is also a square-integrable continuous process, and uniform in $\theta$ , we have

$\displaystyle\tilde{E}^{\Delta t}_{1}$	$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\big{\|}\|m(t,\theta)\|^{2}-\|m(t_{i},\theta)\|^{2}\big{\|}dt\Big{]}$
	$\displaystyle\leq$	$\displaystyle 2C_{1}(\theta)\mathbb{E}\Big{[}\sum_{i=0}^{n-1}\int_{t_{i}}^{t_{i+1}}\|m(t,\theta)-m(t_{i},\theta)\|dt\Big{]}$
	$\displaystyle\leq$	$\displaystyle 2C_{1}(\theta)\sum_{i=0}^{n-1}\int_{t_{i}}^{t_{i+1}}\rho(m(\cdot,\theta),\Delta t)\|dt=2C_{1}(\theta)T\rho(m(\cdot,\theta),\Delta t),$

where $\rho(m(\cdot,\theta),\Delta t)=\sup_{|t-s|\leq\Delta t,t,s\in[0,T]}\|m(t,\theta)-m(s,\theta)\|_{\mathbb{L}^{2}(\Omega)}$ is the modulus of continuity of $m(\cdot,\theta)$ in $\mathbb{L}^{2}(\Omega)$ . Therefore $\rho(m(\cdot,\theta),\Delta t)\to 0$ , as $\Delta t\to 0$ , uniformly in $\theta$ on compacta.

Finally, combining (5.15)–(5) we obtain (5.14). This, together with (5.13) as well as (5.10), proves the theorem.

Now let us denote $h=\Delta t$ , and consider the functions

\displaystyle f(\theta):=\mbox{\it ML}(\theta),\qquad f_{h}(\theta):=\mbox{\it ML}_{h}(\theta),\qquad r_{h}(\theta):=\mbox{\it ML}_{h}(\theta)-\mbox{\it ML}(\theta).

Then $f_{h}(\theta)=f(\theta)+r_{h}(\theta)$ , and by Assumption 5.2 we can easily check that the mappings $\theta\mapsto f_{h}(\theta),r_{h}(\theta)$ are continuous functions. Applying Theorem 7 we see that $r_{h}(\theta)\to 0$ , uniformly in $\theta$ on compacta, as $h\to 0$ . Note that if $\Theta$ is compact, then for any $h>0$ , there exists $\theta^{*}_{h}\in\arg\min_{\theta\in\Theta}f_{h}(\theta)$ . In general, we have the following corollary of Theorem 7.

Corollary 5.4.

Assume that all assumptions in Theorem 7 are in force. If there exists a sequence $\{h_{n}\}_{n\geq 0}\searrow 0$ , such that $\Theta_{n}:=\arg\min_{\theta\in\Theta}f_{h_{n}}(\theta)\neq\emptyset$ , then any limit point $\theta^{*}$ of the sequence $\{\theta^{*}_{n}\}_{\theta^{*}_{n}\in\Theta_{n}}$ must satisfy $\theta^{*}\in\arg\min_{\theta\in\Theta}f(\theta)$ .

Proof. This is a direct consequence of [14, Lemma 1.1].

Remark 5.5.

We should note that, by Remark 5.3, the set of minimizers of the martingale loss function ML $(\theta)$ is the same as that of DMVSE $(\theta)$ . Thus Corollary 5.4 indicates that we have a reasonable approach for approximating the unknown function $J$ . Indeed, if $\{\theta^{*}_{n}\}$ has a convergent subsequence that converges to some $\theta^{*}\in\Theta$ , then $J^{\theta^{*}}$ is the best approximation for $J$ by either the measures of MSVE or DMSVE.

To end this section we discuss the ways to fulfill our last task: finding the optimal parameter $\theta^{*}$ . There are usually two learning methods for this task in RL, often referred to as online and batch learning, respectively. Roughly speaking, the batch learning methods use multiple sample trajectories of $X$ over a given finite time horizon $[0,T]$ to update parameter $\theta$ at each step, whereas in online learning, one observes only a single sample trajectory $X$ to continuously update the parameter $\theta$ until it converges. Clearly, the online learning is particularly suitable for infinite horizon problem, whereas the ML function is by definition better suited for batch learning.

Although our problem is by nature an infinite horizon one, we shall first create a batch learning algorithm via the ML function by restricting ourselves to an arbitrarily fixed finite horizon $T>0$ , so as to convert it to an finite time horizon problem. To this end, we note that

\displaystyle 2ML_{\Delta t}(\theta)

\displaystyle=

\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{K-1}e^{-c{{t_{j}}}}r(X_{t_{j}})\Delta t\Big{)}^{2}\Delta t\Big{]}.

However, since $K_{x}$ may be unbounded, we shall consider instead the function:

\displaystyle 2\widetilde{ML}_{\Delta t}(\theta)

\displaystyle=

\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{N-1}e^{-c{{t_{j}}}}r(X_{t_{j}})\Delta t\Big{)}^{2}\Delta t\Big{]}.

We observe that the difference

			$\displaystyle\|2ML_{\Delta t}(\theta)-2\widetilde{ML}_{\Delta t}(\theta)\|$
		$\displaystyle=$	$\displaystyle\Big{\|}\sum_{i=0}^{N-1}\Big{[}\Big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{N-1}\tilde{r}_{t_{j}}\Delta t\Big{)}^{2}\Delta t-\Big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}^{2}\Delta t\Big{]}\Big{\|}$
		$\displaystyle=$	$\displaystyle\Big{\|}\sum_{i=0}^{N-1}\Big{[}\Big{(}\sum_{j=N}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{(}2e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{N-1}\tilde{r}_{t_{j}}\Delta t-\sum_{j=i}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{]}\Big{\|}\Delta t$
		$\displaystyle\leq$	$\displaystyle K(\theta)\Delta t\Big{\|}\sum_{i=0}^{N-1}\Big{(}\sum_{j=N}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{\|}\leq\widetilde{K}(\theta)\Delta t\Big{\|}\sum_{i=0}^{N-1}\Big{(}\sum_{j=N}^{K-1}e^{-c{t_{j}}}\Delta t\Big{)}\Big{\|}\leq\widetilde{K}(\theta)e^{-cT}T\Delta t,$

for some continuous function $\widetilde{K}(\theta)$ . Thus if $\Theta$ is compact, for $T$ large enough or $\Delta t$ small enough, the difference between $ML_{\Delta t}(\theta)$ and $\widetilde{ML}_{\Delta t}(\theta)$ is negligible. Furthermore, we note that

\displaystyle 2\widetilde{ML}_{\Delta t}(\theta)

\displaystyle=

\displaystyle\mathbb{E}^{\mathbb{Q}}\Big{[}\sum_{i=0}^{N-1}\Big{(}e^{-c{t_{i}}}J^{\theta}(\tilde{X}_{t_{i}})-\sum_{j=i}^{N-1}e^{-c{{t_{j}}}}(a^{\boldsymbol{\pi}}_{t_{j}}-\lambda\ln\pi(a^{\boldsymbol{\pi}}_{t_{j}},\tilde{X}_{t_{j}}))\Delta t\Big{)}^{2}\Delta t\Big{]},

we can now follow the method of Stochastic Gradient Descent (SGD) to minimize $\widetilde{ML}_{\Delta t}(\theta)$ and obtain the updating rule,

\displaystyle\theta^{(k+1)}\leftarrow\theta^{(k)}-\alpha_{(k)}\nabla_{\theta}\widetilde{ML}_{\Delta t}\theta

where $\alpha_{(k)}$ denotes the learning rate for the $k^{\text{th}}$ iteration (using the $k^{\text{th}}$ simulated sample trajectory). Here $\alpha_{(k)}$ is chosen so that $\sum_{k=0}^{\infty}\alpha_{(k)}=\infty,\sum_{k=0}^{\infty}\alpha_{(k)}^{2}<\infty$ to help guarantee the convergence of the algorithm, based on the literature on the convergence of SGD.

6 Temporal Difference (TD) Based Online Learning

In this section we consider another policy evaluation method utilizing the parametric family $\{J^{\theta}\}_{\theta\in\Theta}$ . The starting point of this method is Proposition 5.1, which states that the best approximation $J^{\theta}$ is one whose corresponding approximating process $M^{\theta}$ defined by (5.2) is a martingale (in which case $J^{\theta}=J$ (!)). Next, we recall the following simple fact (see, e.g., [14] for a proof).

Proposition 6.1.

An Itô process $M^{\theta}\in\mathbb{L}^{2}_{\mathbb{F}}([0,T])$ is a martingale if and only if

\displaystyle\mathbb{E}\Big{[}\int_{0}^{T}\xi_{t}dM^{\theta}_{t}\Big{]}=0,\quad\mbox{for any}~{}\xi\in\mathbb{L}^{2}_{{\cal F}}([0,T];M^{\theta}).

(6.1)

The functions $\xi\in\mathbb{L}^{2}_{\mathbb{F}}([0,T];M^{\theta})$ are called test functions.

Proposition 6.1 suggests that a reasonable approach for approximating the optimal $\theta^{*}$ could be solving the martingale orthogonality condition (6.1). However, since (6.1) involves infinitely many equations, for numerical approximations we should only choose a finite number of test functions, often referred to as moment conditions. There are many ways to choose test functions. In the finite horizon case, [14] proposed some algorithms of solving equation (6.1) with certain test functions. By using the well known Robbins-Monroe stochastic approximation (cf. (1951), they suggested some continuous analogs of the well-known discrete time Temporal Difference (TD) algorithms such as TD $(\gamma),\gamma\in[0,1]$ method and the (linear) least square TD(0) (or LSTD $(0)$ ) method, which are often referred to as the CTD $(\gamma),\gamma\in[0,1]$ method and CLSTD $(0)$ method, respectively, for obvious reasons. We should note that although our problem is essentially an infinite horizon one, we could consider a sufficiently large truncated time horizon $[0,T]$ , as we did in previous section, so that offline CTD methods similar to [14] can also be applied. However, in what follows we shall focus only on an online version of CTD method that is more suitable to the infinite horizon case.

We begin by recalling the following fact:

	$\displaystyle\mathbb{E}[dM^{\theta}_{t}]$	$\displaystyle=$	$\displaystyle e^{-ct}\mathbb{E}[dJ^{\theta}(X_{t})-cJ^{\theta}(X_{t})dt+r_{t}dt]$
		$\displaystyle=$	$\displaystyle e^{-ct}\mathbb{E}^{\mathbb{Q}}[dJ^{\theta}(\tilde{X}_{t})-cJ^{\theta}(\tilde{X}_{t})dt+(a_{t}-\lambda\ln\boldsymbol{\pi}(a_{t},\tilde{X}_{t}))dt],$

where $\mathbb{Q}$ is the probability measure defined in Remark LABEL:remark2.0, and $\tilde{X}$ is the trajectory corresponding to the action $a=\{a_{t}\}$ “sampled” from the policy distribution $\{\boldsymbol{\pi}(\cdot,\tilde{X}_{t})\}$ . Now let $t_{i+1}=t_{i}+\Delta t$ , $i=1,2,\cdots$ be a sequence of discrete time points, and $a_{t_{i}}$ the action sampled at time $t_{i}$ . Denote

\left\{\begin{array}[]{lll}\Delta J^{\theta}(\tilde{X}_{t_{i}}):=J^{\theta}(\tilde{X}_{t_{i+1}})-J^{\theta}(\tilde{X}_{t_{i}});\\ \Delta_{i}:=\Delta J^{\theta}(\tilde{X}_{t_{i}})+[-cJ^{\theta}(\tilde{X}_{t_{i}})+(a_{t_{i}}-\lambda\ln\boldsymbol{\pi}(a_{t_{i}},\tilde{X}_{t_{i}}))]\Delta t,\end{array}\right.\quad i=0,1,2,\cdots.

By the same argument as in [14], we have the discrete time approximation of (6):

\displaystyle\qquad M_{t_{i+1}}^{\theta}\negthinspace\negthinspace-\negthinspace M_{t_{i}}^{\theta}\approx e^{-ct_{i}}\negthinspace\big{\{}\negthinspace\Delta J^{\theta}(\tilde{X}_{t_{i}}\negthinspace)\negthinspace-\negthinspace[cJ^{\theta}(\tilde{X}_{t_{i}}\negthinspace)\negthinspace-\negthinspace(a_{t_{i}}\negthinspace-\negthinspace\lambda\ln\boldsymbol{\pi}(a_{t_{i}}\negthinspace,\negthinspace\tilde{X}_{t_{i}}\negthinspace)\negthinspace)]\Delta t\big{\}}\negthinspace=\negthinspace e^{-ct_{i}}\negthinspace\Delta_{i}.

(6.3)

In what follows we summarize the updating rules for CTD( $\gamma$ ) method using (6.3).

CTD(0) ( $\gamma=0$ ). In this case we let $\xi_{t}=\nabla_{\theta}J^{\theta}(\tilde{X}_{t})$ , with the updating rule:

\displaystyle\theta^{(i+1)}=\theta^{(i)}+\alpha_{(i)}\big{(}\nabla_{\theta}J^{\theta^{(i)}}(\tilde{X}_{t_{i}})\big{)}e^{-ct_{i}}\Delta_{i}.

(6.4)

CTD( $\gamma$ ) ( $\gamma\in(0,1]$ ) In this case we choose $\xi_{t}=\int_{0}^{t}\gamma^{t-s}\nabla_{\theta}J^{\theta}(X_{s})ds$ , with the updating rule:

\displaystyle\theta^{(i+1)}=\theta+\alpha_{(i)}\Big{[}\sum_{j=1}^{i}\gamma^{\Delta t(i-j)}\big{(}\nabla_{\theta}J^{\theta^{(i)}}(\tilde{X}_{t_{j}})\big{)}\Delta t\bigg{]}e^{-ct_{i}}\Delta_{i}.

(6.5)

Remark 6.2.

(i) We note that the updating rule (6.4) can be viewed as the special case of (6.5) when $\gamma=0$ if we make the convention that $0^{\alpha}:={\bf 1}_{\{\alpha=0\}}$ , for $\alpha\geq 0$ .

(ii) Although we are considering the infinite horizon case, in practice in order to prevent the infinite loop one always has to stop the iteration at a finite time. In other words, for both CTD(0) and CTD( $\gamma$ ), we assume that $t_{n}=T$ for a large $T>0$ .

(iii) The constants $\alpha_{(i)}$ ’s are often referred as the learning rate for the $i$ -th iteration. In light of the convergence conditions of Stochastic Approximation methods discussed at the end of previous section, we shall choose $\alpha_{(i)}$ ’s so that $\sum_{i=0}^{\infty}\alpha_{(i)}=\infty,\sum_{i=0}^{\infty}\alpha_{(i)}^{2}<\infty$ to help guarantee the convergence of the algorithm.

We observe that the convergence analysis of the above method for each fixed $\Delta t$ , coincides with that of the stochastic approximation methods. It would naturally be interesting to see the convergence with respect to $\Delta t$ . To this end, let us first define a special subspace of $\mathbb{L}^{2}_{\mathbb{F}}([0,T],M^{\theta})$ :

H^{2,\alpha}_{\mathbb{F}}([0,T],M^{\theta}):=\{\xi\in L^{2}_{\mathbb{F}}([0,T],M^{\theta}):~{}\mathbb{E}|\xi_{t}-\xi_{s}|^{2}\leq C(\theta)|t-s|^{\alpha},C(\cdot)\in C(\Theta)\},

where $\alpha\in(0,1)$ , $\theta\in\Theta$ , and $T>0$ . The following result is adapted from [14, §4.2 Theorem 3] with an almost identical proof. We thus only state it and omit the proof.

Proposition 6.3.

Assume that $\xi\in H^{2,\alpha}([0,T],M^{\theta})$ for some $\alpha\in(0,1]$ , and that $\theta^{*}$ , $(\theta^{*}_{\Delta t})$ are such that $\mathbb{E}[\int_{0}^{T}\xi_{t}dM_{t}^{\theta}]=0$ $(\mathbb{E}[\sum_{i=0}^{n-1}\xi_{t_{i}}(M^{\theta}_{t_{i+1}}-M^{\theta}_{t_{i}})]=0)$ . Then for any sequence $[\Delta t]^{(n)}\to 0$ such that $\lim_{n\to\infty}\theta^{*}_{[\Delta t]^{(n)}}=\bar{\theta}$ exists, it must hold that $\bar{\theta}=\theta^{*}$ . Furthermore, there exits $C>0$ such that $\mathbb{E}[\int_{0}^{T}\xi_{t}dM_{t}^{\theta^{*}_{\Delta t}}]\leq C[\Delta t]^{\frac{\alpha}{2}}$ .

Finally, we remark that although there are other PE methods analogous to well known TD methods (e.g., CLSTD(0)), which are particularly well suited for linear parameterization families, in this paper we are interested in parameterized families that are nonlinear in nature. Thus we shall only focus on the CTD $(\gamma)$ methods as well as the Martingale Loss Function based PE methods developed in the previous section, (which will be referred to as the ML-algorithm in what follows) and present the detailed algorithms and numerical results in the next section.

7 Numerical Results

In this section we present the numerical results along the lines of PE and PI schemes discussed in the previous sections. In particular, we shall consider the CTD $(\gamma)$ methods and ML Algorithm and some special parametrization based on the knowledge of the explicit solution of the original optimal dividend problem (with $\lambda=0$ ), but without specifying the market parameter $\mu$ and $\sigma$ . To test the effectiveness of the learning procedure, we shall use the so-called environment simulator: $(x^{\prime})=ENV_{\Delta t}(x,a)$ , that takes the current state $x$ and action $a$ as inputs and generates state $x^{\prime}$ at time $t+\Delta t$ , and we shall use the outcome of the simulator as the dynamics of $X$ . We note that the environment simulator will be problem specific, and should be created using historic data pertaining to the problem, without using environment coefficients, which is considered as unknown in the RL setting. But in our testing procedure, shall use some dummy values of $\mu$ and $\sigma$ , along with the following Euler–Maruyama discretization for the SDE (2.10):

\displaystyle x_{t_{i+1}}^{(k)}=x_{t_{i}}^{(k)}+(\mu-a_{t_{i}}^{(k)})\Delta t+\sigma Z,\qquad i=1,2,\cdots,

(7.1)

where $Z\sim N(0,\sqrt{\Delta t})$ is a normal random variable, and at each time $t_{i}$ , $x_{t_{i+1}}^{(k)}$ is calculated by the environment simulator recursively via the given $x_{t_{i}}^{(k)}$ and $a_{t_{i}}^{(k)}$ , and to be specified below.

Sampling of the optimal strategy. We recall from (LABEL:optimalpi) the optimal policy function has the form $\boldsymbol{\pi}^{*}(x,w)=G(w,\tilde{c}(x))$ where $\tilde{c}(x)$ is a continuous function. It can be easily calculated that the inverse of the cumulative distribution function, denoted by $\big{(}F^{\boldsymbol{\pi}}\big{)}^{-1}$ , is of the form:

\displaystyle\big{(}F^{\boldsymbol{\pi}}\big{)}^{-1}(x,w)=\frac{\lambda\ln\Big{(}w(e^{\frac{a\tilde{c}(x)}{\lambda}}-1)+1\Big{)}}{\tilde{c}(x)}{\bf 1}_{\{\tilde{c}(x)\neq 0\}}+aw{\bf 1}_{\{\tilde{c}(x)=0\}},(w,x)\in[0,a]\times\mathbb{R}^{+}.

Thus, by the inversion method, if ${\bf U}\sim U[0,1]$ , the uniform distribution on $[0,1]$ , then the random variable $\hat{\boldsymbol{\pi}}({\bf U},x):=\big{(}F^{\boldsymbol{\pi}}\big{)}^{-1}(x,{\bf U})\sim\boldsymbol{\pi}^{*}(\cdot,x)$ , and we need only sample ${\bf U}$ , which is much simpler.

Parametrization of the cost functional. The next step is to choose the parametrization of $J^{\theta}$ . In light of the well-known result (cf. e.g., [1]), we know that if $(\mu,\sigma)$ are given, and $\beta=\frac{a}{c}-\frac{1}{\beta_{3}}>0$ , (thanks to Assumption 2.4), the classical solution for the optimal dividend problem is given by

\displaystyle V(x)=K(e^{\beta_{1}x}-e^{-\beta_{2}x}){\bf 1}_{\{x\leq m\}}+\Big{[}\frac{a}{c}-\frac{e^{-\beta_{3}(x-m)}}{\beta_{3}}\Big{]}{\bf 1}_{\{x>m\}}.

(7.2)

where $K=\frac{\beta}{e^{\beta_{1}m}-e^{-\beta_{2}m}}$ , $\beta_{1,2}=\frac{\mp\mu+\sqrt{2c\sigma^{2}+\mu^{2}}}{\sigma^{2}}$ , $\beta_{3}=\frac{\mu-a+\sqrt{2c\sigma^{2}+(a-\mu)^{2}}}{\sigma^{2}}$ , and $m=\frac{log(\frac{1+\beta\beta_{2}}{1-\beta\beta_{1}})}{\beta_{1}+\beta_{2}}$ . We should note that the threshold $m>0$ in (7.2) is most critical in the value function, as it determines the switching barrier of the optimal dividend rate. That is, optimal dividend rate is of the “bang-bang” form: $\alpha_{t}=a{\bf 1}_{\{X(t)>m\}}$ , where $X$ is the reserve process (see, e.g., [1]). We therefore consider the following two parametrizations based on the initial state $x=X_{0}$ .

(i) $x<m$ . By (7.2) we use the approximation family:

			$\displaystyle J^{\theta}(x)=\theta_{3}(e^{\theta_{1}x}-e^{-\theta_{2}x}),$		(7.3)
			$\displaystyle\theta_{1}\in\Big{[}\frac{4c}{(1+\sqrt{5})a},1\Big{]};~{}\theta_{2}\in\Big{[}1+\frac{4c}{(1+\sqrt{5})a},2\Big{]};~{}\theta_{3}\in[15,16],$		(7.3)

where $\theta_{1},\theta_{2},\theta_{3}$ represent $\beta_{1},\beta_{2}$ and $K$ of the classical solution respectively. In particular, the bounds for $\theta_{1}$ and $\theta_{2}$ are due to the fact $\beta_{1}\in[\frac{4c}{(1+\sqrt{5})a},1]$ and $\beta_{2}\in[1+\frac{4c}{(1+\sqrt{5})a},\infty)$ under Assumption 2.4. We should note that these bounds alone are not sufficient for the algorithms to converge, and we actually enforced some additional bounds. In practice, the range of $\theta_{2}$ and $\theta_{3}$ should be obtained from historical data for this method to be effective in real life applications.

Finally, it is worth noting that (7.2) actually implies that $\mu=\frac{c(\beta_{2}-\beta_{1})}{\beta_{2}\beta_{1}}$ and $\sigma^{2}=\frac{2c}{\beta_{2}\beta_{1}}$ . We can therefore approximate $\mu,\sigma$ by $\frac{c(\theta_{2}^{*}-\theta_{1}^{*})}{\theta_{2}^{*}\theta_{1}^{*}}$ and $\sqrt{\frac{2c}{\theta_{2}^{*}\theta_{1}^{*}}}$ , respectively, whenever the limit $\theta^{*}$ can be obtained. The threshold $m$ can then be approximated via $\mu$ and $\sigma$ as well.

(ii) $x>m$ . Again, in this case by (7.2) we choose

\displaystyle J^{\theta}(x)=\frac{a}{c}-\frac{\theta_{1}}{\theta_{2}}e^{-\theta_{2}x},\quad\theta_{1}\in[1,2],~{}\theta_{2}\in\Big{[}\frac{c}{a},\frac{2c}{a}\Big{]},

(7.4)

where $\theta_{1},\theta_{2}$ represent $(e^{m})^{\beta_{3}}$ and $\beta_{3}$ respectively, and the bounds for $\theta_{2}$ are the bounds of parameter $\beta_{3}$ in (7.2). To obtain an upper bound of $\theta_{1}$ , we note that $\theta_{1}\leq\frac{\theta_{2}a}{c}$ is necessary to ensure $J^{\theta}(x)>0$ for each $x>0$ , and thus the upper bound of $\theta_{2}$ leads to that of $\theta_{1}$ . For the lower bound of $\theta_{1}$ , note that $e^{m}>1$ and hence so is $(e^{m})^{\beta_{3}}$ . Using $J^{\theta^{*}}_{x}(m)=1$ , we approximate $m$ by $\frac{\ln(\theta_{1})}{\theta_{2}}$ .

Remark 7.1.

The parametrization above depends heavily on the knowledge of the explicit solution for the classical optimal control problem. In general, it is natural to consider using the viscosity solution of the entropy regularized relaxed control problem as the basis for the parameterization family. However, although we did identify both viscosity super- and sub-solutions in (3.9), we found that the specific super-solution does not work effectively due to the computational complexities resulted by the piece-wise nature of the function, as well as the the complicated nature of the bounds of the parameters involved (see (3.12)); whereas the viscosity sub-solution, being a simple function independent of all the parameters we consider, does not seem to be an effective choice for a parameterization family in this case either. We shall leave the study of the effective parametrization using viscosity solutions to our future research.

In the following two subsections we summarize our numerical experiments following the analysis so far. For testing purpose, we choose “dummy” parameters $a=3,\mu=0.4,\sigma=0.8$ and $c=0.02$ , so that Assumption 2.4 holds. We use $T=10$ to limit the number of iterations, and we observe that on average the ruin time of the path simulations occurs in the interval $[0,10]$ . We also use the error bound $\epsilon=10^{-8}$ , and make the convention that $d\sim 0$ whenever $|d|<\epsilon$ .

7.1 $CTD(\gamma)$ methods

Data: Initial state

x_{0},\textit{Time horizon }T,\textit{ time scale }\Delta t,K=\frac{T}{\Delta t}

, Initial temperature

\lambda

, Initial learning rate

\alpha

, functional forms of

l(.),p(.),J^{\theta}(.),\nabla_{\theta}J^{\theta}(.),\tilde{c}(.)=1-J_{x}^{\theta}(.)

, number of simulated paths

M

, Variable

sz

, an environment simulator

ENV_{\Delta t}(t,x,a).

Learning Procedure Initialize

\theta

m=1

and set

Var=\theta

while $m<M$ do

Set

\theta=Var

if $\mod(m-1,sz)=0$ AND $m>1$ then

Compute and store

A_{m}=Average(\theta^{*})

over the last

sz

iterations.

Set

Var=A_{m}

if $m>sz$ then

End iteration if the absolute difference

DA=|A_{m}-A_{(m-sz)}|<\epsilon

end if

Initialize

j=0

Observe

x_{0}

and store

x_{t_{j}}\Leftarrow x_{0}

while $j<K$ do

\lambda=\lambda l(j)

Compute

\boldsymbol{\pi}(.,x_{t_{j}})=G(.,1-J_{x}^{Var}(x_{t_{j}}))

and generate action

a_{t_{j}}\sim\boldsymbol{\pi}(.,x_{t_{j}}).

Apply

a_{t_{j}}

ENV_{\Delta t}(t_{j},x_{t_{j}},a_{t_{j}})

to observe and store

x_{t_{j+1}}

end iteration if

x_{t_{j+1}}<\epsilon

Compute

\Delta\theta=\big{(}\nabla_{\theta}J^{\theta^{(i)}}(\tilde{X}_{t_{i}})\big{)}e^{-ct_{i}}\Delta_{i}

end iteration if

\|\Delta\theta\|_{2}<\epsilon

Update

\theta\leftarrow\theta+\alpha p(j)\Delta\theta

Update

j\leftarrow j+1

end while

Set

\theta^{*}=\theta

and update

m\leftarrow m+1

end while

Set

\theta^{*}=A_{m}

Algorithm 1

CTD(0)

Algorithm

In Algorithm 1 below we carry out the PE procedure using the $CTD(0)$ method. We choose $\lambda=\lambda(j)$ as a function of iteration number: $\lambda(j)=2l(j)=2(0.2)^{j*\Delta t}\geq 2\cdot 0.2^{T}=2.048\times 10^{-7}$ . This particular function is chosen so that $\lambda\to 0$ and the entropy regularized control problem converges to the classical problem, but $\lambda$ is still bounded away from $0$ so as to ensure that $\pi$ is well defined. We shall initialize the learning rate at 1 and decrease it using the function $p(j)=1/j$ so as to ensures that the conditions $\sum_{i=0}^{\infty}\alpha_{i}=\infty$ and $\sum_{i=0}^{\infty}(\alpha_{i})^{2}<\infty$ are satisfied.

We note that Algorithm 1 is designed as a combination of online learning and the so-called batch learning, which updates parameter $\theta$ at each temporal partition point, but only updates the policy after a certain number (the parameter “ $sz$ ” in Algorightm 1) of path simulations. This particular design is to allow the PE method to better approximate $J(\cdot,\boldsymbol{\pi})$ before updating $\boldsymbol{\pi}$ .

Convergence Analysis. To analyze the convergence as $\Delta t\to 0$ , we consider $\Delta t=0.005$ , 0.001, 0.0005, 0.0001, 0.00005, respectively. We take $M=40000$ path simulations and $sz=250$ in the implementation. Note that with the choice of dummy parameters $a,\mu$ and $\sigma$ , the classical solution is given by $m=4.7797$ , $V(3)=17.9522$ and $V(10)=24.9940$ . We thus consider two parameterization families, for initial values $x=3<m$ and $x=10>m$ respectively.

Table 1: Results for the

CTD0

method

$\Delta t$	$J^{\theta^{*}}$	m	$J^{\theta^{*}}$	m	$J^{\theta^{*}}$	m	$J^{\theta^{*}}$	m
	family(i)				family(ii)
	x=3		x=10		x=3		x=10
${0.01}$	15.49	5.383	31.276	3.476	32.489	19.359	55.667	9.635
${0.005}$	17.188	4.099	22.217	4.292	31.108	18.532	53.262	11.942
${0.001}$	16.68	4.474	23.082	3.931	37.58	11.925	60.948	10.691
${0.0005}$	16.858	4.444	23.079	4.049	40.825	11.797	65.179	10.5
${0.0001}$	17.261	4.392	23.094	4.505	38.899	18.994	55.341	18.142

[Uncaptioned image] — Table 2: Convergence results for the $CTD0$ method

Case 1. $x=3<m$ . As we can observe from Table 1 and 2 , in this case using the approximation (7.3) (family (i)) shows reasonably satisfactory level of convergence towards the known classical solution values of $J(x_{0})$ and $m$ as $\Delta t\to 0$ , despite some mild fluctuations. We believe that such fluctuations are due to the randomness of the data we observe and that averaging over the $sz$ paths in our algorithm reduced the occurrence of these fluctuations to a satisfactory level. As we can see, despite the minor anomalies, the general trajectory of these graphs tends towards the classical solution as $\Delta t\to 0$ . We should also observe that using family (ii) (7.4) does not produce any satisfactory convergent results. But this is as expected, since the function (7.4) is based on the classical solution for $x>m$ .

Case 2. $x=10>m$ . Even though the family (7.3) is based on the classical solution for $x<m$ , as we can see from 1 and 2 , the algorithm using family (7.3) converges to the values of the classical solution even in the case $x=10>m$ , whereas the algorithm using family (ii) (7.4) does not. While a bit counter intuitive, this is actually not entirely unexpected since the state process can be seen to reach $0$ in the considered time interval in general, but the parameterization (7.4) is not suitable when value of the state reaches below $m$ . Consequently, it seems that the parameterization (7.3) suits better for $CTD(0)$ method, regardless of the initial value.

Finally, we would like to point out that the case for $CTD(\gamma)$ methods for $\gamma\neq 0$ is much more complicated, and the algorithms are computationally much slower than $CTD(0)$ method. We believe that the proper choice of the learning rate in this case warrants some deeper investigation, but we prefer not to discuss these issues in this paper.

7.2 The ML Algorithm

In Algorithm 2 we present the so-called ML-algorithm in which we use a batch learning approach where we update the parameters $\theta$ by $\tilde{\theta}$ at the end of each simulated path using the information from the time interval $[0,T\wedge\tau_{x}]$ . We use $M=40000$ path simulations and initial temperature $\lambda=2$ . In the $m$ -th simulated path we decrease the temperature parameter by $\lambda=2\times(0.9)^{m}$ . We also initialize the learning rate at 1 and decrease it using the function $p(m)=1/m$ . Finally, we consider $\Delta t=0.005,0.001,0.0005,0.0001$ , respectively, to represent the convergence as $\Delta t\to 0$ .

Data: Initial state

x_{0},\textit{Time horizon }T,\textit{ time scale }\Delta t,K=\frac{T}{\Delta t}

, Initial temperature

\lambda

, Initial learning rate

\alpha

, functional forms of

l(.),p(.),J^{\theta}(.),\nabla_{\theta}J^{\theta}(.),\tilde{c}(.)=1-J_{x}^{\theta}(.)

, number of simulated paths

M

, Variable

sz

, an environment simulator

ENV_{\Delta t}(t,x,a).

Learning Procedure Initialize

\theta

m=1

while $m<M$ do

\lambda=\lambda l(m)

if $\mod(m-1,sz)=0$ AND $m>1$ then

Compute and store

A_{m}=Average(\theta)

over the last

sz

iterations.

if $m>sz$ then

End iteration if the absolute difference

DA=|A_{m}-A_{(m-sz)}|<\epsilon

end if

Initialize

k=0

, observe

x_{0}

and store

x_{t_{k}}\Leftarrow x_{0}

while $k<K$ do

Compute

\boldsymbol{\pi}(.,x_{t_{k}})=G(.,1-J_{x}^{\theta}(x_{t_{k}}))

and generate action

a_{t_{k}}\sim\boldsymbol{\pi}(.,x_{t_{k}}).

Apply

a_{t_{k}}

ENV_{\Delta t}(t_{k},x_{t_{k}},a_{t_{k}})

to observe and store

x_{t_{k+1}}

end iteration if

x_{t_{k+1}}<\epsilon

observe and store

J^{\theta}(x_{t_{k+1}}),\nabla_{\theta}J^{\theta}(x_{t_{k+1}})

Update

k\leftarrow k+1

end while

Compute

\Delta\theta

using Ml algorithm and Update

\theta\leftarrow\theta-\alpha p(m)\Delta\theta

Update

m\leftarrow m+1

end while

Set

\theta^{*}=A_{m}

Algorithm 2 ML Algorithm

Using parameterized family (i) with both the initial values $x=3$ and $x=10$ , we obtain the optimal $\theta^{*}_{i}$ as the lower bound of each parameter $\theta_{i}$ , for $i=1,2,3$ . Using parameterized family (ii) with both the initial values $x=3$ and $x=10$ , we obtain the optimal $\theta^{*}_{i}$ as the average of the lower and upper bounds of each parameter $\theta_{i}$ for $i=1,2$ , since in each iteration $\theta_{i}$ is updated as the upper and lower boundary alternatively. This is due to the fact that the learning rate $\frac{1}{m}$ is too large for this particular algorithm. Decreasing the size of the learning rate results in optimal $\theta$ values that are away from the boundaries, but the algorithms in these cases were shown not to converge empirically, and thus the final result depends on the number of iterations used(M).

In general, the reason for this could be due to the loss of efficiency occurred by decreasing the learning rates, since Gradient Descent Algorithms are generally sensitive to learning rates. Specific to our problem, among many possible reasons, we believe that the limiting behavior of the optimal strategy when $\lambda\to 0$ is a serious issue, as $\boldsymbol{\pi}$ is not well defined when $\lambda=0$ and a Dirac $\delta$ -measure is supposed to be involved. Furthermore, the ”bang-bang” nature and the jump of the optimal control could also affect the convergence of the algorithm. Finally, the algorithms seems to be quite sensitive to the value of $m$ since value function $V(x)$ is a piece-wisely smooth function depending on $m$ . Thus, to rigorously analyze the effectiveness of the ML-algorithm with parameterization families (i) and (ii), further empirical analysis are needed which involves finding effective learning rates.

All these issues call for further investigation, but based on our numerical experiment we can nevertheless conclude that the CTD(0) method using the parameterization family (i) is effective in finding the value $m$ and $V(x)$ , provided that the effective upper and lower bounds for the parameters can be identified using historic data.

References

[1] Søren Asmussen and Michael Taksar, Controlled diffusion models for optimal dividend pay-out, Insurance Math. Econom. 20 (1997), no. 1, 1–15.
[2] Lihua Bai and Jin Ma, Optimal investment and dividend strategy under renewal risk model, SIAM J. Control Optim. 59 (2021), no. 6, 4590–4614.
[3] Lihua Bai and Jostein Paulsen, Optimal dividend policies with transaction costs for a class of diffusion processes, SIAM J. Control Optim. 48 (2010), no. 8, 4987–5008.
[4] Jun Cai, Hans U. Gerber, and Hailiang Yang, Optimal dividends in an Ornstein-Uhlenbeck type model with credit and debit interest, N. Am. Actuar. J. 10 (2006), no. 2, 94–119.
[5] Tahir Choulli, Michael Taksar, and Xun Yu Zhou, A diffusion model for optimal dividend distribution for a company with constraints on risk control, SIAM J. Control Optim. 41 (2003), no. 6, 1946–1979.
[6] Earl A. Coddington and Norman Levinson, Theory of ordinary differential equations, McGraw-Hill Book Co., Inc., New York-Toronto-London, 1955.
[7] Michael G. Crandall, Hitoshi Ishii, and Pierre-Louis Lions, User’s guide to viscosity solutions of second order partial differential equations, Bull. Amer. Math. Soc. (N.S.) 27 (1992), no. 1, 1–67.
[8] Tiziano De Angelis and Erik Ekström, The dividend problem with a finite horizon, Ann. Appl. Probab. 27 (2017), no. 6, 3525–3546.
[9] B. De Finetti, Su un’ impostazione alternativa dell teoria collettiva del risichio, Transactions of the 15th congress of actuaries, New York 2 (1957), 433–443.
[10] James Ferguson, A brief survey of the history of the calculus of variations and its applications, 2004.
[11] David Gilbarg and Neil S. Trudinger, , Classics in Mathematics, Springer-Verlag, Berlin, 2001, Reprint of the 1998 edition.
[12] Yu-Jui Huang, Zhenhua Wang, and Zhou Zhou, Convergence of policy improvement for entropy-regularized stochastic control problems, 2023.
[13] Saul D. Jacka and Aleksandar Mijatović, On the policy improvement algorithm in continuous time, Stochastics 89 (2017), no. 1, 348–359.
[14] Yanwei Jia and Xun Yu Zhou, Policy evaluation and temporal-difference learning in continuous time and space: a martingale approach, J. Mach. Learn. Res. 23 (2022), Paper No. [154], 55.
[15] , Policy gradient and actor-critic learning in continuous time and space: theory and algorithms, J. Mach. Learn. Res. 23 (2022), Paper No. [275], 50.
[16] B. Kerimkulov, D. Šiška, and L. Szpruch, A modified MSA for stochastic control problems, Appl. Math. Optim. 84 (2021), no. 3, 3417–3436.
[17] Bekzhan Kerimkulov, David Šiška, and Lukasz Szpruch, Exponential convergence and stability of Howard’s policy improvement algorithm for controlled diffusions, SIAM J. Control Optim. 58 (2020), no. 3, 1314–1340.
[18] O. A. Ladyženskaja, V. A. Solonnikov, and N. N. Ural’ceva, Linear and quasilinear equations of parabolic type, AMS, Providence, RI, 1968.
[19] M. L. Puterman, On the convergence of policy iteration for controlled diffusions, J. Optim. Theory Appl. 33 (1981), no. 1, 137–144.
[20] S. E. Shreve, J. P. Lehoczky, and D. P. Gaver, Optimal consumption for general diffusions with absorbing and reflecting barriers, SIAM J. Control Optim. 22 (1984), no. 1, 55–75.
[21] Stefan Thonhauser and Hansjörg Albrecher, Dividend maximization under consideration of the time value of ruin, Insurance Math. Econom. 41 (2007), no. 1, 163–184.
[22] Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou, Reinforcement learning in continuous time and space: a stochastic control approach, J. Mach. Learn. Res. 21 (2020), Paper No. 198, 34.
[23] Haoran Wang and Xun Yu Zhou, Continuous-time mean-variance portfolio selection: a reinforcement learning framework, Math. Finance 30 (2020), no. 4, 1273–1308.
[24] Jiongmin Yong and Xun Yu Zhou, Stochastic controls, Applications of Mathematics (New York), vol. 43, Springer-Verlag, New York, 1999, Hamiltonian systems and HJB equations.
[25] Jinxia Zhu and Hailiang Yang, Optimal financing and dividend distribution in a general diffusion model with regime switching, Adv. in Appl. Probab. 48 (2016), no. 2, 406–422.

	$\displaystyle\mathbb{E}\Big{[}\int_{0}^{\tau_{x}}\|m(t,\theta)\|^{2}dt{\bf 1}_{\{{\tau_{x}}<T\}}\Big{]}$	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\int_{0}^{\lfloor\tau_{x}\rfloor}\|m(t,\theta)\|^{2}dt{\bf 1}_{\{N<n\}}\Big{]}+\tilde{F}_{3}-\tilde{F}_{2}$
		$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\|m(t,\theta)\|^{2}dt{\bf 1}_{\{N<n\}}\Big{]}+\tilde{F}_{3}-\tilde{F}_{2}$

			$\displaystyle\|2ML_{\Delta t}(\theta)-2\widetilde{ML}_{\Delta t}(\theta)\|$
		$\displaystyle=$	$\displaystyle\Big{\|}\sum_{i=0}^{N-1}\Big{[}\Big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{N-1}\tilde{r}_{t_{j}}\Delta t\Big{)}^{2}\Delta t-\Big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}^{2}\Delta t\Big{]}\Big{\|}$
		$\displaystyle=$	$\displaystyle\Big{\|}\sum_{i=0}^{N-1}\Big{[}\Big{(}\sum_{j=N}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{(}2e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{N-1}\tilde{r}_{t_{j}}\Delta t-\sum_{j=i}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{]}\Big{\|}\Delta t$
		$\displaystyle\leq$	$\displaystyle K(\theta)\Delta t\Big{\|}\sum_{i=0}^{N-1}\Big{(}\sum_{j=N}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{\|}\leq\widetilde{K}(\theta)\Delta t\Big{\|}\sum_{i=0}^{N-1}\Big{(}\sum_{j=N}^{K-1}e^{-c{t_{j}}}\Delta t\Big{)}\Big{\|}\leq\widetilde{K}(\theta)e^{-cT}T\Delta t,$

$J^{\theta^{*}}$ w.r.t. $\Delta t$	$m$ w.r.t. $\Delta t$

Results obtained using family (i) for $x=3$

Results obtained using family (i) for $x=10$

Results obtained using family (ii) for $x=3$

Results obtained using family (ii) for $x=10$

Reinforcement Learning for optimal dividend problem under diffusion model

Abstract

1 Introduction

2 Preliminaries and Problem Formulation

Definition 2.1.

Definition 2.2.

Proposition 2.3.

Assumption 2.4.

3 The Value Function and Its Regularity

Proposition 3.1.

Proposition 3.2.

Definition 3.3.

Proposition 3.4.

Definition 3.5.

Theorem 1.

Theorem 2 (Comparison Principle).

Corollary 3.6.

Theorem 3.

4 Policy Update

Definition 4.1.

Lemma 1.

Proposition 4.2.

Theorem 4.

Proposition 4.3.

Theorem 5.

Remark 4.4.

5 Policy Evaluation — A Martingale Approach

Proposition 5.1.

Assumption 5.2.

Theorem 6.

Remark 5.3.

Theorem 7.

Corollary 5.4.

Remark 5.5.

6 Temporal Difference (TD) Based Online Learning

Proposition 6.1.

Remark 6.2.

Proposition 6.3.

7 Numerical Results

Remark 7.1.

7.1 C​T​D​(γ)CTD(\gamma) methods

7.2 The ML Algorithm

References

7.1 $CTD(\gamma)$ methods