This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reinforcement Learning for optimal dividend problem under diffusion model

Lihua Bai,  Thejani Gamage,  Jin Ma,  Pengxu Xie School of Mathematics, Nankai University, Tianjin, 300071, China. Email: lbai@nankai.edu.cn. This author is supported in part by Chinese NSF grants #11931018, #12272274, and #12171257. Department of Mathematics, University of Southern California, Los Angeles, CA 90089. Email: gamage@usc.edu. Department of Mathematics, University of Southern California, Los Angeles, CA 90089. Email: jinma@usc.edu. This author is supported in part by NSF grants #DMS-1908665 and 2205972. School of Mathematics, Nankai University, Tianjin, 300071, China. Email:1120180026@mail.nankai.edu.cn.
Abstract

In this paper, we study the optimal dividend problem under the continuous time diffusion model with the dividend rate being restricted in a given interval [0,a][0,a]. Unlike the standard literature, we shall particularly be interested in the case when the parameters (e.g. drift and diffusion coefficients) of the model are not specified so that the optimal control cannot be explicitly determined. We therefore follow the recently developed method via the Reinforcement Learning (RL) to find the optimal strategy. Specifically, we shall design a corresponding RL-type entropy-regularized exploratory control problem, which randomize the control actions, and balance the exploitation and exploration. We shall first carry out a theoretical analysis of the new relaxed control problem and prove that the value function is the unique bounded classical solution to the corresponding HJB equation. We will then use a policy improvement argument, along with policy evaluation devices (e.g., Temporal Difference (TD)-based algorithm and Martingale Loss (ML)-algorithms) to construct approximating sequences of the optimal strategy. We present some numerical results using different parametrization families for the cost functional to illustrate the effectiveness of the approximation schemes.


Keywords. Optimal dividend problem, entropy-regularized exploratory control problem, policy improvement, policy evaluation, temporal difference (TD) algorithm, martingale loss (ML)..


2020 AMS Mathematics subject classification: 93E20,35; 91G50; 93B47.

1 Introduction

The problem of maximizing the cumulative discounted dividend payment can be traced back to the work of de Finetti [9]. Since then the problem has been widely studied in the literature under different models, and in many cases the problem can be explicitly solved when the model parameters are known. In particular, for optimal dividend problem and its many variations in continuous time under diffusion models, we refer to the works of, among others, [1, 3, 4, 5, 8, 21, 25] and the references cited therein. The main motivation of this paper is to study the optimal dividend problems in which the model parameters are not specified so the optimal control cannot be explicitly determined. Following the recently developed method using Reinforcement Learning (RL), we shall try to find the optimal strategy for a corresponding entropy-regularized control problem and solve it along the lines of both policy improvement and evaluation schemes.

The method of using reinforcement learning to solve discrete Markov decision problems has been well studied, but the extension of these concepts to the continuous time and space setting is still fairly new. Roughly speaking, in RL the learning agent uses a sequence of trial and errors to simultaneously identify the environment or the model parameters, and to determine the optimal control and optimal value function. Such a learning process has been characterized by a mixture of exploration and exploitation, which repeatedly tries the new actions to improve the collective outcomes. A critical point in this process is to balance the exploration and exploitation levels since the former is usually computationally expensive and time consuming, while the latter may lead to sub optimums. In RL theory a typical idea to balance the exploration and exploitation in an optimal control problem is to “randomize” the control action and add a (Shannon’s) entropy term in the cost function weighted by a temperature parameter. By maximizing the entropy one encourages the exploration and by decreasing the temperature parameter one gives more weight to exploitation. The resulting optimal control problem is often refer to as the entropy-regularized exploratory optimal control problem, which will be the starting point of our investigation.

As any reinforcement learning scheme, we shall solve the entropy-regularized exploratory optimal dividend problem via a sequence of Policy Evaluation (PE) and Policy Improvement (PI) procedures. The former evaluates the cost functional for a given policy, and the latter produces new policy that is “better” than the current one. We note that the idea of Policy Improvement Algorithms (PIA) as well as their convergence analysis is not new in the numerical optimal control literature (see, e.g., [13, 17, 16, 19]). The main difference in the current RL setting is the involvement of the entropy regularization, which causes some technical complications in the convergence analysis. In the continuous time entropy-regularized exploratory control problem with diffusion models a successful convergence analysis of PIA was first established for a particular Linear-Quadratic (LQ) case in [23], in which the exploratory HJB equation (i.e., HJB equation corresponding to the entropy regularized problem) can be directly solved, and the Gaussian nature of the optimal exploratory control is known. A more general case was recently investigated in [12], in which the convergence of PIA is proved in a general infinite horizon setting, without requiring the knowledge of the explicit form of the optimal control. The problem studied in this paper is very close to the one in [12], but not identical. While some of the analysis in this paper is benefitted from the fact that the spatial variable is one dimensional, but there are particular technical subtleties because of the presence of ruin time, although the problem is essentially an infinite horizon one, like the one studied in [12].

There are two main issues that this paper will focus on. The first is to design the PE and PI algorithms that are suitable for the continuous time optimal dividend problems. We shall follow some of the “popular” schemes in RL, such as the well-understood Temporal Difference (TD) methods, combined with the so-called martingale approach to design the PE learing procedure. Two technical points are worth noting: 1) since the cost functional involves ruin time, and the observation of the ruin time of the state process is sometimes practically impossible (especially in the cases where ruin time actually occurs beyond the time horizon we can practically observe), we shall propose algorithms that are insensitive to the ruin time; 2) although the infinite horizon nature of the problem somewhat prevents the use of the so-called “batch” learning method, we shall nevertheless try to study the temporally “truncated” problem so that the batch learning method can be applied. It should also be noted that one of the main difficulties in PE methods is to find an effective parameterization family of functions from which the best approximation for the cost functional is chosen, and the choice of the parameterization family directly affects the accuracy of the approximation. Since there are no proven standard methods of finding a suitable parameterization family, except for the LQ (Gaussian) case when the optimal value function is explicitly known, we shall use the classical “barrier”-type (restricted) optimal dividend strategy in [1] to propose the parametrization family, and carry out numerical experiments using the corresponding families.

The second main issue is the convergence analysis of the PIA. Similar to [12], in this paper we focus on the regularity analysis on the solution to the exploratory HJB equation and some related PDEs. Compared to the heavy PDE arguments as in [12], we shall take advantage of the fact that in this paper the state process is one dimensional taking nonnegative values, so that some stability arguments for 2-dimensional first-order nonlinear systems can be applied to conclude that the exploratory HJB equation has a concave, bounded classical solution, which would coincide with the viscosity solution (of class (L)) of HJB equation and the value function of the optimal dividend problem. With the help of these regularity results, we prove the convergence of PIA to the value function along the line of [12] but with a much simpler argument.

The rest of the paper is organized as follows. In §2 we give the preliminary description of the problem and all the necessary notations, definitions, and assumptions. In §3 we study the value function and its regularity, and prove that it is a concave, bounded classical solution to the exploratory HJB equation. In §4 we study the issue of policy update. We shall introduce our PIA and prove its convergence. In §5 and§6 we discuss the methods of Policy Evaluation, that is, the methods for approximating the cost functional for a given policy, using a martingale lost function based approach and (online) CTD(γ\gamma) methods, respectively. In §7 we propose parametrization families for PE and present numerical experiments using the proposed PI and PE methods.

2 Preliminaries and Problem Formulation

Throughout this paper we consider a filtered probability space (Ω,,{t}t0,)(\Omega,{\cal F},\{{\cal F}_{t}\}_{t\geq 0},\mathbb{P}) on which is defined a standard Brownian motion {Wt,t0}\{W_{t},t\geq 0\}. We assume that the filtration 𝔽:={t}={tW}\mathbb{F}:=\{{\cal F}_{t}\}=\{{\cal F}^{W}_{t}\}, with the usual augmentation so that it satisfies the usual conditions. For any metric space 𝕏\mathbb{X} with topological Borel sets (𝕏)\mathscr{B}(\mathbb{X}), we denote 𝕃0(𝕏)\mathbb{L}^{0}(\mathbb{X}) to be all (𝕏)\mathscr{B}(\mathbb{X})-measurable functions, and 𝕃p(𝕏)\mathbb{L}^{p}(\mathbb{X}), p1p\geq 1, to be the space of pp-th integrable functions. The spaces 𝕃𝔽0([0,T];)\mathbb{L}^{0}_{\mathbb{F}}([0,T];\mathbb{R}) and 𝕃𝔽p([0,T];)\mathbb{L}^{p}_{\mathbb{F}}([0,T];\mathbb{R}), p1p\geq 1, etc., are defined in the usual ways. Furthermore, for a given domain 𝔻\mathbb{D}\subset\mathbb{R}, we denote k(𝔻)\mathbb{C}^{k}(\mathbb{D}) to be the space of all kk-th order continuously differentiable functions on 𝔻\mathbb{D}, and (𝔻)=0(𝔻)\mathbb{C}(\mathbb{D})=\mathbb{C}^{0}(\mathbb{D}). In particular, for +:=[0,)\mathbb{R}_{+}:=[0,\infty), we denote bk(+)\mathbb{C}^{k}_{b}(\mathbb{R}_{+}) to be the space of all bounded and kk-th continuously differentiable functions on +\mathbb{R}_{+} with all derivatives being bounded.

Consider the following simplest diffusion approximation of a Cramer-Lundberg model with dividend:

dXt=(μαt)dt+σdWt,t>0,X0=x,\displaystyle dX_{t}=(\mu-\alpha_{t})dt+\sigma dW_{t},\quad t>0,\quad X_{0}=x\in\mathbb{R}, (2.1)

where xx is the initial state, μ\mu and σ\sigma are constants determined by the premium rate and the claim frequency and size (cf., e.g., [1]), and αt\alpha_{t} is the dividend rate at time t0t\geq 0. We denote X=XαX=X^{\alpha} if necessary, and say that α={αt,t0}\alpha=\{\alpha_{t},t\geq 0\} is admissible if it is 𝔽\mathbb{F}-adapted and takes values in a given “action space” [0,a][0,a]. Furthermore, let us define the ruin time to be τxα:=inf{t>0:Xtα<0}\tau^{\alpha}_{x}:=\inf\{t>0:X^{\alpha}_{t}<0\}. Clearly, Xταα=0X^{\alpha}_{\tau^{\alpha}}=0, and the problem is considered “ruined” as no dividend will be paid after τα\tau^{\alpha}. Our aim is to maximize the expected total discounted dividend given the initial condition X0α=xX^{\alpha}_{0}=x\in\mathbb{R}:

V(x):=supα𝒰[0,a]𝔼x[0τxαectαt𝑑t]:=supα𝒰[0,a]𝔼[0τxαectαt𝑑t|x0=x],\displaystyle\qquad V(x):=\sup_{\alpha\in\mathscr{U}[0,a]}\mathbb{E}_{x}\Big{[}\int_{0}^{\tau_{x}^{\alpha}}e^{-ct}\alpha_{t}dt\Big{]}:=\sup_{\alpha\in\mathscr{U}[0,a]}\mathbb{E}\Big{[}\int_{0}^{\tau^{\alpha}_{x}}e^{-ct}\alpha_{t}dt\Big{|}x_{0}=x\Big{]}, (2.2)

where c>0c>0 is the discount rate, and 𝒰[0,a]\mathscr{U}[0,a] is the set of admissible dividend rates taking values in [0,a][0,a]. The problem (2.1)-(2.2) is often referred to as the classical optimal restricted dividend problem, meaning that the dividend rate is restricted in a given interval [0,a][0,a].

It is well-understood that when the parameters μ\mu and σ\sigma are known, then the optimal control is of the “feedback” form: αt=𝒂(Xt)\alpha_{t}^{*}=\boldsymbol{a}^{*}(X_{t}^{*}), where XtX_{t}^{*} is the corresponding state process and 𝒂()\boldsymbol{a}^{*}(\cdot) is a deterministic function taking values in [0,a][0,a], often in the form of a threshold control (see, e.g., [1]). However, in practice the exact form of 𝒂()\boldsymbol{a}^{*}(\cdot) is not implementable since the model parameters are usually not known, thus the “parameter insensitive” method through Reinforcement Learning (RL) becomes much more desirable alternative, which we now elaborate.

In the RL formulation, the agent follows a process of exploration and exploitation via a sequence of trial and error evaluation. A key element is to randomize the control action as a probability distribution over [0,a][0,a], similar to the notion of relaxed control in control theory, and the classical control is considered as a special point-mass (or Dirac δ\delta-measure) case. To make the idea more accurate mathematically, let us denote ([0,a])\mathscr{B}([0,a]) to be the Borel field on [0,a][0,a], and 𝒫([0,a])\mathscr{P}([0,a]) to be the space of all probability measure on ([0,a],([0,a]))([0,a],\mathscr{B}([0,a])), endowed with, say, the Wasserstein metric. A “relaxed control” is a randomized policy defined as a measure-valued progressively measurable process (t,ω)π(;t,ω)𝒫([0,a])(t,\omega)\mapsto{\pi}(\cdot;t,\omega)\in\mathscr{P}([0,a]). Assuming that π(;t,ω){\pi}(\cdot;t,\omega) has a density, denoted by πt(,ω)𝕃+1([0,a])𝕃1([0,a])\pi_{t}(\cdot,\omega)\in\mathbb{L}^{1}_{+}([0,a])\subset\mathbb{L}^{1}([0,a]), (t,ω)[0,T]×Ω(t,\omega)\in[0,T]\times\Omega, then we can write

π(A;t,ω)=Aπt(w,ω)𝑑w,A([0,a]),(t,ω)[0,T]×Ω.{\pi}(A;t,\omega)=\int_{A}\pi_{t}(w,\omega)dw,\qquad A\in\mathscr{B}([0,a]),\quad(t,\omega)\in[0,T]\times\Omega.

In what follows we shall often identify a relaxed control with its density process π={πt,t0}\pi=\{\pi_{t},t\geq 0\}. Now, for t[0,T]t\in[0,T], we define a probability measure on ([0,a]×Ω,([0,a]))([0,a]\times\Omega,\mathscr{B}([0,a])\otimes{\cal F}) as follows: for A([0,a])A\in\mathscr{B}([0,a]) and BB\in{\cal F},

t(A×B):=ABπ(dw;t,ω)(dω)=ABπt(w,ω)𝑑w(dω).\displaystyle\mathbb{Q}_{t}(A\times B):=\int_{A}\int_{B}\pi(dw;t,\omega)\mathbb{P}(d\omega)=\int_{A}\int_{B}\pi_{t}(w,\omega)dw\mathbb{P}(d\omega). (2.3)

We call a function Aπ:[0,T]×[0,a]×Ω[0,a]A^{\pi}:[0,T]\times[0,a]\times\Omega\mapsto[0,a] the “canonical representation” of a relaxed control π={π(,t,)}t0\pi=\{\pi(\cdot,t,\cdot)\}_{t\geq 0}, if Atπ(w,ω)=wA^{\pi}_{t}(w,\omega)=w. Then, for t0t\geq 0 we have

𝔼t[Atπ]=Ω0aAtπ(w,ω)π(dw;t,ω)(dω)=𝔼[0awπt(w)𝑑w].\displaystyle\mathbb{E}^{\mathbb{Q}_{t}}[A^{\pi}_{t}]=\int_{\Omega}\int_{0}^{a}A^{\pi}_{t}(w,\omega)\pi(dw;t,\omega)\mathbb{P}(d\omega)=\mathbb{E}^{\mathbb{P}}\Big{[}\int_{0}^{a}w\pi_{t}(w)dw\Big{]}. (2.4)

We can now derive the exploratory dynamics of the state process XX along the lines of entropy-regularized relaxed stochastic control arguments (see, e.g., [22]). Roughly speaking, consider the discrete version of the dynamics (2.1): for small Δt>0\Delta t>0,

Δxt:=xt+Δtxt(μat)Δt+σ(Wt+ΔtWt),t0.\displaystyle\Delta x_{t}:=x_{t+\Delta t}-x_{t}\approx(\mu-a_{t})\Delta t+\sigma(W_{t+\Delta t}-W_{t}),\qquad t\geq 0. (2.5)

Let {ati}i=1N\{a_{t}^{i}\}_{i=1}^{N} and {(xti,Wti)}i=1N\{(x^{i}_{t},W_{t}^{i})\}_{i=1}^{N} be NN independent samples of (at)(a_{t}) under the distribution πt\pi_{t}, and the corresponding samples of (Xtπ,Wt)(X_{t}^{\pi},W_{t}), respectively. Then, the law of large numbers and (2.4) imply that

i=1NΔxtiNi=1N(μati)ΔtN𝔼t[μAtπ]Δt=𝔼[μ0awπt(w,)𝑑w]Δt,\displaystyle\qquad\sum_{i=1}^{N}\frac{\Delta x_{t}^{i}}{N}\approx\sum_{i=1}^{N}(\mu-a^{i}_{t})\frac{\Delta t}{N}\approx\mathbb{E}^{\mathbb{Q}_{t}}[\mu-A^{\pi}_{t}]\Delta t=\mathbb{E}^{\mathbb{P}}\Big{[}\mu\negthinspace-\negthinspace\int_{0}^{a}w\pi_{t}(w,\cdot)dw\Big{]}\Delta t, (2.6)

as NN\to\infty. This, together with the fact 1Ni=1N(Δxti)2σ2Δt\frac{1}{N}\sum_{i=1}^{N}(\Delta x_{t}^{i})^{2}\approx\sigma^{2}\Delta t, leads to the follow form of the exploratory version of the state dynamics:

dXt=(μ0awπt(w,)𝑑w)dt+σdWt,X0=x,\displaystyle dX_{t}=\Big{(}\mu-\int_{0}^{a}w\pi_{t}(w,\cdot)dw\Big{)}dt+\sigma dW_{t},\quad X_{0}=x, (2.7)

where {πt(w,)}\{\pi_{t}(w,\cdot)\} is the (density of) relaxed control process, and we shall often denote X=Xπ,0,x=Xπ,xX=X^{\pi,0,x}=X^{\pi,x} to specify its dependence on control π\pi and the initial state xx.

To formulate the entropy-regularized optimal dividend problem, we first give a heuristic argument. Similar to (2.6), for NN large and Δt\Delta t small we should have

1Ni=1Nectati𝟏[tτi]Δt𝔼t[ectAtπ𝟏[tτxπ]Δt]=𝔼[𝟏[tτxπ]ect0awπt(w)𝑑wΔt].\displaystyle\frac{1}{N}\sum_{i=1}^{N}e^{-ct}a_{t}^{i}{\bf 1}_{[t\leq\tau^{i}]}\Delta t\approx\mathbb{E}^{\mathbb{Q}_{t}}\Big{[}e^{-ct}A_{t}^{\pi}{\bf 1}_{[t\leq\tau^{\pi}_{x}]}\Delta t\Big{]}=\mathbb{E}^{\mathbb{P}}\Big{[}{\bf 1}_{[t\leq\tau^{\pi}_{x}]}e^{-ct}\int_{0}^{a}w\pi_{t}(w)dw\Delta t\Big{]}.

Therefore, in light of [22] we shall define the entropy-regularized cost functional of the optimal expected dividend control problem under the relaxed control π{\pi} as

J(x,π)=𝔼x[0τxπectλπ(t)𝑑t],\displaystyle J(x,{\pi)}=\mathbb{E}_{x}\Big{[}\int_{0}^{\tau_{x}^{{\pi}}}e^{-ct}{\cal H}_{\lambda}^{\pi}(t)dt\Big{]}, (2.8)

where λπ(t):=0a(wλlnπt(w))πt(w)𝑑w{\cal H}_{\lambda}^{\pi}(t):=\int_{0}^{a}(w-\lambda\ln\pi_{t}(w))\pi_{t}(w)dw, τxπ=inf{t>0:Xtπ,x<0}{\tau^{\pi}_{x}}=\inf\{t>0:X_{t}^{\pi,x}<0\}, and λ>0\lambda>0 is the so-called temperature parameter balancing the exploration and exploitation.

We now define the set of open-loop admissible controls as follows.

Definition 2.1.

A measurable (density) process π={πt(,)}t0𝕃0([0,)×[0,a]×Ω){\pi}=\{\pi_{t}(\cdot,\cdot)\}_{t\geq 0}\in\mathbb{L}^{0}([0,\infty)\times[0,a]\times\Omega) is called an open-loop admissible relaxed control if

1. πt(;ω)𝕃1([0,a]){{\pi}_{t}}(\cdot;\omega)\in\mathbb{L}^{1}([0,a]), for dtddt\otimes d\mathbb{P}-a.e. (t,ω)[0,)×Ω(t,\omega)\in[0,\infty)\times\Omega;

2. for each w[0,a]w\in[0,a], the process (t,ω)πt(w,ω)(t,\omega)\mapsto{{\pi}_{t}}(w,\omega) is 𝔽\mathbb{F}-progressively measurable;

3.  𝔼x[0τxπect|λπ(t)|𝑑t]<+\mathbb{E}_{x}\big{[}\int_{0}^{\tau^{\pi}_{x}}e^{-ct}|{\cal H}_{\lambda}^{\pi}(t)|dt\big{]}<+\infty.

We shall denote 𝒜(x)\mathscr{A}(x) to be the set of open-loop admissible relaxed controls.  

Consequently, the value function (2.2) now reads

V(x)=supπ𝒜(x)𝔼x{0τxπectλπ(t)𝑑t},x0.\displaystyle V(x)=\sup_{{\pi}\in\mathscr{A}(x)}\mathbb{E}_{x}\Big{\{}\int_{0}^{\tau_{x}^{\pi}}e^{-ct}{\cal H}_{\lambda}^{\pi}(t)dt\Big{\}},\qquad x\geq 0. (2.9)

An important type of π𝒜(x)\pi\in\mathscr{A}(x) is of the “feedback” nature, that is, πt(w,ω)=𝝅(w,Xt𝝅,x(ω))\pi_{t}(w,\omega)=\boldsymbol{\pi}(w,X^{\boldsymbol{\pi},x}_{t}(\omega)) for some deterministic function 𝝅\boldsymbol{\pi}, where X𝝅,xX^{\boldsymbol{\pi},x} satisfies the SDE:

dXt=(μ0aw𝝅(w,Xt)𝑑w)dt+σdWt,t0;X0=x.\displaystyle dX_{t}=\Big{(}\mu-\int_{0}^{a}w{\boldsymbol{\pi}}(w,X_{t})dw\Big{)}dt+\sigma dW_{t},\quad t\geq 0;\quad X_{0}=x. (2.10)
Definition 2.2.

A function 𝛑𝕃0([0,a]×)\boldsymbol{\pi}\in\mathbb{L}^{0}([0,a]\times\mathbb{R}) is called a closed-loop admissible relaxed control if, for every x>0x>0,

1. The SDE (2.10)(\ref{sde2}) admits a unique strong solution X𝛑,xX^{\boldsymbol{\pi},x},

2. The process π={πt(;ω):=𝛑(,Xt𝛑,x(ω));(t,ω)[0,T]×Ω}𝒜(x)\pi=\{{{\pi}_{t}}(\cdot;\omega):={\boldsymbol{\pi}}(\cdot,X^{\boldsymbol{\pi},x}_{t}(\omega));(t,\omega)\in[0,T]\times\Omega\}\in\mathscr{A}(x).

We denote 𝒜cl𝒜(x)\mathscr{A}_{cl}\subset\mathscr{A}(x) to be the set of closed-loop admissible relaxed controls.  

The following properties of the value function is straightforward.

Proposition 2.3.

Assume a>1a>1. Then the value function VV satisfies the following properties:

(1) V(x)V(y)V(x)\geq V(y), if xy>0x\geq y>0;

(2) 0V(x)λlna+ac0\leq V(x)\leq\frac{\lambda\ln a+a}{c}, x+x\in\mathbb{R}_{+}.

Proof. (1) Let xyx\geq y, and π𝒜(y)\pi\in\mathscr{A}(y). Consider π^t(w,ω):=πt(w,ω)𝟏{t<τyπ(ω)}+ewλλ(eaλ1)𝟏{tτyπ(ω)}\hat{\pi}_{t}(w,\omega):=\pi_{t}(w,\omega){\bf 1}_{\{t<\tau^{\pi}_{y}(\omega)\}}+\frac{e^{\frac{w}{\lambda}}}{\lambda(e^{\frac{a}{\lambda}}-1)}{\bf 1}_{\{t\geq\tau^{\pi}_{y}(\omega)\}}, (t,w,ω)[0,)×[0,a]×Ω(t,w,\omega)\in[0,\infty)\times[0,a]\times\Omega. Then, it is readily seen that J(x,π^)J(y,π)J(x,\hat{\pi})\geq J(y,\pi), for a>1a>1. Thus V(x)J(x,π^)V(y)V(x)\geq J(x,\hat{\pi})\geq V(y), proving (1), as π𝒜(y)\pi\in\mathscr{A}(y) is arbitrary.

(2) By definition 0awπt(w)𝑑wa\int_{0}^{a}w\pi_{t}(w)dw\leq a and 0alnπt(w))πt(w)dwlna-\int_{0}^{a}\ln\pi_{t}(w))\pi_{t}(w)dw\leq\ln a by the well-known Kullback-Leibler divergence property. Thus, λπ(t)λlna+a{\cal H}_{\lambda}^{\pi}(t)\leq\lambda\ln a+a, and then V(x)λlna+acV(x)\leq\frac{\lambda\ln a+a}{c}. On the other hand, since 𝝅^(w,x)1a\hat{\boldsymbol{\pi}}(w,x)\equiv\frac{1}{a}, (x,w)+×[0,a](x,w)\in\mathbb{R}_{+}\times[0,a], is admissible and J(x,𝝅^)0J(x,\hat{\boldsymbol{\pi}})\geq 0 for a>1a>1, the conclusion follows.  

We remark that in optimal dividend control problems it is often assumed that the maximal dividend rate is greater than the average return rate (that is, a>2μa>2\mu), and that the average return of a surplus process XX, including the safety loading, is higher than the interest rate cc. These, together with Proposition 2.3, lead to the following standing assumption that will be used throughout the paper.

Assumption 2.4.

(i) The maximal dividend rate aa satisfies a>max{1,2μ}a>\max\left\{1,2\mu\right\}; and

(ii) the average return μ\mu satisfies μ>max{c,σ2/2}\mu>\max\left\{c,\sigma^{2}/2\right\}.  

3 The Value Function and Its Regularity

In this section we study the value function of the relaxed control problem (2.9). We note that while most of the results are well-understood, some details still require justification, especially concerning the regularity, due particularly to the non-smoothness of the exit time τx\tau_{x}.

We begin by recalling the Bellman optimality principle (cf. e.g., [24]):

V(x)\displaystyle V(x) =\displaystyle= supπ()𝒜(x)𝔼x[0sτπectλπ(t)𝑑t+ec(sτπ)V(Xsτππ)],s>0.\displaystyle\sup_{{\pi}(\cdot)\in\mathscr{A}(x)}\mathbb{E}_{x}\Big{[}\int_{0}^{s\wedge\tau^{\pi}}e^{-ct}{\cal H}_{\lambda}^{\pi}(t)dt+e^{-c(s\wedge\tau^{\pi})}V(X_{s\wedge\tau^{\pi}}^{\pi})\Big{]},\quad s>0.

Noting that V(0)=0V(0)=0, we can (formally) argue that VV satisfies the HJB equation:

{cv(x)=supπ𝕃1[0,a]0a[wλlnπ(w)+12σ2v′′(x)+(μw)v(x)]π(w)𝑑w;v(0)=0.\displaystyle\qquad\quad\begin{cases}\displaystyle cv(x)\negthinspace=\negthinspace\sup_{{\pi}\in\mathbb{L}^{1}[0,a]}\int_{0}^{a}\negthinspace\Big{[}w\negthinspace-\negthinspace\lambda\ln\pi(w)+\frac{1}{2}\sigma^{2}v^{\prime\prime}(x)\negthinspace+\negthinspace(\mu-w)v^{\prime}(x)\Big{]}\pi(w)dw;\\ v(0)=0.\end{cases} (3.1)

Next, by an argument of Lagrange multiplier and the calculus of variations (see [10]), we can find the maximizer of the right hand side of (3.1) and obtain the optimal feedback control which has the following Gibbs form, assuming all derivatives exist:

𝝅(w,x)=G(w,1v(x)),\displaystyle{\boldsymbol{\pi}}^{*}(w,x)=G(w,1-v^{\prime}(x)), (3.2)

where G(w,y)=yλ[eaλy1]ewλy𝟏{y0}+1a𝟏{y=0}G(w,y)=\frac{y}{\lambda[e^{\frac{a}{\lambda}y}-1]}\cdot e^{\frac{w}{\lambda}y}{\bf 1}_{\{y\neq 0\}}+\frac{1}{a}{\bf 1}_{\{y=0\}} for yy\in\mathbb{R}. Plugging (3.2) into (3.1), we see that the HJB equation (3.1) becomes the following second order ODE:

12σ2v′′(x)+f(v(x))cv(x)=0,x0;v(0)=0,\displaystyle\frac{1}{2}\sigma^{2}v^{\prime\prime}(x)+f(v^{\prime}(x))-cv(x)=0,\quad x\geq 0;\qquad v(0)=0, (3.3)

where the function ff is defined by

f(z):={μz+λln[λ(eaλ(1z)1)1z]}𝟏{z1}+[μ+λlna]𝟏{z=1}.\displaystyle f(z):=\Big{\{}\mu z+\lambda\ln\Big{[}\frac{\lambda(e^{\frac{a}{\lambda}(1-z)}-1)}{1-z}\Big{]}\Big{\}}{\bf 1}_{\{z\neq 1\}}+[\mu+\lambda\ln a]{\bf 1}_{\{z=1\}}. (3.4)

The following result regarding the function ff is important in our discussion.

Proposition 3.1.

The function ff defined by (3.4) enjoys the following properties:

(1) f(z)=μz+λlnw(z)f(z)=\mu z+\lambda\ln w(z) for all zz\in\mathbb{R}, where w(z)=a+n=2an(1z)n1n!λn1w(z)=a+\sum_{n=2}^{\infty}\frac{a^{n}(1-z)^{n-1}}{n!\lambda^{n-1}}, zz\in\mathbb{R}. In particular, f()f\in\mathbb{C}^{\infty}(\mathbb{R});

(2) the function f()f(\cdot) is convex and has a unique intersection point with k(x)=μxk(x)=\mu x, xx\in\mathbb{R}. Moreover, the abscissa value of intersection point H(1,1+λ)H\in(1,1+\lambda).

Proof. (1) Since the function w(z)w(z) is an entire function and w(z)>0w(z)>0 for z1z\neq 1, ln[w(z)]\ln[w(z)] is infinitely many times differentiable for all z1z\neq 1. On the other hand, since w(1)=a>1w(1)=a>1, by the continuity of w(z)w(z), there exists r>0r>0 such that w(z)(a2,3a2)w(z)\in(\frac{a}{2},\frac{3a}{2}), whenever |z1|<r|z-1|<r. Thus ln[w(z)]\ln[w(z)] is infinitely many times differentiable for |z1|<r|z-1|<r as well. Consequently, ln[w(z)]()\ln[w(z)]\in\mathbb{C}^{\infty}(\mathbb{R}), whence f(z)()f(z)\in\mathbb{C}^{\infty}(\mathbb{R}) by extension.

(2) The convexity of the function ff follows from a direct calculation of f′′(z)f^{\prime\prime}(z) for zz\in\mathbb{R}. Define w~(z):=λln[w(z)]=f(z)μz\tilde{w}(z):=\lambda\ln[w(z)]=f(z)-\mu z. It is straightforward to show that w~(1)>0\tilde{w}(1)>0, w~(1+λ)<0\tilde{w}(1+\lambda)<0, and w~(z)<0\tilde{w}^{\prime}(z)<0. Thus w~(H)=0\tilde{w}(H)=0 for some (unique) H(1,1+λ)H\in(1,1+\lambda), proving (2).  

We should note that (3.3) can be viewed as either a boundary value problem of an elliptic PDE with unbounded domain [0,)[0,\infty) or a second order ODE defined on [0,)[0,\infty). But in either case, there is missing information on boundary/initial conditions. Therefore the well-posedness of the classical solution is actually not trivial.

Let us first consider the equation (3.3) as an ODE defined on [0,)[0,\infty). Since the value function is non-decreasing by Proposition 2.3, for the sake of argument let us first consider (3.3) as an ODE with initial condition v(0)=0v(0)=0 and v(0)=α>0v^{\prime}(0)=\alpha>0. By denoting X1(x)=v(x)X_{1}(x)=v(x) and X2(x)=v(x)X_{2}(x)=v^{\prime}(x), we see that (3.3) is equivalent to the following system of first order ODEs: for x[0,)x\in[0,\infty),

{X1=X2,X1(0)=v(0)=0;X2=2cσ2X12σ2f(X2),X2(0)=v(0).\displaystyle\begin{cases}X_{1}^{\prime}=X_{2},\qquad\qquad\qquad\qquad&X_{1}(0)=v(0)=0;\\ X_{2}^{\prime}=\frac{2c}{\sigma^{2}}X_{1}-\frac{2}{\sigma^{2}}f(X_{2}),&X_{2}(0)=v^{\prime}(0).\end{cases} (3.5)

Here ff is an entire function. Let us define X~1:=X1f(0)c\tilde{X}_{1}:=X_{1}-\frac{f(0)}{c}, X:=(X~1,X2)TX:=(\tilde{X}_{1},X_{2})^{T}, A:=[012cσ22σ2h(0)]A:=\begin{bmatrix}0&1\\ \frac{2c}{\sigma^{2}}&-\frac{2}{\sigma^{2}}h^{\prime}(0)\end{bmatrix} and q(X)=[02σ2k(X2)]q(X)=\begin{bmatrix}0\\ -\frac{2}{\sigma^{2}}k(X_{2})\end{bmatrix} where h(y):=f(y)f(0)=yh(0)+n=2h(n)(0)ynn!=yh(0)+k(y)h(y):=f(y)-f(0)=yh^{\prime}(0)+\sum_{n=2}^{\infty}\frac{h^{(n)}(0)y^{n}}{n!}=yh^{\prime}(0)+k(y). Then, XX satisfies the following system of ODEs:

X=AX+q(X),X(0)=(f(0)/c,v(0))T.\displaystyle X^{\prime}=AX+q(X),\qquad X(0)=(-f(0)/c,v^{\prime}(0))^{T}. (3.6)

It is easy to check AA has eigenvalues λ1,2=h(0)2cσ2+h(0)2σ2\lambda_{1,2}=\frac{-h^{\prime}(0)\mp\sqrt{2c\sigma^{2}+h^{\prime}(0)^{2}}}{\sigma^{2}}, with λ1<0<λ2\lambda_{1}<0<\lambda_{2}. Now, let Y=PXY=PX, where PP is such that PAP1=diag[λ1,λ2]:=BPAP^{-1}=\mbox{\rm diag}[\lambda_{1},\lambda_{2}]:=B. Then YY satisfies

Y=BY+g(Y),Y(0)=PX(0),\displaystyle Y^{\prime}=BY+g(Y),\quad Y(0)=PX(0), (3.7)

where g(Y)=Pq(P1Y)g(Y)=Pq(P^{-1}Y). Since Yg(Y)\nabla_{Y}g(Y) exists and tends to 0 as |Y|0|Y|\to 0, and λ1<0<λ2\lambda_{1}<0<\lambda_{2}, we can follow the argument of [6, Theorem 13.4.3] to construct a solution ϕ~\tilde{\phi} to (3.7) such that |ϕ~(x)|Ceαx|\tilde{\phi}(x)|\leq Ce^{-\alpha x} for some constant C>0C>0, so that |ϕ~(x)|0|\tilde{\phi}(x)|\to 0, as xx\to\infty. Consequently, the function ϕ(x):=P1ϕ~(x)\phi(x):=P^{-1}\tilde{\phi}(x) is a solution to (3.6) satisfying |ϕ(x)|0|\phi(x)|\to 0, as xx\to\infty. In other words, (3.5) has a solution such that (X1(x),X2(x))(0+f(0)c,0)=(f(0)c,0)(X_{1}(x),X_{2}(x))\to(0+\frac{f(0)}{c},0)=(\frac{f(0)}{c},0) as xx\to\infty. We summarize the discussion above as the following result.

Proposition 3.2.

The differential equation (3.3) has a classical solution vv that enjoys the following properties:

(i) limxv(x)=f(0)c\lim_{x\to\infty}v(x)=\frac{f(0)}{c};

(ii) v(0)>0v^{\prime}(0)>0 and limxv(x)=0\lim_{x\to\infty}v^{\prime}(x)=0;

(iii) vv is increasing and concave.

Proof. Following the discussion proceeding the proposition we know that the classical solution vv to (3.3) satisfying (i) and (ii) exists. We need only check (iii).

To this end, we shall follow an argument of [20]. Let us first formally differentiate (3.3) to get v′′′(x)=2cσ2v(x)2σ2f(v(x))v′′(x)v^{\prime\prime\prime}(x)=\frac{2c}{\sigma^{2}}v^{\prime}(x)-\frac{2}{\sigma^{2}}f^{\prime}(v^{\prime}(x))v^{\prime\prime}(x), x[0,)x\in[0,\infty). Since vb2([0,))v\in\mathbb{C}^{2}_{b}([0,\infty)), denoting m(x):=v(x)m(x):=v^{\prime}(x), we can write

m′′(x)=2cσ2m(x)2σ2f(m(x))m(x)x[0,).\displaystyle m^{\prime\prime}(x)=\frac{2c}{\sigma^{2}}m(x)-\frac{2}{\sigma^{2}}f^{\prime}(m(x))m^{\prime}(x)\qquad x\in[0,\infty).

Now, noting Proposition 3.1, we define a change of variables such that for x[0,)x\in[0,\infty), φ(x):=0xexp[0v2σ2f(m(w))dw]𝑑v\varphi(x):=\int_{0}^{x}\exp\left[\int_{0}^{v}-\frac{2}{\sigma^{2}}f^{\prime}(m(w))dw\right]dv, and denote l(y)=m(φ1(y))l(y)=m(\varphi^{-1}(y)), y(0,)y\in(0,\infty). Since φ(0)=0\varphi(0)=0, and φ(0)=1\varphi^{\prime}(0)=1, we can define φ1(0)=0\varphi^{-1}(0)=0 as well. Then we see that,

l′′(y)=[φ(φ1(y))]22cσ2l(y),y(0,);l(0)=m(0)=v(0)=α>0.\displaystyle\quad\quad\quad l^{\prime\prime}(y)=[\varphi^{\prime}(\varphi^{-1}(y))]^{-2}\frac{2c}{\sigma^{2}}l(y),~{}y\in(0,\infty);\quad l(0)=m(0)=v^{\prime}(0)=\alpha>0. (3.8)

Since (3.8) is a homogeneous linear ODE, by uniqueness l(0)=α>0l(0)=\alpha>0 implies that l(y)>0l(y)>0, y0y\geq 0. That is, m(x)=v(x)>0m(x)=v^{\prime}(x)>0, x0x\geq 0, and vv is (strictly) increasing.

Finally, from (3.8) we see that l(y)>0l(y)>0, y[0,)y\in[0,\infty) also implies that l′′(y)>0l^{\prime\prime}(y)>0, y[0,)y\in[0,\infty). Thus l()l(\cdot) is convex on [0,+)[0,+\infty), and hence would be unbounded unless l(y)0l^{\prime}(y)\leq 0 for all y[0,)y\in[0,\infty). This, together with the fact that v(x)v(x) is a bounded and increasing function, shows that l()l(\cdot) (i.e., v()v^{\prime}(\cdot)) can only be decreasing and convex, thus v′′(x)v^{\prime\prime}(x) (i.e., l(y)l^{\prime}(y)) 0\leq 0, proving the concavity of vv, whence the proposition.  

Viscosity Solution of (3.3). We note that Proposition 3.2 requires that v(0)v^{\prime}(0) exists, which is not a priorily known. We now conside (3.3) as an elliptic PDE defined on [0,)[0,\infty), and argue that it possesses a unique bounded viscosity solution. We will then identify its value v(0)v^{\prime}(0) and argue that it must coincide with the classical solution identified in Proposition 3.2.

To begin with, let us first recall the notion of viscosity solution to (3.3). For 𝔻\mathbb{D}\subseteq\mathbb{R}, we denote the set of all upper (resp. lower) semicontinuous function in 𝔻\mathbb{D} by USC(𝔻)(\mathbb{D}) (resp. LSC(𝔻)(\mathbb{D})).

Definition 3.3.

We say that uUSC([0,+))u\in USC([0,+\infty)) is a viscosity sub-(resp. super-)solution of (3.3) on [0,+)[0,+\infty), if u(0)=0u(0)=0 and for any x(0,+)x\in(0,+\infty) and φ2()\varphi\in\mathbb{C}^{2}(\mathbb{R}) such that 0=[uφ](x)=maxy(0,+)0=[u-\varphi](x)=\max_{y\in(0,+\infty)} (resp. miny(0,+)\min_{y\in(0,+\infty)})[uφ](y)[u-\varphi](y), it holds that

12σ2φ′′(x)+f(φ(x))cu(x)(resp.)0.\displaystyle\frac{1}{2}\sigma^{2}\varphi^{\prime\prime}(x)+f(\varphi^{\prime}(x))-cu(x)\geq(resp.~{}\leq\ \!)\ \!0.

We say that u([0,+))u\in\mathbb{C}([0,+\infty)) is a viscosity solution of (3.3) on [0,+)[0,+\infty) if it is both a viscosity subsolution and a viscosity supersolution of (3.3) on [0,+)[0,+\infty).

We first show that both viscosity subsolution and viscosity supersolution to (3.3) exist. To see this, for x[0,)x\in[0,\infty), consider the following two functions:

ψ¯(x):=1ex,ψ¯(x):=AM(1eM(xb)),\displaystyle\underline{\psi}(x):=1-e^{-x},\quad\overline{\psi}(x):=\frac{A}{M}(1-e^{-M(x\wedge b)}), (3.9)

where AA, MM, b>0b>0 are constants satisfying M>2μ/σ2M>2\mu/\sigma^{2} and the following constraints:

{1M{ln(AAM)ln(AAf(0)cM)}<b<1M{lnAHln(σ22μM)};A>max{M+H,f(0)cM+H,σ2M2σ2M2μ,f(0)cσ2M2σ2M2μ}.\displaystyle\left\{\begin{array}[]{lll}\frac{1}{M}\Big{\{}\ln\Big{(}\frac{A}{A-M}\Big{)}\vee\ln\Big{(}\frac{A}{A-\frac{f(0)}{c}M}\Big{)}\Big{\}}<b<\frac{1}{M}\Big{\{}\ln\frac{A}{H}\wedge\ln\Big{(}\frac{\sigma^{2}}{2\mu}M\Big{)}\Big{\}};\vskip 6.0pt plus 2.0pt minus 2.0pt\\ A>\max\left\{M+H,~{}\frac{f(0)}{c}M+H,~{}\frac{\sigma^{2}M^{2}}{\sigma^{2}M-2\mu},~{}\frac{f(0)}{c}\cdot\frac{\sigma^{2}M^{2}}{\sigma^{2}M-2\mu}\right\}.\end{array}\right. (3.12)
Proposition 3.4.

Assume that Assumption 2.4 holds, and let ψ¯\underline{\psi}, ψ¯\overline{\psi} be defined by (3.9). Then ψ¯()\underline{\psi}(\cdot) is a viscosity subsolution of (3.3) on [0,)[0,\infty), ψ¯()\overline{\psi}(\cdot) is a viscosity supersolution of (3.3) on [0,)[0,\infty). Furthermore, it holds that ψ¯(x)ψ¯(x)\underline{\psi}(x)\leq\overline{\psi}(x) on [0,)[0,\infty).

Proof. We first show that ψ¯\underline{\psi} is a viscosity subsolution. To see this, note that ψ¯(x)=ex\underline{\psi}^{\prime}(x)=e^{-x} and ψ¯′′(x)=ex\underline{\psi}^{\prime\prime}(x)=-e^{-x} on (0,+)(0,+\infty). By Assumption 2.4, Proposition 3.1, and the fact f(1)<0f^{\prime}(1)<0, we have

12σ2ψ¯′′(x)+f(ψ¯(x))c(ψ¯(x))=12σ2ex+f(ex)c(1ex)\displaystyle\frac{1}{2}\sigma^{2}\underline{\psi}^{\prime\prime}(x)+f(\underline{\psi}^{\prime}(x))-c(\underline{\psi}(x))=-\frac{1}{2}\sigma^{2}e^{-x}+f(e^{-x})-c(1-e^{-x})
\displaystyle\geq 12σ2ex+μ+λlnac(1ex)=(c12σ2)ex+μ+λlnac>0.\displaystyle-\frac{1}{2}\sigma^{2}e^{-x}+\mu+\lambda\ln a-c(1-e^{-x})=(c-\frac{1}{2}\sigma^{2})e^{-x}+\mu+\lambda\ln a-c>0.

That is, ψ¯(x)\underline{\psi}(x) is a viscosity subsolution of (3.3) on [0,)[0,\infty).

To prove that ψ¯\overline{\psi} is a supersolution of (3.3), we take the following three steps:

(i) Note that ψ¯(x)=AeMx\overline{\psi}^{\prime}(x)=Ae^{-Mx} and ψ¯′′(x)=AMeMx\overline{\psi}^{\prime\prime}(x)=-AMe^{-Mx} for all x(0,b)x\in(0,b). Let HH be the abscissa value of intersection point of f(x)f(x) and k(x)k(x), then H(1,1+λ)H\in(1,1+\lambda), thanks to Proposition 3.1.

Since A>HA>H and b<(1/M)ln(A/H)b<(1/M)\ln(A/H) (i.e. AeMb>HAe^{-Mb}>H), we have f(A)<μAf(A)<\mu A and f(AeMb)<μAeMbf(Ae^{-Mb})<\mu Ae^{-Mb}. Also since M>(2μ/σ2)M>(2\mu/\sigma^{2}) and b<(1/M)ln(σ2M/2μ)b<(1/M)\ln(\sigma^{2}M/2\mu), we have f(A)<12σ2AMeMbf(A)<\frac{1}{2}\sigma^{2}AMe^{-Mb} and f(AeMb)<12σ2AMeMbf(Ae^{-Mb})<\frac{1}{2}\sigma^{2}AMe^{-Mb}.

By Assumption 2.4, A>max{M,H}A>\max\left\{M,H\right\}, M>2μσ2M>\frac{2\mu}{\sigma^{2}} and b<min{ln(AH)M,ln(σ22μM)M}b<\min\left\{\frac{\ln\left(\frac{A}{H}\right)}{M},\frac{\ln\left(\frac{\sigma^{2}}{2\mu}M\right)}{M}\right\}, we have 12σ2AMeMx+f(AeMx)cAM(1eMx)<0-\frac{1}{2}\sigma^{2}AMe^{-Mx}+f(Ae^{-Mx})-c\cdot\frac{A}{M}(1-e^{-Mx})<0 for x(0,b)x\in(0,b). That is, ψ¯(x)\overline{\psi}(x) is a viscosity supersolution of (3.3) on [0,b)[0,b).

(ii) Next, note that A>f(0)McA>\frac{f(0)M}{c} and b>1Mln(AAf(0)Mc)b>\frac{1}{M}\ln\big{(}\frac{A}{A-\frac{f(0)M}{c}}\big{)}, we see that f(0)cAM(1eMb)<0f(0)-c\cdot\frac{A}{M}(1-e^{-Mb})<0 for x(b,)x\in(b,\infty), and it follows that ψ¯(x)\overline{\psi}(x) is a viscosity supersolution of (3.3) on (b,)(b,\infty).

(iii) Finally, for x=bx=b, it is clear that there is no test function satisfying the definition of supersolution. We thus conclude that ψ¯(x)\overline{\psi}(x) is a viscosity supersolution of (3.3) on [0,)[0,\infty).

Furthermore, noting Assumption 2.4, A>MA>M and b>ln(AAM)Mb>\frac{\ln\big{(}\frac{A}{A-M}\big{)}}{M}, some direct calculations shows that ψ¯(x)ψ¯(x)\underline{\psi}(x)\leq\overline{\psi}(x) on [0,)[0,\infty), proving the proposition.  

We now follow the Perron’s method to prove the existence of the (bounded) viscosity solution for (3.3). We first recall the following definition (see, e.g., [2]).

Definition 3.5.

A function φ𝕃1(+)\varphi\in\mathbb{L}^{1}(\mathbb{R}_{+}) is said to be of class (L)(L) if

(1) φ\varphi is increasing with respect to xx on [0,+)[0,+\infty);

(2) φ\varphi is bounded on [0,+)[0,+\infty).  

Now let ψ¯\underline{\psi} and ψ¯\overline{\psi} be defined by (3.9), and consider the set

𝔉:={u(+)|ψ¯uψ¯;u is a class (L) vis. subsolution to (3.3)}.\displaystyle\qquad\quad\mathfrak{F}:=\{u\in\mathbb{C}(\mathbb{R}_{+})\ |\ \underline{\psi}\leq u\leq\overline{\psi};~{}\mbox{$u$ is a class ($L$) vis. subsolution to \eqref{ode}}\}. (3.13)

Clearly, ψ¯𝔉\underline{\psi}\in\mathfrak{F}, so 𝔉\mathfrak{F}\neq\emptyset. Define v^(x)=supu𝔉u(x)\hat{v}(x)=\sup_{u\in\mathfrak{F}}u(x), x[0,+)x\in[0,+\infty), and let vv^{*} (resp. vv_{*}) be the USC (resp. LSC) envelope of v^\hat{v}, defined respectively by

v(x)(resp. v(x)):=lim¯r0(resp. lim¯r0){v^(y):y(0,+),|yx|r}.\displaystyle v^{*}(x)(\mbox{resp. $v_{*}(x)$}):=\mathop{\overline{\rm lim}}_{r\downarrow 0}(\mbox{resp. $\displaystyle\mathop{\underline{\rm lim}}_{r\downarrow 0}$})\{\hat{v}(y):y\in(0,+\infty),|y-x|\leq r\}.
Theorem 1.

vv^{*} (resp. vv_{*}) is a viscosity sub-(resp. super-)solution of class (L)(L) to (3.3) on +\mathbb{R}_{+}.

Proof. The proof is the same as a similar result in [2]. We omit it here.  

Note that by definition we have v(x)v(x)v_{*}(x)\leq v^{*}(x), x[0,)x\in[0,\infty). Thus, given Theorem 1, one can derive the existence and uniqueness of the viscosity solution to (3.3) of class (L) by the following comparison principle, which can be argued along the lines of [7, Theorem 5.1], we omit the proof.

Theorem 2 (Comparison Principle).

Let v¯\bar{v} be a viscosity supersolution and v¯\underline{v} a viscosity subsolution of (3.3), both of class (L)(L). Then v¯v¯\underline{v}\leq\bar{v}. Consequently, v=v=v^v^{*}=v_{*}=\hat{v} is the unique viscosity solution of class (L)(L) to (3.3) on [0,+)[0,+\infty).

Following our discussion we can easily raise the regularity of the viscosity solution.

Corollary 3.6.

Let vv be a vis. solution of class (L) to the HJB equation (3.1). Then, vv has a right-derivative v(0+)>0v^{\prime}(0+)>0, and consequently vb2([0,))v\in\mathbb{C}^{2}_{b}([0,\infty)). Furthermore, vv is concave and satisfies limxv(0)=f(0)/c\lim_{x\to\infty}v(0)=f(0)/c and limx+v(x)=0\lim_{x\to+\infty}v^{\prime}(x)=0.

Proof. Let vv be a viscosity solution of class (L) to (3.1). We first claim that v(0+)>0v^{\prime}(0+)>0 exists. Indeed, consider the subsolution ψ¯\underline{\psi} and supersolution ψ¯\overline{\psi} defined by (3.9). Applying Theorem 2, for any x>0x>0 but small enough we have

1exx=ψ¯(x)xv(x)xψ¯(x)x=AM(1eMx)x.\frac{1-e^{-x}}{x}=\frac{\underline{\psi}(x)}{x}\leq\frac{v(x)}{x}\leq\frac{\overline{\psi}(x)}{x}=\frac{A}{M}\frac{(1-e^{-Mx})}{x}.

Sending x0x\searrow 0 we obtain that 1lim¯x0v(x)xlim¯x0v(x)xA1\leq\mathop{\underline{\rm lim}}_{x\searrow 0}\frac{v(x)}{x}\leq\mathop{\overline{\rm lim}}_{x\searrow 0}\frac{v(x)}{x}\leq A. Since vv is of class (L), whence increasing, and thus v(0+)v^{\prime}(0+) exists, and v(0+)=α1>0v^{\prime}(0+)=\alpha\geq 1>0. Then, it follows from Proposition 3.2 that the ODE (3.3) has a bounded classical solution in b2([0,))\mathbb{C}^{2}_{b}([0,\infty)) satisfying v(0+)=αv^{\prime}(0+)=\alpha, and is increasing and concave. Hence it is also a viscosity solution to (3.3) of class (LL). But by Theorem 2, the bounded viscosity solution to (3.3) of class (LL) is unique, thus the viscosity solution vb2([0,))v\in\mathbb{C}^{2}_{b}([0,\infty)). The rest of the properties are the consequences of Proposition 3.2.  

Verification Theorem and Optimal Strategy. Having argued the well-posedness of ODE (3.3) from both classical and viscosity sense, we now look at its connection to the value function. We have the following Verification Theorem.

Theorem 3.

Assume that Assumption 2.4 is in force. Then, the value function VV defined in (2.9) is a viscosity solution of class (LL) to the HJB equation (3.3). More precisely, it holds that

V(x)=supv𝔉v(x):=v^(x),x[0,+),\displaystyle V(x)=\sup_{v\in\mathfrak{F}}v(x):=\hat{v}(x),\qquad x\in[0,+\infty), (3.14)

where the set 𝔉\mathfrak{F} is defined by (3.13). Moreover, VV coincides with the classical solution of (3.3) described in Proposition 3.2, and the optimal control has the following form:

πt(w)=G(w,1V(Xtπ)).\displaystyle\pi^{*}_{t}(w)=G(w,1-V^{\prime}(X^{\pi^{*}}_{t})). (3.15)

Proof. The proof that VV is a viscosity solution satisfying (3.14) is more or less standard (see, e.g., [24]), and Proposition 2.3 shows that VV must be of class (LL). It then follows from Corollary 3.6 that V(0+)V^{\prime}(0+) exists and VV is the (unique) classical solution of (3.3). It remains to show that π\pi^{*} defined by (3.15) is optimal. To this end, note that |λπ(t)|=|0af¯(V(Xt))πt(w)𝑑w||{\cal H}_{\lambda}^{\pi^{*}}(t)|=\left|\int_{0}^{a}\bar{f}(V^{\prime}(X_{t}^{*}))\pi_{t}^{*}(w)dw\right|, where f¯(z):={wz+λln[λ(eaλ(1z)1)1z]}𝟏{z1}+[w+λlna]𝟏{z=1}\bar{f}(z):=\big{\{}wz+\lambda\ln\big{[}\frac{\lambda(e^{\frac{a}{\lambda}(1-z)}-1)}{1-z}\big{]}\big{\}}{\bf 1}_{\{z\neq 1\}}+[w+\lambda\ln a]{\bf 1}_{\{z=1\}}. Thus

𝔼x[0τπect|λπ(t)|𝑑t]=𝔼x[0τπect|0af¯(V(Xt))πt(w)𝑑w|𝑑t]<+,\displaystyle\mathbb{E}_{x}\Big{[}\int_{0}^{\tau^{\pi^{*}}}e^{-ct}|{\cal H}_{\lambda}^{\pi^{*}}(t)|dt\Big{]}=\mathbb{E}_{x}\Big{[}\int_{0}^{\tau^{\pi^{*}}}\negthinspace\negthinspace\negthinspace e^{-ct}\Big{|}\int_{0}^{a}\bar{f}(V^{\prime}(X_{t}^{*}))\pi_{t}^{*}(w)dw\Big{|}dt\Big{]}<+\infty,

as V(Xt)(0,V(0+)]V^{\prime}(X_{t}^{*})\in(0,V^{\prime}(0+)], thanks to the concavity of VV. Consequently π𝒜(x)\pi^{*}\in\mathcal{A}(x).

Finally, since Vb2([0,))V\in\mathbb{C}^{2}_{b}([0,\infty)) and π\pi^{*} is defined by (3.15) is obviously the maximizer of the Hamiltonian in HJB equation (3.1), the optimality of π\pi^{*} follows from a standard argument via Itô’s formula. We omit it.  

4 Policy Update

We now turn to an important step in the RL scheme, that is, the so-called Policy Update. More precisely, we prove a Policy Improvement Theorem which states that for any close-loop policy 𝝅𝒜cl(x)\boldsymbol{\pi}\in{\cal A}_{cl}(x), we can construct another 𝝅𝒜cl(x)\boldsymbol{\pi}^{\prime}\in{\cal A}_{cl}(x), such that J(x,𝝅~)J(x,𝝅)J(x,\tilde{\boldsymbol{\pi}})\geq J(x,\boldsymbol{\pi}). Furthermore, we argue that such a policy policy updating procedure can be constructed without using the system parameters, and we shall discuss the convergence of the iterations to the optimal policy

To begin with, for xx\in\mathbb{R} and 𝝅𝒜cl(x)\boldsymbol{\pi}\in\mathscr{A}_{cl}(x), let X𝝅,xX^{\boldsymbol{\pi},x} be the unique strong solution to the SDE (2.10). For t>0t>0, we consider the process W^s:=Ws+tWt\hat{W}_{s}:=W_{s+t}-W_{t}, s>0s>0. Then W^\hat{W} is an 𝔽^\hat{\mathbb{F}}-Brownian motion, where ^s=s+t\hat{\cal F}_{s}={\cal F}_{s+t}, s>0s>0. Since the SDE (2.10) is time-homogeneous, the path-wise uniqueness then renders the flow property: Xr+t𝝅,x=X^r𝝅,Xt𝝅,xX_{r+t}^{\boldsymbol{\pi},x}=\hat{X}_{r}^{\boldsymbol{\pi},X_{t}^{\boldsymbol{\pi},x}}, r0r\geq 0, where X^\hat{X} satisfies the SDE

dX^s=(μ0aw𝝅(w,X^s)𝑑w)ds+σdW^s,s0;X^0=Xt𝝅,x.\displaystyle d\hat{X}_{s}=\Big{(}\mu-\int_{0}^{a}w\boldsymbol{\pi}(w,\hat{X}_{s})dw\Big{)}ds+\sigma d\hat{W}_{s},\quad s\geq 0;\quad\hat{X}_{0}=X_{t}^{\boldsymbol{\pi},x}. (4.1)

Now we denote π^:=𝝅(,X^)𝒜ol(Xt𝝅,x)\hat{\pi}:=\boldsymbol{\pi}(\cdot,\hat{X}_{\cdot})\in\mathscr{A}_{ol}(X_{t}^{\boldsymbol{\pi},x}) to be the open-loop strategy induced by the closed-loop control 𝝅\boldsymbol{\pi}. Then the corresponding cost functional can be written as (denoting X𝝅=X𝝅,xX^{\boldsymbol{\pi}}=X^{\boldsymbol{\pi},x})

J(Xt𝝅;𝝅)=𝔼Xt𝝅[0τXt𝝅𝝅ecr[0a(wλlnπ^r(w))π^r(w)𝑑w]𝑑r],t0,\displaystyle\quad J(X_{t}^{\boldsymbol{\pi}};\boldsymbol{\pi})=\mathbb{E}_{X_{t}^{\boldsymbol{\pi}}}\Big{[}\int_{0}^{\tau^{\boldsymbol{\pi}}_{X_{t}^{\boldsymbol{\pi}}}}e^{-cr}\Big{[}\int_{0}^{a}(w-\lambda\ln\hat{\pi}_{r}(w))\hat{\pi}_{r}(w)dw\Big{]}dr\Big{]},\quad t\geq 0, (4.2)

where τXt𝝅,x𝝅=inf{r>0:X^r𝝅,Xt𝝅,x<0}{\tau^{\boldsymbol{\pi}}_{X_{t}^{\boldsymbol{\pi},x}}}=\inf\{r>0:\hat{X}_{r}^{\boldsymbol{\pi},X_{t}^{\boldsymbol{\pi},x}}<0\}. It is clear that, by flow property, we have τx𝝅=τXt𝝅,x𝝅+t{\tau^{\boldsymbol{\pi}}_{x}}={\tau^{\boldsymbol{\pi}}_{X_{t}^{\boldsymbol{\pi},x}}}+t, \mathbb{P}-a.s. on {τx𝝅>t}\{\tau^{\boldsymbol{\pi}}_{x}>t\}. Next, for any admissible policy 𝝅𝒜cl\boldsymbol{\pi}\in\mathscr{A}_{cl}, we formally define a new feedback control policy as follows: for (w,x)[0,a]×+(w,x)\in[0,a]\times\mathbb{R}^{+},

𝝅~(w,x):=G(w,1J(x;𝝅)),\displaystyle\boldsymbol{\tilde{\pi}}(w,x):=G(w,1-J^{{}^{\prime}}(x;\boldsymbol{\pi})), (4.3)

where G(,)G(\cdot,\cdot) is the Gibbs function defined by (3.2). We would like to emphasize that the new policy 𝝅~\boldsymbol{\tilde{\pi}} in (4.3) depends on JJ and 𝝅\boldsymbol{\pi}, but is independent of the coefficients (μ,σ)(\mu,\sigma)(!). To facilitate the argument we introduce the following definition.

Definition 4.1.

A function x𝛑(;x)𝒫([0,a])x\mapsto{\boldsymbol{\pi}}(\cdot;x)\in\mathscr{P}([0,a]) is called “Strongly Admissible” if its density function enjoys the following properties:

(i) there exist u,l>0u,l>0 such that l𝛑(x,w)ul\leq\boldsymbol{\pi}(x,w)\leq u, x+x\in\mathbb{R}^{+} and w[0,a]w\in[0,a];

(ii) there exists K>0K>0 such that |𝛑(x,w)𝛑(y,w)|K|xy||\boldsymbol{\pi}(x,w)-\boldsymbol{\pi}(y,w)|\leq K|x-y|, x,y+x,y\in\mathbb{R}^{+}, uniformly in ww.

The set of strongly admissible controls is denoted by 𝒜cls\mathscr{A}^{s}_{cl}.  

The following lemma justifies the Definition 4.1.

Lemma 1.

Suppose that a function x𝛑(;x)𝒫([0,a])x\mapsto{\boldsymbol{\pi}}(\cdot;x)\in\mathscr{P}([0,a]) whose density takes the form 𝛑(w,x)=G(w,c(x))\boldsymbol{\pi}(w,x)=G(w,c(x)) where cb1(+)c\in\mathbb{C}^{1}_{b}(\mathbb{R}_{+}). Then 𝛑𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}.

Proof. Since cb1(+)c\in\mathbb{C}^{1}_{b}(\mathbb{R}_{+}), let K>0K>0 be such that |c(x)|+|c(x)|K|c(x)|+|c^{\prime}(x)|\leq K, x+x\in\mathbb{R}_{+}. Next, note that GG is a positive and continuous, and for fixed ww, G(w,)()G(w,\cdot)\in\mathbb{C}^{\infty}(\mathbb{R}), there exist constants 0<l<u0<l<u, such that lG(w,y)ul\leq G(w,y)\leq u and |Gy(w,y)|u|G_{y}(w,y)|\leq u, (w,y)[0,a]×[K,K](w,y)\in[0,a]\times[-K,K]. Consequently, we have l𝝅(x,w)=G(w,c(x))ul\leq\boldsymbol{\pi}(x,w)=G(w,c(x))\leq u, (w,x)[0,a]×+(w,x)\in[0,a]\times\mathbb{R}^{+}, and 𝝅(,w)=G(w,c())\boldsymbol{\pi}(\cdot,w)=G(w,c(\cdot)) is uniformly Lipschitz on +\mathbb{R}^{+}, uniformly in w[0,a]w\in[0,a], proving the lemma.  

In what follows we shall use the following notations. For any 𝝅𝒜cl\boldsymbol{\pi}\in\mathscr{A}_{cl},

r𝝅(x):=0a(wλln𝝅(w,x))𝝅(w,x)𝑑w;b𝝅(x)=μ0aw𝝅(w,x)𝑑w.\displaystyle\quad r^{\boldsymbol{\pi}}(x):=\int_{0}^{a}(w-\lambda\ln\boldsymbol{\pi}(w,x))\boldsymbol{\pi}(w,x)dw;\quad b^{\boldsymbol{\pi}}(x)=\mu-\int_{0}^{a}w\boldsymbol{\pi}(w,x)dw. (4.4)

Clearly, for 𝝅𝒜cls\boldsymbol{\pi}\in\mathscr{A}_{cl}^{s}, b𝝅b^{\boldsymbol{\pi}} and r𝝅r^{\boldsymbol{\pi}} are bounded and are Lipschitz continuous. We denote X:=X𝝅,xX:=X^{\boldsymbol{\pi},x} to be the solution to SDE (2.10), and rewrite the cost function (2.8) as

J(x,𝝅)=𝔼x[0τx𝝅ecsr𝝅(Xs𝝅,x)𝑑s].\displaystyle\qquad J(x,\boldsymbol{\pi})=\mathbb{E}_{x}\Big{[}\int_{0}^{\tau_{x}^{\boldsymbol{\pi}}}e^{-cs}r^{\boldsymbol{\pi}}(X^{\boldsymbol{\pi},x}_{s})ds\Big{]}. (4.5)

where τx𝝅=inf{t>0:Xt𝝅,x<0}\tau^{\boldsymbol{\pi}}_{x}=\inf\{t>0:X^{\boldsymbol{\pi},x}_{t}<0\}. Thus, in light of the Feynman-Kac formula, for any 𝝅𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}, J(,𝝅)J(\cdot,\boldsymbol{\pi}) is the probabilistic solution to the following ODE on +\mathbb{R}_{+}:

L𝝅[u](x)+r𝝅(x):=12σ2uxx(x)+b𝝅(x)ux(x)cu(x)+r𝝅(x)=0,u(0)=0.\displaystyle\qquad L^{\boldsymbol{\pi}}[u](x)+r^{\boldsymbol{\pi}}(x)\negthinspace:=\negthinspace\frac{1}{2}\sigma^{2}u_{xx}(x)+b^{\boldsymbol{\pi}}(x)u_{x}(x)-cu(x)+r^{\boldsymbol{\pi}}(x)=0,~{}u(0)=0. (4.6)

Now let us denote uR𝝅u^{\boldsymbol{\pi}}_{R} to be solution to the linear elliptic equation (4.6) on finite interval [0,R][0,R] with boundary conditions u(0)=0u(0)=0 and u(R)=J(R,𝝅)u(R)=J(R,\boldsymbol{\pi}), then by the regularity and the boundedness of b𝝅b^{\boldsymbol{\pi}} and r𝝅r^{\boldsymbol{\pi}}, and using only the interior type Schauder estimates (cf. [11]), one can show that uR𝝅b2([0,R])u^{\boldsymbol{\pi}}_{R}\in\mathbb{C}^{2}_{b}([0,R]) and the bounds of (uR𝝅)(u^{\boldsymbol{\pi}}_{R})^{\prime} and (uR𝝅)′′(u^{\boldsymbol{\pi}}_{R})^{\prime\prime} depend only on those of the coefficients b𝝅b^{\boldsymbol{\pi}}, r𝝅r^{\boldsymbol{\pi}} and J(,𝝅)J(\cdot,\boldsymbol{\pi}), but uniform in R>0R>0. By sending RR\to\infty and applying the standard diagonalization argument (cf. e.g., [18]) one shows that limRuR𝝅()=J(,𝝅)\lim_{R\to\infty}u^{\boldsymbol{\pi}}_{R}(\cdot)=J(\cdot,\boldsymbol{\pi}), which satisfies (4.6). We summarize the above discussion as the following proposition for ready reference.

Proposition 4.2.

If 𝛑𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}, then J(,𝛑)b2(+)J(\cdot,\boldsymbol{\pi})\in\mathbb{C}^{2}_{b}(\mathbb{R}^{+}), and the bounds of JJ^{\prime} and J′′J^{\prime\prime} depend only on those of b𝛑b^{\boldsymbol{\pi}}, r𝛑r^{\boldsymbol{\pi}}, and J(,𝛑)J(\cdot,\boldsymbol{\pi}).  

Our main result of this section is the following Policy Improvement Theorem.

Theorem 4.

Assume that Assumption 2.4 is in force. Then, let 𝛑𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl} and let 𝛑~\tilde{\boldsymbol{\pi}} be defined by (4.3) associate to 𝛑\boldsymbol{\pi}, it holds that J(x,𝛑~)J(x,𝛑)J(x,\boldsymbol{\tilde{\pi}})\geq J(x,\boldsymbol{\pi}), x+x\in\mathbb{R}_{+}.

Proof. Let 𝝅𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl} be given, and let 𝝅~\boldsymbol{\tilde{\pi}} be the corresponding control defined by (4.3). Since 𝝅𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}, b𝝅b^{\boldsymbol{\pi}} and r𝝅r^{\boldsymbol{\pi}} are uniformly bounded, and by Proposition 4.2, (1J(,𝝅))b1(+)(1-J^{\prime}(\cdot,\boldsymbol{\pi}))\in\mathbb{C}^{1}_{b}(\mathbb{R}^{+}). Thus Lemma 1 (with c(x)=1J(x,𝝅)c(x)=1-J^{\prime}(x,\boldsymbol{\pi})) implies that 𝝅~𝒜cls\tilde{\boldsymbol{\pi}}\in\mathscr{A}^{s}_{cl} as well. Moreover, since 𝝅𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}, J(,𝝅)J(\cdot,\boldsymbol{\pi}) is a 2\mathbb{C}^{2}-solution to the ODE (4.6). Now recall that 𝝅~𝒜cls\tilde{\boldsymbol{\pi}}\in\mathscr{A}^{s}_{cl} is the maximizer of sup𝝅^𝒜cls[b𝝅^(x)J(x,𝝅)+r𝝅^(x)]\sup_{\widehat{\boldsymbol{\pi}}\in\mathscr{A}^{s}_{cl}}[b^{\widehat{\boldsymbol{\pi}}}(x)J^{\prime}(x,\boldsymbol{\pi})+r^{\widehat{\boldsymbol{\pi}}}(x)], we have

L𝝅~[J(,𝝅)](x)+r𝝅~(x)0,x+.\displaystyle L^{\tilde{\boldsymbol{\pi}}}[J(\cdot,\boldsymbol{\pi})](x)+r^{\tilde{\boldsymbol{\pi}}}(x)\geq 0,\quad x\in\mathbb{R}_{+}. (4.7)

Now, let us consider the process X𝝅~X^{\tilde{\boldsymbol{\pi}}}, the solution to (4.1) with 𝝅\boldsymbol{\pi} being replaced by 𝝅~\tilde{\boldsymbol{\pi}}. Applying Itô’s formula to ectJ(Xt𝝅~,𝝅)e^{-ct}J(X^{\tilde{\boldsymbol{\pi}}}_{t},{\boldsymbol{\pi}}) from 0 to τx𝝅~T\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T, for any T>0T>0, and noting the definitions of b𝝅~b^{\tilde{\boldsymbol{\pi}}} and r𝝅~r^{\tilde{\boldsymbol{\pi}}}, we deduce from (4.7) that

ec(τx𝝅~T)J(Xτx𝝅~T𝝅~,𝝅)\displaystyle e^{-c(\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T)}J(X_{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}^{\tilde{\boldsymbol{\pi}}},\boldsymbol{\pi})
=\displaystyle= J(x,𝝅)+0τx𝝅~TecrL𝝅~[J(,𝝅)](Xr𝝅~)𝑑r+0τx𝝅~TecrJ(Xr𝝅~,𝝅)σ𝑑Wr\displaystyle J(x,\boldsymbol{\pi})+\int_{0}^{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}\negthinspace\negthinspace e^{-cr}L^{\tilde{\boldsymbol{\pi}}}[J(\cdot,\boldsymbol{\pi})](X_{r}^{\tilde{\boldsymbol{\pi}}})dr+\int_{0}^{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}e^{-cr}J^{{}^{\prime}}(X_{r}^{\tilde{\boldsymbol{\pi}}},\boldsymbol{\pi}){\sigma}dW_{r}
\displaystyle\geq J(x,𝝅)0τx𝝅~Tecrr𝝅~(Xrπ~)𝑑r+0τx𝝅~TecrJ(Xr𝝅~,𝝅)σ𝑑Wr,\displaystyle J(x,\boldsymbol{\pi})-\int_{0}^{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}e^{-cr}r^{\tilde{\boldsymbol{\pi}}}(X_{r}^{{\tilde{\pi}}})dr+\int_{0}^{\tau^{\tilde{\boldsymbol{\pi}}}_{x}\wedge T}e^{-cr}J^{{}^{\prime}}(X_{r}^{\tilde{\boldsymbol{\pi}}},\boldsymbol{\pi}){\sigma}dW_{r},

Taking expectation on both sides above, sending TT\to\infty and noting that J(Xτx𝝅~𝝅~,𝝅)=J(0,𝝅)=0J(X_{\tau^{\tilde{\boldsymbol{\pi}}}_{x}}^{\tilde{\boldsymbol{\pi}}},\boldsymbol{\pi})=J(0,\boldsymbol{\pi})=0, we obtain that J(x,𝝅)J(x,𝝅~)J(x,\boldsymbol{\pi})\leq J(x,\tilde{\boldsymbol{\pi}}), x+x\in\mathbb{R}^{+}, proving the theorem.  

In light of Theorem 4 we can naturally define a “learning sequence” as follows. We start with c0Cb1(+)c_{0}\in C^{1}_{b}(\mathbb{R}^{+}), and define 𝝅0(x,w):=G(w,c0(x))\boldsymbol{\pi}_{0}(x,w):=G(w,c_{0}(x)), and v0(x):=J(x,𝝅0)v_{0}(x):=J(x,\boldsymbol{\pi}_{0}),

𝝅n(x,w):=G(w,1J(x,𝝅n1)),(w,x)[0,a]×+, for n1\displaystyle\quad\quad\boldsymbol{\pi}_{n}(x,w):=G(w,1-J^{\prime}(x,\boldsymbol{\pi}_{n-1})),\qquad(w,x)\in[0,a]\times\mathbb{R}^{+},\text{ for }\quad n\geq 1 (4.9)

Also for each n1n\geq 1, let vn(x):=J(x,𝝅n)v_{n}(x):=J(x,\boldsymbol{\pi}_{n}). The natural question is whether this learning sequence is actually a “maximizing sequence”, that is, vn(x)v(x)v_{n}(x)\nearrow v(x), as nn\to\infty. Such a result would obviously justify the policy improvement scheme, and was proved in the LQ case in [23].

Before we proceed, we note that by Proposition 4.2 the learning sequence vn=J(,𝝅n)b2(+)v_{n}=J(\cdot,\boldsymbol{\pi}^{n})\in\mathbb{C}^{2}_{b}(\mathbb{R}_{+}), n1n\geq 1, but the bounds may depend on the coefficients b𝝅nb^{\boldsymbol{\pi}^{n}}, r𝝅nr^{\boldsymbol{\pi}^{n}}, thus may not be uniform in nn. But by definition b𝝅nb^{\boldsymbol{\pi}^{n}} and Proposition 2.3, we see that supnb𝝅n𝕃(+)+V𝕃(+)C\sup_{n}\|b^{\boldsymbol{\pi}^{n}}\|_{\mathbb{L}^{\infty}(\mathbb{R}_{+})}+\|V\|_{\mathbb{L}^{\infty}(\mathbb{R}_{+})}\leq C for some C>0C>0. Moreover, since for each n1n\geq 1, J(,𝝅0)J(,𝝅n)V()J(\cdot,\boldsymbol{\pi}^{0})\leq J(\cdot,\boldsymbol{\pi}^{n})\leq V(\cdot), if we choose 𝝅0𝒜cls\boldsymbol{\pi}^{0}\in\mathscr{A}^{s}_{cl} be such that J(x,𝝅0)0J(x,\boldsymbol{\pi}^{0})\geq 0 (e.g., 𝝅t01a\boldsymbol{\pi}^{0}_{t}\equiv\frac{1}{a}), then we have J(,𝝅n)L(+)V𝕃(+)C\|J(\cdot,\boldsymbol{\pi}^{n})\|_{L^{\infty}(\mathbb{R}_{+})}\leq\|V\|_{\mathbb{L}^{\infty}(\mathbb{R}_{+})}\leq C for all n1n\geq 1. That is, vnv_{n}’s are uniformly bounded, and uniformly in nn, provided that r𝝅nr^{\boldsymbol{\pi}^{n}}’s are. The following result, based on the recent work [12], is thus crucial.

Proposition 4.3.

The functions r𝛑nr^{\boldsymbol{\pi}_{n}}, n1n\geq 1 are uniformly bounded, uniformly in nn. Consequently, the learning sequence vn=J(,𝛑n)b2(+)v_{n}=J(\cdot,\boldsymbol{\pi}^{n})\in\mathbb{C}^{2}_{b}(\mathbb{R}_{+}), n1n\geq 1, and the bounds of vnv_{n}’s, up to their second derivatives, are uniform in nn.  

Our main result of this section is the following.

Theorem 5.

Assume that the Assumption 2.4 is in force. Then the sequence {vn}n0\{v_{n}\}_{n\geq 0} is a maximizing sequence. Furthermore, the sequence {πn}n0\{\pi_{n}\}_{n\geq 0} converges to the optimal policy π\pi^{*}.

Proof. We first observe that by Lemma 1 the sequence {𝝅n}𝒜cls\{\boldsymbol{\pi}_{n}\}\subset\mathscr{A}_{cl}^{s}, provided 𝝅0𝒜cls\boldsymbol{\pi}_{0}\in\mathscr{A}_{cl}^{s}. Since vn=J(,𝝅n)v_{n}=J(\cdot,\boldsymbol{\pi}_{n}), Proposition 4.3 guarantees that vnb2(+)v_{n}\in\mathbb{C}^{2}_{b}(\mathbb{R}_{+}), and the bounds are independent of nn. Thus a simple application of Arzella-Ascolli Theorem shows that there exist subsequences {nk}k1\{n_{k}\}_{k\geq 1} and {nk}k1\{n^{\prime}_{k}\}_{k\geq 1} such that {vnk}k0\{v_{n_{k}}\}_{k\geq 0} and {vnk}k0\{v^{\prime}_{n^{\prime}_{k}}\}_{k\geq 0} converge uniformly on compacts.

Let us fix any compact set E+E\subset\mathbb{R}_{+}, and assume limkvnk()=v()\lim_{k\to\infty}v_{n_{k}}(\cdot)=v^{*}(\cdot), uniformly on EE, for some function vv^{*}. By definition of 𝝅n\boldsymbol{\pi}_{n}’s we know that {vn}\{v_{n}\} is monotonically increasing, thanks to Theorem 4, thus the whole sequence {vn}n0\{v_{n}\}_{n\geq 0} must converge uniformly on EE to vv^{*}. Next, let us assume that limkvnk()=v()\lim_{k\to\infty}v^{\prime}_{n^{\prime}_{k}}(\cdot)=v^{**}(\cdot), uniformly on EE, for some function vv^{**}. Since obviously limkvnk()=v()\lim_{k\to\infty}v_{n^{\prime}_{k}}(\cdot)=v^{*}(\cdot) as well, and note that the derivative operator is a closed operator, it follows that v(x)=(v)(x)v^{**}(x)=(v^{*})^{\prime}(x), xEx\in E. Applying the same argument one shows that for any subsequence of {vn}\{v^{\prime}_{n}\}, there exists a sub-subsequence that converges uniformly on EE to the same limit (v)(v^{*})^{\prime}, we conclude that the sequence {vn}\{v^{\prime}_{n}\} itself converges uniformly on EE to (v)(v^{*})^{\prime}. Since EE is arbitrary, this shows that {(vn,vn)}n0\{(v_{n},v^{\prime}_{n})\}_{n\geq 0} converges uniformly on compacts to (v,(v))(v^{*},(v^{*})^{\prime}). Since 𝝅n\boldsymbol{\pi}_{n} is a continuous function of vnv^{\prime}_{n}, we see that {𝝅n}n0\{\boldsymbol{\pi}_{n}\}_{n\geq 0} converges uniformly to 𝝅𝒜cl\boldsymbol{\pi}^{*}\in\mathscr{A}_{cl} defined by 𝝅(x,w):=G(w,1(v)(x))\boldsymbol{\pi}^{*}(x,w):=G(w,1-(v^{*})^{\prime}(x)).

Finally, applying Lemma 1 we see that 𝝅𝒜cls\boldsymbol{\pi}^{*}\in\mathscr{A}^{s}_{cl}, and the structure of the 𝝅(,)\boldsymbol{\pi}^{*}(\cdot,\cdot) guarantees that vv^{*} satisfies the HJB equation (3.1) on the compact set EE. By expanding the result to +\mathbb{R}^{+} using the fact that EE is arbitrary, vv^{*} satisfies the HJB equation (3.1) (or equivalently (3.3)). Now by using the slightly modified verification argument in Theorem 4.1 in [12] we conclude that v=Vv^{*}=V^{*} is the unique solution to the HJB equation (3.1) and thus π\pi^{*} by definition is the optimal control.  

Remark 4.4.

An alternate policy improvement method is the so-called Policy Gradient (PG) method introduced in [15], applicable for both finite and infinite horizon problems. Roughly speaking, a PG method parametrizes the policies 𝝅ϕ𝒜cls\boldsymbol{\pi}^{\phi}\in{\cal A}_{cl}^{s} and then solves for ϕ\phi via the equation ϕJ(x,πϕ)=0\nabla_{\phi}J(x,\pi^{\phi})=0, using stochastic approximation method. The advantage of a PG method is that it does not depend on the system parameter, whereas in theory Theorem 4 is based on finding the maximizer of the Hamiltonian, and thus the learning strategy (4.9) may depend on the system parameter. However, a closer look at the learning parameters cnc_{n} and dnd_{n} in (4.9) we see that they depend only on vnv_{n}, but not (μ,σ)(\mu,\sigma) directly. In fact, we believe that in our case the PG method would not be advantageous, especially given the convergence result in Theorem 5 and the fact that the the PG method requires also a proper choice of the parameterization family which, to the best of our knowledge, remains a challenging issue in practice. We shall therefore content ourselves to algorithms using learning strategy (4.9) for our numerical analysis in §7.  

5 Policy Evaluation — A Martingale Approach

Having proved the policy improvement theorem, we turn our attention to an equally important issue in the learning process, that is, the evaluation of the cost (value) functional, or the Policy Evaluation. The main idea of the policy evaluation in reinforcement learning literature usually refers to a process of approximating the cost functional J(,𝝅)J(\cdot,\boldsymbol{\pi}), for a given feedback control 𝝅\boldsymbol{\pi}, by approximating J(,𝝅)J(\cdot,\boldsymbol{\pi}) by a parametric family of functions JθJ^{\theta}, where θΘl\theta\in\Theta\subseteq\mathbb{R}^{l}. Throughout this section, we shall consider a fixed feedback control policy 𝝅𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}. Thus for simplicity of notation, we shall drop the superscript 𝝅\boldsymbol{\pi} and thus write r(x)=r𝝅(x),b(x)=b𝝅(x)r(x)=r^{\boldsymbol{\pi}}(x),b(x)=b^{\boldsymbol{\pi}}(x) and J(x,𝝅)=J(x);τx=τx𝝅J(x,\boldsymbol{\pi})=J(x);\quad\tau_{x}=\tau^{\boldsymbol{\pi}}_{x}.

We note that for 𝝅𝒜cls\boldsymbol{\pi}\in\mathscr{A}^{s}_{cl}, the functions r,bb(+)r,b\in\mathbb{C}_{b}(\mathbb{R}_{+}) and Jb2(+)J\in\mathbb{C}_{b}^{2}(\mathbb{R}_{+}). Now let Xx=X𝝅,xX^{x}=X^{\boldsymbol{\pi},x} be the solution to the SDE (LABEL:Xpi), and J()J(\cdot) satisfies the ODE (4.6). Then, applying Itô’s formula we see that

Mtx:=ectJ(Xtx)+0tecsr(Xsx)𝑑s,t0,\displaystyle M^{x}_{t}:=e^{-ct}J(X^{x}_{t})+\int_{0}^{t}e^{-cs}r(X^{x}_{s})ds,\qquad t\geq 0, (5.1)

is an 𝔽\mathbb{F}-martingale. Furthermore, the following result is more or less standard.

Proposition 5.1.

Assume that Assumption 4.3 holds, and suppose that J~()b(+)\tilde{J}(\cdot)\in\mathbb{C}_{b}(\mathbb{R}_{+}) is such that J~(0)=0\tilde{J}(0)=0, and for all x+x\in\mathbb{R}_{+}, the process M~x:={M~tx=ecsJ~(Xsx)+0tecsr(Xsx)𝑑s;t0}\tilde{M}^{x}:=\{\tilde{M}^{x}_{t}=e^{-cs}\tilde{J}(X^{x}_{s})+\int_{0}^{t}e^{-cs}r(X^{x}_{s})ds;~{}t\geq 0\} is an 𝔽\mathbb{F}-martingale. Then JJ~J\equiv\tilde{J}.

Proof. First note that J(0)=J~(0)=0J(0)=\tilde{J}(0)=0, and Xτxx=0X^{x}_{\tau_{x}}=0. By (5.1) and definition of M~x\tilde{M}^{x} we have M~τxx=0τxecsr(Xsx)𝑑s=Mτxx\tilde{M}^{x}_{\tau_{x}}=\int_{0}^{\tau_{x}}e^{-cs}r(X^{x}_{s})ds=M^{x}_{\tau_{x}}. Now, since rr, JJ, and J~\tilde{J} are bounded, both M~x\tilde{M}^{x} and MxM^{x} are uniformly integrable 𝔽\mathbb{F}-martingales, by optional sampling it holds that

J~(x)=M~0x=𝔼[M~τxx|0]=𝔼[Mτxx|0]=M0x=J(x),x+.\tilde{J}(x)=\tilde{M}^{x}_{0}=\mathbb{E}[\tilde{M}^{x}_{\tau_{x}}|{\cal F}_{0}]=\mathbb{E}[M^{x}_{\tau_{x}}|{\cal F}_{0}]=M^{x}_{0}=J(x),\qquad x\in\mathbb{R}_{+}.

The result follows.  

We now consider a family of functions {Jθ(x):(x,θ)+×Θ}\{J^{\theta}(x):(x,\theta)\in\mathbb{R}_{+}\times\Theta\}, where Θl\Theta\subseteq\mathbb{R}^{l} is a certain index set. For the sake of argument, we shall assume further that Θ\Theta is compact. Moreover, we shall make the following assumptions for the parameterized family {Jθ}\{J^{\theta}\}.

Assumption 5.2.

(i) The mapping (x,θ)Jθ(x)(x,\theta)\mapsto J^{\theta}(x) is sufficiently smooth, so that all the derivatives required exist in the classical sense.

(ii) For all θΘ\theta\in\Theta, φθ(Xx)\varphi^{\theta}(X^{x}_{\cdot}) are square-integrable continuous processes, and the mappings θφθ𝕃2([0,T])\theta\mapsto\|\varphi^{\theta}\|_{\mathbb{L}^{2}_{{\cal F}}([0,T])} are continuous, where φθ=Jθ,(Jθ),(Jθ)′′\varphi^{\theta}=J^{\theta},(J^{\theta})^{\prime},(J^{\theta})^{\prime\prime}.

(iii) There exists a continuous function K()>0K(\cdot)>0, such that JθK(θ)\|J^{\theta}\|_{\infty}\leq K(\theta).  

In what follows we shall often drop the superscript xx from the processes XxX^{x}, MxM^{x} etc., if there is no danger of confusion. Also, for practical purpose we shall consider a finite time horizon [0,T][0,T], for an arbitrarily fixed and sufficiently large T>0T>0. Denoting the stopping time τ~x=τxT:=τxT\tilde{\tau}_{x}=\tau^{T}_{x}:=\tau_{x}\wedge T, by optional sampling theorem, we know that M~t:=Mτ~xt=Mτxt\tilde{M}_{t}:=M_{\tilde{\tau}_{x}\wedge t}=M_{\tau_{x}\wedge t}, for t[0,T]t\in[0,T], is an 𝔽~\tilde{\mathbb{F}}-martingale on [0,T][0,T], where 𝔽~={τ~xt}t[0,T]\tilde{\mathbb{F}}=\{{\cal F}_{\tilde{\tau}_{x}\wedge t}\}_{t\in[0,T]}. Let us also denote M~tθ:=Mτxtθ\tilde{M}^{\theta}_{t}:=M^{\theta}_{\tau_{x}\wedge t}, t[0,T]t\in[0,T].

We now follow the idea of [14] to construct the so-called Martingale Loss Function. For any θΘ\theta\in\Theta, consider the parametrized approximation of the process M=MxM=M^{x}:

Mtθ=Mtθ,x:=ectJθ(Xtx)+0tecsr(Xsx)𝑑s,t[0,T].\displaystyle M^{\theta}_{t}=M^{\theta,x}_{t}:=e^{-ct}J^{\theta}(X^{x}_{t})+\int_{0}^{t}e^{-cs}r(X^{x}_{s})ds,\qquad t\in[0,T]. (5.2)

In light of the Martingale Loss function introduced in [14], we denote

ML(θ)=12𝔼[0τ~x|MτxM~tθ|2𝑑t]=12𝔼[0τ~x|ectJθ(Xt)tτxecsr(Xs)𝑑s|2𝑑t].\displaystyle{ML}(\theta)\negthinspace=\negthinspace\frac{1}{2}\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace\negthinspace|M_{\tau_{x}}\negthinspace\negthinspace-\negthinspace\tilde{M}_{t}^{\theta}|^{2}dt\Big{]}\negthinspace=\negthinspace\frac{1}{2}\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace\negthinspace\negthinspace\big{|}e^{-ct}J^{\theta}(X_{t})\negthinspace-\negthinspace\negthinspace\int_{t}^{\tau_{x}}\negthinspace\negthinspace\negthinspace e^{-cs}r(X_{s})ds\big{|}^{2}dt\Big{]}. (5.3)

We should note that the last equality above indicates that the martingale loss function is actually independent of the function JJ, which is one of the main features of this algorithm. Furthermore, inspired by the mean-squared and discounted mean-squared value errors we define

MSVE(θ)\displaystyle\mbox{\it MSVE}(\theta) =\displaystyle= 12𝔼[0τ~x|Jθ(Xt)J(Xt)|2𝑑t],\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}|J^{\theta}(X_{t})-J(X_{t})|^{2}dt\Big{]}, (5.4)
DMSVE(θ)\displaystyle\mbox{\it DMSVE}(\theta) =\displaystyle= 12𝔼[0τ~xe2ct|Jθ(Xt)J(Xt)|2𝑑t].\displaystyle\frac{1}{2}\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}e^{-2ct}|J^{\theta}(X_{t})-J(X_{t})|^{2}dt\Big{]}. (5.5)

The following result shows the connection between the minimizers of ML()ML(\cdot) and DMSVE()DMSVE(\cdot).

Theorem 6.

Assume that Assumption 5.2 is in force. Then, it holds that

argminθΘML(θ)=argminθΘDMSVE(θ).\displaystyle\arg\min_{\theta\in\Theta}\mbox{\it ML}(\theta)=\arg\min_{\theta\in\Theta}\mbox{\it DMSVE}(\theta). (5.6)

Proof. First, note that J(0)=Jθ(0)=0J(0)=J^{\theta}(0)=0, and Xτx=0X_{\tau_{x}}=0, we see that

M~tθ=Mτxθ=Mτx=0τxecsr(Xs)𝑑s,t(τ~x,T).\tilde{M}^{\theta}_{t}=M^{\theta}_{\tau_{x}}=M_{\tau_{x}}=\int_{0}^{\tau_{x}}e^{-cs}r(X_{s})ds,\qquad t\in(\tilde{\tau}_{x},T).

Here in the above we use the convention that (τ~x,T)=(\tilde{\tau}_{x},T)=\emptyset if τ~x=T\tilde{\tau}_{x}=T, and the identities becomes trivial. Consequently, by definition (5.3) and noting that M~tθ=Mtθ\tilde{M}^{\theta}_{t}=M^{\theta}_{t}, for t[0,τ~x]t\in[0,\tilde{\tau}_{x}], we can write

2ML(θ)\displaystyle\qquad 2\mbox{\it ML}(\theta)\negthinspace\negthinspace =\displaystyle\negthinspace\negthinspace=\negthinspace\negthinspace 𝔼[0τ~x|MτxM~tθ|2𝑑t]=𝔼[0τ~x|MτxMtθ|2𝑑t]\displaystyle\negthinspace\negthinspace\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}|M_{\tau_{x}}-\tilde{M}_{t}^{\theta}|^{2}dt\Big{]}=\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}|M_{\tau_{x}}-M_{t}^{\theta}|^{2}dt\Big{]}
=\displaystyle\negthinspace\negthinspace=\negthinspace\negthinspace 𝔼[0τ~x[|MτxMt|2+|MtMtθ|2+2(MτxMt)(MtMtθ)]𝑑t].\displaystyle\negthinspace\negthinspace\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}\big{[}|M_{\tau_{x}}-M_{t}|^{2}+|M_{t}-M_{t}^{\theta}|^{2}+2(M_{\tau_{x}}-M_{t})(M_{t}-M_{t}^{\theta})\big{]}dt\Big{]}.

Next, noting (5.1) and (5.2), we see that

𝔼[0τ~x|MtMtθ|2𝑑t]=𝔼[0τ~xe2ct|J(Xt)Jθ(Xt)|2𝑑t]=2DMSVE(θ).\displaystyle\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace|M_{t}-M_{t}^{\theta}|^{2}dt\Big{]}=\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace\negthinspace e^{-2ct}|J(X_{t})-J^{\theta}(X_{t})|^{2}dt\Big{]}=2\text{DMSVE}(\theta).

Also, applying optional sampling we can see that

𝔼[0τ~x(MτxMt)(MtMtθ)𝑑t]=0T𝔼[𝔼[(MτxMt)|t]𝟏{τxt}(MtMtθ)]𝑑t\displaystyle\mathbb{E}\Big{[}\negthinspace\int_{0}^{\tilde{\tau}_{x}}\negthinspace\negthinspace(M_{\tau_{x}}\negthinspace-\negthinspace M_{t})(M_{t}\negthinspace-\negthinspace M_{t}^{\theta})dt\Big{]}=\int_{0}^{T}\negthinspace\negthinspace\mathbb{E}\big{[}\mathbb{E}\big{[}({M}_{\tau_{x}}\negthinspace-\negthinspace{M}_{t})|{\cal F}_{t}\big{]}{\bf 1}_{\{\tau_{x}\geq t\}}({M}_{t}-{M}_{t}^{\theta})\big{]}dt
=𝔼[0T𝔼[(MτxMt)|tτx]𝟏{τxt}(M~tM~tθ)𝑑t]=0.\displaystyle=\mathbb{E}\Big{[}\int_{0}^{T}\mathbb{E}\big{[}({M}_{\tau_{x}}-{M}_{t})|{\cal F}_{t\wedge\tau_{x}}\big{]}{\bf 1}_{\{\tau_{x}\geq t\}}(\tilde{M}_{t}-\tilde{M}_{t}^{\theta})dt\Big{]}=0.

Combining above we see that (5) becomes 2ML(θ)=2DMSVE(θ)+𝔼[0τ~x|MτxMt|2𝑑t]2ML(\theta)=2\text{DMSVE}(\theta)+\mathbb{E}\big{[}\int_{0}^{\tilde{\tau}_{x}}|M_{\tau_{x}}-M_{t}|^{2}dt\big{]}. Since 𝔼[0τ~x|MτxMt|2𝑑t]\mathbb{E}[\int_{0}^{\tilde{\tau}_{x}}|M_{\tau_{x}}-M_{t}|^{2}dt] is independent of θ\theta, we conclude the result.  

Remark 5.3.

Since the minimizers of MSVE(θ)(\theta) and DMSVE(θ)(\theta) are obviously identical, Theorem 6 suggests that if θ\theta^{*} is a minimizer of either one of ML()ML(\cdot), MSVE()MSVE(\cdot), DMSVE()DMSVE(\cdot), then JθJ^{\theta^{*}} would be an acceptable approximation of JJ. In the rest of the section we shall therefore focus on the identification of θ\theta^{*}.  

We now propose an algorithm that provides a numerical approximation of the policy evaluation J()J(\cdot) (or equivalently the martingale MxM^{x}), by discretizing the integrals in the loss functional ML()ML(\cdot). To this end, let T>0T>0 be an arbitrary but fixed time horizon, and consider the partition 0=t0<<tn=T0=t_{0}<\cdots<t_{n}=T, and denote Δt=titi1\Delta t=t_{i}-t_{i-1}, i=1,ni=1,\cdots n. Now for x+x\in\mathbb{R}_{+}, we define Kx=min{l:t[lΔt,(l+1)Δt):Xtx<0}K_{x}=\min\{l\in\mathbb{N}:\exists t\in[l\Delta t,(l+1)\Delta t):X^{x}_{t}<0\}, and τx:=KxΔt\lfloor\tau_{x}\rfloor:=K_{x}\Delta t so that τx[KxΔt,(Kx+1)Δt)\tau_{x}\in[K_{x}\Delta t,(K_{x}+1)\Delta t). Finally, we define Nx=min{Kx,n}N_{x}=\min\{K_{x},n\}. Clearly, both KxK_{x} and NxN_{x} are integer-valued random variables, and we shall often drop the subscript xx if there is no danger of confusion.

In light of (5.3), let us define

MLΔt(θ)=12𝔼[i=0N1|ectiJθ(Xti)j=iK1ectjr(Xtj)Δt|2Δt]=:12𝔼[i=0N1|ΔM~tiθ|2Δt],\displaystyle\mbox{\it ML}_{\Delta t}(\theta)=\frac{1}{2}\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\Big{|}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{K-1}e^{-c{{t_{j}}}}r(X_{t_{j}})\Delta t\Big{|}^{2}\Delta t\Big{]}=:\frac{1}{2}\mathbb{E}\Big{[}\sum_{i=0}^{N-1}|\Delta\tilde{M}^{\theta}_{t_{i}}|^{2}\Delta t\Big{]},

where ΔM~tiθ\Delta\tilde{M}^{\theta}_{t_{i}}, i=1,,ni=1,\cdots,n, are defined in an obvious way. Furthermore, for t[0,τx]t\in[0,\tau_{x}], we define m(t,θ):=ectJθ(Xt)+tτxecsr(Xs)𝑑sm(t,\theta):=-e^{-ct}J^{\theta}(X_{t})+\int_{t}^{\tau_{x}}e^{-cs}r(X_{s})ds. Now note that {τxT}={τx<Tτx}{τxT}\{{\tau_{x}}\geq T\}=\{\lfloor\tau_{x}\rfloor<T\leq\tau_{x}\}\cup\{\lfloor\tau_{x}\rfloor\geq T\}, and {τxT}={N=n}\{\lfloor\tau_{x}\rfloor\geq T\}=\{N=n\}. Denoting F~1=𝔼[0T|m(t,θ)|2dt𝟏{τx<Tτx}]\tilde{F}_{1}=\mathbb{E}\Big{[}\int_{0}^{T}|m(t,\theta)|^{2}dt{\bf 1}_{\{\lfloor\tau_{x}\rfloor<T\ \leq\tau_{x}}\}\Big{]}, we have

𝔼[0T|m(t,θ)|2𝑑t𝟏{τxT}]\displaystyle\mathbb{E}\Big{[}\int_{0}^{T}|m(t,\theta)|^{2}dt{\bf 1}_{\{{\tau_{x}}\geq T\}}\Big{]} =\displaystyle= 𝔼[0T|m(t,θ)|2𝑑t(𝟏{τx<Tτx}+𝟏{τxT})]\displaystyle\mathbb{E}\Big{[}\int_{0}^{T}|m(t,\theta)|^{2}dt\big{(}{\bf 1}_{\{\lfloor\tau_{x}\rfloor<T\leq\tau_{x}\}}+{\bf 1}_{\{\lfloor\tau_{x}\rfloor\geq T\}}\big{)}\Big{]} (5.8)
=\displaystyle= F~1+𝔼[i=0N1titi+1|m(t,θ)|2𝑑t𝟏{N=n}].\displaystyle\tilde{F}_{1}+\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}|m(t,\theta)|^{2}dt{\bf 1}_{\{N=n\}}\Big{]}.

Since {τx<T}={N<n}={τx<T}{τx<Tτx}\{{\lfloor\tau_{x}\rfloor}<T\}=\negthinspace\{N<n\}=\{{\tau_{x}}<T\}\cup\{\lfloor\tau_{x}\rfloor<T\ \leq\tau_{x}\}, denoting F~2=𝔼[0τx|m(t,θ)|2dt𝟏{τx<Tτx}]\tilde{F}_{2}=\mathbb{E}\Big{[}\int_{0}^{\tau_{x}}|m(t,\theta)|^{2}dt{\bf 1}_{\{\lfloor\tau_{x}\rfloor<T\ \leq\tau_{x}}\}\Big{]} and F~3=𝔼[τxτx|m(t,θ)|2𝑑t𝟏{τx<T}]\tilde{F}_{3}=\mathbb{E}\Big{[}\int_{\lfloor\tau_{x}\rfloor}^{\tau_{x}}|m(t,\theta)|^{2}dt{\bf 1}_{\{\lfloor\tau_{x}\rfloor<T\}}\Big{]}, we obtain,

𝔼[0τx|m(t,θ)|2𝑑t𝟏{τx<T}]\displaystyle\mathbb{E}\Big{[}\int_{0}^{\tau_{x}}|m(t,\theta)|^{2}dt{\bf 1}_{\{{\tau_{x}}<T\}}\Big{]} =\displaystyle= 𝔼[0τx|m(t,θ)|2𝑑t𝟏{N<n}]+F~3F~2\displaystyle\mathbb{E}\Big{[}\int_{0}^{\lfloor\tau_{x}\rfloor}|m(t,\theta)|^{2}dt{\bf 1}_{\{N<n\}}\Big{]}+\tilde{F}_{3}-\tilde{F}_{2}
=\displaystyle= 𝔼[i=0N1titi+1|m(t,θ)|2𝑑t𝟏{N<n}]+F~3F~2\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}|m(t,\theta)|^{2}dt{\bf 1}_{\{N<n\}}\Big{]}+\tilde{F}_{3}-\tilde{F}_{2}

Combining (5.8) and (5), similar to (5)) we can now rewrite (5.3) as

2ML(θ)\displaystyle 2\mbox{\it ML}(\theta) =\displaystyle= 𝔼[0τ~x|m(t,θ)|2𝑑t]=𝔼[0T|m(t,θ)|2𝑑t(𝟏{τxT}+1{τx<T})]\displaystyle\mathbb{E}\Big{[}\int_{0}^{\tilde{\tau}_{x}}|m(t,\theta)|^{2}dt\Big{]}=\mathbb{E}\Big{[}\int_{0}^{T}|m(t,\theta)|^{2}dt\big{(}{\bf 1}_{\{{\tau_{x}}\geq T\}}+1_{\{{\tau_{x}}<T\}}\big{)}\Big{]} (5.9)
=\displaystyle\negthinspace=\negthinspace 𝔼[i=0N1titi+1|m(t,θ)|2𝑑t(𝟏{N=n}+𝟏{N<n})]+F~1F~2+F~3\displaystyle\negthinspace\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}|m(t,\theta)|^{2}dt\big{(}{\bf 1}_{\{N=n\}}+{\bf 1}_{\{N<n\}}\big{)}\Big{]}+\tilde{F}_{1}-\tilde{F}_{2}+\tilde{F}_{3}
=\displaystyle\negthinspace=\negthinspace 𝔼[i=0N1titi+1|m(t,θ)|2𝑑t]+F~1F~2+F~3.\displaystyle\negthinspace\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}|m(t,\theta)|^{2}dt\Big{]}+\tilde{F}_{1}-\tilde{F}_{2}+\tilde{F}_{3}.

We are now ready to give the main result of this section.

Theorem 7.

Let Assumptions 4.3 and 5.2 be in force. Then it holds that

limΔt0 MLΔt(θ)=ML(θ),uniformly in θ on compacta. \lim_{\Delta t\to 0}\mbox{ ML}_{\Delta t}(\theta)=\mbox{ML}(\theta),\quad\mbox{\rm uniformly in $\theta$ on compacta. }

Proof. Fix a partition 0=t0<<tn=T0=t_{0}<\cdots<t_{n}=T. By (5.9) and (5) we have, for θΘ\theta\in\Theta,

2|ML(θ)MLΔt(θ)|\displaystyle 2|M\negthinspace L(\theta)-M\negthinspace L_{\Delta t}(\theta)| =\displaystyle= 𝔼|i=0N1titi+1|m(t,θ)|2𝑑t+F~1+F~3F~2i=0N1|ΔM~tiθ|2Δt|\displaystyle\mathbb{E}\bigg{|}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}|m(t,\theta)|^{2}dt+\tilde{F}_{1}+\tilde{F}_{3}-\tilde{F}_{2}-\sum_{i=0}^{N-1}|\Delta\tilde{M}^{\theta}_{t_{i}}|^{2}\Delta t\bigg{|} (5.10)
\displaystyle\leq 𝔼[i=0N1titi+1||m(t,θ)|2|ΔM~tiθ|2|𝑑t]+i=13|F~i|.\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\big{|}|m(t,\theta)|^{2}-|\Delta\tilde{M}_{t_{i}}^{\theta}|^{2}\big{|}dt\Big{]}+\sum_{i=1}^{3}|\tilde{F}_{i}|.

Let us first check |F~i||\tilde{F}_{i}|, i=1,2,3i=1,2,3. First, by Assumption 5.2, we see that

|m(t,θ)||Jθ(Xt)|+0ecs|r(Xs)|dsK(θ)+Rc=:C1(θ), for t>0,-a.s. ,|m(t,\theta)|\leq|J^{\theta}(X_{t})|+\int_{0}^{\infty}e^{-cs}|r(X_{s})|ds\leq K(\theta)+\frac{R}{c}=:C_{1}(\theta),\qquad\text{ for }t>0,~{}\mathbb{P}\hbox{\rm-a.s.{ }},

where C1()C_{1}(\cdot) is a continuous function, and RR is the bound of r()r(\cdot). Thus we have

|F~3|=𝔼[τxτx|m(t,θ)|2𝑑t𝟏{N<n}]|C1(θ)|2Δt,\displaystyle|\tilde{F}_{3}|=\mathbb{E}\Big{[}\int_{\lfloor\tau_{x}\rfloor}^{\tau_{x}}|m(t,\theta)|^{2}dt{\bf 1}_{\{N<n\}}\Big{]}\leq|C_{1}(\theta)|^{2}\Delta t, (5.11)

Next, note that τxT\lfloor\tau_{x}\rfloor\leq T implies τxτx+ΔtT+Δt\tau_{x}\leq\lfloor\tau_{x}\rfloor+\Delta t\leq T+\Delta t, and since we are considering the case when Δt0\Delta t\to 0, we might assume Δt<1\Delta t<1. Thus by definitions of F~1\tilde{F}_{1} and F~2\tilde{F}_{2} we have

|F~1|+|F~2|\displaystyle|\tilde{F}_{1}|\negthinspace+\negthinspace|\tilde{F}_{2}|\negthinspace\negthinspace \displaystyle\negthinspace\leq\negthinspace 2𝔼[0T+1|m(t,θ)|2dt𝟏{τxTτx}]2|C1(θ)|2(T+1){τxTτx}\displaystyle\negthinspace\negthinspace 2\mathbb{E}\Big{[}\negthinspace\negthinspace\int_{0}^{T+1}\negthinspace\negthinspace\negthinspace\negthinspace|m(t,\theta)|^{2}dt{\bf 1}_{\{\lfloor\tau_{x}\rfloor\leq T\leq\tau_{x}}\}\Big{]}\leq 2|C_{1}(\theta)|^{2}(T+1)\mathbb{P}\{\lfloor\tau_{x}\rfloor\leq T\leq\tau_{x}\} (5.12)
\displaystyle\leq 2|C1(θ)|2(T+1){|Tτx|Δt}.\displaystyle 2|C_{1}(\theta)|^{2}(T+1)\mathbb{P}\{|T-\tau_{x}|\leq\Delta t\}.

Since XX is a diffusion, one can easily check that limΔt0{|Tτx|Δt}={T=τx}{XT=0}=0\lim_{\Delta t\to 0}\mathbb{P}\{|T-\tau_{x}|\leq\Delta t\}=\mathbb{P}\{T=\tau_{x}\}\leq\mathbb{P}\{X_{T}=0\}=0. Furthermore, noting that |C1(θ)|2|C_{1}(\theta)|^{2} is uniformly bounded for θ\theta in any compact set, from (5.11) and (5.12) we conclude that

limΔt0(|F~1|+|F~2|+|F~3|)=0,uniformly in θ on compacta.\displaystyle\lim_{\Delta t\to 0}(|\tilde{F}_{1}|+|\tilde{F}_{2}|+|\tilde{F}_{3}|)=0,\qquad\mbox{uniformly in $\theta$ on compacta.} (5.13)

It remains to show that

limΔt0𝔼[i=0N1titi+1||m(t,θ)|2|ΔM~tiθ|2|𝑑t]=0,\displaystyle\lim_{\Delta t\to 0}\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\big{|}|m(t,\theta)|^{2}-|\Delta\tilde{M}_{t_{i}}^{\theta}|^{2}\big{|}dt\Big{]}=0, (5.14)

uniformly in θ\theta on compacta. To this end, we first note that

𝔼[i=0N1titi+1||m(t,θ)|2|ΔM~tiθ|2|𝑑t]E~1Δt+E~2Δt,\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\big{|}|m(t,\theta)|^{2}-|\Delta\tilde{M}^{\theta}_{t_{i}}|^{2}\big{|}dt\Big{]}\leq\tilde{E}^{\Delta t}_{1}+\tilde{E}^{\Delta t}_{2}, (5.15)

where E~1Δt:=𝔼[i=0N1titi+1(|m(t,θ)|2|m(ti,θ)|2)𝑑t]\tilde{E}^{\Delta t}_{1}\negthinspace\negthinspace:=\negthinspace\negthinspace\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\negthinspace\negthinspace\negthinspace\negthinspace\big{(}|m(t,\theta)|^{2}\negthinspace\negthinspace-\negthinspace\negthinspace|m(t_{i},\theta)|^{2}\big{)}dt\Big{]} and E~2Δt:=𝔼[i=0N1||m(ti,θ)|2|ΔM~tiθ|2|Δt]\tilde{E}^{\Delta t}_{2}\negthinspace\negthinspace:=\negthinspace\negthinspace\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\Big{|}|m(t_{i},\theta)|^{2}\negthinspace\negthinspace-\negthinspace\negthinspace|\Delta\tilde{M}^{\theta}_{t_{i}}|^{2}\Big{|}\Delta t\Big{]}. From definition (5) we see that under Assumption 5.2 it holds that |ΔM~tiθ|C1(θ)|\Delta\tilde{M}^{\theta}_{t_{i}}|\leq C_{1}(\theta), i=1,,ni=1,\cdots,n. Furthermore, we denote

Di\displaystyle D_{i} :=\displaystyle:= m(ti,θ)ΔM~tiθ=tiτxecsr(Xs)𝑑sj=iK1ectjr(Xtj)Δt\displaystyle m(t_{i},\theta)-\Delta\tilde{M}^{\theta}_{t_{i}}=\int_{t_{i}}^{\tau_{x}}e^{-cs}r(X_{s})ds-\sum_{j=i}^{K-1}e^{-ct_{j}}r(X_{t_{j}})\Delta t
=\displaystyle= j=iK1tjtj+1[ecsr(Xs)ectjr(Xtj)]𝑑t+τxτxecsr(Xs)𝑑s,\displaystyle\sum_{j=i}^{K-1}\int_{t_{j}}^{t_{j+1}}[e^{-cs}r(X_{s})-e^{-ct_{j}}r(X_{t_{j}})]dt+\int_{\lfloor\tau_{x}\rfloor}^{\tau_{x}}e^{-cs}r(X_{s})ds,

Since r()r(\cdot) is bounded, we see that |𝔼(τxτxecsr(Xs)𝑑s)|K1Δt|\mathbb{E}(\int_{\lfloor\tau_{x}\rfloor}^{\tau_{x}}e^{-cs}r(X_{s})ds)|\leq K_{1}\Delta t for some constant K1>0K_{1}>0. Then it holds that

𝔼|Di|𝔼[j=iK1tjtj+1|r~sr~tj|𝑑s]+K1Δt𝔼[j=itjtj+1|r~sr~tj|𝑑s]+K1Δt,\displaystyle\mathbb{E}|D_{i}|\leq\mathbb{E}\Big{[}\sum_{j=i}^{K-1}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}+K_{1}\Delta t\leq\mathbb{E}\Big{[}\sum_{j=i}^{\infty}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}+K_{1}\Delta t,

where r~t:=ectr(Xt)\tilde{r}_{t}:=e^{-ct}r(X_{t}), t0t\geq 0, is a bounded and continuous process. Now for any ε>0\varepsilon>0, choose M~+\tilde{M}\in\mathbb{Z}^{+} so that ectεc4Re^{-ct}\leq\frac{\varepsilon c}{4R}, for tM~t\geq\tilde{M}, and define ρ2M~(r~,Δt):=sup|ts|Δt,t,s[0,M~]r~tr~s𝕃2(Ω)\rho^{\tilde{M}}_{2}(\tilde{r},\Delta t):=\sup_{|t-s|\leq\Delta t,t,s\in[0,\tilde{M}]}\|\tilde{r}_{t}-\tilde{r}_{s}\|_{\mathbb{L}^{2}(\Omega)}, we have

𝔼[j=itjtj+1|r~sr~tj|𝑑s]j=iM~1tjtj+1𝔼|r~sr~tj|𝑑s+j=M~tjtj+1𝔼|r~sr~tj|𝑑s\displaystyle\mathbb{E}\Big{[}\sum_{j=i}^{\infty}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}\leq\sum_{j=i}^{\tilde{M}-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds+\sum_{j=\tilde{M}}^{\infty}\int_{t_{j}}^{t_{j+1}}\mathbb{E}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds
\displaystyle\leq j=iM~1tjtj+1ρ2(r~,Δt)𝑑s+4RM~ecs𝑑s=Δt(M~1)ρ2M~(r~,Δt)+4RecM~c\displaystyle\sum_{j=i}^{\tilde{M}-1}\int_{t_{j}}^{t_{j+1}}\rho_{2}(\tilde{r},\Delta t)ds+4R\int_{\tilde{M}}^{\infty}e^{-cs}ds=\Delta t(\tilde{M}-1)\rho^{\tilde{M}}_{2}(\tilde{r},\Delta t)+4R\frac{e^{-c\tilde{M}}}{c}
\displaystyle\leq Δt(M~1)ρ2M~(r~,Δt)+ϵ.\displaystyle\Delta t(\tilde{M}-1)\rho^{\tilde{M}}_{2}(\tilde{r},\Delta t)+\epsilon.

Sending Δt0\Delta t\to 0 we obtain that lim¯Δt0𝔼[j=itjtj+1|r~sr~tj|𝑑s]ε\mathop{\overline{\rm lim}}_{\Delta t\to 0}\mathbb{E}\Big{[}\sum_{j=i}^{\infty}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}\leq\varepsilon. Since ε>0\varepsilon>0 is arbitrary, we conclude that 𝔼[j=itjtj+1|r~sr~tj|𝑑s]0\mathbb{E}\Big{[}\sum_{j=i}^{\infty}\int_{t_{j}}^{t_{j+1}}|\tilde{r}_{s}-\tilde{r}_{t_{j}}|ds\Big{]}\to 0 as Δt0\Delta t\to 0. Note that the argument above is uniform in ii, it follows that supi0𝔼|Di|0\sup_{i\geq 0}\mathbb{E}|D_{i}|\to 0 as Δt0\Delta t\to 0. Consequently, we have

E~2Δt\displaystyle\tilde{E}^{\Delta t}_{2} =\displaystyle= 𝔼[i=0N1[|m(ti,θ)+ΔM~tiθ|Di]Δt]2Δt|C1(θ)|𝔼[i=0N1|Di|]\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\big{[}|m(t_{i},\theta)+\Delta\tilde{M}^{\theta}_{t_{i}}|D_{i}\big{]}\Delta t\Big{]}\leq 2\Delta t|C_{1}(\theta)|\mathbb{E}\Big{[}\sum_{i=0}^{N-1}|D_{i}|\Big{]}
\displaystyle\leq 2nΔt|C1(θ)|supi0𝔼|Di|0,as Δt0.\displaystyle 2n\Delta t|C_{1}(\theta)|\sup_{i\geq 0}\mathbb{E}|D_{i}|\to 0,\quad\mbox{as $\Delta t\to 0$.}

Since C1()C_{1}(\cdot) is continuous in θ\theta, we see that the convergence above is uniform in θ\theta on compacta. Similarly, note that by Assumption 5.2 the process m(,θ)m(\cdot,\theta) is also a square-integrable continuous process, and uniform in θ\theta, we have

E~1Δt\displaystyle\tilde{E}^{\Delta t}_{1} \displaystyle\leq 𝔼[i=0N1titi+1||m(t,θ)|2|m(ti,θ)|2|𝑑t]\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}\big{|}|m(t,\theta)|^{2}-|m(t_{i},\theta)|^{2}\big{|}dt\Big{]}
\displaystyle\leq 2C1(θ)𝔼[i=0n1titi+1|m(t,θ)m(ti,θ)|𝑑t]\displaystyle 2C_{1}(\theta)\mathbb{E}\Big{[}\sum_{i=0}^{n-1}\int_{t_{i}}^{t_{i+1}}|m(t,\theta)-m(t_{i},\theta)|dt\Big{]}
\displaystyle\leq 2C1(θ)i=0n1titi+1ρ(m(,θ),Δt)|dt=2C1(θ)Tρ(m(,θ),Δt),\displaystyle 2C_{1}(\theta)\sum_{i=0}^{n-1}\int_{t_{i}}^{t_{i+1}}\rho(m(\cdot,\theta),\Delta t)|dt=2C_{1}(\theta)T\rho(m(\cdot,\theta),\Delta t),

where ρ(m(,θ),Δt)=sup|ts|Δt,t,s[0,T]m(t,θ)m(s,θ)𝕃2(Ω)\rho(m(\cdot,\theta),\Delta t)=\sup_{|t-s|\leq\Delta t,t,s\in[0,T]}\|m(t,\theta)-m(s,\theta)\|_{\mathbb{L}^{2}(\Omega)} is the modulus of continuity of m(,θ)m(\cdot,\theta) in 𝕃2(Ω)\mathbb{L}^{2}(\Omega). Therefore ρ(m(,θ),Δt)0\rho(m(\cdot,\theta),\Delta t)\to 0, as Δt0\Delta t\to 0, uniformly in θ\theta on compacta.

Finally, combining (5.15)–(5) we obtain (5.14). This, together with (5.13) as well as (5.10), proves the theorem.  

Now let us denote h=Δth=\Delta t, and consider the functions

f(θ):=ML(θ),fh(θ):=MLh(θ),rh(θ):=MLh(θ)ML(θ).\displaystyle f(\theta):=\mbox{\it ML}(\theta),\qquad f_{h}(\theta):=\mbox{\it ML}_{h}(\theta),\qquad r_{h}(\theta):=\mbox{\it ML}_{h}(\theta)-\mbox{\it ML}(\theta).

Then fh(θ)=f(θ)+rh(θ)f_{h}(\theta)=f(\theta)+r_{h}(\theta), and by Assumption 5.2 we can easily check that the mappings θfh(θ),rh(θ)\theta\mapsto f_{h}(\theta),r_{h}(\theta) are continuous functions. Applying Theorem 7 we see that rh(θ)0r_{h}(\theta)\to 0, uniformly in θ\theta on compacta, as h0h\to 0. Note that if Θ\Theta is compact, then for any h>0h>0, there exists θhargminθΘfh(θ)\theta^{*}_{h}\in\arg\min_{\theta\in\Theta}f_{h}(\theta). In general, we have the following corollary of Theorem 7.

Corollary 5.4.

Assume that all assumptions in Theorem 7 are in force. If there exists a sequence {hn}n00\{h_{n}\}_{n\geq 0}\searrow 0, such that Θn:=argminθΘfhn(θ)\Theta_{n}:=\arg\min_{\theta\in\Theta}f_{h_{n}}(\theta)\neq\emptyset, then any limit point θ\theta^{*} of the sequence {θn}θnΘn\{\theta^{*}_{n}\}_{\theta^{*}_{n}\in\Theta_{n}} must satisfy θargminθΘf(θ)\theta^{*}\in\arg\min_{\theta\in\Theta}f(\theta).

Proof. This is a direct consequence of [14, Lemma 1.1].

Remark 5.5.

We should note that, by Remark 5.3, the set of minimizers of the martingale loss function ML(θ)(\theta) is the same as that of DMVSE(θ)(\theta). Thus Corollary 5.4 indicates that we have a reasonable approach for approximating the unknown function JJ. Indeed, if {θn}\{\theta^{*}_{n}\} has a convergent subsequence that converges to some θΘ\theta^{*}\in\Theta, then JθJ^{\theta^{*}} is the best approximation for JJ by either the measures of MSVE or DMSVE.  

To end this section we discuss the ways to fulfill our last task: finding the optimal parameter θ\theta^{*}. There are usually two learning methods for this task in RL, often referred to as online and batch learning, respectively. Roughly speaking, the batch learning methods use multiple sample trajectories of XX over a given finite time horizon [0,T][0,T] to update parameter θ\theta at each step, whereas in online learning, one observes only a single sample trajectory XX to continuously update the parameter θ\theta until it converges. Clearly, the online learning is particularly suitable for infinite horizon problem, whereas the ML function is by definition better suited for batch learning.

Although our problem is by nature an infinite horizon one, we shall first create a batch learning algorithm via the ML function by restricting ourselves to an arbitrarily fixed finite horizon T>0T>0, so as to convert it to an finite time horizon problem. To this end, we note that

2MLΔt(θ)\displaystyle 2ML_{\Delta t}(\theta) =\displaystyle= 𝔼[i=0N1(ectiJθ(Xti)j=iK1ectjr(Xtj)Δt)2Δt].\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{K-1}e^{-c{{t_{j}}}}r(X_{t_{j}})\Delta t\Big{)}^{2}\Delta t\Big{]}.

However, since KxK_{x} may be unbounded, we shall consider instead the function:

2ML~Δt(θ)\displaystyle 2\widetilde{ML}_{\Delta t}(\theta) =\displaystyle= 𝔼[i=0N1(ectiJθ(Xti)j=iN1ectjr(Xtj)Δt)2Δt].\displaystyle\mathbb{E}\Big{[}\sum_{i=0}^{N-1}\big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{N-1}e^{-c{{t_{j}}}}r(X_{t_{j}})\Delta t\Big{)}^{2}\Delta t\Big{]}.

We observe that the difference

|2MLΔt(θ)2ML~Δt(θ)|\displaystyle|2ML_{\Delta t}(\theta)-2\widetilde{ML}_{\Delta t}(\theta)|
=\displaystyle= |i=0N1[(ectiJθ(Xti)j=iN1r~tjΔt)2Δt(ectiJθ(Xti)j=iK1r~tjΔt)2Δt]|\displaystyle\Big{|}\sum_{i=0}^{N-1}\Big{[}\Big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{N-1}\tilde{r}_{t_{j}}\Delta t\Big{)}^{2}\Delta t-\Big{(}e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}^{2}\Delta t\Big{]}\Big{|}
=\displaystyle= |i=0N1[(j=NK1r~tjΔt)(2ectiJθ(Xti)j=iN1r~tjΔtj=iK1r~tjΔt)]|Δt\displaystyle\Big{|}\sum_{i=0}^{N-1}\Big{[}\Big{(}\sum_{j=N}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{(}2e^{-c{t_{i}}}J^{\theta}(X_{t_{i}})-\sum_{j=i}^{N-1}\tilde{r}_{t_{j}}\Delta t-\sum_{j=i}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{]}\Big{|}\Delta t
\displaystyle\leq K(θ)Δt|i=0N1(j=NK1r~tjΔt)|K~(θ)Δt|i=0N1(j=NK1ectjΔt)|K~(θ)ecTTΔt,\displaystyle K(\theta)\Delta t\Big{|}\sum_{i=0}^{N-1}\Big{(}\sum_{j=N}^{K-1}\tilde{r}_{t_{j}}\Delta t\Big{)}\Big{|}\leq\widetilde{K}(\theta)\Delta t\Big{|}\sum_{i=0}^{N-1}\Big{(}\sum_{j=N}^{K-1}e^{-c{t_{j}}}\Delta t\Big{)}\Big{|}\leq\widetilde{K}(\theta)e^{-cT}T\Delta t,

for some continuous function K~(θ)\widetilde{K}(\theta). Thus if Θ\Theta is compact, for TT large enough or Δt\Delta t small enough, the difference between MLΔt(θ)ML_{\Delta t}(\theta) and ML~Δt(θ)\widetilde{ML}_{\Delta t}(\theta) is negligible. Furthermore, we note that

2ML~Δt(θ)\displaystyle 2\widetilde{ML}_{\Delta t}(\theta) =\displaystyle= 𝔼[i=0N1(ectiJθ(X~ti)j=iN1ectj(atj𝝅λlnπ(atj𝝅,X~tj))Δt)2Δt],\displaystyle\mathbb{E}^{\mathbb{Q}}\Big{[}\sum_{i=0}^{N-1}\Big{(}e^{-c{t_{i}}}J^{\theta}(\tilde{X}_{t_{i}})-\sum_{j=i}^{N-1}e^{-c{{t_{j}}}}(a^{\boldsymbol{\pi}}_{t_{j}}-\lambda\ln\pi(a^{\boldsymbol{\pi}}_{t_{j}},\tilde{X}_{t_{j}}))\Delta t\Big{)}^{2}\Delta t\Big{]},

we can now follow the method of Stochastic Gradient Descent (SGD) to minimize ML~Δt(θ)\widetilde{ML}_{\Delta t}(\theta) and obtain the updating rule,

θ(k+1)θ(k)α(k)θML~Δtθ\displaystyle\theta^{(k+1)}\leftarrow\theta^{(k)}-\alpha_{(k)}\nabla_{\theta}\widetilde{ML}_{\Delta t}\theta

where α(k)\alpha_{(k)} denotes the learning rate for the kthk^{\text{th}} iteration (using the kthk^{\text{th}} simulated sample trajectory). Here α(k)\alpha_{(k)} is chosen so that k=0α(k)=,k=0α(k)2<\sum_{k=0}^{\infty}\alpha_{(k)}=\infty,\sum_{k=0}^{\infty}\alpha_{(k)}^{2}<\infty to help guarantee the convergence of the algorithm, based on the literature on the convergence of SGD.

6 Temporal Difference (TD) Based Online Learning

In this section we consider another policy evaluation method utilizing the parametric family {Jθ}θΘ\{J^{\theta}\}_{\theta\in\Theta}. The starting point of this method is Proposition 5.1, which states that the best approximation JθJ^{\theta} is one whose corresponding approximating process MθM^{\theta} defined by (5.2) is a martingale (in which case Jθ=JJ^{\theta}=J(!)). Next, we recall the following simple fact (see, e.g., [14] for a proof).

Proposition 6.1.

An Itô process Mθ𝕃𝔽2([0,T])M^{\theta}\in\mathbb{L}^{2}_{\mathbb{F}}([0,T]) is a martingale if and only if

𝔼[0Tξt𝑑Mtθ]=0,for anyξ𝕃2([0,T];Mθ).\displaystyle\mathbb{E}\Big{[}\int_{0}^{T}\xi_{t}dM^{\theta}_{t}\Big{]}=0,\quad\mbox{for any}~{}\xi\in\mathbb{L}^{2}_{{\cal F}}([0,T];M^{\theta}). (6.1)

The functions ξ𝕃𝔽2([0,T];Mθ)\xi\in\mathbb{L}^{2}_{\mathbb{F}}([0,T];M^{\theta}) are called test functions.  

Proposition 6.1 suggests that a reasonable approach for approximating the optimal θ\theta^{*} could be solving the martingale orthogonality condition (6.1). However, since (6.1) involves infinitely many equations, for numerical approximations we should only choose a finite number of test functions, often referred to as moment conditions. There are many ways to choose test functions. In the finite horizon case, [14] proposed some algorithms of solving equation (6.1) with certain test functions. By using the well known Robbins-Monroe stochastic approximation (cf. (1951), they suggested some continuous analogs of the well-known discrete time Temporal Difference (TD) algorithms such as TD(γ),γ[0,1](\gamma),\gamma\in[0,1] method and the (linear) least square TD(0) (or LSTD(0)(0)) method, which are often referred to as the CTD(γ),γ[0,1](\gamma),\gamma\in[0,1] method and CLSTD(0)(0) method, respectively, for obvious reasons. We should note that although our problem is essentially an infinite horizon one, we could consider a sufficiently large truncated time horizon [0,T][0,T], as we did in previous section, so that offline CTD methods similar to [14] can also be applied. However, in what follows we shall focus only on an online version of CTD method that is more suitable to the infinite horizon case.

We begin by recalling the following fact:

𝔼[dMtθ]\displaystyle\mathbb{E}[dM^{\theta}_{t}] =\displaystyle= ect𝔼[dJθ(Xt)cJθ(Xt)dt+rtdt]\displaystyle e^{-ct}\mathbb{E}[dJ^{\theta}(X_{t})-cJ^{\theta}(X_{t})dt+r_{t}dt]
=\displaystyle= ect𝔼[dJθ(X~t)cJθ(X~t)dt+(atλln𝝅(at,X~t))dt],\displaystyle e^{-ct}\mathbb{E}^{\mathbb{Q}}[dJ^{\theta}(\tilde{X}_{t})-cJ^{\theta}(\tilde{X}_{t})dt+(a_{t}-\lambda\ln\boldsymbol{\pi}(a_{t},\tilde{X}_{t}))dt],

where \mathbb{Q} is the probability measure defined in Remark LABEL:remark2.0, and X~\tilde{X} is the trajectory corresponding to the action a={at}a=\{a_{t}\} “sampled” from the policy distribution {𝝅(,X~t)}\{\boldsymbol{\pi}(\cdot,\tilde{X}_{t})\}. Now let ti+1=ti+Δtt_{i+1}=t_{i}+\Delta t, i=1,2,i=1,2,\cdots be a sequence of discrete time points, and atia_{t_{i}} the action sampled at time tit_{i}. Denote

{ΔJθ(X~ti):=Jθ(X~ti+1)Jθ(X~ti);Δi:=ΔJθ(X~ti)+[cJθ(X~ti)+(atiλln𝝅(ati,X~ti))]Δt,i=0,1,2,.\left\{\begin{array}[]{lll}\Delta J^{\theta}(\tilde{X}_{t_{i}}):=J^{\theta}(\tilde{X}_{t_{i+1}})-J^{\theta}(\tilde{X}_{t_{i}});\\ \Delta_{i}:=\Delta J^{\theta}(\tilde{X}_{t_{i}})+[-cJ^{\theta}(\tilde{X}_{t_{i}})+(a_{t_{i}}-\lambda\ln\boldsymbol{\pi}(a_{t_{i}},\tilde{X}_{t_{i}}))]\Delta t,\end{array}\right.\quad i=0,1,2,\cdots.

By the same argument as in [14], we have the discrete time approximation of (6):

Mti+1θMtiθecti{ΔJθ(X~ti)[cJθ(X~ti)(atiλln𝝅(ati,X~ti))]Δt}=ectiΔi.\displaystyle\qquad M_{t_{i+1}}^{\theta}\negthinspace\negthinspace-\negthinspace M_{t_{i}}^{\theta}\approx e^{-ct_{i}}\negthinspace\big{\{}\negthinspace\Delta J^{\theta}(\tilde{X}_{t_{i}}\negthinspace)\negthinspace-\negthinspace[cJ^{\theta}(\tilde{X}_{t_{i}}\negthinspace)\negthinspace-\negthinspace(a_{t_{i}}\negthinspace-\negthinspace\lambda\ln\boldsymbol{\pi}(a_{t_{i}}\negthinspace,\negthinspace\tilde{X}_{t_{i}}\negthinspace)\negthinspace)]\Delta t\big{\}}\negthinspace=\negthinspace e^{-ct_{i}}\negthinspace\Delta_{i}. (6.3)

In what follows we summarize the updating rules for CTD(γ\gamma) method using (6.3).

CTD(0) (γ=0\gamma=0). In this case we let ξt=θJθ(X~t)\xi_{t}=\nabla_{\theta}J^{\theta}(\tilde{X}_{t}), with the updating rule:

θ(i+1)=θ(i)+α(i)(θJθ(i)(X~ti))ectiΔi.\displaystyle\theta^{(i+1)}=\theta^{(i)}+\alpha_{(i)}\big{(}\nabla_{\theta}J^{\theta^{(i)}}(\tilde{X}_{t_{i}})\big{)}e^{-ct_{i}}\Delta_{i}. (6.4)

CTD(γ\gamma) (γ(0,1]\gamma\in(0,1]) In this case we choose ξt=0tγtsθJθ(Xs)𝑑s\xi_{t}=\int_{0}^{t}\gamma^{t-s}\nabla_{\theta}J^{\theta}(X_{s})ds, with the updating rule:

θ(i+1)=θ+α(i)[j=1iγΔt(ij)(θJθ(i)(X~tj))Δt]ectiΔi.\displaystyle\theta^{(i+1)}=\theta+\alpha_{(i)}\Big{[}\sum_{j=1}^{i}\gamma^{\Delta t(i-j)}\big{(}\nabla_{\theta}J^{\theta^{(i)}}(\tilde{X}_{t_{j}})\big{)}\Delta t\bigg{]}e^{-ct_{i}}\Delta_{i}. (6.5)
Remark 6.2.

(i) We note that the updating rule (6.4) can be viewed as the special case of (6.5) when γ=0\gamma=0 if we make the convention that 0α:=𝟏{α=0}0^{\alpha}:={\bf 1}_{\{\alpha=0\}}, for α0\alpha\geq 0.

(ii) Although we are considering the infinite horizon case, in practice in order to prevent the infinite loop one always has to stop the iteration at a finite time. In other words, for both CTD(0) and CTD(γ\gamma), we assume that tn=Tt_{n}=T for a large T>0T>0.

(iii) The constants α(i)\alpha_{(i)}’s are often referred as the learning rate for the ii-th iteration. In light of the convergence conditions of Stochastic Approximation methods discussed at the end of previous section, we shall choose α(i)\alpha_{(i)}’s so that i=0α(i)=,i=0α(i)2<\sum_{i=0}^{\infty}\alpha_{(i)}=\infty,\sum_{i=0}^{\infty}\alpha_{(i)}^{2}<\infty to help guarantee the convergence of the algorithm.  

We observe that the convergence analysis of the above method for each fixed Δt\Delta t, coincides with that of the stochastic approximation methods. It would naturally be interesting to see the convergence with respect to Δt\Delta t. To this end, let us first define a special subspace of 𝕃𝔽2([0,T],Mθ)\mathbb{L}^{2}_{\mathbb{F}}([0,T],M^{\theta}):

H𝔽2,α([0,T],Mθ):={ξL𝔽2([0,T],Mθ):𝔼|ξtξs|2C(θ)|ts|α,C()C(Θ)},H^{2,\alpha}_{\mathbb{F}}([0,T],M^{\theta}):=\{\xi\in L^{2}_{\mathbb{F}}([0,T],M^{\theta}):~{}\mathbb{E}|\xi_{t}-\xi_{s}|^{2}\leq C(\theta)|t-s|^{\alpha},C(\cdot)\in C(\Theta)\},

where α(0,1)\alpha\in(0,1), θΘ\theta\in\Theta, and T>0T>0. The following result is adapted from [14, §4.2 Theorem 3] with an almost identical proof. We thus only state it and omit the proof.

Proposition 6.3.

Assume that ξH2,α([0,T],Mθ)\xi\in H^{2,\alpha}([0,T],M^{\theta}) for some α(0,1]\alpha\in(0,1], and that θ\theta^{*},(θΔt)(\theta^{*}_{\Delta t}) are such that 𝔼[0Tξt𝑑Mtθ]=0\mathbb{E}[\int_{0}^{T}\xi_{t}dM_{t}^{\theta}]=0 (𝔼[i=0n1ξti(Mti+1θMtiθ)]=0)(\mathbb{E}[\sum_{i=0}^{n-1}\xi_{t_{i}}(M^{\theta}_{t_{i+1}}-M^{\theta}_{t_{i}})]=0). Then for any sequence [Δt](n)0[\Delta t]^{(n)}\to 0 such that limnθ[Δt](n)=θ¯\lim_{n\to\infty}\theta^{*}_{[\Delta t]^{(n)}}=\bar{\theta} exists, it must hold that θ¯=θ\bar{\theta}=\theta^{*}. Furthermore, there exits C>0C>0 such that 𝔼[0Tξt𝑑MtθΔt]C[Δt]α2\mathbb{E}[\int_{0}^{T}\xi_{t}dM_{t}^{\theta^{*}_{\Delta t}}]\leq C[\Delta t]^{\frac{\alpha}{2}}.  

Finally, we remark that although there are other PE methods analogous to well known TD methods (e.g., CLSTD(0)), which are particularly well suited for linear parameterization families, in this paper we are interested in parameterized families that are nonlinear in nature. Thus we shall only focus on the CTD(γ)(\gamma) methods as well as the Martingale Loss Function based PE methods developed in the previous section, (which will be referred to as the ML-algorithm in what follows) and present the detailed algorithms and numerical results in the next section.

7 Numerical Results

In this section we present the numerical results along the lines of PE and PI schemes discussed in the previous sections. In particular, we shall consider the CTD(γ)(\gamma) methods and ML Algorithm and some special parametrization based on the knowledge of the explicit solution of the original optimal dividend problem (with λ=0\lambda=0), but without specifying the market parameter μ\mu and σ\sigma. To test the effectiveness of the learning procedure, we shall use the so-called environment simulator: (x)=ENVΔt(x,a)(x^{\prime})=ENV_{\Delta t}(x,a), that takes the current state xx and action aa as inputs and generates state xx^{\prime} at time t+Δtt+\Delta t, and we shall use the outcome of the simulator as the dynamics of XX. We note that the environment simulator will be problem specific, and should be created using historic data pertaining to the problem, without using environment coefficients, which is considered as unknown in the RL setting. But in our testing procedure, shall use some dummy values of μ\mu and σ\sigma, along with the following Euler–Maruyama discretization for the SDE (2.10):

xti+1(k)=xti(k)+(μati(k))Δt+σZ,i=1,2,,\displaystyle x_{t_{i+1}}^{(k)}=x_{t_{i}}^{(k)}+(\mu-a_{t_{i}}^{(k)})\Delta t+\sigma Z,\qquad i=1,2,\cdots, (7.1)

where ZN(0,Δt)Z\sim N(0,\sqrt{\Delta t}) is a normal random variable, and at each time tit_{i}, xti+1(k)x_{t_{i+1}}^{(k)} is calculated by the environment simulator recursively via the given xti(k)x_{t_{i}}^{(k)} and ati(k)a_{t_{i}}^{(k)}, and to be specified below.

Sampling of the optimal strategy. We recall from (LABEL:optimalpi) the optimal policy function has the form 𝝅(x,w)=G(w,c~(x))\boldsymbol{\pi}^{*}(x,w)=G(w,\tilde{c}(x)) where c~(x)\tilde{c}(x) is a continuous function. It can be easily calculated that the inverse of the cumulative distribution function, denoted by (F𝝅)1\big{(}F^{\boldsymbol{\pi}}\big{)}^{-1}, is of the form:

(F𝝅)1(x,w)=λln(w(eac~(x)λ1)+1)c~(x)𝟏{c~(x)0}+aw𝟏{c~(x)=0},(w,x)[0,a]×+.\displaystyle\big{(}F^{\boldsymbol{\pi}}\big{)}^{-1}(x,w)=\frac{\lambda\ln\Big{(}w(e^{\frac{a\tilde{c}(x)}{\lambda}}-1)+1\Big{)}}{\tilde{c}(x)}{\bf 1}_{\{\tilde{c}(x)\neq 0\}}+aw{\bf 1}_{\{\tilde{c}(x)=0\}},(w,x)\in[0,a]\times\mathbb{R}^{+}.

Thus, by the inversion method, if 𝐔U[0,1]{\bf U}\sim U[0,1], the uniform distribution on [0,1][0,1], then the random variable 𝝅^(𝐔,x):=(F𝝅)1(x,𝐔)𝝅(,x)\hat{\boldsymbol{\pi}}({\bf U},x):=\big{(}F^{\boldsymbol{\pi}}\big{)}^{-1}(x,{\bf U})\sim\boldsymbol{\pi}^{*}(\cdot,x), and we need only sample 𝐔{\bf U}, which is much simpler.

Parametrization of the cost functional. The next step is to choose the parametrization of JθJ^{\theta}. In light of the well-known result (cf. e.g., [1]), we know that if (μ,σ)(\mu,\sigma) are given, and β=ac1β3>0\beta=\frac{a}{c}-\frac{1}{\beta_{3}}>0, (thanks to Assumption 2.4), the classical solution for the optimal dividend problem is given by

V(x)=K(eβ1xeβ2x)𝟏{xm}+[aceβ3(xm)β3]𝟏{x>m}.\displaystyle V(x)=K(e^{\beta_{1}x}-e^{-\beta_{2}x}){\bf 1}_{\{x\leq m\}}+\Big{[}\frac{a}{c}-\frac{e^{-\beta_{3}(x-m)}}{\beta_{3}}\Big{]}{\bf 1}_{\{x>m\}}. (7.2)

where K=βeβ1meβ2mK=\frac{\beta}{e^{\beta_{1}m}-e^{-\beta_{2}m}}, β1,2=μ+2cσ2+μ2σ2\beta_{1,2}=\frac{\mp\mu+\sqrt{2c\sigma^{2}+\mu^{2}}}{\sigma^{2}}, β3=μa+2cσ2+(aμ)2σ2\beta_{3}=\frac{\mu-a+\sqrt{2c\sigma^{2}+(a-\mu)^{2}}}{\sigma^{2}}, and m=log(1+ββ21ββ1)β1+β2m=\frac{log(\frac{1+\beta\beta_{2}}{1-\beta\beta_{1}})}{\beta_{1}+\beta_{2}}. We should note that the threshold m>0m>0 in (7.2) is most critical in the value function, as it determines the switching barrier of the optimal dividend rate. That is, optimal dividend rate is of the “bang-bang” form: αt=a𝟏{X(t)>m}\alpha_{t}=a{\bf 1}_{\{X(t)>m\}}, where XX is the reserve process (see, e.g., [1]). We therefore consider the following two parametrizations based on the initial state x=X0x=X_{0}.

(i) x<mx<m. By (7.2) we use the approximation family:

Jθ(x)=θ3(eθ1xeθ2x),\displaystyle J^{\theta}(x)=\theta_{3}(e^{\theta_{1}x}-e^{-\theta_{2}x}), (7.3)
θ1[4c(1+5)a,1];θ2[1+4c(1+5)a,2];θ3[15,16],\displaystyle\theta_{1}\in\Big{[}\frac{4c}{(1+\sqrt{5})a},1\Big{]};~{}\theta_{2}\in\Big{[}1+\frac{4c}{(1+\sqrt{5})a},2\Big{]};~{}\theta_{3}\in[15,16],

where θ1,θ2,θ3\theta_{1},\theta_{2},\theta_{3} represent β1,β2\beta_{1},\beta_{2} and KK of the classical solution respectively. In particular, the bounds for θ1\theta_{1} and θ2\theta_{2} are due to the fact β1[4c(1+5)a,1]\beta_{1}\in[\frac{4c}{(1+\sqrt{5})a},1] and β2[1+4c(1+5)a,)\beta_{2}\in[1+\frac{4c}{(1+\sqrt{5})a},\infty) under Assumption 2.4. We should note that these bounds alone are not sufficient for the algorithms to converge, and we actually enforced some additional bounds. In practice, the range of θ2\theta_{2} and θ3\theta_{3} should be obtained from historical data for this method to be effective in real life applications.

Finally, it is worth noting that (7.2) actually implies that μ=c(β2β1)β2β1\mu=\frac{c(\beta_{2}-\beta_{1})}{\beta_{2}\beta_{1}} and σ2=2cβ2β1\sigma^{2}=\frac{2c}{\beta_{2}\beta_{1}}. We can therefore approximate μ,σ\mu,\sigma by c(θ2θ1)θ2θ1\frac{c(\theta_{2}^{*}-\theta_{1}^{*})}{\theta_{2}^{*}\theta_{1}^{*}} and 2cθ2θ1\sqrt{\frac{2c}{\theta_{2}^{*}\theta_{1}^{*}}}, respectively, whenever the limit θ\theta^{*} can be obtained. The threshold mm can then be approximated via μ\mu and σ\sigma as well.

(ii) x>mx>m. Again, in this case by (7.2) we choose

Jθ(x)=acθ1θ2eθ2x,θ1[1,2],θ2[ca,2ca],\displaystyle J^{\theta}(x)=\frac{a}{c}-\frac{\theta_{1}}{\theta_{2}}e^{-\theta_{2}x},\quad\theta_{1}\in[1,2],~{}\theta_{2}\in\Big{[}\frac{c}{a},\frac{2c}{a}\Big{]}, (7.4)

where θ1,θ2\theta_{1},\theta_{2} represent (em)β3(e^{m})^{\beta_{3}} and β3\beta_{3} respectively, and the bounds for θ2\theta_{2} are the bounds of parameter β3\beta_{3} in (7.2). To obtain an upper bound of θ1\theta_{1}, we note that θ1θ2ac\theta_{1}\leq\frac{\theta_{2}a}{c} is necessary to ensure Jθ(x)>0J^{\theta}(x)>0 for each x>0x>0, and thus the upper bound of θ2\theta_{2} leads to that of θ1\theta_{1}. For the lower bound of θ1\theta_{1}, note that em>1e^{m}>1 and hence so is (em)β3(e^{m})^{\beta_{3}}. Using Jxθ(m)=1J^{\theta^{*}}_{x}(m)=1, we approximate mm by ln(θ1)θ2\frac{\ln(\theta_{1})}{\theta_{2}}.

Remark 7.1.

The parametrization above depends heavily on the knowledge of the explicit solution for the classical optimal control problem. In general, it is natural to consider using the viscosity solution of the entropy regularized relaxed control problem as the basis for the parameterization family. However, although we did identify both viscosity super- and sub-solutions in (3.9), we found that the specific super-solution does not work effectively due to the computational complexities resulted by the piece-wise nature of the function, as well as the the complicated nature of the bounds of the parameters involved (see (3.12)); whereas the viscosity sub-solution, being a simple function independent of all the parameters we consider, does not seem to be an effective choice for a parameterization family in this case either. We shall leave the study of the effective parametrization using viscosity solutions to our future research.  

In the following two subsections we summarize our numerical experiments following the analysis so far. For testing purpose, we choose “dummy” parameters a=3,μ=0.4,σ=0.8a=3,\mu=0.4,\sigma=0.8 and c=0.02c=0.02, so that Assumption 2.4 holds. We use T=10T=10 to limit the number of iterations, and we observe that on average the ruin time of the path simulations occurs in the interval [0,10][0,10]. We also use the error bound ϵ=108\epsilon=10^{-8}, and make the convention that d0d\sim 0 whenever |d|<ϵ|d|<\epsilon.

7.1 CTD(γ)CTD(\gamma) methods

Data: Initial state x0,Time horizon T, time scale Δt,K=TΔtx_{0},\textit{Time horizon }T,\textit{ time scale }\Delta t,K=\frac{T}{\Delta t} , Initial temperature λ\lambda, Initial learning rate α\alpha, functional forms of l(.),p(.),Jθ(.),θJθ(.),c~(.)=1Jxθ(.)l(.),p(.),J^{\theta}(.),\nabla_{\theta}J^{\theta}(.),\tilde{c}(.)=1-J_{x}^{\theta}(.), number of simulated paths MM, Variable szsz , an environment simulator ENVΔt(t,x,a).ENV_{\Delta t}(t,x,a).
Learning Procedure Initialize θ\theta , m=1m=1 and set Var=θVar=\theta.
while m<Mm<M do
       Set θ=Var\theta=Var.
       if mod(m1,sz)=0\mod(m-1,sz)=0 AND m>1m>1 then
            Compute and store Am=Average(θ)A_{m}=Average(\theta^{*}) over the last szsz iterations.
             Set Var=AmVar=A_{m}.
             if m>szm>sz then
                   End iteration if the absolute difference DA=|AmA(msz)|<ϵDA=|A_{m}-A_{(m-sz)}|<\epsilon.
             end if
            
       end if
      Initialize j=0j=0.
       Observe x0x_{0} and store xtjx0x_{t_{j}}\Leftarrow x_{0}.
      
      while j<Kj<K do
             λ=λl(j)\lambda=\lambda l(j).
            
            Compute 𝝅(.,xtj)=G(.,1JxVar(xtj))\boldsymbol{\pi}(.,x_{t_{j}})=G(.,1-J_{x}^{Var}(x_{t_{j}})) and generate action atj𝝅(.,xtj).a_{t_{j}}\sim\boldsymbol{\pi}(.,x_{t_{j}}).
             Apply atja_{t_{j}} to ENVΔt(tj,xtj,atj)ENV_{\Delta t}(t_{j},x_{t_{j}},a_{t_{j}}) to observe and store xtj+1x_{t_{j+1}} .
             end iteration if xtj+1<ϵx_{t_{j+1}}<\epsilon.
            Compute Δθ=(θJθ(i)(X~ti))ectiΔi\Delta\theta=\big{(}\nabla_{\theta}J^{\theta^{(i)}}(\tilde{X}_{t_{i}})\big{)}e^{-ct_{i}}\Delta_{i}.
            
            end iteration if Δθ2<ϵ\|\Delta\theta\|_{2}<\epsilon .
             Update θθ+αp(j)Δθ\theta\leftarrow\theta+\alpha p(j)\Delta\theta.
             Update jj+1j\leftarrow j+1.
       end while
      Set θ=θ\theta^{*}=\theta and update mm+1m\leftarrow m+1.
end while
Set θ=Am\theta^{*}=A_{m}.
Algorithm 1 CTD(0)CTD(0) Algorithm

In Algorithm 1 below we carry out the PE procedure using the CTD(0)CTD(0) method. We choose λ=λ(j)\lambda=\lambda(j) as a function of iteration number: λ(j)=2l(j)=2(0.2)jΔt20.2T=2.048×107\lambda(j)=2l(j)=2(0.2)^{j*\Delta t}\geq 2\cdot 0.2^{T}=2.048\times 10^{-7}. This particular function is chosen so that λ0\lambda\to 0 and the entropy regularized control problem converges to the classical problem, but λ\lambda is still bounded away from 0 so as to ensure that π\pi is well defined. We shall initialize the learning rate at 1 and decrease it using the function p(j)=1/jp(j)=1/j so as to ensures that the conditions i=0αi=\sum_{i=0}^{\infty}\alpha_{i}=\infty and i=0(αi)2<\sum_{i=0}^{\infty}(\alpha_{i})^{2}<\infty are satisfied.

We note that Algorithm 1 is designed as a combination of online learning and the so-called batch learning, which updates parameter θ\theta at each temporal partition point, but only updates the policy after a certain number (the parameter “szsz” in Algorightm 1) of path simulations. This particular design is to allow the PE method to better approximate J(,𝝅)J(\cdot,\boldsymbol{\pi}) before updating 𝝅\boldsymbol{\pi}.

Convergence Analysis. To analyze the convergence as Δt0\Delta t\to 0, we consider Δt=0.005\Delta t=0.005, 0.001, 0.0005, 0.0001, 0.00005, respectively. We take M=40000M=40000 path simulations and sz=250sz=250 in the implementation. Note that with the choice of dummy parameters a,μa,\mu and σ\sigma, the classical solution is given by m=4.7797m=4.7797 , V(3)=17.9522V(3)=17.9522 and V(10)=24.9940V(10)=24.9940. We thus consider two parameterization families, for initial values x=3<mx=3<m and x=10>mx=10>m respectively.

Table 1: Results for the CTD0CTD0 method
Δt\Delta t JθJ^{\theta^{*}} m JθJ^{\theta^{*}} m JθJ^{\theta^{*}} m JθJ^{\theta^{*}} m
family(i) family(ii)
x=3 x=10 x=3 x=10
0.01{0.01} 15.49 5.383 31.276 3.476 32.489 19.359 55.667 9.635
0.005{0.005} 17.188 4.099 22.217 4.292 31.108 18.532 53.262 11.942
0.001{0.001} 16.68 4.474 23.082 3.931 37.58 11.925 60.948 10.691
0.0005{0.0005} 16.858 4.444 23.079 4.049 40.825 11.797 65.179 10.5
0.0001{0.0001} 17.261 4.392 23.094 4.505 38.899 18.994 55.341 18.142
Table 2: Convergence results for the CTD0CTD0 method
JθJ^{\theta^{*}} w.r.t. Δt\Delta t mm w.r.t. Δt\Delta t
[Uncaptioned image] [Uncaptioned image]
Results obtained using family (i) for x=3x=3
[Uncaptioned image] [Uncaptioned image]
Results obtained using family (i) for x=10x=10
[Uncaptioned image] [Uncaptioned image]
Results obtained using family (ii) for x=3x=3
[Uncaptioned image] [Uncaptioned image]
Results obtained using family (ii) for x=10x=10

Case 1. x=3<mx=3<m. As we can observe from Table 1 and 2 , in this case using the approximation (7.3) (family (i)) shows reasonably satisfactory level of convergence towards the known classical solution values of J(x0)J(x_{0}) and mm as Δt0\Delta t\to 0, despite some mild fluctuations. We believe that such fluctuations are due to the randomness of the data we observe and that averaging over the szsz paths in our algorithm reduced the occurrence of these fluctuations to a satisfactory level. As we can see, despite the minor anomalies, the general trajectory of these graphs tends towards the classical solution as Δt0\Delta t\to 0. We should also observe that using family (ii) (7.4) does not produce any satisfactory convergent results. But this is as expected, since the function (7.4) is based on the classical solution for x>mx>m.

Case 2. x=10>mx=10>m. Even though the family (7.3) is based on the classical solution for x<mx<m, as we can see from 1 and 2 , the algorithm using family (7.3) converges to the values of the classical solution even in the case x=10>mx=10>m, whereas the algorithm using family (ii) (7.4) does not. While a bit counter intuitive, this is actually not entirely unexpected since the state process can be seen to reach 0 in the considered time interval in general, but the parameterization (7.4) is not suitable when value of the state reaches below mm. Consequently, it seems that the parameterization (7.3) suits better for CTD(0)CTD(0) method, regardless of the initial value.

Finally, we would like to point out that the case for CTD(γ)CTD(\gamma) methods for γ0\gamma\neq 0 is much more complicated, and the algorithms are computationally much slower than CTD(0)CTD(0) method. We believe that the proper choice of the learning rate in this case warrants some deeper investigation, but we prefer not to discuss these issues in this paper.

7.2 The ML Algorithm

In Algorithm 2 we present the so-called ML-algorithm in which we use a batch learning approach where we update the parameters θ\theta by θ~\tilde{\theta} at the end of each simulated path using the information from the time interval [0,Tτx][0,T\wedge\tau_{x}]. We use M=40000M=40000 path simulations and initial temperature λ=2\lambda=2. In the mm-th simulated path we decrease the temperature parameter by λ=2×(0.9)m\lambda=2\times(0.9)^{m}. We also initialize the learning rate at 1 and decrease it using the function p(m)=1/mp(m)=1/m. Finally, we consider Δt=0.005,0.001,0.0005,0.0001\Delta t=0.005,0.001,0.0005,0.0001, respectively, to represent the convergence as Δt0\Delta t\to 0.

Data: Initial state x0,Time horizon T, time scale Δt,K=TΔtx_{0},\textit{Time horizon }T,\textit{ time scale }\Delta t,K=\frac{T}{\Delta t} , Initial temperature λ\lambda, Initial learning rate α\alpha, functional forms of l(.),p(.),Jθ(.),θJθ(.),c~(.)=1Jxθ(.)l(.),p(.),J^{\theta}(.),\nabla_{\theta}J^{\theta}(.),\tilde{c}(.)=1-J_{x}^{\theta}(.), number of simulated paths MM, Variable szsz , an environment simulator ENVΔt(t,x,a).ENV_{\Delta t}(t,x,a).
Learning Procedure Initialize θ\theta , m=1m=1 .
while m<Mm<M do
      
      λ=λl(m)\lambda=\lambda l(m).
      
      if mod(m1,sz)=0\mod(m-1,sz)=0 AND m>1m>1 then
            Compute and store Am=Average(θ)A_{m}=Average(\theta) over the last szsz iterations.
             if m>szm>sz then
                   End iteration if the absolute difference DA=|AmA(msz)|<ϵDA=|A_{m}-A_{(m-sz)}|<\epsilon.
             end if
            
       end if
      Initialize k=0k=0, observe x0x_{0} and store xtkx0x_{t_{k}}\Leftarrow x_{0}.
      
      while k<Kk<K do
            
            Compute 𝝅(.,xtk)=G(.,1Jxθ(xtk))\boldsymbol{\pi}(.,x_{t_{k}})=G(.,1-J_{x}^{\theta}(x_{t_{k}})) and generate action atk𝝅(.,xtk).a_{t_{k}}\sim\boldsymbol{\pi}(.,x_{t_{k}}).
             Apply atka_{t_{k}} to ENVΔt(tk,xtk,atk)ENV_{\Delta t}(t_{k},x_{t_{k}},a_{t_{k}}) to observe and store xtk+1x_{t_{k+1}} .
             end iteration if xtk+1<ϵx_{t_{k+1}}<\epsilon.
             observe and store Jθ(xtk+1),θJθ(xtk+1)J^{\theta}(x_{t_{k+1}}),\nabla_{\theta}J^{\theta}(x_{t_{k+1}}) .
            
            Update kk+1k\leftarrow k+1.
       end while
      Compute Δθ\Delta\theta using Ml algorithm and Update θθαp(m)Δθ\theta\leftarrow\theta-\alpha p(m)\Delta\theta.
       Update mm+1m\leftarrow m+1.
end while
Set θ=Am\theta^{*}=A_{m} .
Algorithm 2 ML Algorithm

Using parameterized family (i) with both the initial values x=3x=3 and x=10x=10 , we obtain the optimal θi\theta^{*}_{i} as the lower bound of each parameter θi\theta_{i}, for i=1,2,3i=1,2,3. Using parameterized family (ii) with both the initial values x=3x=3 and x=10x=10, we obtain the optimal θi\theta^{*}_{i} as the average of the lower and upper bounds of each parameter θi\theta_{i} for i=1,2i=1,2, since in each iteration θi\theta_{i} is updated as the upper and lower boundary alternatively. This is due to the fact that the learning rate 1m\frac{1}{m} is too large for this particular algorithm. Decreasing the size of the learning rate results in optimal θ\theta values that are away from the boundaries, but the algorithms in these cases were shown not to converge empirically, and thus the final result depends on the number of iterations used(M).

In general, the reason for this could be due to the loss of efficiency occurred by decreasing the learning rates, since Gradient Descent Algorithms are generally sensitive to learning rates. Specific to our problem, among many possible reasons, we believe that the limiting behavior of the optimal strategy when λ0\lambda\to 0 is a serious issue, as 𝝅\boldsymbol{\pi} is not well defined when λ=0\lambda=0 and a Dirac δ\delta-measure is supposed to be involved. Furthermore, the ”bang-bang” nature and the jump of the optimal control could also affect the convergence of the algorithm. Finally, the algorithms seems to be quite sensitive to the value of mm since value function V(x)V(x) is a piece-wisely smooth function depending on mm. Thus, to rigorously analyze the effectiveness of the ML-algorithm with parameterization families (i) and (ii), further empirical analysis are needed which involves finding effective learning rates.

All these issues call for further investigation, but based on our numerical experiment we can nevertheless conclude that the CTD(0) method using the parameterization family (i) is effective in finding the value mm and V(x)V(x), provided that the effective upper and lower bounds for the parameters can be identified using historic data.

References

  • [1] Søren Asmussen and Michael Taksar, Controlled diffusion models for optimal dividend pay-out, Insurance Math. Econom. 20 (1997), no. 1, 1–15.
  • [2] Lihua Bai and Jin Ma, Optimal investment and dividend strategy under renewal risk model, SIAM J. Control Optim. 59 (2021), no. 6, 4590–4614.
  • [3] Lihua Bai and Jostein Paulsen, Optimal dividend policies with transaction costs for a class of diffusion processes, SIAM J. Control Optim. 48 (2010), no. 8, 4987–5008.
  • [4] Jun Cai, Hans U. Gerber, and Hailiang Yang, Optimal dividends in an Ornstein-Uhlenbeck type model with credit and debit interest, N. Am. Actuar. J. 10 (2006), no. 2, 94–119.
  • [5] Tahir Choulli, Michael Taksar, and Xun Yu Zhou, A diffusion model for optimal dividend distribution for a company with constraints on risk control, SIAM J. Control Optim. 41 (2003), no. 6, 1946–1979.
  • [6] Earl A. Coddington and Norman Levinson, Theory of ordinary differential equations, McGraw-Hill Book Co., Inc., New York-Toronto-London, 1955.
  • [7] Michael G. Crandall, Hitoshi Ishii, and Pierre-Louis Lions, User’s guide to viscosity solutions of second order partial differential equations, Bull. Amer. Math. Soc. (N.S.) 27 (1992), no. 1, 1–67.
  • [8] Tiziano De Angelis and Erik Ekström, The dividend problem with a finite horizon, Ann. Appl. Probab. 27 (2017), no. 6, 3525–3546.
  • [9] B. De Finetti, Su un’ impostazione alternativa dell teoria collettiva del risichio, Transactions of the 15th congress of actuaries, New York 2 (1957), 433–443.
  • [10] James Ferguson, A brief survey of the history of the calculus of variations and its applications, 2004.
  • [11] David Gilbarg and Neil S. Trudinger, , Classics in Mathematics, Springer-Verlag, Berlin, 2001, Reprint of the 1998 edition.
  • [12] Yu-Jui Huang, Zhenhua Wang, and Zhou Zhou, Convergence of policy improvement for entropy-regularized stochastic control problems, 2023.
  • [13] Saul D. Jacka and Aleksandar Mijatović, On the policy improvement algorithm in continuous time, Stochastics 89 (2017), no. 1, 348–359.
  • [14] Yanwei Jia and Xun Yu Zhou, Policy evaluation and temporal-difference learning in continuous time and space: a martingale approach, J. Mach. Learn. Res. 23 (2022), Paper No. [154], 55.
  • [15]  , Policy gradient and actor-critic learning in continuous time and space: theory and algorithms, J. Mach. Learn. Res. 23 (2022), Paper No. [275], 50.
  • [16] B. Kerimkulov, D. Šiška, and L. Szpruch, A modified MSA for stochastic control problems, Appl. Math. Optim. 84 (2021), no. 3, 3417–3436.
  • [17] Bekzhan Kerimkulov, David Šiška, and Lukasz Szpruch, Exponential convergence and stability of Howard’s policy improvement algorithm for controlled diffusions, SIAM J. Control Optim. 58 (2020), no. 3, 1314–1340.
  • [18] O. A. Ladyženskaja, V. A. Solonnikov, and N. N. Ural’ceva, Linear and quasilinear equations of parabolic type, AMS, Providence, RI, 1968.
  • [19] M. L. Puterman, On the convergence of policy iteration for controlled diffusions, J. Optim. Theory Appl. 33 (1981), no. 1, 137–144.
  • [20] S. E. Shreve, J. P. Lehoczky, and D. P. Gaver, Optimal consumption for general diffusions with absorbing and reflecting barriers, SIAM J. Control Optim. 22 (1984), no. 1, 55–75.
  • [21] Stefan Thonhauser and Hansjörg Albrecher, Dividend maximization under consideration of the time value of ruin, Insurance Math. Econom. 41 (2007), no. 1, 163–184.
  • [22] Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou, Reinforcement learning in continuous time and space: a stochastic control approach, J. Mach. Learn. Res. 21 (2020), Paper No. 198, 34.
  • [23] Haoran Wang and Xun Yu Zhou, Continuous-time mean-variance portfolio selection: a reinforcement learning framework, Math. Finance 30 (2020), no. 4, 1273–1308.
  • [24] Jiongmin Yong and Xun Yu Zhou, Stochastic controls, Applications of Mathematics (New York), vol. 43, Springer-Verlag, New York, 1999, Hamiltonian systems and HJB equations.
  • [25] Jinxia Zhu and Hailiang Yang, Optimal financing and dividend distribution in a general diffusion model with regime switching, Adv. in Appl. Probab. 48 (2016), no. 2, 406–422.