Reinforcement Learning for optimal dividend problem under diffusion model
Lihua Bai, Thejani Gamage, Jin Ma, Pengxu Xie
School of Mathematics,
Nankai University, Tianjin, 300071, China. Email:
lbai@nankai.edu.cn. This author is supported in part by Chinese NSF
grants #11931018,
#12272274, and #12171257. Department of Mathematics,
University of Southern California, Los Angeles, CA 90089. Email: gamage@usc.edu. Department of Mathematics,
University of Southern California, Los Angeles, CA 90089.
Email: jinma@usc.edu. This author is supported in part by NSF grants #DMS-1908665 and 2205972. School of Mathematics,
Nankai University, Tianjin, 300071, China. Email:1120180026@mail.nankai.edu.cn.
Abstract
In this paper, we study the optimal dividend problem under the continuous time diffusion model with the dividend rate being restricted in a given interval . Unlike the standard literature, we shall particularly be interested in the case when the parameters (e.g. drift and diffusion coefficients) of the model are not specified so that the optimal control cannot be explicitly determined. We therefore follow the recently developed method via the Reinforcement Learning (RL) to find the optimal strategy. Specifically, we shall design a corresponding RL-type entropy-regularized exploratory control problem, which randomize the control actions, and balance the exploitation and exploration. We shall first carry out a theoretical analysis of the new relaxed control problem and prove that the value function is the unique bounded classical solution to the corresponding HJB equation. We will then use a policy improvement argument, along with policy evaluation devices (e.g., Temporal Difference (TD)-based algorithm and Martingale Loss (ML)-algorithms) to construct approximating sequences of the optimal strategy.
We present some numerical results using different parametrization families for the cost functional to illustrate the effectiveness of the approximation schemes.
Keywords. Optimal dividend problem, entropy-regularized exploratory control problem, policy improvement, policy evaluation, temporal difference (TD) algorithm, martingale loss (ML)..
The problem of maximizing the cumulative discounted dividend payment can be traced back to the work of de Finetti [9].
Since then the problem has been widely studied in the literature under different models, and in many cases the problem can be explicitly solved when the model parameters are known. In particular, for optimal dividend problem and its many variations in continuous time under diffusion models, we refer to the works of, among others, [1, 3, 4, 5, 8, 21, 25] and the references cited therein.
The main motivation of this paper is to study the optimal dividend problems in which the model parameters are not specified so the optimal control cannot be explicitly determined. Following the recently developed method using Reinforcement Learning (RL), we shall try to find the optimal strategy for a corresponding entropy-regularized control problem and solve it along the lines of both policy improvement and evaluation schemes.
The method of using reinforcement learning to solve discrete Markov decision problems has been well studied, but the extension of these concepts to the continuous time and space setting is still fairly new. Roughly speaking, in RL the learning agent uses a sequence of trial and errors to simultaneously identify the environment or the model parameters, and to determine the optimal control and optimal value function. Such a learning process has been characterized by a mixture of
exploration and exploitation, which repeatedly tries the new actions to improve the collective outcomes.
A critical point in this process is to balance the exploration and exploitation levels since the former
is usually computationally expensive and time consuming, while the latter may lead to sub optimums. In RL theory a typical idea to balance the exploration and exploitation in an optimal control problem is to
“randomize” the control action and add a (Shannon’s) entropy term in the cost function weighted by a temperature parameter. By maximizing the entropy one encourages the exploration and by decreasing the temperature parameter one gives more weight to exploitation. The resulting optimal control problem is often refer to as the entropy-regularized exploratory optimal control problem, which will be the starting point of our investigation.
As any reinforcement learning scheme, we shall solve the entropy-regularized exploratory optimal dividend problem via a sequence of Policy Evaluation (PE) and Policy Improvement (PI) procedures. The former evaluates the cost functional for a given policy, and the latter produces new policy that is “better” than the current one. We note that the idea of Policy Improvement Algorithms (PIA) as well as their convergence analysis is not new in the numerical optimal control literature (see, e.g., [13, 17, 16, 19]). The main difference in the current RL setting is the involvement of the entropy regularization, which causes some technical complications in the convergence analysis. In the continuous time entropy-regularized exploratory control problem with diffusion models a successful convergence analysis of PIA was first established for a particular Linear-Quadratic (LQ) case in [23], in which
the exploratory HJB equation (i.e., HJB equation corresponding to the entropy regularized problem) can be directly solved, and the Gaussian nature of the optimal exploratory control is known. A more general case was recently investigated in [12], in which the convergence of PIA is proved in a general infinite horizon setting, without requiring the knowledge of the explicit form of the optimal control. The problem studied in this paper is very close to the one in [12], but not identical. While some of the analysis in this paper is benefitted from the fact that the spatial variable is one dimensional, but there are particular technical subtleties because of the presence of ruin time, although the problem is essentially an infinite horizon one, like the one studied in [12].
There are two main issues that this paper will focus on. The first is to design the PE and PI algorithms that are suitable for the continuous time optimal dividend problems. We shall follow some of the “popular” schemes in RL, such as the well-understood Temporal Difference (TD) methods, combined with the so-called martingale approach to design the PE learing procedure.
Two technical points are worth noting: 1) since the cost functional involves ruin time, and the observation of the ruin time of the state process is sometimes practically impossible (especially in the cases where ruin time actually occurs beyond the time horizon we can practically observe), we shall propose algorithms that are insensitive to the ruin time; 2) although the infinite horizon nature of the problem somewhat prevents the use of the so-called “batch” learning method, we shall nevertheless try to study the temporally “truncated” problem so that the batch learning method can be applied. It should also be noted that one of the main difficulties in PE methods is to find an effective parameterization family of functions from which the best approximation for the cost functional is chosen, and the choice of the parameterization family directly affects the accuracy of the approximation. Since there are no proven standard methods of finding a suitable parameterization family, except for the LQ (Gaussian) case when the optimal value function is explicitly known, we shall use the classical “barrier”-type (restricted) optimal dividend strategy in [1] to propose the parametrization family, and carry out numerical experiments using the corresponding families.
The second main issue is the convergence analysis of the PIA. Similar to [12], in this paper we focus on the regularity analysis on the solution to the exploratory HJB equation and some related PDEs.
Compared to the heavy PDE arguments as in [12], we shall take advantage of the fact that in this paper the state process is one dimensional taking nonnegative values, so that some stability arguments for 2-dimensional first-order nonlinear systems can be applied to conclude that the exploratory HJB equation has a concave, bounded classical solution, which would coincide with the viscosity solution (of class (L)) of HJB equation and the value function of the optimal dividend problem.
With the help of these regularity results, we prove the convergence of PIA to the value function along the line of [12] but with a much simpler argument.
The rest of the paper is organized as follows. In §2 we give the preliminary description of the problem and all the necessary notations, definitions, and assumptions. In §3 we study the value function and its regularity, and prove that it is a concave, bounded classical solution to the exploratory HJB equation. In §4 we study the issue of policy update. We shall introduce our PIA and prove its convergence.
In §5 and§6 we discuss the methods of Policy Evaluation, that is, the methods for approximating the cost functional for a given policy, using a martingale lost function based approach and (online) CTD() methods, respectively.
In §7 we propose parametrization families for PE and present numerical experiments using the proposed PI and PE methods.
2 Preliminaries and Problem Formulation
Throughout this paper we consider a filtered probability space on which is defined a standard Brownian motion . We assume that the filtration , with the usual augmentation so that it satisfies the usual conditions. For any metric space with topological Borel sets , we denote to be all -measurable functions, and , , to be the space of -th integrable functions. The spaces
and , , etc., are defined in the usual ways. Furthermore, for a given domain , we denote to be the space of all -th order continuously differentiable functions on , and . In particular, for , we denote to be the space of all bounded and -th continuously differentiable functions on with all derivatives being bounded.
Consider the following simplest diffusion approximation of a Cramer-Lundberg model with dividend:
(2.1)
where is the initial state, and are constants determined by the premium rate and the claim frequency and size (cf., e.g., [1]), and is the dividend rate at time . We denote if necessary, and say that
is admissible if it is -adapted and takes values in
a given “action space” . Furthermore, let us define the ruin time to be . Clearly,
, and the problem is considered “ruined” as no dividend will be paid after . Our aim is to maximize the expected total discounted dividend given the initial condition :
(2.2)
where is the discount rate, and is the set of admissible dividend rates taking values in . The problem (2.1)-(2.2) is often referred to as the classical optimal restricted dividend problem, meaning that the dividend rate is restricted in a given interval .
It is well-understood that when the parameters and are known, then the optimal control is of the “feedback” form: , where is the corresponding state process and is a deterministic function taking values in , often in the form of a threshold control (see, e.g., [1]). However, in practice
the exact form of is not implementable since the model parameters are usually not known, thus the “parameter insensitive”
method through Reinforcement Learning (RL) becomes much more desirable alternative, which we now elaborate.
In the RL formulation, the agent follows a process of exploration and exploitation via a sequence of trial and error evaluation. A key element is to randomize the control action as a probability distribution over ,
similar to the notion of relaxed control in control theory, and the classical control is considered as a special point-mass (or Dirac -measure) case.
To make the idea more accurate mathematically, let us
denote to be the Borel field on , and to be the space of all probability measure on , endowed with, say, the Wasserstein metric.
A “relaxed control” is a randomized policy defined as a measure-valued progressively measurable process
. Assuming that has a density, denoted by , , then we can write
In what follows we shall often identify a relaxed control with its density process .
Now, for , we define a probability measure on as follows: for and
,
(2.3)
We call a function the “canonical representation” of a relaxed control , if . Then, for we have
(2.4)
We can now derive the exploratory dynamics of the state process along the lines of entropy-regularized relaxed stochastic control arguments (see, e.g., [22]). Roughly speaking, consider the discrete version of the dynamics (2.1): for small ,
(2.5)
Let and be independent samples of under the distribution ,
and the corresponding samples of , respectively.
Then, the law of large numbers and (2.4) imply that
(2.6)
as . This, together with the fact , leads to the follow form
of the exploratory version of the state dynamics:
(2.7)
where is the (density of) relaxed control process, and we shall often denote to specify its dependence on control and
the initial state .
To formulate the entropy-regularized optimal dividend problem, we first give a heuristic argument. Similar to (2.6), for large and small we should have
Therefore, in light of [22] we shall define the entropy-regularized cost functional of the optimal expected dividend control problem under the relaxed control as
(2.8)
where ,
, and is the so-called temperature parameter balancing the exploration and exploitation.
We now define the set of open-loop admissible controls as follows.
Definition 2.1.
A measurable (density) process is called an open-loop admissible relaxed control if
1. , for -a.e. ;
2. for each , the process is -progressively measurable;
3. .
We shall denote to be the set of open-loop admissible relaxed controls.
An important type of is of the
“feedback” nature, that is,
for some deterministic function , where
satisfies the SDE:
(2.10)
Definition 2.2.
A function is called a closed-loop admissible relaxed control if, for every ,
1. The SDE admits a unique strong solution ,
2. The process .
We denote to be the set of closed-loop admissible relaxed controls.
The following properties of the value function is straightforward.
Proposition 2.3.
Assume . Then the value function satisfies the following properties:
(1) , if ;
(2) , .
Proof. (1) Let , and . Consider , .
Then, it is readily seen that , for . Thus , proving (1), as is arbitrary.
(2) By definition and by the well-known Kullback-Leibler divergence property. Thus, , and then . On the other hand, since , , is admissible and for , the conclusion follows.
We remark that in optimal dividend control problems it is often assumed that the maximal dividend rate is greater than the average return rate (that is, ), and that the average return of a surplus process ,
including the safety loading, is higher than the interest rate . These, together with Proposition 2.3,
lead to the following standing assumption that
will be used throughout the paper.
Assumption 2.4.
(i) The maximal dividend rate satisfies ; and
(ii) the average return satisfies .
3 The Value Function and Its Regularity
In this section we study the value function of the relaxed control problem (2.9). We note that while most of the results are well-understood, some details still require justification, especially concerning the regularity, due particularly to the non-smoothness of the exit time .
We begin by recalling the Bellman optimality principle (cf. e.g., [24]):
Noting that , we can (formally) argue that satisfies the HJB equation:
(3.1)
Next, by an argument of Lagrange multiplier and the calculus of variations (see [10]), we can find the maximizer of the right hand side of (3.1) and obtain the optimal feedback control which has the following Gibbs form, assuming all derivatives exist:
(3.2)
where for .
Plugging (3.2) into (3.1), we see that the HJB equation (3.1) becomes the
following second order ODE:
(3.3)
where the function is defined by
(3.4)
The following result regarding the function is important in our discussion.
Proposition 3.1.
The function defined by (3.4) enjoys the following properties:
(1) for all , where , . In particular,
;
(2) the function is convex and has a unique intersection point with , . Moreover, the abscissa value of intersection point .
Proof.
(1) Since the function is an entire function and for , is infinitely many times differentiable for all . On the other hand, since , by the continuity of , there exists such that , whenever . Thus is infinitely many times differentiable for as well. Consequently, , whence by extension.
(2) The convexity of the function follows from a direct calculation of for . Define . It is straightforward to show that , , and . Thus for some (unique) , proving (2).
We should note that (3.3) can be viewed as either a boundary value problem of an elliptic PDE with unbounded domain or a second order ODE defined on . But in either case, there is missing information on boundary/initial conditions.
Therefore the well-posedness of the classical solution is actually not trivial.
Let us first consider the equation (3.3) as an ODE defined on . Since the value function is non-decreasing by Proposition 2.3, for the sake of argument let us first consider (3.3) as an ODE with initial condition
and .
By denoting and , we see that (3.3) is equivalent to the following system of first order ODEs: for ,
(3.5)
Here is an entire function. Let us define , , and where .
Then, satisfies the following system of ODEs:
(3.6)
It is easy to check has eigenvalues , with
.
Now, let , where is
such that . Then satisfies
(3.7)
where . Since exists and tends to 0 as , and ,
we can follow the argument of [6, Theorem 13.4.3] to construct a solution
to (3.7) such that for some constant , so that
, as . Consequently, the function is a solution to (3.6)
satisfying , as . In other words, (3.5) has a solution such that as .
We summarize the discussion above as the following result.
Proposition 3.2.
The differential equation (3.3) has a classical solution
that enjoys the following properties:
(i) ;
(ii) and ;
(iii) is increasing and concave.
Proof. Following the discussion proceeding the proposition we know that the classical solution to (3.3) satisfying (i) and (ii) exists. We need only check (iii).
To this end, we shall follow
an argument of [20]. Let us first formally differentiate (3.3) to get
, .
Since ,
denoting , we can write
Now, noting Proposition 3.1, we define a change of variables such that for ,
,
and denote , . Since , and , we can define as well. Then we see that,
(3.8)
Since (3.8) is a homogeneous linear ODE, by uniqueness implies that , . That is, , , and is (strictly) increasing.
Finally, from (3.8) we see that , also implies that
, . Thus is convex on , and hence would be unbounded unless for all .
This, together with the fact that is a bounded and increasing function, shows that (i.e., )
can only be decreasing and convex, thus (i.e.,
) , proving the concavity of , whence the proposition.
Viscosity Solution of (3.3).
We note that Proposition 3.2 requires that
exists, which is not a priorily known. We now conside (3.3) as an elliptic PDE defined on , and argue that it possesses a unique bounded viscosity solution. We will then
identify its value and argue that it must coincide with the classical solution identified in Proposition 3.2.
To begin with, let us first recall the notion of viscosity solution to (3.3). For , we denote
the set of all upper (resp. lower) semicontinuous function in by USC (resp. LSC).
Definition 3.3.
We say that is a viscosity sub-(resp. super-)solution of (3.3) on , if and for any and such that (resp. ), it holds that
We say that is a viscosity solution of (3.3) on if it is
both a viscosity subsolution
and a viscosity supersolution of (3.3) on .
We first show that both viscosity subsolution and viscosity supersolution to (3.3) exist. To see this, for , consider the following two functions:
(3.9)
where , , are constants satisfying and the following constraints:
(3.12)
Proposition 3.4.
Assume that Assumption 2.4 holds, and let , be defined by (3.9). Then is a viscosity subsolution of (3.3) on , is a viscosity supersolution of (3.3) on . Furthermore, it holds that on .
Proof. We first show that is a viscosity subsolution. To see this, note that and on . By Assumption 2.4, Proposition 3.1, and the fact , we have
To prove that is a supersolution of (3.3), we take the following three steps:
(i) Note that and for all .
Let be the abscissa value of intersection point of and , then , thanks to Proposition 3.1.
Since and (i.e. ), we have and
. Also since and , we have and .
By Assumption 2.4, , and
, we have
for . That is, is a viscosity supersolution of (3.3) on .
(ii) Next, note that and , we see that for , and it follows that is a viscosity supersolution of (3.3) on .
(iii) Finally, for , it is clear that there is no test function satisfying the definition of supersolution. We thus conclude that
is a viscosity supersolution of (3.3) on .
Furthermore, noting Assumption 2.4, and , some direct calculations shows that on , proving the proposition.
We now follow the Perron’s method to prove the existence of the (bounded) viscosity solution for (3.3). We first recall the following definition (see, e.g., [2]).
Definition 3.5.
A function is said to be of class if
(1) is increasing with respect to on ;
(2) is bounded on .
Now let and be defined by (3.9), and consider the set
(3.13)
Clearly, , so . Define
, ,
and let (resp. ) be the USC (resp. LSC) envelope of , defined respectively by
Theorem 1.
(resp. ) is a viscosity sub-(resp. super-)solution of class to (3.3) on .
Proof. The proof is the same as a similar result in [2]. We omit it here.
Note that by definition we have , . Thus, given Theorem 1, one can derive the existence and uniqueness of the viscosity solution to (3.3) of class (L) by the following comparison principle, which can be argued along the lines of [7, Theorem 5.1], we omit the proof.
Theorem 2(Comparison Principle).
Let be a viscosity supersolution and a viscosity subsolution of (3.3), both of class . Then . Consequently, is the unique viscosity solution of class to (3.3) on .
Following our discussion we can easily raise the regularity of the viscosity solution.
Corollary 3.6.
Let be a vis. solution of class (L) to the
HJB equation (3.1). Then, has a right-derivative , and consequently . Furthermore, is concave and satisfies
and .
Proof. Let be a viscosity solution of class (L) to (3.1). We first claim that exists. Indeed, consider the subsolution and supersolution defined by (3.9). Applying Theorem 2, for any but small enough we have
Sending we obtain that . Since is of class (L), whence increasing, and thus exists, and .
Then, it follows from Proposition 3.2 that the ODE (3.3) has a bounded classical solution in
satisfying , and is increasing and concave. Hence it is also a viscosity solution to (3.3) of class (). But by Theorem 2, the bounded viscosity solution to (3.3) of class () is unique, thus the viscosity solution . The rest of the properties are the consequences of Proposition 3.2.
Verification Theorem and Optimal Strategy.
Having argued the well-posedness of ODE (3.3) from both classical and viscosity sense, we now look at its connection to the value function. We have the following Verification Theorem.
Theorem 3.
Assume that Assumption 2.4 is in force. Then, the value function defined in (2.9) is a viscosity solution of class () to the HJB equation (3.3). More precisely, it holds that
(3.14)
where the set is defined by (3.13).
Moreover, coincides with the classical solution of (3.3) described in Proposition 3.2, and the optimal control has the following form:
(3.15)
Proof. The proof that is a viscosity solution satisfying (3.14) is more or less standard (see, e.g., [24]), and Proposition 2.3 shows that must be of class (). It then follows from Corollary 3.6 that exists and is the (unique) classical solution of (3.3).
It remains to show that defined by (3.15) is optimal. To this end, note that , where
. Thus
as , thanks to the concavity of . Consequently .
Finally, since and is defined by (3.15) is obviously the maximizer of the Hamiltonian in HJB equation
(3.1), the optimality of follows from a standard argument via Itô’s formula. We omit it.
4 Policy Update
We now turn to an important step in the RL scheme, that is, the so-called Policy Update. More precisely, we prove a Policy Improvement Theorem which states that for any close-loop policy , we can construct another , such that . Furthermore, we argue that such a policy
policy updating procedure can be constructed without using the system parameters, and we shall discuss the convergence of the iterations to the optimal policy
To begin with, for and
, let be the unique strong solution to the SDE (2.10). For , we consider
the process , . Then is an -Brownian motion, where , . Since the SDE (2.10) is time-homogeneous, the path-wise uniqueness then renders the flow property: , , where satisfies the SDE
(4.1)
Now we denote to be the open-loop strategy induced by the closed-loop control .
Then the corresponding cost functional can be written as (denoting )
(4.2)
where
.
It is clear that, by flow property, we have , -a.s. on .
Next, for any admissible policy , we formally define a new feedback control policy as follows:
for ,
(4.3)
where is the Gibbs function defined by (3.2).
We would like to emphasize that the new policy in (4.3)
depends on and , but is
independent of the coefficients (!). To facilitate the argument we introduce the following definition.
Definition 4.1.
A function is called “Strongly Admissible” if its density function enjoys the following properties:
(i) there exist such that , and ;
(ii) there exists such that , , uniformly in .
The set of strongly admissible controls is denoted by .
Suppose that a function whose density takes the form where .
Then .
Proof.
Since , let be such that , .
Next, note that is a positive and continuous, and for fixed , , there exist constants , such that and , . Consequently,
we have
, , and is uniformly Lipschitz on , uniformly in , proving the lemma.
In what follows we shall use the following notations. For any ,
(4.4)
Clearly, for , and are bounded and are Lipschitz continuous. We denote
to be the solution to SDE (2.10), and
rewrite the cost function (2.8) as
(4.5)
where .
Thus, in light of the Feynman-Kac formula, for any , is the probabilistic solution to the following ODE on :
(4.6)
Now let us denote to be solution to the linear elliptic equation
(4.6) on finite interval with boundary conditions and , then by the regularity and the boundedness of and ,
and using only the interior type Schauder estimates (cf. [11]), one can show that and the bounds of and depend only on those of the coefficients , and , but uniform in . By sending and applying the standard diagonalization argument (cf. e.g., [18]) one shows that , which satisfies (4.6). We summarize the above discussion as the following proposition for ready reference.
Proposition 4.2.
If , then , and the bounds
of and depend only on those of , , and .
Our main result of this section is the following Policy Improvement Theorem.
Theorem 4.
Assume that Assumption 2.4 is in force. Then, let and let be defined by (4.3) associate to , it holds that
, .
Proof. Let be given, and let be the corresponding control defined by (4.3).
Since , and are uniformly bounded, and by Proposition 4.2,
. Thus Lemma 1 (with ) implies that
as well. Moreover, since ,
is a -solution to the ODE (4.6).
Now recall that is the maximizer of , we have
(4.7)
Now, let us consider the process , the solution to (4.1) with being replaced by . Applying Itô’s formula
to from to , for any , and noting the definitions of and , we deduce from (4.7) that
Taking expectation on both sides above, sending and noting that , we obtain that ,
, proving the theorem.
In light of Theorem 4 we can naturally define a “learning sequence” as follows. We start with , and define , and ,
(4.9)
Also for each , let . The natural question is whether this learning sequence is actually a “maximizing sequence”, that is, , as . Such a result would obviously justify the policy improvement scheme, and was proved in the LQ case in [23].
Before we proceed, we note that by Proposition 4.2
the learning sequence , , but the bounds may depend on the coefficients
, , thus may not be uniform in . But by definition and Proposition 2.3, we see that for some . Moreover, since for each , , if we choose be such that (e.g., ), then we have for all . That is, ’s are uniformly bounded, and uniformly in , provided that ’s are. The following result, based on the recent work [12], is thus crucial.
Proposition 4.3.
The functions , are uniformly bounded, uniformly in . Consequently,
the learning sequence , , and the bounds of ’s, up to their second derivatives, are uniform in .
Our main result of this section is the following.
Theorem 5.
Assume that the Assumption 2.4 is in force. Then the sequence is a maximizing sequence. Furthermore,
the sequence converges to the optimal policy .
Proof.
We first observe that by Lemma 1 the sequence , provided . Since
, Proposition 4.3 guarantees that , and the bounds are independent of .
Thus a simple application of
Arzella-Ascolli Theorem shows that there exist subsequences and such that and converge uniformly on compacts.
Let us fix any compact set , and assume , uniformly on , for some function . By definition of ’s we know that is monotonically increasing, thanks to Theorem 4, thus the whole sequence must converge uniformly on to .
Next, let us assume that , uniformly on , for some function . Since obviously
as well, and note that the derivative operator is a closed operator, it follows that , . Applying the same argument one shows that for any subsequence of , there exists a sub-subsequence that converges uniformly on to the same limit , we conclude that the sequence itself converges uniformly on to .
Since is arbitrary, this shows that converges uniformly on compacts to .
Since is a continuous function of , we see that converges uniformly to defined by
.
Finally, applying Lemma 1 we see that , and the structure of the guarantees that satisfies the HJB equation (3.1)
on the compact set .
By expanding the result to using the fact that is arbitrary, satisfies the HJB equation (3.1) (or equivalently
(3.3)).
Now by using the slightly modified verification argument in Theorem 4.1 in [12] we conclude that is the unique solution to the HJB equation (3.1) and thus by definition is the optimal control.
Remark 4.4.
An alternate policy improvement method is the so-called Policy Gradient (PG) method introduced in [15], applicable for both finite and infinite horizon problems. Roughly speaking, a PG method parametrizes the policies and then solves for via the equation , using stochastic approximation method. The advantage of a PG method is that
it does not depend on the system parameter, whereas in theory
Theorem 4 is based on finding the maximizer of the Hamiltonian, and thus the learning strategy (4.9) may depend on the system parameter. However, a closer look at the learning parameters and in (4.9) we see that they depend only on , but not directly. In fact, we believe that in our case the PG method would not be
advantageous, especially given the convergence result in Theorem 5 and the fact that the the PG method requires also a proper choice of the parameterization family which, to the best of our knowledge, remains a challenging issue in practice. We shall therefore content ourselves
to algorithms using learning strategy (4.9) for our numerical analysis in §7.
5 Policy Evaluation — A Martingale Approach
Having proved the policy improvement theorem, we turn our attention to an equally important issue in the learning process, that is, the evaluation of the cost (value) functional, or the Policy Evaluation. The main idea of the policy evaluation in reinforcement learning literature usually refers to a process of approximating the cost functional , for a given feedback control , by approximating by a parametric family of functions , where .
Throughout this section, we shall consider a fixed feedback control policy . Thus for simplicity of notation, we shall drop the superscript and thus write and .
We note that for , the functions and .
Now let be the solution to the SDE (LABEL:Xpi), and
satisfies the ODE (4.6). Then, applying Itô’s formula we see that
(5.1)
is an -martingale. Furthermore, the following result is more or less standard.
Proposition 5.1.
Assume that Assumption 4.3 holds, and suppose that is such that , and for all , the process is an -martingale. Then .
Proof. First note that , and
. By (5.1) and definition of
we have
.
Now, since , , and are bounded, both and are uniformly integrable -martingales, by optional sampling it holds that
The result follows.
We now consider a family of functions , where is a certain index set.
For the sake of argument, we shall assume further that is compact.
Moreover, we shall make the following assumptions for the parameterized family .
Assumption 5.2.
(i) The mapping is sufficiently smooth, so that all the derivatives required exist
in the classical sense.
(ii) For all , are square-integrable continuous processes, and the mappings are continuous, where .
(iii) There exists a continuous function , such that .
In what follows we shall often drop the superscript from the processes , etc., if there is no danger of confusion.
Also, for practical purpose we shall consider a finite time horizon , for an arbitrarily fixed and sufficiently large . Denoting the
stopping time , by optional sampling theorem, we know that , for , is an -martingale on , where . Let us also denote
, .
We now follow the idea of [14] to construct the so-called Martingale Loss Function.
For any , consider the parametrized approximation of the process :
(5.2)
In light of the Martingale Loss function introduced in [14], we denote
(5.3)
We should note that the last equality above indicates that the martingale loss function is actually independent of the function , which is one of the main features of this algorithm.
Furthermore, inspired by the mean-squared and discounted mean-squared value errors we define
(5.4)
(5.5)
The following result shows the connection between the minimizers of and .
Theorem 6.
Assume that Assumption 5.2 is in force. Then, it holds that
(5.6)
Proof. First, note that , and , we see that
Here in the above we use the convention that if , and the identities becomes trivial. Consequently, by definition (5.3) and noting that , for , we can write
Combining above we see that (5) becomes
.
Since is independent of , we conclude the result.
Remark 5.3.
Since the minimizers of MSVE and DMSVE are obviously identical, Theorem 6 suggests that if
is a minimizer of either one of , , , then would be an acceptable approximation of . In the rest of the section we shall therefore focus on the identification of .
We now propose an algorithm that provides a numerical approximation of the policy evaluation (or equivalently the martingale ), by discretizing the integrals in
the loss functional . To this end, let
be an arbitrary but fixed time horizon, and consider the partition , and denote , .
Now for , we define ,
and so that . Finally, we define
. Clearly, both and are integer-valued random variables, and we shall often drop the subscript if there is no danger of confusion.
where , , are defined in an obvious way.
Furthermore, for , we define .
Now note that , and . Denoting , we have
(5.8)
Since
,
denoting and ,
we obtain,
Combining (5.8) and (5), similar to (5)) we can now rewrite (5.3) as
(5.9)
We are now ready to give the main result of this section.
Theorem 7.
Let Assumptions 4.3 and 5.2 be in force.
Then it holds that
Proof. Fix a partition . By (5.9) and (5) we have, for ,
(5.10)
Let us first check , .
First, by Assumption 5.2, we see that
where is a continuous function, and is the bound of .
Thus we have
(5.11)
Next, note that implies , and since we are considering the case when , we might assume . Thus by definitions of and we have
(5.12)
Since is a diffusion, one can easily check that .
Furthermore, noting that is
uniformly bounded for in any compact set, from (5.11) and (5.12) we conclude that
(5.13)
It remains to show that
(5.14)
uniformly in on compacta. To this end, we first note that
(5.15)
where and .
From definition (5) we see that under Assumption 5.2 it holds that , . Furthermore, we denote
Since is bounded, we see that for some constant .
Then it holds that
where
, , is a bounded and continuous process.
Now for any , choose so that , for , and define
, we have
Sending we obtain that . Since is arbitrary, we conclude that as . Note that the argument above is uniform in , it follows that as .
Consequently, we have
Since is continuous in , we see that the convergence above is uniform in on compacta.
Similarly, note that by Assumption 5.2 the process is also a square-integrable continuous process, and uniform in , we have
where is the modulus of continuity of in . Therefore , as , uniformly in on compacta.
Finally, combining (5.15)–(5) we obtain (5.14). This, together with (5.13) as well as
(5.10), proves the theorem.
Now let us denote , and consider the functions
Then , and by Assumption 5.2 we can easily check that the mappings are continuous functions. Applying Theorem 7 we see that , uniformly in on compacta, as .
Note that if is compact, then for any , there exists .
In general, we have the following corollary of Theorem 7.
Corollary 5.4.
Assume that all assumptions in Theorem 7 are in force. If there exists a sequence , such that , then any limit point of the sequence must satisfy
.
Proof. This is a direct consequence of [14, Lemma 1.1].
Remark 5.5.
We should note that, by Remark 5.3, the set of minimizers of the martingale loss function ML is the same as
that of DMVSE. Thus Corollary 5.4 indicates that we have a reasonable approach for approximating the unknown function . Indeed,
if has a convergent subsequence that converges to some , then is the best approximation for by either the measures of MSVE or DMSVE.
To end this section we discuss the ways to fulfill our last task: finding the optimal parameter . There are usually two learning methods for this task in RL, often referred to as online and batch learning, respectively.
Roughly speaking, the batch learning methods use multiple sample trajectories of over a given finite time horizon to update parameter at each step,
whereas in online learning, one observes only a single sample trajectory
to continuously update the parameter until it converges. Clearly, the online learning is particularly suitable for infinite horizon problem, whereas the ML function is by definition better suited for batch learning.
Although our problem is by nature an infinite horizon one, we shall first create a batch learning algorithm via the ML function by restricting ourselves to an arbitrarily fixed finite horizon , so as to convert it to an finite time horizon problem.
To this end, we note that
However, since may be unbounded, we shall consider instead the function:
We observe that the difference
for some continuous function . Thus if is compact, for large enough or small enough, the difference between and is negligible. Furthermore,
we note that
we can now follow the method of Stochastic Gradient Descent (SGD) to minimize and obtain the updating rule,
where denotes the learning rate for the iteration (using the simulated sample trajectory). Here is chosen so that to help guarantee the convergence of the algorithm, based on the literature on the convergence of SGD.
6 Temporal Difference (TD) Based Online Learning
In this section we consider another policy evaluation method utilizing the parametric family . The starting point of this method is Proposition 5.1, which states that the best approximation is one whose corresponding approximating process defined by (5.2) is a martingale (in which case (!)).
Next, we recall the following simple fact (see, e.g., [14] for a proof).
Proposition 6.1.
An Itô process is a martingale if and only if
(6.1)
The functions are called test functions.
Proposition 6.1 suggests that a reasonable approach for approximating the optimal could be solving the martingale orthogonality condition (6.1).
However, since (6.1) involves infinitely many equations, for numerical approximations we should only choose a finite number of test functions, often referred to as moment conditions.
There are many ways to choose test functions.
In the finite horizon case, [14] proposed some algorithms of solving equation (6.1) with certain test functions. By using the well known Robbins-Monroe stochastic approximation (cf. (1951), they suggested some continuous analogs of the well-known discrete time Temporal Difference (TD) algorithms such as TD method and the (linear) least square TD(0) (or LSTD) method, which are often referred to as the CTD method and CLSTD method, respectively, for obvious reasons.
We should note that although our problem is essentially an infinite horizon one, we could consider a sufficiently large truncated time horizon , as we did in previous section, so that offline CTD methods similar to [14] can also be applied. However, in what follows we shall focus only on an online version of CTD method that is more suitable to the infinite horizon case.
We begin by recalling the following fact:
where is the probability measure defined in Remark LABEL:remark2.0, and is the trajectory corresponding to the action “sampled” from the policy distribution . Now let , be a sequence of discrete time points, and the action sampled at time . Denote
By the same argument as in [14], we have the discrete time approximation of (6):
(6.3)
In what follows we summarize the updating rules for CTD() method using (6.3).
CTD(0) (). In this case we let , with the updating rule:
(6.4)
CTD() () In this case we choose
, with the updating rule:
(6.5)
Remark 6.2.
(i) We note that the updating rule (6.4) can be viewed as the special case of (6.5) when if we make the convention that , for .
(ii) Although we are considering the infinite horizon case, in practice in order to prevent the infinite loop one always has to stop the iteration at a finite time. In other words, for both CTD(0) and CTD(), we assume that for a large .
(iii) The constants ’s are often referred as the learning rate for the -th iteration. In light of the convergence conditions of Stochastic Approximation methods discussed at the end of previous section, we shall choose ’s so that to help guarantee the convergence of the algorithm.
We observe that the convergence analysis of the above method for each fixed , coincides with that of the stochastic approximation methods. It would naturally be interesting to see the convergence with respect to . To this end, let us first define a special subspace of :
where , , and .
The following result is adapted from [14, §4.2 Theorem 3] with an almost identical proof. We thus only state it and omit the proof.
Proposition 6.3.
Assume that for some , and that , are such that .
Then for any sequence such that exists, it must hold that . Furthermore, there exits such that .
Finally, we remark that although there are other PE methods analogous to well known TD methods (e.g., CLSTD(0)), which are particularly
well suited for linear parameterization families, in this paper we are interested in parameterized families that are nonlinear in nature. Thus we shall only focus on the CTD methods as well as the Martingale Loss Function based PE methods developed in the previous section, (which will be referred to as the ML-algorithm in what follows) and present the detailed algorithms and numerical results in the next section.
7 Numerical Results
In this section we present the numerical results along the lines of PE and PI schemes discussed in the previous sections. In particular, we shall consider the CTD methods and ML Algorithm and some special parametrization based on the knowledge of the explicit solution of the original optimal dividend problem (with ), but without specifying the market parameter and .
To test the effectiveness of the learning procedure, we shall use the so-called environment simulator: , that takes the current state and action as inputs and generates state at time , and we shall use the outcome of the simulator as the dynamics of . We note that the environment simulator will be problem specific, and should be created using historic data pertaining to the problem, without using environment coefficients, which is considered as unknown in the RL setting. But in our
testing procedure, shall use some dummy values of and , along with the following Euler–Maruyama discretization for the SDE (2.10):
(7.1)
where is a normal random variable, and at each time , is calculated by the environment simulator recursively via the given and , and to be specified below.
Sampling of the optimal strategy.
We recall from (LABEL:optimalpi) the optimal policy function has the form where is a continuous function. It can be easily calculated that the inverse of the cumulative distribution function, denoted by , is of the form:
Thus, by the inversion method, if , the uniform distribution on , then the random variable , and we need only sample , which is much simpler.
Parametrization of the cost functional. The next step is to choose the parametrization of .
In light of the well-known result (cf. e.g., [1]), we know that if are given, and , (thanks to Assumption 2.4), the classical solution for the optimal dividend problem is given by
(7.2)
where , ,
, and .
We should note that the threshold in (7.2) is most critical in the value function, as it determines the switching barrier of the optimal dividend rate. That is, optimal dividend rate is of the “bang-bang” form: , where is the reserve process (see, e.g., [1]).
We therefore consider the following two parametrizations based on the initial state .
where represent and of the classical solution respectively.
In particular, the bounds for and are due to the fact and under Assumption 2.4.
We should note that these bounds alone are not sufficient for the algorithms to converge, and we actually enforced some additional bounds. In practice, the range of and should be obtained from historical data for this method to be effective in real life applications.
Finally, it is worth noting that (7.2) actually implies that and . We can therefore approximate by and , respectively, whenever the limit can be obtained. The threshold can then be approximated via and as well.
where represent and respectively, and the bounds for are the bounds of parameter in (7.2). To obtain an upper bound of , we note that is necessary to ensure for each , and thus the upper bound of leads to that of . For the lower bound of , note that and hence so is .
Using , we approximate by .
Remark 7.1.
The parametrization above depends heavily on the knowledge of the explicit solution for the classical optimal control problem. In general, it is natural to consider using the viscosity solution of the entropy regularized relaxed control problem as the basis for the parameterization family. However, although we did identify both viscosity super- and sub-solutions in (3.9), we found that
the specific super-solution does not
work effectively due to the computational complexities resulted by the piece-wise nature of the function, as well as the the complicated nature of the bounds of the parameters involved (see (3.12)); whereas the viscosity sub-solution, being a simple function independent of all the parameters we consider, does not seem to be an effective choice for a parameterization family in this case either.
We shall leave the study of the effective parametrization using viscosity solutions to our future research.
In the following two subsections we summarize our numerical experiments following the analysis so far.
For testing purpose, we choose “dummy” parameters and , so that Assumption 2.4 holds. We use to limit the
number of iterations, and we observe that on average the ruin time of the path simulations occurs in the interval . We also use the error bound , and make the convention that whenever .
7.1 methods
Data: Initial state , Initial temperature , Initial learning rate , functional forms of , number of simulated paths , Variable , an environment simulator
Learning Procedure
Initialize , and set .
whiledo
Set .
if AND then
Compute and store over the last iterations.
Set .
ifthen
End iteration if the absolute difference .
end if
end if
Initialize .
Observe and store .
whiledo
.
Compute and generate action
Apply to to observe and store .
end iteration if .
Compute .
end iteration if .
Update .
Update .
end while
Set and update .
end while
Set .
Algorithm 1 Algorithm
In Algorithm 1 below we carry out the PE procedure using the method. We choose as
a function of iteration number: . This particular function is chosen so that and the entropy regularized control problem converges to the classical problem, but is still bounded away from so as to ensure that is well defined. We shall initialize the learning rate at 1 and decrease it using the function so as to ensures that the conditions and are satisfied.
We note that Algorithm 1 is designed as a combination of online learning and the so-called batch learning, which updates parameter at each temporal partition point, but only updates the policy after a certain number (the parameter “” in Algorightm 1) of path simulations. This particular design is to allow the PE method to better approximate before updating .
Convergence Analysis. To analyze the convergence as , we consider , 0.001, 0.0005, 0.0001, 0.00005, respectively. We take path simulations and in the implementation. Note that with the choice of dummy parameters and , the classical solution is given by , and . We thus consider
two parameterization families, for initial values and respectively.
Table 1: Results for the method
m
m
m
m
family(i)
family(ii)
x=3
x=10
x=3
x=10
15.49
5.383
31.276
3.476
32.489
19.359
55.667
9.635
17.188
4.099
22.217
4.292
31.108
18.532
53.262
11.942
16.68
4.474
23.082
3.931
37.58
11.925
60.948
10.691
16.858
4.444
23.079
4.049
40.825
11.797
65.179
10.5
17.261
4.392
23.094
4.505
38.899
18.994
55.341
18.142
Table 2: Convergence results for the method
w.r.t.
w.r.t.
Results obtained using family (i) for
Results obtained using family (i) for
Results obtained using family (ii) for
Results obtained using family (ii) for
Case 1. . As we can observe from Table 1 and 2 , in this case using the approximation (7.3) (family (i)) shows reasonably satisfactory level of convergence towards the known classical solution values of and as , despite some mild fluctuations.
We believe that such fluctuations are due to the randomness of the data we observe and that averaging over the paths in our algorithm reduced the occurrence of these fluctuations to a satisfactory level. As we can see, despite the minor anomalies, the general trajectory of these graphs tends towards the classical solution as . We should also observe that using family (ii) (7.4) does not produce any satisfactory convergent results. But this is as expected, since the function (7.4) is based on the classical solution for .
Case 2. . Even though the family (7.3) is based on the classical solution for , as we can see from 1 and 2 , the algorithm using family (7.3) converges to the values of the classical solution even in the case , whereas the algorithm using family (ii) (7.4) does not. While a bit counter intuitive, this is actually not entirely unexpected since the state process can be seen to reach in the considered time interval in general, but the parameterization (7.4) is not suitable when value of the state reaches below .
Consequently, it seems that the parameterization (7.3) suits better for method, regardless of the initial value.
Finally, we would like to point out that the case for methods for is much more complicated,
and the algorithms are computationally much slower than method.
We believe that the proper choice of the learning rate in this case
warrants some deeper investigation, but we prefer not to discuss these issues in this paper.
7.2 The ML Algorithm
In Algorithm 2 we present the so-called ML-algorithm in which we use a batch learning approach where we update the parameters by at the end of each simulated path using the information from the time interval . We use path simulations and initial temperature . In the -th simulated path we decrease the temperature parameter by . We also initialize the learning rate at 1 and decrease it using the function . Finally, we consider , respectively, to represent the convergence as .
Data: Initial state , Initial temperature , Initial learning rate , functional forms of , number of simulated paths , Variable , an environment simulator
Learning Procedure
Initialize , .
whiledo
.
if AND then
Compute and store over the last iterations.
ifthen
End iteration if the absolute difference .
end if
end if
Initialize , observe and store .
whiledo
Compute and generate action
Apply to to observe and store .
end iteration if .
observe and store .
Update .
end while
Compute using Ml algorithm and
Update .
Update .
end while
Set .
Algorithm 2ML Algorithm
Using parameterized family (i) with both the initial values and , we obtain the optimal as the lower bound of each parameter , for . Using parameterized family (ii) with both the initial values and , we obtain the optimal as the average of the lower and upper bounds of each parameter for , since in each iteration is updated as the upper and lower boundary alternatively. This is due to the fact that the learning rate is too large for this particular algorithm. Decreasing the size of the learning rate results in optimal values that are away from the boundaries, but the algorithms in these cases were shown not to converge empirically, and thus the final result depends on the number of iterations used(M).
In general, the reason for this could be due to the loss of efficiency occurred by decreasing the learning rates, since Gradient Descent Algorithms are generally sensitive to learning rates. Specific to our problem, among many possible reasons, we believe that the limiting behavior of the optimal strategy when is a serious issue, as is not well defined when and a Dirac -measure is supposed to be involved. Furthermore, the ”bang-bang” nature and the jump of the optimal control could also affect the convergence of the algorithm. Finally, the algorithms seems to be quite sensitive to the value of since value function is a piece-wisely smooth function depending on . Thus, to rigorously analyze the effectiveness of the ML-algorithm with parameterization families (i) and (ii), further empirical analysis are needed which involves finding effective learning rates.
All these issues call for further investigation, but based on our numerical experiment we can nevertheless conclude that the CTD(0) method using the parameterization family (i) is effective in finding the value and , provided that the effective upper and lower bounds for the parameters can be identified using historic data.
References
[1]
Søren Asmussen and Michael Taksar, Controlled diffusion models for
optimal dividend pay-out, Insurance Math. Econom. 20 (1997), no. 1,
1–15.
[2]
Lihua Bai and Jin Ma, Optimal investment and dividend strategy under
renewal risk model, SIAM J. Control Optim. 59 (2021), no. 6,
4590–4614.
[3]
Lihua Bai and Jostein Paulsen, Optimal dividend policies with transaction
costs for a class of diffusion processes, SIAM J. Control Optim. 48
(2010), no. 8, 4987–5008.
[4]
Jun Cai, Hans U. Gerber, and Hailiang Yang, Optimal dividends in an
Ornstein-Uhlenbeck type model with credit and debit interest, N. Am.
Actuar. J. 10 (2006), no. 2, 94–119.
[5]
Tahir Choulli, Michael Taksar, and Xun Yu Zhou, A diffusion model for
optimal dividend distribution for a company with constraints on risk
control, SIAM J. Control Optim. 41 (2003), no. 6, 1946–1979.
[6]
Earl A. Coddington and Norman Levinson, Theory of ordinary differential
equations, McGraw-Hill Book Co., Inc., New York-Toronto-London, 1955.
[7]
Michael G. Crandall, Hitoshi Ishii, and Pierre-Louis Lions, User’s guide
to viscosity solutions of second order partial differential equations, Bull.
Amer. Math. Soc. (N.S.) 27 (1992), no. 1, 1–67.
[8]
Tiziano De Angelis and Erik Ekström, The dividend problem with a
finite horizon, Ann. Appl. Probab. 27 (2017), no. 6, 3525–3546.
[9]
B. De Finetti, Su un’ impostazione alternativa dell teoria collettiva
del risichio, Transactions of the 15th congress of actuaries, New York 2 (1957), 433–443.
[10]
James Ferguson, A brief survey of the history of the calculus of
variations and its applications, 2004.
[11]
David Gilbarg and Neil S. Trudinger, , Classics in Mathematics,
Springer-Verlag, Berlin, 2001, Reprint of the 1998 edition.
[12]
Yu-Jui Huang, Zhenhua Wang, and Zhou Zhou, Convergence of policy
improvement for entropy-regularized stochastic control problems, 2023.
[13]
Saul D. Jacka and Aleksandar Mijatović, On the policy improvement
algorithm in continuous time, Stochastics 89 (2017), no. 1, 348–359.
[14]
Yanwei Jia and Xun Yu Zhou, Policy evaluation and temporal-difference
learning in continuous time and space: a martingale approach, J. Mach.
Learn. Res. 23 (2022), Paper No. [154], 55.
[15]
, Policy gradient and actor-critic learning in continuous time and
space: theory and algorithms, J. Mach. Learn. Res. 23 (2022), Paper
No. [275], 50.
[16]
B. Kerimkulov, D. Šiška, and L. Szpruch, A modified MSA for
stochastic control problems, Appl. Math. Optim. 84 (2021), no. 3,
3417–3436.
[17]
Bekzhan Kerimkulov, David Šiška, and Lukasz Szpruch, Exponential
convergence and stability of Howard’s policy improvement algorithm for
controlled diffusions, SIAM J. Control Optim. 58 (2020), no. 3,
1314–1340.
[18]
O. A. Ladyženskaja, V. A. Solonnikov, and N. N. Ural’ceva, Linear and
quasilinear equations of parabolic type, AMS, Providence, RI, 1968.
[19]
M. L. Puterman, On the convergence of policy iteration for controlled
diffusions, J. Optim. Theory Appl. 33 (1981), no. 1, 137–144.
[20]
S. E. Shreve, J. P. Lehoczky, and D. P. Gaver, Optimal consumption for
general diffusions with absorbing and reflecting barriers, SIAM J. Control
Optim. 22 (1984), no. 1, 55–75.
[21]
Stefan Thonhauser and Hansjörg Albrecher, Dividend maximization under
consideration of the time value of ruin, Insurance Math. Econom. 41
(2007), no. 1, 163–184.
[22]
Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou, Reinforcement
learning in continuous time and space: a stochastic control approach, J.
Mach. Learn. Res. 21 (2020), Paper No. 198, 34.
[23]
Haoran Wang and Xun Yu Zhou, Continuous-time mean-variance portfolio
selection: a reinforcement learning framework, Math. Finance 30
(2020), no. 4, 1273–1308.
[24]
Jiongmin Yong and Xun Yu Zhou, Stochastic controls, Applications of
Mathematics (New York), vol. 43, Springer-Verlag, New York, 1999, Hamiltonian
systems and HJB equations.
[25]
Jinxia Zhu and Hailiang Yang, Optimal financing and dividend distribution
in a general diffusion model with regime switching, Adv. in Appl. Probab.
48 (2016), no. 2, 406–422.