This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Inverse stochastic optimal controls

Yumiharu Nakano
Department of Mathematical and Computing Science, School of Computing
Tokyo Institute of Technology
W8-28, 2-12-1, Ookayama, Meguro-ku, Tokyo 152-8550, Japan
e-mail: nakano@c.titech.ac.jp
Abstract

We study an inverse problem of the stochastic optimal control of general diffusions with performance index having the quadratic penalty term of the control process. Under mild conditions on the system dynamics, the cost functions, and the optimal control process, we show that our inverse problem is well-posed using a stochastic maximum principle. Then, with the well-posedness, we reduce the inverse problem to some root finding problem of the expectation of a random variable involved with the value function, which has a unique solution. Based on this result, we propose a numerical method for our inverse problem by replacing the expectation above with arithmetic mean of observed optimal control processes and the corresponding state processes. The recent progress of numerical analysis of Hamilton-Jacobi-Bellman equations enables the proposed method to be implementable for multi-dimensional cases. In particular, with the help of the kernel-based collocation method for Hamilton-Jacobi-Bellman equations, our method for the inverse problems still works well even when an explicit form of the value function is unavailable. Several numerical experiments show that the numerical method recovers the unknown penalty parameter with high accuracy.

Key words: Inverse problems, stochastic control, stochastic maximum principle, Hamilton-Jacobi-Bellman equations, Kernel-based collocation methods.

1 Introduction

The inverse optimal controls refers to the problem of determining the performance index that makes a given control law optimal. Kalman [18] first analyses the case of linear quadratic models in infinite horizon in details. Since then, a vast amount of research has been done on this subject. To name a few, Bellman and Kalaba [4] and Casti [8] adopt the dynamic programming approach to derive some equations for cost functions using the feedback function of the optimal control for finite horizon problems. Thau [27] and Anderson and Moore [3] extend Kalman’s results to the case of a nonlinear cost structure and of a multi input, respectively. In the literature of reinforcement learning, Ng and Russell [24] studies an inverse optimal control in the framework of Markov decision processes. See, e.g., Abbeel and Ng [2] and Ziebart et.al [29] for subsequent studies. In continuous-time stochastic optimal control, Dvijotham and Todorov [12] works under a linearly solvable class of optimal control, where explicit forms of the optimal feedback laws are available, and Deng and Krstić [10] and Do [11] adopt Lyapunov function approaches in infinite horizon problems. See the recent review paper Ab Azar et.al [1] for a comprehensive account.

In this paper, we are concerned with an inverse problem of optimal stochastic controls of diffusion processes that are more general than those in [12] and [10]. Our control problem is described as the minimization problem of

𝔼[g(X(T))+0T(f(t,X(t))+θ|u(t)|2)𝑑t]\mathbb{E}\left[g(X(T))+\int_{0}^{T}\left(f(t,X(t))+\theta|u(t)|^{2}\right)dt\right]

over an admissible class of control processes uu, where {X(t)}0tT\{X(t)\}_{0\leq t\leq T} denotes the controlled diffusion process corresponding to uu. The performance of the state is measured by the functions gg and ff, and θ>0\theta>0 is a parameter penalizing the magnitude of the control. See Section 2 below for a precise formulation. It should be emphasized that the performance criterion of this form appears in many applications, where it is trivial to determine the functions gg and ff, but difficult to choose an appropriate value of θ\theta. For example, in trajectory control problems, we seek a state process that is as close as possible to a given curve, with a practical input process. Thus, in this case, gg and ff are given by distances between the state and the desired curve. Moreover, the penalty term θ|u(t)|2\theta|u(t)|^{2} for the control is naturally incorporated, which affects the overall performance depending on a choice of θ\theta. Motivated by this, we consider the problem of determining θ\theta from given optimal controls for our minimization problem with unknown penalty parameter.

Our first aim is, for the inverse problem above, to study the well-posedness defined by Hadamard [16]. It is widely accepted that a modern interpretation of Hadamard’s principle is: the solution of a given problem exists, unique, and depends continuously on data. In our framework, the existence is interpreted as the surjectivity of the mapping from the set of unknown penalty parameters to the set of optimal controls, which is outside the scope of this paper since observed controls are assumed to be optimal for the problem above for some θ>0\theta>0. The uniqueness is nothing but the bijectivity of the mapping. The continuous dependence on data, referred to as the stability in this paper, is the continuity of its inverse in some sense. See the statement of Theorem 3.5 below for a precise meaning of our stability. As one might predict, the uniqueness and the stability do not hold in general in our continuous time optimal control problems with finite horizons. We will give such an example. On the other hand, under additional mild conditions on the drift, the diffusion, and the cost functions, and the optimal control process, we will show that the uniqueness and the stability do hold using the stochastic maximum principle. It should be emphasized that this is a first attempt to reveal sufficient conditions for which the inverse optimal control problems are well-posed for stochastic controls with finite horizons.

Our second aim is to propose a numerical method for the inverse problem above. The problem is, given NN independent samples {(u(j)(ti),X(j)(ti))}i=0n\{(u^{(j)}(t_{i}),X^{(j)}(t_{i}))\}_{i=0}^{n}, j=1,,Nj=1,\ldots,N, of the pair {(u(ti),X(ti))}i=0n\{(u^{*}(t_{i}),X^{*}(t_{i}))\}_{i=0}^{n} of the optimal control and the corresponding state processes, to determine θ\theta computationally, where 0=t0<t1<<tn=T0=t_{0}<t_{1}<\ldots<t_{n}=T. Our approach is based on the simple optimality condition and is a reduction to some root-finding problem. This of course involves the value function computations, and so we rely on the recent progress of numerical analysis of Hamilton-Jacobi-Bellman equations. Thanks to the uniqueness result of the inverse problem, in our numerical experiments performed in Section 3, the positive root θ\theta’s are recovered with high accuracy.

The present paper is organized as follows: we introduce notation and formulate our inverse problem in Section 2. Section 3 is devoted to the issue of the well-posedness. In Section 4, we propose a numerical method for our inverse problems and validate it. Section 5 concludes.

2 Problem formulation

First, we collect some notation used in this paper. Denote by int(U)\mathrm{int}(U) the interior of UU. Denote also by DξφD_{\xi}\varphi and Dξ2φD_{\xi}^{2}\varphi the gradient vector and the Hessian matrix of φ\varphi with respect to a variable ξ\xi, respectively. We write 𝕊d\mathbb{S}^{d} for the totality of symmetric d×dd\times d real matrices, and a𝖳a^{\mathsf{T}} for the transpose of a vector or matrix aa. Let T>0T>0 and mm\in\mathbb{N}. Let {W(t)}0tT\{W(t)\}_{0\leq t\leq T} be an mm-dimensional standard Brownian motion on a complete probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}). Further, let X0X_{0} be a random variable that is independent of {W(t)}0tT\{W(t)\}_{0\leq t\leq T}. Denote by {(t)}0tT\{\mathcal{F}(t)\}_{0\leq t\leq T} the augmented filtration generated by {W(t)}0tT\{W(t)\}_{0\leq t\leq T} and X0X_{0}. The control processes are assumed to take values in a closed set UU of k\mathbb{R}^{k}. The class 𝒰\mathcal{U} of controls is then defined by the set of all UU-valued {(t)}\{\mathcal{F}(t)\}-adapted processes {u(t)}0tT\{u(t)\}_{0\leq t\leq T} satisfying

𝔼0T|u(t)|2𝑑t<.\mathbb{E}\int_{0}^{T}|u(t)|^{2}dt<\infty.

For any given u𝒰u\in\mathcal{U}, consider the dd-dimensional controlled stochastic differential equation

(2.1) dX(t)=b(t,X(t),u(t))dt+σ(t,X(t),u(t))dW(t)dX(t)=b(t,X(t),u(t))dt+\sigma(t,X(t),u(t))dW(t)

with initial condition X0X_{0}, where the functions bb and σ\sigma are assumed to be measurable such that the existence and uniqueness of (2.1) is guaranteed. That is, we assume that for any u𝒰u\in\mathcal{U} there exists a unique continuous {(t)}\{\mathcal{F}(t)\}-adapted process {X(t)}0tT\{X(t)\}_{0\leq t\leq T} satisfying

X(t)=X0+0tb(s,X(s),u(s))𝑑s+0tσ(s,X(s),u(s))𝑑W(s),0tT,X(t)=X_{0}+\int_{0}^{t}b(s,X(s),u(s))ds+\int_{0}^{t}\sigma(s,X(s),u(s))dW(s),\quad 0\leq t\leq T,

almost surely. The functions g:dg:\mathbb{R}^{d}\to\mathbb{R} and f:[0,T]×df:[0,T]\times\mathbb{R}^{d}\to\mathbb{R}, the cost functions introduced in Section 1, are both assumed to be measurable such that

𝔼[g(X(T))+0Tf(t,X(t))𝑑t]>\mathbb{E}\left[g(X(T))+\int_{0}^{T}f(t,X(t))dt\right]>-\infty

for any u𝒰u\in\mathcal{U}. Then, we are concerned with the following optimal control problem:

Problem (𝒞θ)(\mathcal{C}_{\theta}).

Minimize

J[u]:=𝔼[g(X(T))+0T{f(t,X(t))+θ|u(t)|2}𝑑t]J[u]:=\mathbb{E}\left[g(X(T))+\int_{0}^{T}\left\{f(t,X(t))+\theta|u(t)|^{2}\right\}dt\right]

over u𝒰u\in\mathcal{U}.

Here we have assumed that the running cost is decomposed into that for the state and that for the penalty of the control input with weight parameter θ>0\theta>0. Then, we consider the inverse problem with respect to θ\theta given an optimal control and the corresponding state trajectory, i.e., our problem is to recover θ\theta from a solution {u(t)}𝒰\{u^{*}(t)\}\in\mathcal{U} for (𝒞θ)(\mathcal{C}_{\theta}) and the corresponding state process {X(t)}\{X^{*}(t)\} under the assumption that bb, σ\sigma, gg and ff are known.

3 Well-posedness

We shall discuss Hadamard’s well-posedness of the inverse problem of the stochastic optimal control. As stated in Section 1, we skip the existence issue. Then, as in many other inverse problems, the uniqueness and the stability do not hold in general for continuous-time optimal controls.

Example 3.1.

Consider the state equation

dX(t)dt=u2(t)X(t),\frac{dX(t)}{dt}=u^{2}(t)X(t),

with nonrandom initial condition X0X_{0}. The control processes {u(t)}\{u(t)\} are taken from the class of \mathbb{R}-valued Borel functions on [0,T][0,T]. The objective function J[u]J[u] is assumed to be given by

J[u]=X(T)+θ0Tu2(t)𝑑t,J[u]=X(T)+\theta\int_{0}^{T}u^{2}(t)dt,

where θ>0\theta>0. Since X(T)=X0e0Tu2(t)𝑑tX(T)=X_{0}e^{\int_{0}^{T}u^{2}(t)dt}, we obtain infuJ[u]=minγ0{X0eγ+θγ}\inf_{u}J[u]=\min_{\gamma\geq 0}\{X_{0}e^{\gamma}+\theta\gamma\}. Thus, if X00X_{0}\geq 0 then γ0\gamma\equiv 0, i.e., u0u\equiv 0, is optimal for any problem (𝒞θ)(\mathcal{C}_{\theta}) with θ>0\theta>0.

To discuss the uniqueness, we impose the following:

Assumption 3.2.
  1. (i)

    The random variable X0X_{0} is a constant.

  2. (ii)

    The set UU is compact and int(U)\mathrm{int}(U) is nonempty.

  3. (iii)

    The functions bb, σ\sigma, ff, and gg are of C2C^{2}-class in xx.

  4. (iv)

    There exists a constant C0>0C_{0}>0 and a modulus of continuity ρ:[0,)[0,)\rho:[0,\infty)\to[0,\infty) such that for φ(t,x,u)=b(t,x,u)\varphi(t,x,u)=b(t,x,u), σ(t,x,u)\sigma(t,x,u), f(t,x)f(t,x), g(x)g(x), we have

    |φ(t,x,u)φ(t,x,u)|\displaystyle|\varphi(t,x,u)-\varphi(t,x^{\prime},u^{\prime})| C0|xx|+ρ(|uu|),\displaystyle\leq C_{0}|x-x^{\prime}|+\rho(|u-u^{\prime}|),
    |φ(t,0,u)|\displaystyle|\varphi(t,0,u)| C0,\displaystyle\leq C_{0},
    |Dxφ(t,x,u)Dxφ(t,x,u)|\displaystyle|D_{x}\varphi(t,x,u)-D_{x}\varphi(t,x^{\prime},u^{\prime})| C0|xx|+ρ(|uu|),\displaystyle\leq C_{0}|x-x^{\prime}|+\rho(|u-u^{\prime}|),
    |Dx2φ(t,x,u)Dx2φ(t,x,u)|\displaystyle|D_{x}^{2}\varphi(t,x,u)-D_{x}^{2}\varphi(t,x^{\prime},u^{\prime})| ρ(|xx|+|uu|),\displaystyle\leq\rho(|x-x^{\prime}|+|u-u^{\prime}|),

    for t[0,T]t\in[0,T], x,xdx,x^{\prime}\in\mathbb{R}^{d}, and u,uUu,u^{\prime}\in U.

It should be noted that Assumption 3.2 (i) means that the filtration {(t)}0tT\{\mathcal{F}(t)\}_{0\leq t\leq T} is generated by the Brownian motion only. Assumption 3.2 (ii) excludes the case where UU is countable. It should be noted that whether this assumption holds true or not is easily confirmed in practice. The first two conditions in Assumption 3.2 (iv) are standard ones for stochastic optimal control problems. Under these two requirements, there exists a unique solution {X(t)}0tT\{X(t)\}_{0\leq t\leq T} of (2.1) for each u𝒰u\in\mathcal{U}. See Fleming and Soner [15].

Under Assumption 3.2, we can apply a stochastic maximum principle, as described in Yong and Zhou [28], to obtain the uniqueness result for our inverse problem in the following sense:

Theorem 3.3.

Let Assumption 3.2 hold. Suppose that {u(t)}0tT𝒰\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U} is optimal both for the problem (𝒞θ1)(\mathcal{C}_{\theta_{1}}) and (𝒞θ2)(\mathcal{C}_{\theta_{2}}) for some θ1,θ2>0\theta_{1},\theta_{2}>0. Moreover, suppose that (u(t0)int(U){0})>0\mathbb{P}(u^{*}(t_{0})\in\mathrm{int}(U)\setminus\{0\})>0 for some t0[0,T]t_{0}\in[0,T]. Then θ1=θ2\theta_{1}=\theta_{2}.

Proof.

Step (i). Since (X(t),u(t))(X^{*}(t),u^{*}(t)) is optimal both for (𝒞θ1)(\mathcal{C}_{\theta_{1}}) and (𝒞θ2)(\mathcal{C}_{\theta_{2}}), for each θ=θ1\theta=\theta_{1}, θ2\theta_{2}, there exists a unique solution (p(t),q(t))(p(t),q(t)), 0tT0\leq t\leq T, of the backward stochastic differential equation (BSDE)

(3.1) dp(t)\displaystyle dp(t) ={Dxb(t,X(t),u(t))𝖳p(t)+j=1mDxσj(t,X(t),u(t))𝖳qj(t)\displaystyle=-\bigg{\{}D_{x}b(t,X^{*}(t),u^{*}(t))^{\mathsf{T}}p(t)+\sum_{j=1}^{m}D_{x}\sigma_{j}(t,X^{*}(t),u^{*}(t))^{\mathsf{T}}q_{j}(t)
Dxf(t,X(t))}dt+q(t)dW(t),\displaystyle\hskip 25.00003pt-D_{x}f(t,X^{*}(t))\bigg{\}}dt+q(t)dW(t),
p(T)\displaystyle p(T) =Dxg(X(T)),\displaystyle=-D_{x}g(X^{*}(T)),

as well as there exists a unique solution (P(t),Q(t))(P(t),Q(t)), 0tT0\leq t\leq T, of the BSDE

(3.2) dP(t)\displaystyle dP(t) ={Dxb(t,X(t),u(t))𝖳P(t)+P(t)Dxb(t,X(t),u(t))𝖳\displaystyle=-\bigg{\{}D_{x}b(t,X^{*}(t),u^{*}(t))^{\mathsf{T}}P(t)+P(t)D_{x}b(t,X^{*}(t),u^{*}(t))^{\mathsf{T}}
+j=1mDxσj(t,X(t),u(t))𝖳P(t)Dxσj(t,X(t),u(t))\displaystyle\hskip 25.00003pt+\sum_{j=1}^{m}D_{x}\sigma_{j}(t,X^{*}(t),u^{*}(t))^{\mathsf{T}}P(t)D_{x}\sigma_{j}(t,X^{*}(t),u^{*}(t))
+j=1m{Dxσj(t,X(t),u(t))𝖳Qj(t)+Qj(t)Dxσj(t,X(t),u(t))}\displaystyle\hskip 25.00003pt+\sum_{j=1}^{m}\Big{\{}D_{x}\sigma_{j}(t,X^{*}(t),u^{*}(t))^{\mathsf{T}}Q_{j}(t)+Q_{j}(t)D_{x}\sigma_{j}(t,X^{*}(t),u^{*}(t))\Big{\}}
+Dx2H0(t,X(t),u(t),p(t),q(t))}dt+j=1mQj(t)dW(t),\displaystyle\hskip 25.00003pt+D_{x}^{2}H_{0}(t,X^{*}(t),u^{*}(t),p(t),q(t))\bigg{\}}dt+\sum_{j=1}^{m}Q_{j}(t)dW(t),
P(T)\displaystyle P(T) =Dx2g(X(T)),\displaystyle=-D_{x}^{2}g(X^{*}(T)),

where q(t)=(q1(t),,qm(t))q(t)=(q_{1}(t),\ldots,q_{m}(t)), Q(t)=(Q1(t),,Qm(t))Q(t)=(Q_{1}(t),\ldots,Q_{m}(t)), σj(t,x,u)d\sigma_{j}(t,x,u)\in\mathbb{R}^{d} for each j=1,,mj=1,\ldots,m such that σ(t,x,u)=(σ1(t,x,u),,σm(t,x,u))\sigma(t,x,u)=(\sigma_{1}(t,x,u),\ldots,\sigma_{m}(t,x,u)), and

H0(t,x,u,p,q):=p𝖳b(t,x,u)+tr(q𝖳σ(t,x,u))f(t,x)θ|u|2.H_{0}(t,x,u,p,q):=p^{\mathsf{T}}b(t,x,u)+\mathrm{tr}(q^{\mathsf{T}}\sigma(t,x,u))-f(t,x)-\theta|u|^{2}.

In particular, p(t)p(t), q1(t),,qm(t)q_{1}(t),\ldots,q_{m}(t) are d\mathbb{R}^{d}-valued and Q1(t),,Qm(t)Q_{1}(t),\ldots,Q_{m}(t) are 𝕊d\mathbb{S}^{d}-valued, all of which are adapted processes satisfying

𝔼0T|φ(t)|2𝑑t<\mathbb{E}\int_{0}^{T}|\varphi(t)|^{2}dt<\infty

for φ=p,q1,,qm,Q1,,Qm\varphi=p,q_{1},\ldots,q_{m},Q_{1},\ldots,Q_{m}. Moreover, with the generalized Hamiltonian

(t,x,u)\displaystyle\mathcal{H}(t,x,u) :=H0(t,x,u,p(t),q(t))12tr[σ(t,X(t),u(t))𝖳P(t)σ(t,X(t),u(t))]\displaystyle:=H_{0}(t,x,u,p(t),q(t))-\frac{1}{2}\mathrm{tr}\left[\sigma(t,X^{*}(t),u^{*}(t))^{\mathsf{T}}P(t)\sigma(t,X^{*}(t),u^{*}(t))\right]
+12tr{[σ(t,x,u)σ(t,X(t),u(t))]𝖳P(t)[σ(t,x,u)σ(t,X(t),u(t))]}\displaystyle\quad+\frac{1}{2}\mathrm{tr}\left\{[\sigma(t,x,u)-\sigma(t,X^{*}(t),u^{*}(t))]^{\mathsf{T}}P(t)[\sigma(t,x,u)-\sigma(t,X^{*}(t),u^{*}(t))]\right\}

we have

(3.3) (t,X(t),u(t))=maxuU(t,X(t),u).\mathcal{H}(t,X^{*}(t),u^{*}(t))=\max_{u\in U}\mathcal{H}(t,X^{*}(t),u).

See Chapter 3 in [28].

Step (ii). We write (pi(t),qi(t),Pi(t),Qi(t))(p_{i}(t),q_{i}(t),P_{i}(t),Q_{i}(t)) for the corresponding (p(t),q(t),P(t),Q(t))(p(t),q(t),P(t),Q(t)) for θ=θi\theta=\theta_{i} where i=1,2i=1,2. Similarly, we write i\mathcal{H}_{i} for the corresponding \mathcal{H} for θ=θi\theta=\theta_{i} where i=1,2i=1,2. By the optimality condition (3.3) and our assumptions, we obtain

Dui(t0,X(t0),u(t0))=0,i=1,2,D_{u}\mathcal{H}_{i}(t_{0},X^{*}(t_{0}),u^{*}(t_{0}))=0,\quad i=1,2,

with positive probability. So,

0\displaystyle 0 =K(t0,X(t0),u(t0),p1(t),P1(t))2θ1u(t0)\displaystyle=K(t_{0},X^{*}(t_{0}),u^{*}(t_{0}),p_{1}(t),P_{1}(t))-2\theta_{1}u^{*}(t_{0})
=K(t0,X(t0),u(t0),p2(t),P2(t))2θ2u(t0)\displaystyle=K(t_{0},X^{*}(t_{0}),u^{*}(t_{0}),p_{2}(t),P_{2}(t))-2\theta_{2}u^{*}(t_{0})

where for t[0,T]t\in[0,T], x,pdx,p\in\mathbb{R}^{d}, and Pd×dP\in\mathbb{R}^{d\times d},

K(t,x,u,p,P)\displaystyle K(t,x,u,p,P)
=Du[12tr(σ(t,x,u)𝖳Pσ(t,x,u))+p𝖳b(t,x,u)\displaystyle=D_{u}\Big{[}\frac{1}{2}\mathrm{tr}\left(\sigma(t,x,u)^{\mathsf{T}}P\sigma(t,x,u)\right)+p^{\mathsf{T}}b(t,x,u)
+12tr{[σ(t,x,u)σ(t,X(t),u(t))]𝖳P(t)[σ(t,x,u)σ(t,X(t),u(t))]}].\displaystyle\quad+\frac{1}{2}\mathrm{tr}\left\{[\sigma(t,x,u)-\sigma(t,X^{*}(t),u^{*}(t))]^{\mathsf{T}}P(t)[\sigma(t,x,u)-\sigma(t,X^{*}(t),u^{*}(t))]\right\}\Big{]}.

Since the uniqueness of the BSDEs immediately leads to p1=p2p_{1}=p_{2} and P1=P2P_{1}=P_{2}, the equalities above yield θ1=θ2\theta_{1}=\theta_{2}, as claimed. ∎

To study the stability, we impose additional conditions.

Assumption 3.4.

The functions bb and σ\sigma are of C1C^{1}-class in uu. Further, for φ(t,x,u)=b(t,x,u)\varphi(t,x,u)=b(t,x,u), σ(t,x,u)\sigma(t,x,u), f(t,x)f(t,x), g(x)g(x), we have

|φ(t,x,u)|+|Duφ(t,x,u)|\displaystyle|\varphi(t,x,u)|+|D_{u}\varphi(t,x,u)| C0,\displaystyle\leq C_{0},
|Duφ(t,x,u)Duφ(t,x,u)|\displaystyle|D_{u}\varphi(t,x,u)-D_{u}\varphi(t,x^{\prime},u^{\prime})| ρ(|xx|+|uu|),x,xd,u,uU,\displaystyle\leq\rho(|x-x^{\prime}|+|u-u^{\prime}|),\quad x,x^{\prime}\in\mathbb{R}^{d},\;\;u,u^{\prime}\in U,

where C0C_{0} and ρ\rho are as in Assumption 3.2.

Then, again by the stochastic maximum principle, we have the stability result in the following sense:

Theorem 3.5.

Let Assumptions 3.2 and 3.4 hold. Let {u(t)}0tT𝒰\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U} be optimal for the problem (𝒞θ)(\mathcal{C}_{\theta^{*}}) for some θ>0\theta^{*}>0, and for each nn\in\mathbb{N} let {un(t)}0tT𝒰\{u_{n}(t)\}_{0\leq t\leq T}\in\mathcal{U} be optimal for (𝒞θn)(\mathcal{C}_{\theta_{n}}) for some θn>0\theta_{n}>0. Suppose that

𝔼0T|un(t)u(t)|2𝑑t0,n,\mathbb{E}\int_{0}^{T}|u_{n}(t)-u^{*}(t)|^{2}dt\to 0,\quad n\to\infty,

and that there exist n0n_{0}\in\mathbb{N} and a measurable set E[0,T]×ΩE\subset[0,T]\times\Omega with positive measure satisfying

un(t,ω),u(t,ω)int(U){0},(t,ω)E,u_{n}(t,\omega),\;u^{*}(t,\omega)\in\mathrm{int}(U)\setminus\{0\},\quad(t,\omega)\in E,

for nn0n\geq n_{0}. Then we have limnθn=θ\lim_{n\to\infty}\theta_{n}=\theta^{*}.

Proof.

By CC we denote a generic positive constant that may vary from line to line. For any nn0n\geq n_{0}, by (3.3),

(3.4) 2θnun(t,ω)2θu(t,ω)=Ln(t,ω)L(t,ω),(t,ω)E,2\theta_{n}u_{n}(t,\omega)-2\theta^{*}u^{*}(t,\omega)=L_{n}(t,\omega)-L^{*}(t,\omega),\quad(t,\omega)\in E,

where Ln(t,ω)=K(t,Xn(t,ω),un(t,ω),pn(t,ω),Pn(t,ω))L_{n}(t,\omega)=K(t,X_{n}(t,\omega),u_{n}(t,\omega),p_{n}(t,\omega),P_{n}(t,\omega)) with XnX_{n} being the state process corresponding to unu_{n}, and (pn,qn)(p_{n},q_{n}) and (Pn,Qn)(P_{n},Q_{n}) the solutions of the BSDE (3.1) and (3.2) with u=unu=u_{n} and X=XnX=X_{n}, respectively. LL^{*} is similarly defined. The usual arguments with Gronwall’s lemma yield

(3.5) sup0tT𝔼|Xn(t)X(t)|2C𝔼0Tρ(|un(t)u(t)|)2𝑑t.\sup_{0\leq t\leq T}\mathbb{E}|X_{n}(t)-X^{*}(t)|^{2}\leq C\mathbb{E}\int_{0}^{T}\rho(|u_{n}(t)-u^{*}(t)|)^{2}dt.

Since ρ\rho is a modulus of continuity for Dx2b(t,,u)D_{x}^{2}b(t,\cdot,u) uniformly in tt and uu, we may assume that ρ\rho is bounded. Hence, for any ε>0\varepsilon>0 there exists δ>0\delta>0 such that ρ(r)ε+δr\rho(r)\leq\varepsilon+\delta r, r0r\geq 0. From this and the L2L^{2}-convergence of unu_{n} we obtain

(3.6) sup0tT𝔼|Xn(t)X(t)|20,n.\sup_{0\leq t\leq T}\mathbb{E}|X_{n}(t)-X^{*}(t)|^{2}\to 0,\quad n\to\infty.

Furthermore, by the a priori estimates for the BSDE (see, e.g., [28, Theorem 3.3, Chapter 7]),

𝔼0T|pn(t)p(t)|2𝑑t\displaystyle\mathbb{E}\int_{0}^{T}|p_{n}(t)-p^{*}(t)|^{2}dt
C𝔼|Dxg(Xn(T))Dxg(X(T))|2\displaystyle\leq C\mathbb{E}|D_{x}g(X_{n}(T))-D_{x}g(X^{*}(T))|^{2}
+C𝔼0T|Dxb(t,Xn(t),un(t))Dxb(t,X(t),u(t))|2|p(t)|2𝑑t\displaystyle\quad+C\mathbb{E}\int_{0}^{T}|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{*}(t),u^{*}(t))|^{2}|p^{*}(t)|^{2}dt
+C𝔼0T|Dxσ(t,Xn(t),un(t))Dxσ(t,X(t),u(t))|2|q(t)|2𝑑t\displaystyle\quad+C\mathbb{E}\int_{0}^{T}|D_{x}\sigma(t,X_{n}(t),u_{n}(t))-D_{x}\sigma(t,X^{*}(t),u^{*}(t))|^{2}|q^{*}(t)|^{2}dt
+C𝔼0T|Dxf(t,Xn(t))Dxf(t,X(t))|2𝑑t.\displaystyle\quad+C\mathbb{E}\int_{0}^{T}|D_{x}f(t,X_{n}(t))-D_{x}f(t,X^{*}(t))|^{2}dt.

By Assumption 3.2 (iv) and (3.5),

𝔼|Dxg(Xn(T))Dxg(X(T))|2+𝔼0T|Dxf(t,Xn(t))Dxf(t,X(t))|2𝑑t\displaystyle\mathbb{E}|D_{x}g(X_{n}(T))-D_{x}g(X^{*}(T))|^{2}+\mathbb{E}\int_{0}^{T}|D_{x}f(t,X_{n}(t))-D_{x}f(t,X^{*}(t))|^{2}dt
C𝔼0Tρ(|un(t)u(t)|)2𝑑t0,n.\displaystyle\leq C\mathbb{E}\int_{0}^{T}\rho(|u_{n}(t)-u^{*}(t)|)^{2}dt\to 0,\quad n\to\infty.

Using Assumption 3.2 (iv) again, for ε>0\varepsilon>0, we observe

𝔼0T|Dxb(t,Xn(t),un(t))Dxb(t,X(t),u(t))|2|p(t)|2𝑑t\displaystyle\mathbb{E}\int_{0}^{T}|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{*}(t),u^{*}(t))|^{2}|p^{*}(t)|^{2}dt
𝔼0T|Dxb(t,Xn(t),un(t))Dxb(t,X(t),u(t))|2|p(t)|2\displaystyle\leq\mathbb{E}\int_{0}^{T}|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{*}(t),u^{*}(t))|^{2}|p^{*}(t)|^{2}
×1{|Xn(t)X(t)|+ρ(|un(t)u(t)|)>ε}dt\displaystyle\hskip 30.00005pt\times 1_{\{|X_{n}(t)-X^{*}(t)|+\rho(|u_{n}(t)-u^{*}(t)|)>\varepsilon\}}dt
+𝔼0T|Dxb(t,Xn(t),un(t))Dxb(t,X(t),u(t))|2|p(t)|2\displaystyle\quad+\mathbb{E}\int_{0}^{T}|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{*}(t),u^{*}(t))|^{2}|p^{*}(t)|^{2}
×1{|Xn(t)X(t)|+ρ(|un(t)u(t)|)ε}dt\displaystyle\hskip 30.00005pt\times 1_{\{|X_{n}(t)-X^{*}(t)|+\rho(|u_{n}(t)-u^{*}(t)|)\leq\varepsilon\}}dt
C𝔼0T|p(t)|21{|Xn(t)X(t)|+ρ(|un(t)u(t)|)>ε}𝑑t+Cε𝔼0T|p(t)|2𝑑t.\displaystyle\leq C\mathbb{E}\int_{0}^{T}|p^{*}(t)|^{2}1_{\{|X_{n}(t)-X^{*}(t)|+\rho(|u_{n}(t)-u^{*}(t)|)>\varepsilon\}}dt+C\varepsilon\mathbb{E}\int_{0}^{T}|p^{*}(t)|^{2}dt.

Thus, letting nn\to\infty and then ε0\varepsilon\to 0, we have

limn𝔼0T|Dxb(t,Xn(t),un(t))Dxb(t,X(t),u(t))|2|p(t)|2𝑑t=0.\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{*}(t),u^{*}(t))|^{2}|p^{*}(t)|^{2}dt=0.

Similarly,

limn𝔼0T|Dxσ(t,Xn(t),un(t))Dxσ(t,X(t),u(t))|2|q(t)|2𝑑t=0.\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|D_{x}\sigma(t,X_{n}(t),u_{n}(t))-D_{x}\sigma(t,X^{*}(t),u^{*}(t))|^{2}|q^{*}(t)|^{2}dt=0.

Therefore,

(3.7) limn𝔼0T|pn(t)p(t)|2=0.\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|p_{n}(t)-p^{*}(t)|^{2}=0.

By the same way, we obtain

(3.8) limn𝔼0T|Pn(t)P(t)|2=0.\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|P_{n}(t)-P^{*}(t)|^{2}=0.

For notational simplicity we denote Dubn(t)=Dub(t,Xn(t),un(t))D_{u}b_{n}(t)=D_{u}b(t,X_{n}(t),u_{n}(t)). Analogously we use the notation Dub(t)D_{u}b^{*}(t), σn(t)\sigma_{n}(t), and Duσn(t)D_{u}\sigma_{n}(t). With this notation, by Assumption 3.4 we see

|Ln(t)L(t)|\displaystyle|L_{n}(t)-L^{*}(t)|
|Dubn(t)||pn(t)p(t)|+|p(t)||Dubn(t)Dub(t)|\displaystyle\leq|D_{u}b_{n}(t)||p_{n}(t)-p^{*}(t)|+|p^{*}(t)||D_{u}b_{n}(t)-D_{u}b^{*}(t)|
+C|σn(t)||Pn(t)||Duσn(t)Duσ(t)|+C|Duσ(t)||Pn(t)||σn(t)σ(t)|\displaystyle\quad+C|\sigma_{n}(t)||P_{n}(t)||D_{u}\sigma_{n}(t)-D_{u}\sigma^{*}(t)|+C|D_{u}\sigma^{*}(t)||P_{n}(t)||\sigma_{n}(t)-\sigma^{*}(t)|
+|Duσ(t)||σ(t)|)|Pn(t)P(t)|\displaystyle\quad+|D_{u}\sigma^{*}(t)||\sigma^{*}(t)|)|P_{n}(t)-P^{*}(t)|
C(1+|Xn(t)|+|un(t)|)(|pn(t)p(t)|+|Pn(t)P(t)|)\displaystyle\leq C(1+|X_{n}(t)|+|u_{n}(t)|)(|p_{n}(t)-p^{*}(t)|+|P_{n}(t)-P^{*}(t)|)
+Cε(1+|p(t)|+|pn(t)|)(ε+|Xn(t)X(t)|+|un(t)u(t)|)\displaystyle\quad+C_{\varepsilon}(1+|p^{*}(t)|+|p_{n}(t)|)(\varepsilon+|X_{n}(t)-X^{*}(t)|+|u_{n}(t)-u^{*}(t)|)

for any ε>0\varepsilon>0 with constant CεC_{\varepsilon} depending on ε\varepsilon. Thus, by Cauchy-Schwartz inequality and (3.6)–(3.8),

limn𝔼0T|Ln(t)L(t)|𝑑t=0.\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|L_{n}(t)-L^{*}(t)|dt=0.

From this and (3.4) it follows that

2|θnθ|E|un|𝑑t×𝑑2|θ|𝔼0T|un(t)u(t)|𝑑t+𝔼0T|Ln(t)L(t)|𝑑t0,n.2|\theta_{n}-\theta^{*}|\int_{E}|u_{n}|dt\times d\mathbb{P}\leq 2|\theta^{*}|\mathbb{E}\int_{0}^{T}|u_{n}(t)-u^{*}(t)|dt+\mathbb{E}\int_{0}^{T}|L_{n}(t)-L^{*}(t)|dt\to 0,\quad n\to\infty.

On the other hand,

limnE|un|𝑑t×𝑑=E|u|𝑑t×𝑑>0.\lim_{n\to\infty}\int_{E}|u_{n}|dt\times d\mathbb{P}=\int_{E}|u^{*}|dt\times d\mathbb{P}>0.

Therefore, θnθ\theta_{n}\to\theta^{*}, as claimed. ∎

Next we consider the case of the linear quadratic regulator problems. We need special treatment since Assumptions 3.2 and 3.4 (iii) exclude the case where the state processes are affine in controls taking values in unbounded sets.

Assumption 3.6.
  1. (i)

    The random variable X0X_{0} satisfies 𝔼|X0|2<\mathbb{E}|X_{0}|^{2}<\infty.

  2. (ii)

    The set UU is given by U=kU=\mathbb{R}^{k}.

  3. (iii)

    The functions bb, σ\sigma, ff, and gg are given respectively by

    b(t,x,u)=b0(t)x+b1(t)u,σ(t,x,u)=σ0(t),\displaystyle b(t,x,u)=b_{0}(t)x+b_{1}(t)u,\quad\sigma(t,x,u)=\sigma_{0}(t),
    f(t,x)=x𝖳S(t)x,g(x)=x𝖳Rx,\displaystyle f(t,x)=x^{\mathsf{T}}S(t)x,\quad g(x)=x^{\mathsf{T}}Rx,

    for t[0,T]t\in[0,T], xdx\in\mathbb{R}^{d}, and uku\in\mathbb{R}^{k}, where b0b_{0} is d×d\mathbb{R}^{d\times d}-valued, b1b_{1} is d×k\mathbb{R}^{d\times k}-valued, σ0\sigma_{0} is d×m\mathbb{R}^{d\times m}, and SS is d×d\mathbb{R}^{d\times d}-valued, all of which are continuous on [0,T][0,T], and Rd×dR\in\mathbb{R}^{d\times d}. Further RR and S(t)S(t) are positive semidefinite for any t[0,T]t\in[0,T].

In the linear quadratic cases, explicit forms of the optimal controls are available. This helps us to obtain the following uniqueness:

Theorem 3.7.

Let Assumption 3.6 hold. Suppose that {u(t)}0tT𝒰\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U} is optimal both for the problem (𝒞θ1)(\mathcal{C}_{\theta_{1}}) and (𝒞θ2)(\mathcal{C}_{\theta_{2}}) for some θ1,θ2>0\theta_{1},\theta_{2}>0. Moreover, suppose that (u(t0)0)>0\mathbb{P}(u^{*}(t_{0})\neq 0)>0 for some t0[0,T]t_{0}\in[0,T]. Then θ1=θ2\theta_{1}=\theta_{2}.

Proof.

For each θ=θ1\theta=\theta_{1} and θ2\theta_{2}, an optimal control for (𝒞θ)(\mathcal{C}_{\theta}) uniquely exists and we necessarily have

(3.9) u(t)=1θb1(t)𝖳F(t)X(t),0tT,u^{*}(t)=-\frac{1}{\theta}b_{1}(t)^{\mathsf{T}}F(t)X^{*}(t),\quad 0\leq t\leq T,

where {F(t)}0tT\{F(t)\}_{0\leq t\leq T} is a unique solution of the matrix Riccati equation

(3.10) ddtF(t)+b0𝖳(t)F(t)+F(t)b0(t)1θF(t)b1(t)b1(t)𝖳F(t)+S(t)=0,F(T)=R.\frac{d}{dt}F(t)+b_{0}^{\mathsf{T}}(t)F(t)+F(t)b_{0}(t)-\frac{1}{\theta}F(t)b_{1}(t)b_{1}(t)^{\mathsf{T}}F(t)+S(t)=0,\quad F(T)=R.

We refer to, e.g., Bensoussan [5] for this result. A simple application of Itô formula then yields

dF(t)X(t)=(b0(t)𝖳F(t)+S(t))X(t)dt+F(t)σ0(t)dW(t).dF(t)X^{*}(t)=-(b_{0}(t)^{\mathsf{T}}F(t)+S(t))X^{*}(t)dt+F(t)\sigma_{0}(t)dW(t).

Let F1F_{1} and F2F_{2} be the solution of the Riccati equation (3.10) corresponding to θ1\theta_{1} and θ2\theta_{2}, respectively. Then we have

d(F1(t)F2(t))X(t)=b0(t)𝖳X(t)dt+(F1(t)F2(t))σ0(t)dW(t).d(F_{1}(t)-F_{2}(t))X^{*}(t)=-b_{0}(t)^{\mathsf{T}}X^{*}(t)dt+(F_{1}(t)-F_{2}(t))\sigma_{0}(t)dW(t).

From F1(T)=F2(T)F_{1}(T)=F_{2}(T) it follows that

(F1(t)F2(t))X(t)=𝔼[tTb0(s)𝖳(F1(s)F2(s))X(s)𝑑s|(t)],(F_{1}(t)-F_{2}(t))X^{*}(t)=\mathbb{E}\left[\left.\int_{t}^{T}b_{0}(s)^{\mathsf{T}}(F_{1}(s)-F_{2}(s))X^{*}(s)ds\right|\mathcal{F}(t)\right],

whence

𝔼|(F1(t)F2(t))X(t)|2CtT𝔼|(F1(s)F2(s))X(s)|2𝑑s\mathbb{E}|(F_{1}(t)-F_{2}(t))X^{*}(t)|^{2}\leq C\int_{t}^{T}\mathbb{E}|(F_{1}(s)-F_{2}(s))X^{*}(s)|^{2}ds

for some positive constant C>0C>0. Therefore we have F1X=F2XF_{1}X^{*}=F_{2}X^{*} and so θ1=θ2\theta_{1}=\theta_{2} since b1(t0)𝖳F1(t0)X(t0)0b_{1}(t_{0})^{\mathsf{T}}F_{1}(t_{0})X^{*}(t_{0})\neq 0 with positive probability. Thus the theorem follows. ∎

As in the previous theorem, thanks to the explicit representation of the optimal controls, we have the following stability result:

Theorem 3.8.

Let Assumption 3.6 hold. Let {u(t)}0tT𝒰\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U} be optimal for the problem (𝒞θ)(\mathcal{C}_{\theta^{*}}) for some θ>0\theta^{*}>0, and for each nn\in\mathbb{N} let {un(t)}0tT𝒰\{u_{n}(t)\}_{0\leq t\leq T}\in\mathcal{U} be optimal for (𝒞θn)(\mathcal{C}_{\theta_{n}}) for some θn>0\theta_{n}>0. Suppose that

𝔼0T|un(t)u(t)|2𝑑t0,n,\mathbb{E}\int_{0}^{T}|u_{n}(t)-u^{*}(t)|^{2}dt\to 0,\quad n\to\infty,

and that there exists a measurable set E[0,T]×ΩE\subset[0,T]\times\Omega with positive measure satisfying

u(t,ω)0,(t,ω)E.u^{*}(t,\omega)\neq 0,\quad(t,\omega)\in E.

Then we have limnθn=θ\lim_{n\to\infty}\theta_{n}=\theta^{*}.

Proof.

By CC we denote a generic positive constant that may vary from line to line. Let X(t)X^{*}(t) and Xn(t)X_{n}(t) be the state processes corresponding to uu^{*} and unu_{n}, respectively. Further, let FF^{*} and FnF_{n} be the solution of the Riccati equation (3.10) corresponding to θ\theta^{*} and θn\theta_{n}, respectively, nn\in\mathbb{N}. Then, as in the proof of Theorem 3.7,

F(t)X(t)Fn(t)Xn(t)\displaystyle F^{*}(t)X^{*}(t)-F_{n}(t)X_{n}(t)
=𝔼[R(X(T)Xn(T))+tTb0(s)𝖳(F(s)X(s)Fn(s)Xn(s))𝑑s|(t)],\displaystyle=\mathbb{E}\left[\left.R(X^{*}(T)-X_{n}(T))+\int_{t}^{T}b_{0}(s)^{\mathsf{T}}(F^{*}(s)X^{*}(s)-F_{n}(s)X_{n}(s))ds\right|\mathcal{F}(t)\right],

whence by Gronwall inequality,

sup0tT𝔼|F(t)X(t)Fn(t)Xn(t)|2C𝔼|X(T)Xn(T)|20,n.\sup_{0\leq t\leq T}\mathbb{E}|F^{*}(t)X^{*}(t)-F_{n}(t)X_{n}(t)|^{2}\leq C\mathbb{E}|X^{*}(T)-X_{n}(T)|^{2}\to 0,\quad n\to\infty.

Thus,

|θnθ|Eun𝑑t×𝑑\displaystyle|\theta_{n}-\theta^{*}|\int_{E}u_{n}dt\times d\mathbb{P} |θ|𝔼0T|un(t)u(t)|𝑑t\displaystyle\leq|\theta^{*}|\mathbb{E}\int_{0}^{T}|u_{n}(t)-u^{*}(t)|dt
+C𝔼0T|F(t)X(t)Fn(t)Xn(t)|𝑑t0,n.\displaystyle\quad+C\mathbb{E}\int_{0}^{T}|F^{*}(t)X^{*}(t)-F_{n}(t)X_{n}(t)|dt\to 0,\quad n\to\infty.

Consequently, θnθ\theta_{n}\to\theta^{*}, as required. ∎

4 Numerical method

Here we propose a method for determining the weight parameter θ\theta given observed data of optimal controls. To this end, recall that the value function VV for the problem (𝒞θ)(\mathcal{C}_{\theta}) is given by

(4.1) V(t,x;θ)=infu𝒰t𝔼[g(X(T))+tT(f(s,X(s))+θ|u(s)|2)𝑑s|X(t)=x],V(t,x;\theta)=\inf_{u\in\mathcal{U}_{t}}\mathbb{E}\left[\left.g(X(T))+\int_{t}^{T}(f(s,X(s))+\theta|u(s)|^{2})ds\right|X(t)=x\right],

for (t,x)[0,T]×d(t,x)\in[0,T]\times\mathbb{R}^{d}, where 𝒰t\mathcal{U}_{t} is the set of all UU-valued {(s)}\{\mathcal{F}(s)\}-adapted processes {u(s)}tsT\{u(s)\}_{t\leq s\leq T} satisfying

𝔼tT|u(s)|2𝑑s<.\mathbb{E}\int_{t}^{T}|u(s)|^{2}ds<\infty.

It is well-known that under Assumption 3.2 the value fuction V(t,x)V(t,x;θ)V(t,x)\equiv V(t,x;\theta) is a unique continuous viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation

(4.2) {tV(t,x)+H(t,x,DxV(t,x),Dx2V(t,x))=0,(t,x)[0,T)×d,V(T,x)=g(x),xd,\left\{\begin{split}&\partial_{t}V(t,x)+H(t,x,D_{x}V(t,x),D^{2}_{x}V(t,x))=0,\quad(t,x)\in[0,T)\times\mathbb{R}^{d},\\ &V(T,x)=g(x),\quad x\in\mathbb{R}^{d},\end{split}\right.

where

H(t,x,p,M)=infuU[b(t,x,u)𝖳p+12tr(σ(t,x,u)σ(t,x,u)𝖳M)+f(t,x)+θ|u|2]H(t,x,p,M)=\inf_{u\in U}\left[b(t,x,u)^{\mathsf{T}}p+\frac{1}{2}\mathrm{tr}(\sigma(t,x,u)\sigma(t,x,u)^{\mathsf{T}}M)+f(t,x)+\theta|u|^{2}\right]

for (t,x,p,M)d×d×𝕊d(t,x,p,M)\in\mathbb{R}^{d}\times\mathbb{R}^{d}\times\mathbb{S}^{d} (see, e.g., [15]). Under Assumption 3.6, i.e., the case of the linear quadratic regulator problems, the value function VV is given by

(4.3) V(t,x)=x𝖳F(t)x+G(t),(t,x)[0,T]×d,V(t,x)=x^{\mathsf{T}}F(t)x+G(t),\quad(t,x)\in[0,T]\times\mathbb{R}^{d},

where FF is as in (3.10) and GG is the unique solution of

ddtG(t)+tr(σ0(t)𝖳F(t)σ0(t))=0,G(T)=0.\frac{d}{dt}G(t)+\mathrm{tr}(\sigma_{0}(t)^{\mathsf{T}}F(t)\sigma_{0}(t))=0,\quad G(T)=0.

Then we have the following basic result:

Theorem 4.1.

Let {u(t)}0tT𝒰\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U} solves (𝒞θ)(\mathcal{C}_{\theta^{*}}) for some θ>0\theta^{*}>0, and {X(t)}0tT\{X^{*}(t)\}_{0\leq t\leq T} the corresponding state trajectory. Suppose that Assumption 3.4 holds and (u(t0)int(U){0})>0\mathbb{P}(u^{*}(t_{0})\in\mathrm{int}(U)\setminus\{0\})>0 for some t0[0,T]t_{0}\in[0,T]. Then, θ\theta^{*} is a unique positive root of the equation

(4.4) Φ(θ):=𝔼[g(X(T))+0T(f(t,X(t))+θ|u(t)|2)𝑑tV(0,X0;θ)]=0.\Phi(\theta):=\mathbb{E}\left[g(X^{*}(T))+\int_{0}^{T}(f(t,X^{*}(t))+\theta|u^{*}(t)|^{2})dt-V(0,X_{0};\theta)\right]=0.
Proof.

Since X0X_{0} is a constant and {u(t)}\{u^{*}(t)\} is optimal, clearly we have V(0,X0;θ)=infu𝒰J[u]=J[u]V(0,X_{0};\theta^{*})=\inf_{u\in\mathcal{U}}J[u]=J[u^{*}], whence Φ(θ)=0\Phi(\theta^{*})=0. If θ>0\theta^{\prime}>0 satisfies Φ(θ)=0\Phi(\theta^{\prime})=0, then {u(t)}\{u^{*}(t)\} is also optimal for (𝒞θ)(\mathcal{C}_{\theta^{\prime}}). By Theorem 3.3, we obtain θ=θ\theta^{\prime}=\theta^{*}. ∎

We have the same result in the case of the linear quadratic regulator problems.

Theorem 4.2.

Let {u(t)}0tT𝒰\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U} solves (𝒞θ)(\mathcal{C}_{\theta^{*}}) for some θ>0\theta^{*}>0, and {X(t)}0tT\{X^{*}(t)\}_{0\leq t\leq T} the corresponding state trajectory. Suppose that Assumption 3.6 holds and (u(t0)0)>0\mathbb{P}(u^{*}(t_{0})\neq 0)>0 for some t0[0,T]t_{0}\in[0,T]. Then, θ\theta^{*} is a unique positive root of the equation Φ(θ)=0\Phi(\theta)=0, defined as in (4.4)\eqref{eq:3.3}.

Proof.

Since {u(t)}\{u^{*}(t)\} is necessarily given by (3.9), it is clear that the process {u(t)|X0=x}\{u^{*}(t)|_{X_{0}=x}\} is optimal for any xdx\in\mathbb{R}^{d}. Moreover, the process {X(t)|X0=x}\{X^{*}(t)|_{X_{0}=x}\} is the state process corresponding to {u(t)|X0=x}\{u^{*}(t)|_{X_{0}=x}\}. Thus,

J[u]\displaystyle J[u^{*}] =𝔼[X(T)𝖳RX(T)+0T(X(t)𝖳P(t)X(t)+θ|u(t)|2)𝑑t]\displaystyle=\mathbb{E}\left[X^{*}(T)^{\mathsf{T}}RX^{*}(T)+\int_{0}^{T}\left(X^{*}(t)^{\mathsf{T}}P(t)X^{*}(t)+\theta^{*}|u^{*}(t)|^{2}\right)dt\right]
=𝔼[𝔼[X(T)𝖳RX(T)+0T(X(t)𝖳P(t)X(t)+θ|u(t)|2)𝑑t|X0]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\left.X^{*}(T)^{\mathsf{T}}RX^{*}(T)+\int_{0}^{T}\left(X^{*}(t)^{\mathsf{T}}P(t)X^{*}(t)+\theta^{*}|u^{*}(t)|^{2}\right)dt\right|X_{0}\right]\right]
=𝔼[V(0,X0;θ)].\displaystyle=\mathbb{E}[V(0,X_{0};\theta^{*})].

The rest of the proof can be done by the same way as in the proof of Theorem 4.1. ∎

Now, suppose that the NN independent samples {u(j)(ti),X(j)(ti)}\{u^{(j)}(t_{i}),X^{(j)}(t_{i})\}, i=0,,ni=0,\ldots,n, j=1,,Nj=1,\ldots,N, of optimal control process for (𝒞θ)(\mathcal{C}_{\theta^{*}}) at time tit_{i} and the corresponding state at time tit_{i} are available, where 0=t0<<tn=T0=t_{0}<\cdots<t_{n}=T. An inverse control problem is then to determine the unknown θ\theta^{*} from the observations. Our approach is to focus on the following problem:

Problem ()(\mathcal{I}).

Find a positive root of the equation

1Nj=1N{g(X(j)(T))+i=0n1(f(ti,X(j)(ti))+θ|u(j)(ti)|2)(ti+1ti)V(0,X(j)(0);θ)}=0.\frac{1}{N}\sum_{j=1}^{N}\left\{g(X^{(j)}(T))+\sum_{i=0}^{n-1}(f(t_{i},X^{(j)}(t_{i}))+\theta|u^{(j)}(t_{i})|^{2})(t_{i+1}-t_{i})-V(0,X^{(j)}(0);\theta)\right\}=0.

In views of the strong law of large number and Theorems 4.1 and 4.2, we can obtain an approximate solution of the inverse problem by solving the problem ()(\mathcal{I}) for sufficiently large NN and nn.

Example 4.3.

Consider the case where the state is described by the one-dimensional equation

dX(t)=u(t)dt+110dW(t)dX(t)=u(t)dt+\frac{1}{10}dW(t)

with initial condition X0X_{0} having the standard normal distribution, and the control objective J[u]J[u] is given by

J[u]=𝔼01(10|X(t)|2+θ|u(t)|2)𝑑t,J[u]=\mathbb{E}\int_{0}^{1}\left(10|X(t)|^{2}+\theta|u(t)|^{2}\right)dt,

where U=U=\mathbb{R}. This problem is of course a particular case of the linear quadratic regulator ones. The value function VV is explicitly given by V(t,x;θ)=F(t;θ)x2+G(t;θ)V(t,x;\theta)=F(t;\theta)x^{2}+G(t;\theta), where F(t;θ)=10θtanh((1t)10/θ)F(t;\theta)=\sqrt{10\theta}\tanh((1-t)\sqrt{10/\theta}) and G(t;θ)=(θ/100)log(cosh((1t)10/θ))G(t;\theta)=(\theta/100)\log(\cosh((1-t)\sqrt{10/\theta})). The unique optimal control u(t)u^{*}(t) is given by

u(t)=10θtanh(10θ(1t))X(t).u^{*}(t)=-\sqrt{\frac{10}{\theta}}\tanh\left(\sqrt{\frac{10}{\theta}}(1-t)\right)X^{*}(t).

To test our approach, we independently generate the samples (X(j)(ti),u(j)(ti))(X^{(j)}(t_{i}),u^{(j)}(t_{i})), j=1,,10000j=1,\ldots,10000, i=0,,1000i=0,\ldots,1000, of the optimal state and control for θ=1\theta=1, where ti=i/1000t_{i}=i/1000, and consider these samples as observed data. We solve the root-finding problem by minimizing

110000|j=110000i=0999(10|X(j)(ti)|2+θ|u(j)(ti)|2)ΔtV(0,X0(j);θ)|\frac{1}{10000}\left|\sum_{j=1}^{10000}\sum_{i=0}^{999}\left(10|X^{(j)}(t_{i})|^{2}+\theta|u^{(j)}(t_{i})|^{2}\right)\Delta t-V(0,X_{0}^{(j)};\theta)\right|

over θ[0.0001,10]\theta\in[0.0001,10]. The estimated θ\theta over 100100 trials has the mean 0.99710.9971 and the standard deviation 4.4181×1044.4181\times 10^{-4}.

In most of nonlinear problems, analytical solutions of HJB equations are rarely available. So we need to numerically solve (4.4) to approximate the value functions. Existing numerical methods applicable to (4.4) include the finite difference methods (see, e.g., Kushner and Dupuis [21] and Bonnans and Zidani [6]), the finite-element like methods (see, e.g., Camilli and Falcone [7] and Debrabant and Jakobsen [9]), the kernel-based collocation methods (see, e.g., Kansa [19, 20], Huang et.al [17], and Nakano [23]), and the regression-based methods (see, e.g., Fahim et al. [14], E et al. [13], Sirignano and Spiliopoulos [26], and the references therein). It is well-known that the use of the finite difference methods is limited to low dimensional problems since the number of the spatial grids points has an exponential growth in dimension. In the kernel-based collocation methods, we seek an approximate solution of the form of a linear combination of a radial basis function (e.g., multiquadrics in the Kansa’s original work) by solving finite dimensional linear equations. In general, this procedure allows for a simpler numerical implementation compared to the finite-element like methods and the regression-based methods. The regression-based methods, in particular the ones with neural networks, are prominent in high dimensional problems although they are computationally expensive.

In Example 4.4 below, we will deal with a population control problem having a 3-dimensional state space. We adopt the kernel-based collocation methods to compute the value function for that problem.

The kernel-based collocation methods are obtained by the interpolations with positive definite kernels applied backward recursively in time. Given a points set Γ={x(1),,x(N)}d\Gamma=\{x^{(1)},\ldots,x^{(N)}\}\subset\mathbb{R}^{d} such that x(j)x^{(j)}’s are pairwise distinct, and a positive definite function Φ:d\Phi:\mathbb{R}^{d}\to\mathbb{R}, the function

j=1N(A1ψ|Γ)jΦ(xx(j)),xd,\sum_{j=1}^{N}(A^{-1}\psi|_{\Gamma})_{j}\Phi(x-x^{(j)}),\quad x\in\mathbb{R}^{d},

interpolates ψ\psi on Γ\Gamma. Here, A={Φ(x(j)x())}j,=1,,NA=\{\Phi(x^{(j)}-x^{(\ell)})\}_{j,\ell=1,\ldots,N}, ψ|Γ\psi|_{\Gamma} is the column vector composed of ψ(xj)\psi(x_{j}), j=1,,Nj=1,\ldots,N, and (z)j(z)_{j} denotes the jj-th component of zNz\in\mathbb{R}^{N}. Thus, with time grid {t0,,tn}\{t_{0},\ldots,t_{n}\} such that 0=t0<<tn=T0=t_{0}<\cdots<t_{n}=T, the function V~(tn,)\tilde{V}(t_{n},\cdot) defined by

V~(tn,x)=j=1N(A1V~n)jΦ(xx(j)),xd\tilde{V}(t_{n},x)=\sum_{j=1}^{N}(A^{-1}\tilde{V}_{n})_{j}\Phi(x-x^{(j)}),\quad x\in\mathbb{R}^{d}

approximates gg, where V~n=g|Γ\tilde{V}_{n}=g|_{\Gamma}. Then, for any k=0,1,,n1k=0,1,\ldots,n-1, with a vector V~k+1N\tilde{V}_{k+1}\in\mathbb{R}^{N} of an approximate solution at {tk+1}×Γ\{t_{k+1}\}\times\Gamma, we set

V~(tk+1,x)\displaystyle\tilde{V}(t_{k+1},x) =j=1N(A1V~k+1)jΦ(xx(j)),xd,\displaystyle=\sum_{j=1}^{N}(A^{-1}\tilde{V}_{k+1})_{j}\Phi(x-x^{(j)}),\quad x\in\mathbb{R}^{d},
V~k\displaystyle\tilde{V}_{k} =V~k+1+(tk+1tk)Hk+1(vk+1h),\displaystyle=\tilde{V}_{k+1}+(t_{k+1}-t_{k})H_{k+1}(v_{k+1}^{h}),

where Hk+1(V~k+1)=(Hk+1,1(V~k+1),,Hk+1,N(V~k+1))NH_{k+1}(\tilde{V}_{k+1})=(H_{k+1,1}(\tilde{V}_{k+1}),\ldots,H_{k+1,N}(\tilde{V}_{k+1}))\in\mathbb{R}^{N} with

Hk+1,j=H(tk+1,x(j),DV~(tk+1,x(j)),D2V~(tk+1,x(j))).H_{k+1,j}=H(t_{k+1},x^{(j)},D\tilde{V}(t_{k+1},x^{(j)}),D^{2}\tilde{V}(t_{k+1},x^{(j)})).

It should be mentioned that in view of the accuracy of interpolation, it is necessary to take Γ\Gamma densely. In particular, as the spatial dimension increases, NN must also become larger. This creates difficulties in numerically solving the linear equations Aξ=ψ|ΓA\xi=\psi|_{\Gamma} or in computing the inverse matrix A1A^{-1}. On the other hand, basically there are no restrictions on the structure of Γ\Gamma in implementing the kernel-based collocation methods. As a result, they are useful in multi but relatively low dimensional problems. It should also be noted that the method above can be described in a matrix form. See Nakano [22] for details.

Example 4.4.

Consider the following simple SIR (Susceptible-Infectious-Recovery) epidemic model with vaccination studied in Rachah and Torres [25]:

dS(t)dt\displaystyle\frac{dS(t)}{dt} =βS(t)I(t)u(t)S(t),\displaystyle=-\beta S(t)I(t)-u(t)S(t),
dI(t)dt\displaystyle\frac{dI(t)}{dt} =βS(t)I(t)μI(t),\displaystyle=\beta S(t)I(t)-\mu I(t),
dR(t)dt\displaystyle\frac{dR(t)}{dt} =μI(t)+u(t)S(t),\displaystyle=\mu I(t)+u(t)S(t),

with initial conditions S(0)=0.95S(0)=0.95, I(0)=0.05I(0)=0.05, R(0)=0R(0)=0, where β=0.2\beta=0.2 and μ=0.1\mu=0.1. The control objective is given by

J[u]=010(10I(t)+θu2(t))𝑑tJ[u]=\int_{0}^{10}(10I(t)+\theta u^{2}(t))dt

with U=[0,1]U=[0,1]. We assume that θ=1\theta=1 is a true parameter and solve the corresponding control problem by the kernel-based collocation method to obtain an approximate value function V~\tilde{V}, nearly optimal trajectories S(ti)S^{*}(t_{i}), I(ti)I^{*}(t_{i}), R(ti)R^{*}(t_{i}), and a nearly optimal control u(ti)u^{*}(t_{i}), where ti=i/1000t_{i}=i/1000, i=0,1,,10000i=0,1,\ldots,10000. Specifically, the kernel-based method is performed with the following ingredients: the Gaussian kernel with α=1\alpha=1 and the collocation points Γ\Gamma consisting of 838^{3} uniform grids points in [S(0)0.5,S(0)+0.5]×[I(0)0.5,I(0)+0.5]×[R(0)0.5,R(0)+0.5][S(0)-0.5,S(0)+0.5]\times[I(0)-0.5,I(0)+0.5]\times[R(0)-0.5,R(0)+0.5] including the boundary. With this method, we generate 100100 independent copies X(j)(ti)X^{(j)}(t_{i}), j=1,,100j=1,\ldots,100, of (S(ti),I(ti),R(ti))+0.01×Zi(S^{*}(t_{i}),I^{*}(t_{i}),R^{*}(t_{i}))+0.01\times Z_{i}, i=0,1,,ni=0,1,\ldots,n, where Z=(Z0,,Zn)Z=(Z_{0},\ldots,Z_{n}) is an (n+1)×3(n+1)\times 3-dimensional Gaussian vector with all components being independent of each other and having zero mean and unit variance. Then we consider these samples with noises as observed data. As in Example 4.3, we solve the root-finding problem ()(\mathcal{I}) by minimizing

1100|j=1100i=09999(10|X(j)(ti)|2+θ|u(j)(ti)|2)ΔtV~(0,X0(j);θ)|\frac{1}{100}\left|\sum_{j=1}^{100}\sum_{i=0}^{9999}\left(10|X^{(j)}(t_{i})|^{2}+\theta|u^{(j)}(t_{i})|^{2}\right)\Delta t-\tilde{V}(0,X_{0}^{(j)};\theta)\right|

subject to θ[0.0001,10]\theta\in[0.0001,10]. The estimated θ\theta over 100100 trials has the mean 0.94960.9496 and the standard deviation 0.05580.0558.

5 Conclusion

We study the inverse problem of the stochastic control of general diffusions with performance index

𝔼[g(X(T))+0T(f(t,X(t))+θ|u(t)|2)𝑑t].\mathbb{E}\left[g(X(T))+\int_{0}^{T}\left(f(t,X(t))+\theta|u(t)|^{2}\right)dt\right].

Precisely, we formulate the inverse problem as the one of determining the weight parameter θ>0\theta>0 given the dynamics, gg, ff, and an optimal control process. Under mild conditions on the drift, the volatility, gg, and ff, and under the assumption that the optimal control belongs to the interior of the control set, we show that our inverse problem is well-posed. It should be noted that whether the latter assumption holds true or not is easily confirmed in practice. Then, with the well-posedness, we reduce the inverse problem to some root finding problem of the expectation of a random variable involved with the value function, which has a unique solution. Based on this result, we propose a numerical method for our inverse problem by replacing the expectation above with arithmetic mean of observed optimal control processes and the corresponding state processes. Several numerical experiments show that the numerical method recovers the unknown θ\theta with high accuracy. In particular, with the help of the kernel-based collocation method for Hamilton-Jacobi-Bellman equations, our method for the inverse problems still works well even when an explicit form of the value function is unavailable.

Possible future studies include the well-posedness when the function gg or ff is also unknown, and numerical methods under non-uniqueness of the inverse problems. In the latter issue, we may need to use an additional criterion to determine unknowns from multiple candidates such as the maximum entropy principle adopted in [29].

Acknowledgements

This study is supported by JSPS KAKENHI Grant Number JP17K05359.

References

  • [1] N. Ab Azar, A. Shahmansoorian, and M. Davoudi. From inverse optimal control to inverse reinforcement learning: A historical review. Annu. Rev. Control, 50:119–138, 2020.
  • [2] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proc. ICML, 2004.
  • [3] B. D. O. Anderson and J. B. Moore. Optimal control: linear quadratic methods. Prentice-Hall, Inc., Upper Saddle River, 1989.
  • [4] R. Bellman and R. Kalaba. An inverse problem in dynamic programming and automatic control. J. Math. Anal. Appl., 7:322–325, 1963.
  • [5] A. Bensoussan. Stochastic control of partially observable systems. Cambridge University Press, Cambridge, 1992.
  • [6] F. Bonnans and H. Zidani. Consistency of generalized finite difference schemes for the stochastic HJB equation. SIAM J. Numer. Anal., 41:1008–1021, 2003.
  • [7] F. Camilli and M. Falcone. An approximation scheme for the optimal control of diffusion processes. Math. Model. Numer. Anal., 29:97–122, 1995.
  • [8] J. Casti. On the general inverse problem of optimal control theory. J. Optim. Theory Appl., 32:491–497, 1980.
  • [9] K. Debrabant and E. R. Jakobsen. Semi-Lagrangian schemes for linear and fully non-linear diffusion equations. Math. Comp., 82:1433–1462, 2013.
  • [10] H. Deng and M. Krstić. Stochastic nonlinear stabilization―II: inverse optimality. Systems Control lett., 32:151–159, 1997.
  • [11] K. D. Do. Inverse optimal control of stochastic systems driven by Lévy processes. Automatica, 107:539–550, 2019.
  • [12] K. Dvijotham and E. Todorov. Inverse optimal control with linearly-solvable MDPs. In Proc. ICML, 2010.
  • [13] W. E, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun. Math. Stat., 5:349–380, 2017.
  • [14] A. Fahim, N. Touzi, and X. Warin. A probabilistic numerical method for fully nonlinear parabolic PDEs. Ann. Appl. Probab., 21:1322–1364, 2011.
  • [15] W. H. Fleming and H. M. Soner. Controlled Markov processes and viscosity solutions. Springer-Verlag, New York, 2nd edition, 2006.
  • [16] J. Hadamard. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin, 13:49–52, 1902.
  • [17] C.-S. Huang, S. Wang, C. S. Chen, and Z.-C. Li. A radial basis collocation method for Hamilton-Jacobi-Bellman equations. Automatica, 42:2201–2207, 2006.
  • [18] R. E. Kalman. When is a linear control system optimal? Trans. ASME Ser. D: J. Basic Eng., 86:51–60, 1964.
  • [19] E. J. Kansa. Multiquadrics―a scattered data approximation scheme with application to computational fluid-dynamics―I. Computers Math.  Applic., 19:127–145, 1990.
  • [20] E. J. Kansa. Multiquadrics―a scattered data approximation scheme with application to computational fluid-dynamics―II. Computers Math.  Applic., 19:147–161, 1990.
  • [21] H. J. Kushner and P. Dupuis. Numerical methods for stochastic control problems in continuous time. Springer-Verlag, New York, 2001.
  • [22] Y. Nakano. Convergent kernel-based methods for parabolic equations. arXiv:1803.09446[Math.NA].
  • [23] Y. Nakano. Convergence of meshfree collocation methods for fully nonlinear parabolic equations. Numer. Math., 136:703–723, 2017.
  • [24] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In Proc. ICML, 2000.
  • [25] A. Rachah and D. F. Torres. Mathematical modelling, simulation, and optimal control of the 2014 ebola outbreak in west africa. Discrete Dyn. Nat. Soc., 2015, 2015.
  • [26] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. J. Comput. Phys., 375:1339–1364, 2018.
  • [27] F. Thau. On the inverse optimum control problem for a class of nonlinear autonomous systems. IEEE Trans. Automat. Control, 12:674–681, 1967.
  • [28] J. Yong and X. Y. Zhou. Stochastic controls: Hamiltonian systems and HJB equations. Springer-Verlag, New York, 1999.
  • [29] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. D. Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438, 2008.