Inverse stochastic optimal controls

Yumiharu Nakano
Department of Mathematical and Computing Science, School of Computing
Tokyo Institute of Technology
W8-28, 2-12-1, Ookayama, Meguro-ku, Tokyo 152-8550, Japan
e-mail: nakano@c.titech.ac.jp

Abstract

We study an inverse problem of the stochastic optimal control of general diffusions with performance index having the quadratic penalty term of the control process. Under mild conditions on the system dynamics, the cost functions, and the optimal control process, we show that our inverse problem is well-posed using a stochastic maximum principle. Then, with the well-posedness, we reduce the inverse problem to some root finding problem of the expectation of a random variable involved with the value function, which has a unique solution. Based on this result, we propose a numerical method for our inverse problem by replacing the expectation above with arithmetic mean of observed optimal control processes and the corresponding state processes. The recent progress of numerical analysis of Hamilton-Jacobi-Bellman equations enables the proposed method to be implementable for multi-dimensional cases. In particular, with the help of the kernel-based collocation method for Hamilton-Jacobi-Bellman equations, our method for the inverse problems still works well even when an explicit form of the value function is unavailable. Several numerical experiments show that the numerical method recovers the unknown penalty parameter with high accuracy.

Key words: Inverse problems, stochastic control, stochastic maximum principle, Hamilton-Jacobi-Bellman equations, Kernel-based collocation methods.

1 Introduction

The inverse optimal controls refers to the problem of determining the performance index that makes a given control law optimal. Kalman [18] first analyses the case of linear quadratic models in infinite horizon in details. Since then, a vast amount of research has been done on this subject. To name a few, Bellman and Kalaba [4] and Casti [8] adopt the dynamic programming approach to derive some equations for cost functions using the feedback function of the optimal control for finite horizon problems. Thau [27] and Anderson and Moore [3] extend Kalman’s results to the case of a nonlinear cost structure and of a multi input, respectively. In the literature of reinforcement learning, Ng and Russell [24] studies an inverse optimal control in the framework of Markov decision processes. See, e.g., Abbeel and Ng [2] and Ziebart et.al [29] for subsequent studies. In continuous-time stochastic optimal control, Dvijotham and Todorov [12] works under a linearly solvable class of optimal control, where explicit forms of the optimal feedback laws are available, and Deng and Krstić [10] and Do [11] adopt Lyapunov function approaches in infinite horizon problems. See the recent review paper Ab Azar et.al [1] for a comprehensive account.

In this paper, we are concerned with an inverse problem of optimal stochastic controls of diffusion processes that are more general than those in [12] and [10]. Our control problem is described as the minimization problem of

\mathbb{E}\left[g(X(T))+\int_{0}^{T}\left(f(t,X(t))+\theta|u(t)|^{2}\right)dt\right]

over an admissible class of control processes $u$ , where $\{X(t)\}_{0\leq t\leq T}$ denotes the controlled diffusion process corresponding to $u$ . The performance of the state is measured by the functions $g$ and $f$ , and $\theta>0$ is a parameter penalizing the magnitude of the control. See Section 2 below for a precise formulation. It should be emphasized that the performance criterion of this form appears in many applications, where it is trivial to determine the functions $g$ and $f$ , but difficult to choose an appropriate value of $\theta$ . For example, in trajectory control problems, we seek a state process that is as close as possible to a given curve, with a practical input process. Thus, in this case, $g$ and $f$ are given by distances between the state and the desired curve. Moreover, the penalty term $\theta|u(t)|^{2}$ for the control is naturally incorporated, which affects the overall performance depending on a choice of $\theta$ . Motivated by this, we consider the problem of determining $\theta$ from given optimal controls for our minimization problem with unknown penalty parameter.

Our first aim is, for the inverse problem above, to study the well-posedness defined by Hadamard [16]. It is widely accepted that a modern interpretation of Hadamard’s principle is: the solution of a given problem exists, unique, and depends continuously on data. In our framework, the existence is interpreted as the surjectivity of the mapping from the set of unknown penalty parameters to the set of optimal controls, which is outside the scope of this paper since observed controls are assumed to be optimal for the problem above for some $\theta>0$ . The uniqueness is nothing but the bijectivity of the mapping. The continuous dependence on data, referred to as the stability in this paper, is the continuity of its inverse in some sense. See the statement of Theorem 3.5 below for a precise meaning of our stability. As one might predict, the uniqueness and the stability do not hold in general in our continuous time optimal control problems with finite horizons. We will give such an example. On the other hand, under additional mild conditions on the drift, the diffusion, and the cost functions, and the optimal control process, we will show that the uniqueness and the stability do hold using the stochastic maximum principle. It should be emphasized that this is a first attempt to reveal sufficient conditions for which the inverse optimal control problems are well-posed for stochastic controls with finite horizons.

Our second aim is to propose a numerical method for the inverse problem above. The problem is, given $N$ independent samples $\{(u^{(j)}(t_{i}),X^{(j)}(t_{i}))\}_{i=0}^{n}$ , $j=1,\ldots,N$ , of the pair $\{(u^{*}(t_{i}),X^{*}(t_{i}))\}_{i=0}^{n}$ of the optimal control and the corresponding state processes, to determine $\theta$ computationally, where $0=t_{0}<t_{1}<\ldots<t_{n}=T$ . Our approach is based on the simple optimality condition and is a reduction to some root-finding problem. This of course involves the value function computations, and so we rely on the recent progress of numerical analysis of Hamilton-Jacobi-Bellman equations. Thanks to the uniqueness result of the inverse problem, in our numerical experiments performed in Section 3, the positive root $\theta$ ’s are recovered with high accuracy.

The present paper is organized as follows: we introduce notation and formulate our inverse problem in Section 2. Section 3 is devoted to the issue of the well-posedness. In Section 4, we propose a numerical method for our inverse problems and validate it. Section 5 concludes.

2 Problem formulation

First, we collect some notation used in this paper. Denote by $\mathrm{int}(U)$ the interior of $U$ . Denote also by $D_{\xi}\varphi$ and $D_{\xi}^{2}\varphi$ the gradient vector and the Hessian matrix of $\varphi$ with respect to a variable $\xi$ , respectively. We write $\mathbb{S}^{d}$ for the totality of symmetric $d\times d$ real matrices, and $a^{\mathsf{T}}$ for the transpose of a vector or matrix $a$ . Let $T>0$ and $m\in\mathbb{N}$ . Let $\{W(t)\}_{0\leq t\leq T}$ be an $m$ -dimensional standard Brownian motion on a complete probability space $(\Omega,\mathcal{F},\mathbb{P})$ . Further, let $X_{0}$ be a random variable that is independent of $\{W(t)\}_{0\leq t\leq T}$ . Denote by $\{\mathcal{F}(t)\}_{0\leq t\leq T}$ the augmented filtration generated by $\{W(t)\}_{0\leq t\leq T}$ and $X_{0}$ . The control processes are assumed to take values in a closed set $U$ of $\mathbb{R}^{k}$ . The class $\mathcal{U}$ of controls is then defined by the set of all $U$ -valued $\{\mathcal{F}(t)\}$ -adapted processes $\{u(t)\}_{0\leq t\leq T}$ satisfying

\mathbb{E}\int_{0}^{T}|u(t)|^{2}dt<\infty.

For any given $u\in\mathcal{U}$ , consider the $d$ -dimensional controlled stochastic differential equation

(2.1)

dX(t)=b(t,X(t),u(t))dt+\sigma(t,X(t),u(t))dW(t)

with initial condition $X_{0}$ , where the functions $b$ and $\sigma$ are assumed to be measurable such that the existence and uniqueness of (2.1) is guaranteed. That is, we assume that for any $u\in\mathcal{U}$ there exists a unique continuous $\{\mathcal{F}(t)\}$ -adapted process $\{X(t)\}_{0\leq t\leq T}$ satisfying

X(t)=X_{0}+\int_{0}^{t}b(s,X(s),u(s))ds+\int_{0}^{t}\sigma(s,X(s),u(s))dW(s),\quad 0\leq t\leq T,

almost surely. The functions $g:\mathbb{R}^{d}\to\mathbb{R}$ and $f:[0,T]\times\mathbb{R}^{d}\to\mathbb{R}$ , the cost functions introduced in Section 1, are both assumed to be measurable such that

\mathbb{E}\left[g(X(T))+\int_{0}^{T}f(t,X(t))dt\right]>-\infty

for any $u\in\mathcal{U}$ . Then, we are concerned with the following optimal control problem:

Problem $(\mathcal{C}_{\theta})$ .

Minimize

J[u]:=\mathbb{E}\left[g(X(T))+\int_{0}^{T}\left\{f(t,X(t))+\theta|u(t)|^{2}\right\}dt\right]

over $u\in\mathcal{U}$ .

Here we have assumed that the running cost is decomposed into that for the state and that for the penalty of the control input with weight parameter $\theta>0$ . Then, we consider the inverse problem with respect to $\theta$ given an optimal control and the corresponding state trajectory, i.e., our problem is to recover $\theta$ from a solution $\{u^{*}(t)\}\in\mathcal{U}$ for $(\mathcal{C}_{\theta})$ and the corresponding state process $\{X^{*}(t)\}$ under the assumption that $b$ , $\sigma$ , $g$ and $f$ are known.

3 Well-posedness

We shall discuss Hadamard’s well-posedness of the inverse problem of the stochastic optimal control. As stated in Section 1, we skip the existence issue. Then, as in many other inverse problems, the uniqueness and the stability do not hold in general for continuous-time optimal controls.

Example 3.1.

Consider the state equation

\frac{dX(t)}{dt}=u^{2}(t)X(t),

with nonrandom initial condition $X_{0}$ . The control processes $\{u(t)\}$ are taken from the class of $\mathbb{R}$ -valued Borel functions on $[0,T]$ . The objective function $J[u]$ is assumed to be given by

J[u]=X(T)+\theta\int_{0}^{T}u^{2}(t)dt,

where $\theta>0$ . Since $X(T)=X_{0}e^{\int_{0}^{T}u^{2}(t)dt}$ , we obtain $\inf_{u}J[u]=\min_{\gamma\geq 0}\{X_{0}e^{\gamma}+\theta\gamma\}$ . Thus, if $X_{0}\geq 0$ then $\gamma\equiv 0$ , i.e., $u\equiv 0$ , is optimal for any problem $(\mathcal{C}_{\theta})$ with $\theta>0$ .

To discuss the uniqueness, we impose the following:

Assumption 3.2.

(i)

The random variable $X_{0}$ is a constant.
(ii)

The set $U$ is compact and $\mathrm{int}(U)$ is nonempty.
(iii)

The functions $b$ , $\sigma$ , $f$ , and $g$ are of $C^{2}$ -class in $x$ .

(iv)

There exists a constant $C_{0}>0$ and a modulus of continuity $\rho:[0,\infty)\to[0,\infty)$ such that for $\varphi(t,x,u)=b(t,x,u)$ , $\sigma(t,x,u)$ , $f(t,x)$ , $g(x)$ , we have

	$\displaystyle\|\varphi(t,x,u)-\varphi(t,x^{\prime},u^{\prime})\|$	$\displaystyle\leq C_{0}\|x-x^{\prime}\|+\rho(\|u-u^{\prime}\|),$
	$\displaystyle\|\varphi(t,0,u)\|$	$\displaystyle\leq C_{0},$
	$\displaystyle\|D_{x}\varphi(t,x,u)-D_{x}\varphi(t,x^{\prime},u^{\prime})\|$	$\displaystyle\leq C_{0}\|x-x^{\prime}\|+\rho(\|u-u^{\prime}\|),$
	$\displaystyle\|D_{x}^{2}\varphi(t,x,u)-D_{x}^{2}\varphi(t,x^{\prime},u^{\prime})\|$	$\displaystyle\leq\rho(\|x-x^{\prime}\|+\|u-u^{\prime}\|),$

for $t\in[0,T]$ , $x,x^{\prime}\in\mathbb{R}^{d}$ , and $u,u^{\prime}\in U$ .

It should be noted that Assumption 3.2 (i) means that the filtration $\{\mathcal{F}(t)\}_{0\leq t\leq T}$ is generated by the Brownian motion only. Assumption 3.2 (ii) excludes the case where $U$ is countable. It should be noted that whether this assumption holds true or not is easily confirmed in practice. The first two conditions in Assumption 3.2 (iv) are standard ones for stochastic optimal control problems. Under these two requirements, there exists a unique solution $\{X(t)\}_{0\leq t\leq T}$ of (2.1) for each $u\in\mathcal{U}$ . See Fleming and Soner [15].

Under Assumption 3.2, we can apply a stochastic maximum principle, as described in Yong and Zhou [28], to obtain the uniqueness result for our inverse problem in the following sense:

Theorem 3.3.

Let Assumption 3.2 hold. Suppose that $\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U}$ is optimal both for the problem $(\mathcal{C}_{\theta_{1}})$ and $(\mathcal{C}_{\theta_{2}})$ for some $\theta_{1},\theta_{2}>0$ . Moreover, suppose that $\mathbb{P}(u^{*}(t_{0})\in\mathrm{int}(U)\setminus\{0\})>0$ for some $t_{0}\in[0,T]$ . Then $\theta_{1}=\theta_{2}$ .

Proof.

Step (i). Since $(X^{*}(t),u^{*}(t))$ is optimal both for $(\mathcal{C}_{\theta_{1}})$ and $(\mathcal{C}_{\theta_{2}})$ , for each $\theta=\theta_{1}$ , $\theta_{2}$ , there exists a unique solution $(p(t),q(t))$ , $0\leq t\leq T$ , of the backward stochastic differential equation (BSDE)

(3.1)	$\displaystyle dp(t)$	$\displaystyle=-\bigg{\{}D_{x}b(t,X^{}(t),u^{}(t))^{\mathsf{T}}p(t)+\sum_{j=1}^{m}D_{x}\sigma_{j}(t,X^{}(t),u^{}(t))^{\mathsf{T}}q_{j}(t)$
		$\displaystyle\hskip 25.00003pt-D_{x}f(t,X^{*}(t))\bigg{\}}dt+q(t)dW(t),$
	$\displaystyle p(T)$	$\displaystyle=-D_{x}g(X^{*}(T)),$

as well as there exists a unique solution $(P(t),Q(t))$ , $0\leq t\leq T$ , of the BSDE

(3.2)	$\displaystyle dP(t)$	$\displaystyle=-\bigg{\{}D_{x}b(t,X^{}(t),u^{}(t))^{\mathsf{T}}P(t)+P(t)D_{x}b(t,X^{}(t),u^{}(t))^{\mathsf{T}}$
		$\displaystyle\hskip 25.00003pt+\sum_{j=1}^{m}D_{x}\sigma_{j}(t,X^{}(t),u^{}(t))^{\mathsf{T}}P(t)D_{x}\sigma_{j}(t,X^{}(t),u^{}(t))$
		$\displaystyle\hskip 25.00003pt+\sum_{j=1}^{m}\Big{\{}D_{x}\sigma_{j}(t,X^{}(t),u^{}(t))^{\mathsf{T}}Q_{j}(t)+Q_{j}(t)D_{x}\sigma_{j}(t,X^{}(t),u^{}(t))\Big{\}}$
		$\displaystyle\hskip 25.00003pt+D_{x}^{2}H_{0}(t,X^{}(t),u^{}(t),p(t),q(t))\bigg{\}}dt+\sum_{j=1}^{m}Q_{j}(t)dW(t),$
	$\displaystyle P(T)$	$\displaystyle=-D_{x}^{2}g(X^{*}(T)),$

where $q(t)=(q_{1}(t),\ldots,q_{m}(t))$ , $Q(t)=(Q_{1}(t),\ldots,Q_{m}(t))$ , $\sigma_{j}(t,x,u)\in\mathbb{R}^{d}$ for each $j=1,\ldots,m$ such that $\sigma(t,x,u)=(\sigma_{1}(t,x,u),\ldots,\sigma_{m}(t,x,u))$ , and

H_{0}(t,x,u,p,q):=p^{\mathsf{T}}b(t,x,u)+\mathrm{tr}(q^{\mathsf{T}}\sigma(t,x,u))-f(t,x)-\theta|u|^{2}.

In particular, $p(t)$ , $q_{1}(t),\ldots,q_{m}(t)$ are $\mathbb{R}^{d}$ -valued and $Q_{1}(t),\ldots,Q_{m}(t)$ are $\mathbb{S}^{d}$ -valued, all of which are adapted processes satisfying

\mathbb{E}\int_{0}^{T}|\varphi(t)|^{2}dt<\infty

for $\varphi=p,q_{1},\ldots,q_{m},Q_{1},\ldots,Q_{m}$ . Moreover, with the generalized Hamiltonian

	$\displaystyle\mathcal{H}(t,x,u)$	$\displaystyle:=H_{0}(t,x,u,p(t),q(t))-\frac{1}{2}\mathrm{tr}\left[\sigma(t,X^{}(t),u^{}(t))^{\mathsf{T}}P(t)\sigma(t,X^{}(t),u^{}(t))\right]$
		$\displaystyle\quad+\frac{1}{2}\mathrm{tr}\left\{[\sigma(t,x,u)-\sigma(t,X^{}(t),u^{}(t))]^{\mathsf{T}}P(t)[\sigma(t,x,u)-\sigma(t,X^{}(t),u^{}(t))]\right\}$

we have

(3.3)

\mathcal{H}(t,X^{*}(t),u^{*}(t))=\max_{u\in U}\mathcal{H}(t,X^{*}(t),u).

See Chapter 3 in [28].

Step (ii). We write $(p_{i}(t),q_{i}(t),P_{i}(t),Q_{i}(t))$ for the corresponding $(p(t),q(t),P(t),Q(t))$ for $\theta=\theta_{i}$ where $i=1,2$ . Similarly, we write $\mathcal{H}_{i}$ for the corresponding $\mathcal{H}$ for $\theta=\theta_{i}$ where $i=1,2$ . By the optimality condition (3.3) and our assumptions, we obtain

D_{u}\mathcal{H}_{i}(t_{0},X^{*}(t_{0}),u^{*}(t_{0}))=0,\quad i=1,2,

with positive probability. So,

	$\displaystyle 0$	$\displaystyle=K(t_{0},X^{}(t_{0}),u^{}(t_{0}),p_{1}(t),P_{1}(t))-2\theta_{1}u^{*}(t_{0})$
		$\displaystyle=K(t_{0},X^{}(t_{0}),u^{}(t_{0}),p_{2}(t),P_{2}(t))-2\theta_{2}u^{*}(t_{0})$

where for $t\in[0,T]$ , $x,p\in\mathbb{R}^{d}$ , and $P\in\mathbb{R}^{d\times d}$ ,

	$\displaystyle K(t,x,u,p,P)$
	$\displaystyle=D_{u}\Big{[}\frac{1}{2}\mathrm{tr}\left(\sigma(t,x,u)^{\mathsf{T}}P\sigma(t,x,u)\right)+p^{\mathsf{T}}b(t,x,u)$
	$\displaystyle\quad+\frac{1}{2}\mathrm{tr}\left\{[\sigma(t,x,u)-\sigma(t,X^{}(t),u^{}(t))]^{\mathsf{T}}P(t)[\sigma(t,x,u)-\sigma(t,X^{}(t),u^{}(t))]\right\}\Big{]}.$

Since the uniqueness of the BSDEs immediately leads to $p_{1}=p_{2}$ and $P_{1}=P_{2}$ , the equalities above yield $\theta_{1}=\theta_{2}$ , as claimed. ∎

To study the stability, we impose additional conditions.

Assumption 3.4.

The functions $b$ and $\sigma$ are of $C^{1}$ -class in $u$ . Further, for $\varphi(t,x,u)=b(t,x,u)$ , $\sigma(t,x,u)$ , $f(t,x)$ , $g(x)$ , we have

	$\displaystyle\|\varphi(t,x,u)\|+\|D_{u}\varphi(t,x,u)\|$	$\displaystyle\leq C_{0},$
	$\displaystyle\|D_{u}\varphi(t,x,u)-D_{u}\varphi(t,x^{\prime},u^{\prime})\|$	$\displaystyle\leq\rho(\|x-x^{\prime}\|+\|u-u^{\prime}\|),\quad x,x^{\prime}\in\mathbb{R}^{d},\;\;u,u^{\prime}\in U,$

where $C_{0}$ and $\rho$ are as in Assumption 3.2.

Then, again by the stochastic maximum principle, we have the stability result in the following sense:

Theorem 3.5.

Let Assumptions 3.2 and 3.4 hold. Let $\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U}$ be optimal for the problem $(\mathcal{C}_{\theta^{*}})$ for some $\theta^{*}>0$ , and for each $n\in\mathbb{N}$ let $\{u_{n}(t)\}_{0\leq t\leq T}\in\mathcal{U}$ be optimal for $(\mathcal{C}_{\theta_{n}})$ for some $\theta_{n}>0$ . Suppose that

\mathbb{E}\int_{0}^{T}|u_{n}(t)-u^{*}(t)|^{2}dt\to 0,\quad n\to\infty,

and that there exist $n_{0}\in\mathbb{N}$ and a measurable set $E\subset[0,T]\times\Omega$ with positive measure satisfying

u_{n}(t,\omega),\;u^{*}(t,\omega)\in\mathrm{int}(U)\setminus\{0\},\quad(t,\omega)\in E,

for $n\geq n_{0}$ . Then we have $\lim_{n\to\infty}\theta_{n}=\theta^{*}$ .

Proof.

By $C$ we denote a generic positive constant that may vary from line to line. For any $n\geq n_{0}$ , by (3.3),

(3.4)

2\theta_{n}u_{n}(t,\omega)-2\theta^{*}u^{*}(t,\omega)=L_{n}(t,\omega)-L^{*}(t,\omega),\quad(t,\omega)\in E,

where $L_{n}(t,\omega)=K(t,X_{n}(t,\omega),u_{n}(t,\omega),p_{n}(t,\omega),P_{n}(t,\omega))$ with $X_{n}$ being the state process corresponding to $u_{n}$ , and $(p_{n},q_{n})$ and $(P_{n},Q_{n})$ the solutions of the BSDE (3.1) and (3.2) with $u=u_{n}$ and $X=X_{n}$ , respectively. $L^{*}$ is similarly defined. The usual arguments with Gronwall’s lemma yield

(3.5)

\sup_{0\leq t\leq T}\mathbb{E}|X_{n}(t)-X^{*}(t)|^{2}\leq C\mathbb{E}\int_{0}^{T}\rho(|u_{n}(t)-u^{*}(t)|)^{2}dt.

Since $\rho$ is a modulus of continuity for $D_{x}^{2}b(t,\cdot,u)$ uniformly in $t$ and $u$ , we may assume that $\rho$ is bounded. Hence, for any $\varepsilon>0$ there exists $\delta>0$ such that $\rho(r)\leq\varepsilon+\delta r$ , $r\geq 0$ . From this and the $L^{2}$ -convergence of $u_{n}$ we obtain

(3.6)

\sup_{0\leq t\leq T}\mathbb{E}|X_{n}(t)-X^{*}(t)|^{2}\to 0,\quad n\to\infty.

Furthermore, by the a priori estimates for the BSDE (see, e.g., [28, Theorem 3.3, Chapter 7]),

	$\displaystyle\mathbb{E}\int_{0}^{T}\|p_{n}(t)-p^{*}(t)\|^{2}dt$
	$\displaystyle\leq C\mathbb{E}\|D_{x}g(X_{n}(T))-D_{x}g(X^{*}(T))\|^{2}$
	$\displaystyle\quad+C\mathbb{E}\int_{0}^{T}\|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{}(t),u^{}(t))\|^{2}\|p^{*}(t)\|^{2}dt$
	$\displaystyle\quad+C\mathbb{E}\int_{0}^{T}\|D_{x}\sigma(t,X_{n}(t),u_{n}(t))-D_{x}\sigma(t,X^{}(t),u^{}(t))\|^{2}\|q^{*}(t)\|^{2}dt$
	$\displaystyle\quad+C\mathbb{E}\int_{0}^{T}\|D_{x}f(t,X_{n}(t))-D_{x}f(t,X^{*}(t))\|^{2}dt.$

By Assumption 3.2 (iv) and (3.5),

	$\displaystyle\mathbb{E}\|D_{x}g(X_{n}(T))-D_{x}g(X^{}(T))\|^{2}+\mathbb{E}\int_{0}^{T}\|D_{x}f(t,X_{n}(t))-D_{x}f(t,X^{}(t))\|^{2}dt$
	$\displaystyle\leq C\mathbb{E}\int_{0}^{T}\rho(\|u_{n}(t)-u^{*}(t)\|)^{2}dt\to 0,\quad n\to\infty.$

Using Assumption 3.2 (iv) again, for $\varepsilon>0$ , we observe

	$\displaystyle\mathbb{E}\int_{0}^{T}\|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{}(t),u^{}(t))\|^{2}\|p^{*}(t)\|^{2}dt$
	$\displaystyle\leq\mathbb{E}\int_{0}^{T}\|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{}(t),u^{}(t))\|^{2}\|p^{*}(t)\|^{2}$
	$\displaystyle\hskip 30.00005pt\times 1_{\{\|X_{n}(t)-X^{}(t)\|+\rho(\|u_{n}(t)-u^{}(t)\|)>\varepsilon\}}dt$
	$\displaystyle\quad+\mathbb{E}\int_{0}^{T}\|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{}(t),u^{}(t))\|^{2}\|p^{*}(t)\|^{2}$
	$\displaystyle\hskip 30.00005pt\times 1_{\{\|X_{n}(t)-X^{}(t)\|+\rho(\|u_{n}(t)-u^{}(t)\|)\leq\varepsilon\}}dt$
	$\displaystyle\leq C\mathbb{E}\int_{0}^{T}\|p^{}(t)\|^{2}1_{\{\|X_{n}(t)-X^{}(t)\|+\rho(\|u_{n}(t)-u^{}(t)\|)>\varepsilon\}}dt+C\varepsilon\mathbb{E}\int_{0}^{T}\|p^{}(t)\|^{2}dt.$

Thus, letting $n\to\infty$ and then $\varepsilon\to 0$ , we have

\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{*}(t),u^{*}(t))|^{2}|p^{*}(t)|^{2}dt=0.

Similarly,

\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|D_{x}\sigma(t,X_{n}(t),u_{n}(t))-D_{x}\sigma(t,X^{*}(t),u^{*}(t))|^{2}|q^{*}(t)|^{2}dt=0.

Therefore,

(3.7)

\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|p_{n}(t)-p^{*}(t)|^{2}=0.

By the same way, we obtain

(3.8)

\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|P_{n}(t)-P^{*}(t)|^{2}=0.

For notational simplicity we denote $D_{u}b_{n}(t)=D_{u}b(t,X_{n}(t),u_{n}(t))$ . Analogously we use the notation $D_{u}b^{*}(t)$ , $\sigma_{n}(t)$ , and $D_{u}\sigma_{n}(t)$ . With this notation, by Assumption 3.4 we see

	$\displaystyle\|L_{n}(t)-L^{*}(t)\|$
	$\displaystyle\leq\|D_{u}b_{n}(t)\|\|p_{n}(t)-p^{}(t)\|+\|p^{}(t)\|\|D_{u}b_{n}(t)-D_{u}b^{*}(t)\|$
	$\displaystyle\quad+C\|\sigma_{n}(t)\|\|P_{n}(t)\|\|D_{u}\sigma_{n}(t)-D_{u}\sigma^{}(t)\|+C\|D_{u}\sigma^{}(t)\|\|P_{n}(t)\|\|\sigma_{n}(t)-\sigma^{*}(t)\|$
	$\displaystyle\quad+\|D_{u}\sigma^{}(t)\|\|\sigma^{}(t)\|)\|P_{n}(t)-P^{*}(t)\|$
	$\displaystyle\leq C(1+\|X_{n}(t)\|+\|u_{n}(t)\|)(\|p_{n}(t)-p^{}(t)\|+\|P_{n}(t)-P^{}(t)\|)$
	$\displaystyle\quad+C_{\varepsilon}(1+\|p^{}(t)\|+\|p_{n}(t)\|)(\varepsilon+\|X_{n}(t)-X^{}(t)\|+\|u_{n}(t)-u^{*}(t)\|)$

for any $\varepsilon>0$ with constant $C_{\varepsilon}$ depending on $\varepsilon$ . Thus, by Cauchy-Schwartz inequality and (3.6)–(3.8),

\lim_{n\to\infty}\mathbb{E}\int_{0}^{T}|L_{n}(t)-L^{*}(t)|dt=0.

From this and (3.4) it follows that

2|\theta_{n}-\theta^{*}|\int_{E}|u_{n}|dt\times d\mathbb{P}\leq 2|\theta^{*}|\mathbb{E}\int_{0}^{T}|u_{n}(t)-u^{*}(t)|dt+\mathbb{E}\int_{0}^{T}|L_{n}(t)-L^{*}(t)|dt\to 0,\quad n\to\infty.

On the other hand,

\lim_{n\to\infty}\int_{E}|u_{n}|dt\times d\mathbb{P}=\int_{E}|u^{*}|dt\times d\mathbb{P}>0.

Therefore, $\theta_{n}\to\theta^{*}$ , as claimed. ∎

Next we consider the case of the linear quadratic regulator problems. We need special treatment since Assumptions 3.2 and 3.4 (iii) exclude the case where the state processes are affine in controls taking values in unbounded sets.

Assumption 3.6.

(i)

The random variable $X_{0}$ satisfies $\mathbb{E}|X_{0}|^{2}<\infty$ .
(ii)

The set $U$ is given by $U=\mathbb{R}^{k}$ .

(iii)

The functions $b$ , $\sigma$ , $f$ , and $g$ are given respectively by

	$\displaystyle b(t,x,u)=b_{0}(t)x+b_{1}(t)u,\quad\sigma(t,x,u)=\sigma_{0}(t),$
	$\displaystyle f(t,x)=x^{\mathsf{T}}S(t)x,\quad g(x)=x^{\mathsf{T}}Rx,$

for $t\in[0,T]$ , $x\in\mathbb{R}^{d}$ , and $u\in\mathbb{R}^{k}$ , where $b_{0}$ is $\mathbb{R}^{d\times d}$ -valued, $b_{1}$ is $\mathbb{R}^{d\times k}$ -valued, $\sigma_{0}$ is $\mathbb{R}^{d\times m}$ , and $S$ is $\mathbb{R}^{d\times d}$ -valued, all of which are continuous on $[0,T]$ , and $R\in\mathbb{R}^{d\times d}$ . Further $R$ and $S(t)$ are positive semidefinite for any $t\in[0,T]$ .

In the linear quadratic cases, explicit forms of the optimal controls are available. This helps us to obtain the following uniqueness:

Theorem 3.7.

Let Assumption 3.6 hold. Suppose that $\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U}$ is optimal both for the problem $(\mathcal{C}_{\theta_{1}})$ and $(\mathcal{C}_{\theta_{2}})$ for some $\theta_{1},\theta_{2}>0$ . Moreover, suppose that $\mathbb{P}(u^{*}(t_{0})\neq 0)>0$ for some $t_{0}\in[0,T]$ . Then $\theta_{1}=\theta_{2}$ .

Proof.

For each $\theta=\theta_{1}$ and $\theta_{2}$ , an optimal control for $(\mathcal{C}_{\theta})$ uniquely exists and we necessarily have

(3.9)

u^{*}(t)=-\frac{1}{\theta}b_{1}(t)^{\mathsf{T}}F(t)X^{*}(t),\quad 0\leq t\leq T,

where $\{F(t)\}_{0\leq t\leq T}$ is a unique solution of the matrix Riccati equation

(3.10)

\frac{d}{dt}F(t)+b_{0}^{\mathsf{T}}(t)F(t)+F(t)b_{0}(t)-\frac{1}{\theta}F(t)b_{1}(t)b_{1}(t)^{\mathsf{T}}F(t)+S(t)=0,\quad F(T)=R.

We refer to, e.g., Bensoussan [5] for this result. A simple application of Itô formula then yields

dF(t)X^{*}(t)=-(b_{0}(t)^{\mathsf{T}}F(t)+S(t))X^{*}(t)dt+F(t)\sigma_{0}(t)dW(t).

Let $F_{1}$ and $F_{2}$ be the solution of the Riccati equation (3.10) corresponding to $\theta_{1}$ and $\theta_{2}$ , respectively. Then we have

d(F_{1}(t)-F_{2}(t))X^{*}(t)=-b_{0}(t)^{\mathsf{T}}X^{*}(t)dt+(F_{1}(t)-F_{2}(t))\sigma_{0}(t)dW(t).

From $F_{1}(T)=F_{2}(T)$ it follows that

(F_{1}(t)-F_{2}(t))X^{*}(t)=\mathbb{E}\left[\left.\int_{t}^{T}b_{0}(s)^{\mathsf{T}}(F_{1}(s)-F_{2}(s))X^{*}(s)ds\right|\mathcal{F}(t)\right],

whence

\mathbb{E}|(F_{1}(t)-F_{2}(t))X^{*}(t)|^{2}\leq C\int_{t}^{T}\mathbb{E}|(F_{1}(s)-F_{2}(s))X^{*}(s)|^{2}ds

for some positive constant $C>0$ . Therefore we have $F_{1}X^{*}=F_{2}X^{*}$ and so $\theta_{1}=\theta_{2}$ since $b_{1}(t_{0})^{\mathsf{T}}F_{1}(t_{0})X^{*}(t_{0})\neq 0$ with positive probability. Thus the theorem follows. ∎

As in the previous theorem, thanks to the explicit representation of the optimal controls, we have the following stability result:

Theorem 3.8.

Let Assumption 3.6 hold. Let $\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U}$ be optimal for the problem $(\mathcal{C}_{\theta^{*}})$ for some $\theta^{*}>0$ , and for each $n\in\mathbb{N}$ let $\{u_{n}(t)\}_{0\leq t\leq T}\in\mathcal{U}$ be optimal for $(\mathcal{C}_{\theta_{n}})$ for some $\theta_{n}>0$ . Suppose that

\mathbb{E}\int_{0}^{T}|u_{n}(t)-u^{*}(t)|^{2}dt\to 0,\quad n\to\infty,

and that there exists a measurable set $E\subset[0,T]\times\Omega$ with positive measure satisfying

u^{*}(t,\omega)\neq 0,\quad(t,\omega)\in E.

Then we have $\lim_{n\to\infty}\theta_{n}=\theta^{*}$ .

Proof.

By $C$ we denote a generic positive constant that may vary from line to line. Let $X^{*}(t)$ and $X_{n}(t)$ be the state processes corresponding to $u^{*}$ and $u_{n}$ , respectively. Further, let $F^{*}$ and $F_{n}$ be the solution of the Riccati equation (3.10) corresponding to $\theta^{*}$ and $\theta_{n}$ , respectively, $n\in\mathbb{N}$ . Then, as in the proof of Theorem 3.7,

	$\displaystyle F^{}(t)X^{}(t)-F_{n}(t)X_{n}(t)$
	$\displaystyle=\mathbb{E}\left[\left.R(X^{}(T)-X_{n}(T))+\int_{t}^{T}b_{0}(s)^{\mathsf{T}}(F^{}(s)X^{*}(s)-F_{n}(s)X_{n}(s))ds\right\|\mathcal{F}(t)\right],$

whence by Gronwall inequality,

\sup_{0\leq t\leq T}\mathbb{E}|F^{*}(t)X^{*}(t)-F_{n}(t)X_{n}(t)|^{2}\leq C\mathbb{E}|X^{*}(T)-X_{n}(T)|^{2}\to 0,\quad n\to\infty.

Thus,

	$\displaystyle\|\theta_{n}-\theta^{*}\|\int_{E}u_{n}dt\times d\mathbb{P}$	$\displaystyle\leq\|\theta^{}\|\mathbb{E}\int_{0}^{T}\|u_{n}(t)-u^{}(t)\|dt$
		$\displaystyle\quad+C\mathbb{E}\int_{0}^{T}\|F^{}(t)X^{}(t)-F_{n}(t)X_{n}(t)\|dt\to 0,\quad n\to\infty.$

Consequently, $\theta_{n}\to\theta^{*}$ , as required. ∎

4 Numerical method

Here we propose a method for determining the weight parameter $\theta$ given observed data of optimal controls. To this end, recall that the value function $V$ for the problem $(\mathcal{C}_{\theta})$ is given by

(4.1)

V(t,x;\theta)=\inf_{u\in\mathcal{U}_{t}}\mathbb{E}\left[\left.g(X(T))+\int_{t}^{T}(f(s,X(s))+\theta|u(s)|^{2})ds\right|X(t)=x\right],

for $(t,x)\in[0,T]\times\mathbb{R}^{d}$ , where $\mathcal{U}_{t}$ is the set of all $U$ -valued $\{\mathcal{F}(s)\}$ -adapted processes $\{u(s)\}_{t\leq s\leq T}$ satisfying

\mathbb{E}\int_{t}^{T}|u(s)|^{2}ds<\infty.

It is well-known that under Assumption 3.2 the value fuction $V(t,x)\equiv V(t,x;\theta)$ is a unique continuous viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation

(4.2)

\left\{\begin{split}&\partial_{t}V(t,x)+H(t,x,D_{x}V(t,x),D^{2}_{x}V(t,x))=0,\quad(t,x)\in[0,T)\times\mathbb{R}^{d},\\ &V(T,x)=g(x),\quad x\in\mathbb{R}^{d},\end{split}\right.

where

H(t,x,p,M)=\inf_{u\in U}\left[b(t,x,u)^{\mathsf{T}}p+\frac{1}{2}\mathrm{tr}(\sigma(t,x,u)\sigma(t,x,u)^{\mathsf{T}}M)+f(t,x)+\theta|u|^{2}\right]

for $(t,x,p,M)\in\mathbb{R}^{d}\times\mathbb{R}^{d}\times\mathbb{S}^{d}$ (see, e.g., [15]). Under Assumption 3.6, i.e., the case of the linear quadratic regulator problems, the value function $V$ is given by

(4.3)

V(t,x)=x^{\mathsf{T}}F(t)x+G(t),\quad(t,x)\in[0,T]\times\mathbb{R}^{d},

where $F$ is as in (3.10) and $G$ is the unique solution of

\frac{d}{dt}G(t)+\mathrm{tr}(\sigma_{0}(t)^{\mathsf{T}}F(t)\sigma_{0}(t))=0,\quad G(T)=0.

Then we have the following basic result:

Theorem 4.1.

Let $\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U}$ solves $(\mathcal{C}_{\theta^{*}})$ for some $\theta^{*}>0$ , and $\{X^{*}(t)\}_{0\leq t\leq T}$ the corresponding state trajectory. Suppose that Assumption 3.4 holds and $\mathbb{P}(u^{*}(t_{0})\in\mathrm{int}(U)\setminus\{0\})>0$ for some $t_{0}\in[0,T]$ . Then, $\theta^{*}$ is a unique positive root of the equation

(4.4)

\Phi(\theta):=\mathbb{E}\left[g(X^{*}(T))+\int_{0}^{T}(f(t,X^{*}(t))+\theta|u^{*}(t)|^{2})dt-V(0,X_{0};\theta)\right]=0.

Proof.

Since $X_{0}$ is a constant and $\{u^{*}(t)\}$ is optimal, clearly we have $V(0,X_{0};\theta^{*})=\inf_{u\in\mathcal{U}}J[u]=J[u^{*}]$ , whence $\Phi(\theta^{*})=0$ . If $\theta^{\prime}>0$ satisfies $\Phi(\theta^{\prime})=0$ , then $\{u^{*}(t)\}$ is also optimal for $(\mathcal{C}_{\theta^{\prime}})$ . By Theorem 3.3, we obtain $\theta^{\prime}=\theta^{*}$ . ∎

We have the same result in the case of the linear quadratic regulator problems.

Theorem 4.2.

Let $\{u^{*}(t)\}_{0\leq t\leq T}\in\mathcal{U}$ solves $(\mathcal{C}_{\theta^{*}})$ for some $\theta^{*}>0$ , and $\{X^{*}(t)\}_{0\leq t\leq T}$ the corresponding state trajectory. Suppose that Assumption 3.6 holds and $\mathbb{P}(u^{*}(t_{0})\neq 0)>0$ for some $t_{0}\in[0,T]$ . Then, $\theta^{*}$ is a unique positive root of the equation $\Phi(\theta)=0$ , defined as in $\eqref{eq:3.3}$ .

Proof.

Since $\{u^{*}(t)\}$ is necessarily given by (3.9), it is clear that the process $\{u^{*}(t)|_{X_{0}=x}\}$ is optimal for any $x\in\mathbb{R}^{d}$ . Moreover, the process $\{X^{*}(t)|_{X_{0}=x}\}$ is the state process corresponding to $\{u^{*}(t)|_{X_{0}=x}\}$ . Thus,

	$\displaystyle J[u^{*}]$	$\displaystyle=\mathbb{E}\left[X^{}(T)^{\mathsf{T}}RX^{}(T)+\int_{0}^{T}\left(X^{}(t)^{\mathsf{T}}P(t)X^{}(t)+\theta^{}\|u^{}(t)\|^{2}\right)dt\right]$
		$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\left.X^{}(T)^{\mathsf{T}}RX^{}(T)+\int_{0}^{T}\left(X^{}(t)^{\mathsf{T}}P(t)X^{}(t)+\theta^{}\|u^{}(t)\|^{2}\right)dt\right\|X_{0}\right]\right]$
		$\displaystyle=\mathbb{E}[V(0,X_{0};\theta^{*})].$

The rest of the proof can be done by the same way as in the proof of Theorem 4.1. ∎

Now, suppose that the $N$ independent samples $\{u^{(j)}(t_{i}),X^{(j)}(t_{i})\}$ , $i=0,\ldots,n$ , $j=1,\ldots,N$ , of optimal control process for $(\mathcal{C}_{\theta^{*}})$ at time $t_{i}$ and the corresponding state at time $t_{i}$ are available, where $0=t_{0}<\cdots<t_{n}=T$ . An inverse control problem is then to determine the unknown $\theta^{*}$ from the observations. Our approach is to focus on the following problem:

Problem $(\mathcal{I})$ .

Find a positive root of the equation

\frac{1}{N}\sum_{j=1}^{N}\left\{g(X^{(j)}(T))+\sum_{i=0}^{n-1}(f(t_{i},X^{(j)}(t_{i}))+\theta|u^{(j)}(t_{i})|^{2})(t_{i+1}-t_{i})-V(0,X^{(j)}(0);\theta)\right\}=0.

In views of the strong law of large number and Theorems 4.1 and 4.2, we can obtain an approximate solution of the inverse problem by solving the problem $(\mathcal{I})$ for sufficiently large $N$ and $n$ .

Example 4.3.

Consider the case where the state is described by the one-dimensional equation

dX(t)=u(t)dt+\frac{1}{10}dW(t)

with initial condition $X_{0}$ having the standard normal distribution, and the control objective $J[u]$ is given by

J[u]=\mathbb{E}\int_{0}^{1}\left(10|X(t)|^{2}+\theta|u(t)|^{2}\right)dt,

where $U=\mathbb{R}$ . This problem is of course a particular case of the linear quadratic regulator ones. The value function $V$ is explicitly given by $V(t,x;\theta)=F(t;\theta)x^{2}+G(t;\theta)$ , where $F(t;\theta)=\sqrt{10\theta}\tanh((1-t)\sqrt{10/\theta})$ and $G(t;\theta)=(\theta/100)\log(\cosh((1-t)\sqrt{10/\theta}))$ . The unique optimal control $u^{*}(t)$ is given by

u^{*}(t)=-\sqrt{\frac{10}{\theta}}\tanh\left(\sqrt{\frac{10}{\theta}}(1-t)\right)X^{*}(t).

To test our approach, we independently generate the samples $(X^{(j)}(t_{i}),u^{(j)}(t_{i}))$ , $j=1,\ldots,10000$ , $i=0,\ldots,1000$ , of the optimal state and control for $\theta=1$ , where $t_{i}=i/1000$ , and consider these samples as observed data. We solve the root-finding problem by minimizing

\frac{1}{10000}\left|\sum_{j=1}^{10000}\sum_{i=0}^{999}\left(10|X^{(j)}(t_{i})|^{2}+\theta|u^{(j)}(t_{i})|^{2}\right)\Delta t-V(0,X_{0}^{(j)};\theta)\right|

over $\theta\in[0.0001,10]$ . The estimated $\theta$ over $100$ trials has the mean $0.9971$ and the standard deviation $4.4181\times 10^{-4}$ .

In most of nonlinear problems, analytical solutions of HJB equations are rarely available. So we need to numerically solve (4.4) to approximate the value functions. Existing numerical methods applicable to (4.4) include the finite difference methods (see, e.g., Kushner and Dupuis [21] and Bonnans and Zidani [6]), the finite-element like methods (see, e.g., Camilli and Falcone [7] and Debrabant and Jakobsen [9]), the kernel-based collocation methods (see, e.g., Kansa [19, 20], Huang et.al [17], and Nakano [23]), and the regression-based methods (see, e.g., Fahim et al. [14], E et al. [13], Sirignano and Spiliopoulos [26], and the references therein). It is well-known that the use of the finite difference methods is limited to low dimensional problems since the number of the spatial grids points has an exponential growth in dimension. In the kernel-based collocation methods, we seek an approximate solution of the form of a linear combination of a radial basis function (e.g., multiquadrics in the Kansa’s original work) by solving finite dimensional linear equations. In general, this procedure allows for a simpler numerical implementation compared to the finite-element like methods and the regression-based methods. The regression-based methods, in particular the ones with neural networks, are prominent in high dimensional problems although they are computationally expensive.

In Example 4.4 below, we will deal with a population control problem having a 3-dimensional state space. We adopt the kernel-based collocation methods to compute the value function for that problem.

The kernel-based collocation methods are obtained by the interpolations with positive definite kernels applied backward recursively in time. Given a points set $\Gamma=\{x^{(1)},\ldots,x^{(N)}\}\subset\mathbb{R}^{d}$ such that $x^{(j)}$ ’s are pairwise distinct, and a positive definite function $\Phi:\mathbb{R}^{d}\to\mathbb{R}$ , the function

\sum_{j=1}^{N}(A^{-1}\psi|_{\Gamma})_{j}\Phi(x-x^{(j)}),\quad x\in\mathbb{R}^{d},

interpolates $\psi$ on $\Gamma$ . Here, $A=\{\Phi(x^{(j)}-x^{(\ell)})\}_{j,\ell=1,\ldots,N}$ , $\psi|_{\Gamma}$ is the column vector composed of $\psi(x_{j})$ , $j=1,\ldots,N$ , and $(z)_{j}$ denotes the $j$ -th component of $z\in\mathbb{R}^{N}$ . Thus, with time grid $\{t_{0},\ldots,t_{n}\}$ such that $0=t_{0}<\cdots<t_{n}=T$ , the function $\tilde{V}(t_{n},\cdot)$ defined by

\tilde{V}(t_{n},x)=\sum_{j=1}^{N}(A^{-1}\tilde{V}_{n})_{j}\Phi(x-x^{(j)}),\quad x\in\mathbb{R}^{d}

approximates $g$ , where $\tilde{V}_{n}=g|_{\Gamma}$ . Then, for any $k=0,1,\ldots,n-1$ , with a vector $\tilde{V}_{k+1}\in\mathbb{R}^{N}$ of an approximate solution at $\{t_{k+1}\}\times\Gamma$ , we set

	$\displaystyle\tilde{V}(t_{k+1},x)$	$\displaystyle=\sum_{j=1}^{N}(A^{-1}\tilde{V}_{k+1})_{j}\Phi(x-x^{(j)}),\quad x\in\mathbb{R}^{d},$
	$\displaystyle\tilde{V}_{k}$	$\displaystyle=\tilde{V}_{k+1}+(t_{k+1}-t_{k})H_{k+1}(v_{k+1}^{h}),$

where $H_{k+1}(\tilde{V}_{k+1})=(H_{k+1,1}(\tilde{V}_{k+1}),\ldots,H_{k+1,N}(\tilde{V}_{k+1}))\in\mathbb{R}^{N}$ with

H_{k+1,j}=H(t_{k+1},x^{(j)},D\tilde{V}(t_{k+1},x^{(j)}),D^{2}\tilde{V}(t_{k+1},x^{(j)})).

It should be mentioned that in view of the accuracy of interpolation, it is necessary to take $\Gamma$ densely. In particular, as the spatial dimension increases, $N$ must also become larger. This creates difficulties in numerically solving the linear equations $A\xi=\psi|_{\Gamma}$ or in computing the inverse matrix $A^{-1}$ . On the other hand, basically there are no restrictions on the structure of $\Gamma$ in implementing the kernel-based collocation methods. As a result, they are useful in multi but relatively low dimensional problems. It should also be noted that the method above can be described in a matrix form. See Nakano [22] for details.

Example 4.4.

Consider the following simple SIR (Susceptible-Infectious-Recovery) epidemic model with vaccination studied in Rachah and Torres [25]:

	$\displaystyle\frac{dS(t)}{dt}$	$\displaystyle=-\beta S(t)I(t)-u(t)S(t),$
	$\displaystyle\frac{dI(t)}{dt}$	$\displaystyle=\beta S(t)I(t)-\mu I(t),$
	$\displaystyle\frac{dR(t)}{dt}$	$\displaystyle=\mu I(t)+u(t)S(t),$

with initial conditions $S(0)=0.95$ , $I(0)=0.05$ , $R(0)=0$ , where $\beta=0.2$ and $\mu=0.1$ . The control objective is given by

J[u]=\int_{0}^{10}(10I(t)+\theta u^{2}(t))dt

with $U=[0,1]$ . We assume that $\theta=1$ is a true parameter and solve the corresponding control problem by the kernel-based collocation method to obtain an approximate value function $\tilde{V}$ , nearly optimal trajectories $S^{*}(t_{i})$ , $I^{*}(t_{i})$ , $R^{*}(t_{i})$ , and a nearly optimal control $u^{*}(t_{i})$ , where $t_{i}=i/1000$ , $i=0,1,\ldots,10000$ . Specifically, the kernel-based method is performed with the following ingredients: the Gaussian kernel with $\alpha=1$ and the collocation points $\Gamma$ consisting of $8^{3}$ uniform grids points in $[S(0)-0.5,S(0)+0.5]\times[I(0)-0.5,I(0)+0.5]\times[R(0)-0.5,R(0)+0.5]$ including the boundary. With this method, we generate $100$ independent copies $X^{(j)}(t_{i})$ , $j=1,\ldots,100$ , of $(S^{*}(t_{i}),I^{*}(t_{i}),R^{*}(t_{i}))+0.01\times Z_{i}$ , $i=0,1,\ldots,n$ , where $Z=(Z_{0},\ldots,Z_{n})$ is an $(n+1)\times 3$ -dimensional Gaussian vector with all components being independent of each other and having zero mean and unit variance. Then we consider these samples with noises as observed data. As in Example 4.3, we solve the root-finding problem $(\mathcal{I})$ by minimizing

\frac{1}{100}\left|\sum_{j=1}^{100}\sum_{i=0}^{9999}\left(10|X^{(j)}(t_{i})|^{2}+\theta|u^{(j)}(t_{i})|^{2}\right)\Delta t-\tilde{V}(0,X_{0}^{(j)};\theta)\right|

subject to $\theta\in[0.0001,10]$ . The estimated $\theta$ over $100$ trials has the mean $0.9496$ and the standard deviation $0.0558$ .

5 Conclusion

We study the inverse problem of the stochastic control of general diffusions with performance index

\mathbb{E}\left[g(X(T))+\int_{0}^{T}\left(f(t,X(t))+\theta|u(t)|^{2}\right)dt\right].

Precisely, we formulate the inverse problem as the one of determining the weight parameter $\theta>0$ given the dynamics, $g$ , $f$ , and an optimal control process. Under mild conditions on the drift, the volatility, $g$ , and $f$ , and under the assumption that the optimal control belongs to the interior of the control set, we show that our inverse problem is well-posed. It should be noted that whether the latter assumption holds true or not is easily confirmed in practice. Then, with the well-posedness, we reduce the inverse problem to some root finding problem of the expectation of a random variable involved with the value function, which has a unique solution. Based on this result, we propose a numerical method for our inverse problem by replacing the expectation above with arithmetic mean of observed optimal control processes and the corresponding state processes. Several numerical experiments show that the numerical method recovers the unknown $\theta$ with high accuracy. In particular, with the help of the kernel-based collocation method for Hamilton-Jacobi-Bellman equations, our method for the inverse problems still works well even when an explicit form of the value function is unavailable.

Possible future studies include the well-posedness when the function $g$ or $f$ is also unknown, and numerical methods under non-uniqueness of the inverse problems. In the latter issue, we may need to use an additional criterion to determine unknowns from multiple candidates such as the maximum entropy principle adopted in [29].

Acknowledgements

This study is supported by JSPS KAKENHI Grant Number JP17K05359.

References

[1] N. Ab Azar, A. Shahmansoorian, and M. Davoudi. From inverse optimal control to inverse reinforcement learning: A historical review. Annu. Rev. Control, 50:119–138, 2020.
[2] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proc. ICML, 2004.
[3] B. D. O. Anderson and J. B. Moore. Optimal control: linear quadratic methods. Prentice-Hall, Inc., Upper Saddle River, 1989.
[4] R. Bellman and R. Kalaba. An inverse problem in dynamic programming and automatic control. J. Math. Anal. Appl., 7:322–325, 1963.
[5] A. Bensoussan. Stochastic control of partially observable systems. Cambridge University Press, Cambridge, 1992.
[6] F. Bonnans and H. Zidani. Consistency of generalized finite difference schemes for the stochastic HJB equation. SIAM J. Numer. Anal., 41:1008–1021, 2003.
[7] F. Camilli and M. Falcone. An approximation scheme for the optimal control of diffusion processes. Math. Model. Numer. Anal., 29:97–122, 1995.
[8] J. Casti. On the general inverse problem of optimal control theory. J. Optim. Theory Appl., 32:491–497, 1980.
[9] K. Debrabant and E. R. Jakobsen. Semi-Lagrangian schemes for linear and fully non-linear diffusion equations. Math. Comp., 82:1433–1462, 2013.
[10] H. Deng and M. Krstić. Stochastic nonlinear stabilization―II: inverse optimality. Systems Control lett., 32:151–159, 1997.
[11] K. D. Do. Inverse optimal control of stochastic systems driven by Lévy processes. Automatica, 107:539–550, 2019.
[12] K. Dvijotham and E. Todorov. Inverse optimal control with linearly-solvable MDPs. In Proc. ICML, 2010.
[13] W. E, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun. Math. Stat., 5:349–380, 2017.
[14] A. Fahim, N. Touzi, and X. Warin. A probabilistic numerical method for fully nonlinear parabolic PDEs. Ann. Appl. Probab., 21:1322–1364, 2011.
[15] W. H. Fleming and H. M. Soner. Controlled Markov processes and viscosity solutions. Springer-Verlag, New York, 2nd edition, 2006.
[16] J. Hadamard. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin, 13:49–52, 1902.
[17] C.-S. Huang, S. Wang, C. S. Chen, and Z.-C. Li. A radial basis collocation method for Hamilton-Jacobi-Bellman equations. Automatica, 42:2201–2207, 2006.
[18] R. E. Kalman. When is a linear control system optimal? Trans. ASME Ser. D: J. Basic Eng., 86:51–60, 1964.
[19] E. J. Kansa. Multiquadrics―a scattered data approximation scheme with application to computational fluid-dynamics―I. Computers Math. Applic., 19:127–145, 1990.
[20] E. J. Kansa. Multiquadrics―a scattered data approximation scheme with application to computational fluid-dynamics―II. Computers Math. Applic., 19:147–161, 1990.
[21] H. J. Kushner and P. Dupuis. Numerical methods for stochastic control problems in continuous time. Springer-Verlag, New York, 2001.
[22] Y. Nakano. Convergent kernel-based methods for parabolic equations. arXiv:1803.09446[Math.NA].
[23] Y. Nakano. Convergence of meshfree collocation methods for fully nonlinear parabolic equations. Numer. Math., 136:703–723, 2017.
[24] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In Proc. ICML, 2000.
[25] A. Rachah and D. F. Torres. Mathematical modelling, simulation, and optimal control of the 2014 ebola outbreak in west africa. Discrete Dyn. Nat. Soc., 2015, 2015.
[26] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. J. Comput. Phys., 375:1339–1364, 2018.
[27] F. Thau. On the inverse optimum control problem for a class of nonlinear autonomous systems. IEEE Trans. Automat. Control, 12:674–681, 1967.
[28] J. Yong and X. Y. Zhou. Stochastic controls: Hamiltonian systems and HJB equations. Springer-Verlag, New York, 1999.
[29] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. D. Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438, 2008.

	$\displaystyle\|\varphi(t,x,u)-\varphi(t,x^{\prime},u^{\prime})\|$	$\displaystyle\leq C_{0}\|x-x^{\prime}\|+\rho(\|u-u^{\prime}\|),$
	$\displaystyle\|\varphi(t,0,u)\|$	$\displaystyle\leq C_{0},$
	$\displaystyle\|D_{x}\varphi(t,x,u)-D_{x}\varphi(t,x^{\prime},u^{\prime})\|$	$\displaystyle\leq C_{0}\|x-x^{\prime}\|+\rho(\|u-u^{\prime}\|),$
	$\displaystyle\|D_{x}^{2}\varphi(t,x,u)-D_{x}^{2}\varphi(t,x^{\prime},u^{\prime})\|$	$\displaystyle\leq\rho(\|x-x^{\prime}\|+\|u-u^{\prime}\|),$

	$\displaystyle\|\varphi(t,x,u)\|+\|D_{u}\varphi(t,x,u)\|$	$\displaystyle\leq C_{0},$
	$\displaystyle\|D_{u}\varphi(t,x,u)-D_{u}\varphi(t,x^{\prime},u^{\prime})\|$	$\displaystyle\leq\rho(\|x-x^{\prime}\|+\|u-u^{\prime}\|),\quad x,x^{\prime}\in\mathbb{R}^{d},\;\;u,u^{\prime}\in U,$

	$\displaystyle\mathbb{E}\int_{0}^{T}\|p_{n}(t)-p^{*}(t)\|^{2}dt$
	$\displaystyle\leq C\mathbb{E}\|D_{x}g(X_{n}(T))-D_{x}g(X^{*}(T))\|^{2}$
	$\displaystyle\quad+C\mathbb{E}\int_{0}^{T}\|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{}(t),u^{}(t))\|^{2}\|p^{*}(t)\|^{2}dt$
	$\displaystyle\quad+C\mathbb{E}\int_{0}^{T}\|D_{x}\sigma(t,X_{n}(t),u_{n}(t))-D_{x}\sigma(t,X^{}(t),u^{}(t))\|^{2}\|q^{*}(t)\|^{2}dt$
	$\displaystyle\quad+C\mathbb{E}\int_{0}^{T}\|D_{x}f(t,X_{n}(t))-D_{x}f(t,X^{*}(t))\|^{2}dt.$

	$\displaystyle\mathbb{E}\int_{0}^{T}\|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{}(t),u^{}(t))\|^{2}\|p^{*}(t)\|^{2}dt$
	$\displaystyle\leq\mathbb{E}\int_{0}^{T}\|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{}(t),u^{}(t))\|^{2}\|p^{*}(t)\|^{2}$
	$\displaystyle\hskip 30.00005pt\times 1_{\{\|X_{n}(t)-X^{}(t)\|+\rho(\|u_{n}(t)-u^{}(t)\|)>\varepsilon\}}dt$
	$\displaystyle\quad+\mathbb{E}\int_{0}^{T}\|D_{x}b(t,X_{n}(t),u_{n}(t))-D_{x}b(t,X^{}(t),u^{}(t))\|^{2}\|p^{*}(t)\|^{2}$
	$\displaystyle\hskip 30.00005pt\times 1_{\{\|X_{n}(t)-X^{}(t)\|+\rho(\|u_{n}(t)-u^{}(t)\|)\leq\varepsilon\}}dt$
	$\displaystyle\leq C\mathbb{E}\int_{0}^{T}\|p^{}(t)\|^{2}1_{\{\|X_{n}(t)-X^{}(t)\|+\rho(\|u_{n}(t)-u^{}(t)\|)>\varepsilon\}}dt+C\varepsilon\mathbb{E}\int_{0}^{T}\|p^{}(t)\|^{2}dt.$

	$\displaystyle\|L_{n}(t)-L^{*}(t)\|$
	$\displaystyle\leq\|D_{u}b_{n}(t)\|\|p_{n}(t)-p^{}(t)\|+\|p^{}(t)\|\|D_{u}b_{n}(t)-D_{u}b^{*}(t)\|$
	$\displaystyle\quad+C\|\sigma_{n}(t)\|\|P_{n}(t)\|\|D_{u}\sigma_{n}(t)-D_{u}\sigma^{}(t)\|+C\|D_{u}\sigma^{}(t)\|\|P_{n}(t)\|\|\sigma_{n}(t)-\sigma^{*}(t)\|$
	$\displaystyle\quad+\|D_{u}\sigma^{}(t)\|\|\sigma^{}(t)\|)\|P_{n}(t)-P^{*}(t)\|$
	$\displaystyle\leq C(1+\|X_{n}(t)\|+\|u_{n}(t)\|)(\|p_{n}(t)-p^{}(t)\|+\|P_{n}(t)-P^{}(t)\|)$
	$\displaystyle\quad+C_{\varepsilon}(1+\|p^{}(t)\|+\|p_{n}(t)\|)(\varepsilon+\|X_{n}(t)-X^{}(t)\|+\|u_{n}(t)-u^{*}(t)\|)$

Inverse stochastic optimal controls

Abstract

1 Introduction

2 Problem formulation

Problem (𝒞θ)(\mathcal{C}_{\theta}).

3 Well-posedness

Example 3.1.

Assumption 3.2.

Theorem 3.3.

Proof.

Assumption 3.4.

Theorem 3.5.

Proof.

Assumption 3.6.

Theorem 3.7.

Proof.

Theorem 3.8.

Proof.

4 Numerical method

Theorem 4.1.

Proof.

Theorem 4.2.

Proof.

Problem (ℐ)(\mathcal{I}).

Example 4.3.

Example 4.4.

5 Conclusion

Acknowledgements

References

Problem $(\mathcal{C}_{\theta})$ .

Problem $(\mathcal{I})$ .