\floatsetup

[table]capposition=top

Inferring the unknown parameters in Differential Equation by Gaussian Process Regression with Constraint

Y. Zhou^‡ and H. Wang

{{}^{{\ddagger}}}^{*}

^‡School of Mathematics and Statistics
Central South University
Changsha 410083, P.R. China

Abstract

Differential Equation (DE) is a commonly used modeling method in various scientific subjects such as finance and biology. The parameters in DE models often have interesting scientific interpretations, but their values are often unknown and need to be estimated from the measurements of the DE. In this work, we propose a Bayesian inference framework to solve the problem of estimating the parameters of the DE model, from the given noisy and scarce observations of the solution only. A key issue in this problem is to robustly estimate the derivatives of a function from noisy observations of only the function values at given location points, under the assumption of a physical model in the form of differential equation governing the function and its derivatives. To address the key issue, we use the Gaussian Process Regression with Constraint (GPRC) method which jointly model the solution, the derivatives, and the parametric differential equation, to estimate the solution and its derivatives. For nonlinear differential equations, a Picard-iteration-like approximation of linearization method is used so that the GPRC can be still iteratively applicable. A new potential which combines the data and equation information, is proposed and used in the likelihood for our inference. With numerical examples, we illustrate that the proposed method has competitive performance against existing approaches for estimating the unknown parameters in DEs.

Keywords: Gaussian process, Bayesian inference, Inverse problems

1 Introduction

Differential equations (DEs) which include ordinary differential equation (ODE) and partial differential equation (PDE), are used to model a wide variety of physical phenomena. ODEs describe the dynamics of continuously changing processes by relating a process and its rate of change in biomedical research and other scientific areas. PDE models are commonly used to model complex dynamic system in applied science such as physics, biology and finance. The forms of DE models are usually proposed by experts based on their prior knowledge and understanding of the dynamic system. But the values of the parameters in DE models are often unknown and need to be estimated from noisy measurements of the dynamics system. Because most DEs have no analytic solutions and can only be solved with numerical method. Traditional methods for estimating DE parameters require repeatedly solving DEs numerically under thousands of candidate parameter values.

A lot of statistical methods have been developed to estimate the parameters in DE modes by repeatedly solving the DE models numerically, which could be time consuming. Specifically, a series of method in the study of HIV dynamics [11][25][27][26][28] and some hierarchical Bayesian methods [17] [12] are proposed with use of numerical ODE solver. As for PDEs, Muller and Timmer [15] solved the target least-squares type minimization problem using an extended multiple shooting method. It requires careful parameterization of the initial condition and then numerically solves the PDEs. This leads to a pretty high computational load and there is growing need in developing efficient estimation methods for DE models.

Another strategy to estimate parameters of DE is the two-stage method, which in the first stage estimates the solutions and its derivatives from noisy observations using data smoothing methods without considering differential equation models, and then in the second stage estimates of the DE parameters are obtained by least squares. Liang and Wu [13] developed a two-stage method for a general first-order ODE model, using local polynomial regression in the first stage, and estimated asymptotic properties of the estimator. Similarly, Chen and Wu [5] developed local estimation for time-varying coefficients. Bar, Hegger, and Kantz [2] modeled unknown PDEs using multivariate polynomials of sufficiently high order, and the best fit was chosen by minimizing the least squares error of the polynomial approximation. Based on the estimated functions, the PDE parameters were estimated using least squares [14]. The issues of noise level and data resolution were addressed extensively in this approach. See also Parlitz and Merkwirth [16] and Voss et al. [22] for more example.

As mentioned before, a straightforward two-stage strategy, though easy to implement, has difficulty in estimating derivatives of the dynamic process accurately, leading to biased estimates of the DE parameter. There are some advanced methods[19][3][4], which approximate the solution and its derivatives by using basis-function with both considering noisy observations and equation constraint. Then parameters are estimated by optimizing certain criteria. Especially, Xiaolei Xun[29] propose two joint modeling schemes: (a) a parameter cascading or penalized profile likelihood approach and (b) a fully Bayesian treatment. This type methods usually have some sensitive hyper-parameters, which needed be carefully design and may have the parameter estimates converge to a local minima, otherwise global optimization is computationally intensive. In addition, The ability of basis functions to fit complex solution function and its derivatives is also a challenge.

Here we propose using a new statistical approach which is based on Gaussian process, called Gaussian process regression with constraint (GPRC) method [24], to accurately and robustly predict the solution and derivatives appearing in the differential equations. GPRC can efficiently make the predictions with both considering the observations and DE constraint in a joint Bayesian framework. Compared with basis-function expansion method, the Gaussian process (GP) models [21] provide a class of flexible and efficient non-parametric fits both for fitting noisy data and solving noisy DEs. The prior knowledge can also be easily encoded by the covariance of the GP. To achieve parameter estimation, we construct a likelihood by employing a new potential term which contains the DE and data residuals and the Metropolis-Hastings(MH) algorithm is used to infer the unknown parameters in Bayes’ rule.

The structure of the paper is organized as follows: We first review the problem setup in Section 2. The review of GPRC method for estimating the solution and its derivatives is presented in Section 3. Specificly, the Picard iterative linearization for nonlinear DE is carefully formulated in 3.2, based on the last example in [24]. Bayesian inference framework is introduced in Section 4, where we propose a new potential of the likelihood. Numerical examples are presented in Section 5, to demonstrate the effectiveness of the proposed method compared with two-stage method. Finally, Section 6 offers some closing remarks.

2 Problem setup

Inferring the unknown parameters in differential equation is a classical problem in the inverse problem. Differential equations refer to the relationship between the solution function and its derivatives. We consider a multidimensional dynamic process, $u({\bm{x}})$ , where ${\bm{x}}=(x_{1},\dots,x_{D})^{T}\in\mathbb{R}^{D}$ is a multidimensional argument. The general formula for differential equation can be expressed as

\mathcal{F}({\bm{x}},u,\frac{\partial u}{\partial x_{1}},\dots,\frac{\partial u}{\partial x_{p}},\dots,\frac{\partial^{2}u}{\partial x_{i}\partial x_{j}},\dots;{\bm{\theta}})=0,

(2.1)

where ${\bm{\theta}}=(\theta_{1},\dots,\theta_{q})$ is the vector of unknown parameters. In the following, we denote $\mathcal{F}({\bm{x}},u({\bm{x}}),\dots;{\bm{\theta}})$ by the short-hand notation $\mathcal{F}(u(\bm{x});{\bm{\theta}})$ . The left-hand side of Eq. (2.1) consists of the solution $u({\bm{x}})$ and its derivative terms ( $\frac{\partial u}{\partial x_{i}},\frac{\partial^{2}u}{\partial x_{i}\partial x_{i}},\dots,$ ). In practice , we may not directly observe $u({\bm{x}})$ , but we can observe the noisy measurement $y({\bm{x}})$ . For simplicity, we assume that there is a measurement error between $u({\bm{x}})$ and $y({\bm{x}})$ , the noisy measurements satisfy

y_{i}=u({\bm{x}}_{i})+\epsilon_{i},

(2.2)

where $i=1,\dots,n$ , is the observation index and $\epsilon_{i}$ is the measurement error. We assume this measurement error follows an independent and identically distributed Gaussian distribution with zero mean and variance $\sigma_{obs}^{2}$ .

In practice, parameters ${\bm{\theta}}$ are often unknown and need to be determined. Our objective is to estimate the parameters from the limited noisy data $\{{\bm{x}}_{i},y({\bm{x}}_{i})\}_{i=1,\dots,n}$ .

3 GPRC

For estimating the unknown parameter ${\bm{\theta}}$ , accurate estimations of the solution and its derivatives are necessary and helpful. To infer a nonlinear function from its noisy measurements at a given set of inputs is a classical statistical learning problem and a vast of well-established methods, ranging from polynomial and spline to kernel method and neural network, are available for many important applications. However, if the interest is the derivatives and there is no additional observations of derivatives, the problemof estimating derivatives is more challenging and subtle than estimating the function itself and less attentions have been given to this issue. In this section, we briefly introduce an estimator GPRC [24], which has shown the ability of accurately estimating the derivatives in ordinary differential equation and partial differential equation problems and here is used to estimate the solution and its derivatives with a guess parameter ${\bm{\theta}}$ .

3.1 GPRC in linear differential equation

Specifically, the solution $u({\bm{x}})$ is assumed as a Gaussian process with zero mean and the covariance function $k_{uu}({\bm{x}},{\bm{x}}^{\prime},\bm{\gamma})$ , which is denoted as

u({\bm{x}})\sim\mathcal{GP}(0,k_{uu}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma})),

(3.1)

where, $\bm{\gamma}$ denotes the hyper-parameters of the kernel function. One property of the Gaussian process in our favor is that any linear transformation, such as differentiation and integration, of a Gaussian process is still a Gaussian process. With the assumption (3.1), we consider the case when $\mathcal{F}(\cdot;{\bm{\theta}})$ is a linear operator or a combination of different linear operators, denoted as $\mathcal{L}^{\bm{\theta}}$ with given ${\bm{\theta}}$ . We introduce a random function $r({\bm{x}})$ as the residual of the linear differential equation $\mathcal{L}^{\bm{\theta}}u=0$ for convenience:

r({\bm{x}}):=\mathcal{L}^{\bm{\theta}}u(x).

Then the function $r({\bm{x}})$ is also a mean-zero Gaussian process

r({\bm{x}})=\mathcal{GP}(0,k_{rr}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma})),

where $k_{rr}({\bm{x}},{\bm{x}}^{\prime})=\text{Cov}(\mathcal{L}^{\bm{\theta}}u({\bm{x}}),\mathcal{L}^{\bm{\theta}}u({\bm{x}}^{\prime}))$ denotes the covariance function of $\mathcal{L}^{\bm{\theta}}u$ between ${\bm{x}}$ and ${\bm{x}}^{\prime}$ . The following fundamental relationship between the kernels $k_{uu}$ and $k_{rr}$ , $k_{ur}$ , $k_{ru}$ is well-known (see e.g. [21, 9]),

\begin{split}k_{rr}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma})&=\mathcal{L}_{{\bm{x}}}^{\bm{\theta}}\mathcal{L}_{{\bm{x}}^{\prime}}^{\bm{\theta}}k_{uu}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma}),\\ k_{ur}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma})&=\mathcal{L}_{{\bm{x}}^{\prime}}^{\bm{\theta}}k_{uu}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma}),\\ k_{ru}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma})&=\mathcal{L}^{\bm{\theta}}_{{\bm{x}}}k_{uu}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma}),\end{split}

(3.2)

Based on the Gaussian assumption and the covariance expressions between $u$ and $r$ in (3.2), a joint inference framework of Gaussian process regression for the available observation data of $u$ and $r$ can be naturally constructed. By interpreting the equation $r:=\mathcal{L}^{\bm{\theta}}u=0$ as the constraint of the function $r$ , GPRC will significantly improve the accuracy of estimation of solution and its derivatives due to the additional equation information.

Training a Gaussian Process

The training process is to find the optimal parameters $\bm{\beta}=\{\bm{\gamma},\sigma_{obs}^{2}\}$ by maximum likelihood estimation (MLE). Given $n$ noisy observations of the state $u$ at $n$ points $\{{\bm{x}}_{i}:1\leq i\leq n\}$ , we denote the state vector ${\bm{y}}\equiv[y_{1},y_{2},\dots,y_{n}]^{T}\in{\mathbb{R}}^{n\times 1}$ , the residual vector $\bm{r}\equiv[r_{1},r_{2},\dots,r_{n}]^{T}\in{\mathbb{R}}^{n\times 1}$ and the training point matrix $X\equiv[{\bm{x}}_{1},{\bm{x}}_{2},\dots,{\bm{x}}_{n}]^{T}\in{\mathbb{R}}^{n\times D}$ . Let $Y=\begin{bmatrix}{\bm{y}}\\ \bm{r}\end{bmatrix}\in{\mathbb{R}}^{2n}$ . Then the negative log marginal likelihood of $p(Y|\bm{\beta})$ has the following expression

\begin{split}-\log p(Y|\bm{\beta})=\frac{1}{2}\log(\det{K})+\frac{1}{2}Y^{T}K^{-1}Y+\frac{n}{2}\log 2\pi,\end{split}

(3.3)

where the $2n\times 2n$ matrix $K$ is defined by

K=\begin{bmatrix}&K_{uu}+\sigma^{2}_{u}I&K_{ur}\\ &K_{ru}&K_{rr}+\sigma^{2}_{r}I\end{bmatrix},

The matrices $K_{uu},K_{ur}$ , $K_{ru}$ and $K_{rr}$ correspond to the kernel functions $k_{uu},k_{ur},k_{ru}$ and $k_{rr}$ respectively in (3.1) and (3.2), evaluated at the $n$ points $X$ .
The minimisation of the negative log marginal likelihood is a non-convex optimisation task. However gradients are easily obtained

\begin{split}\frac{\partial L}{\partial\beta_{k}}=\frac{1}{2}tr(K\frac{\partial K^{-1}}{\partial\beta_{k}})+\frac{1}{2}Y^{T}\frac{\partial K}{\partial\beta_{k}}K^{-1}\frac{\partial K}{\partial\beta_{k}}Y,\end{split}

(3.4)

where $\beta_{k}$ indicates the $i$ th hyper-parameter of $\bm{\beta}$ and therefore standard gradient optimisers can be used to minimize the negative log marginal likelihood (3.3) within the gradient (3.4).

Prediction After the hyper-parameters in the GPRC are computed, the prediction of the function $u({\bm{x}})$ or its derivatives of interest, denoted as a function $l({\bm{x}})$ (e.g., $l({\bm{x}})=\partial_{{\bm{x}}}u({\bm{x}})$ ), at a new test point ${\bm{x}}_{*}$ is described below. The covariance function of the GP $l({\bm{x}})$ is denoted by $k_{ll}({\bm{x}},{\bm{x}}^{\prime})$ by the convenience.

With a given point ${\bm{x}}_{*}$ , the differential equation provides the fact $r({\bm{x}}_{*})=0$ , which is a useful observation to incorporate the Bayesian inference. To enhance this condition locally, $m$ equally-spaced points around ${\bm{x}}_{*}$ , which are referred as the extended set $\chi:=\{{\bm{x}}_{*}^{j}:1\leq j\leq m\}$ are actually considered in the prediction. The set $\chi$ is similar to the window/scale concept in local polynomial regression and thus adaptive strategy is possible. Therefore, the value of the equation constraints in the extended set should be zero, i.e., $\bm{r}_{\chi}=\bm{0}$ . These points ${\bm{x}}_{*}^{j}$ are supposed to resolve a characteristic length for the residual process $r({\bm{x}})$ .

This is the posterior distribution for a specific set of test case,

		$\displaystyle\bar{u}({\bm{x}}_{})=K_{u_{}\bullet}\widehat{K}_{u\bm{r}_{\chi}}^{-1}\begin{bmatrix}{\bm{y}}\\ \bm{r}_{\chi}\end{bmatrix},S_{u}({\bm{x}}_{})=K_{uu}({\bm{x}}_{},{\bm{x}}_{})-K_{u_{}\bullet}\widehat{K}_{u\bm{r}_{\chi}}^{-1}K_{u_{*}\bullet}^{T}.$		(3.5)
		$\displaystyle\bar{l}({\bm{x}}_{})=K_{l_{}\bullet}\widehat{K}_{u\bm{r}_{\chi}}^{-1}\begin{bmatrix}{\bm{y}}\\ \bm{r}_{\chi}\end{bmatrix},S_{l}({\bm{x}}_{})=K_{ll}({\bm{x}}_{},{\bm{x}}_{})-K_{l_{}\bullet}\widehat{K}_{u\bm{r}_{\chi}}^{-1}K_{l_{*}\bullet}^{T},$		(3.6)

where

\widehat{K}_{u\bm{r}_{\chi}}=\begin{bmatrix}K_{u}&K_{u\bm{r}_{\chi}}\\ K_{u\bm{r}_{\chi}}^{T}&K_{\bm{r}_{\chi}}\end{bmatrix}\in{\mathbb{R}}^{(n+m)\times(n+m)},

$K_{u}=K_{uu}+\sigma^{2}_{u}\bm{I}_{u}\in{\mathbb{R}}^{n\times n}$ , $K_{\bm{r}_{\chi}}=K_{\bm{r}_{\chi}\bm{r}_{\chi}}+\sigma^{2}_{r}\bm{I}_{r}\in{\mathbb{R}}^{m\times m}$ , $K_{u_{*}}=k_{uu}({\bm{x}}_{*},{\bm{x}}_{*})$ , $K_{l_{*}}=k_{ll}({\bm{x}}_{*},{\bm{x}}_{*})$ and $K_{u_{*}\bullet}=[K_{u_{*}u},K_{u_{*}\bm{r}_{\chi}}]\in{\mathbb{R}}^{1\times(n+m)}$ and $K_{l_{*}\bullet}=[K_{l_{*}u},K_{l_{*}\bm{r}_{\chi}}]\in{\mathbb{R}}^{1\times(n+m)}$ . The posterior variances $S_{u}({\bm{x}}_{*})$ and $S_{l}({\bm{x}}_{*})$ can be used as good indicators of how confident the predictions are.

3.2 Picard iterative linearization for nonlinear differential equation

For nonlinear differential equations, the equation constraint can’t be formulated as a Gaussian process since the product of Gaussian processes is not a Gaussian process anymore. Here an iterative linearization method is introduced for nonlinear equations, motivated by the Picard iteration method[7] , which is an extension of method used in Van der Pol equation example in [24].

Assume there have an initial guess of the solution $\hat{u}_{0}$ and its derivative $\hat{l}^{i}_{0}$ which can be fitted by any classical method, like Gaussian process regression and basis function expansion. Here the hat notation $\hat{\cdot}$ represents an estimation and we are only interest in certain derivative terms which is useful for lineariztion. So the $i$ th derivative term $\hat{l}^{i}$ is used to represent all derivatives of interest for simplicity. Then any nonlinear term of the nonlinear equation, which is composed of $K$ solution or its derivative terms, can be simply approximated by replacing $K-1$ terms with the corresponding given initial guess of the solution $\hat{u}_{0}$ and its derivatives $\hat{l}^{i}_{0}$ , and only keep one term. This is a quite straightforward linearlization strategy and can be easily applied. There is a trick that replacing the solution or low order derivatives, would be preferred than the one who replacing the high order derivatives of the nonlinear term in the approximation. That is because the instability of numerical differentiation of the fitted function solely from the data which is the measurement of solution with adding noise, will be more serious in estimating higher order derivative. The linearization of $\mathcal{F}(u(\bm{x});{\bm{\theta}})$ is expressed as

\mathcal{F}(u,l^{i},\dots;\hat{u}_{0},\hat{l}^{i}_{0},\dots;{\bm{\theta}})=0.

(3.7)

Based on linearization in the Eq. (3.7), GPRC introduced in Section 3.1 can be used to estimate a new solution $\hat{u}$ and derivative $\hat{l}^{i}$ . Then an iterative scheme can be constructed with the new pair $\{\hat{u},\hat{l}^{i}\}$ . Roughly speaking in the $n$ th cycle, given the current guess of solution $\hat{u}_{n}$ and derivatives $\hat{l}^{i}_{n}$ , we can compute a new (and possibly better) approximation $\{\hat{u}_{n+1},\hat{l}^{i}_{n+1}\}$ using

\{\hat{u}_{n+1},\hat{l}^{i}_{n+1}\}=\textbf{GPRC}(\mathcal{F}(u,l^{i},\dots;\hat{u}_{n},\hat{l}^{i}_{n},\dots;{\bm{\theta}}),\mathbf{y}),

(3.8)

where $\textbf{GPRC}(\cdot)$ indicates the estimator of GPRC whose output is the posterior mean functions of the solution and its derivatives. Our stopping criterion is the sum of RMSE values of solution and the RMSE values of all derivatives of interest, which is expressed by

e=\int(||\hat{u}_{n+1}-\hat{u}_{n}||^{2}+\sum_{i}||\hat{l}^{i}_{n+1}-\hat{l}^{i}_{n}||^{2})p({\bm{x}})d{\bm{x}},

where $||\cdot||^{2}$ denotes the L2 normalization. If $e$ is small than a give small positive threshold $\epsilon_{rmse}$ , then we think the iteration reaches stability.

4 Inference

We are interest in the problem of estimating the unknown parameters $\bm{\theta}$ from the indirect observations ${\bm{y}}$ and the DE constraint. So for inferring the unknown parameters, both data equations (2.2) and DE constraint equation (2.1) should be satisfied.

In Bayesian setting, the prior belief about the parameter $\bm{\theta}$ is encoded in the probability distribution $\pi_{prior}({\bm{\theta}})$ . Here we use a uniform distribution as prior for highlighting the action of likelihood function. Our aim is to infer the distribution of ${\bm{\theta}}$ conditioned on the given data $\bm{y}$ , i.e., the posterior distribution $\pi({\bm{\theta}}|\mathbf{y})$ . By the Bayes’ rule, we have

\pi({\bm{\theta}}|\mathbf{y})\propto\exp(-\alpha*\eta({\bm{\theta}};\mathbf{y};\mathcal{F}(\cdot)))*\pi_{prior}({\bm{\theta}}),

(4.1)

where $\alpha$ is a scale hyperparameter, which is used to adjust the likelihood value to a reasonable scale, and $\eta({\bm{\theta}};\mathbf{y};\mathcal{F}(\cdot))$ is called potential of the likelihood $\exp(-\alpha*\eta({\bm{\theta}};\mathbf{y};\mathcal{F}(\cdot)))$ , specific

\eta({\bm{\theta}};\mathbf{y};\mathcal{F}(\cdot))=\frac{1}{2}||\mathbf{y}-\hat{u}(X;{\bm{\theta}})||^{2}+\frac{1}{2}\int[\mathcal{F}(\hat{u}(\mathbf{x};{\bm{\theta}});{\bm{\theta}})]^{2}p(\bm{x})d\bm{x}.

(4.2)

Fidelity to the PDE can be measured by $\int[\mathcal{F}(\hat{u}(\mathbf{x};{\bm{\theta}});{\bm{\theta}})]^{2}p(\bm{x})d\bm{x}$ , while fidelity to the data can be measured by $||\mathbf{y}-\hat{u}(X;{\bm{\theta}})||^{2}$ . Equal weights for data and equation information are assumed in E.q (4.2).

The unnormalized posterior (4.1) can be easily sampled using MCMC method such as Metropolis-Hastings (MH) algorithm [6], Gibbs sampling [8] and DRAM [10] et al. Here we use MH sampling algorithm to draw samples from the unnormalized posterior (4.1). The Metropolis-Hastings (MH) algorithm is one of the most popular MCMC methods. An MH step of invariant distribution $\pi({\bm{\theta}})$ and proposal distribution $\pi_{q}({\bm{\theta}}^{i+1}|{\bm{\theta}}^{i})$ involves sampling a candidate value ${\bm{\theta}}^{i+1}$ given the current value ${\bm{\theta}}^{i}$ according to $\pi_{q}({\bm{\theta}}^{i+1}|{\bm{\theta}}^{i})$ . The Markov chain then moves towards ${\bm{\theta}}^{i+1}$ with acceptance probability $\mathcal{A}({\bm{\theta}}^{(i)},\bm{\theta}^{i+1})=\min\{1,\frac{\pi({\bm{\theta}}^{i+1}|\mathbf{y})q({\bm{\theta}}^{(i)}|{\bm{\theta}}^{i+1})}{\pi({\bm{\theta}}^{(i)}|\mathbf{y})q({\bm{\theta}}^{i+1}|{\bm{\theta}}^{(i)})}\}$ , otherwise it remains at ${\bm{\theta}}^{i}$ . In our work, we draw from $\pi({\bm{\theta}}|\mathbf{y})$ using MH sampling:

1.

Initialise $\bm{\theta}^{(0)}$
2.

For $i$ = 0 to $N$ -1
(a) Sample ${\bm{\theta}}^{i+1}$ from the proposal distribution $\pi_{q}(\bm{\theta}^{i+1}|\bm{\theta}^{(i)})$ .
(b) Compute $\mathcal{A}({\bm{\theta}}^{(i)},\bm{\theta}^{i+1})=\min\{1,\frac{\pi({\bm{\theta}}^{i+1}|\mathbf{y})\pi_{q}({\bm{\theta}}^{(i)}|{\bm{\theta}}^{i+1})}{\pi({\bm{\theta}}^{(i)}|\mathbf{y})\pi_{q}({\bm{\theta}}^{i+1}|{\bm{\theta}}^{(i)})}\}$ and sample $u$ from $\mathcal{U}[0,1]$ .
(c) If $u<\mathcal{A}({\bm{\theta}}^{(i)},\bm{\theta}^{i+1})$ then accept ${\bm{\theta}}^{i+1}$ . Otherwise, set ${\bm{\theta}}^{(i+1)}={\bm{\theta}}^{(i)}$ .

5 Numerical implementation

The GPRC method can be used for estimating the solution and derivatives in either ODE or PDE problems [24]. Here we investigate the parameter estimation problem in three scenarios: linear ODE, nonlinear ODE and nonlinear PDE, to verify the effectiveness of our method and make a comparison with two-stage method. It’s worth noting that only one iteration is applied in Picard linearization procedure in Eq. (3.8) because we only replace the solution $u$ in nonlinear terms that is always estimated well directly from data. In two-stage method, we also use Gaussian process regression as statistical model to approximate the state and derivatives from the observations [18]. Same Gaussian process based regression model, used in two-stage method and our method can eliminate the error caused by the model difference. In two-stage method, the data information in Eq. (4.2) is fixed due to the solution and its derivative estimations would not vary with different $\bm{\theta}$ . Then the unknown parameters can be detected by only minimizing the parametric equation constraint residual. So the potential is reduced to

\eta({\bm{\theta}};\mathcal{F}(\cdot))=\int[\mathcal{F}(\hat{u}(\mathbf{x};{\bm{\theta}});{\bm{\theta}})]^{2}p(\bm{x})d\bm{x}.

The same radial basis function (RBF) kernel are shared in both GPRC and GPR (used in two-stage method),

k_{uu}({\bm{x}},{\bm{x}}^{\prime},\bm{\gamma})=\gamma^{2}_{\alpha}\exp(-\frac{1}{2}\sum_{d=1}^{D}\gamma_{d}({\bm{x}}_{d}-{\bm{x}}_{d}^{\prime})^{2}).

with hyper-parameter $\gamma^{2}_{\alpha}$ and $\gamma_{d}$ . The kernel $k_{lu}({\bm{x}},{\bm{x}}^{\prime},\bm{\gamma})$ can be computed based on (3.2). For instance the kernel of first order derivative term $\frac{du(t)}{dt}$ in ODE can be expressed as:

k_{lu}(t,t^{\prime},\bm{\gamma})=\frac{k_{uu}(t,t^{\prime},\bm{\gamma})}{dt}=-k_{uu}(t,t^{\prime},\bm{\gamma})\gamma_{d}(t-t^{\prime}).

The expression of other kernel for a general differential operator can be computed similarly.

5.1 $1$ D toy example

First we consider the following $1$ -dimension linear ordinary differential equation,

u^{\prime\prime}(t)+\theta_{1}u^{\prime}(t)+\theta_{2}u(t)=0.

(5.1)

where $u$ is a function of time $t$ , ^′ and ^′′ denotes first and second order derivative operators, respectively. ${\bm{\theta}}=[\theta_{1},\theta_{2}]$ is the unknown parameters which need to refer. Our true parameter values are taken as $\theta^{*}_{1}=1$ , $\theta^{*}_{2}=3$ . Then $21$ observations are evenly measured in the time interval $[0,10]$ with Gaussian noise $\mathcal{N}(0,\sigma^{2}_{obs}=0.1)$ . The observations and the true solution $u$ are shown in Fig. 1.

Refer to caption — Figure 1: Illustration of the data and the true solution

We assume the prior of the sate function $u$ is a zero-mean Gaussian process expressed in (3.1). As discussed in Section 3, the equation constraint $r=u^{\prime\prime}+\theta_{1}u^{\prime}+\theta_{2}u$ is also a Gaussian process, $r\sim\mathcal{GP}(0,k_{rr}({\bm{x}},{\bm{x}}^{\prime};\bm{\gamma})).$ With the property of covariance, the covariance between $r$ and $u$ can be expanded as

	$\displaystyle\mathrm{Cov}(r,u)$	$\displaystyle=\mathrm{Cov}(u^{\prime\prime}+\theta_{1}u^{\prime}+\theta_{2}u,u)$
		$\displaystyle=\mathrm{Cov}(u^{\prime\prime},u)+\theta_{1}\mathrm{Cov}(u^{\prime},u)+\theta_{2}\mathrm{Cov}(u,u).$

So, $k_{ru}=k_{u^{\prime\prime}u}+\theta_{1}k_{u^{\prime}u}+\theta_{2}k_{u,u}$ . Similarly, the kernel functions corresponding to covariances $\mathrm{Cov}(r,u^{\prime}),\mathrm{Cov}(r,u^{\prime\prime})$ and $\mathrm{Cov}(r,r)$ are expressed as

	$\displaystyle k_{ru^{\prime}}$	$\displaystyle=k_{u^{\prime\prime}u^{\prime}}+\theta_{1}k_{u^{\prime}u^{\prime}}+\theta_{2}k_{u,u^{\prime}},$
	$\displaystyle k_{ru^{\prime\prime}}$	$\displaystyle=k_{u^{\prime\prime}u^{\prime\prime}}+\theta_{1}k_{u^{\prime}u^{\prime\prime}}+\theta_{2}k_{u,u^{\prime\prime}},$
	$\displaystyle k_{rr}$	$\displaystyle=k_{u^{\prime\prime}u^{\prime\prime}}+{\theta_{1}}^{2}k_{u^{\prime}u^{\prime}}+{\theta_{2}}^{2}k_{uu}+2{\theta_{1}}k_{u^{\prime\prime}u^{\prime}}+2{\theta_{2}}k_{u^{\prime\prime}u}+2{\theta_{1}}{\theta_{2}}k_{u^{\prime}u}.$

By equation (3.5) and (3.6), the posterior distributions $p(u_{*}|{\bm{y}},\bm{r}_{\chi})={\mathcal{N}}(m_{u},\Sigma_{u})$ , $p(u_{*}^{\prime}|{\bm{y}},\bm{r}_{\chi})={\mathcal{N}}(m_{u^{\prime}},\Sigma_{u^{\prime}})$ and $p(u_{*}^{\prime\prime}|{\bm{y}},\bm{r}_{\chi})={\mathcal{N}}(m_{u^{\prime\prime}},\Sigma_{u^{\prime\prime}})$ are given below

	$\displaystyle m_{u_{*}}$	$\displaystyle=K_{\bullet u_{}}^{T}\widehat{K}_{u\bm{r}_{\chi}}^{-1}Y,\quad\Sigma_{u_{}}=K_{u_{}}-K_{\bullet u_{}}^{T}\widehat{K}_{u\bm{r}_{\chi}}^{-1}K_{\bullet u_{*}}$
	$\displaystyle m_{u^{\prime}_{*}}$	$\displaystyle=K_{\bullet u_{}^{\prime}}^{T}\widehat{K}_{u\bm{r}_{\chi}}^{-1}Y,\quad\Sigma_{u^{\prime}_{}}=K_{u^{\prime}_{}}-K_{\bullet u_{}^{\prime}}^{T}\widehat{K}_{u\bm{r}_{\chi}}^{-1}K_{\bullet u_{*}^{\prime}}$
	$\displaystyle m_{u^{\prime\prime}_{*}}$	$\displaystyle=K_{\bullet u_{}^{\prime\prime}}^{T}\widehat{K}_{u\bm{r}_{\chi}}^{-1}Y,\quad\Sigma_{u^{\prime\prime}_{}}=K_{u^{\prime\prime}_{}}-K_{\bullet u_{}^{\prime\prime}}^{T}\widehat{K}_{u\bm{r}_{\chi}}^{-1}K_{\bullet u_{*}^{\prime\prime}},$

where $Y=[{\bm{y}},\bm{r}_{\chi}]^{T}$ . This is the posterior estimation of state and derivative functions. In GPRC algorithm, we set $\sigma_{r}^{2}=10$ and $\chi=[t_{*}]$ , which means only thee predicted point is included in the extend set $\chi$ . Although we would sacrifice some precision of prediction due to the small extend set, this setting can efficiently speed up the calculation. Then we can obtain the posterior estimation of solution and its derivatives within any given ${\bm{\theta}}$ . A Gaussian proposal distribution $q({\bm{\theta}}^{*}|{\bm{\theta}}^{(i)})=\mathcal{N}({\bm{\theta}}^{(i)},0.6^{2})$ is used as the proposal distribution in our MH sampler with $\alpha=100$ in our likelihood function (4.1). $5000$ samples are drawn from the posterior. As a comparison same number of samples are obtained by two-stage method and the both posterior results are shown in Figure 2. Obviously from the figure on the left, we can find that the samples (red scattered points) computed by our method have a wider variance, which means a bigger uncertainty of the posterior is estimated. In the marginal pdf figures, the first dimension marginal pdf value at $\theta_{1}^{*}$ computed by our method (red line) are larger than one computed by two-stage method (blue dash) and the second dimension marginal pdf values computed by two method at $\theta_{1}^{*}$ are equal. These results indicate the posterior estimation computed by our method is more credible. This point would be better clear in a small-noise case because the bias caused by the numerical error of derivatives estimation would be highlights when the measure noise is small.

We then consider the small-noise case where we use the same implementation configurations as in the large-noise case. We change the noise level from $0.1$ to $0.05$ and collect the same number of observations at same locations. The samples from posterior are shown in Figure 2 (right). Similar to the large-noise case, the figure shows that the results of the two-stage are of low accuracy, while our method yield rather good results.

5.2 Van der Pol equation

The Van der Pol model is described by 1-dimensional nonlinear ODE equation,

u^{\prime\prime}-\mu(1-u^{2})u^{\prime}+u=0,

(5.2)

where $\mu$ is the unknown parameter. For this nonlinear differential equation, we first use the GPR method to obtain initial guess of the solution $\hat{u}_{0}$ only from data. Then we can obtain a linearizaiton form of (5.2), by using the approximation $\hat{u}_{0}$ to replace the original $u$ of nonlinear term, expressed as

u^{\prime\prime}+\mu(1-\hat{u}_{0}^{2})u^{\prime}+u=0.

In this example 41 observations are measured at every $0.5$ time units on the interval [0,20] with given true parameter $\mu^{*}=0.5$ . The clean data are corrupted with additive Gaussian noise with variance $\sigma_{obs}^{2}=0.01$ . The figure on the left in Fig 3 shows the true solution $u$ and the observations. In GPRC, $\sigma_{r}^{2}=0.1$ is set and $10$ points which are evenly selected in the domain $[t_{*}-1,t_{*}+1]$ , form the extend set $\chi$ . In the MH sampling procedure, the design of the proposal distribution $q(\theta^{*}|\theta^{(i)})$ is $\mathcal{N}(\theta^{(i)},0.6^{2})$ . The histogram of the posterior samples computed by both our method and two-stage method are shown in the Fig. 3 (right). In the figure, the true parameter (black line) are closer to the mean of red histogram of samples computed by our method than the one of blue histogram, computed by two-stage method.

5.3 The KdV Equation

The KdV equation is a nonlinear, dispersive partial differential equation for function $u$ of two real variables, space $x\in\mathbb{R}^{1\times 1}$ and time $t$ :

\frac{\partial u}{\partial t}+\theta_{1}u\frac{\partial u}{\partial x}+\theta_{2}\frac{\partial^{3}u}{\partial x^{3}}=0,

It is an asymptotic simplification of Euler equations used to model waves in shallow water. It can also be viewed as Burger’s equation with added dispersive term. In this equation two parameters need to be estimated from some noisy observations of function $u$ . The true solution is created using a spectral method with $512$ spatial points and $200$ timesteps. More details about this model refer [20]. This is a highly non-linear PDE. So $200$ solution values are randomly selected from the given discrete solution in the time domain $[15,20]$ and spatial domain $[-21,26]$ , within our true parameters are $\theta^{*}_{1}=6$ and $\theta^{*}_{2}=1$ . A Gaussian noise is added on the selected solution values with zero mean and $0.1$ variance. The true solution and the observations are shown in the Fig. 4 (left).

In MH sampler procedure, $5000$ samples are obtained by the given proposal distribution $\mathcal{N}(0,0.6I)$ . The posterior samples and marginal distributions are shown in right figure of Fig. 4. As shown, the true parameter (white star) is closer to the center of the posterior samples computed by our method than the one of two-stage method. In addition, at the true parameters $\theta^{*}_{1}$ and $\theta^{*}_{2}$ , the values of marginal PDF computed by our method are larger than the one computed by two-stage method, express that our method yields a better posterior estimation.

6 Conclusion

In conclusion, we make use of a powerful tool: the Gaussian Process Regression with Constraint method (GPRC) to the problem of parameter estimation. We propose a new potential function which combines the data information and equation information, by using the accurate estimation of solution and its derivatives computed by GPRC. Then posterior samples are drawn from the product of prior and likelihood with this type potential by MH sampler. A two-stage method are also illustrated in each example for comparison. With both linear and nonlinear examples, we illustrate that the proposed method can be a competitive tool to estimate the unknown differential equation parameters. Finally we discuss some limitations of the proposed method as well as some issues that need to be address in the future. Firstly for any given parameters, the GPRC would estimate the solution and derivatives by a training procedure which is time-costing. Maybe an efficient surrogate can be constructed for the likelihood and an adaptive procedure[23] can be employed for efficient sampling in future. Secondly, the observations with large bias would lead to ineffectiveness of our method. GPRC would reduce the numerical error of derivative operator by considering the DE information, but the biased observations probably let it fail.

References

[1] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50(1/2):p.5–43, 2003.
[2] Markus Bär, Rainer Hegger, and Holger Kantz. Fitting partial differential equations to space-time dynamics. Physical Review E, 59(1):337, 1999.
[3] J Cao, L Wang, and J Xu. Robust estimation for ordinary differential equation models. Biometrics, 67(4):1305–1313, 2011.
[4] Jiguo Cao, Jianhua Z Huang, and Hulin Wu. Penalized nonlinear least squares estimation of time-varying parameters in ordinary differential equations. Journal of computational and graphical statistics, 21(1):42–56, 2012.
[5] Jianwei Chen and Hulin Wu. Efficient local estimation for time-varying coefficients in deterministic dynamic models with applications to hiv-1 dynamics. Journal of the American Statistical Association, 103(481):369–384, 2008.
[6] Siddhartha Chib and Edward Greenberg. Understanding the metropolis-hastings algorithm. The american statistician, 49(4):327–335, 1995.
[7] Earl A Coddington and Norman Levinson. Theory of ordinary differential equations. Tata McGraw-Hill Education, 1955.
[8] Walter R Gilks and Pascal Wild. Adaptive rejection sampling for gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337–348, 1992.
[9] Thore Graepel. Solving noisy linear operator equations by gaussian process: Application to ordinary and partial differential equations. In International Conference on Machine Learning, pages 234–241, 2003.
[10] Heikki Haario, Marko Laine, Antonietta Mira, and Eero Saksman. Dram: efficient adaptive mcmc. Statistics and computing, 16(4):339–354, 2006.
[11] David D Ho, Avidan U Neumann, Alan S Perelson, Wen Chen, John M Leonard, and Martin Markowitz. Rapid turnover of plasma virions and cd4 lymphocytes in hiv-1 infection. Nature, 373(6510):123–126, 1995.
[12] Yangxin Huang, Dacheng Liu, and Hulin Wu. Hierarchical bayesian methods for estimation of parameters in a longitudinal hiv dynamic system. Biometrics, 62(2):413–423, 2006.
[13] Hua Liang and Hulin Wu. Parameter estimation for differential equation models using a framework of measurement error in regression models. Journal of the American Statistical Association, 103(484):1570–1583, 2008.
[14] TG Müller and Jens Timmer. Parameter identification techniques for partial differential equations. International Journal of Bifurcation and Chaos, 14(06):2053–2060, 2004.
[15] Thorsten G Müller and Jens Timmer. Fitting parameters in partial differential equations from partially observed noisy data. Physica D: Nonlinear Phenomena, 171(1-2):1–7, 2002.
[16] Ulrich Parlitz and Christian Merkwirth. Prediction of spatiotemporal time series based on reconstructed local states. Physical review letters, 84(9):1890, 2000.
[17] Hein Putter, SH Heisterkamp, JMA Lange, and F De Wolf. A bayesian approach to parameter estimation in hiv dynamical models. Statistics in medicine, 21(15):2199–2214, 2002.
[18] Pankaj Kumar Rai and Shivam Tripathi. Gaussian process for estimating parameters of partial differential equations and its application to the richards equation. Stochastic Environmental Research and Risk Assessment, 33(8-9):1629–1649, 2019.
[19] Jim O Ramsay, Giles Hooker, David Campbell, and Jiguo Cao. Parameter estimation for differential equations: a generalized smoothing approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(5):741–796, 2007.
[20] Samuel H Rudy, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017.
[21] Matthias Seeger. Gaussian processes for machine learning. International journal of neural systems, 14(02):69–106, 2004.
[22] Henning U Voss, Paul Kolodner, Markus Abel, and Jürgen Kurths. Amplitude equations from spatiotemporal binary-fluid convection data. Physical review letters, 83(17):3422, 1999.
[23] Hongqiao Wang and Jinglai Li. Adaptive gaussian process approximation for bayesian inference with expensive likelihood functions. Neural computation, 30(11):3072–3094, 2018.
[24] Hongqiao Wang and Xiang Zhou. Explicit estimation of derivatives from data and differential equations by gaussian process regression. arXiv preprint arXiv:2004.05796, 2020.
[25] Xiping Wei, Sajal K Ghosh, Maria E Taylor, Victoria A Johnson, Emilio A Emini, Paul Deutsch, Jeffrey D Lifson, Sebastian Bonhoeffer, Martin A Nowak, Beatrice H Hahn, et al. Viral dynamics in human immunodeficiency virus type 1 infection. Nature, 373(6510):117–122, 1995.
[26] Hulin Wu. Statistical methods for hiv dynamic studies in aids clinical trials. Statistical methods in medical research, 14(2):171–192, 2005.
[27] Hulin Wu and A Adam Ding. Population hiv-1 dynamics in vivo: applicable models and inferential tools for virological data from aids clinical trials. Biometrics, 55(2):410–418, 1999.
[28] Hulin Wu, A Adam Ding, and Victor De Gruttola. Estimation of hiv dynamic parameters. Statistics in medicine, 17(21):2463–2485, 1998.
[29] Xiaolei Xun, Jiguo Cao, Bani Mallick, Arnab Maity, and Raymond J Carroll. Parameter estimation of partial differential equation models. Journal of the American Statistical Association, 108(503):1009–1020, 2013.