Sparse dynamical system identification with simultaneous structural parameters and initial condition estimation

(March, 12, 2021)

Abstract

Sparse Identification of Nonlinear Dynamics (SINDy) has been shown to successfully recover governing equations from data; however, this approach assumes the initial condition to be exactly known in advance and is sensitive to noise. In this work we propose an integral SINDy (ISINDy) method to simultaneously identify model structure and parameters of nonlinear ordinary differential equations (ODEs) from noisy time-series observations. First, the states are estimated via penalized spline smoothing and then substituted into the integral-form numerical discretization solver, leading to a sparse pseudo linear regression. Then, the sequential threshold least squares is performed to extract the fewest active terms from the overdetermined set of candidate features, thereby estimating structural parameters and initial condition simultaneously and meanwhile, making the identified dynamics parsimonious and interpretable. Simulations detail the method’s recovery accuracy and robustness to noise. Examples include a logistic equation, Lokta-Volterra system, and Lorenz system.

keywords:

nonlinear dynamics , penalized spline smoothing , integral matching , sparse regression , initial condition

1 Introduction

There is a long and fruit history of modeling time-series data with dynamical systems, resulting in many powerful techniques for system identification [1, 2]. In practice, many branches of science and engineering represent dynamical systems as sets of ODEs, in which each equation describes the change of a single variable as a function of the other variables, and the structures of these ODEs are usually determined according to known physical laws (see e.g. Newton’s laws, conservation of mass and energy) that provide first principles from which it is possible to derive equations of motion [3]. However, it may be difficult or impossible for researchers to derive equations, especially in the emerging fields that are lacking of known physical laws. The problem of discovering governing equation from time-series observations is one of the challenges in modern dynamical system research [4].

There are a host of additional techniques for the discovery of governing equations from time-series data, including the state dependent parameter identification [2, 5], symbolic regression [6, 7, 8, 9], automated inference [10], empirical dynamic modelling [11], and deep learning [12, 13]; see [14] for an incomplete review. These approaches can be viewed as special cases of the fourth paradigm: data-intensive scientific discovery [15]. Inspired by the key observation that most dynamical systems only consist of a few terms, the recent breakthrough has resulted in a compressive sensing paradigm for the structure identification of underlying nonlinear dynamics [16, 17]. By using sparse regression to implement compressive sensing, the SINDy framework was proposed [18]. This approach has three principal procedures: (1) estimating the states together with derivatives from noisy time-series observations; (2) constructing a library of candidate features to approximate vector fields; (3) conducting sparse regression to generate parsimonious dynamics. At present, there exist a series of outputs about the extensions and applications of SINDy, such as the discovery of partial differential equations [19], the lasso-type sparse optimization [20], and the information criterion-based structure selection [21]. However, it should be noticed that SINDy has two limitations: (1) one needs to estimate the accurate derivatives of states from noisy time-series observations, but this may be hard in practice [22], and (2) the initial condition is assumed to be noise-free and exactly known in advance when simulating the fitted trajectory, which can not be satisfied in certain cases.

In a broad sense, the aforementioned SINDy framework can be considered as sparse parameter estimation of ODEs. The subproblem (i.e. estimating parameters of ODEs with known model structures under a measurement error framework) has been a hot topic in the statistical society, especially after Ramsay’s work [23]; see [24] for a comprehensive review. Integral matching [25] is an effective approach to this subproblem. The similar idea to integral matching was first introduced to SINDy by Schaeffer and McCalla [26] and lead to an integral variant of SINDy. In this variant, the derivative estimation of states is bypassed, one advantage over the standard SINDy. However, both this variant and standard SINDy have the drawbacks that they assume the initial condition to be precisely known in advance rather than estimating it from observations, and they also do not take measurement noise into account, which can lead to misspecified model structures. To this end, the present research improves SINDy framework by integrating integral matching as a parameter estimation approach. The principal contributions are as follows:

(i)

The penalised spline smoothing is introduced to estimate dynamical states from noisy observations. This provides a flexible data preprocessing approach to identify the underlying dynamical system from noisy observations.
(ii)

The matrix computation tricks are utilised to separate structural parameters from the initial condition, and a modified sequential threshold least squares is proposed to generate simultaneous estimates of sparse structural parameters and initial condition.

The remainder is structured as follows: section 2 formulates the problem and presents the details of integral SINDy; in section 3 the finite sample performance is demonstrated via simulations and conclusions are presented in section 4.

2 Integral SINDy

2.1 Problem formalization

Consider the structure identification and parameter estimation problem for homogeneous nonlinear ODE system

\begin{cases}\frac{d}{dt}{\boldsymbol{x}}(t)=\boldsymbol{g}\left(\boldsymbol{x}(t);\boldsymbol{\beta}\right),~{}t\geq t_{1}\\ {\boldsymbol{x}}(t_{1})=\boldsymbol{\eta}\end{cases}

(1)

where $\boldsymbol{x}(t)\in\mathbb{R}^{d}$ is a $d$ -dimensional state vector, $\boldsymbol{g}\left(\cdot\right)$ is an unknown vector function, $\boldsymbol{\beta}\in\mathbb{R}^{p}$ and $\boldsymbol{\eta}\in\mathbb{R}^{d}$ denotes the unknown structural parameters and initial condition, respectively.

In practice, the states are always observed with measurement errors, i.e. the observation equation is

\displaystyle\boldsymbol{y}(t_{k})=\boldsymbol{x}(t_{k})+\boldsymbol{e}(k),~{}k=1,\cdots,n

(2)

where $\boldsymbol{e}(k)\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{\Sigma})$ is an independent Gaussian noise with mean $\boldsymbol{0}$ and finite variance matrix $\boldsymbol{\Sigma}$ . Then, for simplicity of presentation, the $n\times d$ -dimensional time-series observations are collectively denoted as $\left[\boldsymbol{y}_{i}\right]_{n\times d}$ , where each column $\boldsymbol{y}_{i}$ , $1\leq i\leq d$ , corresponds to the time-series observations $\left\{y_{i}(t_{k})\right\}_{k=1}^{n}$ .

Note that the integral form of state equation (1) is

\displaystyle{\boldsymbol{x}}(t)=\int_{t_{1}}^{t}\boldsymbol{g}\left(\boldsymbol{x}(s);\boldsymbol{\beta}\right)ds+\boldsymbol{\eta},~{}t\geq t_{1}

(3)

which allows for the simultaneous estimation of structural parameter $\boldsymbol{\beta}$ and initial condition $\boldsymbol{\eta}$ . That is, equation (3) offers an alternative integral SINDy (ISINDy) approach for sparse identification of nonlinear dynamical systems, as shown in Figure 1.

Refer to caption — Figure 1: Diagrammatic representation of information flow in integral SINDy.

ISINDy depicts learning from a hybrid information source that consists of data and prior knowledge, also belonging to the emerging field known as informed machine learning [27, 28]. The prior knowledge comes from an independent source, such as domain information, empirical experience, and physical laws. The prior knowledge is converted to formal representations — a library of candidate features, which are used to induce an explicit, likely redundancy expression of vector fields. The following sparse learning consists of three steps: preprocessing the time-series observations to estimate noise-free states, determining the candidate features included in the library, and generating estimates from the sparse regression.

2.2 Penalized spline smoothing for states estimation

Since the observations are contaminated with noise, the penalized spline smoothing [29] is utilized to estimate the states without reference to the state equation. For simplicity of presentation, we treat a single-variable case due to that smoothing is applied to each variable in turn. By using a linear combination of basis functions to approximate $x(t)$ , it follows that

x(t)=\sum_{j=1}^{J}\varphi_{j}(t)b_{j}=\boldsymbol{\varphi}^{\top}(t)\boldsymbol{b}

(4)

where $\boldsymbol{\varphi}(t)=\left[\varphi_{1}(t),\varphi_{2}(t),\cdots,\varphi_{J}(t)\right]^{\top}$ is the known basis function vector, and $\boldsymbol{b}=\left[b_{1},b_{2},\cdots,b_{J}\right]^{\top}$ is the unknown coefficient to be estimated from the observations in equation (2). In order to balance the fitness and smoothness of the fitted curve, the objective function contains a penalized sum of squared errors

\mathcal{L}(\boldsymbol{b})=(1-\rho)\sum_{k=1}^{n}\left[y(t_{k})-\boldsymbol{\varphi}^{\top}(t_{k})\boldsymbol{b}\right]^{2}+\rho\int_{t_{1}}^{t_{n}}\left\{\frac{d^{2}}{dt^{2}}\left[\boldsymbol{\varphi}^{\top}(t)\boldsymbol{b}\right]\right\}^{2}dt

(5)

where $\rho\in[0,1)$ is a hyper-parameter; at the right hand side, the first term is the squared error criterion measuring the fitness, and the second term is a smoothing penalty measuring the smoothness. The objective function $\mathcal{L}(\boldsymbol{b})$ is a trade-off expression between these two terms: when $\rho=0$ , it yields the least-square rough estimates; on the contrary, when $\rho\rightarrow 1$ , it yields very smooth estimates and thus nearly linear curve.

Since the basis functions together with their derivatives are known in advance, the penalty term can be reformulated as

\int_{t_{1}}^{t_{n}}\left\{\frac{d^{2}}{dt^{2}}\left[\boldsymbol{\varphi}^{\top}(t)\boldsymbol{b}\right]\right\}^{2}dt=\boldsymbol{b}^{\top}\left[\int_{t_{1}}^{t_{n}}\frac{d^{2}}{dt^{2}}{\boldsymbol{\varphi}}(t)\frac{d^{2}}{dt^{2}}{\boldsymbol{\varphi}}^{\top}(t)dt\right]\boldsymbol{b}:=\boldsymbol{b}^{\top}\boldsymbol{Q}\boldsymbol{b}

(6)

where each element of $\boldsymbol{Q}$ is calculated as $Q_{i,j}=\int_{t_{1}}^{t_{n}}{\frac{d^{2}}{dt^{2}}{\varphi}_{i}(t)\frac{d^{2}}{dt^{2}}{\varphi}_{j}(t)}dt$ , and these definite integrals can be approximated using the Simpson’s rule. By substituting equation (6) into (5), partially differentiating the objective function with respect to all parameters, in turn, and setting these derivatives to zero, one obtains the estimates of the coefficients and states:

\hat{\boldsymbol{b}}=\left[(1-\rho)\boldsymbol{R}^{\top}\boldsymbol{R}+\rho\boldsymbol{Q}\right]^{-1}\boldsymbol{R}^{\top}\boldsymbol{y}\quad\text{and}\quad\mathring{\boldsymbol{x}}=\boldsymbol{R}^{\top}\hat{\boldsymbol{b}}

(7)

where $\boldsymbol{R}=[\varphi_{j}(t_{k})]$ is an $n\times J$ matrix, $\boldsymbol{y}=\left[y(t_{1}),y(t_{2}),\cdots,y(t_{n})\right]^{\top}$ and $\mathring{\boldsymbol{x}}=\left[\mathring{x}(t_{1}),\mathring{x}(t_{2}),\cdots,\mathring{x}(t_{n})\right]^{\top}$ are both $n\times 1$ column vectors. To tune the hyper-parameter $\rho$ , the cross validation criterion [30] is used, i.e. minimizing the generalized cross validation error

\text{GCV}(\rho)=\frac{1}{n}\left\|(\boldsymbol{I}-\boldsymbol{S}(\rho))\boldsymbol{y}\right\|_{2}^{2}/\left[\frac{1}{n}\mathrm{Trace}(\boldsymbol{I}-\boldsymbol{S}(\rho))\right]^{2}

(8)

where $\boldsymbol{S}(\rho)=\boldsymbol{R}\left[(1-\rho)\boldsymbol{R}^{\top}\boldsymbol{R}+\rho\boldsymbol{Q}\right]^{-1}\boldsymbol{R}^{\top}$ is a $n\times n$ smoothing matrix. It is common to use the grid search strategy to find a reasonable value in the logit scale.

When implementing the penalised spline smoothing, it is vital to select appropriate basis functions. In practice, the basis functions are often chosen as the cubic B-spline basis functions because of their good properties [29, 31]. For example, the property called local compact support, ensures the efficient computation. See Appendix A for details of cubic B-spline basis functions.

2.3 Candidate features for vector field representation

From the prior knowledge that the underling dynamics have only a few terms, the vector field $\boldsymbol{g}(\boldsymbol{x};\boldsymbol{\beta})$ is sparse with respect to the candidate feature space $\boldsymbol{\Theta}(\boldsymbol{x})$ , i.e.

\boldsymbol{g}\left(\boldsymbol{x},\boldsymbol{\beta}\right)=\boldsymbol{\Theta}\left(\boldsymbol{x}\right)\boldsymbol{\Xi}

(9)

where $\boldsymbol{\Theta}(\boldsymbol{x})=\left[\theta_{1}(\boldsymbol{x})~{}\theta_{2}(\boldsymbol{x})~{}\cdots~{}\theta_{m}(\boldsymbol{x})\right]$ consists of all possible features and $\boldsymbol{\Xi}\in\mathbb{R}^{m\times d}$ is a sparse matrix having zeros in each column. Subsequently, without loss of generality, each component of $\boldsymbol{g}\left(\boldsymbol{x};\boldsymbol{\beta}\right)$ can be expressed as a linear combination of the candidate features

\left[\boldsymbol{g}\left(\boldsymbol{x};\boldsymbol{\beta}\right)\right]_{i}=\boldsymbol{\Theta}(\boldsymbol{x})\boldsymbol{\xi}_{i}={\theta}_{1}(\boldsymbol{x})\xi_{1,i}+{\theta}_{2}(\boldsymbol{x})\xi_{2,i}+\cdots+{\theta}_{m}(\boldsymbol{x})\xi_{m,i}

(10)

where $\boldsymbol{\xi}_{i}\in\mathbb{R}^{m}$ , $i=1,\cdots,d$ , is the $i$ th column of the sparse matrix $\boldsymbol{\Xi}$ .

By substituting the sparse expression (9) into state equation (3), it follows that

\boldsymbol{x}(t)=\left(\int_{t_{1}}^{t}\boldsymbol{\Theta}\left(\boldsymbol{x}(s)\right)ds\right)\boldsymbol{\Xi}+\boldsymbol{\eta},~{}t\geq t_{1}

(11)

which, by combining equation (10), leads to $d$ separate integral equations

x_{i}(t)=\left(\int_{t_{1}}^{t}\boldsymbol{\Theta}\left(\boldsymbol{x}(s)\right)ds\right)\boldsymbol{\xi}_{i}+\eta_{i},~{}1\leq i\leq d

(12)

where $\eta_{i}\in\mathbb{R}$ is the $i$ th component of initial condition $\boldsymbol{\eta}$ . By using the piecewise linear quadrature, the definite integral at each time instant can be approximated as

\int_{t_{1}}^{t_{k}}\theta_{\ell}(\boldsymbol{x}(t))dt\approx\frac{h}{2}\sum_{j=1}^{k-1}\theta_{\ell}(\boldsymbol{x}(t_{j}))+\frac{h}{2}\sum_{j=2}^{k}\theta_{\ell}(\boldsymbol{x}(t_{j})),~{}\ell=1,2,\cdots,m

(13)

where $h=t_{k}-t_{k-1}$ is the time interval.

Note that the construction of candidate feature space, $\boldsymbol{\Theta}(\boldsymbol{x})$ , is extremely important to guarantee the accuracy of system identification. Ideally, $\boldsymbol{\Theta}(\boldsymbol{x})$ includes all terms in the underlying dynamic system; then, $\boldsymbol{g}(\boldsymbol{x};\boldsymbol{\beta})$ can be well represented as a linear combination of the candidate features. But this is not a simple task due to $\boldsymbol{\Theta}(\boldsymbol{x})$ is not known in advance, even though the prior knowledge can provide some guidance. As a result, it is likely that the number of candidate features, $m$ , is large enough to include all possible terms appeared in the vector fields. An alternative strategy is the polynomial and trigonometric functions which has been commonly used in various literature [4].

To illustrate the construction process of $\boldsymbol{\Theta}(\boldsymbol{x})$ , we give an example that has three state variables. Denoting the state variable as $\boldsymbol{x}=[x_{1}~{}x_{2}~{}x_{3}]^{\top}$ , the 3 linear features are $\theta_{1}(\boldsymbol{x})=x_{1},~{}\theta_{2}(\boldsymbol{x})=x_{2},~{}\theta_{3}(\boldsymbol{x})=x_{3}$ ; the 6 quadratic features are $\theta_{4}(\boldsymbol{x})=x_{1}x_{1},~{}\theta_{5}(\boldsymbol{x})=x_{1}x_{2},~{}\theta_{6}(\boldsymbol{x})=x_{1}x_{3},~{}\theta_{7}(\boldsymbol{x})=x_{2}x_{2},~{}\theta_{8}(\boldsymbol{x})=x_{2}x_{3},~{}\theta_{9}(\boldsymbol{x})=x_{3}x_{3};$ and, the 10 cubic features are $\theta_{10}(\boldsymbol{x})=x_{1}x_{1}x_{1},~{}\theta_{11}(\boldsymbol{x})=x_{1}x_{1}x_{2},~{}\theta_{12}(\boldsymbol{x})=x_{1}x_{1}x_{3},~{}\theta_{13}(\boldsymbol{x})=x_{1}x_{2}x_{2},~{}\theta_{14}(\boldsymbol{x})=x_{1}x_{2}x_{3},~{}\theta_{15}(\boldsymbol{x})=x_{1}x_{3}x_{3},~{}\theta_{16}(\boldsymbol{x})=x_{2}x_{2}x_{2},~{}\theta_{17}(\boldsymbol{x})=x_{2}x_{2}x_{3},~{}\theta_{18}(\boldsymbol{x})=x_{2}x_{3}x_{3},~{}\theta_{19}(\boldsymbol{x})=x_{3}x_{3}x_{3}$ . Likewise, it is easy to show the trigonometric features; the cosine-based ones are $\theta_{20}(\boldsymbol{x})=\cos(\omega x_{1}),~{}\theta_{21}(\boldsymbol{x})=\cos(\omega x_{2}),~{}\theta_{22}(\boldsymbol{x})=\cos(\omega x_{3}),$ where $\omega$ is a hyper-parameter. Finally, we obtain a library of candidate features $\boldsymbol{\Theta}(\boldsymbol{x})=\left[\theta_{\ell}(\boldsymbol{x})\right]$ , $\ell=1,\cdots,22$ .

Here, if there exists no specific priori knowledge of candidate features, the polynomial functions are suggested because of the following reasons: (1) it is hard to obtain the priori coefficients in the trigonometric functions (see e.g. the hyper-parameter, $\omega$ , in $\cos(\omega x_{i})$ ) and (2) in a finite time interval, there always exists a truncated Taylor series to approximate the vector filed globally.

2.4 Sparse identification by sequential threshold least squares

In the following context, the candidate features are set to polynomial functions unless otherwise stated, that is, $\theta_{\ell}(\boldsymbol{x})=\prod_{i=1}^{d}x_{i}^{r_{i}}$ , $1\leq\sum_{i=1}^{d}r_{i}\leq p$ , $r_{i}\in\mathbb{N}$ , and $p$ is the largest degree of polynomials and traditionally set to be less than 3 for the sake of numerical stability.

Denoting the smoothed time series as a matrix form

\mathring{\boldsymbol{X}}=\begin{pmatrix}|&|&&|\\ \mathring{\boldsymbol{x}}_{1}&\mathring{\boldsymbol{x}}_{2}&\cdots&\mathring{\boldsymbol{x}}_{d}\\ |&|&&|\end{pmatrix}=\begin{pmatrix}\mathring{x}_{1}(t_{1})&\mathring{x}_{2}(t_{1})&\cdots&\mathring{x}_{d}(t_{1})\\ \mathring{x}_{1}(t_{2})&\mathring{x}_{2}(t_{2})&\cdots&\mathring{x}_{d}(t_{2})\\ \vdots&\vdots&\cdots&\vdots\\ \mathring{x}_{1}(t_{n})&\mathring{x}_{2}(t_{n})&\cdots&\mathring{x}_{d}(t_{n})\end{pmatrix}

and, by using the element-wise product (denoted by $\odot$ ), we can obtain the data vectors corresponding to the candidate features. For example, if $\theta_{\ell}(\boldsymbol{x})=x_{1}x_{2}x_{3}$ , then $\boldsymbol{\theta}_{\ell}(\mathring{\boldsymbol{X}})=\mathring{\boldsymbol{x}}_{1}\odot\mathring{\boldsymbol{x}}_{2}\odot\mathring{\boldsymbol{x}}_{3}$ and subsequently, the data matrix is $\boldsymbol{\Theta}(\mathring{\boldsymbol{X}})=\left[\boldsymbol{\theta}_{1}(\mathring{\boldsymbol{X}}),~{}\cdots,~{}\boldsymbol{\theta}_{m}(\mathring{\boldsymbol{X}})\right]$ and for ease of notation, denoted by $\mathring{\boldsymbol{\Theta}}$ .

Substituting data matrices into equation (12) gives a pseudo linear regression whose estimates can be obtained by minimizing the least-square objective function

\displaystyle\mathop{\arg\min}_{\boldsymbol{\xi}_{i},{\eta}_{i}}\left\|\boldsymbol{S}\mathring{\boldsymbol{x}}_{i}-\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\boldsymbol{\xi}_{i}-\boldsymbol{1}\eta_{i}\right\|_{2}^{2}

(14)

where $\boldsymbol{S}=\left[\boldsymbol{0},~{}\boldsymbol{I}_{n-1}\right]$ selects the last $n-1$ rows, $\boldsymbol{C}=\frac{h}{2}\left[\boldsymbol{L}_{n-1},~{}\boldsymbol{0}\right]+\frac{h}{2}\left[\boldsymbol{0},~{}\boldsymbol{L}_{n-1}\right]$ denotes the cumulative operation, $\boldsymbol{I}_{n-1}$ is an identity matrix and $\boldsymbol{L}_{n-1}$ is a special left triangular with all the entries above the main diagonal are zero and ones elsewhere. It is easy to show the simultaneous least-square estimates of structural parameters and initial condition are

\displaystyle\begin{pmatrix}\hat{\boldsymbol{\xi}}_{i}\\ \hat{\eta}_{i}\end{pmatrix}=\begin{pmatrix}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}&\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{1}\\ \boldsymbol{1}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}&\boldsymbol{1}^{\top}\boldsymbol{1}\end{pmatrix}^{-1}\begin{pmatrix}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{S}\mathring{\boldsymbol{x}}_{i}\\ \boldsymbol{1}^{\top}\boldsymbol{S}\mathring{\boldsymbol{x}}_{i}\end{pmatrix}

(15)

then, according to the inverse formula of block matrix [32], the inverse matrix can be rewritten as

\displaystyle\begin{pmatrix}\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}+\frac{1}{\varrho}\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{1}\boldsymbol{1}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}&-\frac{1}{\varrho}\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{1}\\ -\frac{1}{\varrho}\boldsymbol{1}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}&\frac{1}{\varrho}\end{pmatrix}

(16)

where

\varrho=\boldsymbol{1}^{\top}\left[\boldsymbol{I}_{n-1}-\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\right]\boldsymbol{1}.

By plugging equation (16) back into (15), we can separate the estimates of structural parameters and initial condition, respectively expressed as

\displaystyle\hat{\boldsymbol{\xi}}_{i}=\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\left\{\boldsymbol{I}_{n-1}-\frac{1}{\varrho}\boldsymbol{1}\boldsymbol{1}^{\top}\left[\boldsymbol{I}_{n-1}-\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\right]\right\}\boldsymbol{S}\mathring{\boldsymbol{x}}_{i}

(17)

and

\displaystyle\hat{\eta}_{i}=\frac{1}{\varrho}\boldsymbol{1}^{\top}\left[\boldsymbol{I}_{n-1}-\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\right]\boldsymbol{S}\mathring{\boldsymbol{x}}_{i}.

(18)

Note that all of the the resultant least-square estimates of structural parameters are not equal to zero and thus cannot select the correct model structure, that is, equation (17) always results in a full model including all candidate features. In order to perform structure selection, the objective function (14) is constrained to generate sparse estimates, that is

\displaystyle\mathop{\arg\min}_{\boldsymbol{\xi}_{i},{\eta}_{i}}\left\|\boldsymbol{S}\mathring{\boldsymbol{x}}_{i}-\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\boldsymbol{\xi}_{i}-\boldsymbol{1}\eta_{i}\right\|_{2}^{2}\quad\text{subject to}\quad\left\|\boldsymbol{\xi}_{i}\right\|_{0}\leq\lambda_{i}

(19)

where $\lambda_{i}\geq 0$ is a hyper-parameter that measures the sparsity of the parameter vector $\boldsymbol{\xi}_{i}$ .

By selecting an appropriate hyper-parameter $\lambda_{i}$ , equation (19) leads to a parsimonious model structure that balances model complexity with descriptive capability. Equation (19) can be solved by many algorithms, such as the least absolute shrinkage and selection operator (LASSO) [33] and least square approximation (LSA) [34]. Here, we select the sequential threshold least squares (STLS) [35] to generate sparse structural estimates, as shown in Algorithm 1.

Input: smoothed state matrix

\mathring{\boldsymbol{X}}

; candidate data matrix

\mathring{\boldsymbol{\Theta}}

; threshold vector

\boldsymbol{\lambda}

2for $i=1,2,\cdots,d$ do

\boldsymbol{P}^{-1}=\left(\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\right)^{-1}

;

\rhd

initialization based on full model

{\varrho}={n-1}-\boldsymbol{1}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\boldsymbol{P}^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{1}

;

\hat{\boldsymbol{\xi}}_{i}=\boldsymbol{P}^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\left\{\boldsymbol{I}_{n-1}-\frac{1}{{\varrho}}\boldsymbol{1}\boldsymbol{1}^{\top}\left[\boldsymbol{I}_{n-1}-\boldsymbol{C}\mathring{\boldsymbol{\Theta}}\boldsymbol{P}^{-1}\mathring{\boldsymbol{\Theta}}^{\top}\boldsymbol{C}^{\top}\right]\right\}\boldsymbol{S}\mathring{\boldsymbol{X}}(:,i)

;

6 repeat

\text{binds}=(\text{abs}(\hat{\boldsymbol{\xi}}_{i})>\lambda_{j})

;

\rhd

threshold of small coefficients

\hat{\boldsymbol{\xi}}_{i}(^{\sim}\text{binds})=0

;

\mathring{\boldsymbol{\Omega}}=\mathring{\boldsymbol{\Theta}}(:,\text{binds})

;

\rhd

update structural parameter estimates

\boldsymbol{P}^{-1}=\left(\mathring{\boldsymbol{\Omega}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Omega}}\right)^{-1}

;

{\varrho}={n-1}-\boldsymbol{1}^{\top}\boldsymbol{C}\mathring{\boldsymbol{\Omega}}\boldsymbol{P}^{-1}\mathring{\boldsymbol{\Omega}}^{\top}\boldsymbol{C}^{\top}\boldsymbol{1}

;

\hat{\boldsymbol{\xi}}_{i}(\text{binds})=\boldsymbol{P}^{-1}\mathring{\boldsymbol{\Omega}}^{\top}\boldsymbol{C}^{\top}\left\{\boldsymbol{I}_{n-1}-\frac{1}{{\varrho}}\boldsymbol{1}\boldsymbol{1}^{\top}\left[\boldsymbol{I}_{n-1}-\boldsymbol{C}\mathring{\boldsymbol{\Omega}}\boldsymbol{P}^{-1}\mathring{\boldsymbol{\Omega}}^{\top}\boldsymbol{C}^{\top}\right]\right\}\boldsymbol{S}\mathring{\boldsymbol{X}}(:,i)

;

14 until convergence;

\hat{\eta}_{i}=\frac{1}{{\varrho}}\boldsymbol{1}^{\top}\left[\boldsymbol{I}_{n-1}-\boldsymbol{C}\mathring{\boldsymbol{\Omega}}\boldsymbol{P}^{-1}\mathring{\boldsymbol{\Omega}}^{\top}\boldsymbol{C}^{\top}\right]\boldsymbol{S}\mathring{\boldsymbol{X}}(:,i)

;

\rhd

generate initial condition estimate

17 end for

Output : sparse structural estimates

\hat{\boldsymbol{\Xi}}=\left[\hat{\boldsymbol{\xi}}_{1},\cdots,\hat{\boldsymbol{\xi}}_{d}\right]

, initial condition estimates

\hat{\boldsymbol{\eta}}=\left[\hat{\eta}_{1},\cdots,\hat{\eta}_{d}\right]^{\top}

Algorithm 1 Pseudo-code for STLS algorithm

3 Simulations

In this section the finite sample performance of ISINDy is investigated via Monte Carlo simulations; three nonlinear ODEs, namely Logistic, Lokta-Volterra, and Lorenz dynamics, are used as examples. The underlying time series $\boldsymbol{x}(t)$ is approximated with Runge-Kutta numerically. For each component of state variable, the noise level is measured by the noise-variance ratio

nvr=\sqrt{\frac{\mathrm{Var}\left(\boldsymbol{e}_{i}\right)}{\mathrm{Var}\left(\boldsymbol{x}_{i}\right)}}\times 100\%=\frac{\sigma_{i}}{\sqrt{\mathrm{Var}\left(\boldsymbol{x}_{i}\right)}}\times 100\%,~{}i=1,\cdots,d

(20)

where $\boldsymbol{e}_{i}=\left[e_{i}(t_{1}),~{}e_{i}(t_{2}),~{}\cdots,~{}e_{i}(t_{n})\right]^{\top}$ and $\boldsymbol{x}_{i}=\left[x_{i}(t_{1}),~{}x_{i}(t_{2}),~{}\cdots,~{}x_{i}(t_{n})\right]^{\top}$ .

The proposed ISINDy is compared to the benchmark methods: the standard SINDy [18] and its integral variant presented in [26] (which is referred as to InSINDy hereafter). Note that these two benchmark methods do not consider measurement noises of the observations and also cannot produce estimations of initial condition.

3.1 Logistic equation

Consider the logistic growth equation

\frac{d}{dt}x(t)=1.6{x}(t)-[{x}(t)]^{2}

with initial condition $x(0)=\eta=0.1$ .

The trajectories are generated under the configuration: $t\in[0,6]$ with $h=0.01$ , and the noise level $nvr\in[10\%,30\%,50\%]$ . When implementing Algorithm 1, the threshold value is set to $\lambda=0.1$ . In each noisy scenario, the estimates of sparse structural parameters and initial condition are summarised in Table 1 (Appendix B), and the results show that only the proposed ISINDy identifies the correct active terms together with high-accuracy estimates. By substituting the ISINDy-based estimates and running the ode45 routine, one has the identified trajectories in Figure 2.

Figure 2 shows that the identified trajectories is very close to the true ones, even though the noise level is as high as $nvr=50\%$ , indicating that ISINDy is robust to measurement noises. Moreover, both InSINDy and ISINDy can identify the correct active terms when the noise level is $nvr=10\%$ , but ISINDy produces more accurate estimates and as the noise level increases, InSINDy fails to generate correct active terms. For example, when the noise level is $nvr=30\%$ , the identified equations of ISINDy and InSINDy are

\frac{d}{dt}x(t)=1.6609{x}(t)-1.0434[{x}(t)]^{2}

and

\frac{d}{dt}x(t)=2.3752{x}(t)-2.1302[{x}(t)]^{2}+0.4054[{x}(t)]^{3},

respectively. It is obvious that the InSINDy-based equation includes a redundant active term in the vector field.

3.2 Lokta-Volterra system

Consider the Lokta–Volterra system

\begin{cases}\frac{d}{dt}{x}_{1}(t)=\frac{2}{3}x_{1}(t)-\frac{4}{3}x_{1}(t)x_{2}(t)\\ \frac{d}{dt}{x}_{2}(t)=-x_{2}(t)+x_{1}(t)x_{2}(t)\end{cases}

with initial condition $[x_{1}(0),x_{2}(0)]^{\top}=[\eta_{1},\eta_{2}]^{\top}=[1.8,1.8]^{\top}$ .

The true trajectory are generated under the configuration: $t\in[0,10]$ with $h=0.01$ , and the noise level $nvr\in[5\%,10\%,15\%]$ . Setting the threshold value to $\lambda=0.3$ , the estimates of sparse structural parameters and initial conditions are summarised in Table 2 (Appendix B). The results show that in all three noisy scenarios, only the proposed ISINDy identifies the correct active terms and generates high-accuracy parameter estimates. By substituting these ISINDy-based estimates and running the ode23 routine, one obtains the identified trajectories in Figure 3.

Figure 3 shows that the proposed ISINDy captures the correct active terms in all three noisy scenarios, even though the noise level is up to $nvr=15\%$ . Increasing the noise level leads to larger errors in the estimation of structural parameters and initial condition, resulting in more biased trajectories. Besides, to compare the identified and true dynamic systems in the presence of measurement noises on the observations, we simulate the true and identified trajectories at the time range $t\in[0,30]$ . Figure 4 shows that the trajectory of the identified dynamical system almost coincides with the true one, indicating that ISINDy can also generate accurate long-term forecasts.

3.3 Lorenz system

Consider the chaotic Lorenz system

\begin{cases}\frac{d}{dt}{x}_{1}(t)=-10x_{1}(t)+10x_{2}(t)\\ \frac{d}{dt}{x}_{2}(t)=28x_{1}(t)-x_{1}(t)x_{3}(t)-x_{2}(t)\\ \frac{d}{dt}{x}_{3}(t)=x_{1}(t)x_{2}(t)-\frac{8}{3}x_{3}(t)\\ \end{cases}

with initial condition $[x_{1}(0),x_{2}(0),x_{3}(0)]^{\top}=[\eta_{1},\eta_{2},\eta_{3}]^{\top}=[-5,10,30]^{\top}$ .

The true trajectories are generated under the configuration: $t\in[0,5]$ , $h=0.005$ , and $nvr\in[5\%,10\%,15\%]$ . The threshold value is set to $\lambda=0.8$ , and the estimates of sparse structural parameters and initial conditions are summarised in Table 3 (Appendix C). The results show that SINDy, InSINDy, and ISINDy correctly identify the active terms in all three noisy scenarios for the first component, but SINDy and InSINDy fail to identify the correct active terms in the second and third components, letting alone the accurate estimates of parameters. Only ISINDy picks out the correct active terms and generates high-accuracy estimates in all three noisy scenarios. By substituting the ISINDy-based estimates and running the ode45 routine, one obtains the identified trajectories in Figure 5.

It is worth noting, however, that although the identified equation has correct model structure and also high-accuracy estimates, the identified trajectories may still stray off the true one quickly. For instance, in the $nvr=15\%$ scenario, the trajectories of three state variables are plotted in Figure 6. It is clear that the identified trajectories almost coincides with the true ones for a short time from $t=0$ to about $t=4$ and then stray from about $t=5$ on. The reason for this phenomenon is that Lorenz system has chaotic solutions for certain structural parameters and initial conditions. The identified trajectory is very sensitive to the estimates of structural parameters and initial condition. A small bias on the estimates results in large offsets in the trajectory. Specifically, assuming the observations are noise-free, i.e. $nvr=0$ , the identified attractor associated with the trajectories of three states are shown in Figure 7. The identified trajectories starts to fast stray off the true ones from about $t=12$ on, indicating unreliable long-term forecasts even under a noise-free environment.

3.4 Short discussions

The results of three cases show that the proposed ISINDy outperforms the standard SINDy and its integral variant InSINDy, in both structure identification and parameter estimation terms. When observations are noise-free, i.e. $nvr=0\%$ , all three methods can identify the correct model structure, but the parameter estimates obtained from ISINDy are much closer to their true values than those obtained form SINDy and InSINDy (see Tables 1–3 in Appendix B). To be specific, SINDy and ISINDy outperform InSINDy due to that the formers use a trapezoid formula rather than the Euler’s formula, but this does not hold in noisy scenarios due to that the derivative approximation-based SINDy is sensitive to measurement noises and thus generates poor results, especially in the Lorenz system case (see Table 3 in Appendix B).

It is worth noting, however, that the aforementioned simulations are conducted under two conditions: (1) the time range is fixed in a specified interval, and (2) the candidate features include all possible terms appearing in vector fields. Actually, these two factors have important effects on the performance of ISINDy.

(i)

For factor (1), taking the Lorenz system as an example, the fixed time interval is set to $[0,3]$ , $[3,7]$ , and $[7,12]$ , respectively. When the observations are noise-free, ISINDy identifies the correct model structure in all three time interval settings (see Figure 9 in Appendix C). But, when the observations are contaminated by noises, the noise robustness of ISINDy in each interval is different. Figure 10 in Appendix C shows that ISINDy generates wrong model structure when $nvr=2\%$ , indicating that in $[0,3]$ the noise tolerance is less than 2%. Similarly, in $[3,7]$ the noise tolerance is less than 5% and in $[7,12]$ the noise tolerance is less than 10%. An alternative strategy to improve robustness is guaranteeing the quality of smoothing, which is inferred from the noise-free observations cases.

(ii)

For factor (2), a simple ODE $\frac{d}{dt}x=-\sin(x)$ is employed to explore the effect of candidate features on identification accuracy. Note that the Taylor expansion of the vector field is

g(x)=-\sin(x)=-x+\frac{1}{3!}x^{3}-\frac{1}{5!}x^{5}+\mathcal{O}(x^{7})=-x+0.16667x^{3}-0.0083x^{5}+\mathcal{O}(x^{7}).

On this test problem, ISINDy with different candidate features is investigated, including polynomial, trigonometric, and a combination of both, shown in Table 4 (Appendix D). In the case of polynomial features, the degree of polynomial is set to 3 and 5 in turn, and in both settings ISINDy identifies the correct active terms together with high-accuracy parameters in the Taylor expansion of $-\sin(x)$ . In both of the other cases of trigonometric features or a combination of polynomial and trigonometric features, the correct active term $\sin(x)$ is identified.

Furthermore, the time interval and the design of candidate features always interact. Taking the simple ODE $\frac{d}{dt}x=-\sin(x)$ as an example, in the time range of $t\in[0,5]$ , the resulting trajectories lie in the range of $x(t)\in[-1.0,1.0]$ . This guarantees that the Taylor series expansion can approximate the vector filed, thereby generating the excellent dynamic reconstruction, as shown in Figure 11 (Appendix D).

4 Conclusion

In this work, an integral SINDy approach that leverages penalized spline smoothing is proposed for simultaneously (1) denoising time-series observations, (2) generating accurate estimates of structural parameters and initial condition, and (3) identifying parsimonious systems of differential equations responsible for generating time series. Using the integral formulation, noisy time-series observations are handled in a stable and robust way; even for high noise levels, the algorithm can produce correct model structure, high-accuracy estimates of structural parameters and initial condition, showing the improvement of proposed approach with respect to robustness to noise with less training data in comparison with the previous integral variant of SINDy [26]. Simulations show that this approach is accurate and robust to noise when dynamic behavior is included in the observations, suggesting its ability to extract interpretable equations from data.

The current research assumes that all terms appeared in the vector field are included in the candidate features. We must acknowledge that the proposed approach has several limitations, e.g., (1) time-consuming of penalized spline smoothing process, especially for large-sample observations, and (2) incorrect identification of vector fields if the candidate features are improperly designed. In practice, how to design an appropriate library of candidate features is still a challenging problem [36], although physical intuition or domain knowledge may be leveraged in many systems. Besides, note that this work does not relate to the identifiability of structural parameters and initial condition [37, 38, 39], and the uncertainty quantification of model structure inference [40], which are the centre topic of our future theoretical research.

Data availability

All scripts to reproduce the results are available at https://github.com/weibl9/integralSINDy.

Declaration of competing interest

The author declares no competing interests.

References

Lennart [1999] L. Lennart, System identification: Theory for the user, Prentice Hall PTR, 1999.
Young [2012] P. C. Young, Recursive estimation and time-series analysis: an introduction for the student and practitioner, 2nd ed., Springer Science & Business Media, 2012.
Ahsan [2016] Z. Ahsan, Differential equations and their applications, PHI Learning Pvt. Ltd., 2016.
Brunton and Kutz [2019] S. L. Brunton, J. N. Kutz, Data-driven science and engineering: Machine learning, dynamical systems, and control, Cambridge University Press, 2019.
Young [2013] P. C. Young, Hypothetico-inductive data-based mechanistic modeling of hydrological systems, Water Resources Research 49 (2013) 915–935.
Chen et al. [2011] Y. Chen, B. Yang, Q. Meng, Y. Zhao, A. Abraham, Time-series forecasting using a system of ordinary differential equations, Information Sciences 181 (2011) 106–114.
Bongard and Lipson [2007] J. Bongard, H. Lipson, Automated reverse engineering of nonlinear dynamical systems, Proceedings of the National Academy of Sciences 104 (2007) 9943–9948.
Schmidt and Lipson [2009] M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data, Science 324 (2009) 81–85.
Quade et al. [2016] M. Quade, M. Abel, K. Shafi, R. K. Niven, B. R. Noack, Prediction of dynamical systems by symbolic regression, Physical Review E 94 (2016) 012214.
Daniels and Nemenman [2015] B. C. Daniels, I. Nemenman, Automated adaptive inference of phenomenological dynamical models, Nature Communications 6 (2015) 1–8.
Ye et al. [2015] H. Ye, R. J. Beamish, S. M. Glaser, S. C. Grant, C.-h. Hsieh, L. J. Richards, J. T. Schnute, G. Sugihara, Equation-free mechanistic ecosystem forecasting using empirical dynamic modeling, Proceedings of the National Academy of Sciences 112 (2015) E1569–E1576.
Raissi [2018] M. Raissi, Deep hidden physics models: Deep learning of nonlinear partial differential equations, Journal of Machine Learning Research 19 (2018) 932–955.
Qin et al. [2019] T. Qin, K. Wu, D. Xiu, Data driven governing equations approximation using deep neural networks, Journal of Computational Physics 395 (2019) 620–635.
Wang et al. [2016] W.-X. Wang, Y.-C. Lai, C. Grebogi, Data based identification and prediction of nonlinear and complex dynamical systems, Physics Reports 644 (2016) 1–76.
Tolle et al. [2011] K. M. Tolle, D. S. W. Tansley, A. J. Hey, The fourth paradigm: Data-intensive scientific discovery [point of view], Proceedings of the IEEE 99 (2011) 1334–1337.
Wang et al. [2011] W.-X. Wang, R. Yang, Y.-C. Lai, V. Kovanis, C. Grebogi, Predicting catastrophes in nonlinear dynamical systems by compressive sensing, Physical Review Letters 106 (2011) 154101.
Schaeffer et al. [2013] H. Schaeffer, R. Caflisch, C. D. Hauck, S. Osher, Sparse dynamics for partial differential equations, Proceedings of the National Academy of Sciences 110 (2013) 6634–6639.
Brunton et al. [2016] S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences 113 (2016) 3932–3937.
Rudy et al. [2017] S. H. Rudy, S. L. Brunton, J. L. Proctor, J. N. Kutz, Data-driven discovery of partial differential equations, Science Advances 3 (2017) e1602614.
Rudy et al. [2019] S. Rudy, A. Alla, S. L. Brunton, J. N. Kutz, Data-driven identification of parametric partial differential equations, SIAM Journal on Applied Dynamical Systems 18 (2019) 643–660.
Mangan et al. [2017] N. M. Mangan, J. N. Kutz, S. L. Brunton, J. L. Proctor, Model selection for dynamical systems via sparse regression and information criteria, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 473 (2017) 20170009.
Chou and Voit [2009] I.-C. Chou, E. O. Voit, Recent developments in parameter estimation and structure identification of biochemical and genomic systems, Mathematical biosciences 219 (2009) 57–83.
Ramsay et al. [2007] J. O. Ramsay, G. Hooker, D. Campbell, J. Cao, Parameter estimation for differential equations: a generalized smoothing approach, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2007) 741–796.
Ramsay and Hooker [2017] J. Ramsay, G. Hooker, Dynamic data analysis: modeling data with differential equations, Springer Science & Business Media, 2017.
Dattner and Klaassen [2015] I. Dattner, C. A. J. Klaassen, Optimal rate of direct estimators in systems of ordinary differential equations linear in functions of the parameters, Electronic Journal of Statistics 9 (2015) 1939–1973.
Schaeffer and McCalla [2017] H. Schaeffer, S. G. McCalla, Sparse model selection via integral terms, Physical Review E 96 (2017) 023302.
von Rueden et al. [2021] L. von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch, J. Pfrommer, A. Pick, R. Ramamurthy, M. Walczak, J. Garcke, C. Bauckhage, J. Schuecker, Informed Machine Learning – A taxonomy and survey of integrating prior knowledge into learning systems, IEEE Transactions on Knowledge and Data Engineering (2021) 1–20.
Karniadakis et al. [2021] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed machine learning, Nature Reviews Physics 3 (2021) 422–440.
Cao et al. [2012] J. Cao, J. Cai, L. Wang, Estimating curves and derivatives with parametric penalized spline smoothing, Statistics and Computing 22 (2012) 1059–1067.
Wahba [1990] G. Wahba, Spline models for observational data, Siam, 1990.
Patrikalakis and Maekawa [2002] N. M. Patrikalakis, T. Maekawa, Representation of curves and surfaces, in: N. M. Patrikalakis, T. Maekawa (Eds.), Shape Interrogation for Computer Aided Design and Manufacturing, Springer, Berlin, Heidelberg, 2002, pp. 1–33. doi:10.1007/978-3-642-04074-0_1.
Lu and Shiou [2002] T.-T. Lu, S.-H. Shiou, Inverses of 2 $\times$ 2 block matrices, Computers and Mathematics with Applications 43 (2002) 119–129.
Tibshirani [2011] R. Tibshirani, Regression shrinkage and selection via the lasso: a retrospective, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2011) 273–282.
Wang and Leng [2007] H. Wang, C. Leng, Unified LASSO estimation by least squares approximation, Journal of the American Statistical Association 102 (2007) 1039–1048.
Zhang and Schaeffer [2019] L. Zhang, H. Schaeffer, On the convergence of the SINDy algorithm, Multiscale Modeling and Simulation 17 (2019) 948–972.
Delahunt and Kutz [2022] C. B. Delahunt, J. N. Kutz, A toolkit for data-driven discovery of governing equations in high-noise regimes, IEEE Access 10 (2022) 31210–31234.
Pia Saccomani et al. [2003] M. Pia Saccomani, S. Audoly, L. D’Angiò, Parameter identifiability of nonlinear systems: The role of initial conditions, Automatica 39 (2003) 619–632.
Miao et al. [2011] H. Miao, X. Xia, A. S. Perelson, H. Wu, On identifiability of nonlinear ODE models and applications in viral dynamics, SIAM Review 53 (2011) 3–39.
Peng et al. [2011] H. Peng, L. Li, Y. Yang, F. Sun, Conditions of parameter identification from time series, Physical Review E 83 (2011) 036202.
DeGennaro et al. [2019] A. M. DeGennaro, N. M. Urban, B. T. Nadiga, T. Haut, Model structural inference using local dynamic operators, International Journal for Uncertainty Quantification 9 (2019) 59–83.

Appendix A. The cubic B-spline basis functions

A cubic B-spline function is formed by jointing piece-wise polynomials of degree 3 with at most $\mathcal{C}^{2}$ continuity at the knots. The basis functions between two knots are defined by the Cox-de Boor recursion formula [31]:

	$\displaystyle b_{i,0}(t)=\begin{cases}1,&t\in[\tau_{i},\tau_{i+1})\\ 0,&t\notin[\tau_{i},\tau_{i+1})\end{cases}$
	$\displaystyle b_{i,p}(t)=\frac{t-\tau_{i}}{\tau_{i+p}-\tau_{i}}b_{i,p-1}(t)+\frac{\tau_{i+p+1}-t}{\tau_{i+p+1}-\tau_{i+1}}b_{i+1,p-1}(t)$

where $\tau_{i}$ denotes knots and $p=3$ denotes degree of polynomial.

For example, let the knots be $\{0,0.1,\cdots,1.0\}$ . Then, 13 basis functions are defined over the interval $[0,1]$ , and at each knot the polynomial values and their first two derivatives match, as show in Figure 8.

Appendix B. The estimates of structural parameters and initial condition under varying noise levels

Table 1: Sparse structural parameters and initial condition of logistic equation under varying noise levels.

	$nvr=0\%$			$nvr=10\%$			$nvr=30\%$			$nvr=50\%$
Terms	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy
$\eta$	—	—	0.1000	—	—	0.0985	—	—	0.0962	—	—	0.0935
$x$	1.6000	1.5975	1.6000	1.8604	1.6831	1.6203	3.7477	2.3752	1.6609	5.1682	3.4485	1.7047
$x^{2}$	$-$ 1.0000	$-$ 0.9980	$-$ 1.0000	$-$ 1.4848	$-$ 1.0551	$-$ 1.0145	$-$ 5.0504	$-$ 2.1302	$-$ 1.0434	$-$ 7.4255	$-$ 3.5429	$-$ 1.0744
$x^{3}$	0	0	0	0.2016	0	0	1.6942	0.4054	0	2.5584	0.8550	0

Table 2: Sparse structural parameters and initial condition of Lokta-Volterra equation under varying noise levels.

		$nvr=0\%$			$nvr=5\%$			$nvr=10\%$			$nvr=15\%$
	Terms	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy
First	$\eta_{1}$	—	—	1.8000	—	—	1.8017	—	—	1.8030	—	—	1.8032
Comp.	$x_{1}$	0.6667	0.6692	0.6667	0.6664	0.6698	0.6720	0.6589	0.6702	0.6774	0.6425	0.5575	0.6825
	$x_{1}^{2}$	0	0	0	0	0	0	0	0	0	0	0	0
	$x_{1}^{3}$	0	0	0	0	0	0	0	0	0	0	0	0
	$x_{2}$	0	0	0	0	0	0	0	0	0	0	$-$ 0.7683	0
	$x_{1}x_{2}$	$-$ 1.3334	$-$ 1.3430	$-$ 1.3333	$-$ 1.3292	$-$ 1.3277	$-$ 1.3388	$-$ 1.3114	$-$ 1.3122	$-$ 1.3444	$-$ 1.2771	2.7419	$-$ 1.3488
	$x_{1}^{2}x_{2}$	0	0	0	0	0	0	0	0	0	0	$-$ 0.9956	0
	$x_{2}^{2}$	0	0	0	0	0	0	0	0	0	0	0.7534	0
	$x_{1}x_{2}^{2}$	0	0	0	0	0	0	0	0	0	0	$-$ 0.9186	0
	$x_{2}^{3}$	0	0	0	0	0	0	0	0	0	0	$-$ 0.4481	0
Second	$\eta_{2}$	—	—	1.8000	—	—	1.8037	—	—	1.8077	—	—	1.8121
Comp.	$x_{1}$	0	0	0	0	0	0	0	0	0	0	0	0
	$x_{1}^{2}$	0	0	0	0	0	0	0	0	0	0	0	0
	$x_{1}^{3}$	0	0	0	0	0	0	0	0	0	0	0	0
	$x_{2}$	$-$ 0.9999	$-$ 0.9968	$-$ 1.0000	$-$ 1.1301	$-$ 1.0956	$-$ 0.9954	$-$ 0.9973	$-$ 0.9677	$-$ 0.9908	$-$ 0.8007	$-$ 1.0828	$-$ 0.9863
	$x_{1}x_{2}$	0.9990	0.9978	1.0000	1.7761	0	0.9956	1.8408	0.9851	0.9911	1.9741	0	0.9865
	$x_{1}^{2}x_{2}$	0	0	0	0	0	0	0	0	0	0	0	0
	$x_{2}^{2}$	0	0	0	0	0	0	0	0	0	$-$ 0.5601	0	0
	$x_{1}x_{2}^{2}$	0	0	0	0	0.7589	0	0	0	0	$-$ 0.4198	0.7636	0
	$x_{2}^{3}$	0	0	0	0	0	0	0	0	0	0	0	0

Table 3: Sparse structural parameters and initial condition of Lorenz system with varying noise levels. Note that the rows with all entries equal to 0 are omitted here.

		$nvr=0\%$			$nvr=5\%$			$nvr=10\%$			$nvr=15\%$
	Terms	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy	SINDy	InSINDy	ISINDy
First	$\eta_{1}$	—	—	$-$ 5.0046	—	—	$-$ 4.9798	—	—	$-$ 4.9215	—	—	$-$ 4.8467
Comp.	$x_{1}$	$-$ 10.0013	$-$ 10.2411	$-$ 10.0013	$-$ 10.0200	$-$ 10.3430	$-$ 10.0180	$-$ 9.9638	$-$ 10.4420	$-$ 10.0100	$-$ 9.8377	$-$ 10.5390	$-$ 9.9915
	$x_{2}$	10.0012	10.2435	10.0013	10.0000	10.3270	9.9987	9.9286	10.4070	9.9728	9.7906	10.4840	9.9353
	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
	$x_{3}^{3}$	0	0	0	0	0	0	0	0	0	0	0	0
Second	$\eta_{2}$	—	—	9.9983	—	—	9.9822	—	—	9.9451	—	—	9.8771
Comp.	$x_{1}$	27.9523	27.9406	28.0061	59.9790	25.8860	28.0770	52.5500	26.0150	28.0720	66.2880	26.1250	27.9900
	$x_{1}^{2}$	0	0	0	5.1065	0	0	1.8500	0	0	$-$ 2.4958	0	0
	$x_{2}$	$-$ 0.9820	$-$ 0.9683	$-$ 1.0002	7.9855	0	$-$ 0.9592	47.4620	0	$-$ 0.8955	68.9500	0	$-$ 0.8065
	$x_{1}x_{2}$	0	0	0	$-$ 3.5886	0	0	5.3775	0	0	12.8470	0	0
	$x_{2}^{2}$	0	0	0	$-$ 1.0011	0	0	$-$ 5.1637	0	0	$-$ 7.4029	0	0
	$x_{3}$	0	0	0	$-$ 12.214	0	0	$-$ 27.9080	0	0	$-$ 44.8650	0	0
	$x_{1}x_{3}$	$-$ 0.9970	$-$ 0.9990	$-$ 1.0002	$-$ 5.5552	$-$ 0.96061	$-$ 1.0047	$-$ 7.0064	$-$ 0.9654	$-$ 1.0073	$-$ 9.1726	$-$ 0.9695	$-$ 1.0079
	$x_{2}x_{3}$	0	0	0	1.0570	0	0	$-$ 1.4457	0	0	$-$ 3.2327	0	0
	$x_{3}^{2}$	0	0	0	0.9825	0	0	2.4698	0	0	4.0385	0	0
	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
	$x_{3}^{3}$	0	0	0	0	0	0	0	0	0	0	0	0
Third	$\eta_{3}$	—	—	30.0043	—	—	30.0420	—	—	30.0700	—	—	30.0950
Comp.	$x_{1}$	0	0	0	50.8140	0	0	96.1630	1.8294	0	130.3000	$-$ 10.6670	0
	$x_{1}^{2}$	0	0	0	5.3069	0	0	12.2670	0	0	17.3530	0	0
	$x_{2}$	0	0	0	$-$ 58.8090	0	0	$-$ 121.9800	0	0	$-$ 157.9900	0	0
	$x_{1}x_{2}$	0.9949	1.0088	1.0002	$-$ 8.3655	1.0142	0.9964	$-$ 21.4980	0.9183	0.9923	$-$ 29.0380	1.1835	0.9879
	$x_{2}^{2}$	0	0	0	4.5377	0	0	9.6943	0	0	12.2090	0	0
	$x_{3}$	$-$ 2.6710	$-$ 2.6934	$-$ 2.6672	6.1171	$-$ 2.7172	$-$ 2.6625	15.8890	$-$ 3.0193	$-$ 2.6572	17.9310	0	$-$ 2.6507
	$x_{1}x_{3}$	0	0	0	$-$ 3.5181	0	0	$-$ 6.5122	0	0	$-$ 9.5474	0	0
	$x_{2}x_{3}$	0	0	0	4.0361	0	0	9.3839	0	0	12.7760	0	0
	$x_{1}x_{2}x_{3}$	0	0	0	0	0	0	0	0	0	1.1170	0	0
	$x_{3}^{2}$	0	0	0	0	0	0	$-$ 1.5798	0	0	$-$ 1.7850	0	0
	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
	$x_{3}^{3}$	0	0	0	0	0	0	0	0	0	0	0	0

Appendix C. The identified trajectories of Lorenz system under varying time intervals

Appendix D. The effect of candidate features on structure identification

Table 4: Sparse structural parameters and initial condition of Lorenz system with varying noise levels.

polynomial features: degree 3
initial condition	$-$ 1.00	$-$ 0.80	$-$ 0.60	$-$ 0.40	$-$ 0.20	0.20	0.40	0.60	0.80	1.00
$\eta$	$-$ 0.9997	$-$ 0.7999	$-$ 0.6000	$-$ 0.4000	$-$ 0.2000	0.2000	0.4000	0.6000	0.7999	0.9997
$x$	$-$ 0.9996	$-$ 0.9998	$-$ 0.9999	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 0.9999	$-$ 0.9998	$-$ 0.9996
$x^{2}$	0	0	0	0	0	0	0	0	0	0
$x^{3}$	0.1612	0.1632	0.1647	0.1658	0.1664	0.1664	0.1658	0.1647	0.1632	0.1612
polynomial features: degree 5
$\eta$	$-$ 1.0000	$-$ 0.8000	$-$ 0.6000	$-$ 0.4000	$-$ 0.2000	0.2000	0.4000	0.6000	0.8000	1.0000
$x$	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000
$x^{2}$	0	0	0	0	0	0	0	0	0	0
$x^{3}$	0.1666	0.1666	0.1666	0.1666	0.1666	0.1666	0.1666	0.1666	0.1666	0.1666
$x^{4}$	0	0	0	0	0	0	0	0	0	0
$x^{5}$	$-$ 0.0081	$-$ 0.0082	$-$ 0.0082	$-$ 0.0083	$-$ 0.0083	$-$ 0.0083	$-$ 0.0083	$-$ 0.0082	$-$ 0.0082	$-$ 0.0081
trigonometric features
$\eta$	$-$ 1.0000	$-$ 0.8000	$-$ 0.6000	$-$ 0.4000	$-$ 0.2000	0.2000	0.4000	0.6000	0.8000	1.0000
$\sin(x)$	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000
$\cos(x)$	0	0	0	0	0	0	0	0	0	0
$\sin(2x)$	0	0	0	0	0	0	0	0	0	0
$\cos(2x)$	0	0	0	0	0	0	0	0	0	0
combination of polynomial and trigonometric features
$\eta$	$-$ 1.0000	$-$ 0.8000	$-$ 0.6000	$-$ 0.4000	$-$ 0.2000	0.2000	0.4000	0.6000	0.8000	1.0000
$x$	0	0	0	0	0	0	0	0	0	0
$x^{2}$	0	0	0	0	0	0	0	0	0	0
$x^{3}$	0	0	0	0	0	0	0	0	0	0
$\sin(x)$	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000	$-$ 1.0000
$\cos(x)$	0	0	0	0	0	0	0	0	0	0
$\sin(2x)$	0	0	0	0	0	0	0	0	0	0
$\cos(2x)$	0	0	0	0	0	0	0	0	0	0