Learning Deep Input-Output Stable Dynamics

Ryosuke Kojima
Graduate School of Medicine
Kyoto University
Kyoto, 606-8501
kojima.ryosuke.8e@kyoto-u.ac.jp
&Yuji Okamoto^∗
Graduate School of Medicine
Kyoto University
Kyoto, 606-8501
okamoto.yuji.2c@kyoto-u.ac.jp
Equal contribution.

Abstract

Learning stable dynamics from observed time-series data is an essential problem in robotics, physical modeling, and systems biology. Many of these dynamics are represented as an inputs-output system to communicate with the external environment. In this study, we focus on input-output stable systems, exhibiting robustness against unexpected stimuli and noise. We propose a method to learn nonlinear systems guaranteeing the input-output stability. Our proposed method utilizes the differentiable projection onto the space satisfying the Hamilton-Jacobi inequality to realize the input-output stability. The problem of finding this projection can be formulated as a quadratic constraint quadratic programming problem, and we derive the particular solution analytically. Also, we apply our method to a toy bistable model and the task of training a benchmark generated from a glucose-insulin simulator. The results show that the nonlinear system with neural networks by our method achieves the input-output stability, unlike naive neural networks. Our code is available at https://github.com/clinfo/DeepIOStability.

1 Introduction

Learning dynamics from time-series data has many applications such as industrial robot systems [1], physical systems[2], and biological systems [3, 4]. Many of these real-world systems equipped with inputs and outputs to connect for each other, which are called input-output systems [5]. For example, biological systems sustain life by obtaining energy from the external environment through their inputs. Such real-world systems have various properties such as stability, controllability, and observability, which provide clues to analyze the complex systems.

Our purpose is to learn a complex system with “desired properties” from a dataset consisting of pairs of input and output signals. To represent the target system, this paper considers the following nonlinear dynamics:

\displaystyle\begin{aligned} \dot{x}&=f(x)+G(x)u,\quad x(0)=x_{0}\\ y&=h(x).\end{aligned}

(1)

where the inner state $x$ , the input $u$ , and the output $y$ belong to a signal space that maps time interval $[0,\infty)$ to the Euclidean space. We denote the dimension of $x$ , $u$ , and $y$ as $n$ , $m$ , and $l$ , respectively.

Recently, with the development of deep learning, many methods to learn systems from time-series data using neural networks have been proposed [6, 7, 8]. By representing the maps $(f,G,h)$ in Eq. (1) as neural networks, complex systems can be modeled and trained from a given dataset. However, guaranteeing that a trained system has the desired properties is challenging.

Refer to caption — Figure 1: (A) The model prediction of neural networks (B) the reaction of trained model. These results are min-max normalized.

A naively trained system fits the input signals contained in the training dataset, but does not always fit for new input signals. For example, Figure 1 shows our preliminary experiments where we naively learned neural networks $(f,G,h)$ in Eq. (1) from input and output signals. The trained neural networks provided small predictive errors for an input signal in the training dataset (Figure 1 (A) ). Contrastingly, the unbounded output signals were computed by the trained system with new step input signals (Figure 1 (B) ). The reason can be expected that the magnitude (integral value) of this input signal was larger than that of the input signals in the training dataset.

The internal stability is known as one of the attractive properties that should be often satisfied in real-world dynamical systems. The conventional methods to train internal stable systems consisting of neural networks adopt the Lyapunov-based approaches [9, 10, 11]. These methods focus on the internal system, $\dot{x}=f(x)$ in the input-output system (1). Thus, how to learn the entire system (1) with the desired property related to the influence of the input signal is still challenging.

We propose a novel method to learn a dynamical system consisting of neural networks considering the input-output stability. The notion of the input-output stability is often used together with the Hamilton-Jacobi inequality for controller design of the target system in the field of control theory. The Hamilton-Jacobi inequality is one of the sufficient conditions for the input-output stability. The feature of this condition is that the variable for an input signal $u$ does not appear in the expression, i.e., we do not need to evaluate the condition for unknown inputs [12]. To the best of our knowledge, this is the first work that establishes a learning method for the dynamical systems consisting of neural networks using the Hamilton-Jacobi inequality.

The contributions of this paper are as follows:

•

This paper derives differentiable projections to the space satisfying the Hamilton-Jacobi inequality.
•

This paper presents the learning method for the input-output system proven to always satisfy the Hamilton-Jacobi inequality.
•

This paper also provides a loss function derived from the Hamilton-Jacobi inequality. By combining this loss function with the projection described above, efficient learning can be expected.
•

This paper presents experiments using two types of benchmarks to evaluate our method.

2 Background

This section describes the $\mathcal{L}_{2}$ stability, a standard definition of the input-output stability, and the Hamilton-Jacobi inequality.

First, we define the $\mathcal{L}_{2}$ stability of the nonlinear dynamical system (1). If the norm ratio of the output signal to the input signal is bounded, the system is $\mathcal{L}_{2}$ stable and this norm ratio is called the $\mathcal{L}_{2}$ gain. The $\mathcal{L}_{2}$ norm on the input and output signal space is used for the definition of the $\mathcal{L}_{2}$ stability. To deal with the $\mathcal{L}_{2}$ stability on the nonlinear system (1), we assume that the origin $x\equiv 0$ of the internal system $\dot{x}=f(x)$ is an asymptotically stable equilibrium point and $h(0)=0$ . Since the translation of any asymptotically stable equilibrium points is possible, this assumption can be satisfied without loss of generality.

Definition 1.

The $\mathcal{L}_{2}$ norm is defined as $\|x\|_{\mathcal{L}_{2}}:=\sqrt{\int_{0}^{\infty}\|x(t)\|^{2}\differential t}$ . If there exists $\gamma\geq 0$ and a function $\beta$ on a domain $D\subset\mathbb{R}^{n}$ such that

\displaystyle\|y\|_{\mathcal{L}_{2}}\leq\gamma\|u\|_{\mathcal{L}_{2}}+\beta(x_{0}),

(2)

then the system (1) is $\mathcal{L}_{2}$ stable, where the function $\beta(\cdot)$ is non-negative and $\beta(0)=0$ . Furthermore, the minimum $\gamma$ satisfying (2) is called the $\mathcal{L}_{2}$ gain of the nonlinear system (1).

Next, we describe a sufficient condition of the $\mathcal{L}_{2}$ stability using the Lyapunov function $V:D\rightarrow\mathbb{R}$ , where the function $V$ is positive semi-definite, i.e., $V(x)\geq 0$ and $V(0)=0$ . The Hamilton-Jacobi inequality is an input-independent sufficient condition.

Proposition 1 ([5, Theorem 5.5]).

Let $f$ be locally Lipschitz and let $G$ and $h$ be continuous. If there exist a constant $\gamma>0$ and a continuously differentiable positive semi-definite function $V:D\subset\mathbb{R}^{n}\rightarrow\mathbb{R}$ such that

\displaystyle\begin{aligned} &\nabla V^{\mathrm{T}}(x)f(x)+\frac{1}{2\gamma^{2}}\|G^{\mathrm{T}}(x)\nabla V(x)\|^{2}+\frac{1}{2}\|h(x)\|^{2}\leq 0,\quad\forall x\in D\setminus\{0\},\end{aligned}

(3)

then the system (1) is $\mathcal{L}_{2}$ stable and the $\mathcal{L}_{2}$ gain is less than or equal to $\gamma$ . This condition is called the Hamilton-Jacobi inequality.

The above proposition can be generalized to allow the more complicated situations such as limit-cycle and bistable cases, where the domain $D$ contains multiple asymptotically stable equilibrium points. The equilibrium point assumed in this proposition can be replaced with positive invariant sets by extending the $\mathcal{L}_{2}$ norm [13]. Furthermore, by mixing multiple Lyapunov functions, this proposition can be generalized around multiple isolated equilibrium points.

3 Method

The goal of this paper is to learn the $\mathcal{L}_{2}$ stable system represented by using neural networks $(f,G,h)$ . The Hamiltonian-Jacobi inequality, which implies the $\mathcal{L}_{2}$ stability, is expressed by $(f,G,h)$ . We present a method to project $(f,G,h)$ onto the space where the Hamilton-Jacobi inequality holds.

3.1 Modification of nonlinear systems

Supposing $f_{\bf n}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ , $G_{\bf n}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n\times m}$ , and $h_{\bf n}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{l}$ , a triplet map $(f_{\bf n},G_{\bf n},h_{\bf n})$ denote as nominal dynamics. Introducing a distance on the triplet maps, the nearest triplet satisfying the Hamilton-Jacobi inequality from the nominal dynamics $(f_{\bf n},G_{\bf n},h_{\bf n})$ is called modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ . The purpose of this section is to describe the modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ associated with the nominal dynamics $(f_{\bf n},G_{\bf n},h_{\bf n})$ by analytically deriving a projection onto the space satisfying the Hamilton-Jacobi inequality.

The problem of finding the modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ is written as a quadratic constraint quadratic programming (QCQP) problem for the nominal dynamics $(f_{\bf n},G_{\bf n},h_{\bf n})$ . Since there is generally no analytical solution for QCQP problems, we aim to find the particular solution by adjusting the distance on the triplets.

To prepare for the following theorem, we define the ramp and the clamp functions as

\displaystyle\mathrm{R}(x)\triangleq\begin{cases}0,&x\leq 0\\ x,&x>0\end{cases},\quad\mathrm{C}(x;a,b)\triangleq\begin{cases}a,&x\leq a\\ x,&a<x\leq b\\ b,&x>b\end{cases},

and define the Hamilton-Jacobi function as

\displaystyle\mathrm{HJ}(f,G,h)\triangleq\nabla V^{\mathrm{T}}f+\frac{1}{2\gamma^{2}}\|G^{\mathrm{T}}\nabla V\|^{2}+\frac{1}{2}\|h\|^{2},

(4)

where $V$ is a given positive definite function. A way to design $V$ is to determine the desired positive invariant set and design the increasing function around this set. For example, if the target system has a unique stable point, $x=0$ can be regarded as the positive invariant set and the increasing function $V$ can be designed as $V(x)=\tfrac{1}{2}x^{2}$ .

Theorem 1.

Consider that the following optimal problem:


	$\displaystyle\underset{f_{\bf m},G_{\bf m},h_{\bf m}}{\text{\rm\bf minimize}}\quad\frac{(1-k)}{\\|\nabla V\\|}\\|f_{\bf m}-f_{\bf n}\\|+\frac{k}{2\gamma^{2}}\\|G_{\bf m}-G_{\bf n}\\|^{2}+\frac{k}{\\|\nabla V\\|^{2}}\\|h_{\bf m}-h_{\bf n}\\|^{2}$		(5a)
	$\displaystyle\text{\rm\bf subject to}\quad\mathrm{HJ}(f_{\bf m},G_{\bf m},h_{\bf m})\leq 0,$		(5b)

where $k\in[0,1]$ . The solution of (5) is given by

	$\displaystyle f_{\bf m}$	$\displaystyle=f_{\bf n}-\frac{1}{\\|\nabla V\\|^{2}}\mathrm{R}\left(V_{f}+k^{2}V_{Gh}\right)\nabla V,$
	$\displaystyle G_{\bf m}$	$\displaystyle=G_{\bf n}-\left(1-\sqrt{\mathrm{C}\left(-\tfrac{V_{f}}{V_{Gh}};k^{2},1\right)}\right)P_{V}G_{\bf n},$
	$\displaystyle h_{\bf m}$	$\displaystyle=\sqrt{\mathrm{C}\left(-\tfrac{V_{f}}{V_{Gh}};k^{2},1\right)}h_{\bf n},$

where

\displaystyle V_{f}\triangleq\nabla V^{\mathrm{T}}f_{\bf n},\quad V_{Gh}\triangleq\frac{1}{2\gamma^{2}}\|G_{\bf n}^{\mathrm{T}}\nabla V\|^{2}+\frac{1}{2}\|h_{\bf n}\|^{2},\quad P_{V}\triangleq\frac{\nabla V\nabla V^{\mathrm{T}}}{\|\nabla V\|^{2}}.

Proof:.

See Appendix A. ∎

The objective function of (5a) is a new distance between the nominal dynamics $(f_{\bf n},G_{\bf n},h_{\bf n})$ and the modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ . This new distance allows the derivation of analytical solutions by combining three distances of $f$ , $G$ , and $h$ .

Focusing on the objective function (5a), the constant $k$ represents the ratio of the distance scale of $f$ to $G$ and $h$ . When $k=0$ , the result of this problem (5) are consistent with the projection of the conventional method that guarantee internal stability [9]. As the constant $k$ converges $1$ , the modification method of Theorem 1 approaches $G_{\bf m}$ and $h_{\bf m}$ to $G_{\bf n}$ and $h_{\bf n}$ , respectively. In this case, the objective function (5a) becomes the distance between $f_{\bf m}$ to $f_{\bf n}$ . Therefore, the following corollary is satisfied.

Corollary 1.

The solution of


	$\displaystyle\underset{f_{\bf m}}{\text{\rm\bf minimize}}\quad\\|f_{\bf m}-f_{\bf n}\\|$		(6a)
	$\displaystyle\text{\rm\bf subject to}\quad\mathrm{HJ}(f_{\bf m},G_{\bf n},h_{\bf n})\leq 0,$		(6b)

is given by

\displaystyle f_{\bf m}

\displaystyle=f_{\bf n}-\frac{1}{\|\nabla V\|^{2}}\mathrm{R}\left({\rm HJ}(f_{\bf n},G_{\bf n},h_{\bf n})\right)\nabla V.

Proof:.

This solution is easily derived from Theorem 1. ∎

Corollary 1 derives a solution of a linear programming problem rather than QCQP problems. Replacing the Hamilton-Jacobi function $\mathrm{HJ}(f_{\bf m},G_{\bf n},h_{\bf n})$ with the time derivative of a positive definite function $\nabla V^{\mathrm{T}}f$ , this corollary matches the result of the conventional study [9].

When the map $h_{\bf m}$ is fixed as $h_{\bf n}$ , a similar solution as Theorem 1 is derived. Although Corollary 1 is proved by changing $k$ , the modified dynamics with the fixed $h_{\bf n}$ are not derived. We reprove this modified dynamics in a similar way to Theorem 1.

Corollary 2.

Consider the following problem:


	$\displaystyle\underset{f_{\bf m},G_{\bf m}}{\text{\rm\bf minimize}}\quad\frac{(1-k)}{\\|\nabla V\\|}\\|f_{\bf m}-f_{\bf n}\\|+\frac{k}{2\gamma^{2}}\\|G_{\bf m}-G_{\bf n}\\|^{2}$		(7a)
	$\displaystyle\text{\rm\bf subject to}\quad\mathrm{HJ}(f_{\bf m},G_{\bf m},h_{\bf n})\leq 0,$		(7b)

where $k\in[0,1]$ . The solution of (7) is given by

\displaystyle f_{\bf m}=f_{\bf n}-\frac{1}{\|\beta\|^{2}}\mathrm{R}\left(V_{fh}+k^{2}V_{G}\right)\nabla V,\quad G_{\bf m}=G_{\bf n}-\left(1-\sqrt{\mathrm{C}\Big{(}-\tfrac{V_{fh}}{V_{G}};k^{2},1\Big{)}}\right)P_{V}G_{\bf n},

(8)

where

\displaystyle V_{fh}\triangleq\nabla V^{\mathrm{T}}f_{\bf n}+\frac{1}{2}\|h_{\bf n}\|^{2},\quad V_{G}\triangleq\frac{1}{2\gamma^{2}}\|G_{\bf n}^{\mathrm{T}}\nabla V\|^{2}.

Proof:.

See Appendix A. ∎

3.2 Loss function

We represent $(f_{\bf n},G_{\bf n},h_{\bf n})$ as neural networks and denote $(f_{\bf m},G_{\bf m},h_{\bf m})$ as the $\mathcal{L}_{2}$ stable dynamics modified by Theorem 1, Corollary 1, or 2. Note that the modification depends on the candidate of $\mathcal{L}_{2}$ gain $\gamma$ . Figure 2 (A) shows the sketch of this modification, where the blue region satisfies the Hamilton-Jacobi inequality.

Since the nonlinear system of the modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ is represented as ordinary differential equations (ODEs) consisting of the differentiable functions $f_{\bf n}$ , $G_{\bf n}$ , and $h_{\bf n}$ , the techniques of training neural ODEs can be applied [14]. Once a loss function is designed, the parameters of neural networks in the modified system can be learned from given data.

\displaystyle\mathrm{Loss}

\displaystyle=\mathrm{E}_{(x_{0},u,\hat{y})\in\cal{D}}[||y-\hat{y}||_{\mathcal{L}_{2}}^{2}]+\lambda L_{\rm HJ}+\alpha\gamma^{2},

(9)

where $\lambda$ and $\alpha$ are non-negative coefficients.

The first term shows the prediction error of the output signal $y$ ( Figure 2 (A)). A dataset $\cal{D}$ consists of tuples $(x_{0},u,y)$ where the initial value $x_{0}$ , the input signal $u$ , and the output signal $y$ . The predicted output $\hat{y}$ is calculated from $x_{0}$ , $u$ , and the modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ .

The second term aims to improve the nominal dynamics $(f_{\bf n},G_{\bf n},h_{\bf n})$ closer to the modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ and is defined as

\displaystyle L_{\mathrm{HJ}}

\displaystyle=\mathrm{E}_{x}[\mathrm{R}(\mathrm{HJ}(f_{\bf n},G_{\bf n},h_{\bf n})(x)+\varepsilon)],

where $\varepsilon$ is a positive constant ( Figure 2 (B)). Since this term is a form of the hinge loss, $\varepsilon$ represents the magnitude of the penalties for the Hamilton-Jacobi inequality. To evaluate this inequality for any $x\in D$ , we introduce a distribution of $x$ over the domain $D$ . In our experiments, this distribution is decided as a Gaussian distribution ${\cal N}(\mu,\sigma^{2})$ where the mean $\mu$ is placed at the asymptotically stable point and the variance $\sigma^{2}$ is an experimental parameter. Without this second term of the loss function, there are degrees of freedom in the nominal dynamics, i.e., multiple nominal dynamics give the same loss by the projection, which negatively affects the parameter learning.

The modifying parameter $\gamma$ can be manually designed for the application or automatically trained from data by introducing the last term. This training explores smaller $\gamma$ while keeping the same level of the prediction error (Figure 2 (C)).

4 Related work

Estimating parameters of a given system is traditionally studied as system identification. In the field of system identification, much research on the identification of linear systems have been done, where the maps $(f,G,h)$ in Eq. (1) assumes to be linear [15]. Linear state-space models include identification methods by impulse response like the Eigensystem Realization Algorithm (ERA)[16], by the state-space representation like Multivariable Output Error State sPace (MOESP)[17] and ORThogonal decomposition method (ORT)[18]. System identification methods for non-state-space models, unlike our target system, contain Dynamic Mode Decomposition with control (DMDc)[19] and its nonlinear version, Sparse Identification of Nonlinear DYnamics with control (SINDYc)[20]. Also, the nonlinear models such as the nonlinear ARX model [21] and the Hammerstein-Wiener model [22] have been developed and often trained by error minimization. For example, the system identification method using a piece-wise ARX model allows more complicated functions by relaxing the assumption of linearity [21].

These traditional methods often concern gray-box systems, where the maps $(f,G,h)$ in Eq. (1) are partially known [23]. This paper deals with a case of black-box systems, where $f$ , $G$ , and $h$ in the system (1) are represented by neural networks. Our method can be used regardless of whether all $f$ , $G$ , and $h$ functions are parameterized using neural networks. So, our method can be easily applied to application-dependent gray-box systems when the functions $f$ , $G$ , and $h$ are differentiable with respect to the parameters.

Because the input-output system (1) can be regarded as a differential equation, our study is closely related to the method of combining neural networks and ODEs [7, 14]. These techniques have been improved in recent years, including discretization errors and computational complexity. Although we used an Euler method for simplicity, we can expect that learning efficiency would be further improved by using these techniques.

The internal stability is a fundamental property in ODEs, and learning a system with this property using neural networks plays an important role in the identification of real-world systems [24]. In particular, the first method to guarantee the internal stability of the trained system has been proposed in [9]. Furthermore, another method [25] extends this method to apply positive invariant sets, e.g. limit cycles and line attractors. Encouraged by these methods based on the Lyapunov function, our method further generalizes these methods using the Hamilton-Jacobi inequality to guarantee the input-output stability.

In this paper, the Lyapunov function $V$ is considered to be given, but this function can be learned from data. Since a pair of dynamics (1) and $V$ has redundant degrees of freedom, additional assumptions are required to determine $V$ uniquely. In [9], it is realized by limiting the dynamics to the internal system and restricting $V$ to a convex function [26].

Lyapunov functions are also used to design controllers, where the whole system should satisfy the stability condition. A method for learning such a controller using neural nets has been proposed [27]. This method deals with optimization problems over the space that satisfies the stability condition, which is similar to our method. Whereas we solve the QCQP problem to derive the projection onto the space, this method uses a Satisfiability Modulo Theories (SMT) solver to satisfy this condition. The method has also been extended to apply unknown systems [28].

Although this paper deals with deterministic systems, neural networks for stochastic dynamics with the variational inference is well studied [6, 29, 30, 31]. Guaranteeing the internal stability of trained stochastic dynamics is important for noise filtering and robust controller design [10].

5 Experiments

We conduct two experiments to evaluate our proposed method. The first experiment uses a benchmark dataset generated from a nonlinear model with multiple asymptotically equilibrium points. In the next experiment, we applied our method to a biological system using a simulator of the glucose-insulin system.

5.1 Experimental setting

Before describing the results of the experiments, this section explains the evaluation metrics and our experimental setting.

For evaluation metrics, we define the root mean square error (RMSE) and the average $\mathcal{L}_{2}$ gain (GainIO) for the given input and output signals as follows:

\displaystyle\text{RMSE}\triangleq\sqrt{\frac{1}{N}\sum_{i=1}^{N}\|y_{i}-\hat{y}_{i}\|^{2}_{\mathcal{L}_{2}}},\quad\text{GainIO}\triangleq\frac{1}{N}\sum^{N}_{i=1}\frac{\|\hat{y}_{i}\|_{\mathcal{L}_{2}}}{\|u_{i}\|_{\mathcal{L}_{2}}},

where $N$ is the number of signals in the dataset. $u_{i}(\cdot)$ and $y_{i}(\cdot)$ are the input and output signal at the $i$ -th index, respectively. The prediction signal $\hat{y}_{i}(\cdot)$ is computed from $u_{i}(\cdot)$ , the trained dynamics, and the initial state. Note that the integral contained in the $\mathcal{L}_{2}$ norm is approximated by a finite summation. The RMSE is a metric of the prediction errors related to the output signal, and the GainIO is a metric of the property of the $\mathcal{L}_{2}$ stability. Whether the target system satisfies or does not satisfy the $\mathcal{L}_{2}$ stability, the GainIO with a given finite-size dataset can be calculated. The GainIO error is defined by the absolute error between the GainIO of the test dataset and that of the prediction.

In our experiments, 90 $\%$ of the dataset is used for training and the remaining 10 $\%$ is used for testing. We retry five times for all experiments and show the mean and standard deviations of the metrics.

For simplicity in our experiments, the sampling step $\Delta t$ for the output $y$ is set as constant and the Euler method is used to solve ODEs. $x_{0}$ is put at an asymptotically stable point for each benchmark and is known. In this experiment, to prevent the state from diverging during learning of dynamics, the clipping operation is used so that the absolute values of the states are less than ten.

For comparative methods, we use vanilla neural networks, ARX, ORT [18], MOESP [17], and piece-wise ARX[21]. In the method of vanilla neural networks, the maps $(f,G,h)$ in the nonlinear system (1) is represented by using three neural networks, i.e., this method is consistent with a method used in Figure 1. To determine the hyperparameters of comparative methods except for neural networks, the grid search is used. Note that these comparative methods only consider the prediction errors.

For training each method with neural networks, an NVIDIA Tesla T4 GPU was used. Our experiments are totally run on 20 GPUs over about three days.

5.2 Bistable model benchmark

The first experiment is carried out using a bistable model, which is known as a bounded system with multiple asymptotically equilibrium points. This bistable model is defined as

\displaystyle\dot{x}=x(1-x^{2})+u,\quad x(0)=-1,\quad y=x.

The internal system of this model has two asymptotically stable equilibrium points $x\equiv 1,-1$ .

We generate 1000 input and output signals for this experiment. To construct this dataset, we prepare input signals using positive and negative pulse wave signals whose pulse width is changed at random. The input and output signals on the period $[0,10]$ are sampled with an interval $\Delta t=0.1$ . In this benchmark, we set the number of dimensions of the internal system as one and use a fixed function $V(x)=\min((x-1)^{2},(x+1)^{2})$ , a mixture of the two positive definite functions.

In the result of our experiments, we name the proposed methods modified by Theorem 1, Corollary 1, and 2 as DIOS-fgh, DIOS-f, and DIOS-fg, respectively. In these methods, the parameters of our loss function are set as $\lambda=0$ and $\alpha=0.01$ . Also, DIOS-fgh+ uses $\lambda=0.01$ and $\alpha=0.01$ under the same conditions as DIOS-fgh. For this example (1000 input signal), it took about 1 hour using 1 GPU to learn one model training.

The results of the RMSE and the GainIO error in this experiment are shown in Figure 3. Figures 3 (A) and (B) demonstrate that a piece-wise ARX model (PWARX) gives very low RMSE but its GainIO error is high. Our proposed methods achieve a small GainIO error while keeping the RMSE. Note that linear models such as MOESP and ORT only approach one point although the bistable model has two asymptotically stable equilibrium points. Figures 3 (C) and (D) display the effect of dataset sizes. When the dataset size was varied to 100, 1000, and 3000, the result of the larger dataset provided the smaller RMSE for all methods. The vanilla neural networks do not consider the $\mathcal{L}_{2}$ gain, so the GainIO error was large in the case of the low RMSE.

Figure 4 shows the relationship between $x$ and $f(x)$ in the trained system in (1) to compare the vanilla neural networks, DIOS-fgh and DIOS-fgh+. DIOS-fgh and DIOS-fgh+ successfully find two stable points, i.e., the trained function $f(x)$ had roots of two stable points $x=\pm 1$ and an unstable point $x=0$ . Especially, these stable points were robustly estimated in the trained system using our presented loss function (DIOS-fgh+) even when the dataset size was small. The vanilla neural networks failed to obtain the two stable points in all cases.

5.3 Glucose-insulin benchmark

This section addresses an example of the identification of biological systems. We consider the task of learning the glucose-insulin system using a simulator [32] to construct responses for various inputs and evaluate the robustness of the proposed method for unexpected inputs. This simulator outputs the concentrations of plasma glucose $y_{1}$ and insulin $y_{2}$ for the appearance of plasma glucose per minute $u$ . To determine the realistic input $u$ , we adopt another model, an oral glucose absorption model [33].

Using this simulator, 1000 input and output signals are synthesized for this experiment. The input and output signals are sampled with a sampling interval $\Delta t=1$ and 1000 steps for each sequence. In this benchmark, we set the number of dimensions of the internal system as six, fix a positive definite function $V(x)=x^{2}$ , and use our loss function with $\lambda=0.001$ and $\alpha=0.001$ . Training one model for this examples (1000 input signal) tooks about 7.5hours using 1GPU. The hyperparameters including the number of layers in the neural networks, the learning rate, optimizer, and the weighted decay are determined using the tree-structured Parzen estimator (TPE) implemented in Optuna [34].

The RMSE of vanilla and our proposed method are 0.0103 and 0.0050, respectively. So, from the perspective of RMSE, these methods achieved almost the same performance. Figure 5 shows input and output signals in the test dataset and the predicted output by vanilla neural networks and our method (DIOS-fgh+). The $\mathcal{L}_{2}$ stability of the system using the vanilla neural networks is not guaranteed. Since the proposed method guarantees the $\mathcal{L}_{2}$ stability, the output signals of DIOS-fgh+ are bounded even if the input signals are unexpectedly large.

To demonstrate this, we conducted an additional experiment using the trained system. Figure 6 shows the transition of output behavior caused by the magnitude of the input signal changing from 2 to 10. Note that the maximum magnitude in the training dataset is one. In this experiment, $\Delta t$ was changed to $0.01$ and the clipping operation was removed to deal with the large values of the state.

This result shows the output of the vanilla neural networks quickly diverged with an unexpectedly large input. Contrastingly, the output behavior of our proposed method is always bounded. Therefore, we actually confirmed that our proposed method satisfies the $\mathcal{L}_{2}$ stability.

6 Conclusion

This paper proposed a learning method for nonlinear dynamical system guaranteeing the $\mathcal{L}_{2}$ stability. By theoretically deriving the projection of a triplet $(f,G,h)$ to the space satisfying the Hamilton-Jacobi inequality, our proposed method realized the $\mathcal{L}_{2}$ stability of trained systems. Also, we introduced a loss function to empirically achieve a smaller $\mathcal{L}_{2}$ gain while reducing prediction errors. We conducted two experiments to learn dynamical systems consisting of neural networks. The first experiment used a nonlinear model with multiple asymptotically equilibrium points. The result of this experiment showed that our proposed method can robustly estimate a system with multiple stable points. In the next experiment, we applied our method to a biological system using a simulator of the glucose-insulin system. It was confirmed that the proposed method can successfully learn a proper system that works well under unexpectedly large inputs due to the $\mathcal{L}_{2}$ stability. There is a limitation that our method cannot apply the system without the $\mathcal{L}_{2}$ stability. Future work will expand the $\mathcal{L}_{2}$ stability-based method to a more generalized learning method by dissipativity and apply our approach of this study to stochastic systems.

Acknowledgements

This research was supported by JST Moonshot R&D Grant Number JPMJMS2021 and JPMJMS2024. This work was also supported by JSPS KAKENHI Grant No.21H04905 and CREST Grant Number JPMJCR22D3, Japan. This paper was also based on a part of results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

References

[1] Jan Swevers, Walter Verdonck, and Joris De Schutter. Dynamic model identification for industrial robots. IEEE control systems magazine, 27(5):58–71, 2007.
[2] Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 2016.
[3] Timothy S. Gardner, Diego Di Bernardo, David Lorenz, and James J. Collins. Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301(5629):102–105, 2003.
[4] Geoffrey Roeder, Paul K Grant, Andrew Phillips, Neil Dalchau, and Edwards Meeds. Efficient amortised bayesian inference for hierarchical and nonlinear dynamical systems. Proceeding of International Conference on Machine Learning (ICML), 2019.
[5] Hassan K. Khalil. Nonlinear systems. Prentice-Hall, 3rd edition, 2002.
[6] Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In Proceedings of the AAAI Conference on Artificial Intelligence, page 2101–2109, 2017.
[7] Ricky T.Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems (NeurIPS), volume 31, 2018.
[8] Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control. In Proceeding of International Conference on Learning Representations, 2019.
[9] Gaurav Manek and J. Zico Kolter. Learning stable deep dynamics models. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
[10] Nathan Lawrence, Philip Loewen, Michael Forbes, Johan Backstrom, and Bhushan Gopaluni. Almost surely stable deep dynamics. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, 2020.
[11] Andreas Schlaginhaufen, Philippe Wenk, Andreas Krause, and Florian Dörfler. Learning stable deep dynamics models for partially observed or delayed dynamical systems. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, 2021.
[12] Arjan van der Schaft. $\mathcal{L}_{2}$ -Gain and Passivity Techniques in Nonlinear Control. Springer Publishing Company, Incorporated, 3rd edition, 2016.
[13] W.M. Haddad and V.S. Chellaboina. Nonlinear Dynamical Systems and Control: A Lyapunov-Based Approach. Princeton University Press, 2008.
[14] Xinshi Chen. Review: Ordinary differential equations for deep learning. CoRR, abs/1911.00502, 2019.
[15] Tohru Katayama. Subspace Method for System Identification. Springer, 2005.
[16] Jer-Nan Juang and Richard S Pappa. An eigensystem realization algorithm for modal parameter identification and model reduction. Journal of guidance, control, and dynamics, 8(5):620–627, 1985.
[17] Michel Verhaegen and Patrick Dewilde. Subspace model identification part 1. the output-error state-space model identification class of algorithms. International journal of control, 56(5):1187–1210, 1992.
[18] Tohru Katayama, Hideyuki Tanaka, and Takeya Enomoto. A simple subspace identification method of closed-loop systems using orthogonal decomposition. IFAC Proceedings Volumes, 38(1):512–517, 2005.
[19] Joshua L Proctor, Steven L Brunton, and J Nathan Kutz. Dynamic mode decomposition with control. SIAM Journal on Applied Dynamical Systems, 15(1):142–161, 2016.
[20] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Sparse identification of nonlinear dynamics with control (SINDYc). IFAC-PapersOnLine, 49(18):710–715, 2016.
[21] Andrea Garulli, Simone Paoletti, and Antonio Vicino. A survey on switched and piecewise affine system identification. In IFAC Proceedings Volumes, volume 45, pages 344–355. IFAC, 2012.
[22] Adrian Wills, Thomas B. Schon, Lennart Ljung, and Brett Ninness. Identification of hammerstein-wiener models. Automatica, 49(1):70–81, 2013.
[23] Lennart Ljung. System identification. In Signal analysis and prediction, pages 163–173. Springer, 1998.
[24] Martin Braun and Martin Golubitsky. Differential equations and their applications. Springer, 1983.
[25] Naoya Takeishi and Yoshinobu Kawahara. Learning dynamics models with stable invariant sets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9782–9790, 2021.
[26] Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In International Conference on Machine Learning, pages 146–155, 2017.
[27] Ya-Chien Chang, Nima Roohi, and Sicun Gao. Neural lyapunov control. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
[28] Vrushabh Zinage and Efstathios Bakolas. Neural koopman lyapunov control. arXiv:2201.05098, 2022.
[29] Daniel Gedon, Niklas Wahlström, Thomas B. Schön, and Lennart Ljung. Deep state space models for nonlinear system identification. In Proceedings of the 19th IFAC Symposium on System Identification (SYSID), 2021.
[30] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems (NeurIPS), volume 28, 2015.
[31] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. ArXiv, abs/1411.7610, 2014.
[32] A. De Gaetano and O. Arino. A statistical approach to the determination of stability for dynamical systems modelling physiological processes. Mathematical and Computer Modelling, 31(4):41–51, 2000.
[33] Chiara Dalla Man, Robert A. Rizza, and Claudio Cobelli. Meal simulation model of the glucose-insulin system. IEEE Transactions on Biomedical Engineering, 54(10):1740–1749, 2007.
[34] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.

Appendix A Proof of Theorem 1 and Corollary 2

To prove Theorem 1 and Corollary 2, we use the following particular solution of QCQP problems.

Lemma 1.

Suppose that $A\in\mathbb{R}^{n\times n}$ is a positive definite matrix, a vector $b\in\mathbb{R}^{n}$ , a scalar $c$ , and positive constants $k_{\rm x}$ , $k_{\rm y}$ . The solution of this problem

	minimize	$\displaystyle k_{\rm x}x^{\mathrm{T}}Ax+k_{\rm y}\|y\|$
	subject to	$\displaystyle y\geq x^{\mathrm{T}}Ax-2b^{\mathrm{T}}x+c.$

is given by

	$\displaystyle x^{*}$	$\displaystyle=\left(1-\sqrt{\mathrm{C}(1-\tilde{c};\tilde{k}_{\rm x}^{2},1)}\right)A^{-1}b,$
	$\displaystyle y^{*}$	$\displaystyle=\mathrm{R}\left(c-(1-\tilde{k}_{x}^{2})b^{\mathrm{T}}A^{-1}b\right),$

where

\displaystyle\tilde{c}=\frac{c}{b^{\mathrm{T}}A^{-1}b},\quad\tilde{k}_{\rm x}=\frac{k_{\rm x}}{k_{\rm x}+k_{\rm y}}.

Proof:.

See Appendix B. ∎

To keep the notation short in the following proofs, we define

\displaystyle\beta\triangleq\nabla V.

Proof of Theorem 2: For the Hamilton-Jacobi function $\mathrm{HJ}(f_{\bf m},G_{\bf m},h_{\bf m})$ of the modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ , the map $f_{\bf m}$ and each row of the map $G_{\bf m}$ don’t depend on the orthogonal vector of $\beta$ . Hence, $f_{\bf m}$ and $G_{\bf m}$ are written as

\displaystyle f_{\bf m}=f_{\bf n}-\beta p,\quad G_{\bf m}=G_{\bf n}-\beta q^{\mathrm{T}},

where $p$ is scalar and $q\in\mathbb{R}^{m}$ . $h_{\bf m}$ depends only on the norm in $\mathrm{HJ}(f_{\bf m},G_{\bf m},h_{\bf m})$ , i. e.,

\displaystyle h_{\bf m}=(1-r)hn,

where $r$ is scalar.

The optimal function (5a) is rewritten as the $p$ , $q$ , and $r$ , i.e.,

	$\displaystyle\frac{k_{1}}{\\|\beta\\|}\\|f_{\bf m}-f_{\bf n}\\|+\frac{k_{2}}{2\gamma^{2}}\\|G_{\bf m}-G_{\bf n}\\|^{2}+\frac{k_{2}}{2\\|\beta\\|^{2}}\\|h_{\bf m}-h_{\bf n}\\|$
	$\displaystyle=k_{1}\|p\|+k_{2}\frac{\\|\beta\\|^{2}}{2\gamma^{2}}\\|q\\|^{2}+k_{2}\frac{\\|h_{\bf n}\\|}{2\\|\beta\\|^{2}}r^{2}.$

Also, the condition of constraint (5b) is rewritten as

	$\displaystyle\mathrm{HJ}_{\beta}(f_{\bf m},G_{\bf m},h_{\bf m})$
	$\displaystyle=\beta^{\mathrm{T}}f_{\bf m}+\frac{1}{2\gamma^{2}}\\|G_{\bf m}^{\mathrm{T}}\beta\\|^{2}+\frac{1}{2}\\|h_{\bf m}\\|^{2}$
	$\displaystyle=\beta^{\mathrm{T}}(f_{\bf n}-\beta p)+\frac{1}{2\gamma^{2}}\\|(G_{\bf n}-\beta q^{\mathrm{T}})^{\mathrm{T}}\beta\\|^{2}+\frac{1}{2}\Big{\\|}(1-r)h_{\bf n}\\|^{2}$
	$\displaystyle=-\\|\beta\\|^{2}p+\frac{1}{2\gamma^{2}}\\|\beta\\|^{4}\\|q\\|^{2}+\frac{1}{2}\\|\hat{h}\\|^{2}r^{2}-\frac{1}{\gamma^{2}}\\|\beta\\|^{2}\beta^{\mathrm{T}}G_{\bf n}q-\\|h_{\bf n}\\|^{2}r+\mathrm{HJ}_{\beta}(f_{\bf n},G_{\bf n},h_{\bf n})$
	$\displaystyle\leq 0.$

If $x\triangleq[q^{\mathrm{T}},r]^{\mathrm{T}}$ , $y\triangleq p$ , and

\displaystyle A\triangleq\begin{bmatrix}\frac{\|\beta\|^{2}}{2\gamma^{2}}I_{m}&0\\ 0&\frac{\|h_{\bf n}\|^{2}}{2\|\beta\|^{2}}\end{bmatrix},\quad b\triangleq\begin{bmatrix}\frac{1}{2\gamma^{2}}G_{\bf n}^{\mathrm{T}}\beta\\ \frac{\|h_{\bf n}\|^{2}}{2\|\beta\|^{2}}\end{bmatrix}\quad c\triangleq\frac{1}{\|\beta\|^{2}}\Big{(}\mathrm{HJ}_{\beta}(f_{\bf n},G_{\bf n},h_{\bf n})\Big{)},

then this optimal problem (5) of this theorem becomes a problem used in Lemma 1. The optimal point of $p$ , $q$ and $r$ is given by

	$\displaystyle p^{*}$	$\displaystyle=\frac{1}{\\|\beta\\|^{2}}\mathrm{R}\left(V_{\hat{f}}+\hat{k}^{2}V_{G_{\bf n},h_{\bf n}}\right),$
	$\displaystyle q^{*}$	$\displaystyle=\frac{1}{\\|\beta\\|^{2}}\left(1-\sqrt{\mathrm{C}\Big{(}-\tfrac{V_{f_{\bf n}}}{V_{G_{\bf n},h_{\bf n}}};\tilde{k}^{2},1\Big{)}}\right)G_{\bf n}^{\mathrm{T}}\beta,$
	$\displaystyle r^{*}$	$\displaystyle=1-\sqrt{\mathrm{C}\Big{(}-\tfrac{V_{f_{\bf n}}}{V_{G_{\bf n},h_{\bf n}}};\tilde{k}^{2},1\Big{)}}.$

Therefore, Theorem 1 is derived.

Proof of Theorem 2:

Optimal maps $f_{\bf m}$ and $G_{\bf m}$ are also written as

\displaystyle f_{\bf m}=f_{\bf n}-\beta p,\quad G_{\bf m}=G_{\bf n}-\beta q^{\mathrm{T}}.

Hence, the objective function (7a) is rewritten as $p$ and $q$ function, i.e.,

\displaystyle\frac{k_{1}}{\|\beta\|}\|f_{\bf m}-f_{\bf n}\|+\frac{k_{2}}{2\gamma^{2}}\|G_{\bf m}-G_{\bf n}\|^{2}=k_{1}|p|+k_{2}\frac{\|\beta\|^{2}}{2\gamma^{2}}\|q\|^{2}.

Also the condition of constraint (7b) is written

	$\displaystyle\mathrm{HJ}_{\beta}(f_{\bf m},G_{\bf m},h_{\bf n})$
	$\displaystyle=\beta^{\mathrm{T}}f_{\bf m}+\frac{1}{2\gamma^{2}}\\|G_{\bf m}^{\mathrm{T}}\beta\\|^{2}+\frac{1}{2}\\|h_{\bf m}\\|^{2}$
	$\displaystyle=\beta^{\mathrm{T}}(f_{\bf n}-\beta p)+\frac{1}{2\gamma^{2}}\\|(G_{\bf n}-\beta q^{\mathrm{T}})^{\mathrm{T}}\beta\\|^{2}+\frac{1}{2}\\|h_{\bf n}\\|^{2}$
	$\displaystyle=-\\|\beta\\|^{2}p+\frac{1}{2\gamma^{2}}\\|\beta\\|^{4}\\|q\\|^{2}-\frac{1}{\gamma^{2}}\\|\beta\\|^{2}\beta^{\mathrm{T}}G_{\bf n}q+\mathrm{HJ}_{\beta}(f_{\bf n},G_{\bf n},h_{\bf n})$
	$\displaystyle\leq 0.$

If $x\triangleq q$ , $y\triangleq p$ , and

\displaystyle A\triangleq\frac{\|\beta\|^{2}}{2\gamma^{2}}I_{m},\quad b\triangleq\frac{1}{2\gamma^{2}}G_{\bf m}^{\mathrm{T}}\beta,\quad c\triangleq\frac{1}{\|\beta\|^{2}}\Big{(}\mathrm{HJ}_{\beta}(f_{\bf n},G_{\bf n},h_{\bf n})\Big{)},

then this optimal problem (7) becomes the problem used in Lemma 1. The optimal points of $p$ and $q$ are given by

	$\displaystyle p^{*}$	$\displaystyle=\frac{1}{\\|\beta\\|^{2}}\mathrm{R}\left(V_{f_{\bf n},h_{\bf n}}+\hat{k}_{2}^{2}(V_{G_{\bf n}})\right),$
	$\displaystyle q^{*}$	$\displaystyle=\frac{1}{\\|\beta\\|^{2}}\left(1-\sqrt{\mathrm{C}\Big{(}-\tfrac{V_{f_{\bf n},h_{\bf n}}}{V_{G_{\bf n}}};\tilde{k}^{2},1\Big{)}}\right)G_{\bf n}^{\mathrm{T}}\beta,$

Therefore, the solution of problem (7) becomes Eqs. (8).

Appendix B Proof of the particular solution of QCQP

This section presents the proof of the following QCQP problem.


minimize	$\displaystyle k_{\rm x}x^{\mathrm{T}}Ax+k_{\rm y}\|y\|$	(10a)
subject to	$\displaystyle y\geq x^{\mathrm{T}}Ax-2b^{\mathrm{T}}x+c,$	(10b)

where $A$ is a positive definite matrix. We classify this problem according to the parameter $A,b$ and $c$ .

First, the solution of this optimal problem is switched depending on the positive or negative value of $c$ .

Lemma 2.

If $c\leq 0$ then the solution of Eq. (10) is $(x,y)=(0,0)$ . If $c>0$ , the solution $x^{*}$ of Eq. (10) equals the solution of the following problem:

minimize

\displaystyle k_{x}x^{\mathrm{T}}Ax+k_{y}|x^{\mathrm{T}}Ax-2b^{\mathrm{T}}x+c|.

(11)

Furthermore, the solution $y^{*}$ is given by

\displaystyle y^{*}=k_{x}x^{*\mathrm{T}}Ax^{*}-2b^{\mathrm{T}}x^{*}+c.

(12)

Proof:.

The objective function (10a) is strictly convex and the minimum point is $(x,y)=(0,0)$ . If $c\leq 0$ , the origin $(x,y)=(0,0)$ satisfies the condition of constraint (10b). Therefore, the solution of Eq. (10) is $(x,y)=(0,0)$ .

If $c>0$ , $(x,y)=(0,0)$ does not satisfies the constraint condition (10b), and the optimal solution belong the boundary of the region satisfying (10b). Therefore, the solution point $(x^{*},y^{*})$ satisfies Eq. (12). ∎

Also, the solution of the new optimal problem (11) is switched by the ratio between $c$ and $b^{\mathrm{T}}A^{-1}b$ .

Lemma 3.

If $\tilde{c}>1-\tilde{k}_{x}^{2}$ then the solution $x^{*}$ of the optimal problem (11) is given by

\displaystyle x^{*}=(1-\tilde{k}_{x})A^{-1}b,

(13)

where

\displaystyle\tilde{c}=\frac{c}{b^{\mathrm{T}}A^{-1}b},\quad\tilde{k}_{x}=\frac{k_{x}}{k_{x}+k_{y}}.

If $\tilde{c}\leq 1-\tilde{k}_{x}^{2}$ , the solution $x^{*}$ equals the solution of the following QCQP problem, such that


minimize	$\displaystyle x^{\mathrm{T}}Ax$	(14a)
subject to	$\displaystyle x^{\mathrm{T}}Ax-2b^{\mathrm{T}}x+c=0.$	(14b)

Proof:.

The optimal problem (11) is split on the sign of $x^{*\mathrm{T}}Ax^{*}-2b^{\mathrm{T}}x^{*}+c$ .

Case $x^{*\mathrm{T}}Ax^{*}-2b^{\mathrm{T}}x^{*}+c<0$ :

The optimal problem of this case is written as

minimize

\displaystyle(k_{x}-k_{y})x^{\mathrm{T}}Ax+2k_{y}b^{\mathrm{T}}x-k_{y}c.

If $k_{x}-k_{y}\leq 0$ , there is no optimal point. Otherwise $k_{x}-k_{y}>0$ , the optimal point is written as

\displaystyle x^{*}=-\frac{k_{y}}{k_{x}-k_{y}}A^{-1}b.

The optimal point does not satisfies the condition $x^{*\mathrm{T}}Ax^{*}-2b^{\mathrm{T}}x^{*}+c<0$ , because

	$\displaystyle x^{\mathrm{T}}Ax^{}-2b^{\mathrm{T}}x^{*}+c$
	$\displaystyle=\frac{k_{y}^{2}}{(k_{x}-k_{y})^{2}}b^{\mathrm{T}}A^{-1}b+\frac{2k_{y}}{k_{x}-k_{y}}b^{\mathrm{T}}A^{-1}b+c>0,$

where $A^{-1}$ is a positive definite matrix and $c>0$ .

Case $x^{*\mathrm{T}}Ax^{*}-2b^{\mathrm{T}}x^{*}+c>0$ :

The problem (11) is written as

minimize

\displaystyle(k_{x}+k_{y})x^{\mathrm{T}}Ax-2k_{y}b^{\mathrm{T}}x+k_{y}c.

Hence, this optimal point is written as

	$\displaystyle x^{*}$	$\displaystyle=\frac{k_{y}}{k_{x}+k_{y}}A^{-1}b$
		$\displaystyle=(1-\tilde{k}_{x})A^{-1}b,$

where $\sqrt{A}$ is a square root of the positive definite matrix $A$ . The condition $x^{*\mathrm{T}}Ax^{*}-2b^{\mathrm{T}}x^{*}+c>0$ is rewritten by the previous optimal solution, such that

	$\displaystyle x^{\mathrm{T}}Ax^{}-2b^{\mathrm{T}}x^{*}+c$
	$\displaystyle=(1-\tilde{k}_{x})^{2}b^{\mathrm{T}}A^{-1}b-2(1-\tilde{k}_{x})b^{\mathrm{T}}A^{-1}b+c$
	$\displaystyle=-\left(1-\tilde{k}_{x}^{2}\right)b^{\mathrm{T}}A^{-1}b+c>0$
	$\displaystyle\Leftrightarrow\tilde{c}>1-\tilde{k}_{x}^{2}.$

Case $x^{\mathrm{T}}Ax^{*}-2b^{\mathrm{T}}x^{*}+c=0$ : The problem (11) is written as the QCQP problem (14). ∎

The solution of the simple QCQP problem (14) is easily derived using the method of Lagrange multiplier.

Lemma 4.

The solution of the simple QCQP problem (14) is given by

\displaystyle x^{*}=\left(1-\sqrt{1-\tilde{c}}\right)A^{-1}b.

Proof:.

Supposing a Lagrange multiplier $\lambda>0$ , the Lagrange function is written as

\displaystyle L(x,\lambda)=x^{\mathrm{T}}Ax+\lambda(x^{\mathrm{T}}Ax-2b^{\mathrm{T}}x+c).

The KKT condition is given by

	$\displaystyle\frac{\partial L(x^{*},\lambda)}{\partial x}$	$\displaystyle=2(1+\lambda)Ax^{*}-2\lambda b=0.$		(15)
	$\displaystyle\frac{\partial L(x^{*},\lambda)}{\partial\lambda}$	$\displaystyle=x^{\mathrm{T}}Ax^{}-2b^{\mathrm{T}}x^{*}+c=0,$		(16)

Eq. (15) is written as

\displaystyle x^{*}

\displaystyle=\frac{\lambda}{1+\lambda}A^{-1}b,

and the Eq. (16) is given by

		$\displaystyle x^{\mathrm{T}}Ax^{}-2b^{\mathrm{T}}x^{*}+c=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle\frac{\lambda^{2}}{(1+\lambda)^{2}}b^{\mathrm{T}}A^{-1}b-\frac{2\lambda}{1+\lambda}b^{\mathrm{T}}A^{-1}b+c=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle-\frac{\lambda^{2}+2\lambda}{(1+\lambda)^{2}}b^{\mathrm{T}}A^{-1}b+c=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle-(\lambda^{2}+2\lambda)b^{\mathrm{T}}A^{-1}b+(1+\lambda)^{2}c=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle(c-b^{\mathrm{T}}A^{-1}b)\lambda^{2}+2(c-b^{\mathrm{T}}A^{-1}b)\lambda+c=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle\lambda^{2}+2\lambda+\frac{\tilde{c}}{(\tilde{c}-1)}=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle\lambda=-1\pm\sqrt{1-\frac{\tilde{c}}{\tilde{c}-1}}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\lambda=-1\pm\sqrt{\frac{1}{1-\tilde{c}}}.$

As $\lambda>0$ and $\tilde{c}>0$ , the Lagrange multiplier is written as

\displaystyle\lambda=-1+\sqrt{\frac{1}{1-\tilde{c}}}.

Therefore, the optimal point $x^{*}$ is given by

	$\displaystyle x^{*}$	$\displaystyle=\frac{\lambda}{1+\lambda}A^{-1}b$
		$\displaystyle=\frac{-1+\sqrt{\frac{1}{1-\tilde{c}}}}{\sqrt{\frac{1}{1-\tilde{c}}}}A^{-1}b$
		$\displaystyle=(1-\sqrt{1-\tilde{c}})A^{-1}b.$

∎

Finally, we summarize three Lemmas 2-4 and solve the QCQP problem (10).

Lemma 1.

The solution of the problem (10) is given by


$\displaystyle x^{*}$	$\displaystyle=\left(1-\sqrt{\mathrm{C}\left(1-\tilde{c};\tilde{k}_{x}^{2},1\right)}\right)A^{-1}b,$	(17a)
$\displaystyle y^{*}$	$\displaystyle=\mathrm{R}\left(c-(1-\tilde{k}_{x}^{2})b^{\mathrm{T}}A^{-1}b\right).$	(17b)

Proof:.

Lemmas 2-4 split the solution $x^{*}$ as three cases:

	$\displaystyle x^{*}$	$\displaystyle:=\begin{cases}(1-\tilde{k}_{x})A^{-1}b&1-\tilde{k}_{x}^{2}<\tilde{c}\\ \left(1-\sqrt{1-\tilde{c}}\right)A^{-1}b&0<\tilde{c}\leq 1-\tilde{k}_{x}^{2}\\ 0&c\leq 0\end{cases}$
		$\displaystyle:=\begin{cases}\left(1-\sqrt{\tilde{k}_{x}^{2}}\right)A^{-1}b&1-\tilde{c}<\tilde{k}_{x}^{2}\\ \left(1-\sqrt{1-\tilde{c}}\right)A^{-1}b&\tilde{k}_{x}^{2}\leq 1-\tilde{c}<1\\ \left(1-\sqrt{1}\right)A^{-1}b&1\leq 1-\tilde{c}\end{cases}$
		$\displaystyle=\left(1-\sqrt{\mathrm{C}\left(1-\tilde{c};\tilde{k}_{x}^{2},1\right)}\right)A^{-1}b.$

Furthermore, the solution of $y^{*}$ is written as,

	$\displaystyle y^{*}$	$\displaystyle:=\begin{cases}x^{\mathrm{T}}Ax^{}-2b^{\mathrm{T}}x^{}+c&1-\tilde{k}_{x}^{2}<\tilde{c}\\ x^{\mathrm{T}}Ax^{}-2b^{\mathrm{T}}x^{}+c&0<\tilde{c}\leq 1-\tilde{k}_{x}^{2}\\ 0&c\leq 0\end{cases}$
		$\displaystyle=\begin{cases}c-(1-\tilde{k}_{x}^{2})b^{\mathrm{T}}A^{-1}b&1-\tilde{k}_{x}^{2}<\tilde{c}\\ 0&0<\tilde{c}\leq 1-\tilde{k}_{x}^{2}\\ 0&\tilde{c}\leq 0\end{cases}$
		$\displaystyle=\mathrm{R}(c-(1-\tilde{k}_{x}^{2})b^{\mathrm{T}}A^{-1}b).$

Therefore, the solution become Eq. (17). ∎

Appendix C Overall schematic of the learning process

Algorithm 1 shows the overall schematic of the learning process. The first line defines the modified dynamics $(f_{\bf m},G_{\bf m},h_{\bf m})$ from the nominal dynamics $(f_{\bf n},G_{\bf n},h_{\bf n})$ , defined by the neural network, where $\phi$ is a set of parameters of the nominal dynamics. The 2-7 line represents a training loop, where the gradient-based optimization methods can be used by using the forward and backward calculation. Note that an ODE solver is used for forward calculation, and Algorithm 2 shows the forward calculation when the Euler method is used. For simplicity, mini-batch computation omitted in this schematic.

Algorithm 1 Training process

x_{0}

: initial state,

u

: input signal,

y

: output signal,

(f_{\bf n},G_{\bf n},h_{\bf n})

: nominal dynamics,

V

: a designed function

1: define modified functions

(f_{\bf m},G_{\bf m},h_{\bf m})

from

(f_{\bf n},G_{\bf n},h_{\bf n})

and

V

2: for

1

\#

iterations do

\hat{y}\leftarrow

ODE with

(f_{\bf m},G_{\bf m},h_{\bf m})

from

x_{0},u

(Algorithm 2)

4: forward computation of

{\rm Loss}

function (9) from

y

\nabla_{\phi}{\rm Loss}\leftarrow

backward computation with

{\rm Loss}

\phi\leftarrow

Optimizer(

\phi

\nabla_{\phi}{\rm Loss}

)

7: end for

Algorithm 2 Forward computation for dynamics Eq. (1)

x_{0}

: initial state,

u

: input signal,

(f_{\bf m},G_{\bf m},h_{\bf m})

: dynamics

\hat{y}

: output signal

1: for t

\leftarrow 0

T

x_{t+1}\leftarrow x_{t}+\Delta t(f_{\bf m}(x_{t})+G_{\bf m}(x_{t})u_{t})

\hat{y}_{t}\leftarrow h_{\bf m}(x_{t})

4: end for

5: return

\hat{y}

Appendix D Neural network architecture and hyper parameters

Table 1: The search space of Bayesian optimization

parameter name	range	type
learning rate	$10^{-5}$ – $10^{-3}$	log scale
weight decay	$10^{-10}$ – $10^{-6}$	log scale
batch size	$10$ – $100$	integer
optimizer	$\{$ AdamW, Adam, RMSProp $\}$	categorical
$\#$ layer for $f_{\bf n}$	$0$ – $3$	integer
$\#$ layer for $G_{\bf n}$	$0$ – $3$	integer
$\#$ layer for $h_{\bf n}$	$0$ – $3$	integer
$\#$ dim. for a hidden layer of $f_{\bf n}$	$8-32$	integer
$\#$ dim. for a hidden layer of $G_{\bf n}$	$8-32$	integer
$\#$ dim. for a hidden layer of $h_{\bf n}$	$8-32$	integer
$\epsilon$	$0$ – $1.0$	log scale
Initial scale parameter for $f_{\bf n}$	$10^{-5}$ – $0.1$	log scale
Stop gradient for projection	true, false	boolean

This section details how to determine the neural network architecture. The architecture and hyper parameters of the neural networks were basically determined by Bayesian optimizing using the validation dataset.

Table 1 shows the search space of Bayesian optimization. The first three parameters: learning rate, weight decay, and batch size are parameters for training the neural networks. Also, an optimizer is selected from AdamW, Adam, and RMSProp. The structure of neural network is determined from the number of intermediate layers and dimensions for each layer. One layer in our setting consists of a fully connected layer with a ReLU activation. The last three rows represent parameters related to our proposed methods. $\epsilon$ is a parameter of the loss function $L_{{\rm HJ}}$ . Initial scale parameter is multiplied with the output of $f_{\bf n}$ to prevent the value of $f_{\bf n}(x)$ from becoming large in the initial stages of learning. When $f_{\bf n}(x)$ , which determine the behavior of the internal system, outputs a large value, it diverges due to time evolution, and the learning of the entire system may not progress. Therefore, it is empirically preferable to start with a small value for $f_{\bf n}(x)$ at the initial stage of learning. When the flag of ‘stop gradient for projection’ is false, backward computation related to the second term of modification of $f_{\bf m}$ and $G_{\bf m}$ is disabled. Note that modification related to $f_{\bf m}$ and $G_{\bf m}$ consists of two terms (see Theorem 1). Setting this parameter to false resulted in better performance in our all experiments.

We ran 300 trials using the Bayesian optimization for the bistable model benchmark and the glucose insulin benchmark with the above settings, and the hyper parameters obtained are shown in Table 2, where the number of dimensions for each hidden layer is shown in tuple from the order closest to the input layer.

Hyperparameters of comparative methods were determined by grid search using the validation dataset. ARX and PWARX have an order parameter $n$ of the autoregressive model, and this parameter is searched in the range of $1-5$ . The number of iterations was set to 10000 so that the optimization of PWARX converges sufficiently. MOESP and ORT have an internal dimension $n$ ( $1\leq n\leq 20$ ) and the number of subsequences used for estimation $k$ ( $2n<k\leq 20$ ).

Table 2: Selected parameters for each benchmark

parameter name	bistable	glucose insulin
learning rate	$3.01\times 10^{-4}$	$3.28\times 10^{-4}$
weight decay	$4.76\times 10^{-9}$	$2.28\times 10^{-9}$
batch size	$100$	$100$
optimizer	RMSProp	RMSProp
$\#$ layer for $f_{\bf n}$	$3$	$1$
$\#$ layer for $G_{\bf n}$	$1$	$2$
$\#$ layer for $h_{\bf n}$	$3$	$2$
$\#$ dim. for a hidden layer of $f_{\bf n}$	(17,10,22)	(8)
$\#$ dim. for a hidden layer of $G_{\bf n}$	(34)	(27,29)
$\#$ dim. for a hidden layer of $h_{\bf n}$	(10,62,58)	(35,18)
$\epsilon$	$0.63$	$0.75$
Initial scale parameter for $f_{\bf n}$	$9.64\times 10^{-2}$	$8.94\times 10^{-2}$
Stop gradient for projection	false	false

Appendix E Glucose-Insulin system

Glucose concentration in the blood is modeled as a time-delay system regulated by insulin concentration (See Fig. 7 (A)) [32]. Suppose that $G$ $I$ , and $X$ are the glucose, insulin, and accumulated glucose plasma concentration ([mg/100ml],[ $\mu$ UI/ml],and [min mg/100ml], respectively) and $u$ is the amount of ingested glucose per minute [min^-1 mg/100ml]. The dynamics of each concentration is given by

	$\displaystyle\dot{G}(t)$	$\displaystyle=-k_{1}G(t)-k_{2}G(t)I(t)+g_{0}+u(t),$
	$\displaystyle\dot{I}(t)$	$\displaystyle=-k_{3}I(t)+\frac{k_{4}}{\tau}X(t),$
	$\displaystyle\dot{X}(t)$	$\displaystyle=G(t)-G(t-\tau),$
	$\displaystyle y(t)$	$\displaystyle=[G(t),I(t)]^{\mathrm{T}},$

where $k_{1}$ is a spontaneous glucose disappearance rate, $k_{2}$ is an insulin-dependent glucose disappearance rate, $g_{0}$ is a constant increase in plasma glucose concentration, $k_{3}$ is an insulin disappearance rate, $k_{4}$ is an insulin release rate per the average glucose concentration within the last $\tau$ minute.

This system has a unique asymptotically stable equilibrium point $(G,I,X)\equiv(G^{*},I^{*},X^{*})$ on the nonlinear plain such that

	$\displaystyle G^{*}$	$\displaystyle=\frac{-k_{1}k_{3}+\sqrt{(k_{1}k_{3})^{2}+4k_{2}k_{3}k_{4}g_{0}}}{2k_{2}k_{4}},$
	$\displaystyle I^{*}$	$\displaystyle=\frac{-k_{1}k_{3}+\sqrt{(k_{1}k_{3})^{2}+4k_{2}k_{3}k_{4}g_{0}}}{2k_{2}k_{3}},$
	$\displaystyle X^{*}$	$\displaystyle=\frac{-k_{1}k_{3}+\sqrt{(k_{1}k_{3})^{2}+4k_{2}k_{3}k_{4}g_{0}}}{2k_{2}k_{4}}\tau.$

Here we set the initial state of this model as $G(0)=G^{*},I(0)=I^{*},X(0)=X^{*}$ and set the model parameters as shown in the following Table 3. Furthermore, we adopt the output of the previous oral glucose absorption system [33] as $u$ (See Fig. 7 (B)), and the glucose absorption amount $u$ is normalized based on human blood volume per body weight (0.80[100 ml/kg]).

Table 3: Model parameters.

Parameter	Value	Unit
$k_{1}$	$3.35\times 10^{-2}$	$\frac{1}{\text{min}}$
$k_{2}$	$5.22\times 10^{-5}$	$\frac{1}{\text{min}(\mu\text{UI/ml})}$
$k_{3}$	$1.055$	$\frac{1}{\text{min}}$
$k_{4}$	$0.293$	$\frac{(\mu\text{UI/ml})}{\text{min}\text{(mg/100ml)}}$
$g_{0}$	3.13	$\frac{\text{(mg/100ml)}}{\text{min}}$
$\tau$	$6$	min

	$\displaystyle\mathrm{HJ}_{\beta}(f_{\bf m},G_{\bf m},h_{\bf m})$
	$\displaystyle=\beta^{\mathrm{T}}f_{\bf m}+\frac{1}{2\gamma^{2}}\\|G_{\bf m}^{\mathrm{T}}\beta\\|^{2}+\frac{1}{2}\\|h_{\bf m}\\|^{2}$
	$\displaystyle=\beta^{\mathrm{T}}(f_{\bf n}-\beta p)+\frac{1}{2\gamma^{2}}\\|(G_{\bf n}-\beta q^{\mathrm{T}})^{\mathrm{T}}\beta\\|^{2}+\frac{1}{2}\Big{\\|}(1-r)h_{\bf n}\\|^{2}$
	$\displaystyle=-\\|\beta\\|^{2}p+\frac{1}{2\gamma^{2}}\\|\beta\\|^{4}\\|q\\|^{2}+\frac{1}{2}\\|\hat{h}\\|^{2}r^{2}-\frac{1}{\gamma^{2}}\\|\beta\\|^{2}\beta^{\mathrm{T}}G_{\bf n}q-\\|h_{\bf n}\\|^{2}r+\mathrm{HJ}_{\beta}(f_{\bf n},G_{\bf n},h_{\bf n})$
	$\displaystyle\leq 0.$

	$\displaystyle\mathrm{HJ}_{\beta}(f_{\bf m},G_{\bf m},h_{\bf n})$
	$\displaystyle=\beta^{\mathrm{T}}f_{\bf m}+\frac{1}{2\gamma^{2}}\\|G_{\bf m}^{\mathrm{T}}\beta\\|^{2}+\frac{1}{2}\\|h_{\bf m}\\|^{2}$
	$\displaystyle=\beta^{\mathrm{T}}(f_{\bf n}-\beta p)+\frac{1}{2\gamma^{2}}\\|(G_{\bf n}-\beta q^{\mathrm{T}})^{\mathrm{T}}\beta\\|^{2}+\frac{1}{2}\\|h_{\bf n}\\|^{2}$
	$\displaystyle=-\\|\beta\\|^{2}p+\frac{1}{2\gamma^{2}}\\|\beta\\|^{4}\\|q\\|^{2}-\frac{1}{\gamma^{2}}\\|\beta\\|^{2}\beta^{\mathrm{T}}G_{\bf n}q+\mathrm{HJ}_{\beta}(f_{\bf n},G_{\bf n},h_{\bf n})$
	$\displaystyle\leq 0.$