This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1,2]\fnmMouad \surELAARABI

1]\orgdivNantes Université, \orgnameEcole Centrale Nantes, CNRS, GeM, UMR 6183, \orgaddress\cityNantes, \postcode44321, \countryFrance

2]\orgdivNantes Université, \orgnameIRT Jules Verne, \orgaddress\cityBouguenais, \postcode44340, \countryFrance

Adaptive parameters identification for nonlinear dynamics using deep permutation invariant networks

mouad.elaarabi@ec-nantes.fr, mouad.el-aarabi@irt-jules-verne.fr    \fnmDomenico \surBORZACCHIELLO domenico.borzacchiello@ec-nantes.fr    \fnmPhilippe \surLE BOT philippe.le-bot@irt-jules-verne.fr    \fnmYves \surLE GUENNEC yves.le-guennec@irt-jules-verne.fr    \fnmSebastien \surCOMAS-CARDONA sebastien.comas@ec-nantes.fr [ [
Abstract

The promising outcomes of dynamical system identification techniques, such as SINDy [1], highlight their advantages in providing qualitative interpretability and extrapolation compared to non-interpretable deep neural networks [2]. These techniques suffer from parameter updating in real-time use cases, especially when the system parameters are likely to change during or between processes. Recently, the OASIS [3] framework introduced a data-driven technique to address the limitations of real-time dynamical system parameters updating, yielding interesting results. Nevertheless, we show in this work that superior performance can be achieved using more advanced model architectures.
We present an innovative encoding approach, based mainly on the use of Set Encoding methods of sequence data, which give accurate adaptive model identification for complex dynamic systems, with variable input time series length. Two Set Encoding methods are used: the first is Deep Set [4], and the second is Set Transformer [5]. Comparing Set Transformer to OASIS framework on Lotka-Volterra for real-time local dynamical system identification and time series forecasting, we find that the Set Transformer architecture is well adapted to learning relationships within data sets. We then compare the two Set Encoding methods based on the Lorenz system for online global dynamical system identification. Finally, we trained a Deep Set model to perform identification and characterization of abnormalities for 1D heat-transfer problem.

keywords:
data-driven dynamical system identification, neural networks, deep permutation invariant networks, online parameter updating, sparse regression

1 Introduction

Exploring dynamic systems through differential equations is intriguing, providing valuable insights into variable interactions and system behavior. These insights can be useful for many tasks, such as monitoring and controlling industrial processes. However, this could be a challenging undertaking due to their inherent complexity (such as variability of parameters and representativeness of the model). The direct construction of these equations from theoretical principles is a significant challenge and may be sensitive to changes in the parameters or modeling hypotheses, leading researchers to rely on alternative methods such as analyzing empirical data through simulations or experiments. This process involves generating datasets and conducting meticulous analyses using appropriate methods, offering valuable insights about the intrinsic interactions between variables and enhancing the understanding of physical system dynamics.

While machine learning methods offer a potential solution, they suffer from issues such as a lack of interpretability, as the trained model is a black box and generally requires extensive training data for high accuracy, which is not always readily available. These methods may also struggle to generalize for unseen data [2]. On the other hand, model discovery methods [6, 7] aims to identify the true governing equation through empirical data, providing insights into the interactions between the data, which could be applied to a wide range of engineering tasks.

Table 1: Abbreviation table
Abbreviation Definition
CAE Computer-Aided Engineering
DMD Dynamic Mode Decomposition
DMDc Dynamic Mode Decomposition with control
DNN Deep Neural Network
EQL EQuation Learning
ISAB Induced Self-Attention Block
LASSO Least Absolute Shrinkage and Selection Operator
MAB Multi-head Attention Block
MAPE Mean Absolute Percentage Error
MLP MultiLayer Perceptron
MPC Model Predictive Control
NSODE Neural Symbolic Ordinary Differential Equation
OASIS Operable Adaptive Sparse Identification of Systems
OASIS-P Online Adaptive Sparse Identification of Systems for Fault Prognosis
ODE Ordinary Differential Equation
PINN Physics Informed Neural Network
PINN-SR Physics Informed Neural Network with Sparse Regression
PDE Partial Differential Equation
PMA Pooling by Multihead Attention
PySR Python Symbolic Regression
SAB Self-Attention Block
SENDI Set Encoding for Nonlinear Dynamics Identification
SGA-PDE Symbolic Genetic Algorithm for discovering open-form Partial Differential Equations
SINDYc Sparse Identification of Nonlinear Dynamics with Control
SINDy Sparse Identification of Nonlinear Dynamics
SR3 Sparse Relaxed Regularized Regression
rFF row-wise FeedForward
sMAPE symmetric Mean Absolute Percentage Error
\botrule

1.1 Background and motivations

In this work, we focus on dynamic system identification methods [8] that have demonstrated excellent results. SINDy [1] (Please refer to Tab. 1 for abbreviations), based on sparse regression, aims to extract the dynamical links within data. The main idea is to select a few optimal candidate functions from pre-defined over-complete dictionaries, which could be defined based on prior knowledge about the physical laws governing the process and effectively describe the dynamics. This is achieved by using a sparse regression method [9, 10, 11]. On the other hand, PDE Net, proposed by [12], aims to identify PDE equations by discretizing the PDE using forward Euler in time and finite difference in space. This method introduces a different approach mainly based on a succession of symbolic neural networks that aim to link multiple candidates using the EQuation Learning (EQL) architecture [13] and select the best ones using learnable convolutional kernels . Other methods use a different approach, SGA-PDE [14] proposes the use of symbolic mathematics tree representation with classical and differential operators to represent the PDE and a genetic algorithm to find the best representation tree. In the same context, PySR [15] introduces a Python library for symbolic regression based on multi-population evolutionary algorithm. This library is particularly well-suited for identifying ODEs, but has limited capabilities in identifying PDEs. This limitation arises from the fact that PySR’s operators are restricted to basic algebraic operators. In contrast, SGA-PDE [14] extends this capability by incorporating differential operators. These methods offer high potential as the identified parameters can be transferred and used within traditional CAE simulation software finite elements. Additionally, they can be coupled with DNN such as PINN [16] to identify and simulate the dynamics simultaneously as proposed by [17].

Nevertheless, these methods are not pertinent (inference time, adaptability, and leveraging new data to enhance accuracy…) in cases of online identification, where abrupt changes may occur during the process. In such scenarios, the model needs to rapidly adjust to these changes. SINDy-Model Predictive Control (MPC) [18] demonstrates that a combination of SINDy with control [19], built on classical SINDy and DMDc [20], allows for better control of the system dynamics after a change. This method compared SINDYc, DMDc, and a DNN, revealing that the combination of SINDYc with DMDc yields interesting results. Moreover [21] suggests an adaptive model for abrupt changes based on SINDy. This method involves two processes: the first detects the changes, while the second identifies and then modifies, adds, or removes parameters from the dynamic system.
These methods present significant potential but have limitations when the identification time is crucial. A slight delay in identification can lead to substantial changes, such as in process systems prognosis [22]. Developing an accurate dynamics identification model using limited data is not viable and may take a considerably long time, hindering its use for interactive applications. Therefore, using SINDYc or abrupt-SINDy online for real applications is challenging.

Recently, [23] presented a sequence to sequence Transformer [24] architecture to train a general model to perform symbolic regression identification, with faster inference time. This technique is then generalized to ODE identifications, as presented in ODEFormer [25] and NSODE [26]. They introduced a transformer-based method to identify ODEs structure and parameters, utilizing Symbolic Regression techniques. While the first method targets univariate ODEs, the second extends to multivariate ODEs. These methods focus on training a general model to conduct forward identification of symbolic representations of ODEs based on time series observations. ODEFormer [25] exhibits promising results compared to alternative methods. However, this approach is limited when a partial or full knowledge of the dynamics terms are available. Also these methods necessitate an embedding step and an additional fine-tuning step to refine the identified ODE parameters, posing challenges for real-world applications. Hence, the high complexity of the architecture may pose limitations, especially with long time series. Finally, since the model output uses natural language, it would be challenging to couple these methods with other deep learning methods, such as PINNSR [17].
In the same context, OASIS [3] has presented a framework that integrates both the strengths of SINDy for identifying the dynamics parameters and a DNN trained offline with parameters identified by SINDy. The latter is subsequently used online to provide parameter predictions for predictive control. OASIS has also been applied for fault real-time prognosis and demonstrated interesting and accurate results [22]. The trained neural network uses a single time frame input which cannot handle information about sequence data. Additionally, classical neural network architectures may fail to encode the interactions between data, and therefore struggle to capture the dynamical system, especially with complex systems.

1.2 Paper contributions

We introduce a deep learning architecture for evolving online dynamics identification based on Set Encoding. Our goal is to enhance the encoding of interactions within data, leading to improved prediction accuracy of system parameters. We employ two distinct architectures Deep Set [4] and Set Transformer [5]. These methods use two different encoding techniques, that can learn both simple pairwise relationships and higher-order links between set data, which can be applied to diverse datasets, making it suitable for a wide range of applications including supervised (our work) and unsupervised learning. In this work, we are particularly interested in applying this technique to sequential data, notably time series. Our main focus is to combine these methods with SINDy to provide an identification framework of the dynamics parameters of a fixed ODE for online use cases such as time series forecasting.
The problem addressed in this article is defined in Eq. 1:

{dXdt=Θ(X)ΞwhereX=[x0,,xn]Θ(X)=[1,X,X2,,X2sin(X),cos(X),],𝐆(Θ(X))=Ξ\displaystyle\left\{\begin{matrix}\frac{\mathrm{d}X}{\mathrm{d}t}=\Theta(X)~\Xi~~~\textup{where}~~~X~=~[x_{0},~\dots,~x_{n}]\\ \Theta(X)~=~[1,~X,~X^{2},~\dots,~X^{2}\otimes\sin(X),~\cos(X),~\dots],\\ \mathbf{G}(\Theta(X))=\Xi\end{matrix}\right. (1)

where Θ(X)\Theta(X) is a pre-defined search library based on knowledge about the dynamical system, XX is a vector containing the variables of the dynamical system, Ξ\Xi represents the parameters of the dynamical system to be calculated. The main goal is to train the Set Encoding methods represented as 𝐆\mathbf{G} to predict the Ξ\Xi matrix, thereby identifying the dynamics system.

The main contributions of the study are to :

  • Propose an appropriate approach to encode and decode to effectively learn the interactions between data, which provide accurate predictions of dynamics parameters. This architecture can handle sequential data with variable sizes, offering enhanced predictions compared to single-time point predictions.

  • Illustrate that integrating physics loss in offline learning contributes to a better convergence and generalization of the model.

  • Demonstrate that these methods can provide robust predictions with data containing noise.

  • Provide accurate predictions (after offline training) with an inference time of milliseconds, making it suitable for real-time applications.

This article is structured as follows: the section 2 will be dedicated to a brief presentation of related works such as SINDYc which will be used in the following sections, and the OASIS framework along with its training method. Then, in section 3, we introduce the two set encoding architectures and discuss their use for online dynamics identification. Section 4 will focus on comparing OASIS and set encoding for local dynamics identification using the Lotka-Volterra system, followed by a comparison of the two set encoding architectures for global dynamics identification using the Lorenz system, then using Deep Set for 1D heat-transfer identification and characterization. Finally, a conclusion of this work and areas for improvement are given.

2 Related works

In this section, we briefly introduce SINDYc (Sec. 2.1) and OASIS (Sec. 2.2). SINDYc is utilized to generate training data for comparing OASIS and our methods in the first application with the Lotka-Volterra system (Sec. 4.2), as well as for comparing the two Set Encoding methods presented in Sec.3 for the second application with the Lorenz system (Sec. 4.3).

2.1 SINDYc

SINDYc [19] method allows the identification of parameters of a dynamical system with control variables. Drawing upon prior knowledge of the system’s dynamic behavior, a set of candidate functions, referred to as the search space or library denoted by Θ\Theta, is defined. These functions can collectively represent the dynamical system. The underlying idea is primarily based on the notion that an identified system should encompass a limited and concise set of functions. This approach can be applied to various dynamic systems (ODE and PDE), as well as to dynamic systems with control variables (Eq. 2), which are external variables.

X˙=f(X,U),X(t0)=X0\dot{X}=f(X,U),~~~~X(t_{0})=X_{0} (2)

Where X=[x1x2xl]n×lX=[x_{1}~~x_{2}~~\cdots~~x_{l}]\in\mathbb{R}^{n\times l} is a matrix with ll state variables with nn time snapshots, X˙=[x1˙x2˙xl˙]n×l\dot{X}=[\dot{x_{1}}~~\dot{x_{2}}~~\cdots~~\dot{x_{l}}]\in\mathbb{R}^{n\times l} is the time derivative of XX, usually approximated using total variation regularized derivatives [27], and U=[u1u2up]n×pU=[u_{1}~~u_{2}~~\cdots~~u_{p}]\in\mathbb{R}^{n\times p} are the pp control variables with nn time snapshots, and X(t0)X(t_{0}) are the initial conditions. The equation identified by SINDYc is defined as :

X˙=Θ(X,U)Ξ,\dot{X}~~=~~\Theta(X,U)\Xi, (3)

whereas Ξ\Xi is a coefficient matrix representing the selected functions from the search library Θ(X,U)\Theta(X,U), which is defined by multiple functions (Eq. 4) usually using partial knowledge of the physics, for example, the library function could be defined as:

Θ(X,U)=[1XTUT(XU)Tsin(XT)UTexp(XT)]\Theta(X,U)=[1~~X^{T}~~U^{T}~~(X\otimes U)^{T}~~\cdots~~\sin(X^{T})~~\cdots~~U^{T}\otimes\exp(X^{T})~~\cdots] (4)

To identify the dynamical system while avoiding overfitting, the Ξ\Xi matrix must be sparse, implying that the less significant coefficients need to be equal to zero. This is achieved by solving the LASSO optimization problem (Eq. 5) defined as :

Ξ=argminΞ~[X˙Θ(X,U)Ξ~2+λΞ~1],\Xi=\underset{\widetilde{\Xi}}{\mathrm{argmin}}\left[\left\|\dot{X}-\Theta(X,U)\widetilde{\Xi}\right\|_{2}+\lambda\left\|\widetilde{\Xi}\right\|_{1}\right], (5)

where the first term represents the residual error and the second term is a regularizer applied to the coefficients matrix Ξ~\widetilde{\Xi} or the parameters of the ODE/PDE, and λ\lambda is a weight parameter that imposes regularization. The larger it is, the stronger the regularization becomes, leading to the elimination of more terms from the search library while potentially reducing the accuracy. Other methods could be used such as Sparse Relaxed Regularized Regression (SR3) [11], or thresholded least squares that use the norm 0 to eliminate less significant coefficients [1, 28, 10].

Despite all the advantages of SINDYc, this method is restricted in cases where system parameters tend to vary during the process, or whenever the system must be identified online based on sensor data. While abrupt-SINDy[21] presents a method to address this issue, it has drawbacks, notably the quantity of data required for identification and the computational cost, which can be significant for real-time cases. In this study we will focus on using SINDy and SINDYc as offline identification methods to generate some datasets, which are then used to train Deep Neural Networks to perform the approximation of parameters online.

2.2 OASIS

The OASIS framework is primarily based on SINDYc but unlike the latter, it favors the use of offline trained DNN through results from SINDYc, making it more suitable for real-time parameter identification cases.
The dynamical system (Eq. 2) is a set of measured or generated time series. Then, the set is divided into multiple sub domains (Fig. 1. a), which are further used for offline identification through SINDYc (Eq. 3) using an identical search library (Fig. 1. b).

Refer to caption
Figure 1: OASIS training process. XjkN×lX_{jk}\in\mathbb{R}^{N^{{}^{\prime}}\times l} and UjkN×pU_{jk}\in\mathbb{R}^{N^{{}^{\prime}}\times p} represent respectively state variables and control variables for the jj-th set and the kk-th sub domain, where j[1,S]j\in[1,S] and k[1,m]k\in[1,m]. Here, SS is the number of sets and mm is the total number of sub domains. ll, pp, and NN^{{}^{\prime}} represent respectively the number of state variables, control variables, and time windows.

This dataset is then utilized for training the DNN (Fig. 2) model (Fig. 1. c), where the input comprises state and control variables and the output corresponds to the predicted Ξ\Xi. The training process involves leveraging the information contained in the datasets, enabling the model to learn and capture the underlying patterns and relationships within the data, allowing accurate and online identification (Fig. 1. d). This is achieved by minimizing the error between the predictions of the DNN and the true coefficients (Eq. 6).

argminΘ[F(X,U;Θ)Ξ+βΘ2]whereF(X,U;Θ)=Ξ^\underset{\Theta}{\mathrm{argmin}}~~\left[\mathrm{F}(X,U;~\Theta)-\Xi+\beta\left\|\Theta\right\|_{2}\right]~~~\mathrm{where}~~~\mathrm{F}(X,U;~\Theta)=\hat{\Xi} (6)

Here, F\mathrm{F} is the output prediction of the DNN (Fig. 2), and Θ\Theta denotes its trainable parameters. Additionally, it is important to include regularization terms βΘ2\beta\left\|\Theta\right\|_{2} to mitigate overfitting problems, where Θ\Theta represents the trainable variables of the DNN.

The trained model is then employed in real-time for online identification of the dynamical system  (Eq. 2) coefficients Ξ\Xi, enabling accurate extrapolation or prediction of optimal control parameters.

Refer to caption
Figure 2: Deep neural network with two hidden layers with two inputs denoted as XX and UU, and five outputs denoted as Ξ\Xi. The function FF, which is trainable, depends on a set of variables denoted as Θ\Theta that governs the connections between these hidden layers.

Because the DNN inputs are the current states of the system XX and controller UU, the OASIS framework does not aim to identify the global dynamics; this means the primary goal is not to identify the true and global coefficients of the governing ODE, but rather a local dynamic system limited to a small time range, and that provides limited extrapolation or prediction of optimal control parameters. This approach is motivated by the fact that accurate identification of the global dynamics is not straightforward and not always feasible [22].
In seeking to identify a valid local dynamics within a limited time interval, frequent model updates are necessary, which can lead to computational issues. To address this concern, the strategy of OASIS-P [22] aims at choosing an optimal frequency for model updating, minimizing the identification error and consequently the extrapolation error.

The fact that the networks use only a single state to predict the parameters defining the local system behaviors poses a potential restriction whenever the dynamics cannot be captured from a single time frame but require a longer observation window. This approach is expected to yield accurate predictions (example in Fig. 3 top left). However, the limitation is attributed to the non-uniqueness of the solution with a single state, as demonstrated by the example in Fig. 3 top right where the intersection of the two time series may introduce confusion during the model training process. The Fig. 3 bottom presents two examples of the Lokta-Volterra dynamics with similar initial conditions and control variables but with small changes in the dynamics parameters. The black arrows show cases where the model inputs are similar (x(ti)x(t_{i}) and y(ti)y(t_{i})) for the two examples (example 1 in blue and example 2 in red), but the dynamics differ. This aligns with the model’s inadequate capacity to capture temporal information effectively, resulting in less accurate predictions. The following section will present a more suitable architecture capable of effectively encoding temporal information, thereby providing better predictions.

Refer to caption
Figure 3: In top : two examples showcase the dynamics where the model can easily capture the local behavior with only one point in time (left image) and another one where the model will fail to capture the local behavior due to confusion (image at the right). In the bottom : example of dynamics with two state variables system (Lotka–Volterra) where the model could fail to capture the local dynamics at certain points.

3 Invariant Deep Learning model for Nonlinear Dynamics Identification

In this section the proposed architecture is founded on a concept termed Set Encoding. In contrast to classical neural network models (Fig. 2), where the primary goal is to learn a function FF connecting a fixed-dimensional input tensor to its corresponding output value, set encoding (Fig. 4) could be used to establish links between input data sets with variable shapes (variable time series length and variable time steps Δt\Delta t) and their corresponding labels, while maintaining permutation invariance. This objective is realized by encoding input data sets through equivariant layers (Fig. 4 a), followed by pooling steps to attain permutation-invariant properties (Fig. 4 b). Subsequently, the information is decoded (Fig. 4 c) to yield the appropriate outputs.

In this section, we explore the applicability of two set encoding methods for online and adaptative dynamics identification, utilizing variable shapes of time series. The first method, Deep Set, is based on a DNN (Fig. 2) architecture, while the second method, Set Transformer, is built upon a transformers architecture [24].

Refer to caption
Figure 4: Set encoding overall architecture : The input sets, XjkX_{jk} and UjkU_{jk}, represent respectively state variables and control variables for the jj-th set and the kk-th sub domain, that corresponds to the output coefficient Ξjk\Xi_{jk}, j[1,N]j\in[1,N] and k[1,m]k\in[1,m]. NN is the number of sets and mm is the total number of sub domains.

3.1 Deep Set

The Deep Set architecture (Fig. 5) can be divided into three transformations. The first step involves encoding the inputs into a latent space through a series of layers with non-linear activation functions (Eq. 8). In our cases the input is a set of XjkN×lX_{jk}\in\mathbb{R}^{N^{{}^{\prime}}\times l} and Ujk𝕌N×pU_{jk}\in\mathbb{U}^{N^{{}^{\prime}}\times p}, where ll, pp, and NN^{{}^{\prime}} represent respectively the number of state variables, control variables, and the time series length. This transformation can be made permutation-equivariant by employing suitable layers. This implies that the output remains equivariant to permutations of set elements. This objective can be accomplished by defining the neural network layers as in Eq. 7, as demonstrated by [4].

{𝐟(𝐱)σ(λ𝐈𝐱+γ𝐩𝐨𝐨𝐥(𝐱))σ:nonlinear activation functionλ,γ:learnable variables𝐩𝐨𝐨𝐥(𝐱):max, sum, mean …\displaystyle\left\{\begin{matrix}\mathbf{f(x)}\doteq\mathbf{\sigma(\lambda Ix+\gamma pool(x))}\\ \sigma:~~~\textup{nonlinear activation function}\\ \lambda,~\gamma:~~~\textup{learnable variables}\\ \mathbf{pool(x)}:~~~\textup{max, sum, mean ...}\end{matrix}\right. (7)

Following this encoding, a pooling transformation is applied (such as max, mean, sum, etc.) with respect to the time dimension which yields a set of latent variables ZF×LZ\in\mathbb{R}^{F\times L}, as defined in Eq. 8 :

{Encoder(Xjk,Ujk)=Pool(DNN({Xjk(t0),,Xjk(tN),Ujk(t0),,Ujk(tN)}))=ZDecoder(Z)=DNN(Z)=Ξjk,\displaystyle\left\{\begin{matrix}\textbf{Encoder}(X_{jk},U_{jk})&=&\textbf{Pool}(\textbf{DNN}(\left\{X_{jk}(t_{0}),~\ldots,~X_{jk}(t_{N^{{}^{\prime}}}),~U_{jk}(t_{0}),~\ldots,~U_{jk}(t_{N^{{}^{\prime}}})\right\}))=Z\\ \textbf{Decoder}(Z)&=&\textbf{DNN}(Z)=\Xi_{jk},\end{matrix}\right. (8)

where FF in ZF×LZ\in\mathbb{R}^{F\times L} represents the number of encoded features. This operation enables the model to manage inputs of variable shapes, thereby achieving an invariant-type (Eq. 9) encoding.

f({x1,,xn})=f({xπ(x1),,xπ(xn)})\displaystyle f(\{x_{1},\cdots,x_{n}\})=f(\{x_{\pi(x_{1})},\cdots,x_{\pi(x_{n})}\}) for any permutation π\displaystyle\textup{for any permutation }\pi (9)

The final step involves decoding the latent vector to obtain the desired output. In this application, we will only utilize the invariant property.

Refer to caption
Figure 5: Deep Set Architecture : the encoder is a DNN followed by a pooling transformation (max, mean, min …), which is then decoded with another DNN.

3.2 Set Transformer

The architecture of set transformer (Fig. 6), is based on transformer model [24]. The motivation behind the use of this architecture lies in its ability to encode links and pairwise interactions among the elements of sets through the attention mechanism. The architecture is primarily built on layers called self attention layers, constructed to be permutation-equivariant.

The attention mechanism can be defined by the Eq. 10 :

𝐀𝐭𝐭(Q,K,V;σ)=σ(QKT)V.\mathbf{Att}(Q,K,~V;\sigma)=\sigma(QK^{T})V. (10)

Where QQ is a matrix of nn sets with the same dimension dqd_{q}, and KK and VV, referred to as Keys and Values, are two matrices of sets with the same number of sets. The dimension of KK is the same as that of QQ, while VV has a different dimension dvd_{v}. The initial part of the equation, σ(QKT)\sigma(QK^{T}), measures the similarity between the two sets of matrices, where σ\sigma usually denotes the softmax function applied to the product to convert it into a probability distribution. The result of this product represents a weight matrix, which is used to assign different importance to the elements of the matrix VV.

Refer to caption
Figure 6: Set Transformer Architecture: the encoder is a succession of self-attention layers followed by a pooling layer, which is then decoded with self-attention layers and a DNN.

Utilizing this mechanism empowers the model to capture relevant patterns and dependencies in the data. To achieve this objective, the mechanism is extended to Multi-head Attention (Eq. 11) defined by three transformations.

{𝐌𝐮𝐥𝐭𝐢𝐇𝐞𝐚𝐝(Q,K,V;σ)=concatenation(O1,,Oj,,OL)WoOj=𝐀𝐭𝐭(QWjq,KWjk,VWjv;σj)\displaystyle\left\{\begin{matrix}\mathbf{MultiHead}(Q,K,V;\sigma)=\mathrm{concatenation}(O_{1},\cdots,O_{j},\cdots,O_{L})W^{o}\\ O_{j}=\mathbf{Att}(QW^{q}_{j},KW^{k}_{j},VW^{v}_{j};\sigma_{j})\end{matrix}\right. (11)

The initial step involves projecting the QQ, KK, and VV matrices to a higher dimension dqH=dkHd_{q}^{H}=d_{k}^{H} and dvHd_{v}^{H} using learnable parameters Wjq,Wjkdk×dkH,Wjvdv×dvH{W^{q}_{j},W^{k}_{j}\in\mathbb{R}^{d_{k}\times d_{k}^{H}},~~W^{v}_{j}\in\mathbb{R}^{d_{v}\times d_{v}^{H}}}. Subsequently, the attention mechanism is applied to the LL resulting projections, referred to as Head Attentions OjO_{j}. These Head Attentions are then concatenated and multiplied by a trainable variable matrix WoLdvH×dW^{o}\in\mathbb{R}^{Ld_{v}^{H}\times d}. In this application, it was considered that dq=dv=dd_{q}=d_{v}=d, and dqH=dvH=d/Ld_{q}^{H}=d_{v}^{H}=d/L. The Set Transformer proposes set operation blocks, utilizing a Multi-Head Attention mechanism to encode the data and subsequently aggregate and decode the encoded features. The data is encoded through Multihead Attention Blocks (MAB) and Set-Attention Blocks (SAB). Given two input sets matrices X,Yn×dX,Y\in\mathbb{R}^{n\times d}, MAB is defined as:

{𝐌𝐀𝐁(X,Y)=𝐍𝐨𝐫𝐦𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧(H+𝐫𝐅𝐅(H)),H=𝐍𝐨𝐫𝐦𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧(X+𝐌𝐮𝐥𝐭𝐢𝐇𝐞𝐚𝐝(X,Y,Y)).\displaystyle\left\{\begin{matrix}\mathbf{MAB}(X,Y)=\mathbf{Normalization}(H+\mathbf{rFF}(H)),\\ H=\mathbf{Normalization}(X+\mathbf{MultiHead}(X,Y,Y)).\end{matrix}\right. (12)

The 𝐍𝐨𝐫𝐦𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧\mathbf{Normalization} is the layer normalization [29], and 𝐫𝐅𝐅\mathbf{rFF} represents a row-wise feed forward neural network without activation functions. The SAB block is defined as :

𝐒𝐀𝐁(X)=𝐌𝐀𝐁(X,X)\mathbf{SAB}(X)=\mathbf{MAB}(X,X) (13)

Where the Keys and Queries are identical, employing the self-attention mechanism to capture interactions between input features. The encoding is then defined by a succession of SAB blocks (Eq. 14).

𝐄𝐍𝐂𝐎𝐃𝐄𝐑(X)=𝐒𝐀𝐁(𝐒𝐀𝐁(X))\mathbf{ENCODER}(X)=\mathbf{SAB}(\mathbf{SAB}(X)) (14)

The output of the latter is determined by the function F:XZn×dF:X\to Z\in\mathbb{R}^{n\times d}, where Xn×dxX\in\mathbb{R}^{n\times d_{x}}. This output is subsequently aggregated through the PMA (Pooling Multihead Attention) block defined as :

{𝐃𝐄𝐂𝐎𝐃𝐄𝐑(Z)=𝐫𝐅𝐅(𝐒𝐀𝐁(𝐏𝐌𝐀(Z))),𝐏𝐌𝐀(Z)=𝐌𝐀𝐁(S,𝐫𝐅𝐅(Z)).\displaystyle\left\{\begin{matrix}\mathbf{DECODER}(Z)=\mathbf{rFF}(\mathbf{SAB}(\mathbf{PMA}(Z))),\\ \mathbf{PMA}(Z)=\mathbf{MAB}(S,\mathbf{rFF}(Z)).\end{matrix}\right. (15)

Where Sk×dS\in\mathbb{R}^{k\times d} is a trainable matrix, with kk being the desired output features (1 in our case). This allows for an aggregation based on the importance of set elements rather than equal importance (as in the cases of max, mean, etc., aggregation). Following this transformation, a SAB is applied to better capture the connections between the kk output features (Eq. 15).

Considering the quadratic time complexity of the Set Attention Block (SAB) as O(n2)O(n^{2}), where nn represents the number of sets, [5] introduced an improved version named Induced Set Attention Block (ISAB). The ISAB offers reduced computational complexity while preserving the same performance.

The architectures of both models are more complex compared to OASIS, which may raise concerns, particularly regarding overfitting problems. We address these concerns in some examples in the next section. The selection of hyperparameters can be managed using classical testing and validation approaches, which we have employed in this work. In the following section, we demonstrate that this complexity offers significant advantages through three different applications. Additionally, we provide solutions and perspectives to mitigate potential overfitting issues.

4 Results

4.1 Tests cases

To demonstrate the advantages of Set Encoding methods, we present three test cases with three different systems. In the section 4.2 we present a comparison of OASIS framework and Set Transformer for online nonlinear dynamics identification. The Lotka–Volterra [30] (Eq. 16) system is used to compare these two methods.

{x˙=αxβxyy˙=δxyγy+c\left\{\begin{matrix}\begin{aligned} \dot{x}&=\alpha x-\beta xy\\ \dot{y}&=\delta xy-\gamma y+c\end{aligned}\end{matrix}\right. (16)

First, we generate a dataset with various initial conditions and control variables. We then divide the time domain into multiple sub domains, and SINDYc is subsequently employed to identify the local dynamics model for each sub domains. Both methods are then trained to capture the local dynamics using time series and their corresponding identified local SINDYc mode.

While OASIS methods are limited to online local system identification, Set Encoding methods could also be applied in global non-linear dynamics identification. In the section 4.3 we compare the two Set Encoding. These methods can handle variable input sizes, which means capturing the true dynamics through a sequence of data is possible. Furthermore, it can capture the interactions between input data, all of which help enhance the identification process. To study the usefulness of these methods, Set Transformer and Deep Set are compared for the Lorenz system (Eq. 17).

{x˙=σ(y(t)x(t))y˙=ρx(t)y(t)x(t)z(t)z˙=x(t)y(t)βz(t)\left\{\begin{matrix}\dot{x}&=&\lx@intercol\sigma(y(t)-x(t))\hfil\lx@intercol\\ \dot{y}&=&\lx@intercol\rho x(t)-y(t)-x(t)z(t)\hfil\lx@intercol\\ \dot{z}&=&\lx@intercol x(t)y(t)-\beta z(t)\hfil\lx@intercol\end{matrix}\right. (17)

Monitoring and control of industrial processes are generally based on fixing the theoretical parameters of the process and material properties. However, these parameters can change during the process. Therefore, it is important to use sensor data to identify these parameters in real-time. If any deviations are detected, control systems can be used to correct the process by adjusting the parameters, thereby improving its performance. To address these challenges, Set Encoders can be applied for industrial process monitoring. For example, in the thermo-stamping process, a thin composite plate is heated to make it less rigid before being formed using a specific mold. Using real-time sensor data, the Set Encoder can approximate the material properties during the heating step, which helps, for example, to ensure the temperature is homogeneous throughout the depth of the plate to avoid defects during forming. In this application, we begin with a preliminary study of 1D heating using synthetic data before applying it to real data of the heating step in the thermo-stamping process in future work.
In 4.4, we will study the 1D heat-transfer problem (PDE example) using a Deep Set model. We will train the model to approximate diffusivity and to characterize an abnormality defined as a local decrease in diffusivity, using noisy data. The equation is given by (Eq. 18):

{T˙=(α(z)T)z2Tz|z=0=q0k(T)Tz|z=L=0T(t=0,z)=T0\left\{\begin{matrix}\dot{T}&=&\lx@intercol\frac{\partial(\alpha(z)\partial T)}{\partial z^{2}}\hfil\lx@intercol\\ -\frac{\partial T}{\partial z}|_{z=0}&=&\frac{q_{0}}{k}(T)\\ \frac{\partial T}{\partial z}|_{z=L}&=&0\\ T(t=0,z)&=&T_{0}\end{matrix}\right. (18)

4.2 Application 1 : local model identifications (OASIS vs Set Transformer)

4.2.1 Data generation

The Lotka–Volterra system (Eq. 16) is used to compare these two methods. Where, xx and yy respectively represent the populations of prey and predators. cc is a control variable that limits the rate of increase or decrease of the predators. The parameters α=0.5\alpha=0.5, β=0.025\beta=0.025, δ=0.5\delta=0.5, and γ=0.005\gamma=0.005 represent respectively the rates of increase or decrease of prey and predators, as well as the effect of predators on prey (and vice versa). An example of a solution of Eq. 16 is given in Fig. 7.

Refer to caption
Figure 7: Example of Lotka–Volterra dynamics solution for α=0.5\alpha=0.5, β=0.025\beta=0.025, δ=0.5\delta=0.5, and γ=0.005\gamma=0.005 with a constant control variable c=0c=0, x(t0)=27.5x(t_{0})=27.5 and y(t0)=10.0y(t_{0})=10.0 .

The dataset is generated through various combinations of control variable (cc) and initial conditions x0x_{0} and y0y_{0}, using the Sobol [31, 32] sampling method as given in Tab. 2. The data generation is performed using the LSODA [32] method from the SciPy Python package. In this comparison (Fig. 8), 278 datasets are generated, of which 166 will be used for model training, and 112 will be employed for model validation to mitigate overfitting issues, and a test set to compare the two methods.

Table 2: Range of Variables
Variable Range
cc : Control variable [-1; 5]
x0x_{0} : Initial condition xx [5; 50]
y0y_{0} : Initial condition yy [5; 15]
tt : Time interval [0; 30]
\botrule
Refer to caption
Figure 8: Left : initial value distribution (x(t0),y(t0))(x(t_{0}),~y(t_{0})) for training, validation, and test and Right : the control value distribution cc for training, validation, and test.

Each dataset is divided into several sub domains over time. SINDYc is then used to identify the local model of the different sub domains, which may be different from the global dynamics but can be used within a short time interval. In this case, we assume that there is no initial information about the search library. The identification is performed using the PySINDy Python package [33]. We use a base of polynomials of degree 3 defined as :

dX~dt\displaystyle\frac{\mathrm{d}\widetilde{X}}{\mathrm{d}t} =Θ(X~,C)Ξ\displaystyle=\Theta(\widetilde{X},C)~\Xi (19)
Θ(X~,C)\displaystyle\Theta(\widetilde{X},C) =[1,X,Y,,Y3,Y2C,YC2,C3]N×20,\displaystyle=\begin{aligned} &[1,~X,~Y,~\ldots,~Y^{3},~Y^{2}C,~YC^{2},~C^{3}]\in\mathbb{R}^{N\times 20}\hfill\end{aligned},
X~\displaystyle\widetilde{X} =[X,Y]\displaystyle=\left[X,~Y\right]

where Θ(X~,C)N×20\Theta(\widetilde{X},C)\in\mathbb{R}^{N\times 20} is the search library, where CN×1C\in\mathbb{R}^{N\times 1} represents the control variables, and Ξ2×20\Xi\in\mathbb{R}^{2\times 20} is a sparse tensor representing the parameters to be fitted.

4.2.2 Training process and Criteria for comparison

Given the utility and significance of these methods in real-time control and defect detection, it is crucial to evaluate their extrapolation capabilities. For a fair comparison, both methods underwent training on the same dataset and were then assessed and validated using a separate portion designated for validation. The Tab. 3 summarizes the training process for both methods. The error function is defined as a sum of three parts, the first one represents the difference between the true Ξ\Xi and predicted Ξ^\hat{\Xi} parameters, the second part represents local dynamic loss by calculating the difference between the known derivative of XX and YY and Θ(X~,C)Ξ^\Theta(\widetilde{X},C)\hat{\Xi} where Θ(X~,C)\Theta(\widetilde{X},C) is know, the final term represents the regularization, contributing to the effective generalization of models. Both methods are trained using identical optimizer and a fixed number of epochs, incorporating variable learning rates during the training process. A noteworthy distinction pertains to the input of the two models. The first maps a single time step to Ξ^\hat{\Xi}, whereas the Set Transformer encodes a temporal series of variable length. We also train the models to predict the dynamics of the subsequent sub domain based on the current sub domain’s time series X~ti\widetilde{X}_{t_{i}} and control variable CtiC_{t_{i}} as defined in Eq. 20 :

F(X~ti,Cti)=Ξ^ti+ΔtwhereX~tiN×2,C~tiN×1,Ξ^ti+Δt20×2F(\widetilde{X}_{t_{i}},C_{t_{i}})=\hat{\Xi}_{t_{i}+\Delta t}~~~\textbf{where}~~~\widetilde{X}_{t_{i}}\in\mathbb{R}^{N^{{}^{\prime}}\times 2},~\widetilde{C}_{t_{i}}\in\mathbb{R}^{N^{{}^{\prime}}\times 1},~\hat{\Xi}_{t_{i}+\Delta t}\in\mathbb{R}^{20\times 2} (20)

Where NN^{{}^{\prime}} is the time snapshots, N=1N^{{}^{\prime}}=1 for OASIS framework and N=Δt=10N^{{}^{\prime}}=\Delta t=10. Which will help in achieving better results for extrapolation.

Table 3: Training process summary
Set Transformer OASIS
Optimizer Adam
Loss function Ξ+λ0ode+λ1reg\mathcal{L}_{\Xi}+\lambda_{0}\mathcal{L}_{ode}+\lambda_{1}\mathcal{L}_{reg}=ΞΞ^2+λ0dX~dtΘ(X~,C)Ξ^1+λ1W1\left\|\Xi-\hat{\Xi}\right\|_{2}+\lambda_{0}\left\|\frac{d\widetilde{X}}{dt}-\Theta(\widetilde{X},C)\hat{\Xi}\right\|_{1}+\lambda_{1}\left\|W\right\|_{1}
Learning rate [103,4×104,104,105,106,107,108]\left[10^{-3},~4\times 10^{-4},~10^{-4},~10^{-5},~10^{-6},~10^{-7},~10^{-8}\right]
Epochs [2000, 2000, 4000, 2000, 2000, 2000, 2000]
Input Shape [Batchsize,time,features]\left[Batchsize,~time,~features\right] : [Batchsize,10,4]\left[Batchsize,~10,~4\right] [Batchsize,1,4]\left[Batchsize,~1,~4\right]
Output Shape [Batchsize,coefficients]\left[Batchsize,~coefficients\right] : [Batchsize,20]\left[Batchsize,~20\right]
\botrule

To ensure fairness and balance in the comparison, we used hyper-parameters tunning for training both methods, varying hyper-parameters like hidden layers, neurons per layer, regularization coefficient λ1\lambda_{1} , and activation functions. We then select the best architecture by assessing performance through validation error.

4.2.3 Comparison and Conclusion

The test set is not utilized during the training process; instead, the validation portion is employed solely for selecting the best model during training. We also employ two models for each method to predict the coefficients of dxdt\frac{dx}{dt} and dydt\frac{dy}{dt}, which we believe will facilitate the training process.

Refer to caption
Figure 9: Comparison method of OASIS and Set Transformer : Extrapolation using local dynamic identification and ODE-Solver

The models are then compared based on their ability to capture local dynamics and extrapolate (Fig. 10) for a limited time range. This process is summarized in Fig. 9. The input of Set Transformer is a sequence, while the input of OASIS is a single point in time, first the models are used to identify the local dynamics Ξ^t+Δtj\hat{\Xi}^{j}_{t+\Delta t} which corresponds to the local dynamics of the following sub domain for the jj test dataset. It is then used with initial values X~ti+Δtj\widetilde{X}^{j}_{t_{i}+\Delta t} and control Cti+ΔtC_{t_{i}+\Delta t} value to extrapolate using ’LSODA’ as ODE-Solver, the output of the solver is a times series X~predictionj=[xj(t)p,yj(t)p]wheret[ti+Δt:ti+2Δt]\widetilde{X}^{j}_{prediction}=[x_{j}(t)_{p},~y_{j}(t)_{p}]~~\textup{where}~~t\in[t_{i}+\Delta t:t_{i}+2\Delta t], and Δt\Delta t corresponds to the time sub domain windows, which is then compared to the ground truth (simulation data) using MAPE (Eq. 21 and Eq. 22).

MAPEti+Δtxj=100×xj(t)pxj(t)gtxj(t)gt1wheret[ti+Δt:ti+2Δt]andj[1,N]\textbf{MAPE}_{t_{i}+\Delta t}^{x_{j}}=\left\|100\times\frac{x_{j}(t)_{p}-x_{j}(t)_{gt}}{x_{j}(t)_{gt}}\right\|_{1}~~\textup{where}~~t\in[t_{i}+\Delta t:t_{i}+2\Delta t]~~\textup{and}~~j\in[1,N] (21)
MAPEti+Δtyj=100×yj(t)pyj(t)gtyj(t)gt1wheret[ti+Δt:ti+2Δt]andj[1,N]\textbf{MAPE}_{t_{i}+\Delta t}^{y_{j}}=\left\|100\times\frac{y_{j}(t)_{p}-y_{j}(t)_{gt}}{y_{j}(t)_{gt}}\right\|_{1}~~\textup{where}~~t\in[t_{i}+\Delta t:t_{i}+2\Delta t]~~\textup{and}~~j\in[1,N] (22)

This metric is usually used to measure the prediction accuracy. It quantifies the average percentage deviation between the predicted solution and the ground truth, providing a measure of how well the predictions align with the actual outcomes. Based on the analysis by [34] of time series forecasting evaluation, MAPE is a suitable metric in the absence of outliers and intermittence. We also utilize sMAPE, which is more suitable in the presence of outliers. This metric quantifies the percentage difference between predicted and actual values, with adjustments made to ensure equal treatment of both small and large values in the evaluation process (see Appendix A.1).

The two models are trained using fixed sub domain windows (sequences of 10 time points referred to as Δt\Delta t), as explained in sections 4.2.2 and  4.2.1, and then compared using the test dataset, with 3 different sub domain windows 10, 15 and 20, to evaluate the ability of the models to generalize and to capture the local dynamics.
Applying the same logic as explained earlier, we compute the Total MAPE for each dataset (Eq. 23) and Eq. 24).

MAPETotalxj=100×xj(t)pxj(t)gtxj(t)gt1wheret[Δt:Tf]andj[1,N]\textbf{MAPE}_{Total}^{x_{j}}=\left\|100\times\frac{x_{j}(t)_{p}-x_{j}(t)_{gt}}{x_{j}(t)_{gt}}\right\|_{1}~~\textup{where}~~t\in[\Delta t:T_{f}]~~\textup{and}~~j\in[1,N] (23)
MAPETotalyj=100×yj(t)pyj(t)gtyj(t)gt1wheret[Δt:Tf]andj[1,N]\textbf{MAPE}_{Total}^{y_{j}}=\left\|100\times\frac{y_{j}(t)_{p}-y_{j}(t)_{gt}}{y_{j}(t)_{gt}}\right\|_{1}~~\textup{where}~~t\in[\Delta t:T_{f}]~~\textup{and}~~j\in[1,N] (24)

This metric is then used to compare the two methods. Based on [34] MAPE is not a suitable metric in the presence of outliers. To ensure a fair comparison we first, eliminate outlier results using the normal Z-score (or Six Sigma rules). This means that if the MAPE of a dataset performs significantly better or worse compared to others, it will not be included in the comparison of the two methods. Then we calculate the average of MAPExj\textup{MAPE}^{x_{j}} and MAPEyj\textup{MAPE}^{y_{j}} over all test datasets for x(t)x(t) and y(t)y(t) respectively. In addition, we calculate the median and the 90th percentile. This will provide a more comprehensive understanding of the accuracy of the model. We conduct the same comparison without eliminating outlier results, using the sMAPE metric, which is more suitable in the presence of outliers (see Appendix  A.1).

Refer to caption
Figure 10: Example of extrapolation using Set Transformer in blue and OASIS in red compared to the ground truth in black (HF) simulation results.

During these tests, we observed that OASIS and Set Transformer may fail to predict the local dynamics, which can result in divergence. In some cases, this divergence is acceptable (Samples 11 and 21 in Fig. 10), but we also found that it is possible for the model to completely fail to capture the dynamics, leading to infinite divergences. In this case, the dataset is not included in the comparison using the MAPE metric. To represent this, we added the divergence metric to highlight the model’s ability to avoid complete divergences.
The results are presented in Tabs. 4, 5 and 6 which correspond to the 3 sub domains windows sizes as previously explained (Δt10\Delta t_{10}, Δt15\Delta t_{15} and Δt20\Delta t_{20}), with four different extrapolation windows. This means that based on one sub domain window, we try to extrapolate for larger windows. We refer to these as extrapolation windows.

The primary attention lies in the divergence difference between the Set Transformer and OASIS frameworks, which increases with larger sub domain windows with 40-55% of infinite divergences. In contrast, Set Transformer remains stable with significantly lower divergence percentages. This phenomenon is understandable because the set encoding architecture attempts to encode all time series data to generate predictions. Additionally, as the quantity of data increases, so does the accuracy. We can also see that both models give very acceptable results for small extrapolation windows, this could be highly related to the training process of the two models, where we focus on predicting only the next sub domain dynamics. While the average of MAPE remains good for large extrapolation windows (2×Δt,3×Δtand4×Δt2\times\Delta t,~3\times\Delta t~~\text{and}~4\times\Delta t). The inaccuracy, or the 90th percentile represents the maximum MAPE error observed in 90% of the test sets, increases significantly for OASIS, while it increases reasonably for Set Transformer maintaining acceptable extrapolation.

Table 4: Comparison between the two methods Set Transformer and OASIS in terms of MAPE error for extrapolation through local dynamics identification with a time window of 10 steps
mean 90% divergence (%)
SETTR OASIS SETTR OASIS SETTR OASIS
1×Δt10\times\Delta t_{10} xx 0.43 1.01 0.76 3.75 5.36 44.64
yy 0.67 1.90 1.07 3.27 5.36 44.64
2×Δt10\times\Delta t_{10} xx 1.35 2.35 3.63 9.73 5.36 42.86
yy 3.65 6.31 17.01 21.60 5.36 42.86
3×Δt10\times\Delta t_{10} xx 1.93 3.21 4.21 10.28 7.14 41.07
yy 3.28 6.24 11.06 19.00 7.14 41.07
4×Δt10\times\Delta t_{10} xx 4.09 5.44 12.20 15.40 3.57 39.29
yy 5.34 8.45 18.76 24.83 3.57 39.29
\botrule
Table 5: Comparison between the two methods Set Transformer and OASIS in terms of MAPE error for extrapolation through local dynamics identification with a time window of 15 steps
mean 90% divergence (%)
SETTR OASIS SETTR OASIS SETTR OASIS
1×Δt15\times\Delta t_{15} xx 1.00 0.83 1.54 1.22 7.14 51.79
yy 0.88 2.92 1.50 4.63 7.14 51.79
2×Δt15\times\Delta t_{15} xx 2.41 2.81 3.86 5.12 7.14 51.79
yy 2.26 5.49 2.75 8.98 7.14 51.79
3×Δt15\times\Delta t_{15} xx 3.49 5.11 5.84 11.79 5.36 44.64
yy 2.88 7.83 5.56 19.44 5.36 44.64
4×Δt15\times\Delta t_{15} xx 6.24 8.09 14.70 22.14 3.57 44.64
yy 5.20 11.80 16.38 38.38 3.57 44.64
\botrule
Table 6: Comparison between the two methods Set Transformer and OASIS in terms of MAPE error for extrapolation through local dynamics identification with a time window of 20 steps
mean 90% divergence (%)
SETTR OASIS SETTR OASIS SETTR OASIS
1×Δt20\times\Delta t_{20} xx 2.42 1.09 3.99 1.84 7.14 55.36
yy 1.03 3.89 1.78 6.15 7.14 55.36
2×Δt20\times\Delta t_{20} xx 5.10 4.41 8.79 9.15 7.14 50.00
yy 2.08 7.14 3.79 11.86 7.14 50.00
3×Δt20\times\Delta t_{20} xx 9.90 10.17 15.92 23.57 7.14 44.64
yy 6.51 14.75 16.66 39.65 7.14 44.64
4×Δt20\times\Delta t_{20} xx 9.28 10.25 14.16 24.59 3.57 53.57
yy 4.46 8.95 9.60 23.78 3.57 53.57
\botrule

On the other hand, we concluded that adding the ODE loss ode=dX~dtΘ(X~,C)Ξ^1\mathcal{L}_{ode}=\left\|\frac{d\widetilde{X}}{dt}-\Theta(\widetilde{X},C)\hat{\Xi}\right\|_{1} as given in Tab. 3 significantly improves the results, especially in cases where the dynamics terms are unknown or only partially known, as well as in cases where the goal is to identify local dynamics valid for short time windows. The Fig. 11 represents the ODE loss for training and validation using the Set Transformer method with the same architecture. When ode\mathcal{L}_{ode} is used with λ=1\lambda=1, and when λ=0\lambda=0 indicates that the ODE loss is not used, we can easily observe the difference in convergence. This difference can be explained by the fact that ode\mathcal{L}_{ode} assigns weighted attention to each coefficient. This means that if the contribution of a coefficient is significant, its loss ode\mathcal{L}_{ode} will also be significant compared to other coefficients, and vice versa.

Refer to caption
Figure 11: Loss comparison when using ode\mathcal{L}_{\text{ode}} with λ=1\lambda=1 versus when ode\mathcal{L}_{\text{ode}} is not used, with λ=0\lambda=0. For the Set Transformer method using the same architecture.

Based on this comparison, we can see the importance of using Set Encoding for local dynamics identification, which could be utilized for extrapolation, as demonstrated in the example above, or for other use cases such as predictive control. We demonstrated that OASIS may fail to capture the local dynamic, resulting in inaccurate results and divergence, while Set Transformer remains stable and leverages all available data to predict the local dynamics. These results could be improved by using variable sub domain windows during the training process, which would provide the Set Transformer method the ability to generate more accurate results and enhance extrapolation capabilities. This approach is demonstrated in the next example, where we aim to identify the global dynamics using an expanding time window. Furthermore, using this approach with noisy data is going to be challenging, which is related to the high sensitivity of SINDy to noise [35]. Additionally, in this case, we only aim to identify a local dynamic, which means that the identification process is more challenging. Moreover, in extrapolation use cases, it is essential to update the model frequently to obtain accurate results. Using the predictions of Set Transformer (or OASIS) and then an ODE solver to extrapolate poses two challenges. The first one is related to the training process, which involves the generation of training data and using SINDy for the identification of local dynamics with noisy data, which is challenging. Furthermore, even slight noise in the initial conditions may lead to inaccurate extrapolation or complete divergence.

4.3 Application 2 : Global model identifications (Deep Set vs Set Transformer)

4.3.1 Data generation

We generate a dataset of 1024 Lorenz systems with variables parameters [σ,ρ,β]\left[\sigma,~\rho,~\beta\right] , and initial conditions [x0,y0,z0]\left[x_{0},~y_{0},~z_{0}\right] as described in Tab. 7 using LSODA [32] methods implemented in SciPy Python package. The parameters and initial conditions are sampled using the Sobol [31, 32] method implemented in the SciPy Python package. As the number of possible combinations of parameters and initials conditions is large and taking into consideration the chaotic behavior of the Lorenz system, we use 70% data to train the model and the remaining 30% for validation and testing.

Refer to caption
Figure 12: Example of the Lorenz system with different noise levels, left 0% (ξ=0\xi=0), middle 2% (ξ=0.02\xi=0.02), and right 5% (ξ=0.05\xi=0.05)
Table 7: Range of Variables
Variables Range
σ\sigma variations [8; 12]
ρ\rho variations [20; 35]
β\beta variations [0; 4]
x0x_{0} variations [-5; 5]
y0y_{0} variations [4; 50]
z0z_{0} variations [5; 15]
tt train and validation [0, 10[
tt test [0, 20[
\botrule

To study the robustness of the models concerning data quality, we generate 3 datasets with three levels of Gaussian noise. ξ[0,0.02,0.05]\xi\in\left[0,0.02,0.05\right] as described in Eq. 25 :

{x(t)=x(t)+ξϵxwhereϵx𝒩(0,σx2)y(t)=y(t)+ξϵywhereϵy𝒩(0,σy2)z(t)=z(t)+ξϵzwhereϵz𝒩(0,σz2)\left\{\begin{matrix}x(t)&=&x(t)+\xi\epsilon_{x}~~where~~\epsilon_{x}\sim\mathcal{N}(0,\,\sigma^{2}_{x})\,\\ y(t)&=&y(t)+\xi\epsilon_{y}~~where~~\epsilon_{y}\sim\mathcal{N}(0,\,\sigma^{2}_{y})\,\\ z(t)&=&z(t)+\xi\epsilon_{z}~~where~~\epsilon_{z}\sim\mathcal{N}(0,\,\sigma^{2}_{z})\,\end{matrix}\right. (25)

Fig. 12 shows an example with three levels of noise. Using the same parameters and initial conditions, we then compute the approximation of the derivative with respect to time dXdt~=[dxdt~,dydt~,dzdt~]\widetilde{\frac{\mathrm{d}X}{\mathrm{d}t}}~=~\left[\widetilde{\frac{\mathrm{d}x}{\mathrm{d}t}},~\widetilde{\frac{\mathrm{d}y}{\mathrm{d}t}},~\widetilde{\frac{\mathrm{d}z}{\mathrm{d}t}}\right] using total variation method [27] for noisy data and central difference (Eq. 26) for clean data.

f(x)f(x+h)f(xh)2hf^{\prime}(x)\simeq{\frac{f(x+h)-f(x-h)}{2h}} (26)

Afterwards, we compute X~=[x~,y~,x~]\widetilde{X}~=~\left[\widetilde{x},~\widetilde{y},~\widetilde{x}\right], which represents the approximation of the integral of Xt~\widetilde{\frac{\mathrm{\partial}X}{\mathrm{\partial}t}} using the cumulative trapezoidal integration method (Eq. 27) (with the SciPy Python package).

X~=abXt~𝑑t12n=1N1(tn+1tn)[Xt~n+1+Xt~n]\widetilde{X}~=~\int_{a}^{b}\widetilde{\frac{\mathrm{\partial}X}{\mathrm{\partial}t}}dt~\simeq~\frac{1}{2}\sum_{n=1}^{N-1}(t_{n+1}-t_{n})[\widetilde{\frac{\mathrm{\partial}X}{\mathrm{\partial}t}}_{n+1}+\widetilde{\frac{\mathrm{\partial}X}{\mathrm{\partial}t}}_{n}] (27)

This approach contributes to noise reduction in the data (Fig. 13), it will also enable the models to learn effective denoising techniques (further explanations are given in section 4.3.2). PySINDy (Python package) is then used to compute the approximation of the parameters for each dataset. In this application we assume that the ODE search library Θ\Theta is known and well defined, our main focus is on calculating the parameters [σk,ρk,βk]\left[\sigma_{k},~\rho_{k},~\beta_{k}\right]. The identification process is defined as in Eq. 28 .

Refer to caption
Figure 13: Example of the Lorenz system: comparison of noisy data and denoised data using the Total Variation method
[dxdt~dydt~+y~+x~z~dzdt~x~y~]=[x~y~,y~,z~]Ξ~=Θ(x~,y~,z~)Ξ~\left[\begin{matrix}\widetilde{\frac{\mathrm{d}x}{\mathrm{d}t}}\\ \widetilde{\frac{\mathrm{d}y}{\mathrm{d}t}}+\widetilde{y}+\widetilde{x}\widetilde{z}\\ \widetilde{\frac{\mathrm{d}z}{\mathrm{d}t}}-\widetilde{x}\widetilde{y}\end{matrix}\right]~=~\left[\widetilde{x}-\widetilde{y},~\widetilde{y},~\widetilde{z}\right]~\widetilde{\Xi}~=~\Theta(\widetilde{x},~\widetilde{y},\widetilde{z})~\widetilde{\Xi} (28)

4.3.2 Train process and criteria of comparison

We implement a consistent training regime for both models (Deeps Set and Set Transformer), employing varied architectures for both methods. Following this, we conduct a comparison of the optimal architecture configurations for each method. A core principle underlying the utilization of Set Encoding models is the integration of adaptability, enabling predictions that dynamically adjust to an increasing volume of data over time. We employ a sliding window approach with increasing widths. Through this approach, the models are designed to adaptively incorporate new information for updating predictions of Lorenz parameters.

Table 8: Training process summary
Deep Set Set Transformer
Optimizer Adam
Loss function Ξ~Ξ^2+λ0dX~dtΘ(X~,C)Ξ^1\left\|\widetilde{\Xi}-\hat{\Xi}\right\|_{2}+\lambda_{0}\left\|\frac{d\widetilde{X}}{dt}-\Theta(\widetilde{X},C)\hat{\Xi}\right\|_{1}
Learning rate [4×104,5×105,105,5×106,106]\left[4\times 10^{-4},~5\times 10^{-5},~10^{-5},~5\times 10^{-6},~10^{-6}\right]
Epochs [400, 400, 400, 400, 400]
Input Shape [Batchsize,time,Features]=[Batchsize,variable,4]\left[Batchsize,~time,~Features\right]=[Batchsize,~variable,~4]
Output Shape [Batchsize,3]\left[Batchsize,~3\right]
\botrule

The model inputs are [t,Xk(t)]=[t,xk(t),yk(t),zk(t)][t,~X_{k}(t)]~=~[t,~x_{k}(t),~y_{k}(t),~z_{k}(t)], which represent the noisy data, we added the time dimension, which will be encoded during the pooling phase. This will enable the model to better comprehend the dynamic while increasing input size, and hold good accuracy with variables times steps Δt\Delta t. The output is defined as [σk,ρk,βk]\left[\sigma_{k},~\rho_{k},~\beta_{k}\right]. The training process is summarized in Tab. 8. We employ the Adam [36] optimizer for training both models, with a step decay learning rate. The loss function comprises two components. The first part represents the loss associated with parameters loss where Ξ~\widetilde{\Xi} and Ξ^\hat{\Xi} denote the fitted parameters using SINDy (Eq. 28) based on denoised data and the predicted parameters. Note here that Ξ~\widetilde{\Xi} may differ from Ξ\Xi, the true parameters vector. The second part represents the loss associated with the ODE loss utilizing denoised variables, in this study, we fix λ0\lambda_{0} to 1. Training both methods to predict denoised parameters enables the model to additionally learn denoising techniques while processing noisy input data. Both methods are trained and validated on the same dataset. The best model will be selected based on its accuracy in validation; subsequently, the best model instances are chosen for a comparative analysis between the two models, for more details see appendix A.4.
The comparison of instances of the two models will be based on their ability to deliver accurate and stable predictions, while considering the variability in both the quality and quantity of data. Note here that we trained individual instances to predict each parameter of the Lorenz systems, which helps the models converge faster.

4.3.3 Comparison and discussion

The instances comparison are based on three features :

  1. 1.

    Accuracy : the accuracy of selected instances is assessed using R2R^{2}-score metric [37] defined as :

    R2=1i(yiy^i)2i(yiy¯i)2,R^{2}=1-\frac{\sum_{i}(y_{i}-\hat{y}_{i})^{2}}{\sum_{i}(y_{i}-\bar{y}_{i})^{2}}, (29)

    which measures how well the predicted values match the actual values. In our context, it quantifies the precision of predicting Lorenz parameters. The closer the R2R^{2} score is to 1, the better the accuracy. It is important to note that our comparison involves evaluating the model predictions against the true Lorenz parameters, and not the fitted SINDy parameters Ξ^\hat{\Xi}. We assess the instances’ capability to provide accurate predictions with variable input sizes, we employ the weighted average of Rw.a2R^{2}_{w.a} defined as :

    Rw.a2=i1/niFTRni2whereFT=i1/ni,R^{2}_{w.a}=\sum_{i}\frac{1/n_{i}}{F_{T}}R^{2}_{n_{i}}~~~\textup{where}~~F_{T}=\sum_{i}1/n_{i}, (30)

    which assigns more importance to predictions with fewer data points, where ni[100,300,500,700,900]n_{i}\in[100,~300,~500,~700,~900]. This study is based on an expanding window approach, with two windows utilized: one for interpolation study where t[0,10]t\in[0,~10], and another for extrapolation where t[10,20]t\in[10,~20], which is never used during training or validation.

    Results of Rw.a2R^{2}_{w.a} are presented in Fig. 14. The overall performance of both methods is good for both interpolation and extrapolation (Rw.a2>0.8R^{2}_{w.a}>0.8). The accuracy drops for extrapolation, which is related to new Lorenz patterns unseen during the training process. It is also related to the time dimension used during training, limited to t[0,10]t\in[0,10]. In interpolation, Method Deep Set yields slightly better results, while in extrapolation, Method Set Transformer tends to be more stable (results do not change much). An exception is the prediction of σ\sigma, for which the accuracy drops for both methods. This can be explained by the data generation process (Eq. 28), where we can observe that the prediction of σ\sigma is based on xyx-y, whereas for the prediction of β\beta and ρ\rho, it is based directly on yy and zz, respectively, which are the input of the models, making it easier to learn. Furthermore, in extrapolation, Set Transformer performs better in the prediction of σ\sigma, while in interpolation, Deep Set was slightly better.

    We can also observe that for both interpolation and extrapolation, both methods demonstrate very good robustness to different noise levels. We present the Rni2R^{2}_{n_{i}} with respect to the quantity of data in appendix A.2 (Fig. 1819 and 20).

    Refer to caption
    Figure 14: Accuracy results using Rw.a2R^{2}_{w.a} metric for Deep Set and Set Transformer instances for σ\sigma, β\beta and ρ\rho. On the left interpolation results for t[0,10[t\in[0,10[, and on the right extrapolation results for t[10,20[t\in[10,20[.
  2. 2.

    Inference time : Both trained models could be used directly for online applications. This is because the inference time is in the order of a few milliseconds on AuthenticAMD AMD EPYC 9474F@3.6Ghz 48-Core Processor, making these models more suitable for applications like the extrapolation of state variables [22] or online control [3]. These methods also can dynamically use new data to enhance the accuracy of the prediction or to reevaluate the prediction in case of changes in the system dynamics.

  3. 3.

    Training : Based on the comparison of both methods, we observed that the implementation of Deep Set is easier compared to Set Transformer. Deep Set relies on simple dense layers, resulting in fewer hyperparameters, thus facilitating hyperparameter tuning. More details about the instance selection are provided in Appendix A.4.

4.4 Application 3: Characterizing Abnormalities in 1D Heat-Transfer Using Deep Set Model

4.4.1 Data generation

For this application we aim to train a Deep Set model to approximate online the diffusivity value α\alpha from temperature data, we also challenge the model to characterize a local decrease of diffusivity. We start by generating a dataset using finite difference methods (implicit Euler method) with different diffusivity values α\alpha. The 1D heating equation is defined as in Eq. 18 in this case α(z)\alpha(z) is space-dependent. The initial temperature value is T(t=0,z)=20CT(t=0,z)=20~^{\circ}C, L=0.01mL=0.01~m, and the heat flux is defined as function as follow (Eq. 31) :

{Tz|z=0=q0k=30000K/mifT(z=0)<TmaxTz|z=0=0K/mifT(z=0)>Tmax\left\{\begin{matrix}-\frac{\partial T}{\partial z}|_{z=0}&=&\frac{q_{0}}{k}=30000~K/m~~~&\textbf{if}&~~T(z=0)<T_{max}\\ -\frac{\partial T}{\partial z}|_{z=0}&=&0~K/m~~~&\textbf{if}&~~T(z=0)>T_{max}\\ \end{matrix}\right. (31)

Where Tmax=500CT_{max}=500~^{\circ}C, which simply means that we impose a regularization to stop the heating if T>TmaxT>T_{max}. The α(z)\alpha(z) is defined as in Fig. 15, this function could be simplified as three characteristics, αref\alpha_{ref} which is the normal diffusivity value, G=centerabnormality\mathrm{G}=\mathrm{center_{abnormality}} which presents the center of abnormality and ratio=αrefαabnormality\mathrm{ratio}=\frac{\alpha_{ref}}{\alpha_{abnormality}} which is simply the ratio between normal and abnormal diffusivity. The α(z)\alpha(z) function defined as (Eq. 32), where AA and BB are two values that help define αabnormality\alpha_{abnormality}:

α(z)=αrefAeB(Gz)6\alpha(z)=\alpha_{ref}-Ae^{-B(\mathrm{G}-z)^{6}} (32)
Refer to caption
Figure 15: Example of the function diffusivity α(z)\alpha(z) with respect to space zz, which allows the characterization of abnormalities.

We generate NN datasets with different combinations of the three characteristics defined previously, where N=300N=300, where 70%70\% of this dataset is used for training and 30%30\% for validation, we also generate separately a Ntest=150N_{test}=150 test datasets to evaluate the performance of the model, the distribution of training, validation and test characteristics is defined as follows (Tab. 9) :

Table 9: Range of characteristics for train, validation and test
Variables Range Distribution
αref\alpha_{ref} [8×107;1.2×1068\times 10^{-7};~1.2\times 10^{-6}] 𝒩(106,(5×108)2)\mathcal{N}(10^{-6},~~(5\times 10^{-8})^{2})
ratio\mathrm{ratio} [40;8040;~80] 𝒩(60,(8)2)\mathcal{N}(60,~~(8)^{2})
G\mathrm{G} [0.001;0.0090.001;~0.009] Sobol
Time tt [0;2000;~200] -
\botrule

4.4.2 Training process

Using these datasets, we train a Deep Set model, where the input XiX_{i} combines a time series of temperature at two positions z=0z=0 and z=L=0.01mz=L=0.01~m and the corresponding time tt (Example in Fig. 16), where Δt=1s\Delta t=1~s. In this study we limited the model input temperature information to only two positions, the input XiX_{i} which corresponds to the ithi^{th} dataset is defined as (Eq. 33) :

Xi=[zi,ti,Ti]=[zi,1ti,1T(ti,1,zi,1)zi,1ti,2T(ti,2,zi,1)zi,kti,jT(ti,j,zi,k)zi,kti,j+1T(ti,j+1,zi,k)zi,Lti,T1T(ti,T1,zi,L)zi,Lti,TT(ti,T,zi,L)]wherek{1,L},j[1,T]X_{i}~=~\left[z_{i},~t_{i},~T_{i}\right]~=~\begin{bmatrix}z_{i,1}&t_{i,1}&T(t_{i,1},~z_{i,1})\\ z_{i,1}&t_{i,2}&T(t_{i,2},~z_{i,1})\\ &\cdots&\\ &\cdots&\\ z_{i,k}&t_{i,j}&T(t_{i,j},~z_{i,k})\\ z_{i,k}&t_{i,j+1}&T(t_{i,j+1},~z_{i,k})\\ &\cdots&\\ &\cdots&\\ z_{i,L}&t_{i,T-1}&T(t_{i,T-1},~z_{i,L})\\ z_{i,L}&t_{i,T}&T(t_{i,T},~z_{i,L})\end{bmatrix}~~\text{where}~~k\in\{1,L\},~~j\in[1,T] (33)

Thus the model input and output could be defined as F(Xi;Θ)=[centeri,ratioi,αi]F(X_{i};~~\Theta)=[\text{center}_{i},\text{ratio}_{i},\alpha_{i}], where Θ\Theta are the trainable parameters of Deep Set model. The details about the used architecture are given in A.5. We used a time-expanding window, which gives the model a flexible approximation of the characteristics using available information, which is improving as new data is available. The loss function is defined as (Eq. 34):

{Total=center+ratio+αref+λΘcenter=GtrueGprediction2ratio=ratiotrueratioprediction2αref=αtrueαprediction2Θ=Θ1\left\{\begin{matrix}\mathscr{L}_{Total}&=&\mathscr{L}_{\text{center}}+\mathscr{L}_{\text{ratio}}+\mathscr{L}_{\alpha_{ref}}+\lambda\mathscr{L}_{\Theta}\\ \mathscr{L}_{\text{center}}&=&\left\|\mathrm{G}_{true}-\mathrm{G}_{prediction}\right\|_{2}\\ \mathscr{L}_{\text{ratio}}&=&\left\|\mathrm{ratio}_{true}-\mathrm{ratio}_{prediction}\right\|_{2}\\ \mathscr{L}_{\alpha_{ref}}&=&\left\|\alpha_{true}-\alpha_{prediction}\right\|_{2}\\ \mathscr{L}_{\Theta}&=&\left\|\Theta\right\|_{1}\\ \end{matrix}\right. (34)

, which is composed of four losses: center\mathscr{L}_{\text{center}}, ratio\mathscr{L}_{\text{ratio}}, α\mathscr{L}_{\alpha} and Θ\mathscr{L}_{\Theta} , which correspond to the losses of each characteristic and the regularization of the model which helps avoid overfitting problems, in this application we found that λ=0.01\lambda=0.01 gives better generalization.

Refer to caption
Figure 16: Example of the input temperature with noise of the Deep Set model.

We also study the model robustness to noise, for that we tested the model for four noise levels defined as (Eq.35) where ξ[0,0.02,0.05,0.1]\xi\in[0,0.02,0.05,0.1] and i[0,N]i\in[0,~N]:

Ti(t)=Ti(t)+ξϵTwhereϵT𝒩(0,σTi2)T_{i}(t)=T_{i}(t)+\xi\epsilon_{T}\quad\text{where}\quad\epsilon_{T}\sim\mathcal{N}(0,\,\sigma^{2}_{T_{i}}) (35)

4.4.3 Results and discussion

We trained different models for the four different noise levels ξ\xi, we then select the best model based on the loss of the validation datasets, the selected models are then tested using the test datasets. The selected model architecture is given in appendix A.5. We assess the model accuracy with respect to the quantity data and robustness to noise using the R2R^{2} metrics (Eq. 29), we also use the same temperature position used for the training and validation, the results are :

  • Accuracy and robustness : We use the same accuracy assessment Rw.a2R^{2}_{w.a} as in application 4.3 :

    Rw.a2=i1/niFTRni2whereFT=i1/ni,R^{2}_{w.a}=\sum_{i}\frac{1/n_{i}}{F_{T}}R^{2}_{n_{i}}~~~\textup{where}~~F_{T}=\sum_{i}1/n_{i}, (36)

    , for this application we test the models for different data quantities where t[0,200]t\in[0,~200] and data quantities ni[50,100,150,200]n_{i}\in[50,~100,~150,~200].

    Refer to caption
    Figure 17: Accuracy results using Rw.a2R^{2}_{w.a} metric for Deep Set for abnormalities characterization : G\mathrm{G}, αref\alpha_{ref} and ratio\mathrm{ratio}, for 4 different noise levels from ξ=0.0\xi=0.0 to ξ=0.1\xi=0.1

    From these results (Fig. 17, and. 21), we can see that the model can easily predict the G\mathrm{G} of abnormality and remains stable with respect to noise, also the accuracy of the prediction of αref\alpha_{ref} is good and decreases reasonably with the increase of noise, we believe that this is related to confusion related to the added noise which makes the precision of diffusivity very difficult, this is also related to the fact that the range of diffusivity is small while the noise levels represent a higher ΔTnoise\Delta T_{noise} related to noise (for ξ=0.1\xi=0.1 ΔTnoise[40,40]\Delta T_{noise}\in[-40,40]) which confuse the training of the model. Finally, the prediction of the ratio\mathrm{ratio} is acceptable for low noise levels but decreases as the noise level increases. We believe that this is related to the lack of information (only two temperature positions) and the confusion caused by the noise level. We also believe that increasing the number NN of datasets will improve the model accuracy. More details on the accuracy of the model with respect to data quantity are given in appendix A.3.

  • Discussion : In this application we focused on the Deep Set architecture, nevertheless Set Transformer, which gives very good results for Lorenz chaotic systems, could also be applied to this application and is expected to give good results. Also, in this application, we did not add a physics loss PDE\mathscr{L}_{PDE} simply because the approximation of (α(z)T)z2\frac{\partial(\alpha(z)\partial T)}{\partial z^{2}} in the PDE (Eq. 18) is not an easy task. Specifically, in this case where we use only two separated positions, the approximation of the second derivative with respect to space is challenging. This results in less robustness to noise compared to the second application. This problem will be addressed in a subsequent study where we will attempt to improve this method for application to PDE problems.

5 Conclusion and perspectives

In this paper, we presented a combination of two architectures of Set Encoding with the methods of SINDy and SINDYc. The Set Encoding methods are initially trained on a dataset generated by SINDy (or SINDYc) and subsequently employed in real-time to swiftly and accurately predict system dynamics parameters, which could be used for applications such as extrapolating state variables or system control. Our primary contribution lies in encoding variable-sized data sequences using an appropriate architecture, thereby enabling the model to better utilize all temporal information to enhance precision. Additionally, we propose a more suitable loss function. In application 1, we illustrate the need to incorporate the ode\mathcal{L}_{ode} component in the loss function, which enhances the model training process. Our approach can be applied to local identification and global dynamics, as demonstrated in application 2 and 3. In the second application, we suggest a training approach on noisy data, demonstrating that both methods yield robust and stable results with different noise levels. However, this approach relies on datasets generated by SINDy, implying that accuracy is directly dependent on the precision of SINDy (or SINDYc). In the third application, we addressed a PDE example and showed that the model gives interesting results. Nevertheless, the generation of the training datasets is more challenging, especially when dealing with limited and noisy datasets in the context of partial differential equation (PDE) systems, where calculating derivatives in the presence of noise is particularly challenging.. To address this issue, an approach akin to PINN-SR [17] can be employed, extending our methodology to handle ODEs and PDEs with noisy and limited data quantities. In addition, the current Set Encoding architecture cannot handle cases where terms in differential equations are prone to deletion or addition in real-time applications. This implies that the models cannot activate or deactivate terms. To address this challenge, one can integrate a new decoding block with an appropriate activation function (such as a sigmoid function), which would be responsible for activating and deactivating dynamic system terms. Finally, we believe that a comparison between Deep Set and Set Transformer in online nonlinear dynamics identification is necessary to better understand the difference between these two methods. While Deep Set encodes data using classical neural network architecture, Set Transformer employs attention mechanisms, which enable better learning of interactions among input data. This has shown interesting results in many application fields. However, these advantages come with drawbacks such as increased training time and architectural complexity.

Appendix A Appendix

A.1 Using sMAPE metrics for comparison between OASIS and Set Transformer

We present the results based on the sMAPE metric (Eqs. 37, and 38). This metric is more suitable for evaluating the forecasting of time series with outliers. In this case, it is not necessary to eliminate outliers; we follow the same approach as that presented in section 4.2.2. According to the results presented in Tabs. 10, 11 and 12, we can see a close resemblance to the results presented using MAPE with outlier elimination.

sMAPETotalxj=200×xj(t)pxj(t)gt|xj(t)gt|+|xj(t)p|1wheret[Δt:Tf]andj[1,N]\textbf{sMAPE}_{Total}^{x_{j}}=\left\|200\times\frac{x_{j}(t)_{p}-x_{j}(t)_{gt}}{\lvert x_{j}(t)_{gt}\rvert+\lvert x_{j}(t)_{p}\rvert}\right\|_{1}~~\textup{where}~~t\in[\Delta t:T_{f}]~~\textup{and}~~j\in[1,N] (37)
sMAPETotalyj=200×yj(t)pyj(t)gt|yj(t)gt|+|yj(t)p|1wheret[Δt:Tf]andj[1,N]\textbf{sMAPE}_{Total}^{y_{j}}=\left\|200\times\frac{y_{j}(t)_{p}-y_{j}(t)_{gt}}{\lvert y_{j}(t)_{gt}\rvert+\lvert y_{j}(t)_{p}\rvert}\right\|_{1}~~\textup{where}~~t\in[\Delta t:T_{f}]~~\textup{and}~~j\in[1,N] (38)
Table 10: Comparison between the two methods Set-Transformer and OASIS in terms of sMAPE error for extrapolation through local dynamics identification with a time window of 10 steps
mean 90% divergence (%)
SETTR OASIS SETTR OASIS SETTR OASIS
1×Δt10\times\Delta t_{10} xx 0.47 1.23 1.11 3.58 5.36 44.64
yy 1.46 3.61 1.23 6.06 5.36 44.64
2×Δt10\times\Delta t_{10} xx 1.96 3.40 6.07 12.20 5.36 42.86
yy 4.45 6.44 18.69 19.46 5.36 42.86
3×Δt10\times\Delta t_{10} xx 2.31 4.23 5.78 13.59 7.14 41.07
yy 4.16 6.23 14.40 18.15 7.14 41.07
4×Δt10\times\Delta t_{10} xx 4.51 5.87 13.44 16.14 3.57 39.29
yy 6.85 9.07 24.34 21.02 3.57 39.29
\botrule
Table 11: Comparison between the two methods Set-Transformer and OASIS in terms of sMAPE error for extrapolation through local dynamics identification with a time window of 15 steps
mean 90% divergence (%)
SETTR OASIS SETTR OASIS SETTR OASIS
1×Δt15\times\Delta t_{15} xx 1.14 1.76 1.87 3.01 7.14 51.79
yy 1.84 4.62 1.81 8.41 7.14 51.79
2×Δt15\times\Delta t_{15} xx 2.69 3.78 5.07 10.55 7.14 51.79
yy 3.37 5.76 7.32 12.71 7.14 51.79
3×Δt15\times\Delta t_{15} xx 4.60 6.13 6.03 13.73 5.36 44.64
yy 3.71 8.48 9.53 20.58 5.36 44.64
4×Δt15\times\Delta t_{15} xx 7.40 9.56 14.88 24.25 3.57 44.64
yy 7.05 10.83 18.18 31.84 3.57 44.64
\botrule
Table 12: Comparison between the two methods Set-Transformer and OASIS in terms of sMAPE error for extrapolation through local dynamics identification with a time window of 20 steps
mean 90% divergence (%)
SETTR OASIS SETTR OASIS SETTR OASIS
1×Δt20\times\Delta t_{20} xx 2.63 2.66 3.92 2.24 7.14 55.36
yy 2.79 5.81 2.05 7.52 7.14 55.36
2×Δt20\times\Delta t_{20} xx 6.21 7.35 11.32 24.77 7.14 50.00
yy 4.77 9.57 9.95 21.72 7.14 50.00
3×Δt20\times\Delta t_{20} xx 9.82 11.83 23.89 25.61 7.14 44.64
yy 7.08 12.76 18.22 33.02 7.14 44.64
4×Δt20\times\Delta t_{20} xx 9.88 12.51 16.45 35.09 3.57 53.57
yy 4.92 9.72 12.82 23.82 3.57 53.57
\botrule

A.2 Detailed accuracy results for the comparison between Deep Set and Set Transformer

Refer to caption
Figure 18: Accuracy comparison of σ\sigma prediction between Deep Set and Set Transformer using the R2R^{2} metric for the test dataset as a function of data quantity with different noise levels: 0 on the left, 0.02 in the center, and 0.05 on the right for interpolation (in the top) where t[0,10]t\in[0,~10] and for extrapolation (in the bottom) where t[10,20]t\in[10,~20].
Refer to caption
Figure 19: Accuracy comparison of β\beta prediction between Deep Set and Set Transformer using the R2R^{2} metric for the test dataset as a function of data quantity with different noise levels: 0 on the left, 0.02 in the center, and 0.05 on the right for interpolation (in the top) where t[0,10]t\in[0,~10] and for extrapolation (in the bottom) where t[10,20]t\in[10,~20].
Refer to caption
Figure 20: Accuracy comparison of ρ\rho prediction between Deep Set and Set Transformer using the R2R^{2} metric for the test dataset as a function of data quantity with different noise levels: 0 on the left, 0.02 in the center, and 0.05 on the right for interpolation (in the top) where t[0,10]t\in[0,~10] and for extrapolation (in the bottom) where t[10,20]t\in[10,~20].

A.3 Detailed model accuracy with respect to data quantity and noise levels

Refer to caption
Figure 21: Accuracy results using the R2R^{2} metric and noise robustness with respect to data quantity for the Deep Set model for abnormalities characterization: G\mathrm{G}, αref\alpha_{ref}, and ratio\mathrm{ratio}, for four different noise levels ranging from ξ=0.0\xi=0.0 to ξ=0.1\xi=0.1.

A.4 Instance selection for the comparison between Deep Set and Set Transformer

The two Set Encoding methods are trained on different architectures independently, considering three levels of noise and three parameters of the Lorenz system (σ\sigma, β\beta, and ρ\rho), with variable time windows. Simultaneously, they are validated on the validation dataset. A check-pointing process is used if the validation error (Tab. 8) reaches a new minimum to save an instance of the architecture. The same process is conducted with different architectures for both methods, and the best instances are selected for each method, noise level, and each parameter of the Lorenz system. These instances are then used to compare the two methods. The selected architectures are presented in Tabs. 13 and 14.

A.4.1 Deep Set

The Deep Set model, as presented in section 3, can be summarized into three parts:

  • The first part is the encoding part, which takes as input the state variables of the Lorenz system [x(t),y(t),z(t)][x(t),~y(t),~z(t)] and the time dimension tt. It projects these state variables into a higher dimension using a simple MLP. The architecture of this MLP is defined in (Tab. 13), where we have 5 layers with 320 perceptrons each and a ReLU activation function.

  • The second part consists of choosing a pooling function over the time dimension. In this application, we found that applying the absolute value function followed by mean aggregation gives better results. The output of the aggregation function is a timeless feature vector.

  • The last part consists of decoding the feature vector into the dynamics parameters. In this application, we used an MLP with the same architecture as for encoding.

Table 13: Selected architecture for Deep Set method for all noise levels and all the parameters of the Lorenz system
activation Layers Pooling Trainable parameter count
Encode ReLU 320×5\times 5
Decode ReLU 320×5\times 5 abs, mean 927043
\botrule

A.4.2 Set Transformer

The Set Transformer architecture is more sophisticated compared to Deep Set, as it uses the attention mechanism, which means different encoding, pooling, and decoding approaches. Nevertheless, we can also summarize it as a three-part model:

  • As for the Deep Set, this part consists of encoding the Lorenz system variables. For that, we used ISAB (section 3.2), which performs an attention mechanism encoding within the time and the state variables of the Lorenz system. These variables are projected into a higher dimension d=45d=45, and then the attention mechanism is applied with respect to a trainable matrix II called inducing points [5], with m=128m=128 rows and d=45d=45 columns. This means that ISAB will start by projecting the nn time snapshots into mm snapshots and then perform the attention mechanism with respect to the state variables of the Lorenz system. This method decreases the calculation time while providing similar results; more details are given in [5]. For the attention mechanism, we used h=40h=40 attention heads (Tab. 13). In this part, we apply a feedforward neural network to the output of the attention mechanism, adding more nonlinearity. For this application, the feedforward neural network is simply an MLP with 2 layers, 45 perceptrons each, and a ReLU activation function.

  • The second part consists of applying pooling over the encoder feature vector, which is based on PMA, a Multi-Head Attention block that performs aggregation using the multi-head attention mechanism. For this, we used a pooling seed vector Sk×dS\in\mathbb{R}^{k\times d}, where d=40d=40 and k=1k=1.

  • Finally, we use an MLP with 2 layers, 40 perceptrons each, and a ReLU activation function, with a final layer of output dimension dout=3d_{out}=3 that corresponds to the 3 parameters of the Lorenz system.

Table 14: Selected architecture for Set Transformer method for all noise levels and all the parameters of the Lorenz system
Attention blocks rFF Trainable parameter count
Encode ISAB(d=45,m=128,h=40d=45,m=128,h=40) rFF(45, 2, ReLU)
Pooling PAM(k=1,d=40,h=40k=1,d=40,h=40) rFF(40, 2, ReLU) 1045733
\botrule

A.5 Deep Set Architecture for the Third Application

The selected architecture is defined in Tab. 15. As detailed in A.4, we selected the best model using the validation datasets and a checkpoint that saves instances when the validation loss reaches a new minimum. The selected architecture consists of three blocks. In the first block, we use an MLP with 5 layers, 256 perceptrons, and a GeLU activation function. For the second part, we use a sum aggregation function. The encoded vector is then decoded using three different MLPs with the same architecture as the encoder, each corresponding to one of the characteristics (G\mathrm{G}, ratio\mathrm{ratio}, and αref\alpha_{ref}).

Table 15: Selected architecture for the Deep Set method for all noise levels in the third application
Activation Layers Pooling Trainable Parameter Count
Encode GeLU 256×\times5
Decode GeLU 3×\times256×\times5 sum 1262083
\botrule

Declarations

  • Funding : The Jules Verne Institute of Research and Technology (IRT Jules Verne) (PhD PERFORM Program).

  • Conflict of interest : We affirm that there are no known conflicts of interest that have influenced this work.

  • Ethics approval and consent to participate : We confirm that this research complies with authors ethical standards. All authors consent to participate in the submitted work.

  • Data availability Not applicable.

  • Code availability Not applicable.

  • Author contributions Not applicable.

References

  • \bibcommenthead
  • Brunton et al. [2016] Brunton, S.L., Proctor, J.L., Kutz, J.N.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences 113(15), 3932–3937 (2016)
  • Rudin [2019] Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence 1(5), 206–215 (2019)
  • Bhadriraju et al. [2020] Bhadriraju, B., Bangi, M.S.F., Narasingam, A., Kwon, J.S.-I.: Operable adaptive sparse identification of systems: Application to chemical processes. AIChE Journal 66(11), 16980 (2020)
  • Zaheer et al. [2017] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. Advances in neural information processing systems 30 (2017)
  • Lee et al. [2019] Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W.: Set transformer: A framework for attention-based permutation-invariant neural networks. In: International Conference on Machine Learning, pp. 3744–3753 (2019). PMLR
  • Makke and Chawla [2024] Makke, N., Chawla, S.: Interpretable scientific discovery with symbolic regression: a review. Artificial Intelligence Review 57(1), 2 (2024)
  • Lejeune et al. [2023] Lejeune, C., Mothe, J., Soubki, A., Teste, O.: Data driven discovery of systems of ordinary differential equations using nonconvex multitask learning. Machine Learning 112(5), 1523–1549 (2023)
  • North et al. [2023] North, J.S., Wikle, C.K., Schliep, E.M.: A review of data-driven discovery for dynamic systems. International Statistical Review 91(3), 464–492 (2023)
  • Tibshirani [1996] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288 (1996)
  • Manucci et al. [2022] Manucci, M., Aguado, J.V., Borzacchiello, D.: Sparse data-driven quadrature rules via l p-quasi-norm minimization. Computational Methods in Applied Mathematics 22(2), 389–411 (2022)
  • Zheng et al. [2018] Zheng, P., Askham, T., Brunton, S.L., Kutz, J.N., Aravkin, A.Y.: A unified framework for sparse relaxed regularized regression: Sr3. IEEE Access 7, 1404–1423 (2018)
  • Long et al. [2019] Long, Z., Lu, Y., Dong, B.: Pde-net 2.0: Learning pdes from data with a numeric-symbolic hybrid deep network. Journal of Computational Physics 399, 108925 (2019)
  • Martius and Lampert [2016] Martius, G., Lampert, C.H.: Extrapolation and learning equations. arXiv preprint arXiv:1610.02995 (2016)
  • Chen et al. [2022] Chen, Y., Luo, Y., Liu, Q., Xu, H., Zhang, D.: Symbolic genetic algorithm for discovering open-form partial differential equations (sga-pde). Physical Review Research 4(2), 023174 (2022)
  • Cranmer [2023] Cranmer, M.: Interpretable machine learning for science with pysr and symbolicregression. jl. arXiv preprint arXiv:2305.01582 (2023)
  • Raissi et al. [2019] Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics 378, 686–707 (2019)
  • Chen et al. [2021] Chen, Z., Liu, Y., Sun, H.: Physics-informed learning of governing equations from scarce data. Nature communications 12(1), 6136 (2021)
  • Kaiser et al. [2018] Kaiser, E., Kutz, J.N., Brunton, S.L.: Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proceedings of the Royal Society A 474(2219), 20180335 (2018)
  • Brunton et al. [2016] Brunton, S.L., Proctor, J.L., Kutz, J.N.: Sparse identification of nonlinear dynamics with control (sindyc). IFAC-PapersOnLine 49(18), 710–715 (2016)
  • Narasingam and Kwon [2017] Narasingam, A., Kwon, J.S.-I.: Development of local dynamic mode decomposition with control: Application to model predictive control of hydraulic fracturing. Computers & Chemical Engineering 106, 501–511 (2017)
  • Quade et al. [2018] Quade, M., Abel, M., Nathan Kutz, J., Brunton, S.L.: Sparse identification of nonlinear dynamics for rapid model recovery. Chaos: An Interdisciplinary Journal of Nonlinear Science 28(6) (2018)
  • Bhadriraju et al. [2021] Bhadriraju, B., Kwon, J.S.-I., Khan, F.: Oasis-p: Operable adaptive sparse identification of systems for fault prognosis of chemical processes. Journal of Process Control 107, 114–126 (2021)
  • Kamienny et al. [2022] Kamienny, P.-A., d’Ascoli, S., Lample, G., Charton, F.: End-to-end symbolic regression with transformers. Advances in Neural Information Processing Systems 35, 10269–10281 (2022)
  • Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • d’Ascoli et al. [2023] d’Ascoli, S., Becker, S., Mathis, A., Schwaller, P., Kilbertus, N.: Odeformer: Symbolic regression of dynamical systems with transformers. arXiv preprint arXiv:2310.05573 (2023)
  • Becker et al. [2023] Becker, S., Klein, M., Neitz, A., Parascandolo, G., Kilbertus, N.: Predicting ordinary differential equations with transformers. In: International Conference on Machine Learning, pp. 1978–2002 (2023). PMLR
  • Chartrand [2011] Chartrand, R.: Numerical differentiation of noisy, nonsmooth data. International Scholarly Research Notices 2011 (2011)
  • Rudy et al. [2017] Rudy, S.H., Brunton, S.L., Proctor, J.L., Kutz, J.N.: Data-driven discovery of partial differential equations. Science advances 3(4), 1602614 (2017)
  • Ba et al. [2016] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
  • Wangersky [1978] Wangersky, P.J.: Lotka-volterra population models. Annual Review of Ecology and Systematics 9(1), 189–218 (1978)
  • Sobol [1967] Sobol, I.: The distribution of points in a cube and the accurate evaluation of integrals (in russian) zh. Vychisl. Mat. i Mater. Phys 7, 784–802 (1967)
  • Virtanen et al. [2020] Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., et al.: Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods 17(3), 261–272 (2020)
  • de Silva et al. [2020] Silva, B.M., Champion, K., Quade, M., Loiseau, J.-C., Kutz, J.N., Brunton, S.L.: Pysindy: a python package for the sparse identification of nonlinear dynamics from data. arXiv preprint arXiv:2004.08424 (2020)
  • Hewamalage et al. [2023] Hewamalage, H., Ackermann, K., Bergmeir, C.: Forecast evaluation for data scientists: common pitfalls and best practices. Data Mining and Knowledge Discovery 37(2), 788–832 (2023)
  • Gelß et al. [2019] Gelß, P., Klus, S., Eisert, J., Schütte, C.: Multidimensional approximation of nonlinear dynamical systems. Journal of Computational and Nonlinear Dynamics 14(6), 061006 (2019)
  • Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • Chicco et al. [2021] Chicco, D., Warrens, M.J., Jurman, G.: The coefficient of determination r-squared is more informative than smape, mae, mape, mse and rmse in regression analysis evaluation. Peerj computer science 7, 623 (2021)
  • O’Malley et al. [2019] O’Malley, T., Bursztein, E., Long, J., Chollet, F., Jin, H., Invernizzi, L., et al.: KerasTuner. https://github.com/keras-team/keras-tuner (2019)
  • Fasel et al. [2021] Fasel, U., Kaiser, E., Kutz, J.N., Brunton, B.W., Brunton, S.L.: Sindy with control: A tutorial. In: 2021 60th IEEE Conference on Decision and Control (CDC), pp. 16–21 (2021). IEEE
  • Proctor et al. [2016] Proctor, J.L., Brunton, S.L., Kutz, J.N.: Dynamic mode decomposition with control. SIAM Journal on Applied Dynamical Systems 15(1), 142–161 (2016)