\contourlength

1.4pt

Unsupervised physics-informed neural network in reaction-diffusion biology models (Ulcerative colitis and Crohn’s disease cases) A preliminary study

Ahmed Rebai AI Factory, Value Digital Services, Tunisia, "value.com.tn". Louay Boukhris AI Factory, Value Digital Services, Tunisia, "value.com.tn". Radhi Toujani AI Factory, Value Digital Services, Tunisia, "value.com.tn". Ahmed Gueddiche AI Factory, Value Digital Services, Tunisia, "value.com.tn". Fayad Ali Banna AI Factory, Value Digital Services, Tunisia, "value.com.tn". Fares Souissi AI Factory, Value Digital Services, Tunisia, "value.com.tn". Ahmed Lasram AI Factory, Value Digital Services, Tunisia, "value.com.tn". Elyes Ben Rayana AI Factory, Value Digital Services, Tunisia, "value.com.tn". Hatem Zaag Université Sorbonne Paris Nord, LAGA, CNRS (UMR 7539), F-93430, Villetaneuse, France.

Abstract

We propose to explore the potential of physics-informed neural networks (PINNs) in solving a class of partial differential equations (PDEs) used to model the propagation of chronic inflammatory bowel diseases, such as Crohn’s disease and ulcerative colitis. An unsupervised approach was privileged during the deep neural network training. Given the complexity of the underlying biological system, characterized by intricate feedback loops and limited availability of high-quality data, the aim of this study is to explore the potential of PINNs in solving PDEs. In addition to providing this exploratory assessment, we also aim to emphasize the principles of reproducibility and transparency in our approach, with a specific focus on ensuring the robustness and generalizability through the use of artificial intelligence. We will quantify the relevance of the PINN method with several linear and non-linear PDEs in relation to biology. However, it is important to note that the final solution is dependent on the initial conditions, chosen boundary conditions, and neural network architectures. ¹¹1This work was carried out under the scientific direction of the mathematician Pr. Hatem Zaag. ²²2Corresponding author: Ahmed Rebai, ahmed.rebai@value.com.tn

Keywords: Unsupervised PINN, Deep Neural Networks, Coupled Nonlinear PDEs, IBD (Inflammatory Bowel Diseases), Ulcerative Colitis, Crohn’s Disease, Computer Vision, Machine Learning Classification, AI Reproducibility.

1 Introduction

The current work focuses on a new multidisciplinary field at the intersection of three disciplines: artificial intelligence (AI) via deep learning, applied mathematics via partial differential equations, and the biology of the inflammatory bowel diseases (as illustrated in Figure 1). While writing this paper, we encountered several difficulties due to the unique nature of this multidisciplinary subject, which is both innovative and AI hyped, resulting in a genuine debate in the community between partisans who are optimistic about the potential of this new technique [1, 2] and non-partisans who point out its limitations [3, 4]. For this, we believe it is best to begin with a brief overview of each discipline before discussing progress at the various intersections between these disciplines, followed by a discussion of the resulting controversy.

Figure 1: This figure illustrates the interplay between deep neural networks and differential equation models in this multidisciplinary study. Deep neural networks are capable of handling large amounts of data but can be computationally moderate, while differential equation models on the other hand, require less data but may be computationally intensive. The aim of the study is to leverage the advantages of both approaches to solve partial differential equations modeling biological phenomena using deep learning techniques, minimizing data requirements and obtaining a computationally efficient model.

2 PINN in biology: crossroads of several disciplines

2.1 Artificial intelligence

In its general definition, artificial intelligence allows computers to partially or totally perform intelligent tasks usually associated with human intelligence. Nowadays, artificial intelligence learns and generalizes patterns in high-dimensional and highly non-linear spaces without being specifically guided [6, 7]. This learning processes is based on various types of data (tabular data, image data, sound data, text data…) and is leading to success in various fields such as nuclear energy where AI has been used recently to control a fusion reactor [8] and Earth science where AI allowed to predict the weather in short-term within the "nowcasting" project with a deep reinforcement learning algorithm [9]. Continuing with this progress, in this paper we will see how neural networks could also learn the dynamics of a complex biological system from the structure of the partial differential equations describing these dynamics.

2.2 Partial differential equations

Many engineering fields use partial differential equations as models, including combustion theory, weather prediction, financial markets and industrial machine design. A partial differential equation or a system of partial differential equations can be solved analytically [10], numerically [11, 12] and now with artificial intelligence using techniques such as deep neural networks (DNNs) [1, 13]. In practice, the analytical method works for some simple equations, but its application is difficult in most cases of coupled and nonlinear PDE systems. The numerical resolution technique is preferred for this and often requires the use of expensive commercial numerical solvers such as finite element method (FEM) or finite difference method (FDM). It can be summarized in five steps: modeling, meshing, discretizing, numerical computing and post-processing:

•

Modeling: Mathematical modeling of physical or chemical or biological processes.
•

Mesh: The creation of a mesh or a grid called also computational domain which consists of equivalent system of multiple sub-domains (finite differences or elements or volumes). Given the mesh, the basis functions are predetermined. This step is characterized by its great temporal complexity. At this level, the PINN method could offer a solution to reduce the execution time since it only requires faster random sampling of the working domain.
•

Discretization: Discretize the governing equations by turning it into a system of equations simply by approximating the derivatives. The used functions are linear (i.e polynomial functions) which sometimes does not capture the non-linear character of the underlying phenomena. Given the non-linear nature of neural networks, they could help overcome this deficiency.
•

Solution: Solving the set of linear equations by numerical computing using extensive parallel IT ressources like CPUs, GPUs and large RAMs.
•

Postprocess: Finding the desired quantities like position or velocity by analyzing the obtained data. The use of machine learning models can make this analysis more refined and robust.

As previously stated, numerical methods have some drawbacks such as high time consumption, repetitiveness and lack of autonomy. In fact, creating a mesh to simulate an airplane turbo-reactor can take months in some industrial cases. Also, numerical solving is repetitive because the 5 steps must be reproduced each time the domain is changed. Furthermore, unlike AI models, this procedure does not learn from previous trials even if we keep the same domain or the same grid. However, the basis functions do not always allow for the reflection of non-linear phenomena that cause real or artificial blow-up phenomena such as numerical explosion [14]. Finally, when we consider how difficult it is to reduce human intervention, it is clear that this technique lacks the autonomy sought during the normal use of artificial intelligence. As a result, several unanswered questions may arise, such as: Is it possible for solvers to learn the basis functions from partial differential equations automatically? Is it feasible to develop autonomous flow solvers for fluid mechanics?

2.3 Biology of the inflammatory bowel diseases

Now let’s move to biology which is the science of living organisms extending from the molecular level to the mesoscopic ecosystems. In this paper, we focus on the modelling of the inflammatory process hitting the bowel. Crohn’s disease and ulcerative colitis are both inflammatory bowel diseases but they are different indeed [15, 16]. Ulcerative colitis (UC) is a chronic inflammatory bowel disease resulting from an overreaction of the natural defenses of the digestive immune system, with an estimated prevalence of 1 in 1500 people with an annual incidence of 6 to 8 new cases per 100,000 inhabitants in Australia [17], Western Europe and the United States. In Tunisia, the incidence is estimated at 2.11 per 100,000 inhabitants per year [18, 19]. UC is not a rare disease in tunisian adults, but in children. It is characterized by smooth ulceration of the inner lining of the colon. The inflammation begins in the lower region of the colon, just above the anus, and progresses upward at varying distances. One of the most important indicators of the severity of this disease is the spatial distribution of the intestinal lesions associated with an introduced gastro-enterologist’s severity score. While individuals with moderate to high severity scores have a concentration of lesions around the rectum, those with low severity scores frequently have a homogeneous spatial distribution of colonic lesions. UC appears in lesions such as bleeding rectal and colon ulcers. It is a currently incurable disease characterized by varying intensities of inflammatory relapse with interspersed remission periods. This puts the patient at higher risk of colon cancer than the general population thus the potential removal of the organ (colectomy). Currently available treatments aim to control pain, reduce the frequency and duration of relapses, and thereby relieve symptoms. Crohn’s disease is a type of painful inflammatory bowel disease (IBD) that is not well understood. In Tunisia, this serious disease affects both children over 10 and adults [18]. It consists of the appearance of several asymmetrical segments of deep lesions separated by intact areas. In the worst cases, these areas can turn into fissures or even holes in the wall of the intestine. Unlike other IBDs, it affects any part of the gastrointestinal tract, from top (the mouth) to bottom (the anus), in contiguous or isolated parts. The inflammation can affect the inner lining and even go beyond the entire thickness of the intestinal wall; It is manifested by a blood vessels dilation and tissues fluid loss. It is usually present in the lower part of the small intestine that connects to the colon. The inflamed portion of the intestine affects the deep panniculus and is not adjacent to it, but rather is distributed throughout the gastrointestinal tract, with an erratic inflammation pattern. The diagnosis of this disease requires advanced technological tools which present difficulties in the collection of data to predict the spread. For that, the mathematical modeling has been increasingly utilized as a tool to understand the complex and dynamic processes involved in both diseases as shown in [5, 20].

In the evaluation and management of both Crohn’s disease and ulcerative colitis, doctors typically use a combination of biological, clinical, and spatial indicators to assess a patient’s condition, predict its progression, and determine the most appropriate treatment. Clinical indicators may include a physician’s examination and questioning of the patient, as well as video examination of the colon through colonoscopy. Biopsy samples taken during colonoscopy can also provide valuable histological images. In addition, biological or chemical indicators such as the measurement of calprotectin levels in stool (as an indicator of inflammation) and analysis of the intestinal microbiota through DNA and RNA analysis can provide important insights into the disease. Additionally, analysis of RNA expression in the intestine can also be used as an indicator.

Data type	Data requirements	Tasks
Clinical data	Doctor’s questionnaires	Classification
Biological data	Physico-clinical analysis	Scoring and Classification
Images and videos data	Computer vision treatment	Classification and PDE

Table 1: A summary of clinical, biological, and image/video data characterization in IBDs medical tasks.

2.4 The importance of spatial information

Gastro-enterologists and surgeons are hindered from having spatial information on anatomical sites since these indicators are not spatial, and the provided information is never localized in a specific position. The diagnosis of these diseases is based on the analysis of colonoscopy videos. Thus, physicians assess the severity of the disease according to the presence of inflammation, bleeding or ulcers on the intestinal wall which requires an advanced level of expertise. In the same way, the extent of the lesions is currently ignored in medical practice, for lack of a validated method for analyzing this information. This same remark applies to other indicators like numerical score of severity [21], the speed of inflammation propagation, the choice of treatment [22]. Gastroenterologists recognize the significance of spatial information in the development of complications such as esophageal and colon cancer in patients with Crohn’s disease and ulcerative colitis. However, current guidelines fail to fully consider the quantity and distribution of lesions, often focusing solely on the most severe lesion identified. This is due to the scarcity of software tools and scientific literature. Additionally, the intricate feedback loops and technical challenges in collecting high-quality data for the calibration of numerical and mathematical models (see figure 2) further highlights the need for innovative methods. This necessity is the driving force behind our current study, which aims to address the limitations in current approaches and provide a more comprehensive understanding of the disease.

Refer to caption — Figure 2: The figure shows how Crohn’s disease differs from ulceratives colitis in terms of propagation, effect on the human body and mathematical modeling of both diseases. In the right, the system of partial differential equations is derived from this paper [5]. In the left, the Fisher kpp equation is used for the ulcerative colitis disease as shown in this study [23].

Endoscopic video analysis [24] plays a crucial role in evaluating the severity of ulcerative colitis and monitoring the progression of the disease. Colonoscopy is widely used as the reference examination to assess the intensity of the disease and the extent of intestinal lesions. During this routine procedure, a gastroenterologist inserts a camera-equipped endoscope into the colon to visualize the inner lining and take biopsies if necessary. It should be noted that this technique has a very strong impact on the quality of life of the patients.

Wireless capsule endoscopy (WCE) [24] is another commonly used technique in which patients swallow a small, intelligent capsule that contains a camera and a light source. The capsule sends images of the intestinal mucosa to a wearable sensor, making it a less invasive alternative to colonoscopy. This method is especially useful for accessing regions of the small intestine that are difficult to reach with endoscopy. However, it is more expensive as the capsule can only be used once. Unfortunately, this technique has been abandoned in Tunisian hospitals due to its cost being considered expensive.

Both colonoscopy and WCE allow for the detection of important lesions in the videos, such as: Loss of visibility of the vascular framework, which is indicated by the disappearance of blood vessels and the formation of fibrous tissue that impedes nutrient absorption and inflammation and bleeding, which appear as red areas on the intestinal wall and ulcers and indentations in the wall that appear white or gray. The precise collection and examination of the endoscopic video data is essential not only for identifying the disease presence and advancement, but also for categorizing the different types of IBDs and classifying the subtypes within the same disease.

2.5 PIML: A new and growing discipline with challenges

Physics-Informed Machine Learning (PIML), also known as Physics-Informed Neural Networks (PINN), is an emerging discipline that merges the concepts of physics with the advanced techniques of machine learning and neural networks. The objective of PIML is to harness the laws of physics to increase the precision of machine learning models, particularly in cases where the systems being modeled are governed by partial differential equations. In PIML, the governing equations of a physical system are integrated into the training process of a machine learning model, resulting in predictions that are not only accurate, but also physically meaningful and interpretable. Overfitting is prevented through this approach. PIML has been successfully applied in diverse fields such as fluid dynamics, structural mechanics, quantum mechanics, cosmology and quantitative finance. With many advancements and applications yet to be discovered, PIML is a rapidly growing field that promises exciting new possibilities. According to the Gartner AI Hype Cycle diagrams for 2021 [25] and 2022 [26], PINN and PIML are currently in the innovation trigger phase, gaining increasing attention and applications in the scientific community. With continued growth and development, these technologies are expected to reach the peak of inflated expectations before settling into a plateau of productivity. As PINN and PIML become more widely adopted, we can expect to see their use in solving real-world problems in biology and medicine (see the figure 3).

The establishment of neural networks in mathematics can be traced back to the seminal work of Cybenko in 1989 [27]. In this paper, Cybenko presented the concept of universal approximation, which demonstrated that a single hidden layer feedforward neural network with a sigmoid activation function is capable of approximating any continuous bounded function with a sufficient number of hidden units. This foundational work was further advanced by the studies of Hornik and Barron [28, 29], which provided additional insights into the concept of universal approximation. Various architectures have been developed, starting with original PINN and followed by DeepFNet, DeepONet, DeepM&MNet.

•

DeepFNet [30] is a neural network architecture that is well-suited for functional approximation tasks because it is flexible, able to model complex relationships, and scalable. Its hierarchical structure allows it to learn and represent multi-scale features in the data, improving its ability to generalize and make accurate predictions on unseen data. Generally, it requires significantly fewer neurons than shallow networks to achieve a given degree of function approximation.
•

DeepONet [31]: uses a deep learning approach to learn nonlinear operators. The advantage is that it can capture complex relationships between variables that are not easily modeled using linear techniques. DeepONet uses seq2seq and fractional algorithms. Seq2seq (or "sequence-to-sequence") is a type of algorithm that is used to map input sequences to output sequences. Data with long-range relationships are analyzed using fractional (or "fractionally-differentiated") approaches.

Figure 3: For the second consecutive year, the 2022 Gartner’s hype cycle for artificial intelligence evokes the physics-informed AI. We notice that the PIML is actually in the innovation trigger regime with an improved outlook because the plateau of productivity will be reached between 2 to 5 years instead of 5 to 10 years.
•

DeepM&MNet [32, 33] is a neural network framework for simulating complex, multiphysics systems. It uses pre-trained neural networks to make predictions about the different fields in a coupled system, such as the flow, electric and concentration fields. The framework is designed to be fast and efficient, and can be used to build models with very little data. DeepM and MNet are versatile algorithms for modeling complex, multiphysics and multiscale dynamic systems. DeepM uses a multilayer perceptron architecture, while MNet uses a combination of convolutional and long short-term memory networks. Both algorithms are able to capture intricate patterns and trends in time series data.

However, the best approach for using PINN in a particular biological system depends on the available data and knowledge of the physics of the system. We will explore three possible scenarios in which PINN can be applied in biology.

•

First scenario: In this case, a Physics-Informed Neural Network (PINN) is used to make predictions about a system based on both data and known physics information. The neural network is trained on the data and also incorporates the known physics through the use of constraints or regularization terms in the loss function. This allows the network to make predictions that are consistent with the known physics and improve accuracy by utilizing the available data. Fluid dynamics represents a classic example where the neural network is trained on experimental or numerical data of the fluid flow and incorporates the governing equations of fluid dynamics as constraints or regularization terms in the loss function.
•

Second scenario: In this scenario, there is a large amount of data available, but there is no physical model to describe the dynamics. For example, consider the physics of jets produced by terrestrial accelerators in heavy ion experiments such as ALICE at the LHC-CERN accelerator. The lack of a physical model can make it difficult to understand the underlying dynamics of the system. This is a good use case for traditional machine learning techniques, as there is no physics information to incorporate into the model.
•

Third scenario: In this case, data is limited and the system is described by several physical models. The limited data and the presence of multiple physical models can make it challenging to determine which model is most appropriate for describing the system. In this case, it may be necessary to use a combination of approaches, such as combining physical models with machine learning techniques, to gain a more complete understanding of the system. It is also important to carefully validate the results and ensure that the chosen model is able to accurately describe the observed behavior. This is the case considered in this article, in which we attempt to model a biological phenomenon caused by loops of reactions and counter-reactions between bacteria and immune cells.

Having discussed the various techniques and scenarios involved in PINNs, it is now important to evaluate and quantify the performance of the model. From a general point of view, the neural network performance can be characterized into three main types:

1.

Approximation error to ground truth function.
2.

Generalization to unseen data.
3.

Trainability of the model.

In fact, the universal function approximation theorem only considers the approximation error of a neural network to the ground truth function. However, it does not consider other important factors, such as the generalization error and the model trainability. This generalization measures a model’s ability to make accurate predictions on new, unseen data. Meanwhile, the model trainability is determined by factors such as its size, complexity, amount and quality of training data. These factors can influence the model’s ability to be effectively trained, as larger and more complex models may require more resources and may be harder to converge. For that this theorm could not ensure the generalization and the trainability in complex biological process [34] or in the modeling of nonlinear two-phase transport in porous media [4]. These limitations include the availability and quality of data and the potential for the models to fail to capture the full range of possible behaviors or phenomena. In the following points, we aim to shed light on the challenges faced in our modeling efforts.

•

Complexity of underlying processes: Biological processes are often characterized by complex interactions and dynamics, making it challenging to accurately model them using traditional mathematical or physical approaches. The non-linear nature of the differential equations involved, including the presence of non-linear terms such as square or cubic terms, only adds to the difficulty. These non-linear terms can even lead to blow-up phenomena, a common challenge for mathematicians working with PDEs [5]. This can also make it challenging for PINNs to learn and represent the underlying patterns and relationships in the data.
•

Data availability and quality: The data used to train PINNs may be limited in quantity or quality, or may not be representative of the full range of behaviors or phenomena. This can affect the model’s ability to learn and generalize, and can reduce its accuracy and reliability. Spatial data requires exhaustive examination, which can be expensive, and hospitals are often reluctant to share their data. Our attempts to contact digestive disease institutes for data resulted in a refusal to collaborate. However, we were able to find a more accessible data set called Kvasir that we plan to use for our study [35].
•

Limited range of behaviors : PINNs may not be able to capture the full range of behaviors and phenomena that can occur in biological systems. This is because the models are typically trained on a limited set of data and are not able to capture the full range of possible behaviors or situations that can arise.
•

Necessity for simple and parsimonious PINN models: The complexity of the PINN model itself can also be a limitation. These models can be computationally expensive to train and may require a large amount of data and computational resources. This can make their use challenging in certain contexts, such as when data is limited or when computational resources are constrained, as in the case of automatic lesion detection techniques where the gastroenterologist manipulates the patient using the colonoscope and works on the software in real-time [36, 37].

2.6 Our approach

The discussion above have provided the necessary foundation for our main work. We are attempting to predict the evolution of two bowel diseases with poor quality data collected only on the edges of the domain and by integrating physical constraints from nonlinear PDE with the simplest deep neural networks. We are inspired by the first work in PINN done by M. Raissi et al. [38]. In this paper, the authors dealt with 4 equations:

•

Two-dimensional Navier-Stokes equation system
•

Shrödinger equation
•

Korteweg-De Vries equation
•

Burgers’ equation

Then, we will start with the simplest PDE equations and gradually add more complexity, including nonlinear terms. By following this progression, we hope to build a comprehensive model for predicting the evolution of these diseases. Therefore, we will apply this approch in these equations:

•

Simple partial differential equation (a Toy model)
•

Diffusion equation or heat equation in 2D domain
•

Fisher-KPP equation in 1D domain
•

Korteweg-De Vries equations
•

Traditional Turing System: nonlinear coupled system for Reaction Diffusion Equations
•

Turing mechanism System for Crohn’s disease presented in this paper [5].

3 Benchmarking our approach with some chosen PDEs

This benchmarking refers to the process of evaluating the performance and accuracy of the DNN against a series of PDEs with a minimum of data. We begin by testing the DNN on a simple PDE and observe that it is able to accurately solve it with a high degree of accuracy. We will see in the next section that this resolution relates to the determination of the severity score for IBDs, incorporating the crucial aspect of spatial information distribution. Next, we apply the DNN to the Burger’s equation, which is a more complex nonlinear PDE. The DNN is still able to solve this equation with a good level of accuracy. We then move on to the heat equation, which is a linear PDE and therefore relatively easier to solve. Finally, we test the DNN on a nonlinear system of Korteweg-De Vries equations.

3.1 Neural networks architecture

Choosing the hyperparameters of a DNN is an important step in its design and training. These parameters are the values that control the overall behaviour of the DNN, such as the number of layers, the number of neurons per layer, the learning rate and the regularization strength. Then, it is important to consider the nature of the problem being solved and the available data. For example, if the data is limited or noisy, it may be necessary to use a smaller or simpler network to avoid overfitting. The choice of optimization algorithm is another important factor in the training of the network. There are many different optimization algorithms available, each with its own strengths and weaknesses. Two commonly used optimization algorithms are L-BFGS and Adam.

We can control overfitting using the dropout concept which works by randomly "dropping out" a fraction of the neurons during training, then a proportion of neurons are temporarily excluded from the network and do not contribute to the forward or backward passes. This has the effect of reducing the complexity of the DNN and forcing the remaining neurons to learn more robust and generalizable features. The notion of parsimony is important in deep learning because it helps to ensure that the models we build are as simple as possible while still being able to effectively capture the underlying patterns in the data. By using parsimony, we can avoid overfitting and build models that are more likely to perform well on new, unseen data.

3.1.1 Adam vs LBFGS

Stochastic gradient descent (SGD) is an optimization algorithm for finding model parameters that minimize the loss function, which measures the difference between the expected and actual output of a model [39]. There are many variations of SGD, including Adagrad, RMSprop, and Adam. Adam optimization is a stochastic gradient descent method that adaptively estimates first-order and second-order moments [40].

In Adam optimization, the parameter update is given by:

m_{w}^{t+1}=\beta_{1}m_{w}^{t}+(1-\beta_{1})\nabla_{w}L^{t}

v_{w}^{t+1}=\beta_{2}v_{w}^{t}+(1-\beta_{2})(\nabla_{w}L^{t})^{2}

\hat{m}_{w}=\frac{m_{w}^{t+1}}{1-\beta_{1}^{t}}

\hat{v}_{w}=\frac{v_{w}^{t+1}}{1-\beta_{2}^{t}}

w^{t+1}=w^{t}-\eta\frac{\hat{m}_{w}}{\sqrt{\hat{v}_{w}}-\epsilon}

where $w(t)$ are the model parameters, $L^{t}$ is the loss function, $t$ is the current training iteration, $\beta_{1}$ and $\beta_{2}$ are the forgetting factors for gradients and second-order gradient moments, respectively.

On the other hand, the BFGS algorithm is a Quasi-Newton method for optimization, which approximates the Hessian matrix using a series of updates [41, 42]. One of the most widely used Quasi-Newton methods is L-BFGS (Limited-memory BFGS [43, 44]), which is more memory efficient than BFGS, as it only stores a few vectors representing the approximation of the Hessian matrix instead of the full matrix. L-BFGS is more computationally demanding than Adam, but can be faster in situations where the second-order moments are known analytically. Instead of storing the full n x n estimation of the inverse Hessian, L-BFGS stores only a few vectors that represent the approximation implicitly which make it more practical for usage in ML settings with small-mid sized datasets. This method is computationally more demanding. However when the second order is known analytically the optimization method moves faster towards the minimum. The reason lies in the fact that methods based on the first order approximate the error function at a point with a tangent hyperplane, while methods of the second order, with a quadratic hypersurface: this allows us to move closer to the error surface when we update the weights at each iteration.
The L-BFGS algorithm is as follows:

Algorithm 1 L-BFGS

for

i=1,2,\ldots,n

Obtain a direction

P_{k}

by solving

B_{k}P_{k}=-\nabla f(X_{k})

Perform a one-dimensional optimization to find an acceptable stepsize

\alpha_{k}

in the direction

found in the first step. In an exact line search,

\alpha_{k}=\operatorname{argmin}f(X_{k}+\alpha P_{k})

In practice, an inexact line search usually suffices, with acceptable

\alpha_{k}

satisfying the Wolfe

conditions.

Set

S_{k}=\alpha_{k}P_{k}

and update

X_{k+1}=X_{k}+S_{k}

y_{k}=\nabla f(X_{k+1})-\nabla f(X_{k})

B_{k+1}=B_{k}+\frac{Y_{k}Y_{k}^{T}}{s_{k}Y_{k}^{T}}+\frac{B_{k}s_{k}s_{k}^{T}B_{k}^{T}}{B_{k}s_{k}s_{k}^{T}}

end for

where $X_{k}$ are the model parameters, $f$ is the loss function, $B_{k}$ is an approximation of the Hessian matrix, and $S_{k}$ and $Y_{k}$ are the differences between the current and previous values of $X$ and $\nabla f(X)$ , respectively. The L-BFGS and ADAM optimization algorithms are both commonly used in the context of training neural networks, including PINN. Here is a comparison of some key features of these algorithms:

•

Convergence rate: L-BFGS typically has a faster convergence rate than ADAM. However, ADAM can still be effective in practice and may be preferred in some cases due to its simplicity and ability to adapt to changing data [40].
•

Memory requirements: L-BFGS requires storing a set of past gradients in memory, which can be costly for large datasets or networks. ADAM, on the other hand, only requires storing the average of the past gradients, which is typically much cheaper in terms of memory usage [41].
•

Sensitivity to hyperparameters: L-BFGS has relatively few hyperparameters, which can make it easier to tune. ADAM has more hyperparameters, such as the learning rate and the momentum parameters, which can be more sensitive to the choice of values and may require more careful tuning [41, 42].
•

Robustness: L-BFGS can be sensitive to the initialization of the parameters, and may require a good initial guess to find the optimal solution. ADAM can be more robust to the initialization, but may be less sensitive to the true global minimum as shown in [42].

Both L-BFGS and ADAM can be effective in training PINN models, and it may be useful to try both algorithms and compare their performance to determine which is the best fit for a particular problem. In our study, we found that the ADAM give best results.

Optimization algorithms
L-BFGS	ADAM
$x_{k+1}=x_{k}+\alpha_{k}d_{k}$	$m_{k}=\beta_{1}m_{k-1}+(1-\beta_{1})g_{k}$
$s_{k}=x_{k+1}-x_{k}$	$v_{k}=\beta_{2}v_{k-1}+(1-\beta_{2})g_{k}^{2}$
$y_{k}=g_{k+1}-g_{k}$	$m_{k}^{\prime}=\frac{m_{k}}{1-\beta_{1}^{k}}$
$\alpha_{k}=\frac{s_{k}^{T}y_{k}}{y_{k}^{T}H_{k}y_{k}}$	$v_{k}^{\prime}=\frac{v_{k}}{1-\beta_{2}^{k}}$
$x_{k+1}=x_{k}+\alpha_{k}d_{k}$	$x_{k+1}=x_{k}-\frac{\eta}{\sqrt{v_{k}^{\prime}}+\epsilon}m_{k}^{\prime}$

Table 2:

x_{k}

represents the current estimate of the solution,

d_{k}

is the search direction,

\alpha_{k}

is the step size,

s_{k}

and

y_{k}

are the difference vectors,

g_{k}

is the gradient of the objective function at

x_{k}

m_{k}

and

v_{k}

are the first and second moments of the gradient estimates,

\beta_{1}

and

\beta_{2}

are the decay rates for the moving averages,

\eta

is the learning rate,

\epsilon

is a small constant to prevent division by zero, and

H_{k}

is the approximate Hessian matrix. Note that the formulas for L-BFGS involve the computation of the Hessian matrix and its inverse, which can be computationally expensive for large datasets or networks. In contrast, the formulas for ADAM do not involve the Hessian matrix and can be more computationally efficient.

3.1.2 Root Mean Square Error(RMSE) Loss

In the context of a biology process where the values are continuous and expected to be situated in a small range, using the mean squared error (MSE) or root mean squared error (RMSE) loss function can help ensure that the model is able to accurately predict the values within that range. Both MSE and RMSE measure the difference between the predicted values and the true values, but MSE is simply the average squared difference while RMSE is the square root of the average squared difference. Both of these loss functions penalize large errors more heavily than small errors, which can be useful for preventing the DNN model from making large errors in its predictions.

RMSE=\sqrt{\frac{\sum_{i=1}^{n}(\widehat{x_{i}}-x_{i})^{2}}{n}}

With: $x_{i}$ = the $i^{th}$ observed value, $\widehat{x_{i}}$ = the $i^{th}$ predicted value and n is the number of available observations.

3.1.3 RELU vs Sigmoid vs Tanh

In the context of tuning a DNN, the choice of activation function can significantly impact the performance. The Rectified Linear Unit (ReLU) activation function is widely used due to its simplicity and ability to converge faster than other activation functions. It is defined as $f(x)=max(0,x)$ and is generally used in the hidden layers of the network. One disadvantage of ReLU is that it can suffer from the "dying ReLU" problem, where the weights of the neurons become very small and the activation function becomes inactive. The Sigmoid activation function is defined as $f(x)=\frac{1}{1+e^{-x}}$ and is often used in the output layer of binary classification problems. However, it has a slow convergence rate and can suffer from vanishing gradients, where the gradients of the weights become very small, hindering the model’s ability to learn. The Hyperbolic Tangent (Tanh) activation function is defined as $f(x)=tanh(x)=2\sigma(2x)-1$ , where $\sigma(x)$ is the Sigmoid function. It is often used in the hidden layers of the network and can perform well in certain tasks, but it can also suffer from the vanishing gradients problem. The activation function choice can also depend on the range of the values being predicted, especially in a biology context. For example, if the predicted values are expected to be within a small range, such as between 0 and 1, the Sigmoid or Tanh activation functions may be more appropriate. However, if the predicted values are expected to have a larger range, the ReLU activation function may be more suitable. It is important to keep in mind that the choice of activation function is just one of many hyperparameters that can impact the performance of the model and should be tuned accordingly. In this work, we experimented with three activation functions: RELU, Sigmoid and Hyperbolic tangent. Our models converged for Sigmoid and Hyperbolic tangent but we couldn’t obtain satisfying result with RELU.

3.2 Toy model: simple partial differential equation

Let’s consider this first-order partial differential equation:

a*\frac{\partial u(x,t)}{\partial x}+b*\frac{\partial u(x,t)}{\partial t}+c*u(x,t)=0

and let’s fix the values of the constants a, b, and c to be a = 1, b = -2, and c = -1. The modified equation becomes:

\frac{\partial u(x,t)}{\partial x}-2*\frac{\partial u(x,t)}{\partial t}-u(x,t)=0

defined for the temporal interval $t\in[0,1]$ and the spatial interval $x\in[0,2]$ . The initial condition is given by:

u(x,0)=6*e^{-3*x}

and the PDE boundary conditions are given by:

u(x=0,t)=g_{1}(t)

u(x=2,t)=g_{2}(t)

(Note that during the numerical and PINN resolutions, we will choose $g_{1}(t)=6*e^{-2*t}$ and $g_{2}(t)=6*e^{-6-2*t}$ ).

To solve this partial differential equation analytically, the method of separation of variables is used. Start by separating the variables x and t and this involves writing the solution in the form:

u(x,t)=X(x)*T(t)

Substituting this expression into the PDE gives:

X^{\prime}(x)*T(t)-2*X(x)*T^{\prime}(t)-X(x)*T(t)=0

(X^{\prime}(x)-X(x))*T(t)=2*X(x)*T^{\prime}(t)

\frac{X^{\prime}(x)-X(x)}{X(x)}=2*\frac{T^{\prime}(t)}{T(t)}=C_{0}

with $C_{0}$ is an arbitrary constant.

Solving for X(x) and T(t) separately gives:

Let’s solve the left member,

\frac{X^{\prime}(x)-X(x)}{X(x)}=C_{0}

The integration gives:

X(x)=C_{1}*e^{(C_{0}+1)*x}

The right member is written

\frac{T^{\prime}(t)}{T(t)}=\frac{C_{0}}{2}

The integration gives:

T(t)=C_{2}*e^{\frac{C_{0}*t}{2}}

The general solution is then given by:

u(x,t)=X(x)*T(t)=C_{1}*C_{2}*e^{(C_{0}+1)*x+\frac{C_{0}*t}{2}}

Using the initial condition at $t=0$ :

u(x,0)=C_{1}*C_{2}*e^{(C_{0}+1)*x}=6*e^{-3*x}

Then $C_{1}*C_{2}=6$ and $C_{0}=-4$

Therefore, the solution is:

u(x,t)=6*e^{-3*x-2*t}

Analytical solution

u(x,t)=6*e^{-3*x-2*t},\hskip 14.22636ptx\in[0,2],\hskip 14.22636ptt\in[0,1]

Numerical schema

Now, we are interested in numerically solving the previous PDE by using the centered finite difference method. We need to discretize the spatial and temporal domains and approximate the derivatives using finite differences. Here is an outline of the steps involved in the numerical scheme:

•

Choose a spatial discretization step size $\Delta x$ , and a temporal discretization step size $\Delta t$ .
•

Define the grid points $x_{i}$ and $t_{j}$ as follows:

x_{i}=i\Delta x\hskip 28.45274ptt_{j}=j\Delta t

\frac{\partial u}{\partial x}\approx\frac{u_{i+1,j}-u_{i-1,j}}{2\Delta x}

\frac{\partial u}{\partial t}\approx\frac{u_{i,j+1}-u_{i,j-1}}{2\Delta t}

u_{i,j}=\frac{u_{i+1,j}-u_{i-1,j}}{2\Delta x}-2\frac{u_{i,j+1}-u_{i,j-1}}{2\Delta t}-u_{i,j}

u_{i,j}=\frac{1}{1+2\frac{\Delta t}{\Delta x}}(u_{i+1,j}-2u_{i,j}+u_{i-1,j}-2\frac{\Delta t}{\Delta x}(u_{i,j+1}-u_{i,j-1}))

Substitute the initial condition $u(x_{i},0)=6*\exp(-3*x_{i})$ to find $u_{i,0}$ .

Substitute the boundary conditions $u(0,t_{j})=6*e^{-2*t_{j}}$ and $u(2,t_{j})=6*e^{-6-2*t_{j}}$ to find $u_{0,j}$ and $u_{N,j}$ , where $N$ is the number of grid points in the spatial domain.

⬇

1import numpy as np

2import matplotlib.pyplot as plt

5def solve_pde(a, b, c, Delta_x, Delta_t):

6 # Number of grid points in spatial and temporal domains

7 N = int(2 / Delta_x)

8 M = int(1 / Delta_t)

10 # Define grid points

11 x = np.linspace(0, 2, N+1)

12 t = np.linspace(0, 1, M+1)

14 # Initialize solution array

15 u = np.zeros((N+1, M+1))

17 # Set initial condition

18 u[:, 0] = 6 * np.exp(-3 * x)

20 # Set boundary conditions

21 u[0, :] = 6 * np.exp(-2 * t)

22 u[N, :] = 6 * np.exp(-6 - 2 * t)

24 # Iterate over temporal domain

25 for j in range(M):

26 # Iterate over spatial domain

27 for i in range(1, N):

28 # Use finite difference equation to update solution

29 u[i, j+1] = (1 / (1 + 2 * Delta_t / Delta_x)) * (u[i+1, j] - 2 * u[i, j] + u[i-1, j] - 2 * Delta_t / Delta_x * (u[i, j+1] - u[i, j-1]))

31 # Return solution

32 return u

34# Define function to calculate analytical solution

35def analytical_solution(x, t):

36 return 6 * np.exp(-2 * t - 3 * x)

38# Set discretization step sizes

39Delta_x = 0.1

40Delta_t = 0.01

42# Calculate numerical solution

43solution = solve_pde(2, 1, -1, Delta_x, Delta_t)

45# Define grid points for plotting

46x_plot = np.linspace(0, 2, int(2 / Delta_x) + 1)

47t_plot = np.linspace(0, 1, int(1 / Delta_t) + 1)

48X, T = np.meshgrid(x_plot, t_plot)

50# Calculate analytical solution for plotting

51analytical = analytical_solution(X, T)

53# Plot numerical and analytical solutions

54plt.figure()

55plt.title("Numerical and analytical solutions")

56plt.pcolor(X, T, solution, cmap="coolwarm")

57plt.colorbar()

58plt.contour(X, T, analytical, colors="k")

59plt.xlabel("x")

60plt.ylabel("t")

61plt.show()

Listing 1: The code used to compute the numerical solution and the analytical one. The results are visualized using the matplotlib library.

PINN approach

In the case of numerical methods, the approach is to convert the PDE into numerical schemes, ensuring properties such as numerical convergence, consistency, and numerical stability. In the PINN approach, the problem is reformulated as a numerical optimization problem the goal is to end up with a loss function that contains most of the system’s dynamic information. To achieve this, the equation is first rearranged such that all its terms are gathered on one side, which forms the first term of the cost function. The remaining two terms of the loss function consist of the initial and boundary conditions, respectively. Then, the function is the sum of three positive terms that must be minimized through an optimization algorithm. Secondly, the data is generated randomly sampled from the phase space, with points located at the boundary, within the interior of the domain, and subject to the initial conditions.

Let us define $f(x,t)$ as :

f(x,t)=\frac{\partial u(x,t)}{\partial x}-2*\frac{\partial u(x,t)}{\partial t}-u(x,t)\hskip 28.45274ptx\in[0,2],t\in[0,1]

In this case, $u(x,t)$ will be approximated by a neural network. The latter has as an objective to minimize the following loss :

MSE=MSE_{0}+MSE_{b}+MSE_{f}

where

MSE_{0}=\frac{1}{N_{0}}\sum_{i=1}^{N_{0}}|u(x_{0}^{i},0)-u_{0}^{i}|^{2}

MSE_{b}=\frac{1}{N_{b}}\sum_{i=1}^{N_{b}}(|u(0,t_{b}^{i})-u_{b}^{i}|^{2}+|u(2,t_{b}^{i})-u_{b}^{i}|^{2})

MSE_{f}=\frac{1}{N_{f}}\sum_{i=1}^{N_{f}}|f(x_{f}^{i},t_{f}^{i})|^{2}

$\{x_{0}^{i},u_{0}^{i}\}$ denotes the initial data at $t=0$ , $\{t_{b}^{i},u_{b}^{i}\}$ denotes to the boundary data and $\{x_{f}^{i},t_{f}^{i}\}$ corresponds to collocation points on $f(x,t)$ .

⬇

2import numpy as np

3import torch

4import torch.nn as nn

5from torch.autograd import Variable

7device = torch.device(’cuda:0’ if torch.cuda.is_available() else ’cpu’)

9class Net(nn.Module):

10 def __init__(self):

11 super(Net, self).__init__()

12 self.hidden_layer1 = nn.Linear(2,5)

13 self.hidden_layer2 = nn.Linear(5,5)

14 self.hidden_layer3 = nn.Linear(5,5)

15 self.hidden_layer4 = nn.Linear(5,5)

16 self.hidden_layer5 = nn.Linear(5,5)

17 self.output_layer = nn.Linear(5,1)

19 def forward(self, x, t):

20 inputs = torch.cat([x,t], axis=1)

21 layer1_out = torch.sigmoid(self.hidden_layer1(inputs))

22 layer2_out = torch.sigmoid(self.hidden_layer2(layer1_out))

23 layer3_out = torch.sigmoid(self.hidden_layer3(layer2_out))

24 layer4_out = torch.sigmoid(self.hidden_layer4(layer3_out))

25 layer5_out = torch.sigmoid(self.hidden_layer5(layer4_out))

26 output = self.output_layer(layer5_out)

28 return output

30def f(x, t, net):

31 u = net(x, t)

32 u_x = torch.autograd.grad(u.sum(), x, create_graph=True)[0]

33 u_t = torch.autograd.grad(u.sum(), t, create_graph=True)[0]

35 pde = u_x - 2*u_t - u

36 return pde

38def get_boundary_data():

39 x_bc = np.random.uniform(low=0.0, high=2.0, size=(500,1))

40 t_bc = np.zeros((500,1))

42 u_bc = 6*np.exp(-3*x_bc)

43 return x_bc, t_bc, u_bc

45def get_collocation_data():

46 x_collocation = np.random.uniform(low=0.0, high=2.0, size=(500,1))

47 t_collocation = np.random.uniform(low=0.0, high=1.0, size=(500,1))

48 all_zeros = np.zeros((500,1))

49 return x_collocation, t_collocation, all_zeros

51def train_model(net, cost_function, optimizer, num_iterations):

52 previous_validation_loss = 99999999.0

54 for epoch in range(num_iterations):

55 optimizer.zero_grad()

57 # Loss based on boundaries conditions

58 x_bc, t_bc, u_bc = get_boundary_data()

59 pt_x_bc = Variable(torch.from_numpy(x_bc).float(), requires_grad=False).to(device)

60 pt_t_bc = Variable(torch.from_numpy(t_bc).float(), requires_grad=False).to(device)

61 pt_u_bc = Variable(torch.from_numpy(u_bc).float(), requires_grad=False).to(device)

63 net_bc_out = net(pt_x_bc, pt_t_bc)

64 mse_u = cost_function(net_bc_out, pt_u_bc)

66 # Loss based on PDE

67 x_collocation, t_collocation, all_zeros = get_collocation_data()

68 pt_x_collocation = Variable(torch.from_numpy(x_collocation).float(), requires_grad=True).to(device)

69 pt_t_collocation = Variable(torch.from_numpy(t_collocation).float(), requires_grad=True).to(device)

70 pt_all_zeros = Variable(torch.from_numpy(all_zeros).float(), requires_grad=False).to(device)

72 f_out = f(pt_x_collocation, pt_t_collocation, net)

73 mse_pde = cost_function(f_out, pt_all_zeros)

75 # Total loss

76 total_loss = mse_u + mse_pde

77 total_loss.backward()

78 optimizer.step()

80 # Validation

81 if epoch % 1000 == 0:

82 x_validation = np.random.uniform(low=0.0, high=2.0, size=(500,1))

83 t_validation = np.random.uniform(low=0.0, high=1.0, size=(500,1))

84 u_validation = 6*np.exp(-3*x_validation)

86 pt_x_validation = Variable(torch.from_numpy(x_validation).float(), requires_grad=False).to(device)

87 pt_t_validation = Variable(torch.from_numpy(t_validation).float(), requires_grad=False).to(device)

88 pt_u_validation = Variable(torch.from_numpy(u_validation).float(), requires_grad=False).to(device)

90 net_validation_out = net(pt_x_validation, pt_t_validation)

91 mse_validation_loss = cost_function(net_validation_out, pt_u_validation)

Listing 2: An implementation of PINN for the toy model.

Comparison between PINN, analytical solutions and finite difference method

		Analytical solution			Finite difference solution
		8	16	32	8	16	32
Layers	2	0.000472	0.000416	0.000746	0.000702	0.000332	0.000572
	4	0.000366	0.000465	0.000412	0.000614	0.000710	0.000644
	8	0.000790	0.001780	0.001695	0.000979	0.001777	0.001548

Table 3: RMSE between PINN and analytical solution / and PINN and finite difference solution for the toy model.

		Neurons
		8	16	32
Layers	2	2:05	2:02	2:55
	4	3:58	3:46	3:43
	8	4:44	5:16	5:26

Table 4: Execution time according to the number of layers and neurons for the toy model.

3.3 Burger’s equation

The Burger’s equation is a nonlinear PDE that is often used to model a variety of physical, biological, and chemical phenomena, including incompressible fluid flow, population dynamics, and chemical reactions. It expresses the balance between the fluid convective transport and the diffusive transport due to viscosity. Solving the Burger’s equation allows to determine the fluid velocity field at a given time and spatial location. This equation has several applications in biology like modeling blood flow in the cardiovascular system [45], modeling pattern formation biological systems and modeling population dynamics [46]. Two types of Burgers equations are considered, those without viscosity term which can be obtained by considering the particles non-interaction and whose solution can be realized with the help of the finite method, including the derivatives approximation means of Taylor developments based on the discretization of the phase space with viscosity term. The solution of the nonlinear viscosity problem based on the Cole-Hopf transform. Concerning numerical resolution done here either by a finite difference method. The first limitation of the Burgers equation is that it is a simplified model that makes certain assumptions about the system being studied. For example, it may assume that the fluid is incompressible or that the reaction rate is constant, which may not hold true in all cases. As a result, the Burgers equation may not be able to accurately capture the full complexity of a heterogeneous system. The second limitation is that the equation is a deterministic model, not take into account the inherent randomness that is often present in biological systems.

Equation

The Burger’s equation is the following:

\frac{\partial u(x,t)}{\partial t}+u(x,t)\frac{\partial u(x,t)}{\partial x}-\nu\frac{\partial^{2}u(x,t)}{\partial x^{2}}=0\hskip 42.67912ptx\in[0,1],t\in[0,0.1]

with the initial condition:

$u(x,0)=\sin(\pi x)$

and the boundary conditions:

$u(0,t)=u(1,t)=0$

where $\nu$ = 1 is the coefficient of kinematic viscosity

Using the Hopf-Cole transformation

u(x,t)=-2\nu\frac{\Theta_{x}}{\Theta}

The burger’s equation transforms to the linear heat equation:

\frac{\partial\Theta(x,t)}{\partial t}=\nu\frac{\partial^{2}\Theta(x,t)}{\partial x^{2}}\hskip 42.67912ptx\in[0,1],t\in[0,0.1]

with the initial condition

$\Theta(x,0)=\exp(\frac{\cos(\pi x)-1}{2\pi\nu})$

and the boundary conditions

$\Theta_{x}(0,t)=\Theta_{x}(1,t)=0$