Convergence analysis of wide shallow neural operators within the framework of Neural Tangent Kernel
Abstract.
Neural operators are aiming at approximating operators mapping between Banach spaces of functions, achieving much success in the field of scientific computing. Compared to certain deep learning-based solvers, such as Physics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural operators can solve a class of Partial Differential Equations (PDEs). Although much work has been done to analyze the approximation and generalization error of neural operators, there is still a lack of analysis on their training error. In this work, we conduct the convergence analysis of gradient descent for the wide shallow neural operators and physics-informed shallow neural operators within the framework of Neural Tangent Kernel (NTK). The core idea lies on the fact that over-parameterization and random initialization together ensure that each weight vector remains near its initialization throughout all iterations, yielding the linear convergence of gradient descent. In this work, we demonstrate that under the setting of over-parametrization, gradient descent can find the global minimum regardless of whether it is in continuous time or discrete time.
1. Introduction
Partial Differential Equations (PDEs) are essential for modeling a wide range of phenomena in physics, biology, and engineering. Nonetheless, the numerical solution of PDEs has always been a significant challenge in the field of scientific computation. Traditional numerical approaches, such as finite difference, finite element, finite volume, and spectral methods, can encounter difficulties due to the curse of dimensionality when applied to PDEs with a high number of dimensions. In recent years, the impressive achievements of deep learning in various domains, including computer vision, natural language processing, and reinforcement learning, have led to an increased interest in utilizing machine learning techniques to tackle PDE-related problems.
For scientific problems, neural network-based methods are primarily divided into two categories: neural solvers and neural operators. Neural solvers, such as PINNs [1], DRM [2], utilize neural networks to represent the solutions of PDEs, minimizing some form of residual to enable the neural networks to approximate the true solutions closely. There are two potential advantages of neural solvers. First, this is an unsupervised learning approach, which means it does not require the costly process of obtaining a large number of labels as in supervised learning. Second, as a powerful represention tool, neural networks are known to be effictive for approximating continuous functions [3], smooth functions [4], Sobolev functions [5]. This presents a potentially viable avenue for addressing the challenges of high-dimensional PDEs. Nevertheless, existing neural solvers face several limitations when compared to classical numerical solvers like FEM and FVM, particularly in terms of accuracy, and convergence issues. In addition, neural solvers are typically limited to solving a fixed PDE. If certain parameters of the PDE change, it becomes necessary to retrain the neural network.
Neural operator (also called operator learning) aims to approximate unknown operator, which often takes the form of the solution operator associated with a differential equation. Unlike most supervised learning methods in the field of machine learning, where both inputs and outputs are of finite dimensions, operator learning can be regarded as a form of supervised learning in function spaces. Because the inputs and outputs of neural networks are of finite dimensions, some operator learning methods, such as PCA-net and DeepONet, use an encoder to convert infinite-dimensional inputs into finite-dimensional ones, and a decoder to convert finite-dimensional outputs back into infinite-dimensional outputs. The PCA-Net architecture was proposed as an operator learning framework in [6], where principal component analysis (PCA) is employed to obtain data-driven encoders and decoders, combining with a neural network mapping between the finite-dimensional latent spaces. Building on early work by [7], DeepONet [8] consists of a deep neural network for encoding the discrete input function space and another deep neural network for encoding the domain of the output functions. The encoding network is conventionally referred to as the “branch-net”, while the decoding network is referred to as the “trunk-net”. In contrast to the PCA-Net and DeepONet architectures mentioned above,, Fourier Neural Operator (FNO), introduced in [9], does not follow the encoder-decoder-net paradigm. Instead, FNO is a composition of linear integral operators and nonlinear activation functions, which can be seen as a generalization the structure of finite-dimensional neural networks to a function space setting.
The theoretical research on neural operators mostly focuses on the study of approximation errors and generalization errors. As we know, the theoretical basis for the application of neural networks lies in the fact that neural networks are universal approximators. This also holds for neural operators. Regarding the analysis of approximation errors in neural operators, the aim is to identify whether neural operator also possess a universal approximation property, i.e. the ability to approximate a wide class of operators to any given accuracy. As shown in [7], (shallow) operator networks can approximate continuous operators mapping between spaces of continuous functions with arbitrary accuracy. Building on this result, DeepONets have also been proven to be universal approximators. For neural operators following the encoder-decoder paradigm like DeepONet and PCA-Net, Lemma 22 in [10] provides a consistent approximation result, which states that if two Banach spaces have the approximation property, then continuous maps between them can be approximated in a finite-dimensional manner. The universal approximation capability of the FNO was initially established in [11], drawing on concepts from Fourier analysis, and specifically leveraging the density of Fourier series to demonstrate the FNO’s ability to approximate a broad spectrum of operators. For a more quantitative analysis of the approximation error of neural operators, see [12]. In addition to approximation errors, the error analysis of encoder-decoder style neural operators also includes encoding and reconstruction errors. [13] has provided both lower and upper bounds on the total error for DeepONets by using the spectral decay properties of the covariance operators associated with the underlying measures. By employing tools from non-parametric regression, [14] has provided an analysis of the generalization error for neural operators with basis encoders and decoders. The results in [14] holds on neural operators with some popular encoders and decoders, such as those using Legendre polynomials, trigonometric functions, and PCA. For more details on the recent advances and theoretical research in operator learning, refer to the review [12].
Up to this point, the theoretical exploration of the convergence and optimization aspects of neural operators has received relatively little attention. To our best knowledge, only [15] and [16] have touched upon the optimization of neural operators. Based on restricted strong convexity (RSC), [15] has presented a unified framework for gradient descent and apply the framework to DeepONets and FNOs, establishing convergence guarantees for both. [16] has briefly analyzed the training of physics-informed DeepONets and derived a weighting scheme guided by NTK theory to balance the data and the PDE residual terms in the loss function. In this paper, we focus on the training error of shallow neural operator in [7] with the framework of NTK, showing that gradient descent converges at a global linear rate to the global optimum.
1.1. Notations
We denote for . Given a set , we denote the uniform distribution on by . We use to denote the indicator function of the event . We use to denote an estimate that , where is a universal constant. A universal constant means a constant independent of any variables.
2. Preliminaries
The neural operator considered in this paper was originally introduced in [4], aimining to approximate a non-linear operator. Specifically, suppose that is a continuous and non-polynomial function, is a Banach space, , are two compact sets in and , respectively, is a compact set in . Assume that is a nonlinear continuous operator. Then, an operator net can be formulated in terms of two shallow neural networks. The first is the so-called branch net , defined for as
where are the so-called sensors and are weights of the neural network.
The second neural network is the so-called trunk net , defined as
where and are weights of the neural network. Then the branch net and trunk net are combined to approximate the non-linear operator , i.e.,
As shown in [4], (shallow) operator networks can approximate, to arbitrary accuracy, continuous operators mapping between spaces of continuous functions. Specifically, for any , there are positive integers and , constants , , , and , such that
holds for all and .
The training of neural networks is performed using a supervised learning process. It involves minimizing the mean-squared error between the predicted output and the actual output . Specifically, assume we have samples , where is a probability supported on . The aim is to minimize the following loss function:
In this paper, we primarily focus on the shallow neural operators with ReLU activation functions. Formally, we consider a shallow operator of the following form.
(1) |
where we equate the function with its value vector at the points .
We denote the loss function by . The main focus of this paper is to analyze the gradient descent in training the shallw neural operator. We fix the weights and apply gradient (GD) to optimize the weights . Specifically,
where is the learning rate and is an abbreviation of .
At this point, the loss function is
where . Throughout this paper, we consider the initialization
(2) |
and assume that for all . Note that here we treat vector and function as equivalent.
3. Continuous Time Analysis
In this section, we present our result for gradient flow, which can be viewed as a continuous form of gradient descent with an infinitesimal time step size. The analysis of gradient flow in continuous time serves as a foundational step for comprehending discrete gradient descent algorithms. We prove that the gradient flow converges to the global optima of the loss under over-parameterization and some mild conditions on training samples. The time continuous form can be characterized as the following dynamics
for . We denote the prediction on at time under , i.e., with weights .
Thus, we can deduce that
(3) |
and
(4) |
Then, the dynamics of each prediction can be calculated as follows.
(5) | ||||
We let and . Then, we have
(6) |
where , whose -th entry is defined as
and , whose -th entry is defined as
Thus, we can write the dynamics of predictions as follows.
where . We can divide into blocks, the -th block of is and the -th block of is . From the form of , we can derive the Gram matrices induced by the random initialization, which we denote by and . Note that although and are large matrices, we can divide and into blocks, where each block is a matrix. Following the notation above, the -th entry of -th block of is
Thus, the -th block can be written as
where and the -th entry of is
Thus, can be seen as a Kronecker product of matrices and , where and the -th entry of is . Similarly, we have that is a Kronecker product of matrices and , where and the -th entry of is , the -th entry of is .
Similar to the situation in regression, we can show two essential facts: (1) , and (2) for all , , .
Therefore, roughly speaking, as , the dynamics of the predictions can be written as
which results in the linear convergence.
We first show that the Gram matrices are strictly positive definite under mild assumptions.
Lemma 1.
If no two samples in are parallel and no two samples in are parallel, then and are strictly positive definite. We denote the least eigenvalue of and as and respectively.
Remark 1.
In fact, when we consider neural networks with bias, it is natural that Lemma 1 holds. Specifically, for , we can replace and by and . Thus Lemma 1 holds under the condition that no two samples in are identical, which holds naturally.
Then we can verify the two facts that are close to and are close to by following two lemmas.
Lemma 2.
If , we have with probability at least , , and , .
Lemma 3.
Let . If are i.i.d. generated . For any set of weight vectors that satisfy for any , and , then we have with probability at least ,
(7) |
and with probability at least ,
(8) |
where the -th entry of the -th block of is
and the -th entry of the -th block of is
With these preparations, we come to the final conclusion.
Theorem 1.
Suppose the condition in Lemma 1 holds and under initialization as described in (2), then with probability at least , we have
where
Proof Sketch: Note that
(9) |
thus if and , we have
This yields that , i.e., is non-increasing, thus we have
(10) |
On the other hand, roughly speaking, the continous dynamics of and , i.e., (3) and (4), show that
Thus if the prediction decays like (10), we can deduce that
Combining this with the stability of the descrete Gram matrices, i.e., Lemma 3, we have , and , , when is sufficiently large.
From such equivalence, we can arrive at the desired conclusion.
Remark 2.
The result in Theorem 1 indicates that , which may lead to strict requirement for . In fact, from (11), we can see that or is enough. Thus, is sufficient.
4. Discrete Time Analysis
In this section, we are going to demonstrate that randomly initialized gradient in training shallow neural operators converges to the golbal minimum at a linear rate. Unlike the continuous time case, the discrete time case requires a more refined analysis. In the following, we first present our main result and then outline the proof’s approach.
Theorem 2.
Under the setting of Theorem 1, if we set , then with probability at least , we have
where
For the regression problems, [17] has demonstrated that if the learning rate , then randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate when is large enough. The requirement of is derived from the decomposition for the residual in the -th iteration, i.e.,
where is the prediction vector at -iteration under the shallow neural network and is the true prediction. Although using the method from [17] can also yield linear convergence of gradient descent in training shallow neural operators, the requirements for the learning rate would be very stringent due to the dependency on and . Thus, instead of decomposing the residual into the two terms as above, we write it as follows, which serves as a recursion formula.
Lemma 4.
For all , we have
where is the residual term. We can divide it into blocks, each block belongs to and the -th component of -th block is defined as
(11) |
Just as in the case of regression, we prove our conclusion by induction. From the recursive formula above, it can be seen that both the estimation of and , as well as the estimation of the residual , depend on and . Therefore, our inductive hypothesis is the following differences between weights and their initializations.
Condition 1.
At the -th iteration, we have
(12) |
and
(13) |
and for all and , where .
This condition can lead to the linear convergence of gradient descent, i.e., result in Theorem 2.
Corollary 1.
If Condition 1 holds for , then we have that
holds for , where is required to satisfy that
Proof Sketch: Under the setting of over-parameterization, we can show that the weights , stay close to the initialization , . Thus, with the stability of the discrete Gram matrices, i.e., Lemma 3, we can deduce that and . Then combining with the Lemma 4, we have
(14) | ||||
where the inequality requires that is positive definite. Since and , is sufficient to ensure that is positive definite, when is large enough.
From (14), if , which can be obtained from the following lemma, we can obtain that
which directly yields the desired conclusion.
Lemma 5.
Under Condition 1, for , we have
(15) |
where
(16) |
5. Physics-Informed Neural Operators
Let be a bounded open subset of , in this section, we consider the PDE with following form:
(17) | ||||
where , , and is a differential operator,
In this section, we consider the shallow neural operato with following form
where are the activation functions, respectively.
Given samples in the interior and on the boundary, the loss function of PINN is
Let
and
then the loss function can be written as
where , .
We first consider the continuous setting, which is a stepping stone towards understanding discrete algorithms. For and , we have
(18) | ||||
and
(19) | ||||
Thus, for the predictions, we have
(20) | ||||
and
(21) | ||||
Let and , then
where are Gram matrices at time . We can divide them into blocks and each block is a matrix in . Specifically, the -th block of and are and respectively, where
and
Recall that
and
Then, the -th () entry of is
and the -th () entry of is
where we omit the index for simplicity.
From the forms of and , we can derive the corresponding Gram matrices that are induced by the random initialization, which we denote by and , respectively. Specifically, is a Kronecker product of and , where , , the -th entry of is , the -th entry of is
And is a Kronecker product of and , where , , the -th entry of is , the -th entry of is
The Gram matrices play important roles in the convergence analysis. Similar to the setting of Section 4, we can demonstrate the strict positive definiteness of the Gram matrices under mild conditions.
Lemma 6.
If no two samples in are parallel and no two samples in are parallel, then and are both strictly positive definite. We denote their least eigenvalues by and , respectively.
Similar to Section 4, the convergence of gradient descent relies on the stability of the Gram matrices, which is demonstrated by the following two lemmas.
Lemma 7.
If , we have with probability at least , , and , .
Lemma 8.
Let . If are i.i.d. generated . For any set of weight vectors that satisfy for any , and , then we have with probability at least ,
(22) | ||||
and with probability at least ,
(23) |
Similar to training neural operators, we can derive the training dynamics of physics-informed neural operators.
Lemma 9.
For all , we have
where , can be divided into blocks, where each block is an dimensional vector. The -th () component of -th block is
The -th () component of -th block is
With these preparations in place, we can now arrive at the final convergence theorem.
Theorem 3.
If we set , then with probability at least , we have
where
and indicates that some terms involving , and are omitted.
We prove Theorem 3 by induction. Our induction hypothesis is just the following condition:
Condition 2.
At the -th iteration, we have
and , and holds for all , where
This condition directly yields the following bound of deviation from the initialization.
Corollary 2.
If Condtion 2 holds for , then we have that
and
holds for all .
Lemma 10.
If Condtion 2 holds for , then we have that
holds for .
6. Conclusion and Future Work
In this paper, we have analyzed the convergence of gradient descent (GD) in training wide shallow neural operators within the framework of NTK, demenstrating the linear convergence of GD. The core idea is that over-parameterization ensures that all weights are close to their initializations for all iterations, which is similar to performing a certain kernel method. There are some future works. Firstly, the extension of our theory to other neural operators, like FNO. The main difficulty could be that how to meet the requirements of the NTK theory. Secondly, the extension to DeepONets, which we think may be similar to the extension from the results in [17] to [18].
References
- [1] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational physics, vol. 378, pp. 686–707, 2019.
- [2] B. Yu et al., “The deep ritz method: a deep learning-based numerical algorithm for solving variational problems,” Communications in Mathematics and Statistics, vol. 6, no. 1, pp. 1–12, 2018.
- [3] Z. Shen, H. Yang, and S. Zhang, “Optimal approximation rate of relu networks in terms of width and depth,” Journal de Mathématiques Pures et Appliquées, vol. 157, pp. 101–135, 2022.
- [4] J. Lu, Z. Shen, H. Yang, and S. Zhang, “Deep network approximation for smooth functions,” SIAM Journal on Mathematical Analysis, vol. 53, no. 5, pp. 5465–5506, 2021.
- [5] D. Yarotsky, “Error bounds for approximations with deep relu networks,” Neural networks, vol. 94, pp. 103–114, 2017.
- [6] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart, “Model reduction and neural networks for parametric pdes,” The SMAI journal of computational mathematics, vol. 7, pp. 121–157, 2021.
- [7] T. Chen and H. Chen, “Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems,” IEEE transactions on neural networks, vol. 6, no. 4, pp. 911–917, 1995.
- [8] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis, “Learning nonlinear operators via deeponet based on the universal approximation theorem of operators,” Nature machine intelligence, vol. 3, no. 3, pp. 218–229, 2021.
- [9] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Fourier neural operator for parametric partial differential equations,” arXiv preprint arXiv:2010.08895, 2020.
- [10] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Neural operator: Learning maps between function spaces with applications to pdes,” Journal of Machine Learning Research, vol. 24, no. 89, pp. 1–97, 2023.
- [11] N. Kovachki, S. Lanthaler, and S. Mishra, “On universal approximation and error bounds for fourier neural operators,” Journal of Machine Learning Research, vol. 22, no. 290, pp. 1–76, 2021.
- [12] N. B. Kovachki, S. Lanthaler, and A. M. Stuart, “Operator learning: Algorithms and analysis,” arXiv preprint arXiv:2402.15715, 2024.
- [13] S. Lanthaler, S. Mishra, and G. E. Karniadakis, “Error estimates for deeponets: A deep learning framework in infinite dimensions,” Transactions of Mathematics and Its Applications, vol. 6, no. 1, p. tnac001, 2022.
- [14] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao, “Deep nonparametric estimation of operators between infinite dimensional spaces,” Journal of Machine Learning Research, vol. 25, no. 24, pp. 1–67, 2024.
- [15] B. Shrimali, A. Banerjee, and P. Cisneros-Velarde, “Optimization for neural operator learning: Wider networks are better.”
- [16] S. Wang, H. Wang, and P. Perdikaris, “Improved architectures and training algorithms for deep operator networks,” Journal of Scientific Computing, vol. 92, no. 2, p. 35, 2022.
- [17] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” arXiv preprint arXiv:1810.02054, 2018.
- [18] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International conference on machine learning. PMLR, 2019, pp. 1675–1685.
- [19] J. He, L. Li, J. Xu, and C. Zheng, “Relu deep neural networks and linear finite elements,” arXiv preprint arXiv:1807.03973, 2018.
- [20] Y. Gao, Y. Gu, and M. Ng, “Gradient descent finds the global optima of two-layer physics-informed neural networks,” in International Conference on Machine Learning. PMLR, 2023, pp. 10 676–10 707.
- [21] E. Giné and R. Nickl, Mathematical foundations of infinite-dimensional statistical models. Cambridge university press, 2016, vol. 40.
- [22] A. K. Kuchibhotla and A. Chakrabortty, “Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression,” Information and Inference: A Journal of the IMA, vol. 11, no. 4, pp. 1389–1456, 2022.
- [23] A. W. Van Der Vaart, J. A. Wellner, A. W. van der Vaart, and J. A. Wellner, Weak convergence. Springer, 1996.
Appendix
Before the proofs, we first define the events
(24) |
and
(25) |
for all .
Note that the event happens if and only if , thus by the anti-concentration inequality of Gaussian distribution (Lemma 10), we have
(26) |
Similarly, we have
Moreover, we let , and , .
7. Proof of Continuous Time Analysis
7.1. Proof of Lemma 1
Proof.
First, recall that is a Kronecker product of and . The -th entry of is and the -th entry of is . As we know, the Kronecker product of two strictly positive definite matrices is also strictly positive definite. Thus, it suffices to demonstrate that and are both strictly positive definite.
The proof relies relies on standard functional analysis. Let be the Hilbert space of integrable -dimensional vector fields on , i.e., if Then the inner product of this Hilbert space is for . With these preparations in place, to prove is strictlt positive definite, it is equivalent to show are linearly independent, where for . This is exactly the result of Therem 3.1 in [17]. Similarly, Theorem 2.1 in [19] shows that are linearly independent if no two samples in are parallel. Thus it can be directly deduced that is strictly positive definite. Similarly, we can deduce that is also strictly positive definite.
As for other activation functions, such as or smooth activation functions, similar conclusions hold true. For specific details, refer to [18] and [20].
∎
7.2. Proof of Lemma 2
Proof.
First, let
then .
Note that Lemma 13 implies that for all , , which yields that
Thus, by applying Lemma 12, we can deduce for fixed and , with probability at least ,
Taking a union bound yields that with probability at least ,
Thus, if , we have , resulting in that .
On the other hand,
Similarly, applying Lemma 12 yields that with probability at least ,
which leads to the desired conclusion.
∎
7.3. Proof of Lemma 3
Proof.
First, for , recall that the -th entry of -th block is
We let
and let be the initialized parts corresponding to respectively.
Note that we can decompose as follows:
For the first part , from the bounedness of and , we have
For and , we have
i.e., . Moreover, Lemma 13 shows that . Thus, combining these facts yields that
(27) |
For the second part , note that
(28) | ||||
From the Bernstein inequality (Lemma 11), we have that with probability at least ,
(29) |
holds for any .
Therefore, we can deduce that
(30) | ||||
Summing yields that
(31) | ||||
holds with probability at least .
Second, for , recall that -th entry of -th block is
Let and be the corresponding initialized parts.
Similarly, we decompose as follows:
For the first part , we have
(32) | ||||
thus we have
(33) |
Note that for all , then applying Lemma 12 yields that with probability at least ,
holds for all .
For the second part , we cannot directly apply the Bernstein inequality. Instead, we first truncate . Note that for , we have , i.e., with probability at least , . Thus, taking a union bound yields that with probability at least ,
holds for any .
Therefore, under this event,
(34) | ||||
From the Bernstein inequality, we have that with probability at least ,
Thus, with probability at least ,
Summing yields that
(35) | ||||
holds with probability at least .
∎
7.4. Proof of Theorem 1
The proof of Theorem 1 consists of the following Lemma 6, Lemma 7, Lemma 8 and Lemma 9. First, we assume that the following lemmas are considered in the setting of events in Lemma 13, Lemma 14 and , where .
Lemma 11.
If for , , , then we have
Proof.
From the conditions and , we can deduce that
From this, we have
which yields that
i.e.,
∎
Lemma 12.
Suppose for , , and holds for any , then we have that
holds for any , where is a universal constant.
Proof.
For , we have
(36) | ||||
where the last inequality follows from Lemma 6 and the first inequality follows from that
Therefore, we have
where is a universal constant. ∎
Lemma 13.
Suppose for , , and holds for any , then we have that
holds for any , where is a universal constant.
Proof.
For , we have
(37) | ||||
where the last inequality follows from Lemma 6. Then, similar to that in Lemma 7, the conclusion holds.
∎
Lemma 14.
If and , we have that for all , the following two conclusions hold:
-
•
and ;
-
•
and for any .
Proof.
The proof is based on contradiction. Suppose is the smallest time that the two conclusions do not hold, then either conclusion 1 does not hold or conclusion 2 does not hold.
If conclusion 1 does not hold, i.e., either or , then Lemma 3 implies that there exists , or there exists , . This fact shows that conlusion 2 does not hold and then, this contradicts with the minimality of .
If conclusion 2 does not hold, then either there exists , or there exists , . If , then Lemma 7 implies that there exist such that or or there exists , , which contradicts with the minimality of . And the last case is similar to this case.
∎
Proof of Theorem 1.
Theorem 1 is a direct corollary of Lemma 6 and Lemma 9. Thus, it remains only to clarify the requirements for so that Lemma 6, Lemma 7 and Lemma 8 hold. First, and should ensure that and , i.e.,
(38) |
Combining this with the requirement that , we can deduce that
Moreover, the requirement for also leads to that
which are confidences in Lemma 3.
∎
8. Proof of Descrete Time Analysis
8.1. Proof of Lemma 4
Proof.
First, we can decompose as follows.
(39) | ||||
Note that
(40) | ||||
and
(41) | ||||
Plugging (33) and (34) into (32 yields that
where represents the -th of the matrix and , we can divide it into blocks, the -th component of -th block is defined as
Thus, we have
By using a simple algebraic transformation, we have
(42) | ||||
∎
8.2. Proof of Lemma 5
Proof.
We first express explicitly the -component of the -th bloack of the residual term as follows.
(43) | ||||
From the forms of , and , we have
(44) | ||||
and
(45) | ||||
and
(46) | ||||
Thus, we can decompose as follows
where
(47) | ||||
and
(48) | ||||
For , we replace in the definition of by and still denote the event as for simplicity, i.e.,
and .
From the induction hypothesis, we know and . Thus, holds for any . From this fact, we can deduce that for any ,
(49) | ||||
On the other hand, for any , we have
(50) |
Thus, combining (41), (42) and (43) yields that
(51) | ||||
where the second inequality follows from Cauchy’s inequality and the form of , i.e.,
From the Bernstein inequality, we have that with probability at least ,
This leads to the final unpper bound:
(52) | ||||
It remains to bound , which can be written as follows.
(53) | ||||
Note that
thus we can bound the first term and second term by
(54) |
For the third term in (46), we also replace in the definition of by and still denote the event by . Recall that
(55) |
and . Note that and , thus for , we have . Combining this fact with (46) and (47), we can deduce that
(56) |
Thus, we have to bound . From the gradient descent update formula, we have
(57) | ||||
Recall that
Therefore,
(58) | ||||
where the last inequality is due to us taking sufficiently large in the end.
Combining (50) and (51) yields that
(59) |
Plugging this into (49) leads to that
By applying the Bernstein’s inequality, we have that with probability at least ,
Therefore,
(60) | ||||
From (45) and (53), we have
(61) | ||||
Therefore,
(62) |
where
(63) |
∎
8.3. Proof of Corollary 1
Proof.
Note that when , , and is positive definite, we have that for ,
(64) | ||||
Thus,
holds for .
Now, we have to derive the requirement for such that these conditions hold. First, from Lemma 3, when
we have , and ,. Thus, when and , we have and . Specifically, need to satisfy that
(65) |
Moreover, at this point, we can deduce that
and similarly, . Thus, is sufficient to ensure that is positive definite.
Second, we need to make sure that . From (56), suffices, i.e.,
(66) |
Combining these requirements for , i.e., (58), (59) and the condition in Lemma 2, leads to the desired conclusion.
∎
8.4. Proof of Theorem 2
Proof.
From Corollary 1, it remains only to verify that Condition 1 also holds for . Note that in (52), we have proven that
holds for and .
Combining this with Corollary 1 yields that
Similarly, in (44), we have proven that
which yields that
Moreover, from the triangle inequality, we have
∎
8.5. Proof of Lemma 6
Proof.
First, for , recall that the Kronecker product of two strictly definite matrices is also strictly positive definte, thus it suffices to demonstrate that and are both strictly definite. For , similar as that in the proof of Lemma 2, let be the Hilbert space of integrable function on , i.e., if . Now to prove is strictly positive definite, it is equivalent to show that are linearly independent, where . It has been proved in [19], we provide a different proof for completeness and this proof also indicates the strictly positive definiteness of . Suppose that there are such that
which implies that
holds for all due to the continuity of .
Let for , then Lemma A.1 in [17] implies that when no two samples in are parallel, for any . Thus, we can choose . Since is a closed set, there is positive constant such that . This fact implies that is differentiable in for each . Thus is also differentiable in . However is not differentiable in . Thus we can deduce that . Similarly, we have for all , which implies that is strictly positive definite. For , it can be seen as a Gram matrix of PINN, Lemma 3.2 in [20] implies that is strictly positive definite.
Second, for , recall that . Note that the -th entry of is . Thus, Theorem 3.1 in [4] implies that is strictly positive definite. For , let
and be the Hilbert space of integrable -dimensional vector fields on . Suppose that there are such that
which yields that
holds for all .
Let for and for . Thus for any . Similarly, we can choose and such that . Note that and are differentiable in , thus is also differentiable in , implying that . Therefore, for all . Moreover, similar to the proof of the strictly positive definiteness of , we can also deduce that for all . Finally, is strictly positive definite.
∎
8.6. Proof of Lemma 7
Proof.
First, for , we consider its -th block, whose entry has the following form
where
for ,
for ,
for ,
To use the concentration inequality, we need to clarify the order of the sub-Weil random variable . Note that Lemma 18 implies that
On the other hand, from
we have
and
Therefore, we can deduce that
and
Note that for , thus , . Thus Lemma 20 implies that
On the other hand, Lemma 21 implies that
Note that and . From the Taylor expansion of the function , we have that for any ,
which implies that . Therefore, . Finally, applying Lemma 17 leads to that with probability at least ,
Taking a union bound yields that with probability at least ,
First, for , we consider its -th block, whose entry has the following form
where
for ,
for ,
for ,
From Lemma 21, we have
Note that
and
From Lemma 21, we can deduce that
and , thus
Therefore, with probability at least ,
∎
8.7. Proof of Lemma 8
Proof.
For , from the form of -th entry of the -th block, we focus on the form , where
and the notation means replacing in the definitions of and with and , respectively.
For , (25) implies that
(67) |
For , when , we have
(68) | ||||
Note with probability at least , we have , , holds for all , , where
Under these events, we can deduce that
and
Thus, for , we have
(69) |
From Bernstein’s inequality, we have with probability at least ,
holds for all .
Thus, summing yields that
(70) | ||||
When , we can obtain the same estimation, since
and
Note that we can decompose as follows
Therefore, we have
(71) | ||||
Summing yields that
For , from the form of -th entry of the -th block, we focus on the form , where
Similarly, we can deduce that
(72) | ||||
where the last inequality holds with probability at least due to the use of Bernstein inequality.
For , note that
and
Thus, similar to (66), we have
(73) |
Therefore, similar to (69), we have
(74) | ||||
Summing yields that
∎
8.8. Proof of Lemma 8
Proof.
Note that
(75) | ||||
From the updating formula of gradient, we have
Similarly, we can obtain that
For , we can derive similar result, which is omitted for simplicity. Similar to the derivation in the section on neural operators, we have
(76) |
where , can be divided into blocks, where each block is an dimensional vector. The -th () component of -th block is
The -th () component of -th block is
Finally, applying a simple algebraic transformation to (76), we have
∎
8.9. Proof of Corollary 2
Proof.
Let , we first estimate . The gradient updating rule yields that
For the gradient term, we have
and
Theorefore, we obtain that
(77) | ||||
where the first inequality follows from Cauchy’s inequality.
Summing from to yields that
(78) | ||||
where the second inequality follows from the induction hypothesis.
Then, we estimate , the gradient descent updating rule yields that
(79) | ||||
Recall that
Note that
and
Thus, we have
(80) | ||||
where in the last inequality, we assume that .
Similarly, since
we can obtain that
(81) |
Combining (79), (80) and (81) yields that
(82) | ||||
where the first inequality follows from Cauchy’s inequality.
Summing from to yields that
(83) | ||||
∎
8.10. Proof of Lemma 10
Proof.
From the form of the residual , it suffices to estimate
and
which we denote by and , respectively. In fact, we only need to estimate , since includes the term , which is the same as the boundary term.
Recall that the shallow neural operator has the form
We first estimate . We can explicitly express the difference as follows:
(84) | ||||
where in the last equality, we split into two terms in order to estimate them separately later.
On the other hand, from the form of neural operator, we can obtain that
(85) | ||||
and
(86) | ||||
With the explicit expressions for each term of , namely (84), (85), and (86), we can split into two parts: the first part is the second term of (84) minus (86), and the second part is the first term of (84) minus (85). Specifically, let
(87) | ||||
and
(88) | ||||
where in the definition, we have omitted the indices for simplicity.
Then
(89) |
To estimate , since
it suffices to estimate
With a little abuse of notation, we let , and . Then we have that . Note that , thus for , we have . At this point,
(90) | ||||
On the other hand, for all ,
Therefore, for , we have
Combining with (77) yields that
(91) | ||||
where the last inequality follows from the Bernstein inequality and holds with probability at least .
For , we can rewrite it as follows:
(92) | ||||
In the following, we will estimate the three items and separately.
For and , note that for , we have
(93) |
Moreover, we can deduce that
(94) |
and
(95) |
Combining (93), (94) and (95) yields that
(96) |
and
(97) |
It remains to estimate . From its form, it suffices to estimate
(98) | ||||
where and are respectively related to the first-order term, the second-order term, and the zeroth-order term of the PDE. Sepecifically,
and
Note that both and have the form
However, due to the non-differentiability of the ReLU function, we cannot perform a second-order expansion; instead, we decompose it into
(99) | ||||
Thus, for , we have
(100) | ||||
We apply mean value theorem for the first term in (100) and obtain that
Similarly, for the second term, we have
Thus, for , we have
(101) | ||||
For , with same decomposition in (99), we have
(102) | ||||
For the first term, similar to (90), for , we have
thus
For the second term, the mean value theorem yields that
For the third term, we have
Combining these result for and yields that
(103) | ||||
With these estimations for and , i.e., (96), (97) and (98), we obtain that
(104) | ||||
Recall that (82) shows that
and
Therefore, combining with the use of Bernstein inequality, summing yields that
(105) | ||||
Recall that (91) implies that
Therefore, we have
(106) | ||||
∎
8.11. Proof of Theorem 3
Proof.
It suffices to show that Condition 2 also holds for . From the iteration formula in Lemma 9, we have
(107) | ||||
From the stability of the Gram matrices, i.e., Lemma 8, when satisfy that
we have
Thus, with Lemma 7, we can deduce that
implying that and .
Therefore, when , we have that is positive definite and then
(108) |
Let , then combining (107) and (108) yields that
(109) | ||||
where the last inequality requires that .
Finally, we need to specify the requirements for to ensure that the aforementioned conditions are satisfied. Recall that first, needs to satisfy that and , i.e.,
(110) |
Simple algebraic operations yield that
(111) |
Second, needs to satisfy that , i.e.,
implying that
(112) |
Finally, combining (111), (112) and the estimation of in Lemma 17, we have that
where indicates that some terms involving , and are omitted.
∎
9. Auxiliary Lemmas
Lemma 15 (Anti-concentration of Gaussian distribution).
Let , then for any ,
Lemma 16 (Bernstein inequality, Theorem 3.1.7 in [21]).
Let , be independent centered random variables a.s. bounded by in absolute value. Set and . Then, for all ,
First, we provide some preliminaries about Orlicz norms.
Let be a non-decreasing convex function with . The -Orlicz norm of a real-valued random variable is given by
If , we say that is sub-Weibull of order , where
Note that when , is a norm and when , is a quasi-norm. In the related proofs, we may frequently use the fact that for real-valued random variable , we have and . Moreover, when , we have . Since without loss of generality, we can assume that , then
where the first inequality and the second inequality follow from the inequality for .
Lemma 17 (Theorem 3.1 in [22]).
If are independent mean zero random variables with for all and some , then for any vector , the following holds true:
where ,
and for when and when ,
and .
Lemma 18.
For any , we have that with probability at least ,
Moreover, its -norm is a universal constant.
Proof.
Note that and , then applying Lemma 17 yields that with probability at least ,
Moreover, from the equivalence of the norm and the concentration inequality (see Lemma 2.2.1 in [23]), it follows that
∎
Lemma 19.
With probability at least , we have
Proof.
Recall that
Note that
thus
Then from Lemma 12, we have that with probability at least ,
∎
Lemma 20.
If with , then we have , where satisfies that
Proof.
Without loss of generality, we can assume that . To prove this, let us use Young’s inequality, which states that
Let , then
where the first and second inequality follow from Young’s inequality. From this, we have that .
∎
Lemma 21.
With probability at least , we have
Proof.
Recall that the loss of function of PINN is
and the shallow neural operator has the following form
In order to estimate the initial value, it suffices to consider
and
Note that , thus Lemma 20 implies that
Therefore, combining with Lemma 18 yields that
Similarly, we can deduce that
Finally, applying Lemma 17 leads to that with probability at least ,
∎