This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Neural Tangent Kernel of Neural Networks with Loss Informed by Differential Operators

Weiye Gan,  Yicheng Li,  Qian Lin,  Zuoqiang Shi gwy22@mails.tsinghua.edu.cn. Department of Mathematical Sciences, Tsinghua University, Beijing, China. liyc22@mails.tsinghua.edu.cn. Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, China. qianlin@tsinghua.edu.cn. Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, China. Corresponding author. zqshi@tsinghua.edu.cn. Yau Mathematical Sciences Center, Tsinghua University, Beijing, 100084, China & Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408, China.
Abstract

Spectral bias is a significant phenomenon in neural network training and can be explained by neural tangent kernel (NTK) theory. In this work, we develop the NTK theory for deep neural networks with physics-informed loss, providing insights into the convergence of NTK during initialization and training, and revealing its explicit structure. We find that, in most cases, the differential operators in the loss function do not induce a faster eigenvalue decay rate and stronger spectral bias. Some experimental results are also presented to verify the theory.

Keywords: neural tangent kernel, physics-informed neural networks, spectral bias, differential operator

1 Introduction

In recent years, Physics-Informed Neural Networks (PINNs) [31] are gaining popularity as a promising alternative to solve Partial Differential Equations (PDEs). PINNs leverage the universal approximation capabilities of neural networks to approximate solutions while incorporating physical laws directly into the loss function. This approach eliminates the need for discretization and can handle high-dimensional problems more efficiently than traditional methods [14, 33]. Moreover, PINNs are mesh-free, making them particularly suitable for problems with irregular geometries [19, 39]. Despite these advantages, PINNs are not without limitations. One major challenge is their difficulty in training, often resulting in slow convergence or suboptimal solutions [6, 22]. This issue is particularly pronounced in problems with the underlying PDE solutions that contain high-frequency or multiscale features [11, 30, 40].

To explain the obstacles in training PINNs, a significant aspect is about the deficiency of neural networks in learning multifrequency functions, referred to as spectral bias [29, 38, 12, 32], which means that neural networks tend to learn the components of ”lower complexity” faster during training [29]. This phenomenon is intrinsically linked to the Neural Tangent Kernel (NTK) theory [18], as the NTK’s spectrum directly governs the convergence rates of different frequency components during training [7]. Specifically, neural networks are shown to converge faster in the directions defined by eigenfunctions of NTK with larger eigenvalues. Therefore, the components that are considered to have ”low complexity” empirically are actually eigenfunctions of NTK with large eigenvalues and vice versa for components of high complexity. The detrimental effects of spectral bias can be exacerbated by two primary factors: first, the target function inherently possesses significant components of high complexity, and second, there is a substantial disparity in the magnitudes of the NTK’s eigenvalues.

In the context of PINNs, the objective function corresponds to the solution of the PDE, rendering improvements in this aspect particularly challenging. A more promising avenue lies in refining the network architecture to ensure that the NTK exhibits a more favorable eigenvalue distribution. Several efforts have been made in this domain, such as the implementation of Fourier feature embedding [35, 20] and strategic weight balancing to harmonize the disparate components of the loss function [36].

Different from the l2 loss in standard NTK theory, PINNs generally consider the following physics informed loss

(u)=12ni=1n(𝒯u(xi;θ)fi)2+12mj=1m(u(xj;θ)gj)2\mathcal{L}(u)=\frac{1}{2n}\sum_{i=1}^{n}(\mathcal{T}u(x_{i};\theta)-f_{i})^{2}+\frac{1}{2m}\sum_{j=1}^{m}(\mathcal{B}u(x_{j};\theta)-g_{j})^{2}

for PDE

{𝒟u(x)=f(x),xΩ,u(x)=g(x),xΩ.\left\{\begin{array}[]{cc}\mathcal{D}u(x)=f(x),&x\in\Omega,\\ \mathcal{B}u(x)=g(x),&x\in\partial\Omega.\end{array}\right.

In this paper, we only focus on the loss related to the interior euqation and neglect the boundary conditions, i.e.

(u;𝒯)12ni=1n(𝒯u(xi;θ)fi)2.\mathcal{L}(u;\mathcal{T})\coloneqq\frac{1}{2n}\sum_{i=1}^{n}(\mathcal{T}u(x_{i};\theta)-f_{i})^{2}.

This loss function may be adopted when Ω\Omega is a closed manifold or the boundary conditions are already hard constraints on neural networks [8]. We first demonstrate the convergence of the neural network kernel at initialization. While most previous works only consider shallow networks. The idea based on functional analysis in [15] is applied so that we can process arbitrary high-order differential operator 𝒯\mathcal{T} and deep neural networks. Another benefit of this approach is that we can show that the NTK related to (u;𝒯)\mathcal{L}(u;\mathcal{T}) is exactly 𝒯x𝒯xKNT(x,x)\mathcal{T}_{x}\mathcal{T}_{x^{\prime}}K^{NT}(x,x^{\prime}) where KNT(x,x)K^{NT}(x,x^{\prime}) is the NTK for l2 loss (u;Id)\mathcal{L}(u;Id). With this connection, we analyze the impact of 𝒯\mathcal{T} on the decay rate of the NTK’s eigenvalues, which affects the convergence and generalization of the related kernel regression [26, 25]. We found that the additional differential operator in the loss function does not lead to a stronger spectral bias. Therefore, to improve the performance of PINNs from a spectral bias perspective, particular attention should be paid to the equilibrium among distinct components within the loss function [36, 27]. For the convergence in training, we present a sufficient condition for general 𝒯\mathcal{T} and neural networks. This condition is verified in a simple but specific case. Through these results, we hope to advance the theoretical foundations of PINNs and pave the way for their broader application in scientific computing.

The remainder of the paper is organized as follows. In Section 2, we introduce the function space we consider, the settings of neural networks, and some previous results of the NTK theory. All theoretical results are demonstrated in Section 3, including the convergence of NTK during initialization and training and the impact of the differential operator within the loss on the spectrum of NTK. In Section 4, we design some experiments to verify our theory. Some conclusions are drawn in Section 5.

2 Preliminary

In this section, we introduce the basic problem setup including the function space considered and the neural network structure, as well as some background on the NTK theory.

2.1 Continuously Differential Function Spaces

Let TnT\subset\mathbb{R}^{n} be a compact set, k0k\geq 0 be a integer and ZnZ_{n} be the nn-fold index set,

Zn={α=(α1,,αn)|αi is a non-negative integer, i=1,,n}.Z_{n}=\{\alpha=(\alpha_{1},\dots,\alpha_{n})\big{|}\alpha_{i}\text{ is a non-negative integer, }\forall i=1,\dots,n\}.

We denote |α|=i=1nαi\absolutevalue{\alpha}=\sum_{i=1}^{n}\alpha_{i} and Dα=|α|x1α1xnαnD^{\alpha}=\frac{\partial^{\absolutevalue{\alpha}}}{\partial x_{1}^{\alpha_{1}}\dots\partial x_{n}^{\alpha_{n}}}. Then, the kk-times continuously differentiable function space Ck(T)C^{k}(T) is defined as

Ck(T;m)={u:Tm|Dαu is continuous on T,α:|α|k}.C^{k}(T;\mathbb{R}^{m})=\{u:T\rightarrow\mathbb{R}^{m}\big{|}D^{\alpha}u\text{ is continuous on }T,\ \forall\alpha:\absolutevalue{\alpha}\leq k\}.

If m=1m=1, we disregard m\mathbb{R}^{m} for simplicty. The same applies to the following function spaces. Ck(T;m)C^{k}(T;\mathbb{R}^{m}) can be equipped with norm

uCk(T;m)=maxαZn,|α|ksupxT|Dαu(x)|.\norm{u}_{C^{k}(T;\mathbb{R}^{m})}=\max_{\alpha\in Z_{n},\absolutevalue{\alpha}\leq k}\sup_{x\in T}\absolutevalue{D^{\alpha}u(x)}.

For a contant β[0,1]\beta\in[0,1], we also define

[u]C0,β(T;m)=supx,yT,xy|u(x)u(y)||xy|β[u]_{C^{0,\beta}(T;\mathbb{R}^{m})}=\sup_{x,y\in T,x\neq y}\frac{\absolutevalue{u(x)-u(y)}}{\absolutevalue{x-y}^{\beta}}

and

C0,β(T;m)={uC0(T;m)|[u]C0,β(T;m)<},C^{0,\beta}(T;\mathbb{R}^{m})=\{u\in C^{0}(T;\mathbb{R}^{m})\big{|}[u]_{C^{0,\beta}(T;\mathbb{R}^{m})}<\infty\},
Ck,β(T;m)={uCk(T;m)|[Dαu]C0,β(T;m)<,α:|α|k}.C^{k,\beta}(T;\mathbb{R}^{m})=\{u\in C^{k}(T;\mathbb{R}^{m})\big{|}[D^{\alpha}u]_{C^{0,\beta}(T;\mathbb{R}^{m})}<\infty,\ \forall\alpha:\absolutevalue{\alpha}\leq k\}.

C0,1(T;m)C^{0,1}(T;\mathbb{R}^{m}) is the familiar Lipschitz function space. Ck,β(T;m)C^{k,\beta}(T;\mathbb{R}^{m}) can also be equipped with norm

uCk,β(T;m)=uCk(T;m)+maxαZn,|α|k[Dαu(x)]C0,β(T;m).\norm{u}_{C^{k,\beta}(T;\mathbb{R}^{m})}=\norm{u}_{C^{k}(T;\mathbb{R}^{m})}+\max_{\alpha\in Z_{n},\absolutevalue{\alpha}\leq k}[D^{\alpha}u(x)]_{C^{0,\beta}(T;\mathbb{R}^{m})}.

For a function of two variables u(x,x)u(x,x^{\prime}), we denote DxαD_{x}^{\alpha} as a differential operator with respect to xx, and similarly for DxαD_{x^{\prime}}^{\alpha}. We have analogous definitions,

Ck×k(T×T;m)={u:T×Tm|DxαDxαu is continuous on T×T,α,α:|α|,|α|k}C^{k\times k}(T\times T;\mathbb{R}^{m})=\{u:T\times T\rightarrow\mathbb{R}^{m}\big{|}D_{x}^{\alpha}D_{x^{\prime}}^{\alpha^{\prime}}u\text{ is continuous on }T\times T,\ \forall\alpha,\alpha^{\prime}:\absolutevalue{\alpha},\absolutevalue{\alpha^{\prime}}\leq k\}

with norm

uCk×k(T×T;m)=maxα,αZn,|α|,|α|ksupx,xT|DxαDxαu(x,x)|\norm{u}_{C^{k\times k}(T\times T;\mathbb{R}^{m})}=\max_{\alpha,\alpha^{\prime}\in Z_{n},\absolutevalue{\alpha},\absolutevalue{\alpha^{\prime}}\leq k}\sup_{x,x^{\prime}\in T}\absolutevalue{D_{x}^{\alpha}D_{x^{\prime}}^{\alpha^{\prime}}u(x,x^{\prime})}

and

Ck×k,β(T×T;m)={uCk×k(T×T;m)|[DxαDxαu]C0,β(T×T;m)<,α,α:|α|,|α|k}C^{k\times k,\beta}(T\times T;\mathbb{R}^{m})=\{u\in C^{k\times k}(T\times T;\mathbb{R}^{m})\big{|}[D_{x}^{\alpha}D_{x^{\prime}}^{\alpha^{\prime}}u]_{C^{0,\beta}(T\times T;\mathbb{R}^{m})}<\infty,\ \forall\alpha,\alpha^{\prime}:\absolutevalue{\alpha},\absolutevalue{\alpha^{\prime}}\leq k\}

with norm

uCk×k,β(T×T;m)=uCk×k(T×T;m)+maxα,αZn,|α|,|α|k[DxαDxαu(x)]C0,β(T×T;m).\norm{u}_{C^{k\times k,\beta}(T\times T;\mathbb{R}^{m})}=\norm{u}_{C^{k\times k}(T\times T;\mathbb{R}^{m})}+\max_{\alpha,\alpha^{\prime}\in Z_{n},\absolutevalue{\alpha},\absolutevalue{\alpha^{\prime}}\leq k}[D_{x}^{\alpha}D_{x^{\prime}}^{\alpha^{\prime}}u(x)]_{C^{0,\beta}(T\times T;\mathbb{R}^{m})}.

With the chain rule and the fact that the composition of Lipschitz functions is still Lipschitz, the following lemmas can be verified.

Lemma 2.1.

Let T0n0T_{0}\subset\mathbb{R}^{n_{0}}, T1n1T_{1}\subset\mathbb{R}^{n_{1}} be two compact sets. Let φ:T0T1\varphi:T_{0}\rightarrow T_{1} and ψ:T1n2\psi:T_{1}\rightarrow\mathbb{R}^{n_{2}} be two Ck,1C^{k,1} maps. Then, ψφ:T0n2\psi\circ\varphi:T_{0}\rightarrow\mathbb{R}^{n_{2}} is also a Ck,1C^{k,1} map. And there exists a constant CC only depending on kk, φCk,1(T0;T1)\norm{\varphi}_{C^{k,1}(T_{0};T_{1})} and ψCk,1(T1;n2)\norm{\psi}_{C^{k,1}(T_{1};\mathbb{R}^{n_{2}})} such that

ψφCk,1(T0;n2)C.\norm{\psi\circ\varphi}_{C^{k,1}(T_{0};\mathbb{R}^{n_{2}})}\leq C.
Lemma 2.2.

Let T0n0T_{0}\subset\mathbb{R}^{n_{0}}, T1n1T_{1}\subset\mathbb{R}^{n_{1}} be two compact sets. Let φ:T0×T0T1\varphi:T_{0}\times T_{0}\rightarrow T_{1} and ψ:T1n2\psi:T_{1}\rightarrow\mathbb{R}^{n_{2}} be of class Ck×k,1C^{k\times k,1} and Ck,1C^{k,1} respectively. Then, ψφ:T0×T0n2\psi\circ\varphi:T_{0}\times T_{0}\rightarrow\mathbb{R}^{n_{2}} is also a Ck×k,1C^{k\times k,1} map. And there exists a constant CC only depending on kk, φCk×k,1(T0×T0;T1)\norm{\varphi}_{C^{k\times k,1}(T_{0}\times T_{0};T_{1})} and ψCk,1(T1;n2)\norm{\psi}_{C^{k,1}(T_{1};\mathbb{R}^{n_{2}})} such that

ψφCk×k,1(T0×T0;n2)C.\norm{\psi\circ\varphi}_{C^{k\times k,1}(T_{0}\times T_{0};\mathbb{R}^{n_{2}})}\leq C.

2.2 Settings of Neural Network

Let the input x𝒳dx\in\mathcal{X}\subset\mathbb{R}^{d} where 𝒳\mathcal{X} is a convex bounded domain and the output yy\in\mathbb{R}. Let m0=dm_{0}=d, m1,,mLm_{1},\dots,m_{L} be the width of LL hidden layers and mL+1=1m_{L+1}=1. Define the pre-activations zl(x)mlz^{l}(x)\in\mathbb{R}^{m_{l}} for l=1,,L+1l=1,\dots,L+1 by

z1(x)=W0x,zl+1(x)=1mlWlσ(zl(x)),for l=1,,L\displaystyle\begin{aligned} &z^{1}(x)=W^{0}x,\\ &z^{l+1}(x)=\frac{1}{\sqrt{m_{l}}}W^{l}\sigma(z^{l}(x)),\quad\text{for }l=1,\dots,L\end{aligned} (1)

where Wlml+1×mlW^{l}\in\mathbb{R}^{m_{l+1}\times m_{l}} are the weights and σ\sigma is the activation function satisfying the following assumption for some non-negative integer.

Assumption 1.

There exists a nonnegative integer kk and positive constants l1,,lkl_{1},\dots,l_{k} such that activation σCk()\sigma\in C^{k}(\mathbb{R}) and

σ(j)1+|x|ljL<\norm{\frac{\sigma^{(j)}}{1+|x|^{l_{j}}}}_{L^{\infty}}<\infty

for all j=1,,kj=1,\dots,k.

The output of the neural network is then given by uNN(x;θ)=zL+1(x)u^{\mathrm{NN}}(x;\theta)=z^{L+1}(x). Moreover, denoting by zilz^{l}_{i} the ii-th component of zlz^{l}, we have

zi1(x)\displaystyle z^{1}_{i}(x) =j=1m0Wij0xj,\displaystyle=\sum_{j=1}^{m_{0}}W^{0}_{ij}x_{j},
zil+1(x)\displaystyle z^{l+1}_{i}(x) =1mlj=1mlWijlσ(zjl(x))\displaystyle=\frac{1}{\sqrt{m_{l}}}\sum_{j=1}^{m_{l}}W^{l}_{ij}\sigma(z^{l}_{j}(x))

We denote by θ=(W0,,WL)\theta=(W^{0},\dots,W^{L}) the collection of all parameters flatten as a column vector. For simplicity, we also write u(x)=uNN(x;θ)u(x)=u^{\mathrm{NN}}(x;\theta). The neural network is initialized by i.i.d random variables. Specifically, all elements of WlW^{l} are i.i.d with mean 0 and variance 11.

In this paper, we consider to train neural network (1) with gradient descent and the following physics informed loss

(u;𝒯)12ni=1n(𝒯u(xi;θ)yi)2,\mathcal{L}(u;\mathcal{T})\coloneqq\frac{1}{2n}\sum_{i=1}^{n}(\mathcal{T}u(x_{i};\theta)-y_{i})^{2}, (2)

where samples xi𝒳x_{i}\in\mathcal{X} and 𝒯\mathcal{T} is a known linear differential operator.

2.3 Training Dynamics

When training neural network with loss (2), the gradient flow is given by

θ˙=θ(u;𝒯)=1ni=1nθ𝒯u(xi;θ)(𝒯u(xi;θ)yi).\dot{\theta}=-\nabla_{\theta}\mathcal{L}(u;\mathcal{T})=-\frac{1}{n}\sum_{i=1}^{n}\nabla_{\theta}\mathcal{T}u(x_{i};\theta)(\mathcal{T}u(x_{i};\theta)-y_{i}).

Assuming that uu is sufficiently smooth, denoting v=𝒯uv=\mathcal{T}u, we have

v˙\displaystyle\dot{v} =𝒯u˙=𝒯(θu)Tθ˙=[θ(𝒯u)]Tθ(u;𝒯)\displaystyle=\mathcal{T}\dot{u}=\mathcal{T}(\nabla_{\theta}u)^{T}\dot{\theta}=-[\nabla_{\theta}(\mathcal{T}u)]^{T}\nabla_{\theta}\mathcal{L}(u;\mathcal{T})
=1ni=1n[θv]T[θv(xi;θ)][v(xi;θ)yi].\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}[\nabla_{\theta}v]^{T}[\nabla_{\theta}v(x_{i};\theta)][v(x_{i};\theta)-y_{i}].

Define the time-varying neural network kernel (NNK)

K𝒯,θ(x,x)=θ(𝒯u)(x;θ),θ(𝒯u)(x;θ)=θv(x;θ),θv(x;θ).K_{\mathcal{T},\theta}(x,x^{\prime})=\left\langle{\nabla_{\theta}(\mathcal{T}u)(x;\theta),~{}\nabla_{\theta}(\mathcal{T}u)(x^{\prime};\theta)}\right\rangle=\left\langle{\nabla_{\theta}v(x;\theta),~{}\nabla_{\theta}v(x^{\prime};\theta)}\right\rangle. (3)

Then the gradient flow of vv is just

v˙=1nK𝒯,θ(x,X)(v(X;θ)Y).\dot{v}=-\frac{1}{n}K_{\mathcal{T},\theta}(x,X)(v(X;\theta)-Y). (4)

NTK theory suggests that this training dynamic of vv is very similar to that of kernel regression when neural network is wide enough. And K𝒯,θ(x,x)K_{\mathcal{T},\theta}(x,x^{\prime}) is expected to converge to a time-invariant kernel K𝒯(x,x)K_{\mathcal{T}}(x,x^{\prime}) as width mm tends to infinity. If this assertion is true, we can consider the approximate kernel gradient flow of vNTKv^{\mathrm{NTK}} by

v˙NTK(x)=1nK𝒯NT(x,X)(vNTK(X;θ)Y),\dot{v}^{\mathrm{NTK}}(x)=-\frac{1}{n}K_{\mathcal{T}}^{\mathrm{NT}}(x,X)(v^{\mathrm{NTK}}(X;\theta)-Y), (5)

where the initialization v0NTKv^{\mathrm{NTK}}_{0} is not necessarily identically zero. This gradient flow can be solved explicitly by

vNTK(x)=v0NTK(x)+K𝒯NT(x,X)φtGF(1nK𝒯(X,X))(Yv0NTK(X)),v^{\mathrm{NTK}}(x)=v^{\mathrm{NTK}}_{0}(x)+K_{\mathcal{T}}^{\mathrm{NT}}(x,X)\varphi^{\mathrm{GF}}_{t}\left(\frac{1}{n}K_{\mathcal{T}}(X,X)\right)(Y-v^{\mathrm{NTK}}_{0}(X)),

where φtGF(z)(1etz)/z\varphi^{\mathrm{GF}}_{t}(z)\coloneqq(1-e^{-tz})/z.

2.4 NTK Theory

When 𝒯\mathcal{T} in (2) is the identity map (denoted by IdId), the training dynamic (4) has been widely studied with the NTK theory [18]. This theory describes the evolution of neural networks during training in the infinite-width limit, providing insight into their convergence and generalization properties. It shows that the training dynamics of neural networks under gradient descent can be approximated by a kernel method defined by the inner product of the network’s gradients. This theory has spurred extensive research, including studies on the convergence of neural networks with kernel dynamics [4, 2, 9, 24, 1], the properties of the NTK [13, 5, 26], and its statistical performance [2, 17, 23]. By bridging the empirical behavior of neural networks with their theoretical foundations, the NTK provides a framework for understanding gradient descent dynamics in function space. In this section, we review some existing conclusions that are significant in deriving our results.

The random feature and neural tangent kernel

For the finite-width neural network (1), let us define the random feature kernel KlRF,mK^{\mathrm{RF},m}_{l} and the neural tangent kernel KlNT,θK^{\mathrm{NT},\theta}_{l} for l=1,,L+1l=1,\dots,L+1 by

KlRF,m(x,x)\displaystyle K^{\mathrm{RF},m}_{l}(x,x^{\prime}) =Cov(zl(x),zl(x)),\displaystyle=\mathrm{Cov}\left(z^{l}(x),z^{l}(x^{\prime})\right),
Kl,ijNT,θ(x,x)\displaystyle K^{\mathrm{NT},\theta}_{l,ij}(x,x^{\prime}) =θzil(x),θzjl(x), for i,j=1,,ml.\displaystyle=\left\langle{\nabla_{\theta}z^{l}_{i}(x),\nabla_{\theta}z^{l}_{j}(x^{\prime})}\right\rangle,\mbox{\quad for\quad}i,j=1,\dots,m_{l}.

Note here that KlRF,mK^{\mathrm{RF},m}_{l} is deterministic and KlNT,θK^{\mathrm{NT},\theta}_{l} is random. Since ml+1=1m_{l+1}=1, we denote KL+1NT,θ(x,x)=KL+1,11NT,θ(x,x)K^{\mathrm{NT},\theta}_{L+1}(x,x^{\prime})=K^{\mathrm{NT},\theta}_{L+1,11}(x,x^{\prime}).

Moreover, for the kernels associated with the infinite-width limit of the neural network, let us define

K1RF(x,x)=K1NT(x,x)=x,x.\displaystyle K^{\mathrm{RF}}_{1}(x,x^{\prime})=K^{\mathrm{NT}}_{1}(x,x^{\prime})=\left\langle{x,x^{\prime}}\right\rangle.

and the recurrence formula for l=2,,L+1l=2,\dots,L+1,

KlRF(x,x)=𝔼(u,v)N(𝟎,𝑩l1(x,x))[σ(u)σ(v)],KlNT(x,x)=KlRF(x,x)+Kl1NT(x,x)𝔼(u,v)N(𝟎,𝑩l1(x,x))[σ(1)(u)σ(1)(v)],\displaystyle\begin{aligned} K^{\mathrm{RF}}_{l}(x,x^{\prime})&=\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{B}_{l-1}(x,x^{\prime}))}\left[\sigma(u)\sigma(v)\right],\\ K^{\mathrm{NT}}_{l}(x,x^{\prime})&=K^{\mathrm{RF}}_{l}(x,x^{\prime})+K^{\mathrm{NT}}_{l-1}(x,x^{\prime})\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{B}_{l-1}(x,x^{\prime}))}\left[\sigma^{(1)}(u)\sigma^{(1)}(v)\right],\end{aligned} (6)

where the matrix 𝑩l1(x,x)2×2\bm{B}_{l-1}(x,x^{\prime})\in\mathbb{R}^{2\times 2} is defined as:

𝑩l1(x,x)=(Kl1RF(x,x)Kl1RF(x,x)Kl1RF(x,x)Kl1RF(x,x).)\bm{B}_{l-1}(x,x^{\prime})=\begin{pmatrix}K^{\mathrm{RF}}_{l-1}(x,x)&K^{\mathrm{RF}}_{l-1}(x,x^{\prime})\\ K^{\mathrm{RF}}_{l-1}(x,x^{\prime})&K^{\mathrm{RF}}_{l-1}(x^{\prime},x^{\prime}).\end{pmatrix}

The smoothness of the kernel KlRFK_{l}^{\mathrm{RF}} and KlNTK_{l}^{\mathrm{NT}} can be derived by the regularity of the activation function σ\sigma. The proof is presented in Section A.

Lemma 2.3.

Let k1k\geq 1 be an integer and σ\sigma satisfies Assumption 1 for kk. Then the kernel KlRFK_{l}^{\mathrm{RF}} and KlNTK_{l}^{\mathrm{NT}} defined as (6) is of classes Ck×kC^{k\times k} and C(k1)×(k1)C^{(k-1)\times(k-1)} respectively.

Convergnce of NNK

The most basic and vital conclusion for NTK is that NNK defined as (3) converges to a time-invariant kernel when the width of the neural network tends to infinity.

Lemma 2.4 (Proposition 34 in [26]).

Consider NNK K𝒯,θK_{\mathcal{T},\theta} defined as (3) where 𝒯Id\mathcal{T}\equiv Id is the identity map. Let δ(0,1)\delta\in(0,1). Under proper assumptions on the parameters θt\theta_{t}, there exists constants C1>0C_{1}>0 and C21C_{2}\geq 1 such that, with probability at least 1δ1-\delta,

supt0|KId,θt(z,z)KIdNT(x,x)|=O(m112lnm)\sup_{t\geq 0}\absolutevalue{K_{Id,\theta_{t}}(z,z^{\prime})-K_{Id}^{\mathrm{NT}}(x,x^{\prime})}=O\left(m^{-\frac{1}{12}}\sqrt{\ln m}\right)

when mC1(ln(C2/δ))5m\geq C_{1}\left(\ln\left(C_{2}/\delta\right)\right)^{5} and zx2\norm{z-x}_{2}, zx2O(1/m)\norm{z^{\prime}-x^{\prime}}_{2}\leq O(1/m).

Convergence at initialization

As the width tends to infinity, it has been demonstrated that the neural network converges to a Gaussian process at initialization.

Lemma 2.5 ([15]).

Fix a compact set Tn0T\subseteq\mathbb{R}^{n_{0}}. As the hidden layer width mm tends to infinity, the sequence of stochastic processes xuNN(x;θ)x\mapsto u^{\mathrm{NN}}(x;\theta) converges weakly in C0(T)C^{0}(T) to a centered Gaussian process with covariance function KLRFK^{\mathrm{RF}}_{L}.

3 Main Results

In this section, we present our main results. We first establish a general theorem concerning the convergence of NTKs at initialization. A sufficient condition for convergence in training is proposed and validated in some simple cases. Finally, leveraging the aforementioned results, we examine the impact of differential operators within the loss function on the spectral properties of the NTK.

3.1 Convergence at initialization

Let SS be a metric space and X,Xn,n1X,X^{n},~{}n\geq 1 be random variables taking values in SS. We recall that XnX^{n} converges weakly to XX in SS, denoted by Xn𝑤XX^{n}\xrightarrow{w}X, iff

𝔼f(Xn)𝔼f(X) for any continuous bounded function f:S.\displaystyle\mathbb{E}f(X^{n})\to\mathbb{E}f(X)\mbox{\quad for any continuous bounded function\quad}f:S\to\mathbb{R}.

The first result is to show that if the activation function has a higher regularity, we can generalize Lemma 2.4 to the case of weak convergence in CkC^{k}.

Theorem 3.1 (Convergence of initial function).

Let T𝒳T\subset\mathcal{X} be a compact set. k0k\geq 0 is an integer. σ\sigma satisfies Assumption 1 for kk. For any α\alpha satisfying |α|k|\alpha|\leq k, fixed l=2,,L+1l=2,\dots,L+1, fixing mlm_{l}, as m1,,ml1m_{1},\dots,m_{l-1}\to\infty, the random process

x𝒳Dαzl(x)ml\displaystyle x\in\mathcal{X}\mapsto D^{\alpha}z^{l}(x)\in\mathbb{R}^{m_{l}}

converge weakly in C0(T;ml)C^{0}(T;\mathbb{R}^{m_{l}}) to a Gaussian process in ml\mathbb{R}^{m_{l}} whose components are i.i.d. and have mean zero and covariance kernel DxαDxαKlRF(x,x)D_{x}^{\alpha}D_{x^{\prime}}^{\alpha}K^{\mathrm{RF}}_{l}(x,x^{\prime}).

Proof.

With Lemma B.1, Lemma B.5 and Proposition 2.1 in [15], we obtain the conclusion. ∎

With the aid of Theorem 3.1, the uniform convergence of NNK at initialization is demonstrated as follows.

Theorem 3.2.

Let T𝒳T\subset\mathcal{X} be a convex compact set. k0k\geq 0 is an integer. σ\sigma satisfies Assumption 1 for k+2k+2. For any α,β\alpha,\beta satisfying |α|,|β|k|\alpha|,|\beta|\leq k, fixed l=2,,L+1l=2,\dots,L+1, fixing mlm_{l}, as m1,,ml1m_{1},\dots,m_{l-1}\to\infty sequentially, we have

DxαDxβKl,ijNT,θ(x,x)𝑝δijDxαDxβKlNT(x,x) for i,j=1,,ml\displaystyle D_{x}^{\alpha}D_{x^{\prime}}^{\beta}K^{\mathrm{NT},\theta}_{l,ij}(x,x^{\prime})\xrightarrow{p}\delta_{ij}D_{x}^{\alpha}D_{x^{\prime}}^{\beta}K^{\mathrm{NT}}_{l}(x,x^{\prime})\mbox{\quad for\quad}i,j=1,\dots,m_{l} (7)

under C0(T×T;).C^{0}(T\times T;\mathbb{R}).

Proof.

The proof is completed by induction. When l=1l=1, for any i,j=1,,m1i,j=1,\dots,m_{1},

K1,ijNT,θ(x,x)\displaystyle\quad K_{1,ij}^{\mathrm{NT},\theta}(x,x^{\prime}) =θzi1(x),θzj1(x)\displaystyle=\left\langle{\nabla_{\theta}z_{i}^{1}(x),\nabla_{\theta}z_{j}^{1}(x^{\prime})}\right\rangle
=θWi0x,θWj0x\displaystyle=\left\langle{\nabla_{\theta}W_{i}^{0}x,\nabla_{\theta}W_{j}^{0}x^{\prime}}\right\rangle
=δijx,x\displaystyle=\delta_{ij}\left\langle{x,x^{\prime}}\right\rangle
=δijK1NT(x,x).\displaystyle=\delta_{ij}K_{1}^{\mathrm{NT}}(x,x^{\prime}).

Suppose that as m1,,ml1m_{1},\dots,m_{l-1}\to\infty, we have

Kl,ijNT,θ(x,x)𝑝δijKlNT(x,x) under Ck×k(T×T;) for i,j=1,,ml.K^{\mathrm{NT},\theta}_{l,ij}(x,x^{\prime})\xrightarrow{p}\delta_{ij}K^{\mathrm{NT}}_{l}(x,x^{\prime})\mbox{\quad under\quad}C^{k\times k}(T\times T;\mathbb{R})\mbox{\quad for\quad}i,j=1,\dots,m_{l}.

For any i,j=1,,ml+1i,j=1,\dots,m_{l+1},

Kl+1,ijNT,θ(x,x)\displaystyle\quad K_{l+1,ij}^{\mathrm{NT},\theta}(x,x^{\prime}) (8)
=θzil+1(x),θzjl+1(x)\displaystyle=\left\langle{\nabla_{\theta}z_{i}^{l+1}(x),\nabla_{\theta}z_{j}^{l+1}(x^{\prime})}\right\rangle
=δij1mlq=1mlσ(zql(x))σ(zql(x))\displaystyle=\delta_{ij}\frac{1}{m_{l}}\sum_{q=1}^{m_{l}}\sigma(z_{q}^{l}(x))\sigma(z_{q}^{l}(x^{\prime}))
+1mlq1=1mlq2=1mlWiq1lWjq2lσ(1)(zq1l(x))σ(1)(zq2l(x))Kl,q1q2NT,θ(x,x).\displaystyle+\frac{1}{m_{l}}\sum_{q_{1}=1}^{m_{l}}\sum_{q_{2}=1}^{m_{l}}W_{iq_{1}}^{l}W_{jq_{2}}^{l}\sigma^{(1)}(z_{q_{1}}^{l}(x))\sigma^{(1)}(z_{q_{2}}^{l}(x^{\prime}))K_{l,q_{1}q_{2}}^{\mathrm{NT},\theta}(x,x^{\prime}).

With the induction hypothesis and Theorem 3.1, for any fixed mlm_{l}, as m1,,ml1m_{1},\dots,m_{l-1}\to\infty sequentially,

Kl+1,ijNT,θ(x,x)\displaystyle K_{l+1,ij}^{\mathrm{NT},\theta}(x,x^{\prime}) 𝑤δijmlq=1mlσ(Gql(x))σ(Gql(x))\displaystyle\xrightarrow{w}\frac{\delta_{ij}}{m_{l}}\sum_{q=1}^{m_{l}}\sigma(G_{q}^{l}(x))\sigma(G_{q}^{l}(x^{\prime}))
+KlNT(x,x)mlq=1mlWiqlWjqlσ(1)(Gql(x))σ(1)(Gql(x))\displaystyle+\frac{K^{\mathrm{NT}}_{l}(x,x^{\prime})}{m_{l}}\sum_{q=1}^{m_{l}}W_{iq}^{l}W_{jq}^{l}\sigma^{(1)}(G_{q}^{l}(x))\sigma^{(1)}(G_{q}^{l}(x^{\prime}))

under C0(T×T;)C^{0}(T\times T;\mathbb{R}) where GlG^{l} is a Gaussian process in ml\mathbb{R}^{m_{l}} whose components are i.i.d. and have mean zero and covariance kernel KlRFK^{\mathrm{RF}}_{l}. With weak law of large number, we obtain the finite-dimensional convergence

(Kl+1,ijNT,θ(xα,xα))αA𝑝(δijKl+1NT(xα,xα))αA\left(K_{l+1,ij}^{\mathrm{NT},\theta}(x_{\alpha},x_{\alpha}^{\prime})\right)_{\alpha\in A}\xrightarrow{p}\left(\delta_{ij}K^{\mathrm{NT}}_{l+1}(x_{\alpha},x_{\alpha}^{\prime})\right)_{\alpha\in A}

for any finite set {(xα,xα)T×T|αA}\{(x_{\alpha},x_{\alpha}^{\prime})\in T\times T\big{|}\alpha\in A\}. What we still need to prove is that for any δ>0\delta>0,

Kl+1,ijNT,θCk×k,1(T×T;)C\norm{K_{l+1,ij}^{\mathrm{NT},\theta}}_{C^{k\times k,1}(T\times T;\mathbb{R})}\leq C

for some CC not depending on m1,m2,,mlm_{1},m_{2},\dots,m_{l} with probability at least 1δ1-\delta. Suppose that this control holds. Then, with the finite-dimensional convergence and Lemma B.2, we obatin the conclusion (7). Note that TT is convex. We have

Kl+1,ijNT,θCk×k,1(T×T;)Kl+1,ijNT,θC(k+1)×(k+1)(T×T;).\norm{K_{l+1,ij}^{\mathrm{NT},\theta}}_{C^{k\times k,1}(T\times T;\mathbb{R})}\leq\norm{K_{l+1,ij}^{\mathrm{NT},\theta}}_{C^{(k+1)\times(k+1)}(T\times T;\mathbb{R})}.

With the basic inequality ab12(a2+b2)ab\leq\frac{1}{2}(a^{2}+b^{2}), assumption for σ\sigma and Proposition B.7, we have

𝔼[supx,xT|Dασ(G1l(x))Dβσ(G1l(x))|]<,\displaystyle\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{D^{\alpha}\sigma(G_{1}^{l}(x))D^{\beta}\sigma(G_{1}^{l}(x^{\prime}))}\right]<\infty,
𝔼[supx,xT|W11l2DxαDxβ{σ(1)(G1l(x))σ(1)(G1l(x))KlNT(x,x)}|]<\displaystyle\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{{W_{11}^{l}}^{2}D^{\alpha}_{x}D^{\beta}_{x^{\prime}}\left\{\sigma^{(1)}(G_{1}^{l}(x))\sigma^{(1)}(G_{1}^{l}(x^{\prime}))K_{l}^{\mathrm{NT}}(x,x^{\prime})\right\}}\right]<\infty

and

𝔼[supx,xT|DxαDxβ{σ(1)(G1l(x))σ(1)(G1l(x))KlNT(x,x)}|2]<\displaystyle\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{D^{\alpha}_{x}D^{\beta}_{x^{\prime}}\left\{\sigma^{(1)}(G_{1}^{l}(x))\sigma^{(1)}(G_{1}^{l}(x^{\prime}))K_{l}^{\mathrm{NT}}(x,x^{\prime})\right\}}^{2}\right]<\infty

for any α,β\alpha,\beta satisfying |α|,|β|k+1|\alpha|,|\beta|\leq k+1. With the induction hypothesis and Theorem 3.1, for any M>0M>0,

limm1,,ml1supml𝔼[supx,xT|δijmlq=1mlDασ(zql(x))Dβσ(zql(x))|M]\displaystyle\quad\lim_{m_{1},\dots,m_{l-1}\rightarrow\infty}\sup_{m_{l}}\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{\frac{\delta_{ij}}{m_{l}}\sum_{q=1}^{m_{l}}D^{\alpha}\sigma(z_{q}^{l}(x))D^{\beta}\sigma(z_{q}^{l}(x^{\prime}))}\land M\right]
limm1,,ml1𝔼[supx,xT|Dασ(z1l(x))Dβσ(z1l(x))|M]\displaystyle\leq\lim_{m_{1},\dots,m_{l-1}\rightarrow\infty}\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{D^{\alpha}\sigma(z_{1}^{l}(x))D^{\beta}\sigma(z_{1}^{l}(x^{\prime}))}\land M\right]
=𝔼[supx,xT|Dασ(G1l(x))Dβσ(G1l(x))|M]\displaystyle=\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{D^{\alpha}\sigma(G_{1}^{l}(x))D^{\beta}\sigma(G_{1}^{l}(x^{\prime}))}\land M\right]
𝔼[supx,xT|Dασ(G1l(x))Dβσ(G1l(x))|].\displaystyle\leq\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{D^{\alpha}\sigma(G_{1}^{l}(x))D^{\beta}\sigma(G_{1}^{l}(x^{\prime}))}\right].

Hence, there exists a constant C1C_{1} not depending on m1,,mlm_{1},\dots,m_{l} and MM such that

supm1,,ml𝔼[supx,xT|δijmlq=1mlDασ(zql(x))Dβσ(zql(x))|M]C1.\sup_{m_{1},\dots,m_{l}}\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{\frac{\delta_{ij}}{m_{l}}\sum_{q=1}^{m_{l}}D^{\alpha}\sigma(z_{q}^{l}(x))D^{\beta}\sigma(z_{q}^{l}(x^{\prime}))}\land M\right]\leq C_{1}.

and

(supx,xT|δijmlq=1mlDασ(zql(x))Dβσ(zql(x))|>M)\displaystyle\quad\mathbb{P}\left(\sup_{x,x^{\prime}\in T}\absolutevalue{\frac{\delta_{ij}}{m_{l}}\sum_{q=1}^{m_{l}}D^{\alpha}\sigma(z_{q}^{l}(x))D^{\beta}\sigma(z_{q}^{l}(x^{\prime}))}>M\right)
1M𝔼[supx,xT|δijmlq=1mlDασ(zql(x))Dβσ(zql(x))|M]\displaystyle\leq\frac{1}{M}\mathbb{E}\left[\sup_{x,x^{\prime}\in T}\absolutevalue{\frac{\delta_{ij}}{m_{l}}\sum_{q=1}^{m_{l}}D^{\alpha}\sigma(z_{q}^{l}(x))D^{\beta}\sigma(z_{q}^{l}(x^{\prime}))}\land M\right]
C1M.\displaystyle\leq\frac{C_{1}}{M}.

For the second term on the right of (8), we do the decomposition,

1mlq1=1mlq2=1mlWiq1lWjq2lσ(1)(zq1l(x))σ(1)(zq2l(x))Kl,q1q2NT,θ(x,x)\displaystyle\quad\frac{1}{m_{l}}\sum_{q_{1}=1}^{m_{l}}\sum_{q_{2}=1}^{m_{l}}W_{iq_{1}}^{l}W_{jq_{2}}^{l}\sigma^{(1)}(z_{q_{1}}^{l}(x))\sigma^{(1)}(z_{q_{2}}^{l}(x^{\prime}))K_{l,q_{1}q_{2}}^{\mathrm{NT},\theta}(x,x^{\prime})
=1mlq=1mlWiqlWjqlσ(1)(zql(x))σ(1)(zql(x))Kl,qqNT,θ(x,x)\displaystyle=\frac{1}{m_{l}}\sum_{q=1}^{m_{l}}W_{iq}^{l}W_{jq}^{l}\sigma^{(1)}(z_{q}^{l}(x))\sigma^{(1)}(z_{q}^{l}(x^{\prime}))K_{l,qq}^{\mathrm{NT},\theta}(x,x^{\prime})
+1mlq1q2Wiq1lWjq2lσ(1)(zq1l(x))σ(1)(zq2l(x))Kl,q1q2NT,θ(x,x).\displaystyle+\frac{1}{m_{l}}\sum_{q_{1}\neq q_{2}}W_{iq_{1}}^{l}W_{jq_{2}}^{l}\sigma^{(1)}(z_{q_{1}}^{l}(x))\sigma^{(1)}(z_{q_{2}}^{l}(x^{\prime}))K_{l,q_{1}q_{2}}^{\mathrm{NT},\theta}(x,x^{\prime}).

For the terms on the right, it is similar to demonstrate that for any δ>0\delta>0, there exists constants C2C_{2} and C3C_{3} such that

(supx,xT|1mlq=1mlWiqlWjqlDxαDxβ{σ(1)(zql(x))σ(1)(zql(x))Kl,qqNT,θ(x,x)}|>C2)δ\mathbb{P}\left(\sup_{x,x^{\prime}\in T}\absolutevalue{\frac{1}{m_{l}}\sum_{q=1}^{m_{l}}W_{iq}^{l}W_{jq}^{l}D_{x}^{\alpha}D_{x^{\prime}}^{\beta}\left\{\sigma^{(1)}(z_{q}^{l}(x))\sigma^{(1)}(z_{q}^{l}(x^{\prime}))K_{l,qq}^{\mathrm{NT},\theta}(x,x^{\prime})\right\}}>C_{2}\right)\leq\delta

and

(supx,xT{1ml2q1q2(DxαDxβ{σ(1)(zq1l(x))σ(1)(zq2l(x))Kl,q1q2NT,θ(x,x)})2}>C3)δ.\mathbb{P}\left(\sup_{x,x^{\prime}\in T}\left\{\frac{1}{m_{l}^{2}}\sum_{q_{1}\neq q_{2}}\left(D^{\alpha}_{x}D^{\beta}_{x^{\prime}}\left\{\sigma^{(1)}(z_{q_{1}}^{l}(x))\sigma^{(1)}(z_{q_{2}}^{l}(x^{\prime}))K_{l,q_{1}q_{2}}^{\mathrm{NT},\theta}(x,x^{\prime})\right\}\right)^{2}\right\}>C_{3}\right)\leq\delta. (9)

where C2,C3C_{2},C_{3} both not depneding on m1,m2,,mlm_{1},m_{2},\dots,m_{l}. We define a map F:T×Tml2mlF:T\times T\rightarrow\mathbb{R}^{m_{l}^{2}-m_{l}} where the components of F(x,x)F(x,x^{\prime}) are given by

Fq1q2(x,x)=1mlσ(1)(zq1l(x))σ(1)(zq2l(x))Kl,q1q2NT,θ(x,x).F_{q_{1}q_{2}}(x,x^{\prime})=\frac{1}{m_{l}}\sigma^{(1)}(z_{q_{1}}^{l}(x))\sigma^{(1)}(z_{q_{2}}^{l}(x^{\prime}))K_{l,q_{1}q_{2}}^{\mathrm{NT},\theta}(x,x^{\prime}).

Then, with (9), for any δ>0\delta>0, there exsits a constant C4C_{4} such that

(FC(k+1)×(k+1)(T×T)C4)1δ2.\mathbb{P}\left(\norm{F}_{C^{(k+1)\times(k+1)}(T\times T)}\leq C_{4}\right)\geq 1-\frac{\delta}{2}.

Using Lemma B.4 for the map φ:ml2ml\varphi:\mathbb{R}^{m_{l}^{2}-m_{l}}\rightarrow\mathbb{R}, φ(x)=q1q2Wiq1lWjq2lxq1q2\varphi(x)=\sum_{q_{1}\neq q_{2}}W_{iq_{1}}^{l}W_{jq_{2}}^{l}x_{q_{1}q_{2}} and Lemma 2.2(for details, it is similar to the proof of Lemma B.5), there also exists a constant C5C_{5} such that

(1mlq1q2Wiq1lWjq2lσ(1)(zq1l(x))σ(1)(zq2l(x))Kl,q1q2NT,θ(x,x)Ck×k,1(T×T)C5)\displaystyle\quad\mathbb{P}\left(\norm{\frac{1}{m_{l}}\sum_{q_{1}\neq q_{2}}W_{iq_{1}}^{l}W_{jq_{2}}^{l}\sigma^{(1)}(z_{q_{1}}^{l}(x))\sigma^{(1)}(z_{q_{2}}^{l}(x^{\prime}))K_{l,q_{1}q_{2}}^{\mathrm{NT},\theta}(x,x^{\prime})}_{C^{k\times k,1}(T\times T)}\geq C_{5}\right)
=(φF(x,x)Ck×k,1(T×T)C5)\displaystyle=\mathbb{P}\left(\norm{\varphi\circ F(x,x^{\prime})}_{C^{k\times k,1}(T\times T)}\geq C_{5}\right)
1δ.\displaystyle\geq 1-\delta.

There is no additional obstacle to generalize Theorem 3.2 to general linear differential operators.

Proposition 3.3.

Let T𝒳T\subset\mathcal{X} be a convex compact set. k0k\geq 0 is an integer. σ\sigma satisfies Assumption 1 for k+2k+2. For any 𝒯=r=1parDαr\mathcal{T}=\sum_{r=1}^{p}a_{r}D^{\alpha_{r}} satisfying |αr|k|\alpha_{r}|\leq k and arC0(T)a_{r}\in C^{0}(T), fixed l=2,,L+1l=2,\dots,L+1, fixing mlm_{l}, as m1,,ml1m_{1},\dots,m_{l-1}\to\infty sequentially, we have

𝒯x𝒯xKl,ijNT,θ(x,x)𝑝δij𝒯x𝒯xKlNT(x,x) for i,j=1,,ml\displaystyle\mathcal{T}_{x}\mathcal{T}_{x^{\prime}}K^{\mathrm{NT},\theta}_{l,ij}(x,x^{\prime})\xrightarrow{p}\delta_{ij}\mathcal{T}_{x}\mathcal{T}_{x^{\prime}}K^{\mathrm{NT}}_{l}(x,x^{\prime})\mbox{\quad for\quad}i,j=1,\dots,m_{l}

under C0(T×T;).C^{0}(T\times T;\mathbb{R}).

3.2 Convergence in training

In the following, we will use v2\norm{v}_{2} for the 2-norm of a vector and A2\norm{A}_{2} for the Frobinuous norm of a matrix. Moreover, let us shorthand λ0=λmin(K𝒯NT(X,X))\lambda_{0}=\lambda_{\min}(K^{\mathrm{NT}}_{\mathcal{T}}(X,X)). We also define v~tNTK(x)\tilde{v}^{\mathrm{NTK}}_{t}(x) as the NTK dynamics (5) with initialization v~0NTK(x)=u0NN(x)\tilde{v}^{\mathrm{NTK}}_{0}(x)=u^{\mathrm{NN}}_{0}(x).

To control the training process, we first assume the following condition and establish the convergence in the training process. We will present in Lemma 3.8 that this condition is verified for the neural network with l=1l=1 and d=1d=1, while extension to general cases is straightforward but very cumbersome.

Condition 3.4 (Continuity of the gradient).

There are a function B(m):++B(m):\mathbb{R}_{+}\to\mathbb{R}_{+} satisfying B(m)B(m)\to\infty as mm\to\infty and a monotonically increasing function η¯(ε):++\bar{\eta}(\varepsilon):\mathbb{R}_{+}\to\mathbb{R}_{+} such that for any ε>0\varepsilon>0, when

WlWl(0)2B(m)η¯(ε),l=0,,L,\displaystyle\norm{W^{l}-W^{l}(0)}_{2}\leq B(m)\bar{\eta}(\varepsilon),\quad\forall l=0,\dots,L, (10)

we have

supx𝒳θ𝒯uNN(x;θ)θ𝒯uNN(x;θ0)2ε.\displaystyle\sup_{x\in\mathcal{X}}\norm{\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta)-\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta_{0})}_{2}\leq\varepsilon.

The following theorem shows the uniform approximation (over x𝒳x\in\mathcal{X}) between the training dynamics of the neural network and the corresponding kernel regression under the physics informed loss. The general proof idea follows the perturbation analysis in the NTK literature [4, 3, 1, 26], as long as we regard 𝒯uNN(x;θ)\mathcal{T}u^{\mathrm{NN}}(x;\theta) as a whole.

Theorem 3.5 (NTK training dynamics).

Suppose that 𝒯\mathcal{T} is a differential operator up to order kk, σ\sigma satisfies Assumption 1 for k+2k+2, λ0>0\lambda_{0}>0 and Condition 3.4 holds. Then, it holds in probability w.r.t. the randomness of the initialization that

limmsupx𝒳|K𝒯,θm(x,x)K𝒯,θ0m(x,x)|=0\displaystyle\lim_{m\to\infty}\sup_{x\in\mathcal{X}}\absolutevalue{K_{\mathcal{T},\theta}^{m}(x,x)-K_{\mathcal{T},\theta_{0}}^{m}(x,x)}=0 (11)

and thus

limmsupt0supx𝒳|vtNN(x)v~tNTK(x)|=0.\displaystyle\lim_{m\to\infty}\sup_{t\geq 0}\sup_{x\in\mathcal{X}}\absolutevalue{v^{\mathrm{NN}}_{t}(x)-\tilde{v}^{\mathrm{NTK}}_{t}(x)}=0. (12)
Remark 3.6.

We remark here that the assumption λ0>0\lambda_{0}>0 is not restrictive and is also a common assumption in the literature [2, 1]. If the kernel K𝒯NTK^{NT}_{\mathcal{T}} is strictly positive definite, this assumption is satisfied with probability one when the data is drawn from a continuous distribution.

To prove Theorem 3.5, we first show that a perturbation bound of the weights can imply a bound on the kernel function and the empirical kernel matrix.

Proposition 3.7.

Let condition Condition 3.4 hold and let ε>0\varepsilon>0 be arbitrary. Then, when (10) holds, we have

|K𝒯,θm(x,x)K𝒯,θ0m(x,x)|=O(ε),\displaystyle\absolutevalue{K_{\mathcal{T},\theta}^{m}(x,x^{\prime})-K_{\mathcal{T},\theta_{0}}^{m}(x,x^{\prime})}=O(\varepsilon), (13)

and thus

K𝒯,θm(X,X)K𝒯,θ0m(X,X)op=O(nε).\displaystyle\norm{K_{\mathcal{T},\theta}^{m}(X,X)-K_{\mathcal{T},\theta_{0}}^{m}(X,X)}_{\mathrm{op}}=O(n\varepsilon).
Proof.

We note that

K𝒯,θm(x,x)=θ𝒯uNN(x;θ),θ𝒯uNN(x;θ),\displaystyle K_{\mathcal{T},\theta}^{m}(x,x^{\prime})=\left\langle{\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta),\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x^{\prime};\theta)}\right\rangle,

so the result just follows from the fact that

|v,vw,w|\displaystyle\absolutevalue{\left\langle{v,v^{\prime}}\right\rangle-\left\langle{w,w^{\prime}}\right\rangle} =|vw,vw+w,vw+vw,w|\displaystyle=\absolutevalue{\left\langle{v-w,v^{\prime}-w^{\prime}}\right\rangle+\left\langle{w,v^{\prime}-w^{\prime}}\right\rangle+\left\langle{v-w,w^{\prime}}\right\rangle}
vwvw+wvw+vww,\displaystyle\leq\norm{v-w}\norm{v^{\prime}-w^{\prime}}+\norm{w}\norm{v^{\prime}-w^{\prime}}+\norm{v-w}\norm{w^{\prime}},

where we can substitute v=θ𝒯uNN(x;θ)v=\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta), v=θ𝒯uNN(x;θ)v^{\prime}=\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x^{\prime};\theta), w=θ𝒯uNN(x;θ0)w=\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta_{0}), w=θ𝒯uNN(x;θ0)w^{\prime}=\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x^{\prime};\theta_{0}) and apply the conditions. ∎

Proof of Theorem 3.5.

The proof resembles the perturbation analysis in [4] but with some modifications. Using Theorem 3.1 and Theorem 3.2, there are constants L1,L2L_{1},L_{2} such that the following holds at initialization with probability at least 1δ1-\delta when mm is large enough:

v(x;θ0)C0L1,\displaystyle\norm{v(x;\theta_{0})}_{C^{0}}\leq L_{1}, (14)
K𝒯,θ0m(X,X)K𝒯NT(X,X)opλ0/4\displaystyle\norm{K_{\mathcal{T},\theta_{0}}^{m}(X,X)-K^{\mathrm{NT}}_{\mathcal{T}}(X,X)}_{\mathrm{op}}\leq\lambda_{0}/4 (15)
supx𝒳θ𝒯uNN(x;θ0)2L2.\displaystyle\sup_{x\in\mathcal{X}}\norm{\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta_{0})}_{2}\leq L_{2}. (16)

Let ε>0\varepsilon>0 be arbitrary. Using the (15) and also Proposition 3.7, we can choose some η>0\eta>0 such that when Wl(t)Wl(0)ηB(m)\norm{W^{l}(t)-W^{l}(0)}\leq\eta B(m),

θ𝒯uNN(x;θt)θ𝒯uNN(x;θ0)21,K𝒯,θtm(X,X)K𝒯,θ0m(X,X)opλ0/4supx𝒳|K𝒯,θtm(x,x)K𝒯,θ0m(x,x)|ε.\displaystyle\begin{aligned} &\norm{\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta_{t})-\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta_{0})}_{2}\leq 1,\\ &\norm{K_{\mathcal{T},\theta_{t}}^{m}(X,X)-K_{\mathcal{T},\theta_{0}}^{m}(X,X)}_{\mathrm{op}}\leq\lambda_{0}/4\\ &\sup_{x\in\mathcal{X}}\absolutevalue{K_{\mathcal{T},\theta_{t}}^{m}(x,x)-K_{\mathcal{T},\theta_{0}}^{m}(x,x)}\leq\varepsilon.\end{aligned} (17)

Combining them with (14) and (15) , we have

θ𝒯uNN(x;θt)2C0,\displaystyle\norm{\nabla_{\theta}\mathcal{T}u^{\mathrm{NN}}(x;\theta_{t})}_{2}\leq C_{0}, (18)

for some absolute constant C0>0C_{0}>0, and

λmin(K𝒯,θtm(X,X))λ0/2.\displaystyle\lambda_{\min}(K_{\mathcal{T},\theta_{t}}^{m}(X,X))\geq\lambda_{0}/2. (19)

Now, we define

T0=inf{t0:Wl(t)Wl(0)ηB(m)for somel}.\displaystyle T_{0}=\inf\left\{t\geq 0:\norm{W^{l}(t)-W^{l}(0)}\geq\eta B(m)~{}\text{for some}~{}l\right\}.

Then, (18) and (19) hold when tT0t\leq T_{0}.

Using the gradient flow (4), we find that

v˙t(X)=1nK𝒯,θtm(X,X)(vt(X;θ)Y),\displaystyle\dot{v}_{t}(X)=-\frac{1}{n}K_{\mathcal{T},\theta_{t}}^{m}(X,X)(v_{t}(X;\theta)-Y),

so (19) implies that

vt(X;θ)Y2exp(14nλ0t)v0(X)Y2, for tT0.\displaystyle\norm{v_{t}(X;\theta)-Y}_{2}\leq\exp(-\frac{1}{4n}\lambda_{0}t)\norm{v_{0}(X)-Y}_{2},\mbox{\quad for\quad}t\leq T_{0}.

Furthermore, we recall the gradient flow equation for WlW^{l} that

WtlW0l=0t1ni=1n(𝒯ut(xi)yi)Wl𝒯ut(xi)dt,\displaystyle W^{l}_{t}-W^{l}_{0}=\int_{0}^{t}\frac{1}{n}\sum_{i=1}^{n}\left(\mathcal{T}u_{t}(x_{i})-y_{i}\right)\nabla_{W^{l}}\mathcal{T}u_{t}(x_{i})\differential t,

so when tT0t\leq T_{0}, for any ll, we have

WtlW0l2\displaystyle\norm{W^{l}_{t}-W^{l}_{0}}_{2} 1ni=1n0t|𝒯us(xi)yi|Wl𝒯us(xi)2dt\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\int_{0}^{t}\absolutevalue{\mathcal{T}u_{s}(x_{i})-y_{i}}\norm{\nabla_{W^{l}}\mathcal{T}u_{s}(x_{i})}_{2}\differential t
1ni=1nsups[0,t]Wl𝒯us(xi)20t|𝒯us(xi)yi|dt\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sup_{s\in[0,t]}\norm{\nabla_{W^{l}}\mathcal{T}u_{s}(x_{i})}_{2}\int_{0}^{t}\absolutevalue{\mathcal{T}u_{s}(x_{i})-y_{i}}\differential t
4v0(X)Y2λ0i=1nsups[0,t]Wl𝒯us(xi)2\displaystyle\leq\frac{4\norm{v_{0}(X)-Y}_{2}}{\lambda_{0}}\sum_{i=1}^{n}\sup_{s\in[0,t]}\norm{\nabla_{W^{l}}\mathcal{T}u_{s}(x_{i})}_{2}
4nC0v0(X)Y2λ0\displaystyle\leq\frac{4nC_{0}\norm{v_{0}(X)-Y}_{2}}{\lambda_{0}}
4nC0(Y2+nL)λ0\displaystyle\leq\frac{4nC_{0}\left(\norm{Y}_{2}+\sqrt{n}L\right)}{\lambda_{0}}

where we used the (18) in the last inequality. Now, as long as mm is large enough that

ηB(m)>4nC0(Y2+nL)λ0,\displaystyle\eta B(m)>\frac{4nC_{0}\left(\norm{Y}_{2}+\sqrt{n}L\right)}{\lambda_{0}},

an argument by contradiction shows that T0=T_{0}=\infty.

Now we have shown that Wl(t)Wl(0)ηB(m)\norm{W^{l}(t)-W^{l}(0)}\leq\eta B(m) holds for all t0t\geq 0, so the last inequality in (17) gives

supt0supx𝒳|K𝒯,θtm(x,x)K𝒯,θ0m(x,x)|ε.\displaystyle\sup_{t\geq 0}\sup_{x\in\mathcal{X}}\absolutevalue{K_{\mathcal{T},\theta_{t}}^{m}(x,x)-K_{\mathcal{T},\theta_{0}}^{m}(x,x)}\leq\varepsilon.

Therefore, a standard perturbation analysis comparing the ODEs (4) and (5) yields the conclusion, see, e.g., Proof of Lemma F.1 in [4].

In some simple cases, we can verify Condition 3.4 in a direct way.

Lemma 3.8.

Consider neural network (1) with l=1l=1 and d=1d=1. Let T𝒳T\subset\mathcal{X} be a convex compact set, k1k\geq 1 be an integer and σ\sigma satisfy Assumption 1 for k+2k+2. Then, for any δ>0\delta>0, there exists functions B(m)B(m) and η(ε)\eta(\varepsilon) satisfying such that with probability at least 1δ1-\delta over initialization, we have

supxTθdkuNNdx(x;θ)θdkuNNdx(x;θ0)2ε.\sup_{x\in T}\norm{\nabla_{\theta}\frac{d^{k}u^{\mathrm{NN}}}{dx}(x;\theta)-\nabla_{\theta}\frac{d^{k}u^{\mathrm{NN}}}{dx}(x;\theta_{0})}_{2}\leq\varepsilon.

for any W0,W1W^{0},W^{1} satisfying

W0W0(0)2,W1W1(0)2B(m)η(ε).\norm{W^{0}-W^{0}(0)}_{2},\norm{W^{1}-W^{1}(0)}_{2}\leq B(m)\eta(\varepsilon).
Proof.

In this case, the neural network is defined as

uNN(x;θ)z(x)=1mi=1mWi1σ(Wi0x).u^{\mathrm{NN}}(x;\theta)\coloneqq z(x)=\frac{1}{\sqrt{m}}\sum_{i=1}^{m}W_{i}^{1}\sigma\left(W_{i}^{0}x\right).

Hence,

z(k)(x)=1mi=1mWi1Wi0kσ(k)(Wi0x)z^{(k)}(x)=\frac{1}{\sqrt{m}}\sum_{i=1}^{m}W_{i}^{1}{W_{i}^{0}}^{k}\sigma^{(k)}\left(W_{i}^{0}x\right)

and

θz(k)(x;θ)θz(k)(x;θ0)22\displaystyle\quad\norm{\nabla_{\theta}z^{(k)}(x;\theta)-\nabla_{\theta}z^{(k)}(x;\theta_{0})}_{2}^{2} (20)
1mi=1m(Wi0kσ(k)(Wi0x)Wi0(0)kσ(k)(Wi0(0)x))2\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}^{k}\sigma^{(k)}\left(W_{i}^{0}x\right)-{W_{i}^{0}(0)}^{k}\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{2}
+k2mi=1m(Wi1Wi0k1σ(k)(Wi0x)Wi1(0)Wi0(0)k1σ(k)(Wi0(0)x))2\displaystyle+\frac{k^{2}}{m}\sum_{i=1}^{m}\left(W_{i}^{1}{W_{i}^{0}}^{k-1}\sigma^{(k)}\left(W_{i}^{0}x\right)-W_{i}^{1}(0){W_{i}^{0}(0)}^{k-1}\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{2}
+x2mi=1m(Wi1Wi0kσ(k+1)(Wi0x)Wi1(0)Wi0(0)kσ(k+1)(Wi0(0)x))2.\displaystyle+\frac{x^{2}}{m}\sum_{i=1}^{m}\left(W_{i}^{1}{W_{i}^{0}}^{k}\sigma^{(k+1)}\left(W_{i}^{0}x\right)-W_{i}^{1}(0){W_{i}^{0}(0)}^{k}\sigma^{(k+1)}\left(W_{i}^{0}(0)x\right)\right)^{2}.

We only need to demonstrate that for any δ>0\delta>0, there exists B(m)B(m) and η(ε)\eta(\varepsilon) such that with probability at least 1δ1-\delta, we have

supx{1mi=1m(Wi0kσ(k)(Wi0x)Wi0(0)kσ(k)(Wi0(0)x))2}ε\sup_{x}\left\{\frac{1}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}^{k}\sigma^{(k)}\left(W_{i}^{0}x\right)-{W_{i}^{0}(0)}^{k}\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{2}\right\}\leq\varepsilon

for any ε>0\varepsilon>0 and W0,W1W^{0},W^{1} satisfying

W0W0(0)2,W1W1(0)2B(m)η(ε).\norm{W^{0}-W^{0}(0)}_{2},\norm{W^{1}-W^{1}(0)}_{2}\leq B(m)\eta(\varepsilon).

For the last two terms to the right of (20), we can draw a similar conclusion using the same method since kk is a fixed integer and TT is a compact set. In fact, we first do the decomposition,

1mi=1m(Wi0kσ(k)(Wi0x)Wi0(0)kσ(k)(Wi0(0)x))2\displaystyle\quad\frac{1}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}^{k}\sigma^{(k)}\left(W_{i}^{0}x\right)-{W_{i}^{0}(0)}^{k}\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{2}
3mi=1m(Wi0kWi0(0)k)2(σ(k)(Wi0x)σ(k)(Wi0(0)x))2\displaystyle\leq\frac{3}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}^{k}-{W_{i}^{0}}(0)^{k}\right)^{2}\left(\sigma^{(k)}\left(W_{i}^{0}x\right)-\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{2}
+3mi=1m(Wi0kWi0(0)k)2σ(k)(Wi0(0)x)2\displaystyle+\frac{3}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}^{k}-{W_{i}^{0}}(0)^{k}\right)^{2}\sigma^{(k)}\left(W_{i}^{0}(0)x\right)^{2}
+3mi=1mWi0(0)2k(σ(k)(Wi0x)σ(k)(Wi0(0)x))2\displaystyle+\frac{3}{m}\sum_{i=1}^{m}{W_{i}^{0}}(0)^{2k}\left(\sigma^{(k)}\left(W_{i}^{0}x\right)-\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{2}
3{1mi=1m(Wi0kWi0(0)k)41mi=1m(σ(k)(Wi0x)σ(k)(Wi0(0)x))4}12\displaystyle\leq 3\left\{\frac{1}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}^{k}-{W_{i}^{0}}(0)^{k}\right)^{4}\frac{1}{m}\sum_{i=1}^{m}\left(\sigma^{(k)}\left(W_{i}^{0}x\right)-\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{4}\right\}^{\frac{1}{2}}
+3{1mi=1m(Wi0kWi0(0)k)41mi=1mσ(k)(Wi0(0)x)4}12\displaystyle+3\left\{\frac{1}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}^{k}-{W_{i}^{0}}(0)^{k}\right)^{4}\frac{1}{m}\sum_{i=1}^{m}\sigma^{(k)}\left(W_{i}^{0}(0)x\right)^{4}\right\}^{\frac{1}{2}}
+3{1mi=1mWi0(0)4k1mi=1m(σ(k)(Wi0x)σ(k)(Wi0(0)x))4}12\displaystyle+3\left\{\frac{1}{m}\sum_{i=1}^{m}{W_{i}^{0}}(0)^{4k}\frac{1}{m}\sum_{i=1}^{m}\left(\sigma^{(k)}\left(W_{i}^{0}x\right)-\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{4}\right\}^{\frac{1}{2}}

Note that

1mi=1m(Wi0kWi0(0)k)4\displaystyle\quad\frac{1}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}^{k}-{W_{i}^{0}}(0)^{k}\right)^{4}
k4mi=1m(Wi0Wi0(0))4(|Wi0(0)|+|Wi0Wi0(0)|)4k4\displaystyle\leq\frac{k^{4}}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}-{W_{i}^{0}}(0)\right)^{4}(\absolutevalue{W_{i}^{0}(0)}+\absolutevalue{{W_{i}^{0}}-{W_{i}^{0}}(0)})^{4k-4}
C1(k)mi=1m(Wi0Wi0(0))4|Wi0(0)|4k4+C1(k)mi=1m(Wi0Wi0(0))4k\displaystyle\leq\frac{C_{1}(k)}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}-{W_{i}^{0}}(0)\right)^{4}\absolutevalue{W_{i}^{0}(0)}^{4k-4}+\frac{C_{1}(k)}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}-{W_{i}^{0}}(0)\right)^{4k}
C1(k){1mi=1m(Wi0Wi0(0))81mi=1m|Wi0(0)|8k8}12+C1(k)mi=1m(Wi0Wi0(0))4k\displaystyle\leq C_{1}(k)\left\{\frac{1}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}-{W_{i}^{0}}(0)\right)^{8}\frac{1}{m}\sum_{i=1}^{m}\absolutevalue{W_{i}^{0}(0)}^{8k-8}\right\}^{\frac{1}{2}}+\frac{C_{1}(k)}{m}\sum_{i=1}^{m}\left({W_{i}^{0}}-{W_{i}^{0}}(0)\right)^{4k}
C1(k){1mW0W0(0)281mi=1m|Wi0(0)|8k8}12+C1(k)mW0W0(0)24k.\displaystyle\leq C_{1}(k)\left\{\frac{1}{m}\norm{W^{0}-W^{0}(0)}_{2}^{8}\frac{1}{m}\sum_{i=1}^{m}\absolutevalue{W_{i}^{0}(0)}^{8k-8}\right\}^{\frac{1}{2}}+\frac{C_{1}(k)}{m}\norm{W^{0}-W^{0}(0)}_{2}^{4k}.

And with Assumption 1, denoting DsupxT|x|D\coloneqq\sup_{x\in T}|x|, we have

1mi=1m(σ(k)(Wi0x)σ(k)(Wi0(0)x))4\displaystyle\quad\frac{1}{m}\sum_{i=1}^{m}\left(\sigma^{(k)}\left(W_{i}^{0}x\right)-\sigma^{(k)}\left(W_{i}^{0}(0)x\right)\right)^{4}
3D4mi=1m(Wi0Wi0(0))4(1+Dlk+1(|Wi0Wi0(0)|+|Wi0(0)|)lk+1)4\displaystyle\leq\frac{3D^{4}}{m}\sum_{i=1}^{m}\left(W_{i}^{0}-W_{i}^{0}(0)\right)^{4}\left(1+D^{l_{k+1}}\left(\absolutevalue{W_{i}^{0}-W_{i}^{0}(0)}+\absolutevalue{W_{i}^{0}(0)}\right)^{l_{k+1}}\right)^{4}
C2(D,lk+1)mi=1m{|Wi0Wi0(0)|4+|Wi0Wi0(0)|4+4lk+1+(Wi0Wi0(0))4Wi0(0)4lk+1}\displaystyle\leq\frac{C_{2}(D,l_{k+1})}{m}\sum_{i=1}^{m}\left\{\absolutevalue{W_{i}^{0}-W_{i}^{0}(0)}^{4}+\absolutevalue{W_{i}^{0}-W_{i}^{0}(0)}^{4+4l_{k+1}}+\left(W_{i}^{0}-W_{i}^{0}(0)\right)^{4}W_{i}^{0}(0)^{4l_{k+1}}\right\}
C2(D,lk+1)m(W0W0(0)24+W0W0(0)24+4lk+1)\displaystyle\leq\frac{C_{2}(D,l_{k+1})}{m}\left(\norm{W^{0}-W^{0}(0)}_{2}^{4}+\norm{W^{0}-W^{0}(0)}_{2}^{4+4l_{k+1}}\right)
+C2(D,lk+1){1mW0W0(0)281mi=1m|Wi0(0)|8lk+1}12\displaystyle+C_{2}(D,l_{k+1})\left\{\frac{1}{m}\norm{W^{0}-W^{0}(0)}_{2}^{8}\frac{1}{m}\sum_{i=1}^{m}\absolutevalue{W_{i}^{0}(0)}^{8l_{k+1}}\right\}^{\frac{1}{2}}

With Assumption 1, the fact that TT is compact and any moments of Wi0W_{i}^{0} is finite, we have, for any mm,

𝔼[1mi=1mσ(k)(Wi0(0)x)4]=𝔼[σ(k)(W10(0)x)4]<\mathbb{E}\left[\frac{1}{m}\sum_{i=1}^{m}\sigma^{(k)}\left(W_{i}^{0}(0)x\right)^{4}\right]=\mathbb{E}\left[\sigma^{(k)}\left(W_{1}^{0}(0)x\right)^{4}\right]<\infty

and similar conclusions for any summation above not depending on W0W^{0}. Hence, for any δ>0\delta>0, there exists a constant M>0M>0 not depending on mm such that with probability at least 1δ1-\delta,

1mi=1mσ(k)(Wi0(0)x)4,1mi=1mWi0(0)4k,1mi=1mWi0(0)8k8,1mi=1m|Wi0(0)|8lk+1M.\frac{1}{m}\sum_{i=1}^{m}\sigma^{(k)}\left(W_{i}^{0}(0)x\right)^{4},\frac{1}{m}\sum_{i=1}^{m}{W_{i}^{0}}(0)^{4k},\frac{1}{m}\sum_{i=1}^{m}{W_{i}^{0}}(0)^{8k-8},\frac{1}{m}\sum_{i=1}^{m}\absolutevalue{{W_{i}^{0}}(0)}^{8l_{k+1}}\leq M.

Therefore, selecting B(m)=m1dkB(m)=m^{\frac{1}{d_{k}}} where dkmax{8,4k,4+4lk+1}d_{k}\coloneqq\max\left\{8,4k,4+4l_{k+1}\right\} and η(ε)\eta(\varepsilon) sufficiently small, we can obtain the conclusion. ∎

Corollary 3.9.

Consider the neural network (1) with l=1l=1 and d=1d=1. Let TT be a bounded closed interval in \mathbb{R}, 𝒯\mathcal{T} is a differential operator up to order kk, σ\sigma satisfies Assumption 1 for k+2k+2 and λ0>0\lambda_{0}>0. Then, in probability with respect to the randomness of the initialization, we have

limmsupt0supxT|vtNN(x)v~tNTK(x)|=0.\displaystyle\lim_{m\to\infty}\sup_{t\geq 0}\sup_{x\in T}\absolutevalue{v^{\mathrm{NN}}_{t}(x)-\tilde{v}^{\mathrm{NTK}}_{t}(x)}=0.

3.3 Impact on the Spectrum

In the previous section, we demonstrate that the NTK related to the physics-informed loss (2) is K𝒯NT=𝒯x𝒯xKNT(x,x)K_{\mathcal{T}}^{\mathrm{NT}}=\mathcal{T}_{x}\mathcal{T}_{x^{\prime}}K^{\mathrm{NT}}(x,x^{\prime}) where KNT=:KIdNTK^{\mathrm{NT}}=:K_{Id}^{\mathrm{NT}} is the NTK of the traditional l2 loss. In this section, we present some analysis and numerical experiments to explore the impact of differential operator 𝒯\mathcal{T} on the spectrum of integral operator with kernel K𝒯NTK_{\mathcal{T}}^{\mathrm{NT}}. Define 𝒯:L2(𝒳)L2(𝒳)\mathcal{I}_{\mathcal{T}}:L^{2}(\mathcal{X})\rightarrow L^{2}(\mathcal{X}) as the integral operator with kernel K𝒯NTK_{\mathcal{T}}^{\mathrm{NT}}, which means that for all fL2(𝒳)f\in L^{2}(\mathcal{X}),

𝒯f(x)=𝒳𝒯x𝒯xKNT(x,x)f(x)𝑑x.\mathcal{I}_{\mathcal{T}}f(x)=\int_{\mathcal{X}}\mathcal{T}_{x}\mathcal{T}_{x^{\prime}}K^{\mathrm{NT}}(x,x^{\prime})f(x^{\prime})dx^{\prime}.

Since 𝒯\mathcal{I}_{\mathcal{T}} is compact and self-adjoint, it has a sequence of real eigenvalues {μj}j=1\{\mu_{j}\}_{j=1}^{\infty} tending to zero. In addition, we denote {λj}j=1\{\lambda_{j}\}_{j=1}^{\infty} as the eigenvalue of Id\mathcal{I}_{Id}. Then, the following lemma shows that for a large class of 𝒯\mathcal{T}, the decay rate of {μj}j=1\{\mu_{j}\}_{j=1}^{\infty} is not faster than that of {λj}j=1\{\lambda_{j}\}_{j=1}^{\infty}.

Lemma 3.10.

Suppose that {μj}j=1>0\{\mu_{j}\}_{j=1}^{\infty}>0. Let 𝒯|C0(𝒳)\mathcal{T}|C_{0}^{\infty}(\mathcal{X}) be symmetric, i.e.

𝒯u,vL2(𝒳)=u,𝒯vL2(𝒳)\left\langle{\mathcal{T}u,v}\right\rangle_{L^{2}(\mathcal{X})}=\left\langle{u,\mathcal{T}v}\right\rangle_{L^{2}(\mathcal{X})}

for all u,vC0(𝒳)u,v\in C_{0}^{\infty}(\mathcal{X}) and satisfy

C𝒯supv0,vC0(𝒳)vL2(𝒳)𝒯vL2(𝒳)<.C_{\mathcal{T}}\coloneqq\sup_{v\neq 0,v\in C_{0}^{\infty}(\mathcal{X})}\frac{\norm{v}_{L^{2}(\mathcal{X})}}{\norm{\mathcal{T}v}_{L^{2}(\mathcal{X})}}<\infty.

Then,

supjλjμjC𝒯2.\sup_{j}\frac{\lambda_{j}}{\mu_{j}}\leq C_{\mathcal{T}}^{2}.
Proof.

With the definition, for all j=1,2,j=1,2,\dots,

μj=mindimV=j1maxvV𝒯v,vv2.\mu_{j}=\min_{\dim V=j-1}\max_{v\in V^{\perp}}\frac{\left\langle{\mathcal{I}_{\mathcal{T}}v,v}\right\rangle}{\norm{v}^{2}}.

Note that C0(𝒳)C_{0}^{\infty}(\mathcal{X}) is dense in L2(𝒳)L^{2}(\mathcal{X}). Hence,

μj\displaystyle\mu_{j} =mindimV=j1supvVC0(𝒳)𝒯v,vv2\displaystyle=\min_{\dim V=j-1}\sup_{v\in V^{\perp}\cap C_{0}^{\infty}(\mathcal{X})}\frac{\left\langle{\mathcal{I}_{\mathcal{T}}v,v}\right\rangle}{\norm{v}^{2}}
=mindimV=j1supvVC0(𝒳)Id𝒯v,𝒯vv2\displaystyle=\min_{\dim V=j-1}\sup_{v\in V^{\perp}\cap C_{0}^{\infty}(\mathcal{X})}\frac{\left\langle{\mathcal{I}_{Id}\mathcal{T}v,\mathcal{T}v}\right\rangle}{\norm{v}^{2}}
=:supvVjC0(𝒳)Id𝒯v,𝒯vv2.\displaystyle=:\sup_{v\in V_{j}^{\perp}\cap C_{0}^{\infty}(\mathcal{X})}\frac{\left\langle{\mathcal{I}_{Id}\mathcal{T}v,\mathcal{T}v}\right\rangle}{\norm{v}^{2}}.

Moreover,

λj\displaystyle\lambda_{j} =mindimV=j1maxvVIdv,vv2\displaystyle=\min_{\dim V=j-1}\max_{v\in V^{\perp}}\frac{\left\langle{\mathcal{I}_{Id}v,v}\right\rangle}{\norm{v}^{2}}
supv(𝒯(VjC0(𝒳)))Idv,vv2\displaystyle\leq\sup_{v\in(\mathcal{T}(V_{j}\cap C_{0}^{\infty}(\mathcal{X})))^{\perp}}\frac{\left\langle{\mathcal{I}_{Id}v,v}\right\rangle}{\norm{v}^{2}}
=supv(VjC0(𝒳))C0(𝒳)Id𝒯v,𝒯v𝒯v2\displaystyle=\sup_{v\in(V_{j}\cap C_{0}^{\infty}(\mathcal{X}))^{\perp}\cap C_{0}^{\infty}(\mathcal{X})}\frac{\left\langle{\mathcal{I}_{Id}\mathcal{T}v,\mathcal{T}v}\right\rangle}{\norm{\mathcal{T}v}^{2}}
=supvVjC0(𝒳)Id𝒯v,𝒯v𝒯v2.\displaystyle=\sup_{v\in V_{j}^{\perp}\cap C_{0}^{\infty}(\mathcal{X})}\frac{\left\langle{\mathcal{I}_{Id}\mathcal{T}v,\mathcal{T}v}\right\rangle}{\norm{\mathcal{T}v}^{2}}.

Therefore,

λjsupvVjC0(𝒳)Id𝒯v,𝒯vv2supvVjC0(𝒳)v2𝒯v2C𝒯2μj.\lambda_{j}\leq\sup_{v\in V_{j}^{\perp}\cap C_{0}^{\infty}(\mathcal{X})}\frac{\left\langle{\mathcal{I}_{Id}\mathcal{T}v,\mathcal{T}v}\right\rangle}{\norm{v}^{2}}\sup_{v\in V_{j}^{\perp}\cap C_{0}^{\infty}(\mathcal{X})}\frac{\norm{v}^{2}}{\norm{\mathcal{T}v}^{2}}\leq C_{\mathcal{T}}^{2}\mu_{j}.

According to the Poincaré inequality, the gradient operator \nabla fulfills the assumptions in Lemma 3.10. These assumptions also hold for a large class of elliptic operators since their smallest eigenvalues are positive (see section 6.5.1 in [10]).

4 Experiments

In this section, we present experimental results to verify our theory. Throughout the task, data {xi}i=1n\{x_{i}\}_{i=1}^{n} are sampled uniformly from [0,1]d[0,1]^{d}.

In Section 3.3, we show that the differential operator in the loss function does not make the decay rate of the eigenvalue of the integral operator related to the NTK faster. A natural inquiry arises as to whether this phenomenon persists for the NTK matrix K𝒯,θ(X,X)K_{\mathcal{T},\theta}(X,X), which is closer to the neural network training dynamics (4). We employ the network structure described in (1) with depth l=1l=1 and width m=1024m=1024. All parameters are initialized as independent standard normal samples. Let n=1000n=1000. For d=1d=1, we select 𝒯u(x)=u,2x2u,u+2x2u,4x4u\mathcal{T}u(x)=u,\frac{\partial^{2}}{\partial x^{2}}u,u+\frac{\partial^{2}}{\partial x^{2}}u,\frac{\partial^{4}}{\partial x^{4}}u. For d=2d=2, we select 𝒯u(x,y)=u,Δu,u+Δu,2x2u2y2u,Δ2u\mathcal{T}u(x,y)=u,\Delta u,u+\Delta u,\frac{\partial^{2}}{\partial x^{2}}u-\frac{\partial^{2}}{\partial y^{2}}u,\Delta^{2}u. The activation function is Tanh or ReLU6\text{ReLU}^{6}, i.e. σ(x)=max{0,x}6\sigma(x)=\max\{0,x\}^{6}. The eigenvalues of K𝒯,θ(X,X)K_{\mathcal{T},\theta}(X,X) at initialization are shown in Figure 1. Normalization is adopted to ensure that the largest eigenvalues are equal. A common phenomenon is that the higher the order of the differential operator 𝒯\mathcal{T}, the slower the decay of the eigenvalues of K𝒯,θ(X,X)K_{\mathcal{T},\theta}(X,X), which aligns with our theoretical predictions.

Refer to caption
(a) d=1,σ(x)=Tanh(x)d=1,\sigma(x)=Tanh(x)
Refer to caption
(b) d=1,σ(x)=ReLU(x)6d=1,\sigma(x)=ReLU(x)^{6}
Refer to caption
(c) d=2,σ(x)=Tanh(x)d=2,\sigma(x)=Tanh(x)
Refer to caption
(d) d=2,σ(x)=ReLU(x)6d=2,\sigma(x)=ReLU(x)^{6}
Figure 1: Eigenvalues of K𝒯,θ(X,X)K_{\mathcal{T},\theta}(X,X) at initialization

The influence of differential operators within the loss function on the actual training process is also of considerable interest. We consider to approximate sin(2πax)\sin(2\pi ax) on [0,1][0,1] with positive integer aa and three distinct loss functions,

1(u)=1ni=1n(D(xi)u(xi;θ)sin(2πaxi))2,\displaystyle\mathcal{L}_{1}(u)=\frac{1}{n}\sum_{i=1}^{n}\left(D(x_{i})u(x_{i};\theta)-\sin(2\pi ax_{i})\right)^{2},
2(u)=1ni=1n(Δ(D(xi)u(xi;θ))sin(2πaxi))2,\displaystyle\mathcal{L}_{2}(u)=\frac{1}{n}\sum_{i=1}^{n}\left(-\Delta\left(D(x_{i})u(x_{i};\theta)\right)-\sin(2\pi ax_{i})\right)^{2},
3(u;w)=1wni=1n(Δu(xi;θ)sin(2πaxi))2+w(u(0;θ)2+u(1;θ)2)\displaystyle\mathcal{L}_{3}(u;w)=\frac{1-w}{n}\sum_{i=1}^{n}\left(-\Delta u(x_{i};\theta)-\sin(2\pi ax_{i})\right)^{2}+w\left(u(0;\theta)^{2}+u(1;\theta)^{2}\right)

where D(x)x(1x)D(x)\coloneqq x(1-x) is a smooth distance function to ensure that D(x)u(x;θ)D(x)u(x;\theta) fulfills the homogeneous Dirichlet boundary condition [8] and 3\mathcal{L}_{3} is the standard loss of PINNs with weight w(0,1)w\in(0,1). In this task, we adopt a neural network setting that is more closely aligned with scenarios in practice. The network architecture in (1) is still used, but with the addition of bias terms. The parameter initialization follows the default of nn.Linear() in Pytorch. Let l=4l=4, m=512m=512, n=100n=100 and the activation function be Tanh. We employ the Adam algorithm to train the network with learning rate 1e-5 and the nomarlized loss function j(u(x;θ))/j(u(x;θ0))\mathcal{L}_{j}(u(x;\theta))/\mathcal{L}_{j}(u(x;\theta_{0})) for j=1,2,3j=1,2,3. The training loss for different aa is shown in Figure 2. For small aa such as a=0.5,1a=0.5,1, the l2 loss 1\mathcal{L}_{1} decays fastest at the beginning of the training process. As aa increases, the decrease rate of 1\mathcal{L}_{1} slows down due to the constraints imposed by the spectral bias. In comparison, the loss 2\mathcal{L}_{2} is less affected. This corroborates that the additional differential operator in the loss function does not impose a stronger spectral bias on the neural network during training. This behavior is also observed in standard PINNs with different weights ww.

Refer to caption
(a) a=0.5a=0.5
Refer to caption
(b) a=1a=1
Refer to caption
(c) a=3a=3
Refer to caption
(d) a=5a=5
Figure 2: 1(u)\mathcal{L}_{1}(u), 2(u)\mathcal{L}_{2}(u) and 3(u)\mathcal{L}_{3}(u) for different aa in training

5 Conclusion

In this paper, we develop the NTK theory for deep neural networks with physics-informed loss. We not only clarify the convergence of NTK during initialization and training, but also reveal its explicit structure. Using this structure, we prove that, in most cases, the differential operators in the loss function do not cause the neural network to face a stronger spectral bias during training. This is further supported by experiments. Therefore, if one wants to improve the performance of PINNs from the perspective of spectral bias, it may be more beneficial to focus on spectral bias caused by different terms of loss, as demonstrated in [36]. This does not mean that PINNs can better fit high-frequency functions. In fact, its training loss still decays slower when fitting a function with higher frequency, as shown in Figure 2. In instances where the solution exhibits pronounced high-frequency or multifrequency components, it is imperative to implement interventions to enhance the performance of PINNs [16, 20, 35]. It is also important to emphasize that spectral bias is merely one aspect of understanding the limitations of PINNs, and physics-informed loss has other drawbacks, such as making the optimizing problem more ill-conditioned [22]. Hence, the addition of higher-order differential operators to address the spectral bias in the loss function is not recommended.

References

  • [1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization, June 2019.
  • [2] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
  • [3] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International conference on machine learning, pages 322–332. PMLR, 2019.
  • [4] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [5] Alberto Bietti and Francis Bach. Deep equals shallow for ReLU networks in kernel regimes. arXiv preprint arXiv:2009.14397, 2020.
  • [6] Andrea Bonfanti, Giuseppe Bruno, and Cristina Cipriani. The challenges of the nonlinear regime for physics-informed neural networks. Advances in Neural Information Processing Systems, 37:41852–41881, 2025.
  • [7] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198, 2019.
  • [8] Jiaxin Deng, Jinran Wu, Shaotong Zhang, Weide Li, You-Gan Wang, et al. Physical informed neural networks with soft and hard boundary constraints for solving advection-diffusion equations using fourier expansions. Computers & Mathematics with Applications, 159:60–75, 2024.
  • [9] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, September 2018.
  • [10] Lawrence C Evans. Partial differential equations, volume 19. American Mathematical Society, 2022.
  • [11] Olga Fuks and Hamdi A Tchelepi. Limitations of physics informed machine learning for nonlinear two-phase transport in porous media. Journal of Machine Learning for Modeling and Computing, 1(1), 2020.
  • [12] Amnon Geifman, Meirav Galun, David Jacobs, and Basri Ronen. On the spectral bias of convolutional neural tangent and gaussian process kernels. Advances in Neural Information Processing Systems, 35:11253–11265, 2022.
  • [13] Amnon Geifman, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, and Basri Ronen. On the similarity between the Laplace and neural tangent kernels. In Advances in Neural Information Processing Systems, volume 33, pages 1451–1461, 2020.
  • [14] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
  • [15] Boris Hanin. Random neural networks in the infinite width limit as Gaussian processes. arXiv preprint arXiv:2107.01562, 2021.
  • [16] Saeid Hedayatrasa, Olga Fink, Wim Van Paepegem, and Mathias Kersemans. k-space physics-informed neural network (k-pinn) for compressed spectral mapping and efficient inversion of vibrations in thin composite laminates. Mechanical Systems and Signal Processing, 223:111920, 2025.
  • [17] Tianyang Hu, Wenjia Wang, Cong Lin, and Guang Cheng. Regularization matters: A nonparametric perspective on overparametrized neural network. In International Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021.
  • [18] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [19] Ameya D Jagtap and George Em Karniadakis. Extended physics-informed neural networks (xpinns): A generalized space-time domain decomposition based deep learning framework for nonlinear partial differential equations. Communications in Computational Physics, 28(5), 2020.
  • [20] Ge Jin, Jian Cheng Wong, Abhishek Gupta, Shipeng Li, and Yew-Soon Ong. Fourier warm start for physics-informed neural networks. Engineering Applications of Artificial Intelligence, 132:107887, 2024.
  • [21] Olav Kallenberg. Foundations of Modern Probability. Number 99 in Probability Theory and Stochastic Modelling. Springer, Cham, Switzerland, 2021.
  • [22] Aditi Krishnapriyan, Amir Gholami, Shandian Zhe, Robert Kirby, and Michael W Mahoney. Characterizing possible failure modes in physics-informed neural networks. Advances in neural information processing systems, 34:26548–26560, 2021.
  • [23] Jianfa Lai, Manyun Xu, Rui Chen, and Qian Lin. Generalization ability of wide neural networks on R, February 2023.
  • [24] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [25] Yicheng Li, Weiye Gan, Zuoqiang Shi, and Qian Lin. Generalization error curves for analytic spectral algorithms under power-law decay. arXiv preprint arXiv:2401.01599, 2024.
  • [26] Yicheng Li, Zixiong Yu, Guhan Chen, and Qian Lin. On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains. Journal of Machine Learning Research, 25(82):1–47, 2024.
  • [27] Qiang Liu, Mengyu Chu, and Nils Thuerey. Config: Towards conflict-free training of physics informed neural networks. arXiv preprint arXiv:2408.11104, 2024.
  • [28] Athanasios Papoulis and S Unnikrishna Pillai. Probability, random variables, and stochastic processes. McGraw-Hill Europe: New York, NY, USA, 2002.
  • [29] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR, 2019.
  • [30] Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations. Journal of Machine Learning Research, 19(25):1–24, 2018.
  • [31] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
  • [32] Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems, 32, 2019.
  • [33] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of computational physics, 375:1339–1364, 2018.
  • [34] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge university press, 2018.
  • [35] Sifan Wang, Hanwen Wang, and Paris Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale pdes with physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering, 384:113938, 2021.
  • [36] Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why pinns fail to train: A neural tangent kernel perspective. Journal of Computational Physics, 449:110768, 2022.
  • [37] Jon Wellner et al. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 2013.
  • [38] Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. In Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer, 2019.
  • [39] Lei Yuan, Yi-Qing Ni, Xiang-Yun Deng, and Shuo Hao. A-pinn: Auxiliary physics informed neural networks for forward and inverse problems of nonlinear integro-differential equations. Journal of Computational Physics, 462:111260, 2022.
  • [40] Yinhao Zhu, Nicholas Zabaras, Phaedon-Stelios Koutsourelakis, and Paris Perdikaris. Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. Journal of Computational Physics, 394:56–81, 2019.

Appendix A Smoothness of the NTK

In this section, we verify the smoothness of kernel K1RFK^{\mathrm{RF}}_{1} and K1NTK^{\mathrm{NT}}_{1} defined as (6). For the convenience of reader, we first recall the definition.

K1RF(x,x)=K1NT(x,x)=x,x.\displaystyle K^{\mathrm{RF}}_{1}(x,x^{\prime})=K^{\mathrm{NT}}_{1}(x,x^{\prime})=\left\langle{x,x^{\prime}}\right\rangle.

and for l=2,,L+1l=2,\dots,L+1,

KlRF(x,x)\displaystyle K^{\mathrm{RF}}_{l}(x,x^{\prime}) =𝔼(u,v)N(𝟎,𝑩l1(x,x))[σ(u)σ(v)],\displaystyle=\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{B}_{l-1}(x,x^{\prime}))}\left[\sigma(u)\sigma(v)\right],
KlNT(x,x)\displaystyle K^{\mathrm{NT}}_{l}(x,x^{\prime}) =KlRF(x,x)+Kl1NT(x,x)𝔼(u,v)N(𝟎,𝑩l1(x,x))[σ(1)(u)σ(1)(v)],\displaystyle=K^{\mathrm{RF}}_{l}(x,x^{\prime})+K^{\mathrm{NT}}_{l-1}(x,x^{\prime})\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{B}_{l-1}(x,x^{\prime}))}\left[\sigma^{(1)}(u)\sigma^{(1)}(v)\right],

where the matrix 𝑩l1(x,x)2×2\bm{B}_{l-1}(x,x^{\prime})\in\mathbb{R}^{2\times 2} is defined as:

𝑩l1(x,x)=(Kl1RF(x,x)Kl1RF(x,x)Kl1RF(x,x)Kl1RF(x,x).)\bm{B}_{l-1}(x,x^{\prime})=\begin{pmatrix}K^{\mathrm{RF}}_{l-1}(x,x)&K^{\mathrm{RF}}_{l-1}(x,x^{\prime})\\ K^{\mathrm{RF}}_{l-1}(x,x^{\prime})&K^{\mathrm{RF}}_{l-1}(x^{\prime},x^{\prime}).\end{pmatrix}
Proposition A.1.

Let f,gf,g be two functions that f,g,f,gL2(,ex2/2dx)f,g,f^{\prime},g^{\prime}\in L^{2}(\mathbb{R},e^{-x^{2}/2}\differential x). Denoting

F(ρ)=𝔼(u,v)N(𝟎,𝚺)f(u)g(v),Σ=(1ρρ1).\displaystyle F(\rho)=\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{\Sigma})}f(u)g(v),\quad\Sigma=\begin{pmatrix}1&\rho\\ \rho&1\end{pmatrix}.

Then,

F(ρ)=𝔼(u,v)N(𝟎,𝚺)f(u)g(v).\displaystyle F^{\prime}(\rho)=\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{\Sigma})}f^{\prime}(u)g^{\prime}(v).
Proof.

Denote by hi(x)h_{i}(x) the Hermite polynomial basis with respect to the normal distribution. Let us consider the Hermite expansion of f,gf,g:

f(u)=i=0αihi(u),g(v)=j=0βjhj(v),\displaystyle f(u)=\sum_{i=0}^{\infty}\alpha_{i}h_{i}(u),\quad g(v)=\sum_{j=0}^{\infty}\beta_{j}h_{j}(v),

and also f,gf^{\prime},g^{\prime}:

f(u)=i=0αihi(u),g(v)=j=0βjhj(v).\displaystyle f^{\prime}(u)=\sum_{i=0}^{\infty}\alpha_{i}^{\prime}h_{i}(u),\quad g^{\prime}(v)=\sum_{j=0}^{\infty}\beta_{j}^{\prime}h_{j}(v).

Using the fact that hi(u)=ihi1(u)h_{i}^{\prime}(u)=\sqrt{i}h_{i-1}(u), we derive αi1=iαi\alpha_{i-1}^{\prime}=\sqrt{i}\alpha_{i} and βi1=iβi\beta_{i-1}^{\prime}=\sqrt{i}\beta_{i} for i1i\geq 1.

Now, we use the fact that 𝔼hi(u)hj(v)=ρiδij\mathbb{E}h_{i}(u)h_{j}(v)=\rho^{i}\delta_{ij} for (u,v)N(𝟎,𝚺)(u,v)\sim N(\bm{0},\bm{\Sigma}) to obtain

𝔼(u,v)f(u)g(v)=i,jαiβj𝔼hi(u)hj(v)=i=0αiβiρi,\displaystyle\mathbb{E}_{(u,v)}f(u)g(v)=\sum_{i,j}\alpha_{i}\beta_{j}\mathbb{E}h_{i}(u)h_{j}(v)=\sum_{i=0}^{\infty}\alpha_{i}\beta_{i}\rho^{i},

and thus

ρ𝔼(u,v)f(u)g(v)=i=0iαiβiρi1=i=1αi1βi1ρi1=i=0αiβiρi=𝔼(u,v)f(u)g(v).\displaystyle\partialderivative{\rho}\mathbb{E}_{(u,v)}f(u)g(v)=\sum_{i=0}^{\infty}i\alpha_{i}\beta_{i}\rho^{i-1}=\sum_{i=1}^{\infty}\alpha_{i-1}^{\prime}\beta_{i-1}^{\prime}\rho^{i-1}=\sum_{i=0}^{\infty}\alpha_{i}^{\prime}\beta_{i}^{\prime}\rho^{i}=\mathbb{E}_{(u,v)}f^{\prime}(u)g^{\prime}(v).

Lemma A.2.

Let Assumption 1 hold for both σ1,σ2\sigma_{1},\sigma_{2}. Let us define a function F(𝐀)F(\bm{A}) on positive semi-definite matrices 𝐀=(Aij)2×2PSD(2)\bm{A}=(A_{ij})_{2\times 2}\in\mathrm{PSD}(2) as

F(𝑨)=𝔼(u,v)N(𝟎,𝑨)σ1(u)σ2(v).F(\bm{A})=\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{A})}\sigma_{1}(u)\sigma_{2}(v).

Then, denoting 𝒟={𝐀PSD(2):A11,A22[c1,c]}\mathcal{D}=\left\{\bm{A}\in\mathrm{PSD}(2):A_{11},A_{22}\in[c^{-1},c]\right\} for some constant c>1c>1, we have F(𝐀)Ck(𝒟)F(\bm{A})\in C^{k}(\mathcal{D}) and for all |α|k\absolutevalue{\alpha}\leq k,

|DαF(𝑨)|C,\absolutevalue{D^{\alpha}F(\bm{A})}\leq C,

where CC is a constant depending only on the constants in Assumption 1 and cc.

Proof.

We first prove the case for k=1k=1. Let us consider parameterizing

𝑨=(a2ρabρabb2),\displaystyle\bm{A}=\begin{pmatrix}a^{2}&\rho ab\\ \rho ab&b^{2}\end{pmatrix},

where ρ[1,1]\rho\in[-1,1] and a,b[c1/2,c1/2]a,b\in[c^{-1/2},c^{1/2}]. Then,

F(𝑨)=𝔼(u,v)N(𝟎,𝑩)σ1(au)σ2(bv),𝑩=(1ρρ1).\displaystyle F(\bm{A})=\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{B})}\sigma_{1}(au)\sigma_{2}(bv),\quad\bm{B}=\begin{pmatrix}1&\rho\\ \rho&1\end{pmatrix}.

Consequently,

Fa=𝔼(u,v)N(𝟎,𝑩)(uσ1(au)σ2(bv)),Fb=𝔼(u,v)N(𝟎,𝑩)(vσ1(au)σ2(bv)).\displaystyle\partialderivative{F}{a}=\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{B})}\left(u\sigma_{1}^{\prime}(au)\sigma_{2}(bv)\right),\quad\partialderivative{F}{b}=\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{B})}\left(v\sigma_{1}(au)\sigma_{2}^{\prime}(bv)\right).

Then, using Assumption 1, we have

FaC02[𝔼(uσ1(au))2][𝔼(σ2(bv))2]C\displaystyle\norm{\partialderivative{F}{a}}_{C^{0}}^{2}\leq\left[\mathbb{E}\left(u\sigma_{1}^{\prime}(au)\right)^{2}\right]\left[\mathbb{E}\left(\sigma_{2}(bv)\right)^{2}\right]\leq C

for some constant CC. For ρ\rho, we apply Proposition A.1 to get

Fρ=ab𝔼(u,v)N(𝟎,𝑩)(σ1(au))(σ2(bv)).\displaystyle\partialderivative{F}{\rho}=ab\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{B})}\left(\sigma_{1}^{\prime}(au)\right)\left(\sigma_{2}^{\prime}(bv)\right).

Hence,

|Fρ|ab[𝔼(σ1(au))2]1/2[𝔼(σ2(bv))2]1/2C\displaystyle\absolutevalue{\partialderivative{F}{\rho}}\leq ab\left[\mathbb{E}\left(\sigma_{1}^{\prime}(au)\right)^{2}\right]^{1/2}\left[\mathbb{E}\left(\sigma_{2}^{\prime}(bv)\right)^{2}\right]^{1/2}\leq C

for some constant CC.

Now, using a=A11a=\sqrt{A_{11}}, b=A22b=\sqrt{A_{22}} and ρ=A12/A11A22\rho=A_{12}/\sqrt{A_{11}A_{22}}, it is easy to see that the partial derivatives aAij\partialderivative{a}{A_{ij}}, bAij\partialderivative{b}{A_{ij}} and ρAij\partialderivative{\rho}{A_{ij}} are bounded by a constant depending only on cc. Applying the chain rule finishes the proof for k=1k=1.

Finally, the case of general kk follows by induction with σ1,σ2\sigma_{1},\sigma_{2} replaced.

Proof of Lemma 2.3.

Using Lemma A.2, we see that the mappings

𝚺𝔼(u,v)N(𝟎,𝚺)σ(u)σ(v),𝚺𝔼(u,v)N(𝟎,𝚺)σ(u)σ(v)\displaystyle\bm{\Sigma}\mapsto\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{\Sigma})}\sigma(u)\sigma(v),\quad\bm{\Sigma}\mapsto\mathbb{E}_{(u,v)\sim N(\bm{0},\bm{\Sigma})}\sigma^{\prime}(u)\sigma^{\prime}(v)

belong to class CkC^{k} and Ck1C^{k-1} respectively. Hence, the result follows by the recurrence formula and fact that compositions of CkC^{k} mappings are still CkC^{k}. ∎

Appendix B Auxiliary results

The following lemma gives a sufficient condition for the weak convergence in Ck(T)C^{k}(T).

Lemma B.1 (Convergence of CkC^{k} processes).

Let 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} be an open set. Let (Xtn)t𝒳,n1(X^{n}_{t})_{t\in\mathcal{X}},~{}n\geq 1 be random processes with Ck(𝒳)C^{k}(\mathcal{X}) paths a.s. and (Xt)t𝒳(X_{t})_{t\in\mathcal{X}} be a Gaussian field with mean zero and Ck×kC^{k\times k} covariance kernel. Denote by DαD^{\alpha} the derivative with respect to tt for multi-index α\alpha. Suppose that

  1. 1.

    For any t1,,tmTt_{1},\dots,t_{m}\in T, the finite dimensional convergence holds:

    (Xt1n,,Xtmn)𝑤(Xt1,,Xtm).\displaystyle\left(X^{n}_{t_{1}},\dots,X^{n}_{t_{m}}\right)\xrightarrow{w}\left(X_{t_{1}},\dots,X_{t_{m}}\right).
  2. 2.

    For any α\alpha satisfying |α|k|\alpha|\leq k and δ>0\delta>0, there exists L>0L>0 such that

    supn{supt,sT|Dα(XtnXsn)|ts2>L}<δ.\displaystyle\sup_{n}\mathbb{P}\left\{\sup_{t,s\in T}\frac{\absolutevalue{D^{\alpha}(X_{t}^{n}-X_{s}^{n})}}{\norm{t-s}_{2}}>L\right\}<\delta.
  3. 3.

    For any α\alpha satisfying |α|k|\alpha|\leq k and δ>0\delta>0, there exists C>0C>0 such that

    supn{suptT|DαXtn|>C}<δ.\displaystyle\sup_{n}\mathbb{P}\left\{\sup_{t\in T}\absolutevalue{D^{\alpha}X_{t}^{n}}>C\right\}<\delta.

Then, for any compact set T𝒳T\subset\mathcal{X} and α\alpha satisfying |α|k|\alpha|\leq k, we have

DαXtn𝑤DαXt in C0(T).\displaystyle D^{\alpha}X^{n}_{t}\xrightarrow{w}D^{\alpha}X_{t}\mbox{\quad in\quad}C^{0}(T).
Proof.

With condition 2, 3 for α=𝟎\alpha=\mathbf{0} and condition 1 (We refer to [21, Section 23] for more details),

Xtn𝑤Xt in C0(T).X^{n}_{t}\xrightarrow{w}X_{t}\mbox{\quad in\quad}C^{0}(T).

which is equivalent to

lim infn𝔼[φ(Xtn)]𝔼[φ(Xt)]\liminf_{n\rightarrow\infty}\mathbb{E}\left[\varphi(X_{t}^{n})\right]\geq\mathbb{E}\left[\varphi(X_{t})\right]

for any bounded, nonnegative and Lipschitz continuous φC(C0(T);)\varphi\in C(C^{0}(T);\mathbb{R}) (see, for example, in Theorem 1.3.4 of [37]). For α\alpha satisfying 1|α|k1\leq|\alpha|\leq k, condition 2,32,3 also derive the tightness of {DαXtn}n=1\left\{D^{\alpha}X_{t}^{n}\right\}_{n=1}^{\infty}. Hence, we only need to show the finite dimensional convergence:

(DαXt1n,,DαXtln)𝑤(DαXt1,,DαXtl)\left(D^{\alpha}X^{n}_{t_{1}},\dots,D^{\alpha}X^{n}_{t_{l}}\right)\xrightarrow{w}\left(D^{\alpha}X_{t_{1}},\dots,D^{\alpha}X_{t_{l}}\right) (21)

for any t1,,tlTt_{1},\dots,t_{l}\in T. For large enough mm, define smoothing map Jm:C0(T)Ck(T)J_{m}:C^{0}(T)\rightarrow C^{k}(T)

Jmf(x)𝒳Rm(|xy|)f(y)𝑑yJ_{m}f(x)\coloneqq\int_{\mathcal{X}}R_{m}(|x-y|)f(y)dy

where R:[0,)[0,)R:[0,\infty)\rightarrow[0,\infty) is a CkC^{k} kernel with compact support set satisfying dR(|y|)𝑑y=1\int_{\mathbb{R}^{d}}R(|y|)dy=1. Rm(s)mdR(ms)R_{m}(s)\coloneqq m^{d}R(ms) is the scaled kernel. For any α\alpha satisfying |α|k|\alpha|\leq k, denote DmαDαJmD_{m}^{\alpha}\coloneqq D^{\alpha}\circ J_{m}. Define An{supt,sT|Dα(XtnXsn)|ts2L}A_{n}\coloneqq\left\{\sup_{t,s\in T}\frac{\absolutevalue{D^{\alpha}(X_{t}^{n}-X_{s}^{n})}}{\norm{t-s}_{2}}\leq L\right\}. With condition 2, for any δ>0\delta>0, we can select LL large enough so that (An)>1δ\mathbb{P}(A_{n})>1-\delta for all nn. For any bounded, nonnegative and Lipschitz continuous φC(l;)\varphi\in C(\mathbb{R}^{l};\mathbb{R}), we have

𝔼[|φ(DmαXt1n,,DmαXtln)φ(DαXt1n,,DαXtln)|]\displaystyle\quad\mathbb{E}[\absolutevalue{\varphi(D_{m}^{\alpha}X^{n}_{t_{1}},\dots,D_{m}^{\alpha}X^{n}_{t_{l}})-\varphi(D^{\alpha}X^{n}_{t_{1}},\dots,D^{\alpha}X^{n}_{t_{l}})}] (22)
𝔼[|φ(DmαXt1n,,DmαXtln)φ(DαXt1n,,DαXtln)|𝟏An]\displaystyle\leq\mathbb{E}[\absolutevalue{\varphi(D_{m}^{\alpha}X^{n}_{t_{1}},\dots,D_{m}^{\alpha}X^{n}_{t_{l}})-\varphi(D^{\alpha}X^{n}_{t_{1}},\dots,D^{\alpha}X^{n}_{t_{l}})}\mathbf{1}_{A_{n}}]
+𝔼[|φ(DmαXt1n,,DmαXtln)φ(DαXt1n,,DαXtln)|𝟏Anc]\displaystyle+\mathbb{E}[\absolutevalue{\varphi(D_{m}^{\alpha}X^{n}_{t_{1}},\dots,D_{m}^{\alpha}X^{n}_{t_{l}})-\varphi(D^{\alpha}X^{n}_{t_{1}},\dots,D^{\alpha}X^{n}_{t_{l}})}\mathbf{1}_{A_{n}^{c}}]
C1(φ,l)(𝔼[max1il|DmαXtinDαXtin|𝟏An]+(Anc))\displaystyle\leq C_{1}(\varphi,l)\left(\mathbb{E}[\max_{1\leq i\leq l}\absolutevalue{D_{m}^{\alpha}X^{n}_{t_{i}}-D^{\alpha}X^{n}_{t_{i}}}\mathbf{1}_{A_{n}}]+\mathbb{P}(A_{n}^{c})\right)
C2(φ,R,L)m+C1(φ)δ.\displaystyle\leq\frac{C_{2}(\varphi,R,L)}{m}+C_{1}(\varphi)\delta.

For the next part, we first consider α\alpha satisfying |α|=1|\alpha|=1. Without loss of generality, we can assume that Dα=t1D^{\alpha}=\frac{\partial}{\partial t^{1}}. Since XtX_{t} is a Gaussian field, Xt1\frac{\partial X}{\partial t^{1}} exists in the sense of mean square(see, for example, Appendix 9A in [28]), i.e.

𝔼[|Xt+εvXtεXtt1|2]0,as ε0+\mathbb{E}\left[\absolutevalue{\frac{X_{t+\varepsilon v}-X_{t}}{\varepsilon}-\frac{\partial X_{t}}{\partial t^{1}}}^{2}\right]\longrightarrow 0,\quad\text{as }\varepsilon\rightarrow 0^{+}

where v(1,0,,0)dv\coloneqq(1,0,\dots,0)\in\mathbb{R}^{d}. Moreover, this convergence is uniform with respect to tt since covariance kernel of XtX_{t} belongs to C1×1(T)C^{1\times 1}(T). Therefore, we can demonstrate that

𝔼[|JmXtt1JmXt1(t)|2]=0.\mathbb{E}\left[\absolutevalue{\frac{\partial J_{m}X_{t}}{\partial t^{1}}-J_{m}\frac{\partial X}{\partial t^{1}}(t)}^{2}\right]=0. (23)

for any tTt\in T. In fact,

t1𝒳Rm(|ts|)Xs𝑑s\displaystyle\quad\frac{\partial}{\partial t^{1}}\int_{\mathcal{X}}R_{m}(|t-s|)X_{s}ds
=limε0+1ε{𝒳Rm(|t+εvs|)Xs𝑑s𝒳Rm(|ts|)Xs𝑑s}\displaystyle=\lim_{\varepsilon\rightarrow 0^{+}}\frac{1}{\varepsilon}\left\{\int_{\mathcal{X}}R_{m}(|t+\varepsilon v-s|)X_{s}ds-\int_{\mathcal{X}}R_{m}(|t-s|)X_{s}ds\right\}
=limε0+1ε{dRm(|t+εvs|)Xs𝑑sdRm(|ts|)Xs𝑑s}\displaystyle=\lim_{\varepsilon\rightarrow 0^{+}}\frac{1}{\varepsilon}\left\{\int_{\mathbb{R}^{d}}R_{m}(|t+\varepsilon v-s|)X_{s}ds-\int_{\mathbb{R}^{d}}R_{m}(|t-s|)X_{s}ds\right\}
=limε0+dRm(|ts|)Xs+εvXsε𝑑s.\displaystyle=\lim_{\varepsilon\rightarrow 0^{+}}\int_{\mathbb{R}^{d}}R_{m}(|t-s|)\frac{X_{s+\varepsilon v}-X_{s}}{\varepsilon}ds.

Hence,

𝔼[|JmXtt1JmXt1(t)|2]\displaystyle\quad\mathbb{E}\left[\absolutevalue{\frac{\partial J_{m}X_{t}}{\partial t^{1}}-J_{m}\frac{\partial X}{\partial t^{1}}(t)}^{2}\right]
=𝔼[limε0+|dRm(|ts|)(Xs+εvXsεXss1)𝑑s|2]\displaystyle=\mathbb{E}\left[\lim_{\varepsilon\rightarrow 0^{+}}\absolutevalue{\int_{\mathbb{R}^{d}}R_{m}(|t-s|)\left(\frac{X_{s+\varepsilon v}-X_{s}}{\varepsilon}-\frac{\partial X_{s}}{\partial s_{1}}\right)ds}^{2}\right]
lim infε0+𝔼[|dRm(|ts|)(Xs+εvXsεXss1)𝑑s|2]\displaystyle\leq\liminf_{\varepsilon\rightarrow 0^{+}}\mathbb{E}\left[\absolutevalue{\int_{\mathbb{R}^{d}}R_{m}(|t-s|)\left(\frac{X_{s+\varepsilon v}-X_{s}}{\varepsilon}-\frac{\partial X_{s}}{\partial s_{1}}\right)ds}^{2}\right]
CRlim infε0+dRm(|ts|)𝔼[|Xs+εvXsεXss1|2]𝑑s\displaystyle\leq C_{R}\liminf_{\varepsilon\rightarrow 0^{+}}\int_{\mathbb{R}^{d}}R_{m}(|t-s|)\mathbb{E}\left[\absolutevalue{\frac{X_{s+\varepsilon v}-X_{s}}{\varepsilon}-\frac{\partial X_{s}}{\partial s_{1}}}^{2}\right]ds
=0.\displaystyle=0.

Since Xtt1\frac{\partial X_{t}}{\partial t^{1}} is also a Gaussian field, it a.s. has uniformly continuous path . As a result,

JmXt1(t)a.s.Xt1(t)J_{m}\frac{\partial X}{\partial t^{1}}(t)\xrightarrow{a.s.}\frac{\partial X}{\partial t^{1}}(t)

as mm\rightarrow\infty. Note that with Proposition B.7,

supm𝔼[|JmXt1(t)|2]CR𝔼[suptT|Xtt1|2]<\displaystyle\sup_{m}\mathbb{E}\left[\absolutevalue{J_{m}\frac{\partial X}{\partial t^{1}}(t)}^{2}\right]\leq C_{R}\mathbb{E}\left[\sup_{t\in T}\absolutevalue{\frac{\partial X_{t}}{\partial t^{1}}}^{2}\right]<\infty

which means that {JmXt1(t)}m=1\{J_{m}\frac{\partial X}{\partial t^{1}}(t)\}_{m=1}^{\infty} is uniformly integrable and

JmXt1(t)L1Xt1(t).J_{m}\frac{\partial X}{\partial t^{1}}(t)\xrightarrow{L^{1}}\frac{\partial X}{\partial t^{1}}(t).

Combining with (23), we have

JmXtt1L1Xt1(t)\frac{\partial J_{m}X_{t}}{\partial t^{1}}\xrightarrow{L^{1}}\frac{\partial X}{\partial t^{1}}(t) (24)

as mm\rightarrow\infty. For any fixed mm, note that fφ(Jmft1(t1),,Jmft1(tl))f\mapsto\varphi\left(\frac{\partial J_{m}f}{\partial t^{1}}(t_{1}),\dots,\frac{\partial J_{m}f}{\partial t^{1}}(t_{l})\right) is still a bounded, nonnegative and Lipschitz continuous functional in C(C0(T);)C(C^{0}(T);\mathbb{R}). We have

lim infn𝔼[φ(JmXnt1(t1),,JmXnt1(tl))]𝔼[φ(JmXt1(t1),,JmXt1(tl))]\liminf_{n\rightarrow\infty}\mathbb{E}\left[\varphi\left(\frac{\partial J_{m}X^{n}}{\partial t^{1}}(t_{1}),\dots,\frac{\partial J_{m}X^{n}}{\partial t^{1}}(t_{l})\right)\right]\geq\mathbb{E}\left[\varphi\left(\frac{\partial J_{m}X}{\partial t^{1}}(t_{1}),\dots,\frac{\partial J_{m}X}{\partial t^{1}}(t_{l})\right)\right]

Because of (22) and (24),

lim infn𝔼[φ(Xnt1(t1),,Xnt1(tl))]𝔼[φ(Xt1(t1),,Xt1(tl))].\liminf_{n\rightarrow\infty}\mathbb{E}\left[\varphi\left(\frac{\partial X^{n}}{\partial t^{1}}(t_{1}),\dots,\frac{\partial X^{n}}{\partial t^{1}}(t_{l})\right)\right]\geq\mathbb{E}\left[\varphi\left(\frac{\partial X}{\partial t^{1}}(t_{1}),\dots,\frac{\partial X}{\partial t^{1}}(t_{l})\right)\right].

The finite dimensional convergence (21) is obtained for α:|α|=1\alpha:|\alpha|=1. The general situation can be processed by induction with respect to |α||\alpha|. ∎

Lemma B.2 (Convergence of Ck×kC^{k\times k} processes).

Let 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} be an open set. Let (Xt,tn)t,t𝒳,n1(X^{n}_{t,t^{\prime}})_{t,t^{\prime}\in\mathcal{X}},~{}n\geq 1 be random processes with Ck×k(𝒳×𝒳)C^{k\times k}(\mathcal{X}\times\mathcal{X}) paths a.s. and (Xt,t)t,t𝒳Ck×k(𝒳×𝒳)(X_{t,t^{\prime}})_{t,t^{\prime}\in\mathcal{X}}\in C^{k\times k}(\mathcal{X}\times\mathcal{X}) be a deterministic function. Denote by Dtα,DtαD_{t}^{\alpha},D_{t^{\prime}}^{\alpha} the derivative with respect to t,tt,t^{\prime} for multi-index α\alpha. Suppose that

  1. 1.

    For any (t1,t1),,(tm,tm)T(t_{1},t_{1}^{\prime}),\dots,(t_{m},t_{m}^{\prime})\in T, the finite dimensional convergence holds:

    (Xt1,t1n,,Xtm,tmn)𝑤(Xt1,t1,,Xtm,tm).\displaystyle\left(X^{n}_{t_{1},t_{1}^{\prime}},\dots,X^{n}_{t_{m},t_{m}^{\prime}}\right)\xrightarrow{w}\left(X_{t_{1},t_{1}^{\prime}},\dots,X_{t_{m},t_{m}^{\prime}}\right).
  2. 2.

    For any α,β\alpha,\beta satisfying |α|,|β|k|\alpha|,|\beta|\leq k and δ>0\delta>0, there exists L>0L>0 such that

    supn{supt,t,s,sT|DtαDtβ(Xt,tnXs,sn)|(t,t)(s,s)2>L}<δ.\displaystyle\sup_{n}\mathbb{P}\left\{\sup_{t,t^{\prime},s,s^{\prime}\in T}\frac{\absolutevalue{D_{t}^{\alpha}D_{t^{\prime}}^{\beta}(X_{t,t^{\prime}}^{n}-X_{s,s^{\prime}}^{n})}}{\norm{(t,t^{\prime})-(s,s^{\prime})}_{2}}>L\right\}<\delta.
  3. 3.

    For any α,β\alpha,\beta satisfying |α|,|β|k|\alpha|,|\beta|\leq k and δ>0\delta>0, there exists C>0C>0 such that

    supn{supt,tT|DtαDtβXt,tn|>C}<δ.\displaystyle\sup_{n}\mathbb{P}\left\{\sup_{t,t^{\prime}\in T}\absolutevalue{D_{t}^{\alpha}D_{t^{\prime}}^{\beta}X_{t,t^{\prime}}^{n}}>C\right\}<\delta.

Then, for any compact set T𝒳T\subset\mathcal{X} and α,β\alpha,\beta satisfying |α|,|β|k|\alpha|,|\beta|\leq k, we have

DtαDtβXt,tn𝑤DtαDtβXt,t in C0(T).\displaystyle D_{t}^{\alpha}D_{t^{\prime}}^{\beta}X^{n}_{t,t^{\prime}}\xrightarrow{w}D_{t}^{\alpha}D_{t^{\prime}}^{\beta}X_{t,t^{\prime}}\mbox{\quad in\quad}C^{0}(T).
Proof.

Note that Xt,tCk×k(T×T)X_{t,t^{\prime}}\in C^{k\times k}(T\times T). For any bounded, nonnegative and Lipschitz continuous φC(l;)\varphi\in C(\mathbb{R}^{l};\mathbb{R}), α,β\alpha,\beta satisfying |α|,|β|k|\alpha|,|\beta|\leq k and (t1,t1),,(tl,tl)T×T(t_{1},t_{1}^{\prime}),\dots,(t_{l},t_{l}^{\prime})\in T\times T, we have

limmφ(DtαDtβJmXt1,t1,,DtαDtβJmXtl,tl)=φ(DtαDtβXt1,t1,,DtαDtβXtl,tl)\lim_{m\rightarrow\infty}\varphi(D_{t}^{\alpha}D_{t^{\prime}}^{\beta}J_{m}X_{t_{1},t_{1}^{\prime}},\dots,D_{t}^{\alpha}D_{t^{\prime}}^{\beta}J_{m}X_{t_{l},t_{l}^{\prime}})=\varphi(D_{t}^{\alpha}D_{t^{\prime}}^{\beta}X_{t_{1},t_{1}^{\prime}},\dots,D_{t}^{\alpha}D_{t^{\prime}}^{\beta}X_{t_{l},t_{l}^{\prime}})

where JmJ_{m} is defined in the proof of Lemma B.1. The remaining proof is the same as Lemma B.1. ∎

Let us recall our structure of neural network (1),

z1(x)=W0x,zl+1(x)=1mlWlσ(zl(x)),for l=1,,L\displaystyle\begin{aligned} &z^{1}(x)=W^{0}x,\\ &z^{l+1}(x)=\frac{1}{\sqrt{m_{l}}}W^{l}\sigma(z^{l}(x)),\quad\text{for }l=1,\dots,L\end{aligned}

where Wlml+1×mlW^{l}\in\mathbb{R}^{m_{l+1}\times m_{l}}. For the components,

zi1(x)\displaystyle z^{1}_{i}(x) =j=1m0Wij0xj,\displaystyle=\sum_{j=1}^{m_{0}}W^{0}_{ij}x_{j},
zil+1(x)\displaystyle z^{l+1}_{i}(x) =1mlj=1mlWijlσ(zjl(x)).\displaystyle=\frac{1}{\sqrt{m_{l}}}\sum_{j=1}^{m_{l}}W^{l}_{ij}\sigma(z^{l}_{j}(x)).

And the NNK is defined as

Kl,ijNT,θ(x,x)=θzil(x),θzjl(x), for i,j=1,,ml.K^{\mathrm{NT},\theta}_{l,ij}(x,x^{\prime})=\left\langle{\nabla_{\theta}z^{l}_{i}(x),\nabla_{\theta}z^{l}_{j}(x^{\prime})}\right\rangle,\mbox{\quad for\quad}i,j=1,\dots,m_{l}.
Lemma B.3.

Fix an even integer p2p\geq 2. Suppose that μ\mu is a probability measure on \mathbb{R} with mean 0 and finite higher moments. Assume also that w=(w1,,wm1)w=(w_{1},\dots,w_{m_{1}}) is a vector with i.i.d. components, each with distribution μ\mu. Fix an integer n01n_{0}\geq 1. Let T0T_{0} be a compact set in m0\mathbb{R}^{m_{0}}. And T1m1T_{1}\subset\mathbb{R}^{m_{1}} is the image of T0T_{0} under a C0,1C^{0,1} map ff with fC0,1(T0)λ\norm{f}_{C^{0,1}(T_{0})}\leq\lambda. Then, there exists a constant C=C(T0,p,μ,λ)C=C(T_{0},p,\mu,\lambda) such that for all m11m_{1}\geq 1,

𝔼[supyT1|wy|p]C.\mathbb{E}\left[\sup_{y\in T_{1}}|w\cdot y|^{p}\right]\leq C.
Proof.

For any fixed y0T1y_{0}\in T_{1},

𝔼[supyT1|wy|p]C1(p)(𝔼[|wy0|p]+𝔼[supyT1|w(yy0)|p]).\mathbb{E}\left[\sup_{y\in T_{1}}|w\cdot y|^{p}\right]\leq C_{1}(p)\left(\mathbb{E}\left[|w\cdot y_{0}|^{p}\right]+\mathbb{E}\left[\sup_{y\in T_{1}}|w\cdot(y-y_{0})|^{p}\right]\right).

With Lemma 2.9 and Lemma 2.10 in [15], both two terms on the right can be bounded by a constant not depending on m1m_{1}. ∎

Lemma B.4.

Fix an integer n01n_{0}\geq 1. Let T0T_{0} be a compact set in m0\mathbb{R}^{m_{0}}. Consider a map φ:T1T2\varphi:T_{1}\rightarrow T_{2} defined as

φ(x)=1mlσ(Wx)\varphi(x)=\frac{1}{\sqrt{m_{l}}}\sigma(Wx)

where Wml×m1W\in\mathbb{R}^{m_{l}\times m_{1}} with components drawn i.i.d. from a distribution μ\mu with mean 0, variance 1 and finite higher moments. T2mlT_{2}\subset\mathbb{R}^{m_{l}} and T1mlT_{1}\subset\mathbb{R}^{m_{l}} is the image of T0T_{0} under a C0,1C^{0,1} map ff with fC0,1(T0)λ\norm{f}_{C^{0,1}(T_{0})}\leq\lambda. σ\sigma satisfies Assumption 1. Then, for any δ>0\delta>0, there exists a positive constant C=C(k,λ,σ,μ,δ)C=C(k,\lambda,\sigma,\mu,\delta) such that

φCk,1(T1)C.\norm{\varphi}_{C^{k,1}(T_{1})}\leq C.

with probability at least 1δ1-\delta.

Proof.

For the components of φ\varphi,

φi(x)=1mlσ(Wix).\varphi_{i}(x)=\frac{1}{\sqrt{m_{l}}}\sigma(W_{i}\cdot x).

Hence, for a fixed α:|α|k\alpha:|\alpha|\leq k,

Dαφi(x)=1mlσ(|α|)(Wix)Wiα.D^{\alpha}\varphi_{i}(x)=\frac{1}{\sqrt{m_{l}}}\sigma^{(|\alpha|)}(W_{i}\cdot x)W_{i}^{\alpha}.

And

Dαφ(x)22=1mli=1mlσ(|α|)(Wix)2Wi2α.\norm{D^{\alpha}\varphi(x)}_{2}^{2}=\frac{1}{m_{l}}\sum_{i=1}^{m_{l}}\sigma^{(|\alpha|)}(W_{i}\cdot x)^{2}W_{i}^{2\alpha}.

With the basic inequality ab12(a2+b2)ab\leq\frac{1}{2}(a^{2}+b^{2}), Assumption 1 and Lemma B.3, there exists a constant Mα=Mα(k,λ,σ,μ)M_{\alpha}=M_{\alpha}(k,\lambda,\sigma,\mu) such that

𝔼[supxT1Dαφ(x)22]𝔼[supxT1σ(|α|)(Wix)2Wi2α]Mα\mathbb{E}\left[\sup_{x\in T_{1}}\norm{D^{\alpha}\varphi(x)}_{2}^{2}\right]\leq\mathbb{E}\left[\sup_{x\in T_{1}}\sigma^{(|\alpha|)}(W_{i}\cdot x)^{2}W_{i}^{2\alpha}\right]\leq M_{\alpha}

which means that

(supxT1Dαφ(x)2(Mαδ)12)δ.\mathbb{P}\left(\sup_{x\in T_{1}}\norm{D^{\alpha}\varphi(x)}_{2}\geq\left(\frac{M_{\alpha}}{\delta}\right)^{\frac{1}{2}}\right)\leq\delta.

Moreover, for all x1,x2T1x_{1},x_{2}\in T_{1},

Dαφ(x1)Dαφ(x2)22=1mli=1ml(σ(|α|)(Wix1)σ(|α|)(Wix2))2Wi2α\displaystyle\norm{D^{\alpha}\varphi(x_{1})-D^{\alpha}\varphi(x_{2})}_{2}^{2}=\frac{1}{m_{l}}\sum_{i=1}^{m_{l}}\left(\sigma^{(|\alpha|)}(W_{i}\cdot x_{1})-\sigma^{(|\alpha|)}(W_{i}\cdot x_{2})\right)^{2}W_{i}^{2\alpha}

With the same procedure in the proof of Lemma 2.11 in [15], there exists a positive constant C~α=C~α(k,λ,σ,μ,δ)\tilde{C}_{\alpha}=\tilde{C}_{\alpha}(k,\lambda,\sigma,\mu,\delta) such that

(supx1,x2T1Dαφ(x1)Dαφ(x2)2x1x22C~α)δ.\mathbb{P}\left(\sup_{x_{1},x_{2}\in T_{1}}\frac{\norm{D^{\alpha}\varphi(x_{1})-D^{\alpha}\varphi(x_{2})}_{2}}{\norm{x_{1}-x_{2}}_{2}}\geq\tilde{C}_{\alpha}\right)\leq\delta.

Lemma B.5.

Let σ\sigma be an activation satisfying Assumption 1 and T𝒳T\subset\mathcal{X} be a compact set in m0\mathbb{R}^{m_{0}}. For fixed l=2,,L+1l=2,\dots,L+1 and any δ>0\delta>0, considering zl+1(x)z^{l+1}(x) defined as (1), there exists a positive constant Cl=Cl(k,T,ml+1,σ,μ,δ)C_{l}=C_{l}(k,T,m_{l+1},\sigma,\mu,\delta) such that

zl+1(x)Ck,1(T)Cl\norm{z^{l+1}(x)}_{C^{k,1}(T)}\leq C_{l}

with probability at least 1δ1-\delta.

Proof.

For h=1,2,,lh=1,2,\dots,l, define

φh(x)=1mhσ(Wh1x).\varphi^{h}(x)=\frac{1}{\sqrt{m_{h}}}\sigma(W^{h-1}x).

and

φl+1(x)=1ml+1Wlx\varphi^{l+1}(x)=\frac{1}{\sqrt{m_{l+1}}}W^{l}x

Note that

zl+1(x)=φl+1φlφ1(x).z^{l+1}(x)=\varphi^{l+1}\circ\varphi^{l}\circ\cdots\circ\varphi^{1}(x).

With Lemma B.4, there exists a constant C~1=C~1(k,T,σ,μ,δ)\tilde{C}_{1}=\tilde{C}_{1}(k,T,\sigma,\mu,\delta) such that

φ1(x)Ck,1(T)C~1\norm{\varphi^{1}(x)}_{C^{k,1}(T)}\leq\tilde{C}_{1}

with probability at least 1δl+11-\frac{\delta}{l+1}. Define A1{φ1(x)Ck,1(T)C~1}A_{1}\coloneqq\left\{\norm{\varphi^{1}(x)}_{C^{k,1}(T)}\leq\tilde{C}_{1}\right\}. Then, with Lemma B.4, Lemma 2.1 and the fact that W1,W0W^{1},W^{0} are independent, there exists a constant C~2=C~2(k,T,σ,μ,δ)\tilde{C}_{2}=\tilde{C}_{2}(k,T,\sigma,\mu,\delta)

(φ2φ1(x)Ck,1(T)C~2)\displaystyle\mathbb{P}\left(\norm{\varphi^{2}\circ\varphi^{1}(x)}_{C^{k,1}(T)}\leq\tilde{C}_{2}\right) (φ2φ1(x)Ck,1(T)C~2,A1)\displaystyle\geq\mathbb{P}\left(\norm{\varphi^{2}\circ\varphi^{1}(x)}_{C^{k,1}(T)}\leq\tilde{C}_{2},A_{1}\right)
=𝔼[𝟏A1(φ2φ1(x)Ck,1(T)C~2|W0)]\displaystyle=\mathbb{E}\left[\mathbf{1}_{A_{1}}\mathbb{P}\left(\norm{\varphi^{2}\circ\varphi^{1}(x)}_{C^{k,1}(T)}\leq\tilde{C}_{2}\bigg{|}W^{0}\right)\right]
(1δl+1)(A1)\displaystyle\geq(1-\frac{\delta}{l+1})\mathbb{P}(A_{1})
12δl+1.\displaystyle\geq 1-\frac{2\delta}{l+1}.

With induction, we can obtain the conclusion. ∎

Lemma B.6.

Suppose that σ\sigma satisfies Assumption 1 for k+1k+1. T𝒳T\subset\mathcal{X} is a compact set. Then, KlNT(x,x)Ck×k(T×T)K_{l}^{\mathrm{NT}}(x,x^{\prime})\in C^{k\times k}(T\times T) and KlRF(x,x)C(k+1)×(k+1)(T×T)K_{l}^{\mathrm{RF}}(x,x^{\prime})\in C^{(k+1)\times(k+1)}(T\times T) where KlNT,KlRFK_{l}^{\mathrm{NT}},K_{l}^{\mathrm{RF}} are defined as (6).

Proposition B.7.

Let (Xt)tT(X_{t})_{t\in T} be a centered Gaussian process on a compact set TdT\subset\mathbb{R}^{d} with covariance function k(s,t):T×Tk(s,t):T\times T\to\mathbb{R}. Suppose that k(s,t)k(s,t) is Holder-continuous, then for any p0p\geq 0 we have

𝔼suptT|Xt|p<.\displaystyle\mathbb{E}\sup_{t\in T}\absolutevalue{X_{t}}^{p}<\infty. (25)
Proof.

It is a standard application of Dudley’s integral, see Theorem 8.1.6 in [34]. Since k(s,t)k(s,t) is Holder-continuous, the canonical metric of this Gaussian process

d(s,t)=[𝔼(XsXt)2]1/2=k(s,s)+k(t,t)2k(s,t)\displaystyle d(s,t)=\left[\mathbb{E}(X_{s}-X_{t})^{2}\right]^{1/2}=\sqrt{k(s,s)+k(t,t)-2k(s,t)}

is also Holder-continuous. Consequently, the covering number log𝒩(T,d,ε)log(1/ε)\log\mathcal{N}(T,d,\varepsilon)\lesssim\log(1/\varepsilon) and the Dudley’s integral 0log𝒩(T,d,ε)dε\int_{0}^{\infty}\sqrt{\log\mathcal{N}(T,d,\varepsilon)}\differential\varepsilon is finite. The results then follow from the tail bound

{sups,t|XsXt|C0log𝒩(T,d,ε)dε+udiam(T)}2exp(u2).\displaystyle\mathbb{P}\left\{\sup_{s,t}\absolutevalue{X_{s}-X_{t}}\geq C\int_{0}^{\infty}\sqrt{\log\mathcal{N}(T,d,\varepsilon)}\differential\varepsilon+u\mathrm{diam}(T)\right\}\leq 2\exp(-u^{2}).