This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence analysis of wide shallow neural operators within the framework of Neural Tangent Kernel

Xianliang Xu1, Ye Li2 and Zhongyi Huang1 1 Tsinghua University, Beijing, China.
2 Nanjing University of Aeronautics and Astronautics, Nanjing, China.
Abstract.

Neural operators are aiming at approximating operators mapping between Banach spaces of functions, achieving much success in the field of scientific computing. Compared to certain deep learning-based solvers, such as Physics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural operators can solve a class of Partial Differential Equations (PDEs). Although much work has been done to analyze the approximation and generalization error of neural operators, there is still a lack of analysis on their training error. In this work, we conduct the convergence analysis of gradient descent for the wide shallow neural operators and physics-informed shallow neural operators within the framework of Neural Tangent Kernel (NTK). The core idea lies on the fact that over-parameterization and random initialization together ensure that each weight vector remains near its initialization throughout all iterations, yielding the linear convergence of gradient descent. In this work, we demonstrate that under the setting of over-parametrization, gradient descent can find the global minimum regardless of whether it is in continuous time or discrete time.

This work was partially supported by the NSFC Projects No. 12025104, 11871298, 81930119, 62106103.

1. Introduction

Partial Differential Equations (PDEs) are essential for modeling a wide range of phenomena in physics, biology, and engineering. Nonetheless, the numerical solution of PDEs has always been a significant challenge in the field of scientific computation. Traditional numerical approaches, such as finite difference, finite element, finite volume, and spectral methods, can encounter difficulties due to the curse of dimensionality when applied to PDEs with a high number of dimensions. In recent years, the impressive achievements of deep learning in various domains, including computer vision, natural language processing, and reinforcement learning, have led to an increased interest in utilizing machine learning techniques to tackle PDE-related problems.

For scientific problems, neural network-based methods are primarily divided into two categories: neural solvers and neural operators. Neural solvers, such as PINNs [1], DRM [2], utilize neural networks to represent the solutions of PDEs, minimizing some form of residual to enable the neural networks to approximate the true solutions closely. There are two potential advantages of neural solvers. First, this is an unsupervised learning approach, which means it does not require the costly process of obtaining a large number of labels as in supervised learning. Second, as a powerful represention tool, neural networks are known to be effictive for approximating continuous functions [3], smooth functions [4], Sobolev functions [5]. This presents a potentially viable avenue for addressing the challenges of high-dimensional PDEs. Nevertheless, existing neural solvers face several limitations when compared to classical numerical solvers like FEM and FVM, particularly in terms of accuracy, and convergence issues. In addition, neural solvers are typically limited to solving a fixed PDE. If certain parameters of the PDE change, it becomes necessary to retrain the neural network.

Neural operator (also called operator learning) aims to approximate unknown operator, which often takes the form of the solution operator associated with a differential equation. Unlike most supervised learning methods in the field of machine learning, where both inputs and outputs are of finite dimensions, operator learning can be regarded as a form of supervised learning in function spaces. Because the inputs and outputs of neural networks are of finite dimensions, some operator learning methods, such as PCA-net and DeepONet, use an encoder to convert infinite-dimensional inputs into finite-dimensional ones, and a decoder to convert finite-dimensional outputs back into infinite-dimensional outputs. The PCA-Net architecture was proposed as an operator learning framework in [6], where principal component analysis (PCA) is employed to obtain data-driven encoders and decoders, combining with a neural network mapping between the finite-dimensional latent spaces. Building on early work by [7], DeepONet [8] consists of a deep neural network for encoding the discrete input function space and another deep neural network for encoding the domain of the output functions. The encoding network is conventionally referred to as the “branch-net”, while the decoding network is referred to as the “trunk-net”. In contrast to the PCA-Net and DeepONet architectures mentioned above,, Fourier Neural Operator (FNO), introduced in [9], does not follow the encoder-decoder-net paradigm. Instead, FNO is a composition of linear integral operators and nonlinear activation functions, which can be seen as a generalization the structure of finite-dimensional neural networks to a function space setting.

The theoretical research on neural operators mostly focuses on the study of approximation errors and generalization errors. As we know, the theoretical basis for the application of neural networks lies in the fact that neural networks are universal approximators. This also holds for neural operators. Regarding the analysis of approximation errors in neural operators, the aim is to identify whether neural operator also possess a universal approximation property, i.e. the ability to approximate a wide class of operators to any given accuracy. As shown in [7], (shallow) operator networks can approximate continuous operators mapping between spaces of continuous functions with arbitrary accuracy. Building on this result, DeepONets have also been proven to be universal approximators. For neural operators following the encoder-decoder paradigm like DeepONet and PCA-Net, Lemma 22 in [10] provides a consistent approximation result, which states that if two Banach spaces have the approximation property, then continuous maps between them can be approximated in a finite-dimensional manner. The universal approximation capability of the FNO was initially established in [11], drawing on concepts from Fourier analysis, and specifically leveraging the density of Fourier series to demonstrate the FNO’s ability to approximate a broad spectrum of operators. For a more quantitative analysis of the approximation error of neural operators, see [12]. In addition to approximation errors, the error analysis of encoder-decoder style neural operators also includes encoding and reconstruction errors. [13] has provided both lower and upper bounds on the total error for DeepONets by using the spectral decay properties of the covariance operators associated with the underlying measures. By employing tools from non-parametric regression, [14] has provided an analysis of the generalization error for neural operators with basis encoders and decoders. The results in [14] holds on neural operators with some popular encoders and decoders, such as those using Legendre polynomials, trigonometric functions, and PCA. For more details on the recent advances and theoretical research in operator learning, refer to the review [12].

Up to this point, the theoretical exploration of the convergence and optimization aspects of neural operators has received relatively little attention. To our best knowledge, only [15] and [16] have touched upon the optimization of neural operators. Based on restricted strong convexity (RSC), [15] has presented a unified framework for gradient descent and apply the framework to DeepONets and FNOs, establishing convergence guarantees for both. [16] has briefly analyzed the training of physics-informed DeepONets and derived a weighting scheme guided by NTK theory to balance the data and the PDE residual terms in the loss function. In this paper, we focus on the training error of shallow neural operator in [7] with the framework of NTK, showing that gradient descent converges at a global linear rate to the global optimum.

1.1. Notations

We denote [n]={1,2,,n}[n]=\{1,2,\cdots,n\} for nn\in\mathbb{N}. Given a set SS, we denote the uniform distribution on SS by Unif{S}Unif\{S\}. We use I{E}I\{E\} to denote the indicator function of the event EE. We use ABA\lesssim B to denote an estimate that AcBA\leq cB, where cc is a universal constant. A universal constant means a constant independent of any variables.

2. Preliminaries

The neural operator considered in this paper was originally introduced in [4], aimining to approximate a non-linear operator. Specifically, suppose that σ\sigma is a continuous and non-polynomial function, XX is a Banach space, K1XK_{1}\subset X, K2dK_{2}\subset\mathbb{R}^{d} are two compact sets in XX and d\mathbb{R}^{d}, respectively, VV is a compact set in C(K1)C(K_{1}). Assume that G:VC(K2)G^{*}:V\rightarrow C(K_{2}) is a nonlinear continuous operator. Then, an operator net can be formulated in terms of two shallow neural networks. The first is the so-called branch net β(u)=(β1(u),,βm(u))\beta(u)=(\beta_{1}(u),\cdots,\beta_{m}(u)), defined for 1rm1\leq r\leq m as

βr(u):=k=1parkσ(s=1qξrksu(xs)+θrk),\beta_{r}(u):=\sum\limits_{k=1}^{p}a_{rk}\sigma\left(\sum\limits_{s=1}^{q}\xi_{rk}^{s}u(x_{s})+\theta_{rk}\right),

where {xs}1sqK1\{x_{s}\}_{1\leq s\leq q}\subset K_{1} are the so-called sensors and ark,ξrks,θrka_{rk},\xi_{rk}^{s},\theta_{rk} are weights of the neural network.

The second neural network is the so-called trunk net τ(y)=(τ1(y),,τm(y))\tau(y)=(\tau_{1}(y),\cdots,\tau_{m}(y)), defined as

τr(y):=σ(wrTy+ζr),1rm,\tau_{r}(y):=\sigma(w_{r}^{T}y+\zeta_{r}),1\leq r\leq m,

where yK2y\in K_{2} and wr,ζrw_{r},\zeta_{r} are weights of the neural network. Then the branch net and trunk net are combined to approximate the non-linear operator GG^{*}, i.e.,

G(u)(y)r=1mβr(u)τr(y):=G(u)(y),uV,yK2.G^{*}(u)(y)\approx\sum\limits_{r=1}^{m}\beta_{r}(u)\tau_{r}(y):=G(u)(y),u\in V,y\in K_{2}.

As shown in [4], (shallow) operator networks can approximate, to arbitrary accuracy, continuous operators mapping between spaces of continuous functions. Specifically, for any ϵ>0\epsilon>0, there are positive integers p,mp,m and qq, constants ark,ξrks,θrk,ζra_{rk},\xi_{rk}^{s},\theta_{rk},\zeta_{r}\in\mathbb{R}, wrdw_{r}\in\mathbb{R}^{d}, xsK1,s=1,qx_{s}\in K_{1},s=1,\cdots q, r=1,,mr=1,\cdots,m and k=1,,mk=1,\cdots,m, such that

|G(u)(y)r=1mk=1parkσ(s=1qξrksu(xs)+θrk)σ(wrTy+ζr)|<ϵ\left|G^{*}(u)(y)-\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}a_{rk}\sigma\left(\sum\limits_{s=1}^{q}\xi_{rk}^{s}u(x_{s})+\theta_{rk}\right)\sigma(w_{r}^{T}y+\zeta_{r})\right|<\epsilon

holds for all uVu\in V and yK2y\in K_{2}.

The training of neural networks is performed using a supervised learning process. It involves minimizing the mean-squared error between the predicted output G(u)(y)G(u)(y) and the actual output G(u)(y)G^{*}(u)(y). Specifically, assume we have samples {(ui,G(ui))}i=1N,uiμ\{(u_{i},G^{*}(u_{i}))\}_{i=1}^{N},u_{i}\sim\mu, where μ\mu is a probability supported on VV. The aim is to minimize the following loss function:

12i=1Nj=1n|G(ui)(yj)G(ui)(yj)|2.\frac{1}{2}\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{n}|G(u_{i})(y_{j})-G^{*}(u_{i})(y_{j})|^{2}.

In this paper, we primarily focus on the shallow neural operators with ReLU activation functions. Formally, we consider a shallow operator of the following form.

G(u)(y)=1mr=1m[1pk=1pa~rkσ(w~rkTu)]σ(wrTy),G(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\sigma(w_{r}^{T}y), (1)

where we equate the function uu with its value vector (u(x1),,u(xq))Tq(u(x_{1}),\cdots,u(x_{q}))^{T}\in\mathbb{R}^{q} at the points {xs}s=1q\{x_{s}\}_{s=1}^{q}.

We denote the loss function by L(W,W~,a)L(W,\tilde{W},a). The main focus of this paper is to analyze the gradient descent in training the shallw neural operator. We fix the weights {ark}r=1,k=1m,p\{a_{rk}\}_{r=1,k=1}^{m,p} and apply gradient (GD) to optimize the weights {wr}r=1m,{w~rk}r=1,k=1m,p\{w_{r}\}_{r=1}^{m},\{\tilde{w}_{rk}\}_{r=1,k=1}^{m,p}. Specifically,

wr(t+1)=wr(t)ηL(W(t),W~(t))wr,w~rk(t+1)=w~rk(t)ηL(W(t),W~(t))w~rk,w_{r}(t+1)=w_{r}(t)-\eta\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}},\ \tilde{w}_{rk}(t+1)=\tilde{w}_{rk}(t)-\eta\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}},

where η>0\eta>0 is the learning rate and L(W,W~)L(W,\tilde{W}) is an abbreviation of L(W,W~,a)L(W,\tilde{W},a).

At this point, the loss function is

L(W,W~)=12i=1n1j=1n2(G(ui)(yj)zji)2,L(W,\tilde{W})=\frac{1}{2}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}(G(u_{i})(y_{j})-z_{j}^{i})^{2},

where zji=G(ui)(yj)z_{j}^{i}=G^{*}(u_{i})(y_{j}). Throughout this paper, we consider the initialization

wr(0)𝒩(𝟎,𝑰),w~rk(0)𝒩(𝟎,𝑰),𝒂rk(0)Unif{1,1}w_{r}(0)\sim\mathcal{N}(\bm{0},\bm{I}),\tilde{w}_{rk}(0)\sim\mathcal{N}(\bm{0},\bm{I}),\bm{a}_{rk}(0)\sim Unif\{-1,1\} (2)

and assume that ui2=𝒪(1),yj2=𝒪(1)\|u_{i}\|_{2}=\mathcal{O}(1),\|y_{j}\|_{2}=\mathcal{O}(1) for all j[n]j\in[n]. Note that here we treat vector ui=(ui(x1),,uxq)u_{i}=(u_{i}(x_{1}),\cdots,u_{x_{q}}) and function uiu_{i} as equivalent.

3. Continuous Time Analysis

In this section, we present our result for gradient flow, which can be viewed as a continuous form of gradient descent with an infinitesimal time step size. The analysis of gradient flow in continuous time serves as a foundational step for comprehending discrete gradient descent algorithms. We prove that the gradient flow converges to the global optima of the loss under over-parameterization and some mild conditions on training samples. The time continuous form can be characterized as the following dynamics

dwr(t)dt=L(W(t),W~(t))wr,ddtw~rk(t)=L(W(t),W~(t))w~rk\frac{dw_{r}(t)}{dt}=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}},\ \frac{d}{dt}\tilde{w}_{rk}(t)=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}}

for r[m],k[p]r\in[m],k\in[p]. We denote Gt(ui)(yj)G^{t}(u_{i})(y_{j}) the prediction on yjy_{j} at time tt under uiu_{i}, i.e., with weights wr(t),w~rk(t)w_{r}(t),\tilde{w}_{rk}(t).

Thus, we can deduce that

dwr(t)dt=L(W(t),W~(t))wr=i=1n1j=1n2Gt(ui)(yj)wr(G(ui)(yj)zji)\frac{dw_{r}(t)}{dt}=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}}=-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}}(G(u_{i})(y_{j})-z_{j}^{i}) (3)

and

dw~rk(t)dt=L(W(t),W~(t))w~rk=i=1n1j=1n2Gt(ui)(yj)w~rk(Gt(u)(yj)zji).\frac{d\tilde{w}_{rk}(t)}{dt}=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}}=-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}}(G^{t}(u)(y_{j})-z_{j}^{i}). (4)

Then, the dynamics of each prediction can be calculated as follows.

dGt(ui)(yj)dt\displaystyle\frac{dG^{t}(u_{i})(y_{j})}{dt} =r=1mGt(ui)(yj)wr,dwr(t)dt+r=1mk=1pGt(ui)(yj)w~rk,dw~rk(t)dt\displaystyle=\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{dw_{r}(t)}{dt}\right\rangle+\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{d\tilde{w}_{rk}(t)}{dt}\right\rangle (5)
=r=1mi1=1n1j1=1n2Gt(ui)(yj)wr,Gt(ui1)(yj1)wr(Gt(ui1)(yj1)zj1i1)\displaystyle=-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial G^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\right\rangle(G^{t}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}})
r=1mk=1pi1=1n1j1=1n2Gt(ui)(yj)w~rk,Gt(ui1)(yj1)w~rk(Gt(ui1)(yj1)zj1i1).\displaystyle\quad-\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial G^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\rangle(G^{t}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}}).

We let Gt(ui):=(Gt(ui)(y1),,Gt(ui)(yn2))n2G^{t}(u_{i}):=(G^{t}(u_{i})(y_{1}),\cdots,G^{t}(u_{i})(y_{n_{2}}))\in\mathbb{R}^{n_{2}} and Gt(u):=(Gt(u1),,Gt(un1))n1n2G^{t}(u):=(G^{t}(u_{1}),\cdots,G^{t}(u_{n_{1}}))\in\mathbb{R}^{n_{1}n_{2}}. Then, we have

dGt(ui)dt=[(H1i(t),,Hn1i(t))+(H~1i(t),,H~n1i(t))](zGt(u)),\frac{dG^{t}(u_{i})}{dt}=\left[(H_{1}^{i}(t),\cdots,H_{n_{1}}^{i}(t))+(\tilde{H}_{1}^{i}(t),\cdots,\tilde{H}_{n_{1}}^{i}(t))\right](z-G^{t}(u)), (6)

where Hji(t)n2×n2H_{j}^{i}(t)\in\mathbb{R}^{n_{2}\times n_{2}}, whose (i1,j1)(i_{1},j_{1})-th entry is defined as

r=1mGt(ui)(yi1)wr,Gt(uj)(yj1)wr\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{t}(u_{i})(y_{i_{1}})}{\partial w_{r}},\frac{G^{t}(u_{j})(y_{j_{1}})}{\partial w_{r}}\right\rangle

and H~ji(t)n2×n2\tilde{H}_{j}^{i}(t)\in\mathbb{R}^{n_{2}\times n_{2}}, whose (i1,j1)(i_{1},j_{1})-th entry is defined as

r=1mk=1pGt(ui)(yi1)w~rk,Gt(uj)(yj1)w~rk.\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{t}(u_{i})(y_{i_{1}})}{\partial\tilde{w}_{rk}},\frac{G^{t}(u_{j})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\rangle.

Thus, we can write the dynamics of predictions as follows.

dGt(u)dt=[H(t)+H~(t)](zGt(u)),\frac{dG^{t}(u)}{dt}=\left[H(t)+\tilde{H}(t)\right](z-G^{t}(u)),

where H(t),H~(t)n1n2×n2n2H(t),\tilde{H}(t)\in\mathbb{R}^{n_{1}n_{2}\times n_{2}n_{2}}. We can divide H(t),H~(t)H(t),\tilde{H}(t) into n1×n1n_{1}\times n_{1} blocks, the (i,j)(i,j)-th block of H(t)H(t) is Hji(t)H_{j}^{i}(t) and the (i,j)(i,j)-th block of H~(t)\tilde{H}(t) is H~ji(t)\tilde{H}_{j}^{i}(t). From the form of H(t),H~(t)H(t),\tilde{H}(t), we can derive the Gram matrices induced by the random initialization, which we denote by HH^{\infty} and H~\tilde{H}^{\infty}. Note that although HH^{\infty} and H~\tilde{H}^{\infty} are large matrices, we can divide HH^{\infty} and H~\tilde{H}^{\infty} into n1×n1n_{1}\times n_{1} blocks, where each block is a n2×n2n_{2}\times n_{2} matrix. Following the notation above, the (i1,j1)(i_{1},j_{1})-th entry of (i,j)(i,j)-th block of HH^{\infty} is

𝔼[σ(w~Tui)σ(w~Tuj)]𝔼[yi1Tyj1I{wTyi10,wTyj10}].\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})]\mathbb{E}[y_{i_{1}}^{T}y_{j_{1}}I\{w^{T}y_{i_{1}}\geq 0,w^{T}y_{j_{1}}\geq 0\}].

Thus, the (i,j)(i,j)-th block can be written as

𝔼[σ(w~Tui)σ(w~Tuj)]H2,\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})]H_{2}^{\infty},

where H2n2×n2H_{2}^{\infty}\in\mathbb{R}^{n_{2}\times n_{2}} and the (i1,j1)(i_{1},j_{1})-th entry of H2H_{2}^{\infty} is

𝔼[yi1Tyj1I{wTyi10,wTyj10}].\mathbb{E}[y_{i_{1}}^{T}y_{j_{1}}I\{w^{T}y_{i_{1}}\geq 0,w^{T}y_{j_{1}}\geq 0\}].

Thus, HH^{\infty} can be seen as a Kronecker product of matrices H1H_{1}^{\infty} and H2H_{2}^{\infty}, where H1n1×n1H_{1}^{\infty}\in\mathbb{R}^{n_{1}\times n_{1}} and the (i,j)(i,j)-th entry of H1H_{1}^{\infty} is 𝔼[σ(w~Tui)σ(w~Tuj)]\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})]. Similarly, we have that H~\tilde{H}^{\infty} is a Kronecker product of matrices H~1\tilde{H}_{1}^{\infty} and H~2\tilde{H}_{2}^{\infty}, where H~1n1×n1\tilde{H}_{1}^{\infty}\in\mathbb{R}^{n_{1}\times n_{1}} and the (i,j)(i,j)-th entry of H~1\tilde{H}_{1}^{\infty} is 𝔼[uiTujI{w~Tui0,w~Tuj0}]\mathbb{E}[u_{i}^{T}u_{j}I\{\tilde{w}^{T}u_{i}\geq 0,\tilde{w}^{T}u_{j}\geq 0\}], the (i1,j1)(i_{1},j_{1})-th entry of H~2\tilde{H}_{2}^{\infty} is 𝔼[σ(wTyi1)σ(wTyj1)]\mathbb{E}[\sigma(w^{T}y_{i_{1}})\sigma(w^{T}y_{j_{1}})] .

Similar to the situation in L2L^{2} regression, we can show two essential facts: (1) H(0)H2=𝒪(1/m)\|H(0)-H^{\infty}\|_{2}=\mathcal{O}(1/\sqrt{m}), H~(0)H~2=𝒪(1/m)\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}=\mathcal{O}(1/\sqrt{m}) and (2) for all t0t\geq 0, H(t)H(0)2=𝒪(1/m)\|H(t)-H(0)\|_{2}=\mathcal{O}(1/\sqrt{m}), H~(t)H~(0)2=𝒪(1/m)\|\tilde{H}(t)-\tilde{H}(0)\|_{2}=\mathcal{O}(1/\sqrt{m}).

Therefore, roughly speaking, as mm\to\infty, the dynamics of the predictions can be written as

ddtGt(u)=(H+H~)(zGt(u)),\frac{d}{dt}G^{t}(u)=\left(H^{\infty}+\tilde{H}^{\infty}\right)(z-G^{t}(u)),

which results in the linear convergence.

We first show that the Gram matrices are strictly positive definite under mild assumptions.

Lemma 1.

If no two samples in {yj}j=1n2\{y_{j}\}_{j=1}^{n_{2}} are parallel and no two samples in {ui}i=1n1\{u_{i}\}_{i=1}^{n_{1}} are parallel, then HH^{\infty} and H~\tilde{H}^{\infty} are strictly positive definite. We denote the least eigenvalue of HH^{\infty} and H~\tilde{H}^{\infty} as λ0\lambda_{0} and λ~0\tilde{\lambda}_{0} respectively.

Remark 1.

In fact, when we consider neural networks with bias, it is natural that Lemma 1 holds. Specifically, for r[m],j[n2]r\in[m],j\in[n_{2}], we can replace wrw_{r} and yjy_{j} by (wrT,1)T(w_{r}^{T},1)^{T} and (yjT,1)T(y_{j}^{T},1)^{T}. Thus Lemma 1 holds under the condition that no two samples in {yj}j=1n2\{y_{j}\}_{j=1}^{n_{2}} are identical, which holds naturally.

Then we can verify the two facts that H(0),H~(0)H(0),\tilde{H}(0) are close to H,H~H^{\infty},\tilde{H}^{\infty} and H(t),H~(t)H(t),\tilde{H}(t) are close to H(0),H~(0)H(0),\tilde{H}(0) by following two lemmas.

Lemma 2.

If m=Ω(n12n22min(λ02,λ~02)log(n1n2δ))m=\Omega\left(\frac{n_{1}^{2}n_{2}^{2}}{min(\lambda_{0}^{2},\tilde{\lambda}_{0}^{2})}\log\left(\frac{n_{1}n_{2}}{\delta}\right)\right), we have with probability at least 1δ1-\delta, H(0)H2λ04\|H(0)-H^{\infty}\|_{2}\leq\frac{\lambda_{0}}{4}, H~(0)H~2λ~04\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}\leq\frac{\tilde{\lambda}_{0}}{4} and λmin(H(0))34λ0\lambda_{min}(H(0))\geq\frac{3}{4}\lambda_{0}, λmin(H~(0))34λ~0\lambda_{min}(\tilde{H}(0))\geq\frac{3}{4}\tilde{\lambda}_{0}.

Lemma 3.

Let R,R~(0,1)R,\tilde{R}\in(0,1). If w1(0),,wm(0),w~00(0),w~mp(0)w_{1}(0),\cdots,w_{m}(0),\tilde{w}_{00}(0),\cdots\tilde{w}_{mp}(0) are i.i.d. generated 𝒩(𝟎,𝐈)\mathcal{N}(\bm{0},\bm{I}). For any set of weight vectors w1,,wm,w~00,w~mpdw_{1},\cdots,w_{m},\tilde{w}_{00},\cdots\tilde{w}_{mp}\in\mathbb{R}^{d} that satisfy for any r[m],k[p]r\in[m],k\in[p], wrwr(0)2R\|w_{r}-w_{r}(0)\|_{2}\leq R and w~rkw~rk(0)2R~\|\tilde{w}_{rk}-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}, then we have with probability at least 1δn2exp(mR)1-\delta-n_{2}exp(-mR),

HH(0)Fn1n2pR~2+n1n2pR~log(mn1δ)+n1n2Rlog(mn1δ)\|H-H(0)\|_{F}\lesssim n_{1}n_{2}p\tilde{R}^{2}+n_{1}n_{2}\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}+n_{1}n_{2}R\log\left(\frac{mn_{1}}{\delta}\right) (7)

and with probability at least 1δn2exp(mpR~)1-\delta-n_{2}exp(-mp\tilde{R}),

H~H~(0)Fn1n2Rlog(n2δ)+n1n2R~log(mn2δ),\|\tilde{H}-\tilde{H}(0)\|_{F}\lesssim n_{1}n_{2}R\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}+n_{1}n_{2}\tilde{R}\log\left(\frac{mn_{2}}{\delta}\right), (8)

where the (i1,j1)(i_{1},j_{1})-th entry of the (i,j)(i,j)-th block of HH is

Hi,ji1,j1=1mr=1m[1pk=1pa~rkσ(w~rkTui)][1pk=1pa~rkσ(w~rkTuj)]yi1Tyj1I{wrTyi10,wrTyj10}H_{i,j}^{i_{1},j_{1}}=\frac{1}{m}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j})\right]y_{i_{1}}^{T}y_{j_{1}}I\{w_{r}^{T}y_{i_{1}}\geq 0,w_{r}^{T}y_{j_{1}}\geq 0\}

and the (i1,j1)(i_{1},j_{1})-th entry of the (i,j)(i,j)-th block of H~\tilde{H} is

H~i,ji1,j1=1m1pr=1mk=1puiTujI{w~rkTui0,w~rkTuj0}σ(wrTyi1)σ(wrTyj1).\tilde{H}_{i,j}^{i_{1},j_{1}}=\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}}).

With these preparations, we come to the final conclusion.

Theorem 1.

Suppose the condition in Lemma 1 holds and under initialization as described in (2), then with probability at least 1δ1-\delta, we have

zGt(u)22exp((λ0+λ~0)t)zG0(u)22,\|z-G^{t}(u)\|_{2}^{2}\leq exp(-(\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{0}(u)\|_{2}^{2},

where

m=Ω(n14n24log(n1n2δ)log3(mδ)(min(λ0,λ~0))2(λ0+λ~0)2).m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log\left(\frac{n_{1}n_{2}}{\delta}\right)\log^{3}\left(\frac{m}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

Proof Sketch: Note that

ddtzGt(u)22\displaystyle\frac{d}{dt}\|z-G^{t}(u)\|_{2}^{2} =2(zGt(u))T(H(t)+H~(t))(zGt(u)),\displaystyle=-2(z-G^{t}(u))^{T}(H(t)+\tilde{H}(t))(z-G^{t}(u)), (9)

thus if λmin(H(t))λ0/2\lambda_{min}(H(t))\geq\lambda_{0}/2 and λmin(H~(t))λ~0/2\lambda_{min}(\tilde{H}(t))\geq\tilde{\lambda}_{0}/2, we have

ddtzGt(u)22(λ0+λ~0)zGt(u)22.\frac{d}{dt}\|z-G^{t}(u)\|_{2}^{2}\leq-(\lambda_{0}+\tilde{\lambda}_{0})\|z-G^{t}(u)\|_{2}^{2}.

This yields that ddt(exp((λ0+λ~0)t)zGt(u)22)0\frac{d}{dt}\left(exp((\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{t}(u)\|_{2}^{2}\right)\leq 0, i.e., exp((λ0+λ~0)t)zGt(u)22exp((\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{t}(u)\|_{2}^{2} is non-increasing, thus we have

zGt(u)22exp((λ0+λ~0)t)zG0(u)22.\|z-G^{t}(u)\|_{2}^{2}\leq exp(-(\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{0}(u)\|_{2}^{2}. (10)

On the other hand, roughly speaking, the continous dynamics of wr(t)w_{r}(t) and w~rk(t)\tilde{w}_{rk}(t), i.e., (3) and (4), show that

dwr(t)dt2zGt(u)2m,dw~rk(t)dt2zGt(u)2mp.\left\|\frac{dw_{r}(t)}{dt}\right\|_{2}\sim\frac{\|z-G^{t}(u)\|_{2}}{\sqrt{m}},\ \left\|\frac{d\tilde{w}_{rk}(t)}{dt}\right\|_{2}\sim\frac{\|z-G^{t}(u)\|_{2}}{\sqrt{mp}}.

Thus if the prediction decays like (10), we can deduce that

wr(t)wr(0)21m,w~rk(t)w~rk(0)21mp.\|w_{r}(t)-w_{r}(0)\|_{2}\sim\frac{1}{\sqrt{m}},\ \|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}\sim\frac{1}{\sqrt{mp}}.

Combining this with the stability of the descrete Gram matrices, i.e., Lemma 3, we have H(t)H2λ0/4\|H(t)-H^{\infty}\|_{2}\leq\lambda_{0}/4, H~(t)H~2λ~0/4\|\tilde{H}(t)-\tilde{H}^{\infty}\|_{2}\leq\tilde{\lambda}_{0}/4 and λmin(H(t))λ0/2\lambda_{min}(H(t))\geq\lambda_{0}/2, λmin(H~(t))λ~0/2\lambda_{min}(\tilde{H}(t))\geq\tilde{\lambda}_{0}/2, when mm is sufficiently large.

From such equivalence, we can arrive at the desired conclusion.

Remark 2.

The result in Theorem 1 indicates that m=Ω(Poly(1/min(λ0,λ~0)))m=\Omega(Poly(1/min(\lambda_{0},\tilde{\lambda}_{0}))), which may lead to strict requirement for mm. In fact, from (11), we can see that λmin(H(t))λ0/2\lambda_{min}(H(t))\geq\lambda_{0}/2 or λmin(H~(t))λ~0/2\lambda_{min}(\tilde{H}(t))\geq\tilde{\lambda}_{0}/2 is enough. Thus, m=Ω(Poly(1/max(λ0,λ~0)))m=\Omega(Poly(1/max(\lambda_{0},\tilde{\lambda}_{0}))) is sufficient.

4. Discrete Time Analysis

In this section, we are going to demonstrate that randomly initialized gradient in training shallow neural operators converges to the golbal minimum at a linear rate. Unlike the continuous time case, the discrete time case requires a more refined analysis. In the following, we first present our main result and then outline the proof’s approach.

Theorem 2.

Under the setting of Theorem 1, if we set η=𝒪(1H2+H~2)\eta=\mathcal{O}\left(\frac{1}{\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}}\right), then with probability at least 1δ1-\delta, we have

zGt(u)22(1ηλ0+λ~02)tzG022,\|z-G^{t}(u)\|_{2}^{2}\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{t}\|z-G^{0}\|_{2}^{2},

where

m=Ω(n14n24log(n1n2δ)log3(mδ)(min(λ0,λ~0))2(λ0+λ~0)2).m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log\left(\frac{n_{1}n_{2}}{\delta}\right)\log^{3}\left(\frac{m}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

For the L2L^{2} regression problems, [17] has demonstrated that if the learning rate η=𝒪(λ0/n2)\eta=\mathcal{O}(\lambda_{0}/n^{2}), then randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate when mm is large enough. The requirement of η\eta is derived from the decomposition for the residual in the (k+1)(k+1)-th iteration, i.e.,

yu(k+1)=yu(k)(u(k+1)u(k)),y-u(k+1)=y-u(k)-(u(k+1)-u(k)),

where u(t)u(t) is the prediction vector at tt-iteration under the shallow neural network and yy is the true prediction. Although using the method from [17] can also yield linear convergence of gradient descent in training shallow neural operators, the requirements for the learning rate η\eta would be very stringent due to the dependency on λ0,λ~0\lambda_{0},\tilde{\lambda}_{0} and nn. Thus, instead of decomposing the residual into the two terms as above, we write it as follows, which serves as a recursion formula.

Lemma 4.

For all tt\in\mathbb{N}, we have

zGt+1(u)=[Iη(H(t)+H~(t))](zGt(u))I(t),z-G^{t+1}(u)=\left[I-\eta\left(H(t)+\tilde{H}(t)\right)\right]\left(z-G^{t}(u)\right)-I(t),

where I(t)n1n2I(t)\in\mathbb{R}^{n_{1}n_{2}} is the residual term. We can divide it into n1n_{1} blocks, each block belongs to 2n\mathbb{R}^{n}_{2} and the jj-th component of ii-th block is defined as

Ii,j(t)=Gt+1(ui)(yj)Gt(ui)(yj)Gt(ui)(yj)w,w(t+1)w(t)Gt(ui)(yj)w~,w~(t+1)w~(t).I_{i,j}(t)=G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle. (11)

Just as in the case of L2L^{2} regression, we prove our conclusion by induction. From the recursive formula above, it can be seen that both the estimation of H(t)H(t) and H~(t)\tilde{H}(t), as well as the estimation of the residual I(t)I(t), depend on wr(t)wr(0)2\|w_{r}(t)-w_{r}(0)\|_{2} and w~rk(t)w~rk(0)2\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}. Therefore, our inductive hypothesis is the following differences between weights and their initializations.

Condition 1.

At the ss-th iteration, we have

wr(s)wr(0)2n1n2zG0(u)2m(λ0+λ~0)log(mδ):=R\|w_{r}(s)-w_{r}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}:=R^{{}^{\prime}} (12)

and

w~rk(s)w~rk(0)2n1n2zG0(u)2mp(λ0+λ~0)log(mδ):=R~\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}\sqrt{p}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}:=\tilde{R}^{{}^{\prime}} (13)

and |wr(s)Tyj|B|w_{r}(s)^{T}y_{j}|\leq B for all r[m]r\in[m] and j[n2]j\in[n_{2}], where B=2log(mn2δ)B=2\sqrt{\log\left(\frac{mn_{2}}{\delta}\right)}.

This condition can lead to the linear convergence of gradient descent, i.e., result in Theorem 2.

Corollary 1.

If Condition 1 holds for s=0,,ts=0,\cdots,t, then we have that

zGs(u)22(1ηλ0+λ~02)szG0(u)22\|z-G^{s}(u)\|_{2}^{2}\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{s}\|z-G^{0}(u)\|_{2}^{2}

holds for s=0,,ts=0,\cdots,t, where mm is required to satisfy that

m=Ω(n14n24log(n1n2δ)log3(mδ)(min(λ0,λ~0))2(λ0+λ~0)2).m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log\left(\frac{n_{1}n_{2}}{\delta}\right)\log^{3}\left(\frac{m}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

Proof Sketch: Under the setting of over-parameterization, we can show that the weights wr(s)w_{r}(s), w~rk(s)\tilde{w}_{rk}(s) stay close to the initialization wr(0)w_{r}(0), w~rk(0)\tilde{w}_{rk}(0). Thus, with the stability of the discrete Gram matrices, i.e., Lemma 3, we can deduce that λmin(H(s))λ0/2\lambda_{min}(H(s))\geq\lambda_{0}/2 and λmin(H~(s))λ~0/2\lambda_{min}(\tilde{H}(s))\geq\tilde{\lambda}_{0}/2. Then combining with the Lemma 4, we have

zGs+1(u)22\displaystyle\|z-G^{s+1}(u)\|_{2}^{2} (14)
=(Iη(H(s)+H~(s)))(zGs(u))22+I(s)222(Iη(H(s)+H~(s)))(zGs(u)),I(s)\displaystyle=\left\|\left(I-\eta\left(H(s)+\tilde{H}(s)\right)\right)(z-G^{s}(u))\right\|_{2}^{2}+\|I(s)\|_{2}^{2}-2\left\langle\left(I-\eta(H(s)+\tilde{H}(s))\right)(z-G^{s}(u)),I(s)\right\rangle
(1ηλ0+λ~02)2zGs(u)22+I(s)22+2(1ηλ0+λ~02)zGs(u)2I(s)2,\displaystyle\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{2}\|z-G^{s}(u)\|_{2}^{2}+\|I(s)\|_{2}^{2}+2\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)\|z-G^{s}(u)\|_{2}\|I(s)\|_{2},

where the inequality requires that Iη(H(s)+H~(s))I-\eta(H(s)+\tilde{H}(s)) is positive definite. Since H(s)H2=𝒪(1/m)\|H(s)-H^{\infty}\|_{2}=\mathcal{O}(1/\sqrt{m}) and H~(s)H~2=𝒪(1/m)\|\tilde{H}(s)-\tilde{H}^{\infty}\|_{2}=\mathcal{O}(1/\sqrt{m}), η=𝒪(1H2+H~2)\eta=\mathcal{O}\left(\frac{1}{\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}}\right) is sufficient to ensure that Iη(H(s)+H~(s))I-\eta(H(s)+\tilde{H}(s)) is positive definite, when mm is large enough.

From (14), if I(s)2η(λ0+λ~0)zGs(u)2\|I(s)\|_{2}\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0})\|z-G^{s}(u)\|_{2}, which can be obtained from the following lemma, we can obtain that

zGs+1(u)22(1ηλ0+λ~02)zGs(u)2,\|z-G^{s+1}(u)\|_{2}^{2}\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)\|z-G^{s}(u)\|_{2},

which directly yields the desired conclusion.

Lemma 5.

Under Condition 1, for s=0,,t1s=0,\cdots,t-1, we have

I(s)2R¯zGs(u)2,\|I(s)\|_{2}\lesssim\bar{R}\|z-G^{s}(u)\|_{2}, (15)

where

R¯=η(n1n2)32zG0(u)2m(λ0+λ~0)log32(mδ).\bar{R}=\frac{\eta(n_{1}n_{2})^{\frac{3}{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right). (16)

5. Physics-Informed Neural Operators

Let Γ\Gamma be a bounded open subset of d\mathbb{R}^{d}, in this section, we consider the PDE with following form:

u(y)=f(y),y(0,T)×Γ\displaystyle\mathcal{L}u(y)=f(y),\ y\in(0,T)\times\Gamma (17)
u(y~)=g(y~),y~{0}×Γ[0,T]×Γ,\displaystyle u(\tilde{y})=g(\tilde{y}),\ \tilde{y}\in\{0\}\times\Gamma\cup[0,T]\times\partial\Gamma,

where y=(y0,y1,yd)y=(y_{0},y_{1}\cdots,y_{d}), y0(0,T)y_{0}\in(0,T), (y1,yd)Γ¯(y_{1}\cdots,y_{d})\in\bar{\Gamma} and \mathcal{L} is a differential operator,

u(y)=uy0(y)i=1d2uyi2(y)+u(y).\mathcal{L}u(y)=\frac{\partial u}{\partial y_{0}}(y)-\sum\limits_{i=1}^{d}\frac{\partial^{2}u}{\partial y_{i}^{2}}(y)+u(y).

In this section, we consider the shallow neural operato with following form

G(u)(y)=1mr=1m[1pk=1pa~rkσ(w~rkTu)]σ3(wrTy),G(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\sigma_{3}(w_{r}^{T}y),

where σ3(),σ2(),σ()\sigma_{3}(\cdot),\sigma_{2}(\cdot),\sigma(\cdot) are the ReLU3,ReLU2,ReLU\text{ReLU}^{3},\text{ReLU}^{2},\text{ReLU} activation functions, respectively.

Given samples y1,,yn2y_{1},\cdots,y_{n_{2}} in the interior and y~1,,y~n3\tilde{y}_{1},\cdots,\tilde{y}_{n_{3}} on the boundary, the loss function of PINN is

L(W,W~):=i=1n1j1=1n21n2(G(ui)(yj1)f(yj1))2+i=1n1j2=1n31n3(G(ui)(y~j2)g(y~j2))2.L(W,\tilde{W}):=\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\frac{1}{n_{2}}(\mathcal{L}G(u_{i})(y_{j_{1}})-f(y_{j_{1}}))^{2}+\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\frac{1}{n_{3}}(G(u_{i})(\tilde{y}_{j_{2}})-g(\tilde{y}_{j_{2}}))^{2}.

Let

s(ui)(yj)=1n2(G(ui)(yj1)f(yj1))s(u_{i})(y_{j})=\frac{1}{\sqrt{n_{2}}}(\mathcal{L}G(u_{i})(y_{j_{1}})-f(y_{j_{1}}))

and

h(ui)(y~j2)=1n3(G(ui)(y~j2)g(y~j2)),h(u_{i})(\tilde{y}_{j_{2}})=\frac{1}{\sqrt{n_{3}}}(G(u_{i})(\tilde{y}_{j_{2}})-g(\tilde{y}_{j_{2}})),

then the loss function can be written as

L(W,W~)=i=1n1s(ui)22+h(ui)22,L(W,\tilde{W})=\sum\limits_{i=1}^{n_{1}}\|s(u_{i})\|_{2}^{2}+\|h(u_{i})\|_{2}^{2},

where s(ui)=(s(ui)(y1),,s(ui)(yn2))s(u_{i})=(s(u_{i})(y_{1}),\cdots,s(u_{i})(y_{n_{2}})), h(ui)=(h(ui)(y~1),,h(ui)(y~n3))h(u_{i})=(h(u_{i})(\tilde{y}_{1}),\cdots,h(u_{i})(\tilde{y}_{n_{3}})).

We first consider the continuous setting, which is a stepping stone towards understanding discrete algorithms. For wr(t)w_{r}(t) and w~rt\tilde{w}_{rt}, we have

dwr(t)dt\displaystyle\frac{dw_{r}(t)}{dt} =L(W(t),W~(t))wr\displaystyle=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}} (18)
=i=1n1j1=1n2st(ui)(yj1)st(ui)(yj1)wri=1n1j2=1n3ht(ui)(y~j2)ht(ui)(y~j2)wr\displaystyle=-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}s^{t}(u_{i})(y_{j_{1}})\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w_{r}}-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}h^{t}(u_{i})(\tilde{y}_{j_{2}})\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w_{r}}

and

dw~rk(t)dt\displaystyle\frac{d\tilde{w}_{rk}(t)}{dt} =L(W(t),W~(t))w~rk\displaystyle=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}} (19)
=i=1n1j1=1n2st(ui)(yj1)st(ui)(yj1)w~rki=1n1j2=1n3ht(ui)(y~j2)ht(ui)(y~j2)w~rk.\displaystyle=-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}s^{t}(u_{i})(y_{j_{1}})\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}_{rk}}-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}h^{t}(u_{i})(\tilde{y}_{j_{2}})\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}.

Thus, for the predictions, we have

dst(ui)(yj)dt\displaystyle\frac{ds^{t}(u_{i})(y_{j})}{dt} (20)
=r=1mst(ui)(yj)wr,dwr(t)dt+r=1mk=1qst(ui)(yj)w~rk,dw~rk(t)dt\displaystyle=\sum\limits_{r=1}^{m}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{dw_{r}(t)}{dt}\rangle+\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{q}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{d\tilde{w}_{rk}(t)}{dt}\rangle
=r=1mi1=1n1j1=1n2st(ui)(yj)wr,st(ui1)(yj1)wrst(ui1)(yj1)r=1mi1=1n1j2=1n3st(ui)(yj)wr,ht(ui1)(y~j2)wrht(ui1)(y~j2)\displaystyle=-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\rangle s^{t}(u_{i_{1}})(y_{j_{1}})-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial w_{r}}\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})
r,k=1,1m,pi1=1n1j1=1n2st(ui)(yj)w~rk,st(ui1)(yj1)w~rkst(ui1)(yj1)r,k=1,1m,pi1=1n1j2=1n3st(ui)(yj)w~rk,ht(ui1)(y~j2)w~rkht(ui1)(y~j2),\displaystyle-\sum\limits_{r,k=1,1}^{m,p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\rangle s^{t}(u_{i_{1}})(y_{j_{1}})-\sum\limits_{r,k=1,1}^{m,p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}}),

and

dht(ui)(y~j)dt\displaystyle\frac{dh^{t}(u_{i})(\tilde{y}_{j})}{dt} (21)
=r=1mht(ui)(y~j)wr,dwr(t)dt+r=1mk=1lht(ui)(y~j)w~rk,dw~rk(t)dt\displaystyle=\sum\limits_{r=1}^{m}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial w_{r}},\frac{dw_{r}(t)}{dt}\rangle+\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{l}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial\tilde{w}_{rk}},\frac{d\tilde{w}_{rk}(t)}{dt}\rangle
=r=1mi1=1n1j1=1n2ht(ui)(y~j)wr,st(ui1)(yj1)wrst(ui1)(yj1)r=1mi1=1n1j2=1n3ht(ui)(y~j)wr,ht(ui1)(y~j2)wrht(ui1)(y~j2)\displaystyle=-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial w_{r}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\rangle s^{t}(u_{i_{1}})(y_{j_{1}})-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial w_{r}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial w_{r}}\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})
r,k=1,1m,pi1=1n1j1=1n2ht(ui)(y~j)w~rk,st(ui1)(yj1)w~rkst(ui1)(yj1)r,k=1,1m,pi1=1n1j2=1n3ht(ui)(y~j)w~rk,ht(ui1)(y~j2)w~rkht(ui1)(y~j2).\displaystyle-\sum\limits_{r,k=1,1}^{m,p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial\tilde{w}_{rk}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\rangle s^{t}(u_{i_{1}})(y_{j_{1}})-\sum\limits_{r,k=1,1}^{m,p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial\tilde{w}_{rk}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}}).

Let Gt(u)=((s(u1),h(u1)),,(s(un1),h(un1)))G^{t}(u)=((s(u_{1}),h(u_{1})),\cdots,(s(u_{n_{1}}),h(u_{n_{1}}))) and z=((f,g),,(f,g))z=((f,g),\cdots,(f,g)), then

ddtGt(u)=(H(t)+H~(t))(zGt(u)),\frac{d}{dt}G^{t}(u)=(H(t)+\tilde{H}(t))(z-G^{t}(u)),

where H(t),H~(t)n1(n2+n3)×n1(n2+n3)H(t),\tilde{H}(t)\in\mathbb{R}^{n_{1}(n_{2}+n_{3})\times n_{1}(n_{2}+n_{3})} are Gram matrices at time tt. We can divide them into n1×n1n_{1}\times n_{1} blocks and each block is a matrix in (n2+n3)×(n2+n3)\mathbb{R}^{(n_{2}+n_{3})\times(n_{2}+n_{3})}. Specifically, the (i,j)(i,j)-th block of H(t)H(t) and H~(t)\tilde{H}(t) are Hi,j(t):=Di(t)TDj(t)H_{i,j}(t):=D_{i}(t)^{T}D_{j}(t) and H~i,j(t):=D~i(t)TD~j(t)\tilde{H}_{i,j}(t):=\tilde{D}_{i}(t)^{T}\tilde{D}_{j}(t) respectively, where

Di(t)=[st(ui)(y1)w,,st(ui)(yn2)w,ht(ui)(y~1)w,,ht(ui)(y~n3)w]D_{i}(t)=\left[\frac{\partial s^{t}(u_{i})(y_{1})}{\partial w},\cdots,\frac{\partial s^{t}(u_{i})(y_{n_{2}})}{\partial w},\frac{\partial h^{t}(u_{i})(\tilde{y}_{1})}{\partial w},\cdots,\frac{\partial h^{t}(u_{i})(\tilde{y}_{n_{3}})}{\partial w}\right]

and

D~i(t)=[st(ui)(y1)w~,,st(ui)(yn2)w~,ht(ui)(y~1)w~,,ht(ui)(y~n3)w~].\tilde{D}_{i}(t)=\left[\frac{\partial s^{t}(u_{i})(y_{1})}{\partial\tilde{w}},\cdots,\frac{\partial s^{t}(u_{i})(y_{n_{2}})}{\partial\tilde{w}},\frac{\partial h^{t}(u_{i})(\tilde{y}_{1})}{\partial\tilde{w}},\cdots,\frac{\partial h^{t}(u_{i})(\tilde{y}_{n_{3}})}{\partial\tilde{w}}\right].

Recall that

s(u)(y)wr=1m[1pk=1pa~rkσ(w~rkTu)]1n2(σ3(wrTy))wr,\frac{\partial s(u)(y)}{\partial w_{r}}=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\frac{1}{\sqrt{n_{2}}}\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y))}{\partial w_{r}},
s(u)(y)w~rk=1ma~rkpuI{w~rkTu0}(σ3(wrTy))n2,\frac{\partial s(u)(y)}{\partial\tilde{w}_{rk}}=\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}uI\{\tilde{w}_{rk}^{T}u\geq 0\}\frac{\mathcal{L}(\sigma_{3}(w_{r}^{T}y))}{\sqrt{n_{2}}},
h(u)(y~)wr=1m[1pk=1pa~rkσ(w~rkTu)]1n3(σ3(wrTy~))wr,\frac{\partial h(u)(\tilde{y})}{\partial w_{r}}=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\frac{1}{\sqrt{n_{3}}}\frac{\partial(\sigma_{3}(w_{r}^{T}\tilde{y}))}{\partial w_{r}},

and

h(u)(y~)w~rk=1ma~rkpI{w~rkTu0}σ3(wrTy~)n3.\frac{\partial h(u)(\tilde{y})}{\partial\tilde{w}_{rk}}=\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}I\{\tilde{w}_{rk}^{T}u\geq 0\}\frac{\sigma_{3}(w_{r}^{T}\tilde{y})}{\sqrt{n_{3}}}.

Then, the (j1,j2)(j_{1},j_{2})-th (j1[n1],j2[n1]j_{1}\in[n_{1}],j_{2}\in[n_{1}]) entry of Hi,j(t)H_{i,j}(t) is

1mr=1m[1pk=1pa~rkσ(w~rkTui)][1pk=1pa~rkσ(w~rkTuj)]1n2(σ3(wrTyj1))wr,(σ3(wrTyj2))wr.\frac{1}{m}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j})\right]\frac{1}{n_{2}}\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle.

and the (j1,j2)(j_{1},j_{2})-th (j1[n1],j2[n1]j_{1}\in[n_{1}],j_{2}\in[n_{1}]) entry of H~i,j(t)\tilde{H}_{i,j}(t) is

1mr=1m[1pk=1puiTujI{w~rkTui0,w~rkTuj0}]1n2(σ(wrTyj1))(σ(wrTyj2)),\frac{1}{m}\sum\limits_{r=1}^{m}\left[\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\right]\frac{1}{n_{2}}\mathcal{L}(\sigma(w_{r}^{T}y_{j_{1}}))\mathcal{L}(\sigma(w_{r}^{T}y_{j_{2}})),

where we omit the index tt for simplicity.

From the forms of H(t)H(t) and H~(t)\tilde{H}(t), we can derive the corresponding Gram matrices that are induced by the random initialization, which we denote by HH^{\infty} and H~\tilde{H}^{\infty}, respectively. Specifically, HH^{\infty} is a Kronecker product of H1H_{1}^{\infty} and H2H_{2}^{\infty}, where H1n1×n1H_{1}^{\infty}\in\mathbb{R}^{n_{1}\times n_{1}}, H1(n2+n3)×(n2+n3)H_{1}^{\infty}\in\mathbb{R}^{(n_{2}+n_{3})\times(n_{2}+n_{3})}, the (i,j)(i,j)-th entry of HH^{\infty} is 𝔼[σ(w~Tui)σ(w~Tuj)]\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})], the (j1,j2)(j_{1},j_{2})-th entry of H2H_{2}^{\infty} is

1n2𝔼[(σ3(wTyj1))w,(σ3(wTyj2))w].\frac{1}{n_{2}}\mathbb{E}\left[\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w^{T}y_{j_{1}}))}{\partial w},\frac{\partial\mathcal{L}(\sigma_{3}(w^{T}y_{j_{2}}))}{\partial w}\right\rangle\right].

And H~\tilde{H}^{\infty} is a Kronecker product of H~1\tilde{H}_{1}^{\infty} and H~2\tilde{H}_{2}^{\infty}, where H~1n1×n1\tilde{H}_{1}^{\infty}\in\mathbb{R}^{n_{1}\times n_{1}}, H~1(n2+n3)×(n2+n3)\tilde{H}_{1}^{\infty}\in\mathbb{R}^{(n_{2}+n_{3})\times(n_{2}+n_{3})}, the (i,j)(i,j)-th entry of H~\tilde{H}^{\infty} is 𝔼[uiTujI{w~Tui0,w~Tuj0}]\mathbb{E}[u_{i}^{T}u_{j}I\{\tilde{w}^{T}u_{i}\geq 0,\tilde{w}^{T}u_{j}\geq 0\}], the (j1,j2)(j_{1},j_{2})-th entry of H~2\tilde{H}_{2}^{\infty} is

1n2𝔼[(σ3(wTyj1))(σ3(wTyj2))].\frac{1}{n_{2}}\mathbb{E}[\mathcal{L}(\sigma_{3}(w^{T}y_{j_{1}}))\mathcal{L}(\sigma_{3}(w^{T}y_{j_{2}}))].

The Gram matrices play important roles in the convergence analysis. Similar to the setting of Section 4, we can demonstrate the strict positive definiteness of the Gram matrices under mild conditions.

Lemma 6.

If no two samples in {ui}i=1n1\{u_{i}\}_{i=1}^{n_{1}} are parallel and no two samples in {yj}j=1n2{y~j}j=1n3\{y_{j}\}_{j=1}^{n_{2}}\cup\{\tilde{y}_{j}\}_{j=1}^{n_{3}} are parallel, then HH^{\infty} and H~\tilde{H}^{\infty} are both strictly positive definite. We denote their least eigenvalues by λ0\lambda_{0} and λ~0\tilde{\lambda}_{0}, respectively.

Similar to Section 4, the convergence of gradient descent relies on the stability of the Gram matrices, which is demonstrated by the following two lemmas.

Lemma 7.

If m=Ω(d4n12min{λ02,λ~02}log2(n1(n2+n3)δ))m=\Omega\left(\frac{d^{4}n_{1}^{2}}{\min\{\lambda_{0}^{2},\tilde{\lambda}_{0}^{2}\}}\log^{2}\left(\frac{n_{1}(n_{2}+n_{3})}{\delta}\right)\right), we have with probability at least 1δ1-\delta, H(0)H2λ04\|H(0)-H^{\infty}\|_{2}\leq\frac{\lambda_{0}}{4}, H~(0)H~2λ~04\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}\leq\frac{\tilde{\lambda}_{0}}{4} and λmin(H(0))34λ0\lambda_{min}(H(0))\geq\frac{3}{4}\lambda_{0}, λmin(H~(0))34λ~0\lambda_{min}(\tilde{H}(0))\geq\frac{3}{4}\tilde{\lambda}_{0}.

Lemma 8.

Let R,R~(0,1)R,\tilde{R}\in(0,1). If w1(0),,wm(0),w~00(0),w~mp(0)w_{1}(0),\cdots,w_{m}(0),\tilde{w}_{00}(0),\cdots\tilde{w}_{mp}(0) are i.i.d. generated 𝒩(𝟎,𝐈)\mathcal{N}(\bm{0},\bm{I}). For any set of weight vectors w1,,wm,w~00,w~mpdw_{1},\cdots,w_{m},\tilde{w}_{00},\cdots\tilde{w}_{mp}\in\mathbb{R}^{d} that satisfy for any r[m],k[p]r\in[m],k\in[p], wrwr(0)2R\|w_{r}-w_{r}(0)\|_{2}\leq R and w~rkw~rk(0)2R~\|\tilde{w}_{rk}-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}, then we have with probability at least 1δn2exp(mR)1-\delta-n_{2}\exp(-mR),

HH(0)F\displaystyle\|H-H(0)\|_{F} n1dlog(mδ)log(m(n2+n3)δ)Rlog(mn1δ)\displaystyle\lesssim n_{1}d\log\left(\frac{m}{\delta}\right)\sqrt{\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)}R\log\left(\frac{mn_{1}}{\delta}\right) (22)
+n1dlog(mδ)log(m(n2+n3)δ)(pR~2+pR~(mn1δ))\displaystyle\quad+n_{1}d\log\left(\frac{m}{\delta}\right)\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)\left(p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\left(\frac{mn_{1}}{\delta}\right)}\right)

and with probability at least 1δn1exp(mpR~)1-\delta-n_{1}exp(-mp\tilde{R}),

H~H~(0)Fn1dlog(mδ)log(m(n2+n3)δ)R~+n1(dlog(mδ))32log(m(n2+n3)δ)R.\|\tilde{H}-\tilde{H}(0)\|_{F}\lesssim n_{1}d\log\left(\frac{m}{\delta}\right)\sqrt{\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)}\tilde{R}+n_{1}\left(d\log\left(\frac{m}{\delta}\right)\right)^{\frac{3}{2}}\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)R. (23)

Similar to training neural operators, we can derive the training dynamics of physics-informed neural operators.

Lemma 9.

For all tt\in\mathbb{N}, we have

Gt+1(u)=[Iη(H(t)+H~(t))]Gt(u)+I(t),G^{t+1}(u)=[I-\eta(H(t)+\tilde{H}(t))]G^{t}(u)+I(t),

where I(t)n1(n2+n3)I(t)\in\mathbb{R}^{n_{1}(n_{2}+n_{3})}, I(t)I(t) can be divided into n1n_{1} blocks, where each block is an (n2+n3)(n_{2}+n_{3})dimensional vector. The j1j_{1}-th (j1[n2]j_{1}\in[n_{2}]) component of ii-th block is

st+1(ui)(yj1)st(ui)(yj1)st(ui)(yj1)w,w(t+1)w(t)st(ui)(yj1)w~,w~(t+1)w~(t),s^{t+1}(u_{i})(y_{j_{1}})-s^{t}(u_{i})(y_{j_{1}})-\left\langle\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle,

The n2+j2n_{2}+j_{2}-th (j1[n3]j_{1}\in[n_{3}]) component of ii-th block is

ht+1(ui)(y~j2)ht(ui)(y~j2)ht(ui)(y~j2)w,w(t+1)w(t)ht(ui)(y~j2)w~,w~(t+1)w~(t).h^{t+1}(u_{i})(\tilde{y}_{j_{2}})-h^{t}(u_{i})(\tilde{y}_{j_{2}})-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

With these preparations in place, we can now arrive at the final convergence theorem.

Theorem 3.

If we set η=𝒪(1H2+H~2)\eta=\mathcal{O}\left(\frac{1}{\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}}\right), then with probability at least 1δ1-\delta, we have

Gt(u)22(1η(λ0+λ~0)2)tG0(u)22,\|G^{t}(u)\|_{2}^{2}\leq\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{t}\|G^{0}(u)\|_{2}^{2},

where

m=Ω~(n14d7(λ0+λ~0)2min(λ02,λ~02))m=\tilde{\Omega}\left(\frac{n_{1}^{4}d^{7}}{(\lambda_{0}+\tilde{\lambda}_{0})^{2}min(\lambda_{0}^{2},\tilde{\lambda}_{0}^{2})}\right)

and Ω~\tilde{\Omega} indicates that some terms involving log(n1)\log(n_{1}), log(n2)\log(n_{2}) and log(m)\log(m) are omitted.

We prove Theorem 3 by induction. Our induction hypothesis is just the following condition:

Condition 2.

At the ss-th iteration, we have

Gs(u)22(1η(λ0+λ~0)2)sG0(u)22\|G^{s}(u)\|_{2}^{2}\leq\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{s}\|G^{0}(u)\|_{2}^{2}

and wr(s)2B1\|w_{r}(s)\|_{2}\leq B_{1}, |wr(s)Tyj|B2|w_{r}(s)^{T}y_{j}|\leq B_{2} and |wr(s)Ty~j1|B2|w_{r}(s)^{T}\tilde{y}_{j_{1}}|\leq B_{2} holds for all r[m],jin[n2],j1[n3]r\in[m],j\ in[n_{2}],j_{1}\in[n_{3}], where

B1=2dlog(m/δ),B2=2log(m(n2+n3)/δ).B_{1}=2\sqrt{d\log(m/\delta)},\ B_{2}=2\sqrt{\log(m(n_{2}+n_{3})/\delta)}.

This condition directly yields the following bound of deviation from the initialization.

Corollary 2.

If Condtion 2 holds for s=0,,Ts=0,\cdots,T, then we have that

w~rk(T+1)w~rk(0)2n1B12B2G0(u)2mp(λ0+λ~0)\|\tilde{w}_{rk}(T+1)-\tilde{w}_{rk}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}}B_{1}^{2}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}

and

wr(T+1)wr(0)2n1B1B2G0(u)2mp(λ0+λ~0)log(mn1δ)\|w_{r}(T+1)-w_{r}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}}B_{1}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}

holds for all r[m],k[p]r\in[m],k\in[p].

Lemma 10.

If Condtion 2 holds for s=0,,Ts=0,\cdots,T, then we have that

I(s)2η(n1)3/2B16B23G0(u)2m(λ0+λ~0)log3/2(mn1δ)Gt(u)2.\|I(s)\|_{2}\lesssim\frac{\eta(n_{1})^{3/2}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\|G^{t}(u)\|_{2}.

holds for s=0,,T1s=0,\cdots,T-1.

6. Conclusion and Future Work

In this paper, we have analyzed the convergence of gradient descent (GD) in training wide shallow neural operators within the framework of NTK, demenstrating the linear convergence of GD. The core idea is that over-parameterization ensures that all weights are close to their initializations for all iterations, which is similar to performing a certain kernel method. There are some future works. Firstly, the extension of our theory to other neural operators, like FNO. The main difficulty could be that how to meet the requirements of the NTK theory. Secondly, the extension to DeepONets, which we think may be similar to the extension from the results in [17] to [18].

References

  • [1] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational physics, vol. 378, pp. 686–707, 2019.
  • [2] B. Yu et al., “The deep ritz method: a deep learning-based numerical algorithm for solving variational problems,” Communications in Mathematics and Statistics, vol. 6, no. 1, pp. 1–12, 2018.
  • [3] Z. Shen, H. Yang, and S. Zhang, “Optimal approximation rate of relu networks in terms of width and depth,” Journal de Mathématiques Pures et Appliquées, vol. 157, pp. 101–135, 2022.
  • [4] J. Lu, Z. Shen, H. Yang, and S. Zhang, “Deep network approximation for smooth functions,” SIAM Journal on Mathematical Analysis, vol. 53, no. 5, pp. 5465–5506, 2021.
  • [5] D. Yarotsky, “Error bounds for approximations with deep relu networks,” Neural networks, vol. 94, pp. 103–114, 2017.
  • [6] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart, “Model reduction and neural networks for parametric pdes,” The SMAI journal of computational mathematics, vol. 7, pp. 121–157, 2021.
  • [7] T. Chen and H. Chen, “Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems,” IEEE transactions on neural networks, vol. 6, no. 4, pp. 911–917, 1995.
  • [8] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis, “Learning nonlinear operators via deeponet based on the universal approximation theorem of operators,” Nature machine intelligence, vol. 3, no. 3, pp. 218–229, 2021.
  • [9] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Fourier neural operator for parametric partial differential equations,” arXiv preprint arXiv:2010.08895, 2020.
  • [10] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Neural operator: Learning maps between function spaces with applications to pdes,” Journal of Machine Learning Research, vol. 24, no. 89, pp. 1–97, 2023.
  • [11] N. Kovachki, S. Lanthaler, and S. Mishra, “On universal approximation and error bounds for fourier neural operators,” Journal of Machine Learning Research, vol. 22, no. 290, pp. 1–76, 2021.
  • [12] N. B. Kovachki, S. Lanthaler, and A. M. Stuart, “Operator learning: Algorithms and analysis,” arXiv preprint arXiv:2402.15715, 2024.
  • [13] S. Lanthaler, S. Mishra, and G. E. Karniadakis, “Error estimates for deeponets: A deep learning framework in infinite dimensions,” Transactions of Mathematics and Its Applications, vol. 6, no. 1, p. tnac001, 2022.
  • [14] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao, “Deep nonparametric estimation of operators between infinite dimensional spaces,” Journal of Machine Learning Research, vol. 25, no. 24, pp. 1–67, 2024.
  • [15] B. Shrimali, A. Banerjee, and P. Cisneros-Velarde, “Optimization for neural operator learning: Wider networks are better.”
  • [16] S. Wang, H. Wang, and P. Perdikaris, “Improved architectures and training algorithms for deep operator networks,” Journal of Scientific Computing, vol. 92, no. 2, p. 35, 2022.
  • [17] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” arXiv preprint arXiv:1810.02054, 2018.
  • [18] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International conference on machine learning.   PMLR, 2019, pp. 1675–1685.
  • [19] J. He, L. Li, J. Xu, and C. Zheng, “Relu deep neural networks and linear finite elements,” arXiv preprint arXiv:1807.03973, 2018.
  • [20] Y. Gao, Y. Gu, and M. Ng, “Gradient descent finds the global optima of two-layer physics-informed neural networks,” in International Conference on Machine Learning.   PMLR, 2023, pp. 10 676–10 707.
  • [21] E. Giné and R. Nickl, Mathematical foundations of infinite-dimensional statistical models.   Cambridge university press, 2016, vol. 40.
  • [22] A. K. Kuchibhotla and A. Chakrabortty, “Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression,” Information and Inference: A Journal of the IMA, vol. 11, no. 4, pp. 1389–1456, 2022.
  • [23] A. W. Van Der Vaart, J. A. Wellner, A. W. van der Vaart, and J. A. Wellner, Weak convergence.   Springer, 1996.

Appendix

Before the proofs, we first define the events

Ajr:={w:wwr(0)2R,I{wTyj0}I{wr(0)Tyj0}}A_{jr}:=\{\exists w:\|w-w_{r}(0)\|_{2}\leq R,I\{w^{T}y_{j}\geq 0\}\neq I\{w_{r}(0)^{T}y_{j}\geq 0\}\} (24)

and

A~rki:={w:wwrk(0)2R~,I{wTu0}I{wrk(0)Tui0}}\tilde{A}_{rk}^{i}:=\{\exists w:\|w-w_{rk}(0)\|_{2}\leq\tilde{R},I\{w^{T}u\geq 0\}\neq I\{w_{rk}(0)^{T}u_{i}\geq 0\}\} (25)

for all i[n1],j[n2],r[m],p[k]i\in[n_{1}],j\in[n_{2}],r\in[m],p\in[k].

Note that the event AjrA_{jr} happens if and only if |wr(0)Tyj|<yj2R|w_{r}(0)^{T}y_{j}|<\|y_{j}\|_{2}R, thus by the anti-concentration inequality of Gaussian distribution (Lemma 10), we have

P(Air)=Pz𝒩(0,yj22)(|z|<yj2R)=Pz𝒩(0,1)(|z|<R)R.P(A_{ir})=P_{z\sim\mathcal{N}(0,\|y_{j}\|_{2}^{2})}(|z|<\|y_{j}\|_{2}R)=P_{z\sim\mathcal{N}(0,1)}(|z|<R)\lesssim R. (26)

Similarly, we have P(A~rki)R~.P(\tilde{A}_{rk}^{i})\lesssim\tilde{R}.

Moreover, we let Sj:={r[m]:I{Ajr}=0}S_{j}:=\{r\in[m]:I\{A_{jr}\}=0\}, Sj:=[m]\SjS_{j}^{\perp}:=[m]\backslash S_{j} and S~i:={(r,k)[m]×[p]:I{A~rki}=0}\tilde{S}_{i}:=\{(r,k)\in[m]\times[p]:I\{\tilde{A}_{rk}^{i}\}=0\}, S~i:=[m]×[p]\S~i\tilde{S}^{\perp}_{i}:=[m]\times[p]\backslash\tilde{S}_{i}.

7. Proof of Continuous Time Analysis

7.1. Proof of Lemma 1

Proof.

First, recall that HH^{\infty} is a Kronecker product of H1H_{1}^{\infty} and H2H_{2}^{\infty}. The (i,j)(i,j)-th entry of H1H_{1}^{\infty} is 𝔼[σ(w~Tui)σ(w~Tuj)]\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})] and the (i1,j1)(i_{1},j_{1})-th entry of H2H_{2}^{\infty} is 𝔼[yi1Tyj1I{wTyi10,wTyj10}]\mathbb{E}[y_{i_{1}}^{T}y_{j_{1}}I\{w^{T}y_{i_{1}}\geq 0,w^{T}y_{j_{1}}\geq 0\}]. As we know, the Kronecker product of two strictly positive definite matrices is also strictly positive definite. Thus, it suffices to demonstrate that H1H_{1}^{\infty} and H2H_{2}^{\infty} are both strictly positive definite.

The proof relies relies on standard functional analysis. Let \mathcal{H} be the Hilbert space of integrable dd-dimensional vector fields on d\mathbb{R}^{d}, i.e., ff\in\mathcal{H} if 𝔼w𝒩(𝟎,𝑰)[f(w)22]<.\mathbb{E}_{w\sim\mathcal{N}(\bm{0},\bm{I})}[\|f(w)\|_{2}^{2}]<\infty. Then the inner product of this Hilbert space is f,g=𝔼w𝒩(𝟎,𝑰)[f(w)Tg(w)]\langle f,g\rangle_{\mathcal{H}}=\mathbb{E}_{w\sim\mathcal{N}(\bm{0},\bm{I})}[f(w)^{T}g(w)] for f,gf,g\in\mathcal{H}. With these preparations in place, to prove H2H_{2}^{\infty} is strictlt positive definite, it is equivalent to show ψ(y1),,ψ(yn2)\psi(y_{1}),\cdots,\psi(y_{n_{2}})\in\mathcal{H} are linearly independent, where ψ(yj)=yjI{wTyj0}\psi(y_{j})=y_{j}I\{w^{T}y_{j}\geq 0\} for j[n2]j\in[n_{2}]. This is exactly the result of Therem 3.1 in [17]. Similarly, Theorem 2.1 in [19] shows that {σ(w~Tui)}i=1n1\{\sigma(\tilde{w}^{T}u_{i})\}_{i=1}^{n_{1}} are linearly independent if no two samples in {ui}i=1n1\{u_{i}\}_{i=1}^{n_{1}} are parallel. Thus it can be directly deduced that HH^{\infty} is strictly positive definite. Similarly, we can deduce that H~\tilde{H}^{\infty} is also strictly positive definite.

As for other activation functions, such as ReLUp\text{ReLU}^{p} or smooth activation functions, similar conclusions hold true. For specific details, refer to [18] and [20].

7.2. Proof of Lemma 2

Proof.

First, let

Xr=[1pk=1pa~rkσ(w~rk(0)Tui)][1pk=1pa~rkσ(w~rk(0)Tuj)]yi1Tyj1I{wr(0)Tyi10,wr(0)Tyj10},X_{r}=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{j})\right]y_{i_{1}}^{T}y_{j_{1}}I\{w_{r}(0)^{T}y_{i_{1}}\geq 0,w_{r}(0)^{T}y_{j_{1}}\geq 0\},

then |Hi,ji1,j1(0)Hi,ji1,j1,|=|1mr=1mXr𝔼[X1]||H_{i,j}^{i_{1},j_{1}}(0)-H_{i,j}^{i_{1},j_{1},\infty}|=\left|\frac{1}{m}\sum\limits_{r=1}^{m}X_{r}-\mathbb{E}[X_{1}]\right|.

Note that Lemma 13 implies that for all i[n1]i\in[n_{1}], 1pk=1pa~rkσ(w~rk(0)Tui)ψ2=𝒪(1)\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|_{\psi_{2}}=\mathcal{O}(1), which yields that

Xrψ11pk=1pa~rkσ(w~rk(0)Tui)ψ21pk=1pa~rkσ(w~rk(0)Tuj)ψ2=𝒪(1).\|X_{r}\|_{\psi_{1}}\lesssim\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|_{\psi_{2}}\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{j})\right\|_{\psi_{2}}=\mathcal{O}(1).

Thus, by applying Lemma 12, we can deduce for fixed (i,j)(i,j) and (i1,j1)(i_{1},j_{1}), with probability at least 1δ1-\delta,

|Hi,ji1,j1(0)Hi,ji1,j1,|log(1δ)m+log(1δ)m.|H_{i,j}^{i_{1},j_{1}}(0)-H_{i,j}^{i_{1},j_{1},\infty}|\lesssim\sqrt{\frac{\log(\frac{1}{\delta})}{m}}+\frac{\log(\frac{1}{\delta})}{m}.

Taking a union bound yields that with probability at least 1δ1-\delta,

H(0)HF2\displaystyle\|H(0)-H^{\infty}\|_{F}^{2} =i,j=1n1i1,j1=1n2|Hi,ji1,j1(0)Hi,ji1,j1,|2\displaystyle=\sum\limits_{i,j=1}^{n_{1}}\sum\limits_{i_{1},j_{1}=1}^{n_{2}}|H_{i,j}^{i_{1},j_{1}}(0)-H_{i,j}^{i_{1},j_{1},\infty}|^{2}
n12n22(log(n1n2δ)m+log(n1n2δ)m)2\displaystyle\lesssim n_{1}^{2}n_{2}^{2}\left(\sqrt{\frac{\log(\frac{n_{1}n_{2}}{\delta})}{m}}+\frac{\log(\frac{n_{1}n_{2}}{\delta})}{m}\right)^{2}
n12n22mlog(n1n2δ).\displaystyle\lesssim\frac{n_{1}^{2}n_{2}^{2}}{m}\log(\frac{n_{1}n_{2}}{\delta}).

Thus, if m=Ω(n12n22log(n1n2/δ)λ02)m=\Omega(\frac{n_{1}^{2}n_{2}^{2}\log(n_{1}n_{2}/\delta)}{\lambda_{0}^{2}}), we have H(0)H2H(0)HFλ0/4\|H(0)-H^{\infty}\|_{2}\leq\|H(0)-H^{\infty}\|_{F}\leq\lambda_{0}/4, resulting in that λmin(H(0))3λ0/4\lambda_{min}(H(0))\geq 3\lambda_{0}/4.

On the other hand,

1pk=1puiTujI{w~rkTui0,w~rkTuj0}σ(wrTyi)σ(wrTyj)ψ1σ(wrTyi1)σ(wrTyj1)ψ1=𝒪(1).\left\|\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\sigma(w_{r}^{T}y_{i})\sigma(w_{r}^{T}y_{j})\right\|_{\psi_{1}}\leq\|\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}})\|_{\psi_{1}}=\mathcal{O}(1).

Similarly, applying Lemma 12 yields that with probability at least 1δ1-\delta,

H~(0)H~2H~(0)H~Fn1n2log(n1n2δ)m,\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}\leq\tilde{H}(0)-\tilde{H}^{\infty}\|_{F}\lesssim n_{1}n_{2}\sqrt{\frac{\log(\frac{n_{1}n_{2}}{\delta})}{m}},

which leads to the desired conclusion.

7.3. Proof of Lemma 3

Proof.

First, for HH, recall that the (i1,j1)(i_{1},j_{1})-th entry of (i,j)(i,j)-th block is

1mr=1m[1pk=1pa~rkσ(w~rkTui)][1pk=1pa~rkσ(w~rkTuj)]yi1Tyj1I{wrTyi10,wrTyj10}.\frac{1}{m}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j})\right]y_{i_{1}}^{T}y_{j_{1}}I\{w_{r}^{T}y_{i_{1}}\geq 0,w_{r}^{T}y_{j_{1}}\geq 0\}.

We let

a=1pk=1pa~rkσ(w~rkTui),b=1pk=1pa~rkσ(w~rkTuj),c=yi1Tyj1I{wrTyi10,wrTyj10}a=\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i}),b=\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j}),c=y_{i_{1}}^{T}y_{j_{1}}I\{w_{r}^{T}y_{i_{1}}\geq 0,w_{r}^{T}y_{j_{1}}\geq 0\}

and let a(0),b(0),c(0)a(0),b(0),c(0) be the initialized parts corresponding to a,b,ca,b,c respectively.

Note that we can decompose abca(0)b(0)c(0)abc-a(0)b(0)c(0) as follows:

abca(0)b(0)c(0)=(aba(0)b(0))c+a(0)b(0)(cc(0)).abc-a(0)b(0)c(0)=(ab-a(0)b(0))c+a(0)b(0)(c-c(0)).

For the first part (aba(0)b(0))c(ab-a(0)b(0))c, from the bounedness of cc and c(0)c(0), we have

|aba(0)b(0)||aa(0)||b|+|bb(0)||a(0)||aa(0)||bb(0)|+|aa(0)||b(0)|+|bb(0)||a(0)|.|ab-a(0)b(0)|\leq|a-a(0)||b|+|b-b(0)||a(0)|\leq|a-a(0)||b-b(0)|+|a-a(0)||b(0)|+|b-b(0)||a(0)|.

For aa(0)a-a(0) and bb(0)b-b(0), we have

|1pk=1pa~rkσ(w~rkTui)1pk=1pa~rkσ(w~rk(0)Tui)|pR~,\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})-\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right|\leq\sqrt{p}\tilde{R},

i.e., |aa(0)|,|bb(0)|pR~|a-a(0)|,|b-b(0)|\lesssim\sqrt{p}\tilde{R}. Moreover, Lemma 13 shows that |a(0)|,|b(0)|log(mn1/δ)|a(0)|,|b(0)|\lesssim\sqrt{\log(mn_{1}/\delta)}. Thus, combining these facts yields that

|(aba(0)b(0))c|pR~2+pR~log(mn1δ).|(ab-a(0)b(0))c|\lesssim p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}. (27)

For the second part a(0)b(0)(cc(0))a(0)b(0)(c-c(0)), note that

|I{wrTyi10,wrTyj10}I{wr(0)Tyi10,wr(0)Tyj10}|\displaystyle\left|I\{w_{r}^{T}y_{i_{1}}\geq 0,w_{r}^{T}y_{j_{1}}\geq 0\}-I\{w_{r}(0)^{T}y_{i_{1}}\geq 0,w_{r}(0)^{T}y_{j_{1}}\geq 0\}\right| (28)
|I{wrTyi10}I{wr(0)Tyi10}|+|I{wrTyj10}I{wr(0)Tyj10}|\displaystyle\leq\left|I\{w_{r}^{T}y_{i_{1}}\geq 0\}-I\{w_{r}(0)^{T}y_{i_{1}}\geq 0\}\right|+\left|I\{w_{r}^{T}y_{j_{1}}\geq 0\}-I\{w_{r}(0)^{T}y_{j_{1}}\geq 0\}\right|
I{Ai1,r}+I{Aj1,r}.\displaystyle\leq I\{A_{i_{1},r}\}+I\{A_{j_{1},r}\}.

From the Bernstein inequality (Lemma 11), we have that with probability at least 1n2exp(mR)1-n_{2}exp(-mR),

1mr=1mI{Ai,r}R\frac{1}{m}\sum\limits_{r=1}^{m}I\{A_{i,r}\}\lesssim R (29)

holds for any i[n1]i\in[n_{1}].

Therefore, we can deduce that

|Hi,ji1,j1Hi,ji1,j1(0)|\displaystyle|H_{i,j}^{i_{1},j_{1}}-H_{i,j}^{i_{1},j_{1}}(0)| (30)
pR~2+pR~log(mδ)+1mr=1mlog(mδ)(I{Ai1,r}+I{Aj1,r})\displaystyle\lesssim p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{m}{\delta}\right)}+\frac{1}{m}\sum\limits_{r=1}^{m}\log\left(\frac{m}{\delta}\right)(I\{A_{i_{1},r}\}+I\{A_{j_{1},r}\})
pR~2+pR~log(mn1δ)+Rlog(mn1δ).\displaystyle\lesssim p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}+R\log\left(\frac{mn_{1}}{\delta}\right).

Summing i,j,i1,j1i,j,i_{1},j_{1} yields that

HH(0)F\displaystyle\|H-H(0)\|_{F} =i,j=1n1i1,j1=1n2|Hi,ji1,j1Hi,ji1,j1(0)|2\displaystyle=\sqrt{\sum\limits_{i,j=1}^{n_{1}}\sum\limits_{i_{1},j_{1}=1}^{n_{2}}|H_{i,j}^{i_{1},j_{1}}-H_{i,j}^{i_{1},j_{1}}(0)|^{2}} (31)
n1n2pR~2+n1n2pR~log(mn1δ)+n1n2Rlog(mn1δ)\displaystyle\lesssim n_{1}n_{2}p\tilde{R}^{2}+n_{1}n_{2}\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}+n_{1}n_{2}R\log\left(\frac{mn_{1}}{\delta}\right)

holds with probability at least 1δn2exp(mR)1-\delta-n_{2}exp(-mR).

Second, for H~\tilde{H}, recall that (i1,j1)(i_{1},j_{1})-th entry of (i,j)(i,j)-th block is

1m1pr=1mk=1puiTujI{w~rkTui0,w~rkTuj0}σ(wrTyi1)σ(wrTyj1).\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}}).

Let a=σ(wrTyi1),b=σ(wrTyj1),c=uiTujI{w~rkTui0,w~rkTuj0}a=\sigma(w_{r}^{T}y_{i_{1}}),b=\sigma(w_{r}^{T}y_{j_{1}}),c=u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\} and a(0),b(0),c(0)a(0),b(0),c(0) be the corresponding initialized parts.

Similarly, we decompose abca(0)b(0)c(0)abc-a(0)b(0)c(0) as follows:

abca(0)b(0)c(0)=(aba(0)b(0))c+a(0)b(0)(cc(0)).abc-a(0)b(0)c(0)=(ab-a(0)b(0))c+a(0)b(0)(c-c(0)).

For the first part (aba(0)b(0))c(ab-a(0)b(0))c, we have

|σ(wrTyi1)σ(wrTyj1)σ(wr(0)Tyi1)σ(wr(0)Tyj1)|\displaystyle|\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}})-\sigma(w_{r}(0)^{T}y_{i_{1}})\sigma(w_{r}(0)^{T}y_{j_{1}})| (32)
=|[(σ(wrTyi1)σ(wr(0)Tyi1))]σ(wrTyj1)+σ(wr(0)Tyi1)[σ(wrTyj1)σ(wr(0)Tyj1)]|\displaystyle=|[(\sigma(w_{r}^{T}y_{i_{1}})-\sigma(w_{r}(0)^{T}y_{i_{1}}))]\sigma(w_{r}^{T}y_{j_{1}})+\sigma(w_{r}(0)^{T}y_{i_{1}})[\sigma(w_{r}^{T}y_{j_{1}})-\sigma(w_{r}(0)^{T}y_{j_{1}})]|
R(|σ(wrTyj1)|+|σ(wr(0)Tyi1)|)\displaystyle\lesssim R(|\sigma(w_{r}^{T}y_{j_{1}})|+|\sigma(w_{r}(0)^{T}y_{i_{1}})|)
R(|σ(wr(0)Tyj1)|+|σ(wr(0)Tyi1)|+R),\displaystyle\lesssim R(|\sigma(w_{r}(0)^{T}y_{j_{1}})|+|\sigma(w_{r}(0)^{T}y_{i_{1}})|+R),

thus we have

|(aba(0)b(0))c|1mr=1mR(|σ(wr(0)Tyj1)|+|σ(wr(0)Tyi1)|)+R2.|(ab-a(0)b(0))c|\lesssim\frac{1}{m}\sum\limits_{r=1}^{m}R(|\sigma(w_{r}(0)^{T}y_{j_{1}})|+|\sigma(w_{r}(0)^{T}y_{i_{1}})|)+R^{2}. (33)

Note that σ(wr(0)Tyj)ψ2=𝒪(1)\left\|\sigma(w_{r}(0)^{T}y_{j})\right\|_{\psi_{2}}=\mathcal{O}(1) for all j[n2]j\in[n_{2}], then applying Lemma 12 yields that with probability at least 1δ1-\delta,

1mr=1m|σ(wr(0)Tyj)|𝔼[|σ(w1(0)Tyj)|]+log(n2δ)log(n2δ)\frac{1}{m}\sum\limits_{r=1}^{m}|\sigma(w_{r}(0)^{T}y_{j})|\lesssim\mathbb{E}[|\sigma(w_{1}(0)^{T}y_{j})|]+\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}\lesssim\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}

holds for all j[n2]j\in[n_{2}].

For the second part a(0)b(0)(cc(0))a(0)b(0)(c-c(0)), we cannot directly apply the Bernstein inequality. Instead, we first truncate |σ(wr(0)Tyi1)σ(wr(0)Tyj1)||\sigma(w_{r}(0)^{T}y_{i_{1}})\sigma(w_{r}(0)^{T}y_{j_{1}})|. Note that for wr(0)Tyjw_{r}(0)^{T}y_{j}, we have P(|wr(0)Tyj|>yj2t)2et2/2P(|w_{r}(0)^{T}y_{j}|>\|y_{j}\|_{2}t)\leq 2e^{-t^{2}/2}, i.e., with probability at least 1δ1-\delta, |wr(0)Tyj|2log(2δ)|w_{r}(0)^{T}y_{j}|\leq\sqrt{2\log(\frac{2}{\delta})}. Thus, taking a union bound yields that with probability at least 1δ1-\delta,

|σ(wr(0)Tyj)||wr(0)Tyj|log(mn2δ):=M|\sigma(w_{r}(0)^{T}y_{j})|\leq|w_{r}(0)^{T}y_{j}|\lesssim\sqrt{\log\left(\frac{mn_{2}}{\delta}\right)}:=M

holds for any r[m],j[n2]r\in[m],j\in[n_{2}].

Therefore, under this event,

|σ(wr(0)Tyi1)σ(wr(0)Tyj1)(I{w~rkTui0,w~rkTuj0}I{w~rk(0)Tui0,w~rkTuj0})|\displaystyle|\sigma(w_{r}(0)^{T}y_{i_{1}})\sigma(w_{r}(0)^{T}y_{j_{1}})(I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\})| (34)
M2|I{w~rkTui0,w~rkTuj0}I{w~rk(0)Tui0,w~rkTuj0}|\displaystyle\lesssim M^{2}|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}|
M2|I{w~rkTui0}I{w~rk(0)Tui0}|+M2|I{w~rkTui0}I{w~rk(0)Tui0}|\displaystyle\leq M^{2}|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}|+M^{2}|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}|
M2(I{A~rki}+I{A~rkj}).\displaystyle\leq M^{2}(I\{\tilde{A}_{rk}^{i}\}+I\{\tilde{A}_{rk}^{j}\}).

From the Bernstein inequality, we have that with probability at least 1n1exp(mpR~)1-n_{1}exp(-mp\tilde{R}),

1m1pr=1mk=1pI{A~rki}R~.\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}I\{\tilde{A}_{rk}^{i}\}\lesssim\tilde{R}.

Thus, with probability at least 1δn1exp(mpR~)1-\delta-n_{1}exp(-mp\tilde{R}),

|H~i,ji1,j1H~i,ji1,j1(0)|\displaystyle|\tilde{H}_{i,j}^{i_{1},j_{1}}-\tilde{H}_{i,j}^{i_{1},j_{1}}(0)| Rlog(n2δ)+R~log(mn2δ).\displaystyle\lesssim R\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}+\tilde{R}\log\left(\frac{mn_{2}}{\delta}\right).

Summing i,j,i1,j1i,j,i_{1},j_{1} yields that

H~H~(0)F\displaystyle\|\tilde{H}-\tilde{H}(0)\|_{F} =i,j=1n1i1,j1=1n2|H~i,ji1,j1H~i,ji1,j1(0)|2\displaystyle=\sqrt{\sum\limits_{i,j=1}^{n_{1}}\sum\limits_{i_{1},j_{1}=1}^{n_{2}}|\tilde{H}_{i,j}^{i_{1},j_{1}}-\tilde{H}_{i,j}^{i_{1},j_{1}}(0)|^{2}} (35)
n1n2Rlog(n2δ)+n1n2R~log(mn2δ)\displaystyle\lesssim n_{1}n_{2}R\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}+n_{1}n_{2}\tilde{R}\log\left(\frac{mn_{2}}{\delta}\right)

holds with probability at least 1δn1exp(mpR~)1-\delta-n_{1}exp(-mp\tilde{R}).

7.4. Proof of Theorem 1

The proof of Theorem 1 consists of the following Lemma 6, Lemma 7, Lemma 8 and Lemma 9. First, we assume that the following lemmas are considered in the setting of events in Lemma 13, Lemma 14 and {|wr(0)Tyj|B,j[n2],r[m]}\{|w_{r}(0)^{T}y_{j}|\leq B,\forall j\in[n_{2}],\forall r\in[m]\}, where B=2log(mn/δ)B=2\sqrt{\log{(mn/\delta)}}.

Lemma 11.

If for 0st0\leq s\leq t, λmin(H(s))λ02\lambda_{min}(H(s))\geq\frac{\lambda_{0}}{2}, λmin(H~(s))λ~02\lambda_{min}(\tilde{H}(s))\geq\frac{\tilde{\lambda}_{0}}{2}, then we have

zGt(u)22exp((λ0+λ~0)t)zG0(u)22.\|z-G^{t}(u)\|_{2}^{2}\leq exp(-(\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{0}(u)\|_{2}^{2}.
Proof.

From the conditions λmin(H(s))λ02\lambda_{min}(H(s))\geq\frac{\lambda_{0}}{2} and λmin(H~(s))λ~02\lambda_{min}(\tilde{H}(s))\geq\frac{\tilde{\lambda}_{0}}{2}, we can deduce that

ddtzGt(u)22\displaystyle\frac{d}{dt}\|z-G^{t}(u)\|_{2}^{2} =2(zGt(u))T(H(t)+H~(t))(zGt(u))\displaystyle=-2(z-G^{t}(u))^{T}(H(t)+\tilde{H}(t))(z-G^{t}(u))
(λ0+λ~0)zGt(u)22.\displaystyle\leq-(\lambda_{0}+\tilde{\lambda}_{0})\|z-G^{t}(u)\|_{2}^{2}.

From this, we have

ddt(exp((λ0+λ~0)t)zGt(u)22)0,\frac{d}{dt}\left(exp((\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{t}(u)\|_{2}^{2}\right)\leq 0,

which yields that

exp((λ0+λ~0)t)zGt(u)22zG0(u)22,exp((\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{t}(u)\|_{2}^{2}\leq\|z-G^{0}(u)\|_{2}^{2},

i.e.,

zGt(u)22exp((λ0+λ~0)t)zG0(u)22.\|z-G^{t}(u)\|_{2}^{2}\leq exp(-(\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{0}(u)\|_{2}^{2}.

Lemma 12.

Suppose for 0st0\leq s\leq t, λmin(H(s))λ02\lambda_{min}(H(s))\geq\frac{\lambda_{0}}{2}, λmin(H~(s))λ~02\lambda_{min}(\tilde{H}(s))\geq\frac{\tilde{\lambda}_{0}}{2} and w~rk(s)w~rk(0)2R~\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R} holds for any r[m],k[p]r\in[m],k\in[p], then we have that

wr(s)wr(0)2Cn1n2zG0(u)2m(λ0+λ~0)(pR~+log(mn1δ)):=R\|w_{r}(s)-w_{r}(0)\|_{2}\leq\frac{C\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right):=R^{{}^{\prime}}

holds for any r[m]r\in[m], where CC is a universal constant.

Proof.

For 0st0\leq s\leq t, we have

ddtwr(s)2\displaystyle\left\|\frac{d}{dt}w_{r}(s)\right\|_{2} (36)
=L(W(s),W~(s))wr2\displaystyle=\left\|\frac{\partial L(W(s),\tilde{W}(s))}{\partial w_{r}}\right\|_{2}
=i=1n1j=1n2(1m[1pk=1pa~rkσ(w~rk(s)Tui)]yjI{wr(s)Tyj0})(Gs(ui)(yj)zij)2\displaystyle=\left\|\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left(\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]y_{j}I\{w_{r}(s)^{T}y_{j}\geq 0\}\right)(G^{s}(u_{i})(y_{j})-z_{i}^{j})\right\|_{2}
n1n2m(pR~+log(mn1δ))Gs(u)z2\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}}{\sqrt{m}}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right)\|G^{s}(u)-z\|_{2}
n1n2m(pR~+log(mn1δ))exp((λ0+λ~0)s2)G0(u)z2,\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}}{\sqrt{m}}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right)exp(-\frac{(\lambda_{0}+\tilde{\lambda}_{0})s}{2})\|G^{0}(u)-z\|_{2},

where the last inequality follows from Lemma 6 and the first inequality follows from that

|1pk=1pa~rkσ(w~rk(s)Tui)|\displaystyle\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right|
=|1pk=1p[a~rkσ(w~rk(s)Tui)a~rkσ(w~rk(0)Tui)]+a~rkσ(w~rk(0)Tui)|\displaystyle=\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\left[\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]+\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right|
|1pk=1p[a~rkσ(w~rk(s)Tui)a~rkσ(w~rk(0)Tui)]|+|1pk=1pa~rkσ(w~rk(0)Tui)|\displaystyle\leq\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\left[\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right|+\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right|
pR~+log(mn1δ).\displaystyle\lesssim\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.

Therefore, we have

wr(t)wr(0)20tddswr(s)2𝑑sCn1n2zG0(u)2m(λ0+λ~0)(pR~+log(mn1δ)):=R,\|w_{r}(t)-w_{r}(0)\|_{2}\leq\int_{0}^{t}\left\|\frac{d}{ds}w_{r}(s)\right\|_{2}ds\leq\frac{C\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right):=R^{{}^{\prime}},

where CC is a universal constant. ∎

Lemma 13.

Suppose for 0st0\leq s\leq t, λmin(H(s))λ02\lambda_{min}(H(s))\geq\frac{\lambda_{0}}{2}, λmin(H~(s))λ~02\lambda_{min}(\tilde{H}(s))\geq\frac{\tilde{\lambda}_{0}}{2} and wr(s)wr(0)2R\|w_{r}(s)-w_{r}(0)\|_{2}\leq R holds for any r[m]r\in[m], then we have that

w~rk(t)w~rk(0)2Cn1n2(R+B)zG0(u)2mp(λ0+λ~0):=R~\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}\leq\frac{C\sqrt{n_{1}n_{2}}(R+B)\|z-G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}:=\tilde{R}^{{}^{\prime}}

holds for any r[m],p[k]r\in[m],p\in[k], where CC is a universal constant.

Proof.

For 0st0\leq s\leq t, we have

ddtw~rk(s)2\displaystyle\left\|\frac{d}{dt}\tilde{w}_{rk}(s)\right\|_{2} =L(W(s),W~(s))w~rk2\displaystyle=\left\|\frac{\partial L(W(s),\tilde{W}(s))}{\partial\tilde{w}_{rk}}\right\|_{2} (37)
=i=1n1j=1n21ma~rkpuiI{w~rk(s)Tui0}σ(wr(s)Tyj)(Gs(u)(yj)zji)2\displaystyle=\left\|\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}u_{i}I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}\sigma(w_{r}(s)^{T}y_{j})(G^{s}(u)(y_{j})-z_{j}^{i})\right\|_{2}
1mpi=1n1j=1n2|σ(wr(s)Tyj)(Gs(ui)(yj)zji)|\displaystyle\lesssim\frac{1}{\sqrt{mp}}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left|\sigma(w_{r}(s)^{T}y_{j})(G^{s}(u_{i})(y_{j})-z_{j}^{i})\right|
1mpi=1n1j=1n2(|σ(wr(s)Tyj)σ(wr(0)Tyj)|+|σ(wr(0)Tyj)|)|Gs(ui)(yj)zji|\displaystyle\leq\frac{1}{\sqrt{mp}}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left(\left|\sigma(w_{r}(s)^{T}y_{j})-\sigma(w_{r}(0)^{T}y_{j})\right|+\left|\sigma(w_{r}(0)^{T}y_{j})\right|\right)\left|G^{s}(u_{i})(y_{j})-z_{j}^{i}\right|
1mpi=1n1j=1n2(R+B)|Gs(ui)(yj)zji|\displaystyle\leq\frac{1}{\sqrt{mp}}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}(R+B)\left|G^{s}(u_{i})(y_{j})-z_{j}^{i}\right|
n1n2(R+B)mpGs(u)z2\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}(R+B)}{\sqrt{mp}}\|G^{s}(u)-z\|_{2}
n1n2(R+B)mpexp((λ0+λ~0)s/2)zG0(u)2,\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}(R+B)}{\sqrt{mp}}exp(-(\lambda_{0}+\tilde{\lambda}_{0})s/2)\|z-G^{0}(u)\|_{2},

where the last inequality follows from Lemma 6. Then, similar to that in Lemma 7, the conclusion holds.

Lemma 14.

If R<RR^{{}^{\prime}}<R and R~<R~\tilde{R}^{{}^{\prime}}<\tilde{R}, we have that for all t0t\geq 0, the following two conclusions hold:

  • λmin(H(t))λ02\lambda_{min}(H(t))\geq\frac{\lambda_{0}}{2} and λmin(H~(t))λ~02\lambda_{min}(\tilde{H}(t))\geq\frac{\tilde{\lambda}_{0}}{2};

  • wr(t)wr(0)2R\|w_{r}(t)-w_{r}(0)\|_{2}\leq R^{{}^{\prime}} and w~rk(t)w~rk(0)2R~\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}^{{}^{\prime}} for any r[m],p[k]r\in[m],p\in[k].

Proof.

The proof is based on contradiction. Suppose t>0t>0 is the smallest time that the two conclusions do not hold, then either conclusion 1 does not hold or conclusion 2 does not hold.

If conclusion 1 does not hold, i.e., either λmin(H(t))<λ02\lambda_{min}(H(t))<\frac{\lambda_{0}}{2} or λmin(H~(t))<λ~02\lambda_{min}(\tilde{H}(t))<\frac{\tilde{\lambda}_{0}}{2}, then Lemma 3 implies that there exists r[m]r\in[m], wr(s)wr(0)2>R>R\|w_{r}(s)-w_{r}(0)\|_{2}>R>R^{{}^{\prime}} or there exists r[m],k[p]r\in[m],k\in[p], w~rk(s)w~rk(0)2>R~>R~\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}>\tilde{R}>\tilde{R}^{{}^{\prime}}. This fact shows that conlusion 2 does not hold and then, this contradicts with the minimality of tt.

If conclusion 2 does not hold, then either there exists r[m]r\in[m], wr(t)wr(0)2>R\|w_{r}(t)-w_{r}(0)\|_{2}>R^{{}^{\prime}} or there exists r[m],k[p]r\in[m],k\in[p], w~rk(t)w~rk(0)2>R~\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}>\tilde{R}^{{}^{\prime}}. If wr(t)wr(0)2>R\|w_{r}(t)-w_{r}(0)\|_{2}>R^{{}^{\prime}}, then Lemma 7 implies that there exist s<ts<t such that λmin(H(s))<λ02\lambda_{min}(H(s))<\frac{\lambda_{0}}{2} or λmin(H~(s))<λ~02\lambda_{min}(\tilde{H}(s))<\frac{\tilde{\lambda}_{0}}{2} or there exists r[m],k[p]r\in[m],k\in[p], w~rk(s)w~rk(0)2>R~\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}>\tilde{R}^{{}^{\prime}}, which contradicts with the minimality of tt. And the last case is similar to this case.

Proof of Theorem 1.

Theorem 1 is a direct corollary of Lemma 6 and Lemma 9. Thus, it remains only to clarify the requirements for mm so that Lemma 6, Lemma 7 and Lemma 8 hold. First, RR and R~\tilde{R} should ensure that H(0)H2λ0/4\|H(0)-H^{\infty}\|_{2}\leq\lambda_{0}/4 and H~(0)H~2λ~0/4\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}\leq\tilde{\lambda}_{0}/4, i.e.,

Rmin(λ0,λ~0)n1n2log(mδ),R~min(λ0,λ~0)n1n2plog(mδ).R\lesssim\frac{\min(\lambda_{0},\tilde{\lambda}_{0})}{n_{1}n_{2}\log\left(\frac{m}{\delta}\right)},\ \tilde{R}\lesssim\frac{\min(\lambda_{0},\tilde{\lambda}_{0})}{n_{1}n_{2}\sqrt{p}\log\left(\frac{m}{\delta}\right)}. (38)

Combining this with the requirement that R<R,R~<R~R^{{}^{\prime}}<R,\tilde{R}^{{}^{\prime}}<\tilde{R}, we can deduce that

m=Ω(n14n24log(nδ)log3(mδ)(min(λ0,λ~0))2(λ0+λ~0)2).m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log\left(\frac{n}{\delta}\right)\log^{3}\left(\frac{m}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

Moreover, the requirement for mm also leads to that

n2exp(mR)δ,n2exp(mpR~)δ,n_{2}exp(-mR)\lesssim\delta,n_{2}exp(-mp\tilde{R})\lesssim\delta,

which are confidences in Lemma 3.

8. Proof of Descrete Time Analysis

8.1. Proof of Lemma 4

Proof.

First, we can decompose Gt+1(ui)(yj)Gt(ui)(yj)G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j}) as follows.

Gt+1(ui)(yj)Gt(ui)(yj)\displaystyle G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j}) (39)
=Gt+1(ui)(yj)Gt(ui)(yj)Gt(ui)(yj)w,w(t+1)w(t)Gt(ui)(yj)w~,w~(t+1)w~(t)\displaystyle=G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle
+Gt(ui)(yj)w,w(t+1)w(t)+Gt(ui)(yj)w~,w~(t+1)w~(t).\displaystyle\quad+\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle+\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

Note that

Gt(ui)(yj)w,w(t+1)w(t)\displaystyle\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle (40)
=r=1mGt(ui)(yj)wr,wr(t+1)wr(t)\displaystyle=\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle
=ηi1=1n1j1=1n2r=1mGt(ui)(yj)wr,Gt(ui1)(yj1)wr(Gt(ui1)(yj1)zj1i1)\displaystyle=-\eta\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial G^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\right\rangle(G^{t}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}})

and

Gt(ui)(yj)w~,w~(t+1)w~(t)\displaystyle\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle (41)
=r=1mk=1pGt(ui)(yj)w~rk,w~rk(t+1)w~rk(t)\displaystyle=\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\right\rangle
=ηi1=1n1j1=1n2r=1mk=1pGt(ui)(yj)w~rk,Gt(ui1)(yj1)w~rk(Gt(ui1)(yj1)zj1i1).\displaystyle=-\eta\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial G^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\rangle(G^{t}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}}).

Plugging (33) and (34) into (32 yields that

Gt+1(ui)(yj)Gt(ui)(yj)\displaystyle G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j}) =Ii,j(t)η[(H1i(t),,Hn1i)+(H~1i(t),,H~n1i)]j(Gt(u)z),\displaystyle=I_{i,j}(t)-\eta[(H_{1}^{i}(t),\cdots,H_{n_{1}}^{i})+(\tilde{H}_{1}^{i}(t),\cdots,\tilde{H}_{n_{1}}^{i})]_{j}(G^{t}(u)-z),

where [A]j[A]_{j} represents the jj-th of the matrix AA and I(t)n1n2I(t)\in\mathbb{R}^{n_{1}n_{2}}, we can divide it into n1n_{1} blocks, the jj-th component of ii-th block is defined as

Ii,j(t)=Gt+1(ui)(yj)Gt(ui)(yj)Gt(ui)(yj)w,w(t+1)w(t)Gt(ui)(yj)w~,w~(t+1)w~(t).I_{i,j}(t)=G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

Thus, we have

Gt+1(u)Gt(u)=I(t)η(H(t)+H~(t))(Gt(u)z).G^{t+1}(u)-G^{t}(u)=I(t)-\eta(H(t)+\tilde{H}(t))(G^{t}(u)-z).

By using a simple algebraic transformation, we have

zGt+1(u)\displaystyle z-G^{t+1}(u) =zGt(u)I(t)η(H(t)+H~(t))(zGt(u))\displaystyle=z-G^{t}(u)-I(t)-\eta(H(t)+\tilde{H}(t))(z-G^{t}(u)) (42)
=(Iη(H(t)+H~(t)))(zGt(u))I(t).\displaystyle=\left(I-\eta(H(t)+\tilde{H}(t))\right)(z-G^{t}(u))-I(t).

8.2. Proof of Lemma 5

Proof.

We first express explicitly the jj-component of the ii-th bloack of the residual term I(s)I(s) as follows.

Ii,j(s)\displaystyle I_{i,j}(s) =Gs+1(ui)(yj)Gs(ui)(yj)Gs(ui)(yj)w,w(s+1)w(s)Gs(ui)(yj)w~,w~(s+1)w~(s)\displaystyle=G^{s+1}(u_{i})(y_{j})-G^{s}(u_{i})(y_{j})-\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w},w(s+1)-w(s)\right\rangle-\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(s+1)-\tilde{w}(s)\right\rangle (43)
=Gs+1(ui)(yj)Gs(ui)(yj)r=1mGs(ui)(yj)wr,wr(s+1)wr(s)\displaystyle=G^{s+1}(u_{i})(y_{j})-G^{s}(u_{i})(y_{j})-\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(s+1)-w_{r}(s)\right\rangle
r=1mk=1pGs(ui)(yj)w~rk,w~rk(s+1)w~rk(s).\displaystyle\quad-\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\right\rangle.

From the forms of Gs(ui)(yj)G^{s}(u_{i})(y_{j}),Gs(ui)(yj)wr\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}} and Gs(ui)(yj)w~rk\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}}, we have

Gs+1(ui)(yj)Gs(ui)(yj)\displaystyle G^{s+1}(u_{i})(y_{j})-G^{s}(u_{i})(y_{j}) (44)
=1mr=1m[1pk=1pa~rkσ(w~rk(s+1)Tui)]σ(wr(s+1)Tyj)1mr=1m[1pk=1pa~rkσ(w~rk(s)Tui)]σ(wr(s)Tyj)\displaystyle=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})\right]\sigma(w_{r}(s+1)^{T}y_{j})-\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]\sigma(w_{r}(s)^{T}y_{j})
=1mr=1m[1pk=1pa~rkσ(w~rk(s+1)Tui)](σ(wr(s+1)Tyj)σ(wr(s)Tyj))\displaystyle=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})\right](\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j}))
+1mr=1m[1pk=1pa~rk[σ(w~rk(s+1)Tui)σ(w~rk(s)Tui)]]σ(wr(s)Tyj).\displaystyle\quad+\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]\right]\sigma(w_{r}(s)^{T}y_{j}).

and

Gs(ui)(yj)wr,wr(s+1)wr(s)\displaystyle\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(s+1)-w_{r}(s)\right\rangle (45)
=1m[1pk=1pa~rkσ(w~rk(s)Tui)]I{wr(s)Tyj0}(wr(s+1)wr(s))Tyj\displaystyle=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}

and

Gs(ui)(yj)w~rk,w~rk(s+1)w~rk(s)\displaystyle\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\right\rangle (46)
=1ma~rkp(w~rk(s+1)w~rk(s))TuiI{w~rk(s)Tui0}σ(wr(s)Tyj).\displaystyle=\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}\sigma(w_{r}(s)^{T}y_{j}).

Thus, we can decompose Ii,j(s)I_{i,j}(s) as follows

Ii,j(s)=r=1mI~i,jr(s)+I¯i,jr(s),I_{i,j}(s)=\sum\limits_{r=1}^{m}\tilde{I}_{i,j}^{r}(s)+\bar{I}_{i,j}^{r}(s),

where

I~i,jr(s)\displaystyle\tilde{I}_{i,j}^{r}(s) =1m[1pk=1pa~rk[σ(w~rk(s+1)Tui)σ(w~rk(s)Tui)I{w~rk(s)Tui0}(w~rk(s+1)w~rk(s))Tui]]\displaystyle=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}\right]\right] (47)
σ(wr(s)Tyj)\displaystyle\quad\cdot\sigma(w_{r}(s)^{T}y_{j})

and

I¯i,jr(s)\displaystyle\bar{I}_{i,j}^{r}(s) =1m[1pk=1pa~rkσ(w~rk(s+1)Tui)](σ(wr(s+1)Tyj)σ(wr(s)Tyj))\displaystyle=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})\right](\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j})) (48)
1m[1pk=1pa~rkσ(w~rk(s)Tui)]I{wr(s)Tyj0}(wr(s+1)wr(s))Tyj.\displaystyle\quad-\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}.

For I~i,jr(s)\tilde{I}_{i,j}^{r}(s), we replace R~\tilde{R} in the definition of A~rki\tilde{A}_{rk}^{i} by R~\tilde{R}^{{}^{\prime}} and still denote the event as A~rki\tilde{A}_{rk}^{i} for simplicity, i.e.,

A~rki:={w:ww~rk(0)2R~,I{wTui0}I{w~rk(0)Tui0}}\tilde{A}_{rk}^{i}:=\{\exists w:\|w-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}^{{}^{\prime}},I\{w^{T}u_{i}\geq 0\}\neq I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}\}

and S~i={(r,k)[m]×[p]:I{A~rk}=0}\tilde{S}_{i}=\{(r,k)\in[m]\times[p]:I\{\tilde{A}_{rk}\}=0\}.

From the induction hypothesis, we know w~rk(s+1)w~rk(0)2R~\|\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}^{{}^{\prime}} and w~rk(s)w~rk(0)2R~\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}^{{}^{\prime}}. Thus, I{w~rk(s+1)Tui0}=I{w~rk(s)Tui0}I\{\tilde{w}_{rk}(s+1)^{T}u_{i}\geq 0\}=I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\} holds for any (r,k)S~i(r,k)\in\tilde{S}_{i}. From this fact, we can deduce that for any (r,k)S~i(r,k)\in\tilde{S}_{i},

|σ(w~rk(s+1)Tui)σ(w~rk(s)Tui)I{w~rk(s)Tui0}(w~rk(s+1)w~rk(s))Tui|\displaystyle\left|\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}\right| (49)
=|(w~rk(s+1)Tui)I{w~rk(s+1)Tui0}(w~rk(s)Tui)I{w~rk(s)Tui0}\displaystyle=|(\tilde{w}_{rk}(s+1)^{T}u_{i})I\{\tilde{w}_{rk}(s+1)^{T}u_{i}\geq 0\}-(\tilde{w}_{rk}(s)^{T}u_{i})I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}
I{w~rk(s)Tui0}(w~rk(s+1)w~rk(s))Tui|\displaystyle\quad-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}|
=|(w~rk(s+1)Tui)I{w~rk(s)Tui0}(w~rk(s)Tui)I{w~rk(s)Tui0}\displaystyle=|(\tilde{w}_{rk}(s+1)^{T}u_{i})I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}-(\tilde{w}_{rk}(s)^{T}u_{i})I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}
I{w~rk(s)Tui0}(w~rk(s+1)w~rk(s))Tui|\displaystyle\quad-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}|
=0.\displaystyle=0.

On the other hand, for any (r,k)[m]×[p](r,k)\in[m]\times[p], we have

|σ(w~rk(s+1)Tui)σ(w~rk(s)Tui)I{w~rk(s)Tui0}(w~rk(s+1)w~rk(s))Tui|w~rk(s+1)w~rk(s)2.|\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}|\lesssim\|\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\|_{2}. (50)

Thus, combining (41), (42) and (43) yields that

r=1m|I~i,jr(s)|\displaystyle\sum\limits_{r=1}^{m}|\tilde{I}_{i,j}^{r}(s)| Bmpr=1mk=1pw~rk(s+1)w~rk(s)2I{(r,k)S~i}\displaystyle\leq\frac{B}{\sqrt{mp}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\|\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\} (51)
=Bmpr=1mk=1pηL(W(s),W~(s))w~rk2I{(r,k)S~i}\displaystyle=\frac{B}{\sqrt{mp}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\|-\eta\frac{\partial L(W(s),\tilde{W}(s))}{\partial\tilde{w}_{rk}}\right\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}
=Bmpr=1mk=1pηi1=1n1j1=1n2Gs(ui1)(yj1)w~rk(Gs(ui1)(yj1)zj1i1)2I{(r,k)S~i}\displaystyle=\frac{B}{\sqrt{mp}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\|-\eta\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\frac{\partial G^{s}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}(G^{s}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}})\right\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}
ηB2r=1mk=1pn1n2mpzGs(u)2I{(r,k)S~i}\displaystyle\leq\eta B^{2}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\frac{\sqrt{n_{1}n_{2}}}{mp}\|z-G^{s}(u)\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}
=ηB2n1n2zGs(u)2r=1mk=1p1mpI{(r,k)S~i}\displaystyle=\eta B^{2}\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\frac{1}{mp}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}
=ηB2n1n2zGs(u)2r=1mk=1p1mpI{A~r,ki},\displaystyle=\eta B^{2}\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\frac{1}{mp}I\{\tilde{A}_{r,k}^{i}\},

where the second inequality follows from Cauchy’s inequality and the form of Gs(ui)(yj)w~rk\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}}, i.e.,

Gs(ui)(yj)w~rk=1ma~rkpuiI{w~rk(s)Tui0}σ(wr(s)Tyj).\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}}=\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}u_{i}I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}\sigma(w_{r}(s)^{T}y_{j}).

From the Bernstein inequality, we have that with probability at least 1n1exp(mpR~)1-n_{1}exp(-mp\tilde{R}^{{}^{\prime}}),

r=1mk=1p1mpI{A~r,ki}R~.\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\frac{1}{mp}I\{\tilde{A}_{r,k}^{i}\}\lesssim\tilde{R}^{{}^{\prime}}.

This leads to the final unpper bound:

r=1m|I~i,jr(s)|\displaystyle\sum\limits_{r=1}^{m}|\tilde{I}_{i,j}^{r}(s)| ηB2n1n2R~zGs(u)2\displaystyle\lesssim\eta B^{2}\sqrt{n_{1}n_{2}}\tilde{R}^{{}^{\prime}}\|z-G^{s}(u)\|_{2} (52)
ηn1n2zG0(u)2mp(λ0+λ~0)log32(mδ)zGs(u)2.\displaystyle\lesssim\frac{\eta n_{1}n_{2}\|z-G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\|z-G^{s}(u)\|_{2}.

It remains to bound I¯i,jr(s)\bar{I}_{i,j}^{r}(s), which can be written as follows.

I¯i,jr(s)=1m[1pk=1pa~rkσ(w~rk(s+1)Tui)](σ(wr(s+1)Tyj)σ(wr(s)Tyj))\displaystyle\bar{I}_{i,j}^{r}(s)=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})\right](\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j})) (53)
1m[1pk=1pa~rkσ(w~rk(s)Tui)]I{wr(s)Tyj0}(wr(s+1)wr(s))Tyj\displaystyle\quad-\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}
=1m[1pk=1pa~rk(σ(w~rk(s+1)Tui)σ(w~rk(0)Tui))](σ(wr(s+1)Tyj)σ(wr(s)Tyj))\displaystyle=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i}))\right]\left(\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j})\right)
1m[1pk=1pa~rk(σ(w~rk(s)Tui)σ(w~rk(0)Tui))]I{wr(s)Tyj0}(wr(s+1)wr(s))Tyj\displaystyle\quad-\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i}))\right]I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}
+1m[1pk=1pa~rkσ(w~rk(0)Tui)](σ(wr(s+1)Tyj)σ(wr(s)Tyj)I{wr(s)Tyj0}(wr(s+1)wr(s))Tyj).\displaystyle+\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\left(\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j})-I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}\right).

Note that

σ(w~rk(s)Tui)σ(w~rk(0)Tui)2R~,σ(w~rk(s+1)Tui)σ(w~rk(0)Tui)2R~,\|\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\|_{2}\lesssim\tilde{R}^{{}^{\prime}},\ \|\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\|_{2}\lesssim\tilde{R}^{{}^{\prime}},

thus we can bound the first term and second term by

pR~mwr(s+1)wr(s)2nzG0(u)2m(λ0+λ~0)log(mδ)wr(t+1)wr(t)2.\frac{\sqrt{p}\tilde{R}^{{}^{\prime}}}{\sqrt{m}}\|w_{r}(s+1)-w_{r}(s)\|_{2}\lesssim\frac{\sqrt{n}\|z-G^{0}(u)\|_{2}}{m(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}\|w_{r}(t+1)-w_{r}(t)\|_{2}. (54)

For the third term in (46), we also replace RR in the definition of AjrA_{jr} by RR^{{}^{\prime}} and still denote the event by AjrA_{jr}. Recall that

Ajr:={w:wwr(0)2R,I{wTyj0}I{wr(0)Tyj0}}A_{jr}:=\{\exists w:\|w-w_{r}(0)\|_{2}\leq R,I\{w^{T}y_{j}\geq 0\}\neq I\{w_{r}(0)^{T}y_{j}\geq 0\}\} (55)

and Sj={r[m]:I{Ajr}=0}S_{j}=\{r\in[m]:I\{A_{jr}\}=0\}. Note that wr(s+1)wr(0)2R\|w_{r}(s+1)-w_{r}(0)\|_{2}\leq R^{{}^{\prime}} and wr(s)wr(0)2R\|w_{r}(s)-w_{r}(0)\|_{2}\leq R^{{}^{\prime}}, thus for rSjr\in S_{j}, we have I{wr(s+1)Tyj0}=I{wr(s)Tyj0}I\{w_{r}(s+1)^{T}y_{j}\geq 0\}=I\{w_{r}(s)^{T}y_{j}\geq 0\}. Combining this fact with (46) and (47), we can deduce that

|I¯i,jr(s)|nzG0(u)2m(λ0+λ~0)log(mδ)wr(s+1)wr(s)2+1mlog(mδ)wr(s+1)wr(s)2I{rSj}.\displaystyle|\bar{I}_{i,j}^{r}(s)|\lesssim\frac{\sqrt{n}\|z-G^{0}(u)\|_{2}}{m(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}\|w_{r}(s+1)-w_{r}(s)\|_{2}+\frac{1}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}\|w_{r}(s+1)-w_{r}(s)\|_{2}I\{r\in S_{j}^{\perp}\}. (56)

Thus, we have to bound wr(s+1)wr(s)2\|w_{r}(s+1)-w_{r}(s)\|_{2}. From the gradient descent update formula, we have

wr(s+1)wr(s)2\displaystyle\|w_{r}(s+1)-w_{r}(s)\|_{2} =ηL(W(s),W~(s))wr2\displaystyle=\left\|-\eta\frac{\partial L(W(s),\tilde{W}(s))}{\partial w_{r}}\right\|_{2} (57)
ηzGs(u)2i=1n1j=1n2Gs(ui)(yj)wr22.\displaystyle\leq\eta\|z-G^{s}(u)\|_{2}\sqrt{\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left\|\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}}\right\|_{2}^{2}}.

Recall that

Gs(ui)(yj)wr=1m[1pk=1pa~rkσ(w~rk(s)Tui)]yjI{wr(s)Tyj0}.\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}}=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]y_{j}I\{w_{r}(s)^{T}y_{j}\geq 0\}.

Therefore,

Gs(ui)(yj)wr2\displaystyle\left\|\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}}\right\|_{2} (58)
=1m[1pk=1pa~rk[σ(w~rk(s)Tui)σ(w~rk(0)Tui)+σ(w~rk(0)Tui)]]yjI{wr(s)Tyj0}2\displaystyle=\left\|\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})+\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right]y_{j}I\{w_{r}(s)^{T}y_{j}\geq 0\}\right\|_{2}
pR~m+1mlog(mδ)\displaystyle\lesssim\frac{\sqrt{p}\tilde{R}^{{}^{\prime}}}{\sqrt{m}}+\frac{1}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}
1mnzG0(u)2m(λ0+λ~0)log(mδ)+1mlog(mn1δ)\displaystyle\lesssim\frac{1}{\sqrt{m}}\frac{\sqrt{n}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}+\frac{1}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}
1mlog(mn1δ),\displaystyle\lesssim\frac{1}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)},

where the last inequality is due to us taking mm sufficiently large in the end.

Combining (50) and (51) yields that

wr(s+1)wr(s)2ηn1n2zGs(u)2mlog(mn1δ).\|w_{r}(s+1)-w_{r}(s)\|_{2}\lesssim\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}. (59)

Plugging this into (49) leads to that

|I¯i,jr(s)|n1n2zG0(u)2m(λ0+λ~0)ηn1n2zGs(u)2mlog(mδ)+ηn1n2zGs(u)2mI{rSj}log(mδ).|\bar{I}_{i,j}^{r}(s)|\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{m(\lambda_{0}+\tilde{\lambda}_{0})}\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{m}}\log\left(\frac{m}{\delta}\right)+\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{m}I\{r\in S_{j}^{\perp}\}\log\left(\frac{m}{\delta}\right).

By applying the Bernstein’s inequality, we have that with probability at least 1n2exp(mR)1-n_{2}exp(-mR^{{}^{\prime}}),

1mr=1mI{rSj}=1mr=1mI{Ajr}R.\frac{1}{m}\sum\limits_{r=1}^{m}I\{r\in S_{j}^{\perp}\}=\frac{1}{m}\sum\limits_{r=1}^{m}I\{A_{jr}\}\lesssim R^{{}^{\prime}}.

Therefore,

r=1m|I¯i,jr(s)|\displaystyle\sum\limits_{r=1}^{m}|\bar{I}_{i,j}^{r}(s)| ηn1n2zG0(u)2m(λ0+λ~0)log(mδ)zGs(u)2+ηn1n2zG0(u)2m(λ0+λ~0)log32(mδ)zGs(u)2\displaystyle\lesssim\frac{\eta n_{1}n_{2}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log\left(\frac{m}{\delta}\right)\|z-G^{s}(u)\|_{2}+\frac{\eta n_{1}n_{2}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\|z-G^{s}(u)\|_{2} (60)
ηn1n2zG0(u)2m(λ0+λ~0)log32(mδ)zGs(u)2.\displaystyle\lesssim\frac{\eta n_{1}n_{2}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\|z-G^{s}(u)\|_{2}.

From (45) and (53), we have

|Ii,j(s)|\displaystyle|I_{i,j}(s)| r=1m|I~i,jr(s)|+r=1m|I¯i,jr(s)|\displaystyle\leq\sum\limits_{r=1}^{m}|\tilde{I}_{i,j}^{r}(s)|+\sum\limits_{r=1}^{m}|\bar{I}_{i,j}^{r}(s)| (61)
(ηn1n2zG0(u)2mp(λ0+λ~0)log32(mδ)+ηn1n2zG0(u)2m(λ0+λ~0)log32(mδ))zGs(u)2\displaystyle\lesssim\left(\frac{\eta n_{1}n_{2}\|z-G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)+\frac{\eta n_{1}n_{2}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\right)\|z-G^{s}(u)\|_{2}
ηn1n2zG0(u)2m(λ0+λ~0)log32(mδ)zGs(u)2.\displaystyle\lesssim\frac{\eta n_{1}n_{2}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\|z-G^{s}(u)\|_{2}.

Therefore,

I(s)2R¯zGs(u)2,\|I(s)\|_{2}\lesssim\bar{R}\|z-G^{s}(u)\|_{2}, (62)

where

R¯=η(n1n2)32zG0(u)2m(λ0+λ~0)log32(mδ).\bar{R}=\frac{\eta(n_{1}n_{2})^{\frac{3}{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right). (63)

8.3. Proof of Corollary 1

Proof.

Note that when I(s)2η(λ0+λ~0)/12\|I(s)\|_{2}\leq\eta(\lambda_{0}+\tilde{\lambda}_{0})/12, λmin(H(s))λ0/2\lambda_{min}(H(s))\geq\lambda_{0}/2, λmin(H~(s))λ~0/2\lambda_{min}(\tilde{H}(s))\geq\tilde{\lambda}_{0}/2 and Iη(H(s)+H~(s))I-\eta(H(s)+\tilde{H}(s)) is positive definite, we have that for s=0,,t1s=0,\cdots,t-1,

zGs+1(u)22\displaystyle\|z-G^{s+1}(u)\|_{2}^{2} (64)
=[Iη(H(s)+H~(s))](zGs(u))22+I(s)222(η(H(s)+H~(s)))(zGs(u)),I(s)\displaystyle=\|\left[I-\eta(H(s)+\tilde{H}(s))\right](z-G^{s}(u))\|_{2}^{2}+\|I(s)\|_{2}^{2}-2\left\langle\left(\eta(H(s)+\tilde{H}(s))\right)(z-G^{s}(u)),I(s)\right\rangle
(1ηλ0+λ~02)2zGs(u)22+I(s)22+2(1λ0+λ~02)zGs(u)2I(s)2\displaystyle\leq(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2})^{2}\|z-G^{s}(u)\|_{2}^{2}+\|I(s)\|_{2}^{2}+2(1-\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2})\|z-G^{s}(u)\|_{2}\|I(s)\|_{2}
(1η(λ0+λ~0)+η2(λ0+λ~0)24+(η(λ0+λ~0)12)2+2η(λ0+λ~0)12)zGs(u)22\displaystyle\leq\left(1-\eta(\lambda_{0}+\tilde{\lambda}_{0})+\frac{\eta^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}{4}+\left(\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{12}\right)^{2}+2\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{12}\right)\|z-G^{s}(u)\|_{2}^{2}
(1η(λ0+λ~0)+η(λ0+λ~0)4+η(λ0+λ~0)12+2η(λ0+λ~0)12)zGs(u)2\displaystyle\leq\left(1-\eta(\lambda_{0}+\tilde{\lambda}_{0})+\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{4}+\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{12}+2\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{12}\right)\|z-G^{s}(u)\|^{2}
=(1ηλ0+λ~02)zGs(u)2.\displaystyle=\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)\|z-G^{s}(u)\|^{2}.

Thus,

zGs(u)22(1ηλ0+λ~02)szG0(u)22\|z-G^{s}(u)\|_{2}^{2}\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{s}\|z-G^{0}(u)\|_{2}^{2}

holds for s=0,,ts=0,\cdots,t.

Now, we have to derive the requirement for mm such that these conditions hold. First, from Lemma 3, when

Rmin(λ0,λ~0)n1n2log(mδ),R~min(λ0,λ~0)n1n2plog(mδ),R\lesssim\frac{\min(\lambda_{0},\tilde{\lambda}_{0})}{n_{1}n_{2}\log\left(\frac{m}{\delta}\right)},\ \tilde{R}\lesssim\frac{\min(\lambda_{0},\tilde{\lambda}_{0})}{n_{1}n_{2}\sqrt{p}\sqrt{\log\left(\frac{m}{\delta}\right)}},

we have H(s)H(0)2λ04\|H(s)-H(0)\|_{2}\leq\frac{\lambda_{0}}{4}, H~(s)H~(0)2λ~04\|\tilde{H}(s)-\tilde{H}(0)\|_{2}\leq\frac{\tilde{\lambda}_{0}}{4} and λmin(H(s))λ0/2\lambda_{min}(H(s))\geq\lambda_{0}/2,λmin(H~(s))λ~0/2\lambda_{min}(\tilde{H}(s))\geq\tilde{\lambda}_{0}/2. Thus, when R<RR^{{}^{\prime}}<R and R~<R~\tilde{R}^{{}^{\prime}}<\tilde{R}, we have λmin(H(s))λ0/2\lambda_{min}(H(s))\geq\lambda_{0}/2 and λmin(H~(s))λ~0/2\lambda_{min}(\tilde{H}(s))\geq\tilde{\lambda}_{0}/2. Specifically, mm need to satisfy that

m=Ω(n14n24log3(mδ)log(n1n2δ)(min(λ0,λ~0))2(λ0+λ~0)2).m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log^{3}\left(\frac{m}{\delta}\right)\log\left(\frac{n_{1}n_{2}}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right). (65)

Moreover, at this point, we can deduce that

H(s)2\displaystyle\|H(s)\|_{2} H(s)H(0)2+H(0)2\displaystyle\leq\|H(s)-H(0)\|_{2}+\|H(0)\|_{2}
H(s)H(0)2+H(0)H2+H2\displaystyle\leq\|H(s)-H(0)\|_{2}+\|H(0)-H^{\infty}\|_{2}+\|H^{\infty}\|_{2}
λ04+λ04+H2\displaystyle\leq\frac{\lambda_{0}}{4}+\frac{\lambda_{0}}{4}+\|H^{\infty}\|_{2}
32H2\displaystyle\leq\frac{3}{2}\|H^{\infty}\|_{2}

and similarly, H~(s)232H2\|\tilde{H}(s)\|_{2}\leq\frac{3}{2}\|H^{\infty}\|_{2}. Thus, η=𝒪(1H2+H~2)\eta=\mathcal{O}\left(\frac{1}{\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}}\right) is sufficient to ensure that Iη(H(s)+H~(s))I-\eta(H(s)+\tilde{H}(s)) is positive definite.

Second, we need to make sure that I(s)2η(λ0+λ~0)/12\|I(s)\|_{2}\leq\eta(\lambda_{0}+\tilde{\lambda}_{0})/12. From (56), R¯η(λ0+λ~0)\bar{R}\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0}) suffices, i.e.,

m=Ω(n14n24log(nδ)log3(mδ)(λ0+λ0~)4).m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log(\frac{n}{\delta})\log^{3}(\frac{m}{\delta})}{(\lambda_{0}+\tilde{\lambda_{0}})^{4}}\right). (66)

Combining these requirements for mm, i.e., (58), (59) and the condition in Lemma 2, leads to the desired conclusion.

8.4. Proof of Theorem 2

Proof.

From Corollary 1, it remains only to verify that Condition 1 also holds for s=t+1s=t+1. Note that in (52), we have proven that

wr(s+1)wr(s)2ηn1n2zGs(u)2mlog(mδ)\|w_{r}(s+1)-w_{r}(s)\|_{2}\lesssim\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}

holds for s=0,,ts=0,\cdots,t and r[m]r\in[m].

Combining this with Corollary 1 yields that

wr(t+1)wr(0)2\displaystyle\|w_{r}(t+1)-w_{r}(0)\|_{2} s=0twr(s+1)wr(s)2\displaystyle\leq\sum\limits_{s=0}^{t}\|w_{r}(s+1)-w_{r}(s)\|_{2}
s=0tηn1n2zGs(u)2mlog(mδ)\displaystyle\lesssim\sum\limits_{s=0}^{t}\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}
s=0tηn1n2mlog(mδ)(1ηλ0+λ~02)s/2zG0(u)2\displaystyle\lesssim\sum\limits_{s=0}^{t}\frac{\eta\sqrt{n_{1}n_{2}}}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{s/2}\|z-G^{0}(u)\|_{2}
n1n2zG0(u)2m(λ0+λ~0)log(mδ).\displaystyle\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}.

Similarly, in (44), we have proven that

w~rk(s+1)w~rk(s)2ηBn1n2zGs(u)2mp,\|\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\|_{2}\lesssim\frac{\eta B\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{mp}},

which yields that

w~rk(t+1)w~rk(0)2n1n2zG0(u)2mp(λ0+λ~0)log(mδ).\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}.

Moreover, from the triangle inequality, we have

|wr(t+1)Tyj|\displaystyle|w_{r}(t+1)^{T}y_{j}| |(wr(t+1)wr(0))Tyj|+|wr(0)Tyj|\displaystyle\leq|(w_{r}(t+1)-w_{r}(0))^{T}y_{j}|+|w_{r}(0)^{T}y_{j}|
wr(t+1)wr(0)2+2log(2mn2δ)\displaystyle\leq\|w_{r}(t+1)-w_{r}(0)\|_{2}+\sqrt{2\log\left(\frac{2mn_{2}}{\delta}\right)}
2log(mn2δ).\displaystyle\leq 2\sqrt{\log\left(\frac{mn_{2}}{\delta}\right)}.

8.5. Proof of Lemma 6

Proof.

First, for H=H1H2H^{\infty}=H_{1}^{\infty}\otimes H_{2}^{\infty}, recall that the Kronecker product of two strictly definite matrices is also strictly positive definte, thus it suffices to demonstrate that H1H_{1}^{\infty} and H2H_{2}^{\infty} are both strictly definite. For H1H_{1}^{\infty}, similar as that in the proof of Lemma 2, let \mathcal{H} be the Hilbert space of integrable function on d+1\mathbb{R}^{d+1}, i.e., ff\in\mathcal{H} if 𝔼w~𝒩(𝟎,𝑰)[|f(w~)|2]<\mathbb{E}_{\tilde{w}\sim\mathcal{N}(\bm{0},\bm{I})}[|f(\tilde{w})|^{2}]<\infty. Now to prove H1H_{1}^{\infty} is strictly positive definite, it is equivalent to show that ϕ(u1)(w~),,ϕ(un1)(w~)\phi(u_{1})(\tilde{w}),\cdots,\phi(u_{n_{1}})(\tilde{w})\in\mathcal{H} are linearly independent, where ϕ(ui)(w~)=σ(w~Tui)\phi(u_{i})(\tilde{w})=\sigma(\tilde{w}^{T}u_{i}). It has been proved in [19], we provide a different proof for completeness and this proof also indicates the strictly positive definiteness of H~\tilde{H}^{\infty}. Suppose that there are α1,,αn1\alpha_{1},\cdots,\alpha_{n_{1}}\in\mathbb{R} such that

α1ϕ(u1)++αn1ϕ(un1)=0in,\alpha_{1}\phi(u_{1})+\cdots+\alpha_{n_{1}}\phi(u_{n_{1}})=0\ in\ \mathcal{H},

which implies that

α1ϕ(u1)(w~)++αn1ϕ(un1)(w~)=0\alpha_{1}\phi(u_{1})(\tilde{w})+\cdots+\alpha_{n_{1}}\phi(u_{n_{1}})(\tilde{w})=0

holds for all w~d+1\tilde{w}\in\mathbb{R}^{d+1} due to the continuity of ϕ(u1)()\phi(u_{1})(\cdot).

Let D~i={w~d+1:w~Tui=0}\tilde{D}_{i}=\{\tilde{w}\in\mathbb{R}^{d+1}:\tilde{w}^{T}u_{i}=0\} for i[n1]i\in[n_{1}], then Lemma A.1 in [17] implies that when no two samples in {ui}i=1n1\{u_{i}\}_{i=1}^{n_{1}} are parallel, DijiDjD_{i}\not\subset\cup_{j\neq i}D_{j} for any i[n1]i\in[n_{1}]. Thus, we can choose w~0Di\jiDj\tilde{w}_{0}\in D_{i}\backslash\cup_{j\neq i}D_{j}. Since jiDj\cup_{j\neq i}D_{j} is a closed set, there is positive constant r0r_{0} such that Br0(w~0)(jiDj)=B_{r_{0}}(\tilde{w}_{0})\cap(\cup_{j\neq i}D_{j})=\emptyset. This fact implies that ϕ(uj)()\phi(u_{j})(\cdot) is differentiable in Br0(w~0)B_{r_{0}}(\tilde{w}_{0}) for each jij\neq i. Thus αiϕ(ui)()\alpha_{i}\phi(u_{i})(\cdot) is also differentiable in Br0(w~0)B_{r_{0}}(\tilde{w}_{0}). However ϕ(ui)(w~)=σ(w~Tui)\phi(u_{i})(\tilde{w})=\sigma(\tilde{w}^{T}u_{i}) is not differentiable in Br0(w~0)B_{r_{0}}(\tilde{w}_{0}). Thus we can deduce that αi=0\alpha_{i}=0. Similarly, we have αj=0\alpha_{j}=0 for all j[n1]j\in[n_{1}], which implies that H1H_{1}^{\infty} is strictly positive definite. For H2H_{2}^{\infty}, it can be seen as a Gram matrix of PINN, Lemma 3.2 in [20] implies that H2H_{2}^{\infty} is strictly positive definite.

Second, for H~\tilde{H}^{\infty}, recall that H~=H~1H~2\tilde{H}^{\infty}=\tilde{H}_{1}^{\infty}\otimes\tilde{H}_{2}^{\infty}. Note that the (i,j)(i,j)-th entry of H~1\tilde{H}_{1}^{\infty} is 𝔼[uiTujI{w~Tui0,w~Tui0}]\mathbb{E}[u_{i}^{T}u_{j}I\{\tilde{w}^{T}u_{i}\geq 0,\tilde{w}^{T}u_{i}\geq 0\}]. Thus, Theorem 3.1 in [4] implies that H~1\tilde{H}_{1}^{\infty} is strictly positive definite. For H2H_{2}^{\infty}, let

ϕ1(w)=(σ3(wTy1)),,ϕn2(w)=(σ3(wTyn2)),ψ1(w)=σ3(wTy~1),,ψn3(w)=σ3(wTy~n3)\phi_{1}(w)=\mathcal{L}(\sigma_{3}(w^{T}y_{1})),\cdots,\phi_{n_{2}}(w)=\mathcal{L}(\sigma_{3}(w^{T}y_{n_{2}})),\psi_{1}(w)=\sigma_{3}(w^{T}\tilde{y}_{1}),\cdots,\psi_{n_{3}}(w)=\sigma_{3}(w^{T}\tilde{y}_{n_{3}})

and HH be the Hilbert space of integrable (d+1)(d+1)-dimensional vector fields on d+1\mathbb{R}^{d+1}. Suppose that there are α1,,αn2,β1,,βn3\alpha_{1},\cdots,\alpha_{n_{2}},\beta_{1},\cdots,\beta_{n_{3}} such that

α1ϕ1++αn2ϕn2+β1ψ1++βψn3=0in,\alpha_{1}\phi_{1}+\cdots+\alpha_{n_{2}}\phi_{n_{2}}+\beta_{1}\psi_{1}+\cdots+\beta\psi_{n_{3}}=0\ in\ \mathcal{H},

which yields that

α1ϕ1++αn2ϕn2+β1ψ1++βψn3=0\alpha_{1}\phi_{1}+\cdots+\alpha_{n_{2}}\phi_{n_{2}}+\beta_{1}\psi_{1}+\cdots+\beta\psi_{n_{3}}=0

holds for all wd+1w\in\mathbb{R}^{d+1}.

Let Di={wd+1:wTyi=0}D_{i}=\{w\in\mathbb{R}^{d+1}:w^{T}y_{i}=0\} for i[n2]i\in[n_{2}] and D¯i={wd+1:wTy~i=0}\bar{D}_{i}=\{w\in\mathbb{R}^{d+1}:w^{T}\tilde{y}_{i}=0\} for i[n3]i\in[n_{3}]. Thus Di(jiDj)(jD¯j)D_{i}\not\subset(\cup_{j\neq i}D_{j})\cup(\cup_{j}\bar{D}_{j}) for any i[n2]i\in[n_{2}]. Similarly, we can choose w0Diw_{0}\in D_{i} and r0>0r_{0}>0 such that Br0(w0)((jiDj)(jD¯j))=B_{r_{0}}(w_{0})\cap\left((\cup_{j\neq i}D_{j})\cup(\cup_{j}\bar{D}_{j})\right)=\emptyset. Note that ϕj\phi_{j} (ji)(j\neq i) and ψj\psi_{j} are differentiable in Br0(w0)B_{r_{0}}(w_{0}), thus αiϕi\alpha_{i}\phi_{i} is also differentiable in Br0(w0)B_{r_{0}}(w_{0}), implying that αi=0\alpha_{i}=0. Therefore, αi=0\alpha_{i}=0 for all i[n2]i\in[n_{2}]. Moreover, similar to the proof of the strictly positive definiteness of H1H_{1}^{\infty}, we can also deduce that βi=0\beta_{i}=0 for all i[n3]i\in[n_{3}]. Finally, H~\tilde{H} is strictly positive definite.

8.6. Proof of Lemma 7

Proof.

First, for H(0)HH(0)-H^{\infty}, we consider its (i,j)(i,j)-th block, whose entry has the following form

1mr=1mXrYr𝔼[X1Y1],\frac{1}{m}\sum\limits_{r=1}^{m}X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}],

where

Xr=[1pk=1pa~rk(0)σ(w~rk(0)Tui)][1pk=1pa~rk(0)σ(w~rk(0)Tuj)],X_{r}=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u_{j})\right],

for j1[n1],j2[n1]j_{1}\in[n_{1}],j_{2}\in[n_{1}],

Yr=1n2(σ3(wr(0)Tyj1))wr,(σ3(wr(0)Tyj2))wr,Y_{r}=\frac{1}{n_{2}}\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle,

for j1[n1],n2+j2[n2,n2+n3]j_{1}\in[n_{1}],n_{2}+j_{2}\in[n_{2},n_{2}+n_{3}],

Yr=1n2n3(σ3(wr(0)Tyj1))wr,σ3(wr(0)Ty~j2)wr,Y_{r}=\frac{1}{\sqrt{n_{2}n_{3}}\ }\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle,

for n2+j1[n2,n2+n3],n2+j2[n2,n2+n3]n_{2}+j_{1}\in[n_{2},n_{2}+n_{3}],n_{2}+j_{2}\in[n_{2},n_{2}+n_{3}],

Yr=1n3σ3(wr(0)Ty~j1)wr,σ3(wr(0)Ty~j2)wr.Y_{r}=\frac{1}{n_{3}}\left\langle\frac{\partial\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{1}})}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle.

To use the concentration inequality, we need to clarify the order of the sub-Weil random variable XrYr𝔼[X1Y1]X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}]. Note that Lemma 18 implies that

Xrψ11pk=1pa~rk(0)σ(w~rk(0)Tui)ψ21pk=1pa~rk(0)σ(w~rk(0)Tuj)ψ2=𝒪(1).\|X_{r}\|_{\psi_{1}}\leq\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|_{\psi_{2}}\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u_{j})\right\|_{\psi_{2}}=\mathcal{O}(1).

On the other hand, from

(σ3(wrTy))=wr0σ2(wrTy)+wr122σ(wrTy)+σ3(wrTy),\mathcal{L}(\sigma_{3}(w_{r}^{T}y))=w_{r0}\sigma_{2}(w_{r}^{T}y)+\|w_{r1}\|_{2}^{2}\sigma(w_{r}^{T}y)+\sigma_{3}(w_{r}^{T}y),

we have

(σ3(wrTy))wr=(10d)σ2(wrTy)+wr0yσ(wrTy)+2(0wr1)σ(wrTy)+wr12yI{wrTy0}+yσ2(wrTy)\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y))}{\partial w_{r}}=\begin{pmatrix}1\\ 0_{d}\end{pmatrix}\sigma_{2}(w_{r}^{T}y)+w_{r0}y\sigma(w_{r}^{T}y)+2\begin{pmatrix}0\\ w_{r1}\end{pmatrix}\sigma(w_{r}^{T}y)+\|w_{r1}\|_{2}yI\{w_{r}^{T}y\geq 0\}+y\sigma_{2}(w_{r}^{T}y)

and

σ3(wrTy)wr=yσ2(wrTy).\frac{\partial\sigma_{3}(w_{r}^{T}y)}{\partial w_{r}}=y\sigma_{2}(w_{r}^{T}y).

Therefore, we can deduce that

(σ3(wrTy))wr2|wrTy|2+wr2|wrTy|+|wrTy|2wr2|wrTy|\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y))}{\partial w_{r}}\right\|_{2}\lesssim|w_{r}^{T}y|^{2}+\|w_{r}\|_{2}|w_{r}^{T}y|+|w_{r}^{T}y|^{2}\lesssim\|w_{r}\|_{2}|w_{r}^{T}y|

and

σ3(wrTy)wr2|wrTy|2wr2|wrTy|.\left\|\frac{\partial\sigma_{3}(w_{r}^{T}y)}{\partial w_{r}}\right\|_{2}\lesssim|w_{r}^{T}y|^{2}\lesssim\|w_{r}\|_{2}|w_{r}^{T}y|.

Note that wrTyψ2=𝒪(1)\|w_{r}^{T}y\|_{\psi_{2}}=\mathcal{O}(1) for y2=𝒪(1)\|y\|_{2}=\mathcal{O}(1), thus σ2(wrTy)ψ1=𝒪(1)\|\sigma_{2}(w_{r}^{T}y)\|_{\psi_{1}}=\mathcal{O}(1), σ3(wrTy)ψ23=𝒪(1)\|\sigma_{3}(w_{r}^{T}y)\|_{\psi_{\frac{2}{3}}}=\mathcal{O}(1). Thus Lemma 20 implies that

wr22|wrTyj1||wrTyj2|ψ12wr22ψ1|wrTyj1||wrTyj2|ψ1=𝒪(d).\|\|w_{r}\|_{2}^{2}|w_{r}^{T}y_{j_{1}}||w_{r}^{T}y_{j_{2}}|\|_{\psi_{\frac{1}{2}}}\leq\|\|w_{r}\|_{2}^{2}\|_{\psi_{1}}\||w_{r}^{T}y_{j_{1}}||w_{r}^{T}y_{j_{2}}|\|_{\psi_{1}}=\mathcal{O}(d).

On the other hand, Lemma 21 implies that

XrYr𝔼[X1Y1]ψαXrYrψα+𝔼[X1Y1]ψαXrYrψα+𝔼[|X1|]𝔼[|Y1|].\|X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}]\|_{\psi_{\alpha}}\lesssim\|X_{r}Y_{r}\|_{\psi_{\alpha}}+\|\mathbb{E}[X_{1}Y_{1}]\|_{\psi_{\alpha}}\lesssim\|X_{r}Y_{r}\|_{\psi_{\alpha}}+\mathbb{E}[|X_{1}|]\mathbb{E}[|Y_{1}|].

Note that 𝔼[|X1|]|X1|ψ1=𝒪(1)\mathbb{E}[|X_{1}|]\leq|X_{1}|_{\psi_{1}}=\mathcal{O}(1) and |Y1|ψ12wr22|wrTyj1||wrTyj2|ψ12=𝒪(d)|Y_{1}|_{\psi_{\frac{1}{2}}}\lesssim\|\|w_{r}\|_{2}^{2}|w_{r}^{T}y_{j_{1}}||w_{r}^{T}y_{j_{2}}|\|_{\psi_{\frac{1}{2}}}=\mathcal{O}(d). From the Taylor expansion of the function exe^{x}, we have that for any C>0C>0,

𝔼[e(|Y1|C)121]𝔼[12!|Y|C],\mathbb{E}[e^{(\frac{|Y_{1}|}{C})^{\frac{1}{2}}}-1]\geq\mathbb{E}\left[\frac{1}{2!}\frac{|Y|}{C}\right],

which implies that 𝔼[|Y1|]=𝒪(d)\mathbb{E}[|Y_{1}|]=\mathcal{O}(d). Therefore, XrYr𝔼[X1Y1]ψ12=𝒪(d)\|X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}]\|_{\psi_{\frac{1}{2}}}=\mathcal{O}(d). Finally, applying Lemma 17 leads to that with probability at least 1δ1-\delta,

|1mr=1mXrYr𝔼[X1Y1]|dn2mlog(1δ)+dn2mlog2(1δ).\left|\frac{1}{m}\sum\limits_{r=1}^{m}X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}]\right|\lesssim\frac{d}{n_{2}\sqrt{m}}\sqrt{\log(\frac{1}{\delta})}+\frac{d}{n_{2}m}\log^{2}(\frac{1}{\delta}).

Taking a union bound yields that with probability at least 1δ1-\delta,

H(0)HF\displaystyle\|H(0)-H^{\infty}\|_{F} dn1mlog(n1(n2+n3)δ).\displaystyle\lesssim\frac{dn_{1}}{\sqrt{m}}\log(\frac{n_{1}(n_{2}+n_{3})}{\delta}).

First, for H~(0)H~\tilde{H}(0)-\tilde{H}^{\infty}, we consider its (i,j)(i,j)-th block, whose entry has the following form

1mr=1mXrYr𝔼[X1Y1],\frac{1}{m}\sum\limits_{r=1}^{m}X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}],

where

Xr=1pk=1puiTujI{w~rk(0)Tui0,w~rk(0)Tuj0},X_{r}=\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}(0)^{T}u_{j}\geq 0\},

for j1[n1],j2[n1]j_{1}\in[n_{1}],j_{2}\in[n_{1}],

Yr=1n2(σ3(wr(0)Tyj1))(σ3(wr(0)Tyj2))Y_{r}=\frac{1}{n_{2}}\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))

for j1[n1],n2+j2[n2,n2+n3]j_{1}\in[n_{1}],n_{2}+j_{2}\in[n_{2},n_{2}+n_{3}],

Yr=1n2n3(σ3(wr(0)Tyj1))σ3(wr(0)Ty~j2),Y_{r}=\frac{1}{\sqrt{n_{2}n_{3}}\ }\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{2}}),

for n2+j1[n2,n2+n3],n2+j2[n2,n2+n3]n_{2}+j_{1}\in[n_{2},n_{2}+n_{3}],n_{2}+j_{2}\in[n_{2},n_{2}+n_{3}],

Yr=1n3σ3(wr(0)Ty~j1)σ3(wr(0)Ty~j2).Y_{r}=\frac{1}{n_{3}}\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{1}})\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{2}}).

From Lemma 21, we have

1pk=1puiTujI{w~rk(0)Tui0,w~rk(0)Tuj0}ψ2=𝒪(1).\left\|\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}(0)^{T}u_{j}\geq 0\}\right\|_{\psi_{2}}=\mathcal{O}(1).

Note that

|(σ3(wrTy))|wr2|wrTy|2+wr22|wrTy|+|wrTy|3wr22|wrTy||\mathcal{L}(\sigma_{3}(w_{r}^{T}y))|\lesssim\|w_{r}\|_{2}|w_{r}^{T}y|^{2}+\|w_{r}\|_{2}^{2}|w_{r}^{T}y|+|w_{r}^{T}y|^{3}\lesssim\|w_{r}\|_{2}^{2}|w_{r}^{T}y|

and

|σ3(wrTy)||wrTy|3wr22|wrTy|.\ |\sigma_{3}(w_{r}^{T}y)|\leq|w_{r}^{T}y|^{3}\lesssim\|w_{r}\|_{2}^{2}|w_{r}^{T}y|.

From Lemma 21, we can deduce that

wr24ψ12wr22ψ12=𝒪(d2)\|\|w_{r}\|_{2}^{4}\|_{\psi_{\frac{1}{2}}}\leq\|\|w_{r}\|_{2}^{2}\|_{\psi_{1}}^{2}=\mathcal{O}(d^{2})

and |wrTy|2ψ1=𝒪(1)\||w_{r}^{T}y|^{2}\|_{\psi_{1}}=\mathcal{O}(1), thus

wr24|wrTy|2ψ13=𝒪(d2).\|\|w_{r}\|_{2}^{4}|w_{r}^{T}y|^{2}\|_{\psi_{\frac{1}{3}}}=\mathcal{O}(d^{2}).

Therefore, with probability at least 1δ1-\delta,

H~(0)H~F\displaystyle\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{F} d2n1mlog(n1(n2+n3)δ).\displaystyle\lesssim\frac{d^{2}n_{1}}{\sqrt{m}}\log(\frac{n_{1}(n_{2}+n_{3})}{\delta}).

8.7. Proof of Lemma 8

Proof.

For HH(0)H-H(0), from the form of (j1,j2)(j_{1},j_{2})-th entry of the (i,j)(i,j)-th block, we focus on the form arbrar(0)br(0)a_{r}b_{r}-a_{r}(0)b_{r}(0), where

ar=[1pk=1pa~rkσ(w~rkTui)][1pk=1pa~rkσ(w~rkTuj)],a_{r}=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j})\right],
br=(σ3(wrTyj1))wr,(σ3(wrTyj2))wror(σ3(wrTyj1))wr,σ3(wrTy~j2)wrorσ3(wrTy~j1)wr,σ3(wrTy~j2)wrb_{r}=\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle or\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle or\left\langle\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{1}})}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle

and the notation ar(0),br(0)a_{r}(0),b_{r}(0) means replacing wr,w~rkw_{r},\tilde{w}_{rk} in the definitions of ara_{r} and brb_{r} with wr(0)w_{r}(0) and w~rk(0)\tilde{w}_{rk}(0), respectively.

For arar(0)a_{r}-a_{r}(0), (25) implies that

|arar(0)|pR~2+pR~log(mn1δ).|a_{r}-a_{r}(0)|\lesssim p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}. (67)

For brbr(0)b_{r}-b_{r}(0), when br=(σ3(wrTyj1))wr,(σ3(wrTyj2))wrb_{r}=\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle, we have

|(σ3(wrTyj1))wr,(σ3(wrTyj2))wr(σ3(wr(0)Tyj1))wr,(σ3(wr(0)Tyj2))wr|\displaystyle\left|\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle-\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle\right| (68)
(σ3(wrTyj1))wr(σ3(wr(0)Tyj1))wr2(σ3(wr(0)Tyj2))wr2\displaystyle\leq\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2}\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\|_{2}
+(σ3(wrTyj2))wr(σ3(wr(0)Tyj2))wr2(σ3(wr(0)Tyj1))wr2\displaystyle\quad+\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\|_{2}\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2}
+(σ3(wrTyj1))wr(σ3(wr(0)Tyj1))wr2(σ3(wrTyj2))wr(σ3(wr(0)Tyj2))wr2.\displaystyle\quad+\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2}\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\|_{2}.

Note with probability at least 1δ1-\delta, we have wr(0)2B1\|w_{r}(0)\|_{2}\lesssim B_{1}, |wr(0)Tyj|B2|w_{r}(0)^{T}y_{j}|\lesssim B_{2}, |wr(0)Ty~j1|B2|w_{r}(0)^{T}\tilde{y}_{j_{1}}|\lesssim B_{2} holds for all r[m]r\in[m], j[n2]j\in[n_{2}], j1[n3]j_{1}\in[n_{3}] where

B1=dlog(mδ),B2=log(m(n2+n3)δ).B_{1}=\sqrt{d\log\left(\frac{m}{\delta}\right)},\ B_{2}=\sqrt{\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)}.

Under these events, we can deduce that

(σ3(wr(0)Tyj1))wr2,(σ3(wr(0)Tyj2))wr2B1B2\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2},\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\|_{2}\lesssim B_{1}B_{2}

and

(σ3(wrTyj1))wr(σ3(wr(0)Tyj1))wr2\displaystyle\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2} (B1+B2)R+B1|I{wr(0)Tyj10}I{wrTyj10}|\displaystyle\lesssim(B_{1}+B_{2})R+B_{1}|I\{w_{r}(0)^{T}y_{j_{1}}\geq 0\}-I\{w_{r}^{T}y_{j_{1}}\geq 0\}|
(B1+B2)R+B1I{Ar,j1}.\displaystyle\lesssim(B_{1}+B_{2})R+B_{1}I\{A_{r,j_{1}}\}.

Thus, for brbr(0)b_{r}-b_{r}(0), we have

|brbr(0)|\displaystyle|b_{r}-b_{r}(0)| B1B2(B1+B2)R+B12B2[I{Ar,j1}+I{Ar,j2}].\displaystyle\lesssim B_{1}B_{2}(B_{1}+B_{2})R+B_{1}^{2}B_{2}[I\{A_{r,j_{1}}\}+I\{A_{r,j_{2}}\}]. (69)

From Bernstein’s inequality, we have with probability at least 1n2exp(mR)1-n_{2}\exp(-mR),

1mr=1mI{Ar,j}R\frac{1}{m}\sum\limits_{r=1}^{m}I\{A_{r,j}\}\lesssim R

holds for all j[n2]j\in[n_{2}].

Thus, summing rr yields that

1mr=1m|brbr(0)|\displaystyle\frac{1}{m}\sum\limits_{r=1}^{m}|b_{r}-b_{r}(0)| B1B2(B1+B2)R+B12B2R\displaystyle\lesssim B_{1}B_{2}(B_{1}+B_{2})R+B_{1}^{2}B_{2}R (70)
B12B2R.\displaystyle\lesssim B_{1}^{2}B_{2}R.

When br=(σ3(wrTyj1))wr,σ3(wrTy~j2)wrorσ3(wrTy~j1)wr,σ3(wrTy~j2)wrb_{r}=\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle or\left\langle\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{1}})}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle, we can obtain the same estimation, since

σ3(wr(0)Tyj1)wr2B22B1B2\left\|\frac{\partial\sigma_{3}(w_{r}(0)^{T}y_{j_{1}})}{\partial w_{r}}\right\|_{2}\lesssim B_{2}^{2}\lesssim B_{1}B_{2}

and

σ3(wrTyj1)wrσ3(wr(0)Tyj1)wr2B2R.\left\|\frac{\partial\sigma_{3}(w_{r}^{T}y_{j_{1}})}{\partial w_{r}}-\frac{\partial\sigma_{3}(w_{r}(0)^{T}y_{j_{1}})}{\partial w_{r}}\right\|_{2}\lesssim B_{2}R.

Note that we can decompose arbrar(0)br(0)a_{r}b_{r}-a_{r}(0)b_{r}(0) as follows

ar(0)[brbr(0)]+br(0)[arar(0)]+[arar(0)][brbr(0)].a_{r}(0)[b_{r}-b_{r}(0)]+b_{r}(0)[a_{r}-a_{r}(0)]+[a_{r}-a_{r}(0)][b_{r}-b_{r}(0)].

Therefore, we have

1mr=1m|arbrar(0)br(0)|\displaystyle\frac{1}{m}\sum\limits_{r=1}^{m}|a_{r}b_{r}-a_{r}(0)b_{r}(0)| (71)
=1mr=1m|ar(0)[brbr(0)]+br(0)[arar(0)]+[arar(0)][brbr(0)]|\displaystyle=\frac{1}{m}\sum\limits_{r=1}^{m}\left|a_{r}(0)[b_{r}-b_{r}(0)]+b_{r}(0)[a_{r}-a_{r}(0)]+[a_{r}-a_{r}(0)][b_{r}-b_{r}(0)]\right|
1mr=1m|ar(0)||brbr(0)|+1mr=1m|br(0)||arar(0)|+1mr=1m|arar(0)||brbr(0)|\displaystyle\leq\frac{1}{m}\sum\limits_{r=1}^{m}|a_{r}(0)||b_{r}-b_{r}(0)|+\frac{1}{m}\sum\limits_{r=1}^{m}|b_{r}(0)||a_{r}-a_{r}(0)|+\frac{1}{m}\sum\limits_{r=1}^{m}|a_{r}-a_{r}(0)||b_{r}-b_{r}(0)|
B12B2Rlog(mn1δ)+B12B22(pR~2+pR~log(mn1δ)).\displaystyle\lesssim B_{1}^{2}B_{2}R\log\left(\frac{mn_{1}}{\delta}\right)+B_{1}^{2}B_{2}^{2}\left(p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right).

Summing i,j,i1,j1i,j,i_{1},j_{1} yields that

HH(0)F\displaystyle\|H-H(0)\|_{F} n1B12B2Rlog(mn1δ)+n1B12B22(pR~2+pR~log(mn1δ)).\displaystyle\lesssim n_{1}B_{1}^{2}B_{2}R\log\left(\frac{mn_{1}}{\delta}\right)+n_{1}B_{1}^{2}B_{2}^{2}\left(p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right).

For H~H~(0)\tilde{H}-\tilde{H}(0), from the form of (j1,j2)(j_{1},j_{2})-th entry of the (i,j)(i,j)-th block, we focus on the form arbrar(0)br(0)a_{r}b_{r}-a_{r}(0)b_{r}(0), where

ar=1pk=1puiTujI{w~rkTui0,w~rkTuj0},a_{r}=\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\},
br=(σ3(wrTyj1))(σ3(wrTyj2))or(σ3(wrTyj1))σ3(wrTy~j2)orσ3(wrTy~j1)σ3(wrTy~j2).b_{r}=\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))\ or\ \mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})\ or\ \sigma_{3}(w_{r}^{T}\tilde{y}_{j_{1}})\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}}).

Similarly, we can deduce that

1mr=1m|arar(0)|\displaystyle\frac{1}{m}\sum\limits_{r=1}^{m}|a_{r}-a_{r}(0)| (72)
1m1pr=1mk=1p|I{w~rkTui0,w~rkTuj0}I{w~rk(0)Tui0,w~rk(0)Tuj0}|\displaystyle\lesssim\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}(0)^{T}u_{j}\geq 0\}|
1m1pr=1mk=1pI{A~r,ki}+I{A~r,kj}\displaystyle\lesssim\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}I\{\tilde{A}_{r,k}^{i}\}+I\{\tilde{A}_{r,k}^{j}\}
R~,\displaystyle\lesssim\tilde{R},

where the last inequality holds with probability at least 1n1exp(mpR~)1-n_{1}\exp(-mp\tilde{R}) due to the use of Bernstein inequality.

For brbr(0)b_{r}-b_{r}(0), note that

|(σ3(wr(0)Tyj))|B12B2,|σ3(wr(0)Ty~j)|B23B12B2|\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j}))|\lesssim B_{1}^{2}B_{2},\ |\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j})|\lesssim B_{2}^{3}\lesssim B_{1}^{2}B_{2}

and

(σ3(wrTyj))(σ3(wr(0)Tyj))2B1B2R,|σ3(wrTy~j)σ3(wr(0)Ty~j)|B22RB1B2R.\|\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j}))\|_{2}\lesssim B_{1}B_{2}R,\ |\sigma_{3}(w_{r}^{T}\tilde{y}_{j})-\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j})|\lesssim B_{2}^{2}R\lesssim B_{1}B_{2}R.

Thus, similar to (66), we have

|brbr(0)|B13B22R.|b_{r}-b_{r}(0)|\lesssim B_{1}^{3}B_{2}^{2}R. (73)

Therefore, similar to (69), we have

1mr=1m|arbrar(0)br(0)|\displaystyle\frac{1}{m}\sum\limits_{r=1}^{m}|a_{r}b_{r}-a_{r}(0)b_{r}(0)| (74)
=1mr=1m|ar(0)[brbr(0)]+br(0)[arar(0)]+[arar(0)][brbr(0)]|\displaystyle=\frac{1}{m}\sum\limits_{r=1}^{m}\left|a_{r}(0)[b_{r}-b_{r}(0)]+b_{r}(0)[a_{r}-a_{r}(0)]+[a_{r}-a_{r}(0)][b_{r}-b_{r}(0)]\right|
B12B2R~+B13B22R.\displaystyle\lesssim B_{1}^{2}B_{2}\tilde{R}+B_{1}^{3}B_{2}^{2}R.

Summing i,j,i1,j1i,j,i_{1},j_{1} yields that

H~H~(0)F\displaystyle\|\tilde{H}-\tilde{H}(0)\|_{F} n1B12B2R~+n1B13B22R.\displaystyle\lesssim n_{1}B_{1}^{2}B_{2}\tilde{R}+n_{1}B_{1}^{3}B_{2}^{2}R.

8.8. Proof of Lemma 8

Proof.

Note that

st+1(ui)(yj)st(ui)(yj)\displaystyle s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j}) (75)
=st+1(ui)(yj)st(ui)(yj)st(ui)(yj)w,w(t+1)w(t)st(ui)(yj)w~,w~(t+1)w~(t)\displaystyle=s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j})-\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle
+st(ui)(yj)w,w(t+1)w(t)+st(ui)(yj)w~,w~(t+1)w~(t).\displaystyle\quad+\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle+\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

From the updating formula of gradient, we have

st(ui)(yj)w,w(t+1)w(t)\displaystyle\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle
=r=1mst(ui)(yj)wr,wr(t+1)wr(t)\displaystyle=\sum\limits_{r=1}^{m}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle
=ηr=1mi1=1n1j1=1n2st(ui)(yj)wr,st(ui1)(yj1)wrst(ui1)(yj1)\displaystyle=-\eta\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\right\rangle s^{t}(u_{i_{1}})(y_{j_{1}})
ηr=1mi1=1n1j2=1n3st(ui)(yj)wr,ht(ui1)(y~j2)wrht(ui1)(y~j2).\displaystyle\quad-\eta\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}}).

Similarly, we can obtain that

st(ui)(yj)w~,w~(t+1)w~(t)\displaystyle\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle
=r=1mk=1pst(ui)(yj)w~rk,w~rk(t+1)w~rk(t)\displaystyle=\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\right\rangle
=ηr=1mk=1pi1=1n1j1=1n2st(ui)(yj)w~rk,st(ui1)(yj1)w~rkst(ui1)(yj1)\displaystyle=-\eta\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\rangle s^{t}(u_{i_{1}})(y_{j_{1}})
ηr=1mk=1pi1=1n1j2=1n3st(ui)(yj)w~rk,ht(ui1)(y~j2)w~rkht(ui1)(y~j2).\displaystyle\quad-\eta\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}\right\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}}).

For ht+1(ui)(y~j)ht(ui)(y~j)h^{t+1}(u_{i})(\tilde{y}_{j})-h^{t}(u_{i})(\tilde{y}_{j}), we can derive similar result, which is omitted for simplicity. Similar to the derivation in the section on neural operators, we have

Gt+1(u)Gt(u)=η(H(t)+H~(t))Gt(u)+I(t),G^{t+1}(u)-G^{t}(u)=-\eta(H(t)+\tilde{H}(t))G^{t}(u)+I(t), (76)

where I(t)n1(n2+n3)I(t)\in\mathbb{R}^{n_{1}(n_{2}+n_{3})}, I(t)I(t) can be divided into n1n_{1} blocks, where each block is an (n2+n3)(n_{2}+n_{3})dimensional vector. The j1j_{1}-th (j1[n2]j_{1}\in[n_{2}]) component of ii-th block is

st+1(ui)(yj1)st(ui)(yj1)st(ui)(yj1)w,w(t+1)w(t)st(ui)(yj1)w~,w~(t+1)w~(t),s^{t+1}(u_{i})(y_{j_{1}})-s^{t}(u_{i})(y_{j_{1}})-\left\langle\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle,

The n2+j2n_{2}+j_{2}-th (j1[n3]j_{1}\in[n_{3}]) component of ii-th block is

ht+1(ui)(y~j2)ht(ui)(y~j2)ht(ui)(y~j2)w,w(t+1)w(t)ht(ui)(y~j2)w~,w~(t+1)w~(t).h^{t+1}(u_{i})(\tilde{y}_{j_{2}})-h^{t}(u_{i})(\tilde{y}_{j_{2}})-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

Finally, applying a simple algebraic transformation to (76), we have

Gt+1(u)=[Iη(H(t)+H~(t))]Gt(u)+I(t).G^{t+1}(u)=[I-\eta(H(t)+\tilde{H}(t))]G^{t}(u)+I(t).

8.9. Proof of Corollary 2

Proof.

Let B1=2dlog(m/δ),B2=2log(m(n2+n3)/δ)B_{1}=2\sqrt{d\log(m/\delta)},B_{2}=2\sqrt{\log(m(n_{2}+n_{3})/\delta)}, we first estimate w~rk(t+1)w~rk(t)2\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2}. The gradient updating rule yields that

w~rk(t+1)w~rk(t)\displaystyle\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t) =ηL(W(t),W~(t))w~rk\displaystyle=-\eta\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}}
=ηi=1n1j1=1n2st(ui)(yj1)st(ui)(yj1)w~rkηi=1n1j2=1n3ht(ui)(y~j2)ht(ui)(y~j2)w~rk.\displaystyle=-\eta\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}s^{t}(u_{i})(y_{j_{1}})\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}_{rk}}--\eta\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}h^{t}(u_{i})(\tilde{y}_{j_{2}})\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}.

For the gradient term, we have

st(ui)(yj1)w~rk2\displaystyle\left\|\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\|_{2}
=1n21m1pa~rkuiI{w~rk(t)Tui0}(σ3(wr(t)Tyj1))2\displaystyle=\left\|\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\frac{1}{\sqrt{p}}\tilde{a}_{rk}u_{i}I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))\right\|_{2}
1n2B12B2mp\displaystyle\lesssim\frac{1}{\sqrt{n_{2}}}\frac{B_{1}^{2}B_{2}}{\sqrt{mp}}

and

ht(ui)(y~j2)w~rk2\displaystyle\left\|\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}\right\|_{2}
=1n31m1pa~rkuiI{w~rk(t)Tui0}σ3(wr(t)Ty~j2)2\displaystyle=\left\|\frac{1}{\sqrt{n_{3}}}\frac{1}{\sqrt{m}}\frac{1}{\sqrt{p}}\tilde{a}_{rk}u_{i}I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}\sigma_{3}(w_{r}(t)^{T}\tilde{y}_{j_{2}})\right\|_{2}
1n3B23mp.\displaystyle\lesssim\frac{1}{\sqrt{n_{3}}}\frac{B_{2}^{3}}{\sqrt{mp}}.

Theorefore, we obtain that

w~rk(t+1)w~rk(t)2\displaystyle\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2} ηn1B12B2mpi=1n1s(ui)22+ηn1B23mpi=1n1h(ui)22\displaystyle\lesssim\frac{\eta\sqrt{n_{1}}B_{1}^{2}B_{2}}{\sqrt{mp}}\sqrt{\sum\limits_{i=1}^{n_{1}}\|s(u_{i})\|_{2}^{2}}+\frac{\eta\sqrt{n_{1}}B_{2}^{3}}{\sqrt{mp}}\sqrt{\sum\limits_{i=1}^{n_{1}}\|h(u_{i})\|_{2}^{2}} (77)
ηn1B12B2mpGt(u)2,\displaystyle\lesssim\frac{\eta\sqrt{n_{1}}B_{1}^{2}B_{2}}{\sqrt{mp}}\|G^{t}(u)\|_{2},

where the first inequality follows from Cauchy’s inequality.

Summing tt from 0 to TT yields that

w~rk(T)w~rk(0)2\displaystyle\|\tilde{w}_{rk}(T)-\tilde{w}_{rk}(0)\|_{2} t=0Tw~rk(t+1)w~rk(t)2\displaystyle\leq\sum\limits_{t=0}^{T}\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2} (78)
t=1Tηn1B12B2mp(1η(λ0+λ~0)2)t/2G0(u)2\displaystyle\lesssim\sum\limits_{t=1}^{T}\frac{\eta\sqrt{n_{1}}B_{1}^{2}B_{2}}{\sqrt{mp}}\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{t/2}\|G^{0}(u)\|_{2}
n1B12B2G0(u)2mp(λ0+λ~0),\displaystyle\lesssim\frac{\sqrt{n_{1}}B_{1}^{2}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})},

where the second inequality follows from the induction hypothesis.

Then, we estimate wr(t+1)wr(t)2\|w_{r}(t+1)-w_{r}(t)\|_{2}, the gradient descent updating rule yields that

wr(t+1)wr(t)2\displaystyle\|w_{r}(t+1)-w_{r}(t)\|_{2} =ηL(W(t),W~(t))wr2\displaystyle=\left\|-\eta\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}}\right\|_{2} (79)
=ηi=1n1j1=1n2st(ui)(yj1)st(ui)(yj1)wr+i=1n1j2=1n3ht(ui)(y~j2)ht(ui)(y~j2)wr2.\displaystyle=\eta\left\|\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}s^{t}(u_{i})(y_{j_{1}})\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w_{r}}+\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}h^{t}(u_{i})(\tilde{y}_{j_{2}})\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w_{r}}\right\|_{2}.

Recall that

st(ui)(yj1)wr2\displaystyle\left\|\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w_{r}}\right\|_{2}
=1n21m[1pk=1pa~rkσ(w~rk(t)Tui)](σ3(wr(t)Tyj1))wr2.\displaystyle=\left\|\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2}.

Note that

|[1pk=1pa~rkσ(w~rk(t)Tui)][1pk=1pa~rkσ(w~rk(0)Tui)]|pR~\left|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right|\lesssim\sqrt{p}\tilde{R}^{{}^{\prime}}

and

(σ3(wr(t)Tyj1))wr2B1B2.\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2}\lesssim B_{1}B_{2}.

Thus, we have

st(ui)(yj1)wr2\displaystyle\left\|\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w_{r}}\right\|_{2} (80)
1n2m|[1pk=1pa~rkσ(w~rk(t)Tui)][1pk=1pa~rkσ(w~rk(0)Tui)]|(σ3(wr(t)Tyj1))wr2\displaystyle\leq\frac{1}{\sqrt{n_{2}m}}\left|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right|\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2}
+1n2m|1pk=1pa~rkσ(w~rk(0)Tui)|(σ3(wr(t)Tyj1))wr2\displaystyle\quad+\frac{1}{\sqrt{n_{2}m}}\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right|\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2}
pRB1B2mn2+B1B2mn2log(mn1δ)\displaystyle\lesssim\frac{\sqrt{p}R^{{}^{\prime}}B_{1}B_{2}}{\sqrt{mn_{2}}}+\frac{B_{1}B_{2}}{\sqrt{mn_{2}}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}
B1B2mn2log(mn1δ),\displaystyle\lesssim\frac{B_{1}B_{2}}{\sqrt{mn_{2}}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)},

where in the last inequality, we assume that pR~log(mn1/δ)\sqrt{p}\tilde{R}^{{}^{\prime}}\lesssim\sqrt{\log(mn_{1}/\delta)}.

Similarly, since

σ3(wr(t)Ty~j2)wr2B22,\left\|\frac{\partial\sigma_{3}(w_{r}(t)^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\|_{2}\lesssim B_{2}^{2},

we can obtain that

ht(ui)(y~j2)wr2pRB22mn3+B22mn3log(mn1δ).\displaystyle\left\|\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w_{r}}\right\|_{2}\lesssim\frac{\sqrt{p}R^{{}^{\prime}}B_{2}^{2}}{\sqrt{mn_{3}}}+\frac{B_{2}^{2}}{\sqrt{mn_{3}}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}. (81)

Combining (79), (80) and (81) yields that

wr(t+1)wr(t)2\displaystyle\|w_{r}(t+1)-w_{r}(t)\|_{2} ηn1B1B2mlog(mn1δ)i=1n1s(ui)22+ηn1B22mlog(mn1δ)i=1n1h(ui)22\displaystyle\lesssim\frac{\eta\sqrt{n_{1}}B_{1}B_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\sqrt{\sum\limits_{i=1}^{n_{1}}\|s(u_{i})\|_{2}^{2}}+\frac{\eta\sqrt{n_{1}}B_{2}^{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\sqrt{\sum\limits_{i=1}^{n_{1}}\|h(u_{i})\|_{2}^{2}} (82)
ηn1B1B2mlog(mn1δ)Gt(u)2,\displaystyle\lesssim\frac{\eta\sqrt{n_{1}}B_{1}B_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\|G^{t}(u)\|_{2},

where the first inequality follows from Cauchy’s inequality.

Summing tt from 0 to TT yields that

wr(T+1)wr(0)2\displaystyle\|w_{r}(T+1)-w_{r}(0)\|_{2} t=1Twr(t+1)wr(t)2\displaystyle\leq\sum\limits_{t=1}^{T}\|w_{r}(t+1)-w_{r}(t)\|_{2} (83)
t=1T1ηn1B12B2mplog(mn1δ)(1η(λ0+λ~0)2)t/2G0(u)2\displaystyle\lesssim\sum\limits_{t=1}^{T-1}\frac{\eta\sqrt{n_{1}}B_{1}^{2}B_{2}}{\sqrt{mp}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{t/2}\|G^{0}(u)\|_{2}
n1B1B2G0(u)2mp(λ0+λ~0)log(mn1δ).\displaystyle\lesssim\frac{\sqrt{n_{1}}B_{1}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.

8.10. Proof of Lemma 10

Proof.

From the form of the residual I(t)I(t), it suffices to estimate

st+1(ui)(yj)st(ui)(yj)st(ui)(yj)w,w(t+1)w(t)st(ui)(yj)w~,w~(t+1)w~(t)s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j})-\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle

and

ht+1(ui)(y~j)ht(ui)(y~j)ht(ui)(y~j)w,w(t+1)w(t)ht(ui)(y~j)w~,w~(t+1)w~(t),h^{t+1}(u_{i})(\tilde{y}_{j})-h^{t}(u_{i})(\tilde{y}_{j})-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle,

which we denote by Ii,js(t)I_{i,j}^{s}(t) and Ii,jh(t)I_{i,j}^{h}(t), respectively. In fact, we only need to estimate Ii,js(t)I_{i,j}^{s}(t), since (u)\mathcal{L}(u) includes the term uu, which is the same as the boundary term.

Recall that the shallow neural operator has the form

G(u)(y)=1mr=1m[1pk=1pa~rkσ(w~rkTu)]σ3(wrTy).G(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\sigma_{3}(w_{r}^{T}y).

We first estimate Ii,js(t)I_{i,j}^{s}(t). We can explicitly express the difference st+1(ui)(yj)st(ui)(yj)s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j}) as follows:

st+1(ui)(yj)st(ui)(yj)\displaystyle s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j}) (84)
=1n2[Gt+1(ui)(yj)Gt(ui)(yj)]\displaystyle=\frac{1}{\sqrt{n_{2}}}\left[\mathcal{L}G^{t+1}(u_{i})(y_{j})-\mathcal{L}G^{t}(u_{i})(y_{j})\right]
=1n21mr=1m([1pk=1pa~rkσ(w~rk(t+1)Tui)](σ3(wr(t+1)Tyj))\displaystyle=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left(\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})\right]\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))\right.
[1pk=1pa~rkσ(w~rk(t)Tui)](σ3(wr(t)Tyj)))\displaystyle\quad\left.-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right)
=1n21mr=1m[1pk=1pa~rk(σ(w~rk(t+1)Tui)σ(w~rk(t)Tui))](σ3(wr(t)Tyj))\displaystyle=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i}))\right]\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))
+1n21mr=1m[1pk=1pa~rkσ(w~rk(t+1)Tui)][(σ3(wr(t+1)Tyj))(σ3(wr(t)Tyj))],\displaystyle\quad+\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})\right]\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right],

where in the last equality, we split st+1(ui)(yj)st(ui)(yj)s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j}) into two terms in order to estimate them separately later.

On the other hand, from the form of neural operator, we can obtain that

st(ui)(yj)w,w(t+1)w(t)\displaystyle\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle (85)
=r=1mst(ui)(yj)wr,wr(t+1)wr(t)\displaystyle=\sum\limits_{r=1}^{m}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle
=1n21mr=1m[1pk=1pa~rkσ(w~rk(t)Tui)](σ3(wr(t)Tyj))wr,wr(t+1)wr(t),\displaystyle=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle,

and

st(ui)(yj)w~,w~(t+1)w~(t)\displaystyle\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle (86)
=r=1mk=1pst(ui)(yj)w~rk,w~rk(t+1)w~rk(t)\displaystyle=\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\right\rangle
=1n21mr=1m1p[k=1pa~rkI{w~rk(t)Tui0}(w~rk(t+1)w~rk(t))Tui](σ3(wr(t)Tyj)).\displaystyle=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\frac{1}{\sqrt{p}}\left[\sum\limits_{k=1}^{p}\tilde{a}_{rk}I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}\right]\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j})).

With the explicit expressions for each term of Ii,js(t)I_{i,j}^{s}(t), namely (84), (85), and (86), we can split Ii,js(t)I_{i,j}^{s}(t) into two parts: the first part is the second term of (84) minus (86), and the second part is the first term of (84) minus (85). Specifically, let

I1r(t)\displaystyle I_{1}^{r}(t) =(1pk=1pa~rk(σ(w~rk(t+1)Tui)σ(w~rk(t)Tui)I{w~rk(t)Tui0}(w~rk(t+1)w~rk(t))Tui))\displaystyle=\left(\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left(\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}\right)\right) (87)
(σ3(wr(t)Tyj))\displaystyle\quad\cdot\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))

and

I2r(t)\displaystyle I_{2}^{r}(t) =[1pk=1pa~rkσ(w~rk(t+1)Tui)][(σ3(wr(t+1)Tyj))(σ3(wr(t)Tyj))]\displaystyle=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})\right]\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right] (88)
[1pk=1pa~rkσ(w~rk(t)Tui)](σ3(wr(t)Tyj))wr,wr(t+1)wr(t),\displaystyle\quad-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle,

where in the definition, we have omitted the indices i,j,si,j,s for simplicity.

Then

Ii,js(t)=1n21mr=1m[I1r(t)+I2r(t)].I_{i,j}^{s}(t)=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[I_{1}^{r}(t)+I_{2}^{r}(t)\right]. (89)

To estimate I1(r)I_{1}(r), since

|(σ3(wr(t)Tyj))|B12B2,|\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))|\lesssim B_{1}^{2}B_{2},

it suffices to estimate

σ(w~rk(t+1)Tui)σ(w~rk(t)Tui)I{w~rk(t)Tui0}(w~rk(t+1)w~rk(t))Tui.\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}.

With a little abuse of notation, we let A~r,ki={w~d+1:w~w~rk(0)R~,I{w~Tui0}I{w~rk(0)Tui0}}\tilde{A}_{r,k}^{i}=\{\tilde{w}\in\mathbb{R}^{d+1}:\|\tilde{w}-\tilde{w}_{rk}(0)\|\leq\tilde{R}^{{}^{\prime}},I\{\tilde{w}^{T}u_{i}\geq 0\}\neq I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}\}, S~i={(r,k)[m]×[q]:I{A~r,ki}=0}\tilde{S}_{i}=\{(r,k)\in[m]\times[q]:I\{\tilde{A}_{r,k}^{i}\}=0\} and S~i=[m]×[p]\S~i\tilde{S}_{i}^{\perp}=[m]\times[p]\backslash\tilde{S}_{i}. Then we have that P(A~r,ki)R~P(\tilde{A}_{r,k}^{i})\lesssim\tilde{R}^{{}^{\prime}}. Note that w~rk(t+1)w~rk(0)2R,w~rk(t)w~rk(0)2R\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(0)\|_{2}\leq R^{{}^{\prime}},\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}\leq R^{{}^{\prime}}, thus for (r,k)S~i(r,k)\in\tilde{S}_{i}, we have I{w~rk(t+1)Tui0}=I{w~rk(t)Tui0}I\{\tilde{w}_{rk}(t+1)^{T}u_{i}\geq 0\}=I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}. At this point,

σ(w~rk(t+1)Tui)σ(w~rk(t)Tui)I{w~rk(t)Tui0}(w~rk(t+1)w~rk(t))Tui\displaystyle\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i} (90)
=(w~rk(t+1)Tui)I{w~rk(t+1)Tui0}(w~rk(t)Tui)I{w~rk(t)Tui0}\displaystyle=(\tilde{w}_{rk}(t+1)^{T}u_{i})I\{\tilde{w}_{rk}(t+1)^{T}u_{i}\geq 0\}-(\tilde{w}_{rk}(t)^{T}u_{i})I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}
I{w~rk(t)Tui0}(w~rk(t+1)w~rk(t))Tui\displaystyle\quad-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}
=(w~rk(t+1)Tui)I{w~rk(t)Tui0}(w~rk(t)Tui)I{w~rk(t)Tui0}\displaystyle=(\tilde{w}_{rk}(t+1)^{T}u_{i})I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}-(\tilde{w}_{rk}(t)^{T}u_{i})I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}
I{w~rk(t)Tui0}(w~rk(t+1)w~rk(t))Tui\displaystyle\quad-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}
=0.\displaystyle=0.

On the other hand, for all (r,k)[m]×[p](r,k)\in[m]\times[p],

|σ(w~rk(t+1)Tui)σ(w~rk(t)Tui)I{w~rk(t)Tui0}(w~rk(t+1)w~rk(t))Tui|w~rk(t+1)w~rk(t)2.|\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}|\lesssim\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2}.

Therefore, for I1r(t)I_{1}^{r}(t), we have

|I1r(t)|\displaystyle|I_{1}^{r}(t)| 1pk=1pB12B2w~rk(t+1)w~rk(t)2I{(r,k)S~i}.\displaystyle\lesssim\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}B_{1}^{2}B_{2}\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}.

Combining with (77) yields that

1mr=1m|I1r(t)|\displaystyle\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}|I_{1}^{r}(t)| 1m1pr=1mk=1pB1B2w~rk(t+1)w~rk(t)2I{(r,k)S~i}\displaystyle\lesssim\frac{1}{\sqrt{m}}\frac{1}{\sqrt{p}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}B_{1}B_{2}\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\} (91)
=1m1pr=1mk=1pB1B2w~rk(t+1)w~rk(t)2I{A~r,ki}\displaystyle=\frac{1}{\sqrt{m}}\frac{1}{\sqrt{p}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}B_{1}B_{2}\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2}I\{\tilde{A}_{r,k}^{i}\}
ηn1B14B22Gt(u)21mpr=1mk=1pI{A~r,ki}\displaystyle\lesssim\eta\sqrt{n_{1}}B_{1}^{4}B_{2}^{2}\|G^{t}(u)\|_{2}\frac{1}{mp}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}I\{\tilde{A}_{r,k}^{i}\}
ηn1B14B22Gt(u)2R~\displaystyle\lesssim\eta\sqrt{n_{1}}B_{1}^{4}B_{2}^{2}\|G^{t}(u)\|_{2}\tilde{R}^{{}^{\prime}}
ηn1B16B23G0(u)2mp(λ0+λ~0)Gt(u)2,\displaystyle\lesssim\frac{\eta n_{1}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\|G^{t}(u)\|_{2},

where the last inequality follows from the Bernstein inequality and holds with probability at least 1n1exp(mpR~)1-n_{1}\exp(-mp\tilde{R}^{{}^{\prime}}).

For I2r(t)I_{2}^{r}(t), we can rewrite it as follows:

I2r(t)\displaystyle I_{2}^{r}(t) =[1pk=1pa~rkσ(w~rk(t+1)Tui)][(σ3(wr(t+1)Tyj))(σ3(wr(t)Tyj))]\displaystyle=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})\right]\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right] (92)
[1pk=1pa~rkσ(w~rk(t)Tui)](σ3(wr(t)Tyj))wr,wr(t+1)wr(t)\displaystyle\quad-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle
=[1pk=1pa~rk[σ(w~rk(t+1)Tui)σ(w~rk(0)Tui)]][(σ3(wr(t+1)Tyj))(σ3(wr(t)Tyj))]\displaystyle=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right]\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right]
[1pk=1pa~rk[σ(w~rk(t)Tui)σ(w~rk(0)Tui)]](σ3(wr(t)Tyj))wr,wr(t+1)wr(t)\displaystyle-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right]\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle
+[(σ3(wr(t+1)Tyj))(σ3(wr(t)Tyj))(σ3(wr(t)Tyj))wr,wr(t+1)wr(t)]\displaystyle+\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))-\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle\right]
[1pk=1pa~rkσ(w~rk(0)Tui)]\displaystyle\quad\cdot\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]
:=I2,1r(t)+I2,2r(t)+I2,3r(t).\displaystyle:=I_{2,1}^{r}(t)+I_{2,2}^{r}(t)+I_{2,3}^{r}(t).

In the following, we will estimate the three items I2,1r(t),I2,2r(t)I_{2,1}^{r}(t),I_{2,2}^{r}(t) and I2,3r(t)I_{2,3}^{r}(t) separately.

For I2,1r(t)I_{2,1}^{r}(t) and I2,2r(t)I_{2,2}^{r}(t), note that for s=t,t+1s=t,t+1, we have

|[1pk=1pa~rkσ(w~rk(s)Tui)][1pk=1pa~rkσ(w~rk(0)Tui)]|pR~.\left|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right|\lesssim\sqrt{p}\tilde{R}^{{}^{\prime}}. (93)

Moreover, we can deduce that

|(σ3(wr(t+1)Tyj))(σ3(wr(t)Tyj))|(B12+B1B2)wr(t+1)wr(t)2B12wr(t+1)wr(t)2\left|\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right|\lesssim(B_{1}^{2}+B_{1}B_{2})\|w_{r}(t+1)-w_{r}(t)\|_{2}\lesssim B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2} (94)

and

(σ3(wr(t)Tyj))wr2B1B2.\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}}\right\|_{2}\lesssim B_{1}B_{2}. (95)

Combining (93), (94) and (95) yields that

|I2,1r(t)|pRB12wr(t+1)wr(t)2|I_{2,1}^{r}(t)|\lesssim\sqrt{p}R^{{}^{\prime}}B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2} (96)

and

|I2,2r(t)|pRB1B2wr(t+1)wr(t)2pRB12wr(t+1)wr(t)2.|I_{2,2}^{r}(t)|\lesssim\sqrt{p}R^{{}^{\prime}}B_{1}B_{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}\lesssim\sqrt{p}R^{{}^{\prime}}B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}. (97)

It remains to estimate I2,3r(t)I_{2,3}^{r}(t). From its form, it suffices to estimate

(σ3(wr(t+1)Tyj))(σ3(wr(t)Tyj))(σ3(wr(t)Tyj))wr,wr(t+1)wr(t)\displaystyle\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))-\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle (98)
:=I~1(t)+I~2(t)+I~3(t),\displaystyle:=\tilde{I}_{1}(t)+\tilde{I}_{2}(t)+\tilde{I}_{3}(t),

where I~1(t),I~2(t)\tilde{I}_{1}(t),\tilde{I}_{2}(t) and I~3(t)\tilde{I}_{3}(t) are respectively related to the first-order term, the second-order term, and the zeroth-order term of the PDE. Sepecifically,

I~1(t)=wr0(t+1)σ3(wr(t+1)Tyj)wr0(t)σ3(wr(t)Tyj)wr0(t)σ3(wr(t)Tyj)wr,wr(t+1)wr(t),\tilde{I}_{1}(t)=w_{r0}(t+1)\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-w_{r0}(t)\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})-\left\langle\frac{\partial w_{r0}(t)\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle,
I~2(t)\displaystyle\tilde{I}_{2}(t) =[wr1(t+1)22σ3′′(wr(t+1)Tyj)wr1(t)22σ3′′(wr(t)Tyj)\displaystyle=-\left[\|w_{r1}(t+1)\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t+1)^{T}y_{j})-\|w_{r1}(t)\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})\right.
wr1(t)22σ3′′(wr(t)Tyj)wr,wr(t+1)wr(t)]\displaystyle\quad\left.-\left\langle\frac{\partial\|w_{r1}(t)\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle\right]

and

I~3(t)=σ3(wr(t+1)Tyj)σ3(wr(t)Tyj)σ3(wr(t)Tyj)wr,wr(t+1)wr(t).\tilde{I}_{3}(t)=\sigma_{3}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}(w_{r}(t)^{T}y_{j})-\left\langle\frac{\partial\sigma_{3}(w_{r}(t)^{T}y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle.

Note that both I~1(t)\tilde{I}_{1}(t) and I~2(t)\tilde{I}_{2}(t) have the form

f(w)g(w)f(w)g(w)f(w)g(w)w,ww.f(w^{{}^{\prime}})g(w^{{}^{\prime}})-f(w)g(w)-\left\langle\frac{\partial f(w)g(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle.

However, due to the non-differentiability of the ReLU function, we cannot perform a second-order expansion; instead, we decompose it into

f(w)g(w)f(w)g(w)f(w)g(w)w,ww\displaystyle f(w^{{}^{\prime}})g(w^{{}^{\prime}})-f(w)g(w)-\left\langle\frac{\partial f(w)g(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle (99)
=f(w)g(w)f(w)g(w)f(w)wg(w),wwg(w)wf(w),ww\displaystyle=f(w^{{}^{\prime}})g(w^{{}^{\prime}})-f(w)g(w)-\left\langle\frac{\partial f(w)}{\partial w}g(w),w^{{}^{\prime}}-w\right\rangle-\left\langle\frac{\partial g(w)}{\partial w}f(w),w^{{}^{\prime}}-w\right\rangle
=f(w)[g(w)g(w)g(w)w,ww]+g(w)[f(w)f(w)f(w)w,ww]\displaystyle=f(w^{{}^{\prime}})\left[g(w^{{}^{\prime}})-g(w)-\left\langle\frac{\partial g(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle\right]+g(w)\left[f(w^{{}^{\prime}})-f(w)-\left\langle\frac{\partial f(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle\right]
+[f(w)f(w)]g(w)w,ww.\displaystyle\quad+[f(w^{{}^{\prime}})-f(w)]\left\langle\frac{\partial g(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle.

Thus, for I~1(t)\tilde{I}_{1}(t), we have

I~1(t)\displaystyle\tilde{I}_{1}(t) =wr0(t+1)σ3(wr(t+1)Tyj)wr0(t)σ3(wr(t)Tyj)\displaystyle=w_{r0}(t+1)\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-w_{r0}(t)\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j}) (100)
[σ3(wr(t)Tyj)(wr0(t+1)wr0(t))+wr0(t)σ3′′(wr(t)Tyj)(wr(t+1)wr(t))Tyj]\displaystyle\quad-[\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})(w_{r0}(t+1)-w_{r0}(t))+w_{r0}(t)\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}]
=wr0(t+1)[σ3(wr(t+1)Tyj)σ3(wr(t)Tyj)]wr0(t)σ3′′(wr(t)Tyj)(wr(t+1)wr(t))Tyj\displaystyle=w_{r0}(t+1)[\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})]-w_{r0}(t)\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}
=[wr0(t+1)wr0(t)][σ3(wr(t+1)Tyj)σ3(wr(t)Tyj)]\displaystyle=[w_{r0}(t+1)-w_{r0}(t)][\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})]
+wr0(t)[σ3(wr(t+1)Tyj)σ3(wr(t)Tyj)σ3′′(wr(t)Tyj)(wr(t+1)wr(t))Tyj].\displaystyle\quad+w_{r0}(t)[\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})-\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}].

We apply mean value theorem for the first term in (100) and obtain that

|σ3(wr(t+1)Tyj)σ3(wr(t)Tyj)|\displaystyle|\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})| =|wr(t+1)Tyjwr(t)Tyj||σ3′′(ξ)|\displaystyle=|w_{r}(t+1)^{T}y_{j}-w_{r}(t)^{T}y_{j}||\sigma_{3}^{{}^{\prime\prime}}(\xi)|
B2wr(t+1)wr(t)2.\displaystyle\lesssim B_{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}.

Similarly, for the second term, we have

|σ3(wr(t+1)Tyj)σ3(wr(t)Tyj)σ3′′(wr(t)Tyj)(wr(t+1)wr(t))Tyj|\displaystyle|\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})-\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}|
=|σ3′′(ξ)σ3′′(wr(t)Tyj)||(wr(t+1)wr(t))Tyj|\displaystyle=|\sigma_{3}^{{}^{\prime\prime}}(\xi)-\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})||(w_{r}(t+1)-w_{r}(t))^{T}y_{j}|
wr(t+1)wr(t)22.\displaystyle\lesssim\|w_{r}(t+1)-w_{r}(t)\|_{2}^{2}.

Thus, for I~1(t)\tilde{I}_{1}(t), we have

|I~1(t)|\displaystyle|\tilde{I}_{1}(t)| B2wr(t+1)wr(t)22+B1wr(t+1)wr(t)22\displaystyle\lesssim B_{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}^{2}+B_{1}\|w_{r}(t+1)-w_{r}(t)\|_{2}^{2} (101)
B1wr(t+1)wr(t)22.\displaystyle\lesssim B_{1}\|w_{r}(t+1)-w_{r}(t)\|_{2}^{2}.

For I~2(t)\tilde{I}_{2}(t), with same decomposition in (99), we have

I~2(t)\displaystyle\tilde{I}_{2}(t) =wr1(t+1)22σ3′′(wr(t+1)Tyj)wr1(t)22σ3′′(wr(t)Tyj)\displaystyle=\|w_{r1}(t+1)\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t+1)^{T}y_{j})-\|w_{r1}(t)\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j}) (102)
[2σ3′′(wr(t)Tyj)(wr1(t+1)wr1(t))Twr1(t)+wr1(t)22σ3′′′(wr(t)Tyj)(wr(t+1)wr(t))Tyj]\displaystyle\quad-[2\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r1}(t+1)-w_{r1}(t))^{T}w_{r1}(t)+\|w_{r1}(t)\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}]
=wr1(t+1)22[σ3′′(wr(t+1)Tyj)σ3′′(wr(t)Tyj)σ3′′′(wr(t+1)Tyj)(wr(t+1)wr(t))Tyj]\displaystyle=\|w_{r1}(t+1)\|_{2}^{2}\left[\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})-\sigma_{3}^{{}^{\prime\prime\prime}}(w_{r}(t+1)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}\right]
+wr1(t)22[wr1(t+1)22wr1(t)222wr1(t)T(wr1(t+1)wr1(t))]\displaystyle\quad+\|w_{r1}(t)\|_{2}^{2}\left[\|w_{r1}(t+1)\|_{2}^{2}-\|w_{r1}(t)\|_{2}^{2}-2w_{r1}(t)^{T}(w_{r1}(t+1)-w_{r1}(t))\right]
+[wr1(t+1)22wr1(t)22]σ3′′′(wr(t)Tyj)(wr(t+1)wr(t))Tyj.\displaystyle\quad+[\|w_{r1}(t+1)\|_{2}^{2}-\|w_{r1}(t)\|_{2}^{2}]\sigma_{3}^{{}^{\prime\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}.

For the first term, similar to (90), for rSjr\in S_{j}, we have

σ(wr(t+1)Tyj)σ(wr(t)Tyj)I{wr(t+1)Tyj0}(wr(t+1)wr(t))Tyj=0,\sigma(w_{r}(t+1)^{T}y_{j})-\sigma(w_{r}(t)^{T}y_{j})-I\{w_{r}(t+1)^{T}y_{j}\geq 0\}(w_{r}(t+1)-w_{r}(t))^{T}y_{j}=0,

thus

|σ(wr(t+1)Tyj)σ(wr(t)Tyj)I{wr(t+1)Tyj0}(wr(t+1)wr(t))Tyj|wr(t+1)wr(t)2I{rSj}.|\sigma(w_{r}(t+1)^{T}y_{j})-\sigma(w_{r}(t)^{T}y_{j})-I\{w_{r}(t+1)^{T}y_{j}\geq 0\}(w_{r}(t+1)-w_{r}(t))^{T}y_{j}|\lesssim\|w_{r}(t+1)-w_{r}(t)\|_{2}I\{r\in S_{j}^{\perp}\}.

For the second term, the mean value theorem yields that

|wr1(t+1)22wr1(t)222wr1(t)T(wr1(t+1)wr1(t))|wr1(t+1)wr1(t)2.|\|w_{r1}(t+1)\|_{2}^{2}-\|w_{r1}(t)\|_{2}^{2}-2w_{r1}(t)^{T}(w_{r1}(t+1)-w_{r1}(t))|\lesssim\|w_{r1}(t+1)-w_{r1}(t)\|_{2}.

For the third term, we have

|wr1(t+1)22wr1(t)22]σ3′′′(wr(t)Tyj)(wr(t+1)wr(t))Tyj|B1wr(t+1)wr(t)22.|\|w_{r1}(t+1)\|_{2}^{2}-\|w_{r1}(t)\|_{2}^{2}]\sigma_{3}^{{}^{\prime\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}|\lesssim B_{1}\|w_{r}(t+1)-w_{r}(t)\|_{2}^{2}.

Combining these result for I~1(t),I~2(t)\tilde{I}_{1}(t),\tilde{I}_{2}(t) and I~3(t)\tilde{I}_{3}(t) yields that

|I2,3r(t)|\displaystyle|I_{2,3}^{r}(t)| |[1pk=1pa~rkσ(w~rk(0)Tui)]|[|I~1(t)|+|I~2(t)|+|I~3(t)|]\displaystyle\leq\left|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right|[|\tilde{I}_{1}(t)|+|\tilde{I}_{2}(t)|+|\tilde{I}_{3}(t)|] (103)
log(mn1δ)[B12wr(t+1)wr(t)22+B12wr(t+1)wr(t)2I{rSj}].\displaystyle\lesssim\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\left[B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}^{2}+B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}I\{r\in S_{j}^{\perp}\}\right].

With these estimations for I2,1r(t),I2,2r(t)I_{2,1}^{r}(t),I_{2,2}^{r}(t) and I2,3r(t)I_{2,3}^{r}(t), i.e., (96), (97) and (98), we obtain that

|I2r(t)|\displaystyle|I_{2}^{r}(t)| |I2,1r(t)|+|I2,2r(t)|+|I2,3r(t)|\displaystyle\leq|I_{2,1}^{r}(t)|+|I_{2,2}^{r}(t)|+|I_{2,3}^{r}(t)| (104)
pR~B12wr(t+1)wr(t)2\displaystyle\lesssim\sqrt{p}\tilde{R}^{{}^{\prime}}B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}
+log(mn1δ)[B12wr(t+1)wr(t)22+B12wr(t+1)wr(t)2I{rSj}].\displaystyle\quad+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\left[B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}^{2}+B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}I\{r\in S_{j}^{\perp}\}\right].

Recall that (82) shows that

wr(t+1)wr(t)2ηn1B1B2mlog(mn1δ)Gt(u)2\|w_{r}(t+1)-w_{r}(t)\|_{2}\lesssim\frac{\eta\sqrt{n_{1}}B_{1}B_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\|G^{t}(u)\|_{2}

and

R~=n1B12B2G0(u)2mp(λ0+λ~0),R=n1B1B2G0(u)2m(λ0+λ~0)log(mn1δ).\tilde{R}^{{}^{\prime}}=\frac{\sqrt{n_{1}}B_{1}^{2}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})},\ R^{{}^{\prime}}=\frac{\sqrt{n_{1}}B_{1}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.

Therefore, combining with the use of Bernstein inequality, summing rr yields that

1mr=1m|I2r(t)|\displaystyle\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}|I_{2}^{r}(t)| ηn1B15B22G0(u)2m(λ0+λ~0)log(mn1δ)Gt(u)2+ηn1B14B22G0(u)2m(λ0+λ~0)log3/2(mn1δ)Gt(u)2\displaystyle\lesssim\frac{\eta n_{1}B_{1}^{5}B_{2}^{2}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\|G^{t}(u)\|_{2}+\frac{\eta n_{1}B_{1}^{4}B_{2}^{2}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\|G^{t}(u)\|_{2} (105)
ηn1B15B22G0(u)2m(λ0+λ~0)log3/2(mn1δ)Gt(u)2.\displaystyle\lesssim\frac{\eta n_{1}B_{1}^{5}B_{2}^{2}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\|G^{t}(u)\|_{2}.

Recall that (91) implies that

1mr=1m|I1r(t)|ηn1B16B23G0(u)2mp(λ0+λ~0)Gt(u)2.\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}|I_{1}^{r}(t)|\lesssim\frac{\eta n_{1}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\|G^{t}(u)\|_{2}.

Therefore, we have

I(t)2\displaystyle\|I(t)\|_{2} η(n1)3/2B16B23G0(u)2mp(λ0+λ~0)Gt(u)2+η(n1)3/2B15B22G0(u)2m(λ0+λ~0)log3/2(mn1δ)Gt(u)2\displaystyle\lesssim\frac{\eta(n_{1})^{3/2}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\|G^{t}(u)\|_{2}+\frac{\eta(n_{1})^{3/2}B_{1}^{5}B_{2}^{2}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\|G^{t}(u)\|_{2} (106)
η(n1)3/2B16B23G0(u)2m(λ0+λ~0)log3/2(mn1δ)Gt(u)2.\displaystyle\lesssim\frac{\eta(n_{1})^{3/2}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\|G^{t}(u)\|_{2}.

8.11. Proof of Theorem 3

Proof.

It suffices to show that Condition 2 also holds for s=T+1s=T+1. From the iteration formula in Lemma 9, we have

GT+1(u)22\displaystyle\|G^{T+1}(u)\|_{2}^{2} =[Iη(H(T)+H~(T))]GT(u)+I(T)22\displaystyle=\|[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u)+I(T)\|_{2}^{2} (107)
=[Iη(H(T)+H~(T))]GT(u)22+I(T)22+2[Iη(H(T)+H~(T))]GT(u),I(T).\displaystyle=\|[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u)\|_{2}^{2}+\|I(T)\|_{2}^{2}+2\left\langle[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u),I(T)\right\rangle.

From the stability of the Gram matrices, i.e., Lemma 8, when R,R~R,\tilde{R} satisfy that

Rλ0n1B13B22,R~λ~0n1pB12B22,R\lesssim\frac{\lambda_{0}}{n_{1}B_{1}^{3}B_{2}^{2}},\ \tilde{R}\lesssim\frac{\tilde{\lambda}_{0}}{n_{1}\sqrt{p}B_{1}^{2}B_{2}^{2}},

we have

H(0)H(T)2λ04,H~(0)H~(T)2λ~04.\|H(0)-H(T)\|_{2}\leq\frac{\lambda_{0}}{4},\|\tilde{H}(0)-\tilde{H}(T)\|_{2}\leq\frac{\tilde{\lambda}_{0}}{4}.

Thus, with Lemma 7, we can deduce that

H(T)H2λ02,H~(T)H~2λ~02,\|H(T)-H^{\infty}\|_{2}\leq\frac{\lambda_{0}}{2},\|\tilde{H}(T)-\tilde{H}^{\infty}\|_{2}\leq\frac{\tilde{\lambda}_{0}}{2},

implying that λmin(H(T))λ0/2\lambda_{min}(H(T))\geq\lambda_{0}/2 and λmin(H~(T))λ~0/2\lambda_{min}(\tilde{H}(T))\geq\tilde{\lambda}_{0}/2.

Therefore, when η=𝒪(1/(H2+H~2))\eta=\mathcal{O}(1/(\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2})), we have that Iη(H(T)+H~(T))I-\eta(H(T)+\tilde{H}(T)) is positive definite and then

Iη(H(T)+H~(T))2η(λ0+λ~0)2.\|I-\eta(H(T)+\tilde{H}(T))\|_{2}\leq\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}. (108)

Let I(T)R¯GT(u)2I(T)\leq\bar{R}\|G^{T}(u)\|_{2}, then combining (107) and (108) yields that

GT+1(u)22\displaystyle\|G^{T+1}(u)\|_{2}^{2} [Iη(H(T)+H~(T))]GT(u)22+I(T)22+2[Iη(H(T)+H~(T))]GT(u)2I(T)2\displaystyle\leq\|[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u)\|_{2}^{2}+\|I(T)\|_{2}^{2}+2\|[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u)\|_{2}\|I(T)\|_{2} (109)
[(1η(λ0+λ~0)2)2+R¯2+2R¯(1η(λ0+λ~0)2)]GT(u)22\displaystyle\leq\left[\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{2}+\bar{R}^{2}+2\bar{R}\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)\right]\|G^{T}(u)\|_{2}^{2}
(1η(λ0+λ~0)2)GT(u)22,\displaystyle\leq\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)\|G^{T}(u)\|_{2}^{2},

where the last inequality requires that R¯η(λ0+λ~0)\bar{R}\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0}).

Finally, we need to specify the requirements for mm to ensure that the aforementioned conditions are satisfied. Recall that first, mm needs to satisfy that RRR^{{}^{\prime}}\leq R and R~R~\tilde{R}^{{}^{\prime}}\leq\tilde{R}, i.e.,

n1B1B2G0(u)2m(λ0+λ~0)log(mn1δ)λ0n1B13B22,n1B12B2G0(u)2mp(λ0+λ~0)λ~0n1pB12B22.\frac{\sqrt{n_{1}}B_{1}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\lesssim\frac{\lambda_{0}}{n_{1}B_{1}^{3}B_{2}^{2}},\ \frac{\sqrt{n_{1}}B_{1}^{2}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\lesssim\frac{\tilde{\lambda}_{0}}{n_{1}\sqrt{p}B_{1}^{2}B_{2}^{2}}. (110)

Simple algebraic operations yield that

m=Ω(n13B18B26G022(λ0+λ~0)2min(λ02,λ~02)log(mn1δ)).m=\Omega\left(\frac{n_{1}^{3}B_{1}^{8}B_{2}^{6}\|G^{0}\|_{2}^{2}}{(\lambda_{0}+\tilde{\lambda}_{0})^{2}min(\lambda_{0}^{2},\tilde{\lambda}_{0}^{2})}\log\left(\frac{mn_{1}}{\delta}\right)\right). (111)

Second, mm needs to satisfy that R¯η(λ0+λ~0)\bar{R}\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0}), i.e.,

η(n1)3/2B16B23G0(u)2m(λ0+λ~0)log3/2(mn1δ)η(λ0+λ~0),\frac{\eta(n_{1})^{3/2}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0}),

implying that

m=Ω(n13B112B26G0(u)22(λ0+λ~0)4log3(mn1δ)).m=\Omega\left(\frac{n_{1}^{3}B_{1}^{12}B_{2}^{6}\|G^{0}(u)\|_{2}^{2}}{(\lambda_{0}+\tilde{\lambda}_{0})^{4}}\log^{3}\left(\frac{mn_{1}}{\delta}\right)\right). (112)

Finally, combining (111), (112) and the estimation of G0(u)22\|G^{0}(u)\|_{2}^{2} in Lemma 17, we have that

m=Ω~(n14d7(λ0+λ~0)2min(λ02,λ~02)),m=\tilde{\Omega}\left(\frac{n_{1}^{4}d^{7}}{(\lambda_{0}+\tilde{\lambda}_{0})^{2}min(\lambda_{0}^{2},\tilde{\lambda}_{0}^{2})}\right),

where Ω~\tilde{\Omega} indicates that some terms involving log(n1)\log(n_{1}), log(n2)\log(n_{2}) and log(m)\log(m) are omitted.

9. Auxiliary Lemmas

Lemma 15 (Anti-concentration of Gaussian distribution).

Let X𝒩(0,σ2)X\sim\mathcal{N}(0,\sigma^{2}), then for any t>0t>0,

23tσ<P(|X|t)<45tσ.\frac{2}{3}\frac{t}{\sigma}<P(|X|\leq t)<\frac{4}{5}\frac{t}{\sigma}.
Lemma 16 (Bernstein inequality, Theorem 3.1.7 in [21]).

Let XiX_{i}, 1in1\leq i\leq n be independent centered random variables a.s. bounded by c<c<\infty in absolute value. Set σ2=1/ni=1n𝔼Xi2\sigma^{2}=1/n\sum_{i=1}^{n}\mathbb{E}X_{i}^{2} and Sn=1/ni=1nXiS_{n}=1/n\sum_{i=1}^{n}X_{i}. Then, for all t0t\geq 0,

P(Sn2σ2tn+ct3n)eu.P\left(S_{n}\geq\sqrt{\frac{2\sigma^{2}t}{n}}+\frac{ct}{3n}\right)\leq e^{-u}.

First, we provide some preliminaries about Orlicz norms.

Let g:[0,)[0,)g:[0,\infty)\rightarrow[0,\infty) be a non-decreasing convex function with g(0)=0g(0)=0. The gg-Orlicz norm of a real-valued random variable XX is given by

Xg:=inf{C>0:𝔼[g(|X|C)]1}.\|X\|_{g}:=\inf\left\{C>0:\mathbb{E}\left[g\left(\frac{|X|}{C}\right)\right]\leq 1\right\}.

If Xψα<\|X\|_{\psi_{\alpha}}<\infty, we say that XX is sub-Weibull of order α>0\alpha>0, where

ψα(x):=exα1.\psi_{\alpha}(x):=e^{x^{\alpha}}-1.

Note that when α1\alpha\geq 1, ψα\|\cdot\|_{\psi_{\alpha}} is a norm and when 0<α<10<\alpha<1, ψα\|\cdot\|_{\psi_{\alpha}} is a quasi-norm. In the related proofs, we may frequently use the fact that for real-valued random variable X𝒩(0,1)X\sim\mathcal{N}(0,1), we have Xψ26\|X\|_{\psi_{2}}\leq\sqrt{6} and X2ψ1=Xψ226\|X^{2}\|_{\psi_{1}}=\|X\|_{\psi_{2}}^{2}\leq 6. Moreover, when Xψ2<,Yψ2<\|X\|_{\psi_{2}}<\infty,\|Y\|_{\psi_{2}}<\infty, we have XYψ1Xψ2Yψ2\|XY\|_{\psi_{1}}\leq\|X\|_{\psi_{2}}\|Y\|_{\psi_{2}}. Since without loss of generality, we can assume that Xψ2=Yψ2=1\|X\|_{\psi_{2}}=\|Y\|_{\psi_{2}}=1, then

𝔼[e|XY|]𝔼[e|X|22+|Y|22]=𝔼[e|X|22e|Y|22]12𝔼[e|X|2+e|Y|2]1,\mathbb{E}\left[e^{|XY|}\right]\leq\mathbb{E}\left[e^{\frac{|X|^{2}}{2}+\frac{|Y|^{2}}{2}}\right]=\mathbb{E}\left[e^{\frac{|X|^{2}}{2}}e^{\frac{|Y|^{2}}{2}}\right]\leq\frac{1}{2}\mathbb{E}\left[e^{|X|^{2}}+e^{|Y|^{2}}\right]\leq 1,

where the first inequality and the second inequality follow from the inequality 2aba2+b22ab\leq a^{2}+b^{2} for a0,b0a\geq 0,b\geq 0.

Lemma 17 (Theorem 3.1 in [22]).

If X1,,XnX_{1},\cdots,X_{n} are independent mean zero random variables with Xiψα<\|X_{i}\|_{\psi_{\alpha}}<\infty for all 1in1\leq i\leq n and some α>0\alpha>0, then for any vector a=(a1,,an)na=(a_{1},\cdots,a_{n})\in\mathbb{R}^{n}, the following holds true:

P(|i=1naiXi|2eC(α)b2t+2eLn(α)t1/αbβ(α))2et,forallt0,P\left(\left|\sum\limits_{i=1}^{n}a_{i}X_{i}\right|\geq 2eC(\alpha)\|b\|_{2}\sqrt{t}+2eL_{n}^{*}(\alpha)t^{1/\alpha}\|b\|_{\beta(\alpha)}\right)\leq 2e^{-t},\ for\ all\ t\geq 0,

where b=(a1X1ψα,,anXnψα)nb=(a_{1}\|X_{1}\|_{\psi_{\alpha}},\cdots,a_{n}\|X_{n}\|_{\psi_{\alpha}})\in\mathbb{R}^{n},

C(α):=max{2,21/α}{8(2π)1/4e1/24(e2/e/α)1/α,ifα<1,4e+2(log2)1/α,ifα1.C(\alpha):=\max\{\sqrt{2},2^{1/\alpha}\}\left\{\begin{aligned} \sqrt{8}(2\pi)^{1/4}e^{1/24}(e^{2/e}/\alpha)^{1/\alpha}&,&if\ \alpha<1,\\ 4e+2(\log 2)^{1/\alpha}&,&if\ \alpha\geq 1.\end{aligned}\right.

and for β(α)=\beta(\alpha)=\infty when α1\alpha\leq 1 and β(α)=α/(α1)\beta(\alpha)=\alpha/(\alpha-1) when α>1\alpha>1,

Ln(α):=41/α2b2×{bβ(α),ifα<1,4ebβ(α)/C(α),ifα1.L_{n}(\alpha):=\frac{4^{1/\alpha}}{\sqrt{2}\|b\|_{2}}\times\left\{\begin{aligned} &\|b\|_{\beta(\alpha)},&if\ \alpha<1,\\ &4e\|b\|_{\beta(\alpha)}/C(\alpha),&if\alpha\geq 1.\end{aligned}\right.

and Ln(α)=Ln(α)C(α)b2/bβ(α)L^{*}_{n}(\alpha)=L_{n}(\alpha)C(\alpha)\|b\|_{2}/\|b\|_{\beta(\alpha)}.

Lemma 18.

For any r[m]r\in[m], we have that with probability at least 1δ1-\delta,

|1pk=1pa~rkσ(w~rk(0)Tu)|log(1δ).\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right|\lesssim\sqrt{\log\left(\frac{1}{\delta}\right)}.

Moreover, its ψ2\psi_{2}-norm is a universal constant.

Proof.

Note that 𝔼[a~rkσ(w~rk(0)Tu)]=0\mathbb{E}[\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)]=0 and a~rkσ(w~rk(0)Tu)ψ2|w~rk(0)Tu|ψ2=𝒪(1)\|\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\|_{\psi_{2}}\leq\||\tilde{w}_{rk}(0)^{T}u|\|_{\psi_{2}}=\mathcal{O}(1), then applying Lemma 17 yields that with probability at least 1δ1-\delta,

|1pk=1pa~rkσ(w~rk(0)Tu)|log(1δ).\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right|\lesssim\sqrt{\log\left(\frac{1}{\delta}\right)}.

Moreover, from the equivalence of the ψ2\psi_{2} norm and the concentration inequality (see Lemma 2.2.1 in [23]), it follows that

1pk=1pa~rkσ(w~rk(0)Tu)ψ2=𝒪(1).\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right\|_{\psi_{2}}=\mathcal{O}(1).

Lemma 19.

With probability at least 1δ1-\delta, we have

L(0)n1n2log(n1n2δ).L(0)\lesssim n_{1}n_{2}\log\left(\frac{n_{1}n_{2}}{\delta}\right).
Proof.
L(0)\displaystyle L(0) =12i=1n1j=1n2(G0(ui)(yj)zji)2\displaystyle=\frac{1}{2}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}(G^{0}(u_{i})(y_{j})-z_{j}^{i})^{2}
i=1n1j=1n2(G0(ui)(yj))2+(zji)2,\displaystyle\leq\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}(G^{0}(u_{i})(y_{j}))^{2}+(z_{j}^{i})^{2},

Recall that

G0(u)(y)=1mr=1m[1pk=1pa~rkσ(w~rk(0)Tu)]σ(wr(0)Ty).G^{0}(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\sigma(w_{r}(0)^{T}y).

Note that

1pk=1pa~rkσ(w~rk(0)Tu)ψ2=𝒪(1),σ(wr(0)Tyj)ψ2=𝒪(1),\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right\|_{\psi_{2}}=\mathcal{O}(1),\ \left\|\sigma(w_{r}(0)^{T}y_{j})\right\|_{\psi_{2}}=\mathcal{O}(1),

thus

[1pk=1pa~rkσ(w~rk(0)Tu)]σ(wr(0)Tyj)ψ1=𝒪(1)\left\|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\sigma(w_{r}(0)^{T}y_{j})\right\|_{\psi_{1}}=\mathcal{O}(1)

Then from Lemma 12, we have that with probability at least 1δ1-\delta,

L(0)n1n2(log(n1n2δ)+log(n1n2δ)m)2n1n2log(n1n2δ).L(0)\lesssim n_{1}n_{2}\left(\sqrt{\log\left(\frac{n_{1}n_{2}}{\delta}\right)}+\frac{\log(\frac{n_{1}n_{2}}{\delta})}{\sqrt{m}}\right)^{2}\lesssim n_{1}n_{2}\log\left(\frac{n_{1}n_{2}}{\delta}\right).

Lemma 20.

If Xψα,Yψβ<\|X\|_{\psi_{\alpha}},\|Y\|_{\psi_{\beta}}<\infty with α,β>0\alpha,\beta>0, then we have XYψγXψαYψβ\|XY\|_{\psi_{\gamma}}\leq\|X\|_{\psi_{\alpha}}\|Y\|_{\psi_{\beta}}, where γ\gamma satisfies that

1γ=1α+1β.\frac{1}{\gamma}=\frac{1}{\alpha}+\frac{1}{\beta}.
Proof.

Without loss of generality, we can assume that Xψα=Yψβ=1\|X\|_{\psi_{\alpha}}=\|Y\|_{\psi_{\beta}}=1. To prove this, let us use Young’s inequality, which states that

xyxpp+yqq,forx,y0,p,q>1.xy\leq\frac{x^{p}}{p}+\frac{y^{q}}{q},for\ x,y\geq 0,p,q>1.

Let p=α/γ,q=β/γp=\alpha/\gamma,q=\beta/\gamma, then

𝔼[exp(|XY|γ)]\displaystyle\mathbb{E}[\exp(|XY|^{\gamma})] 𝔼[exp(|X|γpp+|Y|γqq)]\displaystyle\leq\mathbb{E}\left[\exp\left(\frac{|X|^{\gamma p}}{p}+\frac{|Y|^{\gamma q}}{q}\right)\right]
=𝔼[exp(|X|αp)exp(|Y|βq)]\displaystyle=\mathbb{E}\left[\exp\left(\frac{|X|^{\alpha}}{p}\right)\exp\left(\frac{|Y|^{\beta}}{q}\right)\right]
𝔼[exp(|X|α)p+exp(|Y|β)q]\displaystyle\leq\mathbb{E}\left[\frac{\exp(|X|^{\alpha})}{p}+\frac{\exp(|Y|^{\beta})}{q}\right]
2p+2q\displaystyle\leq\frac{2}{p}+\frac{2}{q}
=2,\displaystyle=2,

where the first and second inequality follow from Young’s inequality. From this, we have that XYψγXψαYψβ\|XY\|_{\psi_{\gamma}}\leq\|X\|_{\psi_{\alpha}}\|Y\|_{\psi_{\beta}}.

Lemma 21.

With probability at least 1δ1-\delta, we have

G0(u)22=L(W(0),W~(0))n1dlog(n1(n2+n3)δ).\|G^{0}(u)\|_{2}^{2}=L(W(0),\tilde{W}(0))\lesssim n_{1}d\log\left(\frac{n_{1}(n_{2}+n_{3})}{\delta}\right).
Proof.

Recall that the loss of function of PINN is

L(W,W~)=i=1n1j1=1n21n2(G(ui)(yj1)f(yj1))2+i=1n1j2=1n31n3(G(ui)(y~j2)g(y~j2))2L(W,\tilde{W})=\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\frac{1}{n_{2}}(\mathcal{L}G(u_{i})(y_{j_{1}})-f(y_{j_{1}}))^{2}+\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\frac{1}{n_{3}}(G(u_{i})(\tilde{y}_{j_{2}})-g(\tilde{y}_{j_{2}}))^{2}

and the shallow neural operator has the following form

G(u)(y)=1mr=1m[1pk=1pa~rkσ(w~rkTu)]σ3(wrTy).G(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\sigma_{3}(w_{r}^{T}y).

In order to estimate the initial value, it suffices to consider

1mr=1m[1pk=1pa~rk(0)σ(w~rk(0)Tu)](σ3(wr(0)Ty))\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y))

and

1mr=1m[1pk=1pa~rk(0)σ(w~rk(0)Tu)]σ3(wr(0)Ty).\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\sigma_{3}(w_{r}(0)^{T}y).

Note that |(σ3(wr(0)Ty))|wr(0)22|wr(0)Ty||\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y))|\lesssim\|w_{r}(0)\|_{2}^{2}|w_{r}(0)^{T}y|, thus Lemma 20 implies that

(σ3(wr(0)Ty))ψ23wr(0)22|wr(0)Ty|ψ23wr(0)22ψ1|wr(0)Ty|ψ2=𝒪(d).\|\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y))\|_{\psi_{\frac{2}{3}}}\lesssim\|\|w_{r}(0)\|_{2}^{2}|w_{r}(0)^{T}y|\|_{\psi_{\frac{2}{3}}}\leq\|\|w_{r}(0)\|_{2}^{2}\|_{\psi_{1}}\||w_{r}(0)^{T}y|\|_{\psi_{2}}=\mathcal{O}(d).

Therefore, combining with Lemma 18 yields that

[1pk=1pa~rk(0)σ(w~rk(0)Tu)](σ3(wr(0)Ty))ψ12𝒪(d).\left\|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y))\right\|_{\psi_{\frac{1}{2}}}\lesssim\mathcal{O}(d).

Similarly, we can deduce that

[1pk=1pa~rk(0)σ(w~rk(0)Tu)]σ3(wr(0)Ty)ψ12𝒪(d).\left\|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\sigma_{3}(w_{r}(0)^{T}y)\right\|_{\psi_{\frac{1}{2}}}\lesssim\mathcal{O}(d).

Finally, applying Lemma 17 leads to that with probability at least 1δ1-\delta,

G0(u)22=L(W(0),W~(0))n1dlog(n1(n2+n3)δ).\|G^{0}(u)\|_{2}^{2}=L(W(0),\tilde{W}(0))\lesssim n_{1}d\log\left(\frac{n_{1}(n_{2}+n_{3})}{\delta}\right).