Convergence analysis of wide shallow neural operators within the framework of Neural Tangent Kernel

Xianliang Xu¹, Ye Li² and Zhongyi Huang¹ 1 Tsinghua University, Beijing, China.
2 Nanjing University of Aeronautics and Astronautics, Nanjing, China.

Abstract.

Neural operators are aiming at approximating operators mapping between Banach spaces of functions, achieving much success in the field of scientific computing. Compared to certain deep learning-based solvers, such as Physics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural operators can solve a class of Partial Differential Equations (PDEs). Although much work has been done to analyze the approximation and generalization error of neural operators, there is still a lack of analysis on their training error. In this work, we conduct the convergence analysis of gradient descent for the wide shallow neural operators and physics-informed shallow neural operators within the framework of Neural Tangent Kernel (NTK). The core idea lies on the fact that over-parameterization and random initialization together ensure that each weight vector remains near its initialization throughout all iterations, yielding the linear convergence of gradient descent. In this work, we demonstrate that under the setting of over-parametrization, gradient descent can find the global minimum regardless of whether it is in continuous time or discrete time.

^∗ This work was partially supported by the NSFC Projects No. 12025104, 11871298, 81930119, 62106103.

1. Introduction

Partial Differential Equations (PDEs) are essential for modeling a wide range of phenomena in physics, biology, and engineering. Nonetheless, the numerical solution of PDEs has always been a significant challenge in the field of scientific computation. Traditional numerical approaches, such as finite difference, finite element, finite volume, and spectral methods, can encounter difficulties due to the curse of dimensionality when applied to PDEs with a high number of dimensions. In recent years, the impressive achievements of deep learning in various domains, including computer vision, natural language processing, and reinforcement learning, have led to an increased interest in utilizing machine learning techniques to tackle PDE-related problems.

For scientific problems, neural network-based methods are primarily divided into two categories: neural solvers and neural operators. Neural solvers, such as PINNs [1], DRM [2], utilize neural networks to represent the solutions of PDEs, minimizing some form of residual to enable the neural networks to approximate the true solutions closely. There are two potential advantages of neural solvers. First, this is an unsupervised learning approach, which means it does not require the costly process of obtaining a large number of labels as in supervised learning. Second, as a powerful represention tool, neural networks are known to be effictive for approximating continuous functions [3], smooth functions [4], Sobolev functions [5]. This presents a potentially viable avenue for addressing the challenges of high-dimensional PDEs. Nevertheless, existing neural solvers face several limitations when compared to classical numerical solvers like FEM and FVM, particularly in terms of accuracy, and convergence issues. In addition, neural solvers are typically limited to solving a fixed PDE. If certain parameters of the PDE change, it becomes necessary to retrain the neural network.

Neural operator (also called operator learning) aims to approximate unknown operator, which often takes the form of the solution operator associated with a differential equation. Unlike most supervised learning methods in the field of machine learning, where both inputs and outputs are of finite dimensions, operator learning can be regarded as a form of supervised learning in function spaces. Because the inputs and outputs of neural networks are of finite dimensions, some operator learning methods, such as PCA-net and DeepONet, use an encoder to convert infinite-dimensional inputs into finite-dimensional ones, and a decoder to convert finite-dimensional outputs back into infinite-dimensional outputs. The PCA-Net architecture was proposed as an operator learning framework in [6], where principal component analysis (PCA) is employed to obtain data-driven encoders and decoders, combining with a neural network mapping between the finite-dimensional latent spaces. Building on early work by [7], DeepONet [8] consists of a deep neural network for encoding the discrete input function space and another deep neural network for encoding the domain of the output functions. The encoding network is conventionally referred to as the “branch-net”, while the decoding network is referred to as the “trunk-net”. In contrast to the PCA-Net and DeepONet architectures mentioned above,, Fourier Neural Operator (FNO), introduced in [9], does not follow the encoder-decoder-net paradigm. Instead, FNO is a composition of linear integral operators and nonlinear activation functions, which can be seen as a generalization the structure of finite-dimensional neural networks to a function space setting.

The theoretical research on neural operators mostly focuses on the study of approximation errors and generalization errors. As we know, the theoretical basis for the application of neural networks lies in the fact that neural networks are universal approximators. This also holds for neural operators. Regarding the analysis of approximation errors in neural operators, the aim is to identify whether neural operator also possess a universal approximation property, i.e. the ability to approximate a wide class of operators to any given accuracy. As shown in [7], (shallow) operator networks can approximate continuous operators mapping between spaces of continuous functions with arbitrary accuracy. Building on this result, DeepONets have also been proven to be universal approximators. For neural operators following the encoder-decoder paradigm like DeepONet and PCA-Net, Lemma 22 in [10] provides a consistent approximation result, which states that if two Banach spaces have the approximation property, then continuous maps between them can be approximated in a finite-dimensional manner. The universal approximation capability of the FNO was initially established in [11], drawing on concepts from Fourier analysis, and specifically leveraging the density of Fourier series to demonstrate the FNO’s ability to approximate a broad spectrum of operators. For a more quantitative analysis of the approximation error of neural operators, see [12]. In addition to approximation errors, the error analysis of encoder-decoder style neural operators also includes encoding and reconstruction errors. [13] has provided both lower and upper bounds on the total error for DeepONets by using the spectral decay properties of the covariance operators associated with the underlying measures. By employing tools from non-parametric regression, [14] has provided an analysis of the generalization error for neural operators with basis encoders and decoders. The results in [14] holds on neural operators with some popular encoders and decoders, such as those using Legendre polynomials, trigonometric functions, and PCA. For more details on the recent advances and theoretical research in operator learning, refer to the review [12].

Up to this point, the theoretical exploration of the convergence and optimization aspects of neural operators has received relatively little attention. To our best knowledge, only [15] and [16] have touched upon the optimization of neural operators. Based on restricted strong convexity (RSC), [15] has presented a unified framework for gradient descent and apply the framework to DeepONets and FNOs, establishing convergence guarantees for both. [16] has briefly analyzed the training of physics-informed DeepONets and derived a weighting scheme guided by NTK theory to balance the data and the PDE residual terms in the loss function. In this paper, we focus on the training error of shallow neural operator in [7] with the framework of NTK, showing that gradient descent converges at a global linear rate to the global optimum.

1.1. Notations

We denote $[n]=\{1,2,\cdots,n\}$ for $n\in\mathbb{N}$ . Given a set $S$ , we denote the uniform distribution on $S$ by $Unif\{S\}$ . We use $I\{E\}$ to denote the indicator function of the event $E$ . We use $A\lesssim B$ to denote an estimate that $A\leq cB$ , where $c$ is a universal constant. A universal constant means a constant independent of any variables.

2. Preliminaries

The neural operator considered in this paper was originally introduced in [4], aimining to approximate a non-linear operator. Specifically, suppose that $\sigma$ is a continuous and non-polynomial function, $X$ is a Banach space, $K_{1}\subset X$ , $K_{2}\subset\mathbb{R}^{d}$ are two compact sets in $X$ and $\mathbb{R}^{d}$ , respectively, $V$ is a compact set in $C(K_{1})$ . Assume that $G^{*}:V\rightarrow C(K_{2})$ is a nonlinear continuous operator. Then, an operator net can be formulated in terms of two shallow neural networks. The first is the so-called branch net $\beta(u)=(\beta_{1}(u),\cdots,\beta_{m}(u))$ , defined for $1\leq r\leq m$ as

\beta_{r}(u):=\sum\limits_{k=1}^{p}a_{rk}\sigma\left(\sum\limits_{s=1}^{q}\xi_{rk}^{s}u(x_{s})+\theta_{rk}\right),

where $\{x_{s}\}_{1\leq s\leq q}\subset K_{1}$ are the so-called sensors and $a_{rk},\xi_{rk}^{s},\theta_{rk}$ are weights of the neural network.

The second neural network is the so-called trunk net $\tau(y)=(\tau_{1}(y),\cdots,\tau_{m}(y))$ , defined as

\tau_{r}(y):=\sigma(w_{r}^{T}y+\zeta_{r}),1\leq r\leq m,

where $y\in K_{2}$ and $w_{r},\zeta_{r}$ are weights of the neural network. Then the branch net and trunk net are combined to approximate the non-linear operator $G^{*}$ , i.e.,

G^{*}(u)(y)\approx\sum\limits_{r=1}^{m}\beta_{r}(u)\tau_{r}(y):=G(u)(y),u\in V,y\in K_{2}.

As shown in [4], (shallow) operator networks can approximate, to arbitrary accuracy, continuous operators mapping between spaces of continuous functions. Specifically, for any $\epsilon>0$ , there are positive integers $p,m$ and $q$ , constants $a_{rk},\xi_{rk}^{s},\theta_{rk},\zeta_{r}\in\mathbb{R}$ , $w_{r}\in\mathbb{R}^{d}$ , $x_{s}\in K_{1},s=1,\cdots q$ , $r=1,\cdots,m$ and $k=1,\cdots,m$ , such that

\left|G^{*}(u)(y)-\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}a_{rk}\sigma\left(\sum\limits_{s=1}^{q}\xi_{rk}^{s}u(x_{s})+\theta_{rk}\right)\sigma(w_{r}^{T}y+\zeta_{r})\right|<\epsilon

holds for all $u\in V$ and $y\in K_{2}$ .

The training of neural networks is performed using a supervised learning process. It involves minimizing the mean-squared error between the predicted output $G(u)(y)$ and the actual output $G^{*}(u)(y)$ . Specifically, assume we have samples $\{(u_{i},G^{*}(u_{i}))\}_{i=1}^{N},u_{i}\sim\mu$ , where $\mu$ is a probability supported on $V$ . The aim is to minimize the following loss function:

\frac{1}{2}\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{n}|G(u_{i})(y_{j})-G^{*}(u_{i})(y_{j})|^{2}.

In this paper, we primarily focus on the shallow neural operators with ReLU activation functions. Formally, we consider a shallow operator of the following form.

G(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\sigma(w_{r}^{T}y),

(1)

where we equate the function $u$ with its value vector $(u(x_{1}),\cdots,u(x_{q}))^{T}\in\mathbb{R}^{q}$ at the points $\{x_{s}\}_{s=1}^{q}$ .

We denote the loss function by $L(W,\tilde{W},a)$ . The main focus of this paper is to analyze the gradient descent in training the shallw neural operator. We fix the weights $\{a_{rk}\}_{r=1,k=1}^{m,p}$ and apply gradient (GD) to optimize the weights $\{w_{r}\}_{r=1}^{m},\{\tilde{w}_{rk}\}_{r=1,k=1}^{m,p}$ . Specifically,

w_{r}(t+1)=w_{r}(t)-\eta\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}},\ \tilde{w}_{rk}(t+1)=\tilde{w}_{rk}(t)-\eta\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}},

where $\eta>0$ is the learning rate and $L(W,\tilde{W})$ is an abbreviation of $L(W,\tilde{W},a)$ .

At this point, the loss function is

L(W,\tilde{W})=\frac{1}{2}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}(G(u_{i})(y_{j})-z_{j}^{i})^{2},

where $z_{j}^{i}=G^{*}(u_{i})(y_{j})$ . Throughout this paper, we consider the initialization

w_{r}(0)\sim\mathcal{N}(\bm{0},\bm{I}),\tilde{w}_{rk}(0)\sim\mathcal{N}(\bm{0},\bm{I}),\bm{a}_{rk}(0)\sim Unif\{-1,1\}

(2)

and assume that $\|u_{i}\|_{2}=\mathcal{O}(1),\|y_{j}\|_{2}=\mathcal{O}(1)$ for all $j\in[n]$ . Note that here we treat vector $u_{i}=(u_{i}(x_{1}),\cdots,u_{x_{q}})$ and function $u_{i}$ as equivalent.

3. Continuous Time Analysis

In this section, we present our result for gradient flow, which can be viewed as a continuous form of gradient descent with an infinitesimal time step size. The analysis of gradient flow in continuous time serves as a foundational step for comprehending discrete gradient descent algorithms. We prove that the gradient flow converges to the global optima of the loss under over-parameterization and some mild conditions on training samples. The time continuous form can be characterized as the following dynamics

\frac{dw_{r}(t)}{dt}=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}},\ \frac{d}{dt}\tilde{w}_{rk}(t)=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}}

for $r\in[m],k\in[p]$ . We denote $G^{t}(u_{i})(y_{j})$ the prediction on $y_{j}$ at time $t$ under $u_{i}$ , i.e., with weights $w_{r}(t),\tilde{w}_{rk}(t)$ .

Thus, we can deduce that

\frac{dw_{r}(t)}{dt}=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}}=-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}}(G(u_{i})(y_{j})-z_{j}^{i})

(3)

and

\frac{d\tilde{w}_{rk}(t)}{dt}=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}}=-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}}(G^{t}(u)(y_{j})-z_{j}^{i}).

(4)

Then, the dynamics of each prediction can be calculated as follows.

$\displaystyle\frac{dG^{t}(u_{i})(y_{j})}{dt}$	$\displaystyle=\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{dw_{r}(t)}{dt}\right\rangle+\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{d\tilde{w}_{rk}(t)}{dt}\right\rangle$	(5)
	$\displaystyle=-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial G^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\right\rangle(G^{t}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}})$
	$\displaystyle\quad-\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial G^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\rangle(G^{t}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}}).$

We let $G^{t}(u_{i}):=(G^{t}(u_{i})(y_{1}),\cdots,G^{t}(u_{i})(y_{n_{2}}))\in\mathbb{R}^{n_{2}}$ and $G^{t}(u):=(G^{t}(u_{1}),\cdots,G^{t}(u_{n_{1}}))\in\mathbb{R}^{n_{1}n_{2}}$ . Then, we have

\frac{dG^{t}(u_{i})}{dt}=\left[(H_{1}^{i}(t),\cdots,H_{n_{1}}^{i}(t))+(\tilde{H}_{1}^{i}(t),\cdots,\tilde{H}_{n_{1}}^{i}(t))\right](z-G^{t}(u)),

(6)

where $H_{j}^{i}(t)\in\mathbb{R}^{n_{2}\times n_{2}}$ , whose $(i_{1},j_{1})$ -th entry is defined as

\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{t}(u_{i})(y_{i_{1}})}{\partial w_{r}},\frac{G^{t}(u_{j})(y_{j_{1}})}{\partial w_{r}}\right\rangle

and $\tilde{H}_{j}^{i}(t)\in\mathbb{R}^{n_{2}\times n_{2}}$ , whose $(i_{1},j_{1})$ -th entry is defined as

\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{t}(u_{i})(y_{i_{1}})}{\partial\tilde{w}_{rk}},\frac{G^{t}(u_{j})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\rangle.

Thus, we can write the dynamics of predictions as follows.

\frac{dG^{t}(u)}{dt}=\left[H(t)+\tilde{H}(t)\right](z-G^{t}(u)),

where $H(t),\tilde{H}(t)\in\mathbb{R}^{n_{1}n_{2}\times n_{2}n_{2}}$ . We can divide $H(t),\tilde{H}(t)$ into $n_{1}\times n_{1}$ blocks, the $(i,j)$ -th block of $H(t)$ is $H_{j}^{i}(t)$ and the $(i,j)$ -th block of $\tilde{H}(t)$ is $\tilde{H}_{j}^{i}(t)$ . From the form of $H(t),\tilde{H}(t)$ , we can derive the Gram matrices induced by the random initialization, which we denote by $H^{\infty}$ and $\tilde{H}^{\infty}$ . Note that although $H^{\infty}$ and $\tilde{H}^{\infty}$ are large matrices, we can divide $H^{\infty}$ and $\tilde{H}^{\infty}$ into $n_{1}\times n_{1}$ blocks, where each block is a $n_{2}\times n_{2}$ matrix. Following the notation above, the $(i_{1},j_{1})$ -th entry of $(i,j)$ -th block of $H^{\infty}$ is

\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})]\mathbb{E}[y_{i_{1}}^{T}y_{j_{1}}I\{w^{T}y_{i_{1}}\geq 0,w^{T}y_{j_{1}}\geq 0\}].

Thus, the $(i,j)$ -th block can be written as

\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})]H_{2}^{\infty},

where $H_{2}^{\infty}\in\mathbb{R}^{n_{2}\times n_{2}}$ and the $(i_{1},j_{1})$ -th entry of $H_{2}^{\infty}$ is

\mathbb{E}[y_{i_{1}}^{T}y_{j_{1}}I\{w^{T}y_{i_{1}}\geq 0,w^{T}y_{j_{1}}\geq 0\}].

Thus, $H^{\infty}$ can be seen as a Kronecker product of matrices $H_{1}^{\infty}$ and $H_{2}^{\infty}$ , where $H_{1}^{\infty}\in\mathbb{R}^{n_{1}\times n_{1}}$ and the $(i,j)$ -th entry of $H_{1}^{\infty}$ is $\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})]$ . Similarly, we have that $\tilde{H}^{\infty}$ is a Kronecker product of matrices $\tilde{H}_{1}^{\infty}$ and $\tilde{H}_{2}^{\infty}$ , where $\tilde{H}_{1}^{\infty}\in\mathbb{R}^{n_{1}\times n_{1}}$ and the $(i,j)$ -th entry of $\tilde{H}_{1}^{\infty}$ is $\mathbb{E}[u_{i}^{T}u_{j}I\{\tilde{w}^{T}u_{i}\geq 0,\tilde{w}^{T}u_{j}\geq 0\}]$ , the $(i_{1},j_{1})$ -th entry of $\tilde{H}_{2}^{\infty}$ is $\mathbb{E}[\sigma(w^{T}y_{i_{1}})\sigma(w^{T}y_{j_{1}})]$ .

Similar to the situation in $L^{2}$ regression, we can show two essential facts: (1) $\|H(0)-H^{\infty}\|_{2}=\mathcal{O}(1/\sqrt{m})$ , $\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}=\mathcal{O}(1/\sqrt{m})$ and (2) for all $t\geq 0$ , $\|H(t)-H(0)\|_{2}=\mathcal{O}(1/\sqrt{m})$ , $\|\tilde{H}(t)-\tilde{H}(0)\|_{2}=\mathcal{O}(1/\sqrt{m})$ .

Therefore, roughly speaking, as $m\to\infty$ , the dynamics of the predictions can be written as

\frac{d}{dt}G^{t}(u)=\left(H^{\infty}+\tilde{H}^{\infty}\right)(z-G^{t}(u)),

which results in the linear convergence.

We first show that the Gram matrices are strictly positive definite under mild assumptions.

Lemma 1.

If no two samples in $\{y_{j}\}_{j=1}^{n_{2}}$ are parallel and no two samples in $\{u_{i}\}_{i=1}^{n_{1}}$ are parallel, then $H^{\infty}$ and $\tilde{H}^{\infty}$ are strictly positive definite. We denote the least eigenvalue of $H^{\infty}$ and $\tilde{H}^{\infty}$ as $\lambda_{0}$ and $\tilde{\lambda}_{0}$ respectively.

Remark 1.

In fact, when we consider neural networks with bias, it is natural that Lemma 1 holds. Specifically, for $r\in[m],j\in[n_{2}]$ , we can replace $w_{r}$ and $y_{j}$ by $(w_{r}^{T},1)^{T}$ and $(y_{j}^{T},1)^{T}$ . Thus Lemma 1 holds under the condition that no two samples in $\{y_{j}\}_{j=1}^{n_{2}}$ are identical, which holds naturally.

Then we can verify the two facts that $H(0),\tilde{H}(0)$ are close to $H^{\infty},\tilde{H}^{\infty}$ and $H(t),\tilde{H}(t)$ are close to $H(0),\tilde{H}(0)$ by following two lemmas.

Lemma 2.

If $m=\Omega\left(\frac{n_{1}^{2}n_{2}^{2}}{min(\lambda_{0}^{2},\tilde{\lambda}_{0}^{2})}\log\left(\frac{n_{1}n_{2}}{\delta}\right)\right)$ , we have with probability at least $1-\delta$ , $\|H(0)-H^{\infty}\|_{2}\leq\frac{\lambda_{0}}{4}$ , $\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}\leq\frac{\tilde{\lambda}_{0}}{4}$ and $\lambda_{min}(H(0))\geq\frac{3}{4}\lambda_{0}$ , $\lambda_{min}(\tilde{H}(0))\geq\frac{3}{4}\tilde{\lambda}_{0}$ .

Lemma 3.

Let $R,\tilde{R}\in(0,1)$ . If $w_{1}(0),\cdots,w_{m}(0),\tilde{w}_{00}(0),\cdots\tilde{w}_{mp}(0)$ are i.i.d. generated $\mathcal{N}(\bm{0},\bm{I})$ . For any set of weight vectors $w_{1},\cdots,w_{m},\tilde{w}_{00},\cdots\tilde{w}_{mp}\in\mathbb{R}^{d}$ that satisfy for any $r\in[m],k\in[p]$ , $\|w_{r}-w_{r}(0)\|_{2}\leq R$ and $\|\tilde{w}_{rk}-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}$ , then we have with probability at least $1-\delta-n_{2}exp(-mR)$ ,

\|H-H(0)\|_{F}\lesssim n_{1}n_{2}p\tilde{R}^{2}+n_{1}n_{2}\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}+n_{1}n_{2}R\log\left(\frac{mn_{1}}{\delta}\right)

(7)

and with probability at least $1-\delta-n_{2}exp(-mp\tilde{R})$ ,

\|\tilde{H}-\tilde{H}(0)\|_{F}\lesssim n_{1}n_{2}R\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}+n_{1}n_{2}\tilde{R}\log\left(\frac{mn_{2}}{\delta}\right),

(8)

where the $(i_{1},j_{1})$ -th entry of the $(i,j)$ -th block of $H$ is

H_{i,j}^{i_{1},j_{1}}=\frac{1}{m}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j})\right]y_{i_{1}}^{T}y_{j_{1}}I\{w_{r}^{T}y_{i_{1}}\geq 0,w_{r}^{T}y_{j_{1}}\geq 0\}

and the $(i_{1},j_{1})$ -th entry of the $(i,j)$ -th block of $\tilde{H}$ is

\tilde{H}_{i,j}^{i_{1},j_{1}}=\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}}).

With these preparations, we come to the final conclusion.

Theorem 1.

Suppose the condition in Lemma 1 holds and under initialization as described in (2), then with probability at least $1-\delta$ , we have

\|z-G^{t}(u)\|_{2}^{2}\leq exp(-(\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{0}(u)\|_{2}^{2},

where

m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log\left(\frac{n_{1}n_{2}}{\delta}\right)\log^{3}\left(\frac{m}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

Proof Sketch: Note that

\displaystyle\frac{d}{dt}\|z-G^{t}(u)\|_{2}^{2}

\displaystyle=-2(z-G^{t}(u))^{T}(H(t)+\tilde{H}(t))(z-G^{t}(u)),

(9)

thus if $\lambda_{min}(H(t))\geq\lambda_{0}/2$ and $\lambda_{min}(\tilde{H}(t))\geq\tilde{\lambda}_{0}/2$ , we have

\frac{d}{dt}\|z-G^{t}(u)\|_{2}^{2}\leq-(\lambda_{0}+\tilde{\lambda}_{0})\|z-G^{t}(u)\|_{2}^{2}.

This yields that $\frac{d}{dt}\left(exp((\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{t}(u)\|_{2}^{2}\right)\leq 0$ , i.e., $exp((\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{t}(u)\|_{2}^{2}$ is non-increasing, thus we have

\|z-G^{t}(u)\|_{2}^{2}\leq exp(-(\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{0}(u)\|_{2}^{2}.

(10)

On the other hand, roughly speaking, the continous dynamics of $w_{r}(t)$ and $\tilde{w}_{rk}(t)$ , i.e., (3) and (4), show that

\left\|\frac{dw_{r}(t)}{dt}\right\|_{2}\sim\frac{\|z-G^{t}(u)\|_{2}}{\sqrt{m}},\ \left\|\frac{d\tilde{w}_{rk}(t)}{dt}\right\|_{2}\sim\frac{\|z-G^{t}(u)\|_{2}}{\sqrt{mp}}.

Thus if the prediction decays like (10), we can deduce that

\|w_{r}(t)-w_{r}(0)\|_{2}\sim\frac{1}{\sqrt{m}},\ \|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}\sim\frac{1}{\sqrt{mp}}.

Combining this with the stability of the descrete Gram matrices, i.e., Lemma 3, we have $\|H(t)-H^{\infty}\|_{2}\leq\lambda_{0}/4$ , $\|\tilde{H}(t)-\tilde{H}^{\infty}\|_{2}\leq\tilde{\lambda}_{0}/4$ and $\lambda_{min}(H(t))\geq\lambda_{0}/2$ , $\lambda_{min}(\tilde{H}(t))\geq\tilde{\lambda}_{0}/2$ , when $m$ is sufficiently large.

From such equivalence, we can arrive at the desired conclusion.

Remark 2.

The result in Theorem 1 indicates that $m=\Omega(Poly(1/min(\lambda_{0},\tilde{\lambda}_{0})))$ , which may lead to strict requirement for $m$ . In fact, from (11), we can see that $\lambda_{min}(H(t))\geq\lambda_{0}/2$ or $\lambda_{min}(\tilde{H}(t))\geq\tilde{\lambda}_{0}/2$ is enough. Thus, $m=\Omega(Poly(1/max(\lambda_{0},\tilde{\lambda}_{0})))$ is sufficient.

4. Discrete Time Analysis

In this section, we are going to demonstrate that randomly initialized gradient in training shallow neural operators converges to the golbal minimum at a linear rate. Unlike the continuous time case, the discrete time case requires a more refined analysis. In the following, we first present our main result and then outline the proof’s approach.

Theorem 2.

Under the setting of Theorem 1, if we set $\eta=\mathcal{O}\left(\frac{1}{\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}}\right)$ , then with probability at least $1-\delta$ , we have

\|z-G^{t}(u)\|_{2}^{2}\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{t}\|z-G^{0}\|_{2}^{2},

where

m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log\left(\frac{n_{1}n_{2}}{\delta}\right)\log^{3}\left(\frac{m}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

For the $L^{2}$ regression problems, [17] has demonstrated that if the learning rate $\eta=\mathcal{O}(\lambda_{0}/n^{2})$ , then randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate when $m$ is large enough. The requirement of $\eta$ is derived from the decomposition for the residual in the $(k+1)$ -th iteration, i.e.,

y-u(k+1)=y-u(k)-(u(k+1)-u(k)),

where $u(t)$ is the prediction vector at $t$ -iteration under the shallow neural network and $y$ is the true prediction. Although using the method from [17] can also yield linear convergence of gradient descent in training shallow neural operators, the requirements for the learning rate $\eta$ would be very stringent due to the dependency on $\lambda_{0},\tilde{\lambda}_{0}$ and $n$ . Thus, instead of decomposing the residual into the two terms as above, we write it as follows, which serves as a recursion formula.

Lemma 4.

For all $t\in\mathbb{N}$ , we have

z-G^{t+1}(u)=\left[I-\eta\left(H(t)+\tilde{H}(t)\right)\right]\left(z-G^{t}(u)\right)-I(t),

where $I(t)\in\mathbb{R}^{n_{1}n_{2}}$ is the residual term. We can divide it into $n_{1}$ blocks, each block belongs to $\mathbb{R}^{n}_{2}$ and the $j$ -th component of $i$ -th block is defined as

I_{i,j}(t)=G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

(11)

Just as in the case of $L^{2}$ regression, we prove our conclusion by induction. From the recursive formula above, it can be seen that both the estimation of $H(t)$ and $\tilde{H}(t)$ , as well as the estimation of the residual $I(t)$ , depend on $\|w_{r}(t)-w_{r}(0)\|_{2}$ and $\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}$ . Therefore, our inductive hypothesis is the following differences between weights and their initializations.

Condition 1.

At the $s$ -th iteration, we have

\|w_{r}(s)-w_{r}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}:=R^{{}^{\prime}}

(12)

and

\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}\sqrt{p}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}:=\tilde{R}^{{}^{\prime}}

(13)

and $|w_{r}(s)^{T}y_{j}|\leq B$ for all $r\in[m]$ and $j\in[n_{2}]$ , where $B=2\sqrt{\log\left(\frac{mn_{2}}{\delta}\right)}$ .

This condition can lead to the linear convergence of gradient descent, i.e., result in Theorem 2.

Corollary 1.

If Condition 1 holds for $s=0,\cdots,t$ , then we have that

\|z-G^{s}(u)\|_{2}^{2}\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{s}\|z-G^{0}(u)\|_{2}^{2}

holds for $s=0,\cdots,t$ , where $m$ is required to satisfy that

m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log\left(\frac{n_{1}n_{2}}{\delta}\right)\log^{3}\left(\frac{m}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

Proof Sketch: Under the setting of over-parameterization, we can show that the weights $w_{r}(s)$ , $\tilde{w}_{rk}(s)$ stay close to the initialization $w_{r}(0)$ , $\tilde{w}_{rk}(0)$ . Thus, with the stability of the discrete Gram matrices, i.e., Lemma 3, we can deduce that $\lambda_{min}(H(s))\geq\lambda_{0}/2$ and $\lambda_{min}(\tilde{H}(s))\geq\tilde{\lambda}_{0}/2$ . Then combining with the Lemma 4, we have

		$\displaystyle\\|z-G^{s+1}(u)\\|_{2}^{2}$		(14)
		$\displaystyle=\left\\|\left(I-\eta\left(H(s)+\tilde{H}(s)\right)\right)(z-G^{s}(u))\right\\|_{2}^{2}+\\|I(s)\\|_{2}^{2}-2\left\langle\left(I-\eta(H(s)+\tilde{H}(s))\right)(z-G^{s}(u)),I(s)\right\rangle$
		$\displaystyle\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{2}\\|z-G^{s}(u)\\|_{2}^{2}+\\|I(s)\\|_{2}^{2}+2\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)\\|z-G^{s}(u)\\|_{2}\\|I(s)\\|_{2},$

where the inequality requires that $I-\eta(H(s)+\tilde{H}(s))$ is positive definite. Since $\|H(s)-H^{\infty}\|_{2}=\mathcal{O}(1/\sqrt{m})$ and $\|\tilde{H}(s)-\tilde{H}^{\infty}\|_{2}=\mathcal{O}(1/\sqrt{m})$ , $\eta=\mathcal{O}\left(\frac{1}{\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}}\right)$ is sufficient to ensure that $I-\eta(H(s)+\tilde{H}(s))$ is positive definite, when $m$ is large enough.

From (14), if $\|I(s)\|_{2}\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0})\|z-G^{s}(u)\|_{2}$ , which can be obtained from the following lemma, we can obtain that

\|z-G^{s+1}(u)\|_{2}^{2}\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)\|z-G^{s}(u)\|_{2},

which directly yields the desired conclusion.

Lemma 5.

Under Condition 1, for $s=0,\cdots,t-1$ , we have

\|I(s)\|_{2}\lesssim\bar{R}\|z-G^{s}(u)\|_{2},

(15)

where

\bar{R}=\frac{\eta(n_{1}n_{2})^{\frac{3}{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right).

(16)

5. Physics-Informed Neural Operators

Let $\Gamma$ be a bounded open subset of $\mathbb{R}^{d}$ , in this section, we consider the PDE with following form:

		$\displaystyle\mathcal{L}u(y)=f(y),\ y\in(0,T)\times\Gamma$		(17)
		$\displaystyle u(\tilde{y})=g(\tilde{y}),\ \tilde{y}\in\{0\}\times\Gamma\cup[0,T]\times\partial\Gamma,$		(17)

where $y=(y_{0},y_{1}\cdots,y_{d})$ , $y_{0}\in(0,T)$ , $(y_{1}\cdots,y_{d})\in\bar{\Gamma}$ and $\mathcal{L}$ is a differential operator,

\mathcal{L}u(y)=\frac{\partial u}{\partial y_{0}}(y)-\sum\limits_{i=1}^{d}\frac{\partial^{2}u}{\partial y_{i}^{2}}(y)+u(y).

In this section, we consider the shallow neural operato with following form

G(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\sigma_{3}(w_{r}^{T}y),

where $\sigma_{3}(\cdot),\sigma_{2}(\cdot),\sigma(\cdot)$ are the $\text{ReLU}^{3},\text{ReLU}^{2},\text{ReLU}$ activation functions, respectively.

Given samples $y_{1},\cdots,y_{n_{2}}$ in the interior and $\tilde{y}_{1},\cdots,\tilde{y}_{n_{3}}$ on the boundary, the loss function of PINN is

L(W,\tilde{W}):=\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\frac{1}{n_{2}}(\mathcal{L}G(u_{i})(y_{j_{1}})-f(y_{j_{1}}))^{2}+\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\frac{1}{n_{3}}(G(u_{i})(\tilde{y}_{j_{2}})-g(\tilde{y}_{j_{2}}))^{2}.

Let

s(u_{i})(y_{j})=\frac{1}{\sqrt{n_{2}}}(\mathcal{L}G(u_{i})(y_{j_{1}})-f(y_{j_{1}}))

and

h(u_{i})(\tilde{y}_{j_{2}})=\frac{1}{\sqrt{n_{3}}}(G(u_{i})(\tilde{y}_{j_{2}})-g(\tilde{y}_{j_{2}})),

then the loss function can be written as

L(W,\tilde{W})=\sum\limits_{i=1}^{n_{1}}\|s(u_{i})\|_{2}^{2}+\|h(u_{i})\|_{2}^{2},

where $s(u_{i})=(s(u_{i})(y_{1}),\cdots,s(u_{i})(y_{n_{2}}))$ , $h(u_{i})=(h(u_{i})(\tilde{y}_{1}),\cdots,h(u_{i})(\tilde{y}_{n_{3}}))$ .

We first consider the continuous setting, which is a stepping stone towards understanding discrete algorithms. For $w_{r}(t)$ and $\tilde{w}_{rt}$ , we have

	$\displaystyle\frac{dw_{r}(t)}{dt}$	$\displaystyle=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}}$		(18)
		$\displaystyle=-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}s^{t}(u_{i})(y_{j_{1}})\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w_{r}}-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}h^{t}(u_{i})(\tilde{y}_{j_{2}})\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w_{r}}$		(18)

and

	$\displaystyle\frac{d\tilde{w}_{rk}(t)}{dt}$	$\displaystyle=-\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}}$		(19)
		$\displaystyle=-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}s^{t}(u_{i})(y_{j_{1}})\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}_{rk}}-\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}h^{t}(u_{i})(\tilde{y}_{j_{2}})\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}.$		(19)

Thus, for the predictions, we have

		$\displaystyle\frac{ds^{t}(u_{i})(y_{j})}{dt}$		(20)
		$\displaystyle=\sum\limits_{r=1}^{m}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{dw_{r}(t)}{dt}\rangle+\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{q}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{d\tilde{w}_{rk}(t)}{dt}\rangle$
		$\displaystyle=-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\rangle s^{t}(u_{i_{1}})(y_{j_{1}})-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial w_{r}}\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})$
		$\displaystyle-\sum\limits_{r,k=1,1}^{m,p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\rangle s^{t}(u_{i_{1}})(y_{j_{1}})-\sum\limits_{r,k=1,1}^{m,p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}}),$

and

		$\displaystyle\frac{dh^{t}(u_{i})(\tilde{y}_{j})}{dt}$		(21)
		$\displaystyle=\sum\limits_{r=1}^{m}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial w_{r}},\frac{dw_{r}(t)}{dt}\rangle+\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{l}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial\tilde{w}_{rk}},\frac{d\tilde{w}_{rk}(t)}{dt}\rangle$
		$\displaystyle=-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial w_{r}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\rangle s^{t}(u_{i_{1}})(y_{j_{1}})-\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial w_{r}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial w_{r}}\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})$
		$\displaystyle-\sum\limits_{r,k=1,1}^{m,p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial\tilde{w}_{rk}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\rangle s^{t}(u_{i_{1}})(y_{j_{1}})-\sum\limits_{r,k=1,1}^{m,p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial\tilde{w}_{rk}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}}).$

Let $G^{t}(u)=((s(u_{1}),h(u_{1})),\cdots,(s(u_{n_{1}}),h(u_{n_{1}})))$ and $z=((f,g),\cdots,(f,g))$ , then

\frac{d}{dt}G^{t}(u)=(H(t)+\tilde{H}(t))(z-G^{t}(u)),

where $H(t),\tilde{H}(t)\in\mathbb{R}^{n_{1}(n_{2}+n_{3})\times n_{1}(n_{2}+n_{3})}$ are Gram matrices at time $t$ . We can divide them into $n_{1}\times n_{1}$ blocks and each block is a matrix in $\mathbb{R}^{(n_{2}+n_{3})\times(n_{2}+n_{3})}$ . Specifically, the $(i,j)$ -th block of $H(t)$ and $\tilde{H}(t)$ are $H_{i,j}(t):=D_{i}(t)^{T}D_{j}(t)$ and $\tilde{H}_{i,j}(t):=\tilde{D}_{i}(t)^{T}\tilde{D}_{j}(t)$ respectively, where

D_{i}(t)=\left[\frac{\partial s^{t}(u_{i})(y_{1})}{\partial w},\cdots,\frac{\partial s^{t}(u_{i})(y_{n_{2}})}{\partial w},\frac{\partial h^{t}(u_{i})(\tilde{y}_{1})}{\partial w},\cdots,\frac{\partial h^{t}(u_{i})(\tilde{y}_{n_{3}})}{\partial w}\right]

and

\tilde{D}_{i}(t)=\left[\frac{\partial s^{t}(u_{i})(y_{1})}{\partial\tilde{w}},\cdots,\frac{\partial s^{t}(u_{i})(y_{n_{2}})}{\partial\tilde{w}},\frac{\partial h^{t}(u_{i})(\tilde{y}_{1})}{\partial\tilde{w}},\cdots,\frac{\partial h^{t}(u_{i})(\tilde{y}_{n_{3}})}{\partial\tilde{w}}\right].

Recall that

\frac{\partial s(u)(y)}{\partial w_{r}}=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\frac{1}{\sqrt{n_{2}}}\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y))}{\partial w_{r}},

\frac{\partial s(u)(y)}{\partial\tilde{w}_{rk}}=\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}uI\{\tilde{w}_{rk}^{T}u\geq 0\}\frac{\mathcal{L}(\sigma_{3}(w_{r}^{T}y))}{\sqrt{n_{2}}},

\frac{\partial h(u)(\tilde{y})}{\partial w_{r}}=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\frac{1}{\sqrt{n_{3}}}\frac{\partial(\sigma_{3}(w_{r}^{T}\tilde{y}))}{\partial w_{r}},

and

\frac{\partial h(u)(\tilde{y})}{\partial\tilde{w}_{rk}}=\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}I\{\tilde{w}_{rk}^{T}u\geq 0\}\frac{\sigma_{3}(w_{r}^{T}\tilde{y})}{\sqrt{n_{3}}}.

Then, the $(j_{1},j_{2})$ -th ( $j_{1}\in[n_{1}],j_{2}\in[n_{1}]$ ) entry of $H_{i,j}(t)$ is

\frac{1}{m}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j})\right]\frac{1}{n_{2}}\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle.

and the $(j_{1},j_{2})$ -th ( $j_{1}\in[n_{1}],j_{2}\in[n_{1}]$ ) entry of $\tilde{H}_{i,j}(t)$ is

\frac{1}{m}\sum\limits_{r=1}^{m}\left[\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\right]\frac{1}{n_{2}}\mathcal{L}(\sigma(w_{r}^{T}y_{j_{1}}))\mathcal{L}(\sigma(w_{r}^{T}y_{j_{2}})),

where we omit the index $t$ for simplicity.

From the forms of $H(t)$ and $\tilde{H}(t)$ , we can derive the corresponding Gram matrices that are induced by the random initialization, which we denote by $H^{\infty}$ and $\tilde{H}^{\infty}$ , respectively. Specifically, $H^{\infty}$ is a Kronecker product of $H_{1}^{\infty}$ and $H_{2}^{\infty}$ , where $H_{1}^{\infty}\in\mathbb{R}^{n_{1}\times n_{1}}$ , $H_{1}^{\infty}\in\mathbb{R}^{(n_{2}+n_{3})\times(n_{2}+n_{3})}$ , the $(i,j)$ -th entry of $H^{\infty}$ is $\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})]$ , the $(j_{1},j_{2})$ -th entry of $H_{2}^{\infty}$ is

\frac{1}{n_{2}}\mathbb{E}\left[\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w^{T}y_{j_{1}}))}{\partial w},\frac{\partial\mathcal{L}(\sigma_{3}(w^{T}y_{j_{2}}))}{\partial w}\right\rangle\right].

And $\tilde{H}^{\infty}$ is a Kronecker product of $\tilde{H}_{1}^{\infty}$ and $\tilde{H}_{2}^{\infty}$ , where $\tilde{H}_{1}^{\infty}\in\mathbb{R}^{n_{1}\times n_{1}}$ , $\tilde{H}_{1}^{\infty}\in\mathbb{R}^{(n_{2}+n_{3})\times(n_{2}+n_{3})}$ , the $(i,j)$ -th entry of $\tilde{H}^{\infty}$ is $\mathbb{E}[u_{i}^{T}u_{j}I\{\tilde{w}^{T}u_{i}\geq 0,\tilde{w}^{T}u_{j}\geq 0\}]$ , the $(j_{1},j_{2})$ -th entry of $\tilde{H}_{2}^{\infty}$ is

\frac{1}{n_{2}}\mathbb{E}[\mathcal{L}(\sigma_{3}(w^{T}y_{j_{1}}))\mathcal{L}(\sigma_{3}(w^{T}y_{j_{2}}))].

The Gram matrices play important roles in the convergence analysis. Similar to the setting of Section 4, we can demonstrate the strict positive definiteness of the Gram matrices under mild conditions.

Lemma 6.

If no two samples in $\{u_{i}\}_{i=1}^{n_{1}}$ are parallel and no two samples in $\{y_{j}\}_{j=1}^{n_{2}}\cup\{\tilde{y}_{j}\}_{j=1}^{n_{3}}$ are parallel, then $H^{\infty}$ and $\tilde{H}^{\infty}$ are both strictly positive definite. We denote their least eigenvalues by $\lambda_{0}$ and $\tilde{\lambda}_{0}$ , respectively.

Similar to Section 4, the convergence of gradient descent relies on the stability of the Gram matrices, which is demonstrated by the following two lemmas.

Lemma 7.

If $m=\Omega\left(\frac{d^{4}n_{1}^{2}}{\min\{\lambda_{0}^{2},\tilde{\lambda}_{0}^{2}\}}\log^{2}\left(\frac{n_{1}(n_{2}+n_{3})}{\delta}\right)\right)$ , we have with probability at least $1-\delta$ , $\|H(0)-H^{\infty}\|_{2}\leq\frac{\lambda_{0}}{4}$ , $\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}\leq\frac{\tilde{\lambda}_{0}}{4}$ and $\lambda_{min}(H(0))\geq\frac{3}{4}\lambda_{0}$ , $\lambda_{min}(\tilde{H}(0))\geq\frac{3}{4}\tilde{\lambda}_{0}$ .

Lemma 8.

	$\displaystyle\\|H-H(0)\\|_{F}$	$\displaystyle\lesssim n_{1}d\log\left(\frac{m}{\delta}\right)\sqrt{\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)}R\log\left(\frac{mn_{1}}{\delta}\right)$		(22)
		$\displaystyle\quad+n_{1}d\log\left(\frac{m}{\delta}\right)\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)\left(p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\left(\frac{mn_{1}}{\delta}\right)}\right)$		(22)

and with probability at least $1-\delta-n_{1}exp(-mp\tilde{R})$ ,

\|\tilde{H}-\tilde{H}(0)\|_{F}\lesssim n_{1}d\log\left(\frac{m}{\delta}\right)\sqrt{\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)}\tilde{R}+n_{1}\left(d\log\left(\frac{m}{\delta}\right)\right)^{\frac{3}{2}}\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)R.

(23)

Similar to training neural operators, we can derive the training dynamics of physics-informed neural operators.

Lemma 9.

For all $t\in\mathbb{N}$ , we have

G^{t+1}(u)=[I-\eta(H(t)+\tilde{H}(t))]G^{t}(u)+I(t),

where $I(t)\in\mathbb{R}^{n_{1}(n_{2}+n_{3})}$ , $I(t)$ can be divided into $n_{1}$ blocks, where each block is an $(n_{2}+n_{3})$ dimensional vector. The $j_{1}$ -th ( $j_{1}\in[n_{2}]$ ) component of $i$ -th block is

s^{t+1}(u_{i})(y_{j_{1}})-s^{t}(u_{i})(y_{j_{1}})-\left\langle\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle,

The $n_{2}+j_{2}$ -th ( $j_{1}\in[n_{3}]$ ) component of $i$ -th block is

h^{t+1}(u_{i})(\tilde{y}_{j_{2}})-h^{t}(u_{i})(\tilde{y}_{j_{2}})-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

With these preparations in place, we can now arrive at the final convergence theorem.

Theorem 3.

If we set $\eta=\mathcal{O}\left(\frac{1}{\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}}\right)$ , then with probability at least $1-\delta$ , we have

\|G^{t}(u)\|_{2}^{2}\leq\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{t}\|G^{0}(u)\|_{2}^{2},

where

m=\tilde{\Omega}\left(\frac{n_{1}^{4}d^{7}}{(\lambda_{0}+\tilde{\lambda}_{0})^{2}min(\lambda_{0}^{2},\tilde{\lambda}_{0}^{2})}\right)

and $\tilde{\Omega}$ indicates that some terms involving $\log(n_{1})$ , $\log(n_{2})$ and $\log(m)$ are omitted.

We prove Theorem 3 by induction. Our induction hypothesis is just the following condition:

Condition 2.

At the $s$ -th iteration, we have

\|G^{s}(u)\|_{2}^{2}\leq\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{s}\|G^{0}(u)\|_{2}^{2}

and $\|w_{r}(s)\|_{2}\leq B_{1}$ , $|w_{r}(s)^{T}y_{j}|\leq B_{2}$ and $|w_{r}(s)^{T}\tilde{y}_{j_{1}}|\leq B_{2}$ holds for all $r\in[m],j\ in[n_{2}],j_{1}\in[n_{3}]$ , where

B_{1}=2\sqrt{d\log(m/\delta)},\ B_{2}=2\sqrt{\log(m(n_{2}+n_{3})/\delta)}.

This condition directly yields the following bound of deviation from the initialization.

Corollary 2.

If Condtion 2 holds for $s=0,\cdots,T$ , then we have that

\|\tilde{w}_{rk}(T+1)-\tilde{w}_{rk}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}}B_{1}^{2}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}

and

\|w_{r}(T+1)-w_{r}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}}B_{1}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}

holds for all $r\in[m],k\in[p]$ .

Lemma 10.

If Condtion 2 holds for $s=0,\cdots,T$ , then we have that

\|I(s)\|_{2}\lesssim\frac{\eta(n_{1})^{3/2}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\|G^{t}(u)\|_{2}.

holds for $s=0,\cdots,T-1$ .

6. Conclusion and Future Work

In this paper, we have analyzed the convergence of gradient descent (GD) in training wide shallow neural operators within the framework of NTK, demenstrating the linear convergence of GD. The core idea is that over-parameterization ensures that all weights are close to their initializations for all iterations, which is similar to performing a certain kernel method. There are some future works. Firstly, the extension of our theory to other neural operators, like FNO. The main difficulty could be that how to meet the requirements of the NTK theory. Secondly, the extension to DeepONets, which we think may be similar to the extension from the results in [17] to [18].

References

[1] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational physics, vol. 378, pp. 686–707, 2019.
[2] B. Yu et al., “The deep ritz method: a deep learning-based numerical algorithm for solving variational problems,” Communications in Mathematics and Statistics, vol. 6, no. 1, pp. 1–12, 2018.
[3] Z. Shen, H. Yang, and S. Zhang, “Optimal approximation rate of relu networks in terms of width and depth,” Journal de Mathématiques Pures et Appliquées, vol. 157, pp. 101–135, 2022.
[4] J. Lu, Z. Shen, H. Yang, and S. Zhang, “Deep network approximation for smooth functions,” SIAM Journal on Mathematical Analysis, vol. 53, no. 5, pp. 5465–5506, 2021.
[5] D. Yarotsky, “Error bounds for approximations with deep relu networks,” Neural networks, vol. 94, pp. 103–114, 2017.
[6] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart, “Model reduction and neural networks for parametric pdes,” The SMAI journal of computational mathematics, vol. 7, pp. 121–157, 2021.
[7] T. Chen and H. Chen, “Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems,” IEEE transactions on neural networks, vol. 6, no. 4, pp. 911–917, 1995.
[8] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis, “Learning nonlinear operators via deeponet based on the universal approximation theorem of operators,” Nature machine intelligence, vol. 3, no. 3, pp. 218–229, 2021.
[9] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Fourier neural operator for parametric partial differential equations,” arXiv preprint arXiv:2010.08895, 2020.
[10] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Neural operator: Learning maps between function spaces with applications to pdes,” Journal of Machine Learning Research, vol. 24, no. 89, pp. 1–97, 2023.
[11] N. Kovachki, S. Lanthaler, and S. Mishra, “On universal approximation and error bounds for fourier neural operators,” Journal of Machine Learning Research, vol. 22, no. 290, pp. 1–76, 2021.
[12] N. B. Kovachki, S. Lanthaler, and A. M. Stuart, “Operator learning: Algorithms and analysis,” arXiv preprint arXiv:2402.15715, 2024.
[13] S. Lanthaler, S. Mishra, and G. E. Karniadakis, “Error estimates for deeponets: A deep learning framework in infinite dimensions,” Transactions of Mathematics and Its Applications, vol. 6, no. 1, p. tnac001, 2022.
[14] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao, “Deep nonparametric estimation of operators between infinite dimensional spaces,” Journal of Machine Learning Research, vol. 25, no. 24, pp. 1–67, 2024.
[15] B. Shrimali, A. Banerjee, and P. Cisneros-Velarde, “Optimization for neural operator learning: Wider networks are better.”
[16] S. Wang, H. Wang, and P. Perdikaris, “Improved architectures and training algorithms for deep operator networks,” Journal of Scientific Computing, vol. 92, no. 2, p. 35, 2022.
[17] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” arXiv preprint arXiv:1810.02054, 2018.
[18] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International conference on machine learning. PMLR, 2019, pp. 1675–1685.
[19] J. He, L. Li, J. Xu, and C. Zheng, “Relu deep neural networks and linear finite elements,” arXiv preprint arXiv:1807.03973, 2018.
[20] Y. Gao, Y. Gu, and M. Ng, “Gradient descent finds the global optima of two-layer physics-informed neural networks,” in International Conference on Machine Learning. PMLR, 2023, pp. 10 676–10 707.
[21] E. Giné and R. Nickl, Mathematical foundations of infinite-dimensional statistical models. Cambridge university press, 2016, vol. 40.
[22] A. K. Kuchibhotla and A. Chakrabortty, “Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression,” Information and Inference: A Journal of the IMA, vol. 11, no. 4, pp. 1389–1456, 2022.
[23] A. W. Van Der Vaart, J. A. Wellner, A. W. van der Vaart, and J. A. Wellner, Weak convergence. Springer, 1996.

Appendix

Before the proofs, we first define the events

A_{jr}:=\{\exists w:\|w-w_{r}(0)\|_{2}\leq R,I\{w^{T}y_{j}\geq 0\}\neq I\{w_{r}(0)^{T}y_{j}\geq 0\}\}

(24)

and

\tilde{A}_{rk}^{i}:=\{\exists w:\|w-w_{rk}(0)\|_{2}\leq\tilde{R},I\{w^{T}u\geq 0\}\neq I\{w_{rk}(0)^{T}u_{i}\geq 0\}\}

(25)

for all $i\in[n_{1}],j\in[n_{2}],r\in[m],p\in[k]$ .

Note that the event $A_{jr}$ happens if and only if $|w_{r}(0)^{T}y_{j}|<\|y_{j}\|_{2}R$ , thus by the anti-concentration inequality of Gaussian distribution (Lemma 10), we have

P(A_{ir})=P_{z\sim\mathcal{N}(0,\|y_{j}\|_{2}^{2})}(|z|<\|y_{j}\|_{2}R)=P_{z\sim\mathcal{N}(0,1)}(|z|<R)\lesssim R.

(26)

Similarly, we have $P(\tilde{A}_{rk}^{i})\lesssim\tilde{R}.$

Moreover, we let $S_{j}:=\{r\in[m]:I\{A_{jr}\}=0\}$ , $S_{j}^{\perp}:=[m]\backslash S_{j}$ and $\tilde{S}_{i}:=\{(r,k)\in[m]\times[p]:I\{\tilde{A}_{rk}^{i}\}=0\}$ , $\tilde{S}^{\perp}_{i}:=[m]\times[p]\backslash\tilde{S}_{i}$ .

7. Proof of Continuous Time Analysis

7.1. Proof of Lemma 1

Proof.

First, recall that $H^{\infty}$ is a Kronecker product of $H_{1}^{\infty}$ and $H_{2}^{\infty}$ . The $(i,j)$ -th entry of $H_{1}^{\infty}$ is $\mathbb{E}[\sigma(\tilde{w}^{T}u_{i})\sigma(\tilde{w}^{T}u_{j})]$ and the $(i_{1},j_{1})$ -th entry of $H_{2}^{\infty}$ is $\mathbb{E}[y_{i_{1}}^{T}y_{j_{1}}I\{w^{T}y_{i_{1}}\geq 0,w^{T}y_{j_{1}}\geq 0\}]$ . As we know, the Kronecker product of two strictly positive definite matrices is also strictly positive definite. Thus, it suffices to demonstrate that $H_{1}^{\infty}$ and $H_{2}^{\infty}$ are both strictly positive definite.

The proof relies relies on standard functional analysis. Let $\mathcal{H}$ be the Hilbert space of integrable $d$ -dimensional vector fields on $\mathbb{R}^{d}$ , i.e., $f\in\mathcal{H}$ if $\mathbb{E}_{w\sim\mathcal{N}(\bm{0},\bm{I})}[\|f(w)\|_{2}^{2}]<\infty.$ Then the inner product of this Hilbert space is $\langle f,g\rangle_{\mathcal{H}}=\mathbb{E}_{w\sim\mathcal{N}(\bm{0},\bm{I})}[f(w)^{T}g(w)]$ for $f,g\in\mathcal{H}$ . With these preparations in place, to prove $H_{2}^{\infty}$ is strictlt positive definite, it is equivalent to show $\psi(y_{1}),\cdots,\psi(y_{n_{2}})\in\mathcal{H}$ are linearly independent, where $\psi(y_{j})=y_{j}I\{w^{T}y_{j}\geq 0\}$ for $j\in[n_{2}]$ . This is exactly the result of Therem 3.1 in [17]. Similarly, Theorem 2.1 in [19] shows that $\{\sigma(\tilde{w}^{T}u_{i})\}_{i=1}^{n_{1}}$ are linearly independent if no two samples in $\{u_{i}\}_{i=1}^{n_{1}}$ are parallel. Thus it can be directly deduced that $H^{\infty}$ is strictly positive definite. Similarly, we can deduce that $\tilde{H}^{\infty}$ is also strictly positive definite.

As for other activation functions, such as $\text{ReLU}^{p}$ or smooth activation functions, similar conclusions hold true. For specific details, refer to [18] and [20].

∎

7.2. Proof of Lemma 2

Proof.

First, let

X_{r}=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{j})\right]y_{i_{1}}^{T}y_{j_{1}}I\{w_{r}(0)^{T}y_{i_{1}}\geq 0,w_{r}(0)^{T}y_{j_{1}}\geq 0\},

then $|H_{i,j}^{i_{1},j_{1}}(0)-H_{i,j}^{i_{1},j_{1},\infty}|=\left|\frac{1}{m}\sum\limits_{r=1}^{m}X_{r}-\mathbb{E}[X_{1}]\right|$ .

Note that Lemma 13 implies that for all $i\in[n_{1}]$ , $\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|_{\psi_{2}}=\mathcal{O}(1)$ , which yields that

\|X_{r}\|_{\psi_{1}}\lesssim\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|_{\psi_{2}}\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{j})\right\|_{\psi_{2}}=\mathcal{O}(1).

Thus, by applying Lemma 12, we can deduce for fixed $(i,j)$ and $(i_{1},j_{1})$ , with probability at least $1-\delta$ ,

|H_{i,j}^{i_{1},j_{1}}(0)-H_{i,j}^{i_{1},j_{1},\infty}|\lesssim\sqrt{\frac{\log(\frac{1}{\delta})}{m}}+\frac{\log(\frac{1}{\delta})}{m}.

Taking a union bound yields that with probability at least $1-\delta$ ,

	$\displaystyle\\|H(0)-H^{\infty}\\|_{F}^{2}$	$\displaystyle=\sum\limits_{i,j=1}^{n_{1}}\sum\limits_{i_{1},j_{1}=1}^{n_{2}}\|H_{i,j}^{i_{1},j_{1}}(0)-H_{i,j}^{i_{1},j_{1},\infty}\|^{2}$
		$\displaystyle\lesssim n_{1}^{2}n_{2}^{2}\left(\sqrt{\frac{\log(\frac{n_{1}n_{2}}{\delta})}{m}}+\frac{\log(\frac{n_{1}n_{2}}{\delta})}{m}\right)^{2}$
		$\displaystyle\lesssim\frac{n_{1}^{2}n_{2}^{2}}{m}\log(\frac{n_{1}n_{2}}{\delta}).$

Thus, if $m=\Omega(\frac{n_{1}^{2}n_{2}^{2}\log(n_{1}n_{2}/\delta)}{\lambda_{0}^{2}})$ , we have $\|H(0)-H^{\infty}\|_{2}\leq\|H(0)-H^{\infty}\|_{F}\leq\lambda_{0}/4$ , resulting in that $\lambda_{min}(H(0))\geq 3\lambda_{0}/4$ .

On the other hand,

\left\|\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\sigma(w_{r}^{T}y_{i})\sigma(w_{r}^{T}y_{j})\right\|_{\psi_{1}}\leq\|\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}})\|_{\psi_{1}}=\mathcal{O}(1).

Similarly, applying Lemma 12 yields that with probability at least $1-\delta$ ,

\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}\leq\tilde{H}(0)-\tilde{H}^{\infty}\|_{F}\lesssim n_{1}n_{2}\sqrt{\frac{\log(\frac{n_{1}n_{2}}{\delta})}{m}},

which leads to the desired conclusion.

∎

7.3. Proof of Lemma 3

Proof.

First, for $H$ , recall that the $(i_{1},j_{1})$ -th entry of $(i,j)$ -th block is

\frac{1}{m}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j})\right]y_{i_{1}}^{T}y_{j_{1}}I\{w_{r}^{T}y_{i_{1}}\geq 0,w_{r}^{T}y_{j_{1}}\geq 0\}.

We let

a=\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i}),b=\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j}),c=y_{i_{1}}^{T}y_{j_{1}}I\{w_{r}^{T}y_{i_{1}}\geq 0,w_{r}^{T}y_{j_{1}}\geq 0\}

and let $a(0),b(0),c(0)$ be the initialized parts corresponding to $a,b,c$ respectively.

Note that we can decompose $abc-a(0)b(0)c(0)$ as follows:

abc-a(0)b(0)c(0)=(ab-a(0)b(0))c+a(0)b(0)(c-c(0)).

For the first part $(ab-a(0)b(0))c$ , from the bounedness of $c$ and $c(0)$ , we have

|ab-a(0)b(0)|\leq|a-a(0)||b|+|b-b(0)||a(0)|\leq|a-a(0)||b-b(0)|+|a-a(0)||b(0)|+|b-b(0)||a(0)|.

For $a-a(0)$ and $b-b(0)$ , we have

\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})-\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right|\leq\sqrt{p}\tilde{R},

i.e., $|a-a(0)|,|b-b(0)|\lesssim\sqrt{p}\tilde{R}$ . Moreover, Lemma 13 shows that $|a(0)|,|b(0)|\lesssim\sqrt{\log(mn_{1}/\delta)}$ . Thus, combining these facts yields that

|(ab-a(0)b(0))c|\lesssim p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.

(27)

For the second part $a(0)b(0)(c-c(0))$ , note that

		$\displaystyle\left\|I\{w_{r}^{T}y_{i_{1}}\geq 0,w_{r}^{T}y_{j_{1}}\geq 0\}-I\{w_{r}(0)^{T}y_{i_{1}}\geq 0,w_{r}(0)^{T}y_{j_{1}}\geq 0\}\right\|$		(28)
		$\displaystyle\leq\left\|I\{w_{r}^{T}y_{i_{1}}\geq 0\}-I\{w_{r}(0)^{T}y_{i_{1}}\geq 0\}\right\|+\left\|I\{w_{r}^{T}y_{j_{1}}\geq 0\}-I\{w_{r}(0)^{T}y_{j_{1}}\geq 0\}\right\|$
		$\displaystyle\leq I\{A_{i_{1},r}\}+I\{A_{j_{1},r}\}.$

From the Bernstein inequality (Lemma 11), we have that with probability at least $1-n_{2}exp(-mR)$ ,

\frac{1}{m}\sum\limits_{r=1}^{m}I\{A_{i,r}\}\lesssim R

(29)

holds for any $i\in[n_{1}]$ .

Therefore, we can deduce that

		$\displaystyle\|H_{i,j}^{i_{1},j_{1}}-H_{i,j}^{i_{1},j_{1}}(0)\|$		(30)
		$\displaystyle\lesssim p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{m}{\delta}\right)}+\frac{1}{m}\sum\limits_{r=1}^{m}\log\left(\frac{m}{\delta}\right)(I\{A_{i_{1},r}\}+I\{A_{j_{1},r}\})$
		$\displaystyle\lesssim p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}+R\log\left(\frac{mn_{1}}{\delta}\right).$

Summing $i,j,i_{1},j_{1}$ yields that

	$\displaystyle\\|H-H(0)\\|_{F}$	$\displaystyle=\sqrt{\sum\limits_{i,j=1}^{n_{1}}\sum\limits_{i_{1},j_{1}=1}^{n_{2}}\|H_{i,j}^{i_{1},j_{1}}-H_{i,j}^{i_{1},j_{1}}(0)\|^{2}}$		(31)
		$\displaystyle\lesssim n_{1}n_{2}p\tilde{R}^{2}+n_{1}n_{2}\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}+n_{1}n_{2}R\log\left(\frac{mn_{1}}{\delta}\right)$		(31)

holds with probability at least $1-\delta-n_{2}exp(-mR)$ .

Second, for $\tilde{H}$ , recall that $(i_{1},j_{1})$ -th entry of $(i,j)$ -th block is

\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}}).

Let $a=\sigma(w_{r}^{T}y_{i_{1}}),b=\sigma(w_{r}^{T}y_{j_{1}}),c=u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}$ and $a(0),b(0),c(0)$ be the corresponding initialized parts.

Similarly, we decompose $abc-a(0)b(0)c(0)$ as follows:

abc-a(0)b(0)c(0)=(ab-a(0)b(0))c+a(0)b(0)(c-c(0)).

For the first part $(ab-a(0)b(0))c$ , we have

		$\displaystyle\|\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}})-\sigma(w_{r}(0)^{T}y_{i_{1}})\sigma(w_{r}(0)^{T}y_{j_{1}})\|$		(32)
		$\displaystyle=\|[(\sigma(w_{r}^{T}y_{i_{1}})-\sigma(w_{r}(0)^{T}y_{i_{1}}))]\sigma(w_{r}^{T}y_{j_{1}})+\sigma(w_{r}(0)^{T}y_{i_{1}})[\sigma(w_{r}^{T}y_{j_{1}})-\sigma(w_{r}(0)^{T}y_{j_{1}})]\|$
		$\displaystyle\lesssim R(\|\sigma(w_{r}^{T}y_{j_{1}})\|+\|\sigma(w_{r}(0)^{T}y_{i_{1}})\|)$
		$\displaystyle\lesssim R(\|\sigma(w_{r}(0)^{T}y_{j_{1}})\|+\|\sigma(w_{r}(0)^{T}y_{i_{1}})\|+R),$

thus we have

|(ab-a(0)b(0))c|\lesssim\frac{1}{m}\sum\limits_{r=1}^{m}R(|\sigma(w_{r}(0)^{T}y_{j_{1}})|+|\sigma(w_{r}(0)^{T}y_{i_{1}})|)+R^{2}.

(33)

Note that $\left\|\sigma(w_{r}(0)^{T}y_{j})\right\|_{\psi_{2}}=\mathcal{O}(1)$ for all $j\in[n_{2}]$ , then applying Lemma 12 yields that with probability at least $1-\delta$ ,

\frac{1}{m}\sum\limits_{r=1}^{m}|\sigma(w_{r}(0)^{T}y_{j})|\lesssim\mathbb{E}[|\sigma(w_{1}(0)^{T}y_{j})|]+\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}\lesssim\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}

holds for all $j\in[n_{2}]$ .

For the second part $a(0)b(0)(c-c(0))$ , we cannot directly apply the Bernstein inequality. Instead, we first truncate $|\sigma(w_{r}(0)^{T}y_{i_{1}})\sigma(w_{r}(0)^{T}y_{j_{1}})|$ . Note that for $w_{r}(0)^{T}y_{j}$ , we have $P(|w_{r}(0)^{T}y_{j}|>\|y_{j}\|_{2}t)\leq 2e^{-t^{2}/2}$ , i.e., with probability at least $1-\delta$ , $|w_{r}(0)^{T}y_{j}|\leq\sqrt{2\log(\frac{2}{\delta})}$ . Thus, taking a union bound yields that with probability at least $1-\delta$ ,

|\sigma(w_{r}(0)^{T}y_{j})|\leq|w_{r}(0)^{T}y_{j}|\lesssim\sqrt{\log\left(\frac{mn_{2}}{\delta}\right)}:=M

holds for any $r\in[m],j\in[n_{2}]$ .

Therefore, under this event,

		$\displaystyle\|\sigma(w_{r}(0)^{T}y_{i_{1}})\sigma(w_{r}(0)^{T}y_{j_{1}})(I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\})\|$		(34)
		$\displaystyle\lesssim M^{2}\|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\|$
		$\displaystyle\leq M^{2}\|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}\|+M^{2}\|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}\|$
		$\displaystyle\leq M^{2}(I\{\tilde{A}_{rk}^{i}\}+I\{\tilde{A}_{rk}^{j}\}).$

From the Bernstein inequality, we have that with probability at least $1-n_{1}exp(-mp\tilde{R})$ ,

\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}I\{\tilde{A}_{rk}^{i}\}\lesssim\tilde{R}.

Thus, with probability at least $1-\delta-n_{1}exp(-mp\tilde{R})$ ,

\displaystyle|\tilde{H}_{i,j}^{i_{1},j_{1}}-\tilde{H}_{i,j}^{i_{1},j_{1}}(0)|

\displaystyle\lesssim R\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}+\tilde{R}\log\left(\frac{mn_{2}}{\delta}\right).

Summing $i,j,i_{1},j_{1}$ yields that

	$\displaystyle\\|\tilde{H}-\tilde{H}(0)\\|_{F}$	$\displaystyle=\sqrt{\sum\limits_{i,j=1}^{n_{1}}\sum\limits_{i_{1},j_{1}=1}^{n_{2}}\|\tilde{H}_{i,j}^{i_{1},j_{1}}-\tilde{H}_{i,j}^{i_{1},j_{1}}(0)\|^{2}}$		(35)
		$\displaystyle\lesssim n_{1}n_{2}R\sqrt{\log\left(\frac{n_{2}}{\delta}\right)}+n_{1}n_{2}\tilde{R}\log\left(\frac{mn_{2}}{\delta}\right)$		(35)

holds with probability at least $1-\delta-n_{1}exp(-mp\tilde{R})$ .

∎

7.4. Proof of Theorem 1

The proof of Theorem 1 consists of the following Lemma 6, Lemma 7, Lemma 8 and Lemma 9. First, we assume that the following lemmas are considered in the setting of events in Lemma 13, Lemma 14 and $\{|w_{r}(0)^{T}y_{j}|\leq B,\forall j\in[n_{2}],\forall r\in[m]\}$ , where $B=2\sqrt{\log{(mn/\delta)}}$ .

Lemma 11.

If for $0\leq s\leq t$ , $\lambda_{min}(H(s))\geq\frac{\lambda_{0}}{2}$ , $\lambda_{min}(\tilde{H}(s))\geq\frac{\tilde{\lambda}_{0}}{2}$ , then we have

\|z-G^{t}(u)\|_{2}^{2}\leq exp(-(\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{0}(u)\|_{2}^{2}.

Proof.

From the conditions $\lambda_{min}(H(s))\geq\frac{\lambda_{0}}{2}$ and $\lambda_{min}(\tilde{H}(s))\geq\frac{\tilde{\lambda}_{0}}{2}$ , we can deduce that

	$\displaystyle\frac{d}{dt}\\|z-G^{t}(u)\\|_{2}^{2}$	$\displaystyle=-2(z-G^{t}(u))^{T}(H(t)+\tilde{H}(t))(z-G^{t}(u))$
		$\displaystyle\leq-(\lambda_{0}+\tilde{\lambda}_{0})\\|z-G^{t}(u)\\|_{2}^{2}.$

From this, we have

\frac{d}{dt}\left(exp((\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{t}(u)\|_{2}^{2}\right)\leq 0,

which yields that

exp((\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{t}(u)\|_{2}^{2}\leq\|z-G^{0}(u)\|_{2}^{2},

i.e.,

\|z-G^{t}(u)\|_{2}^{2}\leq exp(-(\lambda_{0}+\tilde{\lambda}_{0})t)\|z-G^{0}(u)\|_{2}^{2}.

∎

Lemma 12.

Suppose for $0\leq s\leq t$ , $\lambda_{min}(H(s))\geq\frac{\lambda_{0}}{2}$ , $\lambda_{min}(\tilde{H}(s))\geq\frac{\tilde{\lambda}_{0}}{2}$ and $\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}$ holds for any $r\in[m],k\in[p]$ , then we have that

\|w_{r}(s)-w_{r}(0)\|_{2}\leq\frac{C\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right):=R^{{}^{\prime}}

holds for any $r\in[m]$ , where $C$ is a universal constant.

Proof.

For $0\leq s\leq t$ , we have

		$\displaystyle\left\\|\frac{d}{dt}w_{r}(s)\right\\|_{2}$		(36)
		$\displaystyle=\left\\|\frac{\partial L(W(s),\tilde{W}(s))}{\partial w_{r}}\right\\|_{2}$
		$\displaystyle=\left\\|\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left(\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]y_{j}I\{w_{r}(s)^{T}y_{j}\geq 0\}\right)(G^{s}(u_{i})(y_{j})-z_{i}^{j})\right\\|_{2}$
		$\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}}{\sqrt{m}}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right)\\|G^{s}(u)-z\\|_{2}$
		$\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}}{\sqrt{m}}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right)exp(-\frac{(\lambda_{0}+\tilde{\lambda}_{0})s}{2})\\|G^{0}(u)-z\\|_{2},$

where the last inequality follows from Lemma 6 and the first inequality follows from that

	$\displaystyle\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right\|$
	$\displaystyle=\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\left[\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]+\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|$
	$\displaystyle\leq\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\left[\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right\|+\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|$
	$\displaystyle\lesssim\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.$

Therefore, we have

\|w_{r}(t)-w_{r}(0)\|_{2}\leq\int_{0}^{t}\left\|\frac{d}{ds}w_{r}(s)\right\|_{2}ds\leq\frac{C\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right):=R^{{}^{\prime}},

where $C$ is a universal constant. ∎

Lemma 13.

Suppose for $0\leq s\leq t$ , $\lambda_{min}(H(s))\geq\frac{\lambda_{0}}{2}$ , $\lambda_{min}(\tilde{H}(s))\geq\frac{\tilde{\lambda}_{0}}{2}$ and $\|w_{r}(s)-w_{r}(0)\|_{2}\leq R$ holds for any $r\in[m]$ , then we have that

\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}\leq\frac{C\sqrt{n_{1}n_{2}}(R+B)\|z-G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}:=\tilde{R}^{{}^{\prime}}

holds for any $r\in[m],p\in[k]$ , where $C$ is a universal constant.

Proof.

For $0\leq s\leq t$ , we have

$\displaystyle\left\\|\frac{d}{dt}\tilde{w}_{rk}(s)\right\\|_{2}$	$\displaystyle=\left\\|\frac{\partial L(W(s),\tilde{W}(s))}{\partial\tilde{w}_{rk}}\right\\|_{2}$	(37)
	$\displaystyle=\left\\|\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}u_{i}I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}\sigma(w_{r}(s)^{T}y_{j})(G^{s}(u)(y_{j})-z_{j}^{i})\right\\|_{2}$
	$\displaystyle\lesssim\frac{1}{\sqrt{mp}}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left\|\sigma(w_{r}(s)^{T}y_{j})(G^{s}(u_{i})(y_{j})-z_{j}^{i})\right\|$
	$\displaystyle\leq\frac{1}{\sqrt{mp}}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left(\left\|\sigma(w_{r}(s)^{T}y_{j})-\sigma(w_{r}(0)^{T}y_{j})\right\|+\left\|\sigma(w_{r}(0)^{T}y_{j})\right\|\right)\left\|G^{s}(u_{i})(y_{j})-z_{j}^{i}\right\|$
	$\displaystyle\leq\frac{1}{\sqrt{mp}}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}(R+B)\left\|G^{s}(u_{i})(y_{j})-z_{j}^{i}\right\|$
	$\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}(R+B)}{\sqrt{mp}}\\|G^{s}(u)-z\\|_{2}$
	$\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}(R+B)}{\sqrt{mp}}exp(-(\lambda_{0}+\tilde{\lambda}_{0})s/2)\\|z-G^{0}(u)\\|_{2},$

where the last inequality follows from Lemma 6. Then, similar to that in Lemma 7, the conclusion holds.

∎

Lemma 14.

If $R^{{}^{\prime}}<R$ and $\tilde{R}^{{}^{\prime}}<\tilde{R}$ , we have that for all $t\geq 0$ , the following two conclusions hold:

•

$\lambda_{min}(H(t))\geq\frac{\lambda_{0}}{2}$ and $\lambda_{min}(\tilde{H}(t))\geq\frac{\tilde{\lambda}_{0}}{2}$ ;
•

$\|w_{r}(t)-w_{r}(0)\|_{2}\leq R^{{}^{\prime}}$ and $\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}^{{}^{\prime}}$ for any $r\in[m],p\in[k]$ .

Proof.

The proof is based on contradiction. Suppose $t>0$ is the smallest time that the two conclusions do not hold, then either conclusion 1 does not hold or conclusion 2 does not hold.

If conclusion 1 does not hold, i.e., either $\lambda_{min}(H(t))<\frac{\lambda_{0}}{2}$ or $\lambda_{min}(\tilde{H}(t))<\frac{\tilde{\lambda}_{0}}{2}$ , then Lemma 3 implies that there exists $r\in[m]$ , $\|w_{r}(s)-w_{r}(0)\|_{2}>R>R^{{}^{\prime}}$ or there exists $r\in[m],k\in[p]$ , $\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}>\tilde{R}>\tilde{R}^{{}^{\prime}}$ . This fact shows that conlusion 2 does not hold and then, this contradicts with the minimality of $t$ .

If conclusion 2 does not hold, then either there exists $r\in[m]$ , $\|w_{r}(t)-w_{r}(0)\|_{2}>R^{{}^{\prime}}$ or there exists $r\in[m],k\in[p]$ , $\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}>\tilde{R}^{{}^{\prime}}$ . If $\|w_{r}(t)-w_{r}(0)\|_{2}>R^{{}^{\prime}}$ , then Lemma 7 implies that there exist $s<t$ such that $\lambda_{min}(H(s))<\frac{\lambda_{0}}{2}$ or $\lambda_{min}(\tilde{H}(s))<\frac{\tilde{\lambda}_{0}}{2}$ or there exists $r\in[m],k\in[p]$ , $\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}>\tilde{R}^{{}^{\prime}}$ , which contradicts with the minimality of $t$ . And the last case is similar to this case.

∎

Proof of Theorem 1.

Theorem 1 is a direct corollary of Lemma 6 and Lemma 9. Thus, it remains only to clarify the requirements for $m$ so that Lemma 6, Lemma 7 and Lemma 8 hold. First, $R$ and $\tilde{R}$ should ensure that $\|H(0)-H^{\infty}\|_{2}\leq\lambda_{0}/4$ and $\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{2}\leq\tilde{\lambda}_{0}/4$ , i.e.,

R\lesssim\frac{\min(\lambda_{0},\tilde{\lambda}_{0})}{n_{1}n_{2}\log\left(\frac{m}{\delta}\right)},\ \tilde{R}\lesssim\frac{\min(\lambda_{0},\tilde{\lambda}_{0})}{n_{1}n_{2}\sqrt{p}\log\left(\frac{m}{\delta}\right)}.

(38)

Combining this with the requirement that $R^{{}^{\prime}}<R,\tilde{R}^{{}^{\prime}}<\tilde{R}$ , we can deduce that

m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log\left(\frac{n}{\delta}\right)\log^{3}\left(\frac{m}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

Moreover, the requirement for $m$ also leads to that

n_{2}exp(-mR)\lesssim\delta,n_{2}exp(-mp\tilde{R})\lesssim\delta,

which are confidences in Lemma 3.

∎

8. Proof of Descrete Time Analysis

8.1. Proof of Lemma 4

Proof.

First, we can decompose $G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})$ as follows.

		$\displaystyle G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})$		(39)
		$\displaystyle=G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle$
		$\displaystyle\quad+\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle+\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.$

Note that

		$\displaystyle\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle$		(40)
		$\displaystyle=\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle$
		$\displaystyle=-\eta\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial G^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\right\rangle(G^{t}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}})$

and

		$\displaystyle\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle$		(41)
		$\displaystyle=\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\right\rangle$
		$\displaystyle=-\eta\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial G^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\rangle(G^{t}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}}).$

Plugging (33) and (34) into (32 yields that

\displaystyle G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})

\displaystyle=I_{i,j}(t)-\eta[(H_{1}^{i}(t),\cdots,H_{n_{1}}^{i})+(\tilde{H}_{1}^{i}(t),\cdots,\tilde{H}_{n_{1}}^{i})]_{j}(G^{t}(u)-z),

where $[A]_{j}$ represents the $j$ -th of the matrix $A$ and $I(t)\in\mathbb{R}^{n_{1}n_{2}}$ , we can divide it into $n_{1}$ blocks, the $j$ -th component of $i$ -th block is defined as

I_{i,j}(t)=G^{t+1}(u_{i})(y_{j})-G^{t}(u_{i})(y_{j})-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial G^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

Thus, we have

G^{t+1}(u)-G^{t}(u)=I(t)-\eta(H(t)+\tilde{H}(t))(G^{t}(u)-z).

By using a simple algebraic transformation, we have

	$\displaystyle z-G^{t+1}(u)$	$\displaystyle=z-G^{t}(u)-I(t)-\eta(H(t)+\tilde{H}(t))(z-G^{t}(u))$		(42)
		$\displaystyle=\left(I-\eta(H(t)+\tilde{H}(t))\right)(z-G^{t}(u))-I(t).$		(42)

∎

8.2. Proof of Lemma 5

Proof.

We first express explicitly the $j$ -component of the $i$ -th bloack of the residual term $I(s)$ as follows.

$\displaystyle I_{i,j}(s)$	$\displaystyle=G^{s+1}(u_{i})(y_{j})-G^{s}(u_{i})(y_{j})-\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w},w(s+1)-w(s)\right\rangle-\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(s+1)-\tilde{w}(s)\right\rangle$	(43)
	$\displaystyle=G^{s+1}(u_{i})(y_{j})-G^{s}(u_{i})(y_{j})-\sum\limits_{r=1}^{m}\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(s+1)-w_{r}(s)\right\rangle$
	$\displaystyle\quad-\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\right\rangle.$

From the forms of $G^{s}(u_{i})(y_{j})$ , $\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}}$ and $\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}}$ , we have

		$\displaystyle G^{s+1}(u_{i})(y_{j})-G^{s}(u_{i})(y_{j})$		(44)
		$\displaystyle=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})\right]\sigma(w_{r}(s+1)^{T}y_{j})-\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]\sigma(w_{r}(s)^{T}y_{j})$
		$\displaystyle=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})\right](\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j}))$
		$\displaystyle\quad+\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]\right]\sigma(w_{r}(s)^{T}y_{j}).$

and

		$\displaystyle\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(s+1)-w_{r}(s)\right\rangle$		(45)
		$\displaystyle=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}$		(45)

and

		$\displaystyle\left\langle\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\right\rangle$		(46)
		$\displaystyle=\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}\sigma(w_{r}(s)^{T}y_{j}).$		(46)

Thus, we can decompose $I_{i,j}(s)$ as follows

I_{i,j}(s)=\sum\limits_{r=1}^{m}\tilde{I}_{i,j}^{r}(s)+\bar{I}_{i,j}^{r}(s),

where

	$\displaystyle\tilde{I}_{i,j}^{r}(s)$	$\displaystyle=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}\right]\right]$		(47)
		$\displaystyle\quad\cdot\sigma(w_{r}(s)^{T}y_{j})$		(47)

and

	$\displaystyle\bar{I}_{i,j}^{r}(s)$	$\displaystyle=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})\right](\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j}))$		(48)
		$\displaystyle\quad-\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}.$		(48)

For $\tilde{I}_{i,j}^{r}(s)$ , we replace $\tilde{R}$ in the definition of $\tilde{A}_{rk}^{i}$ by $\tilde{R}^{{}^{\prime}}$ and still denote the event as $\tilde{A}_{rk}^{i}$ for simplicity, i.e.,

\tilde{A}_{rk}^{i}:=\{\exists w:\|w-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}^{{}^{\prime}},I\{w^{T}u_{i}\geq 0\}\neq I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}\}

and $\tilde{S}_{i}=\{(r,k)\in[m]\times[p]:I\{\tilde{A}_{rk}\}=0\}$ .

From the induction hypothesis, we know $\|\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}^{{}^{\prime}}$ and $\|\tilde{w}_{rk}(s)-\tilde{w}_{rk}(0)\|_{2}\leq\tilde{R}^{{}^{\prime}}$ . Thus, $I\{\tilde{w}_{rk}(s+1)^{T}u_{i}\geq 0\}=I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}$ holds for any $(r,k)\in\tilde{S}_{i}$ . From this fact, we can deduce that for any $(r,k)\in\tilde{S}_{i}$ ,

		$\displaystyle\left\|\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}\right\|$		(49)
		$\displaystyle=\|(\tilde{w}_{rk}(s+1)^{T}u_{i})I\{\tilde{w}_{rk}(s+1)^{T}u_{i}\geq 0\}-(\tilde{w}_{rk}(s)^{T}u_{i})I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}$
		$\displaystyle\quad-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}\|$
		$\displaystyle=\|(\tilde{w}_{rk}(s+1)^{T}u_{i})I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}-(\tilde{w}_{rk}(s)^{T}u_{i})I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}$
		$\displaystyle\quad-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}\|$
		$\displaystyle=0.$

On the other hand, for any $(r,k)\in[m]\times[p]$ , we have

|\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s))^{T}u_{i}|\lesssim\|\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\|_{2}.

(50)

Thus, combining (41), (42) and (43) yields that

$\displaystyle\sum\limits_{r=1}^{m}\|\tilde{I}_{i,j}^{r}(s)\|$	$\displaystyle\leq\frac{B}{\sqrt{mp}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\\|\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}$	(51)
	$\displaystyle=\frac{B}{\sqrt{mp}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\\|-\eta\frac{\partial L(W(s),\tilde{W}(s))}{\partial\tilde{w}_{rk}}\right\\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}$
	$\displaystyle=\frac{B}{\sqrt{mp}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\\|-\eta\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\frac{\partial G^{s}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}(G^{s}(u_{i_{1}})(y_{j_{1}})-z_{j_{1}}^{i_{1}})\right\\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}$
	$\displaystyle\leq\eta B^{2}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\frac{\sqrt{n_{1}n_{2}}}{mp}\\|z-G^{s}(u)\\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}$
	$\displaystyle=\eta B^{2}\sqrt{n_{1}n_{2}}\\|z-G^{s}(u)\\|_{2}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\frac{1}{mp}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}$
	$\displaystyle=\eta B^{2}\sqrt{n_{1}n_{2}}\\|z-G^{s}(u)\\|_{2}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\frac{1}{mp}I\{\tilde{A}_{r,k}^{i}\},$

where the second inequality follows from Cauchy’s inequality and the form of $\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}}$ , i.e.,

\frac{\partial G^{s}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}}=\frac{1}{\sqrt{m}}\frac{\tilde{a}_{rk}}{\sqrt{p}}u_{i}I\{\tilde{w}_{rk}(s)^{T}u_{i}\geq 0\}\sigma(w_{r}(s)^{T}y_{j}).

From the Bernstein inequality, we have that with probability at least $1-n_{1}exp(-mp\tilde{R}^{{}^{\prime}})$ ,

\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\frac{1}{mp}I\{\tilde{A}_{r,k}^{i}\}\lesssim\tilde{R}^{{}^{\prime}}.

This leads to the final unpper bound:

	$\displaystyle\sum\limits_{r=1}^{m}\|\tilde{I}_{i,j}^{r}(s)\|$	$\displaystyle\lesssim\eta B^{2}\sqrt{n_{1}n_{2}}\tilde{R}^{{}^{\prime}}\\|z-G^{s}(u)\\|_{2}$		(52)
		$\displaystyle\lesssim\frac{\eta n_{1}n_{2}\\|z-G^{0}(u)\\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\\|z-G^{s}(u)\\|_{2}.$		(52)

It remains to bound $\bar{I}_{i,j}^{r}(s)$ , which can be written as follows.

		$\displaystyle\bar{I}_{i,j}^{r}(s)=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})\right](\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j}))$		(53)
		$\displaystyle\quad-\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}$
		$\displaystyle=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i}))\right]\left(\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j})\right)$
		$\displaystyle\quad-\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i}))\right]I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}$
		$\displaystyle+\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\left(\sigma(w_{r}(s+1)^{T}y_{j})-\sigma(w_{r}(s)^{T}y_{j})-I\{w_{r}(s)^{T}y_{j}\geq 0\}(w_{r}(s+1)-w_{r}(s))^{T}y_{j}\right).$

Note that

\|\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\|_{2}\lesssim\tilde{R}^{{}^{\prime}},\ \|\sigma(\tilde{w}_{rk}(s+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\|_{2}\lesssim\tilde{R}^{{}^{\prime}},

thus we can bound the first term and second term by

\frac{\sqrt{p}\tilde{R}^{{}^{\prime}}}{\sqrt{m}}\|w_{r}(s+1)-w_{r}(s)\|_{2}\lesssim\frac{\sqrt{n}\|z-G^{0}(u)\|_{2}}{m(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}\|w_{r}(t+1)-w_{r}(t)\|_{2}.

(54)

For the third term in (46), we also replace $R$ in the definition of $A_{jr}$ by $R^{{}^{\prime}}$ and still denote the event by $A_{jr}$ . Recall that

A_{jr}:=\{\exists w:\|w-w_{r}(0)\|_{2}\leq R,I\{w^{T}y_{j}\geq 0\}\neq I\{w_{r}(0)^{T}y_{j}\geq 0\}\}

(55)

and $S_{j}=\{r\in[m]:I\{A_{jr}\}=0\}$ . Note that $\|w_{r}(s+1)-w_{r}(0)\|_{2}\leq R^{{}^{\prime}}$ and $\|w_{r}(s)-w_{r}(0)\|_{2}\leq R^{{}^{\prime}}$ , thus for $r\in S_{j}$ , we have $I\{w_{r}(s+1)^{T}y_{j}\geq 0\}=I\{w_{r}(s)^{T}y_{j}\geq 0\}$ . Combining this fact with (46) and (47), we can deduce that

\displaystyle|\bar{I}_{i,j}^{r}(s)|\lesssim\frac{\sqrt{n}\|z-G^{0}(u)\|_{2}}{m(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}\|w_{r}(s+1)-w_{r}(s)\|_{2}+\frac{1}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}\|w_{r}(s+1)-w_{r}(s)\|_{2}I\{r\in S_{j}^{\perp}\}.

(56)

Thus, we have to bound $\|w_{r}(s+1)-w_{r}(s)\|_{2}$ . From the gradient descent update formula, we have

	$\displaystyle\\|w_{r}(s+1)-w_{r}(s)\\|_{2}$	$\displaystyle=\left\\|-\eta\frac{\partial L(W(s),\tilde{W}(s))}{\partial w_{r}}\right\\|_{2}$		(57)
		$\displaystyle\leq\eta\\|z-G^{s}(u)\\|_{2}\sqrt{\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left\\|\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}}\right\\|_{2}^{2}}.$		(57)

Recall that

\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}}=\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]y_{j}I\{w_{r}(s)^{T}y_{j}\geq 0\}.

Therefore,

		$\displaystyle\left\\|\frac{\partial G^{s}(u_{i})(y_{j})}{\partial w_{r}}\right\\|_{2}$		(58)
		$\displaystyle=\left\\|\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})+\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right]y_{j}I\{w_{r}(s)^{T}y_{j}\geq 0\}\right\\|_{2}$
		$\displaystyle\lesssim\frac{\sqrt{p}\tilde{R}^{{}^{\prime}}}{\sqrt{m}}+\frac{1}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}$
		$\displaystyle\lesssim\frac{1}{\sqrt{m}}\frac{\sqrt{n}\\|z-G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}+\frac{1}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}$
		$\displaystyle\lesssim\frac{1}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)},$

where the last inequality is due to us taking $m$ sufficiently large in the end.

Combining (50) and (51) yields that

\|w_{r}(s+1)-w_{r}(s)\|_{2}\lesssim\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.

(59)

Plugging this into (49) leads to that

|\bar{I}_{i,j}^{r}(s)|\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{m(\lambda_{0}+\tilde{\lambda}_{0})}\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{m}}\log\left(\frac{m}{\delta}\right)+\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{m}I\{r\in S_{j}^{\perp}\}\log\left(\frac{m}{\delta}\right).

By applying the Bernstein’s inequality, we have that with probability at least $1-n_{2}exp(-mR^{{}^{\prime}})$ ,

\frac{1}{m}\sum\limits_{r=1}^{m}I\{r\in S_{j}^{\perp}\}=\frac{1}{m}\sum\limits_{r=1}^{m}I\{A_{jr}\}\lesssim R^{{}^{\prime}}.

Therefore,

	$\displaystyle\sum\limits_{r=1}^{m}\|\bar{I}_{i,j}^{r}(s)\|$	$\displaystyle\lesssim\frac{\eta n_{1}n_{2}\\|z-G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log\left(\frac{m}{\delta}\right)\\|z-G^{s}(u)\\|_{2}+\frac{\eta n_{1}n_{2}\\|z-G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\\|z-G^{s}(u)\\|_{2}$		(60)
		$\displaystyle\lesssim\frac{\eta n_{1}n_{2}\\|z-G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\\|z-G^{s}(u)\\|_{2}.$		(60)

From (45) and (53), we have

$\displaystyle\|I_{i,j}(s)\|$	$\displaystyle\leq\sum\limits_{r=1}^{m}\|\tilde{I}_{i,j}^{r}(s)\|+\sum\limits_{r=1}^{m}\|\bar{I}_{i,j}^{r}(s)\|$	(61)
	$\displaystyle\lesssim\left(\frac{\eta n_{1}n_{2}\\|z-G^{0}(u)\\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)+\frac{\eta n_{1}n_{2}\\|z-G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\right)\\|z-G^{s}(u)\\|_{2}$
	$\displaystyle\lesssim\frac{\eta n_{1}n_{2}\\|z-G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right)\\|z-G^{s}(u)\\|_{2}.$

Therefore,

\|I(s)\|_{2}\lesssim\bar{R}\|z-G^{s}(u)\|_{2},

(62)

where

\bar{R}=\frac{\eta(n_{1}n_{2})^{\frac{3}{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{\frac{3}{2}}\left(\frac{m}{\delta}\right).

(63)

∎

8.3. Proof of Corollary 1

Proof.

Note that when $\|I(s)\|_{2}\leq\eta(\lambda_{0}+\tilde{\lambda}_{0})/12$ , $\lambda_{min}(H(s))\geq\lambda_{0}/2$ , $\lambda_{min}(\tilde{H}(s))\geq\tilde{\lambda}_{0}/2$ and $I-\eta(H(s)+\tilde{H}(s))$ is positive definite, we have that for $s=0,\cdots,t-1$ ,

		$\displaystyle\\|z-G^{s+1}(u)\\|_{2}^{2}$		(64)
		$\displaystyle=\\|\left[I-\eta(H(s)+\tilde{H}(s))\right](z-G^{s}(u))\\|_{2}^{2}+\\|I(s)\\|_{2}^{2}-2\left\langle\left(\eta(H(s)+\tilde{H}(s))\right)(z-G^{s}(u)),I(s)\right\rangle$
		$\displaystyle\leq(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2})^{2}\\|z-G^{s}(u)\\|_{2}^{2}+\\|I(s)\\|_{2}^{2}+2(1-\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2})\\|z-G^{s}(u)\\|_{2}\\|I(s)\\|_{2}$
		$\displaystyle\leq\left(1-\eta(\lambda_{0}+\tilde{\lambda}_{0})+\frac{\eta^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}{4}+\left(\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{12}\right)^{2}+2\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{12}\right)\\|z-G^{s}(u)\\|_{2}^{2}$
		$\displaystyle\leq\left(1-\eta(\lambda_{0}+\tilde{\lambda}_{0})+\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{4}+\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{12}+2\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{12}\right)\\|z-G^{s}(u)\\|^{2}$
		$\displaystyle=\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)\\|z-G^{s}(u)\\|^{2}.$

Thus,

\|z-G^{s}(u)\|_{2}^{2}\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{s}\|z-G^{0}(u)\|_{2}^{2}

holds for $s=0,\cdots,t$ .

Now, we have to derive the requirement for $m$ such that these conditions hold. First, from Lemma 3, when

R\lesssim\frac{\min(\lambda_{0},\tilde{\lambda}_{0})}{n_{1}n_{2}\log\left(\frac{m}{\delta}\right)},\ \tilde{R}\lesssim\frac{\min(\lambda_{0},\tilde{\lambda}_{0})}{n_{1}n_{2}\sqrt{p}\sqrt{\log\left(\frac{m}{\delta}\right)}},

we have $\|H(s)-H(0)\|_{2}\leq\frac{\lambda_{0}}{4}$ , $\|\tilde{H}(s)-\tilde{H}(0)\|_{2}\leq\frac{\tilde{\lambda}_{0}}{4}$ and $\lambda_{min}(H(s))\geq\lambda_{0}/2$ , $\lambda_{min}(\tilde{H}(s))\geq\tilde{\lambda}_{0}/2$ . Thus, when $R^{{}^{\prime}}<R$ and $\tilde{R}^{{}^{\prime}}<\tilde{R}$ , we have $\lambda_{min}(H(s))\geq\lambda_{0}/2$ and $\lambda_{min}(\tilde{H}(s))\geq\tilde{\lambda}_{0}/2$ . Specifically, $m$ need to satisfy that

m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log^{3}\left(\frac{m}{\delta}\right)\log\left(\frac{n_{1}n_{2}}{\delta}\right)}{(min(\lambda_{0},\tilde{\lambda}_{0}))^{2}(\lambda_{0}+\tilde{\lambda}_{0})^{2}}\right).

(65)

Moreover, at this point, we can deduce that

	$\displaystyle\\|H(s)\\|_{2}$	$\displaystyle\leq\\|H(s)-H(0)\\|_{2}+\\|H(0)\\|_{2}$
		$\displaystyle\leq\\|H(s)-H(0)\\|_{2}+\\|H(0)-H^{\infty}\\|_{2}+\\|H^{\infty}\\|_{2}$
		$\displaystyle\leq\frac{\lambda_{0}}{4}+\frac{\lambda_{0}}{4}+\\|H^{\infty}\\|_{2}$
		$\displaystyle\leq\frac{3}{2}\\|H^{\infty}\\|_{2}$

and similarly, $\|\tilde{H}(s)\|_{2}\leq\frac{3}{2}\|H^{\infty}\|_{2}$ . Thus, $\eta=\mathcal{O}\left(\frac{1}{\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}}\right)$ is sufficient to ensure that $I-\eta(H(s)+\tilde{H}(s))$ is positive definite.

Second, we need to make sure that $\|I(s)\|_{2}\leq\eta(\lambda_{0}+\tilde{\lambda}_{0})/12$ . From (56), $\bar{R}\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0})$ suffices, i.e.,

m=\Omega\left(\frac{n_{1}^{4}n_{2}^{4}\log(\frac{n}{\delta})\log^{3}(\frac{m}{\delta})}{(\lambda_{0}+\tilde{\lambda_{0}})^{4}}\right).

(66)

Combining these requirements for $m$ , i.e., (58), (59) and the condition in Lemma 2, leads to the desired conclusion.

∎

8.4. Proof of Theorem 2

Proof.

From Corollary 1, it remains only to verify that Condition 1 also holds for $s=t+1$ . Note that in (52), we have proven that

\|w_{r}(s+1)-w_{r}(s)\|_{2}\lesssim\frac{\eta\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}

holds for $s=0,\cdots,t$ and $r\in[m]$ .

Combining this with Corollary 1 yields that

	$\displaystyle\\|w_{r}(t+1)-w_{r}(0)\\|_{2}$	$\displaystyle\leq\sum\limits_{s=0}^{t}\\|w_{r}(s+1)-w_{r}(s)\\|_{2}$
		$\displaystyle\lesssim\sum\limits_{s=0}^{t}\frac{\eta\sqrt{n_{1}n_{2}}\\|z-G^{s}(u)\\|_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}$
		$\displaystyle\lesssim\sum\limits_{s=0}^{t}\frac{\eta\sqrt{n_{1}n_{2}}}{\sqrt{m}}\sqrt{\log\left(\frac{m}{\delta}\right)}\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{s/2}\\|z-G^{0}(u)\\|_{2}$
		$\displaystyle\lesssim\frac{\sqrt{n_{1}n_{2}}\\|z-G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}.$

Similarly, in (44), we have proven that

\|\tilde{w}_{rk}(s+1)-\tilde{w}_{rk}(s)\|_{2}\lesssim\frac{\eta B\sqrt{n_{1}n_{2}}\|z-G^{s}(u)\|_{2}}{\sqrt{mp}},

which yields that

\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(0)\|_{2}\lesssim\frac{\sqrt{n_{1}n_{2}}\|z-G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{m}{\delta}\right)}.

Moreover, from the triangle inequality, we have

	$\displaystyle\|w_{r}(t+1)^{T}y_{j}\|$	$\displaystyle\leq\|(w_{r}(t+1)-w_{r}(0))^{T}y_{j}\|+\|w_{r}(0)^{T}y_{j}\|$
		$\displaystyle\leq\\|w_{r}(t+1)-w_{r}(0)\\|_{2}+\sqrt{2\log\left(\frac{2mn_{2}}{\delta}\right)}$
		$\displaystyle\leq 2\sqrt{\log\left(\frac{mn_{2}}{\delta}\right)}.$

∎

8.5. Proof of Lemma 6

Proof.

First, for $H^{\infty}=H_{1}^{\infty}\otimes H_{2}^{\infty}$ , recall that the Kronecker product of two strictly definite matrices is also strictly positive definte, thus it suffices to demonstrate that $H_{1}^{\infty}$ and $H_{2}^{\infty}$ are both strictly definite. For $H_{1}^{\infty}$ , similar as that in the proof of Lemma 2, let $\mathcal{H}$ be the Hilbert space of integrable function on $\mathbb{R}^{d+1}$ , i.e., $f\in\mathcal{H}$ if $\mathbb{E}_{\tilde{w}\sim\mathcal{N}(\bm{0},\bm{I})}[|f(\tilde{w})|^{2}]<\infty$ . Now to prove $H_{1}^{\infty}$ is strictly positive definite, it is equivalent to show that $\phi(u_{1})(\tilde{w}),\cdots,\phi(u_{n_{1}})(\tilde{w})\in\mathcal{H}$ are linearly independent, where $\phi(u_{i})(\tilde{w})=\sigma(\tilde{w}^{T}u_{i})$ . It has been proved in [19], we provide a different proof for completeness and this proof also indicates the strictly positive definiteness of $\tilde{H}^{\infty}$ . Suppose that there are $\alpha_{1},\cdots,\alpha_{n_{1}}\in\mathbb{R}$ such that

\alpha_{1}\phi(u_{1})+\cdots+\alpha_{n_{1}}\phi(u_{n_{1}})=0\ in\ \mathcal{H},

which implies that

\alpha_{1}\phi(u_{1})(\tilde{w})+\cdots+\alpha_{n_{1}}\phi(u_{n_{1}})(\tilde{w})=0

holds for all $\tilde{w}\in\mathbb{R}^{d+1}$ due to the continuity of $\phi(u_{1})(\cdot)$ .

Let $\tilde{D}_{i}=\{\tilde{w}\in\mathbb{R}^{d+1}:\tilde{w}^{T}u_{i}=0\}$ for $i\in[n_{1}]$ , then Lemma A.1 in [17] implies that when no two samples in $\{u_{i}\}_{i=1}^{n_{1}}$ are parallel, $D_{i}\not\subset\cup_{j\neq i}D_{j}$ for any $i\in[n_{1}]$ . Thus, we can choose $\tilde{w}_{0}\in D_{i}\backslash\cup_{j\neq i}D_{j}$ . Since $\cup_{j\neq i}D_{j}$ is a closed set, there is positive constant $r_{0}$ such that $B_{r_{0}}(\tilde{w}_{0})\cap(\cup_{j\neq i}D_{j})=\emptyset$ . This fact implies that $\phi(u_{j})(\cdot)$ is differentiable in $B_{r_{0}}(\tilde{w}_{0})$ for each $j\neq i$ . Thus $\alpha_{i}\phi(u_{i})(\cdot)$ is also differentiable in $B_{r_{0}}(\tilde{w}_{0})$ . However $\phi(u_{i})(\tilde{w})=\sigma(\tilde{w}^{T}u_{i})$ is not differentiable in $B_{r_{0}}(\tilde{w}_{0})$ . Thus we can deduce that $\alpha_{i}=0$ . Similarly, we have $\alpha_{j}=0$ for all $j\in[n_{1}]$ , which implies that $H_{1}^{\infty}$ is strictly positive definite. For $H_{2}^{\infty}$ , it can be seen as a Gram matrix of PINN, Lemma 3.2 in [20] implies that $H_{2}^{\infty}$ is strictly positive definite.

Second, for $\tilde{H}^{\infty}$ , recall that $\tilde{H}^{\infty}=\tilde{H}_{1}^{\infty}\otimes\tilde{H}_{2}^{\infty}$ . Note that the $(i,j)$ -th entry of $\tilde{H}_{1}^{\infty}$ is $\mathbb{E}[u_{i}^{T}u_{j}I\{\tilde{w}^{T}u_{i}\geq 0,\tilde{w}^{T}u_{i}\geq 0\}]$ . Thus, Theorem 3.1 in [4] implies that $\tilde{H}_{1}^{\infty}$ is strictly positive definite. For $H_{2}^{\infty}$ , let

\phi_{1}(w)=\mathcal{L}(\sigma_{3}(w^{T}y_{1})),\cdots,\phi_{n_{2}}(w)=\mathcal{L}(\sigma_{3}(w^{T}y_{n_{2}})),\psi_{1}(w)=\sigma_{3}(w^{T}\tilde{y}_{1}),\cdots,\psi_{n_{3}}(w)=\sigma_{3}(w^{T}\tilde{y}_{n_{3}})

and $H$ be the Hilbert space of integrable $(d+1)$ -dimensional vector fields on $\mathbb{R}^{d+1}$ . Suppose that there are $\alpha_{1},\cdots,\alpha_{n_{2}},\beta_{1},\cdots,\beta_{n_{3}}$ such that

\alpha_{1}\phi_{1}+\cdots+\alpha_{n_{2}}\phi_{n_{2}}+\beta_{1}\psi_{1}+\cdots+\beta\psi_{n_{3}}=0\ in\ \mathcal{H},

which yields that

\alpha_{1}\phi_{1}+\cdots+\alpha_{n_{2}}\phi_{n_{2}}+\beta_{1}\psi_{1}+\cdots+\beta\psi_{n_{3}}=0

holds for all $w\in\mathbb{R}^{d+1}$ .

Let $D_{i}=\{w\in\mathbb{R}^{d+1}:w^{T}y_{i}=0\}$ for $i\in[n_{2}]$ and $\bar{D}_{i}=\{w\in\mathbb{R}^{d+1}:w^{T}\tilde{y}_{i}=0\}$ for $i\in[n_{3}]$ . Thus $D_{i}\not\subset(\cup_{j\neq i}D_{j})\cup(\cup_{j}\bar{D}_{j})$ for any $i\in[n_{2}]$ . Similarly, we can choose $w_{0}\in D_{i}$ and $r_{0}>0$ such that $B_{r_{0}}(w_{0})\cap\left((\cup_{j\neq i}D_{j})\cup(\cup_{j}\bar{D}_{j})\right)=\emptyset$ . Note that $\phi_{j}$ $(j\neq i)$ and $\psi_{j}$ are differentiable in $B_{r_{0}}(w_{0})$ , thus $\alpha_{i}\phi_{i}$ is also differentiable in $B_{r_{0}}(w_{0})$ , implying that $\alpha_{i}=0$ . Therefore, $\alpha_{i}=0$ for all $i\in[n_{2}]$ . Moreover, similar to the proof of the strictly positive definiteness of $H_{1}^{\infty}$ , we can also deduce that $\beta_{i}=0$ for all $i\in[n_{3}]$ . Finally, $\tilde{H}$ is strictly positive definite.

∎

8.6. Proof of Lemma 7

Proof.

First, for $H(0)-H^{\infty}$ , we consider its $(i,j)$ -th block, whose entry has the following form

\frac{1}{m}\sum\limits_{r=1}^{m}X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}],

where

X_{r}=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u_{j})\right],

for $j_{1}\in[n_{1}],j_{2}\in[n_{1}]$ ,

Y_{r}=\frac{1}{n_{2}}\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle,

for $j_{1}\in[n_{1}],n_{2}+j_{2}\in[n_{2},n_{2}+n_{3}]$ ,

Y_{r}=\frac{1}{\sqrt{n_{2}n_{3}}\ }\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle,

for $n_{2}+j_{1}\in[n_{2},n_{2}+n_{3}],n_{2}+j_{2}\in[n_{2},n_{2}+n_{3}]$ ,

Y_{r}=\frac{1}{n_{3}}\left\langle\frac{\partial\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{1}})}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle.

To use the concentration inequality, we need to clarify the order of the sub-Weil random variable $X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}]$ . Note that Lemma 18 implies that

\|X_{r}\|_{\psi_{1}}\leq\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|_{\psi_{2}}\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u_{j})\right\|_{\psi_{2}}=\mathcal{O}(1).

On the other hand, from

\mathcal{L}(\sigma_{3}(w_{r}^{T}y))=w_{r0}\sigma_{2}(w_{r}^{T}y)+\|w_{r1}\|_{2}^{2}\sigma(w_{r}^{T}y)+\sigma_{3}(w_{r}^{T}y),

we have

\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y))}{\partial w_{r}}=\begin{pmatrix}1\\ 0_{d}\end{pmatrix}\sigma_{2}(w_{r}^{T}y)+w_{r0}y\sigma(w_{r}^{T}y)+2\begin{pmatrix}0\\ w_{r1}\end{pmatrix}\sigma(w_{r}^{T}y)+\|w_{r1}\|_{2}yI\{w_{r}^{T}y\geq 0\}+y\sigma_{2}(w_{r}^{T}y)

and

\frac{\partial\sigma_{3}(w_{r}^{T}y)}{\partial w_{r}}=y\sigma_{2}(w_{r}^{T}y).

Therefore, we can deduce that

\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y))}{\partial w_{r}}\right\|_{2}\lesssim|w_{r}^{T}y|^{2}+\|w_{r}\|_{2}|w_{r}^{T}y|+|w_{r}^{T}y|^{2}\lesssim\|w_{r}\|_{2}|w_{r}^{T}y|

and

\left\|\frac{\partial\sigma_{3}(w_{r}^{T}y)}{\partial w_{r}}\right\|_{2}\lesssim|w_{r}^{T}y|^{2}\lesssim\|w_{r}\|_{2}|w_{r}^{T}y|.

Note that $\|w_{r}^{T}y\|_{\psi_{2}}=\mathcal{O}(1)$ for $\|y\|_{2}=\mathcal{O}(1)$ , thus $\|\sigma_{2}(w_{r}^{T}y)\|_{\psi_{1}}=\mathcal{O}(1)$ , $\|\sigma_{3}(w_{r}^{T}y)\|_{\psi_{\frac{2}{3}}}=\mathcal{O}(1)$ . Thus Lemma 20 implies that

\|\|w_{r}\|_{2}^{2}|w_{r}^{T}y_{j_{1}}||w_{r}^{T}y_{j_{2}}|\|_{\psi_{\frac{1}{2}}}\leq\|\|w_{r}\|_{2}^{2}\|_{\psi_{1}}\||w_{r}^{T}y_{j_{1}}||w_{r}^{T}y_{j_{2}}|\|_{\psi_{1}}=\mathcal{O}(d).

On the other hand, Lemma 21 implies that

\|X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}]\|_{\psi_{\alpha}}\lesssim\|X_{r}Y_{r}\|_{\psi_{\alpha}}+\|\mathbb{E}[X_{1}Y_{1}]\|_{\psi_{\alpha}}\lesssim\|X_{r}Y_{r}\|_{\psi_{\alpha}}+\mathbb{E}[|X_{1}|]\mathbb{E}[|Y_{1}|].

Note that $\mathbb{E}[|X_{1}|]\leq|X_{1}|_{\psi_{1}}=\mathcal{O}(1)$ and $|Y_{1}|_{\psi_{\frac{1}{2}}}\lesssim\|\|w_{r}\|_{2}^{2}|w_{r}^{T}y_{j_{1}}||w_{r}^{T}y_{j_{2}}|\|_{\psi_{\frac{1}{2}}}=\mathcal{O}(d)$ . From the Taylor expansion of the function $e^{x}$ , we have that for any $C>0$ ,

\mathbb{E}[e^{(\frac{|Y_{1}|}{C})^{\frac{1}{2}}}-1]\geq\mathbb{E}\left[\frac{1}{2!}\frac{|Y|}{C}\right],

which implies that $\mathbb{E}[|Y_{1}|]=\mathcal{O}(d)$ . Therefore, $\|X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}]\|_{\psi_{\frac{1}{2}}}=\mathcal{O}(d)$ . Finally, applying Lemma 17 leads to that with probability at least $1-\delta$ ,

\left|\frac{1}{m}\sum\limits_{r=1}^{m}X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}]\right|\lesssim\frac{d}{n_{2}\sqrt{m}}\sqrt{\log(\frac{1}{\delta})}+\frac{d}{n_{2}m}\log^{2}(\frac{1}{\delta}).

Taking a union bound yields that with probability at least $1-\delta$ ,

\displaystyle\|H(0)-H^{\infty}\|_{F}

\displaystyle\lesssim\frac{dn_{1}}{\sqrt{m}}\log(\frac{n_{1}(n_{2}+n_{3})}{\delta}).

First, for $\tilde{H}(0)-\tilde{H}^{\infty}$ , we consider its $(i,j)$ -th block, whose entry has the following form

\frac{1}{m}\sum\limits_{r=1}^{m}X_{r}Y_{r}-\mathbb{E}[X_{1}Y_{1}],

where

X_{r}=\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}(0)^{T}u_{j}\geq 0\},

for $j_{1}\in[n_{1}],j_{2}\in[n_{1}]$ ,

Y_{r}=\frac{1}{n_{2}}\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))

for $j_{1}\in[n_{1}],n_{2}+j_{2}\in[n_{2},n_{2}+n_{3}]$ ,

Y_{r}=\frac{1}{\sqrt{n_{2}n_{3}}\ }\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{2}}),

for $n_{2}+j_{1}\in[n_{2},n_{2}+n_{3}],n_{2}+j_{2}\in[n_{2},n_{2}+n_{3}]$ ,

Y_{r}=\frac{1}{n_{3}}\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{1}})\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j_{2}}).

From Lemma 21, we have

\left\|\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}(0)^{T}u_{j}\geq 0\}\right\|_{\psi_{2}}=\mathcal{O}(1).

Note that

|\mathcal{L}(\sigma_{3}(w_{r}^{T}y))|\lesssim\|w_{r}\|_{2}|w_{r}^{T}y|^{2}+\|w_{r}\|_{2}^{2}|w_{r}^{T}y|+|w_{r}^{T}y|^{3}\lesssim\|w_{r}\|_{2}^{2}|w_{r}^{T}y|

and

\ |\sigma_{3}(w_{r}^{T}y)|\leq|w_{r}^{T}y|^{3}\lesssim\|w_{r}\|_{2}^{2}|w_{r}^{T}y|.

From Lemma 21, we can deduce that

\|\|w_{r}\|_{2}^{4}\|_{\psi_{\frac{1}{2}}}\leq\|\|w_{r}\|_{2}^{2}\|_{\psi_{1}}^{2}=\mathcal{O}(d^{2})

and $\||w_{r}^{T}y|^{2}\|_{\psi_{1}}=\mathcal{O}(1)$ , thus

\|\|w_{r}\|_{2}^{4}|w_{r}^{T}y|^{2}\|_{\psi_{\frac{1}{3}}}=\mathcal{O}(d^{2}).

Therefore, with probability at least $1-\delta$ ,

\displaystyle\|\tilde{H}(0)-\tilde{H}^{\infty}\|_{F}

\displaystyle\lesssim\frac{d^{2}n_{1}}{\sqrt{m}}\log(\frac{n_{1}(n_{2}+n_{3})}{\delta}).

∎

8.7. Proof of Lemma 8

Proof.

For $H-H(0)$ , from the form of $(j_{1},j_{2})$ -th entry of the $(i,j)$ -th block, we focus on the form $a_{r}b_{r}-a_{r}(0)b_{r}(0)$ , where

a_{r}=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{i})\right]\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u_{j})\right],

b_{r}=\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle or\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle or\left\langle\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{1}})}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle

and the notation $a_{r}(0),b_{r}(0)$ means replacing $w_{r},\tilde{w}_{rk}$ in the definitions of $a_{r}$ and $b_{r}$ with $w_{r}(0)$ and $\tilde{w}_{rk}(0)$ , respectively.

For $a_{r}-a_{r}(0)$ , (25) implies that

|a_{r}-a_{r}(0)|\lesssim p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.

(67)

For $b_{r}-b_{r}(0)$ , when $b_{r}=\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle$ , we have

		$\displaystyle\left\|\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle-\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\rangle\right\|$		(68)
		$\displaystyle\leq\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\\|_{2}\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\\|_{2}$
		$\displaystyle\quad+\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\\|_{2}\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\\|_{2}$
		$\displaystyle\quad+\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\\|_{2}\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\\|_{2}.$

Note with probability at least $1-\delta$ , we have $\|w_{r}(0)\|_{2}\lesssim B_{1}$ , $|w_{r}(0)^{T}y_{j}|\lesssim B_{2}$ , $|w_{r}(0)^{T}\tilde{y}_{j_{1}}|\lesssim B_{2}$ holds for all $r\in[m]$ , $j\in[n_{2}]$ , $j_{1}\in[n_{3}]$ where

B_{1}=\sqrt{d\log\left(\frac{m}{\delta}\right)},\ B_{2}=\sqrt{\log\left(\frac{m(n_{2}+n_{3})}{\delta}\right)}.

Under these events, we can deduce that

\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2},\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{2}}))}{\partial w_{r}}\right\|_{2}\lesssim B_{1}B_{2}

and

	$\displaystyle\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}}-\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j_{1}}))}{\partial w_{r}}\right\\|_{2}$	$\displaystyle\lesssim(B_{1}+B_{2})R+B_{1}\|I\{w_{r}(0)^{T}y_{j_{1}}\geq 0\}-I\{w_{r}^{T}y_{j_{1}}\geq 0\}\|$
		$\displaystyle\lesssim(B_{1}+B_{2})R+B_{1}I\{A_{r,j_{1}}\}.$

Thus, for $b_{r}-b_{r}(0)$ , we have

\displaystyle|b_{r}-b_{r}(0)|

\displaystyle\lesssim B_{1}B_{2}(B_{1}+B_{2})R+B_{1}^{2}B_{2}[I\{A_{r,j_{1}}\}+I\{A_{r,j_{2}}\}].

(69)

From Bernstein’s inequality, we have with probability at least $1-n_{2}\exp(-mR)$ ,

\frac{1}{m}\sum\limits_{r=1}^{m}I\{A_{r,j}\}\lesssim R

holds for all $j\in[n_{2}]$ .

Thus, summing $r$ yields that

	$\displaystyle\frac{1}{m}\sum\limits_{r=1}^{m}\|b_{r}-b_{r}(0)\|$	$\displaystyle\lesssim B_{1}B_{2}(B_{1}+B_{2})R+B_{1}^{2}B_{2}R$		(70)
		$\displaystyle\lesssim B_{1}^{2}B_{2}R.$		(70)

When $b_{r}=\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle or\left\langle\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{1}})}{\partial w_{r}},\frac{\partial\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle$ , we can obtain the same estimation, since

\left\|\frac{\partial\sigma_{3}(w_{r}(0)^{T}y_{j_{1}})}{\partial w_{r}}\right\|_{2}\lesssim B_{2}^{2}\lesssim B_{1}B_{2}

and

\left\|\frac{\partial\sigma_{3}(w_{r}^{T}y_{j_{1}})}{\partial w_{r}}-\frac{\partial\sigma_{3}(w_{r}(0)^{T}y_{j_{1}})}{\partial w_{r}}\right\|_{2}\lesssim B_{2}R.

Note that we can decompose $a_{r}b_{r}-a_{r}(0)b_{r}(0)$ as follows

a_{r}(0)[b_{r}-b_{r}(0)]+b_{r}(0)[a_{r}-a_{r}(0)]+[a_{r}-a_{r}(0)][b_{r}-b_{r}(0)].

Therefore, we have

		$\displaystyle\frac{1}{m}\sum\limits_{r=1}^{m}\|a_{r}b_{r}-a_{r}(0)b_{r}(0)\|$		(71)
		$\displaystyle=\frac{1}{m}\sum\limits_{r=1}^{m}\left\|a_{r}(0)[b_{r}-b_{r}(0)]+b_{r}(0)[a_{r}-a_{r}(0)]+[a_{r}-a_{r}(0)][b_{r}-b_{r}(0)]\right\|$
		$\displaystyle\leq\frac{1}{m}\sum\limits_{r=1}^{m}\|a_{r}(0)\|\|b_{r}-b_{r}(0)\|+\frac{1}{m}\sum\limits_{r=1}^{m}\|b_{r}(0)\|\|a_{r}-a_{r}(0)\|+\frac{1}{m}\sum\limits_{r=1}^{m}\|a_{r}-a_{r}(0)\|\|b_{r}-b_{r}(0)\|$
		$\displaystyle\lesssim B_{1}^{2}B_{2}R\log\left(\frac{mn_{1}}{\delta}\right)+B_{1}^{2}B_{2}^{2}\left(p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right).$

Summing $i,j,i_{1},j_{1}$ yields that

\displaystyle\|H-H(0)\|_{F}

\displaystyle\lesssim n_{1}B_{1}^{2}B_{2}R\log\left(\frac{mn_{1}}{\delta}\right)+n_{1}B_{1}^{2}B_{2}^{2}\left(p\tilde{R}^{2}+\sqrt{p}\tilde{R}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right).

For $\tilde{H}-\tilde{H}(0)$ , from the form of $(j_{1},j_{2})$ -th entry of the $(i,j)$ -th block, we focus on the form $a_{r}b_{r}-a_{r}(0)b_{r}(0)$ , where

a_{r}=\frac{1}{p}\sum\limits_{k=1}^{p}u_{i}^{T}u_{j}I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\},

b_{r}=\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{2}}))\ or\ \mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j_{1}}))\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}})\ or\ \sigma_{3}(w_{r}^{T}\tilde{y}_{j_{1}})\sigma_{3}(w_{r}^{T}\tilde{y}_{j_{2}}).

Similarly, we can deduce that

		$\displaystyle\frac{1}{m}\sum\limits_{r=1}^{m}\|a_{r}-a_{r}(0)\|$		(72)
		$\displaystyle\lesssim\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}(0)^{T}u_{j}\geq 0\}\|$
		$\displaystyle\lesssim\frac{1}{m}\frac{1}{p}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}I\{\tilde{A}_{r,k}^{i}\}+I\{\tilde{A}_{r,k}^{j}\}$
		$\displaystyle\lesssim\tilde{R},$

where the last inequality holds with probability at least $1-n_{1}\exp(-mp\tilde{R})$ due to the use of Bernstein inequality.

For $b_{r}-b_{r}(0)$ , note that

|\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j}))|\lesssim B_{1}^{2}B_{2},\ |\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j})|\lesssim B_{2}^{3}\lesssim B_{1}^{2}B_{2}

and

\|\mathcal{L}(\sigma_{3}(w_{r}^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y_{j}))\|_{2}\lesssim B_{1}B_{2}R,\ |\sigma_{3}(w_{r}^{T}\tilde{y}_{j})-\sigma_{3}(w_{r}(0)^{T}\tilde{y}_{j})|\lesssim B_{2}^{2}R\lesssim B_{1}B_{2}R.

Thus, similar to (66), we have

|b_{r}-b_{r}(0)|\lesssim B_{1}^{3}B_{2}^{2}R.

(73)

Therefore, similar to (69), we have

		$\displaystyle\frac{1}{m}\sum\limits_{r=1}^{m}\|a_{r}b_{r}-a_{r}(0)b_{r}(0)\|$		(74)
		$\displaystyle=\frac{1}{m}\sum\limits_{r=1}^{m}\left\|a_{r}(0)[b_{r}-b_{r}(0)]+b_{r}(0)[a_{r}-a_{r}(0)]+[a_{r}-a_{r}(0)][b_{r}-b_{r}(0)]\right\|$
		$\displaystyle\lesssim B_{1}^{2}B_{2}\tilde{R}+B_{1}^{3}B_{2}^{2}R.$

Summing $i,j,i_{1},j_{1}$ yields that

\displaystyle\|\tilde{H}-\tilde{H}(0)\|_{F}

\displaystyle\lesssim n_{1}B_{1}^{2}B_{2}\tilde{R}+n_{1}B_{1}^{3}B_{2}^{2}R.

∎

8.8. Proof of Lemma 8

Proof.

Note that

		$\displaystyle s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j})$		(75)
		$\displaystyle=s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j})-\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle$
		$\displaystyle\quad+\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle+\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.$

From the updating formula of gradient, we have

	$\displaystyle\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle$
	$\displaystyle=\sum\limits_{r=1}^{m}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle$
	$\displaystyle=-\eta\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial w_{r}}\right\rangle s^{t}(u_{i_{1}})(y_{j_{1}})$
	$\displaystyle\quad-\eta\sum\limits_{r=1}^{m}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial w_{r}}\right\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}}).$

Similarly, we can obtain that

	$\displaystyle\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle$
	$\displaystyle=\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\right\rangle$
	$\displaystyle=-\eta\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial s^{t}(u_{i_{1}})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\rangle s^{t}(u_{i_{1}})(y_{j_{1}})$
	$\displaystyle\quad-\eta\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\sum\limits_{i_{1}=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\frac{\partial h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}\right\rangle h^{t}(u_{i_{1}})(\tilde{y}_{j_{2}}).$

For $h^{t+1}(u_{i})(\tilde{y}_{j})-h^{t}(u_{i})(\tilde{y}_{j})$ , we can derive similar result, which is omitted for simplicity. Similar to the derivation in the section on neural operators, we have

G^{t+1}(u)-G^{t}(u)=-\eta(H(t)+\tilde{H}(t))G^{t}(u)+I(t),

(76)

s^{t+1}(u_{i})(y_{j_{1}})-s^{t}(u_{i})(y_{j_{1}})-\left\langle\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle,

The $n_{2}+j_{2}$ -th ( $j_{1}\in[n_{3}]$ ) component of $i$ -th block is

h^{t+1}(u_{i})(\tilde{y}_{j_{2}})-h^{t}(u_{i})(\tilde{y}_{j_{2}})-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle.

Finally, applying a simple algebraic transformation to (76), we have

G^{t+1}(u)=[I-\eta(H(t)+\tilde{H}(t))]G^{t}(u)+I(t).

∎

8.9. Proof of Corollary 2

Proof.

Let $B_{1}=2\sqrt{d\log(m/\delta)},B_{2}=2\sqrt{\log(m(n_{2}+n_{3})/\delta)}$ , we first estimate $\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2}$ . The gradient updating rule yields that

	$\displaystyle\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)$	$\displaystyle=-\eta\frac{\partial L(W(t),\tilde{W}(t))}{\partial\tilde{w}_{rk}}$
		$\displaystyle=-\eta\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}s^{t}(u_{i})(y_{j_{1}})\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}_{rk}}--\eta\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}h^{t}(u_{i})(\tilde{y}_{j_{2}})\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}.$

For the gradient term, we have

	$\displaystyle\left\\|\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial\tilde{w}_{rk}}\right\\|_{2}$
	$\displaystyle=\left\\|\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\frac{1}{\sqrt{p}}\tilde{a}_{rk}u_{i}I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))\right\\|_{2}$
	$\displaystyle\lesssim\frac{1}{\sqrt{n_{2}}}\frac{B_{1}^{2}B_{2}}{\sqrt{mp}}$

and

	$\displaystyle\left\\|\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial\tilde{w}_{rk}}\right\\|_{2}$
	$\displaystyle=\left\\|\frac{1}{\sqrt{n_{3}}}\frac{1}{\sqrt{m}}\frac{1}{\sqrt{p}}\tilde{a}_{rk}u_{i}I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}\sigma_{3}(w_{r}(t)^{T}\tilde{y}_{j_{2}})\right\\|_{2}$
	$\displaystyle\lesssim\frac{1}{\sqrt{n_{3}}}\frac{B_{2}^{3}}{\sqrt{mp}}.$

Theorefore, we obtain that

	$\displaystyle\\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\\|_{2}$	$\displaystyle\lesssim\frac{\eta\sqrt{n_{1}}B_{1}^{2}B_{2}}{\sqrt{mp}}\sqrt{\sum\limits_{i=1}^{n_{1}}\\|s(u_{i})\\|_{2}^{2}}+\frac{\eta\sqrt{n_{1}}B_{2}^{3}}{\sqrt{mp}}\sqrt{\sum\limits_{i=1}^{n_{1}}\\|h(u_{i})\\|_{2}^{2}}$		(77)
		$\displaystyle\lesssim\frac{\eta\sqrt{n_{1}}B_{1}^{2}B_{2}}{\sqrt{mp}}\\|G^{t}(u)\\|_{2},$		(77)

where the first inequality follows from Cauchy’s inequality.

Summing $t$ from $0$ to $T$ yields that

$\displaystyle\\|\tilde{w}_{rk}(T)-\tilde{w}_{rk}(0)\\|_{2}$	$\displaystyle\leq\sum\limits_{t=0}^{T}\\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\\|_{2}$	(78)
	$\displaystyle\lesssim\sum\limits_{t=1}^{T}\frac{\eta\sqrt{n_{1}}B_{1}^{2}B_{2}}{\sqrt{mp}}\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{t/2}\\|G^{0}(u)\\|_{2}$
	$\displaystyle\lesssim\frac{\sqrt{n_{1}}B_{1}^{2}B_{2}\\|G^{0}(u)\\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})},$

where the second inequality follows from the induction hypothesis.

Then, we estimate $\|w_{r}(t+1)-w_{r}(t)\|_{2}$ , the gradient descent updating rule yields that

	$\displaystyle\\|w_{r}(t+1)-w_{r}(t)\\|_{2}$	$\displaystyle=\left\\|-\eta\frac{\partial L(W(t),\tilde{W}(t))}{\partial w_{r}}\right\\|_{2}$		(79)
		$\displaystyle=\eta\left\\|\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}s^{t}(u_{i})(y_{j_{1}})\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w_{r}}+\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}h^{t}(u_{i})(\tilde{y}_{j_{2}})\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w_{r}}\right\\|_{2}.$		(79)

Recall that

	$\displaystyle\left\\|\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w_{r}}\right\\|_{2}$
	$\displaystyle=\left\\|\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))}{\partial w_{r}}\right\\|_{2}.$

Note that

\left|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right|\lesssim\sqrt{p}\tilde{R}^{{}^{\prime}}

and

\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))}{\partial w_{r}}\right\|_{2}\lesssim B_{1}B_{2}.

Thus, we have

		$\displaystyle\left\\|\frac{\partial s^{t}(u_{i})(y_{j_{1}})}{\partial w_{r}}\right\\|_{2}$		(80)
		$\displaystyle\leq\frac{1}{\sqrt{n_{2}m}}\left\|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right\|\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))}{\partial w_{r}}\right\\|_{2}$
		$\displaystyle\quad+\frac{1}{\sqrt{n_{2}m}}\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|\left\\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j_{1}}))}{\partial w_{r}}\right\\|_{2}$
		$\displaystyle\lesssim\frac{\sqrt{p}R^{{}^{\prime}}B_{1}B_{2}}{\sqrt{mn_{2}}}+\frac{B_{1}B_{2}}{\sqrt{mn_{2}}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}$
		$\displaystyle\lesssim\frac{B_{1}B_{2}}{\sqrt{mn_{2}}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)},$

where in the last inequality, we assume that $\sqrt{p}\tilde{R}^{{}^{\prime}}\lesssim\sqrt{\log(mn_{1}/\delta)}$ .

Similarly, since

\left\|\frac{\partial\sigma_{3}(w_{r}(t)^{T}\tilde{y}_{j_{2}})}{\partial w_{r}}\right\|_{2}\lesssim B_{2}^{2},

we can obtain that

\displaystyle\left\|\frac{\partial h^{t}(u_{i})(\tilde{y}_{j_{2}})}{\partial w_{r}}\right\|_{2}\lesssim\frac{\sqrt{p}R^{{}^{\prime}}B_{2}^{2}}{\sqrt{mn_{3}}}+\frac{B_{2}^{2}}{\sqrt{mn_{3}}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.

(81)

Combining (79), (80) and (81) yields that

	$\displaystyle\\|w_{r}(t+1)-w_{r}(t)\\|_{2}$	$\displaystyle\lesssim\frac{\eta\sqrt{n_{1}}B_{1}B_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\sqrt{\sum\limits_{i=1}^{n_{1}}\\|s(u_{i})\\|_{2}^{2}}+\frac{\eta\sqrt{n_{1}}B_{2}^{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\sqrt{\sum\limits_{i=1}^{n_{1}}\\|h(u_{i})\\|_{2}^{2}}$		(82)
		$\displaystyle\lesssim\frac{\eta\sqrt{n_{1}}B_{1}B_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\\|G^{t}(u)\\|_{2},$		(82)

where the first inequality follows from Cauchy’s inequality.

Summing $t$ from $0$ to $T$ yields that

$\displaystyle\\|w_{r}(T+1)-w_{r}(0)\\|_{2}$	$\displaystyle\leq\sum\limits_{t=1}^{T}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}$	(83)
	$\displaystyle\lesssim\sum\limits_{t=1}^{T-1}\frac{\eta\sqrt{n_{1}}B_{1}^{2}B_{2}}{\sqrt{mp}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{t/2}\\|G^{0}(u)\\|_{2}$
	$\displaystyle\lesssim\frac{\sqrt{n_{1}}B_{1}B_{2}\\|G^{0}(u)\\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.$

∎

8.10. Proof of Lemma 10

Proof.

From the form of the residual $I(t)$ , it suffices to estimate

s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j})-\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle

and

h^{t+1}(u_{i})(\tilde{y}_{j})-h^{t}(u_{i})(\tilde{y}_{j})-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial w},w(t+1)-w(t)\right\rangle-\left\langle\frac{\partial h^{t}(u_{i})(\tilde{y}_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle,

which we denote by $I_{i,j}^{s}(t)$ and $I_{i,j}^{h}(t)$ , respectively. In fact, we only need to estimate $I_{i,j}^{s}(t)$ , since $\mathcal{L}(u)$ includes the term $u$ , which is the same as the boundary term.

Recall that the shallow neural operator has the form

G(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\sigma_{3}(w_{r}^{T}y).

We first estimate $I_{i,j}^{s}(t)$ . We can explicitly express the difference $s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j})$ as follows:

		$\displaystyle s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j})$		(84)
		$\displaystyle=\frac{1}{\sqrt{n_{2}}}\left[\mathcal{L}G^{t+1}(u_{i})(y_{j})-\mathcal{L}G^{t}(u_{i})(y_{j})\right]$
		$\displaystyle=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left(\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})\right]\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))\right.$
		$\displaystyle\quad\left.-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right)$
		$\displaystyle=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i}))\right]\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))$
		$\displaystyle\quad+\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})\right]\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right],$

where in the last equality, we split $s^{t+1}(u_{i})(y_{j})-s^{t}(u_{i})(y_{j})$ into two terms in order to estimate them separately later.

On the other hand, from the form of neural operator, we can obtain that

		$\displaystyle\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w},w(t+1)-w(t)\right\rangle$		(85)
		$\displaystyle=\sum\limits_{r=1}^{m}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle$
		$\displaystyle=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle,$

and

		$\displaystyle\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}},\tilde{w}(t+1)-\tilde{w}(t)\right\rangle$		(86)
		$\displaystyle=\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}\left\langle\frac{\partial s^{t}(u_{i})(y_{j})}{\partial\tilde{w}_{rk}},\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\right\rangle$
		$\displaystyle=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\frac{1}{\sqrt{p}}\left[\sum\limits_{k=1}^{p}\tilde{a}_{rk}I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}\right]\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j})).$

With the explicit expressions for each term of $I_{i,j}^{s}(t)$ , namely (84), (85), and (86), we can split $I_{i,j}^{s}(t)$ into two parts: the first part is the second term of (84) minus (86), and the second part is the first term of (84) minus (85). Specifically, let

	$\displaystyle I_{1}^{r}(t)$	$\displaystyle=\left(\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left(\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}\right)\right)$		(87)
		$\displaystyle\quad\cdot\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))$		(87)

and

	$\displaystyle I_{2}^{r}(t)$	$\displaystyle=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})\right]\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right]$		(88)
		$\displaystyle\quad-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle,$		(88)

where in the definition, we have omitted the indices $i,j,s$ for simplicity.

Then

I_{i,j}^{s}(t)=\frac{1}{\sqrt{n_{2}}}\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[I_{1}^{r}(t)+I_{2}^{r}(t)\right].

(89)

To estimate $I_{1}(r)$ , since

|\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))|\lesssim B_{1}^{2}B_{2},

it suffices to estimate

\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}.

With a little abuse of notation, we let $\tilde{A}_{r,k}^{i}=\{\tilde{w}\in\mathbb{R}^{d+1}:\|\tilde{w}-\tilde{w}_{rk}(0)\|\leq\tilde{R}^{{}^{\prime}},I\{\tilde{w}^{T}u_{i}\geq 0\}\neq I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}\}$ , $\tilde{S}_{i}=\{(r,k)\in[m]\times[q]:I\{\tilde{A}_{r,k}^{i}\}=0\}$ and $\tilde{S}_{i}^{\perp}=[m]\times[p]\backslash\tilde{S}_{i}$ . Then we have that $P(\tilde{A}_{r,k}^{i})\lesssim\tilde{R}^{{}^{\prime}}$ . Note that $\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(0)\|_{2}\leq R^{{}^{\prime}},\|\tilde{w}_{rk}(t)-\tilde{w}_{rk}(0)\|_{2}\leq R^{{}^{\prime}}$ , thus for $(r,k)\in\tilde{S}_{i}$ , we have $I\{\tilde{w}_{rk}(t+1)^{T}u_{i}\geq 0\}=I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}$ . At this point,

		$\displaystyle\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}$		(90)
		$\displaystyle=(\tilde{w}_{rk}(t+1)^{T}u_{i})I\{\tilde{w}_{rk}(t+1)^{T}u_{i}\geq 0\}-(\tilde{w}_{rk}(t)^{T}u_{i})I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}$
		$\displaystyle\quad-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}$
		$\displaystyle=(\tilde{w}_{rk}(t+1)^{T}u_{i})I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}-(\tilde{w}_{rk}(t)^{T}u_{i})I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}$
		$\displaystyle\quad-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}$
		$\displaystyle=0.$

On the other hand, for all $(r,k)\in[m]\times[p]$ ,

|\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-I\{\tilde{w}_{rk}(t)^{T}u_{i}\geq 0\}(\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t))^{T}u_{i}|\lesssim\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2}.

Therefore, for $I_{1}^{r}(t)$ , we have

\displaystyle|I_{1}^{r}(t)|

\displaystyle\lesssim\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}B_{1}^{2}B_{2}\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}.

Combining with (77) yields that

$\displaystyle\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\|I_{1}^{r}(t)\|$	$\displaystyle\lesssim\frac{1}{\sqrt{m}}\frac{1}{\sqrt{p}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}B_{1}B_{2}\\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\\|_{2}I\{(r,k)\in\tilde{S}_{i}^{\perp}\}$	(91)
	$\displaystyle=\frac{1}{\sqrt{m}}\frac{1}{\sqrt{p}}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}B_{1}B_{2}\\|\tilde{w}_{rk}(t+1)-\tilde{w}_{rk}(t)\\|_{2}I\{\tilde{A}_{r,k}^{i}\}$
	$\displaystyle\lesssim\eta\sqrt{n_{1}}B_{1}^{4}B_{2}^{2}\\|G^{t}(u)\\|_{2}\frac{1}{mp}\sum\limits_{r=1}^{m}\sum\limits_{k=1}^{p}I\{\tilde{A}_{r,k}^{i}\}$
	$\displaystyle\lesssim\eta\sqrt{n_{1}}B_{1}^{4}B_{2}^{2}\\|G^{t}(u)\\|_{2}\tilde{R}^{{}^{\prime}}$
	$\displaystyle\lesssim\frac{\eta n_{1}B_{1}^{6}B_{2}^{3}\\|G^{0}(u)\\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\\|G^{t}(u)\\|_{2},$

where the last inequality follows from the Bernstein inequality and holds with probability at least $1-n_{1}\exp(-mp\tilde{R}^{{}^{\prime}})$ .

For $I_{2}^{r}(t)$ , we can rewrite it as follows:

$\displaystyle I_{2}^{r}(t)$	$\displaystyle=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})\right]\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right]$	(92)
	$\displaystyle\quad-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(t)^{T}u_{i})\right]\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle$
	$\displaystyle=\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(t+1)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right]\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right]$
	$\displaystyle-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\left[\sigma(\tilde{w}_{rk}(t)^{T}u_{i})-\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right]\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle$
	$\displaystyle+\left[\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))-\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle\right]$
	$\displaystyle\quad\cdot\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]$
	$\displaystyle:=I_{2,1}^{r}(t)+I_{2,2}^{r}(t)+I_{2,3}^{r}(t).$

In the following, we will estimate the three items $I_{2,1}^{r}(t),I_{2,2}^{r}(t)$ and $I_{2,3}^{r}(t)$ separately.

For $I_{2,1}^{r}(t)$ and $I_{2,2}^{r}(t)$ , note that for $s=t,t+1$ , we have

\left|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]-\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right|\lesssim\sqrt{p}\tilde{R}^{{}^{\prime}}.

(93)

Moreover, we can deduce that

\left|\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))\right|\lesssim(B_{1}^{2}+B_{1}B_{2})\|w_{r}(t+1)-w_{r}(t)\|_{2}\lesssim B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}

(94)

and

\left\|\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}}\right\|_{2}\lesssim B_{1}B_{2}.

(95)

Combining (93), (94) and (95) yields that

|I_{2,1}^{r}(t)|\lesssim\sqrt{p}R^{{}^{\prime}}B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}

(96)

and

|I_{2,2}^{r}(t)|\lesssim\sqrt{p}R^{{}^{\prime}}B_{1}B_{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}\lesssim\sqrt{p}R^{{}^{\prime}}B_{1}^{2}\|w_{r}(t+1)-w_{r}(t)\|_{2}.

(97)

It remains to estimate $I_{2,3}^{r}(t)$ . From its form, it suffices to estimate

		$\displaystyle\mathcal{L}(\sigma_{3}(w_{r}(t+1)^{T}y_{j}))-\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))-\left\langle\frac{\partial\mathcal{L}(\sigma_{3}(w_{r}(t)^{T}y_{j}))}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle$		(98)
		$\displaystyle:=\tilde{I}_{1}(t)+\tilde{I}_{2}(t)+\tilde{I}_{3}(t),$		(98)

where $\tilde{I}_{1}(t),\tilde{I}_{2}(t)$ and $\tilde{I}_{3}(t)$ are respectively related to the first-order term, the second-order term, and the zeroth-order term of the PDE. Sepecifically,

\tilde{I}_{1}(t)=w_{r0}(t+1)\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-w_{r0}(t)\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})-\left\langle\frac{\partial w_{r0}(t)\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle,

	$\displaystyle\tilde{I}_{2}(t)$	$\displaystyle=-\left[\\|w_{r1}(t+1)\\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t+1)^{T}y_{j})-\\|w_{r1}(t)\\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})\right.$
		$\displaystyle\quad\left.-\left\langle\frac{\partial\\|w_{r1}(t)\\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle\right]$

and

\tilde{I}_{3}(t)=\sigma_{3}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}(w_{r}(t)^{T}y_{j})-\left\langle\frac{\partial\sigma_{3}(w_{r}(t)^{T}y_{j})}{\partial w_{r}},w_{r}(t+1)-w_{r}(t)\right\rangle.

Note that both $\tilde{I}_{1}(t)$ and $\tilde{I}_{2}(t)$ have the form

f(w^{{}^{\prime}})g(w^{{}^{\prime}})-f(w)g(w)-\left\langle\frac{\partial f(w)g(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle.

However, due to the non-differentiability of the ReLU function, we cannot perform a second-order expansion; instead, we decompose it into

		$\displaystyle f(w^{{}^{\prime}})g(w^{{}^{\prime}})-f(w)g(w)-\left\langle\frac{\partial f(w)g(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle$		(99)
		$\displaystyle=f(w^{{}^{\prime}})g(w^{{}^{\prime}})-f(w)g(w)-\left\langle\frac{\partial f(w)}{\partial w}g(w),w^{{}^{\prime}}-w\right\rangle-\left\langle\frac{\partial g(w)}{\partial w}f(w),w^{{}^{\prime}}-w\right\rangle$
		$\displaystyle=f(w^{{}^{\prime}})\left[g(w^{{}^{\prime}})-g(w)-\left\langle\frac{\partial g(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle\right]+g(w)\left[f(w^{{}^{\prime}})-f(w)-\left\langle\frac{\partial f(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle\right]$
		$\displaystyle\quad+[f(w^{{}^{\prime}})-f(w)]\left\langle\frac{\partial g(w)}{\partial w},w^{{}^{\prime}}-w\right\rangle.$

Thus, for $\tilde{I}_{1}(t)$ , we have

$\displaystyle\tilde{I}_{1}(t)$	$\displaystyle=w_{r0}(t+1)\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-w_{r0}(t)\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})$	(100)
	$\displaystyle\quad-[\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})(w_{r0}(t+1)-w_{r0}(t))+w_{r0}(t)\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}]$
	$\displaystyle=w_{r0}(t+1)[\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})]-w_{r0}(t)\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}$
	$\displaystyle=[w_{r0}(t+1)-w_{r0}(t)][\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})]$
	$\displaystyle\quad+w_{r0}(t)[\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})-\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}].$

We apply mean value theorem for the first term in (100) and obtain that

	$\displaystyle\|\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})\|$	$\displaystyle=\|w_{r}(t+1)^{T}y_{j}-w_{r}(t)^{T}y_{j}\|\|\sigma_{3}^{{}^{\prime\prime}}(\xi)\|$
		$\displaystyle\lesssim B_{2}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}.$

Similarly, for the second term, we have

	$\displaystyle\|\sigma_{3}^{{}^{\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime}}(w_{r}(t)^{T}y_{j})-\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}\|$
	$\displaystyle=\|\sigma_{3}^{{}^{\prime\prime}}(\xi)-\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})\|\|(w_{r}(t+1)-w_{r}(t))^{T}y_{j}\|$
	$\displaystyle\lesssim\\|w_{r}(t+1)-w_{r}(t)\\|_{2}^{2}.$

Thus, for $\tilde{I}_{1}(t)$ , we have

	$\displaystyle\|\tilde{I}_{1}(t)\|$	$\displaystyle\lesssim B_{2}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}^{2}+B_{1}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}^{2}$		(101)
		$\displaystyle\lesssim B_{1}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}^{2}.$		(101)

For $\tilde{I}_{2}(t)$ , with same decomposition in (99), we have

$\displaystyle\tilde{I}_{2}(t)$	$\displaystyle=\\|w_{r1}(t+1)\\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t+1)^{T}y_{j})-\\|w_{r1}(t)\\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})$	(102)
	$\displaystyle\quad-[2\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r1}(t+1)-w_{r1}(t))^{T}w_{r1}(t)+\\|w_{r1}(t)\\|_{2}^{2}\sigma_{3}^{{}^{\prime\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}]$
	$\displaystyle=\\|w_{r1}(t+1)\\|_{2}^{2}\left[\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t+1)^{T}y_{j})-\sigma_{3}^{{}^{\prime\prime}}(w_{r}(t)^{T}y_{j})-\sigma_{3}^{{}^{\prime\prime\prime}}(w_{r}(t+1)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}\right]$
	$\displaystyle\quad+\\|w_{r1}(t)\\|_{2}^{2}\left[\\|w_{r1}(t+1)\\|_{2}^{2}-\\|w_{r1}(t)\\|_{2}^{2}-2w_{r1}(t)^{T}(w_{r1}(t+1)-w_{r1}(t))\right]$
	$\displaystyle\quad+[\\|w_{r1}(t+1)\\|_{2}^{2}-\\|w_{r1}(t)\\|_{2}^{2}]\sigma_{3}^{{}^{\prime\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}.$

For the first term, similar to (90), for $r\in S_{j}$ , we have

\sigma(w_{r}(t+1)^{T}y_{j})-\sigma(w_{r}(t)^{T}y_{j})-I\{w_{r}(t+1)^{T}y_{j}\geq 0\}(w_{r}(t+1)-w_{r}(t))^{T}y_{j}=0,

thus

|\sigma(w_{r}(t+1)^{T}y_{j})-\sigma(w_{r}(t)^{T}y_{j})-I\{w_{r}(t+1)^{T}y_{j}\geq 0\}(w_{r}(t+1)-w_{r}(t))^{T}y_{j}|\lesssim\|w_{r}(t+1)-w_{r}(t)\|_{2}I\{r\in S_{j}^{\perp}\}.

For the second term, the mean value theorem yields that

|\|w_{r1}(t+1)\|_{2}^{2}-\|w_{r1}(t)\|_{2}^{2}-2w_{r1}(t)^{T}(w_{r1}(t+1)-w_{r1}(t))|\lesssim\|w_{r1}(t+1)-w_{r1}(t)\|_{2}.

For the third term, we have

|\|w_{r1}(t+1)\|_{2}^{2}-\|w_{r1}(t)\|_{2}^{2}]\sigma_{3}^{{}^{\prime\prime\prime}}(w_{r}(t)^{T}y_{j})(w_{r}(t+1)-w_{r}(t))^{T}y_{j}|\lesssim B_{1}\|w_{r}(t+1)-w_{r}(t)\|_{2}^{2}.

Combining these result for $\tilde{I}_{1}(t),\tilde{I}_{2}(t)$ and $\tilde{I}_{3}(t)$ yields that

	$\displaystyle\|I_{2,3}^{r}(t)\|$	$\displaystyle\leq\left\|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right\|[\|\tilde{I}_{1}(t)\|+\|\tilde{I}_{2}(t)\|+\|\tilde{I}_{3}(t)\|]$		(103)
		$\displaystyle\lesssim\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\left[B_{1}^{2}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}^{2}+B_{1}^{2}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}I\{r\in S_{j}^{\perp}\}\right].$		(103)

With these estimations for $I_{2,1}^{r}(t),I_{2,2}^{r}(t)$ and $I_{2,3}^{r}(t)$ , i.e., (96), (97) and (98), we obtain that

$\displaystyle\|I_{2}^{r}(t)\|$	$\displaystyle\leq\|I_{2,1}^{r}(t)\|+\|I_{2,2}^{r}(t)\|+\|I_{2,3}^{r}(t)\|$	(104)
	$\displaystyle\lesssim\sqrt{p}\tilde{R}^{{}^{\prime}}B_{1}^{2}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}$
	$\displaystyle\quad+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\left[B_{1}^{2}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}^{2}+B_{1}^{2}\\|w_{r}(t+1)-w_{r}(t)\\|_{2}I\{r\in S_{j}^{\perp}\}\right].$

Recall that (82) shows that

\|w_{r}(t+1)-w_{r}(t)\|_{2}\lesssim\frac{\eta\sqrt{n_{1}}B_{1}B_{2}}{\sqrt{m}}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\|G^{t}(u)\|_{2}

and

\tilde{R}^{{}^{\prime}}=\frac{\sqrt{n_{1}}B_{1}^{2}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})},\ R^{{}^{\prime}}=\frac{\sqrt{n_{1}}B_{1}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.

Therefore, combining with the use of Bernstein inequality, summing $r$ yields that

	$\displaystyle\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\|I_{2}^{r}(t)\|$	$\displaystyle\lesssim\frac{\eta n_{1}B_{1}^{5}B_{2}^{2}\\|G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\\|G^{t}(u)\\|_{2}+\frac{\eta n_{1}B_{1}^{4}B_{2}^{2}\\|G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\\|G^{t}(u)\\|_{2}$		(105)
		$\displaystyle\lesssim\frac{\eta n_{1}B_{1}^{5}B_{2}^{2}\\|G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\\|G^{t}(u)\\|_{2}.$		(105)

Recall that (91) implies that

\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}|I_{1}^{r}(t)|\lesssim\frac{\eta n_{1}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\|G^{t}(u)\|_{2}.

Therefore, we have

	$\displaystyle\\|I(t)\\|_{2}$	$\displaystyle\lesssim\frac{\eta(n_{1})^{3/2}B_{1}^{6}B_{2}^{3}\\|G^{0}(u)\\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\\|G^{t}(u)\\|_{2}+\frac{\eta(n_{1})^{3/2}B_{1}^{5}B_{2}^{2}\\|G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\\|G^{t}(u)\\|_{2}$		(106)
		$\displaystyle\lesssim\frac{\eta(n_{1})^{3/2}B_{1}^{6}B_{2}^{3}\\|G^{0}(u)\\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\\|G^{t}(u)\\|_{2}.$		(106)

∎

8.11. Proof of Theorem 3

Proof.

It suffices to show that Condition 2 also holds for $s=T+1$ . From the iteration formula in Lemma 9, we have

	$\displaystyle\\|G^{T+1}(u)\\|_{2}^{2}$	$\displaystyle=\\|[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u)+I(T)\\|_{2}^{2}$		(107)
		$\displaystyle=\\|[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u)\\|_{2}^{2}+\\|I(T)\\|_{2}^{2}+2\left\langle[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u),I(T)\right\rangle.$		(107)

From the stability of the Gram matrices, i.e., Lemma 8, when $R,\tilde{R}$ satisfy that

R\lesssim\frac{\lambda_{0}}{n_{1}B_{1}^{3}B_{2}^{2}},\ \tilde{R}\lesssim\frac{\tilde{\lambda}_{0}}{n_{1}\sqrt{p}B_{1}^{2}B_{2}^{2}},

we have

\|H(0)-H(T)\|_{2}\leq\frac{\lambda_{0}}{4},\|\tilde{H}(0)-\tilde{H}(T)\|_{2}\leq\frac{\tilde{\lambda}_{0}}{4}.

Thus, with Lemma 7, we can deduce that

\|H(T)-H^{\infty}\|_{2}\leq\frac{\lambda_{0}}{2},\|\tilde{H}(T)-\tilde{H}^{\infty}\|_{2}\leq\frac{\tilde{\lambda}_{0}}{2},

implying that $\lambda_{min}(H(T))\geq\lambda_{0}/2$ and $\lambda_{min}(\tilde{H}(T))\geq\tilde{\lambda}_{0}/2$ .

Therefore, when $\eta=\mathcal{O}(1/(\|H^{\infty}\|_{2}+\|\tilde{H}^{\infty}\|_{2}))$ , we have that $I-\eta(H(T)+\tilde{H}(T))$ is positive definite and then

\|I-\eta(H(T)+\tilde{H}(T))\|_{2}\leq\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}.

(108)

Let $I(T)\leq\bar{R}\|G^{T}(u)\|_{2}$ , then combining (107) and (108) yields that

$\displaystyle\\|G^{T+1}(u)\\|_{2}^{2}$	$\displaystyle\leq\\|[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u)\\|_{2}^{2}+\\|I(T)\\|_{2}^{2}+2\\|[I-\eta(H(T)+\tilde{H}(T))]G^{T}(u)\\|_{2}\\|I(T)\\|_{2}$	(109)
	$\displaystyle\leq\left[\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)^{2}+\bar{R}^{2}+2\bar{R}\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)\right]\\|G^{T}(u)\\|_{2}^{2}$
	$\displaystyle\leq\left(1-\frac{\eta(\lambda_{0}+\tilde{\lambda}_{0})}{2}\right)\\|G^{T}(u)\\|_{2}^{2},$

where the last inequality requires that $\bar{R}\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0})$ .

Finally, we need to specify the requirements for $m$ to ensure that the aforementioned conditions are satisfied. Recall that first, $m$ needs to satisfy that $R^{{}^{\prime}}\leq R$ and $\tilde{R}^{{}^{\prime}}\leq\tilde{R}$ , i.e.,

\frac{\sqrt{n_{1}}B_{1}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\lesssim\frac{\lambda_{0}}{n_{1}B_{1}^{3}B_{2}^{2}},\ \frac{\sqrt{n_{1}}B_{1}^{2}B_{2}\|G^{0}(u)\|_{2}}{\sqrt{mp}(\lambda_{0}+\tilde{\lambda}_{0})}\lesssim\frac{\tilde{\lambda}_{0}}{n_{1}\sqrt{p}B_{1}^{2}B_{2}^{2}}.

(110)

Simple algebraic operations yield that

m=\Omega\left(\frac{n_{1}^{3}B_{1}^{8}B_{2}^{6}\|G^{0}\|_{2}^{2}}{(\lambda_{0}+\tilde{\lambda}_{0})^{2}min(\lambda_{0}^{2},\tilde{\lambda}_{0}^{2})}\log\left(\frac{mn_{1}}{\delta}\right)\right).

(111)

Second, $m$ needs to satisfy that $\bar{R}\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0})$ , i.e.,

\frac{\eta(n_{1})^{3/2}B_{1}^{6}B_{2}^{3}\|G^{0}(u)\|_{2}}{\sqrt{m}(\lambda_{0}+\tilde{\lambda}_{0})}\log^{3/2}\left(\frac{mn_{1}}{\delta}\right)\lesssim\eta(\lambda_{0}+\tilde{\lambda}_{0}),

implying that

m=\Omega\left(\frac{n_{1}^{3}B_{1}^{12}B_{2}^{6}\|G^{0}(u)\|_{2}^{2}}{(\lambda_{0}+\tilde{\lambda}_{0})^{4}}\log^{3}\left(\frac{mn_{1}}{\delta}\right)\right).

(112)

Finally, combining (111), (112) and the estimation of $\|G^{0}(u)\|_{2}^{2}$ in Lemma 17, we have that

m=\tilde{\Omega}\left(\frac{n_{1}^{4}d^{7}}{(\lambda_{0}+\tilde{\lambda}_{0})^{2}min(\lambda_{0}^{2},\tilde{\lambda}_{0}^{2})}\right),

where $\tilde{\Omega}$ indicates that some terms involving $\log(n_{1})$ , $\log(n_{2})$ and $\log(m)$ are omitted.

∎

9. Auxiliary Lemmas

Lemma 15 (Anti-concentration of Gaussian distribution).

Let $X\sim\mathcal{N}(0,\sigma^{2})$ , then for any $t>0$ ,

\frac{2}{3}\frac{t}{\sigma}<P(|X|\leq t)<\frac{4}{5}\frac{t}{\sigma}.

Lemma 16 (Bernstein inequality, Theorem 3.1.7 in [21]).

Let $X_{i}$ , $1\leq i\leq n$ be independent centered random variables a.s. bounded by $c<\infty$ in absolute value. Set $\sigma^{2}=1/n\sum_{i=1}^{n}\mathbb{E}X_{i}^{2}$ and $S_{n}=1/n\sum_{i=1}^{n}X_{i}$ . Then, for all $t\geq 0$ ,

P\left(S_{n}\geq\sqrt{\frac{2\sigma^{2}t}{n}}+\frac{ct}{3n}\right)\leq e^{-u}.

First, we provide some preliminaries about Orlicz norms.

Let $g:[0,\infty)\rightarrow[0,\infty)$ be a non-decreasing convex function with $g(0)=0$ . The $g$ -Orlicz norm of a real-valued random variable $X$ is given by

\|X\|_{g}:=\inf\left\{C>0:\mathbb{E}\left[g\left(\frac{|X|}{C}\right)\right]\leq 1\right\}.

If $\|X\|_{\psi_{\alpha}}<\infty$ , we say that $X$ is sub-Weibull of order $\alpha>0$ , where

\psi_{\alpha}(x):=e^{x^{\alpha}}-1.

Note that when $\alpha\geq 1$ , $\|\cdot\|_{\psi_{\alpha}}$ is a norm and when $0<\alpha<1$ , $\|\cdot\|_{\psi_{\alpha}}$ is a quasi-norm. In the related proofs, we may frequently use the fact that for real-valued random variable $X\sim\mathcal{N}(0,1)$ , we have $\|X\|_{\psi_{2}}\leq\sqrt{6}$ and $\|X^{2}\|_{\psi_{1}}=\|X\|_{\psi_{2}}^{2}\leq 6$ . Moreover, when $\|X\|_{\psi_{2}}<\infty,\|Y\|_{\psi_{2}}<\infty$ , we have $\|XY\|_{\psi_{1}}\leq\|X\|_{\psi_{2}}\|Y\|_{\psi_{2}}$ . Since without loss of generality, we can assume that $\|X\|_{\psi_{2}}=\|Y\|_{\psi_{2}}=1$ , then

\mathbb{E}\left[e^{|XY|}\right]\leq\mathbb{E}\left[e^{\frac{|X|^{2}}{2}+\frac{|Y|^{2}}{2}}\right]=\mathbb{E}\left[e^{\frac{|X|^{2}}{2}}e^{\frac{|Y|^{2}}{2}}\right]\leq\frac{1}{2}\mathbb{E}\left[e^{|X|^{2}}+e^{|Y|^{2}}\right]\leq 1,

where the first inequality and the second inequality follow from the inequality $2ab\leq a^{2}+b^{2}$ for $a\geq 0,b\geq 0$ .

Lemma 17 (Theorem 3.1 in [22]).

If $X_{1},\cdots,X_{n}$ are independent mean zero random variables with $\|X_{i}\|_{\psi_{\alpha}}<\infty$ for all $1\leq i\leq n$ and some $\alpha>0$ , then for any vector $a=(a_{1},\cdots,a_{n})\in\mathbb{R}^{n}$ , the following holds true:

P\left(\left|\sum\limits_{i=1}^{n}a_{i}X_{i}\right|\geq 2eC(\alpha)\|b\|_{2}\sqrt{t}+2eL_{n}^{*}(\alpha)t^{1/\alpha}\|b\|_{\beta(\alpha)}\right)\leq 2e^{-t},\ for\ all\ t\geq 0,

where $b=(a_{1}\|X_{1}\|_{\psi_{\alpha}},\cdots,a_{n}\|X_{n}\|_{\psi_{\alpha}})\in\mathbb{R}^{n}$ ,

C(\alpha):=\max\{\sqrt{2},2^{1/\alpha}\}\left\{\begin{aligned} \sqrt{8}(2\pi)^{1/4}e^{1/24}(e^{2/e}/\alpha)^{1/\alpha}&,&if\ \alpha<1,\\ 4e+2(\log 2)^{1/\alpha}&,&if\ \alpha\geq 1.\end{aligned}\right.

and for $\beta(\alpha)=\infty$ when $\alpha\leq 1$ and $\beta(\alpha)=\alpha/(\alpha-1)$ when $\alpha>1$ ,

L_{n}(\alpha):=\frac{4^{1/\alpha}}{\sqrt{2}\|b\|_{2}}\times\left\{\begin{aligned} &\|b\|_{\beta(\alpha)},&if\ \alpha<1,\\ &4e\|b\|_{\beta(\alpha)}/C(\alpha),&if\alpha\geq 1.\end{aligned}\right.

and $L^{*}_{n}(\alpha)=L_{n}(\alpha)C(\alpha)\|b\|_{2}/\|b\|_{\beta(\alpha)}$ .

Lemma 18.

For any $r\in[m]$ , we have that with probability at least $1-\delta$ ,

\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right|\lesssim\sqrt{\log\left(\frac{1}{\delta}\right)}.

Moreover, its $\psi_{2}$ -norm is a universal constant.

Proof.

Note that $\mathbb{E}[\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)]=0$ and $\|\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\|_{\psi_{2}}\leq\||\tilde{w}_{rk}(0)^{T}u|\|_{\psi_{2}}=\mathcal{O}(1)$ , then applying Lemma 17 yields that with probability at least $1-\delta$ ,

\left|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right|\lesssim\sqrt{\log\left(\frac{1}{\delta}\right)}.

Moreover, from the equivalence of the $\psi_{2}$ norm and the concentration inequality (see Lemma 2.2.1 in [23]), it follows that

\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right\|_{\psi_{2}}=\mathcal{O}(1).

∎

Lemma 19.

With probability at least $1-\delta$ , we have

L(0)\lesssim n_{1}n_{2}\log\left(\frac{n_{1}n_{2}}{\delta}\right).

Proof.

	$\displaystyle L(0)$	$\displaystyle=\frac{1}{2}\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}(G^{0}(u_{i})(y_{j})-z_{j}^{i})^{2}$
		$\displaystyle\leq\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}(G^{0}(u_{i})(y_{j}))^{2}+(z_{j}^{i})^{2},$

Recall that

G^{0}(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\sigma(w_{r}(0)^{T}y).

Note that

\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right\|_{\psi_{2}}=\mathcal{O}(1),\ \left\|\sigma(w_{r}(0)^{T}y_{j})\right\|_{\psi_{2}}=\mathcal{O}(1),

thus

\left\|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\sigma(w_{r}(0)^{T}y_{j})\right\|_{\psi_{1}}=\mathcal{O}(1)

Then from Lemma 12, we have that with probability at least $1-\delta$ ,

L(0)\lesssim n_{1}n_{2}\left(\sqrt{\log\left(\frac{n_{1}n_{2}}{\delta}\right)}+\frac{\log(\frac{n_{1}n_{2}}{\delta})}{\sqrt{m}}\right)^{2}\lesssim n_{1}n_{2}\log\left(\frac{n_{1}n_{2}}{\delta}\right).

∎

Lemma 20.

If $\|X\|_{\psi_{\alpha}},\|Y\|_{\psi_{\beta}}<\infty$ with $\alpha,\beta>0$ , then we have $\|XY\|_{\psi_{\gamma}}\leq\|X\|_{\psi_{\alpha}}\|Y\|_{\psi_{\beta}}$ , where $\gamma$ satisfies that

\frac{1}{\gamma}=\frac{1}{\alpha}+\frac{1}{\beta}.

Proof.

Without loss of generality, we can assume that $\|X\|_{\psi_{\alpha}}=\|Y\|_{\psi_{\beta}}=1$ . To prove this, let us use Young’s inequality, which states that

xy\leq\frac{x^{p}}{p}+\frac{y^{q}}{q},for\ x,y\geq 0,p,q>1.

Let $p=\alpha/\gamma,q=\beta/\gamma$ , then

	$\displaystyle\mathbb{E}[\exp(\|XY\|^{\gamma})]$	$\displaystyle\leq\mathbb{E}\left[\exp\left(\frac{\|X\|^{\gamma p}}{p}+\frac{\|Y\|^{\gamma q}}{q}\right)\right]$
		$\displaystyle=\mathbb{E}\left[\exp\left(\frac{\|X\|^{\alpha}}{p}\right)\exp\left(\frac{\|Y\|^{\beta}}{q}\right)\right]$
		$\displaystyle\leq\mathbb{E}\left[\frac{\exp(\|X\|^{\alpha})}{p}+\frac{\exp(\|Y\|^{\beta})}{q}\right]$
		$\displaystyle\leq\frac{2}{p}+\frac{2}{q}$
		$\displaystyle=2,$

where the first and second inequality follow from Young’s inequality. From this, we have that $\|XY\|_{\psi_{\gamma}}\leq\|X\|_{\psi_{\alpha}}\|Y\|_{\psi_{\beta}}$ .

∎

Lemma 21.

With probability at least $1-\delta$ , we have

\|G^{0}(u)\|_{2}^{2}=L(W(0),\tilde{W}(0))\lesssim n_{1}d\log\left(\frac{n_{1}(n_{2}+n_{3})}{\delta}\right).

Proof.

Recall that the loss of function of PINN is

L(W,\tilde{W})=\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{1}=1}^{n_{2}}\frac{1}{n_{2}}(\mathcal{L}G(u_{i})(y_{j_{1}})-f(y_{j_{1}}))^{2}+\sum\limits_{i=1}^{n_{1}}\sum\limits_{j_{2}=1}^{n_{3}}\frac{1}{n_{3}}(G(u_{i})(\tilde{y}_{j_{2}})-g(\tilde{y}_{j_{2}}))^{2}

and the shallow neural operator has the following form

G(u)(y)=\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}^{T}u)\right]\sigma_{3}(w_{r}^{T}y).

In order to estimate the initial value, it suffices to consider

\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y))

and

\frac{1}{\sqrt{m}}\sum\limits_{r=1}^{m}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\sigma_{3}(w_{r}(0)^{T}y).

Note that $|\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y))|\lesssim\|w_{r}(0)\|_{2}^{2}|w_{r}(0)^{T}y|$ , thus Lemma 20 implies that

\|\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y))\|_{\psi_{\frac{2}{3}}}\lesssim\|\|w_{r}(0)\|_{2}^{2}|w_{r}(0)^{T}y|\|_{\psi_{\frac{2}{3}}}\leq\|\|w_{r}(0)\|_{2}^{2}\|_{\psi_{1}}\||w_{r}(0)^{T}y|\|_{\psi_{2}}=\mathcal{O}(d).

Therefore, combining with Lemma 18 yields that

\left\|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\mathcal{L}(\sigma_{3}(w_{r}(0)^{T}y))\right\|_{\psi_{\frac{1}{2}}}\lesssim\mathcal{O}(d).

Similarly, we can deduce that

\left\|\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}(0)\sigma(\tilde{w}_{rk}(0)^{T}u)\right]\sigma_{3}(w_{r}(0)^{T}y)\right\|_{\psi_{\frac{1}{2}}}\lesssim\mathcal{O}(d).

Finally, applying Lemma 17 leads to that with probability at least $1-\delta$ ,

\|G^{0}(u)\|_{2}^{2}=L(W(0),\tilde{W}(0))\lesssim n_{1}d\log\left(\frac{n_{1}(n_{2}+n_{3})}{\delta}\right).

∎

		$\displaystyle\\|z-G^{s+1}(u)\\|_{2}^{2}$		(14)
		$\displaystyle=\left\\|\left(I-\eta\left(H(s)+\tilde{H}(s)\right)\right)(z-G^{s}(u))\right\\|_{2}^{2}+\\|I(s)\\|_{2}^{2}-2\left\langle\left(I-\eta(H(s)+\tilde{H}(s))\right)(z-G^{s}(u)),I(s)\right\rangle$
		$\displaystyle\leq\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)^{2}\\|z-G^{s}(u)\\|_{2}^{2}+\\|I(s)\\|_{2}^{2}+2\left(1-\eta\frac{\lambda_{0}+\tilde{\lambda}_{0}}{2}\right)\\|z-G^{s}(u)\\|_{2}\\|I(s)\\|_{2},$

		$\displaystyle\|\sigma(w_{r}^{T}y_{i_{1}})\sigma(w_{r}^{T}y_{j_{1}})-\sigma(w_{r}(0)^{T}y_{i_{1}})\sigma(w_{r}(0)^{T}y_{j_{1}})\|$		(32)
		$\displaystyle=\|[(\sigma(w_{r}^{T}y_{i_{1}})-\sigma(w_{r}(0)^{T}y_{i_{1}}))]\sigma(w_{r}^{T}y_{j_{1}})+\sigma(w_{r}(0)^{T}y_{i_{1}})[\sigma(w_{r}^{T}y_{j_{1}})-\sigma(w_{r}(0)^{T}y_{j_{1}})]\|$
		$\displaystyle\lesssim R(\|\sigma(w_{r}^{T}y_{j_{1}})\|+\|\sigma(w_{r}(0)^{T}y_{i_{1}})\|)$
		$\displaystyle\lesssim R(\|\sigma(w_{r}(0)^{T}y_{j_{1}})\|+\|\sigma(w_{r}(0)^{T}y_{i_{1}})\|+R),$

		$\displaystyle\|\sigma(w_{r}(0)^{T}y_{i_{1}})\sigma(w_{r}(0)^{T}y_{j_{1}})(I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\})\|$		(34)
		$\displaystyle\lesssim M^{2}\|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0,\tilde{w}_{rk}^{T}u_{j}\geq 0\}\|$
		$\displaystyle\leq M^{2}\|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}\|+M^{2}\|I\{\tilde{w}_{rk}^{T}u_{i}\geq 0\}-I\{\tilde{w}_{rk}(0)^{T}u_{i}\geq 0\}\|$
		$\displaystyle\leq M^{2}(I\{\tilde{A}_{rk}^{i}\}+I\{\tilde{A}_{rk}^{j}\}).$

		$\displaystyle\left\\|\frac{d}{dt}w_{r}(s)\right\\|_{2}$		(36)
		$\displaystyle=\left\\|\frac{\partial L(W(s),\tilde{W}(s))}{\partial w_{r}}\right\\|_{2}$
		$\displaystyle=\left\\|\sum\limits_{i=1}^{n_{1}}\sum\limits_{j=1}^{n_{2}}\left(\frac{1}{\sqrt{m}}\left[\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right]y_{j}I\{w_{r}(s)^{T}y_{j}\geq 0\}\right)(G^{s}(u_{i})(y_{j})-z_{i}^{j})\right\\|_{2}$
		$\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}}{\sqrt{m}}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right)\\|G^{s}(u)-z\\|_{2}$
		$\displaystyle\leq\frac{\sqrt{n_{1}n_{2}}}{\sqrt{m}}\left(\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}\right)exp(-\frac{(\lambda_{0}+\tilde{\lambda}_{0})s}{2})\\|G^{0}(u)-z\\|_{2},$

	$\displaystyle\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})\right\|$
	$\displaystyle=\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\left[\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]+\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|$
	$\displaystyle\leq\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\left[\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(s)^{T}u_{i})-\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right]\right\|+\left\|\frac{1}{\sqrt{p}}\sum\limits_{k=1}^{p}\tilde{a}_{rk}\sigma(\tilde{w}_{rk}(0)^{T}u_{i})\right\|$
	$\displaystyle\lesssim\sqrt{p}\tilde{R}+\sqrt{\log\left(\frac{mn_{1}}{\delta}\right)}.$