∎

¹¹institutetext: Chenguang Duan ²²institutetext: School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China.
²²email: cgduan.math@whu.edu.cn ³³institutetext: Yuling Jiao ⁴⁴institutetext: School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China.
⁴⁴email: yulingjiaomath@whu.edu.cn ⁵⁵institutetext: Yanming Lai ⁶⁶institutetext: School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China.
⁶⁶email: laiyanming@whu.edu.cn ⁷⁷institutetext: Xiliang Lu ⁸⁸institutetext: School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China.
⁸⁸email: xllv.math@whu.edu.cn ⁹⁹institutetext: Qimeng Quan ¹⁰¹⁰institutetext: School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China.
¹⁰¹⁰email: quanqm@whu.edu.cn ¹¹¹¹institutetext: Jerry Zhijian Yang ¹²¹²institutetext: School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China.
¹²¹²email: zjyang.math@whu.edu.cn

Analysis of Deep Ritz Methods for Laplace Equations with Dirichlet Boundary Condition

Chenguang Duan Yuling Jiao Yanming Lai Xiliang Lu Qimeng Quan Jerry Zhijian Yang

(Received: date / Accepted: date)

Abstract

Deep Ritz methods (DRM) have been proven numerically to be efficient in solving partial differential equations. In this paper, we present a convergence rate in $H^{1}$ norm for deep Ritz methods for Laplace equations with Dirichlet boundary condition, where the error depends on the depth and width in the deep neural networks and the number of samples explicitly. Further we can properly choose the depth and width in the deep neural networks in terms of the number of training samples. The main idea of the proof is to decompose the total error of DRM into three parts, that is approximation error, statistical error and the error caused by the boundary penalty. We bound the approximation error in $H^{1}$ norm with $\mathrm{ReLU}^{2}$ networks and control the statistical error via Rademacher complexity. In particular, we derive the bound on the Rademacher complexity of the non-Lipschitz composition of gradient norm with $\mathrm{ReLU}^{2}$ network, which is of immense independent interest. We also analysis the error inducing by the boundary penalty method and give a prior rule for tuning the penalty parameter.

Keywords:

deep Ritz methods convergence rate Dirichlet boundary condition approximation error Rademacher complexity

MSC:

65C20

1 Introduction

Partial differential equations (PDEs) are one of the fundamental mathematical models in studying a variety of phenomenons arising in science and engineering. There have been established many conventional numerical methods successfully for solving PDEs in the case of low dimension $(d\leq 3)$ , particularly the finite element method brenner2007mathematical ; ciarlet2002finite ; Quarteroni2008Numerical ; Thomas2013Numerical ; Hughes2012the . However, one will encounter some difficulties in both of theoretical analysis and numerical implementation when extending conventional numerical schemes to high-dimensional PDEs. The classic analysis of convergence, stability and any other properties will be trapped into troublesome situation due to the complex construction of finite element space ciarlet2002finite ; brenner2007mathematical . Moreover, in the term of practical computation, the scale of the discrete problem will increase exponentially with respect to the dimension.

Motivated by the well-known fact that deep learning method for high-dimensional data analysis has been achieved great successful applications in discriminative, generative and reinforcement learning he2015delving ; Goodfellow2014Generative ; silver2016mastering , solving high dimensional PDEs with deep neural networks becomes an extremely potential approach and has attracted much attentions Cosmin2019Artificial ; Justin2018DGM ; DeepXDE ; raissi2019physics ; Weinan2017The ; Yaohua2020weak ; Berner2020Numerically ; Han2018solving . Roughly speaking, these works can be divided into three categories. The first category is using deep neural network to improve classical numerical methods, see for example Kiwon2020Solver ; Yufei2020Learning ; hsieh2018learning ; Greenfeld2019Learning . In the second category, the neural operator is introduced to learn mappings between infinite-dimensional spaces with neural networks Li2020Advances ; anandkumar2020neural ; li2021fourier . For the last category, one utilizes deep neural networks to approximate the solutions of PDEs directly including physics-informed neural networks (PINNs) raissi2019physics , deep Ritz method (DRM) Weinan2017The and weak adversarial networks (WAN) Yaohua2020weak . PINNs is based on residual minimization for solving PDEs Cosmin2019Artificial ; Justin2018DGM ; DeepXDE ; raissi2019physics . Proceed from the variational form, Weinan2017The ; Yaohua2020weak ; Xu2020finite propose neural-network based methods related to classical Ritz and Galerkin method. In Yaohua2020weak , WAN are proposed inspired by Galerkin method. Based on Ritz method, Weinan2017The proposes the DRM to solve variational problems corresponding to a class of PDEs.

1.1 Related works and contributions

The idea using neural networks to solve PDEs goes back to 1990’s Isaac1998Artificial ; Dissanayake1994neural . Although there are great empirical achievements in recent several years, a challenging and interesting question is to provide a rigorous error analysis such as finite element method. Several recent efforts have been devoted to making processes along this line, see for example e2020observations ; Luo2020TwoLayerNN ; Mishra2020EstimatesOT ; Mller2021ErrorEF ; lu2021priori ; hong2021rademacher ; Shin2020ErrorEO ; Wang2020WhenAW ; e2021barron . In Luo2020TwoLayerNN , least squares minimization method with two-layer neural networks is studied, the optimization error under the assumption of over-parametrization and generalization error without the over-parametrization assumption are analyzed. In lu2021priori ; Xu2020finite , the generalization error bounds of two-layer neural networks are derived via assuming that the exact solutions lie in spectral Barron space.

Dirichlet boundary condition corresponding to a constrained minimization problem, which may cause some difficulties in computation. The penalty method has been applied in finite element methods and finite volume method Babuska1973The ; Maury2009Numerical . It is also been used in deep PDEs solvers Weinan2017The ; raissi2019physics ; Xu2020finite since it is not easy to construct a network with given values on the boundary. We also apply penalty method to DRM with $\mathrm{ReLU}^{2}$ activation functions, and obtain the error estimation in this work. The main contribution are listed as follows:

•

We derive a bound on the approximation error of deep $\mathrm{ReLU}^{2}$ network in $H^{1}$ norm, which is of independent interest, see Theorem 3.2. That is, for any $u_{\lambda}^{*}\in H^{2}(\Omega)$ , there exist a $\mathrm{ReLU}^{2}$ network $\bar{u}_{\bar{\phi}}$ with depth $\mathcal{D}\leq\lceil\log_{2}d\rceil+3,$ width $\mathcal{W}\leq\mathcal{O}\left(4d\left\lceil\frac{1}{\epsilon}-4\right\rceil^{d}\right)$ (where $d$ is the dimension), such that

\left\|u_{\lambda}^{*}-\bar{u}_{\bar{\phi}}\right\|^{2}_{H^{1}\left(\Omega\right)}\leq\epsilon^{2}\quad\mbox{and}\quad\left\|Tu_{\lambda}^{*}-T\bar{u}_{\bar{\phi}}\right\|^{2}_{L^{2}\left(\partial\Omega\right)}\leq C_{d}\epsilon^{2}.

•

We establish a bound on the statistical error in DRM with the tools of pseudo-dimension, especially we give a bound on

\mathbb{E}_{Z_{i},\sigma_{i},i=1,...,n}\left[\sup_{u_{\phi}\in\mathcal{N}^{2}}\frac{1}{n}\left|\sum_{i}\sigma_{i}\left\|\nabla u_{\phi}(Z_{i})\right\|^{2}\right|\right],

i.e., the Rademacher complexity of the non-Lipschitz composition of gradient norm and $\mathrm{ReLU}^{2}$ network, via calculating the Pseudo dimension of networks with both $\mathrm{ReLU}$ and $\mathrm{ReLU}^{2}$ activation functions, see Theorem 3.3. The technique we used here is also helpful for bounding the statistical errors to other deep PDEs solvers.

•

We give an upper bound of the error caused by the Robin approximation without additional assumptions, i.e., bound the error between the minimizer of the penalized form $u_{\lambda}^{*}$ and the weak solutions of the Laplace equation $u^{*}$ , see Theorem 3.4,

$\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}\leq\mathcal{O}(\lambda^{-1}).$

This result improves the one established in Mller2021ErrorEF ; muller2020deep ; hong2021rademacher .

•

Based on the above two error bounds we establish a nonasymptotic convergence rate of deep Ritz method for Laplace equation with Dirichlet boundary condition. We prove that if we set

\mathcal{D}\leq\lceil\log_{2}d\rceil+3,\ \mathcal{W}\leq\mathcal{O}\left(4d\left\lceil\left(\frac{n}{\log n}\right)^{\frac{1}{2(d+2)}}-4\right\rceil^{d}\right),

and

\lambda\sim n^{\frac{1}{3(d+2)}}(\log n)^{-\frac{d+3}{3(d+2)}},

it holds that

\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq\mathcal{O}\left(n^{-\frac{2}{3(d+2)}}\log n\right).

where $n$ is the number of training samples on both the domain and the boundary. Our theory shed lights on how to choose the topological structure of the employed networks and tune the penalty parameters to achieve the desired convergence rate in terms of number of training samples.

Recently, Mller2021ErrorEF ; muller2020deep also study the convergence of DRM with Dirichlet boundary condition via penalty method. However, the results of derive in Mller2021ErrorEF ; muller2020deep are quite different from ours. Firstly, the approximation results in Mller2021ErrorEF ; muller2020deep in based on the approximation error of $\mathrm{ReLU}$ networks in Sobolev norms established in guhring2019error . However, the $\mathrm{ReLU}$ network may not be suitable for solving PDEs. In this work, we derive an upper bound on the approximation error of $\mathrm{ReLU}^{2}$ networks in $H^{1}$ norm, which is of independent interest. Secondly, to analyze the error caused by the penalty term, Mller2021ErrorEF ; muller2020deep assumed some additional conditions, and we do not need these conditions to obtain the error inducing by the penalty. Lastly, we provide the convergence rate analysis involving the statistical error caused by finite samples used in the SGD training, while in Mller2021ErrorEF ; muller2020deep they do not consider the statistical error at all. Moreover, to bound the statistical error we need to control the Rademacher complexity of the non-Lipschitz composition of gradient norm and $\mathrm{ReLU}^{2}$ network, such technique can be useful for bounding the statistical errors to other deep PDEs solvers.

The rest of this paper is organized as follows. In Section 2 we describe briefly the model problem and recall some standard properties of PDEs and variational problems. We also introduce some notations in deep Ritz methods as preliminaries. We devote Section 3 to the detail analysis on the convergence rate of the deep Ritz method with penalty, where various error estimations are analyzed rigorously one by one and the main results on the convergence rate are presented. Some concluding remarks and discussions are given in Section 4.

2 Preliminaries

Consider the following elliptic equation with zero-boundary condition

\left\{\begin{aligned} -\Delta u+wu&=f&&\text{in}\ \Omega\\ u&=0&&\text{on}\ \partial\Omega,\\ \end{aligned}\right.

(1)

where $\Omega$ is a bounded open subset of $\mathbb{R}^{d}$ , $d>1$ , $f\in L^{2}(\Omega)$ and $w\in L^{\infty}(\Omega)$ . Moreover, we suppose the coefficient $w$ satisfies $w\geq c_{1}\geq 0$ a.e.. Without loss of generality, we assume $\Omega=[0,1]^{d}$ . Define the bilinear form

a:H^{1}(\Omega)\times H^{1}(\Omega)\to\mathbb{R},\quad(u,v)\mapsto\int_{\Omega}\nabla u\cdot\nabla v+wuv\ \mathrm{d}x,

(2)

and the corresponding quadratic energy functional by

\mathcal{L}(u)=\frac{1}{2}a(u,u)-\langle f,u\rangle_{L^{2}(\Omega)}=\frac{1}{2}|u|_{H^{1}(\Omega)}^{2}+\frac{1}{2}\|u\|_{L^{2}(\Omega;{w})}^{2}-\langle f,u\rangle_{L^{2}(\Omega)}.

(3)

Lemma 1

Evans2010PartialDE The unique weak solution $u^{*}\in H_{0}^{1}(\Omega)$ of (1) is the unique minimizer of $\mathcal{L}(u)$ over $H_{0}^{1}(\Omega)$ . Moreover, $u^{*}\in H^{2}(\Omega)$ .

Now we introduce the Robin approximation of (1) with $\lambda>0$ as below

\left\{\begin{aligned} -\Delta u+wu&=f&&\text{in}\ \Omega\\ \frac{1}{\lambda}\frac{\partial u}{\partial n}+u&=0&&\text{on}\ \partial\Omega.\\ \end{aligned}\right.

(4)

Similarly, we define the bilinear form

a_{\lambda}:H^{1}(\Omega)\times H^{1}(\Omega)\to\mathbb{R},\quad(u,v)\mapsto a(u,v)+\lambda\int_{\partial\Omega}uv\ \mathrm{d}s,

and the corresponding quadratic energy functional with boundary penalty

\mathcal{L}_{\lambda}(u)=\frac{1}{2}a_{\lambda}(u,u)-\langle f,u\rangle_{L^{2}(\Omega)}=\mathcal{L}(u)+\frac{\lambda}{2}||Tu||_{L^{2}(\partial\Omega)}^{2},

(5)

where $T$ means the trace operator.

Lemma 2

The unique weak solution $u_{\lambda}^{*}\in H^{1}(\Omega)$ of (4) is the unique minimizer of $\mathcal{L}_{\lambda}(u)$ over $H^{1}(\Omega)$ . Moreover, $u_{\lambda}^{*}\in H^{2}(\Omega)$ .

Proof

See Appendix A.1.

From the perspective of infinite dimensional optimization, $\mathcal{L}_{\lambda}$ can be seen as the penalized version of $\mathcal{L}$ . The following lemma provides the relationship between the minimizers of them.

Lemma 3

The minimizer $u_{\lambda}^{*}$ of the penalized problem (5) converges to $u^{*}$ in $H^{1}(\Omega)$ as $\lambda\rightarrow\infty$ .

Proof

This result follows from Proposition 2.1 in Maury2009Numerical directly.

The deep Ritz method can be divided into three steps. First, one use deep neural network to approximate the trial function. A deep neural network $u_{\phi}:\mathbb{R}\rightarrow\mathbb{R}^{N_{L}}$ is defined by

		$\displaystyle u_{0}(\boldsymbol{x})=\boldsymbol{x},$
		$\displaystyle u_{\ell}(\boldsymbol{x})=\sigma_{\ell}(A_{\ell}u_{\ell-1}+b_{\ell}),\quad\ell=1,2,\ldots,L-1,$
		$\displaystyle u=u_{L}(\boldsymbol{x})=A_{L}u_{L-1}+b_{L},$

where $A_{\ell}\in\mathbb{R}^{N_{\ell}\times N_{\ell-1}}$ , $b_{\ell}\in\mathbb{R}^{N_{\ell}}$ and the activation functions $\sigma_{\ell}$ may be different for different $\ell$ . The depth $\mathcal{D}$ and the width $\mathcal{W}$ of neural networks $u_{\phi}$ are defined as

\mathcal{D}=L,\ \mathcal{W}=\max\{N_{\ell}:\ell=1,2,\ldots,L\}.

$\sum_{\ell=1}^{L}N_{\ell}$ is called the number of units of $u_{\phi}$ , and $\phi=\{A_{\ell},b_{\ell}\}_{\ell=1}^{N}$ is called the free parameters of the network.

Definition 1

The class $\mathcal{N}_{\mathcal{D},\mathcal{W},\mathcal{B}}^{\alpha}$ is the collection of neural networks $u_{\phi}$ which satisfies that

(i)

depth and width are $\mathcal{D}$ and $\mathcal{W}$ , respectively;
(ii)

the function values $u_{\phi}(\boldsymbol{x})$ and the squared norm of $\nabla u_{\phi}(\boldsymbol{x})$ are bounded by $\mathcal{B}$ ;
(iii)

activation functions are given by $\mathrm{ReLU}^{\alpha}$ , where $\alpha$ is the (multi-)index.

For example, $\mathcal{N}_{\mathcal{D},\mathcal{W},\mathcal{B}}^{2}$ is the class of networks with activation functions as $\mathrm{ReLU}^{2}$ , and $\mathcal{N}_{\mathcal{D},\mathcal{W},\mathcal{B}}^{1,2}$ is that with activation functions as $\mathrm{ReLU}^{1}$ and $\mathrm{ReLU}^{2}$ . We may simply use $\mathcal{N}^{\alpha}$ if there is no confusion.

Second, one use Monte Carlo method to discretize the energy functional. We rewrite (5) as

	$\displaystyle\mathcal{L}_{\lambda}(u)=$	$\displaystyle\|\Omega\|\mathop{\mathbb{E}}_{X\sim U(\Omega)}\left[\frac{\\|\nabla u(X)\\|_{2}^{2}}{2}+\frac{w(X)u^{2}(X)}{2}-u(X)f(X)\right]$		(6)
		$\displaystyle+\frac{\lambda}{2}\|\partial\Omega\|\mathop{\mathbb{E}}_{Y\sim U(\partial\Omega)}\left[Tu^{2}(Y)\right],$		(6)

where $U(\Omega)$ , $U(\partial\Omega)$ are the uniform distribution on $\Omega$ and $\partial\Omega$ . We now introduce the discrete version of (5) and replace $u$ by neural network $u_{\phi}$ , as follows

	$\displaystyle\widehat{\mathcal{L}}_{\lambda}(u_{\phi})=$	$\displaystyle\frac{\|\Omega\|}{N}\sum_{i=1}^{N}\left[\frac{\\|\nabla u_{\phi}(X_{i})\\|_{2}^{2}}{2}+\frac{w(X_{i})u_{\phi}^{2}(X_{i})}{2}-u_{\phi}(X_{i})f(X_{i})\right]$		(7)
		$\displaystyle+\frac{\lambda}{2}\frac{\|\partial\Omega\|}{M}\sum_{j=1}^{M}\left[Tu_{\phi}^{2}(Y_{j})\right].$		(7)

We denote the minimizer of (7) over $\mathcal{N}^{2}$ as $\widehat{u}_{\phi}$ , that is

\widehat{u}_{\phi}=\mathop{\arg\min}_{u_{\phi}\in\mathcal{N}^{2}}\widehat{\mathcal{L}}_{\lambda}(u_{\phi}),

(8)

where $\{X_{i}\}_{i=1}^{N}\sim U(\Omega)$ i.i.d. and $\{Y_{j}\}_{j=1}^{M}\sim U(\partial\Omega)$ i.i.d..

Finally, we choose an algorithm for solving the optimization problem, and denote $u_{\phi_{\mathcal{A}}}$ as the solution by optimizer $\mathcal{A}$ .

3 Error Analysis

In this section we prove the convergence rate analysis for DRM with deep $\mathrm{ReLU}^{2}$ networks. The following Theorem plays an important role by decoupling the total errors into four types of errors.

Theorem 3.1

		$\displaystyle\\|u_{\phi_{\mathcal{A}}}-u^{*}\\|_{H^{1}(\Omega)}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{4}{c_{1}\wedge 1}\left\{\underbrace{\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\frac{\\|w\\|_{L^{\infty}(\Omega)}\vee 1}{2}\\|\bar{u}-u_{\lambda}^{}\\|_{H^{1}(\Omega)}^{2}+\frac{\lambda}{2}\\|T\bar{u}-Tu_{\lambda}^{}\\|_{L^{2}({\partial\Omega})}^{2}\right]}_{\mathcal{E}_{app}}\right.$
		$\displaystyle\left.+\underbrace{2\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right\|}_{\mathcal{E}_{sta}}+\underbrace{\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right]}_{\mathcal{E}_{opt}}\right\}+2\underbrace{\\|u_{\lambda}^{}-u^{}\\|_{H^{1}(\Omega)}^{2}}_{\mathcal{E}_{pen}}$

Proof

Given $u_{\phi_{\mathcal{A}}}\in H^{1}(\Omega)$ , we can decompose its distance to the weak solution of (1) using triangle inequality

\|u_{\phi_{\mathcal{A}}}-u^{*}\|_{H^{1}(\Omega)}\leq\|u_{\phi_{\mathcal{A}}}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}+\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}.

(9)

First, we decouple the first term into three parts. For any $\bar{u}\in\mathcal{N}^{2}$ , we have

		$\displaystyle\mathcal{L}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)$
		$\displaystyle=\mathcal{L}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)+\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)+\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\bar{u}\right)$
		$\displaystyle\quad+\widehat{\mathcal{L}}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(\bar{u}\right)+\mathcal{L}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)$
		$\displaystyle\leq\left[\mathcal{L}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)\right]+2\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right\|+\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right].$

Since $\bar{u}$ can be any element in $\mathcal{N}^{2}$ , we take the infimum of $\bar{u}$

	$\displaystyle\mathcal{L}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)\leq$	$\displaystyle\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\mathcal{L}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(u^{*}\right)\right]+2\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right\|$		(10)
		$\displaystyle+\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right].$		(10)

For any $u\in\mathcal{N}$ , set $v=u-u_{\lambda}^{*}$ , then

	$\displaystyle\mathcal{L}_{\lambda}(u)=$	$\displaystyle\mathcal{L}_{\lambda}(u_{\lambda}^{*}+v)$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\langle\nabla(u_{\lambda}^{}+v),\nabla(u_{\lambda}^{}+v)\rangle_{L^{2}({\Omega})}+\frac{1}{2}\langle u_{\lambda}^{}+v,u_{\lambda}^{}+v\rangle_{L^{2}({\Omega};w)}-\langle u_{\lambda}^{*}+v,f\rangle_{L^{2}({\Omega})}$
		$\displaystyle+\frac{\lambda}{2}\langle T(u_{\lambda}^{}+v),T(u_{\lambda}^{}+v)\rangle_{L^{2}({\partial\Omega})}$
	$\displaystyle=$	$\displaystyle\mathcal{L}_{\lambda}(u_{\lambda}^{})+\langle\nabla u_{\lambda}^{},\nabla v\rangle_{L^{2}({\Omega})}+\langle u_{\lambda}^{},v\rangle_{L^{2}({\Omega;w})}-\langle v,f\rangle_{L^{2}({\Omega})}+\lambda\langle Tu_{\lambda}^{},Tv\rangle_{L^{2}({\partial\Omega})}$
		$\displaystyle+\frac{1}{2}\langle\nabla v,\nabla v\rangle_{L^{2}({\Omega})}+\frac{1}{2}\langle v,v\rangle_{L^{2}({\Omega};w)}+\frac{\lambda}{2}\langle Tv,Tv\rangle_{L^{2}({\partial\Omega})}$
	$\displaystyle=$	$\displaystyle\mathcal{L}_{\lambda}(u_{\lambda}^{*})+\frac{1}{2}\langle\nabla v,\nabla v\rangle_{L^{2}({\Omega})}+\frac{1}{2}\langle v,v\rangle_{L^{2}({\Omega};w)}+\frac{\lambda}{2}\langle Tv,Tv\rangle_{L^{2}({\partial\Omega})},$

where the last equality comes from the fact that $u_{\lambda}^{*}$ is the minimizer of (5). Therefore

\mathcal{L}_{\lambda}(u)-\mathcal{L}_{\lambda}(u_{\lambda}^{*})=\frac{1}{2}\langle\nabla v,\nabla v\rangle_{L^{2}({\Omega})}+\frac{1}{2}\langle v,v\rangle_{L^{2}({\Omega};w)}+\frac{\lambda}{2}\langle Tv,Tv\rangle_{L^{2}({\partial\Omega})},

that is

	$\displaystyle\frac{c_{1}\wedge 1}{2}\\|u-u_{\lambda}^{*}\\|_{H^{1}(\Omega)}^{2}$	$\displaystyle\leq\mathcal{L}_{\lambda}(u)-\mathcal{L}_{\lambda}(u_{\lambda}^{})-\frac{\lambda}{2}\\|Tu-Tu_{\lambda}^{}\\|_{L^{2}({\partial\Omega})}^{2}$		(11)
		$\displaystyle\leq\frac{\\|w\\|_{L^{\infty}(\Omega)}\vee 1}{2}\\|u-u_{\lambda}^{*}\\|_{H^{1}(\Omega)}^{2}.$		(11)

Combining (10) and (11), we obtain

	$\displaystyle\\|u_{\phi_{\mathcal{A}}}-u_{\lambda}^{*}\\|_{H^{1}(\Omega)}^{2}$	(12)
$\displaystyle\leq$	$\displaystyle\frac{2}{c_{1}\wedge 1}\left\{\mathcal{L}_{\lambda}(u_{\phi_{\mathcal{A}}})-\mathcal{L}_{\lambda}(u_{\lambda}^{})-\frac{\lambda}{2}\\|Tu_{\phi_{\mathcal{A}}}-Tu_{\lambda}^{}\\|_{L^{2}({\partial\Omega})}^{2}\right\}$
$\displaystyle\leq$	$\displaystyle\frac{2}{c_{1}\wedge 1}\left\{\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\mathcal{L}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)\right]+2\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right\|\right.$
	$\displaystyle\left.+\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right]\right\}$
$\displaystyle\leq$	$\displaystyle\frac{2}{c_{1}\wedge 1}\left\{\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\frac{\\|w\\|_{L^{\infty}(\Omega)}\vee 1}{2}\\|\bar{u}-u_{\lambda}^{}\\|_{H^{1}(\Omega)}^{2}+\frac{\lambda}{2}\\|T\bar{u}-Tu_{\lambda}^{}\\|_{L^{2}({\partial\Omega})}^{2}\right]\right.$
	$\displaystyle\left.+2\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right\|+\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right]\right\}.$

Substituting (12) into (9), it is evident to see that the theorem holds.

The approximation error $\mathcal{E}_{app}$ describes the expressive power of the $\mathrm{ReLU}^{2}$ networks $\mathcal{N}^{2}$ in $H^{1}$ norm, which corresponds to the approximation error in FEM known as the Céa’s lemma ciarlet2002finite . The statistical error $\mathcal{E}_{sta}$ is caused by the Monte Carlo discritization of $L_{\lambda}(\cdot)$ defined in (5) with $\widehat{L}_{\lambda}(\cdot)$ in (7). While, the optimization error $\mathcal{E}_{opt}$ indicates the performance of the solver $\mathcal{A}$ we utilized. In contrast, this error is corresponding to the error of solving linear systems in FEM. In this paper we consider the scenario of perfect training with $\mathcal{E}_{opt}$ = 0. The error $\mathcal{E}_{pen}$ caused by the boundary penalty is the distance between the minimizer of the energy with zero boundary condition and the minimizer of the energy with penalty.

3.1 Approximation error

Theorem 3.2

Assume $\|u_{\lambda}^{*}\|_{H^{2}(\Omega)}\leq c_{2}$ , then there exist a $\mathrm{ReLU}^{2}$ network $\bar{u}_{\bar{\phi}}\in\mathcal{N}^{2}$ with depth and width satisfying

\mathcal{D}\leq\lceil\log_{2}d\rceil+3,\quad\mathcal{W}\leq 4d\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}

such that

	$\displaystyle\mathcal{E}_{app}$	$\displaystyle=\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\frac{\\|w\\|_{L^{\infty}(\Omega)}\vee 1}{2}\\|\bar{u}-u_{\lambda}^{}\\|_{H^{1}(\Omega)}^{2}+\frac{\lambda}{2}\\|T\bar{u}-Tu_{\lambda}^{}\\|_{L^{2}({\partial\Omega})}^{2}\right]$
		$\displaystyle\leq\left(\frac{\\|w\\|_{L^{\infty}(\Omega)}\vee 1}{2}+\frac{\lambda C_{d}}{2}\right)\varepsilon^{2}$

where $C$ is a genetic constant and $C_{d}>0$ is a constant depending only on $\Omega$ .

Proof

Our proof is based on some classical approximation results of B-splines schumaker2007spline ; de1978practical . Let us recall some notation and useful results. We denote by $\pi_{l}$ the dyadic partition of $[0,1]$ , i.e.,

\pi_{l}:t_{0}^{(l)}=0<t_{1}^{(l)}<\cdots<t_{2^{l}-1}^{(l)}<t_{2^{l}}^{(l)}=1,

where $t_{i}^{(l)}=i\cdot 2^{-l}(0\leq i\leq 2^{l})$ . The cardinal B-spline of order $3$ with respect to partition $\pi_{l}$ is defined by

N_{l,i}^{(3)}(x)=(-1)^{k}\left[t_{i}^{(l)},\ldots,t_{i+3}^{(l)},(x-t)_{+}^{2}\right]\cdot\left(t_{i+3}^{(l)}-t_{i}^{(l)}\right),\quad i=-2,\cdots,2^{l}-1

which can be rewritten in the following equivalent form,

N_{l,i}^{(3)}(x)=2^{2l-1}\sum_{j=0}^{3}(-1)^{j}\left(\begin{array}[]{l}3\\ j\end{array}\right)(x-i2^{-l}-j2^{-l})_{+}^{2},\quad i=-2,\cdots,2^{l}-1.

(13)

The multivariate cardinal B-spline of order $3$ is defined by the product of univariate cardinal B-splines of order $3$ , i.e.,

{N}_{l,\boldsymbol{i}}^{(3)}(\boldsymbol{x})=\prod_{j=1}^{d}N_{l,i_{j}}^{(3)}\left(x_{j}\right),\quad\boldsymbol{i}=\left(i_{1},\ldots,i_{d}\right),-3<i_{j}<2^{l}.

Denote

S_{l}^{(3)}([0,1]^{d})=\text{span}\{N_{l,\boldsymbol{i}}^{(3)},-3<i_{j}<2^{l},j=1,2,\cdots,d\}.

Then, the element $f$ in $S_{l}^{(3)}([0,1]^{d})$ are piecewise polynomial functions according to to partition $\pi_{l}^{d}$ with each piece being degree $2$ and in $C^{1}([0,1]^{d})$ . Since

S_{1}^{(3)}\subset S_{2}^{(3)}\subset S_{3}^{(3)}\subset\cdots,

We can further denote

S^{(3)}([0,1]^{d})=\bigcup_{l=1}^{\infty}S_{l}^{(3)}([0,1]^{d}).

The following approximation result of cardinal B-splines in Sobolev spaces which is a direct consequence of theorem 3.4 in schultz1969approximation play an important role in the proof of this Theorem.

Lemma 4

Assume $u^{*}\in H^{2}([0,1]^{d})$ , there exists $\{c_{j}\}_{j=1}^{(2^{l}-4)^{d}}\subset\mathbb{R}$ with $l>2$ such that

\|u^{*}-\sum_{j=1}^{(2^{l}-4)^{d}}c_{j}{N}_{l,\boldsymbol{i}_{j}}^{(3)}\|_{H^{1}(\Omega)}\leq\frac{C}{2^{l}}\|u^{*}\|_{H^{1}(\Omega)},

where $C$ is a constant only depend on $d$ .

Lemma 5

The multivariate B-spline ${N}_{l,\boldsymbol{i}}^{(3)}(\boldsymbol{x})$ can be implemented exactly by a $\mathrm{ReLU}^{2}$ network with depth $\lceil\log_{2}d\rceil+2$ and width $4d$ .

Proof

Denote

\sigma(x)=\left\{\begin{array}[]{ll}x^{2},&x\geq 0\\ 0,&\text{else}\end{array}\right.

as the activation function in $\mathrm{ReLU}^{2}$ network. By definition of $N_{l,i}^{(3)}(x)$ in (13), it’s clear that $N_{l,i}^{(3)}(x)$ can be implemented by $\mathrm{ReLU}^{2}$ network without any error with depth $2$ and width $4$ . On the other hand $\mathrm{ReLU}^{2}$ network can also realize multiplication without any error. In fact, for any $x,y\in\mathbb{R}$ ,

xy=\frac{1}{4}[(x+y)^{2}-(x-y)^{2}]=\frac{1}{4}[\sigma(x+y)+\sigma(-x-y)-\sigma(x-y)-\sigma(y-x)].

Hence multivariate B-spline of order $3$ can be implemented by $\mathrm{ReLU}^{2}$ network exactly with depth $\lceil\log_{2}d\rceil+2$ and width $4d$ .

For any $\epsilon>0$ , by Lemma 4 and 5 with $\frac{1}{2^{l}}\leq\left\lceil\frac{C\|u^{*}\|_{H^{2}}}{\epsilon}\right\rceil$ , there exists $\bar{u}_{\bar{\phi}}\in\mathcal{N}^{2}$ , such that

\left\|u_{\lambda}^{*}-\bar{u}_{\bar{\phi}}\right\|_{H^{1}(\Omega)}\leq\epsilon.

(14)

By the trace theorem, we have

\|Tu_{\lambda}^{*}-T\bar{u}_{\bar{\phi}}\|_{L^{2}({\partial\Omega})}\leq C_{d}^{1/2}\left\|u_{\lambda}^{*}-\bar{u}_{\bar{\phi}}\right\|_{H^{1}(\Omega)}\leq C_{d}^{1/2}\epsilon,

(15)

where $C_{d}>0$ is a constant depending only on $\Omega$ . The depth $\mathcal{D}$ and width $\mathcal{W}$ of $\bar{u}_{\bar{\phi}}$ are satisfying $\mathcal{D}\leq\lceil\log_{2}d\rceil+3$ and $\mathcal{W}\leq 4dn=4d\left\lceil\frac{C\|u^{*}\|_{H^{2}}}{\epsilon}-4\right\rceil^{d}$ , respectively. Combining (14) and (15), we arrive at the result.

3.2 Statistical error

In this section, we bound the statistical error

\mathcal{E}_{sta}=2\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right|.

For simplicity of presentation, we use $c_{3}$ to denote the upper bound of $f$ , $w$ and suppose $c_{3}\geq\mathcal{B}$ , that is

\|f\|_{L^{\infty}(\Omega)}\vee\|w\|_{L^{\infty}(\Omega)}\vee\mathcal{B}\leq c_{3}<\infty.

First, we need to decompose the statistical error into four parts, and estimate each one.

Lemma 6

\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right|\leq\sum_{j=1}^{3}\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,j}(u)-\widehat{\mathcal{L}}_{\lambda,j}(u)\right|+\frac{\lambda}{2}\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,4}(u)-\widehat{\mathcal{L}}_{\lambda,4}(u)\right|,

where

\mathcal{L}_{\lambda,1}(u)=|\Omega|\mathop{\mathbb{E}}_{X\sim U(\Omega)}\left[\frac{\|\nabla u(X)\|_{2}^{2}}{2}\right],\quad\widehat{\mathcal{L}}_{\lambda,1}(u)=\frac{|\Omega|}{N}\sum_{i=1}^{N}\left[\frac{\|\nabla u(X_{i})\|_{2}^{2}}{2}\right],

\mathcal{L}_{\lambda,2}(u)=|\Omega|\mathop{\mathbb{E}}_{X\sim U(\Omega)}\left[\frac{w(X)u^{2}(X)}{2}\right],\quad\widehat{\mathcal{L}}_{\lambda,2}(u)=\frac{|\Omega|}{N}\sum_{i=1}^{N}\left[\frac{w(X_{i})u^{2}(X_{i})}{2}\right],

\mathcal{L}_{\lambda,3}(u)=|\Omega|\mathop{\mathbb{E}}_{X\sim U(\Omega)}\left[u(X)f(X)\right],\quad\widehat{\mathcal{L}}_{\lambda,3}(u)=\frac{|\Omega|}{N}\sum_{i=1}^{N}\left[u(X_{i})f(X_{i})\right],

\mathcal{L}_{\lambda,4}(u)=|\partial\Omega|\mathop{\mathbb{E}}_{Y\sim U(\partial\Omega)}\left[Tu^{2}(Y)\right],\quad\widehat{\mathcal{L}}_{\lambda,4}(u)=\frac{|\partial\Omega|}{M}\sum_{j=1}^{M}\left[Tu^{2}(Y_{j})\right].

Proof

It is easy to verified by triangle inequality.

We use $\mu$ to denote $\mathrm{U}(\Omega)(\mathrm{U}(\partial\Omega)).$ Given $n=N(M)$ i.i.d samples $\mathbf{Z}_{n}=\left\{Z_{i}\right\}_{i=1}^{n}$ from $\mu$ , with $Z_{i}=X_{i}\left(Y_{i}\right)\sim\mu$ , we need the following Rademacher complexity to measure the capacity of the given function class $\mathcal{N}$ restricted on $n$ random samples $\mathbf{Z}_{n}$ .

Definition 2

The Rademacher complexity of a set $A\subseteq\mathrm{R}^{n}$ is defined as

\mathfrak{R}(A)=\mathbb{E}_{\mathbf{Z}_{n},\Sigma_{n}}\left[\sup_{a\in A}\frac{1}{n}\left|\sum_{i}\sigma_{i}a_{i}\right|\right]

where $\Sigma_{n}=\left\{\sigma_{i}\right\}_{i=1}^{n}$ are $n$ i.i.d Rademacher variables with $\mathbb{P}\left(\sigma_{i}=1\right)=\mathbb{P}\left(\sigma_{i}=-1\right)=1/2.$ The Rademacher complexity of function class $\mathcal{N}$ associate with random sample $\mathbf{Z}_{n}$ is defined as

\mathfrak{R}(\mathcal{N})=\mathbb{E}_{\mathbf{Z}_{n},\Sigma_{n}}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i}\sigma_{i}u\left(Z_{i}\right)\right|\right].

For the sake of simplicity, we deal with last three terms first.

Lemma 7

Suppose that $\psi:\mathbb{R}^{d}\times\mathbb{R}\rightarrow\mathbb{R}$ , $(x,y)\mapsto\psi(x,y)$ is $\ell$ -Lipschitz continuous on $y$ for all $x$ . Let $\mathcal{N}$ be classes of functions on $\Omega$ and $\psi\circ\mathcal{N}=\{\psi\circ u:x\mapsto\psi(x,u(x)),u\in\mathcal{N}\}$ . Then

\mathfrak{R}(\psi\circ\mathcal{N})\leq\ell\ \mathfrak{R}(\mathcal{N})

Proof

Corollary 3.17 in ledoux2013probability .

Lemma 8

	$\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda,2}(u)-\widehat{\mathcal{L}}_{\lambda,2}(u)\right\|\right]$	$\displaystyle\leq c_{3}^{2}\mathfrak{R}(\mathcal{N}^{2}),$
	$\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda,3}(u)-\widehat{\mathcal{L}}_{\lambda,3}(u)\right\|\right]$	$\displaystyle\leq c_{3}\mathfrak{R}(\mathcal{N}^{2}),$
	$\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda,4}(u)-\widehat{\mathcal{L}}_{\lambda,4}(u)\right\|\right]$	$\displaystyle\leq 2c_{3}\mathfrak{R}(\mathcal{N}^{2}).$

Proof

Suppose $|y|<c_{3}$ . Define

\psi_{2}(x,y)=\frac{w(x)y^{2}}{2},\ \psi_{3}(x,y)=f(x)y,\ \psi_{4}(x,y)=y^{2}.

According to the symmetrization method, we have

$\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda,2}(u)-\widehat{\mathcal{L}}_{\lambda,2}(u)\right\|\right]$	$\displaystyle\leq\mathfrak{R}(\psi_{2}\circ\mathcal{N}^{2}),$	(16)
$\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda,3}(u)-\widehat{\mathcal{L}}_{\lambda,3}(u)\right\|\right]$	$\displaystyle\leq\mathfrak{R}(\psi_{3}\circ\mathcal{N}^{2}),$
$\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda,4}(u)-\widehat{\mathcal{L}}_{\lambda,4}(u)\right\|\right]$	$\displaystyle\leq\mathfrak{R}(\psi_{4}\circ\mathcal{N}^{2}).$

The result will follow from Lemma 7 and (16) directly, if we can show that $\psi_{i}(x,y)$ , $i=2,3,4$ are $c_{3}^{2}$ , $c_{3}$ , $2c_{3}$ -Lipschitz continuous on $y$ for all $x$ , respectively. For arbitrary $y_{1}$ , $y_{2}$ with $|y_{i}|\leq c_{3}$ , $i=1,2$

|\psi_{2}(x,y_{1})-\psi_{2}(x,y_{2})|=\left|\frac{w(x)y_{1}^{2}}{2}-\frac{w(x)y_{2}^{2}}{2}\right|=\frac{|w(x)(y_{1}+y_{2})|}{2}|y_{1}-y_{2}|\leq c_{3}^{2}|y_{1}-y_{2}|,

|\psi_{3}(x,y_{1})-\psi_{3}(x,y_{2})|=\left|f(x)y_{1}-f(x)y_{2}\right|=|f(x)||y_{1}-y_{2}|\leq c_{3}|y_{1}-y_{2}|,

|\psi_{4}(x,y_{1})-\psi_{4}(x,y_{2})|=\left|y_{1}^{2}-y_{2}^{2}\right|=|y_{1}+y_{2}||y_{1}-y_{2}|\leq 2c_{3}|y_{1}-y_{2}|.

We now turn to the most difficult term in Lemma 6. Since gradient is not a Lipschitz operator, Lemma 7 does not work and we can not bound the Rademacher complexity in the same way.

Lemma 9

\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,1}(u)-\widehat{\mathcal{L}}_{\lambda,1}(u)\right|\right]\leq\mathfrak{R}(\mathcal{N}^{1,2}).\\

Proof

Based on the symmetrization method, we have

\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,1}(u)-\widehat{\mathcal{L}}_{\lambda,1}(u)\right|\right]\leq\mathbb{E}_{\boldsymbol{Z}_{n},\Sigma_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\frac{1}{n}\left|\sum_{i}\sigma_{i}\|\nabla u(Z_{i})\|^{2}\right|\right]

(17)

The proof of (17) is a direct consequence of the following claim.

Claim: Let $u$ be a function implemented by a $\mathrm{ReLU}^{2}$ network with depth $\mathcal{D}$ and width $\mathcal{W}$ . Then $\|\nabla u\|_{2}^{2}$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network with depth $\mathcal{D}+3$ and width $d\left(\mathcal{D}+2\right)\mathcal{W}$ .

Denote $\mathrm{ReLU}$ and $\mathrm{ReLU}^{2}$ as $\sigma_{1}$ and $\sigma_{2}$ , respectively. As long as we show that each partial derivative $D_{i}u(i=1,2,\cdots,d)$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network respectively, we can easily obtain the network we desire, since, $\|\nabla u\|_{2}^{2}=\sum_{i=1}^{d}\left|D_{i}u\right|^{2}$ and the square function can be implemented by $x^{2}=\sigma_{2}(x)+\sigma_{2}(-x)$ .

Now we show that for any $i=1,2,\cdots,d$ , $D_{i}u$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network. We deal with the first two layers in details since there are a little bit difference for the first two layer and apply induction for layers $k\geq 3$ . For the first layer, since $\sigma_{2}^{{}^{\prime}}(x)=2\sigma_{1}(x)$ , we have for any $q=1,2\cdots,n_{1}$

D_{i}u_{q}^{(1)}=D_{i}\sigma_{2}\left(\sum_{j=1}^{d}a_{qj}^{(1)}x_{j}+b_{q}^{(1)}\right)=2\sigma_{1}\left(\sum_{j=1}^{d}a_{qj}^{(1)}x_{j}+b_{q}^{(1)}\right)\cdot a_{qi}^{(1)}

Hence $D_{i}u_{q}^{(1)}$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network with depth $2$ and width $1$ . For the second layer,

D_{i}u_{q}^{(2)}=D_{i}\sigma_{2}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)=2\sigma_{1}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)\cdot\sum_{j=1}^{n_{1}}a_{qj}^{(2)}D_{i}u_{j}^{(1)}

Since $\sigma_{1}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)$ and $\sum_{j=1}^{n_{1}}a_{qj}^{(2)}D_{i}u_{j}^{(1)}$ can be implemented by two $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ subnetworks, respectively, and the multiplication can also be implemented by

\begin{split}x\cdot y&=\frac{1}{4}\left[(x+y)^{2}-(x-y)^{2}\right]\\ &=\frac{1}{4}\left[\sigma_{2}(x+y)+\sigma_{2}(-x-y)-\sigma_{2}(x-y)-\sigma_{2}(-x+y)\right],\end{split}

we conclude that $D_{i}u_{q}^{(2)}$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network. We have

\mathcal{D}\left(\sigma_{1}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)\right)=3,\mathcal{W}\left(\sigma_{1}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)\right)\leq\mathcal{W}

and

\mathcal{D}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}D_{i}u_{j}^{(1)}\right)=2,\mathcal{W}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}D_{i}u_{j}^{(1)}\right)\leq\mathcal{W}.

Thus $\mathcal{D}\left(D_{i}u_{q}^{(2)}\right)=4,$ $\mathcal{W}\left(D_{i}u_{q}^{(2)}\right)\leq\max\{2\mathcal{W},4\}$ .

Now we apply induction for layers $k\geq 3$ . For the third layer,

D_{i}u_{q}^{(3)}=D_{i}\sigma_{2}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}u_{j}^{(2)}+b_{q}^{(3)}\right)=2\sigma_{1}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}u_{j}^{(2)}+b_{q}^{(3)}\right)\cdot\sum_{j=1}^{n_{2}}a_{qj}^{(3)}D_{i}u_{j}^{(2)}.

Since

\mathcal{D}\left(\sigma_{1}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}u_{j}^{(2)}+b_{q}^{(3)}\right)\right)=4,\mathcal{W}\left(\sigma_{1}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}u_{j}^{(2)}+b_{q}^{(3)}\right)\right)\leq\mathcal{W}

and

\mathcal{D}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}D_{i}u_{j}^{(2)}\right)=4,\mathcal{W}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(3)}D_{i}u_{j}^{(2)}\right)\leq\max\{2\mathcal{W},4\mathcal{W}\}=4\mathcal{W},

we conclude that $D_{i}u_{q}^{(3)}$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network and $\mathcal{D}\left(D_{i}u_{q}^{(3)}\right)=5$ , $\mathcal{W}\left(D_{i}u_{q}^{(3)}\right)\leq\max\{5\mathcal{W},4\}=5\mathcal{W}$ .

We assume that $D_{i}u_{q}^{(k)}(q=1,2,\cdots,n_{k})$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network and $\mathcal{D}\left(D_{i}u_{q}^{(k)}\right)=k+2$ , $\mathcal{W}\left(D_{i}u_{q}^{(3)}\right)\leq(k+2)\mathcal{W}$ . For the $(k+1)-$ th layer,

	$\displaystyle D_{i}u_{q}^{(k+1)}=D_{i}\sigma_{2}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}u_{j}^{(k)}+b_{q}^{(k+1)}\right)$
	$\displaystyle=2\sigma_{1}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}u_{j}^{(k)}+b_{q}^{(k+1)}\right)\cdot\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}D_{i}u_{j}^{(k)}.$

Since

	$\displaystyle\mathcal{D}\left(\sigma_{1}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}u_{j}^{(k)}+b_{q}^{(k+1)}\right)\right)$	$\displaystyle=k+2,$
	$\displaystyle\mathcal{W}\left(\sigma_{1}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}u_{j}^{(k)}+b_{q}^{(k+1)}\right)\right)$	$\displaystyle\leq\mathcal{W},$

and

	$\displaystyle\mathcal{D}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}D_{i}u_{j}^{(k)}\right)$	$\displaystyle=k+2,$
	$\displaystyle\mathcal{W}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}D_{i}u_{j}^{(k)}\right)$	$\displaystyle\leq\max\{(k+2)\mathcal{W},4\mathcal{W}\}=(k+2)\mathcal{W},$

we conclude that $D_{i}u_{q}^{(k+1)}$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network and $\mathcal{D}\left(D_{i}u_{q}^{(k+1)}\right)=k+3$ , $\mathcal{W}\left(D_{i}u_{q}^{(k+1)}\right)\leq\max\{(k+3)\mathcal{W},4\}=(k+3)\mathcal{W}$ .

Hence we derive that $D_{i}u=D_{i}u_{1}^{\mathcal{D}}$ can be implemented by a $\mathrm{ReLU}$ - $\mathrm{ReLU}^{2}$ network and $\mathcal{D}\left(D_{i}u\right)=\mathcal{D}+2$ , $\mathcal{W}\left(D_{i}u\right)\leq\left(\mathcal{D}+2\right)\mathcal{W}$ . Finally we obtain that $\mathcal{D}\left(\|\nabla u\|^{2}\right)=\mathcal{D}+3$ , $\mathcal{W}\left(\|\nabla u\|^{2}\right)\leq d\left(\mathcal{D}+2\right)\mathcal{W}$ .

We are now in a position to bound the Rademacher complexity of $\mathcal{N}^{2}$ and $\mathcal{N}^{1,2}$ . To obtain the estimation, we need to introduce covering number, VC-dimension, pseudo-dimension and recall several properties of them.

Definition 3

Suppose that $W\subset\mathbb{R}^{n}.$ For any $\epsilon>0$ , let $V\subset\mathbb{R}^{n}$ be a $\epsilon$ -cover of $W$ with respect to the distance $d_{\infty}$ , that is, for any $w\in W$ , there exists a $v\in V$ such that $d_{\infty}(u,v)<\epsilon$ , where $d_{\infty}$ is defined by

d_{\infty}(u,v):=\|u-v\|_{\infty}.

The covering number $\mathcal{C}\left(\epsilon,W,d_{\infty}\right)$ is defined to be the minimum cardinality among all $\epsilon$ -cover of $W$ with respect to the distance $d_{\infty}$ .

Definition 4

Suppose that $\mathcal{N}$ is a class of functions from $\Omega$ to $\mathbb{R}.$ Given $n$ sample $\mathbf{Z}_{n}=\left(Z_{1},Z_{2},\cdots,Z_{n}\right)\in\Omega^{n},\left.\mathcal{N}\right|_{\mathbf{z}_{n}}\subset\mathbb{R}^{n}$ is defined by

\mathcal{N}\mid\mathbf{z}_{n}=\left\{\left(u\left(Z_{1}\right),u\left(Z_{2}\right),\cdots,u\left(Z_{n}\right)\right):u\in\mathcal{N}\right\}.

The uniform covering number $\mathcal{C}_{\infty}(\epsilon,\mathcal{N},n)$ is defined by

\mathcal{C}_{\infty}(\epsilon,\mathcal{N},n)=\max_{\mathbf{Z}_{n}\in\Omega^{n}}\mathcal{C}\left(\epsilon,\mathcal{N}\mid\mathbf{z}_{n},d_{\infty}\right)

Next we give a upper bound of $\mathfrak{R}\left(\mathcal{N}\right)$ in terms of the covering number of $\mathcal{N}$ by using the Dudley’s entropy formula dudley .

Lemma 10 (Massart’s finite class lemma boucheron2013concentration )

For any finite set $V\in\mathbb{R}^{n}$ with diameter $D=\sum_{v\in V}\|v\|_{2}$ , then

\mathbb{E}_{\Sigma_{n}}\left[\sup_{v\in V}\frac{1}{n}\left|\sum_{i}\sigma_{i}v_{i}\right|\right]\leq\frac{D}{n}\sqrt{2\log(2|V|)}.

We give an upper bound of $\mathfrak{R}(\mathcal{N})$ in terms of the covering number by using the Dudley’s entropy formula dudley .

Lemma 11 (Dudley’s entropy formula dudley )

Assume $0\in\mathcal{N}$ and the diameter of $\mathcal{N}$ is less than $\mathcal{B}$ , i.e., $\|u\|_{L^{\infty}(\Omega)}\leq\mathcal{B},\forall u\in\mathcal{N}$ . Then

\mathfrak{R}(\mathcal{N})\leq\inf_{0<\delta<\mathcal{B}}\left(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}\left(\varepsilon,\mathcal{N},n\right))}\mathrm{d}\varepsilon\right).

Proof

By definition

\mathfrak{R}(\mathcal{N})=\mathfrak{R}(\mathcal{N}|_{\boldsymbol{Z}_{n}})=\mathbb{E}_{\mathbb{Z}_{n}}\left[\mathbb{E}_{\Sigma}\left.\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i}\sigma_{i}u(Z_{i})\right|\ \right|{\boldsymbol{Z}_{n}}\right]\right].

Thus, it suffice to show

\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i}\sigma_{i}u(Z_{i})\right|\right]\leq\inf_{0<\delta<\mathcal{B}}\left(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log\mathcal{C}\left(\varepsilon,\mathcal{N}^{2},n\right)}\mathrm{d}\varepsilon\right)

by conditioning on $\boldsymbol{Z}_{n}$ . Given an positive integer $K$ , let $\varepsilon_{k}=2^{-k+1}\mathcal{B}$ , $k=1,...K$ . Let $C_{k}$ be a cover of $\mathcal{N}|_{\boldsymbol{Z}_{n}}\subseteq\mathbb{R}^{n}$ whose covering number is denoted as $\mathcal{C}(\varepsilon_{k},\mathcal{N}|_{\boldsymbol{Z}_{n}},d_{\infty})$ . Then, by definition, $\forall u\in\mathcal{N},$ there $\exists$ $c^{k}\in C_{k}$ such that

d_{\infty}(u|_{\boldsymbol{Z}_{n}},c^{k})=\max\{|u(Z_{i})-c^{k}_{i}|,i=1,...,n\}\leq\varepsilon_{k},k=1,...,K.

Moreover, we denote the best approximate element of $u$ in $C_{k}$ with respect to $d_{\infty}$ as $c^{k}(u)$ . Then,

	$\displaystyle\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}u(Z_{i})\right\|\right]$
	$\displaystyle=\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}(u(Z_{i})-c^{K}_{i}(u))+\sum_{j=1}^{K-1}\sum_{i=1}^{n}\sigma_{i}(c^{j}_{i}(u)-c^{j+1}_{i}(u))+\sum_{i=1}^{n}\sigma_{i}c^{1}_{i}(u)\right\|\right]$
	$\displaystyle\leq\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}(u(Z_{i})-c^{K}_{i}(u))\right\|\right]+\sum_{j=1}^{K-1}\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}(c^{j}_{i}(u)-c^{j+1}_{i}(u))\right\|\right]$
	$\displaystyle+\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}c^{1}_{i}(u)\right\|\right].$

Since $0\in\mathcal{N}$ , and the diameter of $\mathcal{N}$ is smaller than $\mathcal{B}$ , we can choose $C_{1}=\{0\}$ such that the third term in the above display vanishes. By Hölder’s inequality, we deduce that the first term can be bounded by $\varepsilon_{K}$ as follows.

	$\displaystyle\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}(u(Z_{i})-c^{K}_{i}(u))\right\|\right]$
	$\displaystyle\leq\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left(\sum_{i=1}^{n}\|\sigma_{i}\|\right)\left(\sum_{i=1}^{n}\max_{i=1,...,n}\left\{\left\|u(Z_{i})-c^{K}_{i}(u)\right\|\right\}\right)\right]$
	$\displaystyle\leq\varepsilon_{K}.$

Let $V_{j}=\{c^{j}(u)-c^{j+1}(u):u\in\mathcal{N}\}$ . Then by definition, the number of elements in $V_{j}$ and $C_{j}$ satisfying

|V_{j}|\leq|C_{j}||C_{j+1}|\leq|C_{j+1}|^{2}.

And the diameter of $V_{j}$ denoted as $D_{j}$ can be bounded as

	$\displaystyle D_{j}=\sup_{v\in V_{j}}\\|v\\|_{2}\leq\sqrt{n}\sup_{u\in\mathcal{N}}\\|c^{j}(u)-c^{j+1}(u)\\|_{\infty}$
	$\displaystyle\leq\sqrt{n}\sup_{u\in\mathcal{N}}\\|c^{j}(u)-u\\|_{\infty}+\\|u-c^{j+1}(u)\\|_{\infty}$
	$\displaystyle\leq\sqrt{n}(\varepsilon_{j}+\varepsilon_{j+1})$
	$\displaystyle\leq 3\sqrt{n}\varepsilon_{j+1}.$

Then,

	$\displaystyle\mathbb{E}_{\Sigma}[\sup_{u\in\mathcal{N}}\frac{1}{n}\|\sum_{j=1}^{K-1}\sum_{i=1}^{n}\sigma_{i}(c^{j}_{i}(u)-c^{j+1}_{i}(u))\|]\leq\sum_{j=1}^{K-1}\mathbb{E}_{\Sigma}[\sup_{v\in V_{j}}\frac{1}{n}\|\sum_{i=1}^{n}\sigma_{i}v_{j}\|]$
	$\displaystyle\leq\sum_{j=1}^{K-1}\frac{D_{j}}{n}\sqrt{2\log(2\|V_{j}\|)}$
	$\displaystyle\leq\sum_{j=1}^{K-1}\frac{6\varepsilon_{j+1}}{\sqrt{n}}\sqrt{\log(2\|C_{j+1}\|)},$

where we use triangle inequality in the first inequality, and use Lemma 10 in the second inequality. Putting all the above estimates together, we get

	$\displaystyle\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i}\sigma_{i}u(Z_{i})\right\|\right]\leq\varepsilon_{K}+\sum_{j=1}^{K-1}\frac{6\varepsilon_{j+1}}{\sqrt{n}}\sqrt{\log(2\|C_{j+1}\|)}$
	$\displaystyle\leq\varepsilon_{K}+\sum_{j=1}^{K}\frac{12(\varepsilon_{j}-\varepsilon_{j+1})}{\sqrt{n}}\sqrt{\log(2\mathcal{C}\left(\varepsilon_{j},\mathcal{N},n\right))}$
	$\displaystyle\leq\varepsilon_{K}+\frac{12}{\sqrt{n}}\int_{\varepsilon_{K+1}}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}\left(\varepsilon,\mathcal{N},n\right))}\mathrm{d}\varepsilon$
	$\displaystyle\leq\inf_{0<\delta<\mathcal{B}}(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}\left(\varepsilon,\mathcal{N},n\right))}\mathrm{d}\varepsilon).$

where, last inequality holds since for $0<\delta<\mathcal{B}$ , we can choose $K$ to be the largest integer such that $\varepsilon_{K+1}>\delta$ , at this time $\varepsilon_{K}\leq 4\varepsilon_{K+2}\leq 4\delta.$

Definition 5

Let $\mathcal{N}$ be a set of functions from $X=\Omega(\partial\Omega)$ to $\{0,1\}.$ Suppose that $S=\left\{x_{1},x_{2},\cdots,x_{n}\right\}\subset X.$ We say that $S$ is shattered by $\mathcal{N}$ if for any $b\in\{0,1\}^{n}$ , there exists a $u\in\mathcal{N}$ satisfying

u\left(x_{i}\right)=b_{i},\quad i=1,2,\ldots,n.

Definition 6

The VC-dimension of $\mathcal{N}$ , denoted as $\operatorname{VCdim}(\mathcal{N})$ , is defined to be the maximum cardinality among all sets shattered by $\mathcal{N}$ .

VC-dimension reflects the capability of a class of functions to perform binary classification of points. The larger VC-dimension is, the stronger the capability to perform binary classification is. For more diseussion of VC-dimension, readers are referred to anthony2009neural .

For real-valued functions, we can generalize the concept of VC-dimension into pseudo-dimension anthony2009neural .

Definition 7

Let $\mathcal{N}$ be a set of functions from $X$ to $\mathbb{R}.$ Suppose that $S=\left\{x_{1},x_{2},\cdots,x_{n}\right\}\subset$ $X.$ We say that $S$ is pseudo-shattered by $\mathcal{N}$ if there exists $y_{1},y_{2},\cdots,y_{n}$ such that for any $b\in\{0,1\}^{n}$ , there exists a $u\in\mathcal{N}$ satisfying

\operatorname{sign}\left(u\left(x_{i}\right)-y_{i}\right)=b_{i},\quad i=1,2,\ldots,n

and we say that $\left\{y_{i}\right\}_{i=1}^{n}$ witnesses the shattering.

Definition 8

The pseudo-dimension of $\mathcal{N}$ , denoted as $\operatorname{Pdim}(\mathcal{N})$ , is defined to be the maximum cardinality among all sets pseudo-shattered by $\mathcal{N}$ .

The following proposition showing a relationship between uniform covering number and pseudo-dimension.

Lemma 12

Let $\mathcal{N}$ be a set of real functions from a domain $X$ to the bounded interval $[0,\mathcal{B}]$ . Let $\varepsilon>0$ . Then

\mathcal{C}(\varepsilon,\mathcal{N},n)\leq\sum_{i=1}^{\mathrm{Pdim}(\mathcal{N})}\begin{pmatrix}n\\ i\end{pmatrix}\left(\frac{\mathcal{B}}{\varepsilon}\right)^{i}

which is less than $\left(\frac{en\mathcal{B}}{\varepsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)^{\mathrm{Pdim}(\mathcal{N})}$ for $n\geq\mathrm{Pdim}(\mathcal{N})$ .

Proof

See Theorem 12.2 in anthony2009neural .

We now present the bound of pseudo-dimension for the $\mathcal{N}^{2}$ and $\mathcal{N}^{1,2}$ .

Lemma 13

Let $p_{1},\cdots,p_{m}$ be polynomials with $n$ variables of degree at most $d$ . If $n\leq m$ , then

|\{(\operatorname{sign}(p_{1}(x)),\cdots,\operatorname{sign}(p_{m}(x))):x\in\mathbb{R}^{n}\}|\leq 2\left(\frac{2emd}{n}\right)^{n}

Proof

See Theorem 8.3 in anthony2009neural .

Lemma 14

Let $\mathcal{N}$ be a set of functions that

(i)

can be implemented by a neural network with depth no more than $\mathcal{D}$ and width no more than $\mathcal{W}$ , and
(ii)

the activation function in each unit be the $\mathrm{ReLU}$ or the $\mathrm{ReLU}^{2}$ .

Then

\operatorname{Pdim}(\mathcal{N})=\mathcal{O}(\mathcal{D}^{2}\mathcal{W}^{2}(\mathcal{D}+\log\mathcal{W})).

Proof

The argument is follows from the proof of Theorem 6 in bartlett2019nearly . The result stated here is somewhat stronger then Theorem 6 in bartlett2019nearly since $\mathrm{VCdim}(\operatorname{sign}(\mathcal{N}))\leq\mathrm{Pdim}(\mathcal{N})$ .

We consider a new set of functions:

\mathcal{\widetilde{N}}=\{\widetilde{u}(x,y)=\operatorname{sign}(u(x)-y):u\in\mathcal{H}\}

It is clear that $\mathrm{Pdim}(\mathcal{N})\leq\mathrm{VCdim}(\mathcal{\widetilde{N}})$ . We now bound the VC-dimension of $\mathcal{\widetilde{N}}$ . Denoting $\mathcal{M}$ as the total number of parameters(weights and biases) in the neural network implementing functions in $\mathcal{N}$ , in our case we want to derive the uniform bound for

K_{\{x_{i}\},\{y_{i}\}}(m):=|\{(\operatorname{sign}(f(x_{1},a)-y_{1}),\ldots,\operatorname{sign}(u(x_{m},a)-y_{m})):a\in\mathbb{R}^{\mathcal{M}}\}|

over all $\{x_{i}\}_{i=1}^{m}\subset X$ and $\{y_{i}\}_{i=1}^{m}\subset\mathbb{R}$ . Actually the maximum of $K_{\{x_{i}\},\{y_{i}\}}(m)$ over all $\{x_{i}\}_{i=1}^{m}\subset X$ and $\{y_{i}\}_{i=1}^{m}\subset\mathbb{R}$ is the growth function $\mathcal{G}_{\mathcal{\widetilde{N}}}(m)$ . In order to apply Lemma 13, we partition the parameter space $\mathbb{R}^{\mathcal{M}}$ into several subsets to ensure that in each subset $u(x_{i},a)-y_{i}$ is a polynomial with respcet to $a$ without any breakpoints. In fact, our partition is exactly the same as the partition in bartlett2019nearly . Denote the partition as $\{P_{1},P_{2},\cdots,P_{N}\}$ with some integer $N$ satisfying

N\leq\prod_{i=1}^{\mathcal{D}-1}2\left(\frac{2emk_{i}(1+(i-1)2^{i-1})}{\mathcal{M}_{i}}\right)^{\mathcal{M}_{i}}

(18)

where $k_{i}$ and $\mathcal{M}_{i}$ denotes the number of units at the $i$ th layer and the total number of parameters at the inputs to units in all the layers up to layer $i$ of the neural network implementing functions in $\mathcal{N}$ , respectively. See bartlett2019nearly for the construction of the partition. Obviously we have

K_{\{x_{i}\},\{y_{i}\}}(m)\leq\sum_{i=1}^{N}|\{(\operatorname{sign}(u(x_{1},a)-y_{1}),\cdots,\operatorname{sign}(u(x_{m},a)-y_{m})):a\in P_{i}\}|

(19)

Note that $u(x_{i},a)-y_{i}$ is a polynomial with respect to $a$ with degree the same as the degree of $u(x_{i},a)$ , which is equal to $1+(\mathcal{D}-1)2^{\mathcal{D}-1}$ as shown in bartlett2019nearly . Hence by Lemma 13, we have

	$\displaystyle\|\{(\operatorname{sign}(u(x_{1},a)-y_{1}),\cdots,\operatorname{sign}(u(x_{m},a)-y_{m})):a\in P_{i}\}\|$
	$\displaystyle\leq 2\left(\frac{2em(1+(\mathcal{D}-1)2^{\mathcal{D}-1})}{\mathcal{M}_{\mathcal{D}}}\right)^{\mathcal{M}_{\mathcal{D}}}.$		(20)

Combining $(\ref{pdimb1}),(\ref{pdimb2}),(\ref{pdimb3})$ yields

K_{\{x_{i}\},\{y_{i}\}}(m)\leq\prod_{i=1}^{\mathcal{D}}2\left(\frac{2emk_{i}(1+(i-1)2^{i-1})}{\mathcal{M}_{i}}\right)^{\mathcal{M}_{i}}.

We then have

\mathcal{G}_{\mathcal{\widetilde{N}}}(m)\leq\prod_{i=1}^{\mathcal{D}}2\left(\frac{2emk_{i}(1+(i-1)2^{i-1})}{\mathcal{M}_{i}}\right)^{\mathcal{M}_{i}},

since the maximum of $K_{\{x_{i}\},\{y_{i}\}}(m)$ over all $\{x_{i}\}_{i=1}^{m}\subset X$ and $\{y_{i}\}_{i=1}^{m}\subset\mathbb{R}$ is the growth function $\mathcal{G}_{\mathcal{\widetilde{N}}}(m)$ . Some algebras as that of the proof of Theorem 6 in bartlett2019nearly , we obtain

\mathrm{Pdim}(\mathcal{N})\leq\mathcal{O}\left(\mathcal{D}^{2}\mathcal{W}^{2}\log\mathcal{U}+\mathcal{D}^{3}\mathcal{W}^{2}\right)=\mathcal{O}\left(\mathcal{D}^{2}\mathcal{W}^{2}\left(\mathcal{D}+\log\mathcal{W}\right)\right)

where $\mathcal{U}$ refers to the number of units of the neural network implementing functions in $\mathcal{N}$ .

With the help of above preparations, the statistical error can easily be bounded by a tedious calculation.

Theorem 3.3

Let $\mathcal{D}$ and $\mathcal{W}$ be the depth and width of the network respectively, then

	$\displaystyle\mathcal{E}_{sta}\leq$	$\displaystyle C_{c_{3}}d(\mathcal{D}+3)(\mathcal{D}+2)\mathcal{W}\sqrt{\mathcal{D}+3+\log(d(\mathcal{D}+2)\mathcal{W})}\left(\frac{\log n}{n}\right)^{1/2}$
		$\displaystyle+C_{c_{3}}d\mathcal{D}\mathcal{W}\sqrt{\mathcal{D}+\log\mathcal{W}}\left(\frac{\log n}{n}\right)^{1/2}\lambda.$

where $n$ is the number of training samples on both the domain and the boundary.

Proof

In order to apply Lemma 11, we need to handle the term

\begin{split}&\frac{1}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}(\epsilon,\mathcal{N},n))}d\epsilon\\ &\leq\frac{\mathcal{B}}{\sqrt{n}}+\frac{1}{\sqrt{n}}\int_{\delta}^{B}\sqrt{\log\left(\frac{en\mathcal{B}}{\epsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)^{\mathrm{Pdim}(\mathcal{N})}}d\epsilon\\ &\leq\frac{\mathcal{B}}{\sqrt{n}}+\left(\frac{\mathrm{Pdim}(\mathcal{N})}{n}\right)^{1/2}\int_{\delta}^{B}\sqrt{\log\left(\frac{en\mathcal{B}}{\epsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)}d\epsilon\end{split}

where in the first inequality we use Lemma 12. Now we calculate the integral. Set

t=\sqrt{\log\left(\frac{en\mathcal{B}}{\epsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)}

then $\epsilon=\frac{en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\cdot e^{-t^{2}}$ . Denote $t_{1}=\sqrt{\log\left(\frac{en\mathcal{B}}{\mathcal{B}\cdot\mathrm{Pdim}(\mathcal{N})}\right)}$ , $t_{2}=\sqrt{\log\left(\frac{en\mathcal{B}}{\delta\cdot\mathrm{Pdim}(\mathcal{N})}\right)}$ . And

\begin{split}&\int_{\delta}^{\mathcal{B}}\sqrt{\log\left(\frac{en\mathcal{B}}{\epsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)}d\epsilon=\frac{2en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\int_{t_{1}}^{t_{2}}t^{2}e^{-t^{2}}dt\\ &=\frac{2en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\int_{t_{1}}^{t_{2}}t\left(\frac{-e^{-t^{2}}}{2}\right)^{\prime}dt\\ &=\frac{en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\left[t_{1}e^{-t_{1}^{2}}-t_{2}e^{-t_{2}^{2}}+\int_{t_{1}}^{t_{2}}e^{-t^{2}}dt\right]\\ &\leq\frac{en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\left[t_{1}e^{-t_{1}^{2}}-t_{2}e^{-t_{2}^{2}}+(t_{2}-t_{1})e^{-t_{1}^{2}}\right]\\ &\leq\frac{en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\cdot t_{2}e^{-t_{1}^{2}}=\mathcal{B}\sqrt{\log\left(\frac{en\mathcal{B}}{\delta\cdot\mathrm{Pdim}(\mathcal{N})}\right)}\end{split}

Choosing $\delta=\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N})}{n}\right)^{1/2}\leq\mathcal{B}$ , by Lemma 11 and the above display, we get for both $\mathcal{N}=\mathcal{N}^{2}$ and $\mathcal{N}=\mathcal{N}^{1,2}$ there holds

	$\displaystyle\mathfrak{R}(\mathcal{N})\leq 4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}(\epsilon,\mathcal{N},n))}d\epsilon$
	$\displaystyle\leq 4\delta+\frac{12\mathcal{B}}{\sqrt{n}}+12\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en\mathcal{B}}{\delta\cdot\mathrm{Pdim}(\mathcal{N})}\right)}$
	$\displaystyle\leq 28\sqrt{\frac{3}{2}}\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en}{\mathrm{Pdim}(\mathcal{N})}\right)}.$		(21)

Then by Lemma 6, 11, 8, 9 and equation (21), we have

	$\displaystyle{\mathcal{E}_{sta}}=2\sup_{u\in\mathcal{N}^{2}}\|\mathcal{L}(u)-\widehat{\mathcal{L}}(u)\|$
	$\displaystyle\leq 2\mathfrak{R}(\mathcal{N}^{1,2})+2(2c_{3}^{2}+2c_{3})\mathfrak{R}(\mathcal{N}^{2})+2c_{3}^{2}\mathfrak{R}(\mathcal{N}^{2})\lambda$
	$\displaystyle\leq 56\sqrt{\frac{3}{2}}\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N}^{1,2})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en}{\mathrm{Pdim}(\mathcal{N}^{1,2})}\right)}$
	$\displaystyle+56\sqrt{\frac{3}{2}}(2c_{3}^{2}+2c_{3})\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N}^{2})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en}{\mathrm{Pdim}(\mathcal{N}^{2})}\right)}$
	$\displaystyle+56\sqrt{\frac{3}{2}}c_{3}^{2}\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N}^{2})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en}{\mathrm{Pdim}(\mathcal{N}^{2})}\right)}\lambda.$

Plugging the upper bound of $\mathrm{Pdim}$ derived in Lemma 12 into the above display and using the relationship of depth and width between $\mathcal{N}^{2}$ and $\mathcal{N}^{1,2}$ , we get

	$\displaystyle\mathcal{E}_{sta}\leq$	$\displaystyle C_{c_{3}}d(\mathcal{D}+3)(\mathcal{D}+2)\mathcal{W}\sqrt{\mathcal{D}+3+\log(d(\mathcal{D}+2)\mathcal{W})}\left(\frac{\log n}{n}\right)^{1/2}$		(22)
		$\displaystyle+C_{c_{3}}d\mathcal{D}\mathcal{W}\sqrt{\mathcal{D}+\log\mathcal{W}}\left(\frac{\log n}{n}\right)^{1/2}\lambda.$		(22)

3.3 Error from the boundary penalty method

Although the Lemma 3 shows the convergence property of Robin problem (4) as $\lambda\rightarrow\infty$

u_{\lambda}^{*}\rightarrow u^{*},

it says nothing about the convergence rate. In this section, we consider the error from the boundary penalty method. Roughly speaking, we bound the distance between the minimizer $u^{*}$ and $u_{\lambda}^{*}$ with respect to the penalty parameter $\lambda$ .

Theorem 3.4

Suppose $u_{\lambda}^{*}$ is the minimizer of (5) and $u^{*}$ is the minimizer of (3). Then

\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}\leq C_{c_{3},d}\lambda^{-1}.

Proof

Following the idea which is proposed in Maury2009Numerical (proof of Proposition 2.3), we proceed to prove this theorem. For $v\in H^{1}(\Omega)$ , we introduce

R_{\lambda}(v)=\frac{1}{2}a(u^{*}-v,u^{*}-v)+\frac{\lambda}{2}\int_{\partial\Omega}\left(-\frac{1}{\lambda}\frac{\partial u^{*}}{\partial n}-v\right)^{2}ds.

(23)

Given $\varphi\in H^{1}(\Omega)$ such that $T\varphi=-\frac{\partial u^{*}}{\partial n}$ , we set $w=\frac{1}{\lambda}\varphi+u^{*}$ . Due to $u^{*}\in H^{1}_{0}(\Omega)$ , it follows that

R_{\lambda}(w)=\frac{1}{2\lambda^{2}}a\left(\varphi,\varphi\right)+\frac{\lambda}{2}\int_{\partial\Omega}(u^{*})^{2}ds=\frac{1}{2\lambda^{2}}a\left(\varphi,\varphi\right)\leq C\lambda^{-2},

(24)

where $C$ is dependent only on $\mathcal{B}$ , $w$ and $\Omega$ . Apparently, (23) can be written

	$\displaystyle R_{\lambda}(v)=$	$\displaystyle\frac{1}{2}a(u^{},u^{})-a(u^{},v)+\frac{1}{2}a(v,v)+\frac{1}{2\lambda}\int_{\partial\Omega}\left(\frac{\partial u^{}}{\partial n}\right)^{2}ds+\int_{\partial\Omega}\frac{\partial u^{*}}{\partial n}vds$
		$\displaystyle+\frac{\lambda}{2}\int_{\partial\Omega}v^{2}ds$
	$\displaystyle=$	$\displaystyle\frac{1}{2}a(u^{},u^{})+\frac{1}{2}a(v,v)+\frac{1}{2\lambda}\int_{\partial\Omega}\left(\frac{\partial u^{*}}{\partial n}\right)^{2}ds+\frac{\lambda}{2}\int_{\partial\Omega}v^{2}ds-\int_{\Omega}fvdx$
	$\displaystyle=$	$\displaystyle\frac{1}{2}a(u^{},u^{})+\frac{1}{2\lambda}\int_{\partial\Omega}\left(\frac{\partial u^{*}}{\partial n}\right)^{2}ds+\mathcal{L}_{\lambda}(v),$

where the second equality comes from that

a(u^{*},v)-\int_{\partial\Omega}\frac{\partial u^{*}}{\partial n}vds=\int_{\Omega}fvdx,\ \forall v\in H^{1}(\Omega).

(25)

Since $R_{\lambda}(v)=\mathcal{L}_{\lambda}(v)+const$ , $u_{\lambda}^{*}$ is also the minimizer of $R_{\lambda}$ over $H^{1}(\Omega)$ . Recall (24), we obtain the estimation of $R_{\lambda}(u_{\lambda}^{*})$

0\leq R_{\lambda}(u_{\lambda}^{*})=\frac{1}{2}a(u^{*}-u_{\lambda}^{*},u^{*}-u_{\lambda}^{*})+\frac{\lambda}{2}\int_{\partial\Omega}\left(-\frac{1}{\lambda}\frac{\partial u^{*}}{\partial n}-u_{\lambda}^{*}\right)^{2}ds\leq R_{\lambda}(w)\leq C\lambda^{-2}.

Now that $a(\cdot,\cdot)$ is coercive, we arrive at

\|u^{*}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}\leq C_{c_{3},d}\lambda^{-1}.

In hong2021rademacher , they proved that the error $\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}\leq\mathcal{O}(\lambda^{-1/2}),$ which is suboptimal comparing with the above results derived here. In Mller2021ErrorEF ; muller2020deep , they proved the $\mathcal{O}(\lambda^{-1})$ bound under some unverifiable conditions.

3.4 Convergence rate

Note that for $\lambda\rightarrow\infty$ the approximation error $\mathcal{E}_{app}$ and the statistical error $\mathcal{E}_{sta}$ approach $\infty$ and for $\lambda\rightarrow 0$ the error from penalty blows up. Hence, there must be a trade off for choosing proper $\lambda$ .

Theorem 3.5

Let $u^{*}$ be the weak solution of (1) with bounded $f\in L^{2}(\Omega)$ , $w\in L^{\infty}(\Omega)$ . $\widehat{u}_{\phi}$ is the minimizer of the discrete version of the associated Robin energy with parameter $\lambda$ . Given $n$ be the number of training samples on the domain and the boundary, there is a $\mathrm{ReLU}^{2}$ network with depth and width as

\mathcal{D}\leq\lceil\log_{2}d\rceil+3,\quad\mathcal{W}\leq\mathcal{O}\left(4d\left\lceil\left(\frac{n}{\log n}\right)^{\frac{1}{2(d+2)}}-4\right\rceil^{d}\right),

such that

	$\displaystyle\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\\|\widehat{u}_{\phi}-u^{*}\\|_{H^{1}(\Omega)}^{2}\right]\leq$	$\displaystyle C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)$
		$\displaystyle+C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\lambda$
		$\displaystyle+C_{c_{3},d}\lambda^{-2}.$

Furthermore, for

\lambda\sim n^{\frac{1}{3(d+2)}}(\log n)^{-\frac{d+3}{3(d+2)}},

it holds that

\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{2}{3(d+2)}}\log n\right).

Proof

Combining Theorem 3.1, Theorem 3.2 and Theorem 3.3, we obtain by taking $\varepsilon^{2}=C_{c_{1},c_{2},c_{3}}\left(\frac{\log n}{n}\right)^{\frac{1}{d+2}}$

		$\displaystyle\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}[\\|\widehat{u}_{\phi}-u_{\lambda}^{*}\\|_{H^{1}(\Omega)}^{2}]$
	$\displaystyle\leq$	$\displaystyle\frac{2}{c_{1}\wedge 1}\left[C_{c_{3}}d(\mathcal{D}+3)(\mathcal{D}+2)\mathcal{W}\sqrt{\mathcal{D}+3+\log(d(\mathcal{D}+2)\mathcal{W})}\left(\frac{\log n}{n}\right)^{1/2}+\frac{c_{3}+1}{2}\varepsilon^{2}\right]$
		$\displaystyle+\frac{2}{c_{1}\wedge 1}\left[C_{c_{3}}d\mathcal{D}\mathcal{W}\sqrt{\mathcal{D}+\log\mathcal{W}}\left(\frac{\log n}{n}\right)^{1/2}+\frac{C_{d}}{2}\varepsilon^{2}\right]\lambda$
	$\displaystyle\leq$	$\displaystyle\frac{2}{c_{1}\wedge 1}\left[C_{c_{3}}4d^{2}(\lceil\log d\rceil+6)(\lceil\log d\rceil+5)\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}\cdot\right.$
		$\displaystyle\quad\left.\sqrt{\lceil\log d\rceil+6+\log\left(4d^{2}(\lceil\log d\rceil+5)\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}\right)}\left(\frac{\log n}{n}\right)^{1/2}+\frac{c_{3}+1}{2}\varepsilon^{2}\right]$
		$\displaystyle+\frac{2}{c_{1}\wedge 1}\left[C_{c_{3}}4d^{2}(\lceil\log d\rceil+3)\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}\cdot\right.$
		$\displaystyle\quad\left.\sqrt{\lceil\log d\rceil+3+\log\left(4d\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}\right)}\left(\frac{\log n}{n}\right)^{1/2}+\frac{C_{d}}{2}\varepsilon^{2}\right]\lambda$
	$\displaystyle\leq$	$\displaystyle C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)+C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\lambda.$

Using Theorem 3.1 and Theorem 3.4, it holds that for all $\lambda>0$

$\displaystyle\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\\|\widehat{u}_{\phi}-u^{*}\\|_{H^{1}(\Omega)}^{2}\right]\leq$	$\displaystyle C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)$	(26)
	$\displaystyle+C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\lambda$
	$\displaystyle+C_{c_{3},d}\lambda^{-2}.$

We have derive the error estimate for fixed $\lambda$ , and now we are in a position to find a proper $\lambda$ and get the convergence rate. Since (26) holds for any $\lambda>0$ , we take the infimum of $\lambda$ :

	$\displaystyle\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\\|\widehat{u}_{\phi}-u^{*}\\|_{H^{1}(\Omega)}^{2}\right]\leq\inf_{\lambda>0}$	$\displaystyle\left\{C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\right.$
		$\displaystyle\left.+C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\lambda\right.$
		$\displaystyle\left.+C_{c_{3},d}\lambda^{-2}\right\}.$

By taking

\lambda\sim n^{\frac{1}{3(d+2)}}(\log n)^{-\frac{d+3}{3(d+2)}},

we can obtain

\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{2}{3(d+2)}}\log n\right).

4 Conclusions and Extensions

This paper provided an analysis of convergence rate for deep Ritz methods for Laplace equations with Dirichlet boundary condition. Specifically, our study shed light on how to set depth and width of networks and how to set the penalty parameter to achieve the desired convergence rate in terms of number of training samples. The estimation on the approximation error of deep $\mathrm{ReLU}^{2}$ network is established in $H^{1}$ . The statistical error can be derived technically by the Rademacher complexity of the non-Lipschitz composition of gradient norm and $\mathrm{ReLU}^{2}$ network. We also analysis the error from the boundary penalty method.

There are several interesting further research directions. First, the current analysis can be extended to general second order elliptic equations with other boundary conditions. Second, the approximation and statistical error bounds deriving here can be used for studying the nonasymptotic convergence rate for residual based method, such as PINNs. Finally, the similar result may be applicable to deep Ritz methods for optimal control problems and inverse problems.

Acknowledgements.

Y. Jiao is supported in part by the National Science Foundation of China under Grant 11871474 and by the research fund of KLATASDSMOE of China. X. Lu is partially supported by the National Science Foundation of China (No. 11871385), the National Key Research and Development Program of China (No.2018YFC1314600) and the Natural Science Foundation of Hubei Province (No. 2019CFA007), and by the research fund of KLATASDSMOE of China. J. Yang was supported by NSFC (Grant No. 12125103, 12071362), the National Key Research and Development Program of China (No. 2020YFA0714200) and the Natural Science Foundation of Hubei Province (No. 2019CFA007).

Appendix A Appendix

A.1 Proof of Lemma 2

We claim that $a_{\lambda}$ is coercive on $H^{1}(\Omega)$ . In fact,

a_{\lambda}(u,u)=a(u,u)+\lambda\int_{\partial\Omega}u^{2}\ \mathrm{d}s\geq C\|u\|^{2}_{H^{1}(\Omega)},\ \forall u\in H^{1}(\Omega),

where $C$ is constant from Poincaré inequality Gilbarg1983Elliptic . Thus, there exists a unique weak solution $u_{\lambda}^{*}\in H^{1}(\Omega)$ such that

a_{\lambda}(u_{\lambda}^{*},v)=f(v),\ \forall v\in H^{1}(\Omega).

We can check $u_{\lambda}^{*}$ is the unique minimizer of $\mathcal{L}_{\lambda}(u)$ by standard technique.

We will study the regularity for weak solutions of (4). For the following discussion, we first introduce several useful classic results of second order elliptic equations in Evans2010PartialDE ; Gilbarg1983Elliptic .

Lemma 15

Assume $w\in L^{\infty}(\Omega)$ , $f\in L^{2}(\Omega)$ , $g\in H^{3/2}(\partial\Omega)$ and $\partial\Omega$ is sufficiently smooth. Suppose that $u\in H^{1}(\Omega)$ is a weak solution of the elliptic boundary-value problem

\left\{\begin{aligned} -\Delta u+wu&=f&&\text{ in }\Omega\\ u&=g&&\text{ on }\partial\Omega.\end{aligned}\right.

Then $u\in H^{2}(\Omega)$ and there exists a positive constant $C$ , depending only on $\Omega$ and $w$ , such that

\|u\|_{H^{2}(\Omega)}\leq C\left(\|f\|_{L^{2}(\Omega)}+\|g\|_{H^{3/2}(\partial\Omega)}\right).

Proof

See Evans2010PartialDE .

Lemma 16

Assume $w\in L^{\infty}(\Omega)$ , $f\in L^{2}(\Omega)$ , $g\in H^{1/2}(\partial\Omega)$ and $\partial\Omega$ is sufficiently smooth. Suppose that $u\in H^{1}(\Omega)$ is a weak solution of the elliptic boundary-value problem

\left\{\begin{aligned} -\Delta u+wu&=f&&\text{ in }\Omega\\ \frac{\partial u}{\partial n}&=g&&\text{ on }\partial\Omega.\end{aligned}\right.

Then $u\in H^{2}(\Omega)$ and there exists a positive constant $C$ , depending only on $\Omega$ and $w$ , such that

\|u\|_{H^{2}(\Omega)}\leq C\left(\|f\|_{L^{2}(\Omega)}+\|g\|_{H^{1/2}(\partial\Omega)}\right).

Proof

See Gilbarg1983Elliptic .

Lemma 17

Assume $w\in L^{\infty}(\Omega)$ , $g\in H^{1/2}(\partial\Omega)$ , $\partial\Omega$ is sufficiently smooth and $\lambda>0$ . Let $u\in H^{1}(\Omega)$ be the weak solution of the following Robin problem

\left\{\begin{aligned} -\Delta u+wu&=0&&\text{ in }\Omega\\ \frac{1}{\lambda}\frac{\partial u}{\partial n}+u&=g&&\text{ on }\partial\Omega.\end{aligned}\right.

(27)

Then $u\in H^{2}(\Omega)$ and there exists a positive constant $C$ independent of $\lambda$ such that

\left\|u\right\|_{H^{2}(\Omega)}\leq C\lambda\|g\|_{H^{1/2}(\partial\Omega)}.

Proof

Following the idea which is proposed in Costabel1996ASP in a slightly different context. We first estimate the trace $Tu=\left.u\right|_{\partial\Omega}$ . We define the Dirichlet-to-Neumann map

\widetilde{T}:\left.u\right|_{\partial\Omega}\mapsto\left.\frac{\partial u}{\partial n}\right|_{\partial\Omega},

where $u$ satisfies $-\Delta u+wu=0$ in $\Omega$ , then

Tu=\left(\frac{1}{\lambda}\widetilde{T}+I\right)^{-1}g.

Now we are going to show that $\frac{1}{\lambda}\widetilde{T}+I$ is a positive definite operator in $L^{2}(\partial\Omega)$ . We notice that the variational formulation of (27) can be read as follow:

\int_{\Omega}\nabla u\cdot\nabla vdx+\int_{\Omega}wuvdx+\lambda\int_{\partial\Omega}uvds=\lambda\int_{\partial\Omega}gvds,\ \forall v\in H^{1}(\Omega).

Taking $v=u$ , then we have

\|Tu\|_{L^{2}(\partial\Omega)}^{2}\leq\left\langle\left(\frac{1}{\lambda}\widetilde{T}+I\right)Tu,Tu\right\rangle.

This means that $\lambda^{-1}\widetilde{T}+I$ is a positive definite operator in $L^{2}(\partial\Omega)$ , and further, $(\lambda^{-1}\widetilde{T}+I)^{-1}$ is bounded. We have the estimate

\|Tu\|_{H^{1/2}(\partial\Omega)}\leq C\|g\|_{H^{1/2}(\partial\Omega)}.

(28)

We rewrite the Robin problem (27) as follows

\left\{\begin{aligned} -\Delta u+wu&=0&&\text{ in }\Omega\\ \frac{\partial u}{\partial n}+u&=\lambda\left(g-\left(1-\lambda^{-1}\right)u\right)&&\text{ on }\partial\Omega.\end{aligned}\right.

By Lemma 16 we have

\|u\|_{H^{2}(\Omega)}\leq C\lambda\left\|g-\left(1-\lambda^{-1}\right)Tu\right\|_{H^{1/2}(\partial\Omega)}\leq C\lambda\left(\left\|g\right\|_{H^{1/2}(\partial\Omega)}+\left\|Tu\right\|_{H^{1/2}(\partial\Omega)}\right).

(29)

Combining (28) and (29), we obtain the desired estimation.

With the help of above lemmas, we now turn to proof the regularity properties of the weak solution.

Theorem A.1

Assume $w\in L^{\infty}(\Omega)$ , $f\in L^{2}(\Omega)$ . Suppose that $u\in H^{1}(\Omega)$ is a weak solution of the boundary-value problem (4). If $\partial\Omega$ is sufficiently smooth, then $u\in H^{2}(\Omega)$ , and we have the estimate

\|u\|_{H^{2}(\Omega)}\leq C\|f\|_{L^{2}(\Omega)},

where the constant $C$ depending only on $\Omega$ and $w$ .

Proof

We decompose (4) into two equations

\left\{\begin{aligned} -\Delta u_{0}+wu_{0}&=f&&\text{in}\ \Omega\\ u_{0}&=0&&\text{on}\ \partial\Omega,\\ \end{aligned}\right.

(30)

\left\{\begin{aligned} -\Delta u_{1}+wu_{1}&=0&&\text{in}\ \Omega\\ \frac{1}{\lambda}\frac{\partial u_{1}}{\partial n}+u_{1}&=-\frac{\partial u_{0}}{\partial n}&&\text{on}\ \partial\Omega.\\ \end{aligned}\right.

(31)

and obtain the solution of (4)

u=u_{0}+\frac{1}{\lambda}u_{1}.

Applying Lemma 15 to (30), we have

\|u_{0}\|_{H^{2}(\Omega)}\leq C\|f\|_{L^{2}(\Omega)},

(32)

where $C$ depends on $\Omega$ and $w$ . Using Lemma 17, it is easy to obtain

\left\|u_{1}\right\|_{H^{2}(\Omega)}\leq C\lambda\left\|\frac{\partial u_{0}}{\partial n}\right\|_{H^{1/2}(\partial\Omega)}\leq C\lambda\|u_{0}\|_{H^{2}(\Omega)},

(33)

where the last inequality follows from the trace theorem. Combining (32) and (33), the desired estimation can be derived by triangle inequality.

References

(1) Anandkumar, A., Azizzadenesheli, K., Bhattacharya, K., Kovachki, N., Li, Z., Liu, B., Stuart, A.: Neural operator: Graph kernel network for partial differential equations. In: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations (2020)
(2) Anitescu, C., Atroshchenko, E., Alajlan, N., Rabczuk, T.: Artificial neural network methods for the solution of second order boundary value problems. Cmc-computers Materials & Continua 59(1), 345–359 (2019)
(3) Anthony, M., Bartlett, P.L.: Neural network learning: Theoretical foundations. cambridge university press (2009)
(4) Babuska, I.: The finite element method with penalty. Mathematics of Computation (1973)
(5) Bartlett, P.L., Harvey, N., Liaw, C., Mehrabian, A.: Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 20(63), 1–17 (2019)
(6) Berner, J., Dablander, M., Grohs, P.: Numerically solving parametric families of high-dimensional kolmogorov partial differential equations via deep learning. In: H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 16615–16627. Curran Associates, Inc. (2020)
(7) Boucheron, S., Lugosi, G., Massart, P.: Concentration inequalities: A nonasymptotic theory of independence. Oxford university press (2013)
(8) Brenner, S., Scott, R.: The mathematical theory of finite element methods, vol. 15. Springer Science & Business Media (2007)
(9) Ciarlet, P.G.: The finite element method for elliptic problems. SIAM (2002)
(10) Costabel, M., Dauge, M.: A singularly perturbed mixed boundary value problem. Communications in Partial Differential Equations 21 (1996)
(11) De Boor, C., De Boor, C.: A practical guide to splines, vol. 27. springer-verlag New York (1978)
(12) Dissanayake, M., Phan-Thien, N.: Neural-network-based approximations for solving partial differential equations. Communications in Numerical Methods in Engineering 10(3), 195–201 (1994)
(13) Dudley, R.: The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis 1(3), 290–330 (1967). DOI https://doi.org/10.1016/0022-1236(67)90017-1. URL https://www.sciencedirect.com/science/article/pii/0022123667900171
(14) E, W., Ma, C., Wu, L.: The Barron space and the flow-induced function spaces for neural network models (2021)
(15) E, W., Wojtowytsch, S.: Some observations on partial differential equations in Barron and multi-layer spaces (2020)
(16) Evans, L.C.: Partial differential equations, second edition (2010)
(17) Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep relu neural networks in $w^{s,p}$ norms (2019)
(18) Gilbarg, D., Trudinger, N.: Elliptic partial differential equations of second order, 2nd ed. Springer (1998)
(19) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Advances in Neural Information Processing Systems 3 (2014). DOI 10.1145/3422622
(20) Greenfeld, D., Galun, M., Basri, R., Yavneh, I., Kimmel, R.: Learning to optimize multigrid PDE solvers. In: K. Chaudhuri, R. Salakhutdinov (eds.) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 2415–2423. PMLR (2019)
(21) Han, J., Jentzen, A., E, W.: Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115(34), 8505–8510 (2018). DOI 10.1073/pnas.1718942115. URL https://www.pnas.org/content/115/34/8505
(22) He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034 (2015)
(23) Hong, Q., Siegel, J.W., Xu, J.: Rademacher complexity and numerical quadrature analysis of stable neural networks with applications to numerical pdes (2021)
(24) Hsieh, J.T., Zhao, S., Eismann, S., Mirabella, L., Ermon, S.: Learning neural pde solvers with convergence guarantees. In: International Conference on Learning Representations (2018)
(25) Hughes, T.J.: The Finite Element Method: Linear Static and Dynamic Finite Element Analysis. Courier Corporation (2012)
(26) Lagaris, I.E., Likas, A., Fotiadis, D.I.: Artificial neural networks for solving ordinary and partial differential equations. IEEE Trans. Neural Networks 9(5), 987–1000 (1998). URL https://doi.org/10.1109/72.712178
(27) Ledoux, M., Talagrand, M.: Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media (2013)
(28) Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Stuart, A., Bhattacharya, K., Anandkumar, A.: Multipole graph neural operator for parametric partial differential equations. In: H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 6755–6766. Curran Associates, Inc. (2020). URL https://proceedings.neurips.cc/paper/2020/file/4b21cf96d4cf612f239a6c322b10c8fe-Paper.pdf
(29) Li, Z., Kovachki, N.B., Azizzadenesheli, K., liu, B., Bhattacharya, K., Stuart, A., Anandkumar, A.: Fourier neural operator for parametric partial differential equations. In: International Conference on Learning Representations (2021)
(30) Lu, J., Lu, Y., Wang, M.: A priori generalization analysis of the deep ritz method for solving high dimensional elliptic equations (2021)
(31) Lu, L., Meng, X., Mao, Z., Karniadakis, G.E.: Deepxde: A deep learning library for solving differential equations. CoRR abs/1907.04502 (2019). URL http://arxiv.org/abs/1907.04502
(32) Luo, T., Yang, H.: Two-layer neural networks for partial differential equations: Optimization and generalization theory. ArXiv abs/2006.15733 (2020)
(33) Maury, B.: Numerical analysis of a finite element/volume penalty method. Siam Journal on Numerical Analysis 47(2), 1126–1148 (2009)
(34) Mishra, S., Molinaro, R.: Estimates on the generalization error of physics informed neural networks (pinns) for approximating pdes. ArXiv abs/2007.01138 (2020)
(35) Müller, J., Zeinhofer, M.: Deep ritz revisited. In: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations (2020)
(36) Müller, J., Zeinhofer, M.: Error estimates for the variational training of neural networks with boundary penalty. ArXiv abs/2103.01007 (2021)
(37) Quarteroni, A., Valli, A.: Numerical Approximation of Partial Differential Equations, vol. 23. Springer Science & Business Media (2008)
(38) Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, 686–707 (2019)
(39) Schultz, M.H.: Approximation theory of multivariate spline functions in sobolev spaces. SIAM Journal on Numerical Analysis 6(4), 570–582 (1969)
(40) Schumaker, L.: Spline functions: basic theory. Cambridge University Press (2007)
(41) Shin, Y., Zhang, Z., Karniadakis, G.: Error estimates of residual minimization using neural networks for linear pdes. ArXiv abs/2010.08019 (2020)
(42) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016)
(43) Sirignano, J.A., Spiliopoulos, K.: Dgm: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics 375, 1339–1364 (2018)
(44) Thomas, J.: Numerical Partial Differential Equations: Finite Difference Methods, vol. 22. Springer Science & Business Media (2013)
(45) Um, K., Brand, R., Fei, Y.R., Holl, P., Thuerey, N.: Solver-in-the-loop: Learning from differentiable physics to interact with iterative pde-solvers. In: H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 6111–6122. Curran Associates, Inc. (2020)
(46) Wang, S., Yu, X., Perdikaris, P.: When and why pinns fail to train: A neural tangent kernel perspective. ArXiv abs/2007.14527 (2020)
(47) Wang, Y., Shen, Z., Long, Z., Dong, B.: Learning to discretize: Solving 1d scalar conservation laws via deep reinforcement learning. Communications in Computational Physics 28(5), 2158–2179 (2020)
(48) Weinan, E., Yu, B.: The deep ritz method: A deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics 6(1), 1–12 (2017)
(49) Xu, J.: Finite neuron method and convergence analysis. Communications in Computational Physics 28(5), 1707–1745 (2020)
(50) Zang, Y., Bao, G., Ye, X., Zhou, H.: Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics 411, 109409 (2020)

		$\displaystyle\\|u_{\phi_{\mathcal{A}}}-u^{*}\\|_{H^{1}(\Omega)}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{4}{c_{1}\wedge 1}\left\{\underbrace{\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\frac{\\|w\\|_{L^{\infty}(\Omega)}\vee 1}{2}\\|\bar{u}-u_{\lambda}^{}\\|_{H^{1}(\Omega)}^{2}+\frac{\lambda}{2}\\|T\bar{u}-Tu_{\lambda}^{}\\|_{L^{2}({\partial\Omega})}^{2}\right]}_{\mathcal{E}_{app}}\right.$
		$\displaystyle\left.+\underbrace{2\sup_{u\in\mathcal{N}^{2}}\left\|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right\|}_{\mathcal{E}_{sta}}+\underbrace{\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right]}_{\mathcal{E}_{opt}}\right\}+2\underbrace{\\|u_{\lambda}^{}-u^{}\\|_{H^{1}(\Omega)}^{2}}_{\mathcal{E}_{pen}}$

	$\displaystyle\frac{c_{1}\wedge 1}{2}\\|u-u_{\lambda}^{*}\\|_{H^{1}(\Omega)}^{2}$	$\displaystyle\leq\mathcal{L}_{\lambda}(u)-\mathcal{L}_{\lambda}(u_{\lambda}^{})-\frac{\lambda}{2}\\|Tu-Tu_{\lambda}^{}\\|_{L^{2}({\partial\Omega})}^{2}$		(11)
		$\displaystyle\leq\frac{\\|w\\|_{L^{\infty}(\Omega)}\vee 1}{2}\\|u-u_{\lambda}^{*}\\|_{H^{1}(\Omega)}^{2}.$		(11)

	$\displaystyle\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}u(Z_{i})\right\|\right]$
	$\displaystyle=\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}(u(Z_{i})-c^{K}_{i}(u))+\sum_{j=1}^{K-1}\sum_{i=1}^{n}\sigma_{i}(c^{j}_{i}(u)-c^{j+1}_{i}(u))+\sum_{i=1}^{n}\sigma_{i}c^{1}_{i}(u)\right\|\right]$
	$\displaystyle\leq\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}(u(Z_{i})-c^{K}_{i}(u))\right\|\right]+\sum_{j=1}^{K-1}\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}(c^{j}_{i}(u)-c^{j+1}_{i}(u))\right\|\right]$
	$\displaystyle+\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left\|\sum_{i=1}^{n}\sigma_{i}c^{1}_{i}(u)\right\|\right].$