This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Chenguang Duan 22institutetext: School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China.
22email: cgduan.math@whu.edu.cn
33institutetext: Yuling Jiao 44institutetext: School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China.
44email: yulingjiaomath@whu.edu.cn
55institutetext: Yanming Lai 66institutetext: School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China.
66email: laiyanming@whu.edu.cn
77institutetext: Xiliang Lu 88institutetext: School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China.
88email: xllv.math@whu.edu.cn
99institutetext: Qimeng Quan 1010institutetext: School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China.
1010email: quanqm@whu.edu.cn
1111institutetext: Jerry Zhijian Yang 1212institutetext: School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China.
1212email: zjyang.math@whu.edu.cn

Analysis of Deep Ritz Methods for Laplace Equations with Dirichlet Boundary Condition

Chenguang Duan    Yuling Jiao    Yanming Lai    Xiliang Lu    Qimeng Quan    Jerry Zhijian Yang
(Received: date / Accepted: date)
Abstract

Deep Ritz methods (DRM) have been proven numerically to be efficient in solving partial differential equations. In this paper, we present a convergence rate in H1H^{1} norm for deep Ritz methods for Laplace equations with Dirichlet boundary condition, where the error depends on the depth and width in the deep neural networks and the number of samples explicitly. Further we can properly choose the depth and width in the deep neural networks in terms of the number of training samples. The main idea of the proof is to decompose the total error of DRM into three parts, that is approximation error, statistical error and the error caused by the boundary penalty. We bound the approximation error in H1H^{1} norm with ReLU2\mathrm{ReLU}^{2} networks and control the statistical error via Rademacher complexity. In particular, we derive the bound on the Rademacher complexity of the non-Lipschitz composition of gradient norm with ReLU2\mathrm{ReLU}^{2} network, which is of immense independent interest. We also analysis the error inducing by the boundary penalty method and give a prior rule for tuning the penalty parameter.

Keywords:
deep Ritz methods convergence rate Dirichlet boundary condition approximation error Rademacher complexity
MSC:
65C20

1 Introduction

Partial differential equations (PDEs) are one of the fundamental mathematical models in studying a variety of phenomenons arising in science and engineering. There have been established many conventional numerical methods successfully for solving PDEs in the case of low dimension (d3)(d\leq 3), particularly the finite element method brenner2007mathematical ; ciarlet2002finite ; Quarteroni2008Numerical ; Thomas2013Numerical ; Hughes2012the . However, one will encounter some difficulties in both of theoretical analysis and numerical implementation when extending conventional numerical schemes to high-dimensional PDEs. The classic analysis of convergence, stability and any other properties will be trapped into troublesome situation due to the complex construction of finite element space ciarlet2002finite ; brenner2007mathematical . Moreover, in the term of practical computation, the scale of the discrete problem will increase exponentially with respect to the dimension.

Motivated by the well-known fact that deep learning method for high-dimensional data analysis has been achieved great successful applications in discriminative, generative and reinforcement learning he2015delving ; Goodfellow2014Generative ; silver2016mastering , solving high dimensional PDEs with deep neural networks becomes an extremely potential approach and has attracted much attentions Cosmin2019Artificial ; Justin2018DGM ; DeepXDE ; raissi2019physics ; Weinan2017The ; Yaohua2020weak ; Berner2020Numerically ; Han2018solving . Roughly speaking, these works can be divided into three categories. The first category is using deep neural network to improve classical numerical methods, see for example Kiwon2020Solver ; Yufei2020Learning ; hsieh2018learning ; Greenfeld2019Learning . In the second category, the neural operator is introduced to learn mappings between infinite-dimensional spaces with neural networks Li2020Advances ; anandkumar2020neural ; li2021fourier . For the last category, one utilizes deep neural networks to approximate the solutions of PDEs directly including physics-informed neural networks (PINNs) raissi2019physics , deep Ritz method (DRM) Weinan2017The and weak adversarial networks (WAN) Yaohua2020weak . PINNs is based on residual minimization for solving PDEs Cosmin2019Artificial ; Justin2018DGM ; DeepXDE ; raissi2019physics . Proceed from the variational form, Weinan2017The ; Yaohua2020weak ; Xu2020finite propose neural-network based methods related to classical Ritz and Galerkin method. In Yaohua2020weak , WAN are proposed inspired by Galerkin method. Based on Ritz method, Weinan2017The proposes the DRM to solve variational problems corresponding to a class of PDEs.

1.1 Related works and contributions

The idea using neural networks to solve PDEs goes back to 1990’s Isaac1998Artificial ; Dissanayake1994neural . Although there are great empirical achievements in recent several years, a challenging and interesting question is to provide a rigorous error analysis such as finite element method. Several recent efforts have been devoted to making processes along this line, see for example e2020observations ; Luo2020TwoLayerNN ; Mishra2020EstimatesOT ; Mller2021ErrorEF ; lu2021priori ; hong2021rademacher ; Shin2020ErrorEO ; Wang2020WhenAW ; e2021barron . In Luo2020TwoLayerNN , least squares minimization method with two-layer neural networks is studied, the optimization error under the assumption of over-parametrization and generalization error without the over-parametrization assumption are analyzed. In lu2021priori ; Xu2020finite , the generalization error bounds of two-layer neural networks are derived via assuming that the exact solutions lie in spectral Barron space.

Dirichlet boundary condition corresponding to a constrained minimization problem, which may cause some difficulties in computation. The penalty method has been applied in finite element methods and finite volume method Babuska1973The ; Maury2009Numerical . It is also been used in deep PDEs solvers Weinan2017The ; raissi2019physics ; Xu2020finite since it is not easy to construct a network with given values on the boundary. We also apply penalty method to DRM with ReLU2\mathrm{ReLU}^{2} activation functions, and obtain the error estimation in this work. The main contribution are listed as follows:

  • We derive a bound on the approximation error of deep ReLU2\mathrm{ReLU}^{2} network in H1H^{1} norm, which is of independent interest, see Theorem 3.2. That is, for any uλH2(Ω)u_{\lambda}^{*}\in H^{2}(\Omega), there exist a ReLU2\mathrm{ReLU}^{2} network u¯ϕ¯\bar{u}_{\bar{\phi}} with depth 𝒟log2d+3,\mathcal{D}\leq\lceil\log_{2}d\rceil+3, width 𝒲𝒪(4d1ϵ4d)\mathcal{W}\leq\mathcal{O}\left(4d\left\lceil\frac{1}{\epsilon}-4\right\rceil^{d}\right) (where dd is the dimension), such that

    uλu¯ϕ¯H1(Ω)2ϵ2andTuλTu¯ϕ¯L2(Ω)2Cdϵ2.\left\|u_{\lambda}^{*}-\bar{u}_{\bar{\phi}}\right\|^{2}_{H^{1}\left(\Omega\right)}\leq\epsilon^{2}\quad\mbox{and}\quad\left\|Tu_{\lambda}^{*}-T\bar{u}_{\bar{\phi}}\right\|^{2}_{L^{2}\left(\partial\Omega\right)}\leq C_{d}\epsilon^{2}.
  • We establish a bound on the statistical error in DRM with the tools of pseudo-dimension, especially we give a bound on

    𝔼Zi,σi,i=1,,n[supuϕ𝒩21n|iσiuϕ(Zi)2|],\mathbb{E}_{Z_{i},\sigma_{i},i=1,...,n}\left[\sup_{u_{\phi}\in\mathcal{N}^{2}}\frac{1}{n}\left|\sum_{i}\sigma_{i}\left\|\nabla u_{\phi}(Z_{i})\right\|^{2}\right|\right],

    i.e., the Rademacher complexity of the non-Lipschitz composition of gradient norm and ReLU2\mathrm{ReLU}^{2} network, via calculating the Pseudo dimension of networks with both ReLU\mathrm{ReLU} and ReLU2\mathrm{ReLU}^{2} activation functions, see Theorem 3.3. The technique we used here is also helpful for bounding the statistical errors to other deep PDEs solvers.

  • We give an upper bound of the error caused by the Robin approximation without additional assumptions, i.e., bound the error between the minimizer of the penalized form uλu_{\lambda}^{*} and the weak solutions of the Laplace equation uu^{*}, see Theorem 3.4,

    uλuH1(Ω)𝒪(λ1).\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}\leq\mathcal{O}(\lambda^{-1}).

    This result improves the one established in Mller2021ErrorEF ; muller2020deep ; hong2021rademacher .

  • Based on the above two error bounds we establish a nonasymptotic convergence rate of deep Ritz method for Laplace equation with Dirichlet boundary condition. We prove that if we set

    𝒟log2d+3,𝒲𝒪(4d(nlogn)12(d+2)4d),\mathcal{D}\leq\lceil\log_{2}d\rceil+3,\ \mathcal{W}\leq\mathcal{O}\left(4d\left\lceil\left(\frac{n}{\log n}\right)^{\frac{1}{2(d+2)}}-4\right\rceil^{d}\right),

    and

    λn13(d+2)(logn)d+33(d+2),\lambda\sim n^{\frac{1}{3(d+2)}}(\log n)^{-\frac{d+3}{3(d+2)}},

    it holds that

    𝔼𝑿,𝒀[u^ϕuH1(Ω)2]𝒪(n23(d+2)logn).\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq\mathcal{O}\left(n^{-\frac{2}{3(d+2)}}\log n\right).

    where nn is the number of training samples on both the domain and the boundary. Our theory shed lights on how to choose the topological structure of the employed networks and tune the penalty parameters to achieve the desired convergence rate in terms of number of training samples.

Recently, Mller2021ErrorEF ; muller2020deep also study the convergence of DRM with Dirichlet boundary condition via penalty method. However, the results of derive in Mller2021ErrorEF ; muller2020deep are quite different from ours. Firstly, the approximation results in Mller2021ErrorEF ; muller2020deep in based on the approximation error of ReLU\mathrm{ReLU} networks in Sobolev norms established in guhring2019error . However, the ReLU\mathrm{ReLU} network may not be suitable for solving PDEs. In this work, we derive an upper bound on the approximation error of ReLU2\mathrm{ReLU}^{2} networks in H1H^{1} norm, which is of independent interest. Secondly, to analyze the error caused by the penalty term, Mller2021ErrorEF ; muller2020deep assumed some additional conditions, and we do not need these conditions to obtain the error inducing by the penalty. Lastly, we provide the convergence rate analysis involving the statistical error caused by finite samples used in the SGD training, while in Mller2021ErrorEF ; muller2020deep they do not consider the statistical error at all. Moreover, to bound the statistical error we need to control the Rademacher complexity of the non-Lipschitz composition of gradient norm and ReLU2\mathrm{ReLU}^{2} network, such technique can be useful for bounding the statistical errors to other deep PDEs solvers.

The rest of this paper is organized as follows. In Section 2 we describe briefly the model problem and recall some standard properties of PDEs and variational problems. We also introduce some notations in deep Ritz methods as preliminaries. We devote Section 3 to the detail analysis on the convergence rate of the deep Ritz method with penalty, where various error estimations are analyzed rigorously one by one and the main results on the convergence rate are presented. Some concluding remarks and discussions are given in Section 4.

2 Preliminaries

Consider the following elliptic equation with zero-boundary condition

{Δu+wu=finΩu=0onΩ,\left\{\begin{aligned} -\Delta u+wu&=f&&\text{in}\ \Omega\\ u&=0&&\text{on}\ \partial\Omega,\\ \end{aligned}\right. (1)

where Ω\Omega is a bounded open subset of d\mathbb{R}^{d}, d>1d>1, fL2(Ω)f\in L^{2}(\Omega) and wL(Ω)w\in L^{\infty}(\Omega). Moreover, we suppose the coefficient ww satisfies wc10w\geq c_{1}\geq 0 a.e.. Without loss of generality, we assume Ω=[0,1]d\Omega=[0,1]^{d}. Define the bilinear form

a:H1(Ω)×H1(Ω),(u,v)Ωuv+wuvdx,a:H^{1}(\Omega)\times H^{1}(\Omega)\to\mathbb{R},\quad(u,v)\mapsto\int_{\Omega}\nabla u\cdot\nabla v+wuv\ \mathrm{d}x, (2)

and the corresponding quadratic energy functional by

(u)=12a(u,u)f,uL2(Ω)=12|u|H1(Ω)2+12uL2(Ω;w)2f,uL2(Ω).\mathcal{L}(u)=\frac{1}{2}a(u,u)-\langle f,u\rangle_{L^{2}(\Omega)}=\frac{1}{2}|u|_{H^{1}(\Omega)}^{2}+\frac{1}{2}\|u\|_{L^{2}(\Omega;{w})}^{2}-\langle f,u\rangle_{L^{2}(\Omega)}. (3)
Lemma 1

Evans2010PartialDE The unique weak solution uH01(Ω)u^{*}\in H_{0}^{1}(\Omega) of (1) is the unique minimizer of (u)\mathcal{L}(u) over H01(Ω)H_{0}^{1}(\Omega). Moreover, uH2(Ω)u^{*}\in H^{2}(\Omega).

Now we introduce the Robin approximation of (1) with λ>0\lambda>0 as below

{Δu+wu=finΩ1λun+u=0onΩ.\left\{\begin{aligned} -\Delta u+wu&=f&&\text{in}\ \Omega\\ \frac{1}{\lambda}\frac{\partial u}{\partial n}+u&=0&&\text{on}\ \partial\Omega.\\ \end{aligned}\right. (4)

Similarly, we define the bilinear form

aλ:H1(Ω)×H1(Ω),(u,v)a(u,v)+λΩuvds,a_{\lambda}:H^{1}(\Omega)\times H^{1}(\Omega)\to\mathbb{R},\quad(u,v)\mapsto a(u,v)+\lambda\int_{\partial\Omega}uv\ \mathrm{d}s,

and the corresponding quadratic energy functional with boundary penalty

λ(u)=12aλ(u,u)f,uL2(Ω)=(u)+λ2TuL2(Ω)2,\mathcal{L}_{\lambda}(u)=\frac{1}{2}a_{\lambda}(u,u)-\langle f,u\rangle_{L^{2}(\Omega)}=\mathcal{L}(u)+\frac{\lambda}{2}||Tu||_{L^{2}(\partial\Omega)}^{2}, (5)

where TT means the trace operator.

Lemma 2

The unique weak solution uλH1(Ω)u_{\lambda}^{*}\in H^{1}(\Omega) of (4) is the unique minimizer of λ(u)\mathcal{L}_{\lambda}(u) over H1(Ω)H^{1}(\Omega). Moreover, uλH2(Ω)u_{\lambda}^{*}\in H^{2}(\Omega).

Proof

See Appendix A.1.

From the perspective of infinite dimensional optimization, λ\mathcal{L}_{\lambda} can be seen as the penalized version of \mathcal{L}. The following lemma provides the relationship between the minimizers of them.

Lemma 3

The minimizer uλu_{\lambda}^{*} of the penalized problem (5) converges to uu^{*} in H1(Ω)H^{1}(\Omega) as λ\lambda\rightarrow\infty.

Proof

This result follows from Proposition 2.1 in Maury2009Numerical directly.

The deep Ritz method can be divided into three steps. First, one use deep neural network to approximate the trial function. A deep neural network uϕ:NLu_{\phi}:\mathbb{R}\rightarrow\mathbb{R}^{N_{L}} is defined by

u0(𝒙)=𝒙,\displaystyle u_{0}(\boldsymbol{x})=\boldsymbol{x},
u(𝒙)=σ(Au1+b),=1,2,,L1,\displaystyle u_{\ell}(\boldsymbol{x})=\sigma_{\ell}(A_{\ell}u_{\ell-1}+b_{\ell}),\quad\ell=1,2,\ldots,L-1,
u=uL(𝒙)=ALuL1+bL,\displaystyle u=u_{L}(\boldsymbol{x})=A_{L}u_{L-1}+b_{L},

where AN×N1A_{\ell}\in\mathbb{R}^{N_{\ell}\times N_{\ell-1}}, bNb_{\ell}\in\mathbb{R}^{N_{\ell}} and the activation functions σ\sigma_{\ell} may be different for different \ell. The depth 𝒟\mathcal{D} and the width 𝒲\mathcal{W} of neural networks uϕu_{\phi} are defined as

𝒟=L,𝒲=max{N:=1,2,,L}.\mathcal{D}=L,\ \mathcal{W}=\max\{N_{\ell}:\ell=1,2,\ldots,L\}.

=1LN\sum_{\ell=1}^{L}N_{\ell} is called the number of units of uϕu_{\phi}, and ϕ={A,b}=1N\phi=\{A_{\ell},b_{\ell}\}_{\ell=1}^{N} is called the free parameters of the network.

Definition 1

The class 𝒩𝒟,𝒲,α\mathcal{N}_{\mathcal{D},\mathcal{W},\mathcal{B}}^{\alpha} is the collection of neural networks uϕu_{\phi} which satisfies that

  • (i)

    depth and width are 𝒟\mathcal{D} and 𝒲\mathcal{W}, respectively;

  • (ii)

    the function values uϕ(𝒙)u_{\phi}(\boldsymbol{x}) and the squared norm of uϕ(𝒙)\nabla u_{\phi}(\boldsymbol{x}) are bounded by \mathcal{B};

  • (iii)

    activation functions are given by ReLUα\mathrm{ReLU}^{\alpha}, where α\alpha is the (multi-)index.

For example, 𝒩𝒟,𝒲,2\mathcal{N}_{\mathcal{D},\mathcal{W},\mathcal{B}}^{2} is the class of networks with activation functions as ReLU2\mathrm{ReLU}^{2}, and 𝒩𝒟,𝒲,1,2\mathcal{N}_{\mathcal{D},\mathcal{W},\mathcal{B}}^{1,2} is that with activation functions as ReLU1\mathrm{ReLU}^{1} and ReLU2\mathrm{ReLU}^{2}. We may simply use 𝒩α\mathcal{N}^{\alpha} if there is no confusion.

Second, one use Monte Carlo method to discretize the energy functional. We rewrite (5) as

λ(u)=\displaystyle\mathcal{L}_{\lambda}(u)= |Ω|𝔼XU(Ω)[u(X)222+w(X)u2(X)2u(X)f(X)]\displaystyle|\Omega|\mathop{\mathbb{E}}_{X\sim U(\Omega)}\left[\frac{\|\nabla u(X)\|_{2}^{2}}{2}+\frac{w(X)u^{2}(X)}{2}-u(X)f(X)\right] (6)
+λ2|Ω|𝔼YU(Ω)[Tu2(Y)],\displaystyle+\frac{\lambda}{2}|\partial\Omega|\mathop{\mathbb{E}}_{Y\sim U(\partial\Omega)}\left[Tu^{2}(Y)\right],

where U(Ω)U(\Omega), U(Ω)U(\partial\Omega) are the uniform distribution on Ω\Omega and Ω\partial\Omega. We now introduce the discrete version of (5) and replace uu by neural network uϕu_{\phi}, as follows

^λ(uϕ)=\displaystyle\widehat{\mathcal{L}}_{\lambda}(u_{\phi})= |Ω|Ni=1N[uϕ(Xi)222+w(Xi)uϕ2(Xi)2uϕ(Xi)f(Xi)]\displaystyle\frac{|\Omega|}{N}\sum_{i=1}^{N}\left[\frac{\|\nabla u_{\phi}(X_{i})\|_{2}^{2}}{2}+\frac{w(X_{i})u_{\phi}^{2}(X_{i})}{2}-u_{\phi}(X_{i})f(X_{i})\right] (7)
+λ2|Ω|Mj=1M[Tuϕ2(Yj)].\displaystyle+\frac{\lambda}{2}\frac{|\partial\Omega|}{M}\sum_{j=1}^{M}\left[Tu_{\phi}^{2}(Y_{j})\right].

We denote the minimizer of (7) over 𝒩2\mathcal{N}^{2} as u^ϕ\widehat{u}_{\phi}, that is

u^ϕ=argminuϕ𝒩2^λ(uϕ),\widehat{u}_{\phi}=\mathop{\arg\min}_{u_{\phi}\in\mathcal{N}^{2}}\widehat{\mathcal{L}}_{\lambda}(u_{\phi}), (8)

where {Xi}i=1NU(Ω)\{X_{i}\}_{i=1}^{N}\sim U(\Omega) i.i.d. and {Yj}j=1MU(Ω)\{Y_{j}\}_{j=1}^{M}\sim U(\partial\Omega) i.i.d..

Finally, we choose an algorithm for solving the optimization problem, and denote uϕ𝒜u_{\phi_{\mathcal{A}}} as the solution by optimizer 𝒜\mathcal{A}.

3 Error Analysis

In this section we prove the convergence rate analysis for DRM with deep ReLU2\mathrm{ReLU}^{2} networks. The following Theorem plays an important role by decoupling the total errors into four types of errors.

Theorem 3.1
uϕ𝒜uH1(Ω)2\displaystyle\|u_{\phi_{\mathcal{A}}}-u^{*}\|_{H^{1}(\Omega)}^{2}
\displaystyle\leq 4c11{infu¯𝒩2[wL(Ω)12u¯uλH1(Ω)2+λ2Tu¯TuλL2(Ω)2]app\displaystyle\frac{4}{c_{1}\wedge 1}\left\{\underbrace{\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\frac{\|w\|_{L^{\infty}(\Omega)}\vee 1}{2}\|\bar{u}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}^{2}+\frac{\lambda}{2}\|T\bar{u}-Tu_{\lambda}^{*}\|_{L^{2}({\partial\Omega})}^{2}\right]}_{\mathcal{E}_{app}}\right.
+2supu𝒩2|λ(u)^λ(u)|sta+[^λ(uϕ𝒜)^λ(u^ϕ)]opt}+2uλuH1(Ω)2pen\displaystyle\left.+\underbrace{2\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right|}_{\mathcal{E}_{sta}}+\underbrace{\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right]}_{\mathcal{E}_{opt}}\right\}+2\underbrace{\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}^{2}}_{\mathcal{E}_{pen}}
Proof

Given uϕ𝒜H1(Ω)u_{\phi_{\mathcal{A}}}\in H^{1}(\Omega), we can decompose its distance to the weak solution of (1) using triangle inequality

uϕ𝒜uH1(Ω)uϕ𝒜uλH1(Ω)+uλuH1(Ω).\|u_{\phi_{\mathcal{A}}}-u^{*}\|_{H^{1}(\Omega)}\leq\|u_{\phi_{\mathcal{A}}}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}+\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}. (9)

First, we decouple the first term into three parts. For any u¯𝒩2\bar{u}\in\mathcal{N}^{2}, we have

λ(uϕ𝒜)λ(uλ)\displaystyle\mathcal{L}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)
=λ(uϕ𝒜)^λ(uϕ𝒜)+^λ(uϕ𝒜)^λ(u^ϕ)+^λ(u^ϕ)^λ(u¯)\displaystyle=\mathcal{L}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)+\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)+\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\bar{u}\right)
+^λ(u¯)λ(u¯)+λ(u¯)λ(uλ)\displaystyle\quad+\widehat{\mathcal{L}}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(\bar{u}\right)+\mathcal{L}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)
[λ(u¯)λ(uλ)]+2supu𝒩2|λ(u)^λ(u)|+[^λ(uϕ𝒜)^λ(u^ϕ)].\displaystyle\leq\left[\mathcal{L}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)\right]+2\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right|+\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right].

Since u¯\bar{u} can be any element in 𝒩2\mathcal{N}^{2}, we take the infimum of u¯\bar{u}

λ(uϕ𝒜)λ(uλ)\displaystyle\mathcal{L}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)\leq infu¯𝒩2[λ(u¯)λ(u)]+2supu𝒩2|λ(u)^λ(u)|\displaystyle\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\mathcal{L}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(u^{*}\right)\right]+2\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right| (10)
+[^λ(uϕ𝒜)^λ(u^ϕ)].\displaystyle+\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right].

For any u𝒩u\in\mathcal{N}, set v=uuλv=u-u_{\lambda}^{*}, then

λ(u)=\displaystyle\mathcal{L}_{\lambda}(u)= λ(uλ+v)\displaystyle\mathcal{L}_{\lambda}(u_{\lambda}^{*}+v)
=\displaystyle= 12(uλ+v),(uλ+v)L2(Ω)+12uλ+v,uλ+vL2(Ω;w)uλ+v,fL2(Ω)\displaystyle\frac{1}{2}\langle\nabla(u_{\lambda}^{*}+v),\nabla(u_{\lambda}^{*}+v)\rangle_{L^{2}({\Omega})}+\frac{1}{2}\langle u_{\lambda}^{*}+v,u_{\lambda}^{*}+v\rangle_{L^{2}({\Omega};w)}-\langle u_{\lambda}^{*}+v,f\rangle_{L^{2}({\Omega})}
+λ2T(uλ+v),T(uλ+v)L2(Ω)\displaystyle+\frac{\lambda}{2}\langle T(u_{\lambda}^{*}+v),T(u_{\lambda}^{*}+v)\rangle_{L^{2}({\partial\Omega})}
=\displaystyle= λ(uλ)+uλ,vL2(Ω)+uλ,vL2(Ω;w)v,fL2(Ω)+λTuλ,TvL2(Ω)\displaystyle\mathcal{L}_{\lambda}(u_{\lambda}^{*})+\langle\nabla u_{\lambda}^{*},\nabla v\rangle_{L^{2}({\Omega})}+\langle u_{\lambda}^{*},v\rangle_{L^{2}({\Omega;w})}-\langle v,f\rangle_{L^{2}({\Omega})}+\lambda\langle Tu_{\lambda}^{*},Tv\rangle_{L^{2}({\partial\Omega})}
+12v,vL2(Ω)+12v,vL2(Ω;w)+λ2Tv,TvL2(Ω)\displaystyle+\frac{1}{2}\langle\nabla v,\nabla v\rangle_{L^{2}({\Omega})}+\frac{1}{2}\langle v,v\rangle_{L^{2}({\Omega};w)}+\frac{\lambda}{2}\langle Tv,Tv\rangle_{L^{2}({\partial\Omega})}
=\displaystyle= λ(uλ)+12v,vL2(Ω)+12v,vL2(Ω;w)+λ2Tv,TvL2(Ω),\displaystyle\mathcal{L}_{\lambda}(u_{\lambda}^{*})+\frac{1}{2}\langle\nabla v,\nabla v\rangle_{L^{2}({\Omega})}+\frac{1}{2}\langle v,v\rangle_{L^{2}({\Omega};w)}+\frac{\lambda}{2}\langle Tv,Tv\rangle_{L^{2}({\partial\Omega})},

where the last equality comes from the fact that uλu_{\lambda}^{*} is the minimizer of (5). Therefore

λ(u)λ(uλ)=12v,vL2(Ω)+12v,vL2(Ω;w)+λ2Tv,TvL2(Ω),\mathcal{L}_{\lambda}(u)-\mathcal{L}_{\lambda}(u_{\lambda}^{*})=\frac{1}{2}\langle\nabla v,\nabla v\rangle_{L^{2}({\Omega})}+\frac{1}{2}\langle v,v\rangle_{L^{2}({\Omega};w)}+\frac{\lambda}{2}\langle Tv,Tv\rangle_{L^{2}({\partial\Omega})},

that is

c112uuλH1(Ω)2\displaystyle\frac{c_{1}\wedge 1}{2}\|u-u_{\lambda}^{*}\|_{H^{1}(\Omega)}^{2} λ(u)λ(uλ)λ2TuTuλL2(Ω)2\displaystyle\leq\mathcal{L}_{\lambda}(u)-\mathcal{L}_{\lambda}(u_{\lambda}^{*})-\frac{\lambda}{2}\|Tu-Tu_{\lambda}^{*}\|_{L^{2}({\partial\Omega})}^{2} (11)
wL(Ω)12uuλH1(Ω)2.\displaystyle\leq\frac{\|w\|_{L^{\infty}(\Omega)}\vee 1}{2}\|u-u_{\lambda}^{*}\|_{H^{1}(\Omega)}^{2}.

Combining (10) and (11), we obtain

uϕ𝒜uλH1(Ω)2\displaystyle\|u_{\phi_{\mathcal{A}}}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}^{2} (12)
\displaystyle\leq 2c11{λ(uϕ𝒜)λ(uλ)λ2Tuϕ𝒜TuλL2(Ω)2}\displaystyle\frac{2}{c_{1}\wedge 1}\left\{\mathcal{L}_{\lambda}(u_{\phi_{\mathcal{A}}})-\mathcal{L}_{\lambda}(u_{\lambda}^{*})-\frac{\lambda}{2}\|Tu_{\phi_{\mathcal{A}}}-Tu_{\lambda}^{*}\|_{L^{2}({\partial\Omega})}^{2}\right\}
\displaystyle\leq 2c11{infu¯𝒩2[λ(u¯)λ(uλ)]+2supu𝒩2|λ(u)^λ(u)|\displaystyle\frac{2}{c_{1}\wedge 1}\left\{\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\mathcal{L}_{\lambda}\left(\bar{u}\right)-\mathcal{L}_{\lambda}\left(u_{\lambda}^{*}\right)\right]+2\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right|\right.
+[^λ(uϕ𝒜)^λ(u^ϕ)]}\displaystyle\left.+\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right]\right\}
\displaystyle\leq 2c11{infu¯𝒩2[wL(Ω)12u¯uλH1(Ω)2+λ2Tu¯TuλL2(Ω)2]\displaystyle\frac{2}{c_{1}\wedge 1}\left\{\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\frac{\|w\|_{L^{\infty}(\Omega)}\vee 1}{2}\|\bar{u}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}^{2}+\frac{\lambda}{2}\|T\bar{u}-Tu_{\lambda}^{*}\|_{L^{2}({\partial\Omega})}^{2}\right]\right.
+2supu𝒩2|λ(u)^λ(u)|+[^λ(uϕ𝒜)^λ(u^ϕ)]}.\displaystyle\left.+2\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right|+\left[\widehat{\mathcal{L}}_{\lambda}\left(u_{\phi_{\mathcal{A}}}\right)-\widehat{\mathcal{L}}_{\lambda}\left(\widehat{u}_{\phi}\right)\right]\right\}.

Substituting (12) into (9), it is evident to see that the theorem holds.

The approximation error app\mathcal{E}_{app} describes the expressive power of the ReLU2\mathrm{ReLU}^{2} networks 𝒩2\mathcal{N}^{2} in H1H^{1} norm, which corresponds to the approximation error in FEM known as the Céa’s lemma ciarlet2002finite . The statistical error sta\mathcal{E}_{sta} is caused by the Monte Carlo discritization of Lλ()L_{\lambda}(\cdot) defined in (5) with L^λ()\widehat{L}_{\lambda}(\cdot) in (7). While, the optimization error opt\mathcal{E}_{opt} indicates the performance of the solver 𝒜\mathcal{A} we utilized. In contrast, this error is corresponding to the error of solving linear systems in FEM. In this paper we consider the scenario of perfect training with opt\mathcal{E}_{opt} = 0. The error pen\mathcal{E}_{pen} caused by the boundary penalty is the distance between the minimizer of the energy with zero boundary condition and the minimizer of the energy with penalty.

3.1 Approximation error

Theorem 3.2

Assume uλH2(Ω)c2\|u_{\lambda}^{*}\|_{H^{2}(\Omega)}\leq c_{2}, then there exist a ReLU2\mathrm{ReLU}^{2} network u¯ϕ¯𝒩2\bar{u}_{\bar{\phi}}\in\mathcal{N}^{2} with depth and width satisfying

𝒟log2d+3,𝒲4dCc2ε4d\mathcal{D}\leq\lceil\log_{2}d\rceil+3,\quad\mathcal{W}\leq 4d\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}

such that

app\displaystyle\mathcal{E}_{app} =infu¯𝒩2[wL(Ω)12u¯uλH1(Ω)2+λ2Tu¯TuλL2(Ω)2]\displaystyle=\inf_{\bar{u}\in\mathcal{N}^{2}}\left[\frac{\|w\|_{L^{\infty}(\Omega)}\vee 1}{2}\|\bar{u}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}^{2}+\frac{\lambda}{2}\|T\bar{u}-Tu_{\lambda}^{*}\|_{L^{2}({\partial\Omega})}^{2}\right]
(wL(Ω)12+λCd2)ε2\displaystyle\leq\left(\frac{\|w\|_{L^{\infty}(\Omega)}\vee 1}{2}+\frac{\lambda C_{d}}{2}\right)\varepsilon^{2}

where CC is a genetic constant and Cd>0C_{d}>0 is a constant depending only on Ω\Omega.

Proof

Our proof is based on some classical approximation results of B-splines schumaker2007spline ; de1978practical . Let us recall some notation and useful results. We denote by πl\pi_{l} the dyadic partition of [0,1][0,1], i.e.,

πl:t0(l)=0<t1(l)<<t2l1(l)<t2l(l)=1,\pi_{l}:t_{0}^{(l)}=0<t_{1}^{(l)}<\cdots<t_{2^{l}-1}^{(l)}<t_{2^{l}}^{(l)}=1,

where ti(l)=i2l(0i2l)t_{i}^{(l)}=i\cdot 2^{-l}(0\leq i\leq 2^{l}). The cardinal B-spline of order 33 with respect to partition πl\pi_{l} is defined by

Nl,i(3)(x)=(1)k[ti(l),,ti+3(l),(xt)+2](ti+3(l)ti(l)),i=2,,2l1N_{l,i}^{(3)}(x)=(-1)^{k}\left[t_{i}^{(l)},\ldots,t_{i+3}^{(l)},(x-t)_{+}^{2}\right]\cdot\left(t_{i+3}^{(l)}-t_{i}^{(l)}\right),\quad i=-2,\cdots,2^{l}-1

which can be rewritten in the following equivalent form,

Nl,i(3)(x)=22l1j=03(1)j(3j)(xi2lj2l)+2,i=2,,2l1.N_{l,i}^{(3)}(x)=2^{2l-1}\sum_{j=0}^{3}(-1)^{j}\left(\begin{array}[]{l}3\\ j\end{array}\right)(x-i2^{-l}-j2^{-l})_{+}^{2},\quad i=-2,\cdots,2^{l}-1. (13)

The multivariate cardinal B-spline of order 33 is defined by the product of univariate cardinal B-splines of order 33, i.e.,

Nl,𝒊(3)(𝒙)=j=1dNl,ij(3)(xj),𝒊=(i1,,id),3<ij<2l.{N}_{l,\boldsymbol{i}}^{(3)}(\boldsymbol{x})=\prod_{j=1}^{d}N_{l,i_{j}}^{(3)}\left(x_{j}\right),\quad\boldsymbol{i}=\left(i_{1},\ldots,i_{d}\right),-3<i_{j}<2^{l}.

Denote

Sl(3)([0,1]d)=span{Nl,𝒊(3),3<ij<2l,j=1,2,,d}.S_{l}^{(3)}([0,1]^{d})=\text{span}\{N_{l,\boldsymbol{i}}^{(3)},-3<i_{j}<2^{l},j=1,2,\cdots,d\}.

Then, the element ff in Sl(3)([0,1]d)S_{l}^{(3)}([0,1]^{d}) are piecewise polynomial functions according to to partition πld\pi_{l}^{d} with each piece being degree 22 and in C1([0,1]d)C^{1}([0,1]^{d}). Since

S1(3)S2(3)S3(3),S_{1}^{(3)}\subset S_{2}^{(3)}\subset S_{3}^{(3)}\subset\cdots,

We can further denote

S(3)([0,1]d)=l=1Sl(3)([0,1]d).S^{(3)}([0,1]^{d})=\bigcup_{l=1}^{\infty}S_{l}^{(3)}([0,1]^{d}).

The following approximation result of cardinal B-splines in Sobolev spaces which is a direct consequence of theorem 3.4 in schultz1969approximation play an important role in the proof of this Theorem.

Lemma 4

Assume uH2([0,1]d)u^{*}\in H^{2}([0,1]^{d}), there exists {cj}j=1(2l4)d\{c_{j}\}_{j=1}^{(2^{l}-4)^{d}}\subset\mathbb{R} with l>2l>2 such that

uj=1(2l4)dcjNl,𝒊j(3)H1(Ω)C2luH1(Ω),\|u^{*}-\sum_{j=1}^{(2^{l}-4)^{d}}c_{j}{N}_{l,\boldsymbol{i}_{j}}^{(3)}\|_{H^{1}(\Omega)}\leq\frac{C}{2^{l}}\|u^{*}\|_{H^{1}(\Omega)},

where CC is a constant only depend on dd.

Lemma 5

The multivariate B-spline Nl,𝐢(3)(𝐱){N}_{l,\boldsymbol{i}}^{(3)}(\boldsymbol{x}) can be implemented exactly by a ReLU2\mathrm{ReLU}^{2} network with depth log2d+2\lceil\log_{2}d\rceil+2 and width 4d4d.

Proof

Denote

σ(x)={x2,x00,else\sigma(x)=\left\{\begin{array}[]{ll}x^{2},&x\geq 0\\ 0,&\text{else}\end{array}\right.

as the activation function in ReLU2\mathrm{ReLU}^{2} network. By definition of Nl,i(3)(x)N_{l,i}^{(3)}(x) in (13), it’s clear that Nl,i(3)(x)N_{l,i}^{(3)}(x) can be implemented by ReLU2\mathrm{ReLU}^{2} network without any error with depth 22 and width 44. On the other hand ReLU2\mathrm{ReLU}^{2} network can also realize multiplication without any error. In fact, for any x,yx,y\in\mathbb{R},

xy=14[(x+y)2(xy)2]=14[σ(x+y)+σ(xy)σ(xy)σ(yx)].xy=\frac{1}{4}[(x+y)^{2}-(x-y)^{2}]=\frac{1}{4}[\sigma(x+y)+\sigma(-x-y)-\sigma(x-y)-\sigma(y-x)].

Hence multivariate B-spline of order 33 can be implemented by ReLU2\mathrm{ReLU}^{2} network exactly with depth log2d+2\lceil\log_{2}d\rceil+2 and width 4d4d.

For any ϵ>0\epsilon>0, by Lemma 4 and 5 with 12lCuH2ϵ\frac{1}{2^{l}}\leq\left\lceil\frac{C\|u^{*}\|_{H^{2}}}{\epsilon}\right\rceil, there exists u¯ϕ¯𝒩2\bar{u}_{\bar{\phi}}\in\mathcal{N}^{2}, such that

uλu¯ϕ¯H1(Ω)ϵ.\left\|u_{\lambda}^{*}-\bar{u}_{\bar{\phi}}\right\|_{H^{1}(\Omega)}\leq\epsilon. (14)

By the trace theorem, we have

TuλTu¯ϕ¯L2(Ω)Cd1/2uλu¯ϕ¯H1(Ω)Cd1/2ϵ,\|Tu_{\lambda}^{*}-T\bar{u}_{\bar{\phi}}\|_{L^{2}({\partial\Omega})}\leq C_{d}^{1/2}\left\|u_{\lambda}^{*}-\bar{u}_{\bar{\phi}}\right\|_{H^{1}(\Omega)}\leq C_{d}^{1/2}\epsilon, (15)

where Cd>0C_{d}>0 is a constant depending only on Ω\Omega. The depth 𝒟\mathcal{D} and width 𝒲\mathcal{W} of u¯ϕ¯\bar{u}_{\bar{\phi}} are satisfying 𝒟log2d+3\mathcal{D}\leq\lceil\log_{2}d\rceil+3 and 𝒲4dn=4dCuH2ϵ4d\mathcal{W}\leq 4dn=4d\left\lceil\frac{C\|u^{*}\|_{H^{2}}}{\epsilon}-4\right\rceil^{d}, respectively. Combining (14) and (15), we arrive at the result.

3.2 Statistical error

In this section, we bound the statistical error

sta=2supu𝒩2|λ(u)^λ(u)|.\mathcal{E}_{sta}=2\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right|.

For simplicity of presentation, we use c3c_{3} to denote the upper bound of ff, ww and suppose c3c_{3}\geq\mathcal{B}, that is

fL(Ω)wL(Ω)c3<.\|f\|_{L^{\infty}(\Omega)}\vee\|w\|_{L^{\infty}(\Omega)}\vee\mathcal{B}\leq c_{3}<\infty.

First, we need to decompose the statistical error into four parts, and estimate each one.

Lemma 6
supu𝒩2|λ(u)^λ(u)|j=13supu𝒩2|λ,j(u)^λ,j(u)|+λ2supu𝒩2|λ,4(u)^λ,4(u)|,\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda}(u)-\widehat{\mathcal{L}}_{\lambda}(u)\right|\leq\sum_{j=1}^{3}\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,j}(u)-\widehat{\mathcal{L}}_{\lambda,j}(u)\right|+\frac{\lambda}{2}\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,4}(u)-\widehat{\mathcal{L}}_{\lambda,4}(u)\right|,

where

λ,1(u)=|Ω|𝔼XU(Ω)[u(X)222],^λ,1(u)=|Ω|Ni=1N[u(Xi)222],\mathcal{L}_{\lambda,1}(u)=|\Omega|\mathop{\mathbb{E}}_{X\sim U(\Omega)}\left[\frac{\|\nabla u(X)\|_{2}^{2}}{2}\right],\quad\widehat{\mathcal{L}}_{\lambda,1}(u)=\frac{|\Omega|}{N}\sum_{i=1}^{N}\left[\frac{\|\nabla u(X_{i})\|_{2}^{2}}{2}\right],
λ,2(u)=|Ω|𝔼XU(Ω)[w(X)u2(X)2],^λ,2(u)=|Ω|Ni=1N[w(Xi)u2(Xi)2],\mathcal{L}_{\lambda,2}(u)=|\Omega|\mathop{\mathbb{E}}_{X\sim U(\Omega)}\left[\frac{w(X)u^{2}(X)}{2}\right],\quad\widehat{\mathcal{L}}_{\lambda,2}(u)=\frac{|\Omega|}{N}\sum_{i=1}^{N}\left[\frac{w(X_{i})u^{2}(X_{i})}{2}\right],
λ,3(u)=|Ω|𝔼XU(Ω)[u(X)f(X)],^λ,3(u)=|Ω|Ni=1N[u(Xi)f(Xi)],\mathcal{L}_{\lambda,3}(u)=|\Omega|\mathop{\mathbb{E}}_{X\sim U(\Omega)}\left[u(X)f(X)\right],\quad\widehat{\mathcal{L}}_{\lambda,3}(u)=\frac{|\Omega|}{N}\sum_{i=1}^{N}\left[u(X_{i})f(X_{i})\right],
λ,4(u)=|Ω|𝔼YU(Ω)[Tu2(Y)],^λ,4(u)=|Ω|Mj=1M[Tu2(Yj)].\mathcal{L}_{\lambda,4}(u)=|\partial\Omega|\mathop{\mathbb{E}}_{Y\sim U(\partial\Omega)}\left[Tu^{2}(Y)\right],\quad\widehat{\mathcal{L}}_{\lambda,4}(u)=\frac{|\partial\Omega|}{M}\sum_{j=1}^{M}\left[Tu^{2}(Y_{j})\right].
Proof

It is easy to verified by triangle inequality.

We use μ\mu to denote U(Ω)(U(Ω)).\mathrm{U}(\Omega)(\mathrm{U}(\partial\Omega)). Given n=N(M)n=N(M) i.i.d samples 𝐙n={Zi}i=1n\mathbf{Z}_{n}=\left\{Z_{i}\right\}_{i=1}^{n} from μ\mu, with Zi=Xi(Yi)μZ_{i}=X_{i}\left(Y_{i}\right)\sim\mu, we need the following Rademacher complexity to measure the capacity of the given function class 𝒩\mathcal{N} restricted on nn random samples 𝐙n\mathbf{Z}_{n}.

Definition 2

The Rademacher complexity of a set ARnA\subseteq\mathrm{R}^{n} is defined as

(A)=𝔼𝐙n,Σn[supaA1n|iσiai|]\mathfrak{R}(A)=\mathbb{E}_{\mathbf{Z}_{n},\Sigma_{n}}\left[\sup_{a\in A}\frac{1}{n}\left|\sum_{i}\sigma_{i}a_{i}\right|\right]

where Σn={σi}i=1n\Sigma_{n}=\left\{\sigma_{i}\right\}_{i=1}^{n} are nn i.i.d Rademacher variables with (σi=1)=(σi=1)=1/2.\mathbb{P}\left(\sigma_{i}=1\right)=\mathbb{P}\left(\sigma_{i}=-1\right)=1/2. The Rademacher complexity of function class 𝒩\mathcal{N} associate with random sample 𝐙n\mathbf{Z}_{n} is defined as

(𝒩)=𝔼𝐙n,Σn[supu𝒩1n|iσiu(Zi)|].\mathfrak{R}(\mathcal{N})=\mathbb{E}_{\mathbf{Z}_{n},\Sigma_{n}}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i}\sigma_{i}u\left(Z_{i}\right)\right|\right].

For the sake of simplicity, we deal with last three terms first.

Lemma 7

Suppose that ψ:d×\psi:\mathbb{R}^{d}\times\mathbb{R}\rightarrow\mathbb{R}, (x,y)ψ(x,y)(x,y)\mapsto\psi(x,y) is \ell-Lipschitz continuous on yy for all xx. Let 𝒩\mathcal{N} be classes of functions on Ω\Omega and ψ𝒩={ψu:xψ(x,u(x)),u𝒩}\psi\circ\mathcal{N}=\{\psi\circ u:x\mapsto\psi(x,u(x)),u\in\mathcal{N}\}. Then

(ψ𝒩)(𝒩)\mathfrak{R}(\psi\circ\mathcal{N})\leq\ell\ \mathfrak{R}(\mathcal{N})
Proof

Corollary 3.17 in ledoux2013probability .

Lemma 8
𝔼𝒁n[supu𝒩2|λ,2(u)^λ,2(u)|]\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,2}(u)-\widehat{\mathcal{L}}_{\lambda,2}(u)\right|\right] c32(𝒩2),\displaystyle\leq c_{3}^{2}\mathfrak{R}(\mathcal{N}^{2}),
𝔼𝒁n[supu𝒩2|λ,3(u)^λ,3(u)|]\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,3}(u)-\widehat{\mathcal{L}}_{\lambda,3}(u)\right|\right] c3(𝒩2),\displaystyle\leq c_{3}\mathfrak{R}(\mathcal{N}^{2}),
𝔼𝒁n[supu𝒩2|λ,4(u)^λ,4(u)|]\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,4}(u)-\widehat{\mathcal{L}}_{\lambda,4}(u)\right|\right] 2c3(𝒩2).\displaystyle\leq 2c_{3}\mathfrak{R}(\mathcal{N}^{2}).
Proof

Suppose |y|<c3|y|<c_{3}. Define

ψ2(x,y)=w(x)y22,ψ3(x,y)=f(x)y,ψ4(x,y)=y2.\psi_{2}(x,y)=\frac{w(x)y^{2}}{2},\ \psi_{3}(x,y)=f(x)y,\ \psi_{4}(x,y)=y^{2}.

According to the symmetrization method, we have

𝔼𝒁n[supu𝒩2|λ,2(u)^λ,2(u)|]\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,2}(u)-\widehat{\mathcal{L}}_{\lambda,2}(u)\right|\right] (ψ2𝒩2),\displaystyle\leq\mathfrak{R}(\psi_{2}\circ\mathcal{N}^{2}), (16)
𝔼𝒁n[supu𝒩2|λ,3(u)^λ,3(u)|]\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,3}(u)-\widehat{\mathcal{L}}_{\lambda,3}(u)\right|\right] (ψ3𝒩2),\displaystyle\leq\mathfrak{R}(\psi_{3}\circ\mathcal{N}^{2}),
𝔼𝒁n[supu𝒩2|λ,4(u)^λ,4(u)|]\displaystyle\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,4}(u)-\widehat{\mathcal{L}}_{\lambda,4}(u)\right|\right] (ψ4𝒩2).\displaystyle\leq\mathfrak{R}(\psi_{4}\circ\mathcal{N}^{2}).

The result will follow from Lemma 7 and (16) directly, if we can show that ψi(x,y)\psi_{i}(x,y), i=2,3,4i=2,3,4 are c32c_{3}^{2}, c3c_{3}, 2c32c_{3} -Lipschitz continuous on yy for all xx, respectively. For arbitrary y1y_{1}, y2y_{2} with |yi|c3|y_{i}|\leq c_{3}, i=1,2i=1,2

|ψ2(x,y1)ψ2(x,y2)|=|w(x)y122w(x)y222|=|w(x)(y1+y2)|2|y1y2|c32|y1y2|,|\psi_{2}(x,y_{1})-\psi_{2}(x,y_{2})|=\left|\frac{w(x)y_{1}^{2}}{2}-\frac{w(x)y_{2}^{2}}{2}\right|=\frac{|w(x)(y_{1}+y_{2})|}{2}|y_{1}-y_{2}|\leq c_{3}^{2}|y_{1}-y_{2}|,
|ψ3(x,y1)ψ3(x,y2)|=|f(x)y1f(x)y2|=|f(x)||y1y2|c3|y1y2|,|\psi_{3}(x,y_{1})-\psi_{3}(x,y_{2})|=\left|f(x)y_{1}-f(x)y_{2}\right|=|f(x)||y_{1}-y_{2}|\leq c_{3}|y_{1}-y_{2}|,
|ψ4(x,y1)ψ4(x,y2)|=|y12y22|=|y1+y2||y1y2|2c3|y1y2|.|\psi_{4}(x,y_{1})-\psi_{4}(x,y_{2})|=\left|y_{1}^{2}-y_{2}^{2}\right|=|y_{1}+y_{2}||y_{1}-y_{2}|\leq 2c_{3}|y_{1}-y_{2}|.

We now turn to the most difficult term in Lemma 6. Since gradient is not a Lipschitz operator, Lemma 7 does not work and we can not bound the Rademacher complexity in the same way.

Lemma 9
𝔼𝒁n[supu𝒩2|λ,1(u)^λ,1(u)|](𝒩1,2).\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,1}(u)-\widehat{\mathcal{L}}_{\lambda,1}(u)\right|\right]\leq\mathfrak{R}(\mathcal{N}^{1,2}).\\
Proof

Based on the symmetrization method, we have

𝔼𝒁n[supu𝒩2|λ,1(u)^λ,1(u)|]𝔼𝒁n,Σn[supu𝒩21n|iσiu(Zi)2|]\mathbb{E}_{\boldsymbol{Z}_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\left|\mathcal{L}_{\lambda,1}(u)-\widehat{\mathcal{L}}_{\lambda,1}(u)\right|\right]\leq\mathbb{E}_{\boldsymbol{Z}_{n},\Sigma_{n}}\left[\sup_{u\in\mathcal{N}^{2}}\frac{1}{n}\left|\sum_{i}\sigma_{i}\|\nabla u(Z_{i})\|^{2}\right|\right] (17)

The proof of (17) is a direct consequence of the following claim.

Claim: Let uu be a function implemented by a ReLU2\mathrm{ReLU}^{2} network with depth 𝒟\mathcal{D} and width 𝒲\mathcal{W}. Then u22\|\nabla u\|_{2}^{2} can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network with depth 𝒟+3\mathcal{D}+3 and width d(𝒟+2)𝒲d\left(\mathcal{D}+2\right)\mathcal{W}.

Denote ReLU\mathrm{ReLU} and ReLU2\mathrm{ReLU}^{2} as σ1\sigma_{1} and σ2\sigma_{2}, respectively. As long as we show that each partial derivative Diu(i=1,2,,d)D_{i}u(i=1,2,\cdots,d) can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network respectively, we can easily obtain the network we desire, since, u22=i=1d|Diu|2\|\nabla u\|_{2}^{2}=\sum_{i=1}^{d}\left|D_{i}u\right|^{2} and the square function can be implemented by x2=σ2(x)+σ2(x)x^{2}=\sigma_{2}(x)+\sigma_{2}(-x).

Now we show that for any i=1,2,,di=1,2,\cdots,d, DiuD_{i}u can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network. We deal with the first two layers in details since there are a little bit difference for the first two layer and apply induction for layers k3k\geq 3. For the first layer, since σ2(x)=2σ1(x)\sigma_{2}^{{}^{\prime}}(x)=2\sigma_{1}(x), we have for any q=1,2,n1q=1,2\cdots,n_{1}

Diuq(1)=Diσ2(j=1daqj(1)xj+bq(1))=2σ1(j=1daqj(1)xj+bq(1))aqi(1)D_{i}u_{q}^{(1)}=D_{i}\sigma_{2}\left(\sum_{j=1}^{d}a_{qj}^{(1)}x_{j}+b_{q}^{(1)}\right)=2\sigma_{1}\left(\sum_{j=1}^{d}a_{qj}^{(1)}x_{j}+b_{q}^{(1)}\right)\cdot a_{qi}^{(1)}

Hence Diuq(1)D_{i}u_{q}^{(1)} can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network with depth 22 and width 11. For the second layer,

Diuq(2)=Diσ2(j=1n1aqj(2)uj(1)+bq(2))=2σ1(j=1n1aqj(2)uj(1)+bq(2))j=1n1aqj(2)Diuj(1)D_{i}u_{q}^{(2)}=D_{i}\sigma_{2}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)=2\sigma_{1}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)\cdot\sum_{j=1}^{n_{1}}a_{qj}^{(2)}D_{i}u_{j}^{(1)}

Since σ1(j=1n1aqj(2)uj(1)+bq(2))\sigma_{1}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right) and j=1n1aqj(2)Diuj(1)\sum_{j=1}^{n_{1}}a_{qj}^{(2)}D_{i}u_{j}^{(1)} can be implemented by two ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} subnetworks, respectively, and the multiplication can also be implemented by

xy=14[(x+y)2(xy)2]=14[σ2(x+y)+σ2(xy)σ2(xy)σ2(x+y)],\begin{split}x\cdot y&=\frac{1}{4}\left[(x+y)^{2}-(x-y)^{2}\right]\\ &=\frac{1}{4}\left[\sigma_{2}(x+y)+\sigma_{2}(-x-y)-\sigma_{2}(x-y)-\sigma_{2}(-x+y)\right],\end{split}

we conclude that Diuq(2)D_{i}u_{q}^{(2)} can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network. We have

𝒟(σ1(j=1n1aqj(2)uj(1)+bq(2)))=3,𝒲(σ1(j=1n1aqj(2)uj(1)+bq(2)))𝒲\mathcal{D}\left(\sigma_{1}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)\right)=3,\mathcal{W}\left(\sigma_{1}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}u_{j}^{(1)}+b_{q}^{(2)}\right)\right)\leq\mathcal{W}

and

𝒟(j=1n1aqj(2)Diuj(1))=2,𝒲(j=1n1aqj(2)Diuj(1))𝒲.\mathcal{D}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}D_{i}u_{j}^{(1)}\right)=2,\mathcal{W}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(2)}D_{i}u_{j}^{(1)}\right)\leq\mathcal{W}.

Thus 𝒟(Diuq(2))=4,\mathcal{D}\left(D_{i}u_{q}^{(2)}\right)=4, 𝒲(Diuq(2))max{2𝒲,4}\mathcal{W}\left(D_{i}u_{q}^{(2)}\right)\leq\max\{2\mathcal{W},4\}.

Now we apply induction for layers k3k\geq 3. For the third layer,

Diuq(3)=Diσ2(j=1n2aqj(3)uj(2)+bq(3))=2σ1(j=1n2aqj(3)uj(2)+bq(3))j=1n2aqj(3)Diuj(2).D_{i}u_{q}^{(3)}=D_{i}\sigma_{2}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}u_{j}^{(2)}+b_{q}^{(3)}\right)=2\sigma_{1}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}u_{j}^{(2)}+b_{q}^{(3)}\right)\cdot\sum_{j=1}^{n_{2}}a_{qj}^{(3)}D_{i}u_{j}^{(2)}.

Since

𝒟(σ1(j=1n2aqj(3)uj(2)+bq(3)))=4,𝒲(σ1(j=1n2aqj(3)uj(2)+bq(3)))𝒲\mathcal{D}\left(\sigma_{1}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}u_{j}^{(2)}+b_{q}^{(3)}\right)\right)=4,\mathcal{W}\left(\sigma_{1}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}u_{j}^{(2)}+b_{q}^{(3)}\right)\right)\leq\mathcal{W}

and

𝒟(j=1n2aqj(3)Diuj(2))=4,𝒲(j=1n1aqj(3)Diuj(2))max{2𝒲,4𝒲}=4𝒲,\mathcal{D}\left(\sum_{j=1}^{n_{2}}a_{qj}^{(3)}D_{i}u_{j}^{(2)}\right)=4,\mathcal{W}\left(\sum_{j=1}^{n_{1}}a_{qj}^{(3)}D_{i}u_{j}^{(2)}\right)\leq\max\{2\mathcal{W},4\mathcal{W}\}=4\mathcal{W},

we conclude that Diuq(3)D_{i}u_{q}^{(3)} can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network and 𝒟(Diuq(3))=5\mathcal{D}\left(D_{i}u_{q}^{(3)}\right)=5, 𝒲(Diuq(3))max{5𝒲,4}=5𝒲\mathcal{W}\left(D_{i}u_{q}^{(3)}\right)\leq\max\{5\mathcal{W},4\}=5\mathcal{W}.

We assume that Diuq(k)(q=1,2,,nk)D_{i}u_{q}^{(k)}(q=1,2,\cdots,n_{k}) can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network and 𝒟(Diuq(k))=k+2\mathcal{D}\left(D_{i}u_{q}^{(k)}\right)=k+2, 𝒲(Diuq(3))(k+2)𝒲\mathcal{W}\left(D_{i}u_{q}^{(3)}\right)\leq(k+2)\mathcal{W}. For the (k+1)(k+1)-th layer,

Diuq(k+1)=Diσ2(j=1nkaqj(k+1)uj(k)+bq(k+1))\displaystyle D_{i}u_{q}^{(k+1)}=D_{i}\sigma_{2}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}u_{j}^{(k)}+b_{q}^{(k+1)}\right)
=2σ1(j=1nkaqj(k+1)uj(k)+bq(k+1))j=1nkaqj(k+1)Diuj(k).\displaystyle=2\sigma_{1}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}u_{j}^{(k)}+b_{q}^{(k+1)}\right)\cdot\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}D_{i}u_{j}^{(k)}.

Since

𝒟(σ1(j=1nkaqj(k+1)uj(k)+bq(k+1)))\displaystyle\mathcal{D}\left(\sigma_{1}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}u_{j}^{(k)}+b_{q}^{(k+1)}\right)\right) =k+2,\displaystyle=k+2,
𝒲(σ1(j=1nkaqj(k+1)uj(k)+bq(k+1)))\displaystyle\mathcal{W}\left(\sigma_{1}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}u_{j}^{(k)}+b_{q}^{(k+1)}\right)\right) 𝒲,\displaystyle\leq\mathcal{W},

and

𝒟(j=1nkaqj(k+1)Diuj(k))\displaystyle\mathcal{D}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}D_{i}u_{j}^{(k)}\right) =k+2,\displaystyle=k+2,
𝒲(j=1nkaqj(k+1)Diuj(k))\displaystyle\mathcal{W}\left(\sum_{j=1}^{n_{k}}a_{qj}^{(k+1)}D_{i}u_{j}^{(k)}\right) max{(k+2)𝒲,4𝒲}=(k+2)𝒲,\displaystyle\leq\max\{(k+2)\mathcal{W},4\mathcal{W}\}=(k+2)\mathcal{W},

we conclude that Diuq(k+1)D_{i}u_{q}^{(k+1)} can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network and 𝒟(Diuq(k+1))=k+3\mathcal{D}\left(D_{i}u_{q}^{(k+1)}\right)=k+3, 𝒲(Diuq(k+1))max{(k+3)𝒲,4}=(k+3)𝒲\mathcal{W}\left(D_{i}u_{q}^{(k+1)}\right)\leq\max\{(k+3)\mathcal{W},4\}=(k+3)\mathcal{W}.

Hence we derive that Diu=Diu1𝒟D_{i}u=D_{i}u_{1}^{\mathcal{D}} can be implemented by a ReLU\mathrm{ReLU}-ReLU2\mathrm{ReLU}^{2} network and 𝒟(Diu)=𝒟+2\mathcal{D}\left(D_{i}u\right)=\mathcal{D}+2, 𝒲(Diu)(𝒟+2)𝒲\mathcal{W}\left(D_{i}u\right)\leq\left(\mathcal{D}+2\right)\mathcal{W}. Finally we obtain that 𝒟(u2)=𝒟+3\mathcal{D}\left(\|\nabla u\|^{2}\right)=\mathcal{D}+3, 𝒲(u2)d(𝒟+2)𝒲\mathcal{W}\left(\|\nabla u\|^{2}\right)\leq d\left(\mathcal{D}+2\right)\mathcal{W}.

We are now in a position to bound the Rademacher complexity of 𝒩2\mathcal{N}^{2} and 𝒩1,2\mathcal{N}^{1,2}. To obtain the estimation, we need to introduce covering number, VC-dimension, pseudo-dimension and recall several properties of them.

Definition 3

Suppose that Wn.W\subset\mathbb{R}^{n}. For any ϵ>0\epsilon>0, let VnV\subset\mathbb{R}^{n} be a ϵ\epsilon -cover of WW with respect to the distance dd_{\infty}, that is, for any wWw\in W, there exists a vVv\in V such that d(u,v)<ϵd_{\infty}(u,v)<\epsilon, where dd_{\infty} is defined by

d(u,v):=uv.d_{\infty}(u,v):=\|u-v\|_{\infty}.

The covering number 𝒞(ϵ,W,d)\mathcal{C}\left(\epsilon,W,d_{\infty}\right) is defined to be the minimum cardinality among all ϵ\epsilon-cover of WW with respect to the distance dd_{\infty}.

Definition 4

Suppose that 𝒩\mathcal{N} is a class of functions from Ω\Omega to .\mathbb{R}. Given nn sample 𝐙n=(Z1,Z2,,Zn)Ωn,𝒩|𝐳nn\mathbf{Z}_{n}=\left(Z_{1},Z_{2},\cdots,Z_{n}\right)\in\Omega^{n},\left.\mathcal{N}\right|_{\mathbf{z}_{n}}\subset\mathbb{R}^{n} is defined by

𝒩𝐳n={(u(Z1),u(Z2),,u(Zn)):u𝒩}.\mathcal{N}\mid\mathbf{z}_{n}=\left\{\left(u\left(Z_{1}\right),u\left(Z_{2}\right),\cdots,u\left(Z_{n}\right)\right):u\in\mathcal{N}\right\}.

The uniform covering number 𝒞(ϵ,𝒩,n)\mathcal{C}_{\infty}(\epsilon,\mathcal{N},n) is defined by

𝒞(ϵ,𝒩,n)=max𝐙nΩn𝒞(ϵ,𝒩𝐳n,d)\mathcal{C}_{\infty}(\epsilon,\mathcal{N},n)=\max_{\mathbf{Z}_{n}\in\Omega^{n}}\mathcal{C}\left(\epsilon,\mathcal{N}\mid\mathbf{z}_{n},d_{\infty}\right)

Next we give a upper bound of (𝒩)\mathfrak{R}\left(\mathcal{N}\right) in terms of the covering number of 𝒩\mathcal{N} by using the Dudley’s entropy formula dudley .

Lemma 10 (Massart’s finite class lemma boucheron2013concentration )

For any finite set VnV\in\mathbb{R}^{n} with diameter D=vVv2D=\sum_{v\in V}\|v\|_{2}, then

𝔼Σn[supvV1n|iσivi|]Dn2log(2|V|).\mathbb{E}_{\Sigma_{n}}\left[\sup_{v\in V}\frac{1}{n}\left|\sum_{i}\sigma_{i}v_{i}\right|\right]\leq\frac{D}{n}\sqrt{2\log(2|V|)}.

We give an upper bound of (𝒩)\mathfrak{R}(\mathcal{N}) in terms of the covering number by using the Dudley’s entropy formula dudley .

Lemma 11 (Dudley’s entropy formula dudley )

Assume 0𝒩0\in\mathcal{N} and the diameter of 𝒩\mathcal{N} is less than \mathcal{B}, i.e., uL(Ω),u𝒩\|u\|_{L^{\infty}(\Omega)}\leq\mathcal{B},\forall u\in\mathcal{N}. Then

(𝒩)inf0<δ<(4δ+12nδlog(2𝒞(ε,𝒩,n))dε).\mathfrak{R}(\mathcal{N})\leq\inf_{0<\delta<\mathcal{B}}\left(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}\left(\varepsilon,\mathcal{N},n\right))}\mathrm{d}\varepsilon\right).
Proof

By definition

(𝒩)=(𝒩|𝒁n)=𝔼n[𝔼Σ[supu𝒩1n|iσiu(Zi)||𝒁n]].\mathfrak{R}(\mathcal{N})=\mathfrak{R}(\mathcal{N}|_{\boldsymbol{Z}_{n}})=\mathbb{E}_{\mathbb{Z}_{n}}\left[\mathbb{E}_{\Sigma}\left.\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i}\sigma_{i}u(Z_{i})\right|\ \right|{\boldsymbol{Z}_{n}}\right]\right].

Thus, it suffice to show

𝔼Σ[supu𝒩1n|iσiu(Zi)|]inf0<δ<(4δ+12nδlog𝒞(ε,𝒩2,n)dε)\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i}\sigma_{i}u(Z_{i})\right|\right]\leq\inf_{0<\delta<\mathcal{B}}\left(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log\mathcal{C}\left(\varepsilon,\mathcal{N}^{2},n\right)}\mathrm{d}\varepsilon\right)

by conditioning on 𝒁n\boldsymbol{Z}_{n}. Given an positive integer KK, let εk=2k+1\varepsilon_{k}=2^{-k+1}\mathcal{B}, k=1,Kk=1,...K. Let CkC_{k} be a cover of 𝒩|𝒁nn\mathcal{N}|_{\boldsymbol{Z}_{n}}\subseteq\mathbb{R}^{n} whose covering number is denoted as 𝒞(εk,𝒩|𝒁n,d)\mathcal{C}(\varepsilon_{k},\mathcal{N}|_{\boldsymbol{Z}_{n}},d_{\infty}). Then, by definition, u𝒩,\forall u\in\mathcal{N}, there \exists ckCkc^{k}\in C_{k} such that

d(u|𝒁n,ck)=max{|u(Zi)cik|,i=1,,n}εk,k=1,,K.d_{\infty}(u|_{\boldsymbol{Z}_{n}},c^{k})=\max\{|u(Z_{i})-c^{k}_{i}|,i=1,...,n\}\leq\varepsilon_{k},k=1,...,K.

Moreover, we denote the best approximate element of uu in CkC_{k} with respect to dd_{\infty} as ck(u)c^{k}(u). Then,

𝔼Σ[supu𝒩1n|i=1nσiu(Zi)|]\displaystyle\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i=1}^{n}\sigma_{i}u(Z_{i})\right|\right]
=𝔼Σ[supu𝒩1n|i=1nσi(u(Zi)ciK(u))+j=1K1i=1nσi(cij(u)cij+1(u))+i=1nσici1(u)|]\displaystyle=\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i=1}^{n}\sigma_{i}(u(Z_{i})-c^{K}_{i}(u))+\sum_{j=1}^{K-1}\sum_{i=1}^{n}\sigma_{i}(c^{j}_{i}(u)-c^{j+1}_{i}(u))+\sum_{i=1}^{n}\sigma_{i}c^{1}_{i}(u)\right|\right]
𝔼Σ[supu𝒩1n|i=1nσi(u(Zi)ciK(u))|]+j=1K1𝔼Σ[supu𝒩1n|i=1nσi(cij(u)cij+1(u))|]\displaystyle\leq\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i=1}^{n}\sigma_{i}(u(Z_{i})-c^{K}_{i}(u))\right|\right]+\sum_{j=1}^{K-1}\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i=1}^{n}\sigma_{i}(c^{j}_{i}(u)-c^{j+1}_{i}(u))\right|\right]
+𝔼Σ[supu𝒩1n|i=1nσici1(u)|].\displaystyle+\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i=1}^{n}\sigma_{i}c^{1}_{i}(u)\right|\right].

Since 0𝒩0\in\mathcal{N}, and the diameter of 𝒩\mathcal{N} is smaller than \mathcal{B}, we can choose C1={0}C_{1}=\{0\} such that the third term in the above display vanishes. By Hölder’s inequality, we deduce that the first term can be bounded by εK\varepsilon_{K} as follows.

𝔼Σ[supu𝒩1n|i=1nσi(u(Zi)ciK(u))|]\displaystyle\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i=1}^{n}\sigma_{i}(u(Z_{i})-c^{K}_{i}(u))\right|\right]
𝔼Σ[supu𝒩1n(i=1n|σi|)(i=1nmaxi=1,,n{|u(Zi)ciK(u)|})]\displaystyle\leq\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left(\sum_{i=1}^{n}|\sigma_{i}|\right)\left(\sum_{i=1}^{n}\max_{i=1,...,n}\left\{\left|u(Z_{i})-c^{K}_{i}(u)\right|\right\}\right)\right]
εK.\displaystyle\leq\varepsilon_{K}.

Let Vj={cj(u)cj+1(u):u𝒩}V_{j}=\{c^{j}(u)-c^{j+1}(u):u\in\mathcal{N}\}. Then by definition, the number of elements in VjV_{j} and CjC_{j} satisfying

|Vj||Cj||Cj+1||Cj+1|2.|V_{j}|\leq|C_{j}||C_{j+1}|\leq|C_{j+1}|^{2}.

And the diameter of VjV_{j} denoted as DjD_{j} can be bounded as

Dj=supvVjv2nsupu𝒩cj(u)cj+1(u)\displaystyle D_{j}=\sup_{v\in V_{j}}\|v\|_{2}\leq\sqrt{n}\sup_{u\in\mathcal{N}}\|c^{j}(u)-c^{j+1}(u)\|_{\infty}
nsupu𝒩cj(u)u+ucj+1(u)\displaystyle\leq\sqrt{n}\sup_{u\in\mathcal{N}}\|c^{j}(u)-u\|_{\infty}+\|u-c^{j+1}(u)\|_{\infty}
n(εj+εj+1)\displaystyle\leq\sqrt{n}(\varepsilon_{j}+\varepsilon_{j+1})
3nεj+1.\displaystyle\leq 3\sqrt{n}\varepsilon_{j+1}.

Then,

𝔼Σ[supu𝒩1n|j=1K1i=1nσi(cij(u)cij+1(u))|]j=1K1𝔼Σ[supvVj1n|i=1nσivj|]\displaystyle\mathbb{E}_{\Sigma}[\sup_{u\in\mathcal{N}}\frac{1}{n}|\sum_{j=1}^{K-1}\sum_{i=1}^{n}\sigma_{i}(c^{j}_{i}(u)-c^{j+1}_{i}(u))|]\leq\sum_{j=1}^{K-1}\mathbb{E}_{\Sigma}[\sup_{v\in V_{j}}\frac{1}{n}|\sum_{i=1}^{n}\sigma_{i}v_{j}|]
j=1K1Djn2log(2|Vj|)\displaystyle\leq\sum_{j=1}^{K-1}\frac{D_{j}}{n}\sqrt{2\log(2|V_{j}|)}
j=1K16εj+1nlog(2|Cj+1|),\displaystyle\leq\sum_{j=1}^{K-1}\frac{6\varepsilon_{j+1}}{\sqrt{n}}\sqrt{\log(2|C_{j+1}|)},

where we use triangle inequality in the first inequality, and use Lemma 10 in the second inequality. Putting all the above estimates together, we get

𝔼Σ[supu𝒩1n|iσiu(Zi)|]εK+j=1K16εj+1nlog(2|Cj+1|)\displaystyle\mathbb{E}_{\Sigma}\left[\sup_{u\in\mathcal{N}}\frac{1}{n}\left|\sum_{i}\sigma_{i}u(Z_{i})\right|\right]\leq\varepsilon_{K}+\sum_{j=1}^{K-1}\frac{6\varepsilon_{j+1}}{\sqrt{n}}\sqrt{\log(2|C_{j+1}|)}
εK+j=1K12(εjεj+1)nlog(2𝒞(εj,𝒩,n))\displaystyle\leq\varepsilon_{K}+\sum_{j=1}^{K}\frac{12(\varepsilon_{j}-\varepsilon_{j+1})}{\sqrt{n}}\sqrt{\log(2\mathcal{C}\left(\varepsilon_{j},\mathcal{N},n\right))}
εK+12nεK+1log(2𝒞(ε,𝒩,n))dε\displaystyle\leq\varepsilon_{K}+\frac{12}{\sqrt{n}}\int_{\varepsilon_{K+1}}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}\left(\varepsilon,\mathcal{N},n\right))}\mathrm{d}\varepsilon
inf0<δ<(4δ+12nδlog(2𝒞(ε,𝒩,n))dε).\displaystyle\leq\inf_{0<\delta<\mathcal{B}}(4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}\left(\varepsilon,\mathcal{N},n\right))}\mathrm{d}\varepsilon).

where, last inequality holds since for 0<δ<0<\delta<\mathcal{B}, we can choose KK to be the largest integer such that εK+1>δ\varepsilon_{K+1}>\delta, at this time εK4εK+24δ.\varepsilon_{K}\leq 4\varepsilon_{K+2}\leq 4\delta.

Definition 5

Let 𝒩\mathcal{N} be a set of functions from X=Ω(Ω)X=\Omega(\partial\Omega) to {0,1}.\{0,1\}. Suppose that S={x1,x2,,xn}X.S=\left\{x_{1},x_{2},\cdots,x_{n}\right\}\subset X. We say that SS is shattered by 𝒩\mathcal{N} if for any b{0,1}nb\in\{0,1\}^{n}, there exists a u𝒩u\in\mathcal{N} satisfying

u(xi)=bi,i=1,2,,n.u\left(x_{i}\right)=b_{i},\quad i=1,2,\ldots,n.
Definition 6

The VC-dimension of 𝒩\mathcal{N}, denoted as VCdim(𝒩)\operatorname{VCdim}(\mathcal{N}), is defined to be the maximum cardinality among all sets shattered by 𝒩\mathcal{N}.

VC-dimension reflects the capability of a class of functions to perform binary classification of points. The larger VC-dimension is, the stronger the capability to perform binary classification is. For more diseussion of VC-dimension, readers are referred to anthony2009neural .

For real-valued functions, we can generalize the concept of VC-dimension into pseudo-dimension anthony2009neural .

Definition 7

Let 𝒩\mathcal{N} be a set of functions from XX to .\mathbb{R}. Suppose that S={x1,x2,,xn}S=\left\{x_{1},x_{2},\cdots,x_{n}\right\}\subset X.X. We say that SS is pseudo-shattered by 𝒩\mathcal{N} if there exists y1,y2,,yny_{1},y_{2},\cdots,y_{n} such that for any b{0,1}nb\in\{0,1\}^{n}, there exists a u𝒩u\in\mathcal{N} satisfying

sign(u(xi)yi)=bi,i=1,2,,n\operatorname{sign}\left(u\left(x_{i}\right)-y_{i}\right)=b_{i},\quad i=1,2,\ldots,n

and we say that {yi}i=1n\left\{y_{i}\right\}_{i=1}^{n} witnesses the shattering.

Definition 8

The pseudo-dimension of 𝒩\mathcal{N}, denoted as Pdim(𝒩)\operatorname{Pdim}(\mathcal{N}), is defined to be the maximum cardinality among all sets pseudo-shattered by 𝒩\mathcal{N}.

The following proposition showing a relationship between uniform covering number and pseudo-dimension.

Lemma 12

Let 𝒩\mathcal{N} be a set of real functions from a domain XX to the bounded interval [0,][0,\mathcal{B}]. Let ε>0\varepsilon>0. Then

𝒞(ε,𝒩,n)i=1Pdim(𝒩)(ni)(ε)i\mathcal{C}(\varepsilon,\mathcal{N},n)\leq\sum_{i=1}^{\mathrm{Pdim}(\mathcal{N})}\begin{pmatrix}n\\ i\end{pmatrix}\left(\frac{\mathcal{B}}{\varepsilon}\right)^{i}

which is less than (enεPdim(𝒩))Pdim(𝒩)\left(\frac{en\mathcal{B}}{\varepsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)^{\mathrm{Pdim}(\mathcal{N})} for nPdim(𝒩)n\geq\mathrm{Pdim}(\mathcal{N}).

Proof

See Theorem 12.2 in anthony2009neural .

We now present the bound of pseudo-dimension for the 𝒩2\mathcal{N}^{2} and 𝒩1,2\mathcal{N}^{1,2}.

Lemma 13

Let p1,,pmp_{1},\cdots,p_{m} be polynomials with nn variables of degree at most dd. If nmn\leq m, then

|{(sign(p1(x)),,sign(pm(x))):xn}|2(2emdn)n|\{(\operatorname{sign}(p_{1}(x)),\cdots,\operatorname{sign}(p_{m}(x))):x\in\mathbb{R}^{n}\}|\leq 2\left(\frac{2emd}{n}\right)^{n}
Proof

See Theorem 8.3 in anthony2009neural .

Lemma 14

Let 𝒩\mathcal{N} be a set of functions that

  • (i)

    can be implemented by a neural network with depth no more than 𝒟\mathcal{D} and width no more than 𝒲\mathcal{W}, and

  • (ii)

    the activation function in each unit be the ReLU\mathrm{ReLU} or the ReLU2\mathrm{ReLU}^{2}.

Then

Pdim(𝒩)=𝒪(𝒟2𝒲2(𝒟+log𝒲)).\operatorname{Pdim}(\mathcal{N})=\mathcal{O}(\mathcal{D}^{2}\mathcal{W}^{2}(\mathcal{D}+\log\mathcal{W})).
Proof

The argument is follows from the proof of Theorem 6 in bartlett2019nearly . The result stated here is somewhat stronger then Theorem 6 in bartlett2019nearly since VCdim(sign(𝒩))Pdim(𝒩)\mathrm{VCdim}(\operatorname{sign}(\mathcal{N}))\leq\mathrm{Pdim}(\mathcal{N}).

We consider a new set of functions:

𝒩~={u~(x,y)=sign(u(x)y):u}\mathcal{\widetilde{N}}=\{\widetilde{u}(x,y)=\operatorname{sign}(u(x)-y):u\in\mathcal{H}\}

It is clear that Pdim(𝒩)VCdim(𝒩~)\mathrm{Pdim}(\mathcal{N})\leq\mathrm{VCdim}(\mathcal{\widetilde{N}}). We now bound the VC-dimension of 𝒩~\mathcal{\widetilde{N}}. Denoting \mathcal{M} as the total number of parameters(weights and biases) in the neural network implementing functions in 𝒩\mathcal{N}, in our case we want to derive the uniform bound for

K{xi},{yi}(m):=|{(sign(f(x1,a)y1),,sign(u(xm,a)ym)):a}|K_{\{x_{i}\},\{y_{i}\}}(m):=|\{(\operatorname{sign}(f(x_{1},a)-y_{1}),\ldots,\operatorname{sign}(u(x_{m},a)-y_{m})):a\in\mathbb{R}^{\mathcal{M}}\}|

over all {xi}i=1mX\{x_{i}\}_{i=1}^{m}\subset X and {yi}i=1m\{y_{i}\}_{i=1}^{m}\subset\mathbb{R}. Actually the maximum of K{xi},{yi}(m)K_{\{x_{i}\},\{y_{i}\}}(m) over all {xi}i=1mX\{x_{i}\}_{i=1}^{m}\subset X and {yi}i=1m\{y_{i}\}_{i=1}^{m}\subset\mathbb{R} is the growth function 𝒢𝒩~(m)\mathcal{G}_{\mathcal{\widetilde{N}}}(m). In order to apply Lemma 13, we partition the parameter space \mathbb{R}^{\mathcal{M}} into several subsets to ensure that in each subset u(xi,a)yiu(x_{i},a)-y_{i} is a polynomial with respcet to aa without any breakpoints. In fact, our partition is exactly the same as the partition in bartlett2019nearly . Denote the partition as {P1,P2,,PN}\{P_{1},P_{2},\cdots,P_{N}\} with some integer NN satisfying

Ni=1𝒟12(2emki(1+(i1)2i1)i)iN\leq\prod_{i=1}^{\mathcal{D}-1}2\left(\frac{2emk_{i}(1+(i-1)2^{i-1})}{\mathcal{M}_{i}}\right)^{\mathcal{M}_{i}} (18)

where kik_{i} and i\mathcal{M}_{i} denotes the number of units at the iith layer and the total number of parameters at the inputs to units in all the layers up to layer ii of the neural network implementing functions in 𝒩\mathcal{N}, respectively. See bartlett2019nearly for the construction of the partition. Obviously we have

K{xi},{yi}(m)i=1N|{(sign(u(x1,a)y1),,sign(u(xm,a)ym)):aPi}|K_{\{x_{i}\},\{y_{i}\}}(m)\leq\sum_{i=1}^{N}|\{(\operatorname{sign}(u(x_{1},a)-y_{1}),\cdots,\operatorname{sign}(u(x_{m},a)-y_{m})):a\in P_{i}\}| (19)

Note that u(xi,a)yiu(x_{i},a)-y_{i} is a polynomial with respect to aa with degree the same as the degree of u(xi,a)u(x_{i},a), which is equal to 1+(𝒟1)2𝒟11+(\mathcal{D}-1)2^{\mathcal{D}-1} as shown in bartlett2019nearly . Hence by Lemma 13, we have

|{(sign(u(x1,a)y1),,sign(u(xm,a)ym)):aPi}|\displaystyle|\{(\operatorname{sign}(u(x_{1},a)-y_{1}),\cdots,\operatorname{sign}(u(x_{m},a)-y_{m})):a\in P_{i}\}|
2(2em(1+(𝒟1)2𝒟1)𝒟)𝒟.\displaystyle\leq 2\left(\frac{2em(1+(\mathcal{D}-1)2^{\mathcal{D}-1})}{\mathcal{M}_{\mathcal{D}}}\right)^{\mathcal{M}_{\mathcal{D}}}. (20)

Combining (18),(19),(20)(\ref{pdimb1}),(\ref{pdimb2}),(\ref{pdimb3}) yields

K{xi},{yi}(m)i=1𝒟2(2emki(1+(i1)2i1)i)i.K_{\{x_{i}\},\{y_{i}\}}(m)\leq\prod_{i=1}^{\mathcal{D}}2\left(\frac{2emk_{i}(1+(i-1)2^{i-1})}{\mathcal{M}_{i}}\right)^{\mathcal{M}_{i}}.

We then have

𝒢𝒩~(m)i=1𝒟2(2emki(1+(i1)2i1)i)i,\mathcal{G}_{\mathcal{\widetilde{N}}}(m)\leq\prod_{i=1}^{\mathcal{D}}2\left(\frac{2emk_{i}(1+(i-1)2^{i-1})}{\mathcal{M}_{i}}\right)^{\mathcal{M}_{i}},

since the maximum of K{xi},{yi}(m)K_{\{x_{i}\},\{y_{i}\}}(m) over all {xi}i=1mX\{x_{i}\}_{i=1}^{m}\subset X and {yi}i=1m\{y_{i}\}_{i=1}^{m}\subset\mathbb{R} is the growth function 𝒢𝒩~(m)\mathcal{G}_{\mathcal{\widetilde{N}}}(m). Some algebras as that of the proof of Theorem 6 in bartlett2019nearly , we obtain

Pdim(𝒩)𝒪(𝒟2𝒲2log𝒰+𝒟3𝒲2)=𝒪(𝒟2𝒲2(𝒟+log𝒲))\mathrm{Pdim}(\mathcal{N})\leq\mathcal{O}\left(\mathcal{D}^{2}\mathcal{W}^{2}\log\mathcal{U}+\mathcal{D}^{3}\mathcal{W}^{2}\right)=\mathcal{O}\left(\mathcal{D}^{2}\mathcal{W}^{2}\left(\mathcal{D}+\log\mathcal{W}\right)\right)

where 𝒰\mathcal{U} refers to the number of units of the neural network implementing functions in 𝒩\mathcal{N}.

With the help of above preparations, the statistical error can easily be bounded by a tedious calculation.

Theorem 3.3

Let 𝒟\mathcal{D} and 𝒲\mathcal{W} be the depth and width of the network respectively, then

sta\displaystyle\mathcal{E}_{sta}\leq Cc3d(𝒟+3)(𝒟+2)𝒲𝒟+3+log(d(𝒟+2)𝒲)(lognn)1/2\displaystyle C_{c_{3}}d(\mathcal{D}+3)(\mathcal{D}+2)\mathcal{W}\sqrt{\mathcal{D}+3+\log(d(\mathcal{D}+2)\mathcal{W})}\left(\frac{\log n}{n}\right)^{1/2}
+Cc3d𝒟𝒲𝒟+log𝒲(lognn)1/2λ.\displaystyle+C_{c_{3}}d\mathcal{D}\mathcal{W}\sqrt{\mathcal{D}+\log\mathcal{W}}\left(\frac{\log n}{n}\right)^{1/2}\lambda.

where nn is the number of training samples on both the domain and the boundary.

Proof

In order to apply Lemma 11, we need to handle the term

1nδlog(2𝒞(ϵ,𝒩,n))𝑑ϵn+1nδBlog(enϵPdim(𝒩))Pdim(𝒩)𝑑ϵn+(Pdim(𝒩)n)1/2δBlog(enϵPdim(𝒩))𝑑ϵ\begin{split}&\frac{1}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}(\epsilon,\mathcal{N},n))}d\epsilon\\ &\leq\frac{\mathcal{B}}{\sqrt{n}}+\frac{1}{\sqrt{n}}\int_{\delta}^{B}\sqrt{\log\left(\frac{en\mathcal{B}}{\epsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)^{\mathrm{Pdim}(\mathcal{N})}}d\epsilon\\ &\leq\frac{\mathcal{B}}{\sqrt{n}}+\left(\frac{\mathrm{Pdim}(\mathcal{N})}{n}\right)^{1/2}\int_{\delta}^{B}\sqrt{\log\left(\frac{en\mathcal{B}}{\epsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)}d\epsilon\end{split}

where in the first inequality we use Lemma 12. Now we calculate the integral. Set

t=log(enϵPdim(𝒩))t=\sqrt{\log\left(\frac{en\mathcal{B}}{\epsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)}

then ϵ=enPdim(𝒩)et2\epsilon=\frac{en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\cdot e^{-t^{2}}. Denote t1=log(enPdim(𝒩))t_{1}=\sqrt{\log\left(\frac{en\mathcal{B}}{\mathcal{B}\cdot\mathrm{Pdim}(\mathcal{N})}\right)}, t2=log(enδPdim(𝒩))t_{2}=\sqrt{\log\left(\frac{en\mathcal{B}}{\delta\cdot\mathrm{Pdim}(\mathcal{N})}\right)}. And

δlog(enϵPdim(𝒩))𝑑ϵ=2enPdim(𝒩)t1t2t2et2𝑑t=2enPdim(𝒩)t1t2t(et22)𝑑t=enPdim(𝒩)[t1et12t2et22+t1t2et2𝑑t]enPdim(𝒩)[t1et12t2et22+(t2t1)et12]enPdim(𝒩)t2et12=log(enδPdim(𝒩))\begin{split}&\int_{\delta}^{\mathcal{B}}\sqrt{\log\left(\frac{en\mathcal{B}}{\epsilon\cdot\mathrm{Pdim}(\mathcal{N})}\right)}d\epsilon=\frac{2en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\int_{t_{1}}^{t_{2}}t^{2}e^{-t^{2}}dt\\ &=\frac{2en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\int_{t_{1}}^{t_{2}}t\left(\frac{-e^{-t^{2}}}{2}\right)^{\prime}dt\\ &=\frac{en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\left[t_{1}e^{-t_{1}^{2}}-t_{2}e^{-t_{2}^{2}}+\int_{t_{1}}^{t_{2}}e^{-t^{2}}dt\right]\\ &\leq\frac{en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\left[t_{1}e^{-t_{1}^{2}}-t_{2}e^{-t_{2}^{2}}+(t_{2}-t_{1})e^{-t_{1}^{2}}\right]\\ &\leq\frac{en\mathcal{B}}{\mathrm{Pdim}(\mathcal{N})}\cdot t_{2}e^{-t_{1}^{2}}=\mathcal{B}\sqrt{\log\left(\frac{en\mathcal{B}}{\delta\cdot\mathrm{Pdim}(\mathcal{N})}\right)}\end{split}

Choosing δ=(Pdim(𝒩)n)1/2\delta=\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N})}{n}\right)^{1/2}\leq\mathcal{B}, by Lemma 11 and the above display, we get for both 𝒩=𝒩2\mathcal{N}=\mathcal{N}^{2} and 𝒩=𝒩1,2\mathcal{N}=\mathcal{N}^{1,2} there holds

(𝒩)4δ+12nδlog(2𝒞(ϵ,𝒩,n))𝑑ϵ\displaystyle\mathfrak{R}(\mathcal{N})\leq 4\delta+\frac{12}{\sqrt{n}}\int_{\delta}^{\mathcal{B}}\sqrt{\log(2\mathcal{C}(\epsilon,\mathcal{N},n))}d\epsilon
4δ+12n+12(Pdim(𝒩)n)1/2log(enδPdim(𝒩))\displaystyle\leq 4\delta+\frac{12\mathcal{B}}{\sqrt{n}}+12\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en\mathcal{B}}{\delta\cdot\mathrm{Pdim}(\mathcal{N})}\right)}
2832(Pdim(𝒩)n)1/2log(enPdim(𝒩)).\displaystyle\leq 28\sqrt{\frac{3}{2}}\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en}{\mathrm{Pdim}(\mathcal{N})}\right)}. (21)

Then by Lemma 6, 11, 8, 9 and equation (21), we have

sta=2supu𝒩2|(u)^(u)|\displaystyle{\mathcal{E}_{sta}}=2\sup_{u\in\mathcal{N}^{2}}|\mathcal{L}(u)-\widehat{\mathcal{L}}(u)|
2(𝒩1,2)+2(2c32+2c3)(𝒩2)+2c32(𝒩2)λ\displaystyle\leq 2\mathfrak{R}(\mathcal{N}^{1,2})+2(2c_{3}^{2}+2c_{3})\mathfrak{R}(\mathcal{N}^{2})+2c_{3}^{2}\mathfrak{R}(\mathcal{N}^{2})\lambda
5632(Pdim(𝒩1,2)n)1/2log(enPdim(𝒩1,2))\displaystyle\leq 56\sqrt{\frac{3}{2}}\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N}^{1,2})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en}{\mathrm{Pdim}(\mathcal{N}^{1,2})}\right)}
+5632(2c32+2c3)(Pdim(𝒩2)n)1/2log(enPdim(𝒩2))\displaystyle+56\sqrt{\frac{3}{2}}(2c_{3}^{2}+2c_{3})\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N}^{2})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en}{\mathrm{Pdim}(\mathcal{N}^{2})}\right)}
+5632c32(Pdim(𝒩2)n)1/2log(enPdim(𝒩2))λ.\displaystyle+56\sqrt{\frac{3}{2}}c_{3}^{2}\mathcal{B}\left(\frac{\mathrm{Pdim}(\mathcal{N}^{2})}{n}\right)^{1/2}\sqrt{\log\left(\frac{en}{\mathrm{Pdim}(\mathcal{N}^{2})}\right)}\lambda.

Plugging the upper bound of Pdim\mathrm{Pdim} derived in Lemma 12 into the above display and using the relationship of depth and width between 𝒩2\mathcal{N}^{2} and 𝒩1,2\mathcal{N}^{1,2}, we get

sta\displaystyle\mathcal{E}_{sta}\leq Cc3d(𝒟+3)(𝒟+2)𝒲𝒟+3+log(d(𝒟+2)𝒲)(lognn)1/2\displaystyle C_{c_{3}}d(\mathcal{D}+3)(\mathcal{D}+2)\mathcal{W}\sqrt{\mathcal{D}+3+\log(d(\mathcal{D}+2)\mathcal{W})}\left(\frac{\log n}{n}\right)^{1/2} (22)
+Cc3d𝒟𝒲𝒟+log𝒲(lognn)1/2λ.\displaystyle+C_{c_{3}}d\mathcal{D}\mathcal{W}\sqrt{\mathcal{D}+\log\mathcal{W}}\left(\frac{\log n}{n}\right)^{1/2}\lambda.

3.3 Error from the boundary penalty method

Although the Lemma 3 shows the convergence property of Robin problem (4) as λ\lambda\rightarrow\infty

uλu,u_{\lambda}^{*}\rightarrow u^{*},

it says nothing about the convergence rate. In this section, we consider the error from the boundary penalty method. Roughly speaking, we bound the distance between the minimizer uu^{*} and uλu_{\lambda}^{*} with respect to the penalty parameter λ\lambda.

Theorem 3.4

Suppose uλu_{\lambda}^{*} is the minimizer of (5) and uu^{*} is the minimizer of (3). Then

uλuH1(Ω)Cc3,dλ1.\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}\leq C_{c_{3},d}\lambda^{-1}.
Proof

Following the idea which is proposed in Maury2009Numerical (proof of Proposition 2.3), we proceed to prove this theorem. For vH1(Ω)v\in H^{1}(\Omega), we introduce

Rλ(v)=12a(uv,uv)+λ2Ω(1λunv)2𝑑s.R_{\lambda}(v)=\frac{1}{2}a(u^{*}-v,u^{*}-v)+\frac{\lambda}{2}\int_{\partial\Omega}\left(-\frac{1}{\lambda}\frac{\partial u^{*}}{\partial n}-v\right)^{2}ds. (23)

Given φH1(Ω)\varphi\in H^{1}(\Omega) such that Tφ=unT\varphi=-\frac{\partial u^{*}}{\partial n}, we set w=1λφ+uw=\frac{1}{\lambda}\varphi+u^{*}. Due to uH01(Ω)u^{*}\in H^{1}_{0}(\Omega), it follows that

Rλ(w)=12λ2a(φ,φ)+λ2Ω(u)2𝑑s=12λ2a(φ,φ)Cλ2,R_{\lambda}(w)=\frac{1}{2\lambda^{2}}a\left(\varphi,\varphi\right)+\frac{\lambda}{2}\int_{\partial\Omega}(u^{*})^{2}ds=\frac{1}{2\lambda^{2}}a\left(\varphi,\varphi\right)\leq C\lambda^{-2}, (24)

where CC is dependent only on \mathcal{B}, ww and Ω\Omega. Apparently, (23) can be written

Rλ(v)=\displaystyle R_{\lambda}(v)= 12a(u,u)a(u,v)+12a(v,v)+12λΩ(un)2𝑑s+Ωunv𝑑s\displaystyle\frac{1}{2}a(u^{*},u^{*})-a(u^{*},v)+\frac{1}{2}a(v,v)+\frac{1}{2\lambda}\int_{\partial\Omega}\left(\frac{\partial u^{*}}{\partial n}\right)^{2}ds+\int_{\partial\Omega}\frac{\partial u^{*}}{\partial n}vds
+λ2Ωv2𝑑s\displaystyle+\frac{\lambda}{2}\int_{\partial\Omega}v^{2}ds
=\displaystyle= 12a(u,u)+12a(v,v)+12λΩ(un)2𝑑s+λ2Ωv2𝑑sΩfv𝑑x\displaystyle\frac{1}{2}a(u^{*},u^{*})+\frac{1}{2}a(v,v)+\frac{1}{2\lambda}\int_{\partial\Omega}\left(\frac{\partial u^{*}}{\partial n}\right)^{2}ds+\frac{\lambda}{2}\int_{\partial\Omega}v^{2}ds-\int_{\Omega}fvdx
=\displaystyle= 12a(u,u)+12λΩ(un)2𝑑s+λ(v),\displaystyle\frac{1}{2}a(u^{*},u^{*})+\frac{1}{2\lambda}\int_{\partial\Omega}\left(\frac{\partial u^{*}}{\partial n}\right)^{2}ds+\mathcal{L}_{\lambda}(v),

where the second equality comes from that

a(u,v)Ωunv𝑑s=Ωfv𝑑x,vH1(Ω).a(u^{*},v)-\int_{\partial\Omega}\frac{\partial u^{*}}{\partial n}vds=\int_{\Omega}fvdx,\ \forall v\in H^{1}(\Omega). (25)

Since Rλ(v)=λ(v)+constR_{\lambda}(v)=\mathcal{L}_{\lambda}(v)+const, uλu_{\lambda}^{*} is also the minimizer of RλR_{\lambda} over H1(Ω)H^{1}(\Omega). Recall (24), we obtain the estimation of Rλ(uλ)R_{\lambda}(u_{\lambda}^{*})

0Rλ(uλ)=12a(uuλ,uuλ)+λ2Ω(1λunuλ)2𝑑sRλ(w)Cλ2.0\leq R_{\lambda}(u_{\lambda}^{*})=\frac{1}{2}a(u^{*}-u_{\lambda}^{*},u^{*}-u_{\lambda}^{*})+\frac{\lambda}{2}\int_{\partial\Omega}\left(-\frac{1}{\lambda}\frac{\partial u^{*}}{\partial n}-u_{\lambda}^{*}\right)^{2}ds\leq R_{\lambda}(w)\leq C\lambda^{-2}.

Now that a(,)a(\cdot,\cdot) is coercive, we arrive at

uuλH1(Ω)Cc3,dλ1.\|u^{*}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}\leq C_{c_{3},d}\lambda^{-1}.

In hong2021rademacher , they proved that the error uλuH1(Ω)𝒪(λ1/2),\|u_{\lambda}^{*}-u^{*}\|_{H^{1}(\Omega)}\leq\mathcal{O}(\lambda^{-1/2}), which is suboptimal comparing with the above results derived here. In Mller2021ErrorEF ; muller2020deep , they proved the 𝒪(λ1)\mathcal{O}(\lambda^{-1}) bound under some unverifiable conditions.

3.4 Convergence rate

Note that for λ\lambda\rightarrow\infty the approximation error app\mathcal{E}_{app} and the statistical error sta\mathcal{E}_{sta} approach \infty and for λ0\lambda\rightarrow 0 the error from penalty blows up. Hence, there must be a trade off for choosing proper λ\lambda.

Theorem 3.5

Let uu^{*} be the weak solution of (1) with bounded fL2(Ω)f\in L^{2}(\Omega), wL(Ω)w\in L^{\infty}(\Omega). u^ϕ\widehat{u}_{\phi} is the minimizer of the discrete version of the associated Robin energy with parameter λ\lambda. Given nn be the number of training samples on the domain and the boundary, there is a ReLU2\mathrm{ReLU}^{2} network with depth and width as

𝒟log2d+3,𝒲𝒪(4d(nlogn)12(d+2)4d),\mathcal{D}\leq\lceil\log_{2}d\rceil+3,\quad\mathcal{W}\leq\mathcal{O}\left(4d\left\lceil\left(\frac{n}{\log n}\right)^{\frac{1}{2(d+2)}}-4\right\rceil^{d}\right),

such that

𝔼𝑿,𝒀[u^ϕuH1(Ω)2]\displaystyle\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq Cc1,c2,c3,d𝒪(n1d+2(logn)d+3d+2)\displaystyle C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)
+Cc1,c2,c3,d𝒪(n1d+2(logn)d+3d+2)λ\displaystyle+C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\lambda
+Cc3,dλ2.\displaystyle+C_{c_{3},d}\lambda^{-2}.

Furthermore, for

λn13(d+2)(logn)d+33(d+2),\lambda\sim n^{\frac{1}{3(d+2)}}(\log n)^{-\frac{d+3}{3(d+2)}},

it holds that

𝔼𝑿,𝒀[u^ϕuH1(Ω)2]Cc1,c2,c3,d𝒪(n23(d+2)logn).\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{2}{3(d+2)}}\log n\right).
Proof

Combining Theorem 3.1, Theorem 3.2 and Theorem 3.3, we obtain by taking ε2=Cc1,c2,c3(lognn)1d+2\varepsilon^{2}=C_{c_{1},c_{2},c_{3}}\left(\frac{\log n}{n}\right)^{\frac{1}{d+2}}

𝔼𝑿,𝒀[u^ϕuλH1(Ω)2]\displaystyle\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}[\|\widehat{u}_{\phi}-u_{\lambda}^{*}\|_{H^{1}(\Omega)}^{2}]
\displaystyle\leq 2c11[Cc3d(𝒟+3)(𝒟+2)𝒲𝒟+3+log(d(𝒟+2)𝒲)(lognn)1/2+c3+12ε2]\displaystyle\frac{2}{c_{1}\wedge 1}\left[C_{c_{3}}d(\mathcal{D}+3)(\mathcal{D}+2)\mathcal{W}\sqrt{\mathcal{D}+3+\log(d(\mathcal{D}+2)\mathcal{W})}\left(\frac{\log n}{n}\right)^{1/2}+\frac{c_{3}+1}{2}\varepsilon^{2}\right]
+2c11[Cc3d𝒟𝒲𝒟+log𝒲(lognn)1/2+Cd2ε2]λ\displaystyle+\frac{2}{c_{1}\wedge 1}\left[C_{c_{3}}d\mathcal{D}\mathcal{W}\sqrt{\mathcal{D}+\log\mathcal{W}}\left(\frac{\log n}{n}\right)^{1/2}+\frac{C_{d}}{2}\varepsilon^{2}\right]\lambda
\displaystyle\leq 2c11[Cc34d2(logd+6)(logd+5)Cc2ε4d\displaystyle\frac{2}{c_{1}\wedge 1}\left[C_{c_{3}}4d^{2}(\lceil\log d\rceil+6)(\lceil\log d\rceil+5)\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}\cdot\right.
logd+6+log(4d2(logd+5)Cc2ε4d)(lognn)1/2+c3+12ε2]\displaystyle\quad\left.\sqrt{\lceil\log d\rceil+6+\log\left(4d^{2}(\lceil\log d\rceil+5)\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}\right)}\left(\frac{\log n}{n}\right)^{1/2}+\frac{c_{3}+1}{2}\varepsilon^{2}\right]
+2c11[Cc34d2(logd+3)Cc2ε4d\displaystyle+\frac{2}{c_{1}\wedge 1}\left[C_{c_{3}}4d^{2}(\lceil\log d\rceil+3)\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}\cdot\right.
logd+3+log(4dCc2ε4d)(lognn)1/2+Cd2ε2]λ\displaystyle\quad\left.\sqrt{\lceil\log d\rceil+3+\log\left(4d\left\lceil\frac{Cc_{2}}{\varepsilon}-4\right\rceil^{d}\right)}\left(\frac{\log n}{n}\right)^{1/2}+\frac{C_{d}}{2}\varepsilon^{2}\right]\lambda
\displaystyle\leq Cc1,c2,c3,d𝒪(n1d+2(logn)d+3d+2)+Cc1,c2,c3,d𝒪(n1d+2(logn)d+3d+2)λ.\displaystyle C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)+C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\lambda.

Using Theorem 3.1 and Theorem 3.4, it holds that for all λ>0\lambda>0

𝔼𝑿,𝒀[u^ϕuH1(Ω)2]\displaystyle\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq Cc1,c2,c3,d𝒪(n1d+2(logn)d+3d+2)\displaystyle C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right) (26)
+Cc1,c2,c3,d𝒪(n1d+2(logn)d+3d+2)λ\displaystyle+C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\lambda
+Cc3,dλ2.\displaystyle+C_{c_{3},d}\lambda^{-2}.

We have derive the error estimate for fixed λ\lambda, and now we are in a position to find a proper λ\lambda and get the convergence rate. Since (26) holds for any λ>0\lambda>0, we take the infimum of λ\lambda:

𝔼𝑿,𝒀[u^ϕuH1(Ω)2]infλ>0\displaystyle\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq\inf_{\lambda>0} {Cc1,c2,c3,d𝒪(n1d+2(logn)d+3d+2)\displaystyle\left\{C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\right.
+Cc1,c2,c3,d𝒪(n1d+2(logn)d+3d+2)λ\displaystyle\left.+C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{1}{d+2}}(\log n)^{\frac{d+3}{d+2}}\right)\lambda\right.
+Cc3,dλ2}.\displaystyle\left.+C_{c_{3},d}\lambda^{-2}\right\}.

By taking

λn13(d+2)(logn)d+33(d+2),\lambda\sim n^{\frac{1}{3(d+2)}}(\log n)^{-\frac{d+3}{3(d+2)}},

we can obtain

𝔼𝑿,𝒀[u^ϕuH1(Ω)2]Cc1,c2,c3,d𝒪(n23(d+2)logn).\mathbb{E}_{\boldsymbol{X},\boldsymbol{Y}}\left[\|\widehat{u}_{\phi}-u^{*}\|_{H^{1}(\Omega)}^{2}\right]\leq C_{c_{1},c_{2},c_{3},d}{\mathcal{O}}\left(n^{-\frac{2}{3(d+2)}}\log n\right).

4 Conclusions and Extensions

This paper provided an analysis of convergence rate for deep Ritz methods for Laplace equations with Dirichlet boundary condition. Specifically, our study shed light on how to set depth and width of networks and how to set the penalty parameter to achieve the desired convergence rate in terms of number of training samples. The estimation on the approximation error of deep ReLU2\mathrm{ReLU}^{2} network is established in H1H^{1}. The statistical error can be derived technically by the Rademacher complexity of the non-Lipschitz composition of gradient norm and ReLU2\mathrm{ReLU}^{2} network. We also analysis the error from the boundary penalty method.

There are several interesting further research directions. First, the current analysis can be extended to general second order elliptic equations with other boundary conditions. Second, the approximation and statistical error bounds deriving here can be used for studying the nonasymptotic convergence rate for residual based method, such as PINNs. Finally, the similar result may be applicable to deep Ritz methods for optimal control problems and inverse problems.

Acknowledgements.
Y. Jiao is supported in part by the National Science Foundation of China under Grant 11871474 and by the research fund of KLATASDSMOE of China. X. Lu is partially supported by the National Science Foundation of China (No. 11871385), the National Key Research and Development Program of China (No.2018YFC1314600) and the Natural Science Foundation of Hubei Province (No. 2019CFA007), and by the research fund of KLATASDSMOE of China. J. Yang was supported by NSFC (Grant No. 12125103, 12071362), the National Key Research and Development Program of China (No. 2020YFA0714200) and the Natural Science Foundation of Hubei Province (No. 2019CFA007).

Appendix A Appendix

A.1 Proof of Lemma 2

We claim that aλa_{\lambda} is coercive on H1(Ω)H^{1}(\Omega). In fact,

aλ(u,u)=a(u,u)+λΩu2dsCuH1(Ω)2,uH1(Ω),a_{\lambda}(u,u)=a(u,u)+\lambda\int_{\partial\Omega}u^{2}\ \mathrm{d}s\geq C\|u\|^{2}_{H^{1}(\Omega)},\ \forall u\in H^{1}(\Omega),

where CC is constant from Poincaré inequality Gilbarg1983Elliptic . Thus, there exists a unique weak solution uλH1(Ω)u_{\lambda}^{*}\in H^{1}(\Omega) such that

aλ(uλ,v)=f(v),vH1(Ω).a_{\lambda}(u_{\lambda}^{*},v)=f(v),\ \forall v\in H^{1}(\Omega).

We can check uλu_{\lambda}^{*} is the unique minimizer of λ(u)\mathcal{L}_{\lambda}(u) by standard technique.

We will study the regularity for weak solutions of (4). For the following discussion, we first introduce several useful classic results of second order elliptic equations in Evans2010PartialDE ; Gilbarg1983Elliptic .

Lemma 15

Assume wL(Ω)w\in L^{\infty}(\Omega), fL2(Ω)f\in L^{2}(\Omega), gH3/2(Ω)g\in H^{3/2}(\partial\Omega) and Ω\partial\Omega is sufficiently smooth. Suppose that uH1(Ω)u\in H^{1}(\Omega) is a weak solution of the elliptic boundary-value problem

{Δu+wu=f in Ωu=g on Ω.\left\{\begin{aligned} -\Delta u+wu&=f&&\text{ in }\Omega\\ u&=g&&\text{ on }\partial\Omega.\end{aligned}\right.

Then uH2(Ω)u\in H^{2}(\Omega) and there exists a positive constant CC, depending only on Ω\Omega and ww, such that

uH2(Ω)C(fL2(Ω)+gH3/2(Ω)).\|u\|_{H^{2}(\Omega)}\leq C\left(\|f\|_{L^{2}(\Omega)}+\|g\|_{H^{3/2}(\partial\Omega)}\right).
Proof
Lemma 16

Assume wL(Ω)w\in L^{\infty}(\Omega), fL2(Ω)f\in L^{2}(\Omega), gH1/2(Ω)g\in H^{1/2}(\partial\Omega) and Ω\partial\Omega is sufficiently smooth. Suppose that uH1(Ω)u\in H^{1}(\Omega) is a weak solution of the elliptic boundary-value problem

{Δu+wu=f in Ωun=g on Ω.\left\{\begin{aligned} -\Delta u+wu&=f&&\text{ in }\Omega\\ \frac{\partial u}{\partial n}&=g&&\text{ on }\partial\Omega.\end{aligned}\right.

Then uH2(Ω)u\in H^{2}(\Omega) and there exists a positive constant CC, depending only on Ω\Omega and ww, such that

uH2(Ω)C(fL2(Ω)+gH1/2(Ω)).\|u\|_{H^{2}(\Omega)}\leq C\left(\|f\|_{L^{2}(\Omega)}+\|g\|_{H^{1/2}(\partial\Omega)}\right).
Lemma 17

Assume wL(Ω)w\in L^{\infty}(\Omega), gH1/2(Ω)g\in H^{1/2}(\partial\Omega), Ω\partial\Omega is sufficiently smooth and λ>0\lambda>0. Let uH1(Ω)u\in H^{1}(\Omega) be the weak solution of the following Robin problem

{Δu+wu=0 in Ω1λun+u=g on Ω.\left\{\begin{aligned} -\Delta u+wu&=0&&\text{ in }\Omega\\ \frac{1}{\lambda}\frac{\partial u}{\partial n}+u&=g&&\text{ on }\partial\Omega.\end{aligned}\right. (27)

Then uH2(Ω)u\in H^{2}(\Omega) and there exists a positive constant CC independent of λ\lambda such that

uH2(Ω)CλgH1/2(Ω).\left\|u\right\|_{H^{2}(\Omega)}\leq C\lambda\|g\|_{H^{1/2}(\partial\Omega)}.
Proof

Following the idea which is proposed in Costabel1996ASP in a slightly different context. We first estimate the trace Tu=u|ΩTu=\left.u\right|_{\partial\Omega}. We define the Dirichlet-to-Neumann map

T~:u|Ωun|Ω,\widetilde{T}:\left.u\right|_{\partial\Omega}\mapsto\left.\frac{\partial u}{\partial n}\right|_{\partial\Omega},

where uu satisfies Δu+wu=0-\Delta u+wu=0 in Ω\Omega, then

Tu=(1λT~+I)1g.Tu=\left(\frac{1}{\lambda}\widetilde{T}+I\right)^{-1}g.

Now we are going to show that 1λT~+I\frac{1}{\lambda}\widetilde{T}+I is a positive definite operator in L2(Ω)L^{2}(\partial\Omega). We notice that the variational formulation of (27) can be read as follow:

Ωuvdx+Ωwuv𝑑x+λΩuv𝑑s=λΩgv𝑑s,vH1(Ω).\int_{\Omega}\nabla u\cdot\nabla vdx+\int_{\Omega}wuvdx+\lambda\int_{\partial\Omega}uvds=\lambda\int_{\partial\Omega}gvds,\ \forall v\in H^{1}(\Omega).

Taking v=uv=u, then we have

TuL2(Ω)2(1λT~+I)Tu,Tu.\|Tu\|_{L^{2}(\partial\Omega)}^{2}\leq\left\langle\left(\frac{1}{\lambda}\widetilde{T}+I\right)Tu,Tu\right\rangle.

This means that λ1T~+I\lambda^{-1}\widetilde{T}+I is a positive definite operator in L2(Ω)L^{2}(\partial\Omega), and further, (λ1T~+I)1(\lambda^{-1}\widetilde{T}+I)^{-1} is bounded. We have the estimate

TuH1/2(Ω)CgH1/2(Ω).\|Tu\|_{H^{1/2}(\partial\Omega)}\leq C\|g\|_{H^{1/2}(\partial\Omega)}. (28)

We rewrite the Robin problem (27) as follows

{Δu+wu=0 in Ωun+u=λ(g(1λ1)u) on Ω.\left\{\begin{aligned} -\Delta u+wu&=0&&\text{ in }\Omega\\ \frac{\partial u}{\partial n}+u&=\lambda\left(g-\left(1-\lambda^{-1}\right)u\right)&&\text{ on }\partial\Omega.\end{aligned}\right.

By Lemma 16 we have

uH2(Ω)Cλg(1λ1)TuH1/2(Ω)Cλ(gH1/2(Ω)+TuH1/2(Ω)).\|u\|_{H^{2}(\Omega)}\leq C\lambda\left\|g-\left(1-\lambda^{-1}\right)Tu\right\|_{H^{1/2}(\partial\Omega)}\leq C\lambda\left(\left\|g\right\|_{H^{1/2}(\partial\Omega)}+\left\|Tu\right\|_{H^{1/2}(\partial\Omega)}\right). (29)

Combining (28) and (29), we obtain the desired estimation.

With the help of above lemmas, we now turn to proof the regularity properties of the weak solution.

Theorem A.1

Assume wL(Ω)w\in L^{\infty}(\Omega), fL2(Ω)f\in L^{2}(\Omega). Suppose that uH1(Ω)u\in H^{1}(\Omega) is a weak solution of the boundary-value problem (4). If Ω\partial\Omega is sufficiently smooth, then uH2(Ω)u\in H^{2}(\Omega), and we have the estimate

uH2(Ω)CfL2(Ω),\|u\|_{H^{2}(\Omega)}\leq C\|f\|_{L^{2}(\Omega)},

where the constant CC depending only on Ω\Omega and ww.

Proof

We decompose (4) into two equations

{Δu0+wu0=finΩu0=0onΩ,\left\{\begin{aligned} -\Delta u_{0}+wu_{0}&=f&&\text{in}\ \Omega\\ u_{0}&=0&&\text{on}\ \partial\Omega,\\ \end{aligned}\right. (30)
{Δu1+wu1=0inΩ1λu1n+u1=u0nonΩ.\left\{\begin{aligned} -\Delta u_{1}+wu_{1}&=0&&\text{in}\ \Omega\\ \frac{1}{\lambda}\frac{\partial u_{1}}{\partial n}+u_{1}&=-\frac{\partial u_{0}}{\partial n}&&\text{on}\ \partial\Omega.\\ \end{aligned}\right. (31)

and obtain the solution of (4)

u=u0+1λu1.u=u_{0}+\frac{1}{\lambda}u_{1}.

Applying Lemma 15 to (30), we have

u0H2(Ω)CfL2(Ω),\|u_{0}\|_{H^{2}(\Omega)}\leq C\|f\|_{L^{2}(\Omega)}, (32)

where CC depends on Ω\Omega and ww. Using Lemma 17, it is easy to obtain

u1H2(Ω)Cλu0nH1/2(Ω)Cλu0H2(Ω),\left\|u_{1}\right\|_{H^{2}(\Omega)}\leq C\lambda\left\|\frac{\partial u_{0}}{\partial n}\right\|_{H^{1/2}(\partial\Omega)}\leq C\lambda\|u_{0}\|_{H^{2}(\Omega)}, (33)

where the last inequality follows from the trace theorem. Combining (32) and (33), the desired estimation can be derived by triangle inequality.

References

  • (1) Anandkumar, A., Azizzadenesheli, K., Bhattacharya, K., Kovachki, N., Li, Z., Liu, B., Stuart, A.: Neural operator: Graph kernel network for partial differential equations. In: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations (2020)
  • (2) Anitescu, C., Atroshchenko, E., Alajlan, N., Rabczuk, T.: Artificial neural network methods for the solution of second order boundary value problems. Cmc-computers Materials & Continua 59(1), 345–359 (2019)
  • (3) Anthony, M., Bartlett, P.L.: Neural network learning: Theoretical foundations. cambridge university press (2009)
  • (4) Babuska, I.: The finite element method with penalty. Mathematics of Computation (1973)
  • (5) Bartlett, P.L., Harvey, N., Liaw, C., Mehrabian, A.: Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 20(63), 1–17 (2019)
  • (6) Berner, J., Dablander, M., Grohs, P.: Numerically solving parametric families of high-dimensional kolmogorov partial differential equations via deep learning. In: H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 16615–16627. Curran Associates, Inc. (2020)
  • (7) Boucheron, S., Lugosi, G., Massart, P.: Concentration inequalities: A nonasymptotic theory of independence. Oxford university press (2013)
  • (8) Brenner, S., Scott, R.: The mathematical theory of finite element methods, vol. 15. Springer Science & Business Media (2007)
  • (9) Ciarlet, P.G.: The finite element method for elliptic problems. SIAM (2002)
  • (10) Costabel, M., Dauge, M.: A singularly perturbed mixed boundary value problem. Communications in Partial Differential Equations 21 (1996)
  • (11) De Boor, C., De Boor, C.: A practical guide to splines, vol. 27. springer-verlag New York (1978)
  • (12) Dissanayake, M., Phan-Thien, N.: Neural-network-based approximations for solving partial differential equations. Communications in Numerical Methods in Engineering 10(3), 195–201 (1994)
  • (13) Dudley, R.: The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis 1(3), 290–330 (1967). DOI https://doi.org/10.1016/0022-1236(67)90017-1. URL https://www.sciencedirect.com/science/article/pii/0022123667900171
  • (14) E, W., Ma, C., Wu, L.: The Barron space and the flow-induced function spaces for neural network models (2021)
  • (15) E, W., Wojtowytsch, S.: Some observations on partial differential equations in Barron and multi-layer spaces (2020)
  • (16) Evans, L.C.: Partial differential equations, second edition (2010)
  • (17) Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep relu neural networks in ws,pw^{s,p} norms (2019)
  • (18) Gilbarg, D., Trudinger, N.: Elliptic partial differential equations of second order, 2nd ed. Springer (1998)
  • (19) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Advances in Neural Information Processing Systems 3 (2014). DOI 10.1145/3422622
  • (20) Greenfeld, D., Galun, M., Basri, R., Yavneh, I., Kimmel, R.: Learning to optimize multigrid PDE solvers. In: K. Chaudhuri, R. Salakhutdinov (eds.) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 2415–2423. PMLR (2019)
  • (21) Han, J., Jentzen, A., E, W.: Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115(34), 8505–8510 (2018). DOI 10.1073/pnas.1718942115. URL https://www.pnas.org/content/115/34/8505
  • (22) He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034 (2015)
  • (23) Hong, Q., Siegel, J.W., Xu, J.: Rademacher complexity and numerical quadrature analysis of stable neural networks with applications to numerical pdes (2021)
  • (24) Hsieh, J.T., Zhao, S., Eismann, S., Mirabella, L., Ermon, S.: Learning neural pde solvers with convergence guarantees. In: International Conference on Learning Representations (2018)
  • (25) Hughes, T.J.: The Finite Element Method: Linear Static and Dynamic Finite Element Analysis. Courier Corporation (2012)
  • (26) Lagaris, I.E., Likas, A., Fotiadis, D.I.: Artificial neural networks for solving ordinary and partial differential equations. IEEE Trans. Neural Networks 9(5), 987–1000 (1998). URL https://doi.org/10.1109/72.712178
  • (27) Ledoux, M., Talagrand, M.: Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media (2013)
  • (28) Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Stuart, A., Bhattacharya, K., Anandkumar, A.: Multipole graph neural operator for parametric partial differential equations. In: H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 6755–6766. Curran Associates, Inc. (2020). URL https://proceedings.neurips.cc/paper/2020/file/4b21cf96d4cf612f239a6c322b10c8fe-Paper.pdf
  • (29) Li, Z., Kovachki, N.B., Azizzadenesheli, K., liu, B., Bhattacharya, K., Stuart, A., Anandkumar, A.: Fourier neural operator for parametric partial differential equations. In: International Conference on Learning Representations (2021)
  • (30) Lu, J., Lu, Y., Wang, M.: A priori generalization analysis of the deep ritz method for solving high dimensional elliptic equations (2021)
  • (31) Lu, L., Meng, X., Mao, Z., Karniadakis, G.E.: Deepxde: A deep learning library for solving differential equations. CoRR abs/1907.04502 (2019). URL http://arxiv.org/abs/1907.04502
  • (32) Luo, T., Yang, H.: Two-layer neural networks for partial differential equations: Optimization and generalization theory. ArXiv abs/2006.15733 (2020)
  • (33) Maury, B.: Numerical analysis of a finite element/volume penalty method. Siam Journal on Numerical Analysis 47(2), 1126–1148 (2009)
  • (34) Mishra, S., Molinaro, R.: Estimates on the generalization error of physics informed neural networks (pinns) for approximating pdes. ArXiv abs/2007.01138 (2020)
  • (35) Müller, J., Zeinhofer, M.: Deep ritz revisited. In: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations (2020)
  • (36) Müller, J., Zeinhofer, M.: Error estimates for the variational training of neural networks with boundary penalty. ArXiv abs/2103.01007 (2021)
  • (37) Quarteroni, A., Valli, A.: Numerical Approximation of Partial Differential Equations, vol. 23. Springer Science & Business Media (2008)
  • (38) Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, 686–707 (2019)
  • (39) Schultz, M.H.: Approximation theory of multivariate spline functions in sobolev spaces. SIAM Journal on Numerical Analysis 6(4), 570–582 (1969)
  • (40) Schumaker, L.: Spline functions: basic theory. Cambridge University Press (2007)
  • (41) Shin, Y., Zhang, Z., Karniadakis, G.: Error estimates of residual minimization using neural networks for linear pdes. ArXiv abs/2010.08019 (2020)
  • (42) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016)
  • (43) Sirignano, J.A., Spiliopoulos, K.: Dgm: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics 375, 1339–1364 (2018)
  • (44) Thomas, J.: Numerical Partial Differential Equations: Finite Difference Methods, vol. 22. Springer Science & Business Media (2013)
  • (45) Um, K., Brand, R., Fei, Y.R., Holl, P., Thuerey, N.: Solver-in-the-loop: Learning from differentiable physics to interact with iterative pde-solvers. In: H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 6111–6122. Curran Associates, Inc. (2020)
  • (46) Wang, S., Yu, X., Perdikaris, P.: When and why pinns fail to train: A neural tangent kernel perspective. ArXiv abs/2007.14527 (2020)
  • (47) Wang, Y., Shen, Z., Long, Z., Dong, B.: Learning to discretize: Solving 1d scalar conservation laws via deep reinforcement learning. Communications in Computational Physics 28(5), 2158–2179 (2020)
  • (48) Weinan, E., Yu, B.: The deep ritz method: A deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics 6(1), 1–12 (2017)
  • (49) Xu, J.: Finite neuron method and convergence analysis. Communications in Computational Physics 28(5), 1707–1745 (2020)
  • (50) Zang, Y., Bao, G., Ye, X., Zhou, H.: Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics 411, 109409 (2020)