The Finite Neuron Method and Convergence Analysis

Jinchao Xu¹¹1xu@math.psu.edu, Department of Mathematics, Pennsylvania State University, University Park, PA, 16802, USA

(August 2020)

Abstract

We study a family of $H^{m}$ -conforming piecewise polynomials based on artificial neural network, named as the finite neuron method (FNM), for numerical solution of $2m$ -th order partial differential equations in $\mathbb{R}^{d}$ for any $m,d\geq 1$ and then provide convergence analysis for this method. Given a general domain $\Omega\subset\mathbb{R}^{d}$ and a partition $\mathcal{T}_{h}$ of $\Omega$ , it is still an open problem in general how to construct conforming finite element subspace of $H^{m}(\Omega)$ that have adequate approximation properties. By using techniques from artificial neural networks, we construct a family of $H^{m}$ -conforming set of functions consisting of piecewise polynomials of degree $k$ for any $k\geq m$ and we further obtain the error estimate when they are applied to solve elliptic boundary value problem of any order in any dimension. For example, the following error estimates between the exact solution $u$ and finite neuron approximation $u_{N}$ are obtained.

\|u-u_{N}\|_{H^{m}(\Omega)}=\mathcal{O}(N^{-{1\over 2}-{1\over d}}).

Discussions will also be given on the difference and relationship between the finite neuron method and finite element methods (FEM). For example, for finite neuron method, the underlying finite element grids are not given a priori and the discrete solution can only be obtained by solving a non-linear and non-convex optimization problem. Despite of many desirable theoretical properties of the finite neuron method analyzed in the paper, its practical value is a subject of further investigation since the aforementioned underlying non-linear and non-convex optimization problem can be expensive and challenging to solve. For completeness and also convenience to readers, some basic known results and their proofs are also included in this manuscript.

1 Introduction

This paper is devoted to the study of numerical methods for high order partial differential equations in any dimension using appropriate piecewise polynomial function classes. In this introduction, we will briefly describle a class of elliptic boundary value problems of any order in any dimension, we will then give an overview of some existing numerical methods for this model and other related problems, and we will finally explain the motivation and objective of this paper.

1.1 Model problem

Let $\Omega\subset\mathbb{R}^{d}$ be a bounded domain with a sufficiently smooth boundary $\partial\Omega$ . For any integer $m\geq 1$ , we consider the following model $2m$ -th order partial differential equation with certain boundary conditions:

\left\{\begin{array}[]{rccl}\displaystyle Lu&=&f&\mbox{in }\Omega,\\ B^{k}(u)&=&0&\mbox{on }\partial\Omega\quad(0\leq k\leq m-1),\end{array}\right.

(1.1)

where $L$ is a partial differential operator as follows

Lu=\sum_{|\alpha|=m}(-1)^{m}\partial^{\alpha}(a_{\alpha}(x)\,\partial^{\alpha}\,u)+a_{0}(x)u,

(1.2)

and ${\bm{\alpha}}$ denotes $n$ -dimensional multi-index ${\bm{\alpha}}=(\alpha_{1},\cdots,\alpha_{n})$ with

|{\bm{\alpha}}|=\sum_{i=1}^{n}\alpha_{i},\quad\partial^{\bm{\alpha}}=\frac{\partial^{|{\bm{\alpha}}|}}{\partial x_{1}^{\alpha_{1}}\cdots\partial x_{n}^{\alpha_{n}}}.

For simplicity, we assume that $a_{\alpha}$ are strictly positive and smooth functions on $\Omega$ for $|\alpha|=m$ and $\alpha=0$ , namely, $\exists\alpha_{0}>0$ , such that

a_{\alpha}(x),a_{0}(x)\geq\alpha_{0},\,\,\forall x\in\Omega,\,\,|\alpha|=m.

(1.3)

Given a nonnegative integer $k$ and a bounded domain $\Omega\subset\mathbb{R}^{d}$ , let

H^{k}(\Omega):=\left\{v\in L^{2}(\Omega),\partial^{\alpha}v\in L^{2}(\Omega),|\alpha|\leq k\right\}

be standard Sobolev spaces with norm and seminorm given respectively by

\|v\|_{k}:=\left(\sum_{|\alpha|\leq k}\|\partial^{\alpha}v\|_{0}^{2}\right)^{1/2},\quad|v|_{k}:=\left(\sum_{|\alpha|=k}\|\partial^{\alpha}v\|_{0}^{2}\right)^{1/2}.

For $k=0$ , $H^{0}(\Omega)$ is the standard $L^{2}(\Omega)$ space with the inner product denoted by $(\cdot,\cdot)$ . Similarly, for any subset $K\subset\Omega$ , $L^{2}(K)$ inner product is denoted by $(\cdot,\cdot)_{0,K}$ . We note that, under the assumption (1.3),

a(v,v)\gtrsim\|v\|^{2}_{m,\Omega},\forall v\in H^{m}(\Omega).

(1.4)

The boundary value problem (1.1) can be cast into an equivalent optimization or a variational problem as described below for some approximate subspace $V\subset H^{m}(\Omega)$ .

Minimization Problem M:

Find $u\in V$ such that

J(u)=\min_{v\in V}J(v),

(1.5)

Variational Problem V:

Find $u\in V$ such that

a(u,v)=\langle f,v\rangle\quad\forall v\in V.

(1.6)

The bilinear form $a(\cdot,\cdot)$ in (1.6), the objective function $J(\cdot)$ in (1.5) and the functional space $V$ depend on the type of boundary condition in (1.1).

One popular type of boundary conditions are Dirichlet boundary condition when $B^{k}=B_{D}^{k}$ are given by the following Dirichlet type trace operators

B_{D}^{k}(u):=\left.\frac{\partial^{k}u}{\partial\nu^{k}}\right|_{\partial\Omega}\quad(0\leq k\leq m-1),

(1.7)

with $\nu$ being the outward unit normal vector of $\partial\Omega$ .

For the aforementioned Dirichlet boundary condition, the elliptic boundary value problem (1.1) is equivalent to (1.5) or (1.6) with $V=H^{m}_{0}(\Omega)$ and

a(u,v):=\sum_{|\alpha|=m}(a_{\alpha}\partial^{\alpha}u,\partial^{\alpha}v)_{0,\Omega}+(a_{0}u,v)\quad\forall u,v\in V,

(1.8)

and

J(v)=\frac{1}{2}a(v,v)-\int_{\Omega}fvdx.

(1.9)

Other boundary conditions such as Neumann boundary and mixed boundary conditions are a little bit complicated to describe for general case when $m\geq 2$ and will be discussed later.

1.2 A brief overview of existing methods

Here we briefly review some classic finite element and other relevant methods for numerical solution of elliptic boundary value problems (1.1) for all $d,m\geq 1$ .

Classic finite element methods use piecewise polynomial functions based on a given a subdivision, namely a finite element grid, of the domain, to discretize the variational problem. We will mainly review three different types of finite element methods: (1) conforming element method; (2) nonconforming and discontinuous Galerkin method; and (3) virtual element method.

Conforming finite element method.

Given a finite element grid, this type of method is to construct $V_{h}\subset V$ and find $u_{h}\in V_{h}$ such that

J(u_{h})=\min_{v_{h}\in V_{h}}J(v_{h}).

(1.10)

It is well-known that a piecewise polynomial $V_{h}\subset H^{m}(\Omega)$ if and only if $V_{h}\subset C^{m-1}(\bar{\Omega})$ . For $m=1$ , piecewise linear finite element $V_{h}\subset H^{1}(\Omega)$ can be easily constructed on simplicial finite element grids in any dimension $d\geq 1$ . The construction and analysis of linear finite element method for $m=1$ and $d=2$ can be traced back to [Feng, 1965]. The situation becomes complicated when $m\geq 2$ and $d\geq 2$ .

For example, it was proved that the construction of an $H^{2}$ -conforming finite element space requires the use of polynomials of at least degree five in two dimensions [Ženíšek, 1970] and degree nine in three dimensions [Lai and Schumaker, 2007]. We refer to [Argyris et al., 1968] for the classic quintic $H^{2}$ -Argyris element in two dimension and to [Zhang, 2009] for the ninth-degree $H^{2}$ -element in three dimensions.

Many other efforts have been made in the literature in constructing $H^{m}$ -conforming finite element spaces. [Bramble and Zlámal, 1970] proposed the 2D simplicial $H^{m}$ conforming elements ( $m\geq 1$ ) by using the polynomial spaces of degree $4m-3$ , which are the generalization of the $H^{2}$ Argyris element (cf. [Argyris et al., 1968, Ciarlet, 1978]) and $H^{3}$ Ženíšek element (cf. [Ženíšek, 1970]). Again, the degree of polynomials used is quite high. For (1.1) an alternative in 2D is to use mixed methods based on the Helmholtz decompositions for tensor-valued functions (cf. [Schedensack, 2016]). However, the general construction of $H^{m}$ -conforming elements in any dimension is still an open problem.

We note that the construction of conforming finite element space depends on the structure of the underlying grid. For example, one can construct relatively low-order $H^{2}$ finite elements on grids with special structures. Examples include the (quadratic) Powell-Sabin, (cubic) Clough-Tocher elements in two dimensions [Powell and Sabin, 1977, Clough and Tocher, 1965], and the (quintic) Alfeld splits in three dimensions [Alfeld, 1984], where full-order accuracy, namely $\mathcal{O}(h)$ , $\mathcal{O}(h^{2})$ and $\mathcal{O}(h^{4})$ accuracy can be estimated. On more recent developments on Alfeld splits we refer to [Fu et al., 2020] and references cited therein. But these constructions do not apply to general grids. For example, de Boor-DeVore-Höllig [Boor and Devore, 1983, Boor and Höllig, 1983] showed that the $H^{2}$ element that consists of piecewise cubic polynomials on uniform grid sequence would not provide full approximation accuracy. This gives us hints the structure of the underlying grid plays an important role in constructing $H^{m}$ -comforming finite element.

Nonconforming finite element and discontinuous Galerkin methods:

Given a finite element grid $\mathcal{T}_{h}$ , compared to conforming method, the nonconforming finite element method does not require that $V_{h}\subset V$ , namely $V_{h}\nsubseteq V$ . We find $u_{h}\in V_{h}$ such that

J_{h}(u_{h})=\min_{v_{h}\in V_{h}}J_{h}(v_{h})

(1.11)

with

\displaystyle J_{h}(v_{h})=\sum_{K\in\mathcal{T}_{h}}J_{K}(v_{h})=\sum_{K\in\mathcal{T}_{h}}\frac{1}{2}\int_{K}\sum_{|\alpha|=m}a_{\alpha}|\partial^{\alpha}v_{h}|^{2}+a_{0}|v_{h}|^{2}dx-\int_{K}fv_{h}dx.

One interesting example of nonconforming element for (1.1) is the Morley element [Morley, 1967] for $m=d=2$ which uses piecewise quadratic polynomials. For $m\leq d$ , Wang and Xu [Wang and Xu, 2013] provided a universal construction and analysis for a family of nonconforming finite elements consisting piecewise polynomials of minimal order for (1.1) on $\mathbb{R}^{d}$ simplicical grids. The elements in [Wang and Xu, 2013], now known as MWX-elements in the literature, gave a natural generalization of the classic Morley element to the general case that $1\leq m\leq d$ . Recently, there are a number of results on the extension of MWX-elements. [Wu and Xu, 2019] enriched the $\mathcal{P}_{m}$ polynomial space by $\mathcal{P}_{m+1}$ bubble functions to obtain a family of $H^{m}$ nonconforming elements when $m=d+1$ . [Hu and Zhang, 2017] applied the full $\mathcal{P}_{4}$ polynomial space for the construction of nonconforming element when $m=3,d=2$ , which has three more degrees of freedom locally than the element in [Wu and Xu, 2019]. They also used the full $\mathcal{P}_{2m-3}$ polynomial space for the nonconforming finite element approximations when $m\geq 4,d=2$ .

In addition to the aforementioned conforming and nonconforming finite element methods, discontinuous Galerkin (DG) method that make use of piecewise polynomials but globally discontinuous finite element functions have been also used for solving high order partial differential equations, c.f. [Baker, 1977]. The DG method requires the use of many stabilization terms and parameters and the number of stabilization terms and parameters naturally grow as the order of PDE grows. To reduce the amount of stabilization, one approach is to introduce some continuity and smoothness in the discrete space to replace the totally discontinuous spaces. Examples for such an approach include the $C^{0}$ -interior penalty DG methods for fourth order elliptic problem by Brenner and Sung [Brenner and Sung, 2005] and for sixth order elliptic equation by Gudi and Neilan [Gudi and Neilan, 2011]. More recently, Wu and Xu [Wu and Xu, 2017] provided a family of interior penalty nonconforming finite element methods for (1.1) in $\mathbb{R}^{d}$ , for any $m\geq 0,d\geq 1$ . This family of elements recover the MWX-elements in [Wang and Xu, 2013] when $m\leq d$ which does not require any stabilization.

Virtual finite element

Classic definition of finite element methods [Ciarlet, 1978] based on finite element triple can be extended in many different ways. One successful extension is the virtual element method (VEM) in which general polygons or polyhedrons are used as elements and non-polynomial functions are used as shape functions. For $m=1$ , we refer to [Beirão da Veiga et al., 2013] and [Brezzi et al., 2014]. For $m=2$ , we refer to [Brezzi and Marini, 2013] on conforming virtual element methods for plate bending problems, and [Antonietti et al., 2018] on nonconforming virtual element methods for biharmonic problems. For general $m\geq 1$ , we refer to [Chen and Huang, 2020] for nonconforming elements which extend the MWX elements in [Wang and Xu, 2013] from simplicial elements to polyhedral elements.

1.3 Objectives

Deep neural network (DNN), a tool developed for machine learning [Goodfellow et al., 2016]. DNN provides a very special function class that have been used for numerical solution of partial different equations, c.f. [Lagaris et al., 1998]. By using different activation function such as sigmoidal, deep neural network can give rise to a very wide range of functional classes that can be drastically different from the piecewise polynomial function classes used in classic finite element method. One advantage of DNN approach is that it is quite easy to obtain smooth, namely $H^{m}$ -conforming for any $m\geq 1$ , DNN functions by simply choosing smooth activation functions. These function classes, however, do not usually form a linear vector space and hence the usual variational principle in classic finite element method can not be applied easily and instead collocation type of methods are often used. DNN is known to have much less “curse of dimensionality” than the traditional functional classes (such as polynomials or piecewise polynomials), DNN based method is potentially efficient to high dimensional problems and has been studied, for example, in [E et al., 2017] and [Sirignano and Spiliopoulos, 2018].

One main motivation of this paper is to explore DNN type methods that are most closely related to the traditional finite element methods. Namely we are interested in DNN function classes that consist of piecewise polynomials. By exploring relationship between DNN and FEM, we hope, on one hand, to expand or extend the traditional FEM approach by using new tools from DNN, and, on the other hand, to gain and develop theoretical insights and algorithmic tools into DNN by combining the rich mathematical theories and techniques in FEM.

In an earlier work [He et al., 2020b], we studied the relationship between deep neural networks (DNNs) using ReLU as activation function and continuous piecewise linear functions. One conclusion that can be drawn from [He et al., 2020b] is that any ReLU-DNN function is an $H^{1}$ -conforming linear finite element function, and verse versa. The current work can be considered an extension of [He et al., 2020b] by considering using ReLU^k-DNN for high order partial differential equations. One focus in the current work is to provide error estimates when ReLU^k-DNN is applied to solve high order partial differential equations. More specifically, we will study a special class of $H^{m}$ -conforming generalized finite element methods (consisting of piecewise polynomials) for (1.1) for any $m\geq 1$ and $d\geq 1$ based on artificial neural network for numerical solution of arbitrarily high order elliptic boundary value problem (1.1) and then provide convergence analysis for this method. For this type of method, the underlying finite element grids are not given a priori and the discrete solution can be obtained for solving a non-linear and non-convex optimization problem. In the case that the boundary of $\Omega\subset\mathcal{R}^{d}$ , namely $\partial\Omega$ , is curved, it is often an issue how to put a good finite element grid to accurately approximate $\partial\Omega$ . As it turns out, this is not an issue for the finite neuron method, which is probably one of the advantages of the finite neuron method analyzed in this paper.

We note that the numerical method studied in this paper for elliptic boundary value problems is closely related to the classic finite element method, namely it amounts to piecewise polynomials with respect to an implicitly defined grid. We can also argue that it can be viewed as a mesh-less or even vertex-less method. But comparing with the popular meshless method, this method does correspond to some underlying grid, but this grid is not given a priori. This underlying grid is determined by the artificial neurons, which mathematically speaking refers to hyperplanes $\omega_{i}\cdot x+b_{i}=0$ , together with a given activation function. By combining the names for finite element method and artificial neural network, for convenience of exposition, we will name the method studied in this paper as the finite neuron method.

The rest of the paper is organized as follows. In Section 2, we describe Monte-Carlo sampling technique, stratified sampling technique and Barron space. In Section 3, we construct the finite neuron functions and prove their approximation properties. In Section 5, we propose the finite neuron method and provide the convergence analysis. Finally, in Section 6, we give some summaries and discussions on the results in this paper.

Following [Xu, 1992], we will use the notation “ $x\lesssim y$ ” to denote “ $x\leq Cy$ ” for some constant $C$ independent of crucial parameter such as mesh size.

2 Preliminaries

In this section, for clarity of exposition, we present some standard materials from statistics about Monte Carlo sampling, stratified sampling and their applications to analysis of asymptotic approximation properties of neural network functions.

2.1 Monte-Carlo and stratified sampling techniques

Let $\lambda\geq 0$ be a probability density function on a domain $G\subset\mathbb{R}^{D}(D\geq 1)$ such that

\int_{G}\lambda(\theta)d\theta=1.

(2.1)

We define the expectation and variance as follows

\mathbb{E}g:=\int_{G}g(\theta)\lambda(\theta)d\theta,\qquad\mathbb{V}g:=\mathbb{E}((g-\mathbb{E}g)^{2})=\mathbb{E}(g^{2})-(\mathbb{E}g)^{2}.

(2.2)

We note that

\displaystyle\mathbb{V}g\leq\max_{\theta,\theta^{\prime}\in G}(g(\theta)-g(\theta^{\prime}))^{2}.

For any subset $G_{i}\subset G$ , let

\lambda(G_{i})=\int_{G_{i}}\lambda(\theta)d\theta,\qquad\lambda_{i}(\theta)={\lambda(\theta)\over\lambda(G_{i})}.

It holds that

\mathbb{E}_{G}g=\sum_{i=1}^{M}\lambda(G_{i})\mathbb{E}_{G_{i}}g.

For any function $h(\theta_{1},\cdots,\theta_{N}):G_{1}\times G_{2}\cdots G_{N}\mapsto\mathbb{R}$ , define

\mathbb{E}_{G_{i}}g=\int_{G_{i}}g(\theta)\lambda_{i}(\theta)d\theta

and

\mathbb{E}_{N}h:=\int_{G_{1}\times G_{2}\times\ldots\times G_{N}}h(\theta_{1},\cdots,\theta_{N})\lambda_{1}(\theta_{1})\lambda_{2}(\theta_{2})\ldots\lambda_{N}(\theta_{N})d\theta_{1}d\theta_{2}\ldots d\theta_{N}.

(2.3)

For the Monte Carlo method, let $G_{i}=G$ for all $1\leq i\leq n$ , namely,

\mathbb{E}_{N}h:=\int_{G\times G\times\ldots\times G}h(\theta_{1},\cdots,\theta_{N})\lambda(\theta_{1})\lambda(\theta_{2})\ldots\lambda(\theta_{N})d\theta_{1}d\theta_{2}\ldots d\theta_{N}.

(2.4)

The following result is standard [Rubinstein and Kroese, 2016] and their proofs can be obtained by direct calculations.

Lemma 2.1.

For any $g\in L^{\infty}(G)$ , we have

\begin{split}\mathbb{E}_{N}\Big{(}\mathbb{E}g-\frac{1}{N}\sum_{i=1}^{N}g(\omega_{i})\Big{)}^{2}&=\left\{\begin{aligned} \frac{1}{N}\mathbb{V}(g)&\leq\frac{1}{N}\sup_{\omega,\omega^{\prime}\in G}|g(\omega)-g(\omega^{\prime})|^{2}\\ \frac{1}{N}\Big{(}\mathbb{E}(g^{2})-\big{(}\mathbb{E}(g)\big{)}^{2}\Big{)}&\leq\frac{1}{N}\mathbb{E}(g^{2})\leq\frac{1}{N}\|g\|^{2}_{L^{\infty}},\end{aligned}\right.\end{split}

(2.5)

Proof.

First note that

$\displaystyle\left(\mathbb{E}g-\frac{1}{N}\sum_{i=1}^{N}g(\omega_{i})\right)^{2}$	$\displaystyle=\frac{1}{N^{2}}\left(\sum_{i=1}^{N}(\mathbb{E}g-g(\omega_{i}))\right)^{2}$	(2.6)
	$\displaystyle=\frac{1}{N^{2}}\sum_{i,j=1}^{N}(\mathbb{E}g-g(\omega_{i}))(\mathbb{E}g-g(\omega_{j}))$
	$\displaystyle=\frac{I_{1}}{N^{2}}+\frac{I_{2}}{N^{2}}.$

with

I_{1}=\sum_{i=1}^{N}(\mathbb{E}g-g(\omega_{i}))^{2},\quad I_{2}=\sum_{i\neq j}^{N}\left((\mathbb{E}g)^{2}-\mathbb{E}(g)(g(\omega_{i})+g(\omega_{j}))+g(\omega_{i})g(\omega_{j}))\right).

(2.7)

Consider $I_{1}$ , for any $i$ ,

\mathbb{E}_{N}(\mathbb{E}g-g(\omega_{i}))^{2}=\mathbb{E}(\mathbb{E}g-g)^{2}=\mathbb{V}(g).

Thus,

\mathbb{E}_{N}(I_{1})=n\mathbb{V}(g).

For $I_{2}$ , note that

\mathbb{E}_{N}g(\omega_{i})=\mathbb{E}_{N}g(\omega_{j})=\mathbb{E}(g)

and, for $i\neq j$ ,

$\displaystyle\mathbb{E}_{N}(g(\omega_{i})g(\omega_{j}))$	$\displaystyle=\int_{G\times G\times\ldots\times G}g(\omega_{j})g(\omega_{j})\lambda(\omega_{1})\lambda(\omega_{2})\ldots\lambda(\omega_{N})d\omega_{1}d\omega_{2}\cdots d\omega_{N}$	(2.8)
	$\displaystyle=\int_{G\times G}g(\omega_{j})g(\omega_{j})\lambda(\omega_{1})\lambda(\omega_{1})\lambda(\omega_{2})d\omega_{1}d\omega_{2}$
	$\displaystyle=\mathbb{E}_{N}(g(\omega_{i}))\mathbb{E}_{n}(g(\omega_{j}))=[\mathbb{E}(g)]^{2}.$

Thus

\mathbb{E}_{N}(I_{2})=\mathbb{E}_{N}\left(\sum_{i\neq j}^{N}((\mathbb{E}g)^{2}-\mathbb{E}(g)(\mathbb{E}(g(\omega_{i}))+\mathbb{E}(g(\omega_{j})))+\mathbb{E}(g(\omega_{i})g(\omega_{j})))\right)=0.

(2.9)

Consequently, there exist the following two formulas for $\displaystyle\mathbb{E}_{N}\left(\mathbb{E}g-\frac{1}{N}\sum_{i=1}^{N}g(\omega_{i})\right)^{2}$ :

\mathbb{E}_{N}\left(\mathbb{E}g-\frac{1}{N}\sum_{i=1}^{N}g(\omega_{i})\right)^{2}=\frac{1}{N^{2}}\mathbb{E}_{N}(I_{1})=\left\{\begin{aligned} \frac{1}{N}\mathbb{E}\big{(}(\mathbb{E}g-g)^{2}\big{)}\\ \frac{1}{N}(\mathbb{E}(g^{2})-(\mathbb{E}g)^{2}).\end{aligned}\right.

(2.10)

Based on the first formula above, since

|g(\omega)-\mathbb{E}g|=|\int_{G}\big{(}g(\omega)-g(\tilde{\omega})\big{)}\lambda(\tilde{\omega})d\tilde{\omega}|\leq\sup_{\omega,\omega^{\prime}\in G}|g(\omega)-g(\omega^{\prime})|,

it holds that

\mathbb{E}_{N}\left(\mathbb{E}g-{1\over N}\sum_{i=1}^{N}g(\omega_{i})\right)^{2}\leq\frac{1}{N}\sup_{\omega,\omega\in G}|g(\omega)-g(\omega^{\prime})|^{2}.

(2.11)

Due to the second formula above,

\mathbb{E}_{N}\left(\mathbb{E}g-\frac{1}{N}\sum_{i=1}^{N}g(\omega_{i})\right)^{2}\leq\frac{1}{N}\mathbb{E}(g^{2})\leq\frac{1}{N}\|g\|^{2}_{L^{\infty}}

(2.12)

which completes the proof. ∎

Stratified sampling [Bickel and Freedman, 1984] gives a more refined version of the Monte Carlo method.

Lemma 2.2.

For any nonoverlaping decomposition $G=G_{1}\cup G_{2}\cup\cdots\cup G_{M}$ and positive integer $n$ , let $n_{i}=\lceil\lambda(G_{i})n\rceil$ be the smallest integer larger than $\lambda(G_{i})n$ and $\displaystyle N=\sum_{i=1}^{M}n_{i}$ . Let $\theta_{i,j}\in G_{i}(1\leq j\leq n_{i})$ and

g_{n}=\sum_{i=1}^{M}\lambda(G_{i})g_{n_{i}}^{i}\quad\mbox{with}\quad g_{n_{i}}^{i}={1\over n_{i}}\sum_{j=1}^{n_{i}}g(\theta_{i,j}).

(2.13)

It holds that

\mathbb{E}_{N}(\mathbb{E}_{G}g-g_{N})^{2}=\sum_{i=1}^{M}{\lambda^{2}(G_{i})\over n_{i}}\mathbb{E}_{G_{i}}\big{(}g-\mathbb{E}_{G_{i}}g\big{)}^{2}\leq{1\over n}\max_{1\leq i\leq M}\sup_{\theta,\theta^{\prime}\in G_{i}}\big{|}g(\theta)-g(\theta^{\prime})\big{|}^{2}.

(2.14)

Proof.

It follows from definition that

g(x,\theta)=\sum_{i=1}^{M}\lambda(G_{i})g(x,\theta),\qquad\mathbb{E}_{G}g=\sum_{i=1}^{M}\lambda(G_{i})\int_{G_{i}}g\lambda_{i}(\theta)d\theta=\sum_{i=1}^{M}\lambda(G_{i})\mathbb{E}_{G_{i}}g.

(2.15)

Thus, the difference $g-\mathbb{E}_{G}g$ is a linear combination of $g-\mathbb{E}_{G_{i}}g$ on each $G_{i}$ as follows

g-\mathbb{E}_{G}g=\sum_{i=1}^{M}\lambda(G_{i})\big{(}g-\mathbb{E}_{G_{i}}g\big{)}.

(2.16)

It follows from

\mathbb{E}_{G}g-g_{n}=\sum_{i=1}^{M}\lambda(G_{i})(\mathbb{E}_{G_{i}}g-g^{i}_{n_{i}})

(2.17)

and (2.3) that

\mathbb{E}_{n}(\mathbb{E}_{G}g-g_{n})^{2}=\sum_{i,j=1}^{M}\lambda(G_{i})\lambda(G_{j})\mathbb{E}_{n}\big{(}(\mathbb{E}_{G_{i}}g-g^{i}_{n_{i}})(\mathbb{E}_{G_{j}}g-g^{j}_{n_{j}})\big{)}=\sum_{i,j=1}^{M}\lambda(G_{i})\lambda(G_{j})I_{ij}

(2.18)

with

I_{ij}=\mathbb{E}_{n}\big{(}(\mathbb{E}_{G_{i}}g-g^{i}_{n_{i}})(\mathbb{E}_{G_{j}}g-g^{j}_{n_{j}})\big{)}.

By Lemma 2.1,

I_{ij}=\mathbb{E}_{n}\big{(}(\mathbb{E}_{G_{i}}g-g^{i}_{n_{i}})^{2}\big{)}\delta_{ij}={1\over n_{i}}\mathbb{E}\big{(}(\mathbb{E}_{G_{i}}g-g)^{2}\big{)}\delta_{ij}.

(2.19)

Thus,

\mathbb{E}_{n}(\mathbb{E}_{G}g-g_{n})^{2}=\sum_{i=1}^{M}{\lambda^{2}(G_{i})\over n_{i}}\mathbb{E}_{G_{i}}\big{(}g-\mathbb{E}_{G_{i}}g\big{)}^{2},

(2.20)

which completes the proof. ∎

Lemma 2.1 and Lemma 2.2 represent two simple identities and subsequent inequalities that can be verified by a direct calculation. Actually Lemma 2.1 is a special case of Lemma 2.2 with $M=1$ . Lemma 2.1 and Lemma 2.2 are the basis of Monte-Carlo sampling and stratified sampling in statistics. In the presentation of the this paper, we choose not to use any concepts related to random samplings.

Given another domain $\Omega\subset\mathbb{R}^{d}$ , we consider the case that $g(x,\theta)$ is a function of both $x\in\Omega$ and $\theta\in G$ . Given any function $\rho\in L^{1}(G)$ , we consider

u(x)=\int_{G}g(x,\theta)\rho(\theta)d\theta

(2.21)

with $\|\rho\|_{L^{1}(G)}<\infty$ . Let $\lambda(\theta)={\rho(\theta)\over\|\rho\|_{L^{1}(G)}}$ . Thus,

u(x)=\|\rho\|_{L^{1}(G)}\int_{G}g(x,\theta)\lambda(\theta)d\theta

(2.22)

with $\|\lambda(\theta)\|_{L^{1}(G)}=1$ .

We can apply the above two lemmas to the given function $u(x)$ .

Lemma 2.3.

[Monte Carlo Sampling] Consider $u(x)$ in (2.21) with $0\leq\rho(\theta)\in L^{1}(G)$ . For any $N\geq 1$ , there exist $\theta_{i}^{*}\in G$ such that

\|u-u_{N}\|_{L^{2}(\Omega)}^{2}\leq\frac{1}{N}\int_{G}\|g(\cdot,\theta)\|_{L^{2}(\Omega)}^{2}\rho(\theta)d\theta={\|\rho\|_{L^{1}(G)}\over N}\mathbb{E}(\|g(\cdot,\theta)\|_{L^{2}(\Omega)}^{2})

where $\|g(\cdot,\theta)\|_{L^{2}(\Omega)}^{2}=\int_{\Omega}[g(x,\theta)]^{2}d\mu(x),$

u_{N}(x)=\frac{\|\rho\|_{L^{1}(G)}}{N}\sum_{i=1}^{N}g(x,\theta_{i}^{*}).

(2.23)

Similarly, if $g(\cdot,\theta)\in H^{m}(\Omega)$ , for any $N\geq 1$ , there exist $\theta_{i}^{*}\in G_{i}$ with $f_{N}$ given in (2.23) such that

\|u-u_{N}\|_{H^{m}(\Omega)}^{2}\leq\int_{G}\|g(\cdot,\theta)\|_{H^{m}(\Omega)}^{2}\rho(\theta)d\theta=\frac{\|\rho\|_{L^{1}(G)}}{N}\mathbb{E}(\|g(\cdot,\theta)\|_{H^{m}(\Omega)}^{2}).

(2.24)

Proof.

Note that

u(x)=\|\rho\|_{L^{1}(G)}\mathbb{E}(g).

(2.25)

By Lemma 2.1,

\mathbb{E}_{n}\left(\bigg{(}\mathbb{E}(g(x,\cdot))-\frac{1}{N}\sum_{i=1}^{N}g(x,\theta_{i}))\bigg{)}^{2}\right)\leq{1\over N}\mathbb{E}(g^{2}).

By taking integration w.r.t. $x$ on both sides, we get

\mathbb{E}_{n}\left(h(\theta_{1},\theta_{2},\cdots,\theta_{N})\right)\leq{1\over N}\mathbb{E}\Big{(}\int_{\Omega}g^{2}d\mu(x)\Big{)},

where

h(\theta_{1},\theta_{2},\cdots,\theta_{N})=\int_{\Omega}\bigg{(}\mathbb{E}(g(x,\cdot))-\frac{1}{N}\sum_{i=1}^{N}g(x,\theta_{i}))\bigg{)}^{2}d\mu(x).

Sine $\mathbb{E}_{N}(1)=1$ and $\mathbb{E}_{N}(h)\leq{1\over N}\mathbb{E}\Big{(}\int_{\Omega}g^{2}d\mu(x)\Big{)}$ , there exist $\theta_{i}^{*}\in G$ such that

h(\theta_{1}^{*},\theta_{2}^{*},\cdots,\theta_{N}^{*})\leq{1\over N}\int_{\Omega}\mathbb{E}(g^{2})d\mu(x).

This implies that

\mathbb{E}_{n}\|u-u_{N}\|_{L^{2}(\Omega)}^{2}\leq\frac{\|\rho\|_{L^{1}(G)}}{N}\int_{G}\|g(\cdot,\theta)\|_{L^{2}(\Omega)}^{2}\lambda(\theta)d\theta.

The proof for (2.24) is similar to the above analysis for the $L^{2}$ -error analysis, which completes the proof. ∎

Lemma 2.4.

[Stratified Sampling] For $u(x)$ in (2.21) with positive $\rho(\theta)\in L^{1}(G)$ , given any positive integers $n$ and $M\leq n$ , for any nonoverlaping decomposition $G=G_{1}\cup G_{2}\cup\cdots\cup G_{M}$ , there exists $\{\theta_{i}^{\ast}\}_{i=1}^{N}$ with $n\leq N\leq 2n$ such that

\|u-u_{N}\|_{L^{2}(\Omega)}\leq N^{-1/2}\|\rho\|_{L^{1}(G)}\max_{1\leq j\leq M}\sup_{\theta_{j},\theta_{j}^{\prime}\in G_{j}}\|g(x,\theta_{j})-g(x,\theta_{j}^{\prime})\|_{L^{2}(\Omega)}

(2.26)

where

u_{N}(x)={2\|\rho\|_{L^{1}(G)}\over N}\sum_{i=1}^{N}\beta_{i}g(x,\theta_{i}^{\ast})

and $\beta_{i}\in[0,1]$ .

Proof.

Let $n_{j}=\lceil\lambda(G_{j})n\rceil$ and $\theta_{i,j}\in G_{j}(1\leq i\leq n_{j})$ . Define $\displaystyle N=\sum_{j=1}^{M}n_{j}$ and

u_{N}(x)=\|\rho\|_{L^{1}(G)}\sum_{i=1}^{M}\lambda(G_{i})g_{n_{i}}^{i}\quad\mbox{with}\quad g_{n_{i}}^{i}={1\over n_{i}}\sum_{j=1}^{n_{i}}g(\theta_{i,j}).

Since $\displaystyle u(x)=\|\rho\|_{L^{1}(G)}\sum_{i=1}^{M}\lambda(G_{i})\mathbb{E}_{G_{i}}g,$ by Lemma 2.1,

\begin{split}\mathbb{E}_{n}\|u-u_{N}\|_{L^{2}(\Omega)}^{2}=&\|\rho\|_{L^{1}(G)}^{2}\sum_{j=1}^{M}{\lambda^{2}(G_{j})\over n_{j}}\mathbb{E}_{G_{j}}\|\mathbb{E}_{G_{j}}g-g\|^{2}_{L^{2}(\Omega)}\\ \leq&\|\rho\|_{L^{1}(G)}^{2}\sum_{j=1}^{M}{\lambda^{2}(G_{j})\over n_{j}}\sup_{\theta_{j},\theta_{j}^{\prime}\in G_{j}}\|g(x,\theta_{j})-g(x,\theta_{j}^{\prime})\|^{2}_{L^{2}(\Omega)}.\end{split}

(2.27)

Since ${\lambda(G_{j})\over n_{j}}\leq{1\over n}$ and $\displaystyle\sum_{j=1}^{M}\lambda(G_{j})=1$ ,

\mathbb{E}_{n}\|u-u_{N}\|_{L^{2}(\Omega)}^{2}\leq n^{-1}\|\rho\|_{L^{1}(G)}^{2}\max_{1\leq j\leq M}\sup_{\theta_{j},\theta_{j}^{\prime}\in G_{j}}\|g(x,\theta_{j})-g(x,\theta_{j}^{\prime})\|^{2}_{L^{2}(\Omega)}.

(2.28)

There exist $\{\theta_{i,j}^{\ast}\}$ such that $\theta_{i,j}^{\ast}\in G_{i}$ and

\|u-u_{N}\|_{L^{2}(\Omega)}^{2}\leq n^{-1}\|\rho\|_{L^{1}(G)}^{2}\max_{1\leq j\leq M}\sup_{\theta_{j},\theta_{j}^{\prime}\in G_{j}}\|g(x,\theta_{j})-g(x,\theta_{j}^{\prime})\|^{2}_{L^{2}(\Omega)}.

(2.29)

Note that $n\leq N\leq n+M\leq 2n$ ,

u_{N}(x)={2\|\rho\|_{L^{1}(G)}\over N}\sum_{j=1}^{M}\frac{N\lambda(G_{j})}{2n_{j}}\sum_{i=1}^{n_{j}}g(x,\theta_{i,j}^{\ast})={2\|\rho\|_{L^{1}(G)}\over N}\sum_{j=1}^{M}\beta_{i,j}\sum_{i=1}^{n_{j}}g(x,\theta_{i,j}^{\ast})

with

\beta_{i,j}=\frac{N\lambda(G_{j})}{2n_{j}}\leq\frac{2\lambda(G_{j})n}{2\lambda(G_{j})n}\leq 1,

(2.30)

which completes the proof. ∎

2.2 Barron spectral space

Let us use a simple example to motivate the Barron space. Consider the Fourier transform of a real function $u:\mathbb{R}^{d}\mapsto\mathbb{R}$

\hat{u}(\omega)=(2\pi)^{-d/2}\int_{\mathbb{R}^{d}}e^{-i\omega\cdot x}u(x)dx.

(2.31)

This gives the following integral representation of $u$ in terms of the cosine function

u(x)=Re\int_{\mathbb{R}^{d}}e^{i\omega\cdot x}\hat{u}(\omega)d\omega=\int_{\mathbb{R}^{d}}\cos(\omega\cdot x+b(\omega))|\hat{u}(\omega)|d\omega,

(2.32)

where $\hat{u}(\omega)=e^{ib(\omega)}|\hat{u}(\omega)|$ . Let

g(x,\omega)=\cos(\omega\cdot x+b(\omega))\quad\mbox{ and }\quad\rho(\omega)=|\hat{u}(\omega)|.

(2.33)

Thus,

u(x)=\int_{\mathbb{R}^{d}}g(x,\omega)\rho(\omega)d\omega,

(2.34)

\int_{\mathbb{R}^{d}}|\hat{u}(\omega)|d\omega<\infty,

then $\|\rho\|_{L^{1}}<\infty$ . By applying the Lemma 2.3, there exist $\omega_{i}\in\mathbb{R}^{d}$ such that

\|u-u_{N}\|_{0,\Omega}\leq N^{-1/2}\|\hat{u}\|_{L^{1}(\mathbb{R}^{d})}.

(2.35)

where

u_{N}(x)={\|\hat{u}\|_{L^{1}(\mathbb{R}^{d})}\over N}\sum_{i=1}^{N}\cos(\omega_{i}\cdot x+b(\omega_{i}))

(2.36)

More generally, we consider the approximation property in $H^{m}$ -norm. By (2.32),

\partial^{\alpha}u(x)=\int_{\mathbb{R}^{d}}\cos^{|\alpha|}(\omega\cdot x+b(\omega))\omega^{\alpha}|\hat{u}(\omega)|d\omega,\quad\forall\ |\alpha|\leq m.

(2.37)

For any positive integer $m$ , let

g_{m}(x,\omega)={\cos(\omega\cdot x+b(\omega))\over(1+\|\omega\|)^{m}}\quad\mbox{and}\quad\rho_{m}(\omega)=(1+\|\omega\|)^{m}|\hat{u}(\omega)|,

(2.38)

where

\|\rho_{m}\|_{L^{1}(\mathbb{R}^{d})}=\int_{\mathbb{R}^{d}}(1+\|\omega\|)^{m}|\hat{u}(\omega)|d\omega<\infty.

Then, $\displaystyle u(x)=\int_{\mathbb{R}^{d}}g_{m}(x,\omega)\rho_{m}d\omega=\|\rho_{m}\|_{L^{1}(\mathbb{R}^{d})}\mathbb{E}g_{m}(x,\omega)$ . Define

u_{N}(x)={\|\rho_{m}\|_{L^{1}(\mathbb{R}^{d})}\over N}\sum_{i=1}^{N}g_{m}(x,\omega_{i})={\|\rho_{m}\|_{L^{1}(\mathbb{R}^{d})}\over N}\sum_{i=1}^{N}{\cos(\omega_{i}\cdot x+b(\omega_{i}))\over(1+\|\omega_{i}\|)^{m}}.

(2.39)

It holds that

\partial^{\alpha}(u(x)-u_{N}(x))={\|\rho_{m}\|_{L^{1}(\mathbb{R}^{d})}\over N}\sum_{i=1}^{N}\mathbb{E}\partial^{\alpha}(g_{m}(x,\omega)-g_{m}(x,\omega_{i})).

By Lemma 2.1,

	$\displaystyle\mathbb{E}_{N}\sum_{\|\alpha\|\leq m}\\|\partial^{\alpha}(u(x)-u_{N}(x))\\|_{0,\Omega}^{2}$	$\displaystyle\leq\\|\rho_{m}\\|_{L^{1}(\mathbb{R}^{d})}^{2}\mathbb{E}_{N}\sum_{\|\alpha\|\leq m}\frac{1}{N^{2}}\sum_{i=1}^{N}\left(\mathbb{E}\partial^{\alpha}(g_{m}(x,\omega)-g_{m}(x,\omega_{i}))\right)^{2}$		(2.40)
		$\displaystyle\leq{\\|\rho_{m}\\|_{L^{1}(\mathbb{R}^{d})}^{2}\over N}\sum_{\|\alpha\|\leq m}\mathbb{E}\left(\partial^{\alpha}g_{m}(x,\omega)\right)^{2}$		(2.41)

Note that the definitions of $g_{m}$ and $\rho_{m}$ in (2.38) guarantee that

|\partial^{\alpha}g_{m}(x,\omega)|\leq 1.

Thus,

\mathbb{E}_{N}\sum_{|\alpha|\leq m}\|\partial^{\alpha}(u(x)-u_{N}(x))\|_{0,\Omega}^{2}\lesssim{\|\rho_{m}\|_{L^{1}(\mathbb{R}^{d})}^{2}\over N}.

This implies that there exist $\omega_{i}\in\mathbb{R}^{d}$ such that

\|u-u_{N}\|_{H^{m}(\Omega)}\lesssim N^{-1/2}\int_{\mathbb{R}^{d}}(1+\|\omega\|)^{m}|\hat{u}(\omega)|d\omega.

(2.42)

Given $v\in L^{2}(\Omega)$ , consider all the possible extension $v_{E}:\mathbb{R}^{d}\mapsto\mathbb{R}$ with $v_{E}|_{\Omega}=v$ and define the Barron spectral norm for any $s\geq 1$ :

\|v\|_{B^{s}(\Omega)}=\inf_{v_{E}|_{\Omega}=v}\int_{\mathbb{R}^{d}}(1+\|\omega\|)^{s}|\hat{v}_{E}(\omega)|d\omega

(2.43)

and Barron spectral space

B^{s}(\Omega)=\{v\in L^{2}(\Omega):\|v\|_{B^{s}(\Omega)}<\infty\}.

(2.44)

In summary, we have

\|u-u_{N}\|_{H^{m}(\Omega)}\lesssim N^{-1/2}\|u\|_{B^{m}(\Omega)}.

(2.45)

The estimate of (2.35), first obtained in [Jones, 1992] using a slightly different technique, appears to be the first asymptotic error estimate for the artificial neural network. [Barron, 1993] extended Jones’s estimate (2.35) to sigmoidal type of activation function in place of cosine.

The above short discussions reflect the core idea in the analysis of approximation property of artificial neural networks. Namely, represent $f$ as an expectation of some probability distribution as in (2.34) and then a simple application of Monte-Carlo sampling then leads to error estimate like (2.42) for a special neural network function given by (2.36) using $\sigma=\cos$ an activation function. For a more general activation function $\sigma$ , we just need to derive a corresponding representation like (2.34) with $g$ in terms of $\sigma$ . Quantitative estimates on the order of approximation are obtained for sigmoidal activation functions are obtained in [Barron, 1993] and for periodic activation functions in [Mhaskar and Micchelli, 1995, Mhaskar and Micchelli, 1994]. Error estimates in Sobolev norms for general activation functions can be found in [Hornik et al., 1994]. A review of a variety of known results, especially for networks with one hidden layer, can be found in [Pinkus, 1999]. More recently, these results have been improved by a factor of $n^{1/d}$ in [Klusowski and Barron, 2016] using the idea of stratified sampling, based in part on the techniques in [Makovoz, 1996]. [Siegel and Xu, 2020] provides an analysis for general activation functions under very weak assumptions which applies to essentially all activation functions used in practice. In [E et al., 2019a, E et al., 2019b, E and Wojtowytsch, 2020], a more refined definition of the Barron norm is introduced to give sharper approximation error bounds of neural networks.

The following lemma shows some relationship between Sobolev norm and the Barron spectral norm.

Lemma 2.5.

Let $m\geq 0$ be an integer and $\Omega\subset\mathbb{R}^{d}$ a bounded domain. Then for any Schwartz function $v$ , we have

\|v\|_{H^{m}(\Omega)}\lesssim\|v\|_{{B}^{m}(\Omega)}\lesssim\|v\|_{H^{m+{d\over 2}+\epsilon}(\Omega)}.

(2.46)

Proof.

The first inequality in (2.46) and its proof can be found in [Siegel and Xu, 2020]. A version of the second inequality in (2.46) and its proof can be found in [Barron, 1993]. Below is a proof, by definition and Cauchy-Schwarz inequality,

$\displaystyle\\|v\\|_{B^{m}(\Omega)}=$	$\displaystyle\inf_{v_{E}\|_{\Omega}=v}\left(\int_{\mathbb{R}^{d}}(1+\\|\omega\\|)^{m}\|\hat{v}_{E}(\omega)\|d\omega\right)^{2}$	(2.47)
$\displaystyle\leq$	$\displaystyle\int_{\mathbb{R}^{d}}(1+\\|\omega\\|)^{-d-2\epsilon}d\omega\inf_{v_{E}\|_{\Omega}=v}\int_{\mathbb{R}^{d}}(1+\\|\omega\\|)^{d+2m+2\epsilon}\|\hat{v}_{E}(\omega)\|^{2}d\omega$	(2.48)
$\displaystyle\lesssim$	$\displaystyle\inf_{v_{E}\|_{\Omega}=v}\int_{\mathbb{R}^{d}}(1+\\|\omega\\|)^{d+2m+2\epsilon}\|\hat{v}_{E}(\omega)\|^{2}d\omega\lesssim\\|v\\|_{H^{m+{d\over 2}+\epsilon}(\Omega)}.$	(2.49)

∎

3 Finite neuron functions and approximation properties

As mentioned before, for $m=1$ , the finite element for (5.1) can be given by piecewise linear function in any dimension $d\geq 1$ . As shown in [He et al., 2020b], the linear finite element function can be represented by deep neural network with ReLU as activation functions. Here

\mbox{ReLU}(x)=x_{+}=\max(0,x).

(3.1)

In this paper, we will consider the power of ReLU as activation functions

\mbox{ReLU}^{k}(x)=[x_{+}]^{k}=[\max(0,x)]^{k}.

(3.2)

We will use a short-hand notation that $x_{+}^{k}=[x_{+}]^{k}$ in the rest of the paper.

We consider the following neuron network functional class with one hidden layer:

V_{N}^{k}=\left\{\sum_{i=1}^{N}a_{i}(w_{i}x+b_{i})_{+}^{k},a_{i},b_{i}\in\mathbb{R}^{1},w_{i}\in\mathbb{R}^{1\times d}\right\}.

(3.3)

We note that $V_{N}^{k}$ is not a linear vector space. The definition of neural network function class such as (3.3) can be traced back in [McCulloch and Pitts, 1943] and its early mathematical analysis can be found in [Hornik et al., 1989, Cybenko, 1989, Funahashi, 1989].

The functions in $V^{k}_{N}$ as defined in (3.3) will be known as finite neuron functions in this paper.

Lemma 3.1.

For any $k\geq 1$ , $V_{N}^{k}$ consists of functions that are piecewise polynomials of degree $k$ with respect to a grid whose boundaries are given by intersection of the following hyperplanes

w_{i}x+b_{i}=0,\quad 1\leq i\leq N.

see Fig 3.1 - 4.1. Furthermore, if $k\geq m$ ,

V_{N}^{k}(\Omega)\subset H^{k}(\Omega)\subset H^{m}(\Omega).

Refer to caption — Figure 3.1: Hyperplanes with $\ell=1$

The main goal of this section is to prove that the following type of error estimate holds, for some $\delta\geq 0$ ,

\inf_{v_{N}\in V_{N}^{k}}\|u-v_{N}\|_{H^{m}(\Omega)}\lesssim N^{-{1\over 2}-\delta}.

(3.4)

We will use two different approaches to establish (3.4). The first approach, presented in §3.1, mainly follows [Hornik et al., 1994] and [Siegel and Xu, 2020] that gives error estimates for a general class of activation functions. The second approach, presented in §3.2, follows [Klusowski and Barron, 2016] that gives error estimates specifically for ReLU activation function.

We assume that $\Omega\subset\mathbb{R}^{d}$ is a given bounded domain. Thus,

T=\max_{x\in\bar{\Omega}}\|x\|<\infty.

(3.5)

3.1 B-spline as activation functions

The activation function ${\rm[ReLU]}^{k}$ (3.2) are related to cardinal B-Splines. A cardinal B-Spline of degree $k\geq 0$ denoted by $b^{k}$ , is defined by convolution as

b^{k}(x)=(b^{k-1}*b^{0})(x)=\int_{\mathbb{R}}b^{k-1}(x-t)b^{0}(t)dt,

(3.6)

where

b^{0}(x)=\left\{\begin{array}[]{lr}1&x\in[0,1),\\ 0&\hbox{otherwise}.\end{array}\right.

(3.7)

More explicitly, see [de Boor, 1971], for any $x\in[0,k+1]$ and $k\geq 1$ , we have

b^{k}(x)=\frac{x}{k}b^{k-1}(x)+\frac{k+1-x}{k}b^{k-1}(x-1),

(3.8)

b^{k}(x)=(k+1)\sum_{i=0}^{k+1}w_{i}(i-x)_{+}^{k}\hbox{~{}and~{}}w_{i}={\displaystyle\prod_{j=0,j\neq i}^{k+1}}\frac{1}{i-j}.

(3.9)

We note that all $b^{k}$ are locally supported and see Fig. 3.2 for their plots.

For an uniform grid with mesh size $h=\frac{1}{n+1}$ , we define

b^{k}_{j,h}(x)=b^{k}(\frac{x}{h}-j).

(3.10)

Then the cardinal B-Spline series of degree $k$ on the uniform grid is

S_{n}^{k}=\Big{\{}v(x)=\sum_{j=-k}^{n}c_{j}b^{k}_{j,h}(x)\Big{\}}.

(3.11)

Lemma 3.2.

For $V_{N}^{k}$ and $S_{n}^{k}$ defined by (3.3) and (3.11), we have

S_{n}^{k}\subset V_{n+k+1}^{k}.

(3.12)

As a result, for any bounded domain $\Omega\subset\mathbb{R}^{1}$ , we have

\inf_{v\in V_{N}^{k}}\|u-v\|_{m,\Omega}\leq\inf_{v\in S_{N-k-1}^{k}}\|u-v\|_{m,\Omega}\lesssim N^{m-(k+1)}\|u\|_{k+1,\Omega}

(3.13)

Given an activation function $\sigma$ , consider its Fourier transformation:

\hat{\sigma}(a)=\frac{1}{2\pi}\int_{\mathbb{R}}\sigma(t)e^{-iat}dt.

(3.14)

For any $a\neq 0$ with $\hat{\sigma}(a)\neq 0$ , by making a change of variables $t=a^{-1}\omega\cdot x+b$ and $dt=db$ , we have

\displaystyle\hat{\sigma}(a)

\displaystyle=\frac{1}{2\pi}\int_{\mathbb{R}}\sigma(a^{-1}\omega\cdot x+b)e^{-ia(a^{-1}\omega\cdot x+b)}db=e^{-i\omega\cdot x}\frac{1}{2\pi}\int_{\mathbb{R}}\sigma(a^{-1}\omega\cdot x+b)e^{-iab}db.

(3.15)

This implies that

e^{i\omega\cdot x}=\frac{1}{2\pi\hat{\sigma}(a)}\int_{\mathbb{R}}\sigma(a^{-1}\omega\cdot x+b)e^{-iab}db.

(3.16)

We write $\hat{u}(\omega)=e^{-i\theta(\omega)}|\hat{u}(\omega)|$ and then obtain the following integral represntation:

u(x)=\int_{\mathbb{R}^{d}}e^{i\omega\cdot x}\hat{u}(\omega)d\omega=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}}\frac{1}{2\pi\hat{\sigma}(a)}\sigma\left(a^{-1}\omega\cdot x+b\right)|\hat{u}(\omega)|e^{-i(ab+\theta(\omega))}dbd\omega

(3.17)

Now we consider activation function $\sigma(x)=b^{k}(x)$ and $\hat{\sigma}$ be the Fourier transform of $\sigma(x)$ . Note that, by (3.2),

\displaystyle\hat{\sigma}(a)=\left({1-e^{-ia}\over ia}\right)^{k+1}=\left({2\over a}\sin{a\over 2}\right)^{k+1}e^{-{ia(k+1)\over 2}}.

(3.18)

We first take $a=\pi$ in (3.18). Thus,

\hat{\sigma}(\pi)=\left({2\over\pi}\right)^{k+1}e^{-{i\pi(k+1)\over 2}}.

(3.19)

Combining (3.17) and (3.19), we obtain that

u(x)=\frac{1}{4}\left({\pi\over 2}\right)^{k}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}}\sigma\left(\pi^{-1}\omega\cdot x+b\right)|\hat{u}(\omega)|e^{-i(\pi b+{\pi(k+1)\over 2}+\theta(\omega))}dbd\omega

(3.20)

An application of the Monte Carlo method in Lemma 2.1 to the integral representation (3.20) leads to the following estimate.

Theorem 3.3.

For any $0\leq m\leq k$ , there exist $\omega_{i}\in\mathbb{R}^{d}$ , $b_{i}\in\mathbb{R}$ such that

\left\|u-u_{N}\right\|_{H^{m}(\Omega)}\lesssim N^{-{1\over 2}}\|u\|_{{B}^{m+1}(\Omega)}

(3.21)

with

u_{N}(x)=\sum_{i=1}^{N}\beta_{i}b^{k}\left(\pi^{-1}\omega_{i}\cdot x+b_{i}\right).

(3.22)

Based on the integral representation (3.20), a stratified analysis similar to the one in [Siegel and Xu, 2020] leads to the following result.

Theorem 3.4.

For any $0\leq m\leq k$ and positive $\epsilon$ , there exist there exist $\omega_{i}\in\mathbb{R}^{d}$ , $b_{i}\in\mathbb{R}$ such that

\left\|u-u_{N}\right\|_{H^{m}(\Omega)}\leq N^{-{1\over 2}-{\epsilon\over(d+1)(2+\epsilon)}}\|u\|_{{B}^{m+1+\epsilon}(\Omega)}

(3.23)

with

u_{N}(x)=\sum_{i=1}^{N}\beta_{i}b^{k}\left(\bar{\omega}_{i}\cdot x+b_{i}\right).

(3.24)

Next, we try to improve the estimate (3.23). Again, we will use (3.17). Let $\displaystyle a_{\omega}=4\pi\lceil{\|\omega\|\over 4\pi}\rceil+\pi$ in (3.18) and $\displaystyle\bar{\omega}={\omega\over a_{\omega}}$ . We have

\hat{\sigma}(a_{\omega})=\left({2\over a_{\omega}}\right)^{k+1},\quad\|\omega\|+\pi\leq a_{\omega}\leq\|\omega\|+5\pi,\quad\|\bar{\omega}\|\leq 1,

(3.25)

which, together with (3.17), indicates that

\begin{split}u(x)=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}}\frac{1}{2\pi}\sigma\left(\bar{\omega}\cdot x+b\right)\left({a_{\omega}\over 2}\right)^{k+1}\hat{u}(\omega)e^{-ia_{\omega}(b+{k+1\over 2})}dbd\omega.\end{split}

(3.26)

Theorem 3.5.

There exist $\|\bar{\omega}_{i}\|\leq 1$ , $|b_{i}|\leq T+k+1$ such that

\left\|u-u_{N}\right\|_{H^{m}(\Omega)}\lesssim N^{-{1\over 2}-{1\over d+1}}\|u\|_{{B}^{k+1}(\Omega)}

(3.27)

with

u_{N}(x)=\sum_{i=1}^{N}\beta_{i}b^{k}\left(\bar{\omega}_{i}\cdot x+b_{i}\right).

(3.28)

Proof.

We write (3.26) as follows

\displaystyle u(x)=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}}g(x,b,\omega)\rho(b,\omega)dbd\omega

with

\hat{u}(\omega)=e^{-i\theta(\omega)}|\hat{u}(\omega)|,\quad\tilde{\theta}(\omega)=\theta(\omega)+a_{\omega}(b+{k+1\over 2})

and

g(x,b,\omega)=\sigma\left({\bar{\omega}}\cdot x+b\right)sgn(\cos\tilde{\theta}(\omega)),

(3.29)

\rho(b,\omega)=\frac{1}{(2\pi)^{d}}\left({a_{\omega}\over 2}\right)^{k+1}|\hat{u}(\omega)||\cos\tilde{\theta}(\omega)|.

(3.30)

Note that

\|\bar{\omega}\|\leq 1,\quad|b|\leq T+k+1.

(3.31)

Let

G=\{(\omega,b):\omega\in\mathbb{R}^{d},\ |b|\leq T+k+1\},\ \tilde{G}=\{(\bar{\omega},b):\|\bar{\omega}\|\leq 1,\ |b|\leq T+k+1\}.

For any positive integer $n$ , divide $\tilde{G}$ into $\tilde{M}(\tilde{M}\leq{n\over 2})$ nonoverlapping subdomains, say $\tilde{G}=\tilde{G}_{1}\cup\tilde{G}_{2}\cup\cdots\cup\tilde{G}_{\tilde{M}}$ , such that

|b-b^{\prime}|\lesssim n^{-{1\over d+1}},\quad|\bar{\omega}-\bar{\omega}^{\prime}|\lesssim n^{-{1\over d+1}},\quad(\bar{\omega},b),\ (\bar{\omega}^{\prime},b^{\prime})\in\tilde{G}_{i},\ 1\leq i\leq\tilde{M}.

(3.32)

Define $M=2\tilde{M}$ and for $1\leq i\leq\tilde{M}$ ,

G_{i}=\{(\omega,b):(\bar{\omega},b)\in\tilde{G}_{i},\ \cos\tilde{\theta}(\omega)\geq 0\},\ G_{\tilde{M}+i}=\{(\omega,b):(\bar{\omega},b)\in\tilde{G}_{i},\ \cos\tilde{\theta}(\omega)\leq 0\}.

Thus, $G=G_{1}\cup G_{2}\cup\cdots\cup G_{M}$ with $\tilde{G}_{i}\cap\tilde{G}_{j}=\varnothing$ if $i\neq j$ , and

|b-b^{\prime}|\lesssim n^{-{1\over d+1}},\quad|\bar{\omega}-\bar{\omega}^{\prime}|\lesssim n^{-{1\over d+1}},\quad sgn(\cos\tilde{\theta}(\omega))=sgn(\cos\tilde{\theta}(\omega^{\prime})).

(3.33)

Let $n_{i}=\lceil\lambda(G_{i})n\rceil$ , $N=\displaystyle\sum_{i=1}^{M}n_{i}$ and

u_{N}(x)=\|\rho\|_{L^{1}(G)}\sum_{i=1}^{M}\frac{\lambda(G_{i})}{n_{i}}\sum_{j=1}^{n_{i}}g(x,\theta_{i,j}).

(3.34)

It holds that

\begin{split}\mathbb{E}\left(\left\|u-u_{N}\right\|_{H^{m}(\Omega)}^{2}\right)\leq&\|\rho\|_{L^{1}(G)}\sum_{i=1}^{M}\frac{\lambda^{2}(G_{i})}{n_{i}}\sup_{\theta_{i},\theta_{i}^{\prime}\in G_{i}}\|g(x,\theta_{i})-g(x,\theta_{i}^{\prime})\|^{2}_{H^{m}(\Omega)}\end{split}

(3.35)

with $\theta=(b,\omega)$ . For any $(b,\omega)\in G_{i}$ , $1\leq i\leq M$ , if $k\geq m+1$ ,

|g(x,\theta)-g(x,\theta^{\prime})|\lesssim|b-b^{\prime}|+|\omega-\omega^{\prime}|\lesssim n^{-{1\over d+1}}

(3.36)

Thus,

\sum_{i=1}^{M}\frac{\lambda^{2}(G_{i})}{n_{i}}\sup_{\theta_{i},\theta_{i}^{\prime}\in G_{i}}\|g(x,\theta_{i})-g(x,\theta_{i}^{\prime})\|^{2}_{H^{m}(\Omega)}\lesssim n^{-{2\over d+1}}|\Omega|.

(3.37)

Thus,

\begin{split}\mathbb{E}\left(\left\|u-u_{N}\right\|_{H^{m}(\Omega)}^{2}\right)\lesssim&n^{-1-{2\over d+1}}|\Omega|\|\rho\|_{L^{1}(G)}.\end{split}

(3.38)

Since $a\leq\|\omega\|+5\pi$ ,

\|\rho\|_{L^{1}(G)}\lesssim\int_{G}(\|\omega\|+1)^{k+1}|\hat{u}(\omega)|d\omega db\lesssim\|u\|_{B^{k+1}(\Omega)}.

Note that $n\leq N\leq 2n$ . Thus, there exist $\omega_{i}\in\mathbb{R}^{d}$ , $\beta_{i}$ , $b_{i}\in\mathbb{R}$ such that

\left\|u-u_{N}\right\|_{H^{m}(\Omega)}\lesssim N^{-{1\over 2}-{1\over d+1}}\|u\|_{B^{k+1}(\Omega)},

(3.39)

which completes the proof. ∎

The above analysis can also be applied to more general activation functions with compact support.

Theorem 3.6.

Suppose that $\sigma\in W^{m+1,\infty}(\mathbb{R})$ that has a compact support. If for any $a>0$ , there exists $\tilde{a}>0$ such that

\tilde{a}\gtrsim a,\quad|\hat{\sigma}(\tilde{a})|\gtrsim a^{-\ell},

(3.40)

then, there exist $\omega_{i}\in\mathbb{R}^{d}$ and $b_{i}\in\mathbb{R}$ such that

\left\|u-u_{N}\right\|_{H^{m}(\Omega)}\lesssim N^{-{1\over 2}-{1\over d+1}}\|u\|_{B^{\ell}(\Omega)},

(3.41)

where

u_{N}(x)=\sum_{i=1}^{N}\beta_{i}\sigma\left(\bar{\omega}_{i}\cdot x+b_{i}\right).

(3.42)

3.2 [ReLU]^k as activation functions

Rather than using general Fourier transform as in (3.16) to represent $e^{i\omega\cdot x}$ in terms of $\sigma(\omega\cdot x+b)$ , [Klusowski and Barron, 2016] gave a different method to represent $e^{i\omega\cdot x}$ in terms of $(\omega\cdot x+b)_{+}^{k}$ for $k=1$ and $2$ . The following lemma gives a generalization of this representation for all $k\geq 0$ .

Lemma 3.7.

For any $k\geq 0$ and $x\in\Omega$ ,

e^{i\omega\cdot x}=\sum_{j=0}^{k}{(i\omega\cdot x)^{j}\over j!}+{i^{k+1}\over k!}\|\omega\|^{k+1}\int_{0}^{T}\left[(\bar{\omega}\cdot x-t)_{+}^{k}e^{i\|\omega\|t}+(-1)^{k-1}(-\bar{\omega}\cdot x-t)_{+}^{k}e^{-i\|\omega\|t}\right]dt.

(3.43)

Proof.

For $|z|\leq c$ , by the Taylor expansion with integral remainder,

e^{iz}=\sum_{j=0}^{k}{(iz)^{j}\over j!}+{i^{k+1}\over k!}\int_{0}^{z}e^{iu}(z-u)^{k}du.

(3.44)

Note that

(z-u)^{k}=(z-u)^{k}_{+}-(u-z)^{k}_{+}.

It follows that

\begin{split}\int_{0}^{z}(z-u)^{k}e^{iu}du=&\int_{0}^{z}(z-u)_{+}^{k}e^{iu}du+\int_{0}^{z}(-1)^{k}(u-z)_{+}^{k}e^{iu}du\\ =&\int_{0}^{z}(z-u)_{+}^{k}e^{iu}du+\int_{0}^{-z}(-1)^{k-1}(-u-z)_{+}^{k}e^{-iu}du\\ =&\int_{0}^{c}(z-u)_{+}^{k}e^{iu}du+(-1)^{k-1}(-u-z)_{+}^{k}e^{-iu}du.\end{split}

(3.45)

Thus,

e^{iz}-\sum_{j=0}^{k}{(iz)^{j}\over j!}={i^{k+1}\over k!}\int_{0}^{c}\left[(z-u)_{+}^{k}e^{iu}+(-1)^{k-1}(-z-u)_{+}^{k}e^{-iu}\right]du.

(3.46)

Let

z=\omega\cdot x,\quad u=\|\omega\|t,\quad\bar{\omega}={\omega\over\|\omega\|}.

(3.47)

Since $\|x\|\leq T$ and $|\bar{\omega}\cdot x|\leq T$ , we obtain

e^{i\omega\cdot x}-\sum_{j=0}^{k}{(i\omega\cdot x)^{j}\over j!}={i^{k+1}\over k!}\|\omega\|^{k+1}\int_{0}^{T}\left[(\bar{\omega}\cdot x-t)_{+}^{k}e^{i\|\omega\|t}+(-1)^{k-1}(-\bar{\omega}\cdot x-t)_{+}^{k}e^{-i\|\omega\|t}\right]dt,

(3.48)

which completes the proof. ∎

Since $u(x)={1\over(2\pi)^{d}}\int_{\mathbb{R}^{d}}e^{i\omega\cdot x}\hat{u}(\omega)d\omega$ and $\partial^{\alpha}u(x)=\int_{\mathbb{R}^{d}}i^{|\alpha|}\omega^{\alpha}e^{i\omega\cdot x}\hat{u}(\omega)d\omega,$

\displaystyle\partial^{\alpha}u(0)x^{\alpha}=\int_{\mathbb{R}^{d}}i^{|\alpha|}\omega^{\alpha}x^{\alpha}\hat{u}(\omega)d\omega.

(3.49)

Note that $\displaystyle(\omega\cdot x)^{j}=\sum_{|\alpha|=j}{j!\over\alpha!}\omega^{\alpha}x^{\alpha}$ . It follows that

\sum_{|\alpha|=j}{1\over\alpha!}\partial^{\alpha}u(0)x^{\alpha}=i^{j}\sum_{|\alpha|=j}{1\over\alpha!}\int_{\mathbb{R}^{d}}\omega^{\alpha}x^{\alpha}\hat{u}(\omega)d\omega={1\over j!}\int_{\mathbb{R}^{d}}(i\omega\cdot x)^{j}\hat{u}(\omega)d\omega.

(3.50)

Let $\hat{u}(\omega)=|\hat{u}(\omega)|e^{ib(\omega)}$ . Then, $e^{i\|\omega\|t}\hat{u}(\omega)=|\hat{u}(\omega)|e^{i(\|\omega\|t+b(\omega))}$ . By Lemma 3.7,

\begin{split}&u(x)-\sum_{|\alpha|\leq k}{1\over\alpha!}\partial^{\alpha}u(0)x^{\alpha}\\ =&\int_{\mathbb{R}^{d}}\big{(}e^{i\omega\cdot x}-\sum_{j=0}^{k}{1\over j!}(i\omega\cdot x)^{j}\big{)}\hat{u}(\omega)d\omega.\\ =&{\rm Re}\bigg{(}{i^{k+1}\over k!}\int_{\mathbb{R}^{d}}\int_{0}^{T}\left[(\bar{\omega}\cdot x-t)_{+}^{k}e^{i\|\omega\|t}+(-1)^{k-1}(-\bar{\omega}\cdot x-t)_{+}^{k}e^{-i\|\omega\|_{\ell_{1}}t}\right]\hat{u}(\omega)\|\omega\|^{k+1}dtd\omega\bigg{)}\\ =&{1\over k!}\int_{\{-1,1\}}\int_{\mathbb{R}^{d}}\int_{0}^{T}(z\bar{\omega}\cdot x-t)_{+}^{k}s(zt,\omega)|\hat{u}(\omega)|\|\omega\|^{k+1}dtd\omega dz\end{split}

(3.51)

with $\int_{\{-1,1\}}r(z)dz=r(-1)+r(1)$ and

s(zt,\omega)=\begin{cases}(-1)^{k+1\over 2}\cos(z\|\omega\|t+b(\omega))&k\text{ is odd},\\ (-1)^{k+2\over 2}\sin(z\|\omega\|t+b(\omega))&k\text{ is even}.\end{cases}

(3.52)

Define $G=\{-1,1\}\times[0,T]\times\mathbb{R}^{d}$ , $\theta=(z,t,\omega)\in G$ ,

g(x,\theta)=(z\bar{\omega}\cdot x-t)_{+}^{k}{\rm sgn}s(zt,\omega),\qquad\rho(\theta)={1\over(2\pi)^{d}}|s(zt,\omega)||\hat{u}(\omega)|\|\omega\|^{k+1},\quad\lambda(\theta)={\rho(\theta)\over\|\rho\|_{L^{1}(G)}}.

(3.53)

Then (3.51) can be written as

u(x)=\sum_{|\alpha|\leq k}{1\over\alpha!}D^{\alpha}u(0)x^{\alpha}+{\nu\over k!}\int_{G}g(x,\theta)\lambda(\theta)d\theta,

(3.54)

with $\nu=\int_{G}\rho(\theta)d\theta$ . In summary, we have the following lemma.

Lemma 3.8.

It holds that

u(x)=\sum_{|\alpha|\leq k}{1\over\alpha!}\partial^{\alpha}u(0)x^{\alpha}+{\nu\over k!}r_{k}(x),\qquad x\in\Omega

(3.55)

with $\nu=\int_{G}\rho(\theta)d\theta$ and

r_{k}(x)=\int_{G}g(x,\theta)\lambda(\theta)d\theta,\qquad G=\{-1,1\}\times[0,T]\times\mathbb{R}^{d},

(3.56)

and $g(x,\theta)$ , $\rho(\theta)$ and $\lambda(\theta)$ defined in (3.53).

According to (3.53), the main ingredient $(z\bar{\omega}\cdot x-t)_{+}^{k}$ of $g(x,\theta)$ only includes the direction $\bar{\omega}$ of $\omega$ which belongs to a bounded domain $\mathbb{S}^{d-1}$ . Thanks to the continuity of $(z\bar{\omega}\cdot x-t)_{+}^{k}$ with respect to $(z,\bar{\omega},t)$ and the boundedness of $\mathbb{S}^{d-1}$ , the application of the stratified sampling to the residual term of the Taylor expansion leads to the approximation property in Theorem 3.9.

Theorem 3.9.

Assume $u\in B^{k+1}(\Omega)$ There exist $\beta_{j}\in[-1,1]$ , $\|\bar{\omega}_{j}\|=1$ , $t_{j}\in[0,T]$ such that

u_{N}(x)=\sum_{|\alpha|\leq k}{1\over\alpha!}\partial^{\alpha}u(0)x^{\alpha}+{2\nu\over k!N}\sum_{j=1}^{N}\beta_{j}(\bar{\omega}_{j}\cdot x-t_{j})_{+}^{k}

(3.57)

with $\nu=\int_{G}\rho(\theta)d\theta$ and $\rho(\theta)$ defined in (3.53) satisfies the following estimate

\|u-u_{N}\|_{H^{m}(\Omega)}\lesssim\begin{cases}N^{-{1\over 2}-{1\over d}}\|u\|_{B^{k+1}(\Omega)},&m<k,\\ N^{-{1\over 2}}\|u\|_{B^{k+1}(\Omega)}&m=k.\end{cases}

(3.58)

Proof.

Let

u_{N}(x)=\sum_{|\alpha|\leq k}{1\over\alpha!}\partial^{\alpha}u(0)x^{\alpha}+{\nu\over k!}r_{k,N}(x),\qquad r_{k,N}(x)={1\over N}\sum_{j=1}^{N}\beta_{j}(\bar{\omega}_{j}\cdot x-t_{j})_{+}^{k}.

Recall the representation of $u(x)$ in (3.55) and $r_{k}(x)$ in (3.56). It holds that

u(x)-u_{N}(x)={2\nu\over k!}(r_{k}(x)-r_{k,N}(x)).

(3.59)

By Lemma 2.4, for any decomposition $\displaystyle G=\cup_{i=1}^{N}G_{i}$ , there exist $\{\theta_{i}\}_{i=1}^{N}$ and $\{\beta_{i}\}_{i=1}^{N}\in[0,1]$ such that

\|\partial_{x}^{\alpha}(u-u_{N})\|_{L^{2}(\Omega)}={\nu\over k!}\|\partial_{x}^{\alpha}(r_{k}-r_{k,N})\|_{L^{2}(\Omega)}\leq{1\over k!N^{1/2}}\max_{1\leq j\leq n}\sup_{\theta_{j},\theta_{j}^{\prime}\in G_{j}}\|\partial_{x}^{\alpha}\big{(}g(x,\theta_{j})-g(x,\theta_{j}^{\prime})\big{)}\|_{L^{2}(\Omega)}.

(3.60)

Consider a $\epsilon$ -covering decomposition $G=\cup_{i=1}^{N}G_{i}$ such that

z=z^{\prime},\ |t-t^{\prime}|<\epsilon,\ \|\bar{\omega}-\bar{\omega}^{\prime}\|_{\ell^{1}}<\epsilon\qquad\forall\theta=(z,t,\omega),\ \theta^{\prime}=(z^{\prime},t^{\prime},\omega^{\prime})\in G_{i}

(3.61)

where $\bar{\omega}$ is defined in (3.47). For any $\theta_{i},\theta^{\prime}_{i}\in G_{i}$ ,

|\partial_{x}^{\alpha}\big{(}g(x,\theta_{i})-g(x,\theta_{i}^{\prime})\big{)}|={k!\over(k-|\alpha|)!}|g_{\alpha}(x,\bar{\omega},t)-g_{\alpha}(x,\bar{\omega}^{\prime},t^{\prime})|

with

g_{\alpha}(x,\bar{\omega},t)=(z\bar{\omega}\cdot x-t)^{k-|\alpha|}_{+}\bar{\omega}^{\alpha}.

(3.62)

Since

|\partial_{\bar{\omega}_{i}}g_{\alpha}|\leq(2T)^{m-|\alpha|-1}\big{(}(k-|\alpha|)x_{i}+2T\alpha_{i}\big{)},\qquad|\partial_{t}g_{\alpha}|\leq(k-|\alpha|)(2T)^{k-|\alpha|-1},

it follows that

\big{|}\partial_{x}^{\alpha}\big{(}g(x,\theta_{i})-g(x,\theta_{i}^{\prime})\big{)}\big{|}\leq{k!\over(k-|\alpha|)!}(2T)^{k-|\alpha|-1}\bigg{(}(k-|\alpha|)(|x|_{\ell_{1}}+1)+2T|\alpha|\bigg{)}\epsilon.

(3.63)

Thus, by Lemma 2.4, if $m=|\alpha|<k$ ,

\|\partial_{x}^{\alpha}(u-u_{N})\|_{L^{2}(\Omega)}\leq{|\Omega|^{1/2}\over(k-|\alpha|)!}(2T)^{k-|\alpha|-1}\bigg{(}(k-|\alpha|)(T+1)+2T|\alpha|\bigg{)}N^{-{1\over 2}}\epsilon.

(3.64)

Note that $\epsilon\sim N^{-{1\over d}}$ . There exist $\theta_{i,j}$ such that for any $0\leq k<m$ ,

\|u-u_{N}\|_{H^{k}(\Omega)}\leq C(m,k,\Omega)\nu N^{-{1\over 2}-{1\over d}}

(3.65)

with $\nu\leq\|u\|_{B^{k+1}(\Omega)}$ and

C(m,k,\Omega)=|\Omega|^{1/2}\bigg{(}\sum_{|\alpha|\leq k}{1\over(k-|\alpha|)!}(2T)^{k-|\alpha|-1}\big{(}(k-|\alpha|)(T+1)+2T|\alpha|\big{)}\bigg{)}^{1/2}.

(3.66)

If $m=|\alpha|=k$ ,

\max_{1\leq j\leq M}\sup_{\theta_{j},\theta_{j}^{\prime}\in G_{j}}\|D_{x}^{\alpha}\big{(}g(x,\theta_{j})-g(x,\theta_{j}^{\prime})\big{)}\|_{L^{2}(\Omega)}\lesssim 1.

This leads to

\|u-u_{N}\|_{H^{m}(\Omega)}\leq C(m,k,\Omega)\nu N^{-{1\over 2}}\quad\mbox{for }\ k=m.

(3.67)

Note that $u_{N}$ defined above can be written as

u_{N}(x)=\sum_{|\alpha|\leq k}{1\over\alpha!}\partial^{\alpha}u(0)x^{\alpha}+{1\over k!N}\sum_{j=1}^{N}\beta_{j}(\bar{\omega}_{j}\cdot x-t_{j})_{+}^{k}

with $\beta_{j}\in[-1,1]$ , which completes the proof. ∎

Lemma 3.10.

There exist $\alpha_{i}$ , $\omega_{i}$ , $b_{i}$ and $N\leq 2\begin{pmatrix}k+d\\ k\end{pmatrix}$ such that

\sum_{|\alpha|\leq m}{1\over\alpha!}\partial^{\alpha}u(0)x^{\alpha}=\sum_{i=1}^{N}\alpha_{i}(\omega_{i}\cdot x+b_{i})_{+}^{k}

with $x^{\alpha}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}},\quad\alpha!=\alpha_{1}!\alpha_{2}!\cdots\alpha_{d}!.$

The above result can be found in [He et al., 2020a]

A combination of Theorem 3.9 and the above the lemma gives the following estimate in Theorem 3.11.

Theorem 3.11.

Suppose $u\in B^{k+1}(\Omega)$ . There exist $\beta_{j},t\in\mathbb{R}$ , $\omega_{j}\in\mathbb{R}^{d}$ such that

u_{N}(x)=\sum_{j=1}^{N}\beta_{j}(\bar{\omega}_{j}\cdot x-t_{j})_{+}^{k}

(3.68)

satisfies the following estimate

\|u-u_{N}\|_{H^{m}(\Omega)}\lesssim\begin{cases}N^{-{1\over 2}-{1\over d}}\|u\|_{B^{k+1}(\Omega)},\qquad k>m,\\ N^{-{1\over 2}}\|u\|_{B^{k+1}(\Omega)},\qquad k=m,\end{cases}

(3.69)

where $\bar{\omega}$ is defined in (3.47).

Remark 3.12.

We make the following comparisons between results in Section 3.1 and 3.2.

1.

The results in 3.1 are for activation functions $\sigma=b_{k}$ , while the results in Section 3.2 are for activation functions $\sigma={\rm ReLU}^{k}$ .
2.

By (3.9), the following relation obviously holds

$V_{N}(b_{k})\subset V_{N+k}({\rm ReLU}^{k}).$

Thus, asymptotically speaking, the results that hold for $\sigma=b_{k}$ also hold for $\sigma={\rm ReLU}^{k}$ .
3.

The results in Section 3.2 are in some cases asymptotically better than those in Section 3.1, but require more regularity assumptions on $u$ . For example, Theorem 3.3 only requires $u\in B^{m}$ , but Theorem 3.9 only requires $u\in B^{k+1}$ even for $m=0$ .
4.

The computational efficiency for the solution of the optimization problems (5.28) or (5.31) below, may be different with different choice of activation function, namely, $\sigma=b_{k}$ or ${\rm ReLU}^{k}$ .

4 Deep finite neuron functions, adaptivity and spectral accuracy

In this section, we will study deep finite neural functions through the framework of deep neural networks and then discuss its adaptive and spectral accuracy properties.

4.1 Deep finite neuron functions

Given $d,\ell\in\mathbb{N}^{+}$ , $n_{1},\dots,n_{\ell}\in\mathbb{N}\mbox{ with }n_{0}=d,n_{\ell+1}=1,$

\theta^{i}(x)=\omega_{i}\cdot x+b_{i},\quad\omega_{i}\in\mathbb{R}^{n_{i+1}\times n_{i}},\ b\in\mathbb{R}^{n_{i+1}},

(4.1)

and the activation function $\mbox{ReLU}^{k}$ , define a deep finite neuron function $u(x)$ from $\mathbb{R}^{d}$ to $\mathbb{R}$ as follows:

	$\displaystyle f^{0}(x)$	$\displaystyle=\theta^{0}(x)$
	$\displaystyle f^{i}(x)$	$\displaystyle=[\theta^{i}\circ\sigma](f^{i-1}(x))\quad i=1:\ell$
	$\displaystyle f(x)$	$\displaystyle=f^{\ell}(x).$

The following more concise notation is often used in computer science literature:

f(x)=\theta^{\ell}\circ\sigma\circ\theta^{\ell-1}\circ\sigma\cdots\circ\theta^{1}\circ\sigma\circ\theta^{0}(x),

(4.2)

here $\theta^{i}:\mathbb{R}^{n_{i}}\to\mathbb{R}^{n_{i+1}}$ are linear functions as defined in (4.1). Such a deep neutral network has $(\ell+1)$ -layer DNN, namely $\ell$ -hidden layers. The size of this deep neutral network is $n_{1}+\cdots+n_{\ell}$ .

Based on these notation and connections, define deep finite neuron functions with activation function $\sigma=\mbox{ReLU}^{k}$ by

{}_{n}{\mathcal{N}}^{k}(n_{1},n_{2},\ldots,n_{\ell})=\bigg{\{}f^{\ell}(x)=\theta^{\ell}(x^{\ell}),\mbox{ with }W^{i}\in\mathbb{R}^{n_{i+1}\times n_{i}},b^{i}\in\mathbb{R}^{n_{i}},i=0:\ell,n_{0}=d,n_{\ell+1}=1\bigg{\}}

(4.3)

Generally, we can define the $\ell$ -hidden layer neural network as:

{}_{n}{\mathcal{N}}_{\ell}^{k}:=\bigcup_{n_{1},n_{2},\cdots,n_{\ell}\geq 1}{}_{n}{\mathcal{N}}(n_{1},n_{2},\cdots,n_{\ell},1).

(4.4)

For $\ell=1$ , functions in ${}_{n}{\mathcal{N}}_{1}^{k}$ consist of piecewise polynomials of degree $k^{2}$ on a finite neuron grids whose boundaries are level sets of quadratic polynomials, see Fig 4.1.

4.2 Reproduction of polynomials and spectral accuracy

One interesting property of the ReLU^k-DNN is that it reproduces polynomials of degree $k$ .

Lemma 4.1.

Given $k\geq 2$ , $q\geq 2$ , there exist $\ell\geq 1$ , $n_{1},\cdots,n_{\ell}$ such that

\mathbb{P}_{q}\subset{}_{n}{\mathcal{N}}_{l}^{k}(n_{1},\cdots,n_{\ell}),

where $\mathbb{P}_{q}$ is the set of all polynomials with degree not larger than $q$ .

For a proof of the above result, we refer to [Li et al., 2019].

Theorem 4.2.

Let $\mbox{ReLU}^{k}$ be the activation function, and ${}_{n}{\mathcal{N}}_{\ell}^{k}(N)$ be the DNN model with $\ell$ hidden layers. There exists some $\ell$ such that

\inf_{v_{N}\in{}_{n}{\mathcal{N}}_{\ell}^{k}(N)}\|u-v_{N}\|_{H^{m}(\Omega)}\lesssim\inf_{v_{N}\in\mathbb{P}_{k^{\ell}}}\|u-v_{N}\|_{H^{m}(\Omega)},

(4.5)

Estimate (4.5) indicates that the deep finite neuron function may provide spectral approximate accuracy.

4.3 Reproduction of linear finite element functions and adaptivity

The deep neural network with ReLU activation function have been much studied in the literature and most widely used in practice. One interesting fact is that ReLU-DNN is simply piecewise linear functions. More specifically, from [He et al., 2018], we have the following result:

Lemma 4.3.

Assume that ${\cal T}_{h}$ is a simplicial finite element grid of $N$ elements, in which any union of simplexes that share a same vertex is convex, any linear finite element function on this grid can be written as a ReLU-DNN with at most $\mathcal{O}(d)$ hidden layers. The number of neurons is at most $\mathcal{O}(\kappa^{d}N)$ for some constant $\kappa\geq 2$ depending on the shape-regularity of $\mathcal{T}_{h}$ . The number of non-zero parameters is at most $\mathcal{O}(d\kappa^{d}N)$ .

The above result indicate that the deep finite neuron functions can reproduce any linear finite element functions. Given the adaptive feature and capability of finite element methods, we see that the finite neuron method can be at least as adaptive as finite element method.

5 The finite neuron method for boundary value problems

In this section, we apply the finite neuron functions for numerical solutions of (1.1). In §5.1, we first present some analytic results for (1.1). In §5.2, we obtain error estimates for the finite neuron method for (1.1) for both the Neumann and Dirichlet boundary conditions.

5.1 Elliptic boundary value problems of order $2m$

As discussed in the introduction, let us rewrite the Dirchlet boundary value problem as follows:

\left\{\begin{array}[]{rccl}\displaystyle Lu&=&f&\mbox{in }\Omega,\\ B_{D}^{k}(u)&=&0&\mbox{on }\partial\Omega\quad(0\leq k\leq m-1).\end{array}\right.

(5.1)

Here $B_{D}^{k}(u)$ are given by (1.7). We next discuss about the pure Neumann boundary conditions for general PDE operator (1.2) when $m\geq 2$ . We first begin our discussion with the following simple result.

Lemma 5.1.

For each $k=0,1,\ldots,m-1$ , there exists a bounded linear differential operator of order $2m-k-1$ :

B_{N}^{k}:H^{2m}(\Omega)\mapsto L^{2}(\partial\Omega)

(5.2)

such that the following identity holds

(Lu,v)=a(u,v)-\sum_{k=0}^{m-1}\langle B_{N}^{k}(u),B_{D}^{k}(v)\rangle_{0,\partial\Omega}.

Namely

\sum_{|\alpha|=m}(-1)^{m}\left(\partial^{\alpha}(a_{\alpha}\,\partial^{\alpha}\,u),\,v\right)_{0,\Omega}=\sum_{|{\bm{\alpha}}|=m}\left(a_{\alpha}\partial^{\bm{\alpha}}\,u,\partial^{\bm{\alpha}}v\right)_{0,\Omega}-\sum_{k=0}^{m-1}\langle B_{N}^{k}(u),B_{D}^{k}(v)\rangle_{0,\partial\Omega}

(5.3)

for all $u\in H^{2m}(\Omega),v\in H^{m}(\Omega)$ . Furthermore,

\sum_{k=0}^{m-1}\|B_{D}^{k}(u)\|_{L^{2}(\partial\Omega)}+\sum_{k=0}^{m-1}\|B_{N}^{k}(u)\|_{L^{2}(\partial\Omega)}\lesssim\|u\|_{2m,\Omega}.

(5.4)

Lemma 5.1 can be proved by induction with respect to $m$ . We refer to [Lions and Magenes, 2012] (Chapter 2) and [Chen and Huang, 2020] for a proof on a similar identity.

In general the explicit expression of $B_{N}^{k}$ can be quite complicated. Let us get some idea by looking at some simple examples with the following special operator:

Lu=(-\Delta)^{m}u+u,

(5.5)

and

a(u,v)=\sum_{|\alpha|=m}(a_{\alpha}\partial^{\alpha}u,\partial^{\alpha}v)_{0,\Omega}+(a_{0}u,v)\quad\forall u,v\in V.

(5.6)

•

For $m=1$ , it is easy to see that $B_{N}^{0}u=\frac{\partial u}{\partial\nu}|_{\partial\Omega}$ .

•

For $m=2$ and $d=2$ , see [Chien, 1980]:

B_{N}^{0}u=\frac{\partial}{\partial\nu}\left(\Delta u+\frac{\partial^{2}u}{\partial\tau^{2}}\right)-\frac{\partial}{\partial\tau}\left({\kappa_{\tau}}\frac{\partial u}{\partial\tau}\right)|_{\partial\Omega}~{}~{}\hbox{and}~{}~{}B_{N}^{1}u=\frac{\partial^{2}u}{\partial\nu^{2}}|_{\partial\Omega},

with $\tau$ being the anti-clockwise unit tangential vector, and $\kappa_{\tau}$ the curvature of $\partial\Omega$ .

We are now in a position to state that the pure Neumann boundary value problems for PDE operator (1.2) as follows.

\left\{\begin{array}[]{rccl}Lu&=&f&\mbox{in }\Omega,\\ B_{N}^{k}(u)&=&0&\mbox{on }\partial\Omega\quad(0\leq k\leq m-1).\end{array}\right.

(5.7)

Combining the trace theorem for $H^{m}(\Omega)$ , see [Adams and Fournier, 2003], and Lemma (5.1), it is easy to see that (1.5) is equivalent to (5.7) with $V=H^{m}(\Omega)$ .

For a given parameter $\delta>0$ , we next consider the following problem with mixed boundary condition:

\left\{\begin{aligned} Lu_{\delta}&=f\qquad\mbox{in }\Omega,\\ B_{D}^{k}(u_{\delta})+\delta B_{N}^{k}(u_{\delta})&=0,\ \ 0\leq k\leq m-1.\end{aligned}\right.

(5.8)

It is easy to see that (5.8) is equivalent to the following problem: Find $u_{\delta}\in H^{m}(\Omega)$ , such that

J_{\delta}(u_{\delta})=\min_{v\in H^{m}(\Omega)}J_{\delta}(v).

(5.9)

where

J_{\delta}(v)={1\over 2}a_{\delta}(v,v)-(f,v)

(5.10)

and

a_{\delta}(u,v)=a(u,v)+\delta^{-1}\sum_{k=0}^{m-1}\langle B_{D}^{k}(u),B_{D}^{k}(v)\rangle_{0,\partial\Omega}.

(5.11)

In summary, we have

Lemma 5.2.

The following equivalences hod:

1.

$u$ solves for (5.1) or (5.7) if and only if $u$ solves

$J(u)=\min_{v\in V}J(v)$ (5.12)

with $V=H^{m}_{0}(\Omega)$ or $V=H^{m}(\Omega)$ ,
2.

$u_{\delta}$ solves for (5.8) if and only if $u_{\delta}$ solves

$\displaystyle J_{\delta}(u_{\delta})=\min_{v\in V}J_{\delta}(v)$

with $V=H^{m}(\Omega)$ .

Lemma 5.3.

Assume that $u\in V$ be solution of (5.1) or (5.7) and $u_{\delta}\in V$ be the solution of (5.9), then the following identities hold:

\|v-u\|_{a}^{2}=J(v)-J(u)\quad\forall v\in V,

(5.13)

and

\|v-u_{\delta}\|_{a,\delta}^{2}=J_{\delta}(v)-J_{\delta}(u_{\delta})\quad\forall v\in V.

(5.14)

Here

\|v\|_{a}^{2}=a(v,v),\quad\|v\|_{a,\delta}^{2}=a_{\delta}(v,v).

(5.15)

Proof.

Let $u$ be the solution of (1.5). Given $v\in V$ , consider the quadratic function of $t$ :

g(t)=J(u+t(v-u)).

It is easy to see that

0=\arg\min_{t}g(t),\quad g^{\prime}(0)=0,

and

J(v)-J(u)=g(1)-g(0)=g^{\prime}(0)+{1\over 2}g^{\prime\prime}(0)=\|v-u\|_{a}^{2}.

This completes the proof of (5.13). The proof of (5.14) is similar. ∎

Lemma 5.4.

Let $u$ be the solution of (5.1) and $u_{\delta}$ be the solution of (5.8). Then

\displaystyle\|u-u_{\delta}\|_{a,\delta}\lesssim\sqrt{\delta}\|u\|_{2m,\Omega}.

(5.16)

Proof.

Let $w=u-u_{\delta}$ and we have

\left\{\begin{aligned} Lw&=0\qquad\mbox{in }\Omega,\\ B_{D}^{k}(w)+\delta B_{N}^{k}(w)&=\delta B_{N}^{k}(u),\ \ 0\leq k\leq m-1.\end{aligned}\right.

(5.17)

By Lemma 5.1, and (5.17), we have

$\displaystyle 0$	$\displaystyle=(Lw,w)$	(5.18)
	$\displaystyle=\sum_{\|\alpha\|=m}(a_{\alpha}\partial^{\alpha}w,\partial^{\alpha}w)-\sum_{k=0}^{m-1}\int_{\partial\Omega}B_{N}^{k}(w)B_{D}^{k}(w)ds+(a_{0}w,w)$	(5.19)
	$\displaystyle=\sum_{\|\alpha\|=m}(a_{\alpha}\partial^{\alpha}w,\partial^{\alpha}w)+\sum_{k=0}^{m-1}\int_{\partial\Omega}(\delta^{-1}B_{D}^{k}(w)-B_{N}^{k}(u))B_{D}^{k}(w)ds+(a_{0}w,w),$	(5.20)

implying

\displaystyle a(w,w)+\delta^{-1}\sum_{k=0}^{m-1}\int_{\partial\Omega}B_{D}^{k}(w)^{2}ds=\sum_{k=0}^{m-1}\int_{\partial\Omega}B_{N}^{k}(u)B_{D}^{k}(w)ds.

(5.21)

By Cauchy inequality, we have

	$\displaystyle a(w,w)+\delta^{-1}\sum_{k=0}^{m-1}\\|B_{D}^{k}(w)\\|^{2}_{L^{2}(\partial\Omega)}\leq$	$\displaystyle\sum_{k=0}^{m-1}\\|B_{N}^{k}(u)\\|_{L^{2}(\partial\Omega)}\\|B_{D}^{k}(w)\\|_{L^{2}(\partial\Omega)}$		(5.22)
	$\displaystyle\leq$	$\displaystyle 2\delta\sum_{k=0}^{m-1}\\|B_{N}^{k}(u)\\|^{2}_{L^{2}(\partial\Omega)}+\frac{1}{2}\delta^{-1}\sum_{k=0}^{m-1}\\|B_{D}^{k}(w)\\|^{2}_{L^{2}(\partial\Omega)},$		(5.23)

which implies

\displaystyle a(w,w)+\frac{1}{2}\delta^{-1}\sum_{k=0}^{m-1}\|B_{D}^{k}(w)\|^{2}_{L^{2}(\partial\Omega)}\leq 2\delta\sum_{k=0}^{m-1}\|B_{N}^{k}(u)\|^{2}_{L^{2}(\partial\Omega)}.

(5.24)

By the definition of $\|\cdot\|_{a,\delta}$ and noting that $w=u-u_{\delta}$ , we have

\displaystyle\|u-u_{\delta}\|_{a,\delta}^{2}\leq 4\delta\sum_{k=0}^{m-1}\|B_{N}^{k}(u)\|^{2}_{L^{2}(\partial\Omega)}.

(5.25)

Combing this with (5.4), then completes the proof. ∎

Lemma 5.5.

For any $s\geq-m$ and $f\in H^{s}(\Omega)$ , the solution $u$ of (5.1) or (5.7) satisfies $u\in H^{2m+s}(\Omega)$ and

\|u\|_{2m+s,\Omega}\lesssim\|f\|_{s,\Omega}.

(5.26)

We refer to [Lions and Magenes, 2012] (Chapter 2, Theorem 5.1 therein) for a detailed proof.

Following from (5.26) and (2.46), we have

Lemma 5.6.

For any $s\geq-m$ , $\epsilon>0$ , and $f\in H^{s}(\Omega)$ , the solution $u$ of (5.1) or (5.7) satisfies

\|u\|_{B^{m+1}(\Omega)}\leq\|f\|_{-m+\frac{d}{2}+1+\epsilon,\Omega}.

(5.27)

5.2 The finite neuron method for (1.1) and error estimates

Let $V_{N}\subset V$ be a subset of $V$ defined by (3.3) which may not be linear subspace. Consider the the discrete problem of (5.12):

\mbox{Find $u_{N}\in V_{N}$ such that }J(u_{N})=\min_{v_{N}\in V_{N}}J(v_{N}).

(5.28)

It is easy to see that the solution to (5.28) always exists (for deep neural network functions as defined below), but may not be unique.

Theorem 5.7.

Let $u\in V$ and $u_{N}\in V_{N}$ be solutions to (5.7) and (5.28) respectively. Then

\|u-u_{N}\|_{a}=\inf_{v_{N}\in V_{N}}\|u-v_{N}\|_{a}.

(5.29)

Proof.

By Lemma 5.3, we have

\|u_{N}-u\|_{a}^{2}=J(u_{N})-J(u)\leq J(v_{N})-J(u)=\|v_{N}-u\|_{a}^{2},\quad\forall v\in V_{N}.

The proof is completed. ∎

We obtain the following result.

Theorem 5.8.

Let $u\in V$ and $u_{N}\in V_{N}$ be solutions to (5.7) and (5.28) respectively. Then for arbitrary $\epsilon>0$ , we have

\|u-u_{N}\|_{a}\lesssim(\|f\|_{L^{2}(\Omega)}+\|f\|_{-k+\frac{d}{2}+1+\epsilon})\begin{cases}N^{-{1\over 2}-{1\over d}}&m<k,\\ N^{-{1\over 2}}&m=k.\end{cases}

(5.30)

By (5.29), Theorem 3.9 and the embedding of Barron space into Sobolev space, namely Lemma 2.5, the regularity result (5.26), we get the proof.

Next we consider the discrete problem of (5.9):

\mbox{Find $u_{N}\in V_{N}$ such that }J_{\delta}(u_{N})=\min_{v_{N}\in V_{N}}J_{\delta}(v_{N}).

(5.31)

Lemma 5.9.

For any given number $\delta$ , let $u_{\delta}$ be the solution of (5.8) and $u_{N}$ be the solution of (5.31), respectively. We have

\|u_{N}-u_{\delta}\|_{a,\delta}\lesssim(1+\delta^{-\frac{1}{2}})\inf_{v_{N}\in V_{N}}\|v_{N}-u_{\delta}\|_{m,\Omega}.

(5.32)

Proof.

First of all, by Lemma 5.3 and the variational property, it holds that

\displaystyle\|u_{N}-u_{\delta}\|_{a,\delta}^{2}=J_{\delta}(u_{N})-J_{\delta}(u_{\delta})\leq J_{\delta}(v_{N})-J_{\delta}(u_{\delta})=\|v_{N}-u_{\delta}\|_{a,\delta}^{2},\quad\forall\,v_{N}\in V_{N}.

(5.33)

Further, for any $v_{N}\in V_{N}$ , by the definition of $\|\cdot\|_{a,\delta}$ and trace inequality, we have

\|v_{N}-u_{\delta}\|_{a,\delta}\lesssim\|v_{N}-u_{\delta}\|_{m,\Omega}+\delta^{-\frac{1}{2}}\|v_{N}-u_{\delta}\|_{0,\partial\Omega}\lesssim(1+\delta^{-\frac{1}{2}})\|v_{N}-u_{\delta}\|_{m,\Omega}.

This completes the proof. ∎

Lemma 5.10.

Let $u$ be the solution of (5.1) and $u_{N}$ be the solution of (5.31), respectively. We have

\|u_{N}-u\|_{a,\delta}\lesssim(1+\delta^{-\frac{1}{2}})\inf_{v_{N}\in V_{N}}\|v_{N}-u\|_{m,\Omega}+\sqrt{\delta}\|f\|_{L^{2}(\Omega)}.

(5.34)

Proof.

First, by triangle inequality and (5.33), for any $v_{N}\in V_{N}$ , we have

$\displaystyle\\|u_{N}-u\\|_{a,\delta}$	$\displaystyle\leq\\|u_{N}-u_{\delta}\\|_{a,\delta}+\\|u_{\delta}-u\\|_{a,\delta}$	(5.35)
	$\displaystyle\leq\\|v_{N}-u_{\delta}\\|_{a,\delta}+\\|u_{\delta}-u\\|_{a,\delta}$	(5.36)
	$\displaystyle\leq\\|v_{N}-u\\|_{a,\delta}+2\\|u_{\delta}-u\\|_{a,\delta}.$	(5.37)

Then by the definition of $\|\cdot\|_{a,\delta}$ , trace inequality and (5.16), for any $v_{N}\in V_{N}$ , we have

\|u_{N}-u\|_{a,\delta}\lesssim(1+\delta^{-\frac{1}{2}})\|v_{N}-u\|_{m,\Omega}+\sqrt{\delta}\|f\|_{L^{2}(\Omega)}.

(5.38)

This completes the proof. ∎

Theorem 5.11.

Let $u$ be the solution of (5.1) and $u_{N}$ be the solution of (5.31) with $\delta\sim N^{-1/2-1/{d}}$ , respectively. Then

\|u-u_{N}\|_{a}\lesssim(\|f\|_{L^{2}(\Omega)}+\|f\|_{-k+\frac{d}{2}+1+\epsilon})\begin{cases}N^{-{1\over 4}-{1\over{2d}}}&m<k,\\ N^{-{1\over 4}}&m=k.\end{cases}

(5.39)

Proof.

Let us only consider the case that $k>m$ . By Theorem 3.9,

\inf_{v_{N}\in V_{N}}\|u-v_{N}\|_{m,\Omega}\lesssim N^{-\frac{1}{2}-{1\over d}}\|u\|_{B^{m+1}(\Omega)}.

Thus, by (5.32)

\|u-u_{N}\|_{m,\Omega}\lesssim\|u_{N}-u\|_{a,\delta}\lesssim\delta^{-\frac{1}{2}}N^{-\frac{1}{2}-{1\over d}}\|u\|_{B^{m+1,q}(\Omega)}+\delta^{\frac{1}{2}}\|f\|_{L^{2}(\Omega)}\\ \leq(\delta^{-\frac{1}{2}}N^{-\frac{1}{2}-{1\over d}}+\delta^{\frac{1}{2}})(\|u\|_{B^{m+1}(\Omega)}+\|f\|_{L^{2}(\Omega)}),\ \ \forall\,\delta>0.

(5.40)

Set $\delta\sim N^{-1/2-1/{d}}$ , and it follows that

\|u-u_{N}\|_{m,\Omega}\lesssim N^{-{1\over 4}-{1\over 2d}}(\|u\|_{B^{m+1}(\Omega)}+\|f\|_{L^{2}(\Omega)}).

(5.41)

Now by the embedding of Barron space to Sobolev space, namely Lemma 2.5, the regularity result (5.26), the proof is completed. ∎

We note that (5.31) was studied in [E and Yu, 2018] for $m=1$ and $k=3$ . Convergence analysis for (5.28) and (5.31) seems to be new in this paper. For other convergence analysis of DNN for numerical PDE, we refer to [shin2020on:arXiv:2004.01806] and [Mishra and Rusch, 2020, Mishra and Molinaro, 2020] for convergence analysis of PINN (Physics Informed Neural Network).

6 Summary and discussions

In this paper, we consider a very special class of neural network function based on ReLU^k as activation function. This function class consists of piecewise polynomials which closely resemble finite element functions. By considering elliptic boundary value problems of $2m$ -th order in any dimensions, it is still unknown how to construct $H^{m}$ -conforming finite element space in general in the classic finite element setting. In contrast, it is rather straightforward to construct $H^{m}$ -conforming piecewise polynomials using neural networks, known as the finite neuron method, and we further proved that the finite neuron method provides good approximation properties.

It is still a subject of debate and of further investigation whether it is practically efficient to use artificial neural network for numerical solution of partial differential equations. One major challenge for this type of method is that the resulting optimization problem is hard to solve, as we shall discuss below.

6.1 Solution of the non-convex optimization problem

(5.28) or (5.31) is a highly nonlinear and non-convex optimization problem with respect to parameters defining the functions in $V_{N}$ , see (3.3). How to solve this type of optimization problem efficiently is a topic of intensive research in deep learning. For example, stochastic gradient method is used in [E and Yu, 2018] to solve (5.31) for $m=1$ and $k=3$ . Multi-scale deep neural network (MscaleDNN) [Liu et al., 2020] and phase shift DNN (PhaseDNN) [Cai et al., 2019] are developed to convert the high frequency solution to a low frequency one before training. Randomized Newton’s method is developed to train the neural network from a nonlinear computation point of view [Chen and Hao, 2019]. More refined algorithms still need to be developed to solve (5.28) or (5.31) with high accuracy so that the convergence order, (5.30) or (5.39), of the finite neuron method can not be achieved.

6.2 Competitions between locality and global smoothness

One insight gained from the studies in the paper is that the challenges in constructing classic $H^{m}$ -finite element subspace seems to lie in the competitions between local d.o.f. (degree of freedom) and global smoothness. In the classic finite element, one requires to define d.o.f. on each element and then glue the local d.o.f. together to obtain a globally $H^{m}$ -smooth function. This process has proven to be very difficult to realize in general when $m\geq 2$ . But, if we relax the locality, as in Powell-Sabine element [Powell and Sabin, 1977], we can use piecewise polynomials of lower degree to construct globally smooth function. The neural network approach studied in this paper can be considered as a global construction without any use of a grid in the first place (even though an implicitly defined grid exists). As a result, it is quite easy to construct globally smooth functions that are piecewise polynomials. It is quite remarkable that such a global construction leads to function class that has very good approximation properties. This is an attractive property of the function classes from the artificial neural network. One feasible question to ask if it is possible to develop finite element construction technique that are more global than the classic finite element but more local than the finite neuron method, which may be an interesting topic for further research.

Local D.O.F.	Slightly more global	$\cdots$	global
General grid	Special grid	$\cdots$	No grid
		$\cdots$
Conjecture: $k=(m-1)2^{d}+1$	Powell-Sabin [Powell and Sabin, 1977]	$\cdots$	ReLu^m-DNN
True: $d=1,m\geq 1$	$k=2$	$\cdots$	$k=m$
$d=2,m=2$ (Still open)	$d=2,m=2$	$?$	any $d$ and $m$

Observation: More global d.o.f. lead to easier construction of conforming elements for high order PDEs.

6.3 Piecewise $P_{m}$ for $H^{m}(\Omega)$ : from finite element to finite neuron method

As it is noted above, in the classic finite element setting, it is challenging to construct $H^{m}$ -conforming finite element spaces for any $m,d\geq 1$ . But if we relax the conformity, as shown in [Wang and Xu, 2013], it is possible to give a universal construction of convergent $H^{m}$ -nonconforming finite element consisting of piecewise polynomial of degree $m$ . In the finite neuron method setting, by relaxing the constraints from the a priori given finite element grid, the construction of $H^{m}$ -conforming piecewise polynomials of degree $m$ becomes straightforward. In fact, the finite neuron method can be considered as mesh-less method, or even, vertex-less method although there is a hidden grid for any finite neuron function. This raises a question if it is possible to develop some ”in-between” method that have the advantages of both the classic finite element method and the finite neuron method.

6.4 Adaptivity and spectral accuracy

One of the important properties in the traditional finite element method is its ability to locally adapt the finite element grids to provide accurate approximation of PDE solution that may have local singularities (such as corner singularities and interface singularities). In contrast, the traditional spectral method (using high order polynomials) can provide very high order accuracy for solutions that are globally smooth. The finite neuron method analyzed in this paper seems to possess both the adaptivity feature as in the traditional finite element method and also the global spectral accuracy as in the traditional spectral methods. Adaptivity feature of the finite neuron method is expected since, as shown in § 4, the deep finite neuron method can recover locally adaptive finite element spaces for $m=1$ . Spectral feature of the finite neuron method is illustrated in Theorem 4.2. As a result, tt is conceivable that the finite neuron method may have both the local and also global adaptive feature, or perhaps even adaptive features in all different scales. Nevertheless, such highly adaptive features of the finite neuron method come with a potentially big price, namely the solution of a nonlinear and non-convex optimization problems.

6.5 Comparison with PINN

One important class of methods that is related to the FNM analyzed in this paper is the the method of physical-informed neural networks (PINN) introduced in [Raissi et al., 2019]. By minimizing certain norms of PDE residual together with penalizations of boundary conditions and other relevant quantities, PINN is a very general approach that can be directly applied to a wide range of problems. In comparison, FNM can only be applied to some special class of problems that admit some special physical laws such as principle of energy minimization or principle of least action, see [Feynman et al.,]. Because of the special physical law associated with our underlying minimization problems, the Neumann boundary conditions are naturally enforced in the minimization problem and, unlike in the PINN method, no penalization is needed to enforce such type of boundary conditions.

6.6 On the sharpness of the error estimates

In this paper, we provide a number of error estimates for our FNM such as (3.23), (3.27) and (3.69), which give increasingly better asymptotic order but also require more regularities. Even for sufficiently regular solution $u$ , the best asymptotic estimate (3.69) may still not be optimal. In finite element method, piecewise polynomial of degree $k$ usually give rise to increasingly better asymptotic error when $k$ increases. But the asymptotic rate in the estimate of (3.69) does not improve as $k$ increases. On the other hand, If $k>j$ , ReLU^k-DNN should conceivably give better accuracy than ReLU^j-DNN since ReLU^j can be approximated arbitrarily accurate by certain finite difference of ReLU^k. How to obtain better asymptotic estimates than (3.69) is still a subject of further investigation.

We also note our error estimate (5.39) for Dirichlet boundary condition is not as good as the one (5.30) for Neumann boundary conditions. This is undesirable and may not be optimal. In comparison, Nitsche trick does not suffer a loss of accuracy when used in traditional finite element method.

6.7 Neural splines in multi-dimensions

The spline functions described in § 3.1 are widely used in scientific and enginnering computing, but their generalization multiple dimension are non-trivial, especially when $\Omega$ has curved boundary. In [Hu and Zhang, 2015], using the tensor product, the authors extended the 1D spline to multi-dimensions on rectangular grids. Some others involve rational functions such as NURBS [Cottrell et al., 2009]. But the generalization of ${}_{n}{\mathcal{N}}_{1}^{k}$ or ${}_{n}{\mathcal{N}}_{1}(b^{k})$ to multi-dimension is straightforward and also the resulting (nonlinear) space has very good approximate properties. It is conceivable that the neural network extension of B-spline to multiple dimensions which are locally polynomials and globally smooth, may find useful applications in computer aid design (CAD) and isogeometric analysis [Cottrell et al., 2009]. This is a potentially an interesting research direction.

Acknowledgements

Main results in this manuscript were prepared for and reported in “International Conference on Computational Mathematics and Scientific Computing” (August 17-20, 2020, http://lsec.cc.ac.cn/ $\sim$ iccmsc/Home.html). and the author is grateful to the invitation of the conference organizers and also to the helpful feedbacks from the audience. The author also wishes to thank Limin Ma, Qingguo Hong and Shuo Zhang for their help in preparing this manuscript. This work was partially supported by the Verne M. William Professorship Fund from Penn State University and the National Science Foundation (Grant No. DMS-1819157).

References

[Adams and Fournier, 2003] Adams, R. A. and Fournier, J. (2003). Sobolev spaces, volume 140. Academic press.
[Alfeld, 1984] Alfeld, P. (1984). A trivariate clough-tocher scheme for tetrahedral data. Computer Aided Geometric Design, 1(2):169–181.
[Antonietti et al., 2018] Antonietti, P. F., Manzini, G., and Verani, M. (2018). The fully nonconforming virtual element method for biharmonic problems. Mathematical Models and Methods in Applied Sciences, 28(02):387–407.
[Argyris et al., 1968] Argyris, J. H., Fried, I., and Scharpf, D. W. (1968). The tuba family of plate elements for the matrix displacement method. The Aeronautical Journal, 72(692):701–709.
[Baker, 1977] Baker, G. A. (1977). Finite element methods for elliptic equations using nonconforming elements. Mathematics of Computation, 31(137):45–59.
[Barron, 1993] Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945.
[Beirão da Veiga et al., 2013] Beirão da Veiga, L., Brezzi, F., Cangiani, A., Manzini, G., Marini, L. D., and Russo, A. (2013). Basic principles of virtual element methods. Mathematical Models and Methods in Applied Sciences, 23(01):199–214.
[Bickel and Freedman, 1984] Bickel, P. J. and Freedman, D. A. (1984). Asymptotic normality and the bootstrap in stratified sampling. The annals of statistics, pages 470–482.
[Boor and Devore, 1983] Boor, C. D. and Devore, R. (1983). Approximation by smooth multivariate splines. Transactions of the American Mathematical Society, 276:775–788.
[Boor and Höllig, 1983] Boor, C. D. and Höllig, K. (1983). Approximation order from bivariate $c^{1}$ -cubics: A counterexample. Proceedings of the American Mathematical Society, 87(4):649–655.
[Bramble and Zlámal, 1970] Bramble, J. H. and Zlámal, M. (1970). Triangular elements in the finite element method. Mathematics of Computation, 24(112):809–820.
[Brenner and Sung, 2005] Brenner, S. C. and Sung, L. (2005). $C^{0}$ interior penalty methods for fourth order elliptic boundary value problems on polygonal domains. Journal of Scientific Computing, 22(1-3):83–118.
[Brezzi et al., 2014] Brezzi, F., Falk, R. S., and Marini, L. D. (2014). Basic principles of mixed virtual element methods. ESAIM: Mathematical Modelling and Numerical Analysis, 48(4):1227–1240.
[Brezzi and Marini, 2013] Brezzi, F. and Marini, L. D. (2013). Virtual element methods for plate bending problems. Computer Methods in Applied Mechanics and Engineering, 253:455–462.
[Cai et al., 2019] Cai, W., Li, X., and Liu, L. (2019). A phase shift deep neural network for high frequency wave equations in inhomogeneous media. arXiv preprint arXiv:1909.11759.
[Chen and Huang, 2020] Chen, L. and Huang, X. (2020). Nonconforming virtual element method for $2m$ th order partial differential equations in ${R}^{n}$ . Mathematics of Computation, 89(324):1711–1744.
[Chen and Hao, 2019] Chen, Q. and Hao, W. (2019). A randomized newton’s method for solving differential equations based on the neural network discretization. arXiv preprint arXiv:1912.03196.
[Chien, 1980] Chien, W. Z. (1980). Variational methods and finite elements.
[Ciarlet, 1978] Ciarlet, P. G. (1978). The finite element method for elliptic problems. North-Holland.
[Clough and Tocher, 1965] Clough, R. and Tocher, J. (1965). Finite-element stiffness analysis of plate bending. In Proc. First Conf. Matrix Methods in Struct. Mech., Wright-Patterson AFB, Dayton (Ohio).
[Cottrell et al., 2009] Cottrell, J. A., Hughes, T. J., and Bazilevs, Y. (2009). Isogeometric analysis: toward integration of CAD and FEA. John Wiley & Sons.
[Cybenko, 1989] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
[de Boor, 1971] de Boor, C. (1971). Subroutine package for calculating with B-splines. Los Alamos Scient. Lab. Report LA-4728-MS.
[E et al., 2017] E, W., Han, J., and Jentzen, A. (2017). Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380.
[E et al., 2019a] E, W., Ma, C., and Wu, L. (2019a). Barron spaces and the compositional function spaces for neural network models. arXiv preprint arXiv:1906.08039.
[E et al., 2019b] E, W., Ma, C., and Wu, L. (2019b). A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407–1425.
[E and Wojtowytsch, 2020] E, W. and Wojtowytsch, S. (2020). Representation formulas and pointwise properties for barron functions. arXiv preprint arXiv:2006.05982.
[E and Yu, 2018] E, W. and Yu, B. (2018). The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12.
[Feng, 1965] Feng, K. (1965). Finite difference schemes based on variational principles. Appl. Math. Comput. Math, 2:238–262.
[Feynman et al., ] Feynman, R., Leighton, R., and Sands, M. The feynman lectures on physics. 3 volumes 1964, 1966. In Library of Congress, Catalog Card, number 63-20717.
[Fu et al., 2020] Fu, G., Guzmán, J., and Neilan, M. (2020). Exact smooth piecewise polynomial sequences on alfeld splits. Mathematics of Computation, 89(323):1059–1091.
[Funahashi, 1989] Funahashi, K.-I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192.
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT press Cambridge.
[Gudi and Neilan, 2011] Gudi, T. and Neilan, M. (2011). An interior penalty method for a sixth-order elliptic equation. IMA Journal of Numerical Analysis, 31(4):1734–1753.
[He et al., 2020a] He, J., Li, L., and Xu, J. (2020a). DNN with Heaviside, ReLU and ReQU activation functions. preprint.
[He et al., 2018] He, J., Li, L., Xu, J., and Zheng, C. (2018). Relu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973.
[He et al., 2020b] He, J., Li, L., Xu, J., and Zheng, C. (2020b). ReLU deep neural networks and linear finite elements. Journal of Computational Mathematics, 38(3):502–527.
[Hornik et al., 1989] Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.
[Hornik et al., 1994] Hornik, K., Stinchcombe, M., White, H., and Auer, P. (1994). Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Computation, 6(6):1262–1275.
[Hu and Zhang, 2015] Hu, J. and Zhang, S. (2015). The minimal conforming $H^{k}$ finite element spaces on $\mathbb{R}^{n}$ rectangular grids. Mathematics of Computation, 84(292):563–579.
[Hu and Zhang, 2017] Hu, J. and Zhang, S. (2017). A canonical construction of $H^{m}$ -nonconforming triangular finite elements. Annals of Applied Mathematics, 33(3):266–288.
[Jones, 1992] Jones, L. K. (1992). A simple lemma on greedy approximation in hilbert space and convergence rates for projection pursuit regression and neural network training. The annals of Statistics, 20(1):608–613.
[Klusowski and Barron, 2016] Klusowski, J. M. and Barron, A. R. (2016). Uniform approximation by neural networks activated by first and second order ridge splines. arXiv preprint arXiv:1607.07819.
[Lagaris et al., 1998] Lagaris, I. E., Likas, A., and Fotiadis, D. I. (1998). Artificial neural networks for solving ordinary and partial differential equations. IEEE Transactions on Neural Networks, 9(5):987–1000.
[Lai and Schumaker, 2007] Lai, M. and Schumaker, L. L. (2007). Spline functions on triangulations. Number 110. Cambridge University Press.
[Li et al., 2019] Li, B., Tang, S., and Yu, H. (2019). Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. arXiv preprint arXiv:1903.05858.
[Lions and Magenes, 2012] Lions, J. L. and Magenes, E. (2012). Non-homogeneous boundary value problems and applications, volume 1. Springer Science & Business Media.
[Liu et al., 2020] Liu, Z., Cai, W., and Xu, Z. (2020). Multi-scale deep neural network (mscalednn) for solving poisson-boltzmann equation in complex domains. arXiv preprint arXiv:2007.11207.
[Makovoz, 1996] Makovoz, Y. (1996). Random approximants and neural networks. Journal of Approximation Theory, 85(1):98–109.
[McCulloch and Pitts, 1943] McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133.
[Mhaskar and Micchelli, 1995] Mhaskar, H. and Micchelli, C. A. (1995). Degree of approximation by neural and translation networks with a single hidden layer. Advances in applied mathematics, 16(2):151–183.
[Mhaskar and Micchelli, 1994] Mhaskar, H. N. and Micchelli, C. A. (1994). Dimension-independent bounds on the degree of approximation by neural networks. IBM Journal of Research and Development, 38(3):277–284.
[Mishra and Molinaro, 2020] Mishra, S. and Molinaro, R. (2020). Estimates on the generalization error of physics informed neural networks (pinns) for approximating pdes ii: A class of inverse problems. arXiv preprint arXiv:2007.01138.
[Mishra and Rusch, 2020] Mishra, S. and Rusch, T. K. (2020). Enhancing accuracy of deep learning algorithms by training with low-discrepancy sequences. arXiv preprint arXiv:2005.12564.
[Morley, 1967] Morley, L. S. D. (1967). The triangular equilibrium element in the solution of plate bending problems. RAE.
[Pinkus, 1999] Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta numerica, 8:143–195.
[Powell and Sabin, 1977] Powell, M. J. and Sabin, M. A. (1977). Piecewise quadratic approximations on triangles. ACM Transactions on Mathematical Software (TOMS), 3(4):316–325.
[Raissi et al., 2019] Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707.
[Rubinstein and Kroese, 2016] Rubinstein, R. Y. and Kroese, D. P. (2016). Simulation and the Monte Carlo method, volume 10. John Wiley & Sons.
[Schedensack, 2016] Schedensack, M. (2016). A new discretization for $m$ th-Laplace equations with arbitrary polynomial degrees. SIAM Journal on Numerical Analysis, 54(4):2138–2162.
[Siegel and Xu, 2020] Siegel, J. W. and Xu, J. (2020). Approximation rates for neural networks with general activation functions. Neural Networks.
[Sirignano and Spiliopoulos, 2018] Sirignano, J. and Spiliopoulos, K. (2018). Dgm: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339–1364.
[Wang and Xu, 2013] Wang, M. and Xu, J. (2013). Minimal finite element spaces for $2m$ -th-order partial differential equations in $\mathbb{R}^{n}$ . Mathematics of Computation, 82(281):25–43.
[Wu and Xu, 2017] Wu, S. and Xu, J. (2017). $P_{m}$ interior penalty nonconforming finite element methods for 2m-th order PDEs in $\mathbb{R}^{n}$ . arXiv preprint arXiv:1710.07678.
[Wu and Xu, 2019] Wu, S. and Xu, J. (2019). Nonconforming finite element spaces for $2m$ -th order partial differential equations on $\mathbb{R}^{n}$ simplicial grids when $m=n+1$ . Mathematics of Computation, 88(316):531–551.
[Xu, 1992] Xu, J. (1992). Iterative methods by space decomposition and subspace correction. SIAM review, 34(4):581–613.
[Ženíšek, 1970] Ženíšek, A. (1970). Interpolation polynomials on the triangle. Numerische Mathematik, 15(4):283–296.
[Zhang, 2009] Zhang, S. (2009). A family of 3D continuously differentiable finite elements on tetrahedral grids. Applied Numerical Mathematics, 59(1):219–233.

	$\displaystyle\mathbb{E}_{N}\sum_{\|\alpha\|\leq m}\\|\partial^{\alpha}(u(x)-u_{N}(x))\\|_{0,\Omega}^{2}$	$\displaystyle\leq\\|\rho_{m}\\|_{L^{1}(\mathbb{R}^{d})}^{2}\mathbb{E}_{N}\sum_{\|\alpha\|\leq m}\frac{1}{N^{2}}\sum_{i=1}^{N}\left(\mathbb{E}\partial^{\alpha}(g_{m}(x,\omega)-g_{m}(x,\omega_{i}))\right)^{2}$		(2.40)
		$\displaystyle\leq{\\|\rho_{m}\\|_{L^{1}(\mathbb{R}^{d})}^{2}\over N}\sum_{\|\alpha\|\leq m}\mathbb{E}\left(\partial^{\alpha}g_{m}(x,\omega)\right)^{2}$		(2.41)

$\displaystyle\\|u_{N}-u\\|_{a,\delta}$	$\displaystyle\leq\\|u_{N}-u_{\delta}\\|_{a,\delta}+\\|u_{\delta}-u\\|_{a,\delta}$	(5.35)
	$\displaystyle\leq\\|v_{N}-u_{\delta}\\|_{a,\delta}+\\|u_{\delta}-u\\|_{a,\delta}$	(5.36)
	$\displaystyle\leq\\|v_{N}-u\\|_{a,\delta}+2\\|u_{\delta}-u\\|_{a,\delta}.$	(5.37)

The Finite Neuron Method and Convergence Analysis

Abstract

1 Introduction

1.1 Model problem

1.2 A brief overview of existing methods

Conforming finite element method.

Nonconforming finite element and discontinuous Galerkin methods:

Virtual finite element

1.3 Objectives

2 Preliminaries

2.1 Monte-Carlo and stratified sampling techniques

Lemma 2.1.

Proof.

Lemma 2.2.

Proof.

Lemma 2.3.

Proof.

Lemma 2.4.

Proof.

2.2 Barron spectral space

Lemma 2.5.

Proof.

3 Finite neuron functions and approximation properties

Lemma 3.1.

3.1 B-spline as activation functions

Lemma 3.2.

Theorem 3.3.

Theorem 3.4.

Theorem 3.5.

Proof.

Theorem 3.6.

3.2 [ReLU]k as activation functions

Lemma 3.7.

Proof.

Lemma 3.8.

Theorem 3.9.

Proof.

Lemma 3.10.

Theorem 3.11.

Remark 3.12.

4 Deep finite neuron functions, adaptivity and spectral accuracy

4.1 Deep finite neuron functions

4.2 Reproduction of polynomials and spectral accuracy

Lemma 4.1.

Theorem 4.2.

4.3 Reproduction of linear finite element functions and adaptivity

Lemma 4.3.

5 The finite neuron method for boundary value problems

5.1 Elliptic boundary value problems of order 2​m2m

Lemma 5.1.

Lemma 5.2.

Lemma 5.3.

Proof.

Lemma 5.4.

Proof.

Lemma 5.5.

Lemma 5.6.

5.2 The finite neuron method for (1.1) and error estimates

Theorem 5.7.

Proof.

Theorem 5.8.

Lemma 5.9.

Proof.

Lemma 5.10.

Proof.

Theorem 5.11.

Proof.

6 Summary and discussions

6.1 Solution of the non-convex optimization problem

6.2 Competitions between locality and global smoothness

6.3 Piecewise PmP_{m} for Hm​(Ω)H^{m}(\Omega): from finite element to finite neuron method

6.4 Adaptivity and spectral accuracy

6.5 Comparison with PINN

6.6 On the sharpness of the error estimates

6.7 Neural splines in multi-dimensions

Acknowledgements

References

3.2 [ReLU]^k as activation functions

5.1 Elliptic boundary value problems of order $2m$

6.3 Piecewise $P_{m}$ for $H^{m}(\Omega)$ : from finite element to finite neuron method