This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Besov Function Approximation and Binary Classification on Low-Dimensional Manifolds Using Convolutional Residual Networks

Hao Liu     Minshuo Chen     Tuo Zhao     Wenjing Liao Hao Liu is affiliated with the Department of Mathematics at Hong Kong Baptist University; Wenjing Liao is affiliated with the School of Mathematics at Georgia Tech; Minshuo Chen and Tuo Zhao are affiliated with the ISYE department at Georgia Tech ; Email: haoliu@hkbu.edu.hk, {\{mchen393, wliao60, tzhao80}\}@gatech.edu .
Abstract

Most of existing statistical theories on deep neural networks have sample complexities cursed by the data dimension and therefore cannot well explain the empirical success of deep learning on high-dimensional data. To bridge this gap, we propose to exploit low-dimensional geometric structures of the real world data sets. We establish theoretical guarantees of convolutional residual networks (ConvResNet) in terms of function approximation and statistical estimation for binary classification. Specifically, given the data lying on a dd-dimensional manifold isometrically embedded in D\mathbb{R}^{D}, we prove that if the network architecture is properly chosen, ConvResNets can (1) approximate Besov functions on manifolds with arbitrary accuracy, and (2) learn a classifier by minimizing the empirical logistic risk, which gives an excess risk in the order of ns2s+2(sd)n^{-\frac{s}{2s+2(s\vee d)}}, where ss is a smoothness parameter. This implies that the sample complexity depends on the intrinsic dimension dd, instead of the data dimension DD. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.

1 Introduction

Deep learning has achieved significant success in various practical applications with high-dimensional data set, such as computer vision (Krizhevsky et al., 2012), natural language processing (Graves et al., 2013; Young et al., 2018; Wu et al., 2016), health care (Miotto et al., 2018; Jiang et al., 2017) and bioinformatics (Alipanahi et al., 2015; Zhou and Troyanskaya, 2015).

The success of deep learning clearly demonstrates the great power of neural networks in representing complex data. In the past decades, the representation power of neural networks has been extensively studied. The most commonly studied architecture is the feedforward neural network (FNN), as it has a simple composition form. The representation theory of FNNs has been developed with smooth activation functions (e.g., sigmoid) in Cybenko (1989); Barron (1993); McCaffrey and Gallant (1994); Hamers and Kohler (2006); Kohler and Krzyżak (2005); Kohler and Mehnert (2011) or nonsmooth activations (e.g., ReLU) in Lu et al. (2017); Yarotsky (2017); Lee et al. (2017); Suzuki (2019). These works show that if the network architecture is properly chosen, FNNs can approximate uniformly smooth functions (e.g., Hölder or Sobolev) with arbitrary accuracy.

Refer to caption
(a) Convolution.
Refer to caption
(b) Skip-layer connection.
Figure 1: Illustration of (a) convolution and (b) skip-layer connection.

In real-world applications, convolutional neural networks (CNNs) are more popular than FNNs (LeCun et al., 1989; Krizhevsky et al., 2012; Sermanet et al., 2013; He et al., 2016; Chen et al., 2017; Long et al., 2015; Simonyan and Zisserman, 2014; Girshick, 2015). In a CNN, each layer consists of several filters (channels) which are convolved with the input, as demonstrated in Figure 1(a). Due to such complexity in the CNN architecture, there are limited works on the representation theory of CNNs (Zhou, 2020b, a; Fang et al., 2020; Petersen and Voigtlaender, 2020). The constructed CNNs in these works become extremely wide (in terms of the size of each layer’s output) as the approximation error goes to 0. In most real-life applications, the network width does not exceed 2048 (Zagoruyko and Komodakis, 2016; Zhang et al., 2020).

Convolutional residual networks (ConvResNet) is a special CNN architecture with skip-layer connections, as shown in Figure 1(b). Specifically, in addition to CNNs, ConvResNets have identity connections between inconsecutive layers. In many applications, ConvResNets outperform CNNs in terms of generalization performance and computational efficiency, and alleviate the vanishing gradient issue. Using this architecture, He et al. (2016) won the 1st place on the ImageNet classification task with a 3.57% top 5 error in 2015.

Table 1: Comparison of our approximation theory and existing theoretical results.
Network type Function class Low dim. structure Fixed width Training
Yarotsky (2017) FNN Sobolev difficult to train due to the cardinality constraint
Suzuki (2019) FNN Besov
Chen et al. (2019b) FNN Hölder
Petersen and Voigtlaender (2020) CNN FNN
Zhou (2020b) CNN Sobolev can be trained without the cardinality constraint
Oono and Suzuki (2019) ConvResNet Hölder
Ours ConvResNet Besov

Recently, Oono and Suzuki (2019) develops the only representation and statistical estimation theory of ConvResNets. Oono and Suzuki (2019) proves that if the network architecture is properly set, ConvResNets with a fixed filter size and a fixed number of channels can universally approximate Hölder functions with arbitrary accuracy. However, the sample complexity in Oono and Suzuki (2019) grows exponentially with respect to the data dimension and therefore cannot well explain the empirical success of ConvResNets for high dimensional data. In order to estimate a CsC^{s} function in D\mathbb{R}^{D} with accuracy ε\varepsilon, the sample size required by Oono and Suzuki (2019) scales as ε2s+Ds\varepsilon^{-\frac{2s+D}{s}}, which is far beyond the sample size used in practical applications. For example, the ImageNet data set consists of 1.2 million labeled images of size 224×224×3224\times 224\times 3. According to this theory, to achieve a 0.1 error, the sample size is required to be in the order of 10224×224×310^{224\times 224\times 3} which greatly exceeds 1.2 million. Due to the curse of dimensionality, there is a huge gap between theory and practice.

We bridge such a gap by taking low-dimensional geometric structures of data sets into consideration. It is commonly believed that real world data sets exhibit low-dimensional structures due to rich local regularities, global symmetries, or repetitive patterns (Hinton and Salakhutdinov, 2006; Osher et al., 2017; Tenenbaum et al., 2000). For example, the ImageNet data set contains many images of the same object with certain transformations, such as rotation, translation, projection and skeletonization. As a result, the degree of freedom of the ImageNet data set is significantly smaller than the data dimension (Gong et al., 2019).

The function space considered in Oono and Suzuki (2019) is the Hölder space in which functions are required to be differentiable everywhere up to certain order. In practice, the target function may not have high order derivatives. Function spaces with less restrictive conditions are more desirable. In this paper, we consider the Besov space Bp,qsB^{s}_{p,q}, which is more general than the Hölder space. In particular, the Hölder and Sobolev spaces are special cases of the Besov space:

Ws+α,=s,αB,s+αBp,qs+αW^{s+\alpha,\infty}=\mathcal{H}^{s,\alpha}\subseteq B^{s+\alpha}_{\infty,\infty}\subseteq B^{s+\alpha}_{p,q}

for any 0<p,q,s0<p,q\leq\infty,s\in\mathbb{N} and α(0,1]\alpha\in(0,1]. For practical applications, it has been demonstrated in image processing that Besov norms can capture important features, such as edges (Jaffard et al., 2001). Due to the generality of the Besov space, it is shown in Suzuki and Nitanda (2019) that kernel ridge estimators have a sub-optimal rate when estimating Besov functions.

In this paper, we establish theoretical guarantees of ConvResNets for the approximation of Besov functions on a low-dimensional manifold, and a statistical theory on binary classification. Let \mathcal{M} be a dd-dimensional compact Riemannian manifold isometrically embedded in D\mathbb{R}^{D}. Denote the Besov space on \mathcal{M} as Bp,qs()B_{p,q}^{s}(\mathcal{M}) for 0<p,q0<p,q\leq\infty and 0<s<0<s<\infty. Our function approximation theory is established for Bp,qs()B_{p,q}^{s}(\mathcal{M}). For binary classification, we are given nn i.i.d. samples {(𝐱i,yi)}i=1n\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n} where 𝐱i\mathbf{x}_{i}\in\mathcal{M} and yi{1,1}y_{i}\in\{-1,1\} is the label. The label yy follows the Bernoulli-type distribution

(y=1|𝐱)=η(𝐱),(y=1|𝐱)=1η(𝐱)\mathbb{P}(y=1|\mathbf{x})=\eta(\mathbf{x}),\ \mathbb{P}(y=-1|\mathbf{x})=1-\eta(\mathbf{x})

for some η:[0,1]\eta:\mathcal{M}\rightarrow[0,1]. Our results (Theorem 1 and 2) are summarized as follows:

Theorem (informal).

Assume sd/p+1s\geq d/p+1.

  1. 1.

    Given ε(0,1)\varepsilon\in(0,1), we construct a ConvResNet architecture such that, for any fBp,qs()f^{*}\in B_{p,q}^{s}(\mathcal{M}), if the weight parameters of this ConvResNet are properly chosen, it gives rises to f¯\bar{f} satisfying

    f¯fLε.\|\bar{f}-f^{*}\|_{L^{\infty}}\leq\varepsilon.
  2. 2.

    Assume ηBp,qs()\eta\in B_{p,q}^{s}(\mathcal{M}). Let fϕf_{\phi}^{*} be the minimizer of the population logistic risk. If the ConvResNet architecture is properly chosen, minimizing the empirical logistic risk gives rise to f^ϕ,n\widehat{f}_{\phi,n} with the following excess risk bound

    𝔼(ϕ(f^ϕ,n,fϕ))Cns2s+2(sd)log4n,\displaystyle\mathbb{E}(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*}))\leq Cn^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n,

    where ϕ(f^ϕ,n,fϕ)\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*}) denotes the excess logistic risk of f^ϕ,n\widehat{f}_{\phi,n} against fϕf_{\phi}^{*} and CC is a constant independent of nn.

We remark that the first part of the theorem above requires the network size to depend on the intrinsic dimension dd and only weakly depend on DD. The second part is built upon the first part and shows a fast convergence rate of the excess risk in terms of nn where the exponent depends on dd instead of DD. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.

Related work. Approximation theories of FNNs with the ReLU activation have been established for Sobolev (Yarotsky, 2017), Hölder (Schmidt-Hieber, 2017) and Besov (Suzuki, 2019) spaces. The networks in these works have certain cardinality constraint, i.e., the number of nonzero parameters is bounded by certain constant, which requires a lot of efforts for training.

Approximation theories of CNNs are developed in Zhou (2020b); Petersen and Voigtlaender (2020); Oono and Suzuki (2019). Among these works, Zhou (2020b) shows that CNNs can approximate Sobolev functions in Ws,2W^{s,2} for sD/2+2s\geq D/2+2 with an arbitrary accuracy ε(0,1)\varepsilon\in(0,1). The network in Zhou (2020b) has width increasing linearly with respect to depth and has depth growing in the order of ε2\varepsilon^{-2} as ε\varepsilon decreases to 0. It is shown in Petersen and Voigtlaender (2020); Zhou (2020a) that any approximation error achieved by FNNs can be achieved by CNNs. Combining Zhou (2020a) and Yarotsky (2017), we can show that CNNs can approximate Ws,W^{s,\infty} functions in D\mathbb{R}^{D} with arbitrary accuracy ε\varepsilon. Such CNNs have the number of channels in the order of εD/s\varepsilon^{-D/s} and a cardinality constraint. The only theory on ConvResNet can be found in Oono and Suzuki (2019), where an approximation theory for Hölder functions is proved for ConvResNets with fixed width.

Statistical theories for binary classification by FNNs are established with the hinge loss (Ohn and Kim, 2019; Hu et al., 2020) and the logistic loss (Kim et al., 2018). Among these works, Hu et al. (2020) uses a parametric model given by a teacher-student network. The nonparametric results in Ohn and Kim (2019); Kim et al. (2018) are cursed by the data dimension, and therefore require a large number of samples for high-dimensional data.

Binary classification by CNNs has been studied in Kohler et al. (2020); Kohler and Langer (2020); Nitanda and Suzuki (2018); Huang et al. (2018). Image binary classification is studied in Kohler et al. (2020); Kohler and Langer (2020) in which the probability function is assumed to be in a hierarchical max-pooling model class. ResNet type classifiers are considered in Nitanda and Suzuki (2018); Huang et al. (2018) while the generalization error is not given explicitly.

Low-dimensional structures of data sets are explored for neural networks in Chui and Mhaskar (2018); Shaham et al. (2018); Chen et al. (2019b, a); Schmidt-Hieber (2019); Nakada and Imaizumi (2019); Cloninger and Klock (2020); Chen et al. (2020); Montanelli and Yang (2020). These works show that, if data are near a low-dimensional manifold, the performance of FNNs depends on the intrinsic dimension of the manifold and only weakly depends on the data dimension. Our work focuses on ConvResNets for practical applications.

The networks in many aforementioned works have a cardinality constraint. From the computational perspective, training such networks requires substantial efforts (Han et al., 2016, 2015; Blalock et al., 2020). In comparison, the ConvResNet in Oono and Suzuki (2019) and this paper does not require any cardinality constraint. Additionally, our constructed network has a fixed filter size and a fixed number of channels, which is desirable for practical applications.

As a summary, we compare our approximation theory and existing results in Table 1.

The rest of this paper is organized as follows: In Section 2, we briefly introduce manifolds, Besov functions on manifolds and convolution. Our main results are presented in Section 3. We give a proof sketch in Section 4 and conclude this paper in Section 5.

2 Preliminaries

Notations: We use bold lower-case letters to denote vectors, upper-case letters to denote matrices, calligraphic letters to denote tensors, sets and manifolds. For any x>0x>0, we use x\lceil x\rceil to denote the smallest integer that is no less than xx and use x\lfloor x\rfloor to denote the largest integer that is no larger than xx. For any a,ba,b\in\mathbb{R}, we denote ab=max(a,b)a\vee b=\max(a,b). For a function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} and a set Ωd\Omega\subset\mathbb{R}^{d}, we denote the restriction of ff to Ω\Omega by f|Ωf|_{\Omega}. We use fLp\|f\|_{L^{p}} to denote the LpL^{p} norm of ff. We denote the Euclidean ball centered at 𝐜\mathbf{c} with radius ω\omega by Bω(𝐜)B_{\omega}(\mathbf{c}).

2.1 Low-dimensional manifolds

We first introduce some concepts on manifolds. We refer the readers to Tu (2010); Lee (2006) for details. Throughout this paper, we let \mathcal{M} be a dd-dimensional Riemannian manifold \mathcal{M} isometrically embedded in D\mathbb{R}^{D} with dDd\leq D. We first introduce charts, an atlas and the partition of unity.

Definition 1 (Chart).

A chart on \mathcal{M} is a pair (U,ϕ)(U,\phi) where UU\subset\mathcal{M} is open and ϕ:Ud,\phi:U\to\mathbb{R}^{d}, is a homeomorphism (i.e., bijective, ϕ\phi and ϕ1\phi^{-1} are both continuous).

In a chart (U,ϕ)(U,\phi), UU is called a coordinate neighborhood and ϕ\phi is a coordinate system on UU. A collection of charts which covers \mathcal{M} is called an atlas of \mathcal{M}.

Definition 2 (CkC^{k} Atlas).

A CkC^{k} atlas for \mathcal{M} is a collection of charts {(Uα,ϕα)}α𝒜\{(U_{\alpha},\phi_{\alpha})\}_{\alpha\in\mathcal{A}} which satisfies α𝒜Uα=\bigcup_{\alpha\in\mathcal{A}}U_{\alpha}=\mathcal{M}, and are pairwise CkC^{k} compatible:

ϕαϕβ1:ϕβ(UαUβ)ϕα(UαUβ)andϕβϕα1:ϕα(UαUβ)ϕβ(UαUβ)\displaystyle\phi_{\alpha}\circ\phi_{\beta}^{-1}:\phi_{\beta}(U_{\alpha}\cap U_{\beta})\to\phi_{\alpha}(U_{\alpha}\cap U_{\beta})\quad\textrm{and}\quad\phi_{\beta}\circ\phi_{\alpha}^{-1}:\phi_{\alpha}(U_{\alpha}\cap U_{\beta})\to\phi_{\beta}(U_{\alpha}\cap U_{\beta})

are both CkC^{k} for any α,β𝒜\alpha,\beta\in\mathcal{A}. An atlas is called finite if it contains finitely many charts.

Definition 3 (Smooth Manifold).

A smooth manifold is a manifold \mathcal{M} together with a CC^{\infty} atlas.

The Euclidean space, the torus and the unit sphere are examples of smooth manifolds. CsC^{s} functions on a smooth manifold \mathcal{M} are defined as follows:

Definition 4 (CsC^{s} functions on \mathcal{M}).

Let \mathcal{M} be a smooth manifold and f:f:\mathcal{M}\rightarrow\mathbb{R} be a function on \mathcal{M}. We say ff is a CsC^{s} function on \mathcal{M}, if for every chart (U,ϕ)(U,\phi) on \mathcal{M}, the function fϕ1:ϕ(U)f\circ\phi^{-1}:\phi(U)\rightarrow\mathbb{R} is a CsC^{s} function.

We next define the CC^{\infty} partition of unity which is an important tool for the study of functions on manifolds.

Definition 5 (Partition of Unity).

A CC^{\infty} partition of unity on a manifold \mathcal{M} is a collection of CC^{\infty} functions {ρα}α𝒜\{\rho_{\alpha}\}_{\alpha\in\mathcal{A}} with ρα:[0,1]\rho_{\alpha}:\mathcal{M}\to[0,1] such that for any 𝐱\mathbf{x}\in\mathcal{M},

  1. 1.

    there is a neighbourhood of 𝐱\mathbf{x} where only a finite number of the functions in {ρα}α𝒜\{\rho_{\alpha}\}_{\alpha\in\mathcal{A}} are nonzero, and

  2. 2.

    α𝒜ρα(𝐱)=1\displaystyle\sum_{\alpha\in\mathcal{A}}\rho_{\alpha}(\mathbf{x})=1.

An open cover of a manifold \mathcal{M} is called locally finite if every 𝐱\mathbf{x}\in\mathcal{M} has a neighbourhood which intersects with a finite number of sets in the cover. The following proposition shows that a CC^{\infty} partition of unity for a smooth manifold always exists (Spivak, 1970, Chapter 2, Theorem 15).

Proposition 1 (Existence of a CC^{\infty} partition of unity).

Let {Uα}α𝒜\{U_{\alpha}\}_{\alpha\in\mathcal{A}} be a locally finite cover of a smooth manifold \mathcal{M}. There is a CC^{\infty} partition of unity {ρα}α=1\{\rho_{\alpha}\}_{\alpha=1}^{\infty} such that supp(ρα)Uα\mathrm{supp}(\rho_{\alpha})\subset U_{\alpha}.

Let {(Uα,ϕα)}α𝒜\{(U_{\alpha},\phi_{\alpha})\}_{\alpha\in\mathcal{A}} be a CC^{\infty} atlas of \mathcal{M}. Proposition 1 guarantees the existence of a partition of unity {ρα}α𝒜\{\rho_{\alpha}\}_{\alpha\in\mathcal{A}} such that ρα\rho_{\alpha} is supported on UαU_{\alpha}.

The reach of \mathcal{M} introduced by Federer (Federer, 1959) is an important quantity defined below. Let d(𝐱,)=inf𝐲𝐱𝐲2d(\mathbf{x},\mathcal{M})=\inf_{\mathbf{y}\in\mathcal{M}}\|\mathbf{x}-\mathbf{y}\|_{2} be the distance from 𝐱\mathbf{x} to \mathcal{M}.

Definition 6 (Reach (Federer, 1959; Niyogi et al., 2008)).

Define the set

G={\displaystyle G=\{ 𝐱D: distinct 𝐩,𝐪 such that d(𝐱,)=𝐱𝐩2=𝐱𝐪2}.\displaystyle\mathbf{x}\in\mathbb{R}^{D}:\exists\mbox{ distinct }\mathbf{p},\mathbf{q}\in\mathcal{M}\mbox{ such that }d(\mathbf{x},\mathcal{M})=\|\mathbf{x}-\mathbf{p}\|_{2}=\|\mathbf{x}-\mathbf{q}\|_{2}\}.

The closure of GG is called the medial axis of \mathcal{M}. The reach of \mathcal{M} is defined as

τ=inf𝐱inf𝐲G𝐱𝐲2.\tau=\inf_{\mathbf{x}\in\mathcal{M}}\ \inf_{\mathbf{y}\in G}\|\mathbf{x}-\mathbf{y}\|_{2}.
Refer to caption
Figure 2: Illustration of manifolds with large and small reach.

We illustrate large and small reach in Figure 2.

2.2 Besov functions on a smooth manifold

We next define Besov function spaces on \mathcal{M}, which generalizes more elementary function spaces such as the Sobolev and Hölder spaces. To define Besov functions, we first introduce the modulus of smoothness.

Definition 7 (Modulus of Smoothness (DeVore and Lorentz, 1993; Suzuki, 2019)).

Let ΩD\Omega\subset\mathbb{R}^{D}. For a function f:Df:\mathbb{R}^{D}\rightarrow\mathbb{R} be in Lp(Ω)L^{p}(\Omega) for p>0p>0, the rr-th modulus of smoothness of ff is defined by

wr,p(f,t)=sup𝐡2tΔ𝐡r(f)Lp, where\displaystyle w_{r,p}(f,t)=\sup_{\|\mathbf{h}\|_{2}\leq t}\|\Delta_{\mathbf{h}}^{r}(f)\|_{L^{p}},\mbox{ where}
Δ𝐡r(f)(𝐱)={j=0r(rj)(1)rjf(𝐱+j𝐡) if 𝐱Ω,𝐱+r𝐡Ω,0 otherwise.\displaystyle\Delta_{\mathbf{h}}^{r}(f)(\mathbf{x})=\begin{cases}\sum_{j=0}^{r}\binom{r}{j}(-1)^{r-j}f(\mathbf{x}+j\mathbf{h})&\mbox{ if }\ \mathbf{x}\in\Omega,\mathbf{x}+r\mathbf{h}\in\Omega,\\ 0&\mbox{ otherwise}.\end{cases}
Definition 8 (Besov Space Bp,qs(Ω)B_{p,q}^{s}(\Omega)).

For 0<p,q,s>0,r=s+10<p,q\leq\infty,s>0,r=\lfloor s\rfloor+1, define the seminorm ||Bp,qs|\cdot|_{B_{p,q}^{s}} as

|f|Bp,qs(Ω):={(0(tswr,p(f,t))qdtt)1q if q<,supt>0tswr,p(f,t) if q=.|f|_{B_{p,q}^{s}(\Omega)}:=\begin{cases}\left(\displaystyle\int_{0}^{\infty}(t^{-s}w_{r,p}(f,t))^{q}\frac{dt}{t}\right)^{\frac{1}{q}}&\mbox{ if }q<\infty,\\ \sup_{t>0}t^{-s}w_{r,p}(f,t)&\mbox{ if }q=\infty.\end{cases}

The norm of the Besov space Bp,qs(Ω)B_{p,q}^{s}(\Omega) is defined as fBp,qs(Ω):=fLp(Ω)+|f|Bp,qs(Ω)\|f\|_{B_{p,q}^{s}(\Omega)}:=\|f\|_{L^{p}(\Omega)}+|f|_{B_{p,q}^{s}(\Omega)}. The Besov space is Bp,qs(Ω)={fLp(Ω)|fBp,qs<}B_{p,q}^{s}(\Omega)=\{f\in L^{p}(\Omega)|\|f\|_{B_{p,q}^{s}}<\infty\}.

We next define Bp,qsB_{p,q}^{s} functions on \mathcal{M} (Geller and Pesenson, 2011; Triebel, 1983, 1992).

Definition 9 (Bp,qsB_{p,q}^{s} Functions on \mathcal{M}).

Let \mathcal{M} be a compact smooth manifold of dimension dd. Let {(Ui,ϕi)}i=1C\{(U_{i},\phi_{i})\}_{i=1}^{C_{\mathcal{M}}} be a finite atlas on \mathcal{M} and {ρi}i=1C\{\rho_{i}\}_{i=1}^{C_{\mathcal{M}}} be a partition of unity on \mathcal{M} such that supp(ρi)Ui\mathrm{supp}(\rho_{i})\subset U_{i}. A function f:f:\mathcal{M}\to\mathbb{R} is in Bp,qs()B_{p,q}^{s}(\mathcal{M}) if

fBp,qs():=i=1C(fρi)ϕi1Bp,qs(d)<.\displaystyle\|f\|_{B_{p,q}^{s}(\mathcal{M})}:=\sum_{i=1}^{C_{\mathcal{M}}}\|(f\rho_{i})\circ\phi_{i}^{-1}\|_{B_{p,q}^{s}(\mathbb{R}^{d})}<\infty. (1)

Since ρi\rho_{i} is supported on UiU_{i}, the function (fρi)ϕi1(f\rho_{i})\circ\phi_{i}^{-1} is supported on ϕ(Ui)\phi(U_{i}). We can extend (fρi)ϕi1(f\rho_{i})\circ\phi_{i}^{-1} from ϕ(Ui)\phi(U_{i}) to d\mathbb{R}^{d} by setting the function to be 0 on dϕ(Ui)\mathbb{R}^{d}\setminus\phi(U_{i}). The extended function lies in the Besov space Bp,qs(d)B^{s}_{p,q}(\mathbb{R}^{d}) (Triebel, 1992, Chapter 7).

2.3 Convolution and residual block

In this paper, we consider one-sided stride-one convolution in our network. Let 𝒲={𝒲j,k,l}C×K×C\mathcal{W}=\{\mathcal{W}_{j,k,l}\}\in\mathbb{R}^{C^{\prime}\times K\times C} be a filter where CC^{\prime} is the output channel size, KK is the filter size and CC is the input channel size. For zD×Cz\in\mathbb{R}^{D\times C}, the convolution of 𝒲\mathcal{W} with zz gives yD×Cy\in\mathbb{R}^{D\times C^{\prime}} such that

y=𝒲z,yi,j=k=1Kl=1C𝒲j,k,lzi+k1,l,\displaystyle y=\mathcal{W}*z,\quad y_{i,j}=\sum_{k=1}^{K}\sum_{l=1}^{C}\mathcal{W}_{j,k,l}z_{i+k-1,l}, (2)

where 1iD,1jC1\leq i\leq D,1\leq j\leq C^{\prime} and we set zi+k1,l=0z_{i+k-1,l}=0 for i+k1>Di+k-1>D, as demonstrated in Figure 3(a).

The building blocks of ConvResNets are residual blocks. For an input 𝐱\mathbf{x}, each residual block computes

𝐱+F(𝐱)\mathbf{x}+F(\mathbf{x})

where FF is a subnetwork consisting of convolutional layers (see more details in Section 3.1). A residual block is demonstrated in Figure 3(b).

Refer to caption
(a) Convolution.
Refer to caption
(b) A residual block.
Figure 3: (a) Demonstration of 𝒲z\mathcal{W}*z, where the input is zD×Cz\in\mathbb{R}^{D\times C}, and the output is 𝒲zD×C\mathcal{W}*z\in\mathbb{R}^{D\times C^{\prime}}. Here 𝒲={𝒲j,k,l}C×K×C\mathcal{W}=\{\mathcal{W}_{j,k,l}\}\in\mathbb{R}^{C^{\prime}\times K\times C} is a filter where CC^{\prime} is the output channel size, KK is the filter size and CC is the input channel size. 𝒲j,:,:\mathcal{W}_{j,:,:} is a D×CD\times C matrix for the jj-th output channel. (b) Demonstration of a residual block.

3 Theory

In this section, we first introduce the ConvResNet architecture, and then present our main results.

3.1 Convolutional residual neural network

We study the ConvResNet with the rectified linear unit (ReLU\mathrm{ReLU}) activation function: ReLU(z)=max(z,0)\mathrm{ReLU}(z)=\max(z,0). The ConvResNet we consider consists of a padding layer and several residual blocks followed by a fully connected feedforward layer.

We first define the padding layer. Given an input AD×C1A\in\mathbb{R}^{D\times C_{1}}, the network first applies a padding operator P:D×C1D×C2P:\mathbb{R}^{D\times C_{1}}\rightarrow\mathbb{R}^{D\times C_{2}} for some integer C2C1C_{2}\geq C_{1} such that

Z=P(A)=[A𝟎𝟎]D×C2.Z=P(A)=\begin{bmatrix}A&{\bm{0}}&\cdots&{\bm{0}}\end{bmatrix}\in\mathbb{R}^{D\times C_{2}}.

Then the matrix ZZ is passed through MM residual blocks.

In the mm-th block, let 𝒲m={𝒲m(1),,𝒲m(Lm)}\mathcal{W}_{m}=\{\mathcal{W}_{m}^{(1)},...,\mathcal{W}_{m}^{(L_{m})}\} and m={Bm(1),,Bm(Lm)}\mathcal{B}_{m}=\{B_{m}^{(1)},...,B_{m}^{(L_{m})}\} be a collection of filters and biases. The mm-th residual block maps a matrix from D×C\mathbb{R}^{D\times C} to D×C\mathbb{R}^{D\times C} by

Conv𝒲m,m+id,\mathrm{Conv}_{\mathcal{W}_{m},\mathcal{B}_{m}}+\mathrm{id},

where id\mathrm{id} is the identity operator and

Conv𝒲m,m(Z)=ReLU(𝒲m(Lm)ReLU(𝒲m(1)Z+Bm(1))+Bm(Lm)),\displaystyle\mathrm{Conv}_{\mathcal{W}_{m},\mathcal{B}_{m}}(Z)=\mathrm{ReLU}\Big{(}\mathcal{W}_{m}^{(L_{m})}*\cdots\cdots*\mathrm{ReLU}\left(\mathcal{W}_{m}^{(1)}*Z+B_{m}^{(1)}\right)\cdots+B_{m}^{(L_{m})}\Big{)}, (3)

with ReLU\mathrm{ReLU} applied entrywise. Denote

Q(𝐱)=\displaystyle Q(\mathbf{x})= (Conv𝒲M,M+id)(Conv𝒲1,1+id)P(𝐱).\displaystyle\left(\mathrm{Conv}_{\mathcal{W}_{M},\mathcal{B}_{M}}+\mathrm{id}\right)\circ\cdots\circ\left(\mathrm{Conv}_{\mathcal{W}_{1},\mathcal{B}_{1}}+\mathrm{id}\right)\circ P(\mathbf{x}). (4)

For networks only consisting of residual blocks, we define the network class as

𝒞Conv(M,L,J,K,κ)=\displaystyle\mathcal{C}^{\mathrm{Conv}}(M,L,J,K,\kappa)= {Q|Q(𝐱) is in the form of (4) with M residual blocks. Each block has\displaystyle\big{\{}Q|Q(\mathbf{x})\mbox{ is in the form of (\ref{eq.cnnBlock}) with $M$ residual blocks. Each block has }
filter size bounded by K, number of channels bounded by J,\displaystyle\hskip 14.22636pt\mbox{filter size bounded by }K,\mbox{ number of channels bounded by }J,
maxmLmL,maxm,l𝒲m(l)Bm(l)κ},\displaystyle\hskip 14.22636pt\max_{m}L_{m}\leq L,\ \max_{m,l}\|\mathcal{W}_{m}^{(l)}\|_{\infty}\vee\|B_{m}^{(l)}\|_{\infty}\leq\kappa\big{\}}, (5)

where \left\lVert\cdot\right\rVert_{\infty} denotes \ell^{\infty} norm of a vector, and for a tensor 𝒲\mathcal{W}, 𝒲=maxj,k,l|𝒲j,k,l|\left\lVert\mathcal{W}\right\rVert_{\infty}=\max_{j,k,l}|\mathcal{W}_{j,k,l}|.

Based on the network QQ in (4), a ConvResNet has an additional fully connected layer and can be expressed as

f(𝐱)=WQ(𝐱)+b\displaystyle f(\mathbf{x})=WQ(\mathbf{x})+b (6)

where WW and bb are the weight matrix and the bias in the fully connected layer. The class of ConvResNets is defined as

𝒞(M,L,J,K,κ1,κ2,R)=\displaystyle\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2},R)= {f|f(𝐱)=WQ(𝐱)+b with Q𝒞Conv(M,L,J,K,κ1),\displaystyle\big{\{}f|f(\mathbf{x})=WQ(\mathbf{x})+b\mbox{ with }Q\in\mathcal{C}^{\mathrm{Conv}}(M,L,J,K,\kappa_{1}),
W|b|κ2,fLR}.\displaystyle\hskip 14.22636pt\|W\|_{\infty}\vee|b|\leq\kappa_{2},\|f\|_{L^{\infty}}\leq R\big{\}}. (7)

Sometimes we do not have restriction on the output, we omit the parameter RR and denote the network class by 𝒞(M,L,J,K,κ1,κ2)\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}).

3.2 Approximation theory

Our approximation theory is based on the following assumptions of \mathcal{M} and the object function f:f^{*}:\mathcal{M}\rightarrow\mathbb{R}.

Assumption 1.

\mathcal{M} is a dd-dimensional compact smooth Riemannian manifold isometrically embedded in D\mathbb{R}^{D}. There is a constant BB such that for any 𝐱\mathbf{x}\in\mathcal{M}, 𝐱B\|\mathbf{x}\|_{\infty}\leq B.

Assumption 2.

The reach of \mathcal{M} is τ>0\tau>0.

Assumption 3.

Let 0<p,q0<p,q\leq\infty, d/p+1s<d/p+1\leq s<\infty. Assume fBp,qs()f^{*}\in B_{p,q}^{s}(\mathcal{M}) and fBp,qs()c0\|f^{*}\|_{B_{p,q}^{s}(\mathcal{M})}\leq c_{0} for a constant c0>0c_{0}>0. Additionally, we assume fLR\|f^{*}\|_{L^{\infty}}\leq R for a constant R>0R>0.

Assumption 3 implies that ff^{*} is Lipschitz continuous (Triebel, 1983, Section 2.7.1 Remark 2 and Section 3.3.1).

Our first result is the following universal approximation error of ConvResNets for Besov functions on \mathcal{M}.

Refer to caption
Figure 4: The ConvResNet in Theorem 1 contains a padding layer, MM residual blocks, and a fully connected (FC) layer.
Theorem 1.

Assume Assumption 1-3. For any ε(0,1)\varepsilon\in(0,1) and positive integer K[2,D]K\in[2,D], there is a ConvResNet architecture 𝒞(M,L,J,K,κ1,κ2)\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}) such that, for any fBp,qs()f^{*}\in B_{p,q}^{s}(\mathcal{M}), if the weight parameters of this ConvResNet are properly chosen, the network yields a function f¯𝒞(M,L,J,K,κ1,κ2)\bar{f}\in\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}) satisfying

f¯fLε.\displaystyle\|\bar{f}-f^{*}\|_{L^{\infty}}\leq\varepsilon. (8)

Such a network architecture has

M=O(εd/s),L=O(log(1/ε)+D+logD),J=O(D),κ1=O(1),logκ2=O(log2(1/ε)).\displaystyle M=O\left(\varepsilon^{-d/s}\right),\ L=O(\log(1/\varepsilon)+D+\log D),\ J=O(D),\ \kappa_{1}=O(1),\ \log\kappa_{2}=O(\log^{2}(1/\varepsilon)). (9)

The constant hidden in O()O(\cdot) depend on dd, ss,2dspd\frac{2d}{sp-d}, pp, qq,c0,τc_{0},\tau and the surface area of \mathcal{M}.

The architecture of the ConvResNet in Theorem 1 is illustrated in Figure 4. It has the following properties:

  • The network has a fixed filter size and a fixed number of channels.

  • There is no cardinality constraint.

  • The network size depends on the intrinsic dimension dd, and only weakly depends on DD.

Theorem 1 can be compared with Suzuki (2019) on the approximation theory for Besov functions in D\mathbb{R}^{D} by FNNs as follows: (1) To universally approximate Besov functions in D\mathbb{R}^{D} with ε\varepsilon error, the FNN constructed in Suzuki (2019) requires O(log(1/ε))O\left(\log(1/\varepsilon)\right) depth, O(εD/s)O\left(\varepsilon^{-D/s}\right) width and O(εD/slog(1/ε))O\left(\varepsilon^{-D/s}\log(1/\varepsilon)\right) nonzero parameters. By exploiting the manifold model, our network size depends on the intrinsic dimension dd and weakly depends on DD. (2) The ConvResNet in Theorem 1 does not require any cardinality constraint, while such a constraint is needed in Suzuki (2019).

3.3 Statistical theory

We next consider binary classification on \mathcal{M}. For any 𝐱\mathbf{x}\in\mathcal{M}, denote its label by y{1,1}y\in\{-1,1\}. The label yy follows the following Bernoulli-type distribution

(y=1|𝐱)=η(𝐱),(y=1|𝐱)=1η(𝐱)\displaystyle\mathbb{P}(y=1|\mathbf{x})=\eta(\mathbf{x}),\ \mathbb{P}(y=-1|\mathbf{x})=1-\eta(\mathbf{x}) (10)

for some η:[0,1]\eta:\mathcal{M}\rightarrow[0,1].

We assume the following data model:

Assumption 4.

We are given i.i.d. sample {(𝐱i,yi)}i=1n\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}, where 𝐱i\mathbf{x}_{i}\in\mathcal{M}, and the yiy_{i}’s are sampled according to (10).

In binary classification, a classifier ff predicts the label of 𝐱\mathbf{x} as sign(f(𝐱))\mathop{\mathrm{sign}}(f(\mathbf{x})). To learn the optimal classifier, we consider the logistic loss ϕ(z)=log(1+exp(z))\phi(z)=\log(1+\exp(-z)). The logistic risk ϕ(f)\mathcal{E}_{\phi}(f) of a classifier ff is defined as

ϕ(f)=𝔼(ϕ(yf(𝐱))).\displaystyle\mathcal{E}_{\phi}(f)=\mathbb{E}(\phi(yf(\mathbf{x}))). (11)

The minimizer of ϕ(f)\mathcal{E}_{\phi}(f) is denoted by fϕf_{\phi}^{*}, which satisfies

fϕ(𝐱)=logη(𝐱)1η(𝐱).\displaystyle f_{\phi}^{*}(\mathbf{x})=\log\frac{\eta(\mathbf{x})}{1-\eta(\mathbf{x})}. (12)

For any classifier ff, we define its logistic excess risk as

ϕ(f,fϕ)=ϕ(f)ϕ(fϕ).\displaystyle\mathcal{E}_{\phi}(f,f_{\phi}^{*})=\mathcal{E}_{\phi}(f)-\mathcal{E}_{\phi}(f_{\phi}^{*}). (13)

In this paper, we consider ConvResNets with the following architecture:

𝒞(n)={f|f=g¯2h¯g¯1η¯ where\displaystyle\mathcal{C}^{(n)}=\big{\{}f|f=\bar{g}_{2}\circ\bar{h}\circ\bar{g}_{1}\circ\bar{\eta}\mbox{ where}\ η¯𝒞Conv(M1,L1,J1,K,κ1),g¯1𝒞Conv(1,4,8,1,κ2),\displaystyle\bar{\eta}\in\mathcal{C}^{\mathrm{Conv}}\left(M_{1},L_{1},J_{1},K,\kappa_{1}\right),\ \bar{g}_{1}\in\mathcal{C}^{\mathrm{Conv}}\left(1,4,8,1,\kappa_{2}\right),
h¯𝒞Conv(M2,L2,J2,1,κ1),g¯2𝒞(1,3,8,1,κ3,1,R)},\displaystyle\bar{h}\in\mathcal{C}^{\mathrm{Conv}}\left(M_{2},L_{2},J_{2},1,\kappa_{1}\right),\ \bar{g}_{2}\in\mathcal{C}\left(1,3,8,1,\kappa_{3},1,R\right)\big{\}}, (14)

where M1,M2,L,J,K,κ1,κ2,κ3M_{1},M_{2},L,J,K,\kappa_{1},\kappa_{2},\kappa_{3} are some parameters to be determined.

The empirical classifier is learned by minimizing the empirical logistic risk:

f^ϕ,n=argminf𝒞(n)1ni=1nϕ(yif(𝐱i)).\displaystyle\widehat{f}_{\phi,n}=\mathop{\mathrm{argmin}}_{f\in\mathcal{C}^{(n)}}\frac{1}{n}\sum_{i=1}^{n}\phi(y_{i}f(\mathbf{x}_{i})). (15)

We establish an upper bound on the excess risk of f^ϕ,n\widehat{f}_{\phi,n}:

Theorem 2.

Assume Assumption 1, 2 and 4. Assume 0<p,q0<p,q\leq\infty, 0<s<0<s<\infty, sd/p+1s\geq d/p+1 and ηBp,qs()\eta\in B_{p,q}^{s}(\mathcal{M}) with ηBp,qsc0\|\eta\|_{B_{p,q}^{s}}\leq c_{0} for some constant c0c_{0}. For any 2KD2\leq K\leq D, we set

M1=O(n2ds+2(sd)),M2=O(n2ss+2(sd)),L1=O(log(1/ε)+D+logD),L2=O(log(1/ε)),\displaystyle M_{1}=O\left(n^{\frac{2d}{s+2(s\vee d)}}\right),\ M_{2}=O\left(n^{\frac{2s}{s+2(s\vee d)}}\right),\ L_{1}=O(\log(1/\varepsilon)+D+\log D),\ L_{2}=O(\log(1/\varepsilon)),
J1=O(D),J2=O(1),κ1=O(1),logκ2=O(log2n),κ3=O(logn),R=O(logn)\displaystyle J_{1}=O(D),\ J_{2}=O(1),\ \kappa_{1}=O(1),\ \log\kappa_{2}=O(\log^{2}n),\ \kappa_{3}=O(\log n),\ R=O(\log n)

for 𝒞(n)\mathcal{C}^{(n)}. Then

𝔼(ϕ(f^ϕ,n,fϕ))Cns2s+2(sd)log4n\displaystyle\mathbb{E}(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*}))\leq Cn^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n (16)

for some constant CC. Here CC is linear in DlogDD\log D and additionally depends on d,s,2dspd,p,q,c0,τd,s,\frac{2d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}. The constant hidden in O()O(\cdot) depends on d,s,2dspd,p,q,c0,τd,s,\frac{2d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}.

Theorem 2 shows that a properly designed ConvResNet gives rise to an empirical classifier, of which the excess risk converges at a fast rate with an exponent depending on the intrinsic dimension dd, instead of DD.

Theorem 2 is proved in Appendix A. Each building block of 𝒞(n)\mathcal{C}^{(n)} is constructed for the following purpose:

  • g¯1η¯\bar{g}_{1}\circ\bar{\eta} is designed to approximate a truncated η\eta on \mathcal{M}, which is realized by Theorem 1.

  • g¯2h¯\bar{g}_{2}\circ\bar{h} is designed to approximate a truncated univariate function logz1z\log\frac{z}{1-z}.

4 Proof of Theorem 1

We provide a proof sketch of Theorem 1 in this section. More technical details are deferred to Appendix C.

We prove Theorem 1 in the following four steps:

  1. 1.

    Decompose f=ifif^{*}=\sum_{i}f_{i} as a sum of locally supported functions according to the manifold structure.

  2. 2.

    Locally approximate each fif_{i} using cardinal B-splines.

  3. 3.

    Implement the cardinal B-splines using CNNs.

  4. 4.

    Implement the sum of all CNNs by a ConvResNet for approximating ff^{*}.

Step 1: Decomposition of ff^{*}.
\bullet Construct an atlas on \mathcal{M}.
Since the manifold \mathcal{M} is compact, we can cover \mathcal{M} by a finite collection of open balls Bω(𝐜i)B_{\omega}(\mathbf{c}_{i}) for i=1,,Ci=1,\dots,C_{\mathcal{M}}, where 𝐜i\mathbf{c}_{i} is the center of the ball and ω\omega is the radius to be chosen later. Accordingly, the manifold is partitioned as =iUi\mathcal{M}=\bigcup_{i}U_{i} with Ui=Bω(𝐜i)U_{i}=B_{\omega}(\mathbf{c}_{i})\bigcap\mathcal{M}. We choose ω<τ/2\omega<\tau/2 such that UiU_{i} is diffeomorphic to an open subset of d\mathbb{R}^{d} (Niyogi et al., 2008, Lemma 5.4). The total number of partitions is then bounded by CSA()ωdTd,C_{\mathcal{M}}\leq\left\lceil\frac{{\rm SA}(\mathcal{M})}{\omega^{d}}T_{d}\right\rceil, where SA(){\rm SA}(\mathcal{M}) is the surface area of \mathcal{M} and TdT_{d} is the average number of UiU_{i}’s that contain a given point on \mathcal{M} (Conway et al., 1987, Chapter 2 Equation (1)).

On each partition, we define a projection-based transformation ϕi\phi_{i} as

ϕi(𝐱)=aiVi(𝐱𝐜i)+𝐛i,\displaystyle\phi_{i}(\mathbf{x})=a_{i}V_{i}^{\top}(\mathbf{x}-\mathbf{c}_{i})+\mathbf{b}_{i},

where the scaling factor aia_{i}\in\mathbb{R} and the shifting vector 𝐛id\mathbf{b}_{i}\in\mathbb{R}^{d} ensure ϕi(Ui)[0,1]d\phi_{i}(U_{i})\subset[0,1]^{d}, and the column vectors of ViD×dV_{i}\in\mathbb{R}^{D\times d} form an orthonormal basis of the tangent space T𝐜i()T_{\mathbf{c}_{i}}(\mathcal{M}). The atlas on \mathcal{M} is the collection (Ui,ϕi)(U_{i},\phi_{i}) for i=1,,i=1,\dots,\mathcal{M}. See Figure 5 for a graphical illustration of the atlas.

Refer to caption
Figure 5: An atlas given by covering \mathcal{M} using Euclidean balls.

\bullet Decompose ff^{*} according to the atlas. We decompose ff^{*} as

f=i=1Cfiwithfi=fρi,\displaystyle f^{*}=\sum_{i=1}^{C_{\mathcal{M}}}f_{i}\quad\textrm{with}\quad f_{i}=f\rho_{i}, (17)

where {ρi}i=1C\{\rho_{i}\}_{i=1}^{C_{\mathcal{M}}} is a CC^{\infty} partition of unity with supp(ϕi)Ui\textrm{supp}(\phi_{i})\subset U_{i}. The existence of such a {ρi}i=1C\{\rho_{i}\}_{i=1}^{C_{\mathcal{M}}} is guaranteed by Proposition 1. As a result, each fif_{i} is supported on a subset of UiU_{i}, and therefore, we can rewrite (17) as

f=i=1C(fiϕi1)ϕi×𝟙Uiwithfi=fρi,\displaystyle f^{*}=\sum_{i=1}^{C_{\mathcal{M}}}(f_{i}\circ\phi_{i}^{-1})\circ\phi_{i}\times\mathds{1}_{U_{i}}\quad\textrm{with}\quad f_{i}=f\rho_{i}, (18)

where 𝟙Ui\mathds{1}_{U_{i}} is the indicator function of UiU_{i}. Since ϕi\phi_{i} is a bijection between UiU_{i} and ϕi(Ui)\phi_{i}(U_{i}), fiϕi1f_{i}\circ\phi_{i}^{-1} is supported on ϕi(Ui)[0,1]d\phi_{i}(U_{i})\subset[0,1]^{d}. We extend fiϕi1f_{i}\circ\phi_{i}^{-1} on [0,1]d\ϕi(Ui)[0,1]^{d}\backslash\phi_{i}(U_{i}) by 0. The extended function is in Bp,qs([0,1]d)B_{p,q}^{s}([0,1]^{d}) (see Lemma 4 in Appendix C.1). This allows us to use cardinal B-splines to locally approximate each fiϕi1f_{i}\circ\phi_{i}^{-1} as detailed in Step 2.

Step 2: Local cardinal B-spline approximation. We approximate fiϕi1f_{i}\circ\phi_{i}^{-1} using cardinal B-splines f~i\widetilde{f}_{i} as

fiϕi1f~ij=1Nf~i,jwithf~i,j=αk,𝐣(i)Mk,𝐣,md,\displaystyle f_{i}\circ\phi_{i}^{-1}\approx\widetilde{f}_{i}\equiv\sum_{j=1}^{N}\widetilde{f}_{i,j}~{}~{}\textrm{with}~{}~{}\widetilde{f}_{i,j}=\alpha^{(i)}_{k,\mathbf{j}}M_{k,\mathbf{j},m}^{d}, (19)

where αk,𝐣(i)\alpha^{(i)}_{k,\mathbf{j}}\in\mathbb{R} is a coefficient and Mk,𝐣,md:[0,1]dM_{k,\mathbf{j},m}^{d}:[0,1]^{d}\rightarrow\mathbb{R} denotes a cardinal B-spline with indecies k,m+,𝐣dk,m\in\mathbb{N}^{+},\mathbf{j}\in\mathbb{R}^{d}. Here kk is a scaling factor, 𝐣\mathbf{j} is a shifting vector, mm is the degree of the B-spline and dd is the dimension (see a formal definition in Appendix C.2).

Since sd/p+1s\geq d/p+1 (by Assumption 3), setting r=+,m=s+1r=+\infty,m=\lceil s\rceil+1 in Lemma 5 (see Appendix C.3) and applying Lemma 4 gives

f~ifiϕi1LCc0Ns/d\displaystyle\left\|\widetilde{f}_{i}-f_{i}\circ\phi_{i}^{-1}\right\|_{L^{\infty}}\leq Cc_{0}N^{-s/d} (20)

for some constant CC depending on s,p,qs,p,q and dd.

Combining (18) and (19), we approximate ff^{*} by

f~i=1Cf~iϕi×𝟙Ui=i=1Cj=1Nf~i,jϕi×𝟙Ui.\displaystyle\widetilde{f}^{*}\equiv\sum_{i=1}^{C_{\mathcal{M}}}\widetilde{f}_{i}\circ\phi_{i}\times\mathds{1}_{U_{i}}=\sum_{i=1}^{C_{\mathcal{M}}}\sum_{j=1}^{N}\widetilde{f}_{i,j}\circ\phi_{i}\times\mathds{1}_{U_{i}}. (21)

Such an approximation has error

f~fLCCc0Ns/d.\|\widetilde{f}^{*}-f^{*}\|_{L^{\infty}}\leq CC_{\mathcal{M}}c_{0}N^{-s/d}.

Step 3: Implement local approximations in Step 2 by CNNs. In Step 2, (21) gives a natural approximation of ff^{*}. In the sequel, we aim to implement all ingredients of f~i,jϕi×𝟙Ui\widetilde{f}_{i,j}\circ\phi_{i}\times\mathds{1}_{U_{i}} using CNNs. In particular, we show that CNNs can implement the cardinal B-spline f~i,j\widetilde{f}_{i,j}, the linear projection ϕi\phi_{i}, the indicator function 𝟙Ui\mathds{1}_{U_{i}}, and the multiplication operation.

\bullet Implement 𝟙Ui\mathds{1}_{U_{i}} by CNNs. Recall our construction of UiU_{i} in Step 1. For any 𝐱\mathbf{x}\in\mathcal{M}, we have 𝟙Ui(𝐱)=1\mathds{1}_{U_{i}}(\mathbf{x})=1 if di2(𝐱)=𝐱𝐜i22ω2d_{i}^{2}(\mathbf{x})=\left\lVert\mathbf{x}-\mathbf{c}_{i}\right\rVert_{2}^{2}\leq\omega^{2}; otherwise 𝟙Ui(𝐱)=0\mathds{1}_{U_{i}}(\mathbf{x})=0.

To implement 𝟙Ui\mathds{1}_{U_{i}}, we rewrite it as the composition of a univariate indicator function 𝟙[0,ω2]\mathds{1}_{[0,\omega^{2}]} and the distance function di2d_{i}^{2}:

𝟙Ui(𝐱)=𝟙[0,ω2]di2(𝐱)for𝐱.\displaystyle\mathds{1}_{U_{i}}(\mathbf{x})=\mathds{1}_{[0,\omega^{2}]}\circ d_{i}^{2}(\mathbf{x})\quad\textrm{for}\quad\mathbf{x}\in\mathcal{M}. (22)

We show that CNNs can efficiently implement both 𝟙[0,ω2]\mathds{1}_{[0,\omega^{2}]} and di2d_{i}^{2}. Specifically, given θ(0,1)\theta\in(0,1) and Δ8DB2θ\Delta\geq 8DB^{2}\theta, there exist CNNs that yield functions 𝟙~Δ\widetilde{\mathds{1}}_{\Delta} and d~i2\widetilde{d}_{i}^{2} satisfying

d~i2di2L4B2Dθ\displaystyle\|\widetilde{d}_{i}^{2}-d_{i}^{2}\|_{L^{\infty}}\leq 4B^{2}D\theta (23)

and

𝟙~Δd~i2(𝐱)={1, if 𝐱Ui,di2(𝐱)ω2Δ,0, if 𝐱Ui,between 0 and 1, otherwise.\displaystyle\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2}(\mathbf{x})=\begin{cases}1,\mbox{ if }\mathbf{x}\in U_{i},d_{i}^{2}(\mathbf{x})\leq\omega^{2}-\Delta,\\ 0,\mbox{ if }\mathbf{x}\notin U_{i},\\ \mbox{between 0 and 1},\mbox{ otherwise}.\end{cases} (24)

We also characterize the network sizes for realizing 𝟙~Δ\widetilde{\mathds{1}}_{\Delta} and d~i2\widetilde{d}_{i}^{2}: The network for 𝟙~Δ\widetilde{\mathds{1}}_{\Delta} has O(log(ω2/Δ))O(\log(\omega^{2}/\Delta)) layers, 22 channels and all weight parameters bounded by max(2,|ω24B2Dθ|)\max(2,|\omega^{2}-4B^{2}D\theta|); the network for d~i2\widetilde{d}_{i}^{2} has O(log(1/θ)+D)O(\log(1/\theta)+D) layers, 6D6D channels and all weight parameters bounded by 4B24B^{2}. More technical details are provided in Lemma 9 in Appendix C.6.

\bullet Implement f~i,jϕi\widetilde{f}_{i,j}\circ\phi_{i} by CNNs. Since ϕi\phi_{i} is a linear projection, it can be realized by a single-layer perceptron. By Lemma 8 (see Appendix C.5), this single-layer perceptron can be realized by a CNN, denoted by ϕiCNN\phi_{i}^{\rm CNN}.

For f~i,j\widetilde{f}_{i,j}, Proposition 3 (see Appendix C.8) shows that for any δ(0,1)\delta\in(0,1) and 2Kd2\leq K\leq d, there exists a CNN f~i,jCNNCNN(L,J,K,κ,κ)\widetilde{f}_{i,j}^{\rm CNN}\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) with

L=O(log1δ),J=O(1),κ=O(δ(log2)(2dspd+c1d))\displaystyle\begin{aligned} &L=O\left(\log\frac{1}{\delta}\right),J=O(1),\kappa=O\left(\delta^{-(\log 2)(\frac{2d}{sp-d}+\frac{c_{1}}{d})}\right)\end{aligned}

such that when setting N=C1δd/sN=C_{1}\delta^{-d/s}, we have

j=1Nf~i,jCNNfiϕi1L(ϕi(Ui))δ,\displaystyle\Big{\|}\sum_{j=1}^{N}\widetilde{f}^{\rm CNN}_{i,j}-f_{i}\circ\phi_{i}^{-1}\Big{\|}_{L^{\infty}(\phi_{i}(U_{i}))}\leq\delta, (25)

where C1C_{1} is a constant depending on s,p,qs,p,q and dd. The constant hidden in O()O(\cdot) depends on d,s,2dspd,p,q,c0d,s,\frac{2d}{sp-d},p,q,c_{0}. The CNN class CNN\mathcal{F}^{\rm CNN} is defined in Appendix B.

\bullet Implement the multiplication ×\times by a CNN. According to Lemma 7 (see Appendix C.4) and Lemma 8, for any η(0,1)\eta\in(0,1), the multiplication operation ×\times can be approximated by a CNN ×~\widetilde{\times} with LL^{\infty} error η\eta:

a×b×~(a,b)Lη.\displaystyle\|a\times b-\widetilde{\times}(a,b)\|_{L^{\infty}}\leq\eta. (26)

Such a CNN has O(log1/η)O\left(\log 1/\eta\right) layers, 66 channels. All parameters are bounded by max(2c02,1)\max(2c_{0}^{2},1).

Step 4: Implement f~\widetilde{f}^{*} by a ConvResNet. We assemble all CNN approximations in Step 3 together and show that the whole approximation can be realized by a ConvResNet.

\bullet Assemble all ingredients together. Assembling all CNN approximations together gives an approximation of f~i,jϕi×𝟙Ui\widetilde{f}_{i,j}\circ\phi_{i}\times\mathds{1}_{U_{i}} as

f̊i,j×~(f~i,jCNNϕiCNN,𝟙~Δd~i2).\displaystyle\mathring{f}_{i,j}\equiv\widetilde{\times}\left(\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN},\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2}\right). (27)

After substituting (27) into (21), we approximate the target function ff^{*} by

f̊i=1Cj=1Nf̊i,j.\displaystyle\mathring{f}\equiv\sum_{i=1}^{C_{\mathcal{M}}}\sum_{j=1}^{N}\mathring{f}_{i,j}. (28)

The approximation error of f̊\mathring{f} is analyzed in Lemma 12 (see Appendix C.9). According to Lemma 12, the approximation error can be bounded as follows:

f̊fLi=1C(Ai,1+Ai,2+Ai,3)with\displaystyle\qquad\|\mathring{f}-f^{*}\|_{L^{\infty}}\leq\sum_{i=1}^{C_{\mathcal{M}}}(A_{i,1}+A_{i,2}+A_{i,3})\quad\mbox{with }
Ai,1=j=1N×~(f~i,jCNNϕiCNN,𝟙~Δd~i2)(f~i,jCNNϕiCNN)×(𝟙~Δd~i2)LNη,\displaystyle A_{i,1}=\sum_{j=1}^{N}\Big{\|}\widetilde{\times}(\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN},\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})-(\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN})\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})\Big{\|}_{L^{\infty}}\leq N\eta,
Ai,2=(j=1N(f~i,jCNNϕiCNN))×(𝟙~Δd~i2)fi×(𝟙~Δd~i2)Lδ,\displaystyle A_{i,2}=\Big{\|}\Big{(}\sum_{j=1}^{N}\left(\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN}\right)\Big{)}\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})-f_{i}\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})\Big{\|}_{L^{\infty}}\leq\delta,
Ai,3=fi×(𝟙~Δd~i2)fi×𝟙UiLc(π+1)ω(1ω/τ)Δ,\displaystyle A_{i,3}=\|f_{i}\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})-f_{i}\times\mathds{1}_{U_{i}}\|_{L^{\infty}}\leq\frac{c(\pi+1)}{\omega(1-\omega/\tau)}\Delta,

where δ,η,Δ\delta,\eta,\Delta and θ\theta are defined in (25), (26), (24) and (23), respectively. For any ε(0,1)\varepsilon\in(0,1), with properly chosen δ,η,Δ\delta,\eta,\Delta and θ\theta as in (53) in Lemma 12, one has

f̊fLε.\displaystyle\|\mathring{f}-f^{*}\|_{L^{\infty}}\leq\varepsilon. (29)

With these choices, the network size of each CNN is quantified in Appendix C.10.

\bullet Realize f̊\mathring{f} by a ConvResNet. Lemma 17 (see Appendix C.15) shows that for every f̊i,j\mathring{f}_{i,j}, there exists f¯i,jCNNCNN(L,J,K,κ1,κ2)\bar{f}^{\rm CNN}_{i,j}\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2}) with L=O(log1/ε+D+logD),J=O(D),κ1=O(1),logκ2=O(log21/ε)L=O(\log 1/\varepsilon+D+\log D),J=O(D),\kappa_{1}=O(1),\log\kappa_{2}=O\left(\log^{2}1/\varepsilon\right) such that f¯i,jCNN(𝐱)=f̊i,j(𝐱)\bar{f}^{\rm CNN}_{i,j}(\mathbf{x})=\mathring{f}_{i,j}(\mathbf{x}) for any 𝐱\mathbf{x}\in\mathcal{M}. As a result, the function f̊\mathring{f} in (28) can be expressed as a sum of CNNs:

f̊=f¯CNNi=1Cj=1Nf¯i,jCNN,\displaystyle\mathring{f}=\bar{f}^{\rm CNN}\equiv\sum_{i=1}^{C_{\mathcal{M}}}\sum_{j=1}^{N}\bar{f}^{\rm CNN}_{i,j}, (30)

where NN is chosen of O(εd/s)O\left(\varepsilon^{-d/s}\right) (see Proposition 3 and Lemma 12). Lemma 18 (see Appendix C.16) shows that f¯CNN\bar{f}^{\rm CNN} can be realized by f¯𝒞(M,L,J,κ1,κ2)\bar{f}\in\mathcal{C}(M,L,J,\kappa_{1},\kappa_{2}) with

M=O(εd/s),L=O(log(1/ε)+D+logD),J=O(D),κ1=O(1),logκ2=O(log2(1/ε)).\displaystyle M=O\left(\varepsilon^{-d/s}\right),L=O(\log(1/\varepsilon)+D+\log D),J=O(D),\kappa_{1}=O(1),\log\kappa_{2}=O\left(\log^{2}(1/\varepsilon)\right).

5 Conclusion

Our results show that ConvResNets are adaptive to low-dimensional geometric structures of data sets. Specifically, we establish a universal approximation theory of ConvResNets for Besov functions on a dd-dimensional manifold \mathcal{M}. Our network size depends on the intrinsic dimension dd and only weakly depends on DD. We also establish a statistical theory of ConvResNets for binary classification when the given data are located on \mathcal{M}. The classifier is learned by minimizing the empirical logistic loss. We prove that if the ConvResNet architecture is properly chosen, the excess risk of the learned classifier decays at a fast rate depending on the intrinsic dimension of the manifold.

Our ConvResNet has many practical properties: it has a fixed filter size and a fixed number of channels. Moreover, it does not require any cardinality constraint, which is beneficial to training.

Our analysis can be extended to multinomial logistic regression for multi-class classification. In this case, the network will output a vector where each component represents the likelihood of an input belonging to certain class. By assuming that each likelihood function is in the Besov space, we can apply our analysis to approximate each function by a ConvResNet.

References

  • Alipanahi et al. (2015) Alipanahi, B., Delong, A., Weirauch, M. T. and Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33 831–838.
  • Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39 930–945.
  • Blalock et al. (2020) Blalock, D., Ortiz, J. J. G., Frankle, J. and Guttag, J. (2020). What is the state of neural network pruning? arXiv preprint arXiv:2003.03033.
  • Chen et al. (2017) Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. and Yuille, A. L. (2017). DeepLAB: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40 834–848.
  • Chen et al. (2019a) Chen, M., Jiang, H., Liao, W. and Zhao, T. (2019a). Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. In Advances in Neural Information Processing Systems.
  • Chen et al. (2019b) Chen, M., Jiang, H., Liao, W. and Zhao, T. (2019b). Nonparametric regression on low-dimensional manifolds using deep ReLU networks. arXiv preprint arXiv:1908.01842.
  • Chen et al. (2020) Chen, M., Liu, H., Liao, W. and Zhao, T. (2020). Doubly robust off-policy learning on low-dimensional manifolds by deep neural networks. arXiv preprint arXiv:2011.01797.
  • Chui and Mhaskar (2018) Chui, C. K. and Mhaskar, H. N. (2018). Deep nets for local manifold learning. Frontiers in Applied Mathematics and Statistics, 4 12.
  • Cloninger and Klock (2020) Cloninger, A. and Klock, T. (2020). ReLU nets adapt to intrinsic dimensionality beyond the target domain. arXiv preprint arXiv:2008.02545.
  • Conway et al. (1987) Conway, J. H., Sloane, N. J. A. and Bannai, E. (1987). Sphere-packings, Lattices, and Groups. Springer-Verlag, Berlin, Heidelberg.
  • Cybenko (1989) Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2 303–314.
  • DeVore and Lorentz (1993) DeVore, R. A. and Lorentz, G. G. (1993). Constructive Approximation, vol. 303. Springer Science & Business Media.
  • DeVore and Popov (1988) DeVore, R. A. and Popov, V. A. (1988). Interpolation of Besov spaces. Transactions of the American Mathematical Society, 305 397–414.
  • Dispa (2003) Dispa, S. (2003). Intrinsic characterizations of Besov spaces on lipschitz domains. Mathematische Nachrichten, 260 21–33.
  • Dũng (2011) Dũng, D. (2011). Optimal adaptive sampling recovery. Advances in Computational Mathematics, 34 1–41.
  • Fang et al. (2020) Fang, Z., Feng, H., Huang, S. and Zhou, D.-X. (2020). Theory of deep convolutional neural networks II: Spherical analysis. Neural Networks, 131 154–162.
  • Federer (1959) Federer, H. (1959). Curvature measures. Transactions of the American Mathematical Society, 93 418–491.
  • Geer and van de Geer (2000) Geer, S. A. and van de Geer, S. (2000). Empirical Processes in M-estimation, vol. 6. Cambridge University press.
  • Geller and Pesenson (2011) Geller, D. and Pesenson, I. Z. (2011). Band-limited localized parseval frames and Besov spaces on compact homogeneous manifolds. Journal of Geometric Analysis, 21 334–371.
  • Girshick (2015) Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision.
  • Gong et al. (2019) Gong, S., Boddeti, V. N. and Jain, A. K. (2019). On the intrinsic dimensionality of image representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Graves et al. (2013) Graves, A., Mohamed, A.-r. and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE.
  • Hamers and Kohler (2006) Hamers, M. and Kohler, M. (2006). Nonasymptotic bounds on the L2{L}_{2} error of neural network regression estimates. Annals of the Institute of Statistical Mathematics, 58 131–151.
  • Han et al. (2016) Han, S., Pool, J., Narang, S., Mao, H., Gong, E., Tang, S., Elsen, E., Vajda, P., Paluri, M., Tran, J. et al. (2016). Dsd: Dense-sparse-dense training for deep neural networks. arXiv preprint arXiv:1607.04381.
  • Han et al. (2015) Han, S., Pool, J., Tran, J. and Dally, W. J. (2015). Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626.
  • He et al. (2016) He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Hinton and Salakhutdinov (2006) Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313 504–507.
  • Hu et al. (2020) Hu, T., Shang, Z. and Cheng, G. (2020). Sharp rate of convergence for deep neural network classifiers under the teacher-student setting. arXiv preprint arXiv:2001.06892.
  • Huang et al. (2018) Huang, F., Ash, J., Langford, J. and Schapire, R. (2018). Learning deep resnet blocks sequentially using boosting theory. In International Conference on Machine Learning.
  • Jaffard et al. (2001) Jaffard, S., Meyer, Y. and Ryan, R. D. (2001). Wavelets: tools for science and technology. SIAM.
  • Jiang et al. (2017) Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., Wang, Y., Dong, Q., Shen, H. and Wang, Y. (2017). Artificial intelligence in healthcare: past, present and future. Stroke and vascular neurology, 2 230–243.
  • Kim et al. (2018) Kim, Y., Ohn, I. and Kim, D. (2018). Fast convergence rates of deep neural networks for classification. arXiv preprint arXiv:1812.03599.
  • Kohler and Krzyżak (2005) Kohler, M. and Krzyżak, A. (2005). Adaptive regression estimation with multilayer feedforward neural networks. Nonparametric Statistics, 17 891–913.
  • Kohler et al. (2020) Kohler, M., Krzyzak, A. and Walter, B. (2020). On the rate of convergence of image classifiers based on convolutional neural networks. arXiv preprint arXiv:2003.01526.
  • Kohler and Langer (2020) Kohler, M. and Langer, S. (2020). Statistical theory for image classification using deep convolutional neural networks with cross-entropy loss. arXiv preprint arXiv:2011.13602.
  • Kohler and Mehnert (2011) Kohler, M. and Mehnert, J. (2011). Analysis of the rate of convergence of least squares neural network regression estimates in case of measurement errors. Neural Networks, 24 273–279.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems.
  • LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1 541–551.
  • Lee et al. (2017) Lee, H., Ge, R., Ma, T., Risteski, A. and Arora, S. (2017). On the ability of neural nets to express distributions. In Conference on Learning Theory.
  • Lee (2006) Lee, J. M. (2006). Riemannian manifolds: an introduction to curvature, vol. 176. Springer Science & Business Media.
  • Long et al. (2015) Long, J., Shelhamer, E. and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Lu et al. (2017) Lu, Z., Pu, H., Wang, F., Hu, Z. and Wang, L. (2017). The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems.
  • McCaffrey and Gallant (1994) McCaffrey, D. F. and Gallant, A. R. (1994). Convergence rates for single hidden layer feedforward networks. Neural Networks, 7 147–158.
  • Mhaskar and Micchelli (1992) Mhaskar, H. N. and Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13 350–373.
  • Miotto et al. (2018) Miotto, R., Wang, F., Wang, S., Jiang, X. and Dudley, J. T. (2018). Deep learning for healthcare: review, opportunities and challenges. Briefings in Bioinformatics, 19 1236–1246.
  • Montanelli and Yang (2020) Montanelli, H. and Yang, H. (2020). Error bounds for deep ReLU networks using the Kolmogorov–Arnold superposition theorem. Neural Networks, 129 1–6.
  • Nakada and Imaizumi (2019) Nakada, R. and Imaizumi, M. (2019). Adaptive approximation and estimation of deep neural network with intrinsic dimensionality. arXiv preprint arXiv:1907.02177.
  • Nitanda and Suzuki (2018) Nitanda, A. and Suzuki, T. (2018). Functional gradient boosting based on residual network perception. In International Conference on Machine Learning.
  • Niyogi et al. (2008) Niyogi, P., Smale, S. and Weinberger, S. (2008). Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 39 419–441.
  • Ohn and Kim (2019) Ohn, I. and Kim, Y. (2019). Smooth function approximation by deep neural networks with general activation functions. Entropy, 21 627.
  • Oono and Suzuki (2019) Oono, K. and Suzuki, T. (2019). Approximation and non-parametric estimation of ResNet-type convolutional neural networks. In International Conference on Machine Learning.
  • Osher et al. (2017) Osher, S., Shi, Z. and Zhu, W. (2017). Low dimensional manifold model for image processing. SIAM Journal on Imaging Sciences, 10 1669–1690.
  • Park (2009) Park, C. (2009). Convergence rates of generalization errors for margin-based classification. Journal of Statistical Planning and Inference, 139 2543–2551.
  • Petersen and Voigtlaender (2020) Petersen, P. and Voigtlaender, F. (2020). Equivalence of approximation by convolutional neural networks and fully-connected networks. Proceedings of the American Mathematical Society, 148 1567–1581.
  • Schmidt-Hieber (2017) Schmidt-Hieber, J. (2017). Nonparametric regression using deep neural networks with ReLU activation function. arXiv preprint arXiv:1708.06633.
  • Schmidt-Hieber (2019) Schmidt-Hieber, J. (2019). Deep ReLU network approximation of functions on a manifold. arXiv preprint arXiv:1908.00695.
  • Sermanet et al. (2013) Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R. and LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.
  • Shaham et al. (2018) Shaham, U., Cloninger, A. and Coifman, R. R. (2018). Provable approximation properties for deep neural networks. Applied and Computational Harmonic Analysis, 44 537–557.
  • Shen and Wong (1994) Shen, X. and Wong, W. H. (1994). Convergence rate of sieve estimates. The Annals of Statistics 580–615.
  • Simonyan and Zisserman (2014) Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • Spivak (1970) Spivak, M. D. (1970). A comprehensive introduction to differential geometry. Publish or Perish.
  • Suzuki (2019) Suzuki, T. (2019). Adaptivity of deep ReLU network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations.
  • Suzuki and Nitanda (2019) Suzuki, T. and Nitanda, A. (2019). Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic besov space. arXiv preprint arXiv:1910.12799.
  • Tenenbaum et al. (2000) Tenenbaum, J. B., De Silva, V. and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290 2319–2323.
  • Triebel (1983) Triebel, H. (1983). Theory of Function Spaces. Modern Birkhäuser Classics, Birkhäuser Basel.
  • Triebel (1992) Triebel, H. (1992). Theory of function spaces II. Monographs in Mathematics, Birkhäuser Basel.
  • Tu (2010) Tu, L. (2010). An Introduction to Manifolds. Universitext, Springer New York.
  • Wu et al. (2016) Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K. et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • Yarotsky (2017) Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94 103–114.
  • Young et al. (2018) Young, T., Hazarika, D., Poria, S. and Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13 55–75.
  • Zagoruyko and Komodakis (2016) Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
  • Zhang et al. (2020) Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Mueller, J., Manmatha, R. et al. (2020). ResNeSt: Split-attention networks. arXiv preprint arXiv:2004.08955.
  • Zhou (2020a) Zhou, D.-X. (2020a). Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124 319–327.
  • Zhou (2020b) Zhou, D.-X. (2020b). Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis, 48 787–794.
  • Zhou and Troyanskaya (2015) Zhou, J. and Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12 931–934.

Supplementary Materials for Besov Function Approximation and Binary Classification on
Low-Dimensional Manifolds Using Convolutional Residual Networks

Notations: Throughout our proofs, we define the following notations: For two functions f:Ωf:\Omega\rightarrow\mathbb{R} and g:Ωg:\Omega\rightarrow\mathbb{R} defined on some domain Ω\Omega, we denote fgf\lesssim g if there is a constant CC such that f(𝐱)Cg(𝐱)f(\mathbf{x})\leq Cg(\mathbf{x}) for all 𝐱Ω\mathbf{x}\in\Omega. Similarly, we denote fgf\gtrsim g if there is a constant CC such that f(𝐱)Cg(𝐱)f(\mathbf{x})\geq Cg(\mathbf{x}) for all 𝐱Ω\mathbf{x}\in\Omega. We denote fgf\asymp g if fgf\lesssim g and fgf\gtrsim g. We use \mathbb{N} to denote the set of all nonnegative integers. For a real number aa, we denote a+=max(a,0)a_{+}=\max(a,0) and a=min(a,0)a_{-}=\min(a,0).

The proof of Theorem 1 is sketched in Section 4. In this supplementary material, we prove Theorem 2 in Section A. We define convolutional network and multi-layer perceptrons classes in Section B, based on which the lemmas used in Section 4 are proved in Section C. The lemmas used in Section A are proved in Section D.

Appendix A Proof of Theorem 2

A.1 Basic definitions and tools

We first define the bracketing entropy and covering number which are used in the proof of Theorem 2.

Definition 10 (Bracketing entropy).

A set of function pairs {(fiL,fiU)}i=1N\{(f_{i}^{L},f_{i}^{U})\}_{i=1}^{N} is called a δ\delta-bracketing of a function class \mathcal{F} with respect to the norm \|\cdot\| if for any ii, fiUfiLδ\|f_{i}^{U}-f_{i}^{L}\|\leq\delta and for any ff\in\mathcal{F}, there exists a pair (fiL,fiU)(f_{i}^{L},f_{i}^{U}) such that fiLffiUf_{i}^{L}\leq f\leq f_{i}^{U}. The δ\delta-bracketing number is defined as the cardinality of the minimal δ\delta-bracketing set and is denoted by 𝒩B(δ,,)\mathcal{N}_{B}(\delta,\mathcal{F},\|\cdot\|). The δ\delta-bracketing enropy, denoted by B(δ,,)\mathcal{H}_{B}(\delta,\mathcal{F},\|\cdot\|), is defined as

B(δ,,)=log𝒩B(δ,,).\displaystyle\mathcal{H}_{B}(\delta,\mathcal{F},\|\cdot\|)=\log\mathcal{N}_{B}(\delta,\mathcal{F},\|\cdot\|).
Definition 11 (Covering number).

Let \mathcal{F} be a set with metric ρ\rho. A δ\delta-cover of \mathcal{F} is a set {f1,,fN}\{f_{1}^{*},...,f_{N}^{*}\}\subset\mathcal{F} such that for any ff\in\mathcal{F}, there exists fkf_{k}^{*} for some kk such that ρ(f,fk)δ\rho(f,f_{k}^{*})\leq\delta. The δ\delta-covering number of \mathcal{F} is defined as

𝒩(δ,,ρ)=inf{N: there exists a δcover {f1,,fN} of }.\displaystyle\mathcal{N}(\delta,\mathcal{F},\rho)=\inf\{N:\mbox{ there exists a }\delta-\mbox{cover }\{f_{1}^{*},...,f_{N}^{*}\}\mbox{ of }\mathcal{F}\}.

It has been shown (Geer and van de Geer, 2000, Lemma 2.1) that for any δ>0,p1\delta>0,p\geq 1,

B(δ,,Lp)log𝒩(δ/2,,L).\displaystyle\mathcal{H}_{B}(\delta,\mathcal{F},\|\cdot\|_{L^{p}})\leq\log\mathcal{N}(\delta/2,\mathcal{F},\|\cdot\|_{L^{\infty}}).

The proof of Theorem 2 relies on the following proposition which is a modified version of Kim et al. (2018, Theorem 5):

Proposition 2.

Let ϕ\phi be a surrogate loss function for binary classification. Let fϕ,ϕ(fn,fϕ)f_{\phi}^{*},\mathcal{E}_{\phi}(f_{n},f_{\phi}^{*}) be defined as in (12) and (13), respectively. Assume the following regularity conditions:

  1. (A1)

    ϕ\phi is Lipschitz: |ϕ(z1)ϕ(z2)|C1|z1z2||\phi(z_{1})-\phi(z_{2})|\leq C_{1}|z_{1}-z_{2}| for any z1,z2z_{1},z_{2} and some constant C1C_{1}.

  2. (A2)

    For a positive sequence an=O(na0)a_{n}=O(n^{-a_{0}}) for some a0>0a_{0}>0, there exists a sequence of function classes {n}n\{\mathcal{F}_{n}\}_{n\in\mathbb{N}} such that as nn\rightarrow\infty,

    ϕ(fn,fϕ)an\displaystyle\mathcal{E}_{\phi}(f_{n},f_{\phi}^{*})\leq a_{n}

    for some fnnf_{n}\in\mathcal{F}_{n}.

  3. (A3)

    There exists a sequence {Fn}n\{F_{n}\}_{n\in\mathbb{N}} with Fn1F_{n}\gtrsim 1 such that supfnfLFn\sup_{f\in\mathcal{F}_{n}}\|f\|_{L^{\infty}}\leq F_{n}.

  4. (A4)

    There exists a constant ν(0,1]\nu\in(0,1] such that for any fnf\in\mathcal{F}_{n} and any nn\in\mathbb{N},

    𝔼(ϕ(yf(𝐱))ϕ(yfϕ(𝐱)))2C2Fn2νeFn(ϕ(f,fϕ))ν\displaystyle\mathbb{E}\left(\phi(yf(\mathbf{x}))-\phi(yf_{\phi}^{*}(\mathbf{x}))\right)^{2}\leq C_{2}F_{n}^{2-\nu}e^{F_{n}}\left(\mathcal{E}_{\phi}(f,f_{\phi}^{*})\right)^{\nu}

    for some constant C2>0C_{2}>0 only depending on ϕ\phi and η\eta.

  5. (A5)

    For a positive constant C3>0C_{3}>0, there exists a sequence {δn}n\{\delta_{n}\}_{n\in\mathbb{N}} such that

    B(δn,n,L2)C3eFnn(δnFn)2ν\displaystyle\mathcal{H}_{B}(\delta_{n},\mathcal{F}_{n},\|\cdot\|_{L^{2}})\leq C_{3}e^{-F_{n}}n\left(\frac{\delta_{n}}{F_{n}}\right)^{2-\nu}

    for {n}n\{\mathcal{F}_{n}\}_{n\in\mathbb{N}} in (A2), {Fn}n\{F_{n}\}_{n\in\mathbb{N}} in (A3) and ν\nu in (A4).

Let ϵn2max(an,δn)\epsilon^{2}_{n}\asymp\max(a_{n},\delta_{n}). Then the empirical ϕ\phi-risk minimizer f^ϕ,n\widehat{f}_{\phi,n} over n\mathcal{F}_{n} satisfies

(ϕ(f^ϕ,n,fϕ)ϵn)C5exp(C4eFnn(ϵn2/(Fn))2ν)\displaystyle\mathbb{P}\left(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})\geq\epsilon_{n}\right)\leq C_{5}\exp\left(-C_{4}e^{-F_{n}}n\left(\epsilon_{n}^{2}/\left(F_{n}\right)\right)^{2-\nu}\right) (31)

for some constants C4,C5>0C_{4},C_{5}>0.

Proposition 2 is proved in Appendix D.1. In Proposition 2, condition (A1) requires the surrogate loss function ϕ\phi to be Lipschitz. This condition is satisfied in Theorem 2 since ϕ\phi is the logistic loss. (A2) is a condition on the bias of f^ϕ,n\widehat{f}_{\phi,n}. Take nn as the number of samples. (A2) requires the bias to decrease in the order of O(na0)O(n^{-a_{0}}) for some a0a_{0}. (A3) requires all functions in the class n\mathcal{F}_{n} to be bounded. (A4) and (A5) are conditions relate to the variance of f^ϕ,n\widehat{f}_{\phi,n}. Condition (A4) for logistic loss can be verified using the following lemma:

Lemma 1 (Lemma 6.1 in Park (2009)).

Let ϕ\phi be the logistic loss. Given a function class \mathcal{F} which is uniformly bounded by FF, for any function ff\in\mathcal{F}, we have

𝔼[ϕ(yf)ϕ(yfϕ)]2CeFϕ(f,fϕ)\displaystyle\mathbb{E}\left[\phi(yf)-\phi(yf_{\phi}^{*})\right]^{2}\leq Ce^{F}\mathcal{E}_{\phi}(f,f_{\phi}^{*})

for some constant CC.

According to Lemma 1, (A4) is verified with ν=1\nu=1. Now we are ready to prove Theorem 2.

A.2 Proof of Theorem 2

Proof of Theorem 2.

The main idea of the proof is to construct a sequence of network architectures, depending on nn, such that Condition (A1)-(A5) in Proposition 2 are satisfied. The excess risk is then derived from (31). In particular, we choose

n=𝒞(n),an=ns2s+2(sd)log2n,Fn=s2s+2(sd)logn,δn=ns2s+2(sd)log4n\displaystyle\mathcal{F}_{n}=\mathcal{C}^{(n)},a_{n}=n^{-\frac{s}{2s+2(s\vee d)}}\log^{2}n,F_{n}=\frac{s}{2s+2(s\vee d)}\log n,\delta_{n}=n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n (32)

where 𝒞(n)\mathcal{C}^{(n)} is the network architecture in Theorem 2.

We first prove the probability bound of ϕ(f^ϕ,n,fϕ)\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*}) by checking conditions (A1)-(A5) in Proposition 2. Note that ϕ\phi is the logistic loss which is Lipschitz continuous with Lipschitz constant 1. Thus (A1) is verified. According to Lemma 1, (A4) is verified with ν=1\nu=1. We next verify (A2), (A3) and (A5).

A truncation technique.

Recall that f=logη1ηf^{*}=\log\frac{\eta}{1-\eta}. As η\eta goes to 0 (resp. 1), ff^{*} goes to \infty (resp. -\infty). Note that (A3) requires the function class n\mathcal{F}_{n} to be bounded by FnF_{n}. To study the approximation error of n\mathcal{F}_{n} with respect to ff^{*}, we consider a truncated version of ff^{*} defined as

fϕ,n={Fn, if fϕ>Fn,fϕ, if FnfϕFn,Fn, if fϕ<Fn.\displaystyle f_{\phi,n}^{*}=\begin{cases}F_{n},&\mbox{ if }f_{\phi}^{*}>F_{n},\\ f_{\phi}^{*},&\mbox{ if }-F_{n}\leq f_{\phi}^{*}\leq F_{n},\\ -F_{n},&\mbox{ if }f_{\phi}^{*}<-F_{n}.\end{cases} (33)

Verification of (A2) and (A3).

The following lemma is a very important lemma on approximating fϕ,nf_{\phi,n}^{*} by ConvResNets. It also provides the covering number of the network class which will be used to verify (A5).

Lemma 2.

Assume Assumption 1 and 2. Assume 0<p,q0<p,q\leq\infty, 0<s<0<s<\infty, sd/p+1s\geq d/p+1. For any ε(0,1)\varepsilon\in(0,1) and any KDK\leq D, there exists a ConvResNet architecture

𝒞(Fn)={f|\displaystyle\mathcal{C}^{(F_{n})}=\big{\{}f| f=g¯2h¯g¯1η¯ where η¯𝒞Conv(M1,L1,J1,K,κ1),g¯1𝒞Conv(1,4,8,1,κ2),\displaystyle f=\bar{g}_{2}\circ\bar{h}\circ\bar{g}_{1}\circ\bar{\eta}\mbox{ where }\bar{\eta}\in\mathcal{C}^{\mathrm{Conv}}\left(M_{1},L_{1},J_{1},K,\kappa_{1}\right),\ \bar{g}_{1}\in\mathcal{C}^{\mathrm{Conv}}\left(1,4,8,1,\kappa_{2}\right),
h¯𝒞Conv(M2,L2,J2,1,κ1),g¯2𝒞(1,3,8,1,κ3,1,R)}\displaystyle\bar{h}\in\mathcal{C}^{\mathrm{Conv}}\left(M_{2},L_{2},J_{2},1,\kappa_{1}\right),\ \bar{g}_{2}\in\mathcal{C}\left(1,3,8,1,\kappa_{3},1,R\right)\big{\}}

with

M1=O(εd/s),M2=O(eFnε1),L1=O(log(1/ε)+D+logD),L2=O(log(1/ε)),\displaystyle M_{1}=O\left(\varepsilon^{-d/s}\right),\ M_{2}=O\left(e^{-F_{n}}\varepsilon^{-1}\right),L_{1}=O(\log(1/\varepsilon)+D+\log D),\ L_{2}=O(\log(1/\varepsilon)),
J1=O(D),J2=O(1),κ1=O(1),logκ2=O(log2(1/ε)),κ3=O(log(Fn/ε)+Fn),R=Fn,\displaystyle J_{1}=O(D),\ J_{2}=O(1),\ \kappa_{1}=O(1),\ \log\kappa_{2}=O(\log^{2}(1/\varepsilon)),\ \kappa_{3}=O(\log(F_{n}/\varepsilon)+F_{n}),\ R=F_{n},

such that for any ηBp,qs()\eta\in B_{p,q}^{s}(\mathcal{M}) with ηBp,qs()c0\|\eta\|_{B_{p,q}^{s}(\mathcal{M})}\leq c_{0} for some constant c0c_{0}, and fϕ,nf_{\phi,n}^{*} be defined as in (33), there exists f¯ϕ,n𝒞(Fn)\bar{f}_{\phi,n}\in\mathcal{C}^{(F_{n})} with

f¯ϕ,nfϕ,nL4eFnε.\displaystyle\|\bar{f}_{\phi,n}-f_{\phi,n}^{*}\|_{L^{\infty}}\leq 4e^{F_{n}}\varepsilon.

Moreover, the covering number of 𝒞(Fn)\mathcal{C}^{(F_{n})} is bounded by

𝒩(δ,𝒞(Fn),L)=O(D3ε(ds1))log(1/ε)(log2(1/ε)+logD+Fn+log(1/δ))).\displaystyle\mathcal{N}(\delta,\mathcal{C}^{(F_{n})},\|\cdot\|_{L^{\infty}})=O\left(D^{3}\varepsilon^{-\left(\frac{d}{s}\vee 1\right))}\log(1/\varepsilon)\left(\log^{2}(1/\varepsilon)+\log D+F_{n}+\log(1/\delta)\right)\right).

The constant hidden in O()O(\cdot) depends on d,s,2dspd,p,q,c0,τd,s,\frac{2d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}.

Lemma 2 is proved in Section D.2. By Lemma 2, fix the network architecture 𝒞(Fn)\mathcal{C}^{(F_{n})}, for ε1(0,1)\varepsilon_{1}\in(0,1), there exists a ConvResNet f¯ϕ,n𝒞(Fn)\bar{f}_{\phi,n}\in\mathcal{C}^{(F_{n})} such that f¯ϕ,nfϕ,nL4eFnε1.\|\bar{f}_{\phi,n}-f_{\phi,n}^{*}\|_{L^{\infty}}\leq 4e^{F_{n}}\varepsilon_{1}. In the following, we choose ε1=n2s2s+2(sd)logn\varepsilon_{1}=n^{-\frac{2s}{2s+2(s\vee d)}}\log n.

Next we check conditions (A2) and (A3) by estimating ϕ(f¯ϕ,n,fϕ)\mathcal{E}_{\phi}(\bar{f}_{\phi,n},f_{\phi}^{*}). Denote

An={𝐱:|fϕ|Fn},An={𝐱:|fϕ|>Fn}.\displaystyle A_{n}=\{\mathbf{x}\in\mathcal{M}:|f_{\phi}^{*}|\leq F_{n}\},\ A_{n}^{\complement}=\{\mathbf{x}\in\mathcal{M}:|f_{\phi}^{*}|>F_{n}\}.

We have

ϕ(f¯ϕ,n,fϕ)\displaystyle\mathcal{E}_{\phi}(\bar{f}_{\phi,n},f_{\phi}^{*}) =η(ϕ(f¯ϕ,n)ϕ(fϕ))+(1η)(ϕ(f¯ϕ,n)ϕ(fϕ))μ(d𝐱)\displaystyle=\int_{\mathcal{M}}\eta\left(\phi(\bar{f}_{\phi,n})-\phi(f_{\phi}^{*})\right)+(1-\eta)\left(\phi(-\bar{f}_{\phi,n})-\phi(-f_{\phi}^{*})\right)\mu(d\mathbf{x})
=Anη(ϕ(f¯ϕ,n)ϕ(fϕ,n))+(1η)(ϕ(f¯ϕ,n)ϕ(fϕ,n))μ(d𝐱)T1\displaystyle=\underbrace{\int_{A_{n}}\eta\left(\phi(\bar{f}_{\phi,n})-\phi(f_{\phi,n}^{*})\right)+(1-\eta)\left(\phi(-\bar{f}_{\phi,n})-\phi(-f_{\phi,n}^{*})\right)\mu(d\mathbf{x})}_{\rm T_{1}}
+Anη(ϕ(f¯ϕ,n)ϕ(fϕ))+(1η)(ϕ(f¯ϕ,n)ϕ(fϕ))μ(d𝐱)T2,\displaystyle\quad+\underbrace{\int_{A_{n}^{\complement}}\eta\left(\phi(\bar{f}_{\phi,n})-\phi(f_{\phi}^{*})\right)+(1-\eta)\left(\phi(-\bar{f}_{\phi,n})-\phi(-f_{\phi}^{*})\right)\mu(d\mathbf{x})}_{\rm T_{2}}, (34)

where we used fϕ=fϕ,nf_{\phi}^{*}=f_{\phi,n}^{*} on AnA_{n}. In (34), T1{\rm T_{1}} represents the approximation error of f¯ϕ,n\bar{f}_{\phi,n}, and T2{\rm T_{2}} is the truncation error. Since f¯ϕ,nfϕ,nL4eFnε1\|\bar{f}_{\phi,n}-f_{\phi,n}^{*}\|_{L^{\infty}}\leq 4e^{F_{n}}\varepsilon_{1},

T1\displaystyle{\rm T_{1}} Anη|ϕ(f¯ϕ,n)ϕ(fϕ,n)|+(1η)|ϕ(f¯ϕ,n)ϕ(fϕ,n)|μ(d𝐱)\displaystyle\leq\int_{A_{n}}\eta|\phi(\bar{f}_{\phi,n})-\phi(f_{\phi,n}^{*})|+(1-\eta)|\phi(-\bar{f}_{\phi,n})-\phi(-f_{\phi,n}^{*})|\mu(d\mathbf{x})
ϕ(f¯ϕ,n)ϕ(fϕ,n)L4eFnε1.\displaystyle\leq\|\phi(\bar{f}_{\phi,n})-\phi(f_{\phi,n}^{*})\|_{L^{\infty}}\leq 4e^{F_{n}}\varepsilon_{1}. (35)

A bound of T2{\rm T_{2}} is provided by the following lemma (see a proof in Appendix D.3):

Lemma 3.

Assume Assumption 1 and 2. Assume 0<p,q0<p,q\leq\infty, 0<s<0<s<\infty, sd/p+1s\geq d/p+1, ηBp,qs()\eta\in B_{p,q}^{s}(\mathcal{M}) with ηBp,qs()c0\|\eta\|_{B_{p,q}^{s}(\mathcal{M})}\leq c_{0} for some constant c0c_{0}. Let T2{\rm T_{2}} be defined as in (34). If 4eFnε1<14e^{F_{n}}\varepsilon_{1}<1, the following bound holds:

T28FneFn.\displaystyle{\rm T_{2}}\leq 8F_{n}e^{-F_{n}}. (36)

According to our choices of ε1\varepsilon_{1} and FnF_{n}, 4eFnε1<14e^{F_{n}}\varepsilon_{1}<1 is satisfied. Combining (35) and (36) gives

ϕ(f¯ϕ,n,fϕ)T1+T24eFnε1+8FneFn.\displaystyle\mathcal{E}_{\phi}(\bar{f}_{\phi,n},f_{\phi}^{*})\leq{\rm T_{1}}+{\rm T_{2}}\leq 4e^{F_{n}}\varepsilon_{1}+8F_{n}e^{-F_{n}}.

Substituting ε1=n2s2s+2(sd)logn,Fn=s2s+2(sd)logn\varepsilon_{1}=n^{-\frac{2s}{2s+2(s\vee d)}}\log n,F_{n}=\frac{s}{2s+2(s\vee d)}\log n gives

ϕ(f¯ϕ,n,fϕ)C6ns2s+2(sd)log2n\displaystyle\mathcal{E}_{\phi}(\bar{f}_{\phi,n},f_{\phi}^{*})\leq C_{6}n^{-\frac{s}{2s+2(s\vee d)}}\log^{2}n

and 𝒞(Fn)=𝒞(n)\mathcal{C}^{(F_{n})}=\mathcal{C}^{(n)}, where 𝒞(n)\mathcal{C}^{(n)} is defined in Theorem 2. Here C6C_{6} is a constant depending on ss and dd. Thus (A2) and (A3) are satisfied with an=ns2s+2(sd)log2n,Fn=s2s+2(sd)logna_{n}=n^{-\frac{s}{2s+2(s\vee d)}}\log^{2}n,F_{n}=\frac{s}{2s+2(s\vee d)}\log n.

Verification of (A5).

For (A5), we only need to check that log𝒩(δn,𝒞(n),L)C3nFn1δn\log\mathcal{N}(\delta_{n},\mathcal{C}^{(n)},\|\cdot\|_{L^{\infty}})\leq C_{3}nF_{n}^{-1}\delta_{n} for some constant C3C_{3} . According to Lemma 2 with our choices of ε1\varepsilon_{1} and FnF_{n}, we have

log𝒩(δ,𝒞(n),L)=O(D3n2(sd)2s+2(sd)log(n)(log2n+logn+logD+log(1/δ))).\displaystyle\log\mathcal{N}(\delta,\mathcal{C}^{(n)},\|\cdot\|_{L^{\infty}})=O\left(D^{3}n^{\frac{2(s\vee d)}{2s+2(s\vee d)}}\log(n)\left(\log^{2}n+\log n+\log D+\log(1/\delta)\right)\right).

Substituting our choice δn=ns2s+2(sd)log4n\delta_{n}=n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n gives rise to

log𝒩(δn,𝒞(n),L)=O((D3logD)n2(sd)2s+2(sd)log2n)C3nFn1eFnδn\displaystyle\log\mathcal{N}(\delta_{n},\mathcal{C}^{(n)},\|\cdot\|_{L^{\infty}})=O\left((D^{3}\log D)n^{\frac{2(s\vee d)}{2s+2(s\vee d)}}\log^{2}n\right)\leq C_{3}nF_{n}^{-1}e^{-F_{n}}\delta_{n} (37)

for some C3C_{3} depending on d,D3logD,s,dspd,p,q,c0,τd,D^{3}\log D,s,\frac{d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}. Therefore (A5) is satisfied.

Estimate the excess risk.

Since (A1)-(A5) are satisfied, Proposition 2 gives

(ϕ(f^ϕ,n,fϕ)ϵn)C5exp(C42s+2(sd)sns+2(sd)2s+2(sd)ϵn2logn)\displaystyle\mathbb{P}\left(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})\geq\epsilon_{n}\right)\leq C_{5}\exp\left(-C_{4}\frac{2s+2(s\vee d)}{s}\frac{n^{\frac{s+2(s\vee d)}{2s+2(s\vee d)}}\epsilon^{2}_{n}}{\log n}\right) (38)

with ϵn2max(an,δn)=C7ns2s+2(sd)log4n\epsilon^{2}_{n}\asymp\max(a_{n},\delta_{n})=C_{7}n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n and f^ϕ,nn\widehat{f}_{\phi,n}\in\mathcal{F}_{n} being the minimizer of the empirical risk in (11). Here C7C_{7} is a constant depending on d,D,logD,s,dspd,p,q,c0,τd,D,\log D,s,\frac{d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}.

Note that (A5) is also satisfied for any δnC7ns2s+2(sd)log4n\delta_{n}\geq C_{7}n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n. Thus

(ϕ(f^ϕ,n,fϕ)t)C5exp(C42s+2(sd)sns+2(sd)2s+2(sd)tlogn)\displaystyle\mathbb{P}\left(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})\geq t\right)\leq C_{5}\exp\left(-C_{4}\frac{2s+2(s\vee d)}{s}\frac{n^{\frac{s+2(s\vee d)}{2s+2(s\vee d)}}t}{\log n}\right) (39)

for any tC7ns2s+2(sd)log4nt\geq C_{7}n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n. Integrating (39), we estimate the expected excess risk as

𝔼(ϕ(f^ϕ,n,fϕ))\displaystyle\mathbb{E}(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})) =ϕ(f^ϕ,n,fϕ)μ(d𝐱)\displaystyle=\int_{\mathcal{M}}\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})\mu(d\mathbf{x})
C7(ϕ(f^ϕ,n,fϕ)C7ns2s+2(sd)log4n)ns2s+2(sd)log4n\displaystyle\leq C_{7}\mathbb{P}\left(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})\leq C_{7}n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n\right)n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n
+C5C7ns2s+2(sd)log4nexp(C42s+2(sd)sns+2(sd)2s+2(sd)tlogn)𝑑t\displaystyle\quad+C_{5}\int_{C_{7}n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n}^{\infty}\exp\left(-C_{4}\frac{2s+2(s\vee d)}{s}\frac{n^{\frac{s+2(s\vee d)}{2s+2(s\vee d)}}t}{\log n}\right)dt
C8ns2s+2(sd)log4n\displaystyle\leq C_{8}n^{-\frac{s}{2s+2(s\vee d)}}\log^{4}n (40)

for some constants C7,C8C_{7},C_{8} depending on d,D,logD,s,2dspd,p,q,c0,τd,D,\log D,s,\frac{2d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}. ∎

Appendix B Convolutional neural networks and muli-layer perceptrons

The proofs of the main results utilize properties convolutional neural networks (CNN) and multi-layer perceptrons (MLP) with the ReLU activation. We consider CNNs in the form of

f(𝐱)=WConv𝒲,(𝐱)\displaystyle f(\mathbf{x})=W\cdot\mathrm{Conv}_{\mathcal{W},\mathcal{B}}(\mathbf{x}) (41)

where Conv𝒲,(Z)\mathrm{Conv}_{\mathcal{W},\mathcal{B}}(Z) is defined in (3), WW is the weight matrix of the fully connected layer, 𝒲,\mathcal{W},\mathcal{B} are sets of filters and biases, respectively. We define the class of CNNs as

CNN(L,J,K,κ1,κ2)={f|\displaystyle\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2})=\big{\{}f~{}| f(𝐱) in the form (41) with L layers.\displaystyle f(\mathbf{x})\textrm{ in the form \eqref{eq:convfCNN} with $L$ layers.} (42)
Each convolutional layer has filter size bounded by KK.
The number of channels of each layer is bounded by J.\displaystyle\mbox{The number of channels of each layer is bounded by $J$}.
maxl𝒲(l)B(l)κ1,Wκ2}.\displaystyle\max_{l}\|\mathcal{W}^{(l)}\|_{\infty}\vee\|B^{(l)}\|_{\infty}\leq\kappa_{1},\ \|W\|_{\infty}\leq\kappa_{2}\big{\}}.

For MLP, we consider the following form

f(𝐱)=WLReLU(WL1ReLU(W1𝐱+𝐛1)+𝐛L1)+𝐛L,\displaystyle f(\mathbf{x})=W_{L}\cdot\textrm{ReLU}(W_{L-1}\cdots\textrm{ReLU}(W_{1}\mathbf{x}+\mathbf{b}_{1})\cdots+\mathbf{b}_{L-1})+\mathbf{b}_{L}, (43)

where W1,,WLW_{1},\dots,W_{L} and 𝐛1,,𝐛L\mathbf{b}_{1},\dots,\mathbf{b}_{L} are weight matrices and bias vectors of proper sizes, respectively. The class of MLP is defined as

MLP(L,J,κ)={f|\displaystyle\mathcal{F}^{\rm MLP}(L,J,\kappa)=\big{\{}f~{}| f(𝐱) in the form (43) with L-layers and width bounded by J.\displaystyle f(\mathbf{x})\textrm{ in the form \eqref{eq:reluf} with $L$-layers and width bounded by $J$}. (44)
Wi,κ,𝐛iκfori=1,,L}.\displaystyle\left\lVert W_{i}\right\rVert_{\infty,\infty}\leq\kappa,\left\lVert\mathbf{b}_{i}\right\rVert_{\infty}\leq\kappa~{}\textrm{for}~{}i=1,\dots,L\big{\}}.

In some cases it is necessary to enforce the output of the MLP to be bounded. We define such a class as

MLP(L,J,κ,R)={f|f(𝐱)MLP(L,J,κ) and fR}.\displaystyle\mathcal{F}^{\rm MLP}(L,J,\kappa,R)=\left\{f~{}|f(\mathbf{x})\in\mathcal{F}^{\rm MLP}(L,J,\kappa)\mbox{ and }\|f\|_{\infty}\leq R\right\}.

In some case we do not need the constraint on the output, we denote such MLP class as MLP(L,J,κ)\mathcal{F}^{\rm MLP}(L,J,\kappa).

Appendix C Lemmas and proofs in Section 4

C.1 Lemma 4 and its proof

Lemma 4.

Define fi,ϕif_{i},\phi_{i} as in (17). We extend fiϕi1f_{i}\circ\phi_{i}^{-1} by 0 on [0,1]d\ϕi(Ui)[0,1]^{d}\backslash\phi_{i}(U_{i}) and denote the extended function by fiϕi1|[0,1]df_{i}\circ\phi_{i}^{-1}|_{[0,1]^{d}} . Under Assumption 3, we have fiϕi1|[0,1]dBp,qs([0,1]d)f_{i}\circ\phi_{i}^{-1}|_{[0,1]^{d}}\in B_{p,q}^{s}([0,1]^{d}) with

fiϕi1Bp,qs([0,1]d)<Cc0\|f_{i}\circ\phi_{i}^{-1}\|_{B_{p,q}^{s}([0,1]^{d})}<Cc_{0}

where CC is a constant depending on s,p,qs,p,q and dd.

To prove Lemma 4, we first give an equivalent definition of Besov functions:

Definition 12.

Let Ω\Omega be a Lipschitz domain in d\mathbb{R}^{d}. For 0<p,q0<p,q\leq\infty and s>0s>0, p,qs(Ω)\mathcal{B}_{p,q}^{s}(\Omega) is the set of functions

p,qs(Ω)={f:Ω|gBp,qs(d) with g|Ω=f},\displaystyle\mathcal{B}_{p,q}^{s}(\Omega)=\{f:\Omega\rightarrow\mathbb{R}|\exists g\in B_{p,q}^{s}(\mathbb{R}^{d})\mbox{ with }g|_{\Omega}=f\},

where g|Ωg|_{\Omega} denotes the restriction of gg on Ω\Omega. The norm is defined as fp,qs(Ω)=infggp,qs(d)\|f\|_{\mathcal{B}_{p,q}^{s}(\Omega)}=\inf_{g}\|g\|_{\mathcal{B}_{p,q}^{s}(\mathbb{R}^{d})}.

According to Dispa (2003, Theorem 3.18), for any Lipschitz domain Ωd\Omega\subset\mathbb{R}^{d}, the norm Bp,qs(Ω)\|\cdot\|_{B_{p,q}^{s}(\Omega)} in Definition 8 is equivalent to p,qs(Ω)\|\cdot\|_{\mathcal{B}_{p,q}^{s}(\Omega)} in Definition 12. Thus Bp,qs(Ω)=p,qs(Ω)B_{p,q}^{s}(\Omega)=\mathcal{B}_{p,q}^{s}(\Omega).

Proof of Lemma 4.

Since fBp,qs()f\in B_{p,q}^{s}(\mathcal{M}), according to Definition 9, fiϕi1Bp,qs(d)f_{i}\circ\phi_{i}^{-1}\in B_{p,q}^{s}(\mathbb{R}^{d}) in the sense of extending fiϕi1f_{i}\circ\phi_{i}^{-1} by zero on d\ϕi(Ui)\mathbb{R}^{d}\backslash\phi_{i}(U_{i}), see Triebel (1983, Section 3.2.3) for details. From Assumption 3, fiϕi1Bp,qs(d)<c0\|f_{i}\circ\phi_{i}^{-1}\|_{B_{p,q}^{s}(\mathbb{R}^{d})}<c_{0}. We next restrict fiϕi1f_{i}\circ\phi_{i}^{-1} on [0,1]d[0,1]^{d} and denote the restriction by fiϕi1|[0,1]df_{i}\circ\phi_{i}^{-1}|_{[0,1]^{d}}.

Using Definition 12 and Assumption , we have

fiϕi1|[0,1]dp,qs([0,1]d))fiϕi1Bp,qs(d)<c0,\|f_{i}\circ\phi_{i}^{-1}|_{[0,1]^{d}}\|_{\mathcal{B}_{p,q}^{s}([0,1]^{d}))}\leq\|f_{i}\circ\phi_{i}^{-1}\|_{B_{p,q}^{s}(\mathbb{R}^{d})}<c_{0},

and we next show fiϕi1|[0,1]dU(Bp,qs([0,1]d))f_{i}\circ\phi_{i}^{-1}|_{[0,1]^{d}}\in U(B_{p,q}^{s}([0,1]^{d})). Since [0,1]d[0,1]^{d} is a Lipschitz domain, Dispa (2003, Theorem 3.18) implies Bp,qs([0,1]d)=p,qs([0,1]d)B_{p,q}^{s}([0,1]^{d})=\mathcal{B}_{p,q}^{s}([0,1]^{d}). Therefore, there exists a constant CC depending on s,p,qs,p,q and dd such that

fiϕi1|[0,1]dBp,qs([0,1]d)fiϕi1p,qs(d)Cc0.\|f_{i}\circ\phi_{i}^{-1}|_{[0,1]^{d}}\|_{B_{p,q}^{s}([0,1]^{d})}\leq\|f_{i}\circ\phi_{i}^{-1}\|_{\mathcal{B}_{p,q}^{s}(\mathbb{R}^{d})}\leq Cc_{0}.

C.2 Cardinal B-splines

We give a brief introduction of cardinal B-splines.

Definition 13 (Cardinal B-spline).

Let ψ(x)=𝟙[0,1](x)\psi(x)=\mathds{1}_{[0,1]}(x) be the indicator function of [0,1][0,1]. The cardinal B-spline of order m is defined by taking m+1m+1-times convolution of ψ\psi:

ψm(x)=(ψψψm+1 times)(x)\displaystyle\psi_{m}(x)=(\underbrace{\psi\ast\psi\ast\cdots\ast\psi}_{m+1\mbox{ times}})(x)

where fg(x)f(xt)g(t)𝑑tf\ast g(x)\equiv\int f(x-t)g(t)dt.

Note that ψm\psi_{m} is a piecewise polynomial with degree mm and support [0,m+1][0,m+1]. It can be expressed as (Mhaskar and Micchelli, 1992)

ψm(x)=1m!j=0m+1(1)j(m+1j)(xj)+m.\displaystyle\psi_{m}(x)=\frac{1}{m!}\sum_{j=0}^{m+1}(-1)^{j}\binom{m+1}{j}(x-j)_{+}^{m}.

For any k,jk,j\in\mathbb{N}, let Mk,j,m(x)=ψm(2kxj)M_{k,j,m}(x)=\psi_{m}(2^{k}x-j), which is the rescaled and shifted cardinal B-spline with resolution 2k2^{-k} and support 2k[j,j+(m+1)]2^{-k}[j,j+(m+1)]. For 𝐤=(k1,,kd)d\mathbf{k}=(k_{1},\dots,k_{d})\in\mathbb{N}^{d} and 𝐣=(j1,,jd)d\mathbf{j}=(j_{1},\dots,j_{d})\in\mathbb{N}^{d}, we define the dd dimensional cardinal B-spline as M𝐤,𝐣,md(𝐱)=i=1dψm(2kixiji)M_{\mathbf{k},\mathbf{j},m}^{d}(\mathbf{x})=\prod_{i=1}^{d}\psi_{m}(2^{k_{i}}x_{i}-j_{i}). When k1==kd=kk_{1}=\ldots=k_{d}=k\in\mathbb{N}, we denote Mk,𝐣,md(𝐱)=i=1dψm(2kxiji)M_{k,\mathbf{j},m}^{d}(\mathbf{x})=\prod_{i=1}^{d}\psi_{m}(2^{k}x_{i}-j_{i}).

C.3 Lemma 5

For any mm\in\mathbb{N}, let J(k)={m,m+1,,2k1,2k}dJ(k)=\{-m,-m+1,\dots,2^{k}-1,2^{k}\}^{d} and the quasi-norm of the coefficient {αk,j}\{\alpha_{k,j}\} for k,𝐣J(k)k\in\mathbb{N},\mathbf{j}\in J(k) be

{αk,𝐣}bp,qs=(k[2k(sd/p)(𝐣J(k)|αk,𝐣|p)1/p]q)1/q.\displaystyle\|\{\alpha_{k,\mathbf{j}}\}\|_{b_{p,q}^{s}}=\left(\sum_{k\in\mathbb{N}}\left[2^{k(s-d/p)}\left(\sum_{\mathbf{j}\in J(k)}|\alpha_{k,\mathbf{j}}|^{p}\right)^{1/p}\right]^{q}\right)^{1/q}. (45)

The following lemma, resulted from DeVore and Popov (1988); Dũng (2011), gives an error bound for the approximation of functions in Bp,qs([0,1]d)B_{p,q}^{s}([0,1]^{d}) by cardinal B-splines.

Lemma 5 (Lemma 2 in Suzuki (2019); DeVore and Popov (1988); Dũng (2011)).

Assume that 0<p,q,r0<p,q,r\leq\infty and 0<s<0<s<\infty satisfying s>d(1/p1/r)+s>d(1/p-1/r)_{+}. Let mm\in\mathbb{N} be the order of the Cardinal B-spline basis such that 0<s<min(m,m1+1/p)0<s<\min(m,m-1+1/p). For any fBp,qs([0,1]d)f\in B_{p,q}^{s}([0,1]^{d}), there exists fNf_{N} satisfying

ffNLr([0,1]d)CNs/dfBp,qs([0,1]d)\displaystyle\|f-f_{N}\|_{L^{r}([0,1]^{d})}\leq CN^{-s/d}\|f\|_{B_{p,q}^{s}([0,1]^{d})}

for some constant CC with N1N\gg 1. ff is in the form of

fN(𝐱)=k=0H𝐣J(k)αk,𝐣Mk,𝐣,md(𝐱)+k=K+1Hi=1nkαk,𝐣iMk,𝐣i,md(𝐱),\displaystyle f_{N}(\mathbf{x})=\sum_{k=0}^{H}\sum_{\mathbf{j}\in J(k)}\alpha_{k,\mathbf{j}}M_{k,\mathbf{j},m}^{d}(\mathbf{x})+\sum_{k=K+1}^{H^{*}}\sum_{i=1}^{n_{k}}\alpha_{k,\mathbf{j}_{i}}M_{k,\mathbf{j}_{i},m}^{d}(\mathbf{x}), (46)

where {𝐣i}i=1nkJ(k),H=c1log(N)/d,H=ν1log(λN)+H+1,nk=λN2ν(kH)\{\mathbf{j}_{i}\}_{i=1}^{n_{k}}\subset J(k),H=\lceil c_{1}\log(N)/d\rceil,H^{*}=\lceil\nu^{-1}\log(\lambda N)\rceil+H+1,n_{k}=\lceil\lambda N2^{-\nu(k-H)}\rceil for k=H+1,,H,u=d(1/p1/r)+k=H+1,\dots,H^{*},u=d(1/p-1/r)_{+} and ν=(su)/(2u)\nu=(s-u)/(2u). The real numbers c1>0c_{1}>0 and λ>0\lambda>0 are two absolute constants chosen to satisfy k=1H(2k+m)d+k=H+1HnkN\sum_{k=1}^{H}(2^{k}+m)^{d}+\sum_{k=H+1}^{H^{*}}n_{k}\leq N, which are to NN. Moreover, we can choose the coefficients {αk,𝐣}\{\alpha_{k,\mathbf{j}}\} such that

{αk,𝐣}bp,qsC1fBp,qs([0,1]d)\displaystyle\|\{\alpha_{k,\mathbf{j}}\}\|_{b_{p,q}^{s}}\leq C_{1}\|f\|_{B_{p,q}^{s}([0,1]^{d})}

for some constant C1C_{1}.

Lemma 6.

Let αk,𝐣(i)\alpha^{(i)}_{k,\mathbf{j}} be defined as in (19). Under Assumption 3, for any i,k,𝐣i,k,\mathbf{j}, we have

|αk,𝐣|Cc0N(log2)(ν1+c1d1)(d/ps)+\displaystyle|\alpha_{k,\mathbf{j}}|\leq Cc_{0}N^{(\log 2)(\nu^{-1}+c_{1}d^{-1})(d/p-s)_{+}} (47)

for some CC depending on (d/ps)+ν1,s(d/p-s)_{+}\nu^{-1},s and dd, where ν,c1\nu,c_{1} are defined in Lemma 5.

Proof of Lemma 6.

According to (45) and Lemma 5,

2k(sd/p)|αk,𝐣|{αk,𝐣}bp,qsC1fiϕi1Bp,qs([0,1]d).\displaystyle 2^{k(s-d/p)}|\alpha_{k,\mathbf{j}}|\leq\|\{\alpha_{k,\mathbf{j}}\}\|_{b_{p,q}^{s}}\leq C_{1}\|f_{i}\circ\phi_{i}^{-1}\|_{B_{p,q}^{s}([0,1]^{d})}.

Using Lemma 4 and since kHk\leq H^{*} (from Lemma 5), we have

|αk,𝐣(i)|\displaystyle|\alpha^{(i)}_{k,\mathbf{j}}| C12k(d/ps)+fiϕi1Bp,qs([0,1]d)C22H(d/ps)+c1c0\displaystyle\leq C_{1}2^{k(d/p-s)_{+}}\|f_{i}\circ\phi_{i}^{-1}\|_{B_{p,q}^{s}([0,1]^{d})}\leq C_{2}2^{H^{*}(d/p-s)_{+}}c_{1}c_{0} (48)

for some C2C_{2} depending on ss and dd. From the expression of HH^{*}, we can compute

2HC3N(log2)(ν1+c1d1)\displaystyle 2^{H^{*}}\leq C_{3}N^{(\log 2)(\nu^{-1}+c_{1}d^{-1})} (49)

for some C3C_{3} depending on (d/ps)+ν1(d/p-s)_{+}\nu^{-1}. Substituting (49) into (48) finishes the proof.

C.4 Lemma 7

Lemma 7 (Proposition 3 in Yarotsky (2017)).

For any C>0C>0 and 0<η<10<\eta<1. If |x|C,|y|C|x|\leq C,|y|\leq C, there is an MLP, denoted by ×~(,)\widetilde{\times}(\cdot,\cdot), such that

|×~(x,y)xy|<η,×~(x,0)=×~(y,0)=0.|\widetilde{\times}(x,y)-xy|<\eta,\ \widetilde{\times}(x,0)=\widetilde{\times}(y,0)=0.

Such a network has O(log1η)O\left(\log\frac{1}{\eta}\right) layers and parameters. The width of each layer is bounded by 6 and all parameters are bounded by C2C^{2}.

C.5 Lemma 8

The following lemma is a special case of Oono and Suzuki (2019, Theorem 1). It shows that each MLP can be realized by a CNN:

Lemma 8 (Theorem 1 in Oono and Suzuki (2019)).

Let DD be the dimension of the input. Let L,JL,J be positive integers and κ>0\kappa>0. For any 2KD2\leq K^{\prime}\leq D, any MLP architectures MLP(L,J,κ)\mathcal{F}^{\rm MLP}(L,J,\kappa) can be realized by a CNN architecture CNN(L,J,K,κ1,κ2)\mathcal{F}^{\rm CNN}(L^{\prime},J^{\prime},K^{\prime},\kappa_{1}^{\prime},\kappa_{2}^{\prime}) with

L=L+D,J=4J,κ1=κ2=κ.L^{\prime}=L+D,J^{\prime}=4J,\kappa^{\prime}_{1}=\kappa^{\prime}_{2}=\kappa.

Specifically, any f¯MLPMLP(L,J,κ)\bar{f}^{\rm MLP}\in\mathcal{F}^{\rm MLP}(L,J,\kappa) can be realized by a CNN f¯CNNCNN(L,J,K,κ1,κ2)\bar{f}^{\rm CNN}\in\mathcal{F}^{\rm CNN}(L^{\prime},J^{\prime},K^{\prime},\kappa_{1}^{\prime},\kappa_{2}^{\prime}). Furthermore, the weight matrix in the fully connected layer of f¯CNN\bar{f}^{\rm CNN} has nonzero entries only in the first row.

C.6 Lemma 9 and its proof

Lemma 9.

Let di2d_{i}^{2} and 𝟙[0,ω2]\mathds{1}_{[0,\omega^{2}]} be defined as in (22). For any θ(0,1)\theta\in(0,1) and Δ8B2Dθ\Delta\geq 8B^{2}D\theta, there exists a CNN d~i2\widetilde{d}_{i}^{2} approximating di2d_{i}^{2} such that

d~i2di2L4B2Dθ,\|\widetilde{d}_{i}^{2}-d_{i}^{2}\|_{L^{\infty}}\leq 4B^{2}D\theta,

and a CNN 𝟙~Δ\widetilde{\mathds{1}}_{\Delta} approximating 𝟙[0,ω2]\mathds{1}_{[0,\omega^{2}]} with

𝟙~Δ(𝐱)={1, if a(12k)(ω24B2Dθ),0, if aω24B2Dθ,2k((ω24B2Dθ)1a1), otherwise.\displaystyle\widetilde{\mathds{1}}_{\Delta}(\mathbf{x})=\begin{cases}1,&\mbox{ if }a\leq(1-2^{-k})(\omega^{2}-4B^{2}D\theta),\\ 0,&\mbox{ if }a\geq\omega^{2}-4B^{2}D\theta,\\ 2^{k}((\omega^{2}-4B^{2}D\theta)^{-1}a-1),&\mbox{ otherwise}.\end{cases}

for 𝐱\mathbf{x}\in\mathcal{M}. The CNN for d~i2\widetilde{d}_{i}^{2} has O(log(1/θ))O(\log(1/\theta)) layers, 6D6D channels and all weights parameters are bounded by 4B24B^{2}. The CNN for 𝟙~Δ\widetilde{\mathds{1}}_{\Delta} has log(ω2/Δ)\left\lceil\log(\omega^{2}/\Delta)\right\rceil layers, 22 channels. All weight parameters are bounded by max(2,|ω24B2Dθ|)\max(2,|\omega^{2}-4B^{2}D\theta|).

As a result, for any 𝐱\mathbf{x}\in\mathcal{M}, 𝟙~Δd~i2(𝐱)\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2}(\mathbf{x}) gives an approximation of 𝟙Ui\mathds{1}_{U_{i}} satisfying

𝟙~Δd~i2(𝐱)={1, if 𝐱Ui and di2(𝐱)ω2Δ;0, if 𝐱Ui; between 0 and 1, otherwise.\displaystyle\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2}(\mathbf{x})=\begin{cases}1,&\mbox{ if }\mathbf{x}\in U_{i}\mbox{ and }d_{i}^{2}(\mathbf{x})\leq\omega^{2}-\Delta;\\ 0,&\mbox{ if }\mathbf{x}\notin U_{i};\\ \mbox{ between 0 and 1},&\mbox{ otherwise}.\end{cases}
Proof.

We first show the existence of d~i2\widetilde{d}_{i}^{2}. Here di2(𝐱)d_{i}^{2}(\mathbf{x}) is the sum of DD univariate quadratic functions. Each quadratic function can be approximated by an multi-layer perceptron (MLP, see Appendix B for the definition) according to Lemma 7. Let h̊(x)\mathring{h}(x) be an MLP approximation of x2x^{2} for x[0,1]x\in[0,1] with error θ\theta, i.e., h̊(x)x2θ\|\mathring{h}(x)-x^{2}\|_{\infty}\leq\theta. We define

d̊i2(𝐱)=4B2j=1Dh̊(|xjci,j2B|)\displaystyle\mathring{d}_{i}^{2}(\mathbf{x})=4B^{2}\sum_{j=1}^{D}\mathring{h}\left(\left|\frac{x_{j}-c_{i,j}}{2B}\right|\right)

as an approximation of di2(𝐱)d_{i}^{2}(\mathbf{x}), which gives rise to the approximation error d̊i2di24B2Dθ\|\mathring{d}_{i}^{2}-d_{i}^{2}\|_{\infty}\leq 4B^{2}D\theta. Such a MLP has O(log1/θ)O(\log 1/\theta) layers, and width 6D6D. All weight parameters are bounded by 4B24B^{2}. According to Lemma 8, d̊i2\mathring{d}_{i}^{2} can be realized by a CNN, which is denoted by d~i2\widetilde{d}_{i}^{2}. Such a CNN has O(log1/θ)O(\log 1/\theta) layers, 6D6D channels. All weight parameters are bounded by 4B24B^{2}.

To show the existence of 𝟙~Δ\widetilde{\mathds{1}}_{\Delta}, we use the following function to approximate 𝟙[0,ω2]\mathds{1}_{[0,\omega^{2}]}:

𝟙~Δ(a)={1, if aω2Δ+4B2Dθ,0, if aω24B2Dv,1Δ8B2Dθa+r24B2DθΔ8B2Dθ, otherwise.\displaystyle\widetilde{\mathds{1}}_{\Delta}(a)=\begin{cases}1,&\mbox{ if }a\leq\omega^{2}-\Delta+4B^{2}D\theta,\\ 0,&\mbox{ if }a\geq\omega^{2}-4B^{2}Dv,\\ -\frac{1}{\Delta-8B^{2}D\theta}a+\frac{r^{2}-4B^{2}D\theta}{\Delta-8B^{2}D\theta},&\mbox{ otherwise. }\end{cases}

We implement 𝟙~Δ(a)\widetilde{\mathds{1}}_{\Delta}(a) based on the basic step function defined as: g(a)=2ReLU(a0.5(ω24B2Dθ))2ReLU(aω2+4B2Dθ)g(a)=2\mathrm{ReLU}(a-0.5(\omega^{2}-4B^{2}D\theta))-2\mathrm{ReLU}(a-\omega^{2}+4B^{2}D\theta). Define

gk(a)\displaystyle g_{k}(a) =ggk(a)\displaystyle=\underbrace{g\circ\cdots\circ g}_{k}(a)
={0, if a(12k)(ω24B2Dθ),ω24B2Dθ, if aω24B2Dθ,2k(aω2+4B2Dθ)+ω24B2Dθ, otherwise.\displaystyle=\begin{cases}0,&\mbox{ if }a\leq(1-2^{-k})(\omega^{2}-4B^{2}D\theta),\\ \omega^{2}-4B^{2}D\theta,&\mbox{ if }a\geq\omega^{2}-4B^{2}D\theta,\\ 2^{k}(a-\omega^{2}+4B^{2}D\theta)+\omega^{2}-4B^{2}D\theta,&\mbox{ otherwise}.\end{cases}

We set 𝟙~Δ=1(ω24B2Dθ)1gk\widetilde{\mathds{1}}_{\Delta}=1-(\omega^{2}-4B^{2}D\theta)^{-1}g_{k} which can be realized by a CNN (according to Lemma 8). Such a CNN has kk layers, 22 channels. All weight parameters are bounded by max(2,|ω24B2Dθ|)\max(2,|\omega^{2}-4B^{2}D\theta|). The number of compositions kk is chosen to satisfy (12k)(ω24B2Dθ)ω2Δ+4B2Dθ(1-2^{-k})(\omega^{2}-4B^{2}D\theta)\geq\omega^{2}-\Delta+4B^{2}D\theta which gives k=log(ω2/Δ)k=\lceil\log(\omega^{2}/\Delta)\rceil.

C.7 Lemma 10 and its proof

Lemma 10 shows that each cardinal B-spline can be approximated by a CNN with arbitrary accuracy. This lemma is used to prove Proposition 3.

Lemma 10.

Let kk be any number in \mathbb{N} and 𝐣\mathbf{j} be any element in d\mathbb{N}^{d}. There exists a constant CC depending only on dd and mm such that, for and ε(0,1)\varepsilon\in(0,1) and 2Kd2\leq K\leq d, there exists a CNN M~k,𝐣,mdCNN(L,J,K,κ,κ)\widetilde{M}_{k,\mathbf{j},m}^{d}\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) with L=3+2log2(3mCε)+5log2(dm)+d,J=24dm(m+2)+8dL=3+2\lceil\log_{2}\left(\frac{3\vee m}{C\varepsilon}\right)+5\rceil\lceil\log_{2}(d\vee m)\rceil+d,J=24dm(m+2)+8d and κ=2(m+1)m2k\kappa=2(m+1)^{m}\vee 2^{k} such that for any kk\in\mathbb{N} and 𝐣d\mathbf{j}\in\mathbb{N}^{d},

Mk,𝐣,mdM~k,𝐣,mdL([0,1]d)ε,\displaystyle\|M_{k,\mathbf{j},m}^{d}-\widetilde{M}_{k,\mathbf{j},m}^{d}\|_{L^{\infty}([0,1]^{d})}\leq\varepsilon,

and M~k,𝐣,md(𝐱)=0\widetilde{M}_{k,\mathbf{j},m}^{d}(\mathbf{x})=0 for all 𝐱2k[0,m+1]d\mathbf{x}\notin 2^{-k}[0,m+1]^{d}.

The proof of Lemma 10 is based on the following lemma:

Lemma 11 (Lemma 1 in Suzuki (2019)).

Let kk be any number in \mathbb{N} and 𝐣\mathbf{j} be any element in d\mathbb{N}^{d}. There exists a constant CC depending only on dd and mm such that, for all ε>0\varepsilon>0, there exists an MLP M¯k,𝐣,mdMLP(L,J,κ,1)\bar{M}_{k,\mathbf{j},m}^{d}\in\mathcal{F}^{\rm MLP}(L,J,\kappa,1) with L=3+2log2(3mCε)+5log2(dm),J=6dm(m+2)+2dL=3+2\lceil\log_{2}\left(\frac{3\vee m}{C\varepsilon}\right)+5\rceil\lceil\log_{2}(d\vee m)\rceil,J=6dm(m+2)+2d and κ=2(m+1)m2k\kappa=2(m+1)^{m}\vee 2^{k} such that for any kk\in\mathbb{N} and 𝐣d\mathbf{j}\in\mathbb{N}^{d},

Mk,𝐣,mdM¯k,𝐣,mdL([0,1]d)ε,\displaystyle\|M_{k,\mathbf{j},m}^{d}-\bar{M}_{k,\mathbf{j},m}^{d}\|_{L^{\infty}([0,1]^{d})}\leq\varepsilon,

and M¯k,𝐣,md(𝐱)=0\bar{M}_{k,\mathbf{j},m}^{d}(\mathbf{x})=0 for all 𝐱2k[0,m+1]d\mathbf{x}\notin 2^{-k}[0,m+1]^{d}.

Proof of Lemma 10.

According to Lemma 11, there exists an MLP M¯k,𝐣,mdMLP(L,J,κ,1)\bar{M}_{k,\mathbf{j},m}^{d}\in\mathcal{F}^{\rm MLP}(L^{\prime},J^{\prime},\kappa^{\prime},1) with L=3+2log2(3mCε)+5log2(dm),J=6dm(m+2)+2dL^{\prime}=3+2\lceil\log_{2}\left(\frac{3\vee m}{C\varepsilon}\right)+5\rceil\lceil\log_{2}(d\vee m)\rceil,J^{\prime}=6dm(m+2)+2d and κ=2(m+1)m2k\kappa^{\prime}=2(m+1)^{m}\vee 2^{k} such that

Mk,𝐣,mdM¯k,𝐣,mdL([0,1]d)ε,\displaystyle\|M_{k,\mathbf{j},m}^{d}-\bar{M}_{k,\mathbf{j},m}^{d}\|_{L^{\infty}([0,1]^{d})}\leq\varepsilon,

and M¯k,𝐣,md(𝐱)=0\bar{M}_{k,\mathbf{j},m}^{d}(\mathbf{x})=0 for all 𝐱2k[0,m+1]d\mathbf{x}\notin 2^{-k}[0,m+1]^{d}.

Lemma 8 shows that such an MLP can be realized by a CNN M~k,𝐣,mdCNN(L,J,K,κ,κ)\widetilde{M}_{k,\mathbf{j},m}^{d}\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa). ∎

C.8 Proposition 3 and its proof

Proposition 3 shows that if NN and ε1\varepsilon_{1} are properly chosen, j=1Nf~i,jCNN\sum_{j=1}^{N}\widetilde{f}^{\rm CNN}_{i,j} can approximate fiϕi1f_{i}\circ\phi_{i}^{-1} with arbitrary accuracy.

Proposition 3.

Let fiϕi1f_{i}\circ\phi_{i}^{-1} be defined as in (18). For any δ(0,1)\delta\in(0,1), set N=C1δd/sN=C_{1}\delta^{-d/s}. Suppose Assumption 3. For any 2Kd2\leq K\leq d, there exists a set of CNNs {f~i,jCNN}j=1N\left\{\widetilde{f}_{i,j}^{\rm CNN}\right\}_{j=1}^{N} such that

j=1Nf~i,jCNNfiϕi1Lδ,\displaystyle\left\|\sum_{j=1}^{N}\widetilde{f}^{\rm CNN}_{i,j}-f_{i}\circ\phi_{i}^{-1}\right\|_{L^{\infty}}\leq\delta,

where C1C_{1} is a constant depending on s,p,qs,p,q and dd.

f~i,jCNN\widetilde{f}_{i,j}^{\rm CNN} is a CNN approximation of f~i,j\widetilde{f}_{i,j} (defined in (19)) and is in CNN(L,J,K,κ,κ)\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) with

L=O(log(1/δ)),J=24d(s+1)(s+3)+8d,κ=O(δ(log2)(2dspd+c1d1)).\displaystyle\begin{aligned} &L=O\left(\log(1/\delta)\right),J=\lceil 24d(s+1)(s+3)+8d\rceil,\kappa=O\left(\delta^{-(\log 2)(\frac{2d}{sp-d}+c_{1}d^{-1})}\right).\end{aligned}

The constant hidden in O()O(\cdot) depends on d,s,2dspd,p,q,c0d,s,\frac{2d}{sp-d},p,q,c_{0}.

Proof of Proposition 3.

Based on the approximation (19), for each f~i,j\widetilde{f}_{i,j}, we construct CNN f~i,jCNN\widetilde{f}_{i,j}^{\rm CNN} to approximate it.

Note that f~i,j=αk,𝐣(i)Mk,𝐣,md\widetilde{f}_{i,j}=\alpha^{(i)}_{k,\mathbf{j}}M_{k,\mathbf{j},m}^{d} with some coefficient αk,𝐣(i)\alpha^{(i)}_{k,\mathbf{j}} and index k,𝐣,mk,\mathbf{j},m where Mk,𝐣,mdM_{k,\mathbf{j},m}^{d} is a dd-dimensional cardinal B-spline. Lemma 10 shows that Mk,𝐣,mdM_{k,\mathbf{j},m}^{d} can be approximated by a CNN M~k,𝐣,md\widetilde{M}_{k,\mathbf{j},m}^{d} with arbitrary accuracy. Therefor f~i,j\widetilde{f}_{i,j} can be approximated by a CNN f~i,jCNN\widetilde{f}_{i,j}^{\rm CNN} with arbitrary accuracy. Assume M~k,𝐣,mdMk,𝐣,mdLε1\|\widetilde{M}_{k,\mathbf{j},m}^{d}-M_{k,\mathbf{j},m}^{d}\|_{L^{\infty}}\leq\varepsilon_{1} for some ε1(0,1)\varepsilon_{1}\in(0,1). Then f~i,jCNNCNN(L,J,K,κ,κ)\widetilde{f}_{i,j}^{\rm CNN}\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) with

L=O(log(1/ε1)),J=24dm(m+2)+8d,κ=max(|αk,𝐣(i)|,2k)=O(N(log2)(ν1+c1d1)(1(d/ps)+)),\displaystyle\begin{aligned} &L=O\left(\log(1/\varepsilon_{1})\right),J=24dm(m+2)+8d,\kappa=\max\left(|\alpha^{(i)}_{k,\mathbf{j}}|,2^{k}\right)=O\left(N^{(\log 2)(\nu^{-1}+c_{1}d^{-1})(1\vee(d/p-s)_{+})}\right),\end{aligned} (50)

where the value of κ\kappa comes from Lemma 6 and (49).

The rest proof follows that of Suzuki (2019, Proposition 1) in which we show that with properly chosen NN and ε1\varepsilon_{1}, j=1Nf~i,jCNN\sum_{j=1}^{N}\widetilde{f}^{\rm CNN}_{i,j} can approximate fiϕi1f_{i}\circ\phi_{i}^{-1} with arbitrary accuracy.

We decompose the error as

j=1Nf~i,jCNNfiϕi1Lf~ifiϕi1L+j=1Nf~i,jCNNf~iL.\displaystyle\left\|\sum_{j=1}^{N}\widetilde{f}^{\rm CNN}_{i,j}-f_{i}\circ\phi_{i}^{-1}\right\|_{L^{\infty}}\leq\left\|\widetilde{f}_{i}-f_{i}\circ\phi_{i}^{-1}\right\|_{L^{\infty}}+\left\|\sum_{j=1}^{N}\widetilde{f}^{\rm CNN}_{i,j}-\widetilde{f}_{i}\right\|_{L^{\infty}}. (51)

where f~i\widetilde{f}_{i} is defined in (19). We next derive an error bound for each term.

Let mm be the order of the Cardianl B-spline basis. Set m=s+1m=\lceil s\rceil+1. According to (20) and Lemma 5,

f~ifiϕi1LCc0Ns/d,{αk,𝐣(i)}bp,qsC1fBp,qs,\displaystyle\left\|\widetilde{f}_{i}-f_{i}\circ\phi_{i}^{-1}\right\|_{L^{\infty}}\leq Cc_{0}N^{-s/d},\quad\|\{\alpha^{(i)}_{k,\mathbf{j}}\}\|_{b_{p,q}^{s}}\leq C_{1}\|f\|_{B_{p,q}^{s}}, (52)

for some constant CC depending on s,p,qs,p,q and dd, some universal constant C1C_{1} with {𝐣i}i=1nkJ(k),H=c1log(N)/d,H=ν1log(λN)+H+1,nk=λN2ν(kH)\{\mathbf{j}_{i}\}_{i=1}^{n_{k}}\subset J(k),\ H=\lceil c_{1}\log(N)/d\rceil,\ H^{*}=\lceil\nu^{-1}\log(\lambda N)\rceil+H+1,\ n_{k}=\lceil\lambda N2^{-\nu(k-H)}\rceil for k=H+1,,H,u=d(1/p1/r)+k=H+1,\dots,H^{*},\ u=d(1/p-1/r)_{+}, ν=(su)/(2u)\nu=(s-u)/(2u) and k=1H(2k+m)d+k=H+1HnkN\sum_{k=1}^{H}(2^{k}+m)^{d}+\sum_{k=H+1}^{H^{*}}n_{k}\leq N. By setting N=(δ2Cc0)d/sN=\left\lceil\left(\frac{\delta}{2Cc_{0}}\right)^{-d/s}\right\rceil, we have f~ifiϕi1δ/2\|\widetilde{f}_{i}-f_{i}\circ\phi_{i}^{-1}\|_{\infty}\leq\delta/2.

Next we consider the second term in (51). For any 𝐱[0,1]d\mathbf{x}\in[0,1]^{d}, we have

|j=1Nf~i,jCNN(𝐱)f~i(𝐱)|(k,𝐣)𝒮N|αk,𝐣(i)||Mk,𝐣,md(𝐱)M~k,𝐣,md(𝐱)|\displaystyle\left|\sum_{j=1}^{N}\widetilde{f}^{\rm CNN}_{i,j}(\mathbf{x})-\widetilde{f}_{i}(\mathbf{x})\right|\leq\sum_{(k,\mathbf{j})\in{\mathcal{S}}_{N}}|\alpha^{(i)}_{k,\mathbf{j}}||M_{k,\mathbf{j},m}^{d}(\mathbf{x})-\widetilde{M}_{k,\mathbf{j},m}^{d}(\mathbf{x})|
ε1(k,𝐣)𝒮N|αk,𝐣(i)|𝟏Mk,𝐣,md(𝐱)0ε1(m+1)d(1+H)2H(d/ps)+fBp,qs\displaystyle\leq\varepsilon_{1}\sum_{(k,\mathbf{j})\in{\mathcal{S}}_{N}}|\alpha^{(i)}_{k,\mathbf{j}}|{\bm{1}}_{M_{k,\mathbf{j},m}^{d}(\mathbf{x})\neq 0}\leq\varepsilon_{1}(m+1)^{d}(1+H^{*})2^{H^{*}(d/p-s)_{+}}\|f\|_{B_{p,q}^{s}}
ε1(m+1)d(1+log(λN)ν1+c1log(N)/d+3)(e3(λN)ν1Nc1/d)(log2)(d/ps)+fBp,qs\displaystyle\leq\varepsilon_{1}(m+1)^{d}\left(1+\log(\lambda N)\nu^{-1}+c_{1}\log(N)/d+3\right)\left(e^{3}(\lambda N)^{\nu^{-1}}N^{c_{1}/d}\right)^{(\log 2)(d/p-s)_{+}}\|f\|_{B_{p,q}^{s}}
C2c0log(N)N(log2)(ν1+c1d1)(d/ps)+ε1\displaystyle\leq C_{2}c_{0}\log(N)N^{(\log 2)(\nu^{-1}+c_{1}d^{-1})(d/p-s)_{+}}\varepsilon_{1}
C2c0log(2ε)(ε2)(log2)(ν1+c1d1)(d2spd)+ε1\displaystyle\leq C_{2}c_{0}\log\left(\frac{2}{\varepsilon}\right)\left(\frac{\varepsilon}{2}\right)^{-(\log 2)(\nu^{-1}+c_{1}d^{-1})\left(\frac{d^{2}}{sp}-d\right)_{+}}\varepsilon_{1}

with C2C_{2} being some constant depending on m,d,s,p,q,ν1m,d,s,p,q,\nu^{-1}. In the second inequality, 𝟏Mk,𝐣,md(𝐱)0=1{\bm{1}}_{M_{k,\mathbf{j},m}^{d}(\mathbf{x})\neq 0}=1 if Mk,𝐣,md(𝐱)0M_{k,\mathbf{j},m}^{d}(\mathbf{x})\neq 0 and it equals to 0 otherwise. The third inequality follows from the fact that for each kk, there are (m+1)d(m+1)^{d} basis functions which are non-zero at 𝐱\mathbf{x} and 2k(sd/p)|αk,𝐣(i)|{αk,𝐣(i)}bp,qsC1fBp,qs2^{k(s-d/p)}\left|\alpha^{(i)}_{k,\mathbf{j}}\right|\leq\left\|\{\alpha^{(i)}_{k,\mathbf{j}}\}\right\|_{b_{p,q}^{s}}\leq C_{1}\|f\|_{B_{p,q}^{s}}. In the fourth inequality we use H=log(λN)ν1+H+1H^{*}=\left\lceil\log(\lambda N)\nu^{-1}\right\rceil+H+1 and H=c1log(N)/dH=\left\lceil c_{1}\log(N)/d\right\rceil, and the last inequality follows from N=(ε2C)d/sN=\left\lceil\left(\frac{\varepsilon}{2C}\right)^{-d/s}\right\rceil. Setting
ε1=1C2c0log(2/δ)(δ2)12+(log2)(ν1+c1d1)(d2spd)+\varepsilon_{1}=\frac{1}{C_{2}c_{0}\log\left(2/\delta\right)}\left(\frac{\delta}{2}\right)^{\frac{1}{2}+(\log 2)(\nu^{-1}+c_{1}d^{-1})\left(\frac{d^{2}}{sp}-d\right)_{+}} proves the error bound.

Under Assumption 3, sd/p+1s\geq d/p+1. Therefore (d2spd)+=0\left(\frac{d^{2}}{sp}-d\right)_{+}=0 and ν=spd2d\nu=\frac{sp-d}{2d}. Substituting these expressions into (50) gives the network architectures. ∎

C.9 Lemma 12

Lemma 12 estimates the approximation error of f¯\bar{f}.

Lemma 12.

Let η\eta be the approximation error of the multiplication operator ×~(,)\widetilde{\times}(\cdot,\cdot), δ\delta be defined as in Proposition 3, Δ\Delta and θ\theta be defined as in Lemma 9. Assume NN is chosen according to Proposition 3. For any i=1,,Ci=1,...,C_{\mathcal{M}}, we have f̊fLi=1C(Ai,1+Ai,2+Ai,3)\|\mathring{f}-f^{*}\|_{L^{\infty}}\leq\sum_{i=1}^{C_{\mathcal{M}}}(A_{i,1}+A_{i,2}+A_{i,3}) with

Ai,1=j=1N×~(f~i,jCNNϕiCNN,𝟙~Δd~i2)f~i,jCNNϕiCNN×(𝟙~Δd~i2)LCδd/sη,\displaystyle A_{i,1}=\sum_{j=1}^{N}\left\|\widetilde{\times}(\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN},\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})-\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN}\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})\right\|_{L^{\infty}}\leq C\delta^{-d/s}\eta,
Ai,2=(j=1N(f~i,jCNNϕiCNN))×(𝟙~Δd~i2)fi×(𝟙~Δd~i2)Lδ,\displaystyle A_{i,2}=\left\|\left(\sum_{j=1}^{N}\left(\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN}\right)\right)\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})-f_{i}\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})\right\|_{L^{\infty}}\leq\delta,
Ai,3=fi×(𝟙~Δd~i2)fi×𝟙UiLc(π+1)ω(1ω/τ)Δ\displaystyle A_{i,3}=\|f_{i}\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})-f_{i}\times\mathds{1}_{U_{i}}\|_{L^{\infty}}\leq\frac{c(\pi+1)}{\omega(1-\omega/\tau)}\Delta

for some constant CC depending on d,s,p,qd,s,p,q and some constant cc. Furthermore, for any ε(0,1)\varepsilon\in(0,1), setting

δ=ε3C,η=1C(ε3C)ds+1,Δ=ω(1ω/τ)ε3c(π+1)C,θ=Δ16B2D\displaystyle\delta=\frac{\varepsilon}{3C_{\mathcal{M}}},\ \eta=\frac{1}{C}\left(\frac{\varepsilon}{3C_{\mathcal{M}}}\right)^{\frac{d}{s}+1},\Delta=\frac{\omega(1-\omega/\tau)\varepsilon}{3c(\pi+1)C_{\mathcal{M}}},\ \theta=\frac{\Delta}{16B^{2}D} (53)

gives rise to

f̊fLε.\|\mathring{f}-f^{*}\|_{L^{\infty}}\leq\varepsilon.

The choice in (53) satisfies the condition Δ>8B2Dθ\Delta>8B^{2}D\theta in Lemma 9.

Proof of Lemma 12.

In the error decomposition, Ai,1A_{i,1} measures the error from ×~\widetilde{\times}:

Ai,1=j=1N×~(f~i,jCNNϕiCNN,𝟙~Δd~i2)f~i,jCNNϕiCNN×(𝟙~Δd~i2)LNηCδd/sη,A_{i,1}=\sum_{j=1}^{N}\left\|\widetilde{\times}(\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN},\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})-\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN}\times(\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2})\right\|_{L^{\infty}}\leq N\eta\leq C\delta^{-d/s}\eta,

for some constant CC depending on d,s,p,qd,s,p,q.

Ai,2A_{i,2} measures the error from CNN approximation of Besov functions. According to Proposition 3, Ai,2δA_{i,2}\leq\delta.

Ai,3A_{i,3} measures the error from CNN approximation of the chart determination function. The bound of Ai,3A_{i,3} can be derived using Chen et al. (2019b, Proof of Lemma 4.5) since fiϕi1f_{i}\circ\phi_{i}^{-1} is a Lipschitz function and its domain is in [0,1]d[0,1]^{d}. ∎

C.10 CNN size quantification of f̊i,j\mathring{f}_{i,j}

Let f̊i,j\mathring{f}_{i,j} be defined as in (28). Under the choices of δ,η,Δ,θ\delta,\eta,\Delta,\theta in Lemma 12, we quantify the size of each CNN in f̊i,j\mathring{f}_{i,j} as follows:

  • d~i2\widetilde{d}_{i}^{2} has O(log(1/ε)+D+logD)O(\log(1/\varepsilon)+D+\log D) layers, 6D6D channels and all weights parameters are bounded by 4B24B^{2}.

  • 𝟙~Δ\widetilde{\mathds{1}}_{\Delta} has O(log(1/ε))O(\log(1/\varepsilon)) layers with 22 channels. All weights are bounded by max(2,ω2)\max(2,\omega^{2}).

  • ×~\widetilde{\times} has O(log1/ε)O(\log 1/\varepsilon) layers with 66 channels. All weights are of max(c02,1)\max(c_{0}^{2},1).

  • f~i,jCNN\widetilde{f}_{i,j}^{\rm CNN} has O(log1/ε)O(\log 1/\varepsilon) layers with 24d(s+1)(s+3)+8d\lceil 24d(s+1)(s+3)+8d\rceil channels. All weights are in the order of O(ε(log2)ds(2dspd+c1d1))O\left(\varepsilon^{-(\log 2)\frac{d}{s}(\frac{2d}{sp-d}+c_{1}d^{-1})}\right) where c1c_{1} is defined in Lemma 5.

  • ϕiCNN\phi_{i}^{\rm CNN} has 2+D2+D layers and dd channels. All weights are bounded by 2B2B.

In the above network architectures, the constant hidden in O()O(\cdot) depend on d,s,2dspd,p,q,c0,τd,s,\frac{2d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}. In particular, the constant depends on DlogDD\log D linearly.

C.11 Lemma 13 and its proof

Lemma 13 shows that the composition of two CNNs can be realized by another CNN. Lemma 13 is used to prove Lemma 17.

Lemma 13.

Let 1CNN(L1,J1,K1,κ1,κ1)\mathcal{F}_{1}^{\rm CNN}(L_{1},J_{1},K_{1},\kappa_{1},\kappa_{1}) be a CNN architecture from D\mathbb{R}^{D}\rightarrow\mathbb{R} and 2CNN(L2,J2,K2,κ2,κ2)\mathcal{F}_{2}^{\rm CNN}(L_{2},J_{2},K_{2},\kappa_{2},\kappa_{2}) be a CNN architecture from \mathbb{R}\rightarrow\mathbb{R}. Assume the weight matrix in the fully connected layer of 1CNN(L1,J1,K1,κ1,κ1)\mathcal{F}_{1}^{\rm CNN}(L_{1},J_{1},K_{1},\kappa_{1},\kappa_{1}) and 2CNN(L2,J2,K2,κ2,κ2)\mathcal{F}_{2}^{\rm CNN}(L_{2},J_{2},K_{2},\kappa_{2},\kappa_{2}) has nonzero entries only in the first row. Then there exists a CNN architecture CNN(L,J,K,κ,κ)\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) from D\mathbb{R}^{D}\rightarrow\mathbb{R} with

L=L1+L2,J=max(J1,J2),K=max(K1,K2),κ=max(κ1,κ2)\displaystyle L=L_{1}+L_{2},\ J=\max(J_{1},J_{2}),\ K=\max(K_{1},K_{2}),\kappa=\max(\kappa_{1},\kappa_{2})

such that for any f1CNN(L1,J1,K1,κ1,κ1)f_{1}\in\mathcal{F}^{\rm CNN}(L_{1},J_{1},K_{1},\kappa_{1},\kappa_{1}) and f2CNN(L2,J2,K2,κ2,κ2)f_{2}\in\mathcal{F}^{\rm CNN}(L_{2},J_{2},K_{2},\kappa_{2},\kappa_{2}), there exists fCNN(L,J,K,κ,κ)f\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) such that f(𝐱)=f2f1(𝐱)f(\mathbf{x})=f_{2}\circ f_{1}(\mathbf{x}). Furthermore, the weight matrix in the fully connected layer of CNN(L,J,K,κ,κ)\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) has nonzero entries only in the first row.

In Lemma 13 and the following lemmas, the subscript of CNN\mathcal{F}^{\rm CNN} are used to distinguish different network architectures.

Proof of Lemma 13.

Compared to a CNN, directly composing f1f_{1} and f2f_{2} gives a network with an additional intermediate fully connected layer. In our network construction, we will design two convolutaionl layers to replace and realize this fully connected layer.

Denote f1f_{1} and f2f_{2} by

f1(𝐱)=W1Conv𝒲1,1(𝐱) and f2(𝐱)=W2Conv𝒲2,2(𝐱).f_{1}(\mathbf{x})=W_{1}\cdot\mathrm{Conv}_{\mathcal{W}_{1},\mathcal{B}_{1}}(\mathbf{x})\mbox{ and }f_{2}(\mathbf{x})=W_{2}\cdot\mathrm{Conv}_{\mathcal{W}_{2},\mathcal{B}_{2}}(\mathbf{x}).

where 𝒲1={𝒲1(l)}i=1L1,1={B1(l)}l=1L1,𝒲2={𝒲2(l)}i=1L2,2={B2(l)}l=1L2,\mathcal{W}_{1}=\left\{\mathcal{W}_{1}^{(l)}\right\}_{i=1}^{L_{1}},\mathcal{B}_{1}=\left\{B_{1}^{(l)}\right\}_{l=1}^{L_{1}},\mathcal{W}_{2}=\left\{\mathcal{W}_{2}^{(l)}\right\}_{i=1}^{L_{2}},\mathcal{B}_{2}=\left\{B_{2}^{(l)}\right\}_{l=1}^{L_{2}}, are sets of filters and biases and Conv𝒲1,1,Conv𝒲2,2\mathrm{Conv}_{\mathcal{W}_{1},\mathcal{B}_{1}},\mathrm{Conv}_{\mathcal{W}_{2},\mathcal{B}_{2}} are defined in (3). In the rest of this proof, we will choose proper weight parameters in 𝒲,\mathcal{W},\mathcal{B} and WW such that f(𝐱)CNN(L,J,K,κ,κ)f(\mathbf{x})\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) is in the form of

f(𝐱)=WConv𝒲,(𝐱)f(\mathbf{x})=W\cdot\mathrm{Conv}_{\mathcal{W},\mathcal{B}}(\mathbf{x})

and satisfies f(𝐱)=f2f1(𝐱)f(\mathbf{x})=f_{2}\circ f_{1}(\mathbf{x}).

For 1lL111\leq l\leq L_{1}-1, we set 𝒲(l)=𝒲1(l),B(l)=B1(l)\mathcal{W}^{(l)}=\mathcal{W}^{(l)}_{1},B^{(l)}=B^{(l)}_{1}.

For l=L1l=L_{1}, to realize the fully connected layer of f1f_{1} by a convolutional layer, we set

𝒲1,:,:(L1)=(W1)1,:,𝒲2,:,:(L1)=(W1)1,:\displaystyle\mathcal{W}^{(L_{1})}_{1,:,:}=(W_{1})_{1,:},\ \mathcal{W}^{(L_{1})}_{2,:,:}=-(W_{1})_{1,:}

and B(L1)=𝟎B^{(L_{1})}=\mathbf{0}. Here 𝒲(L1)2×1×M\mathcal{W}^{(L_{1})}\in\mathbb{R}^{2\times 1\times M} is a size-one filter with two output channels, where MM is the number of input channels of W1W_{1}. The output of the L1L_{1}-th layer of ff has the form

[(f1(𝐱))+(f1(𝐱))]\begin{bmatrix}(f_{1}(\mathbf{x}))_{+}&(f_{1}(\mathbf{x}))_{-}\\ \star&\star\end{bmatrix}

where \star denotes some elements that will not affect the result.

Since the input of f2f_{2} is a real number, all filters of f2f_{2} has size 1. The weight matrix in the fully connected layer and all biases only have one row. For the (L1+1)(L_{1}+1)-th layer, we set

𝒲i,:,:(L1+1)=[(𝒲2(1))i,:,:(𝒲2(1))i,:,:],BL1+1=[B2(1))𝟎]\displaystyle\mathcal{W}^{(L_{1}+1)}_{i,:,:}=\begin{bmatrix}(\mathcal{W}^{(1)}_{2})_{i,:,:}&-(\mathcal{W}^{(1)}_{2})_{i,:,:}\end{bmatrix},\ B^{L_{1}+1}=\begin{bmatrix}B^{(1))}_{2}\\ \mathbf{0}\end{bmatrix}

where ii varies from 1 to the number of output channels of 𝒲2(1)\mathcal{W}^{(1)}_{2}. Here 𝒲(L1+1)\mathcal{W}^{(L_{1}+1)} is a size-one filter whose number of output channels is the same as that of 𝒲2(1)\mathcal{W}^{(1)}_{2}.

For L1+1lL1L_{1}+1\leq l\leq L-1, we set

𝒲(l)=𝒲2(lL1),Bl=[B2(lL1))𝟎], for l=L1+1,,L1+L21.\displaystyle\mathcal{W}^{(l)}=\mathcal{W}^{(l-L_{1})}_{2},\ B^{l}=\begin{bmatrix}B^{(l-L_{1}))}_{2}\\ \mathbf{0}\end{bmatrix},\ \mbox{ for }l=L_{1}+1,...,L_{1}+L_{2}-1.

For l=Ll=L, we set

W=[W2𝟎].W=\begin{bmatrix}W_{2}\\ \mathbf{0}\end{bmatrix}.

With the above settings, the lemma is proved. ∎

C.12 Lemma 14 and its proof

Lemma 14 is used to prove Lemma 17.

Lemma 14.

Let f1CNN(L1,J1,K1,κ1,κ1)f_{1}\in\mathcal{F}^{\rm CNN}(L_{1},J_{1},K_{1},\kappa_{1},\kappa_{1}) be a CNN from D\mathbb{R}^{D}\rightarrow\mathbb{R} and f2CNN(L2,J2,K2,κ2,κ2)f_{2}\in\mathcal{F}^{\rm CNN}(L_{2},J_{2},K_{2},\kappa_{2},\kappa_{2}) be a CNN from D\mathbb{R}^{D}\rightarrow\mathbb{R}. Assume the weight matrix in the fully connected layer of f1f_{1} and f2f_{2} have nonzero entries only in the first row. Then there exists a set of filters 𝒲\mathcal{W} and biases \mathcal{B} such that

Conv𝒲,(𝐱)=[(f1(𝐱))+(f1(𝐱))(f2(𝐱))+(f2(𝐱))]D×4.\mathrm{Conv}_{\mathcal{W},\mathcal{B}}(\mathbf{x})=\begin{bmatrix}(f_{1}(\mathbf{x}))_{+}&(f_{1}(\mathbf{x}))_{-}&(f_{2}(\mathbf{x}))_{+}&(f_{2}(\mathbf{x}))_{-}\\ \star&\star&\star&\star\end{bmatrix}\in\mathbb{R}^{D\times 4}.

Such a network has max(L1,L2)\max(L_{1},L_{2}) layers, each filter has size at most max(K1,K2)\max(K_{1},K_{2}) and at most J1+J2J_{1}+J_{2} channels. All parameter are bounded by max(κ1,κ2)\max(\kappa_{1},\kappa_{2}).

Proof of Lemma 14.

For simplicity, we assume all convolutional layers of f1f_{1} have J1J_{1} channels and all convolutional layers of f2f_{2} have J2J_{2} channels. If some filters in f1f_{1} (or f2f_{2}) have channels less than J1J_{1} (or J2J_{2}), we can add additional channels with zero filters and biases. Without loss of generality, we assume L1>L2L_{1}>L_{2}.

Denotef1f_{1} and f2f_{2} by

f1(𝐱)=W1Conv𝒲1,1(𝐱) and f2(𝐱)=W2Conv𝒲2,2(𝐱),f_{1}(\mathbf{x})=W_{1}\cdot\mathrm{Conv}_{\mathcal{W}_{1},\mathcal{B}_{1}}(\mathbf{x})\mbox{ and }f_{2}(\mathbf{x})=W_{2}\cdot\mathrm{Conv}_{\mathcal{W}_{2},\mathcal{B}_{2}}(\mathbf{x}),

where 𝒲1={𝒲1(l)}i=1L1,1={B1(l)}l=1L1,𝒲2={𝒲2(l)}i=1L2,2={B2(l)}l=1L2,\mathcal{W}_{1}=\left\{\mathcal{W}_{1}^{(l)}\right\}_{i=1}^{L_{1}},\mathcal{B}_{1}=\left\{B_{1}^{(l)}\right\}_{l=1}^{L_{1}},\mathcal{W}_{2}=\left\{\mathcal{W}_{2}^{(l)}\right\}_{i=1}^{L_{2}},\mathcal{B}_{2}=\left\{B_{2}^{(l)}\right\}_{l=1}^{L_{2}}, are sets of filters and biases and Conv𝒲1,1,Conv𝒲2,2\mathrm{Conv}_{\mathcal{W}_{1},\mathcal{B}_{1}},\mathrm{Conv}_{\mathcal{W}_{2},\mathcal{B}_{2}} are defined in (3). In the rest of this proof, We will choose proper weight parameters in 𝒲,\mathcal{W},\mathcal{B} such that

Conv𝒲,(𝐱)=[(f1(𝐱))+(f1(𝐱))(f2(𝐱))+(f2(𝐱))]D×4.\mathrm{Conv}_{\mathcal{W},\mathcal{B}}(\mathbf{x})=\begin{bmatrix}(f_{1}(\mathbf{x}))_{+}&(f_{1}(\mathbf{x}))_{-}&(f_{2}(\mathbf{x}))_{+}&(f_{2}(\mathbf{x}))_{-}\\ \star&\star&\star&\star\end{bmatrix}\in\mathbb{R}^{D\times 4}.

For 1lL211\leq l\leq L_{2}-1, we set

𝒲i,:,:(l)=[(𝒲1(l))i,:,:𝟎] for i=1,,J1,\displaystyle\mathcal{W}^{(l)}_{i,:,:}=\begin{bmatrix}(\mathcal{W}_{1}^{(l)})_{i,:,:}&\mathbf{0}\end{bmatrix}\mbox{ for }i=1,...,J_{1},
𝒲i,:,:(l)=[𝟎(𝒲2(l))iJ1,:,:] for i=J1+1,,J1+J2,\displaystyle\mathcal{W}^{(l)}_{i,:,:}=\begin{bmatrix}\mathbf{0}&(\mathcal{W}_{2}^{(l)})_{i-J_{1},:,:}\end{bmatrix}\mbox{ for }i=J_{1}+1,...,J_{1}+J_{2},
B(l)=[B1(l)B2(l)].\displaystyle B^{(l)}=\begin{bmatrix}B_{1}^{(l)}&B_{2}^{(l)}\end{bmatrix}.

Each 𝒲(l)\mathcal{W}^{(l)} is a filter with size max(K1,K2)\max(K_{1},K_{2}) and J1+J2J_{1}+J_{2} output channels. When K1K2K_{1}\neq K_{2}, we pad the smaller filter by zeros. For example when K1<K2K_{1}<K_{2}, we set

𝒲i,:,:(l)=[(𝒲1(l))i,:,:𝟎𝟎𝟎] for i=1,,J1,\displaystyle\mathcal{W}^{(l)}_{i,:,:}=\begin{bmatrix}(\mathcal{W}_{1}^{(l)})_{i,:,:}&\mathbf{0}\\ \mathbf{0}&\mathbf{0}\end{bmatrix}\mbox{ for }i=1,...,J_{1},

such that 𝒲(l)\mathcal{W}^{(l)} has size K2K_{2} filters.

For the L2L_{2}-th layer, we set

𝒲i,:,:(L2)=[(𝒲1(L2))i,:,:𝟎] for i=1,,J1,\displaystyle\mathcal{W}^{(L_{2})}_{i,:,:}=\begin{bmatrix}(\mathcal{W}_{1}^{(L_{2})})_{i,:,:}&\mathbf{0}\end{bmatrix}\mbox{ for }i=1,...,J_{1},
𝒲J1+1,:,:(L2)=[𝟎(W2)1,:𝟎𝟎],𝒲J1+2,:,:(L2)=[𝟎(W2)1,:𝟎𝟎],\displaystyle\mathcal{W}^{(L_{2})}_{J_{1}+1,:,:}=\begin{bmatrix}\mathbf{0}&(W_{2})_{1,:}\\ \mathbf{0}&\mathbf{0}\end{bmatrix},\ \mathcal{W}^{(L_{2})}_{J_{1}+2,:,:}=\begin{bmatrix}\mathbf{0}&-(W_{2})_{1,:}\\ \mathbf{0}&\mathbf{0}\end{bmatrix},
B(L2)=[B1(L2)𝟎].\displaystyle B^{(L_{2})}=\begin{bmatrix}B_{1}^{(L_{2})}&\mathbf{0}\end{bmatrix}.

𝒲(L2)\mathcal{W}^{(L_{2})} is a filter with size K1K_{1} and J1+2J_{1}+2 output channels.

For L2+1lL11L_{2}+1\leq l\leq L_{1}-1, we set

𝒲i,:,:(l)=[(𝒲1(l))i,:,:𝟎] for i=1,,J1,\displaystyle\mathcal{W}^{(l)}_{i,:,:}=\begin{bmatrix}(\mathcal{W}_{1}^{(l)})_{i,:,:}&\mathbf{0}\end{bmatrix}\mbox{ for }i=1,...,J_{1},
𝒲J1+1,:,:(l)=[𝟎10𝟎𝟎𝟎],𝒲J1+2,:,:(l)=[𝟎01𝟎𝟎𝟎],\displaystyle\mathcal{W}^{(l)}_{J_{1}+1,:,:}=\begin{bmatrix}\mathbf{0}&1&0\\ \mathbf{0}&\mathbf{0}&\mathbf{0}\end{bmatrix},\ \mathcal{W}^{(l)}_{J_{1}+2,:,:}=\begin{bmatrix}\mathbf{0}&0&1\\ \mathbf{0}&\mathbf{0}&\mathbf{0}\end{bmatrix},
B(L2)=[B1(L2)𝟎].\displaystyle B^{(L_{2})}=\begin{bmatrix}B_{1}^{(L_{2})}&\mathbf{0}\end{bmatrix}.

For the L1L_{1}-th layer, we set

𝒲1,:,:(L1)=[(W1)1,:00𝟎𝟎𝟎],𝒲2,:,:(L1)=[(W1)1,:00𝟎𝟎𝟎],𝒲3,:,:(L1)=[𝟎10𝟎𝟎𝟎],𝒲4,:,:(L1)=[𝟎01𝟎𝟎𝟎],\mathcal{W}^{(L_{1})}_{1,:,:}=\begin{bmatrix}(W_{1})_{1,:}&0&0\\ \mathbf{0}&\mathbf{0}&\mathbf{0}\end{bmatrix},\ \mathcal{W}^{(L_{1})}_{2,:,:}=\begin{bmatrix}-(W_{1})_{1,:}&0&0\\ \mathbf{0}&\mathbf{0}&\mathbf{0}\end{bmatrix},\mathcal{W}^{(L_{1})}_{3,:,:}=\begin{bmatrix}\mathbf{0}&1&0\\ \mathbf{0}&\mathbf{0}&\mathbf{0}\end{bmatrix},\ \mathcal{W}^{(L_{1})}_{4,:,:}=\begin{bmatrix}\mathbf{0}&0&1\\ \mathbf{0}&\mathbf{0}&\mathbf{0}\end{bmatrix},

and B(L1)=𝟎B^{(L_{1})}=\mathbf{0}. ∎

C.13 Lemma 15 and its proof

Lemma 15 is used to prove Lemma 17.

Lemma 15.

Let M,NM,N be positive integers. For any 𝐱=[x1xM]M\mathbf{x}=\begin{bmatrix}x_{1}&\cdots&x_{M}\end{bmatrix}^{\top}\in\mathbb{R}^{M}, define

X=[(x1)+(x1)(xM)+(xM)]N×(2M).X=\begin{bmatrix}(x_{1})_{+}&(x_{1})_{-}&\cdots&(x_{M})_{+}&(x_{M})_{-}\\ \star&\star&\cdots&\star&\star\end{bmatrix}\in\mathbb{R}^{N\times(2M)}.

For any CNN architecture 1CNN(L,J,K,κ,κ)\mathcal{F}_{1}^{\rm CNN}(L,J,K,\kappa,\kappa) from M\mathbb{R}^{M}\rightarrow\mathbb{R}, there exists a CNN architecture CNN(L,MJ,K,κ,κ)\mathcal{F}^{\rm CNN}(L,MJ,K,\kappa,\kappa) from N×(2M)\mathbb{R}^{N\times(2M)}\rightarrow\mathbb{R} such that for any f11CNN(L,J,K,κ,κ)f_{1}\in\mathcal{F}_{1}^{\rm CNN}(L,J,K,\kappa,\kappa), there exists fCNN(L,MJ,K,κ,κ)f\in\mathcal{F}^{\rm CNN}(L,MJ,K,\kappa,\kappa) with f(X)=f1(𝐱)f(X)=f_{1}(\mathbf{x}). Furthermore, the fully connected layer of CNN(L,MJ,K,κ,κ)\mathcal{F}^{\rm CNN}(L,MJ,K,\kappa,\kappa) has nonzero entries only in the first row.

Proof of Lemma 15.

Denote f1f_{1} as

f1=W1Conv𝒲1,1(𝐱)\displaystyle f_{1}=W_{1}\cdot\mathrm{Conv}_{\mathcal{W}_{1},\mathcal{B}_{1}}(\mathbf{x})

where 𝒲1={𝒲1(l)}i=1L1,1={B1(l)}l=1L1\mathcal{W}_{1}=\left\{\mathcal{W}_{1}^{(l)}\right\}_{i=1}^{L_{1}},\mathcal{B}_{1}=\left\{B_{1}^{(l)}\right\}_{l=1}^{L_{1}} are sets of filters and biases and Conv𝒲1,1\mathrm{Conv}_{\mathcal{W}_{1},\mathcal{B}_{1}} is defined in (3). For simplicity, we assume all convolutional layers of f1f_{1} have JJ channels . If some filters in ff have less than JJ channels, we can add additional channels with zero filters and biases.

We next choose proper weight parameters in 𝒲,\mathcal{W},\mathcal{B} and WW such that

f=WConv𝒲,(x)f=W\cdot\mathrm{Conv}_{\mathcal{W},\mathcal{B}}(x)

and f(X)=f1(𝐱)f(X)=f_{1}(\mathbf{x}).

For the first layer, i.e., l=1l=1, we design 𝒲(1)\mathcal{W}^{(1)} and B(1)B^{(1)} since 𝐱\mathbf{x} is a vector in M\mathbb{R}^{M}, the filter 𝒲1(1)\mathcal{W}_{1}^{(1)} has 1 input channel and JJ output channel. For 1iJ1\leq i\leq J and 1jMK+11\leq j\leq M-K+1, we set

𝒲(i1)M+j,:,:(1)=[𝟎(𝒲1(1))i,1,:(𝒲1(1))i,1,:(𝒲1(1))i,K,:(𝒲1(1))i,K,:𝟎~]\displaystyle\mathcal{W}^{(1)}_{(i-1)M+j,:,:}=\begin{bmatrix}\mathbf{0}&(\mathcal{W}^{(1)}_{1})_{i,1,:}&-(\mathcal{W}^{(1)}_{1})_{i,1,:}&\cdots&(\mathcal{W}^{(1)}_{1})_{i,K,:}&-(\mathcal{W}^{(1)}_{1})_{i,K,:}&\widetilde{\mathbf{0}}\end{bmatrix}

where 𝟎\mathbf{0} is of size 1×(j1)1\times(j-1), 𝟎~\widetilde{\mathbf{0}} is a zero matrix of size 1×(MjK+1)1\times(M-j-K+1).

For 1iJ1\leq i\leq J and MK+2jMM-K+2\leq j\leq M, we set

𝒲(i1)M+j,:,:(1)=[𝟎(𝒲1(1))i,1,:(𝒲1(1))i,1,:(𝒲1(1))i,Kj+1,:(𝒲1(1))i,Kj+1,:]\displaystyle\mathcal{W}^{(1)}_{(i-1)M+j,:,:}=\begin{bmatrix}\mathbf{0}&(\mathcal{W}^{(1)}_{1})_{i,1,:}&-(\mathcal{W}^{(1)}_{1})_{i,1,:}&\cdots&(\mathcal{W}^{(1)}_{1})_{i,K-j+1,:}&-(\mathcal{W}^{(1)}_{1})_{i,K-j+1,:}\end{bmatrix}

where 𝟎\mathbf{0} is of size 1×(j1)1\times(j-1). The bias is set as

B(1)=[((B1(1)):,1)((B1(1)):,J)𝟎𝟎].B^{(1)}=\begin{bmatrix}\left((B_{1}^{(1)})_{:,1}\right)^{\top}&\cdots&\left((B_{1}^{(1)})_{:,J}\right)^{\top}\\ \mathbf{0}&\cdots&\mathbf{0}\end{bmatrix}.

We next choose weight parameters for 2lL112\leq l\leq L_{1}-1. For 1iJ1\leq i\leq J and 1jMK+11\leq j\leq M-K+1, we set

𝒲(i1)M+j,:,:(l)=[𝟎((𝒲1(l))i,:,1)𝟎~𝟎((𝒲1(l))i,:,J)𝟎~]\displaystyle\mathcal{W}^{(l)}_{(i-1)M+j,:,:}=\begin{bmatrix}\mathbf{0}&\left((\mathcal{W}^{(l)}_{1})_{i,:,1}\right)^{\top}&\widetilde{\mathbf{0}}&\cdots&\mathbf{0}&\left((\mathcal{W}^{(l)}_{1})_{i,:,J}\right)^{\top}&\widetilde{\mathbf{0}}\end{bmatrix}

where 𝟎\mathbf{0} is of size 1×(j1)1\times(j-1), 𝟎~\widetilde{\mathbf{0}} is a zero matrix of size 1×(MjK+1)1\times(M-j-K+1).

For 1iJ1\leq i\leq J and MK+2jMM-K+2\leq j\leq M, we set

𝒲(i1)M+j,:,:(l)=[𝟎((𝒲1(l))i,1:Kj+1,1)𝟎((𝒲1(l))i,1:Kj+1,J)]\displaystyle\mathcal{W}^{(l)}_{(i-1)M+j,:,:}=\begin{bmatrix}\mathbf{0}&\left((\mathcal{W}^{(l)}_{1})_{i,1:K-j+1,1}\right)^{\top}&\cdots&\mathbf{0}&\left((\mathcal{W}^{(l)}_{1})_{i,1:K-j+1,J}\right)^{\top}\end{bmatrix}

where 𝟎\mathbf{0} is of size 1×(j1)1\times(j-1). The bias is set as

B(l)=[((B1(l)):,1)((B1(l)):,J)𝟎𝟎].B^{(l)}=\begin{bmatrix}\left((B_{1}^{(l)})_{:,1}\right)^{\top}&\cdots&\left((B_{1}^{(l)})_{:,J}\right)^{\top}\\ \mathbf{0}&\cdots&\mathbf{0}\end{bmatrix}.

For the fully connected layer, we set

W=[((W1):,1)((W1):,J)𝟎𝟎].W=\begin{bmatrix}\left((W_{1})_{:,1}\right)^{\top}&\cdots&\left((W_{1})_{:,J}\right)^{\top}\\ \mathbf{0}&\cdots&\mathbf{0}\end{bmatrix}.

With these choices, the lemma is proved. ∎

C.14 Lemma 16 and its proof

Lemma 16 shows that for any CNN, if we scale all weight parameters in convolutional layers by some factors and scale the weight parameters in the fully connected layer properly, the output will remain the same. Lemma 16 is used to prove Lemma 17.

Lemma 16.

Let α1\alpha\geq 1. For any fCNN(L,J,K,κ1,κ2)f\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2}), there exists f~CNN(L,J,K,α1κ1,αLκ2)\widetilde{f}\in\mathcal{F}^{\rm CNN}(L,J,K,\alpha^{-1}\kappa_{1},\alpha^{L}\kappa_{2}) such that f~(𝐱)=f(𝐱)\widetilde{f}(\mathbf{x})=f(\mathbf{x}).

Proof of Lemma 16.

This lemma is proved using the linear property of ReLU\mathrm{ReLU} and convolution. Let ff be any CNN in CNN(L,J,K,κ1,κ2)\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2}). Denote its architecture as

f(𝐱)=WConv𝒲,(𝐱).f(\mathbf{x})=W\cdot\mathrm{Conv}_{\mathcal{W},\mathcal{B}}(\mathbf{x}).

Define W¯=αLW\bar{W}=\alpha^{L}W and 𝒲¯,¯\bar{\mathcal{W}},\bar{\mathcal{B}} as

𝒲¯(l)=α1𝒲(l),¯(l)=αl(l)\bar{\mathcal{W}}^{(l)}=\alpha^{-1}\mathcal{W}^{(l)},\bar{\mathcal{B}}^{(l)}=\alpha^{-l}\mathcal{B}^{(l)}

for any l(1,L)l\in(1,L). Set

f~(𝐱)=W¯Conv𝒲¯,¯(𝐱)\widetilde{f}(\mathbf{x})=\bar{W}\cdot\mathrm{Conv}_{\bar{\mathcal{W}},\bar{\mathcal{B}}}(\mathbf{x})

We have f~CNN(L,J,K,α1κ1,αLκ2)\widetilde{f}\in\mathcal{F}^{\rm CNN}(L,J,K,\alpha^{-1}\kappa_{1},\alpha^{L}\kappa_{2}) and f~(𝐱)=f(𝐱)\widetilde{f}(\mathbf{x})=f(\mathbf{x}) since ReLU(c𝐱)=cReLU(𝐱)\mathrm{ReLU}(c\mathbf{x})=c\mathrm{ReLU}(\mathbf{x}) for any c>0c>0. ∎

C.15 Lemma 17 and its proof

Lemma 17 shows that each f̊i,j\mathring{f}_{i,j} defined in (27) can be realized by a CNN.

Lemma 17.

Let f̊i,j\mathring{f}_{i,j} be defined as in (27). Assume each CNN in f̊i,j\mathring{f}_{i,j} has architecture discussed in Appendix C.10. Then there exists a CNN f¯i,jCNNCNN(L,J,K,κ1,κ2)\bar{f}^{\rm CNN}_{i,j}\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2}) with

L=O(log(1/ε)+D+logD),J=48d(s+1)(s+3)+28d+6D,κ1=1,logκ2=O(log21ε)\displaystyle L=O(\log(1/\varepsilon)+D+\log D),J=\lceil 48d(s+1)(s+3)+28d+6D\rceil,\kappa_{1}=1,\log\kappa_{2}=O\left(\log^{2}\frac{1}{\varepsilon}\right)

such that f¯i,jCNN(𝐱)=f̊i,j(𝐱)\bar{f}^{\rm CNN}_{i,j}(\mathbf{x})=\mathring{f}_{i,j}(\mathbf{x}) for any 𝐱\mathbf{x}\in\mathcal{M}. The constants hidden in O()O(\cdot) depend on d,D,s,2dspd,p,q,c0,τd,D,s,\frac{2d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}.

Proof of Lemma 17.

According to Lemma 13, there exists a CNN gi,jg_{i,j} realizing f~i,jCNNϕiCNN\widetilde{f}_{i,j}^{\rm CNN}\circ\phi_{i}^{\rm CNN} and a CNN g~i\widetilde{g}_{i} realizing 𝟙~Δd~i2\widetilde{\mathds{1}}_{\Delta}\circ\widetilde{d}_{i}^{2}. Using Lemma 14, one can construct a CNN excluding the fully connected layer, denoted by g¯i,j\bar{g}_{i,j}, such that

g¯i,j(𝐱)=[(gi,j(𝐱))+(gi,j(𝐱))(g~i(𝐱))+(g~i(𝐱))].\displaystyle\bar{g}_{i,j}(\mathbf{x})=\begin{bmatrix}(g_{i,j}(\mathbf{x}))_{+}&(g_{i,j}(\mathbf{x}))_{-}&(\widetilde{g}_{i}(\mathbf{x}))_{+}&(\widetilde{g}_{i}(\mathbf{x}))_{-}\\ \star&\star&\star&\star\end{bmatrix}. (54)

Here g¯i,j\bar{g}_{i,j} has 48d(s+1)(s+3)+28\lceil 48d(s+1)(s+3)+28\rceil channels.

Since the input of ×~\widetilde{\times} is [gi,jg~i],\begin{bmatrix}g_{i,j}\\ \widetilde{g}_{i}\end{bmatrix}, Lemma 15 shows that there exists a CNN g̊i,jCNN\mathring{g}^{\rm CNN}_{i,j} which takes (54) as the input and outputs ×~(gi,j,g~i)\widetilde{\times}(g_{i,j},\widetilde{g}_{i}).

Note that g¯i,j\bar{g}_{i,j} only contains convolutional layers. The composition g̊i,jCNNg¯i,j\mathring{g}^{\rm CNN}_{i,j}\circ\bar{g}_{i,j}, denoted by f˘i,jCNN\breve{f}^{\rm CNN}_{i,j}, is a CNN and for any 𝐱\mathbf{x}\in\mathcal{M}, f˘i,jCNN(𝐱)=f¯i,j(𝐱)\breve{f}^{\rm CNN}_{i,j}(\mathbf{x})=\bar{f}_{i,j}(\mathbf{x}). We have f˘i,jCNNCNN(L,J,K,κ,κ)\breve{f}^{\rm CNN}_{i,j}\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa,\kappa) with

L=O(log1ε+D+logD),J=28d(s+1)(s+3)+18d+6D,κ=O(ε(log2)ds(2dspd+c1d1)).\displaystyle L=O\left(\log\frac{1}{\varepsilon}+D+\log D\right),\ J=\lceil 28d(s+1)(s+3)+18d\rceil+6D,\ \kappa=O\left(\varepsilon^{-(\log 2)\frac{d}{s}(\frac{2d}{sp-d}+c_{1}d^{-1})}\right).

and KK can be any integer in [2,D][2,D].

We next rescale all parameters in convolutional layers of f˘i,jCNN\breve{f}^{\rm CNN}_{i,j} to be no larger than 1. Using Lemma 16, we can realize f˘i,jCNN\breve{f}^{\rm CNN}_{i,j} by f¯i,jCNNCNN(L,J,K,α1κ,αLκ)\bar{f}^{\rm CNN}_{i,j}\in\mathcal{F}^{\rm CNN}(L,J,K,\alpha^{-1}\kappa,\alpha^{L}\kappa) for any α>1\alpha>1. Set α=Cε(log2)ds(2dspd+c1d1)(8KD)M1L\alpha=C^{\prime}\varepsilon^{-(\log 2)\frac{d}{s}(\frac{2d}{sp-d}+c_{1}d^{-1})}(8KD)M^{\frac{1}{L}} where CC^{\prime} is a constant such that κCε(log2)ds(2dspd+c1d1)\kappa\leq C^{\prime}\varepsilon^{-(\log 2)\frac{d}{s}(\frac{2d}{sp-d}+c_{1}d^{-1})}. With this α\alpha, we have f¯i,jCNNCNN(L,J,K,κ1,κ2)\bar{f}^{\rm CNN}_{i,j}\in\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2}) with

L=O(log(1/ε)+D+logD),J=28d(s+1)(s+3)+18d+6D,\displaystyle L=O(\log(1/\varepsilon)+D+\log D),\ J=\lceil 28d(s+1)(s+3)+18d\rceil+6D,
κ1=(8KD)1M1L=O(1),logκ2=O(log21/ε).\displaystyle\kappa_{1}=(8KD)^{-1}M^{-\frac{1}{L}}=O(1),\ \log\kappa_{2}=O\left(\log^{2}1/\varepsilon\right).

C.16 Lemma 18 and its proof

Lemma 18 shows that the sum of a set of CNNs can be realized by a ConvResNet.

Lemma 18.

Let CNN(L,J,K,κ1,κ2)\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2}) be any CNN architecture from D\mathbb{R}^{D} to \mathbb{R}. Assume the weight matrix in the fully connected layer of CNN(L,J,K,κ1,κ2)\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2}) has nonzero entries only in the first row. Let MM be a positive integer. There exists a ConvResNet architecture 𝒞(M,L,J,κ1,κ2(1κ11))\mathcal{C}(M,L,J,\kappa_{1},\kappa_{2}(1\vee\kappa_{1}^{-1})) such that for any {fi(𝐱)}m=1MCNN(L,J,K,κ1,κ2)\{f_{i}(\mathbf{x})\}_{m=1}^{M}\subset\mathcal{F}^{\rm CNN}(L,J,K,\kappa_{1},\kappa_{2}), there exists f¯𝒞(M,L,J,κ1,κ2(1κ11))\bar{f}\in\mathcal{C}(M,L,J,\kappa_{1},\kappa_{2}(1\vee\kappa_{1}^{-1})) with

f¯(𝐱)=m=1Mf~m(𝐱).\bar{f}(\mathbf{x})=\sum_{m=1}^{M}\widetilde{f}_{m}(\mathbf{x}).
Proof of Lemma 18.

Denote the architecture of fmf_{m} by

fm(𝐱)=WmConv𝒲m,m(𝐱)f_{m}(\mathbf{x})=W_{m}\cdot\mathrm{Conv}_{\mathcal{W}_{m},\mathcal{B}_{m}}(\mathbf{x})

with 𝒲m={𝒲m(l)}l=1L,m={Bm(l)}l=1L.\mathcal{W}_{m}=\left\{\mathcal{W}_{m}^{(l)}\right\}_{l=1}^{L},\mathcal{B}_{m}=\left\{B_{m}^{(l)}\right\}_{l=1}^{L}. In f¯\bar{f}, denote the weight matrix and bias in the fully connected layer by W¯,b¯\bar{W},\bar{b} and the set of filters and biases in the mm-th block by 𝒲¯m\bar{\mathcal{W}}_{m} and ¯m\bar{\mathcal{B}}_{m}, respectively. The padding layer PP in f¯\bar{f} pads the input from D\mathbb{R}^{D} to D×3\mathbb{R}^{D\times 3} by zeros. Here each column denotes a channel.

We first show that for each mm, there exists a subnetowrk Conv𝒲¯m,¯m:D×3D×3\mathrm{Conv}_{\bar{\mathcal{W}}_{m},\bar{\mathcal{B}}_{m}}:\mathbb{R}^{D\times 3}\rightarrow\mathbb{R}^{D\times 3} such that for any ZD×3Z\in\mathbb{R}^{D\times 3} in the form of

Z=[𝐱],\displaystyle Z=\begin{bmatrix}\mathbf{x}&\star&\star\end{bmatrix}, (55)

we have

Conv𝒲¯m,¯m(Z)=[0κ1κ2(fm)+κ1κ2(fm)𝟎]\displaystyle\mathrm{Conv}_{\bar{\mathcal{W}}_{m},\bar{\mathcal{B}}_{m}}(Z)=\begin{bmatrix}0&\frac{\kappa_{1}}{\kappa_{2}}(f_{m})_{+}&\frac{\kappa_{1}}{\kappa_{2}}(f_{m})_{-}\\ \mathbf{0}&\star&\star\end{bmatrix} (56)

where \star denotes some entries that we do not care.

For any mm, the first layer of f~m\widetilde{f}_{m} takes input in D\mathbb{R}^{D}. As a result, the filters in 𝒲m(1)\mathcal{W}_{m}^{(1)} are in D\mathbb{R}^{D}. We pad these filters by zeros to get filters in D×3\mathbb{R}^{D\times 3} and construct 𝒲¯m(1)\bar{\mathcal{W}}_{m}^{(1)} as

(𝒲¯m(1))j,:,:=[(𝒲m(1))j,:,:𝟎𝟎].(\bar{\mathcal{W}}_{m}^{(1)})_{j,:,:}=\begin{bmatrix}(\mathcal{W}_{m}^{(1)})_{j,:,:}&\mathbf{0}&\mathbf{0}\end{bmatrix}.

For any ZZ in the form of (55), we have 𝒲¯m(1)Z=𝒲m(1)𝐱\bar{\mathcal{W}}_{m}^{(1)}*Z=\mathcal{W}_{m}^{(1)}*\mathbf{x}. For the filters in the following layers and all biases, we simply set

𝒲¯m(l)=𝒲m(1)\displaystyle\bar{\mathcal{W}}_{m}^{(l)}=\mathcal{W}_{m}^{(1)} for l=2,,L1,\displaystyle\mbox{ for }l=2,\dots,L-1,
¯m(l)=m(1)\displaystyle\bar{\mathcal{B}}_{m}^{(l)}=\mathcal{B}_{m}^{(1)} for l=1,,L1.\displaystyle\mbox{ for }l=1,\dots,L-1.

In Conv𝒲¯m,¯m\mathrm{Conv}_{\bar{\mathcal{W}}_{m},\bar{\mathcal{B}}_{m}}, another convolutional layer is constructed to realize the fully connected layer in fmf_{m}. According to our assumption, only the first row of WmW_{m} has nonzero entries. We set ¯m(L)=𝟎\bar{\mathcal{B}}_{m}^{(L)}=\mathbf{0} and 𝒲¯mL\bar{\mathcal{W}}_{m}^{L} as size one filters with three output channels in the form of

(𝒲¯mL)1,:,:=𝟎,(𝒲¯mL)2,:,:=κ1κ2(Wm)1,:,(𝒲¯mL)3,:,:=κ1κ2(Wm)1,:.(\bar{\mathcal{W}}_{m}^{L})_{1,:,:}=\mathbf{0},\ (\bar{\mathcal{W}}_{m}^{L})_{2,:,:}=\frac{\kappa_{1}}{\kappa_{2}}(W_{m})_{1,:},\ (\bar{\mathcal{W}}_{m}^{L})_{3,:,:}=-\frac{\kappa_{1}}{\kappa_{2}}(W_{m})_{1,:}.

Under such choices, (56) is proved and all parameters in 𝒲¯m,¯m\bar{\mathcal{W}}_{m},\bar{\mathcal{B}}_{m} are bounded by κ1\kappa_{1}.

By composing all residual blocks, one has

(Conv𝒲¯M,¯M+id)(Conv𝒲¯1,¯1+id)P(𝐱)=[κ1κ2m=1M(fm)+κ1κ2m=1M(fm)+𝐱].\displaystyle(\mathrm{Conv}_{\bar{\mathcal{W}}_{M},\bar{\mathcal{B}}_{M}}+\mathrm{id})\circ\cdots\circ(\mathrm{Conv}_{\bar{\mathcal{W}}_{1},\bar{\mathcal{B}}_{1}}+\mathrm{id})\circ P(\mathbf{x})=\begin{bmatrix}&\frac{\kappa_{1}}{\kappa_{2}}\sum_{m=1}^{M}(f_{m})_{+}&\frac{\kappa_{1}}{\kappa_{2}}\sum_{m=1}^{M}(f_{m})_{+}\\ \mathbf{x}&\star&\star\\ &\vdots&\vdots\end{bmatrix}.

The fully connect layer is set as

W¯=[0κ2κ1κ2κ1𝟎𝟎𝟎],b¯=0.\bar{W}=\begin{bmatrix}0&\frac{\kappa_{2}}{\kappa_{1}}&-\frac{\kappa_{2}}{\kappa_{1}}\\ \mathbf{0}&\mathbf{0}&\mathbf{0}\end{bmatrix},\ \bar{b}=0.

The weights in the fully connected layer are bounded by κ2(1κ11)\kappa_{2}(1\vee\kappa_{1}^{-1}).

Such a ConvResNet gives

f¯(𝐱)=m=1M(fm(𝐱))+(m=1M(fm(𝐱)))=m=1Mfm(𝐱).\displaystyle\bar{f}(\mathbf{x})=\sum_{m=1}^{M}(f_{m}(\mathbf{x}))_{+}-\left(\sum_{m=1}^{M}(f_{m}(\mathbf{x}))_{-}\right)=\sum_{m=1}^{M}f_{m}(\mathbf{x}).

Appendix D Proof of Lemmas and Propositions in Section A

D.1 Proof of Proposition 2

The proof of Proposition 2 relies on the following large-deviation inequality:

Lemma 19 (Theorem 3 in Shen and Wong (1994)).

Let {𝐱i}i=1n\{\mathbf{x}_{i}\}_{i=1}^{n} be i.i.d. samples from some probability distribution. Let \mathcal{F} be a class of functions whose magnitude is bounded by FF. Let vsupfVar(f(𝐱)),v>0v\geq\sup_{f\in\mathcal{F}}\operatorname{{\rm Var}}(f(\mathbf{x})),v>0 be a constant. Assume there are constants M>0,0<λ<1M>0,0<\lambda<1 such that

  1. (B1)

    B(v1/2,,L2)λnM2/(8(4v+MF/3))\mathcal{H}_{B}(v^{1/2},\mathcal{F},\|\cdot\|_{L^{2}})\leq\lambda nM^{2}/(8(4v+MF/3)),

  2. (B2)

    Mλv/(4F),v1/2FM\leq\lambda v/(4F),\ v^{1/2}\leq F,

  3. (B3)

    if λM/8v1/2\lambda M/8\leq v^{1/2}, then

    M1λM/32v1/2B(u,,L2)1/2dun1/2λ3/2210.M^{-1}\int_{\lambda M/32}^{v^{1/2}}\mathcal{H}_{B}(u,\mathcal{F},\|\cdot\|_{L^{2}})^{1/2}du\leq\frac{n^{1/2}\lambda^{3/2}}{2^{10}}.

Then

(supf1ni=1n(f(𝐱i)𝔼[f(𝐱i)])M)3exp((1λ)nM22(4v+MF/3)).\mathbb{P}\left(\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}(f(\mathbf{x}_{i})-\mathbb{E}[f(\mathbf{x}_{i})])\geq M\right)\leq 3\exp\left(-(1-\lambda)\frac{nM^{2}}{2(4v+MF/3)}\right).
Proof of Proposition 2.

Let C1,C2,C3C_{1},C_{2},C_{3} be constants defined in Proposition 2. Set ϵn2=max{2an,27δn/C1}\epsilon^{2}_{n}=\max\{2a_{n},2^{7}\delta_{n}/C_{1}\}. For each n\mathcal{F}_{n} defined in (A2), define

En(f)=1ni=1n[ϕ(yifn(𝐱i))ϕ(yif(𝐱i))𝔼[ϕ(yifn(𝐱i))ϕ(yif(𝐱i))]].\displaystyle E_{n}(f)=\frac{1}{n}\sum_{i=1}^{n}\left[\phi(y_{i}f_{n}(\mathbf{x}_{i}))-\phi(y_{i}f(\mathbf{x}_{i}))-\mathbb{E}\left[\phi(y_{i}f_{n}(\mathbf{x}_{i}))-\phi(y_{i}f(\mathbf{x}_{i}))\right]\right]. (57)

Note that ϕ(fn,fϕ)an\mathcal{E}_{\phi}(f_{n},f_{\phi}^{*})\leq a_{n} by (A2). Since f^ϕ,n\widehat{f}_{\phi,n} is the minimizer of ϕ(f)\mathcal{E}_{\phi}(f), we have

(ϕ(f^ϕ,n,fϕ)ϵn2)([supfn:ϕ(f,fϕ)ϵn21ni=1n[ϕ(yifn(𝐱i))ϕ(yif(𝐱i))]]0).\mathbb{P}\left(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})\geq\epsilon_{n}^{2}\right)\leq\mathbb{P}\left(\left[\sup_{f\in\mathcal{F}_{n}:\mathcal{E}_{\phi}(f,f_{\phi}^{*})\geq\epsilon_{n}^{2}}\frac{1}{n}\sum_{i=1}^{n}\left[\phi(y_{i}f_{n}(\mathbf{x}_{i}))-\phi(y_{i}f(\mathbf{x}_{i}))\right]\right]\geq 0\right).

We decompose the set {fn:ϕ(f,fϕ)ϵn2}\left\{f\in\mathcal{F}_{n}:\mathcal{E}_{\phi}(f,f_{\phi}^{*})\geq\epsilon_{n}^{2}\right\} into disjoint subsets {n,i}\{\mathcal{F}_{n,i}\} for i=1,,ini=1,...,i_{n} in the form of

n,i={fn:2i1ϵn2ϕ(f,fϕ)<2iϵn2}.\mathcal{F}_{n,i}=\left\{f\in\mathcal{F}_{n}:2^{i-1}\epsilon_{n}^{2}\leq\mathcal{E}_{\phi}(f,f_{\phi}^{*})<2^{i}\epsilon_{n}^{2}\right\}.

Note that fLFn\|f\|_{L^{\infty}}\leq F_{n} and ϕ(fϕ)ϕ(f)\mathcal{E}_{\phi}(f_{\phi}^{*})\leq\mathcal{E}_{\phi}(f) for all fnf\in\mathcal{F}_{n}. Therefore, for any fnf\in\mathcal{F}_{n}, we have

ϕ(f,fϕ)ϕ(f)=𝔼[ϕ(yf(𝐱))]C1𝔼[|f(𝐱)|]C1Fn,\mathcal{E}_{\phi}(f,f^{*}_{\phi})\leq\mathcal{E}_{\phi}(f)=\mathbb{E}[\phi(yf(\mathbf{x}))]\leq C_{1}\mathbb{E}[|f(\mathbf{x})|]\leq C_{1}F_{n},

which implies n,i\mathcal{F}_{n,i} is an empty set for 2i1ϵn2>C1Fn2^{i-1}\epsilon_{n}^{2}>C_{1}F_{n}. We set in=inf{i:2i1ϵn2>C1Fn}i_{n}=\inf\{i\in\mathbb{N}:2^{i-1}\epsilon_{n}^{2}>C_{1}F_{n}\}. Then we have {fn:ϕ(f,fϕ)ϵn2}=i=1inn,i\left\{f\in\mathcal{F}_{n}:\mathcal{E}_{\phi}(f,f_{\phi}^{*})\geq\epsilon_{n}^{2}\right\}=\bigcup_{i=1}^{i_{n}}\mathcal{F}_{n,i}. Since ϕ(fn,fϕ)anϵn2/2\mathcal{E}_{\phi}(f_{n},f_{\phi}^{*})\leq a_{n}\leq\epsilon_{n}^{2}/2, we have

inffn,i𝔼[ϕ(yf(𝐱))ϕ(yfn(𝐱))]=inffn,i[ϕ(f,fϕ)ϕ(fn,fϕ)]2i2ϵn2.\inf_{f\in\mathcal{F}_{n,i}}\mathbb{E}\left[\phi(yf(\mathbf{x}))-\phi(yf_{n}(\mathbf{x}))\right]=\inf_{f\in\mathcal{F}_{n,i}}\left[\mathcal{E}_{\phi}(f,f_{\phi}^{*})-\mathcal{E}_{\phi}(f_{n},f_{\phi}^{*})\right]\geq 2^{i-2}\epsilon_{n}^{2}.

Denote Mn,i=2i2ϵn2M_{n,i}=2^{i-2}\epsilon_{n}^{2}. We have

(ϕ(f^ϕ,n,fϕ)ϵn2)\displaystyle\mathbb{P}\left(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})\geq\epsilon_{n}^{2}\right) i=1in(supfn,i𝔼n(f)𝔼[ϕ(yf(𝐱))ϕ(yfn(𝐱))])\displaystyle\leq\sum_{i=1}^{i_{n}}\mathbb{P}\left(\sup_{f\in\mathcal{F}_{n,i}}\mathbb{E}_{n}(f)\geq\mathbb{E}\left[\phi(yf(\mathbf{x}))-\phi(yf_{n}(\mathbf{x}))\right]\right)
i=1in(supfn,i𝔼n(f)Mn,i),\displaystyle\leq\sum_{i=1}^{i_{n}}\mathbb{P}\left(\sup_{f\in\mathcal{F}_{n,i}}\mathbb{E}_{n}(f)\geq M_{n,i}\right),

where 𝔼n(f)\mathbb{E}_{n}(f) is defined in (57). Then we will bound each summand on the right-hand side using Lemma 19. First, using (A4), we have

supfn,i𝔼[(ϕ(yf(𝐱))ϕ(yfn(𝐱)))2]\displaystyle\sup_{f\in\mathcal{F}_{n,i}}\mathbb{E}\left[\left(\phi(yf(\mathbf{x}))-\phi(yf_{n}(\mathbf{x}))\right)^{2}\right]
\displaystyle\leq 2supfn,i𝔼[(ϕ(yf(𝐱))ϕ(yfϕ(𝐱)))2+(ϕ(yfn(𝐱))ϕ(yfϕ(𝐱)))2]\displaystyle 2\sup_{f\in\mathcal{F}_{n,i}}\mathbb{E}\left[\left(\phi(yf(\mathbf{x}))-\phi(yf_{\phi}^{*}(\mathbf{x}))\right)^{2}+\left(\phi(yf_{n}(\mathbf{x}))-\phi(yf_{\phi}^{*}(\mathbf{x}))\right)^{2}\right]
\displaystyle\leq 2C2eFnFn2v(supfn,iϕ(f,fϕ)ν+ϕ(fn,fϕ)ν)\displaystyle 2C_{2}e^{F_{n}}F_{n}^{2-v}\left(\sup_{f\in\mathcal{F}_{n,i}}\mathcal{E}_{\phi}(f,f_{\phi}^{*})^{\nu}+\mathcal{E}_{\phi}(f_{n},f_{\phi}^{*})^{\nu}\right)
\displaystyle\leq 2C2eFnFn2ν((2iϵn2)ν+(ϵn2/2)ν)4ν+1C2eFnFn2ν((2i2ϵn2)ν)\displaystyle 2C_{2}e^{F_{n}}F_{n}^{2-\nu}\left(\left(2^{i}\epsilon_{n}^{2}\right)^{\nu}+\left(\epsilon_{n}^{2}/2\right)^{\nu}\right)\leq 4^{\nu+1}C_{2}e^{F_{n}}F_{n}^{2-\nu}\left(\left(2^{i-2}\epsilon_{n}^{2}\right)^{\nu}\right)
=\displaystyle= 4ν+1C2eFnFn2νMn,iν.\displaystyle 4^{\nu+1}C_{2}e^{F_{n}}F_{n}^{2-\nu}M_{n,i}^{\nu}.

Define 𝒢n,i={g=ϕ(yfn(𝐱))ϕ(yf(𝐱)):fn,i}\mathcal{G}_{n,i}=\left\{g=\phi(yf_{n}(\mathbf{x}))-\phi(yf(\mathbf{x})):f\in\mathcal{F}_{n,i}\right\}. For any g𝒢n,ig\in\mathcal{G}_{n,i}, Var(g)4ν+1C2eFnFn2νMn,iν\operatorname{{\rm Var}}(g)\leq 4^{\nu+1}C_{2}e^{F_{n}}F_{n}^{2-\nu}M_{n,i}^{\nu}. Since fn,fnf_{n},f\in\mathcal{F}_{n}, we have gLC1|fnf|2C1Fn\|g\|_{L^{\infty}}\leq C_{1}|f_{n}-f|\leq 2C_{1}F_{n}.

To apply Lemma 19 on 𝒢n,i\mathcal{G}_{n,i}, we set λ=1/2,F=D1Fn,M=Mn,i\lambda=1/2,\ F=D_{1}F_{n},\ M=M_{n,i} and v=vn,i=D2Fn2νMn,iνv=v_{n,i}=D_{2}F_{n}^{2-\nu}M_{n,i}^{\nu} with

D1=18(2C1)1νD2,D2=max{41+νC2eFn,64(2C1)2ν}.D_{1}=\frac{1}{8(2C_{1})^{1-\nu}}D_{2},\quad D_{2}=\max\left\{4^{1+\nu}C_{2}e^{F_{n}},64(2C_{1})^{2-\nu}\right\}.

Since D12C1D_{1}\geq 2C_{1} and D241+νC2eFnD_{2}\geq 4^{1+\nu}C_{2}e^{F_{n}}, we have supg𝒢n,igLF\sup_{g\in\mathcal{G}_{n,i}}\|g\|_{L^{\infty}}\leq F and supg𝒢n,iVar(g)vn,i\sup_{g\in\mathcal{G}_{n,i}}\operatorname{{\rm Var}}(g)\leq v_{n,i}. We first check the validation of (B2). Since Mn,i2C1Fn,D264(2C1)2νM_{n,i}\leq 2C_{1}F_{n},D_{2}\geq 64(2C_{1})^{2-\nu},

vF2=vn,iD12Fn264(2C1)22νD2Fn2ν(2C1Fn)νD22Fn2=64(2C1)2νD21,\displaystyle\frac{v}{F^{2}}=\frac{v_{n,i}}{D_{1}^{2}F_{n}^{2}}\leq\frac{64(2C_{1})^{2-2\nu}D_{2}F_{n}^{2-\nu}(2C_{1}F_{n})^{\nu}}{D_{2}^{2}F_{n}^{2}}=\frac{64(2C_{1})^{2-\nu}}{D_{2}}\leq 1,
Mn,i=Mn,i1νMn,iν(2C1Fn)1νMn,iν=8(2C1)1νD2Fn2νMn,iν8D2Fn=vn,i8D2Fnvn,i8Fn\displaystyle M_{n,i}=M_{n,i}^{1-\nu}M_{n,i}^{\nu}\leq(2C_{1}F_{n})^{1-\nu}M_{n,i}^{\nu}=\frac{8(2C_{1})^{1-\nu}D_{2}F_{n}^{2-\nu}M_{n,i}^{\nu}}{8D_{2}F_{n}}=\frac{v_{n,i}}{8D_{2}F_{n}}\leq\frac{v_{n,i}}{8F_{n}}

when FnF_{n} is large enough. Thus (B2) is satisfied.

For (B3), note that for g1=ϕ(yfn(𝐱))ϕ(yf1(𝐱)),g2=ϕ(yfn(𝐱))ϕ(yf2(𝐱))g_{1}=\phi(yf_{n}(\mathbf{x}))-\phi(yf_{1}(\mathbf{x})),g_{2}=\phi(yf_{n}(\mathbf{x}))-\phi(yf_{2}(\mathbf{x})) where f1,f2n,if_{1},f_{2}\in\mathcal{F}_{n,i}, |g1g2|C1|f1f2||g_{1}-g_{2}|\leq C_{1}|f_{1}-f_{2}|. We have

B(δ,𝒢n,i,L2)B(C1δ,n,i,L2)B(C1δ,n,L2),\mathcal{H}_{B}(\delta,\mathcal{G}_{n,i},\|\cdot\|_{L^{2}})\leq\mathcal{H}_{B}(C_{1}\delta,\mathcal{F}_{n,i},\|\cdot\|_{L^{2}})\leq\mathcal{H}_{B}(C_{1}\delta,\mathcal{F}_{n},\|\cdot\|_{L^{2}}),

where the second inequality comes from n,in\mathcal{F}_{n,i}\subset\mathcal{F}_{n}. Since Mn,i1λMn,i/32vn,i1/2(B(τ,𝒢n,i,L2))1/2dτM_{n,i}^{-1}\displaystyle\int_{\lambda M_{n,i}/32}^{v_{n,i}^{1/2}}\left(\mathcal{H}_{B}(\tau,\mathcal{G}_{n,i},\|\cdot\|_{L^{2}})\right)^{1/2}d\tau is a non-increasing function of ii,

Mn,i1λMn,i/32vn,i1/2(B(τ,𝒢n,i,L2))1/2dτ\displaystyle M_{n,i}^{-1}\int_{\lambda M_{n,i}/32}^{v_{n,i}^{1/2}}\left(\mathcal{H}_{B}(\tau,\mathcal{G}_{n,i},\|\cdot\|_{L^{2}})\right)^{1/2}d\tau
\displaystyle\leq Mn,11Mn,1/64vn,11/2(B(C1τ,n,L2))1/2dτ\displaystyle M_{n,1}^{-1}\int_{M_{n,1}/64}^{v_{n,1}^{1/2}}\left(\mathcal{H}_{B}(C_{1}\tau,\mathcal{F}_{n},\|\cdot\|_{L^{2}})\right)^{1/2}d\tau
\displaystyle\leq Mn,11vn,11/2(B(C1ϵn2/128,n,L2))1/2\displaystyle M_{n,1}^{-1}v_{n,1}^{1/2}\left(\mathcal{H}_{B}(C_{1}\epsilon_{n}^{2}/128,\mathcal{F}_{n},\|\cdot\|_{L^{2}})\right)^{1/2}
\displaystyle\leq (D2Fn2ν)1/2Mn,1ν/21(C3eFnn(27C1ϵn2Fn1)2ν)1/2\displaystyle(D_{2}F_{n}^{2-\nu})^{1/2}M_{n,1}^{\nu/2-1}\left(C_{3}e^{-F_{n}}n\left(2^{-7}C_{1}\epsilon_{n}^{2}F_{n}^{-1}\right)^{2-\nu}\right)^{1/2}
\displaystyle\leq C31/2C11ν/2D21/2Fn1ν/2eFn/2(ϵn2/2)ν/21(27ν/27n1/2ϵn2νFnν/21)\displaystyle C_{3}^{1/2}C_{1}^{1-\nu/2}D_{2}^{1/2}F_{n}^{1-\nu/2}e^{-F_{n}/2}(\epsilon_{n}^{2}/2)^{\nu/2-1}\left(2^{7\nu/2-7}n^{1/2}\epsilon_{n}^{2-\nu}F_{n}^{\nu/2-1}\right)
=\displaystyle= (26ν12eFnC3C12νD2)1/2n1/2,\displaystyle\left(2^{6\nu-12}e^{-F_{n}}C_{3}C_{1}^{2-\nu}D_{2}\right)^{1/2}n^{1/2},

where in the third inequality we used (A5). (B3) is satisfied when FnF_{n} is large enough.

To verify (B1), we use (B2) and (B3). From (B2), since vn,i1/2Fv_{n,i}^{1/2}\leq F, we have Mn,iλvn,i/(4F)18vn,i1/2M_{n,i}\leq\lambda v_{n,i}/(4F)\leq\frac{1}{8}v_{n,i}^{1/2} which implies vn,i1/28Mn,i>Mn,i/16=λMn,i/8v_{n,i}^{1/2}\geq 8M_{n,i}>M_{n,i}/16=\lambda M_{n,i}/8. Thus the condition in (B3) is satisfied. From (B3), we derive

(B(vn,i1/2,𝒢n,i,L2))1/2Mn,ivn,i1/2Mn,i/64Mn.i1Mn,i/64vn,i1/2(B(τ,𝒢n,i,L2))1/2dτ\displaystyle\left(\mathcal{H}_{B}\left(v_{n,i}^{1/2},\mathcal{G}_{n,i},\|\cdot\|_{L^{2}}\right)\right)^{1/2}\leq\frac{M_{n,i}}{v_{n,i}^{1/2}-M_{n,i}/64}M_{n.i}^{-1}\int_{M_{n,i}/64}^{v_{n,i}^{1/2}}\left(\mathcal{H}_{B}(\tau,\mathcal{G}_{n,i},\|\cdot\|_{L^{2}})\right)^{1/2}d\tau
Mn,ivn,i1/2Mn,i/64223/2n1/243Mn,ivn,i1/2223/2n1/2=13219/2Mn,ivn,i1/2n1/2,\displaystyle\leq\frac{M_{n,i}}{v_{n,i}^{1/2}-M_{n,i}/64}\cdot 2^{-23/2}n^{1/2}\leq\frac{4}{3}\frac{M_{n,i}}{v_{n,i}^{1/2}}\cdot 2^{-23/2}n^{1/2}=\frac{1}{3\cdot 2^{19/2}}\frac{M_{n,i}}{v_{n,i}^{1/2}}n^{1/2},

where in the third inequality we used Mn,i16vn,i1/2M_{n,i}\leq 16v_{n,i}^{1/2}. Again from (B2), Mn,iλvn,i/(4F)=vn,i/(8F)M_{n,i}\leq\lambda v_{n,i}/(4F)=v_{n,i}/(8F). Therefore

λMn,i2n8(4vn,i+Mn,iF/3)\displaystyle\frac{\lambda M_{n,i}^{2}n}{8(4v_{n,i}+M_{n,i}F/3)} =Mn,i2n16(4vn,i+Mn,iF/3)Mn,i2n(64+2/3)vn,i\displaystyle=\frac{M_{n,i}^{2}n}{16(4v_{n,i}+M_{n,i}F/3)}\geq\frac{M_{n,i}^{2}n}{(64+2/3)v_{n,i}}
Mn,i2n9219vn,iB(vn,i1/2,𝒢n,i,L2)\displaystyle\geq\frac{M_{n,i}^{2}n}{9\cdot 2^{19}v_{n,i}}\geq\mathcal{H}_{B}\left(v_{n,i}^{1/2},\mathcal{G}_{n,i},\|\cdot\|_{L^{2}}\right)

and (B1) is verified.

Apply Lemma 19 to each n,i\mathcal{H}_{n,i}, we get

(ϕ(f^ϕ,n,fϕ)ϵn2)i=1in3exp(nMn,i24(4vn,i+Mn,iF/3))\displaystyle\mathbb{P}\left(\mathcal{E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})\geq\epsilon_{n}^{2}\right)\leq\sum_{i=1}^{i_{n}}3\exp\left(-\frac{nM_{n,i}^{2}}{4\left(4v_{n,i}+M_{n,i}F/3\right)}\right)
\displaystyle\leq i=13exp(C4nMn,i2/vn,i)i=13exp(C5(2i)2νeFnn(ϵn2/Fn)2ν)\displaystyle\sum_{i=1}^{\infty}3\exp\left(-C_{4}nM_{n,i}^{2}/v_{n,i}\right)\leq\sum_{i=1}^{\infty}3\exp\left(-C_{5}\left(2^{i}\right)^{2-\nu}e^{-F_{n}}n\left(\epsilon_{n}^{2}/F_{n}\right)^{2-\nu}\right)
\displaystyle\leq C6exp(C5eFnn(ϵn2/Fn)2ν).\displaystyle C_{6}\exp\left(-C_{5}e^{-F_{n}}n\left(\epsilon_{n}^{2}/F_{n}\right)^{2-\nu}\right).

D.2 Proof of Lemma 2

The proof of Lemma 2 consists of two steps. We first show that there exists a composition of networks f~ϕ,n\widetilde{f}_{\phi,n} such that f~ϕ,nfϕ,nL4eFnε\|\widetilde{f}_{\phi,n}-f_{\phi,n}^{*}\|_{L^{\infty}}\leq 4e^{F_{n}}\varepsilon. Then we show that f~ϕ,n\widetilde{f}_{\phi,n} can be realized by a ConvResNet f¯ϕ,n\bar{f}_{\phi,n}.

Lemma 20 shows the existence of f~ϕ,n\widetilde{f}_{\phi,n}.

Lemma 20.

Assume Assumption 1 and 2. Assume 0<p,q0<p,q\leq\infty, 0<s<0<s<\infty, sd/p+1s\geq d/p+1. There exists a network composition architecture

~(Fn)={gFnh~ngnη¯}\displaystyle\widetilde{\mathcal{F}}^{(F_{n})}=\{g_{F_{n}}\circ\widetilde{h}_{n}\circ g_{n}\circ\bar{\eta}\} (58)

where η¯𝒞(M,L,J,K,κ1,κ2)\bar{\eta}\in\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}) with

M=O(εd/s),L=O(log(1/ε)+D+logD),J=O(D),κ1=O(1),logκ2=O(log2(1/ε)),\displaystyle M=O(\varepsilon^{-d/s}),\ L=O(\log(1/\varepsilon)+D+\log D),\ J=O(D),\ \kappa_{1}=O(1),\ \log\kappa_{2}=O\left(\log^{2}(1/\varepsilon)\right),

h~nMLP(L2,J2,κ2,Fn)\widetilde{h}_{n}\in\mathcal{F}^{\rm MLP}(L_{2},J_{2},\kappa_{2},F_{n}) with

L2=O(log(eFn/ε)),J2=O(eFnε1log(eFn/ε)),κ2=O(eFn),\displaystyle L_{2}=O(\log(e^{F_{n}}/\varepsilon)),\ J_{2}=O(e^{F_{n}}\varepsilon^{-1}\log(e^{F_{n}}/\varepsilon)),\ \kappa_{2}=O(e^{F_{n}}),

and

gn(z)=ReLU(ReLU(z+eFn1+eFn)+eFn11+eFn)+11+eFn,\displaystyle g_{n}(z)=\mathrm{ReLU}\left(-\mathrm{ReLU}\left(-z+\frac{e^{F_{n}}}{1+e^{F_{n}}}\right)+\frac{e^{F_{n}}-1}{1+e^{F_{n}}}\right)+\frac{1}{1+e^{F_{n}}},
gFn=ReLU(ReLU(z+Fn)+2Fn)Fn.\displaystyle g_{F_{n}}=\mathrm{ReLU}\left(-\mathrm{ReLU}\left(-z+F_{n}\right)+2F_{n}\right)-F_{n}.

For any ηBp,qs()\eta\in B_{p,q}^{s}(\mathcal{M}) with ηBp,qs()c0\|\eta\|_{B_{p,q}^{s}(\mathcal{M})}\leq c_{0} for some constant c0c_{0}, let fϕ,nf_{\phi,n}^{*} be defined as in (33). For any nn and ε(0,1)\varepsilon\in(0,1), there exists a composition of networks f~ϕ,n~(Fn)\widetilde{f}_{\phi,n}\in\widetilde{\mathcal{F}}^{(F_{n})} such that

f~ϕ,nfϕ,nL4eFnε.\displaystyle\|\widetilde{f}_{\phi,n}-f_{\phi,n}^{*}\|_{L^{\infty}}\leq 4e^{F_{n}}\varepsilon.
Proof of Lemma 20.

According to Theorem 1, for any ε1(0,1)\varepsilon_{1}\in(0,1) and K[2,D]K\in[2,D], there is a ConvResNet architecture 𝒞(M,L,J,K,κ1,κ2)\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}) with

M=O(ε1d/s),L=O(log(1/ε1)+D+logD),J=O(D),κ1=O(1),logκ2=O(log2(1/ε)),\displaystyle M=O(\varepsilon_{1}^{-d/s}),\ L=O(\log(1/\varepsilon_{1})+D+\log D),\ J=O(D),\ \kappa_{1}=O(1),\ \log\kappa_{2}=O\left(\log^{2}(1/\varepsilon)\right),

such that there exists η¯𝒞(M,L,J,K,κ1,κ2)\bar{\eta}\in\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}) with η¯ηLε1\|\bar{\eta}-\eta\|_{L^{\infty}}\leq\varepsilon_{1}

Since fϕ=logη1ηf_{\phi}^{*}=\log\frac{\eta}{1-\eta}, we have fϕ,n=logηn1ηnf_{\phi,n}^{*}=\log\frac{\eta_{n}}{1-\eta_{n}} with

ηn(𝐱)={11+eFn, if η(𝐱)<11+eFn,η(𝐱), if 11+eFnη(𝐱)eFn1+eFn,eFn1+eFn, if η(𝐱)>eFn1+eFn.\displaystyle\eta_{n}(\mathbf{x})=\begin{cases}\frac{1}{1+e^{F_{n}}},&\mbox{ if }\eta(\mathbf{x})<\frac{1}{1+e^{F_{n}}},\\ \eta(\mathbf{x}),&\mbox{ if }\frac{1}{1+e^{F_{n}}}\leq\eta(\mathbf{x})\leq\frac{e^{F_{n}}}{1+e^{F_{n}}},\\ \frac{e^{F_{n}}}{1+e^{F_{n}}},&\mbox{ if }\eta(\mathbf{x})>\frac{e^{F_{n}}}{1+e^{F_{n}}}.\end{cases}

The function max(min(z,eFn1+eFn),11+eFn)\max(\min(z,\frac{e^{F_{n}}}{1+e^{F_{n}}}),\frac{1}{1+e^{F_{n}}}) can be realized by gn(z)=ReLU(ReLU(z+eFn1+eFn)+eFn11+eFn)+11+eFng_{n}(z)=\mathrm{ReLU}\left(-\mathrm{ReLU}\left(-z+\frac{e^{F_{n}}}{1+e^{F_{n}}}\right)+\frac{e^{F_{n}}-1}{1+e^{F_{n}}}\right)+\frac{1}{1+e^{F_{n}}}. Then gnη¯g_{n}\circ\bar{\eta} is an approximation of ηn\eta_{n}.

Define

hn=max(min(log(z1z),Fn),Fn)h_{n}=\max\left(\min\left(\log\left(\frac{z}{1-z}\right),F_{n}\right),-F_{n}\right)

for z[0,1]z\in[0,1] which is a Lipschitz function with Lipschitz constant (1+eFn)2/eFn(1+e^{F_{n}})^{2}/e^{F_{n}}. According to Chen et al. (2019b, Theorem 4.1), there exists an MLP h~MLP(L,J,κ)\widetilde{h}\in\mathcal{F}^{\rm MLP}(L,J,\kappa) such that h~hneFn(1+eFn)2Lε2eFn(1+eFn)2\|\widetilde{h}-h_{n}\cdot\frac{e^{F_{n}}}{(1+e^{F_{n}})^{2}}\|_{L^{\infty}}\leq\varepsilon_{2}\frac{e^{F_{n}}}{(1+e^{F_{n}})^{2}} with

L=O(log(eFn/ε2)),J=O(eFnε21log(eFn/ε2)),κ=1.\displaystyle L=O(\log(e^{F_{n}}/\varepsilon_{2})),\ J=O(e^{F_{n}}\varepsilon_{2}^{-1}\log(e^{F_{n}}/\varepsilon_{2})),\ \kappa=1.

Let h~n=(1+eFn)2eFnh~\widetilde{h}_{n}=\frac{(1+e^{F_{n}})^{2}}{e^{F_{n}}}\widetilde{h}. Then h~n(L2,J2,κ2)\widetilde{h}_{n}\in\mathcal{F}(L_{2},J_{2},\kappa_{2}) such that h~nhnLε2\|\widetilde{h}_{n}-h_{n}\|_{L^{\infty}}\leq\varepsilon_{2} with

L2=O(log(eFn/ε2)),J2=O(eFnε21log(eFn/ε2)),κ2=O(eFn).\displaystyle L_{2}=O(\log(e^{F_{n}}/\varepsilon_{2})),\ J_{2}=O(e^{F_{n}}\varepsilon_{2}^{-1}\log(e^{F_{n}}/\varepsilon_{2})),\ \kappa_{2}=O(e^{F_{n}}).

Let gFn=ReLU(ReLU(z+Fn)+2Fn)Fn.g_{F_{n}}=\mathrm{ReLU}\left(-\mathrm{ReLU}\left(-z+F_{n}\right)+2F_{n}\right)-F_{n}. We define

f~ϕ,n=gFnh~ngnη¯\displaystyle\widetilde{f}_{\phi,n}=g_{F_{n}}\circ\widetilde{h}_{n}\circ g_{n}\circ\bar{\eta}

as an approximation of fϕ,nf_{\phi,n}^{*}. Then the error of f~ϕ,n\widetilde{f}_{\phi,n} can be decomposed as

f~ϕ,nfϕ,nL\displaystyle\|\widetilde{f}_{\phi,n}-f_{\phi,n}^{*}\|_{L^{\infty}} =(gFnh~n)(gnη¯)hnηnL\displaystyle=\|(g_{F_{n}}\circ\widetilde{h}_{n})\circ(g_{n}\circ\bar{\eta})-h_{n}\circ\eta_{n}\|_{L^{\infty}}
(gFnh~n)(gnη¯)hn(gnη¯)L+hn(gnη¯)hnηnL\displaystyle\leq\|(g_{F_{n}}\circ\widetilde{h}_{n})\circ(g_{n}\circ\bar{\eta})-h_{n}\circ(g_{n}\circ\bar{\eta})\|_{L^{\infty}}+\|h_{n}\circ(g_{n}\circ\bar{\eta})-h_{n}\circ\eta_{n}\|_{L^{\infty}}
ε2+(1+eFn)2eFnηngnη¯L(1+eFn)2eFnε1+ε2.\displaystyle\leq\varepsilon_{2}+\frac{(1+e^{F_{n}})^{2}}{e^{F_{n}}}\|\eta_{n}-g_{n}\circ\bar{\eta}\|_{L^{\infty}}\leq\frac{(1+e^{F_{n}})^{2}}{e^{F_{n}}}\varepsilon_{1}+\varepsilon_{2}.

Choosing ε2=(1+eFn)2eFnε1\varepsilon_{2}=\frac{(1+e^{F_{n}})^{2}}{e^{F_{n}}}\varepsilon_{1} gives rise to f~ϕ,nfϕ,nL4eFnε1\|\widetilde{f}_{\phi,n}-f_{\phi,n}^{*}\|_{L^{\infty}}\leq 4e^{F_{n}}\varepsilon_{1}. With this choice, we have

L2=O(log(1/ε1)),J2=O(ε11log(1/ε1)),κ2=O(eFn).\displaystyle L_{2}=O(\log(1/\varepsilon_{1})),\ J_{2}=O(\varepsilon_{1}^{-1}\log(1/\varepsilon_{1})),\ \kappa_{2}=O(e^{F_{n}}).

Setting ε1=ε\varepsilon_{1}=\varepsilon proves the lemma. ∎

To show that ~(Fn)\widetilde{\mathcal{F}}^{(F_{n})} can be realized by a ConvResNet class and to derive its covering number, we need the following lemma to bound the covering number of ConvResNets.

Lemma 21.

Let 𝒞(M,L,J,K,κ1,κ2)\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}) be the ConvResNet structure defined in Theorem 1. Its covering number is bounded by

log𝒩(δ,𝒞(M,L,J,K,κ1,κ2),L)=O(D3εd/slog(1/ε)(log(1/ε)+logDlog(1/δ)).\log\mathcal{N}\left(\delta,\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}),\|\cdot\|_{L^{\infty}}\right)=O\left(D^{3}\varepsilon^{-d/s}\log(1/\varepsilon)(\log(1/\varepsilon)+\log D\log(1/\delta)\right).

Lemma 21 is proved based on the following lemma:

Lemma 22 (Lemma 4 of Oono and Suzuki (2019)).

Let 𝒞(M,L,J,K,κ1,κ2)\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}) be a class of ConvResNet architecture from D\mathbb{R}^{D} to \mathbb{R}. Let κ=κ1κ2\kappa=\kappa_{1}\vee\kappa_{2}. For δ>0\delta>0, we have

𝒩(δ,𝒞(M,L,J,K,κ1,κ2),L)(2κΛ1/δ)Λ2,\mathcal{N}(\delta,\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}),\|\cdot\|_{L^{\infty}})\leq(2\kappa\Lambda_{1}/\delta)^{\Lambda_{2}},

where

Λ1=(8M+12)D2(1κ2)(1κ1)ρ~ρ~+,Λ2=ML(16D2K+4D)+4D2+1\displaystyle\Lambda_{1}=(8M+12)D^{2}(1\vee\kappa_{2})(1\vee\kappa_{1})\widetilde{\rho}\widetilde{\rho}^{+},\ \Lambda_{2}=ML(16D^{2}K+4D)+4D^{2}+1

with ρ~=(1+ρ)M,ρ~+=1+MLρ+,ρ=(4DKκ1)L\widetilde{\rho}=(1+\rho)^{M},\widetilde{\rho}^{+}=1+ML\rho^{+},\rho=(4DK\kappa_{1})^{L} and ρ+=(14DKκ1)L\rho^{+}=(1\vee 4DK\kappa_{1})^{L}.

Proof of Lemma 21.

According to Lemma 22,

log𝒩(δ,𝒞(M,L,J,K,κ1,κ2),L)Λ2log(2κΛ1/δ).\displaystyle\log\mathcal{N}\left(\delta,\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}),\|\cdot\|_{L^{\infty}}\right)\leq\Lambda_{2}\log(2\kappa\Lambda_{1}/\delta).

In the ConvResNet architecture defined in Theorem 1, κ1=(8DK)1M1/L\kappa_{1}=(8DK)^{-1}M^{-1/L}, ρ=(1/2)LM1<M1\rho=(1/2)^{L}M^{-1}<M^{-1}. We have ρ~=(1+ρ)M(1+M1)Me\widetilde{\rho}=(1+\rho)^{M}\leq(1+M^{-1})^{M}\leq e. Moreover, we have ρ+=1,ρ~+=1+ML\rho^{+}=1,\widetilde{\rho}^{+}=1+ML. Since logκ2=O(log2(1/ε))\log\kappa_{2}=O(\log^{2}(1/\varepsilon)), substituting M=O(εd/s)M=O\left(\varepsilon^{-d/s}\right) and L=O(log(1/ε)+D+logD)L=O(\log(1/\varepsilon)+D+\log D) gives rise to logΛ1=O(log2(1/ε)+logD)\log\Lambda_{1}=O(\log^{2}(1/\varepsilon)+\log D) and Λ2=O(D3εd/slog(1/ε))\Lambda_{2}=O\left(D^{3}\varepsilon^{-d/s}\log(1/\varepsilon)\right). Therefore,

log𝒩(δ,𝒞(M,L,J,K,κ1,κ2),L)=O(D3εd/slog(1/ε)(log2(1/ε)+logD+log(1/δ)).\log\mathcal{N}\left(\delta,\mathcal{C}(M,L,J,K,\kappa_{1},\kappa_{2}),\|\cdot\|_{L^{\infty}}\right)=O\left(D^{3}\varepsilon^{-d/s}\log(1/\varepsilon)(\log^{2}(1/\varepsilon)+\log D+\log(1/\delta)\right).

The constants hidden in O()O(\cdot) depend on d,s,2dspd,p,q,c0,τd,s,\frac{2d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}. ∎

The following lemma shows that ~(Fn)\widetilde{\mathcal{F}}^{(F_{n})} can be realized by a ConvResNet class 𝒞(Fn)\mathcal{C}^{(F_{n})} and estimates the covering number of the class of 𝒞(Fn)\mathcal{C}^{(F_{n})}.

Lemma 23.

Let 𝒞(Fn)\mathcal{C}^{(F_{n})} be defined as in Lemma 2. The network composition class ~(Fn)\widetilde{\mathcal{F}}^{(F_{n})} defined in Lemma 20 can be realized by a ConvResNet class 𝒞(Fn)\mathcal{C}^{(F_{n})}. Moreover, the covering number of 𝒞(Fn)\mathcal{C}^{(F_{n})} is bounded by

log𝒩(δ,𝒞(Fn),L)=O(ε(ds1))(log3(1/ε)+Fn+log(1/δ))).\displaystyle\log\mathcal{N}(\delta,\mathcal{C}^{(F_{n})},\|\cdot\|_{L^{\infty}})=O\left(\varepsilon^{-\left(\frac{d}{s}\vee 1\right))}\left(\log^{3}(1/\varepsilon)+F_{n}+\log(1/\delta)\right)\right).
Proof of Lemma 23.

In this proof, we show that each part of f~ϕ,n\widetilde{f}_{\phi,n} in Lemma 20 can be realized by ConvResNet architectures. Specifically, we show that η¯,gn,h~n\bar{\eta},g_{n},\widetilde{h}_{n} can be realized by residual blocks η¯,g¯n,h¯\bar{\eta},\bar{g}_{n},\bar{h}, and gFng_{F_{n}} can be realized by a ConvResNet g¯Fn\bar{g}_{F_{n}}. In the following, we show the existence of each ingredient.

Realize η¯\bar{\eta} by residual blocks.

In Lemma 20, η¯𝒞(M(η),L(η),J(η),K(η),κ1(η),κ2(η))\bar{\eta}\in\mathcal{C}(M^{(\eta)},L^{(\eta)},J^{(\eta)},K^{(\eta)},\kappa^{(\eta)}_{1},\kappa^{(\eta)}_{2}) with

M(η)=O(εd/s),L(η)=O(log(1/ε)+D+logD),J(η)=O(D),κ(η)=O(1),logκ2(η)=O(log2(1/ε)).\displaystyle M^{(\eta)}=O\left(\varepsilon^{-d/s}\right),\ L^{(\eta)}=O(\log(1/\varepsilon)+D+\log D),\ J^{(\eta)}=O(D),\ \kappa^{(\eta)}=O(1),\ \log\kappa^{(\eta)}_{2}=O(\log^{2}(1/\varepsilon)).

By Lemma 21, the covering number of this architecture is bounded by

log𝒩(δ,𝒞(M(η),L(η),J(η),K(η),κ1(η),κ2(η)),L)\displaystyle\log\mathcal{N}\left(\delta,\mathcal{C}(M^{(\eta)},L^{(\eta)},J^{(\eta)},K^{(\eta)},\kappa^{(\eta)}_{1},\kappa^{(\eta)}_{2}),\|\cdot\|_{L^{\infty}}\right)
=\displaystyle= O(ε(ds1))log(1/ε)(log2(1/ε)+logD+log(1/δ))).\displaystyle O\left(\varepsilon^{-\left(\frac{d}{s}\vee 1\right))}\log(1/\varepsilon)\left(\log^{2}(1/\varepsilon)+\log D+\log(1/\delta)\right)\right).

Excluding the final fully connected layer, denote all of the residual blocks of η¯\bar{\eta} by η¯(Conv)\bar{\eta}^{(\mathrm{Conv})}, then η¯𝒞(η)\bar{\eta}\in\mathcal{C}^{(\eta)} with 𝒞(η)=𝒞Conv(M(η),L(η),J(η),K(η),κ1(η))\mathcal{C}^{(\eta)}=\mathcal{C}^{\mathrm{Conv}}(M^{(\eta)},L^{(\eta)},J^{(\eta)},K^{(\eta)},\kappa^{(\eta)}_{1}) and

log𝒩(δ,𝒞(η),L)\displaystyle\log\mathcal{N}\left(\delta,\mathcal{C}^{(\eta)},\|\cdot\|_{L^{\infty}}\right) log𝒩(δ,𝒞(M(η),L(η),J(η),K(η),κ1(η),κ2(η)),L)\displaystyle\leq\log\mathcal{N}\left(\delta,\mathcal{C}(M^{(\eta)},L^{(\eta)},J^{(\eta)},K^{(\eta)},\kappa^{(\eta)}_{1},\kappa^{(\eta)}_{2}),\|\cdot\|_{L^{\infty}}\right)
=O(D3εd/slog(1/ε)(log2(1/ε)+logD+log(1/δ))).\displaystyle=O\left(D^{3}\varepsilon^{-d/s}\log(1/\varepsilon)(\log^{2}(1/\varepsilon)+\log D+\log(1/\delta))\right).

We denote the ii-th row (the ii-th element of all channels) of the output of the residual-blocks η¯(Conv)\bar{\eta}^{(\mathrm{Conv})} by (η¯(Conv))i,:(\bar{\eta}^{(\mathrm{Conv})})_{i,:}. In the proof of Theorem 1, the input in D\mathbb{R}^{D} is padded into D×3\mathbb{R}^{D\times 3} by 0’s. The output of η¯\bar{\eta} has the form (η¯(Conv))1,:=κ1(η)κ2(η)[η¯+η¯](\bar{\eta}^{(\mathrm{Conv})})_{1,:}=\frac{\kappa_{1}^{(\eta)}}{\kappa_{2}^{(\eta)}}\begin{bmatrix}\star&\bar{\eta}_{+}&\bar{\eta}_{-}\end{bmatrix}. Here \star denotes some number that does not affect the result. In this proof, instead of padding the input into size D×3D\times 3, we pad it into size D×8D\times 8. The weights in the first MM blocks of hh is the same as that of η¯\bar{\eta} except we need to pad the filters and biases by 0 to be compatible with the additional channels. Then the output of η¯(Conv)\bar{\eta}^{(\mathrm{Conv})} is κ1(η)κ2(η)[η¯+η¯00000]\frac{\kappa_{1}^{(\eta)}}{\kappa_{2}^{(\eta)}}\begin{bmatrix}\star&\bar{\eta}_{+}&\bar{\eta}_{-}&0&0&0&0&0\end{bmatrix}.

Realize gng_{n} by residual blocks.

To realize gng_{n}, we add another block with 4 layers with filters and biases {𝒲gn(l),Bgn(l)}l=15\left\{\mathcal{W}_{g_{n}}^{(l)},B_{g_{n}}^{(l)}\right\}_{l=1}^{5} where 𝒲gn(l)8×1×8,Bgn(l)8×8\mathcal{W}_{g_{n}}^{(l)}\in\mathbb{R}^{8\times 1\times 8},B_{g_{n}}^{(l)}\in\mathbb{R}^{8\times 8}. We set the parameters in the first layer as

(𝒲gn(1))2,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(1)}\right)_{2,1,:} =[0κ2(η)κ1(η)000000],\displaystyle=\begin{bmatrix}0&\frac{\kappa_{2}^{(\eta)}}{\kappa_{1}^{(\eta)}}&0&0&0&0&0&0\end{bmatrix},
(𝒲gn(1))3,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(1)}\right)_{3,1,:} =[00κ2(η)κ1(η)00000],\displaystyle=\begin{bmatrix}0&0&\frac{\kappa_{2}^{(\eta)}}{\kappa_{1}^{(\eta)}}&0&0&0&0&0\end{bmatrix},
(𝒲gn(1))i,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(1)}\right)_{i,1,:} =𝟎 for i=1,4,5,,8,\displaystyle=\mathbf{0}\quad\mbox{ for }i=1,4,5,...,8,

and Bgn(1)=𝟎B_{g_{n}}^{(1)}=\mathbf{0}. This layer scales the output of η¯\bar{\eta} back to [η¯+η¯00000]\begin{bmatrix}\star&\bar{\eta}_{+}&\bar{\eta}_{-}&0&0&0&0&0\end{bmatrix}. Then we use the other 4 layers to realize gng_{n}. The second layer is set as

(𝒲gn(2))4,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(2)}\right)_{4,1,:} =[01100000],\displaystyle=\begin{bmatrix}0&-1&1&0&0&0&0&0\end{bmatrix},
(𝒲gn(2))i,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(2)}\right)_{i,1,:} =𝟎 for i=1,2,3,5,6,7,8,\displaystyle=\mathbf{0}\quad\mbox{ for }i=1,2,3,5,6,7,8,
(Bgn(2))1,:\displaystyle\left(B_{g_{n}}^{(2)}\right)_{1,:} =[000eFn1+eFn0000],\displaystyle=\begin{bmatrix}0&0&0&\frac{e^{F_{n}}}{1+e^{F_{n}}}&0&0&0&0&\end{bmatrix},
(Bgn(2))i,:\displaystyle\left(B_{g_{n}}^{(2)}\right)_{i,:} =𝟎 for i=2,,8.\displaystyle=\mathbf{0}\quad\mbox{ for }i=2,...,8.

The third layer is set as

(𝒲gn(3))4,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(3)}\right)_{4,1,:} =[00010000],\displaystyle=\begin{bmatrix}0&0&0&-1&0&0&0&0\end{bmatrix},
(𝒲gn(3))i,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(3)}\right)_{i,1,:} =𝟎 for i=1,2,3,5,6,7,8,\displaystyle=\mathbf{0}\quad\mbox{ for }i=1,2,3,5,6,7,8,
(Bgn(3))1,:\displaystyle\left(B_{g_{n}}^{(3)}\right)_{1,:} =[000eFn11+eFn0000],\displaystyle=\begin{bmatrix}0&0&0&\frac{e^{F_{n}}-1}{1+e^{F_{n}}}&0&0&0&0&\end{bmatrix},
(Bgn(3))i,:\displaystyle\left(B_{g_{n}}^{(3)}\right)_{i,:} =𝟎 for i=2,,8.\displaystyle=\mathbf{0}\quad\mbox{ for }i=2,...,8.

The forth layer is set as

(𝒲gn(4))4,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(4)}\right)_{4,1,:} =[00010000],\displaystyle=\begin{bmatrix}0&0&0&1&0&0&0&0\end{bmatrix},
(𝒲gn(4))i,1,:\displaystyle\left(\mathcal{W}_{g_{n}}^{(4)}\right)_{i,1,:} =𝟎 for i=1,2,3,5,6,7,8,\displaystyle=\mathbf{0}\quad\mbox{ for }i=1,2,3,5,6,7,8,
(Bgn(4))1,:\displaystyle\left(B_{g_{n}}^{(4)}\right)_{1,:} =[00011+eFn0000],\displaystyle=\begin{bmatrix}0&0&0&\frac{1}{1+e^{F_{n}}}&0&0&0&0&\end{bmatrix},
(Bgn(4))i,:\displaystyle\left(B_{g_{n}}^{(4)}\right)_{i,:} =𝟎 for i=2,,8.\displaystyle=\mathbf{0}\quad\mbox{ for }i=2,...,8.

The output of gnη¯g_{n}\circ\bar{\eta} is stored as the first element in the forth channel of the output of g¯nη¯(Conv)\bar{g}_{n}\circ\bar{\eta}^{\rm(Conv)}:

(g¯nη¯)1,:=[gnη¯0000].(\bar{g}_{n}\circ\bar{\eta})_{1,:}=\begin{bmatrix}\star&\star&\star&g_{n}\circ\bar{\eta}&0&0&0&0\end{bmatrix}.

We have g¯n𝒞(gn)\bar{g}_{n}\in\mathcal{C}^{(g_{n})} where 𝒞(gn)=𝒞(gn)=𝒞Conv(M(gn),L(gn),J(gn),K(gn),κ(gn))\mathcal{C}^{(g_{n})}=\mathcal{C}^{(g_{n})}=\mathcal{C}^{\mathrm{Conv}}(M^{(g_{n})},L^{(g_{n})},J^{(g_{n})},K^{(g_{n})},\kappa^{(g_{n})}) with

M(gn)=1,L(gn)=4,J(gn)=8,K(gn)=1,κ(gn)=O(κ2(η)κ1(η)eFn1+eFn).\displaystyle M^{(g_{n})}=1,\ L^{(g_{n})}=4,\ J^{(g_{n})}=8,\ K^{(g_{n})}=1,\ \kappa^{(g_{n})}=O\left(\frac{\kappa^{(\eta)}_{2}}{\kappa^{(\eta)}_{1}}\vee\frac{e^{F_{n}}}{1+e^{F_{n}}}\right).

According to Lemma 22, the covering number of 𝒞(gn)\mathcal{C}^{(g_{n})} is bounded as

log𝒩(δ,𝒞(gn),L)=O(log(κ2(η)κ1(η)eFn1+eFn)+log(1/δ)).\log\mathcal{N}(\delta,\mathcal{C}^{(g_{n})},\|\cdot\|_{L^{\infty}})=O\left(\log\left(\frac{\kappa_{2}^{(\eta)}}{\kappa_{1}^{(\eta)}}\vee\frac{e^{F_{n}}}{1+e^{F_{n}}}\right)+\log(1/\delta)\right).

Substituting the expressions of κ1(η),κ2(η)\kappa^{(\eta)}_{1},\kappa^{(\eta)}_{2} into the expression above gives rise to logκ(gn)=O((log2(1/ε))\log\kappa^{(g_{n})}=O(\left(\log^{2}(1/\varepsilon)\right) and log𝒩(δ,𝒞(gn),L)=O(log2(1/ε)+log(1/δ))\log\mathcal{N}(\delta,\mathcal{C}^{(g_{n})},\|\cdot\|_{L^{\infty}})=O(\log^{2}(1/\varepsilon)+\log(1/\delta)).

Realize h~n\widetilde{h}_{n} by residual blocks.

To realize h~n\widetilde{h}_{n}, from the construction of h~n\widetilde{h}_{n} and using Oono and Suzuki (2019, Corollary 4), we can realize h~n\widetilde{h}_{n} by h¯n𝒞(hn)\bar{h}_{n}\in\mathcal{C}^{(h_{n})} with

𝒞(hn)=𝒞(M(hn),L(hn),J(hn),K(hn),κ1(hn)κ2(hn),Fn),\displaystyle\mathcal{C}^{(h_{n})}=\mathcal{C}(M^{(h_{n})},L^{(h_{n})},J^{(h_{n})},K^{(h_{n})},\kappa_{1}^{(h_{n})}\kappa_{2}^{(h_{n})},F_{n}),

and

M(hn)=O(ε21),L(hn)=O(log(1/ε2)),J(hn)=O(1),K(hn)=1,\displaystyle M^{(h_{n})}=O\left(\varepsilon_{2}^{-1}\right),\ L^{(h_{n})}=O\left(\log(1/\varepsilon_{2})\right),\ J^{(h_{n})}=O(1),\ K^{(h_{n})}=1,
κ1(hn)=O(1),logκ2(hn)=O(log(Lh/ε2)),\displaystyle\kappa_{1}^{(h_{n})}=O(1),\ \log\kappa_{2}^{(h_{n})}=O(\log(L_{h}/\varepsilon_{2})),

where ε2=(1+eFn)2eFnε\varepsilon_{2}=\frac{(1+e^{F_{n}})^{2}}{e^{F_{n}}}\varepsilon, and LhL_{h} is the Lipschitz constant of hnh_{n}. According to Lemma 22, the covering number of this class is bounded by

log𝒩(δ,𝒞(hn),L)=O(D2ε21log(1/ε2)(log(1/ε2)+log(Lh/ε2)+log(1/δ)).\log\mathcal{N}\left(\delta,\mathcal{C}^{(h_{n})},\|\cdot\|_{L^{\infty}}\right)=O\left(D^{2}\varepsilon_{2}^{-1}\log(1/\varepsilon_{2})(\log(1/\varepsilon_{2})+\log(L_{h}/\varepsilon_{2})+\log(1/\delta)\right).

Note that such h¯\bar{h} is from \mathbb{R} to \mathbb{R}. Since the information we need from the output of g¯nη¯\bar{g}_{n}\circ\bar{\eta} is only the first element in the forth channel, we can follow the proof of Oono and Suzuki (2019, Theorem 6) to construct h¯\bar{h} by padding the elements in the filters and biases by 0 so that all operations work on the forth channel and store results on the fifth and sixth channel. Substituting ε2=(1+eFn)2eFnε\varepsilon_{2}=\frac{(1+e^{F_{n}})^{2}}{e^{F_{n}}}\varepsilon and Lhn=(1+eFn)2/eFnL_{h_{n}}=(1+e^{F_{n}})^{2}/e^{F_{n}} yields

M(hn)=O(eFnε1),L(hn)=O(log(1/ε)),J(hn)=O(1),\displaystyle M^{(h_{n})}=O\left(e^{-F_{n}}\varepsilon^{-1}\right),L^{(h_{n})}=O\left(\log(1/\varepsilon)\right),J^{(h_{n})}=O(1),
κ1(hn)=O(1),logκ2(hn)=O(log(eFn/ε))\displaystyle\kappa_{1}^{(h_{n})}=O(1),\log\kappa_{2}^{(h_{n})}=O\left(\log\left(e^{F_{n}}/\varepsilon\right)\right)

and

log𝒩(δ,𝒞(hn),L)=O(D2eFnε1log(1/ε)(log(1/ε)+Fn+log(1/δ)).\log\mathcal{N}\left(\delta,\mathcal{C}^{(h_{n})},\|\cdot\|_{L^{\infty}}\right)=O\left(D^{2}e^{-F_{n}}\varepsilon^{-1}\log(1/\varepsilon)(\log(1/\varepsilon)+F_{n}+\log(1/\delta)\right).

Similar to η¯(Conv)\bar{\eta}^{(\mathrm{Conv})}, denote all residual blocks of h¯\bar{h} by h¯(Conv)\bar{h}^{(\mathrm{Conv})}. We have

(h¯(Conv))1,:=κ1(hn)κ2(hn)[(h~n)+(h~n)00].(\bar{h}^{(\mathrm{Conv})})_{1,:}=\frac{\kappa_{1}^{(h_{n})}}{\kappa_{2}^{(h_{n})}}\begin{bmatrix}\star&\star&\star&\star&(\widetilde{h}_{n})_{+}&(\widetilde{h}_{n})_{-}&0&0\end{bmatrix}.

Realize gFng_{F_{n}} by a ConvResNet.

We then add another residual block of 3 layers followed by a fully connected layer to realize gFng_{F_{n}}. Denote the parameters in this block and the fully connected layer by {𝒲gFn(l),BgFn(l)}l=13\{\mathcal{W}^{(l)}_{g_{F_{n}}},B^{(l)}_{g_{F_{n}}}\}_{l=1}^{3} and {W,b}\{W,b\}, respectively. Here 𝒲gFn(l)8×1×8,BgFn(l)8×8,W8×8\mathcal{W}^{(l)}_{g_{F_{n}}}\in\mathbb{R}^{8\times 1\times 8},B^{(l)}_{g_{F_{n}}}\in\mathbb{R}^{8\times 8},W\in\mathbb{R}^{8\times 8} and bb\in\mathbb{R}.

The first layer is set as

(𝒲gFn(1))5,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(1)}\right)_{5,1,:} =[0000κ2(hn)κ1(hn)000],\displaystyle=\begin{bmatrix}0&0&0&0&\frac{\kappa_{2}^{(h_{n})}}{\kappa_{1}^{(h_{n})}}&0&0&0\end{bmatrix},
(𝒲gFn(1))6,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(1)}\right)_{6,1,:} =[00000κ2(hn)κ1(hn)00],\displaystyle=\begin{bmatrix}0&0&0&0&0&\frac{\kappa_{2}^{(h_{n})}}{\kappa_{1}^{(h_{n})}}&0&0\end{bmatrix},
(𝒲gFn(1))i,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(1)}\right)_{i,1,:} =𝟎 for i=1,2,3,4,7,8,\displaystyle=\mathbf{0}\quad\mbox{ for }i=1,2,3,4,7,8,
(BgFn(1))i,:\displaystyle\left(B_{g_{F_{n}}}^{(1)}\right)_{i,:} =𝟎 for i=1,2,,8.\displaystyle=\mathbf{0}\quad\mbox{ for }i=1,2,...,8.

This layer scales the output of h¯(Conv)\bar{h}^{(\mathrm{Conv})} back to [(h~n)+(h~n)00]\begin{bmatrix}\star&\star&\star&\star&(\widetilde{h}_{n})_{+}&(\widetilde{h}_{n})-&0&0\end{bmatrix}. The rest layers are used to realize gFng_{F_{n}}.

The second layer is set as

(𝒲gFn(2))7,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(2)}\right)_{7,1,:} =[00001000],\displaystyle=\begin{bmatrix}0&0&0&0&-1&0&0&0\end{bmatrix},
(𝒲gFn(2))8,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(2)}\right)_{8,1,:} =[00000100],\displaystyle=\begin{bmatrix}0&0&0&0&0&-1&0&0\end{bmatrix},
(𝒲gFn(2))i,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(2)}\right)_{i,1,:} =𝟎 for i=1,,6\displaystyle=\mathbf{0}\quad\mbox{ for }i=1,...,6
(BgFn(2))1,:\displaystyle\left(B_{g_{F_{n}}}^{(2)}\right)_{1,:} =[000000FnFn],\displaystyle=\begin{bmatrix}0&0&0&0&0&0&F_{n}&F_{n}\end{bmatrix},
(BgFn(2))i,:\displaystyle\left(B_{g_{F_{n}}}^{(2)}\right)_{i,:} =𝟎 for i=2,,8.\displaystyle=\mathbf{0}\quad\mbox{ for }i=2,...,8.

The third layer is set as

(𝒲gFn(3))7,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(3)}\right)_{7,1,:} =[00000010],\displaystyle=\begin{bmatrix}0&0&0&0&0&0&-1&0\end{bmatrix},
(𝒲gFn(3))8,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(3)}\right)_{8,1,:} =[00000001],\displaystyle=\begin{bmatrix}0&0&0&0&0&0&0&-1\end{bmatrix},
(𝒲gFn(3))i,1,:\displaystyle\left(\mathcal{W}_{g_{F_{n}}}^{(3)}\right)_{i,1,:} =𝟎 for i=1,,6\displaystyle=\mathbf{0}\quad\mbox{ for }i=1,...,6
(BgFn(3))1,:\displaystyle\left(B_{g_{F_{n}}}^{(3)}\right)_{1,:} =[000000FnFn],\displaystyle=\begin{bmatrix}0&0&0&0&0&0&F_{n}&F_{n}\end{bmatrix},
(BgFn(3))i,:\displaystyle\left(B_{g_{F_{n}}}^{(3)}\right)_{i,:} =𝟎 for i=2,,8.\displaystyle=\mathbf{0}\quad\mbox{ for }i=2,...,8.

The first row of the output of the third layer is

[min((h~n)+,Fn)min((h~n),Fn)].\begin{bmatrix}\star&\star&\star&\star&\star&\star&\min((\widetilde{h}_{n})_{+},F_{n})&\min((\widetilde{h}_{n})_{-},F_{n})\end{bmatrix}.

Then the fully connected layer is set as

W1,:\displaystyle W_{1,:} =[00000011],\displaystyle=\begin{bmatrix}0&0&0&0&0&0&1&-1\end{bmatrix},
Wi,:\displaystyle W_{i,:} =𝟎 for i=2,,8\displaystyle=\mathbf{0}\quad\mbox{ for }i=2,...,8

and b=0b=0.

Thus g¯Fn𝒞(gFn)\bar{g}_{F_{n}}\in\mathcal{C}^{(g_{F_{n}})} with 𝒞(gFn)=𝒞(M(gFn),L(gFn),J(gFn),K(gFn),κ1(gFn),κ2(gFn))\mathcal{C}^{(g_{F_{n}})}=\mathcal{C}(M^{(g_{F_{n}})},L^{(g_{F_{n}})},J^{(g_{F_{n}})},K^{(g_{F_{n}})},\kappa_{1}^{(g_{F_{n}})},\kappa_{2}^{(g_{F_{n}})}) and

M(gFn)=1,L(gFn)=4,J(gFn)=8,K(gFn)=1,κ1(gFn)=O(κ2(hn)κ1(hn)Fn),κ2(gFn)=1.\displaystyle M^{(g_{F_{n}})}=1,\ L^{(g_{F_{n}})}=4,\ J^{(g_{F_{n}})}=8,\ K^{(g_{F_{n}})}=1,\ \kappa_{1}^{(g_{F_{n}})}=O\left(\frac{\kappa^{(h_{n})}_{2}}{\kappa^{(h_{n})}_{1}}\vee F_{n}\right),\ \kappa_{2}^{(g_{F_{n}})}=1.

According to Lemma 22, substituting the expressions of κ1(hn),κ2(hn)\kappa^{(h_{n})}_{1},\kappa^{(h_{n})}_{2} gives rise to log𝒩(δ,𝒞(gFn),L)=O(D2(log(1/ε)+logFn+log(1/δ)))\log\mathcal{N}(\delta,\mathcal{C}^{(g_{F_{n}})},\|\cdot\|_{L^{\infty}})=O\left(D^{2}\left(\log(1/\varepsilon)+\log F_{n}+\log(1/\delta)\right)\right).

The resulting network f¯ϕ,ng¯Fnh¯(Conv)g¯nη¯(Conv)\bar{f}_{\phi,n}\equiv\bar{g}_{F_{n}}\circ\bar{h}^{\rm(Conv)}\circ\bar{g}_{n}\circ\bar{\eta}^{\rm(Conv)} is a ConvResNet and

f¯ϕ,n(𝐱)=f~ϕ,n(𝐱)\bar{f}_{\phi,n}(\mathbf{x})=\widetilde{f}_{\phi,n}(\mathbf{x})

for any 𝐱\mathbf{x}\in\mathcal{M}. Denote the class of the architecture of f¯ϕ,n\bar{f}_{\phi,n} by 𝒞(Fn)\mathcal{C}^{(F_{n})}. Its covering number is bounded by

log𝒩(δ,𝒞(Fn),L)\displaystyle\log\mathcal{N}(\delta,\mathcal{C}^{(F_{n})},\|\cdot\|_{L^{\infty}})
\displaystyle\leq log𝒩(δ,𝒞(η),L)+log𝒩(δ,𝒞(gn),L)\displaystyle\log\mathcal{N}\left(\delta,\mathcal{C}^{(\eta)},\|\cdot\|_{L^{\infty}}\right)+\log\mathcal{N}\left(\delta,\mathcal{C}^{(g_{n})},\|\cdot\|_{L^{\infty}}\right)
+log𝒩(δ,𝒞(hn),L)+log𝒩(δ,𝒞(gFn),L)\displaystyle+\log\mathcal{N}\left(\delta,\mathcal{C}^{(h_{n})},\|\cdot\|_{L^{\infty}}\right)+\log\mathcal{N}(\delta,\mathcal{C}^{(g_{F_{n}})},\|\cdot\|_{L^{\infty}})
=\displaystyle= O(D3ε(ds1))log(1/ε)(log2(1/ε)+logD+Fn+log(1/δ))).\displaystyle O\left(D^{3}\varepsilon^{-\left(\frac{d}{s}\vee 1\right))}\log(1/\varepsilon)\left(\log^{2}(1/\varepsilon)+\log D+F_{n}+\log(1/\delta)\right)\right).

The constants hidden in O()O(\cdot) depend on d,s,2dspd,p,q,c0,τd,s,\frac{2d}{sp-d},p,q,c_{0},\tau and the surface area of \mathcal{M}. ∎

Proof of Lemma 2.

Lemma 2 is a direct result of Lemma 20 and Lemma 23. ∎

D.3 Proof of Lemma 3

Proof of Lemma 3.

We divide AnA_{n}^{\complement} into two regions: {𝐱:fϕ>Fn}\{\mathbf{x}\in\mathcal{M}:f^{*}_{\phi}>F_{n}\} and {𝐱:fϕ<Fn}\{\mathbf{x}\in\mathcal{M}:f^{*}_{\phi}<-F_{n}\}. A bound of T2{\rm T_{2}} is derived by bounding the integral on both regions.

Let us first consider the region {𝐱:fϕ>Fn}\{\mathbf{x}\in\mathcal{M}:f^{*}_{\phi}>F_{n}\}. Since η=efϕ/(1+efϕ)\eta=e^{f_{\phi}^{*}}/(1+e^{f_{\phi}^{*}}), we have

ηϕ(fϕ)+(1η)ϕ(fϕ)\displaystyle\eta\phi(f_{\phi}^{*})+(1-\eta)\phi(-f_{\phi}^{*}) =efϕ1+efϕϕ(fϕ)+11+efϕϕ(fϕ)\displaystyle=\frac{e^{f_{\phi}^{*}}}{1+e^{f_{\phi}^{*}}}\phi(f_{\phi}^{*})+\frac{1}{1+e^{f_{\phi}^{*}}}\phi(-f_{\phi}^{*})
ϕ(Fn)+supzFnlog(1+ez)1+ez\displaystyle\leq\phi(F_{n})+\sup_{z\geq F_{n}}\frac{\log(1+e^{z})}{1+e^{z}}
log(1+eFn)+log(1+eFn)1+eFn\displaystyle\leq\log(1+e^{-F_{n}})+\frac{\log(1+e^{F_{n}})}{1+e^{F_{n}}}
2FneFn.\displaystyle\leq 2F_{n}e^{-F_{n}}. (59)

On this region, Fn1f¯ϕ,nFnF_{n}-1\leq\bar{f}_{\phi,n}\leq F_{n}. Thus

ηϕ(f¯ϕ,n)+(1η)ϕ(f¯ϕ,n)\displaystyle\eta\phi(\bar{f}_{\phi,n})+(1-\eta)\phi(-\bar{f}_{\phi,n}) =efϕ1+efϕϕ(f¯ϕ,n)+11+efϕϕ(f¯ϕ,n)\displaystyle=\frac{e^{f_{\phi}^{*}}}{1+e^{f_{\phi}^{*}}}\phi(\bar{f}_{\phi,n})+\frac{1}{1+e^{f_{\phi}^{*}}}\phi(-\bar{f}_{\phi,n})
ϕ(Fn1)+log(1+ef¯ϕ,n)1+efϕ\displaystyle\leq\phi(F_{n}-1)+\frac{\log(1+e^{\bar{f}_{\phi,n}})}{1+e^{f_{\phi}^{*}}}
log(1+e(Fn1))+log(1+eFn)1+eFn\displaystyle\leq\log(1+e^{-(F_{n}-1)})+\frac{\log(1+e^{F_{n}})}{1+e^{F_{n}}}
2FneFn.\displaystyle\leq 2F_{n}e^{-F_{n}}. (60)

Combining (59) and (60) gives

|[ηϕ(fϕ)+(1η)ϕ(fϕ)][ηϕ(f¯ϕ,n)+(1η)ϕ(f¯ϕ,n)]|4FneFn.\displaystyle\left|\left[\eta\phi(f_{\phi}^{*})+(1-\eta)\phi(-f_{\phi}^{*})\right]-\left[\eta\phi(\bar{f}_{\phi,n})+(1-\eta)\phi(-\bar{f}_{\phi,n})\right]\right|\leq 4F_{n}e^{-F_{n}}. (61)

Now consider the region {𝐱:fϕ<Fn}\{\mathbf{x}\in\mathcal{M}:f^{*}_{\phi}<-F_{n}\}, we have

ηϕ(fϕ)+(1η)ϕ(fϕ)\displaystyle\eta\phi(f_{\phi}^{*})+(1-\eta)\phi(-f_{\phi}^{*}) =efϕ1+efϕϕ(fϕ)+11+efϕϕ(fϕ)\displaystyle=\frac{e^{f_{\phi}^{*}}}{1+e^{f_{\phi}^{*}}}\phi(f_{\phi}^{*})+\frac{1}{1+e^{f_{\phi}^{*}}}\phi(-f_{\phi}^{*})
=11+efϕϕ(fϕ)+efϕ1+efϕϕ(fϕ)\displaystyle=\frac{1}{1+e^{-f_{\phi}^{*}}}\phi(f_{\phi}^{*})+\frac{e^{-f_{\phi}^{*}}}{1+e^{-f_{\phi}^{*}}}\phi(-f_{\phi}^{*})
ϕ(Fn)+supzFnlog(1+ez)1+ez\displaystyle\leq\phi(F_{n})+\sup_{z\leq-F_{n}}\frac{\log(1+e^{-z})}{1+e^{-z}}
log(1+eFn)+log(1+eFn)1+eFn\displaystyle\leq\log(1+e^{-F_{n}})+\frac{\log(1+e^{F_{n}})}{1+e^{F_{n}}}
2FneFn.\displaystyle\leq 2F_{n}e^{-F_{n}}. (62)

On this region, Fnf¯ϕ,nFn+1-F_{n}\leq\bar{f}_{\phi,n}\leq-F_{n}+1. Thus

ηϕ(f¯ϕ,n)+(1η)ϕ(f¯ϕ,n)\displaystyle\eta\phi(\bar{f}_{\phi,n})+(1-\eta)\phi(-\bar{f}_{\phi,n}) =11+efϕϕ(f¯ϕ,n)+efϕ1+efϕϕ(f¯ϕ,n)\displaystyle=\frac{1}{1+e^{-f_{\phi}^{*}}}\phi(\bar{f}_{\phi,n})+\frac{e^{-f_{\phi}^{*}}}{1+e^{-f_{\phi}^{*}}}\phi(-\bar{f}_{\phi,n})
ϕ(Fn1)+log(1+ef¯ϕ,n)1+efϕ\displaystyle\leq\phi(F_{n}-1)+\frac{\log(1+e^{-\bar{f}_{\phi,n}})}{1+e^{-f_{\phi}^{*}}}
log(1+e(Fn1))+log(1+eFn)1+eFn\displaystyle\leq\log(1+e^{-(F_{n}-1)})+\frac{\log(1+e^{F_{n}})}{1+e^{F_{n}}}
2FneFn.\displaystyle\leq 2F_{n}e^{-F_{n}}. (63)

Combining (62) and (63) gives

|[ηϕ(fϕ)+(1η)ϕ(fϕ)][ηϕ(f¯ϕ,n)+(1η)ϕ(f¯ϕ,n)]|4FneFn.\displaystyle\left|\left[\eta\phi(f_{\phi}^{*})+(1-\eta)\phi(-f_{\phi}^{*})\right]-\left[\eta\phi(\bar{f}_{\phi,n})+(1-\eta)\phi(-\bar{f}_{\phi,n})\right]\right|\leq 4F_{n}e^{-F_{n}}. (64)

Putting (61) and (64) together, we have

T2\displaystyle{\rm T_{2}} An|[ηϕ(fϕ)+(1η)ϕ(fϕ)][ηϕ(f¯ϕ,n)+(1η)ϕ(f¯ϕ,n)]|μ(d𝐱)8FneFn.\displaystyle\leq\int_{A_{n}^{\complement}}\left|\left[\eta\phi(f_{\phi}^{*})+(1-\eta)\phi(-f_{\phi}^{*})\right]-\left[\eta\phi(\bar{f}_{\phi,n})+(1-\eta)\phi(-\bar{f}_{\phi,n})\right]\right|\mu(d\mathbf{x})\leq 8F_{n}e^{-F_{n}}.