This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploring the Complexity of Deep Neural Networks through Functional Equivalence

Guohao Shen
Department of Applied Mathematics
The Hong Kong Polytechnic University
Hung Hom, Kowloon, Hong Kong SAR, China
guohao.shen@polyu.edu.hk
Abstract

We investigate the complexity of deep neural networks through the lens of functional equivalence, which posits that different parameterizations can yield the same network function. Leveraging the equivalence property, we present a novel bound on the covering number for deep neural networks, which reveals that the complexity of neural networks can be reduced. Additionally, we demonstrate that functional equivalence benefits optimization, as overparameterized networks tend to be easier to train since increasing network width leads to a diminishing volume of the effective parameter space. These findings can offer valuable insights into the phenomenon of overparameterization and have implications for understanding generalization and optimization in deep learning.

1 Introduction

Artificial neural networks, particularly deep and wide ones, have shown remarkable success in various applications widely in machine learning and artificial intelligence. However, one of the major challenges in understanding the success is to explain their generalization ability when they are very large and prone to overfitting data (Neyshabur et al., , 2014; Neyshabur et al., 2017a, ; Razin and Cohen, , 2020).

Theoretical studies have suggested that the generalization error can be related to the complexity, approximation power, and optimization of deep neural networks. Larger neural networks are proved to possess better approximation power (Yarotsky, , 2017; Lu et al., 2021a, ; Zhou, 2020b, ), but may exhibit larger complexity and generalization gaps (Bartlett et al., , 2017; Mohri et al., , 2018; Bartlett et al., , 2019), and can be more challenging to optimize (Glorot et al., , 2011). However, some aspects of deep learning initially appeared to contradict common sense: overparameterized networks tend to be easier to train (Frankle and Carbin, , 2019; Allen-Zhu et al., , 2019; Du et al., , 2019) and exhibit better generalization (Belkin et al., , 2019; Neyshabur et al., , 2019; Novak et al., , 2018). Although the model class’s capacity was immense, deep networks did not tend to overfit (Zhang et al., , 2017).

Recent studies have highlighted that the functional form of neural networks may be less complex than their parametric form Bui Thi Mai and Lampert, (2020); Stock and Gribonval, (2022); Grigsby et al., 2022b , as networks with different parameters may implement the same function. This insight provides us with a fresh perspective for reconsidering how overparameterization truly affects the generalization.

In this work, we quantitatively characterize the redundancy in the parameterization of deep neural networks and derive a complexity measure for these networks based on functional equivalence. We analyze the results to gain insights into generalization and optimization in deep learning.

1.1 Related work

The issue of redundancy or identification of parameterization of neural networks has been noted since 1990 in Hecht-Nielsen, (1990). Subsequent studies for neural networks with Tanh and sigmoid activations (Chen et al., , 1993; Fefferman and Markel, , 1993; Kůrkov’a and Kainen, , 1994) have proved that given the input-output mapping of a Tanh neural network, its architecture can be determined and weights are identified up to permutations and sign flips. Recently, the identifiability of parameterization in deep neural networks, particularly ReLU networks, has received considerable attention (Elbrächter et al., , 2019; Bui Thi Mai and Lampert, , 2020; Bona-Pellissier et al., , 2021; Dereich and Kassing, , 2022; Stock and Gribonval, , 2022; Grigsby et al., 2022b, ; Grigsby et al., 2022a, ). Most recently, Bui Thi Mai and Lampert, (2020) demonstrated that ReLU networks with non-increasing widths are identifiable up to permutation and scaling of weight matrices. With redundant parameterization, the weight space of deep neural networks can exhibit symmetric structures, which leads to implications for optimization (Neyshabur et al., 2015a, ; Badrinarayanan et al., 2015a, ; Stock et al., , 2019). These studies suggest that naive loss gradient is sensitive to reparameterization by scaling, and proposed alternative, scaling-invariant optimization procedures. In addition, the redundancy or identification properties of neural networks are closely related to the study of inverse stability (Elbrächter et al., , 2019; Rolnick and Kording, , 2020; Bona-Pellissier et al., , 2021; Petersen et al., , 2021; Stock and Gribonval, , 2022), which investigates the possibility for one to recover the parameters (weights and biases) of a neural network.

The complexity of neural networks in terms of their parameterization redundancy has received limited attention. Among the few relevant studies, Grigsby et al., 2022a and Grigsby et al., 2022b are worth mentioning. In Grigsby et al., 2022a , the authors investigated local and global notions of topological complexity for fully-connected feedforward ReLU neural network functions. On the other hand, Grigsby et al., 2022b defined the functional dimension of ReLU neural networks based on perturbations in parameter space. They explored functional redundancy and conditions under which the functional dimension reaches its theoretical maximum. However, it should be noted that these results on functional dimension do not directly translate into generalization error bounds for deep learning algorithms.

The complexity of a class of functions is closely related to generalization error, with larger complexities often leading to larger generalization error (Bartlett et al., , 2017; Mohri et al., , 2018). Various complexity upper bounds have been studied for deep neural networks using different measurements, such as Rademacher complexity (Neyshabur et al., 2015b, ; Golowich et al., , 2018; Li et al., , 2018), VC-dimension and Pseudo dimension (Baum and Haussler, , 1988; Goldberg and Jerrum, , 1993; Anthony et al., , 1999; Bartlett et al., , 2019), and covering number (Anthony et al., , 1999; Neyshabur et al., 2017b, ; Bartlett et al., , 2017; Lin and Zhang, , 2019). These measurements characterize the complexity of the class of neural networks and are influenced by hyperparameters like network depth, width, number of weights and bias vectors, and corresponding norm bounds. While these bounds are not directly comparable in magnitude, they are closely related and can be converted to facilitate comparisons (Anthony et al., , 1999; Mohri et al., , 2018).

1.2 Our contributions

We summarize our contributions as follows:

  • 1.

    We make use of the permutation equivalence property to firstly obtain a tighter upper bound on the covering number of neural networks, which improves existing results by factorials of the network widths and provides unprecedented insights into the intricate relationship between network complexity and layer width.

  • 2.

    We improve existing covering number bounds in the sense that our results hold for neural networks with bias vectors and general activation functions. Since bias terms are indispensable for the approximation power of neural networks, our results are useful in both theory and practice. Additionally, we express our bound explicitly in terms of the network’s width, depth, size, and the norm of the parameters.

  • 3.

    We discuss the implications of our findings for understanding generalization and optimization. In particular, we found that overparameterized networks tend to be easier to train in the sense that increasing the width of neural networks leads to a vanishing volume of the effective parameter space.

The remainder of the paper is organized as follows. In section 2, we introduce the concept of functional equivalence and investigate the permutation invariance property of general feedforward neural networks. In section LABEL:sec_complexity, we derive novel covering number bounds for shallow and deep neural networks by exploiting the permutation invariance property and compare our results with existing ones. In section 3, we discuss the extension to convolutional, residual and attention-based networks. In section 4, we demonstrate the theoretical implication of permutation invariance on the optimization complexity in deep learning and discuss the implications of our results on generalization. Finally, we discuss the limitations of this study and future research directions in section 5. All technical proofs are included in the Appendix.

Table 1: A comparison of recent results on the complexity of feedforward neural networks.
Paper Complexity Explicit Bias Vectors Permutation Invariance
Bartlett et al., (2017) Bx2(ρ¯s¯)2𝒰log(W)/ϵ2B_{x}^{2}(\bar{\rho}\bar{s})^{2}\mathcal{U}\log(W)/\epsilon^{2}
Neyshabur et al., 2017b Bx2(ρ¯s¯)2𝒮L2log(WL)/ϵ2B_{x}^{2}(\bar{\rho}\bar{s})^{2}\mathcal{S}L^{2}\log(WL)/\epsilon^{2}
Lin and Zhang, (2019) Bx(ρ¯s¯)𝒮2L/ϵB_{x}(\bar{\rho}\bar{s})\mathcal{S}^{2}L/\epsilon
Bartlett et al., (2019) L𝒮log(𝒮)log(ρ¯s¯Bx/ϵ)L\mathcal{S}\log(\mathcal{S})\log(\bar{\rho}\bar{s}B_{x}/\epsilon)
This paper L𝒮log(ρ¯s¯Bx1/L/((d1!dL!)1/𝒮ϵ)1/L)L\mathcal{S}\log(\bar{\rho}\bar{s}B_{x}^{1/L}/((d_{1}!\cdots d_{L}!)^{1/\mathcal{S}}\epsilon)^{1/L})
  • Notations: 𝒮\mathcal{S} number of parameters; 𝒰\mathcal{U} number of hidden neurons; LL number of hidden layers; WW maximum hidden layers width; BxB_{x}, L2 norm of input; ρ¯=Πj=1Lρj\bar{\rho}=\Pi_{j=1}^{L}\rho_{j}, products of Lipschitz constants of activations; s¯=Πj=1Lsj\bar{s}=\Pi_{j=1}^{L}s_{j}, products of spectral norms of hidden layer weight matrices; ϵ\epsilon, radius for covering number.

The remainder of the paper is organized as follows. In section 2, we introduce the concept of functional equivalence and investigate the permutation invariance property of general feedforward neural networks. In section LABEL:sec_complexity, we derive novel covering number bounds for shallow and deep neural networks by exploiting the permutation invariance property and compare our results with existing ones. In section 3, we discuss the extension to convolutional, residual and attention-based networks. In section 4, we demonstrate the theoretical implication of permutation invariance on the optimization complexity in deep learning and discuss the implications of our results on generalization. Finally, we discuss the limitations of this study and future research directions in section 5. All technical proofs are included in the Appendix.

2 Functionally equivalent Neural Networks

A feedforward neural network is a fully connected artificial neural network consisting of multiple layers of interconnected neurons. The network’s architecture can be expressed as a composition of linear maps and activations. The functional form of an LL-layer feedforward neural network is determined by its weight matrices, bias vectors, and activation functions:

f(x;θ)=𝒜L+1σL𝒜Lσ2𝒜2σ1𝒜1(x).f(x;\theta)=\mathcal{A}_{L+1}\circ\sigma_{L}\circ\mathcal{A}_{L}\circ\cdots\circ\sigma_{2}\circ\mathcal{A}_{2}\circ\sigma_{1}\circ\mathcal{A}_{1}(x). (1)

Here, 𝒜l(x)=W(l)x+b(l)\mathcal{A}_{l}(x)=W^{(l)}x+b^{(l)} is the linear transformation for layer ll, where W(l)W^{(l)} and b(l)b^{(l)} are the weight matrix and bias vector respectively. The activation function σl\sigma_{l} is applied element-wise to the output of 𝒜l\mathcal{A}_{l}, and can be different across layers. The collection of weight matrices and bias vectors is denoted by θ=(W(1),b(1),,W(L+1),b(L+1))\theta=(W^{(1)},b^{(1)},\ldots,W^{(L+1)},b^{(L+1)}). The input xx is propagated through each layer of the network to produce the output f(x;θ)f(x;\theta).

The parameterization of a neural network can be redundant, with different parameter sets producing identical function outputs. This redundancy arises from the non-identifiability of weight matrices or activation functions.

Definition 1 (Functionally-Equivalent Neural Networks).

Two neural networks f(x;θ1)f({x};{\theta}_{1}) and f(x;θ2)f({x};{\theta}_{2}) are said to be functionally-equivalent on 𝒳\mathcal{X} if they produce the same input-output function for all possible inputs, i.e.,

f1(x;θ1)=f2(x;θ2)x𝒳,f_{1}({x};{\theta}_{1})=f_{2}({x};{\theta}_{2})\quad\forall{x}\in\mathcal{X}, (2)

where 𝒳\mathcal{X} is the input space and θ1{\theta}_{1} and θ2{\theta}_{2} denote the sets of parameters of the two networks, respectively.

Neural networks with a fixed architecture can have functionally-equivalent versions through weight scaling, sign flips, and permutations. This can even occur across networks with different architectures. In this paper, we focus on the complexity of a specific class of neural networks with fixed architecture but varying parameterizations. We provide examples of functionally-equivalent shallow neural networks to illustrate this concept.

Example 1 (Scaling).

Consider two shallow neural networks parameterized by θ1=(W1(1),b1(1),W1(2),b1(2))\theta_{1}=(W^{(1)}_{1},b_{1}^{(1)},W^{(2)}_{1},b_{1}^{(2)}) and θ2=(W2(1),b2(1),W2(2),b2(2))\theta_{2}=(W^{(1)}_{2},b_{2}^{(1)},W^{(2)}_{2},b_{2}^{(2)}), defined as:

f(x;θ1)=W1(2)σ(W1(1)x+b1(1))+b1(2),\displaystyle f(x;\theta_{1})=W^{(2)}_{1}\sigma(W^{(1)}_{1}x+b_{1}^{(1)})+b^{(2)}_{1},
f(x;θ2)=W2(2)σ(W2(1)x+b2(1))+b2(2)\displaystyle f(x;\theta_{2})=W^{(2)}_{2}\sigma(W^{(1)}_{2}x+b_{2}^{(1)})+b^{(2)}_{2}

respectively, where xnx\in\mathbb{R}^{n} is the input to the network and σ\sigma satisfies σ(λx)=λσ(x)\sigma(\lambda x)=\lambda\sigma(x) for all xnx\in\mathbb{R}^{n} and λ>0\lambda>0. If there exists a scalar value α>0\alpha>0 such that:

(W2(1),b2(1))=(αW1(1),αb1(1))andW2(2)=1αW1(2),\displaystyle(W^{(1)}_{2},b^{(1)}_{2})=(\alpha W^{(1)}_{1},\alpha b^{(1)}_{1})\quad{\rm and}\quad W^{(2)}_{2}=\frac{1}{\alpha}W^{(2)}_{1},

then f(;θ1)f(\cdot;\theta_{1}) and f(;θ2)f(\cdot;\theta_{2}) are functionally equivalent.

Scaling invariance property is applicable to ReLU, Leaky ReLU, and piecewise-linear activated neural networks. Specifically, for all xn{x}\in\mathbb{R}^{n} and λ0\lambda\geq 0, we have σ(λx)=λσ(x)\sigma(\lambda x)=\lambda\sigma(x) for σ\sigma being the ReLU or Leaky ReLU function. It is worth noting that the above example is presented for shallow neural networks, but the scaling invariance property can happen in deep networks across any two consecutive layers.

Example 2 (Sign Flipping).

Consider two shallow neural networks f(;θ1)f(\cdot;\theta_{1}) and f(;θ2)f(\cdot;\theta_{2}) defined in Example 1 with σ\sigma being an odd function, that is σ(x)=σ(x)\sigma(-x)=-\sigma(x) for all xnx\in\mathbb{R}^{n}. If

(W2(1),b2(1))=(W1(1),b1(1))andW2(2)=W1(2),\displaystyle(W^{(1)}_{2},b^{(1)}_{2})=(-W^{(1)}_{1},-b^{(1)}_{1})\quad{\rm and}\quad W^{(2)}_{2}=-W^{(2)}_{1},

then f(x;θ1)f(x;\theta_{1}) and f(x;θ2)f(x;\theta_{2}) are functionally equivalent.

Sign flipping invariance property can happen for neural networks activated by Tanh, Sin and odd functions. It is worth noting that Sigmoid does not have a straightforward Sign flipping invariance. While Sigmoid is an odd function up to a constant 0.5, it can be Sign-flipping invariant up-to a constant and the constant can be mitigated by using a bias(Martinelli et al., , 2023). The sign flipping invariance property can also be generalized to deep neural networks across any two consecutive layers.

Example 3 (Permutation).

Consider two shallow neural networks f(;θ1)f(\cdot;\theta_{1}) and f(;θ2)f(\cdot;\theta_{2}) defined in Example 1 with σ\sigma being a general activation function. Let the dimension of the hidden layer of f(x;θ1)f(x;\theta_{1}) and f(x;θ2)f(x;\theta_{2}) be denoted by mm. If there exists an m×mm\times m permutation matrix PP such that

(PW2(1),Pb2(1))=(W1(1),b1(1))andW2(2)P=W1(2),\displaystyle(PW^{(1)}_{2},Pb^{(1)}_{2})=(W^{(1)}_{1},b^{(1)}_{1})\quad{\rm and}\quad W^{(2)}_{2}P=W^{(2)}_{1},

then f(x;θ1)f(x;\theta_{1}) and f(x;θ2)f(x;\theta_{2}) are functionally equivalent.

The feedforward neural networks are built of linear transformations and activations, and it is intuitive that simply re-indexing neurons in a hidden layer and the corresponding rows of the weights matrix and bias vector will lead to a functionally equivalent network. The permutation invariance is the most basic type of equivalence for neural networks since it does not rely on any specific properties of activation functions, while scaling and sign flipping invariance are activation-dependent properties. A comparison on functional equivalence properties on neural network with commonly-used activation functions is presented in Table 2.

Table 2: Functional equivalence property for networks with different activation functions.
Activation Formula Sign flipping Scaling Permutation
Sigmoid [1+exp(x)]1[1+\exp(-x)]^{-1}
Tanh [1exp(2x)]/[1+exp(2x)][1-\exp(-2x)]/[1+\exp(-2x)]
ReLU max{0,x}\max\{0,x\}
Leaky ReLU max{ax,x}\max\{ax,x\} for a>0a>0

Next, we derive sufficient conditions fo feed-forward neural networks (FNNs) to be permutation-equivalent.

Proposition 1 (Permutation equivalence for deep FNNs).

Consider two neural networks f(x;θ1)f(x;\theta_{1}) and f(x;θ2)f(x;\theta_{2}) with the same activations σ1,,σL\sigma_{1},\ldots,\sigma_{L} and architecture

f(x;θ)=W(L+1)σL(σ1(W1(1)x+b1(1)))+b1(L))+b(L+1)f(x;\theta)=W^{(L+1)}\sigma_{L}(\cdots\sigma_{1}(W^{(1)}_{1}x+b^{(1)}_{1})\cdots)+b^{(L)}_{1})+b^{(L+1)}

but parameterized by different parameters

θj=(Wj(1),bj(1),,Wj(L+1),bj(L+1)),j=1,2\displaystyle\theta_{j}=(W^{(1)}_{j},b^{(1)}_{j},\ldots,W^{(L+1)}_{j},b^{(L+1)}_{j}),\quad j=1,2

respectively, where xnx\in\mathbb{R}^{n} is the input to the network. Let PP^{\top} denote the transpose of matrix PP. If there exists permutation matrices P1,,PLP_{1},\ldots,P_{L} such that

W1(1)=P1W2(1),\displaystyle W^{(1)}_{1}=P_{1}W^{(1)}_{2},\qquad\qquad\hskip 4.26773pt b1(1)=P1b2(1),\displaystyle b^{(1)}_{1}=P_{1}b^{(1)}_{2},
W1(l)=PlW2(l)Pl1,\displaystyle W^{(l)}_{1}=P_{l}W^{(l)}_{2}P_{l-1}^{\top},\qquad\quad b1(l)=Plb2(1),l=2,,L\displaystyle b^{(l)}_{1}=P_{l}b^{(1)}_{2},\quad l=2,\ldots,L
W1(L+1)=W2(L+1)PL,\displaystyle W^{(L+1)}_{1}=W^{(L+1)}_{2}P_{L}^{\top},\qquad b1(L)=b2(L),\displaystyle b^{(L)}_{1}=b^{(L)}_{2},

then f(x;θ1)f(x;\theta_{1}) and f(x;θ2)f(x;\theta_{2}) are functionally equivalent.

Proposition 1 describes the relationship between the parameters of two permutation-equivalent deep feedforward neural networks. This relationship can be used to create functionally equivalent networks given fixed architectures. It’s important to note that although permutation invariance is sufficient for functional equivalence of feedforward neural networks, it’s not always necessary. Petzka et al., (2020) gave a complete characterization for fully-connected networks with two layers. While for general networks, certain restrictions on the architecture and activation function are required to fully characterize (Sussmann, , 1992; Kůrkov’a and Kainen, , 1994; Bui Thi Mai and Lampert, , 2020) and recover the parameters of a network (Martinelli et al., , 2023). This study focuses only on utilizing permutation invariance to investigate neural network complexity.

In this section, we analyze the complexity of a class of feedforward neural networks by examining the redundancy that arises from permutation invariance. Specifically, we study the covering number of real-valued, deep feedforward neural networks that share the same architecture but have different parameterization.

Let the vector (d0,d1,,dL)(d_{0},d_{1},\ldots,d_{L}) represent the dimensions of the layers of the neural network f(x;θ)f(x;\theta) defined in (1), where dL+1=1d_{L+1}=1 as the output is real-valued. Note that the bias vectors in hidden layers contain 𝒰:=i=1Ldi\mathcal{U}:=\sum_{i=1}^{L}d_{i} entries, and the weight matrices together with bias vectors contain 𝒮:=i=0Ldi×di+1+di+1\mathcal{S}:=\sum_{i=0}^{L}d_{i}\times d_{i+1}+d_{i+1} entries in total. We define the parameter space of θ\theta as Θ=[B,B]𝒮\Theta=[-B,B]^{\mathcal{S}} for some B1B\geq 1, which is closed for permutation operations and ensures the absolute value of the weight matrix and bias vector entries are bounded by BB. We set Θ=[B,B]𝒮\Theta=[-B,B]^{\mathcal{S}} for some B1B\geq 1. The setting is in line with complexity studies as in (Neyshabur et al., 2015b, ; Bartlett et al., , 2017; Golowich et al., , 2018) with norm controls. The setting of bounded parameter space can also correspond to the observed implicit regularization phenomena in SGD-based optimization algorithms (Neyshabur et al., , 2014; Gunasekar et al., 2018a, ; Gunasekar et al., 2018b, ), which can lead to minimum-norm solutions (e.g., for least square problems). We do not specify any activation functions σ1,,σL\sigma_{1},\ldots,\sigma_{L} since we consider general deep feedforward neural networks. Finally, the class of feedforward neural networks we consider is denoted as

(L,d0,d1,,dL,B)={f(;θ):d0isdefinedin(1):θ[B,B]𝒮}.\displaystyle\mathcal{F}(L,d_{0},d_{1},\ldots,d_{L},B)=\{f(\cdot;\theta):\mathbb{R}^{d_{0}}\to\mathbb{R}{\rm\ is\ defined\ in\ (\ref{dnn})}:\theta\in[-B,B]^{\mathcal{S}}\}. (3)

2.1 Shallow Feed-Forward Neural Networks

It is well-known that a shallow neural network with a single hidden layer has universal approximation properties (Cybenko, , 1989; Hornik, , 1991) and is sufficient for many learning tasks (Hanin and Rolnick, , 2019; Hertrich et al., , 2021). We begin our investigation by considering shallow neural networks of the form

(1,d0,d1,B)={f(x;θ)=W(2)σ1(W(1)x+b(1))+b(2):θ[B,B]𝒮}\displaystyle\mathcal{F}(1,d_{0},d_{1},B)=\{f(x;\theta)=W^{(2)}\sigma_{1}(W^{(1)}x+b^{(1)})+b^{(2)}:\theta\in[-B,B]^{\mathcal{S}}\} (4)

where the total number of parameters is given by 𝒮=(d0+2)×d1+1\mathcal{S}=(d_{0}+2)\times d_{1}+1. By Theorem 1, for any θ=(W(1),b(1),W(2),b(2))Θ\theta=(W^{(1)},b^{(1)},W^{(2)},b^{(2)})\in\Theta and any permutation matrix PP, θ~=(PW(1),Pb(1),W(2)P,b(2))Θ\tilde{\theta}=(PW^{(1)},Pb^{(1)},W^{(2)}P^{\top},b^{(2)})\in\Theta will produce the same input-out function. Actually, permutation invariance leads to equivalence classes of parameters that yield the same realization, and we can obtain a set of representatives from equivalence classes. A canonical choice is

Θ0:={θ[B,B]𝒮:b1(1)b2(1)bd1(1)},\Theta_{0}:=\{\theta\in[-B,B]^{\mathcal{S}}:b^{(1)}_{1}\geq b^{(1)}_{2}\geq\cdots\geq b^{(1)}_{d_{1}}\},

where the set of representatives is by restricting the bias vector b(1)=(b1(1),,bd1(1))b^{(1)}=(b^{(1)}_{1},\ldots,b^{(1)}_{d_{1}})^{\top} to have descending components. Alternatively, we can sort the first component of the rows of W(1)W^{(1)} to obtain a set of representatives. It is worth mentioning that Θ0\Theta_{0} may not be the minimal set of representatives since there may be other symmetries within Θ0\Theta_{0}. We did not further utilize these additional symmetries to reduce Θ0\Theta_{0} to be smaller since other symmetries can depend on the specific properties of the activation functions in the neural networks. In this work, we specifically employ permutation invariance (which holds for networks with any activations) to obtain a representative set Θ0\Theta_{0}.

The set of representatives Θ0\Theta_{0} has two important properties

  • Neural networks {f(;θ):θΘ0}\{f(\cdot;\theta):\theta\in\Theta_{0}\} parameterized by Θ0\Theta_{0} contains all the functions in {f(;θ):θΘ}\{f(\cdot;\theta):\theta\in\Theta\}, i.e.,

    {f(;θ):θΘ0}={f(;θ):θΘ}.\{f(\cdot;\theta):\theta\in\Theta_{0}\}=\{f(\cdot;\theta):\theta\in\Theta\}.
  • The volume (in terms of Lebesgue measure) of the set of representatives Θ0\Theta_{0} is (1/d1!)(1/d_{1}!) times smaller that that of the parameter space Θ\Theta, i.e.,

    Volume(Θ)=d1!×Volume(Θ0).{\rm Volume}(\Theta)=d_{1}!\times{\rm Volume}(\Theta_{0}).

The first property holds naturally since any parameter θΘ\theta\in\Theta has a permuted version in Θ0\Theta_{0}. Regarding the second property, we note that both Θ\Theta and Θ0\Theta_{0} belong to the Euclidean space 𝒮\mathbb{R}^{\mathcal{S}} (where 𝒮\mathcal{S} denotes the number of parameters), then their volumes in terms of Lebesgue measure can be calculated. For any parameter θ\theta with distinct components in the bias vector b(1)=(b1(1),,bd1(1))b^{(1)}=(b^{(1)}_{1},\ldots,b^{(1)}_{d_{1}})^{\top}, permutation of the bias vector and corresponding weights can lead to (d1!)(d_{1}!) distinct equivalent parameters θ1,,θd1!\theta_{1},\ldots,\theta_{d_{1}!}. Then this implies a (d1!)(d_{1}!) times smaller volume of representative set Θ0\Theta_{0} compared to that of Θ\Theta.

Subsequently, the two properties suggest that Θ0\Theta_{0} can be a representative parameterization of neural networks when they are viewed only as input-output functions. Based on these observations, we derive improved complexities of the class of neural networks in terms of its covering number.

Definition 2 (Covering Number).

Let =f:𝒳\mathcal{F}={f:\mathcal{X}\to\mathbb{R}} be a class of functions. We define the supremum norm of ff\in\mathcal{F} as f:=supx𝒳|f(x)|\|f\|_{\infty}:=\sup_{x\in\mathcal{X}}|f(x)|. For a given ϵ>0\epsilon>0, we define the covering number of \mathcal{F} with radius ϵ\epsilon under the norm \|\cdot\|_{\infty} as the least cardinality of a subset 𝒢\mathcal{G}\subseteq\mathcal{F} satisfying

supfming𝒢fgϵ.\sup_{f\in\mathcal{F}}\min_{g\in\mathcal{G}}\|f-g\|_{\infty}\leq\epsilon.

Denoted by 𝒩(,ϵ,)\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty}), the covering number measures the minimum number of functions in \mathcal{F} needed to cover the set of functions within a distance of ϵ\epsilon under the supremum norm.

The covering number 𝒩(,ϵ,)\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty}) provides a quantitative measure of the complexity of the class of functions \mathcal{F} under the supremum norm, with smaller values indicating simpler classes. Covering numbers, along with Rademacher complexity, VC dimension, and Pseudo dimension, are essential complexity measures in the analysis of learning theories and in estimating generalization errors. Although these measures are different, they are correlated with each other, and we introduce the detailed correlations in Appendix.

Remark 1.

We define the covering number of a class of functions in the uniform sense. This is an extension of the canonical definition of covering numbers, which was originally developed for subsets in Euclidean space. While most existing studies of covering numbers for function spaces consider the image of the functions on a finite sample (Anthony et al., , 1999; Bartlett et al., , 2017), our definition is formulated directly in terms of the function space itself, without requiring a finite sample or any other auxiliary construction.

Theorem 1 (Covering number of shallow neural networks).

Consider the class of single hidden layer neural networks :=(1,d0,d1,B)\mathcal{F}:=\mathcal{F}(1,d_{0},d_{1},B) defined in (4) parameterized by θΘ=[B,B]𝒮\theta\in\Theta=[-B,B]^{\mathcal{S}}. Suppose the radius of the domain 𝒳\mathcal{X} of ff\in\mathcal{F} is bounded by some Bx>0B_{x}>0, and the activation σ1\sigma_{1} is continuous. Then for any ϵ>0\epsilon>0, the covering number

𝒩(,ϵ,)(16B2(Bx+1)d0d1/ϵ)𝒮×ρ𝒮h/d1!,\displaystyle\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty})\leq(16B^{2}(B_{x}+1)\sqrt{d_{0}}d_{1}/\epsilon)^{\mathcal{S}}\times\rho^{\mathcal{S}_{h}}/d_{1}!, (5)

where ρ\rho denotes the Lipschitz constant of σ1\sigma_{1} on the range of the hidden layer (i.e., [d0B(Bx)+1),d0B(Bx+1)][-\sqrt{d_{0}}B(B_{x})+1),\sqrt{d_{0}}B(B_{x}+1)]), and 𝒮h=d0d1+d1\mathcal{S}_{h}=d_{0}d_{1}+d_{1} is the total number of parameters in the linear transformation from input to the hidden layer, and 𝒮=d0×d1+2d1+1\mathcal{S}=d_{0}\times d_{1}+2d_{1}+1 is the total number of parameters.

Our upper bound on the covering number firstly takes advantage of permutation invariance, resulting in a reduced complexity (by a factorial term d1!d_{1}! in the denominator) compared to existing studies (Neyshabur et al., 2015b, ; Bartlett et al., , 2017; Neyshabur et al., 2017b, ; Neyshabur, , 2017; Lin and Zhang, , 2019). The factorial reduction d!!d_{!}! can be significant. For instance, for a shallow ReLU network with a hidden dimension of d1=128d_{1}=128, the factorial 128!10215128!\approx 10^{215}, which is far larger than 108210^{82}, the upper estimate on the number of atoms in the known universe. This reduction can be substantial and can enhance theoretical analysis and results that rely on covering numbers. In addition, it’s worth noting that bounds in Theorem 1 holds true for networks with bias vectors. This is also an improvement over existing studies that didn’t consider bias vectors in neural networks, since bias terms are crucial for the approximation power of neural networks (Yarotsky, , 2017; Lu et al., 2021b, ; Shen et al., , 2022).

Stirling’s formula can be used to approximate the factorial term as 2πd1(d1/e)d1exp(1/(12d1+1))<d1!<2πd1(d1/e)d1exp(1/12d1)\sqrt{2\pi d_{1}}(d_{1}/e)^{d_{1}}\exp(1/(12d_{1}+1))<d_{1}!<\sqrt{2\pi d_{1}}(d_{1}/e)^{d_{1}}\exp(1/12d_{1}) when d11d_{1}\geq 1. This reduces the covering number approximately by a factor of (d1/e)d1(d_{1}/e)^{d_{1}} and the bound in (5) is (CB,Bx,d0,ρ/ϵ)𝒮×d1𝒮𝒰(C_{B,B_{x},d_{0},\rho}/\epsilon)^{\mathcal{S}}\times d_{1}^{\mathcal{S}-\mathcal{U}}, where 𝒰=d1\mathcal{U}=d_{1} denotes the number of hidden neurons and CB,Bx,d0,ρ>0C_{B,B_{x},d_{0},\rho}>0 is a constant depending only on B,Bx,d0B,B_{x},d_{0} and ρ\rho. Notably, 𝒮𝒰\mathcal{S}-\mathcal{U} is the number of weights (excluding bias), and the reduced bound is basically that for a single neural network without bias vectors and considering permutation invariance. Lastly, we note that increasing the number of neurons in a shallow neural network enlarges its approximation power (Lu et al., , 2017; Ongie et al., , 2019), but at a smaller increase in complexity according to our results.

Remark 2.

Theorem 1 applies to any activation function that is locally Lipschitz on bounded sets (range of the hidden layer), which does not require any specific choice such as Hinge or ReLU in Neyshabur et al., 2015b ; Neyshabur et al., 2017b , and does not require universal Lipschitz and σ(0)=0\sigma(0)=0 as in Bartlett et al., (2017); Lin and Zhang, (2019). In the case of the ReLU or Leaky ReLU activation, our bound simplifies to ρ=1\rho=1 without any condition, leading to the disappearance of the term ρ𝒮h\rho^{\mathcal{S}_{h}} in our bound.

2.2 Deep Feed-Forward Neural Networks

For deep neural networks, we can also analyze its effective parameter space based on permutation invariance properties. By Theorem 1, a set of representatives Θ0=Θ0(1)×Θ0(2)××Θ0(L)×Θ0(L+1)\Theta_{0}=\Theta_{0}^{(1)}\times\Theta_{0}^{(2)}\times\cdots\times\Theta_{0}^{(L)}\times\Theta_{0}^{(L+1)} can be constructed where

Θ0(l)={(W(l),b(l))[B,B]𝒮l:b1(l)b2(l)bdl(l)}\displaystyle\Theta_{0}^{(l)}=\Big{\{}(W^{(l)},b^{(l)})\in[-B,B]^{\mathcal{S}_{l}}:b^{(l)}_{1}\geq b^{(l)}_{2}\geq\cdots\geq b^{(l)}_{d_{l}}\}
forl=1,,L,Θ0(L+1)={(W(L+1),b(L+1))[B,B]𝒮L+1}.\displaystyle\qquad\qquad{\rm\ for\ }l=1,\ldots,L,\Theta_{0}^{(L+1)}=\{(W^{(L+1)},b^{(L+1)})\in[-B,B]^{\mathcal{S}_{L+1}}\Big{\}}.

Then we can obtain an upper bound of the covering number of deep feedforward neural networks.

Theorem 2 (Covering number of deep neural networks).

Consider the class of deep neural networks :=(1,d0,d1,,dL,B)\mathcal{F}:=\mathcal{F}(1,d_{0},d_{1},\ldots,d_{L},B) defined in (3) parameterized by θΘ=[B,B]𝒮\theta\in\Theta=[-B,B]^{\mathcal{S}}. Suppose the radius of the domain 𝒳\mathcal{X} of ff\in\mathcal{F} is bounded by BxB_{x} for some Bx>0B_{x}>0, and the activations σ1,,σL\sigma_{1},\ldots,\sigma_{L} are locally Lipschitz. Then for any ϵ>0\epsilon>0, the covering number 𝒩(,ϵ,)\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty}) is bounded by

(4(L+1)(Bx+1)(2B)L+2(Πj=1Lρj)(Πj=0Ldj)ϵ1)𝒮d1!×d2!××dL!,\displaystyle\frac{\Big{(}4(L+1)(B_{x}+1)(2B)^{L+2}(\Pi_{j=1}^{L}\rho_{j})({\Pi_{j=0}^{L}d_{j}})\cdot\epsilon^{-1}\Big{)}^{\mathcal{S}}}{d_{1}!\times d_{2}!\times\cdots\times d_{L}!},

where 𝒮=i=0Ldidi+1+di+1\mathcal{S}=\sum_{i=0}^{L}d_{i}d_{i+1}+d_{i+1} and ρi\rho_{i} denotes the Lipschitz constant of σi\sigma_{i} on the range of (i1)(i-1)-th hidden layer, especially the range of (i1)(i-1)-th hidden layer is bounded by [B(i),B(i)][-B^{(i)},B^{(i)}] with B(i)(2B)iΠj=1i1ρjdjB^{(i)}\leq(2B)^{i}\Pi_{j=1}^{i-1}\rho_{j}{d_{j}} for i=1,,Li=1,\ldots,L.

Theorem 2 provides a novel upper bound for the covering number of deep neural networks based on permutation invariance, which reduces the complexity compared to previous results (Neyshabur et al., 2015b, ; Neyshabur et al., 2017b, ; Bartlett et al., , 2017; Lin and Zhang, , 2019) by approximately a factor of (d1!d2!dL!)(d_{1}!d_{2}!\cdots d_{L}!). According to Theorem 2, increasing the depth of a neural network increases its complexity. However, it is interesting to note that the increased hidden layer ll will have a (dl!)(d_{l}!) discount on the complexity. If the hidden layers have equal width (d=d1==dLd=d_{1}=\cdots=d_{L}), the bound reduces to (CB,Bx,d0,ρ/ϵ)𝒮×d𝒮𝒰(C_{B,B_{x},d_{0},\rho}/\epsilon)^{\mathcal{S}}\times d^{\mathcal{S}-\mathcal{U}}, where 𝒰=Ld\mathcal{U}=Ld denotes the number of hidden neurons and CB,Bx,d0,ρ>0C_{B,B_{x},d_{0},\rho}>0 is a constant depending only on B,Bx,d0B,B_{x},d_{0} and ρi,i=1,,L\rho_{i},i=1,\ldots,L. As with shallow neural networks, the improved rate (𝒮𝒰)(\mathcal{S}-\mathcal{U}) denotes the number of weights but excluding biases, which assures the approximation power of neural networks, but grows its complexity in a rate free of number of bias.

Remark 3.

As discussed in Remark 2, our results take into account the permutation invariance and have looser requirements on activation functions compared to existing results (Neyshabur et al., 2015b, ; Neyshabur et al., 2017b, ; Bartlett et al., , 2017; Lin and Zhang, , 2019). In addition, our upper bound is explicitly expressed in parameters which are known and can be specified in practice, e.g., network depth LL, width (d0,d1,,dL)(d_{0},d_{1},\ldots,d_{L}), size 𝒮\mathcal{S} and uniform bound BB for weights and biases. While most existing bounds in (Neyshabur et al., 2015b, ; Neyshabur et al., 2017b, ; Bartlett et al., , 2017; Lin and Zhang, , 2019) are in terms of the spectral norm of weight matrices and some measurement on externally introduced reference matrices, which are usually unknown in practice.

2.3 Comparing to existing results

The complexity upper bounds in terms of covering number on the class of deep neural networks have been studied in Anthony et al., (1999); Neyshabur et al., 2017b ; Bartlett et al., (2017); Lin and Zhang, (2019). These results are proved by using similar approaches of mathematical induction (e.g. Lemma A.7 in Bartlett et al., (2017), Lemma 2 in Neyshabur et al., 2017b and Lemma 14 of Lin and Zhang, (2019)). Compared with these results, we improve upon their results in three ways.

  • First, we consider the generally defined neural networks where the bias vector is allowed to appear. Bias terms are indispensable for the approximation power of neural networks (Yarotsky, , 2017; Lu et al., 2021b, ; Shen et al., , 2022). Neural networks without bias vectors may be disqualified in theory and practice.

  • Second, we explicitly express our bound in terms of the width, depth, and size of the neural network, as well as the infinity norm of the parameters.

  • Third, we utilize the permutation equivalence property to obtain a tighter upper bound on the covering number. Notably, this bound improved existing results by a factorial of the network widths, providing unprecedented insights into the intricate relationship between network complexity and layer width.

Various complexity upper bounds for deep neural networks in other measurements have also been studied, including Rademacher complexity (Neyshabur et al., 2015b, ; Golowich et al., , 2018; Li et al., , 2018), VC-dimension, and Pseudo dimension (Baum and Haussler, , 1988; Goldberg and Jerrum, , 1993; Anthony et al., , 1999; Bartlett et al., , 2019). Our bounds in terms of covering number are not directly comparable with these measurements. To make a comparison with these other measurements, we convert the upper bounds into metric entropy, which is equivalent to the logarithm of the covering number log(𝒩(,ϵ,))\log(\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty})). Specially, let ρ¯=Πj=1Lρj\bar{\rho}=\Pi_{j=1}^{L}\rho_{j} products of Lipschitz constants of activation functions and s¯=Πj=1Lsj\bar{s}=\Pi_{j=1}^{L}s_{j} products of spectral norms of hidden layer weight matrices Bartlett et al., (2017) derived an spectral-norm based bound of metric entropy Bx2(ρ¯s¯)2𝒰log(W)/ϵ2B_{x}^{2}(\bar{\rho}\bar{s})^{2}\mathcal{U}\log(W)/\epsilon^{2}, following which Neyshabur et al., 2017b obtained Bx2(ρ¯s¯)2𝒮L2log(WL)/ϵ2B_{x}^{2}(\bar{\rho}\bar{s})^{2}\mathcal{S}L^{2}\log(WL)/\epsilon^{2} and Lin and Zhang, (2019) obtained Bx(ρ¯s¯)𝒮2L/ϵB_{x}(\bar{\rho}\bar{s})\mathcal{S}^{2}L/\epsilon. Based on Theorem 12.2 in Anthony et al., (1999), the Pseudo dimension bound in Bartlett et al., (2017) leads to L𝒮log(𝒮)log(ρ¯s¯Bx/ϵ)L\mathcal{S}\log(\mathcal{S})\log(\bar{\rho}\bar{s}B_{x}/\epsilon). Lastly, our results in Theorem 2 can be presented as L𝒮log(ρ¯s¯Bx1/L/((d1!dL!)𝒮ϵ)1/L)L\mathcal{S}\log(\bar{\rho}\bar{s}B_{x}^{1/L}/((d_{1}!\cdots d_{L}!)^{\mathcal{S}}\epsilon)^{1/L}) by letting si:=Bdidi1s_{i}:=B\sqrt{d_{i}d_{i-1}} given in our setting. It is important to note that the quantity Bdidi1B\sqrt{d_{i}d_{i-1}} can provide an upper bound for the spectral norm sis_{i}, but the reverse is not necessarily true. We cannot ensure Theorem 2 to still be true by directly substituting Bdidi1B\sqrt{d_{i}d_{i-1}} with sis_{i} in the theorem. In other words, if the covering number bounds are derived in terms of the spectral norm, the reduction factor d1!dL!d_{1}!\cdots d_{L}! may not be obtained, and additional parameters like matrix norm bounds would appear in the upper bound(Bartlett et al., , 2017). We would also mention that even our results improved over the existing ones, all these bounds scale with number of parameters 𝒮\mathcal{S} and can still result vacuous bounds in error analysis with extremely over-parametrized settings. We present a detailed comparison of the results in Table 1.

3 Extension to other neural networks

Functional equivalence can manifest ubiquitously across various types of neural networks, with a specific emphasis on the presence of permutation equivalence within neural networks featuring linear transformation layers. This section delves into the exploration of feasible extensions aimed at harnessing the power of functional equivalence within convolutional neural networks, residual networks, and attention-based networks.

3.1 Convolutional neural networks

Convolutional neural networks (CNNs) are featured by the utilization of convolution and pooling layers. In a convolution layer, the input is convolved with parameter-carrying filters, resembling a linear layer with a sparse weight matrix. A pooling layer is commonly employed for downsampling and summarizing input feature information. It partitions the input into non-overlapping regions and applies a pooling operation to each region. The most prevalent types of pooling operations include max/min/avg pooling, which retain the maximum, minimum, or average value within each region.

As demonstrated in Example 1-3, scaling, sign flip, and permutation equivalence directly apply to convolution layers (linear layer with sparse weight matrix). We also extend the permutation equivalence within pooling regions as follows.

Example 4 (Permutation within Pooling Regions).

Consider two shallow CNNs defined by f(x;θ1)=Pool(W1x+b1)f(x;\theta_{1})={Pool}(W_{1}x+b_{1}) and f(x;θ2)=Pool(W2x+b2)f(x;\theta_{2})={Pool}(W_{2}x+b_{2}) respectively where “PoolPool" is a pooling operator. Let 1,,K\mathcal{I}_{1},\ldots,\mathcal{I}_{K} be the non-overlapping index sets (correspond to the pooling operator) of rows of W1x+b1W_{1}x+b_{1} and W2x+b2W_{2}x+b_{2}. Then f(;θ1)f(\cdot;\theta_{1}) and f(;θ2)f(\cdot;\theta_{2}) are functional equivalent if there exists a permutation matrix PP such that k{1,,K}\forall k\in\{1,\ldots,K\}

(PW2)k(W1)kand(Pb2)k(b1)k,(PW_{2})_{\mathcal{I}_{k}}\cong(W_{1})_{\mathcal{I}_{k}}\ {\rm and}\ (Pb_{2})_{\mathcal{I}_{k}}\cong(b_{1})_{\mathcal{I}_{k}},

where AkA_{\mathcal{I}_{k}} denotes the k\mathcal{I}_{k} rows of AA and ABA\cong B denotes that AA equals to BB up to row permutations.

Permutation equivalence within non-overlapping regions in CNNs preserves max/min/avg values, eliminating the need for cancel-off operations in subsequent layers. This further allows for the derivation of complexity bounds of CNNs by identifying their effective parameter space, similar to Theorem 2. In addition, the results on the connections between CNNs and feed-forward networks can be naturally utilized to improve the covering number bounds and capacity estimates (Fang et al., , 2020; Mao et al., , 2021), e.g., (Zhou, 2020b, ; Zhou, 2020a, ) showed that fully connected deep ReLU networks can be realized by deep CNNs with the same order of network parameter numbers,.

3.2 Residual Networks

Residual Networks are a type of deep CNN architecture that has a significant impact on computer visions (He et al., , 2016). The key feature of ResNet is the use of skip connections that enable networks to learn residual mappings and bypass layers, leading to very deep but trainable networks. Mathematically, a residual layer f(x;θ)=x+F(x;θ)f(x;\theta)=x+\text{F}(x;\theta) outputs the summation of the input xx and its transformation F(x;θ)\text{F}(x;\theta). The F(;θ)F(\cdot;\theta) can be any transformation that maps xx to the space of itself, and it defines the residual layer. Then the equivalence of FF implies that of the residual layer.

Example 5 (Equivalence of Residual Layer).

Consider two residual layers f(x;θ1)=x+F(x;θ1)f(x;\theta_{1})=x+F(x;\theta_{1}) and f(x;θ2)=x+F(x;θ2)f(x;\theta_{2})=x+F(x;\theta_{2}). Then f(;θ1)f(\cdot;\theta_{1}) and f(;θ2)f(\cdot;\theta_{2}) are functionally equivalent if and only if F(;θ1)F(\cdot;\theta_{1}) and F(;θ2)F(\cdot;\theta_{2}) are functionally equivalent.

3.3 Attention-based Networks

Attention-based models, well known for BERT, GPT and many others, have been successful in natural language processing and computer vision tasks (Vaswani et al., , 2017; Devlin et al., , 2018; Radford et al., , 2018). It utilizes an attention mechanism to focus on relevant parts of the input data. Here we focus on the self-attention module due to its effectiveness. Let Xn×dX_{n\times d} denote the input of a nn-sequence of dd-dimensional embeddings, and let Wd×dqQ,Wd×dkKW^{Q}_{d\times d_{q}},W^{K}_{d\times d_{k}} and Wd×dvVW^{V}_{d\times d_{v}} be the weight matrices where dq=dkd_{q}=d_{k}. Then the self-attention map outputs

Softmax(XWQ(WK)Xdk)XWVSoftmax\left(\frac{XW^{Q}(W^{K})^{\prime}X^{\prime}}{\sqrt{d_{k}}}\right)XW^{V}

where the Softmax()Softmax(\cdot) is applied to each row of its input and AA^{\prime} denotes the transpose of a matrix AA.

Example 6 (Permutation within Attention map).

Consider two attention maps f(x;θ1)f(x;\theta_{1}) and f(x;θ2)f(x;\theta_{2}) with f(x;θi)=Softmax(XWiQ(WiK)X/dk)XWiVf(x;\theta_{i})=Softmax({XW_{i}^{Q}(W_{i}^{K})^{\prime}X^{\prime}}/{\sqrt{d_{k}}})XW_{i}^{V} for i=1,2i=1,2. Then f(;θ1)f(\cdot;\theta_{1}) and f(;θ2)f(\cdot;\theta_{2}) are functionally equivalent if there exists dk×dkd_{k}\times d_{k} permutation matrix PP such that

W2QP=W1QandW2KP=W1K.W_{2}^{Q}P=W^{Q}_{1}\quad{\rm and}\quad W_{2}^{K}P=W^{K}_{1}.

For the attention module, there is no activation function between the key and query matrices. The relevant symmetry can be considered for any equivalent linear maps WQ(WK)W^{Q}(W^{K})^{\prime}. In addition, the output of SoftmaxSoftmax operator is invariant to the row shift of its input, which also leaves the possibility to further reduce the complexity of attention modules to understand the overparameterization of large language models.

4 Implications to generalization and optimization

In this section, we introduce the relevance and highlight the usefulness of our study to both generalization and optimization via empirical risk minimization (ERM) framework.

The goal of ERM is to find the target function f0f_{0}, which represents the true relationship between the inputs and outputs, and is typically defined as the minimizer (can be unbounded) of some risk ()\mathcal{R}(\cdot), i.e., f0:=argminf(f).f_{0}:=\arg\min_{f}\mathcal{R}(f). However, since the target function is unknown, we can only approximate it using a predefined hypothesis space \mathcal{F}, such as the class of neural networks parameterized by θ\theta in deep learning, i.e., (Θ)={fθ()=f(;θ):θΘ}\mathcal{F}(\Theta)=\{f_{\theta}(\cdot)=f(\cdot;\theta):\theta\in\Theta\}. Then the “best in class" estimator is defined by fθ=argminfΘ(f).f_{\theta^{*}}=\arg\min_{f\in\mathcal{F}_{\Theta}}\mathcal{R}(f). It’s worth noting that the risk function \mathcal{R} is defined with respect to the distribution of the data, which is unknown in practice. Instead, only a sample with size nn is available, and the empirical risk n\mathcal{R}_{n} can be defined and minimized to obtain an empirical risk minimizer (ERM), i.e., fθnargminfΘn(f).f_{\theta_{n}}\in\arg\min_{f\in\mathcal{F}_{\Theta}}\mathcal{R}_{n}(f). Finally, optimization algorithms such as SGD and ADAM, lead us to the estimator obtained in practice, i.e., fθ^nf_{\hat{\theta}_{n}}. The generalization error of fθ^nf_{\hat{\theta}_{n}} can be defined and decomposed as (Mohri et al., , 2018):

(fθ^n,opt)(f0)generalizationerror=(fθ^n)(fθn)optimizationerror+(fθn)(fθ)estimationerror+(fθ)(f0)approximationerror.\displaystyle\underbrace{\mathcal{R}(f_{\hat{\theta}_{n,opt}})-\mathcal{R}(f_{0})}_{\rm generalization\ error}=\underbrace{\mathcal{R}(f_{\hat{\theta}_{n}})-\mathcal{R}(f_{\theta_{n}})}_{\rm optimization\ error}+\underbrace{\mathcal{R}(f_{\theta_{n}})-\mathcal{R}(f_{\theta^{*}})}_{\rm estimation\ error}+\underbrace{\mathcal{R}(f_{\theta^{*}})-\mathcal{R}(f_{0})}_{\rm approximation\ error}.

The estimation error is closely related to the complexity of the function class (Θ)\mathcal{F}(\Theta) and the sample size nn. pecifically, for a wide range of problems such as regression and classification, the estimation error is 𝒪((log{𝒩((Θ),1/n,)}/n)k)\mathcal{O}((\log\{\mathcal{N}(\mathcal{F}(\Theta),1/n,\|\cdot\|_{\infty})\}/n)^{k}) for k=1/2k=1/2 or 1 (Bartlett et al., , 2019; Kohler and Langer, , 2021; Shen et al., , 2022; Jiao et al., , 2023). Our results improve the estimation error by subtracting at least log(d1!dL!)\log(d_{1}!\cdots d_{L}!) from the numerator (/n)k(\cdot/n)^{k} compared to existing results.

The approximation error depends on the expressive power of networks (Θ)\mathcal{F}(\Theta) and the features of the target f0f_{0}, such as its input dimension dd and smoothness β\beta. Typical results for the bounds of the approximation error are 𝒪((L𝒲)β/d)\mathcal{O}((L\mathcal{W})^{-\beta/d}) (Yarotsky, , 2017, 2018; Petersen and Voigtlaender, , 2018; Lu et al., 2021b, ) where LL and 𝒲\mathcal{W} denote the depth and width of the neural network. However, it is unclear how our reduced covering number bounds will improve the approximation error based on current theories.

Regarding the optimization error, due to the high non-convexity and complexity of deep learning problems, quantitative analysis based on current theories is limited. Even proving convergence (to stationary points) of existing methods is a difficult task (Sun, , 2020). However, we found that the symmetric structure of the parameter space can facilitate optimization. To be specific, our Theorem 3 in the following indicates that considering the symmetry structure of the deep network parameter space can make the probability of achieving zero (or some level of) optimization error (d1!dL!)(d_{1}!\cdots d_{L}!) times larger.

For a deep neural network in (3), we say two rows in the parameters θ(l):=(W(l);b(l))\theta^{(l)}:=(W^{(l)};b^{(l)}) in the llth hidden layer are identical if the two rows of the concatenated matrix (W(l);b(l))(W^{(l)};b^{(l)}) are identical. Here we concatenate the weight matrix W(l)W^{(l)} and bias vector b(l)b^{(l)} by (W(l);b(l))(W^{(l)};b^{(l)}) due to the one-one correspondence of the rows in W(l)W^{(l)} and b(l)b^{(l)}. Specifically, if the llth layer of the network is activated by σ\sigma, then the iith row of the output vector σ(W(l)x+b(l))\sigma(W^{(l)}x+b^{(l)}) is given by σ(Wi(l)x+bi(l))\sigma(W^{(l)}_{i}x+b^{(l)}_{i}) where Wi(l)W^{(l)}_{i} and bi(l)b^{(l)}_{i} denote the iith row of W(l)W^{(l)} and b(l)b^{(l)} respectively. We let dld_{l}^{*} denote the number of distinct permutations of rows in θ(l)\theta^{(l)}, and let (d1,,dL)(d_{1}^{*},\ldots,d_{L}^{*}) collect the number of distinct permutations in the hidden layers of the network parameterization θ=(θ(1),,θ(L))\theta=(\theta^{(1)},\ldots,\theta^{(L)}). We let Δmin(θ)\Delta_{\rm min}(\theta) and Δmax(θ)\Delta_{\rm max}(\theta) denote the minimum and maximum of the LL_{\infty} norm of distinct rows in θ(l)\theta^{(l)} over l1,,Ll\in{1,\ldots,L} (see Definition (8) and (9) in Appendix for details). Then we have the following result.

Theorem 3.

Suppose we have an ERM fθn()=f(;θn)f_{\theta_{n}}(\cdot)=f(\cdot;\theta_{n}) with parameter θn\theta_{n} having (d1,,dL)(d_{1}^{*},\ldots,d_{L}^{*}) distinct permutations and Δmin(θn)=δ\Delta_{\rm min}(\theta_{n})=\delta. For any optimization algorithm 𝒜\mathcal{A}, if it guarantees producing a convergent solution of θn\theta_{n} when its initialization θn(0)\theta^{(0)}_{n} satisfies Δmax(θn(0)θn)δ/2\Delta_{\rm max}(\theta^{(0)}_{n}-\theta_{n})\leq\delta/2, then any initialization scheme that uses identical random distributions for the entries of weights and biases within a layer will produce a convergent solution with probability at least d1××dL×(Δmax(θ(0)θn)δ/2)d_{1}^{*}\times\cdots\times d_{L}^{*}\times\mathbb{P}(\Delta_{\rm max}(\theta^{(0)}-\theta_{n})\leq\delta/2). Here, θ(0)\theta^{(0)} denotes the random initialization, and ()\mathbb{P}(\cdot) is with respect to the randomness from initialization.

Theorem 3 can be understood straightforwardly. By Theorem 1, if fθnf_{\theta_{n}} parameterized by θn\theta_{n} is a solution, then fθ~nf_{\tilde{\theta}_{n}} parameterized by θ~n\tilde{\theta}_{n} (the permuted θn\theta_{n}) is also a solution. The conditions in Theorem 1 ensure that the convergent regions for permutation-implemented solutions are disjoint. Thus, the probability fro convergence can be multiplied by the number of distinct permutations. These conditions can hold true under specific scenarios. For instance, when the loss is locally (strongly) convex within the (δ/2)(\delta/2)-neighborhood under Δmax\Delta_{\rm max} of a global solution θn\theta_{n}, then (stochastic) gradient descent algorithms 𝒜\mathcal{A} can guarantee convergence to the solution if the initialization θn(0)\theta^{(0)}_{n} falls within its (δ/2)(\delta/2)-neighborhood. It is also worth noting that when the parameter space is Θ=[B,B]𝒮\Theta={[-B,B]^{\mathcal{S}}}, the optimization problem is equivalent when restricted to the effective parameter space Θ0\Theta_{0}, as defined in section 2.2. The volume of Θ0\Theta_{0} is (2B)𝒮/(d1!dL!)(2B)^{\mathcal{S}}/(d_{1}!\cdots d_{L}!). Specifically, when BB is fixed, (2B)𝒮/(d1!dL!)(2B)^{\mathcal{S}}/(d_{1}!\cdots d_{L}!) approaches zero when dld_{l}\to\infty for any l=1,,Ll=1,\ldots,L. Remarkably, increasing the width of neural networks leads to the effective parameter space’s volume tending towards zero. As a result, this may explain the observations in Frankle and Carbin, (2019); Allen-Zhu et al., (2019); Du et al., (2019) where overparameterized networks tend to be easier to train. In Simsek et al., (2021), the geometry (in terms of manifold and connected affine subspace) of sets of minima and critical points in deep learning was described, which also indicates that overparameterized networks bear more minima solutions, thereby facilitating optimization.

The landscape of the loss surface in deep learning has been studied by considering the symmetry of the parameter space in several works. Specifically, Brea et al., (2019) discovered that permutation critical points are embedded in high-dimensional flat plateaus and proved that all permutation points in a given layer are connected with equal-loss paths. Entezari et al., (2021) conjectured that SGD solutions will likely have no barrier in the linear interpolation between them if the permutation invariance of neural networks is taken into account. Ainsworth et al., (2022) further explored the role of permutation symmetries in the linear mode connectivity of SGD solutions, and argued that neural network loss landscapes often contain (nearly) a single basin after accounting for all possible permutation symmetries of hidden units. Subsequently, Jordan et al., (2022) proposed methods to mitigate the variance collapse phenomenon that occurs in the interpolated networks, and to improve their empirical performance. Additionally, optimization algorithms for deep learning have been proposed to enhance training based on the symmetry of network parameter space (Badrinarayanan et al., 2015b, ; Cho and Lee, , 2017; Meng et al., , 2018; Navon et al., , 2023).

Remark 4.

The popular initialization schemes, including the Xavier and He’s methods, use normal random numbers to initialize the entries of weight matrices and bias vectors identically within a layer (Glorot and Bengio, , 2010; He et al., , 2015; Shang et al., , 2016; Reddi et al., , 2019). By Theorem 3, these initializations reduce the optimization difficulty due to the permutation invariance property.

5 Conclusion

In this work, we quantitatively characterized the redundancy in the parameterization of deep neural networks based on functional equivalence, and derived a tighter complexity upper bound of the covering number, which is explicit and holds for networks with bias vectors and general activations. We also explored functional equivalence in convolutional, residual and attention-based networks. We discussed the implications for understanding generalization and optimization. Specifically, we found that permutation equivalence can indicate a reduced theoretical complexity of both estimation and optimizations in deep learning.

A limitation of our work is that we only considered permutation invariance, neglecting sign flip and scaling invariance, which may be relevant for specific activations. Furthermore, functional equivalence in practice may be limited to a finite sample, potentially resulting in reduced complexity. Future research could explore the effects of sign flip and scaling invariance and investigate advanced optimization algorithms or designs for deep learning. We also acknowledge that the importance of deriving the lower bound of the covering numbers. We intend to pursue these areas of study to enhance our understanding in the future.

References

  • Ainsworth et al., (2022) Ainsworth, S. K., Hayase, J., and Srinivasa, S. (2022). Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836.
  • Allen-Zhu et al., (2019) Allen-Zhu, Z., Li, Y., and Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR.
  • Anthony et al., (1999) Anthony, M., Bartlett, P. L., Bartlett, P. L., et al. (1999). Neural network learning: Theoretical foundations, volume 9. cambridge university press Cambridge.
  • (4) Badrinarayanan, V., Mishra, B., and Cipolla, R. (2015a). Symmetry-invariant optimization in deep networks. arXiv preprint arXiv:1511.01754.
  • (5) Badrinarayanan, V., Mishra, B., and Cipolla, R. (2015b). Understanding symmetries in deep networks. arXiv preprint arXiv:1511.01029.
  • Bartlett et al., (2017) Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30.
  • Bartlett et al., (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. (2019). Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301.
  • Baum and Haussler, (1988) Baum, E. and Haussler, D. (1988). What size net gives valid generalization? Advances in neural information processing systems, 1.
  • Belkin et al., (2019) Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854.
  • Bona-Pellissier et al., (2021) Bona-Pellissier, J., Bachoc, F., and Malgouyres, F. (2021). Parameter identifiability of a deep feedforward relu neural network. arXiv preprint arXiv:2112.12982.
  • Brea et al., (2019) Brea, J., Simsek, B., Illing, B., and Gerstner, W. (2019). Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape. arXiv preprint arXiv:1907.02911.
  • Bui Thi Mai and Lampert, (2020) Bui Thi Mai, P. and Lampert, C. (2020). Functional vs. parametric equivalence of relu networks. In 8th International Conference on Learning Representations.
  • Chen et al., (1993) Chen, A. M., Lu, H.-m., and Hecht-Nielsen, R. (1993). On the geometry of feedforward neural network error surfaces. Neural computation, 5(6):910–927.
  • Cho and Lee, (2017) Cho, M. and Lee, J. (2017). Riemannian approach to batch normalization. Advances in Neural Information Processing Systems, 30.
  • Cybenko, (1989) Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
  • Dereich and Kassing, (2022) Dereich, S. and Kassing, S. (2022). On minimal representations of shallow relu networks. Neural Networks, 148:121–128.
  • Devlin et al., (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Du et al., (2019) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. (2019). Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR.
  • Dudley, (1967) Dudley, R. M. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330.
  • Dudley, (2010) Dudley, R. M. (2010). Universal donsker classes and metric entropy. In Selected Works of RM Dudley, pages 345–365. Springer.
  • Elbrächter et al., (2019) Elbrächter, D. M., Berner, J., and Grohs, P. (2019). How degenerate is the parametrization of neural networks with the relu activation function? Advances in Neural Information Processing Systems, 32.
  • Entezari et al., (2021) Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. (2021). The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296.
  • Fang et al., (2020) Fang, Z., Feng, H., Huang, S., and Zhou, D.-X. (2020). Theory of deep convolutional neural networks ii: Spherical analysis. Neural Networks, 131:154–162.
  • Fefferman and Markel, (1993) Fefferman, C. and Markel, S. (1993). Recovering a feed-forward net from its output. Advances in neural information processing systems, 6.
  • Frankle and Carbin, (2019) Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations.
  • Glorot and Bengio, (2010) Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
  • Glorot et al., (2011) Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323. JMLR Workshop and Conference Proceedings.
  • Goldberg and Jerrum, (1993) Goldberg, P. and Jerrum, M. (1993). Bounding the vapnik-chervonenkis dimension of concept classes parameterized by real numbers. In Proceedings of the sixth annual conference on Computational learning theory, pages 361–369.
  • Golowich et al., (2018) Golowich, N., Rakhlin, A., and Shamir, O. (2018). Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR.
  • (30) Grigsby, J. E., Lindsey, K., and Masden, M. (2022a). Local and global topological complexity measures of relu neural network functions. arXiv preprint arXiv:2204.06062.
  • (31) Grigsby, J. E., Lindsey, K., Meyerhoff, R., and Wu, C. (2022b). Functional dimension of feedforward relu neural networks. arXiv preprint arXiv:2209.04036.
  • (32) Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. (2018a). Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832–1841. PMLR.
  • (33) Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. (2018b). Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31.
  • Hanin and Rolnick, (2019) Hanin, B. and Rolnick, D. (2019). Deep relu networks have surprisingly few activation patterns. Advances in neural information processing systems, 32.
  • Haussler, (1995) Haussler, D. (1995). Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232.
  • He et al., (2015) He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034.
  • He et al., (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  • Hecht-Nielsen, (1990) Hecht-Nielsen, R. (1990). On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, pages 129–135. Elsevier.
  • Hertrich et al., (2021) Hertrich, C., Basu, A., Di Summa, M., and Skutella, M. (2021). Towards lower bounds on the depth of relu neural networks. Advances in Neural Information Processing Systems, 34:3336–3348.
  • Hornik, (1991) Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257.
  • Jiao et al., (2023) Jiao, Y., Shen, G., Lin, Y., and Huang, J. (2023). Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2):691–716.
  • Jordan et al., (2022) Jordan, K., Sedghi, H., Saukh, O., Entezari, R., and Neyshabur, B. (2022). Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint arXiv:2211.08403.
  • Kohler and Langer, (2021) Kohler, M. and Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249.
  • Kůrkov’a and Kainen, (1994) Kůrkov’a, V. and Kainen, P. C. (1994). Functionally equivalent feedforward neural networks. Neural Computation, 6(3):543–558.
  • Li et al., (2018) Li, X., Lu, J., Wang, Z., Haupt, J., and Zhao, T. (2018). On tighter generalization bound for deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159.
  • Lin and Zhang, (2019) Lin, S. and Zhang, J. (2019). Generalization bounds for convolutional neural networks. arXiv preprint arXiv:1910.01487.
  • (47) Lu, J., Shen, Z., Yang, H., and Zhang, S. (2021a). Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506.
  • (48) Lu, J., Shen, Z., Yang, H., and Zhang, S. (2021b). Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506.
  • Lu et al., (2017) Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30.
  • Mao et al., (2021) Mao, T., Shi, Z., and Zhou, D.-X. (2021). Theory of deep convolutional neural networks iii: Approximating radial functions. Neural Networks, 144:778–790.
  • Martinelli et al., (2023) Martinelli, F., Simsek, B., Brea, J., and Gerstner, W. (2023). Expand-and-cluster: exact parameter recovery of neural networks. arXiv preprint arXiv:2304.12794.
  • Meng et al., (2018) Meng, Q., Zheng, S., Zhang, H., Chen, W., Ye, Q., Ma, Z.-M., Yu, N., and Liu, T.-Y. (2018). G-sgd: Optimizing relu neural networks in its positively scale-invariant space. In International Conference on Learning Representations.
  • Mohri et al., (2018) Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
  • Navon et al., (2023) Navon, A., Shamsian, A., Achituve, I., Fetaya, E., Chechik, G., and Maron, H. (2023). Equivariant architectures for learning in deep weight spaces. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 25790–25816. PMLR.
  • Neyshabur, (2017) Neyshabur, B. (2017). Implicit regularization in deep learning. arXiv preprint arXiv:1709.01953.
  • (56) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017a). Exploring generalization in deep learning. Advances in neural information processing systems, 30.
  • (57) Neyshabur, B., Bhojanapalli, S., and Srebro, N. (2017b). A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564.
  • Neyshabur et al., (2019) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. (2019). The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations.
  • (59) Neyshabur, B., Salakhutdinov, R. R., and Srebro, N. (2015a). Path-sgd: Path-normalized optimization in deep neural networks. Advances in neural information processing systems, 28.
  • Neyshabur et al., (2014) Neyshabur, B., Tomioka, R., and Srebro, N. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614.
  • (61) Neyshabur, B., Tomioka, R., and Srebro, N. (2015b). Norm-based capacity control in neural networks. In Conference on learning theory, pages 1376–1401. PMLR.
  • Novak et al., (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. (2018). Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations.
  • Ongie et al., (2019) Ongie, G., Willett, R., Soudry, D., and Srebro, N. (2019). A function space view of bounded norm infinite width relu nets: The multivariate case. arXiv preprint arXiv:1910.01635.
  • Petersen et al., (2021) Petersen, P., Raslan, M., and Voigtlaender, F. (2021). Topological properties of the set of functions generated by neural networks of fixed size. Foundations of computational mathematics, 21:375–444.
  • Petersen and Voigtlaender, (2018) Petersen, P. and Voigtlaender, F. (2018). Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330.
  • Petzka et al., (2020) Petzka, H., Trimmel, M., and Sminchisescu, C. (2020). Notes on the symmetries of 2-layer relu-networks. In Proceedings of the northern lights deep learning workshop, volume 1, pages 6–6.
  • Radford et al., (2018) Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pre-training.
  • Razin and Cohen, (2020) Razin, N. and Cohen, N. (2020). Implicit regularization in deep learning may not be explainable by norms. Advances in neural information processing systems, 33:21174–21187.
  • Reddi et al., (2019) Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
  • Rolnick and Kording, (2020) Rolnick, D. and Kording, K. (2020). Reverse-engineering deep relu networks. In International Conference on Machine Learning, pages 8178–8187. PMLR.
  • Shang et al., (2016) Shang, W., Sohn, K., Almeida, D., and Lee, H. (2016). Understanding and improving convolutional neural networks via concatenated rectified linear units. In international conference on machine learning, pages 2217–2225. PMLR.
  • Shen et al., (2022) Shen, G., Jiao, Y., Lin, Y., and Huang, J. (2022). Approximation with cnns in sobolev space: with applications to classification. Advances in Neural Information Processing Systems, 35:2876–2888.
  • Simsek et al., (2021) Simsek, B., Ged, F., Jacot, A., Spadaro, F., Hongler, C., Gerstner, W., and Brea, J. (2021). Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pages 9722–9732. PMLR.
  • Stock et al., (2019) Stock, P., Graham, B., Gribonval, R., and Jégou, H. (2019). Equi-normalization of neural networks. In International Conference on Learning Representations.
  • Stock and Gribonval, (2022) Stock, P. and Gribonval, R. (2022). An embedding of relu networks and an analysis of their identifiability. Constructive Approximation, pages 1–47.
  • Sun, (2020) Sun, R.-Y. (2020). Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294.
  • Sussmann, (1992) Sussmann, H. J. (1992). Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural networks, 5(4):589–593.
  • Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
  • Yarotsky, (2017) Yarotsky, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114.
  • Yarotsky, (2018) Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep relu networks. In Conference on Learning Theory, pages 639–649. PMLR.
  • Zhang et al., (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
  • (82) Zhou, D.-X. (2020a). Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124:319–327.
  • (83) Zhou, D.-X. (2020b). Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal., 48(2):787–794.

Appendix

In this Appendix, we present the technical details of the proof of theorems and provide supporting definition and lemmas.

Appendix A Proof of Theorems

In the proofs, we adopt the following notation. We introduce the LL^{\infty}-norm of a collection of parameters θ=(W(1),b(1),,W(L),b(L),W(L+1),b(L+1)).\theta=(W^{(1)},b^{(1)},\ldots,W^{(L)},b^{(L)},W^{(L+1)},b^{(L+1)}). We define the infinity norm of the collection of parameters by θ=(W(1),b(1),,W(L),b(L),W(L+1),b(L+1))=max{maxl=1,,LW(l),maxl=1,,Lb(l)}\|\theta\|_{\infty}=\|(W^{(1)},b^{(1)},\ldots,W^{(L)},b^{(L)},W^{(L+1)},b^{(L+1)})\|_{\infty}=\max\{\max_{l=1,\ldots,L}\|W^{(l)}\|_{\infty},\max_{l=1,\ldots,L}\|b^{(l)}\|_{\infty}\}. Here, \|\cdot\|_{\infty} denotes the maximum absolute value of a vector or matrix. We let 2\|\cdot\|_{2} denote the L2L^{2} norm of a vector. For any matrix Am×nA\in\mathbb{R}^{m\times n}, the spectral norm of AA is denoted by A2=maxx0Ax2/x2\|A\|_{2}=\max_{x\neq 0}{\|Ax\|_{2}}/{\|x\|_{2}}, defined as the largest singular value of AA or the square root of the largest eigenvalue of the matrix AAA^{\top}A. Also we have A2mnA.\|A\|_{2}\leq\sqrt{mn}\|A\|_{\infty}.

A.1 Proof of Theorem 1

Proof.

The proof is straight forward based on the properties of permutation matrix and element-wise activation functions. Firstly, for any n×nn\times n permutation matrix PP, it is true that PP=PP=InPP^{\top}=P^{\top}P=I_{n} where InI_{n} is the n×nn\times n identity matrix. Secondly, for any element-wise activation function σ\sigma, any nn-dimensional vector xnx\in\mathbb{R}^{n} and any n×nn\times n permutation matrix PP, it is easy to check

σ(Px)=Pσ(x).\sigma(Px)=P\sigma(x).

Then for any deep neural network

f(x;θ)=W(L+1)σL(W(L)σ1(W1(1)x+b1(1)))+b1(L))+b(L+1),f(x;\theta)=W^{(L+1)}\sigma_{L}(W^{(L)}\cdots\sigma_{1}(W^{(1)}_{1}x+b^{(1)}_{1})\cdots)+b^{(L)}_{1})+b^{(L+1)},

and any permutation matrices P1,,PLP_{1},\ldots,P_{L}, it is easy to check

W(L+1)PLσL(PLW(L)PL1σ1(P1W(1)x+P1b(1)))+PLb(L))+b(L+1)\displaystyle W^{(L+1)}P^{\top}_{L}\sigma_{L}(P_{L}W^{(L)}P^{\top}_{L-1}\cdots\sigma_{1}(P_{1}W^{(1)}x+P_{1}b^{(1)})\cdots)+P_{L}b^{(L)})+b^{(L+1)}
=W(L+1)σL(W(L)σ1(W(1)x+b(1)))+b(L))+b(L+1),\displaystyle=W^{(L+1)}\sigma_{L}(W^{(L)}\cdots\sigma_{1}(W^{(1)}x+b^{(1)})\cdots)+b^{(L)})+b^{(L+1)},

which completes the proof. ∎

A.2 Proof of Theorem 1

Proof.

Firstly, by the property of permutation invariance, we know that the neural networks {f(x;θ)=W(2)σ1(W(1)x+b(1))+b(2):θΘ0}\{f(x;\theta)=W^{(2)}\sigma_{1}(W^{(1)}x+b^{(1)})+b^{(2)}:\theta\in\Theta_{0}\} parameterized by Θ0\Theta_{0} contains all the functions in {f(x;θ)=W(2)σ1(W(1)x+b(1))+b(2):θΘ}\{f(x;\theta)=W^{(2)}\sigma_{1}(W^{(1)}x+b^{(1)})+b^{(2)}:\theta\in\Theta\}. The covering numbers of these two class of functions are the same. Then we can consider the covering number of {f(x;θ)=W(2)σ1(W(1)x+b(1))+b(2):θΘ0}\{f(x;\theta)=W^{(2)}\sigma_{1}(W^{(1)}x+b^{(1)})+b^{(2)}:\theta\in\Theta_{0}\} where

Θ0:={θ[B,B]𝒮:b1(1)b2(1)bd1(1)}.\Theta_{0}:=\{\theta\in[-B,B]^{\mathcal{S}}:b^{(1)}_{1}\geq b^{(1)}_{2}\geq\cdots\geq b^{(1)}_{d_{1}}\}.

Recall that for a single hidden layer neural network f(;θ)f(\cdot;\theta) parameterized by θ=(W(1),b(1),W(2),b(2))\theta=(W^{(1)},b^{(1)},W^{(2)},b^{(2)}), the parameters W(1)d1×d0W^{(1)}\in\mathbb{R}^{d_{1}\times d_{0}}, b(1)d1b^{(1)}\in\mathbb{R}^{d_{1}}, W(2)1×d1W^{(2)}\in\mathbb{R}^{1\times d_{1}}, and b(2)1b^{(2)}\in\mathbb{R}^{1} have components bounded by BB. We start by considering the covering number of the activated linear transformations

:={σ1𝒜1:(W(1),b(1))Θ0(1)},\mathcal{H}:=\{\sigma_{1}\circ\mathcal{A}_{1}:(W^{(1)},b^{(1)})\in\Theta_{0}^{(1)}\},

where Θ0(1)={[B,B]d1×d0+d1,b1(1)b2(1)bd1(1)}\Theta_{0}^{(1)}=\{[-B,B]^{d_{1}\times d_{0}+d_{1}},b^{(1)}_{1}\geq b^{(1)}_{2}\geq\cdots\geq b^{(1)}_{d_{1}}\}, σ1\sigma_{1} is a ρ\rho-Lipschitz activation function on [BBxB,BBx+B][-BB_{x}-B,BB_{x}+B], and 𝒜1(x)=W(1)x+b(1)\mathcal{A}_{1}(x)=W^{(1)}x+b^{(1)}. Here 𝒜1\mathcal{A}_{1} output d1d_{1}-dimensional vectors, and we define 𝒜1:=supx𝒳σ1𝒜1(x)2\|\mathcal{A}_{1}\|_{\infty}:=\sup_{x\in\mathcal{X}}\|\sigma_{1}\circ\mathcal{A}_{1}(x)\|_{2} for vector-valued functions.

Let ϵ1>0\epsilon_{1}>0 be a real number and Θ0,ϵ1(1)={(Wj(1),bj(1))}j=1N1\Theta^{(1)}_{0,\epsilon_{1}}=\{(W^{(1)}_{j},b^{(1)}_{j})\}_{j=1}^{N_{1}} be a minimal ϵ1\epsilon_{1}-covering of Θ0(1)\Theta^{(1)}_{0} under \|\cdot\|_{\infty} norm with covering number 𝒩1=𝒩(Θ0(1),ϵ1,)\mathcal{N}_{1}=\mathcal{N}(\Theta_{0}^{(1)},\epsilon_{1},\|\cdot\|_{\infty}). Then for any (W(1),b(1))Θ0(1)(W^{(1)},b^{(1)})\in\Theta_{0}^{(1)}, there exist a (Wj(1),bj(1))(W^{(1)}_{j},b^{(1)}_{j}) such that (Wj(1)W(1),bj(1)b(1))ϵ1\|(W^{(1)}_{j}-W^{(1)},b^{(1)}_{j}-b^{(1)})\|_{\infty}\leq\epsilon_{1}. For any x𝒳x\in\mathcal{X} and (W(1),b(1))Θ0(1)(W^{(1)},b^{(1)})\in\Theta_{0}^{(1)}, it is not hard to check W(1)x+b(1)2B(d0d1Bx+1)\|W^{(1)}x+b^{(1)}\|_{2}\leq B(\sqrt{d_{0}d_{1}}B_{x}+1). Then we have

σ1(Wi(1)x+bi(1))σ1(Wj(1)x+bj(1))2\displaystyle\|\sigma_{1}(W^{(1)}_{i}x+b^{(1)}_{i})-\sigma_{1}(W^{(1)}_{j}x+b^{(1)}_{j})\|_{2}\leq ρ(W(1)x+b(1))(Wj(1)x+bj(1))2\displaystyle\rho\|(W^{(1)}x+b^{(1)})-(W^{(1)}_{j}x+b^{(1)}_{j})\|_{2}
\displaystyle\leq ρ(W(1)Wj(1))x2+ρ(b(1)bj(1))2\displaystyle\rho\|(W^{(1)}-W^{(1)}_{j})x\|_{2}+\rho\|(b^{(1)}-b^{(1)}_{j})\|_{2}
\displaystyle\leq ρd0d1W(1)Wj(1)x2+ρd1(b(1)bj(1))\displaystyle\rho\sqrt{d_{0}d_{1}}\|W^{(1)}-W^{(1)}_{j}\|_{\infty}\|x\|_{2}+\rho\sqrt{d_{1}}\|(b^{(1)}-b^{(1)}_{j})\|_{\infty}
\displaystyle\leq ρϵ1(d0d1Bx+d1).\displaystyle\rho\epsilon_{1}(\sqrt{d_{0}d_{1}}B_{x}+\sqrt{d_{1}}).

This implies that

1={σ1𝒜1:(W(1).b(1))Θ0,ϵ1(1)}\mathcal{H}_{1}=\{\sigma_{1}\circ\mathcal{A}_{1}:(W^{(1)}.b^{(1)})\in\Theta^{(1)}_{0,\epsilon_{1}}\}

is a set with no more than 𝒩1\mathcal{N}_{1} elements and it covers \mathcal{H} under \|\cdot\|_{\infty} norm with radius ϵ1ρ(d0d1Bx+d1)\epsilon_{1}\rho(\sqrt{d_{0}d_{1}}B_{x}+\sqrt{d_{1}}). By Lemma 2, the covering number 𝒩1=𝒩(Θ0(1),ϵ,)Volume(Θ0(1))×(2/ϵ1)d1×d0+d1=(4B/ϵ1)d1×d0+d1/d1!\mathcal{N}_{1}=\mathcal{N}(\Theta_{0}^{(1)},\epsilon,\|\cdot\|_{\infty})\leq{\rm Volume(\Theta_{0}^{(1)})}\times({2}/{\epsilon_{1}})^{d_{1}\times d_{0}+d_{1}}=(4B/\epsilon_{1})^{d_{1}\times d_{0}+d_{1}}/d_{1}!.

Now, let ϵ2>0\epsilon_{2}>0 be a real number and Θ0,ϵ2(2)={(Wj(2),bj(2))}j=1N2\Theta^{(2)}_{0,\epsilon_{2}}=\{(W^{(2)}_{j},b^{(2)}_{j})\}_{j=1}^{N_{2}} be a minimal ϵ2\epsilon_{2}-covering of Θ0(2)\Theta^{(2)}_{0} under \|\cdot\|_{\infty} norm with 𝒩2=𝒩(Θ0(2),ϵ2,)Volume(Θ0(2))×(2/ϵ2)d1×1+1=(4B/ϵ2)d1+1\mathcal{N}_{2}=\mathcal{N}(\Theta_{0}^{(2)},\epsilon_{2},\|\cdot\|_{\infty})\leq{\rm Volume(\Theta_{0}^{(2)})}\times(2/\epsilon_{2})^{d_{1}\times 1+1}=(4B/\epsilon_{2})^{d_{1}+1}. Also we construct a class of functions by

2={𝒜2h:h1,(W(2),b(2))Θ0,ϵ2(2)},\mathcal{H}_{2}=\{\mathcal{A}_{2}\circ h:h\in\mathcal{H}_{1},(W^{(2)},b^{(2)})\in\Theta^{(2)}_{0,\epsilon_{2}}\},

where 𝒜2(x)=W(2)x+b(2)\mathcal{A}_{2}(x)=W^{(2)}x+b^{(2)}. Now for any f=𝒜2σ1𝒜1f=\mathcal{A}_{2}\circ\sigma_{1}\circ\mathcal{A}_{1} parameterized by θ=(W(1),b(1),W(2),b(2))Θ0\theta=(W^{(1)},b^{(1)},W^{(2)},b^{(2)})\in\Theta_{0}, by the definition of covering, there exists hj1h_{j}\in\mathcal{H}_{1} such that hjσ1𝒜1ρϵ1(d0d1Bx+d1)\|h_{j}-\sigma_{1}\circ\mathcal{A}_{1}\|_{\infty}\leq\rho\epsilon_{1}(\sqrt{d_{0}d_{1}}B_{x}+\sqrt{d_{1}}) and there exists (Wk(2),bk(2))Θ0,ϵ2(2)(W^{(2)}_{k},b^{(2)}_{k})\in\Theta^{(2)}_{0,\epsilon_{2}} such that (Wk(2)W(2),bk(2)b(2))ϵ2\|(W^{(2)}_{k}-W^{(2)},b^{(2)}_{k}-b^{(2)})\|_{\infty}\leq\epsilon_{2}. Then fro any x𝒳x\in\mathcal{X}

f(x)Wk(2)hj(x)+bk(2)2\displaystyle\|f(x)-W^{(2)}_{k}h_{j}(x)+b^{(2)}_{k}\|_{2}
=\displaystyle= W(2)σ1𝒜1(x)+b(2)Wk(2)hj(x)bk(2)2\displaystyle\|W^{(2)}\sigma_{1}\circ\mathcal{A}_{1}(x)+b^{(2)}-W^{(2)}_{k}h_{j}(x)-b^{(2)}_{k}\|_{2}
\displaystyle\leq W(2)σ1𝒜1(x)Wk(2)hj(x)2+b(2)bk(2)2\displaystyle\|W^{(2)}\sigma_{1}\circ\mathcal{A}_{1}(x)-W^{(2)}_{k}h_{j}(x)\|_{2}+\|b^{(2)}-b^{(2)}_{k}\|_{2}
\displaystyle\leq W(2)σ1𝒜1(x)W(2)hj(x)2+W(2)hj(x)Wk(2)hj(x)2+d2ϵ2\displaystyle\|W^{(2)}\sigma_{1}\circ\mathcal{A}_{1}(x)-W^{(2)}h_{j}(x)\|_{2}+\|W^{(2)}h_{j}(x)-W^{(2)}_{k}h_{j}(x)\|_{2}+\sqrt{d_{2}}\epsilon_{2}
\displaystyle\leq d1Bρϵ1(d0d1Bx+d1)+ϵ2d1B(d0d1Bx+d1)+ϵ2\displaystyle\sqrt{d_{1}}B\rho\epsilon_{1}(\sqrt{d_{0}d_{1}}B_{x}+\sqrt{d_{1}})+\epsilon_{2}\sqrt{d_{1}}B(\sqrt{d_{0}d_{1}}B_{x}+\sqrt{d_{1}})+\epsilon_{2}
\displaystyle\leq 2d0d1B(Bx+1)[ρϵ1+ϵ2],\displaystyle 2\sqrt{d_{0}}d_{1}B(B_{x}+1)[\rho\epsilon_{1}+\epsilon_{2}],

which implies that 2\mathcal{H}_{2} is a (2d0d1B(Bx+1)[ρϵ1+ϵ2])(2\sqrt{d_{0}}d_{1}B(B_{x}+1)[\rho\epsilon_{1}+\epsilon_{2}])-covering of the neural networks (1,d0,d1,B)\mathcal{F}(1,d_{0},d_{1},B), where there are at most (4B/ϵ1)d1×d0+d1/d1!×(4B/ϵ2)d1+1(4B/\epsilon_{1})^{d_{1}\times d_{0}+d_{1}}/d_{1}!\times(4B/\epsilon_{2})^{d_{1}+1} elements in 2\mathcal{H}_{2}. Given ϵ>0\epsilon>0, we take ϵ1=ϵ/(4ρd0d1B(Bx+1))\epsilon_{1}=\epsilon/(4\rho\sqrt{d_{0}}d_{1}B(B_{x}+1)) and ϵ2=ϵ/(4d0d1B(Bx+1))\epsilon_{2}=\epsilon/(4\sqrt{d_{0}}d_{1}B(B_{x}+1)), then this implies

𝒩((1,d0,d1,B),ϵ,)\displaystyle\mathcal{N}(\mathcal{F}(1,d_{0},d_{1},B),\epsilon,\|\cdot\|_{\infty})
\displaystyle\leq (4B/ϵ1)d1×d0+d1/d1!×(4B/ϵ2)d1+1\displaystyle(4B/\epsilon_{1})^{d_{1}\times d_{0}+d_{1}}/d_{1}!\times(4B/\epsilon_{2})^{d_{1}+1}
=\displaystyle= (16B/ϵ)d0×d1+d1+d1+1×(ρd0d1B(Bx+1))d0×d1+d1×(d0d1B(Bx+1))d1+1/d1!\displaystyle(16B/\epsilon)^{d_{0}\times d_{1}+d_{1}+d_{1}+1}\times(\rho\sqrt{d_{0}}d_{1}B(B_{x}+1))^{d_{0}\times d_{1}+d_{1}}\times(\sqrt{d_{0}}d_{1}B(B_{x}+1))^{d_{1}+1}/d_{1}!
=\displaystyle= (16B/ϵ)d0×d1+d1+d1+1×(d0d1B(Bx+1))d0×d1+d1+d1+1×ρd0×d1+d1/d1!\displaystyle(16B/\epsilon)^{d_{0}\times d_{1}+d_{1}+d_{1}+1}\times(\sqrt{d_{0}}d_{1}B(B_{x}+1))^{d_{0}\times d_{1}+d_{1}+d_{1}+1}\times\rho^{d_{0}\times d_{1}+d_{1}}/d_{1}!
=\displaystyle= (16B2(Bx+1)d0d1/ϵ)d0×d1+d1+d1+1×ρd0×d1+d1/d1!,\displaystyle(16B^{2}(B_{x}+1)\sqrt{d_{0}}d_{1}/\epsilon)^{d_{0}\times d_{1}+d_{1}+d_{1}+1}\times\rho^{d_{0}\times d_{1}+d_{1}}/d_{1}!,
=\displaystyle= (16B2(Bx+1)d0d1/ϵ)𝒮×ρ𝒮1/d1!,\displaystyle(16B^{2}(B_{x}+1)\sqrt{d_{0}}d_{1}/\epsilon)^{\mathcal{S}}\times\rho^{\mathcal{S}_{1}}/d_{1}!,

where 𝒮=d0×d1+d1+d1+1\mathcal{S}=d_{0}\times d_{1}+d_{1}+d_{1}+1 is the total number of parameters in the network and 𝒮1\mathcal{S}_{1} denotes the number of parameters in the linear transformation from the input to the hidden layer. ∎

A.3 Proof of Theorem 2

Proof.

Our proof takes into account permutation equivalence and extends Theorem 1 by using mathematical induction. Similar approaches using induction can be found in Lemma A.7 of Bartlett et al., (2017) and Lemma 14 of Lin and Zhang, (2019). However, we improve upon their results in three ways. Firstly, we consider the generally defined neural networks where the bias vector is allowed to appear. Secondly, we express our bound in terms of the width, depth, and size of the neural network as well as the infinity norm of the parameters, instead of the spectral norm of the weight matrices, which can be unknown in practice. Thirdly, we utilize permutation equivalence to derive tighter upper bounds on the covering number.

Step 1. We analyze the effective region of the parameter space

Θ=[B,B]𝒮,\Theta=[-B,B]^{\mathcal{S}},

where 𝒮=i=0Ldidi+1+di+1\mathcal{S}=\sum_{i=0}^{L}d_{i}d_{i+1}+d_{i+1} is the total number of parameters in the deep neural network. We denote 𝒮l=dl1dl+dl\mathcal{S}_{l}=d_{l-1}d_{l}+d_{l} by the total number of parameters in (W(l),B(l))(W^{(l)},B^{(l)}) in the llth layer, and denote Θ(l)={(W(l),bl):(W(l),b(l))[B,B]𝒮l}\Theta^{(l)}=\{(W^{(l)},b^{l}):(W^{(l)},b^{(l)})\in[-B,B]^{\mathcal{S}_{l}}\} by the parameter space for l=1,,L+1l=1,\ldots,L+1. By our Theorem 1, for any given neural network f(;θ)f(\cdot;\theta) parameterized by

θ=(W(1),b(1),,W(L+1),b(L+1))Θ,\theta=(W^{(1)},b^{(1)},\ldots,W^{(L+1)},b^{(L+1)})\in\Theta,

there exists P1,,PLP_{1},\ldots,P_{L} such that f(;θ~)f(\cdot;\tilde{\theta}) parameterized by

θ~=(P1W(1),P1b(1),,PlW(l)Pl1,Plb(1),,W(L+1)PL,b(L+1))\tilde{\theta}=(P_{1}W^{(1)},P_{1}b^{(1)},\ldots,P_{l}W^{(l)}P_{l-1}^{\top},P_{l}b^{(1)},\ldots,W^{(L+1)}P_{L}^{\top},b^{(L+1)})

implements the same input-output function. This implies that there exists a subset Θ0\Theta_{0} of Θ\Theta such that neural networks {f(;θ):θΘ0}\{f(\cdot;\theta):\theta\in\Theta_{0}\} parameterized by Θ0\Theta_{0} contains all the functions in {f(;θ):θΘ}\{f(\cdot;\theta):\theta\in\Theta\}. The covering numbers of these two class of functions are the same. To be specific, the effective parameter space

Θ0=Θ0(1)×Θ0(2)××Θ0(L)×Θ0(L+1),\Theta_{0}=\Theta_{0}^{(1)}\times\Theta_{0}^{(2)}\times\cdots\times\Theta_{0}^{(L)}\times\Theta_{0}^{(L+1)},

where

Θ0(1)=\displaystyle\Theta_{0}^{(1)}= {(W(1),b(1))[B,B]𝒮1:b1(1)b2(1)bd1(1)},\displaystyle\{(W^{(1)},b^{(1)})\in[-B,B]^{\mathcal{S}_{1}}:b^{(1)}_{1}\geq b^{(1)}_{2}\geq\cdots\geq b^{(1)}_{d_{1}}\},
Θ0(l)=\displaystyle\Theta_{0}^{(l)}= {(W(l),b(l))[B,B]𝒮l:b1(l)b2(l)bdl(l)}forl=2,,L,\displaystyle\{(W^{(l)},b^{(l)})\in[-B,B]^{\mathcal{S}_{l}}:b^{(l)}_{1}\geq b^{(l)}_{2}\geq\cdots\geq b^{(l)}_{d_{l}}\}{\rm\ for\ }l=2,\ldots,L,
Θ0(L+1)\displaystyle\Theta_{0}^{(L+1)} ={(W(L+1),b(L+1))[B,B]𝒮L+1}.\displaystyle=\{(W^{(L+1)},b^{(L+1)})\in[-B,B]^{\mathcal{S}_{L+1}}\}.

In the following, we focus on considering the covering number of {f(;θ):θΘ0}\{f(\cdot;\theta):\theta\in\Theta_{0}\}.

Step 2. We start by bounding the covering number for the first activated hidden layer. Let 1={σ1𝒜1:(W(1)x+b(1))Θ0(1)}\mathcal{H}_{1}=\{\sigma_{1}\circ\mathcal{A}_{1}:(W^{(1)}x+b^{(1)})\in\Theta_{0}^{(1)}\} where 𝒜1(x)=W(1)x+b(1)\mathcal{A}_{1}(x)=W^{(1)}x+b^{(1)} is the linear transformation from the input to the first hidden layer. Given any ϵ1>0\epsilon_{1}>0, in the proof of Theorem 1, we have shown that

𝒩1:=𝒩(1,ϵ1ρ1d0d1(Bx+1),)Volume(Θ0(1))×(2/ϵ1)𝒮1,\mathcal{N}_{1}:=\mathcal{N}(\mathcal{H}_{1},\epsilon_{1}\rho_{1}\sqrt{d_{0}d_{1}}(B_{x}+1),\|\cdot\|_{\infty})\leq{\rm Volume}(\Theta_{0}^{(1)})\times(2/\epsilon_{1})^{\mathcal{S}_{1}},

and h(1)ρ1d0d1B(Bx+1)\|h^{(1)}\|_{\infty}\leq\rho_{1}\sqrt{d_{0}d_{1}}B(B_{x}+1) for h(1)1.h^{(1)}\in\mathcal{H}_{1}.

Step 3. We use induction to proceed the proof for l=1,,Ll=1,\ldots,L. Let l={σl𝒜lσ1𝒜1:(W(k)x+b(k))Θ0(k),k=1,,l}\mathcal{H}_{l}=\{\sigma_{l}\circ\mathcal{A}_{l}\circ\cdots\circ\sigma_{1}\circ\mathcal{A}_{1}:(W^{(k)}x+b^{(k)})\in\Theta_{0}^{(k)},k=1,\ldots,l\} where 𝒜k(x)=W(k)x+b(k)\mathcal{A}_{k}(x)=W^{(k)}x+b^{(k)} is the linear transformation from the (k1)(k-1)th layer to the kkth layer for k=1,,lk=1,\ldots,l. Let B(l)B^{(l)} denotes the infinity norm of functions h(l)lh^{(l)}\in\mathcal{H}_{l} for k=1,,Lk=1,\ldots,L. For any el>0e_{l}>0, let

~l={hj(l)}j=1𝒩(l,el,)\tilde{\mathcal{H}}_{l}=\{h^{(l)}_{j}\}_{j=1}^{\mathcal{N}(\mathcal{H}_{l},e_{l},\|\cdot\|_{\infty})}

be a ele_{l}-covering of l\mathcal{H}_{l} under the \|\cdot\|_{\infty} norm. For any ϵl+1>0\epsilon_{l+1}>0, let

Θ0,ϵl+1(l+1)={(Wt(l+1),bt(l+1))}t=1𝒩(Θ0(l+1),ϵl+1,)\Theta^{(l+1)}_{0,\epsilon_{l+1}}=\{(W^{(l+1)}_{t},b^{(l+1)}_{t})\}_{t=1}^{\mathcal{N}(\Theta^{(l+1)}_{0},\epsilon_{l+1},\|\cdot\|_{\infty})}

be a ϵl+1\epsilon_{l+1}-covering of Θ0(l+1)\Theta^{(l+1)}_{0}. Then for any h(l+1)=σl+1𝒜l+1h(l)l+1h^{(l+1)}=\sigma_{l+1}\circ\mathcal{A}_{l+1}\circ h^{(l)}\in\mathcal{H}_{l+1} where 𝒜l+1(x)=W(l+1)x+b(l+1)\mathcal{A}_{l+1}(x)=W^{(l+1)}x+b^{(l+1)} , there exists hj(l)~lh^{(l)}_{j}\in\tilde{\mathcal{H}}_{l} and (Wt(l+1),bt(l+1))Θ0,ϵl+1(l+1)(W^{(l+1)}_{t},b^{(l+1)}_{t})\in\Theta^{(l+1)}_{0,\epsilon_{l+1}} such that

h(l)hj(l)el\|h^{(l)}-h^{(l)}_{j}\|_{\infty}\leq e_{l}

and

(Wt(l+1)W(l+1),bt(l+1)b(l+1))ϵl+1.\|(W^{(l+1)}_{t}-W^{(l+1)},b^{(l+1)}_{t}-b^{(l+1)})\|_{\infty}\leq\epsilon_{l+1}.

Then for any x𝒳x\in\mathcal{X}, we have

h(l+1)(x)σl+1(Wt(l+1)hj(l)(x)+bt(l+1))2\displaystyle\|h^{(l+1)}(x)-\sigma_{l+1}(W^{(l+1)}_{t}h^{(l)}_{j}(x)+b^{(l+1)}_{t})\|_{2}
=\displaystyle= σl+1(W(l+1)h(l)(x)+b(l+1))σl+1(Ws(l+1)hj(l)(x)bs(l+1))2\displaystyle\|\sigma_{l+1}(W^{(l+1)}h^{(l)}(x)+b^{(l+1)})-\sigma_{l+1}(W^{(l+1)}_{s}h^{(l)}_{j}(x)-b^{(l+1)}_{s})\|_{2}
\displaystyle\leq ρl+1(W(l+1)h(l)(x)+b(l+1))(Ws(l+1)hj(l)(x)bs(l+1))2\displaystyle\rho_{l+1}\|(W^{(l+1)}h^{(l)}(x)+b^{(l+1)})-(W^{(l+1)}_{s}h^{(l)}_{j}(x)-b^{(l+1)}_{s})\|_{2}
\displaystyle\leq ρl+1W(l+1)h(l)(x)Ws(l+1)hj(l)(x)2+ρl+1dl+1b(l+1)bs(l+1)\displaystyle\rho_{l+1}\|W^{(l+1)}h^{(l)}(x)-W^{(l+1)}_{s}h^{(l)}_{j}(x)\|_{2}+\rho_{l+1}\sqrt{d_{l+1}}\|b^{(l+1)}-b^{(l+1)}_{s}\|_{\infty}
\displaystyle\leq ρl+1W(l+1)h(l)(x)W(l+1)hj(l)(x)2+ρl+1W(l+1)hj(l)(x)Ws(l+1)hj(l)(x)2+ρl+1dl+1ϵl+1\displaystyle\rho_{l+1}\|W^{(l+1)}h^{(l)}(x)-W^{(l+1)}h^{(l)}_{j}(x)\|_{2}+\rho_{l+1}\|W^{(l+1)}h^{(l)}_{j}(x)-W^{(l+1)}_{s}h^{(l)}_{j}(x)\|_{2}+\rho_{l+1}\sqrt{d_{l+1}}\epsilon_{l+1}
\displaystyle\leq ρl+1dlBel+ρl+1ϵl+1dlB(l)+ρl+1dl+1ϵl+1\displaystyle\rho_{l+1}\sqrt{d_{l}}Be_{l}+\rho_{l+1}\epsilon_{l+1}\sqrt{d_{l}}B^{(l)}+\rho_{l+1}\sqrt{d_{l+1}}\epsilon_{l+1}
=\displaystyle= ρl+1(dlBel+(B(l)+dl+1)ϵl+1).\displaystyle\rho_{l+1}(\sqrt{d_{l}}Be_{l}+(B^{(l)}+\sqrt{d_{l+1}})\epsilon_{l+1}).

Then it is proved that the covering number 𝒩((l+1),el+1,)\mathcal{N}(\mathcal{H}^{(l+1)},e_{l+1},\|\cdot\|_{\infty}) with el+1=ρl+1(dlBel+(B(l)+dl+1)ϵl+1)e_{l+1}=\rho_{l+1}(\sqrt{d_{l}}Be_{l}+(B^{(l)}+\sqrt{d_{l+1}})\epsilon_{l+1}) satisfying

𝒩((l+1),el+1,)𝒩((l),el,)×𝒩(Θ0(l+1),ϵl+1,)\mathcal{N}(\mathcal{H}^{(l+1)},e_{l+1},\|\cdot\|_{\infty})\leq\mathcal{N}(\mathcal{H}^{(l)},e_{l},\|\cdot\|_{\infty})\times\mathcal{N}(\Theta^{(l+1)}_{0},\epsilon_{l+1},\|\cdot\|_{\infty})

for l=1,,L1l=1,\ldots,L-1. Recall that in the proof of Theorem 1, we have proved that

𝒩((1),e1,)𝒩(Θ0(1),ϵ1,),\mathcal{N}(\mathcal{H}^{(1)},e_{1},\|\cdot\|_{\infty})\leq\mathcal{N}(\Theta^{(1)}_{0},\epsilon_{1},\|\cdot\|_{\infty}),

where e1=ρ1ϵ1d0d1(Bx+1),e_{1}=\rho_{1}\epsilon_{1}\sqrt{d_{0}d_{1}}(B_{x}+1), which leads to

𝒩((l),el,)Πi=1l𝒩(Θ0(l),ϵl,),\mathcal{N}(\mathcal{H}^{(l)},e_{l},\|\cdot\|_{\infty})\leq\Pi_{i=1}^{l}\mathcal{N}(\Theta^{(l)}_{0},\epsilon_{l},\|\cdot\|_{\infty}),

for l=1,,Ll=1,\ldots,L.

Step 4. Next we give upper bounds of ele_{l} and B(l)B^{(l)} for l=1,,Ll=1,\ldots,L. Recall that B(l)B^{(l)} denotes the infinity norm of functions h(l)lh^{(l)}\in\mathcal{H}_{l} for k=1,,Lk=1,\ldots,L. Then it is easy to see that B(l)ρlBB^{(l)}\geq\rho_{l}B since we can always take the bias vectors to have components BB or B-B. In addition,

B(l+1)\displaystyle B^{(l+1)} =σl+1𝒜l+1hlρl+1B(dlB(l)+dl+1)2ρl+1dldl+1BB(l).\displaystyle=\|\sigma_{l+1}\circ\mathcal{A}_{l+1}\circ h_{l}\|_{\infty}\leq\rho_{l+1}B(\sqrt{d_{l}}B^{(l)}+\sqrt{d_{l+1}})\leq 2\rho_{l+1}\sqrt{d_{l}d_{l+1}}BB^{(l)}.

As proved in Theorem 1, we know B(1)ρ1Bd0d1(Bx+1)B^{(1)}\leq\rho_{1}B\sqrt{d_{0}d_{1}}(B_{x}+1), then we can get

B(l)(2B)l(Bx+1)d0Πi=1ldiρi/(dl).B^{(l)}\leq(2B)^{l}(B_{x}+1)\sqrt{d_{0}}\Pi_{i=1}^{l}d_{i}\rho_{i}/(\sqrt{d_{l}}).

Recall el+1=ρl+1(dlBel+(B(l)+dl+1)ϵl+1)e_{l+1}=\rho_{l+1}(\sqrt{d_{l}}Be_{l}+(B^{(l)}+\sqrt{d_{l+1}})\epsilon_{l+1}) for l=1,,L1l=1,\ldots,L-1, and e1=ρ1ϵ1d0(Bx+1),e_{1}=\rho_{1}\epsilon_{1}\sqrt{d_{0}}(B_{x}+1), then by simple mathematics

eL\displaystyle e_{L} =BL1Πi=2Lρidi1e1+i=2L(B(i1)+di)ϵiΠj=iLρjdj1/dL\displaystyle=B^{L-1}\Pi_{i=2}^{L}\rho_{i}\sqrt{d_{i-1}}e_{1}+\sum_{i=2}^{L}(B^{(i-1)}+\sqrt{d_{i}})\epsilon_{i}\Pi_{j=i}^{L}\rho_{j}\sqrt{d_{j-1}}/\sqrt{d_{L}}
BL1Πi=1Lρidi1(Bx+1)ϵ1+i=2L(Bx+1)(2B)i1d0(Πj=1Lρjdj)ϵi/(dL)\displaystyle\leq B^{L-1}\Pi_{i=1}^{L}\rho_{i}\sqrt{d_{i-1}}(B_{x}+1)\epsilon_{1}+\sum_{i=2}^{L}(B_{x}+1)(2B)^{i-1}\sqrt{d_{0}}(\Pi_{j=1}^{L}\rho_{j}{d_{j}})\epsilon_{i}/(\sqrt{d_{L}})
2i=1Ld0(Bx+1)(2B)i1(Πj=1Lρjdj)ϵi/dL.\displaystyle\leq 2\sum_{i=1}^{L}\sqrt{d_{0}}(B_{x}+1)(2B)^{i-1}(\Pi_{j=1}^{L}\rho_{j}{d_{j}})\epsilon_{i}/\sqrt{d_{L}}.

Step 5. Last we construct a covering of =(1,d0,d1,B)={f(x;θ)=W(2)σ1(W(1)x+b(1))+b(2):θΘ0}\mathcal{F}=\mathcal{F}(1,d_{0},d_{1},B)=\{f(x;\theta)=W^{(2)}\sigma_{1}(W^{(1)}x+b^{(1)})+b^{(2)}:\theta\in\Theta_{0}\}. Let

~L={hj(L)}j=1𝒩(L,eL,)\tilde{\mathcal{H}}_{L}=\{h^{(L)}_{j}\}_{j=1}^{\mathcal{N}(\mathcal{H}_{L},e_{L},\|\cdot\|_{\infty})}

be a eLe_{L}-covering of L\mathcal{H}_{L} under the \|\cdot\|_{\infty} norm. For any ϵL+1>0\epsilon_{L+1}>0, let

Θ0,ϵL+1(L+1)={(Wt(L+1),bt(L+1))}t=1𝒩(Θ0(L+1),ϵL+1,)\Theta^{(L+1)}_{0,\epsilon_{L+1}}=\{(W^{(L+1)}_{t},b^{(L+1)}_{t})\}_{t=1}^{\mathcal{N}(\Theta^{(L+1)}_{0},\epsilon_{L+1},\|\cdot\|_{\infty})}

be a ϵL+1\epsilon_{L+1}-covering of Θ0(L+1)\Theta^{(L+1)}_{0}. Then for any f=𝒜L+1h(L)f=\mathcal{A}_{L+1}\circ h^{(L)}\in\mathcal{F} where 𝒜L+1(x)=W(L+1)x+b(L+1)\mathcal{A}_{L+1}(x)=W^{(L+1)}x+b^{(L+1)} , there exists hj(L)~Lh^{(L)}_{j}\in\tilde{\mathcal{H}}_{L} and (Wt(L+1),bt(L+1))Θ0,ϵL+1(L+1)(W^{(L+1)}_{t},b^{(L+1)}_{t})\in\Theta^{(L+1)}_{0,\epsilon_{L+1}} such that

h(L)hj(L)eL\|h^{(L)}-h^{(L)}_{j}\|_{\infty}\leq e_{L}

and

(Wt(L+1)W(L+1),bt(L+1)b(L+1))ϵL+1.\|(W^{(L+1)}_{t}-W^{(L+1)},b^{(L+1)}_{t}-b^{(L+1)})\|_{\infty}\leq\epsilon_{L+1}.

Then for any x𝒳x\in\mathcal{X}, we have

f(x)Wt(L+1)hj(L)(x)+bt(L+1)2\displaystyle\|f(x)-W^{(L+1)}_{t}h^{(L)}_{j}(x)+b^{(L+1)}_{t}\|_{2}
=\displaystyle= W(L+1)h(L)(x)+b(L+1)Ws(L+1)hj(L)(x)bs(L+1)2\displaystyle\|W^{(L+1)}h^{(L)}(x)+b^{(L+1)}-W^{(L+1)}_{s}h^{(L)}_{j}(x)-b^{(L+1)}_{s}\|_{2}
\displaystyle\leq (W(L+1)h(L)(x)+b(L+1))(Ws(L+1)hj(L)(x)bs(L+1))2\displaystyle\|(W^{(L+1)}h^{(L)}(x)+b^{(L+1)})-(W^{(L+1)}_{s}h^{(L)}_{j}(x)-b^{(L+1)}_{s})\|_{2}
\displaystyle\leq W(L+1)h(L)(x)Ws(L+1)hj(L)(x)2+b(L+1)bs(L+1)2\displaystyle\|W^{(L+1)}h^{(L)}(x)-W^{(L+1)}_{s}h^{(L)}_{j}(x)\|_{2}+\|b^{(L+1)}-b^{(L+1)}_{s}\|_{2}
\displaystyle\leq W(L+1)h(L)(x)W(L+1)hj(L)(x)2+W(L+1)hj(L)(x)Ws(L+1)hj(L)(x)2+ϵL+1\displaystyle\|W^{(L+1)}h^{(L)}(x)-W^{(L+1)}h^{(L)}_{j}(x)\|_{2}+\|W^{(L+1)}h^{(L)}_{j}(x)-W^{(L+1)}_{s}h^{(L)}_{j}(x)\|_{2}+\epsilon_{L+1}
\displaystyle\leq dLBeL+ϵL+1dLB(L)+ϵL+1\displaystyle\sqrt{d_{L}}Be_{L}+\epsilon_{L+1}\sqrt{d_{L}}B^{(L)}+\epsilon_{L+1}
=\displaystyle= (dLBeL+(dLB(L)+1)ϵL+1)\displaystyle(\sqrt{d_{L}}Be_{L}+(\sqrt{d_{L}}B^{(L)}+1)\epsilon_{L+1})
\displaystyle\leq 2d0i=1L(Bx+1)(2B)i(Πj=1Lρjdj)ϵi+2(2B)L(Bx+1)d0(Πj=1Lρjdj)ϵL+1\displaystyle 2\sqrt{d_{0}}\sum_{i=1}^{L}(B_{x}+1)(2B)^{i}(\Pi_{j=1}^{L}\rho_{j}d_{j})\epsilon_{i}+2(2B)^{L}(B_{x}+1)\sqrt{d_{0}}(\Pi_{j=1}^{L}\rho_{j}{d_{j}})\epsilon_{L+1}
\displaystyle\leq 2d0(Bx+1)(Πj=1Lρjdj)i=1L+1(2B)iϵi.\displaystyle 2\sqrt{d_{0}}(B_{x}+1)(\Pi_{j=1}^{L}\rho_{j}{d_{j}})\sum_{i=1}^{L+1}(2B)^{i}\epsilon_{i}.

Then we know that the covering number 𝒩(,eL+1,)\mathcal{N}(\mathcal{F},e_{L+1},\|\cdot\|_{\infty}) with radius

eL+1=2(Bx+1)d0(Πj=1Lρjdj)i=1L+1(2B)i1ϵie_{L+1}=2(B_{x}+1)\sqrt{d_{0}}(\Pi_{j=1}^{L}\rho_{j}{d_{j}})\sum_{i=1}^{L+1}(2B)^{i-1}\epsilon_{i}

satisfies

𝒩(,eL+1,)\displaystyle\mathcal{N}(\mathcal{F},e_{L+1},\|\cdot\|_{\infty}) 𝒩((L),eL,)×𝒩(Θ0(L+1),ϵL+1,)\displaystyle\leq\mathcal{N}(\mathcal{H}^{(L)},e_{L},\|\cdot\|_{\infty})\times\mathcal{N}(\Theta^{(L+1)}_{0},\epsilon_{L+1},\|\cdot\|_{\infty})
Πi=1L+1𝒩(Θ0(i),ϵL+1,)\displaystyle\leq\Pi_{i=1}^{L+1}\mathcal{N}(\Theta^{(i)}_{0},\epsilon_{L+1},\|\cdot\|_{\infty})
Πi=1L+1Volume(Θ0(i))×(2/ϵi)𝒮i\displaystyle\leq\Pi_{i=1}^{L+1}{\rm Volume}(\Theta^{(i)}_{0})\times(2/\epsilon_{i})^{\mathcal{S}_{i}}
=(4B/ϵL+1)𝒮L+1×Πi=1L(4B/ϵi)𝒮i/(di!).\displaystyle=(4B/\epsilon_{L+1})^{\mathcal{S}_{L+1}}\times\Pi_{i=1}^{L}(4B/\epsilon_{i})^{\mathcal{S}_{i}}/(d_{i}!).

Finally, setting ϵi={2(L+1)d0(Bx+1)(2B)i(Πj=1Lρjdj)}1ϵ\epsilon_{i}=\{2(L+1)\sqrt{d_{0}}(B_{x}+1)(2B)^{i}(\Pi_{j=1}^{L}\rho_{j}{d_{j}})\}^{-1}\epsilon for i=1,,L+1i=1,\ldots,L+1 leads to an upper bound for the covering number 𝒩(,ϵ,)\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty})

𝒩(,ϵ,)\displaystyle\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty}) (4(L+1)(Bx+1)d0(Bx+1)(2B)L+2(Πj=1Lρjdj)/ϵ)𝒮L+1\displaystyle\leq(4(L+1)(B_{x}+1)\sqrt{d_{0}}(B_{x}+1)(2B)^{L+2}(\Pi_{j=1}^{L}\rho_{j}{d_{j}})/\epsilon)^{\mathcal{S}_{L+1}}
×Πi=1L(4(L+1)(2B)i+1(Πj=1Lρjdj)/ϵ)𝒮i/(di!)\displaystyle\qquad\times\Pi_{i=1}^{L}(4(L+1)(2B)^{i+1}(\Pi_{j=1}^{L}\rho_{j}\sqrt{d_{j}})/\epsilon)^{\mathcal{S}_{i}}/(d_{i}!)
(4(L+1)d0(Bx+1)(2B)L+2(Πj=1Lρjdj)/ϵ)𝒮d1!×d2!××dL!.\displaystyle\leq\frac{(4(L+1)\sqrt{d_{0}}(B_{x}+1)(2B)^{L+2}(\Pi_{j=1}^{L}\rho_{j}{d_{j}})/\epsilon)^{\mathcal{S}}}{d_{1}!\times d_{2}!\times\cdots\times d_{L}!}.

This completes the proof. ∎

A.4 Proof of Theorem 3

Proof.

The proof is straightforward. And we present the proof in three steps. First, we show that if fθnf_{\theta_{n}} is an empirical risk minimizer, then there are at least d1×d2××dLd_{1}^{*}\times d_{2}^{*}\times\cdots\times d_{L}^{*} empirical risk minimizers with distinct parameterization. Second, we prove that the (δ/2)(\delta/2) neighborhood of these distinct parameterization are disjoint under the LL_{\infty} norm. Lastly, we show that initialization schemes with identical random distribution for weights and bias within layers can indeed increase the probability of convergence for any appropriate optimization algorithms 𝒜\mathcal{A}.

Step 1. Suppose fθnf_{\theta_{n}} is an empirical risk minimizer with parameterization

θn=(Wn(1),bn(1),Wn(2),bn(2),,Wn(L),bn(L),Wn(L+1),bn(L+1)).\displaystyle\theta_{n}=(W^{(1)}_{n},b^{(1)}_{n},W^{(2)}_{n},b^{(2)}_{n},\ldots,W^{(L)}_{n},b^{(L)}_{n},W^{(L+1)}_{n},b^{(L+1)}_{n}).

By Theorem 1, for any permutation matrices P1,,PLP_{1},\ldots,P_{L},

Step 1. Suppose fθnf_{\theta_{n}} is an empirical risk minimizer with parameterization

θn=(Wn(1),bn(1),Wn(2),bn(2),,Wn(L),bn(L),Wn(L+1),bn(L+1)).\displaystyle\theta_{n}=(W^{(1)}_{n},b^{(1)}_{n},W^{(2)}_{n},b^{(2)}_{n},\ldots,W^{(L)}_{n},b^{(L)}_{n},W^{(L+1)}_{n},b^{(L+1)}_{n}).

By Theorem 1, for any permutation matrices P1,,PLP_{1},\ldots,P_{L},

θ~n=(P1Wn(1),P1bn(1),P2Wn(2)P1,P2bn(2),,PLWn(L),PLbn(L),Wn(L+1)PL,bn(L+1)),\tilde{\theta}_{n}=(P_{1}W^{(1)}_{n},P_{1}b^{(1)}_{n},P_{2}W^{(2)}_{n}P_{1}^{\top},P_{2}b^{(2)}_{n},\ldots,P_{L}W^{(L)}_{n},P_{L}b^{(L)}_{n},W^{(L+1)}_{n}P_{L}^{\top},b^{(L+1)}_{n}), (6)

will lead to an empirical risk minimizer fθ~nf_{\tilde{\theta}_{n}}. However, the concatenated matrices (Wn(l);bn(l))(W^{(l)}_{n};b^{(l)}_{n}) may have identical rows, and the permutation-implemented matrices (PlWn(l)Pl1;Pl;bn(l))(P_{l}W^{(l)}_{n}P_{l-1}^{\top};P_{l};b^{(l)}_{n}) may remain unchanged for some permutation matrices PlP_{l}. Let dld^{*}_{l} denote the number of distinct permutations of the rows in (Wn(l);bn(l))(W^{(l)}_{n};b^{(l)}_{n}). Then it is guaranteed that {(PlWn(l)Pl1;Pl;bn(l)):Pl,Pl1arepermutationmatrices}\{(P_{l}W^{(l)}_{n}P_{l-1}^{\top};P_{l};b^{(l)}_{n}):P_{l},P_{l-1}{\rm\ are\ permutation\ matrices}\} has at least dld^{*}_{l} distinct elements. Note that 1dldl!1\leq d^{*}_{l}\leq d_{l}! for l=1,,Ll=1,\ldots,L where dld_{l} is the dimension of the bias vector b(l)b^{(l)} as well as the number of rows of (Wn(l);bn(l))(W^{(l)}_{n};b^{(l)}_{n}). Specifically, dl=1d^{*}_{l}=1 if and only if all the entries of bn(l)b^{(l)}_{n} are identical and all the rows of W(l)W^{(l)} are identical. And dl=dl!d^{*}_{l}=d_{l}! if and only if all the rows of concatenated matrix (Wn(l),bn(l))(W^{(l)}_{n},b^{(l)}_{n}) are distinct. Moreover, the distinct elements in {(PlWn(l)Pl1;Pl;bn(l)):Pl,Pl1arepermutationmatrices}\{(P_{l}W^{(l)}_{n}P_{l-1}^{\top};P_{l};b^{(l)}_{n}):P_{l},P_{l-1}{\rm\ are\ permutation\ matrices}\} can range from 11 to dl1!dl!d_{l-1}!d_{l}!. It is 1 if and only if all the entries of bn(l)b^{(l)}_{n} are identical and all the entries of Wn(l)W^{(l)}_{n} are identical; it is dl1!dl!d_{l-1}!d_{l}! if and only if all the entries of bn(l)b^{(l)}_{n} are distinct and all the entries of Wn(l)W^{(l)}_{n} are distinct. Lastly, it is easy to see that θnθ~n\theta_{n}\not=\tilde{\theta}_{n} if (Wn(l);bn(l))(PlWn(l)Pl1;Pl;bn(l))(W^{(l)}_{n};b^{(l)}_{n})\not=(P_{l}W^{(l)}_{n}P_{l-1}^{\top};P_{l};b^{(l)}_{n}) for any l{1,,L}l\in\{1,\ldots,L\}. Then there are at least d1××dLd_{1}^{*}\times\cdots\times d_{L}^{*} distinct elements in

Θ~n={θ~ndefinedin(6):P1,PLarepermutationmatrices}.\displaystyle\tilde{\Theta}_{n}=\{\tilde{\theta}_{n}{\rm\ defined\ in\ (\ref{permuted})}:P_{1},\ldots P_{L}{\rm\ are\ permutation\ matrices}\}. (7)

Step 2.

Let θ=(W(1),b(1),W(2),b(2),,W(L),b(L),W(L+1),b(L+1))\theta=(W^{(1)},b^{(1)},W^{(2)},b^{(2)},\ldots,W^{(L)},b^{(L)},W^{(L+1)},b^{(L+1)}) be the collection of parameters of a network and let θi(l)\theta_{i}^{(l)} be the iith row of concatenated matrix θ(l)=(W(l);b(l))\theta^{(l)}=(W^{(l)};b^{(l)}). We define

Δmin(θ):=minl{1,,L}[mini,j{1,,dl},θi(l)θj(l)θi(l)θj(l)],\Delta_{\rm min}(\theta):=\min_{l\in\{1,\ldots,L\}}\left[\min_{i,j\in\{1,\ldots,d_{l}\},\theta^{(l)}_{i}\not=\theta^{(l)}_{j}}\|\theta^{(l)}_{i}-\theta^{(l)}_{j}\|_{\infty}\right], (8)

and

Δmax(θ):=maxl{1,,L}[maxi,j{1,,dl}θi(l)θj(l)].\Delta_{\rm max}(\theta):=\max_{l\in\{1,\ldots,L\}}\left[\max_{i,j\in\{1,\ldots,d_{l}\}}\|\theta^{(l)}_{i}-\theta^{(l)}_{j}\|_{\infty}\right]. (9)

Recall that Δmin(θn)=δ\Delta_{\rm min}(\theta_{n})=\delta. This implies that for any two distinct permutation-implemented θ~n,1\tilde{\theta}_{n,1} and θ~n,2\tilde{\theta}_{n,2} in (7),

Δmax(θ~n,1θ~n,2)δ,\Delta_{\rm max}(\tilde{\theta}_{n,1}-\tilde{\theta}_{n,2})\geq\delta,

where Δmax(θn)\Delta_{\rm max}(\theta_{n}) denote the maximum of the \|\cdot\|_{\infty} norm of distinct rows in θn(l)\theta^{(l)}_{n} over l1,,Ll\in{1,\ldots,L}. Then the neighborhoods B(θ~n,δ/2):={θ:Δmax(θθ~n)δ/2}B_{\infty}(\tilde{\theta}_{n},\delta/2):=\{\theta:\Delta_{\rm max}(\theta-\tilde{\theta}_{n})\leq\delta/2\} in the collection

(δ/2)={B(θ~n,δ/2):θ~nΘ~n}\displaystyle\mathcal{B}_{(\delta/2)}=\{B_{\infty}(\tilde{\theta}_{n},\delta/2):\tilde{\theta}_{n}\in\tilde{\Theta}_{n}\}

are pairwise disjoint. It worth mentioning that for θ~=𝒫(θn)\tilde{\theta}=\mathcal{P}(\theta_{n}), we have

B(θ~n,δ/2)={𝒫(θ):Δmax(θθn)δ/2},\displaystyle B_{\infty}(\tilde{\theta}_{n},\delta/2)=\{\mathcal{P}(\theta):\Delta_{\rm max}(\theta-\theta_{n})\leq\delta/2\}, (10)

by the symmetry of permutation and the definition of Δmax\Delta_{\rm max}.

Step 3. For any given permutation matrices P1,,PLP_{1},\ldots,P_{L}, we let 𝒫=𝒫(P1,,PL)\mathcal{P}=\mathcal{P}(P_{1},\ldots,P_{L}) denote the operator such that

𝒫(θ)=θ~\mathcal{P}(\theta)=\tilde{\theta}

for any θΘ\theta\in\Theta where

θn=(Wn(1),bn(1),Wn(2),bn(2),,Wn(L),bn(L),Wn(L+1),bn(L+1)).\displaystyle\theta_{n}=(W^{(1)}_{n},b^{(1)}_{n},W^{(2)}_{n},b^{(2)}_{n},\ldots,W^{(L)}_{n},b^{(L)}_{n},W^{(L+1)}_{n},b^{(L+1)}_{n}).

and

θ~n=(P1Wn(1),P1bn(1),P2Wn(2)P1,P2bn(2),,PLWn(L),PLbn(L),Wn(L+1)PL,bn(L+1)).\displaystyle\tilde{\theta}_{n}=(P_{1}W^{(1)}_{n},P_{1}b^{(1)}_{n},P_{2}W^{(2)}_{n}P_{1}^{\top},P_{2}b^{(2)}_{n},\ldots,P_{L}W^{(L)}_{n},P_{L}b^{(L)}_{n},W^{(L+1)}_{n}P_{L}^{\top},b^{(L+1)}_{n}).

Now if an optimization algorithm 𝒜\mathcal{A} guarantees to produce a convergent solution towards θn\theta_{n} when θ(0)\theta^{(0)} is an initialization belong to the (δ/2)(\delta/2)-neighborhood of θn\theta_{n}, then for any 𝒫\mathcal{P}, the algorithm 𝒜\mathcal{A} should guarantee to produce a convergent solution towards 𝒫(θn)\mathcal{P}(\theta_{n}) when θ(0)\theta^{(0)} is an initialization belong to the (δ/2)(\delta/2)-neighborhood of 𝒫(θn)\mathcal{P}(\theta_{n}). The reason is that the loss surface of the empirical risk keeps the same structure on

B(θn,δ/2)={θ(0):Δmax(θ(0)θn)δ/2}B_{\infty}({\theta}_{n},\delta/2)=\{\theta^{(0)}:\Delta_{\rm max}(\theta^{(0)}-\theta_{n})\leq\delta/2\}

and

B(θ~n,δ/2)={𝒫(θ(0)):Δmax(θ(0)θn)δ/2}.B_{\infty}(\tilde{\theta}_{n},\delta/2)=\{\mathcal{P}(\theta^{(0)}):\Delta_{\rm max}(\theta^{(0)}-\theta_{n})\leq\delta/2\}.

This implies that if the initialization θ(0)\theta^{(0)} belongs to (δ/2)(\delta/2)-neighborhood of any θ~nΘ~n\tilde{\theta}_{n}\in\tilde{\Theta}_{n}, the algorithm 𝒜\mathcal{A} should guarantee to produce a convergent solution towards some θ~nΘ~n\tilde{\theta}_{n}\in\tilde{\Theta}_{n}, which learns the an empirical risk minimizer fθ~nf_{\tilde{\theta}_{n}}. The rest of the proof is to calculate the probability of an random initialization θ(0)\theta^{(0)} locates in the union of neighborhoods in δ/2\mathcal{B}_{\delta/2}, i.e, we targets for

(θ(0)θ~nΘ~nB(θ~n,δ/2)),\mathbb{P}(\theta^{(0)}\in\cup_{\tilde{\theta}_{n}\in\tilde{\Theta}_{n}}B_{\infty}(\tilde{\theta}_{n},\delta/2)),

where the probability is with respect to the randomness of initialization. Firstly, for any random initialization scheme and permutation operator 𝒫\mathcal{P}, we have

(θ(0)B(θn,δ/2))=(𝒫(θ(0))B(𝒫(θn),δ/2)),\displaystyle\mathbb{P}(\theta^{(0)}\in B_{\infty}({\theta}_{n},\delta/2))=\mathbb{P}(\mathcal{P}(\theta^{(0)})\in B_{\infty}(\mathcal{P}(\theta_{n}),\delta/2)), (11)

by (10) and the definition of permutation. With a little bit abuse of notation, let

θ(0)=(W(0)(1),b(0)(1),,W(0)(L+1),b(0)(L+1)).\theta^{(0)}=(W^{(1)}_{(0)},b^{(1)}_{(0)},\ldots,W^{(L+1)}_{(0)},b^{(L+1)}_{(0)}).

If the initialization method uses identical random distributions for the entries of weights and biases within a layer, then we know that the entries of W(0)(l)W^{(l)}_{(0)} are independent and identically distributed, and entries of b(0)(l)b^{(l)}_{(0)} are independent and identically distributed. This further implies that for any permutation matrices P1,,PLP_{1},\ldots,P_{L},

(W(0)(1),b(0)(1))=d(P1W(0)(1),P1b(0)(1)),\displaystyle(W^{(1)}_{(0)},b^{(1)}_{(0)})\stackrel{{\scriptstyle d}}{{=}}(P_{1}W^{(1)}_{(0)},P_{1}b^{(1)}_{(0)}),
(W(0)(l),b(0)(l))=d(PlW(0)(l)Pl1,Plb(0)(l)),l=2,,L\displaystyle(W^{(l)}_{(0)},b^{(l)}_{(0)})\stackrel{{\scriptstyle d}}{{=}}(P_{l}W^{(l)}_{(0)}P^{\top}_{l-1},P_{l}b^{(l)}_{(0)}),\quad l=2,\ldots,L
(W(0)(L+1),b(0)(L+1))=d(W(0)(L+1)PL,b(0)(L+1)),\displaystyle(W^{(L+1)}_{(0)},b^{(L+1)}_{(0)})\stackrel{{\scriptstyle d}}{{=}}(W^{(L+1)}_{(0)}P^{\top}_{L},b^{(L+1)}_{(0)}),

where =d\stackrel{{\scriptstyle d}}{{=}} denotes the equivalence in distribution. Consequently, 𝒫(θ(0))\mathcal{P}(\theta^{(0)}) has the same distribution of θ(0)\theta^{(0)}. And we have

(𝒫(θ(0))B(𝒫(θn),δ/2))=(θ(0)B(𝒫(θn),δ/2)),\displaystyle\mathbb{P}(\mathcal{P}(\theta^{(0)})\in B_{\infty}(\mathcal{P}(\theta_{n}),\delta/2))=\mathbb{P}(\theta^{(0)}\in B_{\infty}(\mathcal{P}(\theta_{n}),\delta/2)), (12)

for any permutation 𝒫\mathcal{P}.

Combining (11), (12) and the property in (7), we have

(θ(0)θ~nΘ~nB(θ~n,δ/2))\displaystyle\mathbb{P}(\theta^{(0)}\in\cup_{\tilde{\theta}_{n}\in\tilde{\Theta}_{n}}B_{\infty}(\tilde{\theta}_{n},\delta/2)) =Σθ~Θ~(θ(0)B(θ~n,δ/2))\displaystyle=\Sigma_{\tilde{\theta}\in\tilde{\Theta}}\mathbb{P}(\theta^{(0)}\in B_{\infty}(\tilde{\theta}_{n},\delta/2))
=Σθ~Θ~(θ(0)B(θn,δ/2))\displaystyle=\Sigma_{\tilde{\theta}\in\tilde{\Theta}}\mathbb{P}(\theta^{(0)}\in B_{\infty}({\theta}_{n},\delta/2))
=d1××dL×(θ(0)B(θn,δ/2))\displaystyle=d_{1}^{*}\times\cdots\times d_{L}^{*}\times\mathbb{P}(\theta^{(0)}\in B_{\infty}({\theta}_{n},\delta/2))
=d1××dL×(Δmax(θ(0)θn)δ/2).\displaystyle=d_{1}^{*}\times\cdots\times d_{L}^{*}\times\mathbb{P}(\Delta_{\rm max}(\theta^{(0)}-\theta_{n})\leq\delta/2).

This completes the proof.

Appendix B Supporting Lemmas

In this section, we give definitions of covering number, packing number of subsets in Euclidean space. We also present definitions of other complexity measures including VC-dimension, Pseudo-dimension, and Rademacher complexity. Lemmas regarding their properties and correlations are also given.

Definition 3 (Covering number).

Let (K,)(K,\|\cdot\|) be a metric space, let CC be a subset of KK, and let ϵ\epsilon be a positive real number. Let Bϵ(x)B_{\epsilon}(x) denote the ball of radius ϵ\epsilon centered at xx. Then CC is called a ϵ\epsilon-covering of KK, if KxCBϵ(x).K\subset\cup_{x\in C}B_{\epsilon}(x). The covering number of the metric space (K,)(K,\|\cdot\|) with any radius ϵ>0\epsilon>0 is the minimum cardinality of any ϵ\epsilon-covering, which is defined by 𝒩(K,ϵ,)=min{|C|:CisaϵcoveringofK}\mathcal{N}(K,\epsilon,\|\cdot\|)=\min\{|C|:C{\rm\ is\ a\ }\epsilon{\rm-covering\ of\ }K\}.

Definition 4 (Packing number).

Let (K,)(K,\|\cdot\|) be a metric space, let PP be a subset of KK, and let ϵ\epsilon be a positive real number. Let Bϵ(x)B_{\epsilon}(x) denote the ball of radius ϵ\epsilon centered at xx. Then PP is called a ϵ\epsilon-packing of KK, if {Bϵ(x)}xP\{B_{\epsilon}(x)\}_{x\in P} is pairwise disjoint. The ϵ\epsilon-packing number of the metric space (K,)(K,\|\cdot\|) with any radius ϵ>0\epsilon>0 is the maximum cardinality of any ϵ\epsilon-packing, which is defined by (K,ϵ,)=max{|P|:PisaϵpackingofK}\mathcal{M}(K,\epsilon,\|\cdot\|)=\max\{|P|:P{\rm\ is\ a\ }\epsilon{\rm-packing\ of\ }K\}.

Lemma 1.

Let (K,)(K,\|\cdot\|) be a metric space, and for any ϵ>0\epsilon>0, let 𝒩(K,ϵ,)\mathcal{N}(K,\epsilon,\|\cdot\|) and (K,ϵ,)\mathcal{M}(K,\epsilon,\|\cdot\|) denote the ϵ\epsilon-covering number and ϵ\epsilon-packing number respectively, then

(K,2ϵ,)𝒩(K,ϵ,)(K,ϵ/2,).\mathcal{M}(K,2\epsilon,\|\cdot\|)\leq\mathcal{N}(K,\epsilon,\|\cdot\|)\leq\mathcal{M}(K,\epsilon/2,\|\cdot\|).
Proof.

For simplicity, we write 𝒩ϵ=𝒩(K,ϵ,)\mathcal{N}_{\epsilon}=\mathcal{N}(K,\epsilon,\|\cdot\|) and ϵ=(K,ϵ,)\mathcal{M}_{\epsilon}=\mathcal{M}(K,\epsilon,\|\cdot\|). We firstly proof 2ϵ𝒩ϵ\mathcal{M}_{2\epsilon}\leq\mathcal{N}_{\epsilon} by contradiction. Let P={p1,,p2ϵ}P=\{p_{1},\ldots,p_{\mathcal{M}_{2\epsilon}}\} be any maximal 2ϵ2\epsilon-packing of KK and C={c1,,c𝒩ϵ}C=\{c_{1},\ldots,c_{\mathcal{N}_{\epsilon}}\} be any minimal ϵ\epsilon-covering of KK. If 2ϵ𝒩ϵ+1\mathcal{M}_{2\epsilon}\geq\mathcal{N}_{\epsilon}+1, then we must have pip_{i} and pjp_{j} belonging to the same ϵ\epsilon-ball Bϵ(ck)B_{\epsilon}(c_{k}) for some iji\not=j and kk. This means that the distance between pip_{i} and pjp_{j} cannot be more than the diameter of the ball, i.e. pipj2ϵ\|p_{i}-p_{j}\|\leq 2\epsilon, which leads to a contradiction since pipj>2ϵ\|p_{i}-p_{j}\|>2\epsilon by the definition of packing.

Secondly, we prove 𝒩ϵϵ/2\mathcal{N}_{\epsilon}\leq\mathcal{M}_{\epsilon/2} by showing that each maximal (ϵ/2)(\epsilon/2)-packing P={p1,,pϵ}P=\{p_{1},\ldots,p_{\mathcal{M}_{\epsilon}}\} is also a ϵ\epsilon-covering. Note that for any xK\Px\in K\backslash P, there exist a piPp_{i}\in P such that xpiϵ\|x-p_{i}\|\leq\epsilon since if this does not hold, then we can construct a bigger packing with pϵ+1=xp_{\mathcal{M}_{\epsilon}+1}=x. Thus PP is also a ϵ\epsilon-covering and we have 𝒩ϵϵ\mathcal{N}_{\epsilon}\leq\mathcal{M}_{\epsilon} by the definition of covering. ∎

Lemma 2.

Let SS be a subset of d\mathbb{R}^{d} with volume VV, and let ϵ>0\epsilon>0 be a positive real number. Then, the covering number 𝒩(V,ϵ,)\mathcal{N}(V,\epsilon,\|\cdot\|_{\infty}) and packing number (V,ϵ/2,)\mathcal{M}(V,\epsilon/2,\|\cdot\|_{\infty}) of SS satisfies:

𝒩(V,ϵ,)(V,ϵ/2,)V×(2ϵ)d.\mathcal{N}(V,\epsilon,\|\cdot\|_{\infty})\leq\mathcal{M}(V,\epsilon/2,\|\cdot\|_{\infty})\leq V\times(\frac{2}{\epsilon})^{d}.
Proof.

Consider a packing of SS with non-overlapping hypercubes of side length ϵ/2\epsilon/2 under the LL^{\infty} norm. The volume of each hypercube is (ϵ/2)d(\epsilon/2)^{d}, and since the hypercubes do not overlap, the total volume of the hypercubes in the packing is at most the volume of SS. Thus, we have:

(V,ϵ/2,)(ϵ/2)dV,\mathcal{M}(V,\epsilon/2,\|\cdot\|_{\infty})\cdot(\epsilon/2)^{d}\leq V,

which implies:

(V,ϵ/2,)V×(2ϵ)d.\mathcal{M}(V,\epsilon/2,\|\cdot\|_{\infty})\leq V\times(\frac{2}{\epsilon})^{d}.

Then by Lemma 1, 𝒩(V,ϵ,)(V,ϵ/2,)\mathcal{N}(V,\epsilon,\|\cdot\|_{\infty})\leq\mathcal{M}(V,\epsilon/2,\|\cdot\|_{\infty}), which completes the proof.

Definition 5 (Shattering, Definition 11.4 in Mohri et al., (2018)).

Let \mathcal{F} be a family of functions from a set 𝒵\mathcal{Z} to \mathbb{R}. A set {z1,,Zn}𝒵\{z_{1},\ldots,Z_{n}\}\subset\mathcal{Z} is said to be shattered by \mathcal{F}, if there exists t1,,tnt_{1},\ldots,t_{n}\in\mathbb{R} such that

|{[sgn(f(z1)t1)sgn(f(zn)tn)]:f}|=2n,\displaystyle\Bigg{|}\Big{\{}\Big{[}\begin{array}[]{lr}{\rm sgn}(f(z_{1})-t_{1})\\ \ldots\\ {\rm sgn}(f(z_{n})-t_{n})\\ \end{array}\Big{]}:f\in\mathcal{F}\Big{\}}\Bigg{|}=2^{n},

where rmsgn{rmsgn} is the sign function returns +1+1 or 1-1 and |||\cdot| denotes the cardinality of a set. When they exist, the threshold values t1,,tnt_{1},\ldots,t_{n} are said to witness the shattering.

Definition 6 (Pseudo dimension, Definition 11.5 in Mohri et al., (2018)).

Let \mathcal{F} be a family of functions mapping from 𝒵\mathcal{Z} to \mathbb{R}. Then, the pseudo dimension of \mathcal{F}, denoted by Pdim(){\rm Pdim}(\mathcal{F}), is the size of the largest set shattered by \mathcal{F}.

Definition 7 (VC dimension).

Let \mathcal{F} be a family of functions mapping from 𝒵\mathcal{Z} to \mathbb{R}. Then, the Vapnik–Chervonenkis (VC) dimension of \mathcal{F}, denoted by VCdim(){\rm VCdim}(\mathcal{F}), is the size of the largest set shattered by \mathcal{F} with all threshold values being zero, i.e., t1=,=tn=0t_{1}=\ldots,=t_{n}=0.

Definition 8 (Empirical Rademacher Complexity, Definition 3.1 in Mohri et al., (2018)).

Let \mathcal{F} be a family of functions mapping from 𝒵\mathcal{Z} to [a,b][a,b] and S=(z1,,zn)S=(z_{1},\ldots,z_{n}) a fixed sample of size m with elements in 𝒵\mathcal{Z}. Then, the empirical Rademacher complexity of \mathcal{F} with respect to the sample SS is defined as:

^S()=𝔼σ[supf|1ni=1nσif(xi)|],\hat{\mathcal{R}}_{S}(\mathcal{F})=\mathbb{E}_{\bf\sigma}\Bigg{[}\sup_{f\in\mathcal{F}}\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}f(x_{i})\Bigg{|}\Bigg{]},

where σ=(σ1,,σn){\bf\sigma}=(\sigma_{1},\ldots,\sigma_{n})^{\top}, with σi\sigma_{i}s independent uniform random variables taking values in {+1,1}\{+1,-1\}. The random variables σi\sigma_{i} are called Rademacher variables.

Definition 9 (Rademacher Complexity, Definition 3.2 in Mohri et al., (2018)).

Let 𝒟\mathcal{D} denote the distribution according to which samples are drawn. For any integer n1n\geq 1, the Rademacher complexity of \mathcal{F} is the expectation of the empirical Rademacher complexity over all samples of size nn drawn according to 𝒟\mathcal{D}:

n()=𝔼S𝒟n[^S()].\mathcal{R}_{n}(\mathcal{F})=\mathbb{E}_{S\sim\mathcal{D}^{n}}\left[\hat{\mathcal{R}}_{S}(\mathcal{F})\right].
Lemma 3 (Dudley’s Theorem, Dudley, (1967)).

Let \mathcal{F} be a set of functions f:𝒳f:\mathcal{X}\to\mathbb{R}. Then

n()120log𝒩(,ϵ,)n𝑑ϵ.\displaystyle{\mathcal{R}}_{n}(\mathcal{F})\leq 12\int_{0}^{\infty}\sqrt{\frac{\log\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty})}{n}}d\epsilon.

Dudley’s Theorem gives a way to bound Rademacher complexity using covering number (Dudley, , 1967, 2010). And the covering and packing numbers can be bounded in terms of the VC dimension or Pseudo dimension (Haussler, , 1995; Anthony et al., , 1999).

Lemma 4 (Theorem 12.2 in Anthony et al., (1999)).

Suppose that \mathcal{F} is a class of functions from XX to the bounded interval [0,B][0,B]\subset\mathbb{R}. Given a sequence x=(x1,,xn)𝒳nx=(x_{1},\ldots,x_{n})\in\mathcal{X}^{n}, we let |x\mathcal{F}|_{x} be the subset of n\mathbb{R}^{n} given by

|x={(f(x1),,f(xn)):f}.\mathcal{F}|_{x}=\{(f(x_{1}),\ldots,f(x_{n})):f\in\mathcal{F}\}.

we define the uniform covering number 𝒩n(,ϵ,)\mathcal{N}_{n}(\mathcal{F},\epsilon,\|\cdot\|_{\infty}) to be the maximum, over all x𝒳nx\in\mathcal{X}^{n}, of the covering number 𝒩(|x,ϵ,)\mathcal{N}(\mathcal{F}|_{x},\epsilon,\|\cdot\|_{\infty}), that is

𝒩n(,ϵ,)=max{𝒩(|x,ϵ,):x𝒳n}.\mathcal{N}_{n}(\mathcal{F},\epsilon,\|\cdot\|_{\infty})=\max\{\mathcal{N}(\mathcal{F}|_{x},\epsilon,\|\cdot\|_{\infty}):x\in\mathcal{X}^{n}\}.

Let ϵ>0\epsilon>0 and suppose the pseudo-dimension of \mathcal{F} is dd. Then

𝒩n(,ϵ,)i=1d(ni)(Bϵ)i,\displaystyle\mathcal{N}_{n}(\mathcal{F},\epsilon,\|\cdot\|_{\infty})\leq\sum_{i=1}^{d}{\binom{n}{i}}(\frac{B}{\epsilon})^{i},

which is less that (enB/(ϵd))d(enB/(\epsilon d))^{d} for ndn\geq d.